The Prague Dependency Treebank

The Prague Dependency Treebank (PDT) contains a large amount of Czech texts with complex and interlinked morphological, syntactic and complex semantic annotation; in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.

PDT is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.

The Prague Dependency Treebank 3.0

The Prague Dependency Treebank 3.0 (PDT 3.0) annotates the same texts as the PDT 2.0 (Hajič et al. 2006), PDT 2.5 (Bejček et al. 2011), and the Prague Discourse Treebank 1.0 (PDiT 1.0, Poláková et al. 2012). The annotation on the four layers was further fixed and improved in various aspects. Moreover, new information was added to the data:

  • from PDT 2.5 to PDiT 1.0
    • Extended textual coreference
    • Bridging anaphora
    • Discourse relations marked by explicit connectives
  • from PDiT 1.0 to PDT 3.0
    • Revision of several grammatemes
    • Revision of sentence modality annotation
    • Replacement of t_lemma #Benef
    • Genres of documents
    • Pronominal textual coreference of 1st and 2nd person
    • Updated discourse relations marked by explicit connectives

How to cite

If you make use of PDT 3.0, please cite (at least) the following paper:

  • Bejček Eduard, Hajičová Eva, Hajič Jan, Jínová Pavlína, Kettnerová Václava, Kolářová Veronika, Mikulová Marie, Mírovský Jiří, Nedoluzhko Anna, Panevová Jarmila, Poláková Lucie, Ševčíková Magda, Štěpánek Jan, Zikánová Šárka: Prague Dependency Treebank 3.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, Czech republic, http://ufal.mff.cuni.cz/pdt3.0/, Dec 2013
@misc{pdt-3-0-2013,
      title = {Prague Dependency Treebank 3.0},
      author = {Eduard Bej{\v{c}}ek and Eva Haji{\v{c}}ov{\'{a}} and Jan Haji{\v{c}} and Pavl{\'{i}}na J{\'{i}}nov{\'{a}} and V{\'{a}}clava Kettnerov{\'{a}} and Veronika Kol{\'{a}}{\v{r}}ov{\'{a}} and Marie Mikulov{\'{a}} and Ji{\v{r}}{\'{i}} M{\'{i}}rovsk{\'{y}} and Anna Nedoluzhko and Jarmila Panevov{\'{a}} and Lucie Pol{\'{a}}kov{\'{a}} and Magda {\v{S}}ev{\v{c}}{\'{i}}kov{\'{a}} and Jan {\v{S}}t{\v{e}}p{\'{a}}nek and {\v{S}}{\'{a}}rka Zik{\'{a}}nov{\'{a}}},
      year = {2013},
      publisher = {Univerzita Karlova v Praze, {MFF}, {\'{U}}{FAL}},
      address = {Prague, Czech republic},
}

 

The Prague Dependency Treebank 2.5

The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects. Moreover, new information was added to the data:

  • Annotation of multiword expressions
  • Pair/group meaning
  • Clause segmentation

How to cite

If you make use of PDT 2.5, please cite (at least) the following paper:

  • Bejček Eduard, Panevová Jarmila, Popelka Jan, Smejkalová Lenka, Straňák Pavel, Ševčíková Magda, Štěpánek Jan, Toman Josef, Žabokrtský Zdeněk, Hajič Jan: Prague Dependency Treebank 2.5. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Praha, Czechia, http://ufal.mff.cuni.cz/pdt2.5/, Dec 2011
@misc{pdt-2-5-2011,
          title = {Prague Dependency Treebank 2.5},
          author = {Eduard Bej{\v{c}}ek and Jarmila Panevov{\'{a}} and Jan Popelka and Lenka Smejkalov{\'{a}} and Pavel Stra{\v{n}}{\'{a}}k and Magda {\v{S}}ev{\v{c}}{\'{i}}kov{\'{a}} and Jan {\v{S}}t{\v{e}}p{\'{a}}nek and Josef Toman and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and Jan Haji{\v{c}}},
          year = {2011},
          publisher = {Univerzita Karlova v Praze, {MFF}, {\'{U}}{FAL}},
          address = {Praha, Czech Republic},
}

 

The Prague Dependency Treebank 2.0

The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.

How to cite

If you make use of PDT 2.0, please cite (at least) the following paper:

  • Hajič Jan, Panevová Jarmila, Hajičová Eva, Sgall Petr, Pajas Petr, Štěpánek Jan, Havelka Jiří, Mikulová Marie, Žabokrtský Zdeněk, Ševčíková-Razímová Magda, Urešová Zdeňka: Prague Dependency Treebank 2.0. Data/software, Linguistic Data Consortium, Philadelphia, PA, USA, ISBN 1-58563-370-4, www.ldc.upenn.edu, Jul 2006
@misc{pdt-2-0-2006,
      title = {Prague Dependency Treebank 2.0},
      author = {Jan Haji{\v{c}} and Jarmila Panevov{\'{a}} and Eva Haji{\v{c}}ov{\'{a}} and Petr Sgall and Petr Pajas and Jan {\v{S}}t{\v{e}}p{\'{a}}nek and Ji{\v{r}}{\'{i}} Havelka and Marie Mikulov{\'{a}} and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and Magda {\v{S}}ev{\v{c}}{\'{i}}kov{\'{a}}-Raz{\'{i}}mov{\'{a}} and Zde{\v{n}}ka Ure{\v{s}}ov{\'{a}}},
      year = {2006},
      publisher = {Linguistic Data Consortium},
      address = {Philadelphia, {PA}, {USA}},
}

Short overview of the PDT 2.0 attributes and their values

Slides and video recordings from the Prague Treebanking for Everyone: A two-day tutorial, Vilem Mathesius Lecture Series 21.

 

The Prague Dependency Treebank 1.0