PCEDT & Multilingual Corpora

Prague Czech-English Dependency Treebank

The Prague Czech-English Dependency Treebank is a manually annotated parallel, aligned treebank built above the Penn Treebank - Wall Street Journal text collection. It comes in two versions. The current version has over 1.2 million running words in almost 50,000 sentences for each language part. Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (Prague Dependency Treebank 2.0). ... [learn more]

HamleDT: HArmonized Multi-LanguagE Dependency Treebank

HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. ... There are as many as 29 treebanks integrated in HamleDT at this moment. A subset of the treebanks whose license terms permit redistribution is available directly for download from us. ... [learn more]

 

Other Parallel and/or Multilingual Corpora