Projects: Corpora

The Prague Dependency Treebank

The Prague Dependency Treebank (PDT) contains a large amount of Czech texts with complex and interlinked morphological, syntactic and complex semantic annotation; in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. ... [learn more]

Prague Czech-English Dependency Treebank

The Prague Czech-English Dependency Treebank is a manually annotated parallel, aligned treebank built above the Penn Treebank - Wall Street Journal text collection. It comes in two versions. The current version has over 1.2 million running words in almost 50,000 sentences for each language part. Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (Prague Dependency Treebank 2.0). ... [learn more]

Prague Discourse Treebank

Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5 (PDT; Bejček et al. 2011), which is a revised, updated and extended version of the Prague Dependency Treebank 2.0 (Hajič et al. 2006). It represents a new manually annotated layer of language description, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of discourse structure and coherence. ... [learn more]

HamleDT: HArmonized Multi-LanguagE Dependency Treebank

HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. ... There are as many as 30 treebanks integrated in HamleDT at this moment. A subset of the treebanks whose license terms permit redistribution is available directly for download from us. ... [learn more]

 

More Corpora

Title Tags
Abstract Meaning Representation Annotations, Machine Translation, Semantics
Anotace pro Google Annotations, Data, Morphology, Semantics
Automatic MWE Identification Data, Lexicons, Monolingual, Semantics
Čapek Annotations, Morphology
CoNLL 2017 Shared Task Annotations, Machine Learning, Multilingual, Parsers, Tools
Czech Academic Corpus Corpora, Data, Monolingual
Czech and English verbal valency Annotations, Corpora, Data, Lexicons, Machine Translation, Multilingual, Semantics, Taggers
Czech Legal Text Treebank Annotations, Corpora, Data, Monolingual
Czech Malach Cross-lingual Speech Retrieval Test Collection Corpora, Data, Information Retrieval, Multilingual, Speech Retrieval
Czech Named Entity Corpus Corpora, Data, Monolingual
Czech-English Manual Word Alignment Annotations, Data, Multilingual
CzEng Corpora, Data, Machine Translation, Multilingual
Deltacorpus Corpora, Data, Machine Learning, Taggers
DeriNet Annotations, Data, Lexicons, Monolingual
EngVallex - English valency lexicon linked to corpora Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Valency
Functional Generative Description Data, Information Structure, Morphology, Semantics, Valency
HamleDT Annotations, Corpora, Data, Multilingual, Parsers
HindEnCorp Corpora, Data, Machine Translation, Monolingual, Multilingual
Interset Corpora, Data, Morphology, Multilingual, Taggers, Tools
Lexical-semantic Annotation / SemLex Lexicon Annotations, Data, Lexicons, Monolingual, Semantics
Lindat KonText Annotations, Corpora, Data, Monolingual, Multilingual, Tools
Malach Centre for Visual History Data, Discourse, Multi-modality, Multilingual
MorfFlex CZ Corpora, Data, Lexicons, Monolingual, Morphology
Multilingual Corpus Annotation as a Support for Language Technologies Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Multilingual
MUSCIMA++ Annotations, Data, Tools
PARSEME Annotations, Corpora, Lexicons, Linked data, Machine Learning, Multiword Expressions, Parsers, Semantics, Valency
PARSEME Annotations, Corpora, Lexicons, Multilingual, Multiword Expressions, Parsers, Semantics, Valency
PDT-Vallex: valency lexicon linked to Czech corpora Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Valency
PDTSC 2.0 Annotations, Corpora, Data, Linked data, Monolingual, Morphology, Multi-modality, Semantics, Speech Recognition, Speech Retrieval, Valency
Prague Czech-English Dependency Treebank Annotations, Corpora, Data, Lexicons, Linked data, Multilingual, Valency
Prague Czech-English Dependency Treebank 2.0 Coref Annotations, Coreference, Corpora, Data, Linked data, Multilingual
Prague Database of Spoken Language 1.0 Annotations, Corpora, Data, Dialog, Multi-modality, Multilingual, Speech Recognition, Speech Retrieval
Prague Dependency Treebank Annotations, Corpora, Data, Monolingual
Prague Dependency Treebank 3.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics
Prague Discourse Treebank 1.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics
Prague Discourse Treebank 2.0 Annotations, Coreference, Corpora, Data, Discourse, Information Structure, Monolingual, Morphology, Multiword Expressions, Semantics
Prague English Dependency Treebank Annotations, Corpora, Data, Lexicons, Monolingual, Valency
QT21 Corpora, Data, Lexicons, Linked data, Machine Learning, Machine Translation, Multilingual, Semantics, Tools
ROMi 1.0 Corpora, Data, Dialog, Monolingual, Speech Recognition
Selected derivational relations for automatic processing of Czech Data, Lexicons, Monolingual, Morphology
Semantic Pattern Recognition Annotations, Corpora, Data, Lexicons, Monolingual, Morphology, Parsers, Publications, Semantics, Taggers, Tools, Valency
Sentiment Analysis in Czech Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Tools
Strojový překlad se sémantickou informací Annotations, Lexicons, Machine Translation, Semantics, Valency
Styx Annotations, Morphology, Tools
Systematic, economical and corpus-based description of valency properties of Czech deverbal nouns (theory and practice) Lexicons, Valency
TextLink: Skladba diskurzu v evropských jazycích Annotations, Corpora, Data, Discourse, Lexicons, Linked data, Monolingual
UFAL Medical Corpus Corpora, Data, Machine Translation, Multilingual
Universal Dependencies Annotations, Corpora, Data, Morphology, Multilingual, Parsers
UrMonoCorp Corpora, Data, Monolingual
Valency Lexicon of Czech Verbs VALLEX Data, Lexicons, Monolingual, Semantics, Valency
VPS-30-En: Verb Pattern Sample - 30 English Annotations, Corpora, Data, Lexicons, Monolingual, Semantics, Valency
VPS-GradeUp Annotations, Corpora, Data, Lexicons, Machine Learning, Monolingual, Semantics, Valency
W2C Corpora, Data, Multilingual
WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353 Annotations, Data, Information Retrieval, Machine Learning, Multilingual, Semantics
Working with the Penn Discourse Treebank Annotations, Corpora, Data, Discourse, Linked data, Monolingual, Tools
Working with the RST-DT and the RST-SC Annotations, Corpora, Data, Discourse, Linked data, Semantics, Tools