Since the very first step in the formulation of the Praguian project of Czech National Corpus by a group of Czech linguists (Institute of Formal and Applied Linguistics, Institute of Theoretical and Computational Linguistcs) from Charles University in Prague and Masaryk University in Brno, it has been quite clear to all of us that for the outcome of our project to have a boarder relevance and a manysided use we cannot confine ourselves to a mere compilation of a very large corpus of Czech texts. We have been aware that in order to make the corpus really useful for future users, be they linguists or developers of natural language processing systems of any kind, we have to design annotation schemes and develop tools that would allow us to add as much linguistic information as possible. Having the advantage of a long and fruitful tradition of theoretical and computational linguistics and inspired by the research resulting the Penn Treebank, the project group has decided to build so-called Prague Dependency Treebank (PDT).
The following three points are characteristic for the theory underlying the PDT, fully visible at the highest, tectogrammatical level:
(i) Its theoretical background is a dependency-based syntax (handling the sentence structure as concentrated around the verb and its valency, but containing a further dimension, namely coordination); among the reasons for the choice of a dependency based syntax we would like to stress first of all its relative economy and its perspicuous, immediate correspondence to the empirical data.
(ii) The nodes of the dependency tree (more exactly, of a more-dimensional network) are labeled by complex symbols (consisting of lexical, morphological and syntactic parts). Thus, the label of every node contains symbols expressing all the information contained in the grammatical position of this word and relevant for a semantic (more exactly, semantico-pragmatic) interpretation. This makes the output representations, or the trees of our treebank, useful not only for practical applications such as parsing, but also for its inclusion into an integrated theoretical description encompassing all layers from the outer (phonetic or graphemic) shape of the sentence to its semantico-pragmatic representation, be it in the form of truth-conditionally based intensional semantics, or in that of a framework paying more attention to the embedding of the sentence in context.
(iii) The dependency tree is understood as projective, and its
relationships to the morphemic representation of the sentence (a string
of symbols, the order of which corresponds to the surface word order) are
handled by means of specific rules.
The Prague Dependency Treebank (PDT) is a long-term project with two major phases. In the first phase (1996-2000), the morphological and syntactic analytic layers of annotation have been completed and made together with the preview of tectogrammatical layer annotation available as PDT 1.0. During the second phase (2000 - 2004, Center for Computational Linguistics), the tectogrammatical layer of annotation will proceed and the PDT 2.0 will be available in the end.
The structure of Prague Dependency Treebank (PDT) corresponds to a three-layer
structure annotated corpus of Czech
as a representative of inflectionally rich free-word-order languages:
The electronic text sources have been provided by the Institute of the Czech National Corpus.The text material contains samples from the following sources:
There are two internal formats employed in PDT: FS and
The former is an older format, still heavily used by some treebank tools.
The latter, more general SGML-based
encoding, is meant as the main PDT format (in the future, it will be
followed by an XML version, probably already for PDT 2.0).
See the description
of the FS file format and documentation of
the CSTS document type
("half through") has been released in 1998 and it contains 456,705 tokens
(words and punctuation) in 26,610 sentences. PDT 1.0 contains about three
times more tokens and sentences than PDT 0.5 (see PDT
1.0 characteristics ) completetely manually annotated on the morphological
and analytical levels and includes the preview of tectogrammatically annotated
data as well.
The Prague Dependency Treebank version 2.0 will
add the tectogrammatical layer of annotation to PDT 1.0. It will be available
with a reduced amount of data as preliminary "version 1.5" during 2002,
and the final data volume will be reached at the end of 2004.
The PDT 1.0 has been supported by the following grants and projects
See documents about PDT here.