The Czech Legal Text Treebank (CLTT) is a manually annotated corpus of dependency trees. The treebank consists of 1,121 sentences from the legal domain.
The Czech Legal Text Treebank 2.0 (CLTT 2.0) annotates the same texts as the CLTT 1.0. The CLTT 2.0 annotation on the syntactic layer is more elaborate than in the CLTT 1.0 from various aspects. In addition, new annotation layers were added to the data: (i) the layer of accounting entities, and (ii) the layer of semantic entity relations.
The sentences were taken from Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended). The selection was given by the goals determined in the INTLIB project, focusing on the accounting subdomain namely.
The annotations in CLTT fit the framework originally formulated in the Prague Dependency Treebank (PDT) project. The dependency approach to syntactic analysis with the main role of the verb is applied. Technically, we speak about the analytical (a-) layer of annotation where each token in the sentence has one corresponding node and dependencies are assigned with the syntactic dependency function stored in the afun attribute.
To make manual annotation as easy as possible, we developed a special annotation strategy:
In the CLTT 2.0, we introduced a new annotation layer of accounting entities. We exploited the dictionary of accounting terms that was created for the RExtractor system. Subsequently, we used the RExtractor system for automatic identification of entities in the CLTT dependency trees.
The dictionary of accounting terms consists of 1,733 different terms classified into 25 categories. The RExtractor system identified 7,332 occurrences in the CLTT 2.0.
The layer of semantic relations is newly introduced in the CLTT 2.0. Relations are represented as (subject, predicate, object) triples, where subject and object have to be entities and predicate represents a relation. Three types of semantic relations were manually annotated in the CLTT texts:
The CLTT 2.0 contains 498 manually annotated relations classified into 3 categories.
Please use the following text to cite CLTT:
Kríž, Vincent; Hladká, Barbora and Urešová, Zdeňka, 2015, Czech Legal Text Treebank, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11234/1-1516.
Distributed under CC BY-NC-SA licence.