Czech Named Entity Corpus 2.0

The Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities. It is a major update to the Czech Named Entity Corpus 1.0, a first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. The corpus is available under the CC BY-NC-SA 3.0 license.

Classification

CNEC 2.0 NE hierarchyThe named entities in Czech are classified according to an updated version of the two-level hierarchy of CNEC 1.0 described in Ševčíková et al., 2007.

Data Formats

Named entities are saved in formats:

  • plain text – manual annotations in plain text format
  • simple xml – simple xml format
  • treex – xml format from Treex (formerly TectoMT) with morphologic analysis
  • html – html with highlighted named entities

Downloads

Czech Named Entity Corpus 2.0 can be downloaded from LINDAT/CLARIN repository.

Changes

Named Entity Hierarchy

The changes in the named entity hierarchy compared to CNEC 1.0 are the following:

  • overhaul the number entities
    • entities of supertype c were merged into n; in order to accommodate bibliographic entities a new type nb “vol./page/chap./sec./fig. numbers” was added
      • csoa
      • cnnb
      • cbnb
      • cpnb
      • crn_ , or
    • entities of supertype q were moved into n
      • qcnc
      • qono
    • low frequent entities of supertype n were removed and some renamed and merged
      • removed nm, nr, nw
      • nc was renamed to ns
      • npno
      • nqn_
    • some time entities were removed
      • tcno
      • tpno
      • tnnc
      • tsnc
  • new entity me representing email was added
  • gp entity was merged into g
  • mr and mt were merged into new ms
  • oc entity was merged into o
  • pb entity was merged into p

New Data

New data was annotated and added:

  • 125 sentences with many addresses and emails were added,
  • 3000 sentences containing only a few named entities were added so that the resulting corpus better represents the density of named entities (density of named entities in CNEC 1.1 is too high).