Czech Named Entity Corpus

The latest version of the Czech Named Entity Corpus (Czech Named Entity Corpus 2.0) is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities.

Current version download: Czech Named Entity Corpus 2.0.

Detailed description of the corpus, file formats, two-level named entity hierarchy and download links are available for every released version:

Work Published using CNEC

State-of-the-art Results

CNEC 1.0 and 2.0 Results, F1 measure
CNEC 1.0 Types CNEC 1.0 Supertypes CNEC 2.0 Types CNEC 2.0 Supertypes CNEC 1.0 Extended CNEC 2.0 Extended Publication Code Method
86.39 Bachelor Thesis of Müller 2020, a rerun of Straková et al., 2019 Straková et al., 2019 LSTM-CRF+BERT
86.88 89.91 86.23 89.37 Straka et al., 2019 Seq2seq+BERT
86.88 Straková et al., 2019 GitHub Seq2seq+BERT
83.15 86.30 83.27 84.22 Jana Straková, Milan Straka, Jan Hajič, Martin Popel (2019): Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, ISSN 0037-7031, vol. 80, no. 4, pp. 306-327 Deep NN
81.05 Güngör, 2018 RNN+WE+CLE
81.20 84.68 79.23 82.78 80.88 80.79 Straková et al., 2016 GitHub RNN+WE+CLE
74.08 Konkol et al., 2015 Latent semantics
75.61 Demir and Özgür, 2014 NN+WE
74.23 74.37 Konkol and Konopík, 2014 CRF+stemming
79.23 82.82 Straková et al., 2013 NameTag Simple NN
79.00 74.08 Konkol and Konopík, 2013 CRF
72.94 Konkol and Konopík, 2011 Maximum entropy
68.00 71.00 Kravalová and Žabokrtský, 2009 SVM
62.00 68.00 Ševčíková et al., 2007 Dec. trees

Please let us know if you have a contribution to this table. Thanks!

Tools

Other

  • Straková Jana, Straka Milan, Ševčíková Magda, Žabokrtský Zdeněk: Czech Named Entity Corpus. In: Handbook of Linguistic Annotation, Copyright © Springer Netherlands, Netherlands, ISBN 978-94-024-0879-9, pp. 855-873, 1459 pp., 2017.
  • Ševčíková Magda, Žabokrtský Zdeněk, Krůza Oldřich: Zpracování pojmenovaných entit v českých textech. Technical report no. 2007/TR-2007-36, Copyright © ÚFAL MFF UK, 60 pp., 2007.

Please Cite this Corpus As:

Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named Entities in Czech: Annotating Data and Developing NE Tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007).

@inproceedings{SevcikovaEtAl2007CNEC,
booktitle = {Lecture Notes in Artificial Intelligence, Proceedings of the 10th International Conference on Text, Speech and Dialogue},
series = {Lecture Notes in Computer Science},
title = {Named Entities in Czech: Annotating Data and Developing {NE} Tagger},
editor = {V{\'{a}}clav Matou{\v{s}}ek and Pavel Mautner},
author = {Magda {\v{S}}ev{\v{c}}{\'{\i}}kov{\'{a}} and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and Old{\v{r}}ich Kr{\r{u}}za},
year = {2007},
publisher = {Springer},
address = {Berlin / Heidelberg},
volume = {4629},
number = {{XVII}},
pages = {188--195},
isbn = {978-3-540-74627-0},
issn = {0302-9743},
}

Acknowledgements:

  • SVV project number 267 314 (Teoretické základy informatiky a výpočetní lingvistiky)
  • LINDAT/CLARIN (Large infrastructural grant for language resources, data access and distribution and related reseearch), project LM2010013 of the Ministry of Education of the Czech Republic
  • GAČR 406/12/P175 project (Vybrané derivační vztahy pro automatické zpracování češtiny) of the Grant Agency of the Czech Republic
  • PRVOUK P46 project

Authors: