[ Skip to the content ]

Institute of Formal and Applied Linguistics

at Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic


[ Back to the navigation ]

Publication


Year 2012
Type article
Status published
Language English
Author(s) Maršík, Jiří Bojar, Ondřej
Title TrTok: A Fast and Trainable Tokenizer for Natural Languages
Czech title TrTok: Rychlý a trénovatelný tokenizér pro přirozené jazyky
Journal The Prague Bulletin of Mathematical Linguistics
Volume 98
Pages range 75-85
Month September
Supported by 2012-2016 PRVOUK P46 (Informatika)
Czech abstract Představujeme univerzální nástroj pro segmentaci a tokenizaci textů, který uživateli dovoluje nadefinovat potenciální hranice vět a slov a na základě trénovacích dat se naučí hranice hledat.
English abstract We present a universal data-driven tool for segmenting and tokenizing text. The presented tokenizer lets the user define where token and sentence boundaries should be considered. These instances are then judged by a classifier which is trained from provided tokenized data. The features passed to the classifier are also defined by the user making, e.g., the inclusion of abbreviation lists trivial. This level of customizability makes the tokenizer a versatile tool which we show is capable of sentence detection in English text as well as word segmentation in Chinese text. In the case of English sentence detection, the system outperforms previous methods. The software is available as an open-source project on GitHub
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Open access no
DOI 10.2478/v10108-012-0010-0
ISSN* 0032-6585
Institution* Univerzita Karlova v Praze
Creator: Common Account
Created: 11/9/12 6:20 PM
Modifier: Almighty Admin
Modified: 9/6/13 4:56 PM
***

Content, Design & Functionality: ÚFAL, 2006–2016. Page generated: Tue Oct 23 06:32:31 CEST 2018

[ Back to the navigation ] [ Back to the content ]

100% OpenAIRE compliant