Like any supervised machine-learning tool, UDPipe needs a trained linguistic model. This section describes the available language models and also the command line tools and interfaces.

1. Running UDPipe

Probably the most common usage of UDPipe is to tokenize, tag and parse the input using

udpipe --tokenize --tag --parse udpipe_model

The input is assumed to be in UTF-8 encoding and can be either already tokenized and segmented, or it can be a plain text which will be tokenized and segmented automatically.

Any number of input files can be specified after the udpipe_model and if no file is given, the standard input is used. The output is by default saved to the standard output, but if --outfile=name is used, it is saved to the given file name. The output file name can contain a {}, which is replaced by a base name of the processed file (i.e., without directories and an extension).

The full command syntax of running UDPipe is

Usage: udpipe [running_opts] udpipe_model [input_files]
       udpipe --train [training_opts] udpipe_model [input_files]
       udpipe --detokenize [detokenize_opts] raw_text_file [input_files]
Running opts: --accuracy (measure accuracy only)
              --input=[conllu|generic_tokenizer|horizontal|vertical]
              --immediate (process sentences immediately during loading)
              --outfile=output file template
              --output=[conllu|matxin|horizontal|plaintext|vertical]
              --tokenize (perform tokenization)
              --tokenizer=tokenizer options, implies --tokenize
              --tag (perform tagging)
              --tagger=tagger options, implies --tag
              --parse (perform parsing)
              --parser=parser options, implies --parse
Training opts: --method=[morphodita_parsito] which method to use
               --heldout=heldout data file name
               --tokenizer=tokenizer options
               --tagger=tagger options
               --parser=parser options
Detokenize opts: --outfile=output file template
Generic opts: --version
              --help

1.1. Immediate Mode

By default UDPipe loads the whole input file into memory before starting to process it. That allows to store the space markup (see the following Tokenizer section) in most consistent way, i.e., store all spaces following a sentence in the last token of that sentence.

However, sometimes it is desirable to process the input as soon as possible, which can be achieved by specifying the --immediate option. In immediate mode, the input is processed and printed as soon as a block of input guaranteed to contain whole sentences is loaded. Specifically, for most input formats the input is processed after loading an empty line (with the exception of horizontal input format and presegmented tokenizer, where the input is processed after each line).

1.2. Loading Model On Demand

Although a model for UDPipe always has to be specified, the model is loaded only if really needed. It is therefore possible to use for example none as the model in case it is not required for performing the requested operation (e.g., converting between formats or using a generic tokenizer).

1.3. Tokenizer

If the --tokenize option is supplied, the input is assumed to be plain text and is tokenized using model tokenizer. Additional arguments to the tokenizer might be specified using --tokenizer=data option (which implies --tokenize), where data is a semicolon-separated list of the following options:

  • normalized_spaces: by default, UDPipe uses custom MISC fields to exactly encode spaces in the original document (as described below). If the normalized_spaces option is given, only the standard CoNLL-U v2 markup (SpaceAfter=No and # newpar) is used.
  • presegmented: the input file is assumed to be already segmented, with each sentence on a separate line, and is only tokenized (respecting sentence breaks)
  • ranges: for each token, a range in the original document is stored in the format described below.
  • joint_with_parsing: an experimental mode performing sentence segmentation jointly using the tokenizer and the parser (see Milan Straka and Jana Straková: Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe paper for details). The following options are utilized:
    • joint_max_sentence_len (default 20): maximum sentence length
    • joint_change_boundary_logprob (default -0.5): logprob of using sentence boundary not generated by the tokenizer
    • joint_sentence_logprob (default -0.5): additional logprob of every sentence
    The logprob of a sentence is computed using logprob of its best dependency parsing tree, together with joint_sentence_logprob and also joint_change_boundary_logprob for every sentence boundary not returned by the tokenizer (i.e., either 0, 1 or 2 times). The joint sentence segmentation chooses such a segmentation, where every sentence has length at most joint_max_sentence_len and the sum of logprobs of all sentences is as large as possible.

1.3.1. Preserving Original Spaces

By default, UDPipe uses custom MISC fields to store all spaces in the original document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No feature. This markup can be utilized by the plaintext output format, which allows reconstructing the original document.

Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).

The markup uses the following MISC fields on tokens (not words in multi-word tokens):

  • SpacesBefore=content (by default empty): spaces/other content preceding the token
  • SpacesAfter=content (by default a space if SpaceAfter=No feature is not present, empty otherwise): spaces/other content following the token
  • SpacesInToken=content (by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)

The content of all the three fields must be escaped to allow storing tabs and newlines. The following C-like schema is used:

  • \s: space
  • \t: tab
  • \r: CR character
  • \n: LF character
  • \p: | (pipe character)
  • \\: \ (backslash character)

1.3.2. Preserving Token Ranges

When the ranges tokenizer option is used, the range of each token in the original document is stored in the TokenRange MISC field.

The format of the TokenRange field (inspired by Python) is TokenRange=start:end, where start is a zero-based document-level index of the start of the token (counted in Unicode characters) and end is a zero-based document-level index of the first character following the token (i.e., the length of the token is end-start).

1.4. Input Formats

If the tokenizer is not used, the input format can be specified using the --input option. The individual input formats can be parametrized in the same way a tokenizer is, by using format=data syntax. Currently supported input formats are:

  • conllu (default): the CoNLL-U format. Supported options:
    • v2 (default): use CoNLL-U v2
    • v1: allow loading only CoNLL-U v1 (i.e., no empty nodes and no spaces in forms and lemmas)
  • generic_tokenizer: generic tokenizer for English-like languages (with spaces separating tokens and English-like punctuation). The tokenizer is rule-based and needs no trained model. It supports the same options as a model tokenizer, i.e., normalized_spaces, presegmented and ranges.
  • horizontal: each sentence on a separate line, with tokens separated by spaces. In order to allow spaces in tokens, Unicode character 'NO-BREAK SPACE' (U+00A0) is considered part of token and converted to a space during loading.
  • vertical: each token on a separate line, with an empty line denoting end of sentence; only the first tab-separated word is used as a token, the rest of the line is ignored.

Note that a model tokenizer can be specified using the --input option too, by using the tokenizer input format, for example using --input tokenizer=ranges.

1.5. Tagger

If the --tag option is supplied, the input is POS tagged and lemmatized using the model tagger. Additional arguments to the tagger might be specified using the --tagger=data option (which implies --tag).

1.6. Dependency Parsing

If the --parse option is supplied, the input is parsed using the model dependency parser. Additional arguments to the parser might be specified using the --parser=data option (which implies --parse).

1.7. Output Formats

The output format is specified using the --output option. The individual output formats can be parametrized in the same way as input formats, by using the format=data syntax. Currently supported output formats are:

  • conllu (default): the CoNLL-U format Supported options:
    • v2 (default): use CoNLL-U v2
    • v1: produce output in CoNLL-U v1 format. Note that this is a lossy process, as empty nodes are ignored and spaces in forms and lemmas are converted to underscores.
  • matxin: the Matxin format
  • horizontal: writes the words (in the UD sense) in horizontal format, that is, each sentence is on a separate line, with words separated by a single space. Because words can contain spaces in CoNLL-U v2, the spaces in words are converted to Unicode character 'NO-BREAK SPACE' (U+00A0). Supported options:
    • paragraphs: an empty line is printed after the end of a paragraph or a document (recognized by # newpar or # newdoc comments)
  • plaintext: writes the tokens (in the UD sense) using original spacing. By default, UDPipe's custom MISC features (SpacesBefore, SpacesAfter and SpacesInToken, see the description in the Tokenizer section) are used to reconstruct the exact original spaces. However, if the document does not contain these features or if you want only normalized spacing, you can use the following option:
    • normalized_spaces: write one sentence on a line, and either one or no space between tokens according to the SpaceAfter=No feature
  • vertical: each word on a separate line, with an empty line denoting the end of sentence. Supported options:
    • paragraphs: an empty line is printed after the end of a paragraph or a document (recognized by # newpar or # newdoc comments)

2. Running the UDPipe REST Server

UDPipe also provides a REST server binary called udpipe_server. The binary uses MicroRestD as a REST server implementation and provides UDPipe REST API.

The full command syntax of udpipe_server is

udpipe_server [options] port default_model (rest_id model_file acknowledgements)*
Options: --concurrent_models=maximum concurrently loaded models (default 10)
         --daemon (daemonize after start)
         --no_check_models_loadable (do not check models are loadable)
         --no_preload_default (do not preload default model)

The udpipe_server can run either in foreground or in background (when --daemon is used).

Since UDPipe 1.1.1, the models are loaded on demand, so that at most concurrent_models (default 10) are kept in memory at the same time. The model files are opened during start and never closed until the server stops. Unless no_check_models_loadable is specified, the model files are also checked to be loadable during start. Note that the default model is preloaded and never released, unless no_preload_default is given. (Before UDPipe 1.1.1, specified model files were loaded during start and kept in memory all the time.)

3. Training UDPipe Models

Custom UDPipe models can be trained using the following syntax:

udpipe --train model.output [--heldout=heldout_data] training_file ...

The training data should be in the CoNLL-U format.

By default, three model components are trained – tokenizer, tagger and parser. Any subset of the model components can be trained and a model component may be copied from an existing model.

The training options are specified for each model component separately using the --tokenizer, --tagger and --parser options. If a model component should not be trained, value none should be used (e.g., --tagger=none).

The options are name=value pairs separated by a semicolon. The value can be either a simple string value (ending by a semicolon), file content specified as name=file:filename, or an arbitrary string value specified as name=data:length:value, where the value is exactly length bytes long.

3.1. Reusing Components from Existing Models

The model components (tagger, parser or tagger) can be reused from existing models, by specifying the from_model=file:filename option.

3.2. Random Hyperparameter Search

The default values of hyperparameters are set to the values which were used the most during UD 1.2 models training, but if you want to reach best performance, the hyperparameters must be tuned.

Apart from manual grid search, UDPipe can perform a simple random search. You can perform the random search by repeatedly training UDPipe (preferably in parallel, most likely on different computers) while specifying different training run number – some of the hyperparameters (chosen by us; you can of course override their value by specifying it on the command line) change their values in different training runs. The pseudorandom sequences of hyperparameters are of course deterministic.

The training run can be specified by providing the run=number option to a model component. The run number 1 is the default one (with the best hyperparameters for the UD 1.2 models), run numbers 2 and more randomize the hyperparameters.

3.3. Tokenizer

The tokenizer is trained using the SpaceAfter=No features in the CoNLL-U files. If the feature is not present, a detokenizer can be used to guess the SpaceAfter=No features according to a supplied plain text (which typically does not overlap with the texts in the CoNLL-U files).

In order to use the detokenizer, use the detokenizer=file:filename_with_plaintext option. In UD 1.2 models, the optimal performance is achieved with very small plain texts – only 500kB.

The tokenizer recognizes the following options:

  • tokenize_url (default 1): tokenize URLs and emails using a manually implemented recognizer
  • allow_spaces (default 1 if any token contains a space, 0 otherwise): allow tokens to contain spaces
  • dimension (default 24): dimension of character embeddings and of the per-character bidirectional GRU. Note that inference time is quadratic in this parameter. Supported values are only 16, 24 and 64, with 64 needed only for languages with complicated tokenization like Japanese, Chinese or Vietnamese.
  • epochs (default 100): the number of epochs to train the tokenizer for
  • batch_size (default 50): batch size used during tokenizer training
  • learning_rate (default 0.005): the learning rate used during tokenizer training
  • dropout (default 0.1): dropout used during tokenizer training
  • early_stopping (default 1 if heldout is given, 0 otherwise): perform early stopping, choosing training iteration maximizing sentences F1 score plus tokens F1 score on heldout data

During random hyperparameter search, batch_size is chosen uniformly from {50,100} and learning_rate logarithmically from <0.0005, 0.01).

3.3.1. Detokenizing CoNLL-U Files

The --detokenizer option allows generating the SpaceAfter=No features automatically from a given plain text. Even if the current algorithm is very simple and makes quite a lot of mistakes, the tokenizer trained on generated features is very close to a tokenizer trained on gold SpaceAfter=No features (the difference in token F1 score is usually one or two tenths of percent).

The generated SpaceAfter=No features are only used during tokenizer training, not printed. However, if you would like to obtain the CoNLL-U files with automatic detokenization (generated SpaceAfter=No features), you can run UDPipe with the --detokenize option. In this case, you have to supply plain text in the given language (usually the best results are achieved with just 500kB or 1MB of text) and UDPipe then detokenizes all the given CoNLL-U files.

The complete usage of the --detokenize option is:

udpipe --detokenize [detokenize_opts] raw_text_file [input_files]
Detokenize opts: --outfile=output file template

3.4. Tagger

The tagging is currently performed using MorphoDiTa. The UDPipe tagger consists of possibly several MorphoDiTa models, each tagging some of the POS tags and/or lemmas.

By default, only one model is constructed, which generates all available tags (UPOS, XPOS, Feats and Lemma). However, we found out during the UD 1.2 models training that performance improves if one model tags the UPOS, XPOS and Feats tags, while the other is performing lemmatization. Therefore, if you utilize two MorphoDiTa models, by default the first one generates all tags (except lemmas) and the second one performs lemmatization.

The number of MorphoDiTa models can be specified using the models=number parameter. All other parameters may be either generic for all models (guesser_suffix_rules=5), or specific for a given model (guesser_suffix_rules_2=6), including the from_model option (therefore, MorphoDiTa models can be trained separately and then combined together into one UDPipe model).

Every model utilizes UPOS for disambiguation and the first model is the one producing the UPOS tags on output.

The tagger recognizes the following options:

  • use_lemma (default for the second model and also if there is only one model): use the lemma field internally to perform disambiguation; the lemma may be not outputted
  • provide_lemma (default for the second model and also if there is only one model): produce the disambiguated lemma on output
  • use_xpostag (default for the first model): use the XPOS tags internally to perform disambiguation; it may not be outputted
  • provide_xpostag (default for the first model): produce the disambiguated XPOS tag on output
  • use_feats (default for the first model): use the Feats internally to perform disambiguation; it may not be outputted
  • provide_feats (default for the first model): produce the disambiguated Feats field on output
  • dictionary_max_form_analyses (default 0 - unlimited): the maximum number of (most frequent) form analyses from UD training data that are to be kept in the morphological dictionary
  • dictionary_file (default empty): use a given custom morphological dictionary, where each line contains 5 tab-separated fields FORM, LEMMA, UPOSTAG, XPOSTAG and FEATS. Note that this dictionary data is appended to the dictionary created from the UD training data, not replacing it.
  • guesser_suffix_rules (default 8): number of rules generated for every suffix
  • guesser_prefixes_max (default 4 if ``provide_lemma`, 0 otherwise): maximum number of form-generating prefixes to use in the guesser
  • guesser_prefix_min_count (default 10): minimum number of occurrences of form-generating prefix to consider using it in the guesser
  • guesser_enrich_dictionary (default 6 if no dictionary_file is passed, 0 otherwise): number of rules generated for forms present in training data (assuming that the analyses from the training data may not be all)
  • iterations (default 20): number of training iterations to perform
  • early_stopping (default 1 if heldout is given, 0 otherwise): perform early stopping, choosing training iteration maximizing tagging accuracy on the heldout data
  • templates (default lemmatizer for second model, tagger otherwise): MorphoDiTa feature templates to use, either lemmatizer which focuses more on lemmas, or tagger which focuses more on UPOS/XPOS/FEATS

During random hyperparameter search, guesser_suffix_rules is chosen uniformly from {5,6,7,8,9,10,11,12} and guesser_enrich_dictionary is chosen uniformly from {3,4,5,6,7,8,9,10}.

3.5. Parser

The parsing is performed using Parsito, which is a transition-based parser using a neural-network classifier.

The transition-based systems can be configured by the following options:

  • transition_system (default projective): which transition system to use for parsing (language dependent, you can choose according to language properties or try all and choose the best one)
    • projective: projective stack-based arc standard system with shift, left_arc and right_arc transitions
    • swap: fully non-projective system which extends projective system by adding the swap transition
    • link2: partially non-projective system which extends projective system by adding left_arc2 and right_arc2 transitions
  • transition_oracle (default dynamic/static_lazy_static whichever first is applicable): which transition oracle to use for the chosen transition_system:
    • transition_system=projective: available oracles are static and dynamic (dynamic usually gives better results, but training time is slower)
    • transition_system=swap: available oracles are static_eager and static_lazy (static_lazy almost always gives better results)
    • transition_system=link2: only available oracle is static
  • structured_interval (default 8): use search-based oracle in addition to the translation_oracle specified. This almost always gives better results, but makes training 2-3 times slower. For details, see the paper Straka et al. 2015: Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle
  • single_root (default 1): allow only single root when parsing, and make sure only the root node has the root deprel (note that training data are checked to be in this format)

The Lemmas/UPOS/XPOS/FEATS used by the parser are configured by:

  • use_gold_tags (default 0): if false and a tagger exists, the Lemmas/UPOS/XPOS/FEATS for both the training and heldout data are generated by the tagger, otherwise they are taken from the gold data

The embeddings used by the parser can be specified as follows:

  • embedding_upostag (default 20): the dimension of the UPos embedding used in the parser
  • embedding_feats (default 20): the dimension of the Feats embedding used in the parser
  • embedding_xpostag (default 0): the dimension of the XPos embedding used in the parser
  • embedding_form (default 50): the dimension of the Form embedding used in the parser
  • embedding_lemma (default 0): the dimension of the Lemma embedding used in the parser
  • embedding_deprel (default 20): the dimension of the Deprel embedding used in the parser
  • embedding_form_file: pre-trained word embeddings in word2vec textual format
  • embedding_lemma_file: pre-trained lemma embeddings in word2vec textual format
  • embedding_form_mincount (default 2): for forms not present in the pre-trained embeddings, generate random embeddings if the form appears at least this number of times in the trainig data (forms not present in the pre-trained embeddings and appearing less number of times are considered OOV)
  • embedding_lemma_mincount (default 2): for lemmas not present in the pre-trained embeddings, generate random embeddings if the lemma appears at least this number of times in the trainig data (lemmas not present in the pre-trained embeddings and appearing less number of times are considered OOV)

The neural-network training options:

  • iterations (default 10): number of training iterations to use
  • hidden_layer (default 200): the size of the hidden layer
  • batch_size (default 10): batch size used during neural-network training
  • learning_rate (default 0.02): the learning rate used during neural-network training
  • learning_rate_final (0.001): the final learning rate used during neural-network training
  • l2 (0.5): the L2 regularization used during neural-network training
  • early_stopping (default 1 if heldout is given, 0 otherwise): perform early stopping, choosing training iteration maximizing LAS on heldout data

During random hyperparameter search, structured_interval is chosen uniformly from {0,8,10}, learning_rate is chosen logarithmically from <0.005,0.04) and l2 is chosen uniformly from <0.2,0.6).

3.5.1. Pre-trained Word Embeddings

The pre-trained word embeddings for forms and lemmas can be specified in the word2vec textual format using the embedding_form_file and embedding_lemma_file options.

Note that pre-training word embeddings even on the UD data itself improves the accuracy (we use word2vec with -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 -threads 12 -binary 0 -iter 15 -min-count 2 options to pre-train on the UD data after converting it to the horizontal format using udpipe --output=horizontal).

Forms and lemmas can contain spaces in CoNLL-U v2, so these spaces are converted to a Unicode character 'NO-BREAK SPACE' (U+00A0) before performing the embedding lookup, because spaces are usually used to delimit tokens in word embedding generating software (both word2vec and glove use spaces to separate words on input and on output). When using UDPipe to generate plain texts from CoNLL-U format using --output=horizontal, this space replacing happens automatically.

When looking up an embedding for a given word, the following possibilities are tried in the following order until a match is found (or an embedding for unknown word is returned):

  • original word
  • all but the first character lowercased
  • all characters lowercased
  • if the word contains only digits, just the first digit is tried

3.6. Measuring Model Accuracy

Measuring custom model accuracy can be performed by running:

udpipe --accuracy [udpipe_options] udpipe_model file ...

The command syntax is similar to the regular UDPipe operation, only the input must be always in the CoNLL-U format and the --input and --output options are ignored.

Three different settings (depending on --tokenize(r), --tag(ger) and --parse(r)) can be evaluated:

  • --tokenize(r) [--tag(ger) [--parse(r)]]: Tokenizer is used to segment and tokenize plain text (obtained by SpaceAfter=No features and # newdoc and # newpar comments in the input file). Optionally, a tagger is used on the resulting data to obtain Lemma/UPOS/XPOS/Feats columns and eventually a parser can be used to parse the results.

    The tokenizer is evaluated using F1-score on tokens, multi-word tokens, sentences and words. The words are aligned using a word-alignment algorithm described in the CoNLL 2017 Shared Task in UD Parsing. The tagger and parser are evaluated on aligned words, resulting in F1 scores of Lemmas/UPOS/XPOS/Feats/UAS/LAS.

  • --tag(ger) [--parse(r)]: The gold segmented and tokenized input is tagged (and then optionally parsed using the tagger outputs) and then evaluated.

  • --parse(r): The gold segmented and tokenized input is parsed using gold morphology (Lemmas/UPOS/XPOS/Feats) and evaluated.

4. Universal Dependencies 2.0 Models

Universal Dependencies 2.0 Models are distributed under the CC BY-NC-SA licence. The models are based solely on Universal Dependencies 2.0 treebanks. The models work in UDPipe version 1.2 and later.

Universal Dependencies 2.0 Models are versioned according to the date released in the format YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively. The latest version is 170801.

4.1. Download

The latest version 170801 of the Universal Dependencies 2.0 models can be downloaded from LINDAT/CLARIN repository.

4.2. Acknowledgements

This work has been partially supported and has been using language resources and tools developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). The wark was also partially supported by OP VVV projects CZ.02.1.01/0.0/0.0/16\_013/0001781 and CZ.02.2.69/0.0/0.0/16\_018/0002373, and by SVV project number 260 453.

The models were trained on Universal Dependencies 2.0 treebanks.

For the UD treebanks which do not contain original plain text version, raw text is used to train the tokenizer instead. The plain texts were taken from the W2C – Web to Corpus.

4.2.1. Publications

4.3. Model Description

The Universal Dependencies 2.0 models contain 68 models of 50 languages, each consisting of a tokenizer, tagger, lemmatizer and dependency parser, all trained using the UD data. Note that we use custom train-dev split, by moving sentences from the beginning of dev data to the end of train data, until the training data is at least 9 times the dev data.

The tokenizer is trained using the SpaceAfter=No features. If the features are not present in the data, they can be filled in using raw text in the language in question.

The tagger, lemmatizer and parser are trained using gold UD data.

Details about model architecture and training process can be found in the (Straka et al. 2017) paper.

4.3.1. Reproducible Training

In case you want to train the same models, scripts for downloading and resplitting UD 2.0 data, precomputed word embedding, raw texts for tokenizers, all hyperparameter values and training scripts are available in the second archive on the model download page.

4.4. Model Performance

We present the tagger, lemmatizer and parser performance, measured on the testing portion of the data, evaluated in three different settings: using raw text only, using gold tokenization only, and using gold tokenization plus gold morphology (UPOS, XPOS, FEATS and Lemma).

Treebank Mode Words Sents UPOS XPOS Feats AllTags Lemma UAS LAS
Ancient Greek Raw text 100.0% 98.7% 82.4% 72.3% 85.8% 72.3% 82.6% 64.4% 57.8%
Ancient Greek Gold tok - - 82.4% 72.4% 85.8% 72.3% 82.7% 64.6% 57.9%
Ancient Greek Gold tok+morph - - - - - - - 69.2% 64.4%
Ancient Greek-PROIEL Raw text 100.0% 47.2% 95.8% 96.0% 88.6% 87.2% 92.6% 71.8% 67.1%
Ancient Greek-PROIEL Gold tok - - 95.8% 96.1% 88.7% 87.2% 92.8% 77.2% 72.3%
Ancient Greek-PROIEL Gold tok+morph - - - - - - - 79.7% 76.1%
Arabic Raw text 93.8% 83.1% 88.4% 83.4% 83.5% 82.3% 87.5% 71.7% 65.8%
Arabic Gold tok - - 94.4% 89.5% 89.6% 88.3% 92.6% 81.3% 74.3%
Arabic Gold tok+morph - - - - - - - 82.9% 77.9%
Basque Raw text 100.0% 99.5% 93.2% - 87.6% - 93.8% 75.8% 70.7%
Basque Gold tok - - 93.3% - 87.7% - 93.9% 75.9% 70.8%
Basque Gold tok+morph - - - - - - - 82.3% 78.4%
Belarusian Raw text 99.4% 76.8% 88.2% 85.6% 71.7% 68.6% 81.3% 68.0% 60.6%
Belarusian Gold tok - - 88.7% 85.7% 72.4% 69.2% 81.5% 69.4% 61.9%
Belarusian Gold tok+morph - - - - - - - 76.8% 74.0%
Bulgarian Raw text 99.9% 93.9% 97.6% 94.6% 95.6% 94.0% 94.6% 88.8% 84.8%
Bulgarian Gold tok - - 97.7% 94.7% 95.6% 94.1% 94.7% 89.5% 85.5%
Bulgarian Gold tok+morph - - - - - - - 92.6% 89.1%
Catalan Raw text 100.0% 99.2% 98.0% 98.0% 97.1% 96.5% 97.9% 88.8% 85.7%
Catalan Gold tok - - 98.0% 98.0% 97.2% 96.5% 97.9% 88.8% 85.8%
Catalan Gold tok+morph - - - - - - - 91.1% 88.7%
Chinese Raw text 90.2% 98.8% 84.0% 83.8% 89.0% 82.7% 90.2% 62.9% 58.7%
Chinese Gold tok - - 92.2% 92.0% 98.7% 90.8% 100.0% 75.6% 70.1%
Chinese Gold tok+morph - - - - - - - 84.1% 81.4%
Coptic Raw text 65.8% 35.7% 62.6% 62.1% 65.7% 62.1% 64.6% 41.1% 39.3%
Coptic Gold tok - - 95.1% 94.3% 99.7% 94.2% 96.2% 83.2% 79.2%
Coptic Gold tok+morph - - - - - - - 88.1% 84.9%
Croatian Raw text 99.9% 97.0% 95.9% - 84.3% - 94.4% 83.6% 77.9%
Croatian Gold tok - - 96.0% - 84.4% - 94.4% 83.9% 78.1%
Croatian Gold tok+morph - - - - - - - 87.1% 83.2%
Czech Raw text 99.9% 91.6% 98.3% 92.8% 92.1% 91.7% 97.8% 86.8% 83.2%
Czech Gold tok - - 98.4% 92.9% 92.2% 91.9% 97.9% 87.7% 84.1%
Czech Gold tok+morph - - - - - - - 90.2% 87.5%
Czech-CAC Raw text 100.0% 99.8% 98.1% 90.6% 89.4% 89.1% 97.0% 86.9% 82.7%
Czech-CAC Gold tok - - 98.1% 90.7% 89.5% 89.1% 97.1% 87.0% 82.8%
Czech-CAC Gold tok+morph - - - - - - - 89.7% 86.6%
Czech-CLTT Raw text 99.5% 92.3% 96.5% 87.5% 87.8% 87.3% 96.8% 80.2% 76.6%
Czech-CLTT Gold tok - - 97.0% 87.9% 88.3% 87.7% 97.2% 81.0% 77.6%
Czech-CLTT Gold tok+morph - - - - - - - 83.8% 80.8%
Danish Raw text 99.8% 77.9% 95.2% - 94.2% - 94.9% 78.4% 74.7%
Danish Gold tok - - 95.5% - 94.5% - 95.0% 80.4% 76.6%
Danish Gold tok+morph - - - - - - - 85.6% 82.7%
Dutch Raw text 99.8% 77.6% 91.4% 88.1% 89.3% 87.0% 89.9% 75.4% 69.6%
Dutch Gold tok - - 91.8% 88.8% 89.9% 87.7% 90.1% 77.0% 71.2%
Dutch Gold tok+morph - - - - - - - 82.9% 79.4%
Dutch-LassySmall Raw text 100.0% 80.4% 97.6% - 97.2% - 98.1% 84.4% 82.0%
Dutch-LassySmall Gold tok - - 97.7% - 97.4% - 98.2% 87.5% 85.0%
Dutch-LassySmall Gold tok+morph - - - - - - - 89.7% 87.4%
English Raw text 99.0% 76.6% 93.5% 92.9% 94.4% 91.5% 96.0% 80.2% 77.2%
English Gold tok - - 94.5% 93.9% 95.4% 92.5% 96.9% 84.3% 81.2%
English Gold tok+morph - - - - - - - 87.8% 86.0%
English-LinES Raw text 99.9% 86.2% 95.0% 92.7% - - - 78.6% 74.4%
English-LinES Gold tok - - 95.1% 92.8% - - - 79.5% 75.3%
English-LinES Gold tok+morph - - - - - - - 84.1% 81.1%
English-ParTUT Raw text 99.6% 97.5% 94.2% 94.0% 93.3% 92.0% 96.9% 81.6% 77.9%
English-ParTUT Gold tok - - 94.6% 94.4% 93.6% 92.3% 97.3% 82.1% 78.4%
English-ParTUT Gold tok+morph - - - - - - - 86.4% 84.5%
Estonian Raw text 99.9% 94.2% 91.2% 93.2% 85.0% 83.2% 84.5% 72.4% 65.6%
Estonian Gold tok - - 91.3% 93.2% 85.0% 83.3% 84.5% 72.8% 66.0%
Estonian Gold tok+morph - - - - - - - 83.1% 79.6%
Finnish Raw text 99.7% 86.7% 94.5% 95.7% 91.5% 90.3% 86.5% 80.5% 76.9%
Finnish Gold tok - - 94.9% 96.0% 91.8% 90.7% 86.8% 82.0% 78.4%
Finnish Gold tok+morph - - - - - - - 86.9% 84.7%
Finnish-FTB Raw text 100.0% 86.4% 92.0% 91.0% 92.5% 89.2% 88.9% 80.1% 75.7%
Finnish-FTB Gold tok - - 92.2% 91.3% 92.7% 89.5% 88.9% 81.7% 77.3%
Finnish-FTB Gold tok+morph - - - - - - - 88.8% 86.5%
French Raw text 98.9% 94.6% 95.4% - 95.5% - 96.6% 84.2% 80.7%
French Gold tok - - 96.5% - 96.5% - 97.6% 85.4% 82.0%
French Gold tok+morph - - - - - - - 88.4% 86.0%
French-ParTUT Raw text 99.0% 97.8% 94.5% 94.2% 91.9% 90.8% 94.3% 82.9% 78.7%
French-ParTUT Gold tok - - 95.6% 95.3% 92.7% 91.6% 95.2% 84.1% 80.2%
French-ParTUT Gold tok+morph - - - - - - - 88.1% 85.3%
French-Sequoia Raw text 99.1% 84.0% 95.9% - 95.1% - 96.8% 83.2% 80.6%
French-Sequoia Gold tok - - 96.8% - 96.0% - 97.7% 85.1% 82.7%
French-Sequoia Gold tok+morph - - - - - - - 88.7% 87.4%
Galician Raw text 99.9% 95.8% 97.2% 96.7% 99.7% 96.4% 97.1% 81.0% 77.8%
Galician Gold tok - - 97.2% 96.8% 99.8% 96.4% 97.1% 81.2% 77.9%
Galician Gold tok+morph - - - - - - - 83.1% 80.5%
Galician-TreeGal Raw text 98.7% 86.7% 91.1% 87.8% 89.9% 87.0% 92.6% 71.5% 66.3%
Galician-TreeGal Gold tok - - 92.4% 88.8% 91.0% 88.0% 93.7% 74.4% 68.7%
Galician-TreeGal Gold tok+morph - - - - - - - 81.5% 77.1%
German Raw text 99.7% 79.3% 90.7% 94.7% 80.5% 76.3% 95.4% 74.0% 68.6%
German Gold tok - - 91.2% 95.0% 80.9% 76.7% 95.6% 76.5% 70.7%
German Gold tok+morph - - - - - - - 84.7% 82.2%
Gothic Raw text 100.0% 29.5% 94.2% 94.8% 87.6% 85.6% 92.9% 69.7% 63.5%
Gothic Gold tok - - 94.8% 95.3% 88.0% 86.5% 92.9% 78.8% 72.6%
Gothic Gold tok+morph - - - - - - - 82.2% 78.3%
Greek Raw text 99.9% 88.2% 95.8% 95.8% 90.3% 89.1% 94.5% 84.2% 80.4%
Greek Gold tok - - 96.0% 96.0% 90.5% 89.3% 94.6% 85.0% 81.1%
Greek Gold tok+morph - - - - - - - 87.9% 85.9%
Hebrew Raw text 85.2% 100.0% 80.9% 80.9% 77.6% 76.8% 79.6% 62.2% 57.9%
Hebrew Gold tok - - 95.1% 95.1% 91.3% 90.5% 93.2% 84.5% 78.9%
Hebrew Gold tok+morph - - - - - - - 87.8% 84.3%
Hindi Raw text 100.0% 99.1% 95.8% 94.9% 90.3% 87.7% 98.0% 91.3% 87.3%
Hindi Gold tok - - 95.8% 94.9% 90.3% 87.7% 98.0% 91.4% 87.3%
Hindi Gold tok+morph - - - - - - - 93.9% 91.0%
Hungarian Raw text 99.8% 96.2% 91.6% - 70.5% - 89.3% 74.1% 68.1%
Hungarian Gold tok - - 91.8% - 70.6% - 89.5% 74.5% 68.5%
Hungarian Gold tok+morph - - - - - - - 81.2% 78.5%
Indonesian Raw text 100.0% 92.0% 93.5% - 99.5% - - 80.6% 74.3%
Indonesian Gold tok - - 93.5% - 99.6% - - 80.8% 74.5%
Indonesian Gold tok+morph - - - - - - - 83.1% 79.1%
Irish Raw text 99.4% 94.3% 88.0% 86.9% 75.1% 72.7% 85.5% 72.5% 62.4%
Irish Gold tok - - 88.5% 87.4% 75.5% 73.1% 86.0% 73.3% 63.1%
Irish Gold tok+morph - - - - - - - 78.1% 71.4%
Italian Raw text 99.8% 97.1% 97.2% 97.0% 97.0% 96.1% 97.3% 88.8% 86.1%
Italian Gold tok - - 97.4% 97.2% 97.2% 96.3% 97.5% 89.3% 86.6%
Italian Gold tok+morph - - - - - - - 91.3% 89.7%
Japanese Raw text 91.9% 95.1% 89.1% - 91.8% - 91.1% 78.0% 76.6%
Japanese Gold tok - - 96.6% - 100.0% - 99.0% 93.4% 91.5%
Japanese Gold tok+morph - - - - - - - 95.6% 95.0%
Kazakh Raw text 94.0% 84.9% 52.0% 52.1% 47.2% 40.0% 59.2% 40.2% 23.9%
Kazakh Gold tok - - 55.4% 55.4% 50.1% 42.2% 63.1% 45.2% 27.0%
Kazakh Gold tok+morph - - - - - - - 60.5% 42.5%
Korean Raw text 99.7% 92.7% 94.4% 89.7% 99.3% 89.7% 99.4% 67.4% 60.5%
Korean Gold tok - - 94.7% 90.0% 99.6% 90.0% 99.7% 68.4% 61.5%
Korean Gold tok+morph - - - - - - - 71.7% 65.8%
Latin Raw text 100.0% 98.0% 83.4% 67.6% 72.5% 67.6% 51.2% 56.5% 46.0%
Latin Gold tok - - 83.4% 67.6% 72.5% 67.6% 51.2% 56.6% 46.1%
Latin Gold tok+morph - - - - - - - 67.8% 61.5%
Latin-ITTB Raw text 99.9% 82.5% 97.2% 92.7% 93.5% 91.3% 97.8% 79.7% 76.0%
Latin-ITTB Gold tok - - 97.3% 92.8% 93.6% 91.4% 97.9% 81.8% 78.1%
Latin-ITTB Gold tok+morph - - - - - - - 87.6% 85.2%
Latin-PROIEL Raw text 99.9% 31.0% 94.9% 95.0% 87.7% 86.7% 94.8% 66.1% 60.7%
Latin-PROIEL Gold tok - - 95.2% 95.2% 88.4% 87.4% 95.0% 75.3% 69.4%
Latin-PROIEL Gold tok+morph - - - - - - - 79.0% 75.0%
Latvian Raw text 99.2% 97.1% 89.6% 76.2% 83.2% 75.7% 87.6% 69.2% 62.8%
Latvian Gold tok - - 90.2% 76.8% 84.0% 76.3% 88.3% 70.3% 63.9%
Latvian Gold tok+morph - - - - - - - 78.7% 74.9%
Lithuanian Raw text 98.2% 92.0% 74.0% 73.0% 68.9% 63.7% 73.5% 44.0% 32.4%
Lithuanian Gold tok - - 74.6% 73.5% 69.7% 64.2% 74.2% 44.6% 33.0%
Lithuanian Gold tok+morph - - - - - - - 55.6% 46.5%
Norwegian-Bokmaal Raw text 99.8% 96.5% 96.9% - 95.3% - 96.6% 86.9% 84.1%
Norwegian-Bokmaal Gold tok - - 97.1% - 95.5% - 96.8% 87.5% 84.7%
Norwegian-Bokmaal Gold tok+morph - - - - - - - 91.7% 89.6%
Norwegian-Nynorsk Raw text 99.9% 92.2% 96.5% - 94.9% - 96.4% 85.6% 82.5%
Norwegian-Nynorsk Gold tok - - 96.6% - 95.0% - 96.5% 86.5% 83.3%
Norwegian-Nynorsk Gold tok+morph - - - - - - - 91.0% 88.6%
Old Church Slavonic Raw text 100.0% 40.5% 93.8% 93.8% 86.9% 85.7% 91.2% 73.6% 66.9%
Old Church Slavonic Gold tok - - 94.1% 94.1% 87.6% 86.5% 91.2% 81.6% 74.7%
Old Church Slavonic Gold tok+morph - - - - - - - 86.7% 82.2%
Persian Raw text 99.7% 98.2% 96.0% 96.0% 96.1% 95.4% 93.5% 83.3% 79.4%
Persian Gold tok - - 96.4% 96.3% 96.4% 95.7% 93.8% 83.8% 80.0%
Persian Gold tok+morph - - - - - - - 87.7% 84.9%
Polish Raw text 99.9% 99.7% 95.6% 84.0% 84.1% 83.1% 93.4% 86.7% 80.7%
Polish Gold tok - - 95.7% 84.1% 84.2% 83.3% 93.6% 87.0% 81.0%
Polish Gold tok+morph - - - - - - - 92.9% 89.5%
Portuguese Raw text 99.6% 89.4% 96.4% 72.7% 93.3% 71.6% 96.8% 86.0% 82.6%
Portuguese Gold tok - - 96.8% 73.0% 93.7% 71.9% 97.2% 87.2% 83.6%
Portuguese Gold tok+morph - - - - - - - 89.6% 87.5%
Portuguese-BR Raw text 99.9% 96.8% 97.0% 97.0% 99.7% 97.0% 98.8% 88.5% 86.3%
Portuguese-BR Gold tok - - 97.2% 97.2% 99.9% 97.2% 98.9% 88.8% 86.6%
Portuguese-BR Gold tok+morph - - - - - - - 90.5% 89.1%
Romanian Raw text 99.7% 93.9% 96.6% 95.9% 96.0% 95.7% 96.5% 85.6% 80.2%
Romanian Gold tok - - 96.9% 96.2% 96.3% 96.0% 96.8% 86.2% 80.8%
Romanian Gold tok+morph - - - - - - - 87.8% 83.0%
Russian Raw text 99.9% 96.9% 94.7% 94.4% 84.4% 82.8% 75.0% 80.3% 75.5%
Russian Gold tok - - 94.8% 94.5% 84.5% 82.9% 75.1% 80.8% 76.0%
Russian Gold tok+morph - - - - - - - 84.8% 81.9%
Russian-SynTagRus Raw text 99.6% 98.0% 98.0% - 93.6% - 95.6% 89.8% 87.2%
Russian-SynTagRus Gold tok - - 98.4% - 93.9% - 95.9% 90.4% 87.9%
Russian-SynTagRus Gold tok+morph - - - - - - - 91.8% 90.5%
Sanskrit Raw text 88.1% 29.0% 52.0% - 35.2% - 50.2% 38.8% 22.5%
Sanskrit Gold tok - - 57.6% - 43.6% - 60.6% 58.5% 34.3%
Sanskrit Gold tok+morph - - - - - - - 72.9% 58.5%
Slovak Raw text 100.0% 83.5% 93.2% 77.5% 79.7% 77.1% 85.9% 80.4% 75.2%
Slovak Gold tok - - 93.3% 77.6% 79.9% 77.2% 86.0% 82.0% 76.9%
Slovak Gold tok+morph - - - - - - - 88.2% 85.5%
Slovenian Raw text 99.9% 98.9% 96.2% 88.2% 88.5% 87.7% 95.3% 84.9% 81.6%
Slovenian Gold tok - - 96.2% 88.2% 88.6% 87.7% 95.4% 85.0% 81.7%
Slovenian Gold tok+morph - - - - - - - 91.8% 90.5%
Slovenian-SST Raw text 99.9% 17.8% 89.0% 81.1% 81.3% 78.6% 91.6% 53.0% 46.6%
Slovenian-SST Gold tok - - 89.4% 81.6% 81.8% 79.3% 91.7% 63.4% 56.0%
Slovenian-SST Gold tok+morph - - - - - - - 75.5% 70.6%
Spanish Raw text 99.7% 95.3% 95.5% - 96.1% - 95.9% 84.9% 81.4%
Spanish Gold tok - - 95.8% - 96.3% - 96.1% 85.5% 81.9%
Spanish Gold tok+morph - - - - - - - 88.0% 85.3%
Spanish-AnCora Raw text 99.9% 98.0% 98.1% 98.1% 97.5% 96.8% 98.1% 87.7% 84.5%
Spanish-AnCora Gold tok - - 98.2% 98.2% 97.5% 96.9% 98.1% 87.8% 84.7%
Spanish-AnCora Gold tok+morph - - - - - - - 90.2% 87.6%
Swedish Raw text 99.8% 94.6% 95.6% 93.9% 94.4% 92.8% 95.5% 81.4% 77.8%
Swedish Gold tok - - 95.8% 94.1% 94.6% 93.1% 95.7% 82.1% 78.4%
Swedish Gold tok+morph - - - - - - - 88.0% 85.0%
Swedish-LinES Raw text 100.0% 85.7% 94.8% 92.2% - - - 80.4% 75.7%
Swedish-LinES Gold tok - - 94.8% 92.3% - - - 81.3% 76.6%
Swedish-LinES Gold tok+morph - - - - - - - 86.0% 82.6%
Tamil Raw text 95.3% 89.2% 82.2% 77.7% 80.9% 77.2% 85.3% 59.5% 52.0%
Tamil Gold tok - - 85.8% 81.0% 84.2% 80.3% 89.1% 64.9% 56.5%
Tamil Gold tok+morph - - - - - - - 78.9% 71.8%
Turkish Raw text 98.1% 96.8% 92.4% 91.5% 87.3% 85.5% 90.2% 62.9% 55.8%
Turkish Gold tok - - 94.0% 93.0% 88.9% 87.0% 91.7% 65.5% 58.0%
Turkish Gold tok+morph - - - - - - - 66.8% 61.1%
Ukrainian Raw text 99.8% 95.1% 88.5% 70.7% 70.9% 67.6% 86.7% 69.9% 61.5%
Ukrainian Gold tok - - 88.6% 70.8% 71.0% 67.7% 86.9% 70.2% 61.8%
Ukrainian Gold tok+morph - - - - - - - 79.0% 74.5%
Urdu Raw text 100.0% 98.3% 92.4% 90.5% 80.6% 76.3% 93.0% 84.6% 77.6%
Urdu Gold tok - - 92.4% 90.5% 80.7% 76.3% 93.0% 84.7% 77.7%
Urdu Gold tok+morph - - - - - - - 88.2% 83.0%
Uyghur Raw text 99.8% 67.2% 74.7% 79.1% - - - 55.1% 35.0%
Uyghur Gold tok - - 75.1% 79.3% - - - 56.5% 35.8%
Uyghur Gold tok+morph - - - - - - - 62.3% 42.0%
Vietnamese Raw text 85.3% 92.9% 77.4% 75.4% 85.1% 75.4% 84.5% 46.9% 42.5%
Vietnamese Gold tok - - 89.3% 86.8% 99.6% 86.8% 99.0% 64.4% 57.2%
Vietnamese Gold tok+morph - - - - - - - 70.7% 67.9%

5. CoNLL17 Shared Task Baseline UD 2.0 Models

As part of CoNLL 2017 Shared Task in UD Parsing, baseline models for UDPipe were released. The CoNLL 2017 Shared Task models were trained on most of UD 2.0 treebanks (64 of them) and are distributed under the CC BY-NC-SA licence.

Note that the models were released when the test set of UD 2.0 was unknown. Therefore, the models were trained on a subset of training data only, to allow fair comparison on the development data (which were unused during training and hyperparameter settings). Consequently, the performance of the models is not directly comparable to other models. Details about the concrete data split, hyperparameter values and model performance are available in the model archive.

5.1. Download

The CoNLL17 Shared Task Baseline UD 2.0 Models can be downloaded from LINDAT/CLARIN repository.

5.2. Acknowledgements

This work has been partially supported and has been using language resources and tools developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

The models were trained on a Universal Dependencies 2.0 treebanks.

6. Universal Dependencies 1.2 Models

Universal Dependencies 1.2 Models are distributed under the CC BY-NC-SA licence. The models are based solely on Universal Dependencies 1.2 treebanks. The models work in UDPipe version 1.0.

Universal Dependencies 1.2 Models are versioned according to the date released in the format YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively. The latest version is 160523.

6.1. Download

The latest version 160523 of the Universal Dependencies 1.2 models can be downloaded from LINDAT/CLARIN repository.

6.2. Acknowledgements

This work has been partially supported and has been using language resources and tools developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

The models were trained on Universal Dependencies 1.2 treebanks.

For the UD treebanks which do not contain original plain text version, raw text is used to train the tokenizer instead. The plain texts were taken from the W2C – Web to Corpus.

6.2.1. Publications

  • (Straka et al. 2016) Straka Milan, Hajič Jan, Straková Jana. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. LREC 2016, Portorož, Slovenia, May 2016.

6.3. Model Description

The Universal Dependencies 1.2 models contain 36 models, each consisting of a tokenizer, tagger, lemmatizer and dependency parser, all trained using the UD data. The model for Japanese is missing, because we do not have the license for the required corpus of Mainichi Shinbun 1995.

The tokenizer is trained using the SpaceAfter=No features. If the features are not present in the data, they can be filled in using raw text in the language in question (surprisingly, quite little data suffices, we use 500kB).

The tagger, lemmatizer and parser are trained using gold UD data.

Details about model architecture and training process can be found in the (Straka et al. 2016) paper.

6.4. Model Performance

We present the tagger, lemmatizer and parser performance, measured on the testing portion of the data. Only the segmentation and the tokenization of the testing data is retained before evaluation. Therefore, the dependency parser is evaluated without gold POS tags.

Treebank UPOS XPOS Feats All Tags Lemma UAS LAS
Ancient Greek 91.1% 77.8% 88.7% 77.7% 86.9% 68.1% 61.6%
Ancient Greek-PROIEL 96.7% 96.4% 89.3% 88.4% 93.4% 75.8% 69.6%
Arabic 98.8% 97.7% 97.8% 97.6% - 80.4% 75.6%
Basque 93.3% - 87.2% 85.4% 93.5% 74.8% 69.5%
Bulgarian 97.8% 94.8% 94.4% 93.1% 94.6% 89.0% 84.2%
Croatian 94.9% - 85.5% 85.0% 93.1% 78.6% 71.0%
Czech 98.4% 93.2% 92.6% 92.2% 97.8% 86.9% 83.0%
Danish 95.8% - 94.8% 93.6% 95.2% 78.6% 74.8%
Dutch 89.7% 88.7% 91.2% 86.4% 88.9% 78.1% 70.7%
English 94.5% 93.8% 95.4% 92.5% 97.0% 84.2% 80.6%
Estonian 88.0% 73.7% 80.0% 73.6% 77.0% 79.9% 71.5%
Finnish 94.9% 96.0% 93.2% 92.1% 86.8% 81.0% 76.5%
Finnish-FTB 94.0% 91.6% 93.3% 91.2% 89.1% 81.5% 76.9%
French 95.8% - - 95.8% - 82.8% 78.4%
German 90.5% - - 90.5% - 78.2% 72.2%
Gothic 95.5% 95.7% 88.0% 86.3% 93.4% 76.4% 68.2%
Greek 97.3% 97.3% 92.8% 91.7% 94.8% 80.3% 76.5%
Hebrew 94.9% 94.9% 91.3% 90.5% - 82.6% 76.8%
Hindi 95.8% 94.8% 90.2% 87.7% 98.0% 91.7% 87.5%
Hungarian 92.6% - 89.9% 88.9% 86.9% 77.0% 70.6%
Indonesian 93.5% - - 93.5% - 79.9% 73.3%
Irish 91.8% 90.3% 79.4% 76.6% 87.3% 74.4% 66.1%
Italian 97.2% 97.0% 97.1% 96.2% 97.7% 88.6% 85.8%
Latin 91.2% 75.8% 79.3% 75.6% 79.9% 57.1% 46.7%
Latin-ITT 98.8% 94.0% 94.6% 93.8% 98.3% 79.9% 76.4%
Latin-PROIEL 96.4% 96.0% 88.9% 88.2% 95.3% 75.3% 68.3%
Norwegian 97.2% - 95.5% 94.7% 96.9% 86.7% 84.1%
Old Church Slavonic 95.3% 95.1% 89.1% 88.2% 92.9% 80.6% 73.4%
Persian 97.0% 96.3% 96.5% 96.2% - 83.8% 79.4%
Polish 95.8% 84.0% 84.1% 83.8% 92.8% 86.3% 79.6%
Portuguese 97.6% 92.3% 95.3% 92.0% 97.8% 85.8% 81.9%
Romanian 89.0% 81.0% 82.3% 81.0% 75.3% 68.6% 56.9%
Slovenian 95.7% 88.2% 88.6% 87.5% 95.0% 84.1% 80.3%
Spanish 95.3% - 95.9% 93.4% 96.3% 84.2% 80.3%
Swedish 95.8% 93.9% 94.8% 93.2% 95.5% 81.4% 77.1%
Tamil 85.9% 80.8% 84.3% 80.2% 88.0% 67.2% 58.8%