Korektor Model Creation
In order to create a new spellchecker model for Korektor, several models must be created and a configuration file describing these models must be provided.
Korektor uses flexible morphology system which associates several
morphological factors to every word, with the word itself being considered
as a first one. Usually the factors are
arbitrary factors may be used. Note that currently there is a hard limit of
four factors for efficiency (you can change
FactorList::MAX_FACTORS if you
For each morphological factor a language model is needed.
The last required model is an error model describing costs of various spelling errors.
1. Creating a Morphology Model
To create a morphology model, a morphology lexicon input file must be provided
and processed by the
1.1. Morphology Lexicon Input Format
The morphology lexicon is an UTF-8 encoded file in the following format:
- the first line contains names of the factors delimited by
- following lines contain two space separated columns, the first column is a word
factor strings delimited by
|(there must be the same number of tokens as on the first line) and the second column is a count number of this morphology entry.
form|lemma|tag dog|dog|NN 68 likes|like|VB 220 ...
1.2. Running create_morphology
create_morphology should be run as follows:
create_morphology in_morphology_lexicon out_bin_morphology out_bin_vocabulary out_test_file
2. Creating a Language Model for Each Morphological Factor
The language model for each morphological factor should be created by an external tool such as SRILM or KenLM and stored in ARPA format.
To create a binary representation of such model in ARPA format, the
create_lm_binary tool should be used as follows:
create_lm_binary in_arpa_model in_bin_morphology in_bin_vocabulary factor_name lm_order out_bin_lm
3. Creating an Error Model
To create an error model, a textual error model description must be provided
and processed by the
3.1. Error Model Input Format
The textual error model description is in UTF-8 format and contains one error
model item (edit operation) per line. Each item contain three
edit distance and
signature: Describes the edit operation in the following format:
s_ab: substitution of letter
i_abc: insertion of letter
d_ab: deletion of letter
afollowing after letter
swap_ab: swap of letters
case: change of letter casing (lowercase to uppercase and vice verse)
substitutions: default substitution operation used when no
insertions: default insertion operation used when no
deletions: default deletion operation used when no
swaps: default swap operation used when no
edit distance: Integral edit distance of this operation. This distance is used during the similar words lookup which is limited by maximum edit distance. The sensible default is 1, but it can be useful to use 0 for example when removing/adding diacritical mark only.
cost: The logarithm of the probability of the edit operation.
The first five lines must contain operations
swaps, in this order.
Example (with textually marked
case<------tab------->0<---tab--->2.6 substitutions<--tab-->1<---tab--->3.8 insertions<---tab---->1<---tab--->4.4 deletions<----tab---->1<---tab--->3.5 swaps<------tab------>1<---tab--->4.1 s_qw<-------tab------>1<---tab--->3.7 s_ui<-------tab------>1<---tab--->2.3 s_yi<-------tab------>1<---tab--->2.1 s_aá<-------tab------>0<---tab--->1.7 i_iuo<------tab------>1<---tab--->4.8 ...
3.2. Running create_error_model
create_error_model binary should be run as follows:
create_morphology --binarize in_txt_error_model out_bin_error_model
4. Configuration File
The configuration specifies morphology model, language models and error model to use. In addition, it specifies similar word searching strategy and it can enable a diagnostics mode.
When a file is specified in the configuration file, its name is considered to be relative to the directory containing the configuration file.
The configuration file is line oriented and each line should adhere to one of the following formats:
- an empty line or a line starting with
morpholex=bin_morphology_file: use the specified morphology model
lm-bin_lm_file-model_order-model_weight: use the specified language model. The corresponding factor is stored in the language model itself. Model weight is a floating point multiplicative factor used when all factor language model probabilities are summed together.
errormodel=bin_error_model_file: use the specified error model file
search-casing_treatment-max_edit_distance-max_cost: defines method for finding possible corrections. Multiple methods can be specified in the configuration file. The methods are tried in the order of their appearance in the configuration file until one produces nonempty set of possible corrections. Each search method has the following options:
casing_treatment: there are three possible method names:
case_sensitive: the casing of the original word is honored when looking up possible suggestions in the morphology model
ignore_case: the casing of the original word is ignored when looking up possible suggestions in the morphology model, and the casing defined in the morphology model is used instead of the original one
ignore_case_keep_orig: the casing of the original word is ignored when looking up possible suggestions in the morphology model, but the generated suggestions have the same casing as the original word
max_edit_distance: the maximum edit distance of possible corrections
max_cost: the maximum cost of possible corrections
diagnostics=bin_vocabulary_file: use diagnostics mode which dumps a lot of information during spellchecking. A vocabulary file created during morphology model creation is needed to print out the morphological factors.
The lines can be in arbitrary order, only the relative ordering of
lines is utilized when finding possible corrections.
As an example consider the configuration file
# Binary file containing morphology and lexicon. morpholex=data/morphology_h2mor_freq2.bin # Error model binary. errormodel=data/error_model_train0.bin # Language models in the following format: # lm-[filename]-[order]-[weight] lm-data/form_lm_h2mor.bin-3-0.40 lm-data/lemma_lm_h2mor.bin-3-0.1 lm-data/tag_lm_h2mor.bin-3-0.50 # Search options, each item specify a distinct search rounds. # Searches are triggered in the order specified in this file, # whenever one of the search rounds find at least one possibility, # the consecutive search rounds are not triggered. # The format is the following: # search-[casing_treatment]-[max_edit_distance]-[max_cost] search-case_sensitive-1-6 search-ignore_case_keep_orig-1-6 search-ignore_case_keep_orig-2-9 # The diagnostics mode can be activated by uncommenting the following line. #diagnostics=data/morphology_h2mor_freq2_vocab.bin