This is a brief description how to build statistical machine
translation
system on your parallel corpora as soon as possible. It includes sample
of building Czech-English translation system on parallel text from
Prague Czech-English Depenency Treebank (PCEDT) corpus. PCEDT version 1.0 is
distributed by Linguistic Data Consortium (LDC), see LDC Catalog.
SMT Quick Run Package is a set of scripts for preparing training
data in appropriate format, for training lanaguage and translation
models, and for running a decoder in server mode. The SMT Quick Run
requires the following software packages:
untar -xvzf SMT_QuickRun1.2.tgzIt contains the following files:
| SMT_QuickRun1.2/bin/PCEDT_Sample/Makefile | Makefile for preparing training data, training
and running the Czech-English SMT on PCEDT corpus |
| SMT_QuickRun1.2/bin/add_sent_marks.prl | Adds sentence markup to LM training files |
| SMT_QuickRun1.2/bin/start_decoder.template | Template for start_decoder script |
| SMT_QuickRun1.2/bin/rewrite.mkZeroFert2.perl | Script for creating zero fertility file written
by Ulrich Germann |
| SMT_QuickRun1.2/bin/xml-wrap-input | XML wrapper for ISI ReWrite Decoder input
written
by Ulrich Germann |
| SMT_QuickRun1.2/bin/tserver-client.perl | Translation client for ISI ReWrite Decoder
server
written
by Ulrich Germann |
| SMT_QuickRun1.2/bin/whittle.prl | Prepares data for GIZA++ TM training, written by
Mike
Jahr. Included in The
EGYPT Statistical Machine Translation Toolkit developed during the
WS'99 at CLSP JHU. (EGYPT/tools/whittle/whittle.perl) |
| SMT_QuickRun1.2/doc/SMT_QuickRun.html | This file |
| SMT_QuickRun1.2/doc/sample_data/HansTest_e | English part of 10000 sentence pairs from The
Canadian Hansard Corpus for sample TM training |
| SMT_QuickRun1.2/doc/sample_data/HansTest_f | French part of 10000 sentence pairs from The
Canadian Hansard Corpus for sample TM training |
| SMT_QuickRun1.2/doc/sample_data/HansTest.plain | 30000 sentences from English part of The
Canadian Hansard Corpus for sample LM training |
| SMT_QuickRun1.2/exec/train_LM.sh | Script for LM training |
| SMT_QuickRun1.2/exec/train_TM.sh | Script for TM training |
| SMT_QuickRun1.2/exec/prepare_decoder.sh | Script for preparation of configuration files
and
running the decoder |
| SMT_QuickRun1.2/bleu | Data for BLEU and NIST evaluation |
| SMT_QuickRun1.2/bleu/NISTref.de | Reference translations for development set |
| SMT_QuickRun1.2/bleu/NISTref.te | Reference translations for evaluation (test) set |
| Files | Description | Link to the Source |
text2wfreq |
Creates binary LM loadable by ISI ReWrite Decoder | CMU Statistical Language Modelling Toolkit, version 2 |
mkcls |
Trains word classes (WCs) | mkcls tool for training of word classes by Franz Josef Och |
GIZA++ |
Trains TM tables Prepares data for WC training |
GIZA++ - GIZA update from Franz Josef Och. GIZA is TM training tool from EGYPT. |
decoder.linux.public |
ISI ReWrite Decoder server | Executables and libraries
of ISI
ReWrite Decoder written by Daniel Marcu and Ulrich Germann.
Downloadble from http://www.isi.edu/cgi-bin/licensed-sw/rewrite-decoder/rw-decoder |
mteval-v09.pl |
BLEU and NIST evaluation |
Included in the NIST
MT evaluation kit version 9 at http://www.nist.gov/speech/tests/mt/mt2001/resource/ |
In this sample, the translation goes
from
Czech to English. It requires installation of SMT Quick Run package and
all executables listed in section Download of
Necessary Executables. Because the statistical machine translation
employs
the
Noisy Channel Model, we use the term source language for
English
and the term target language for Czech in that case.
NOTE: Perl scripts expect perl located in /usr/bin/perl.
In the example below we suppose that SMT QuickRun package is in your
home directory, i.e. /home/my_home/SMT_QuickRun1.2/, and
PCEDT
distribution is in
directory /data/LDC/PCEDT_CD_1.0
Check and set variables PCEDT_ROOT and SMT_QR_ROOT
in makefile /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/Makefile
as indicated:
#######################################
# Variables to be set
## Root of PCEDT distribution
# Such as
/data/LDC/PCEDT_CD_1.0
PCEDT_ROOT=/data/LDC/PCEDT_CD_1.0
## Root of SMT Quick Run
package
# Such as
/home/my_home/tools/SMT_QuickRun1.2
SMT_QR_ROOT=/home/my_home/SMT_QuickRun1.2
Run make on this Makefile
cd /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/It copies and tranforms all training data from PCEDT distribution, continue by running
make prepare_training_data
make train_modelsLanguage model training takes about 1 minute on Intel(R) Pentium(R) 4 CPU 2.66GHz and /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/LM_training.log is its log file. Word classes training and translation models training takes about 25 minutes on the same machine. If the translation models training takes suspiciously less than indicated, check /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/TM_training.log for errors. Than a configuration file and a running script for the decoder are created, see log in /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/prepare_decoder.log for details.
./Cz2En_start_decoder.shIf succesfull, the decoder enters server mode and indicates a port.
make prepare_test_dataPerform translations and evaluate results using BLEU and NIST metrics by running:
make evaluate_resultsEnjoy results!
cd /home/my_home/SMT_QuickRun1.2/Copy and edit Makefile from PCEDT_Sample system:
mkdir My_SMT
mkdir My_SMT/LM
mkdir My_SMT/TM
cd /home/my_home/SMT_QuickRun1.2/
cp PCEDT_Sample/Makefile My_SMT
cd My_SMT
cd /home/my_home/SMT_QuickRun1.2/
cp doc/sample_data/HansTest_e My_SMT
cp doc/sample_data/HansTest_f My_SMT
cp doc/sample_data/HansTest.plain My_SMTTokenizers for English, French, and Czech can be found for example in The EGYPT Statistical Machine Translation Toolkit.
cd /home/my_home/SMT_QuickRun1.2/My_SMT/Language model training in logged in file /home/my_home/SMT_QuickRun1.2/My_SMT/LM_training.log is its log file, word classes training and translation models training in file /home/my_home/SMT_QuickRun1.2/My_SMT/TM_training.log. Than a configuration file and a running script for the decoder are created, see log in /home/my_home/SMT_QuickRun1.2/My_SMT/prepare_decoder.log for details.
make train_models
./Fr2En_start_decoder.shIf succesfull, the decoder enters server mode and indicates a port.
Usage: train_LM.sh -B SMTQuickRunBin -D WorkingDirectory -P FileNamePrefix [-t TempDir]
Creates binary language model.
Train data should be stored in files *.plain in <WorkDirectory>.
Format of <FileNamePrefix>.plain is plain tokenized text, sentence per line.
If successful, the output binary LM file <FileNamePrefix>.binlm is created in <WorkDirectory>.
<SMTQuickRunBin> is the location of CMU LMT, GIZA++, mkcls, and SMTQuickRun executables.
Usage: train_TM.sh -B SMTQuickRunBin -D WorkingDirectory -P FileNamePrefix [-T _f] [-S _e] [-L MaxSentLength] [-p0 GizaNullWordParam0 ] [-c CutoffTreshold] [-r] [-h]
Performs preparation of corpora, training of word classes and a standard GIZA training.
Train data files should be located in <WorkDirectory>.
Target language files suffix is declared by optional parameter -T (default is _f),
source language files suffix is declared by optional parameter -S (default is _e).
example: Translating from German (suffix _g) to English (suffix _e), you have to specify -T _g parameter/value pair.
Required format of train data is plain text, sentence (parallel blocks) per line.
Number of lines has to be the same in both languages.
If successful, trained translation model files are created in <WorkDirectory>.
Decoder TM config file template is stored in <FileNamePrefix>.Decoder.config file.
<SMTQuickRunBin> is the location of CMU LMT, GIZA++, mkcls, and SMTQuickRun executables.
Parameters:
-P <string> set default prefix of the files to <string>
-L <number> set maximum length of sentences to <number> (default=40)
-p0 <number> set GIZA null word parameter p0 to <number> (default=0.98)
-c <number> set cutoff threshold for lexicon probabilities to <number> (default=1e-7)
Switches:
-r do not retrain word classes
-h display help and exit
Usage: prepare_decoder.sh -B ISIDecoderBin -D WorkingDirectory -P FileNamePrefix [-T f] [-S e] [-M MinZeroFertCount]
Creates files for running ISI ReWrite Decoder.
Script for running the decoder will be created in <WorkingDirectory>.
Target language files suffix is declared by optional parameter -T (default if f),
source language files suffix is declared by optional parameter -S (default if e).
example: Translating from German (suffix g) to English (suffix e), you have to specify -T g parameter/value pair.
LM files are expected in <WorkingDirectory>/LM
TM files are expected in <WorkingDirectory>/TM
Set <FileNamePrefix> to the same prefix as for LM & TM training.
<ISIDecoderBin> is the location of CMU LMT, GIZA++, mkcls, and ISIDecoder executables.
For details of The EGYPT Statistical Machine Translation Toolkit see http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/
For detailed description of ISI ReWrite Decoder parameters see http://www.isi.edu/natural-language/software/decoder/manual.html