Statistical Machine Translation Quick Run Package (version 1.2)

(by Jan Curin)
 

Overview

This is a brief description how to build statistical machine translation system on your parallel corpora as soon as possible. It includes sample of building Czech-English translation system on parallel text from Prague Czech-English Depenency Treebank (PCEDT) corpus. PCEDT version 1.0 is distributed by Linguistic Data Consortium (LDC), see LDC Catalog.

SMT Quick Run Package is a set of scripts for preparing training data in appropriate format, for training lanaguage and translation models, and for running a decoder in server mode. The SMT Quick Run requires the following software packages:

Install

Untar SMT_QuickRun1.2.tgz located in tools directory of PCEDT distribution or downloaded from here.

	untar -xvzf SMT_QuickRun1.2.tgz
It contains the following files:

SMT_QuickRun1.2/bin/PCEDT_Sample/Makefile Makefile for preparing training data, training and running the Czech-English SMT on PCEDT corpus
SMT_QuickRun1.2/bin/add_sent_marks.prl Adds sentence markup to LM training files
SMT_QuickRun1.2/bin/start_decoder.template Template for start_decoder script
SMT_QuickRun1.2/bin/rewrite.mkZeroFert2.perl Script for creating zero fertility file written by Ulrich Germann
SMT_QuickRun1.2/bin/xml-wrap-input XML wrapper for ISI ReWrite Decoder input written by Ulrich Germann
SMT_QuickRun1.2/bin/tserver-client.perl Translation client for ISI ReWrite Decoder server written by Ulrich Germann
SMT_QuickRun1.2/bin/whittle.prl Prepares data for GIZA++ TM training, written by Mike Jahr. Included in The EGYPT Statistical Machine Translation Toolkit developed during the WS'99 at CLSP JHU. (EGYPT/tools/whittle/whittle.perl)
SMT_QuickRun1.2/doc/SMT_QuickRun.html This file
SMT_QuickRun1.2/doc/sample_data/HansTest_e English part of 10000 sentence pairs from The Canadian Hansard Corpus for sample TM training
SMT_QuickRun1.2/doc/sample_data/HansTest_f French part of 10000 sentence pairs from The Canadian Hansard Corpus for sample TM training
SMT_QuickRun1.2/doc/sample_data/HansTest.plain 30000 sentences from English part of The Canadian Hansard Corpus for sample LM training
SMT_QuickRun1.2/exec/train_LM.sh Script for LM training
SMT_QuickRun1.2/exec/train_TM.sh Script for TM training
SMT_QuickRun1.2/exec/prepare_decoder.sh Script for preparation of configuration files and running the decoder
SMT_QuickRun1.2/bleu Data for BLEU and NIST evaluation
SMT_QuickRun1.2/bleu/NISTref.de Reference translations for development set
SMT_QuickRun1.2/bleu/NISTref.te Reference translations for evaluation (test) set

Download of Necessary Executables

Than you have to download and build all necessary tools for language model (LM) training, translation model (TM) training, word classes (WCs) training, and decoding. List of necessary executables with links to their sources follows. Put all files listed below to SMT_QuickRun1.2/bin/ directory.
 
 
Files Description Link to the Source
text2wfreq
wfreq2vocab
text2idngram
idngram2lm
Creates binary LM loadable by ISI ReWrite Decoder CMU Statistical Language Modelling Toolkit, version 2
mkcls
Trains word classes (WCs) mkcls tool for training of word classes by Franz Josef Och
GIZA++
snt2plain.out
Trains TM tables
Prepares data for WC training
GIZA++ - GIZA update from Franz Josef Och. GIZA is TM training tool from EGYPT
decoder.linux.public
libicudata.so.22
libicuuc.so.22
libxerces-c.so.21
ISI ReWrite Decoder server Executables and libraries of ISI ReWrite Decoder written by Daniel Marcu and Ulrich Germann. Downloadble from http://www.isi.edu/cgi-bin/licensed-sw/rewrite-decoder/rw-decoder
mteval-v09.pl
BLEU and NIST evaluation
Included in the NIST MT evaluation kit version 9 at http://www.nist.gov/speech/tests/mt/mt2001/resource/

Building Czech-English Statistical Machine Translation System on PCEDT Corpus

In this sample, the translation goes from Czech to English. It requires installation of SMT Quick Run package and all executables listed in section Download of Necessary Executables. Because the statistical machine translation employs the Noisy Channel Model, we use the term source language for English and the term target language for Czech in that case.

NOTE: Perl scripts expect perl located in /usr/bin/perl.

In the example below we suppose that SMT QuickRun package is in your home directory, i.e. /home/my_home/SMT_QuickRun1.2/, and PCEDT distribution is in directory /data/LDC/PCEDT_CD_1.0

Check and set variables PCEDT_ROOT and SMT_QR_ROOT in makefile /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/Makefile as indicated:

#######################################
# Variables to be set

## Root of PCEDT distribution
#  Such as /data/LDC/PCEDT_CD_1.0
PCEDT_ROOT=/data/LDC/PCEDT_CD_1.0

## Root of SMT Quick Run package
#  Such as /home/my_home/tools/SMT_QuickRun1.2
SMT_QR_ROOT=/home/my_home/SMT_QuickRun1.2


Run make on this Makefile

	cd /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/
make prepare_training_data
It copies and tranforms all training data from PCEDT distribution, continue by running
	make train_models
Language model training takes about 1 minute on Intel(R) Pentium(R) 4 CPU 2.66GHz and /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/LM_training.log is its log file. Word classes training and translation models training takes about 25 minutes on the same machine. If the translation models training takes suspiciously less than indicated, check  /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/TM_training.log for errors. Than a configuration file and a running script for the decoder are created, see log in /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/prepare_decoder.log for details.

At the end of models training process a running script for the decoder is created in file /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/Cz2En_start_decoder.sh. This script runs ISI ReWrite Decoder is server mode and creates script for translations. Type
	./Cz2En_start_decoder.sh
If succesfull, the decoder enters server mode and indicates a port.

entering server mode
Listening on port 1083.

Now the system is ready for translations. Newly created script /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/Cz2En_translate.sh expects a file to be translated on standard input and a resulting translation is written to the standard output.

Use script /home/my_home/SMT_QuickRun1.2/PCEDT_Sample/Cz2En_kill_decoder.sh to kill the decoder server after performing all translations.

HINT: The English language model is build on a relatively small corpus of English side of the PCEDT parallel texts, you can improve the translation quality by adding more English monoligual data for the LM training. See datails of data format and location in the next section.

For evaluation of the results against 4 reference transalations do:

First prepare test data from the PCEDT distribution by running:
	make prepare_test_data
Perform translations and evaluate results using BLEU and NIST metrics by running:
	make evaluate_results
Enjoy results!

From Plain Data to Statistical Machine Translation System

This is an example how to run LM & TM training on your own corpus. Let say that your translation goes from French to English. Because the statistical machine translation employs the Noisy Channel Model, we use the term source language for English and the term target language for French in that case.

NOTE: Perl scripts expect perl located in /usr/bin/perl.

In the example below we suppose that SMT QuickRun package is in your home directory, i.e. /home/my_home/SMT_QuickRun1.2.

Let name your system My_SMT, set the default file prefix to Fr2En, and choose "f" the suffix for French.

Create new working directories for your system:
	cd /home/my_home/SMT_QuickRun1.2/
mkdir My_SMT
mkdir My_SMT/LM
mkdir My_SMT/TM
Copy and edit Makefile from PCEDT_Sample system:
	cd /home/my_home/SMT_QuickRun1.2/
cp PCEDT_Sample/Makefile My_SMT
cd My_SMT

Set variables SMT_QR_ROOT, DATA_DIR, FILEPREF, SRCLANGSUFF, and TGTLANGSUFF in makefile /home/my_home/SMT_QuickRun1.2/My_SMT/Makefile as indicated below.

#######################################
# Variables to be set

## Root of PCEDT distribution
#  Such as
/data/LDC/PCEDT_CD_0.9
#PCEDT_ROOT=/data/LDC/PCEDT_CD_0.9

## Root of SMT Quick Run package
#  Such as
/home/my_home/SMT_QuickRun1.2
SMT_QR_ROOT=
/home/my_home/SMT_QuickRun1.2

## Working directory of the current SMT experiment (or PCEDT_Sample)
#  Such as /home/my_home/SMT_QuickRun1.2/PCEDT_Sample
#  This is a default location of this Makefile
DATADIR=${SMT_QR_ROOT}/My_SMT

## Default file prefix for the current SMT experimnet
#  Such as "Cz2En" for Czech-English translations
FILEPREF=Fr2En

## One letter suffix of the source language (BEWARE: this is the language you are translation to)
#  "e" as English in the PCEDT_Sample
SRCLANGSUFF=e

## One letter suffix of the target language (BEWARE: this is the language you are translation from)
#  "c" as Czech in the PCEDT_Sample
TGTLANGSUFF=f

# END of section Variables to be set
########################################

Training Data


Copy your parallel corpus into TM, for example:
	cd /home/my_home/SMT_QuickRun1.2/
cp doc/sample_data/HansTest_e My_SMT
cp doc/sample_data/HansTest_f My_SMT

Copy your monolingual corpus into LM, for example:
	cp doc/sample_data/HansTest.plain My_SMT
Tokenizers for English, French, and Czech can be found for example in The EGYPT Statistical Machine Translation Toolkit.

Training of the Models

Run make on this Makefile
	cd /home/my_home/SMT_QuickRun1.2/My_SMT/
make train_models
Language model training in logged in file /home/my_home/SMT_QuickRun1.2/My_SMT/LM_training.log is its log file, word classes training and translation models training in file  /home/my_home/SMT_QuickRun1.2/My_SMT/TM_training.log. Than a configuration file and a running script for the decoder are created, see log in /home/my_home/SMT_QuickRun1.2/My_SMT/prepare_decoder.log for details.

Performing Translations

At the end of models training process a running script for the decoder is created in file /home/my_home/SMT_QuickRun1.2/My_SMT/Fr2En_start_decoder.sh. This script runs ISI ReWrite Decoder is server mode and creates script for translations. Type
	./Fr2En_start_decoder.sh
If succesfull, the decoder enters server mode and indicates a port.

entering server mode
Listening on port 1083.

Now the system is ready for translations. Newly created script /home/my_home/SMT_QuickRun1.2/My_SMT/Fr2En_translate.sh expects a file to be translated on standard input and a resulting translation is written to the standard output.

Use script /home/my_home/SMT_QuickRun1.2/My_SMT/Fr2En_kill_decoder.sh to kill the decoder server after performing all translations.

Command Line Parameters

Instead of running make on the Makefile or you can run the scripts from the command line. See usage of these scripts:

	Usage: train_LM.sh -B SMTQuickRunBin -D WorkingDirectory -P FileNamePrefix [-t TempDir]

Creates binary language model.
Train data should be stored in files *.plain in <WorkDirectory>.
Format of <FileNamePrefix>.plain is plain tokenized text, sentence per line.
If successful, the output binary LM file <FileNamePrefix>.binlm is created in <WorkDirectory>.
<SMTQuickRunBin> is the location of CMU LMT, GIZA++, mkcls, and SMTQuickRun executables.
	Usage: train_TM.sh -B SMTQuickRunBin -D WorkingDirectory -P FileNamePrefix [-T _f] [-S _e] [-L MaxSentLength] [-p0 GizaNullWordParam0 ] [-c CutoffTreshold] [-r] [-h]

Performs preparation of corpora, training of word classes and a standard GIZA training.
Train data files should be located in <WorkDirectory>.
Target language files suffix is declared by optional parameter -T (default is _f),
source language files suffix is declared by optional parameter -S (default is _e).
 example: Translating from German (suffix _g) to English (suffix _e), you have to specify -T _g parameter/value pair.
Required format of train data is plain text, sentence (parallel blocks) per line.
Number of lines has to be the same in both languages.
If successful, trained translation model files are created in <WorkDirectory>.
Decoder TM config file template is stored in <FileNamePrefix>.Decoder.config file.
<SMTQuickRunBin> is the location of CMU LMT, GIZA++, mkcls, and SMTQuickRun executables.
Parameters:
  -P <string>   set default prefix of the files to <string>
  -L <number>   set maximum length of sentences to <number> (default=40)
  -p0 <number>   set GIZA null word parameter p0 to <number> (default=0.98)
  -c <number>   set cutoff threshold for lexicon probabilities to <number> (default=1e-7)
Switches:
  -r   do not retrain word classes
  -h   display help and exit
	Usage: prepare_decoder.sh -B ISIDecoderBin -D WorkingDirectory -P FileNamePrefix [-T f] [-S e] [-M MinZeroFertCount]

Creates files for running ISI ReWrite Decoder.
Script for running the decoder will be created in <WorkingDirectory>.
Target language files suffix is declared by optional parameter -T (default if f),
source language files suffix is declared by optional parameter -S (default if e).
 example: Translating from German (suffix g) to English (suffix e), you have to specify -T g parameter/value pair.
LM files are expected in <WorkingDirectory>/LM
TM files are expected in <WorkingDirectory>/TM
Set <FileNamePrefix> to the same prefix as for LM & TM training.
<ISIDecoderBin> is the location of CMU LMT, GIZA++, mkcls, and ISIDecoder executables.

Documentation

For details of CMU Statistical Language Modelling Toolkit settings see http://svr-www.eng.cam.ac.uk/~prc14/toolkit_documentation.html

For details of The EGYPT Statistical Machine Translation Toolkit see http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/

For detailed description of ISI ReWrite Decoder parameters see http://www.isi.edu/natural-language/software/decoder/manual.html