The course covers the area of machine translation (MT) in its current breadth, delving deep enough in each approach to let you know how to confuse every existing MT system. We put a balanced emphasis on several imporant types of state-of-the-art systems: phrase-based MT, surface-syntactic MT and (a typically Praguian) deep-syntactic MT. We do not forget common pre-requisities and surrounding fields: extracting translation equivalents from parallel texts (including word alignment techniques), MT evaluation or methods of system combination.
We aim to provide a unifying view of machine translation as statistical search in a large search space, well supported with practical experience during your project work in a team or alone. Finally, we also attempt to give a gist of emerging approaches in MT, such as neural networks.
Important news for 2021: We're trialling inverted classroom style. In other words, you are expected:
Contributions to the grade:
Final Grade: ≥50% good, ≥70% very good, ≥90% excellent.
Legend: Slides Main Content Illustrative Content Optional Reading Homework Assignment
The dates below indicate when we talk about it. Remember to watch the Full Lecture Video much earlier.
If you see a 2020 date, the entry has not yet been updated.
Apr 1, 2021
Apr 8, 2021
Apr 15, 2021
Apr 22, 2021
Apr 29, 2021
Lecture Slides Full Lecture Video Shared Doc Transformer Illustrated Transformer in Pytorch Transformer at Medium Replacing Linguists with Dummies Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed
May 6, 2021
May 13, 2021
May 20, 2021
May 27, 2021 Shared Doc with Schedule
** The following will still be updated for 2019/2020. **
June 3, 2021
For older versions of the lectures, you can browse the course history in SVN:
Deadline: 20th April 2020
The exam is written and consists of 7 questions, each equally important. In general, the exam questions will cover the full range of topics discussed in the lectures.
Here are the exam questions used in the past, for illustration:
Describe IBM Model 1 for word alignment, highlighting the EM structure of the algorithm. You may or may not use formulas.
Suggest limitations of IBM Model 1. Provide examples of sentences and their translations where the model is inadequate, suggest a solution for at least one of them.
Illustrate the problems of word alignment task as such.
Come up with as many problems as you can for automatic word alignment when used in phrase-based MT.
Use a graph and/or the notation of deductive logic to illustrate the full space of partial (incl. complete) derivations translating "Marii miluje Jan" into English given the following translation dictionary:
Make up an example sentence and phrase table snippets. Illustrate the process of phrase-based translation. Remember to cover both the preparation of translation options as well as the hypothesis expansion.
Make up an example input sentence, phrase table snippets and the process of hypothesis expansion and pruning to illustrate why is future cost estimation needed in phrase-based MT. Ignore the cost of reordering.
In the first step of phrase-based translation, all relevant phrase translations are considered for an input sentence. How the phrase translations were obtained? What scores are associated with phrase translations? Roughly suggest how the scores can be estimated.
What is the relation between noisy channel model and log-linear model for MT? Try to use formulas. Remember to explain your notation.
Describe in detail the process of hypothesis expansion in phrase-based MT. Provide examples for local and non-local features for scoring the hypotheses. How can non-local features be turned into local ones?
Illustrate the extraction of "gappy phrases" for the hierarchical model from a word-aligned sentence pair (e.g. 4x5 words). List (some of) the extracted phrases in the order of extraction.
Illustrate chart parsing as used in both hierarchical and (surface-) syntactic translation model. You will need to provide a sample: input sentence, some rules, some rule applications.
What is the difference between the hierarchical and (surface-) syntactic translation model? What new complications does syntax bring and how they can be solved?
Make up a sample sentence containing non-projectivity.
Why is non-projectivity important in MT? Provide an example.
For (a) phrase-based model (think Moses) and (b) deep-syntactic translation (think TectoMT) provide examples of as many problems as you can (e.g. syntactic constructions where you can prove the model will fail, situations with a high risk of mismatch between training and test data).
Compare (a) phrase-based model (think Moses) and (b) constituency-based syntactic model (Joshua). Provide sample syntactic constructions for a language pair that includes English where (1) one of them is bound to fail and (2) both of them are bound to fail. Describe what new problems does the syntactic model bring and how to tackle them (hint: coverage and sparseness).
When factors are used for target-side morphology, what they are meant to solve? Provide a (not very frequent) counterexample when the part added to the setup hurts instead of helping.
Provide 3 examples of factored phrase-based MT setups addressing various linguistics phenomena, explaining what are their potential benefits.
Compare language models based on word forms and language models based on POS
A, ... or more detailed like
your option) by making up cases where the increased generality of the
POS LM helps and where it hurts in distinguisting good vs. bad sentences. You
may need to say which patterns are frequent in your training data prior to
saying how this misleads the model given some test data. Use monolingual or
bilingual examples as you wish.
Sketch the idea of the reverse self-training approach. What benefits it brings?
Why is MT NP-complete? Try providing a (polynomial) reduction of an NP-complete problem onto a task in MT.
What are "local" vs. "non-local" features in search? Provide examples for phrase-based MT and also for an arbitrary syntactic model you come up with. You will probably need to sketch a small sample of the search space of each of the models with partial hypotheses.
What are the complications of introducing a language model to the hierarchical model (model based on chart parsing)? Illustrate state splitting.
Describe BLEU. Explain its core properties and limitations, sketch the formula and provide its explanation.
How does BLEU defeat (score low) hypotheses like "The the the the the." and (separately) "The."?
Why does BLEU perform poorly when evaluating Czech? There are at least two reasons. Provide examples.
What are the problems of (a) (automatic) word alignment and (b) phrase extraction as used in the "Moses pipeline" in general or when used in phrase-based translation.
Suggest 3 different manual MT evaluation techniques and highlight their respective positive and negative aspects.
Describe the loop of weight optimization for the log-linear model as used in phrase-based MT.
Describe MERT, minimum error-rate training. Remember to talk about both the outer loop and inner loop, as well as both situations where "lines" appear in the algorithm. Why is the outer loop needed?
Describe what a "transfer-based" MT architecture means, illustrate the design of the deep-syntactic layer used for Czech-English translation. What are the potential benefits of transferring at this deep-syntactic layer?
What are the problems of transfer-based MT?
Describe the statistical model that is used in TectoMT tree-to-tree transfer. What component of the model serves as a "language model"? What unit does this language model operate with?
Sketch the structure of an encoder-decoder architecture of neural MT, remember to describe the components in the picture.
What problem does attention in neural MT address? Provide the key idea of the method.
All lecture materials for the years 2008—2017 are available in the course SVN:
For read-only access use username: student and password: student