Institute of Formal and Applied Linguistics

at Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Year 2008
Type PhD dissertation
Status published
Language English
Author(s) Bojar, Ondřej
Title Exploiting Linguistic Data in Machine Translation
Czech title Využití lingvistických dat ve strojovém překladu
Publisher's city and country Prague, Czech Republic
Total book pages 135
Month September
Supported by 2006-2009 FP6-IST-5-034291-STP (Euromatrix) 2005-2010 MSM 0021620838 (Moderní metody, struktury a systémy informatiky) 2005-2009 LC536 (Centrum komputační lingvistiky) 2006-2008 GA405/06/0589 (Tektogramatický popis jazyka pro rozpoznávání mluvené řeči a strojový překlad) 2005-2008 GD201/05/H014 (Collegium Informaticum) 2005-2009 1ET201120505 (Od jazyka ke znalostem a sémantickému webu) 2005-2006 GAUK 351/2005
Czech abstract Doktorská dizertační práce studuje vzájemný vztah mezi lingvistickými teoriemi, daty a aplikacemi. Zaměřujeme se na jednu konkrétní teorii, Funkční generativní popis, jeden konkrétní typ dat, totiž valenční slovníky, a jednu konkrétní aplikaci: strojový překlad z angličtiny do češtiny.
English abstract This thesis explores the mutual relationship between linguistic theories, data and applications. We focus on one particular theory, Functional Generative Description (FGD), one particular type of linguistic data, namely valency dictionaries and one particular application: machine translation (MT) from English to Czech.
First, we examine methods for automatic extraction of verb valency dictionaries based on corpus data. We propose an automatic metric for estimating how much lexicographers' labour was saved and evaluate various frame extraction techniques using this metric.
Second, we design and implement an MT system with transfer at various layers of language description, as defined in the framework of FGD. We primarily focus on the tectogrammatical (deep syntactic) layer.
Third, we leave the framework of FGD and experiment with a rather direct, "phrase-based" MT system. Comparing various setups of the system and specifically treating target-side morphological coherence, we are able to significantly improve MT quality and out-perform a commercial MT system within a pre-defined text domain.
The concluding chapter provides a broader perspective on the utility of lexicons in various applications, highlighting the successful features. Finally, we summarize the contribution of the thesis.
Specialization linguistics ("jazykověda")
Confidentiality default – not confidential
Open access no
Defense slidespublic2008-FILE-bojar_phd-FINAL-defense-slides.pdfapplication/pdf
Summary PDFpublic2008-FILE-bojar_phd-FINAL-summary.pdfapplication/pdf
Thesis PDFpublic2008-FILE-bojar_phd-FINAL.pdfapplication/pdf
