Článek shrnuje deset ročníků soutěží ve strojovém překladu a ve vyhodnocování jeho kvality: WMT.
The WMT evaluation campaign (http://www.statmt.org/wmt16) has been run annually since 2006. It is a collection of shared
tasks related to machine translation, in which researchers compare their techniques against those of others in the field. The longest
running task in the campaign is the translation task, where participants translate a common test set with their MT systems. In addition
to the translation task, we have also included shared tasks on evaluation: both on automatic metrics (since 2008), which compare the
reference to the MT system output, and on quality estimation (since 2012), where system output is evaluated without a reference. An
important component of WMT has always been the manual evaluation, wherein human annotators are used to produce the official ranking
of the systems in each translation task. This reflects the belief of theWMTorganizers that human judgement should be the ultimate arbiter
of MT quality. Over the years, we have experimented with different methods of improving the reliability, efficiency and discriminatory
power of these judgements. In this paper we report on our experiences in running this evaluation campaign, the current state of the art in
MT evaluation (both human and automatic), and our plans for future editions of WMT.
default – not confidential
Ondřej Bojar; Aljoscha Burchardt; Christian Dugast; Marcello Federico; Josef Genabith; Barry Haddow; Jan Hajič; Kim Harris; Philipp Koehn; Matteo Negri; Martin Popel; Georg Rehm; Lucia Specia; Marco Turchi; Hans Uszkoreit