ACCURAT is a Collaborative project funded within FP7-ICT-2009-4 call and action ICT-2009.2.2: Language-based interaction under Grant agreement no. 248347.
Project summary
The aim of the ACCURAT project is to research methods and techniques to overcome one of the central problems of machine translation (MT) – the lack of linguistic resources for under-resourced areas of machine translation. The main goal is to find, analyze and evaluate novel methods that exploit comparable corpora on order to compensate for the shortage of linguistic resources, and ultimately to significantly improve MT quality for under-resourced languages and narrow domains.
The applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data. For this reason the translation quality of current data-driven MT systems varies dramatically from being quite good for language pairs with large corpora available (e.g. English and French) to being almost unusable for under-resourced languages and domains (e.g. Latvian and Croatian). Therefore the ultimate ACCURAT goal is to achieve a significant increase in translation quality for under-resourced languages and narrow domains.
The key innovation of ACCURAT will be the creation of methodology and tools to measure, to find and to use comparable corpora to improve the quality of MT for under-resourced languages and domains. Thus the ACCURAT project will bring significant contributions not only the theory of MT, but also to corpus linguistics, information extraction and natural language processing in general and will strongly advance theoretical foundations and methodology for research in corpus linguistics.
Scientific objectives
The project will use the latest state-of-the-art in SMT and rule-based MT systems as a baseline and will provide novel methods to achieve much better results by extending these systems through the use of comparable corpora. Initial research demonstrates promising results from the use of comparable corpora in SMT (Munteanu and Marcu, 2005; see also chapter on the state-of-the-art below) and RBMT (Thurmair, 2006) and this makes the ACCURAT consortium confident of the feasibility of the proposed approach.
Technological objectives
The ACCURAT project will investigate two broader use cases where the scarcity of linguistic resources poses a major challenge – adjusting machine translation for under-resourced languages and narrow domains.
The ACCURAT project will provide researchers and developers with a methodology and fully functional model for exploiting comparable corpora in MT, including corpus acquisition from the Web and other sources, analysis and metrics of comparability, multi-level alignment and extraction of lexical data and techniques for applying aligned text and extracted lexical data to increase translation quality of existing SMT and RBMT systems.
ACCURAT will provide an optimal approach to achieve quality MT translation for a number of new EU official languages and languages of associated countries, as well novel approaches for adapting existing MT technologies to specific narrow domains, significantly increasing language and domain coverage of automated translation.
ACCURAT will make its novel methodology for under-resourced areas of MT openly accessible in respect to comparability metrics, methods and techniques of alignment for comparable corpora, methods and techniques of information extraction from aligned comparable corpora at different levels (document, paragraph, phrase / word), methods and techniques of collecting comparable corpora from the Web as well as collections of comparable corpora for the project languages.
| 2010-03-04 |