UNDL Foundation, University of Geneva, EPFL
Funded by CADMOS
The main goal of the project LACE (Language Acquisition from Comparable tExts) is to build language modules out of data automatically extracted from comparable corpora. The results are expected to be incorporated in the architecture of UNL-based systems as supplementary resources for natural language disambiguation, both in analysis and generation, and will be used for improving the performance of applications in machine translation, summarization, information retrieval and semantic reasoning. The project LACEhpc is part of this broader initiative and aims at designing and implementing efficient high-performance computing methods for extracting monolingual and multilingual resources from comparable non-parallel corpora. It focuses on the process of extracting n-grams from monolingual corpora; aligning n-grams in bilingual corpora; building monolingual and multilingual language models; and minimizing and indexing the resulting databases for use in the UNL framework. The proposal includes the adaptation and implementation of existing algorithms; the evaluation, revision and optimization of extraction and alignment methods; and studies for sustainability of the resulting techniques, especially on scalability and portability. In addition to HPC-oriented algorithms, the project is expected to deliver several different monolingual and bilingual databases, as well as aligned corpora and translation memories, which are important assets for natural language processing and fundamental resources for research in Linguistics and Computational Linguistics.