IR-832: (2011) Krstovski, K. and Smith, D., "A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs," Proceedings of the 6th Workshop on Statistical Machine Translation, EMNLP 2011, Edinburgh, UK, July 30-31, 2011. [View bibtex]
We describe an approach for generating a ranked list of possible document translation pairs without the use of bilingual dictionary or machine translation system. We developed this approach as an initial, filtering step, in a process of generating parallel document collections from large, multilingual—but non-parallel—corpora. Our approach represents bilingual documents in a vector space whose basis vectors are the overlapping tokens found in both languages of the collection. Using this representation, weighted by tf•idf, we compute cosine document similarity to create a ranked list of candidate document translation pairs. Unlike a cross-language information retrieval task, where a ranked list in the target language is evaluated for each source query, we are interested in, and evaluate, the more difficult task of finding translated document pairs. We first perform a feasibility study of our approach on parallel collections in multiple languages, representing multiple language families and scripts. The approach is then applied to a large bilingual collection of around 800k books. To avoid the computational cost of O(n^2) document pair comparisons, we employ a locality sensitive hashing (LSH) approximation algorithm for cosine similarity, which reduces our time complexity to O(nlogn).