Data Driven Methods for Improving Mono- and Cross-lingual IR Performance in Environments
Järvelin, Antti; Talvensaari, Tuomas; Järvelin, Anni (2009)
Järvelin, Antti
Talvensaari, Tuomas
Järvelin, Anni
2009
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/urn:isbn:978-951-44-7929-8
https://urn.fi/urn:isbn:978-951-44-7929-8
Tiivistelmä
Abstract
In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare
proper nouns can be seen as noise when they appear in queries or in the target collection, because they often are
out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpusbased approach, comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are presented. The methods are shown to be effective in OOV word translation (s-grams), in query translation without dictionaries between closely related languages (TRT and s-grams), and in query translation in a special domain (sgrams, TRT and corpus based methods).
Keywords: Cross-language information retrieval, noise, OOV words, TRT, s-grams, corpus based methods
In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare
proper nouns can be seen as noise when they appear in queries or in the target collection, because they often are
out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpusbased approach, comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are presented. The methods are shown to be effective in OOV word translation (s-grams), in query translation without dictionaries between closely related languages (TRT and s-grams), and in query translation in a special domain (sgrams, TRT and corpus based methods).
Keywords: Cross-language information retrieval, noise, OOV words, TRT, s-grams, corpus based methods