A Knowledge-Based CLIR Model for Specific Domain Collections

Monti, Johanna; Monteleone, Mario; Di Buono, Maria Pia

Cross-language information Retrieval (CLIR) applications aimed toward accessing information on the web in many languages are attracting several important players within the information Retrieval (IR) field, like Google and Microsoft. Usually in CLIR applications, information is searched by means of a query expressed within the user’s first language. This query is automatically translated in the desired foreign language and also the results are translated back within the user’s first language. This process relies on two completely different translation stages: query translation and document translation. The query translation concerns the translation in the desired foreign language of the query expressed in the user’s first language, whereas the document translation is the back translation in the user’s language of the relevant documents found by means of the translated query. Translation is usually based on bilingual or multilingual Machine Readable Dictionaries (MRD), Machine Translation (MT) and parallel corpora. CLIR success clearly depends on the quality of translation and thus inaccurate or incorrect translations might cause serious problems in retrieving relevant information. A very frequent source of mistranslations in specific domain texts is, indeed, represented by multiword units (MWUs), and particularly terminological word compounds, that designate a large gamut of lexical constructions, composed of two or more words with an opaque meaning, i.e. the meaning of a unit is not always the results of the sum of the meanings of the single words that are part of the unit. MWUs are not always easy to identify since co-occurrence among the lexemes forming the units might vary a great deal. In domain specific texts compound terms, primarily noun compounds, are very frequent. In all languages there is indeed a close relationship between terminology and multi-words and, particularly, word compounds. In fact, word compounds account in some cases for 90% of the terms belonging to a domain specific language. Contrary to generic simple words, terminological word compounds are mono-referential, i.e. they are unambiguous and refer only to one specific concept in one special language, even though they might occur in more than one domain. Their meaning, similar to all compound words, cannot be directly inferred by a nonexpert from the various parts of the compounds because it depends on the specific area and the concept it refers to. CLIR applications are typically used in domain specific collections, like the Europeana Connect, that is aimed toward facilitating multilingual access to Europeana.eu, a web portal that acts as an interface to millions of books, paintings, films, museum objects and archival records that have been digitized throughout Europe, regardless of the users’ native language. In Europeana Connect, indeed, users can submit queries in their native language and are able to retrieve documents in different languages and acquire information regarding objects from several sources across all European countries. The retrieved information is translated back to the user’s language by means of MT. Typical Europeana item descriptions contain many compound terms, and, as shown in Monti (2013), translation produced by MT are filled with mistranslations. Processing and translating these forms of compound words is not a straightforward task since their morpho-syntactic and linguistics behavior is quite complex and varied according to the various types and their translations are practically unpredictable. Our contribution focuses on the outline of the knowledge-based resources (dictionary, ontology and rules), developed by means of Nooj and used in the experimentation of a knowledge-based CLIR system designed to take into consideration a proper processing and translation of MWUs. This experiment has been set up for the Italian/English language pair and might be easily extended to different language pairs.