We propose a new methodology for the identification of MWEs from parallel multilingual corpora. Our approach is inspired by one of the most significant properties that characterize the majority of MWEs, which goes under the name of non-translatability: an MWE cannot be translated from one language to another on a word by word basis (Sag et al., 2002; Monti, 2012). The methodology envisions a three-stage process. The first phase makes use of automatic kernel methods for the identification of possible candidate pairs of expressions which has high recall and low precision. In the second phase, a "word by word" automatic translation system will filter out those candidate pairs which are literal translations (and therefore not MWEs). In the third phase, a crowdsourcing system is used to further validate the list of final candidates

Identifying Multi-Word Expressions from Parallel Corpora with Kernel Methods and Crowdsourcing

Monti Johanna;Federico Sangati;
2014-01-01

Abstract

We propose a new methodology for the identification of MWEs from parallel multilingual corpora. Our approach is inspired by one of the most significant properties that characterize the majority of MWEs, which goes under the name of non-translatability: an MWE cannot be translated from one language to another on a word by word basis (Sag et al., 2002; Monti, 2012). The methodology envisions a three-stage process. The first phase makes use of automatic kernel methods for the identification of possible candidate pairs of expressions which has high recall and low precision. In the second phase, a "word by word" automatic translation system will filter out those candidate pairs which are literal translations (and therefore not MWEs). In the third phase, a crowdsourcing system is used to further validate the list of final candidates
2014
File in questo prodotto:
File Dimensione Formato  
WG1-WG3-Sangati_Cranenburgh_Monti_Slides.pdf

accesso aperto

Tipologia: Altro materiale allegato
Licenza: Creative commons
Dimensione 4.58 MB
Formato Adobe PDF
4.58 MB Adobe PDF Visualizza/Apri
WG1-WG3-SANGATI-vanCRANENBURG-MONTI-abstract.pdf

accesso aperto

Tipologia: Abstract
Licenza: Creative commons
Dimensione 168.14 kB
Formato Adobe PDF
168.14 kB Adobe PDF Visualizza/Apri
WG1-WG3-Sangati-vanCranenburgh-Monti-poster.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 722.36 kB
Formato Adobe PDF
722.36 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11574/170144
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact