We propose a new methodology for the identification of MWEs from parallel multilingual corpora. Our approach is inspired by one of the most significant properties that characterize the majority of MWEs, which goes under the name of non-translatability: an MWE cannot be translated from one language to another on a word by word basis (Sag et al., 2002; Monti, 2012). The methodology envisions a three-stage process. The first phase makes use of automatic kernel methods for the identification of possible candidate pairs of expressions which has high recall and low precision. In the second phase, a "word by word" automatic translation system will filter out those candidate pairs which are literal translations (and therefore not MWEs). In the third phase, a crowdsourcing system is used to further validate the list of final candidates
Identifying Multi-Word Expressions from Parallel Corpora with Kernel Methods and Crowdsourcing
Monti Johanna;Federico Sangati;
2014-01-01
Abstract
We propose a new methodology for the identification of MWEs from parallel multilingual corpora. Our approach is inspired by one of the most significant properties that characterize the majority of MWEs, which goes under the name of non-translatability: an MWE cannot be translated from one language to another on a word by word basis (Sag et al., 2002; Monti, 2012). The methodology envisions a three-stage process. The first phase makes use of automatic kernel methods for the identification of possible candidate pairs of expressions which has high recall and low precision. In the second phase, a "word by word" automatic translation system will filter out those candidate pairs which are literal translations (and therefore not MWEs). In the third phase, a crowdsourcing system is used to further validate the list of final candidatesFile | Dimensione | Formato | |
---|---|---|---|
WG1-WG3-Sangati_Cranenburgh_Monti_Slides.pdf
accesso aperto
Tipologia:
Altro materiale allegato
Licenza:
Creative commons
Dimensione
4.58 MB
Formato
Adobe PDF
|
4.58 MB | Adobe PDF | Visualizza/Apri |
WG1-WG3-SANGATI-vanCRANENBURG-MONTI-abstract.pdf
accesso aperto
Tipologia:
Abstract
Licenza:
Creative commons
Dimensione
168.14 kB
Formato
Adobe PDF
|
168.14 kB | Adobe PDF | Visualizza/Apri |
WG1-WG3-Sangati-vanCranenburgh-Monti-poster.pdf
accesso aperto
Tipologia:
Documento in Post-print
Licenza:
Creative commons
Dimensione
722.36 kB
Formato
Adobe PDF
|
722.36 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.