This paper focuses on the evaluation of linguistic data, concerning idioms examples collected and annotated through Dodiom, a GWAP environment, by Italian linguists. The paper provides an insight into the Dodiom project, the data collection through the contribution of the crowd, and, finally it specifically describes the annotation criteria used by the experts to estimate the quality of the collected data. The main scope of this paper is, indeed, the evaluation of the quality of the linguistic data obtained through crowdsourcing, namely to assess if the data provided by the players who joined the game are eligible and profitable for research and teaching purposes. This task concerns the development of a collection of idioms, namely a specific type of Multiword expressions which is usually hard to find in corpora and that contains words that may also be used in their literal meanings within a sentence. This is particularly important as these data may be used both for the training and the evaluation of NLP applications. Finally, results, as well as future work, are presented.

Assessing the Quality of an Italian Crowdsourced Idiom Corpus: the Dodiom Experiment

G. Morza;R. Manna;J. Monti
Supervision
2022-01-01

Abstract

This paper focuses on the evaluation of linguistic data, concerning idioms examples collected and annotated through Dodiom, a GWAP environment, by Italian linguists. The paper provides an insight into the Dodiom project, the data collection through the contribution of the crowd, and, finally it specifically describes the annotation criteria used by the experts to estimate the quality of the collected data. The main scope of this paper is, indeed, the evaluation of the quality of the linguistic data obtained through crowdsourcing, namely to assess if the data provided by the players who joined the game are eligible and profitable for research and teaching purposes. This task concerns the development of a collection of idioms, namely a specific type of Multiword expressions which is usually hard to find in corpora and that contains words that may also be used in their literal meanings within a sentence. This is particularly important as these data may be used both for the training and the evaluation of NLP applications. Finally, results, as well as future work, are presented.
2022
979-10-95546-72-6
File in questo prodotto:
File Dimensione Formato  
2022.lrec-1.446.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: PUBBLICO - Pubblico con Copyright
Dimensione 338.36 kB
Formato Adobe PDF
338.36 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11574/219620
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact