This paper presents the outcomes of an initial investigation into the performance of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems in translating high-stakes messages. The research employed a novel bilingual corpus, ITALERT (Italian Emergency Response Text) and applied a human-centric post-editing based metric (HOPE) to assess translation quality systematically. The initial dataset contains eleven texts in Italian and their corresponding English translations, both extracted from the national communication campaign website of the Italian Civil Protection Department. The texts deal with eight crisis scenarios: flooding, earthquake, forest fire, volcanic eruption, tsunami, industrial accident, nuclear risk, and dam failure. The dataset has been carefully compiled to ensure usability and clarity for evaluating machine translation (MT) systems in crisis settings. Our findings show that current LLMs and NMT models, such as ChatGPT (OpenAI’s GPT-4o model) and Google MT, face limitations in translating emergency texts, particularly in maintaining the appropriate register, resolving context ambiguities, and managing domain-specific terminology.

ITALERT: Assessing the Quality of LLMs and NMT in Translating Italian Emergency Response Text

Maria Carmen Staiano
Writing – Original Draft Preparation
;
Johanna Monti
Supervision
;
Francesca Chiusaroli
Membro del Collaboration Group
2025-01-01

Abstract

This paper presents the outcomes of an initial investigation into the performance of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems in translating high-stakes messages. The research employed a novel bilingual corpus, ITALERT (Italian Emergency Response Text) and applied a human-centric post-editing based metric (HOPE) to assess translation quality systematically. The initial dataset contains eleven texts in Italian and their corresponding English translations, both extracted from the national communication campaign website of the Italian Civil Protection Department. The texts deal with eight crisis scenarios: flooding, earthquake, forest fire, volcanic eruption, tsunami, industrial accident, nuclear risk, and dam failure. The dataset has been carefully compiled to ensure usability and clarity for evaluating machine translation (MT) systems in crisis settings. Our findings show that current LLMs and NMT models, such as ChatGPT (OpenAI’s GPT-4o model) and Google MT, face limitations in translating emergency texts, particularly in maintaining the appropriate register, resolving context ambiguities, and managing domain-specific terminology.
2025
Inglese
AA.VV
Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, Sara Szoc
Proceedings of Machine Translation Summit XX
contributo
MT Summit 2025
1
566
577
12
978-2-9701897-0-1
https://aclanthology.org/2025.mtsummit-1.43/
European Association for Machine Translation
Esperti anonimi
June 2025
Ginevra - CH
Internazionale
Large Language Models, Neural Machine Translation, Translation quality, Emergency Response Texts, evaluation
4
Staiano, Maria Carmen; Han, Lifeng; Monti, Johanna; Chiusaroli, Francesca
open
273
info:eu-repo/semantics/conferenceObject
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
2025.mtsummit-1.43 (1).pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: DRM non definito
Dimensione 249.6 kB
Formato Adobe PDF
249.6 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11574/248734
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact