Benchmarking Multilingual Terminology Translation for UNDRR-ISC Hazard Information Profiles

IRIS

This paper presents the results of the multilingual translation of the 281 UNDRR-ISC Hazard Information Profiles (HIPs) terms from English into French, Spanish, and Chinese, in response to a request from UN partners, member states, and the scientific community. We compare three translation setups: (1) gpt-oss-20b as an open-weight single LLM baseline; (2) ChatGPT-5.5 as a state-of-the-art proprietary baseline; and (3) a Voter–Arbitrator multi-agent architecture, which combines three models through consensus-based arbitration, served locally via Ollama. Outputs are aligned with a SKOS-based hazard knowledge organisation system, enabling machine-actionable multilingual terminology interoperable with semantic disaster risk infrastructures. Evaluation against a human-validated gold standard shows that ChatGPT5.5 achieves the highest Exact Match accuracy across all languages, while the Voter–Arbitrator system consistently outperforms the open-weight baseline on Exact Match (58.0% vs 45.6% for French) and achieves higher cosine similarity, although human evaluation identifies a higher rate of conceptually incorrect terms for Chinese. The findings suggest that consensus-based multi-agent arbitration offers a reproducible, sovereign, and infrastructure-independent alternative to proprietary systems, with performance gains over single open-weight model inference.

Benchmarking Multilingual Terminology Translation for UNDRR-ISC Hazard Information Profiles

Staiano Maria Carmen;Monti Johanna;Chiusaroli Francesca;TykhonovSlava;Hawkins Ken;Jacot des CombesHélène;Yang Saini;Fra Paleo Urbano;Murra Virginia

2026-01-01

Abstract

This paper presents the results of the multilingual translation of the 281 UNDRR-ISC Hazard Information Profiles (HIPs) terms from English into French, Spanish, and Chinese, in response to a request from UN partners, member states, and the scientific community. We compare three translation setups: (1) gpt-oss-20b as an open-weight single LLM baseline; (2) ChatGPT-5.5 as a state-of-the-art proprietary baseline; and (3) a Voter–Arbitrator multi-agent architecture, which combines three models through consensus-based arbitration, served locally via Ollama. Outputs are aligned with a SKOS-based hazard knowledge organisation system, enabling machine-actionable multilingual terminology interoperable with semantic disaster risk infrastructures. Evaluation against a human-validated gold standard shows that ChatGPT5.5 achieves the highest Exact Match accuracy across all languages, while the Voter–Arbitrator system consistently outperforms the open-weight baseline on Exact Match (58.0% vs 45.6% for French) and achieves higher cosine similarity, although human evaluation identifies a higher rate of conceptually incorrect terms for Chinese. The findings suggest that consensus-based multi-agent arbitration offers a reproducible, sovereign, and infrastructure-independent alternative to proprietary systems, with performance gains over single open-weight model inference.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2026

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11574/257900

Citazioni

ND

social impact