The DoReCo database contains corpora on 51 languages from 32 top-level language families (as classified in Glottolog), covering languages from all inhabited continents and all linguistic macro-areas. Most of these data were originally collected in the context of language documentation projects focusing on preserving linguistic practices and traditions. They contain mostly monological, narrative texts, though some texts also represent conversations and stimulus retelling. Most datasets were extracted from larger collections archived in repositories such as TLA or ELAR. In total, DoReCo contains over 100 hours of recordings with almost half a million transcribed words that are time-aligned at the word and phone levels. The minimum amount of data per language is 35,000 phones (although some datasets are slightly below that mark), corresponding to more than 10,000 word tokens for isolating languages. The total number of core texts is 893, equivalent to 17 texts on average per language. Numbers of unique speakers per core dataset range from 1 (Kamas, Texistepec Popoluca, Yongning Na) to 30 (Urum). All texts are also translated, mostly into English, but in some cases also Portuguese, German, Russian, Swahili and other languages. For 38 languages, DoReCo provides time-aligned interlinear morpheme glosses. For most of these 38 languages, additional texts with interlinear glosses that are not time-aligned are contained in the DoReCo extended set. In total, DoReCo provides over 300,000 word tokens of time-aligned interlinear glossed text and another 300,000 word tokens of glossed texts without time alignment. Each DoReDo dataset is accompanied by extensive corpus documentation on orthographic conventions, abbreviations used in glosses, and other useful information.

Jejuan corpus contribution to the DoReCo (Language Documentation Reference Corpus)

Kim, Soung-U.
2022-01-01

Abstract

The DoReCo database contains corpora on 51 languages from 32 top-level language families (as classified in Glottolog), covering languages from all inhabited continents and all linguistic macro-areas. Most of these data were originally collected in the context of language documentation projects focusing on preserving linguistic practices and traditions. They contain mostly monological, narrative texts, though some texts also represent conversations and stimulus retelling. Most datasets were extracted from larger collections archived in repositories such as TLA or ELAR. In total, DoReCo contains over 100 hours of recordings with almost half a million transcribed words that are time-aligned at the word and phone levels. The minimum amount of data per language is 35,000 phones (although some datasets are slightly below that mark), corresponding to more than 10,000 word tokens for isolating languages. The total number of core texts is 893, equivalent to 17 texts on average per language. Numbers of unique speakers per core dataset range from 1 (Kamas, Texistepec Popoluca, Yongning Na) to 30 (Urum). All texts are also translated, mostly into English, but in some cases also Portuguese, German, Russian, Swahili and other languages. For 38 languages, DoReCo provides time-aligned interlinear morpheme glosses. For most of these 38 languages, additional texts with interlinear glosses that are not time-aligned are contained in the DoReCo extended set. In total, DoReCo provides over 300,000 word tokens of time-aligned interlinear glossed text and another 300,000 word tokens of glossed texts without time alignment. Each DoReDo dataset is accompanied by extensive corpus documentation on orthographic conventions, abbreviations used in glosses, and other useful information.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11574/229368
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact