BolsuTolka Search Word Frequency List
Bolsutolka.lv Speech Corpus (Common Voice 19.0)
The speech corpus includes sentences in Latgalian, read by different speakers of Latgalian dialects. The Mozilla Common Voice platform is used for data collection. Part-of-speech tagging and lemmatization has been done manually in this Latgalian corpus.
Corpus size | 29 hours (160k tokens) |
Data period | 2023–2024 |
Development period | 2024 |
Developers | Rezekne Academy of Technologies, Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association |
Funding | EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Digital Humanities" (VPP-IZM-DH-2022/1-0002) |
Other publications |
R. Dargis,
A. Znotins,
I. Auzina,
B. Saulite,
S. Reinsone,
R. Dejus,
A. Klavinska,
N. Gruzitis
BalsuTalka.lv – Boosting the Common Voice Corpus for Low-Resource Languages 2024 |