BolsuTolka  Search Word Frequency List

Bolsutolka.lv Speech Corpus (Common Voice 19.0)

The speech corpus includes sentences in Latgalian, read by different speakers of Latgalian dialects. The Mozilla Common Voice platform is used for data collection. Part-of-speech tagging and lemmatization has been done manually in this Latgalian corpus.

Corpus size 29 hours (160k tokens)
Data period 2023–2024
Development period 2024
Developers Rezekne Academy of Technologies, Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association
Funding EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Digital Humanities" (VPP-IZM-DH-2022/1-0002)
Other publications
R. Dargis, A. Znotins, I. Auzina, B. Saulite, S. Reinsone, R. Dejus, A. Klavinska, N. Gruzitis
BalsuTalka.lv – Boosting the Common Voice Corpus for Low-Resource Languages
2024
PDF