BalsuTalka Search Word Frequency List

Balsutalka.lv Speech Corpus (Common Voice 17.0)

Latvian speech corpus collected during the crowdsourcing activity "Balsu talka", in which the pre-selected sentences were spoken by thousands of people of different ages and nationalities, both from Latvia and from the diaspora. The Mozilla Common Voice platform is used for data collection.

speech (10) specialised (35) morphology (41)

Corpus size	277 hours (1.3M tokens)
Data period	2023–2024
Development period	2024
Developers	Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association
Funding	EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006)
Other publications	R. Dargis, A. Znotins, I. Auzina, B. Saulite, S. Reinsone, R. Dejus, A. Klavinska, N. Gruzitis BalsuTalka.lv – Boosting the Common Voice Corpus for Low-Resource Languages Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), 2024 PDF