BalsuTalka Search Word Frequency List
Balsutalka.lv Speech Corpus (Common Voice 17.0)
Latvian speech corpus collected during the crowdsourcing activity "Balsu talka", in which the pre-selected sentences were spoken by thousands of people of different ages and nationalities, both from Latvia and from the diaspora. The Mozilla Common Voice platform is used for data collection.
| Corpus size | 277 hours (1.3M tokens) |
| Data period | 2023–2024 |
| Development period | 2024 |
| Developers | Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association |
| Funding | EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006) |
| Other publications |
R. Dargis,
A. Znotins,
I. Auzina,
B. Saulite,
S. Reinsone,
R. Dejus,
A. Klavinska,
N. Gruzitis
BalsuTalka.lv – Boosting the Common Voice Corpus for Low-Resource Languages 2024 |