BalsuTalka Search Word Frequency List
Balsutalka.lv Speech Corpus (Common Voice 17.0)
Latvian speech corpus collected during the crowdsourcing activity "Balsu talka", in which the pre-selected sentences were spoken by thousands of people of different ages and nationalities, both from Latvia and from the diaspora. The Mozilla Common Voice platform is used for data collection.
Corpus size | 277 hours (1.3M tokens) |
Data period | 2023–2024 |
Development period | 2024 |
Developers | Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association |
Funding | EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006) |
Other publications |
R. Dargis,
A. Znotins,
I. Auzina,
B. Saulite,
S. Reinsone,
R. Dejus,
A. Klavinska,
N. Gruzitis
BalsuTalka.lv – Boosting the Common Voice Corpus for Low-Resource Languages 2024 |