BalsuTalka  Search Word Frequency List

Balsutalka.lv Speech Corpus (Common Voice 17.0)

Latvian speech corpus collected during the crowdsourcing activity "Balsu talka", in which the pre-selected sentences were spoken by thousands of people of different ages and nationalities, both from Latvia and from the diaspora. The Mozilla Common Voice platform is used for data collection.

Corpus size 277 hours (1.3M tokens)
Data period 2023–2024
Development period 2024
Developers Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association
Funding EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006)
Other publications
R. Dargis, A. Znotins, I. Auzina, B. Saulite, S. Reinsone, R. Dejus, A. Klavinska, N. Gruzitis
BalsuTalka.lv – Boosting the Common Voice Corpus for Low-Resource Languages
2024
PDF