BalsuTalka  Search Word Frequency List Speech Corpus (Common Voice 14.0)

Latvian speech corpus collected during the crowdsourcing activity "Balsu talka", in which the pre-selected sentences were spoken by thousands of people of different ages and nationalities, both from Latvia and from the diaspora. The Mozilla Common Voice platform is used for data collection.

Corpus size 136 hours (817k tokens)
Development period 2023
Developers Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association
Funding EU Recovery and Resilience Facility "Language Technology Initiative" (; State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006)