BolsuTolka Search Word Frequency List
Bolsutolka.lv Speech Corpus (Common Voice 17.0)
The speech corpus includes sentences in Latgalian, read by different speakers of Latgalian dialects. The Mozilla Common Voice platform is used for data collection. Part-of-speech tagging and lemmatization has been done manually in this Latgalian corpus.
Corpus size | 24 hours (130k tokens) |
Data period | 2023–2024 |
Development period | 2024 |
Developers | Rezekne Academy of Technologies, Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association |
Funding | EU Recovery and Resilience Facility "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002); State Research Programme "Digital Humanities" (VPP-IZM-DH-2022/1-0002) |