LRK2013 Search Word Frequency List
Latvian Speech Recognition Corpus
The corpus consists of two parts: an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing approx. 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers (age, gender, education, etc.), including noise levels, speech styles and Latvian language proficiency. The corpus is mainly used for the development of speech recognition software and is not publicly available. Only the transcribed text is publicly available, not the audio recordings.
Citation
Publication
M. Pinnis,
I. Auzina,
K. Goba
Designing the Latvian speech recognition corpus
Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 2014
Designing the Latvian speech recognition corpus
Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 2014
| Corpus size | 100 hours (1.1M tokens) |
| Data period | 2005–2013 |
| Development period | 2013 |
| Developers | Institute of Mathematics and Computer Science UL, Tilde, LETA |
| Funding | European Regional Development Fund (KC/2.1.2.1.1/10/01/001, project No. 2.9) |
| Homepage | http://runa.korpuss.lv/ |
| Other publications |
I. Auzina,
M. Pinnis,
R. Dargis
Comparison of rule-based and statistical methods for grapheme to phoneme modelling Human Language Technologies - The Baltic Perspective, IOS Press, 2014
A. Znotins,
K. Polis,
R. Dargis
Media monitoring system for Latvian radio and TV broadcasts Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2015 |