LRK2013 Search Word Frequency List

Latvian Speech Recognition Corpus

The corpus consists of two parts: an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing approx. 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers (age, gender, education, etc.), including noise levels, speech styles and Latvian language proficiency. The corpus is mainly used for the development of speech recognition software and is not publicly available. Only the transcribed text is publicly available, not the audio recordings.

Citation

Publication

M. Pinnis, I. Auzina, K. Goba
Designing the Latvian speech recognition corpus
Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 2014

PDF

speech (10) general (11) representative (9) morphology (41)

Corpus size	100 hours (1.1M tokens)
Data period	2005–2013
Development period	2013
Developers	Institute of Mathematics and Computer Science UL, Tilde, LETA
Funding	European Regional Development Fund (KC/2.1.2.1.1/10/01/001, project No. 2.9)
Homepage	http://runa.korpuss.lv/
Other publications	I. Auzina, M. Pinnis, R. Dargis Comparison of rule-based and statistical methods for grapheme to phoneme modelling Human Language Technologies - The Baltic Perspective, IOS Press, 2014 PDF DOI A. Znotins, K. Polis, R. Dargis Media monitoring system for Latvian radio and TV broadcasts Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2015 PDF