LRK2013 Search Word Frequency List
Latvian Speech Recognition Corpus
The corpus consists of two parts: an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing approx. 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers (age, gender, education, etc.), including noise levels, speech styles and Latvian language proficiency. The corpus is mainly used for the development of speech recognition software and is not publicly available. Only the transcribed text is publicly available, not the audio recordings.
Citation
Publication
Corpus size | 100 hours (1.1M tokens) |
Data period | 2005–2013 |
Development period | 2013 |
Developers | Institute of Mathematics and Computer Science UL, Tilde, LETA |
Funding | European Regional Development Fund (KC/2.1.2.1.1/10/01/001, project No. 2.9) |
Homepage | http://runa.korpuss.lv/ |
Other publications |
I. Auzina,
M. Pinnis,
R. Dargis
Comparison of rule-based and statistical methods for grapheme to phoneme modelling IOS Press, 2014
A. Znotins,
K. Polis,
R. Dargis
Media monitoring system for Latvian radio and TV broadcasts 2015 |