LRK2013  Search Word Frequency List

Latvian Speech Recognition Corpus

The corpus consists of two parts: an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing approx. 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers (age, gender, education, etc.), including noise levels, speech styles and Latvian language proficiency. The corpus is mainly used for the development of speech recognition software and is not publicly available. Only the transcribed text is publicly available, not the audio recordings.

Citation
Publication
M. Pinnis, I. Auzina, K. Goba
Designing the Latvian speech recognition corpus
2014
PDF
Corpus size 100 hours (1.1M tokens)
Data period 2005–2013
Development period 2013
Developers Institute of Mathematics and Computer Science UL, Tilde, LETA
Funding European Regional Development Fund (KC/2.1.2.1.1/10/01/001, project No. 2.9)
Homepage http://runa.korpuss.lv/
Other publications
I. Auzina, M. Pinnis, R. Dargis
Comparison of rule-based and statistical methods for grapheme to phoneme modelling
IOS Press, 2014
A. Znotins, K. Polis, R. Dargis
Media monitoring system for Latvian radio and TV broadcasts
2015
PDF