LATE-mediji Search Word Frequency List

LATE Media Speech Corpus

Corpus includes audio recordings of media broadcasts and their transcripts in orthographic transcription. The data are written down in the orthography of Standard Latvian, observing also the principles of punctuation.

Citation

Publication

I. Auzina, N. Gruzitis, R. Dargis, G. Rabante-Busa, D. Gosko, J. Vempers, R. Kivkucans, A. Znotins
Recent Latvian Speech Corpora for Linguistic Research and Technology Development
Baltic Journal of Modern Computing, 12(4), 646-658, 2024

PDF DOI

Data

I. Auziņa, R. Darģis, K. Levāne-Petrova, A. Auziņa, B. Saulīte, I. Ļaksa-Timinska, E. Gailīte, G. Nešpore-Bērzkalne, G. Rābante-Buša, K. Pokratniece, A. Klints
LATE Media Speech Corpus (LATE-mediji)
CLARIN-LV digital library, 2024
http://hdl.handle.net/20.500.12574/114

speech (10) specialised (35) morphology (41) manually annotated (9)

Corpus size	78 hours (682k tokens)
Data period	2015–2020
Development period	2021–2024
Developers	Institute of Mathematics and Computer Science UL
Funding	State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006)
CLARIN	http://hdl.handle.net/20.500.12574/114