Corpora with tag speech (9)

LATE-sarunas

LATE-conversational

2012–2024, 35 hours (347k tokens)
Developers: IMCS, UL, ILFA UL

BalsuTalka

Balsutalka.lv Speech Corpus (Common Voice 17.0)

2023–2024, 277 hours (1.3M tokens)
Developers: IMCS, UL, ILFA UL, LATA

BolsuTolka

Bolsutolka.lv Speech Corpus (Common Voice 17.0)

2023–2024, 24 hours (130k tokens)
Developers: RATA, IMCS, UL, ILFA UL, LATA

LAMBA

Annotated Longitudinal Latvian Children's Speech Corpus

2015–2017, 34 hours
Developers: IMCS UL

LaRKo

Latvian Speech Corpus

2005–2014, 8 hours
Developers: IMCS UL

LATE-mediji

LATE-media

2015–2020, 50 hours (433k tokens)
Developers: IMCS UL

LRK2013

Latvian Speech Recognition Corpus

2005–2013, 100 hours (1.1M tokens)
Developers: IMCS UL, Tilde, LETA

LVMED

Latvian Radiology Speech Corpus

2010–2022, 35 hours (157k tokens)
Developers: IMCS UL, REUH

Subtitri

Latvian Subtitles of Public Broadcasting

2015–2020, 1200 hours (10.8M tokens)
Developers: IMCS UL
B. Saulīte, R. Darģis, N. Grūzītis, I. Auziņa, K. Levāne-Petrova, L. Pretkalniņa, L. Rituma, P. Paikens, A. Znotiņš, L. Strankale, K. Pokratniece, I. Poikāns, G. Bārzdiņš, I. Skadiņa, A. Baklāne, V. Saulespurēns, J. Ziediņš.
Latvian National Corpora Collection – Korpuss.lv
Proceedings of the 13th Language Resources and Evaluation Conference (LREC), 2022, pp. 5123–5129
PDF   BibTeX