Corpora with tag general (11)

LVK2022

The Balanced Corpus of Modern Latvian

2000–2021, 101M words (123M tokens)
Developers: IMCS UL

MuLa2022

Corpus of Contemporary Latgalian Texts 2022

1988–2021, 2M words (2.8M tokens)
Developers: RAT, IMCS UL

LVTB

Latvian Treebank

1991–2020, 18295 sentences (300K tokens) (v2.13)
Developers: IMCS UL

FullStack-LV

Full Stack of Latvian Language Resources

1991–2018, 13691 sentences
Developers: IMCS UL

LiLa

Lithuanian-Latvian-Lithuanian Parallel Text Corpus

1982–2012, 8M words
Developers: IMCS UL, VMU

LRK2013

Latvian Speech Recognition Corpus

2005–2013, 100 hours (1.1M tokens)
Developers: IMCS UL, Tilde, LETA

LVK2018

The Balanced Corpus of Modern Latvian

1991–2018, 10M words (12M tokens)
Developers: IMCS UL

MuLa2012

Corpus of Contemporary Latgalian Texts 2012

1988–2012, 1M words (1.3M tokens)
Developers: IMCS UL, RAT

Tīmeklis2007

Latvian Web Corpus 2007

1991–2005, 99M words (123M tokens)
Developers: IMCS UL

Tīmeklis2020

CommonCrawl of Latvian 2020

2013–2022, 403.6M words (492.6M tokens)
Developers: IMCS UL

UDLV-LVTB

Latvian UD Treebank

1991–2020, 18295 sentences (300K tokens) (v2.13)
Developers: IMCS UL
B. Saulīte, R. Darģis, N. Grūzītis, I. Auziņa, K. Levāne-Petrova, L. Pretkalniņa, L. Rituma, P. Paikens, A. Znotiņš, L. Strankale, K. Pokratniece, I. Poikāns, G. Bārzdiņš, I. Skadiņa, A. Baklāne, V. Saulespurēns, J. Ziediņš.
Latvian National Corpora Collection – Korpuss.lv
Proceedings of the 13th Language Resources and Evaluation Conference (LREC), 2022, pp. 5123–5129
PDF   BibTeX