LVK2022

The Balanced Corpus of Modern Latvian

2000–2021, 101M words (123M tokens)
Developers: IMCS UL

LATE-sarunas

LATE-conversational

2012–2024, 44 hours (429k tokens)
Developers: IMCS, UL, ILFA UL

MuLa2022

Corpus of Contemporary Latgalian Texts 2022

1988–2021, 2M words (2.8M tokens)
Developers: RAT, IMCS UL

LVTB

Latvian Treebank

1991–2023, 19367 sentences (328K tokens) (v2.15)
Developers: IMCS UL

BalsuTalka

Balsutalka.lv Speech Corpus (Common Voice 17.0)

2023–2024, 277 hours (1.3M tokens)
Developers: IMCS, UL, ILFA UL, LATA

Barometrs

Corpus of News Portal Comments

2011–2022, 26M comments (642M tokens)
Developers: RSU, IMCS UL

BolsuTolka

Bolsutolka.lv Speech Corpus (Common Voice 19.0)

2023–2024, 29 hours (160k tokens)
Developers: RATA, IMCS, UL, ILFA UL, LATA

Cīņa

"Cīņa"

1904–1991, 185M words (231M tokens)
Developers: NLL

Disertācijas

Corpus of Latvian PhD Theses

1993–2020, 16.7M words (23.4M tokens)
Developers: IMCS UL

Emuāri

Latvian Blog Corpus 2015

2001–2015, 6.6M words (8M tokens)
Developers: IMCS UL

fonLATE

LATE Phonetically Annotated Speech Corpus

2012–2024, 4 hours (48k tokens)
Developers: IMCS UL

FullStack-LV

Full Stack of Latvian Language Resources

1991–2018, 13691 sentences
Developers: IMCS UL

Jaunatne

"Padomju Jaunatne"

1944–1989, 138M words (176M tokens)
Developers: NLL

Karogs

"Karogs"

1940–1994, 48.7M words (62.1M tokens)
Developers: NLL

LAMBA

Annotated Longitudinal Latvian Children's Speech Corpus

2015–2017, 34 hours
Developers: IMCS UL

LATE-mediji

LATE-media

2015–2020, 78 hours (682k tokens)
Developers: IMCS UL

LatSenRom

Corpus of Latvian Early Novels

1879–1920, 3.7M words (4.7M tokens)
Developers: NLL, ILFA UL

LaVA

Latvian Language Learner Corpus

2018–2021, 192k words (241k tokens)
Developers: IMCS UL

LAvīzes

"Latviešu Avīzes"

1822–1915, 35.7M words (46M tokens)
Developers: NLL

Likumi

Corpus of Legal Acts of the Republic of Latvia

1990–2022, 73.9M words (116.2M tokens)
Developers: IMCS UL

LiLa

Lithuanian-Latvian-Lithuanian Parallel Text Corpus

1982–2012, 8M words
Developers: IMCS UL, VMU

LitMāksla

"Literatūra un Māksla"

1945–1995, 52.7M words (65.8M tokens)
Developers: NLL

LRK2013

Latvian Speech Recognition Corpus

2005–2013, 100 hours (1.1M tokens)
Developers: IMCS UL, Tilde, LETA

LVK2018

The Balanced Corpus of Modern Latvian

1991–2018, 10M words (12M tokens)
Developers: IMCS UL

LVMED

Latvian Radiology Speech Corpus

2010–2022, 35 hours (157k tokens)
Developers: IMCS UL, REUH

MuLa2012

Corpus of Contemporary Latgalian Texts 2012

1988–2012, 1M words (1.3M tokens)
Developers: IMCS UL, RAT

PanDi

Corpus of Latvian Pandemic Diaries

2020–2022, 565k words (709k tokens)
Developers: ILFA UL

Pārspriedumi

Corpus of Students' Essays

2018, 185k words (226k tokens)
Developers: IMCS UL, LiepU, RAT

Rainis

Corpus of Texts Written by Rainis

1895–1929, 1.6M words (2.3M tokens)
Developers: IMCS UL

Saeima

Corpus of the Saeima (the Parliament of Latvia)

1993–2022, 20M words (24M tokens)
Developers: IMCS UL, RSU

Senie

Corpus of Early Written Latvian Texts

1507–1800, 2M words (2.7M tokens)
Developers: LLI UL, FH UL, IMCS UL

Subtitri

Latvian Subtitles of Public Broadcasting

2015–2020, 1200 hours (10.8M tokens)
Developers: IMCS UL

Tīmeklis2007

Latvian Web Corpus 2007

1991–2005, 99M words (123M tokens)
Developers: IMCS UL

Tīmeklis2020

CommonCrawl of Latvian 2020

2013–2022, 403.6M words (492.6M tokens)
Developers: IMCS UL

UDLV-LVTB

Latvian UD Treebank

1991–2023, 19368 sentences (328K tokens) (v2.15)
Developers: IMCS UL

Vikipēdija

Latvian Wikipedia

2003–2022, 17.9M words (27.7M tokens)
Developers: IMCS UL

VVPP

Corpus of the Tests of the State Language Proficiency Testing

2016–2017, 150k tokens
Developers: IMCS UL

Ziņas

Articles from Latvian news portals

2000–2022, 357.2M words (513.5M tokens)
Developers: IMCS UL

ĪsprozaS

Corpus of Latvian Women Writers’ Short Fiction

1893–2002, 925k words (1.2M tokens)
Developers: ILFA UL
B. Saulīte, R. Darģis, N. Grūzītis, I. Auziņa, K. Levāne-Petrova, L. Pretkalniņa, L. Rituma, P. Paikens, A. Znotiņš, L. Strankale, K. Pokratniece, I. Poikāns, G. Bārzdiņš, I. Skadiņa, A. Baklāne, V. Saulespurēns, J. Ziediņš.
Latvian National Corpora Collection – Korpuss.lv
Proceedings of the 13th Language Resources and Evaluation Conference (LREC), 2022, pp. 5123–5129
PDF   BibTeX