Latvian National Corpora Collection

Latvian National Corpora Collection (LNCC) is a diverse collection of corpora representing both written and spoken language. LNCC covers various use cases and all the important text types and genres. It is a continuous multi-institutional and multi-project effort, supported by the digital humanities and language technology communities in Latvia.

Currently, 46 corpora developed by 18 institutions are available in the LNCC. Most of the corpora are annotated with a uniform morpho-syntactic annotation scheme and included in the federated search. The federated search combines multiple corpora from two corpus indexer instances (endpoints) maintained by IMCS UL and NLL. Federated search includes 42 corpora (2.9 billions tokens)

Roberts Darģis, Baiba Saulīte
Korpuss.lv – a Versatile Platform for Digital Humanities
Baltic Journal of Modern Computing, 12(4), 2024, pp. 636–645
PDF   BibTeX
Baiba Saulīte, Ilze Auziņa, Roberts Darģis
Latvian National Corpora Collection Korpuss.lv | Nacionālā korpusu kolekcija Korpuss.lv
Linguistica Lettica, 31(1), 2023, pp. 202–223
PDF   BibTeX
Baiba Saulīte, Roberts Darģis, Normunds Grūzītis, Ilze Auziņa, Kristīne Levāne-Petrova, Lauma Pretkalniņa, Laura Rituma, Pēteris Paikens, Artūrs Znotiņš, Laine Strankale, Kristīne Pokratniece, Ilmārs Poikāns, Guntis Bārzdiņš, Inguna Skadiņa, Anda Baklāne, Valdis Saulespurēns, Jānis Ziediņš.
Latvian National Corpora Collection – Korpuss.lv
Proceedings of the 13th Language Resources and Evaluation Conference (LREC), 2022, pp. 5123–5129
PDF   Poster   Video   BibTeX

Partners

Funding

ANM

EU Recovery and Resilience Facility project "Language Technology Initiative" (2.3.1.1.i.0/1/22/I/CFLA/002; 2023–2026)

VPP

National Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-IZM-LETONIKA-2025/1-0004; 2025–2028)
National Research Programme "Digital Humanities" (VPP-IZM-DH-2022/1-0002; 2022–2025)
National Research Programme "Digital Resources of the Humanities" (VPP-IZM-DH-2020/1-0001; 2020–2022)

ERAF

European Regional Development Fund (1.1.1.5/18/I/016; 2018–2020)
European Regional Development Fund (1.1.1.1/16/A/219; 2017–2019)

Latviešu valodas aģentūra

Funding for the development of the corpus conception, Balanced Corpus of Modern Latvian, etc. (2005–2022)

Supporters