Subtitri  Search Word Frequency List

Latvian Subtitles of Public Broadcasting

The corpus contains subtitles from various Latvian public media broadcasts (2015–2020) – shows, movies, series, etc. Each has a title, publication date, and a URL where it can be watched. All recordings also indicate the audio language of the broadcast and whether the broadcast was originally recorded in the specified language or dubbed. Only the transcribed text is publicly available, not the audio recordings.

Corpus size 1200 hours (10.8M tokens)
Data period 2015–2020
Development period 2020–2022
Developers Institute of Mathematics and Computer Science UL