Subtitri Search Word Frequency List
Latvian Subtitles of Public Broadcasting
The corpus contains subtitles from various Latvian public media broadcasts (2015–2020) – shows, movies, series, etc. Each has a title, publication date, and a URL where it can be watched. All recordings also indicate the audio language of the broadcast and whether the broadcast was originally recorded in the specified language or dubbed. Only the transcribed text is publicly available, not the audio recordings.
Corpus size | 1200 hours (10.8M tokens) |
Data period | 2015–2020 |
Development period | 2020–2022 |
Developers | Institute of Mathematics and Computer Science UL |