Skip to main content

SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan.

Author
Abstract
:

SUBTLEX-CAT is a word frequency and contextual diversity database for Catalan, obtained from a 278-million-word corpus based on subtitles supplied from broadcast Catalan television. Like all previous SUBTLEX corpora, it comprises subtitles from films and TV series. In addition, it includes a wider range of TV shows (e.g., news, documentaries, debates, and talk shows) than has been included in most previous databases. Frequency metrics were obtained for the whole corpus, on the one hand, and only for films and fiction TV series, on the other. Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates, computed from either written texts or texts from the Internet. Furthermore, the metrics obtained from the whole corpus were better predictors than the ones obtained from films and fiction TV series alone. In both experiments, the best predictor of response times and accuracy was contextual diversity.

Year of Publication
:
2020
Journal
:
Behavior research methods
Volume
:
52
Issue
:
1
Number of Pages
:
360-375
ISSN Number
:
1554-351X
URL
:
https://dx.doi.org/10.3758/s13428-019-01233-1
DOI
:
10.3758/s13428-019-01233-1
Short Title
:
Behav Res Methods
Download citation