- SUBTLEX-GR: The corpus
Out of the different 5,508 contexts that were identified within the approximately 6,100 unique subtitle files constituting our raw material, 4,001 corresponded to films and 1,507 to television series episodes (an average of 7.9 episodes per series). Accordingly, out of the more than 27 million space-separated tokens of SUBTLEX-GR, 84.8% were taken from movies and only 15.2% from television shows. The films and the television series to which the subtitles corresponded were mostly USA productions (71.7%), in line with the American dominance of the film and TV industry. Out of the remaining 28.3% subtitle files, nearly one third (9.5%) corresponded to UK productions and around two thirds (18.8%) to non-English productions, mostly French, German and Spanish. Hence, most of the subtitles used were transcripts of the English language.
Files to download (UTF-8, plain-text format):
- SUBTLEX-GR entries along with information about their frequency of occurrence within the corpus.
SUBTLEX-GR_full.txt: This file contains the entire corpus with a total of 27,761,198 space-separated tokens that corresponded to 597,540 different types. For these types we calculated the number of occurrences and the number of different contexts they appeared in (contextual diversity,CD).
- SUBTLEX-GR entries with a contextual diversity value of more than 2 along with information about their frequency of occurrence within the corpus.
SUBTLEX-GR_CD.txt: This file provides a “cleaner” version of the corpus than the one found in the SUBTLEX-GR_full.txt file, with 187,021 types and 26,981,066 tokens. By using a contextual diversity cut-off of two, most of the optical character recognition (OCR) mistakes and illegal strings were eliminated, while very low frequency entries corresponding to neologisms and supposedly illegal constructions were maintained. Even though we consider this to be a useful additional tool for specific word material selection, it should be noted that this version of the corpus is not error-free.
- Letters of the Greek alphabet in their uppercase and lowercase versions along with their number of occurrences within the entire SUBTLEX-GR corpus.
- SUBTLEX-GR entries that resulted from the spell-checking process along with their frequency and lexical characteristics.
SUBTLEX-GR_restricted.txt: This version resulted from cross-checking the corpus with a Modern Greek spell-checker which includes more than 1,600,000 inflected word forms (Symfonia software, ILSP). Through this process, spelling errors due to optical character recognition (OCR) mistakes and word types not found in the Symfonia software were removed. The rejected strings made up a total of 75.6% of the types, but importantly, only 16.5% of the tokens. For the remaining 145,631 different word types, accounting for a total of 23,152,956 tokens, a set of additional frequency and lexical measures was also calculated. We strongly encourage researchers to use this version of the corpus when looking for word material, since it only contains legal and correctly spelled Modern Greek lexical entries and is thus much more manageable.
- The following information is contained in each of the columns (from left to right) of the different versions of the SUBTLEX-GR corpus:
- ID: The number of each entry.
- Word: The entries ordered alphabetically. Please note that in the SUBTLEXGR_ restricted.txt file the entries are presented starting with either an upper or a lower case letter depending on how they were encountered more times in the corpus. This was done so that researchers would be able to identify words that correspond mostly to proper names (or to words that are also used as names, e.g., Ειρηνικός [pacific], appears with an uppercase because it is mostly used as the ocean’s name). We would recommend researchers to avoid using these words as experimental items since they have a rather special representational status.
- FREQcount: The number of occurrences of each entry (raw frequency) in the subtitle files.
- CD: The number of different contexts (films and television series) a word appeared in.
- SUBTLEX_WF: The word frequency per million words with four digit precision. This value was calculated by multiplying the number of occurrences of an entry within the corpus (i.e., FREQcount) by a million and then dividing it by the number of tokens the database (in each of its different versions). This measure allows matching frequency values across different databases since it does not take into account the corpus size.
- Lg10WF: This value corresponds to the log10(FREQcount+1).
- SUBTLEX_CD: The percentage of different contexts (films and television episodes) a word appeared in, with four digit precision.
- Lg10CD: This is the log10(CDcount+1) with four digit precision. According to our analyses, this is the most valid frequency measure for material selection.
- The following additional columns are also included in the SUBTLEX-GR_ restricted.txt file:
- FREQlow: The number of times a word appears in the corpus starting with a lowercase letter. Brysbaert and New (2009) showed that for proper names this measure is more representative than the total number of appearances of the name (starting with an upper or a lowercase letter).
- FREQupper: The number of times a word appears in the corpus starting with an uppercase letter. This measure does not take into account the number of times the entire word appeared written in uppercase letters.
- N: Number of orthographic neighbours of each entry found within the restricted corpus.
- OLD20: The Orthographic Levenshtein Distance 20 score. The OLD20 (Levenshtein, 1966),is a measure of orthographic similarity between two words that stands for the minimum number of substitutions, insertions, or deletions required to turn one word into the other.
- Length: The length of each entry counted in number of characters.
- SUBTLEX_WF_full: The word frequency per million words with four digit precision, using as reference value the total number of tokens included in the full version of the corpus. It could be argued that i) frequency values calculated on the entire “unclean” corpus would be more exact since they will be divided by the true total number of tokens, and ii) that the relative frequencies would be also more representative since words found in the left-most part of the frequency distribution will be also included.