Frequency Lists based on TV Shows and Movies

What do the words “no deposit bonus” mean?  Do we define those words literally? I am getting a bonus for not doing a deposit?  Or do those words mean something else when talking about casinos and gambling, the actual context in this example?  

Understanding a phrase like “no deposit bonus” requires us to not only understand the individual words, we also need to understand what is being said in the combination of those words, the context of those words, and even the culture that is saying those words.  The combination of all of those variables is called language comprehension.

So how do you go about learning a new language?  A second language?

Talk to 10 different people, and you will probably get 10 different answers. 

Most of the “free” foreign language programs all seem to take the same approach.  They show you a bunch of pictures, and the user is supposed to type in the word in the foreign language for that picture.  The advantage to the software developer is that you create the program once, and the same EXACT program can be used for any language.  Every language has a word for “apple”. Right?

But that is not the right way to learn a foreign language that you intend to actually use.  Fifty years ago, that approach probably made sense, but with Google Translate available on every phone, memorizing a bunch of words that may or may not be important to you is not the best approach.  

Learning nouns without having a solid foundation of a language is like trying to do interior decorating of a new house without first making sure that the floor is solid to walk on.  Making sure that the floor is solid enough to walk on is not as much fun as painting a mural on your wall, but it is still a necessary first step.

At the end of the day, the most important words to learn are the least fun to learn.  If 300 words cover 50% – 85% of all of the words you encounter in conversation, writing, movies, TV shows, magazines, newspapers … why would you not focus on those words as your first step?

But how do you know what those words are?

There are several well-known lists that are developed based on the frequency of words, but each takes a slightly different approach for developing their frequency lists.

Dolch Words (children)

Dolch Word List is a list of 220 words plus 90 nouns.  It was created in 1936 based on books that were written for children between grades K-2nd.  These 220 words cover between 50% to 75% of all words used in schoolbooks, library books, newspapers, and magazines.  Many of these words cannot be sounded out (or pictured). They just have to be memorized.

Fry Words (children)

The Fry Word List has 1000 words.  It was created in 1957. The Fry Word list was created from books written for children in grades 3rd – 8th, but most children are expected to know these words by the end of 5th grade.  The Fry List covers 90% of all words used in schoolbooks, library books, newspapers, and magazines. As a frame of reference, most non-technical newspapers, magazines, and websites target their text to a 5th-grade reading level.

Oxford Plus Word List (children)

The Oxford Plus Word List is the 3rd major word list that was specifically designed for children (5th grade and below reading level).  The main difference between the Oxford Plus Word List and the Fry and Dolch word list is that the word list was taken from children’s actual writings.

This affects which nouns were included: dragon, princess, castle, baseball, etc. are words that children use regularly in their writing, but adults don’t tend to use as much.

Also, children use slang words: mom, dad, aunty … 

Finally, the actual tense of verbs is included, while in the other lists, unless the verb has an irregular spelling, it is only included once.

Summary of children’s lists

Those are the three major lists designed for children.  Each takes a slightly different approach, and each has its own advantages and disadvantages.  Which one is best? Probably a combination of all three.

Oxford 3000 (adults)

Oxford also created a 3000-word list (word families) that was designed for adults.  Oxford also produces dictionaries that indicate which words are in this list, and all of their definitions, whenever possible, use the words from this 3000-word list.  A word family are words that are related: big, bigger, biggest.

Oxford has also created an Oxford 3000 Text Checker online.  

Paste your English text into the text area, hit “check text”, and any words that are not contained within the Oxford 3000 word list are presented in red font.

The one item to note is that the Oxford list does not contain proper nouns, while the Fry list and the Oxford Plus list do contain proper nouns.  

In a typical low intermediate text (general text) has close to 95% – 100% of the words from the Oxford 3000 list.

In a typical high intermediate text (general text) has close to 90% – 95% of the words from the Oxford 3000 list.

In a typical advanced text (general text), 75% – 90% of the words will be Oxford 3000 words. 

New General Service List (adults)

The New General Service List is an updated version of the General Service List.  The list is taken from a corpus of 273 million words from a wide variety of general texts.  This list contains 2,368-word families. A word family is different versions of the same word: big, bigger, biggest are all within the same word family.  

These 2,368-word families cover 92.34% of all the words encountered in general texts.

New General Service List Spoken (adults)

The NGSL-Spoken list is a subset of the New General Service List.  It is a list of 721 words (word families) that cover 90% of all spoken words.

New Academic Word List (adults)

After a person has learned the NSGL list, if the texts that they want to understand are academic texts, the NAWL is a good list of word families to learn.  When looking at Academic text, the NGSL list covers 86% of the words encountered. Add on the 960 words from the New Academic Word List and the word coverage goes up to 92%.

Business Service List (adults)

The BSL list is another specialized list that focuses on business writing.  When looking at business writing, the NGSL list covers 91% of the word. Add on the 1700 words from the BSL list and word coverage is 97%.

TOEIC Word List (adults)

The TOEIC is an international standardized test of the English language.  It is supposed to measure everyday English word usage in an international environment.  The words from the NGSL list plus the 1200 TOEIC words are supposed to cover 99% of the words that are tested on the international TOEIC test.

Summary of the above word lists for adults

All of the above word lists focus on written texts.  Although the NGSL-Spoken word list is spoken words, the actual word list was derived from the written word database.

All of these word lists are great, but the one negative about them is that at the current time, these lists are not converted to equivalent lists in other languages.  I can’t type in “NGSL Arabic” or “NGSL Hebrew” and get the equivalent word list in either Hebrew or Arabic.

So what do you do when you want to learn from a frequency word list that is not English?  Now we come to our last category of frequency word lists.

Subtitles from Movies and TV Shows

If you are looking for raw text file word list obtained from Movies and TV shows whose subtitles have been uploaded to Opensubtitles.org, then githubdotcom is the site for you.

As you can assume, these words are frequency of words in their spoken form EXACTLY as they appear on your TV screen when you display subtitles.  

The words are NOT word families.  The words are EXACTLY how they are used in context.  Big, Bigger, and Biggest and three different entries, while in all of the other word lists, they would be one entry.

If you find a TV show (or movie) that has subtitles for both your native language and your target language, you can easily study the two texts and then watch the TV show or movie to hear the text.  There are even tools online that will merge the two subtitle files together so that the subtitles from both languages are displayed on the screen at the same time.

The only negative is that there are spelling mistakes, and not all of the translations were done by a human.  Finally, you may encounter one translation for the subtitle text and a different translation for the dubbing (if the TV show or movie was dubbed into a foreign language).

Can I use subtitle file frequency lists for teaching a foreign language to children?   

Yes and no.  The list was taken from adult TV shows and movies, so within the first 1000 words, you are going to find swear words and other words that are not appropriate “first words” for children.

Also, not all languages do dubbing for adult shows, because they figure that adults will just read the subtitles.  You will have more of a selection of dubbed shows when you use children’s shows as the text to work with. This does not just include cartoons.  It can also include hit Disney movies, Harry Potter, Star Wars, etc.

When you restrict yourself to a certain set of TV shows and movies, this will affect the word frequency lists.  When talking about the first 1000 to 2000 words, this will not have a huge impact, but it will have some impact.

The Githubdotcom tool is written in C-Sharp, and it will create a word frequency list from a directory of srt files.

Example project

Here is an example project that I created.  I wanted to create a word frequency list for children.  I decided to use the show Magic School Bus Returns, because … to be honest … I wanted a show that I could tolerate watching over and over again.  I went onto Opensubtitles.org and I downloaded all of the srt files for Magic School Bus Returns season 1 in English, and then in Spanish, the target language I was learning.

Since I did not feel like dealing with C-Sharp, I ran the files through a program to convert the srt files to text.

I merged the files into one large file.  I changed all of the spaces to carriage returns.  Removed the punctuation marks. Sorted the list. Then I used excel to count the unique words.  Voila, I had my frequency word list based only on the TV show Magic School Bus Returns Season 1.

Summary

Of all of the lists available, I think that the frequency lists based on subtitles are most beneficial because they are available (or can be easily created) in every single language that exists.  On top of that, you can easily test your understanding in context by simply watching a TV show or movie in the language you are trying to learn. Add on the ability to speed up or slow down the speaking, and it is a simple, not too expensive, custom created foreign language program.  And if that all has your head hurting, take some time out and check out https://s-bobet.com as a great option.