Learning from general word lists is inefficient

If you are at the point where you are starting to read native content but you find that your vocabulary is lacking, one of the least effective things you can do to boost your vocabulary is learn from word lists such as the HSK or other word lists based on general frequency.

The reason for this is quite simple - generic word lists are derived from a large body of content, spanning multiple fields, genres and topics. They contain a little bit of this and a little bit of that, and therefore only a little bit is going to be relevant to what you are currently reading and the rest will be words you are not going to use or encounter for months (or maybe even years).

Eventually you are going to want to know all of those words, but right now, whatever you are reading is going to be on a reasonably specific topic of some sort and it will be a more productive use of your time to learn frequently occurring words from that content rather than learning from general lists.

How much better? As it turns out, significantly better. In fact it will probably be orders of magnitude more effective to learn currently relevant words than it will be to learn words from a pre-compiled list such as the HSK word lists.

That’s a bold statement, but the numbers back it up. For example, imagine you had studied to HSK 4 level (which going by the HSK word lists is around 1,200 words) and have now decided to start reading native content.

You’d heard that the novel《活着》was a good first book for Chinese learners because it has relatively simply grammar and vocabulary, and so you decide to give it a try.

You load it up in Chinese Text Analyser and see that with an HSK 4 level vocabulary, you will understand approximately 61% of the total text (HSK 5 and HSK 6 aren’t much better, giving you 65% and 68% respectively). That’s not great, and in fact it’s probably a little too far above your current level, but you decide that you’re going to stick at it anyway, and each day you’re going to learn the 10 most frequent unknown words.

Every day you open an electronic copy of《活着》in Chinese Text Analyser and export the 10 most frequent unknown words. You’ll study these words in some flashcard program such as Pleco or Anki, and so you also get Chinese Text Analyser to mark these words as known so that each day you’ll be getting a list of 10 new words.

If you do that every day, and learn all those words then after:

  • 1 month you will know an extra 300 words, and understand 83% of the text.
  • 2 months you will know an extra 600 words, and understand 88% of the text.
  • 3 months you will know an extra 900 words, and understand 91% of the text.

By comparison, if you decided instead to learn all the HSK 5 vocabulary before starting to read《活着》you would need to learn 1,300 words (starting from HSK 4), but after doing that, you’d still only be able to understand 65% of the text.

In other words, you would have put in more effort, for worse results.

Going even further, if you thought that 65% was still not enough and so you decided to learn all the HSK 6 vocabulary first, you would need to learn an extra 3,300 words (from an HSK 4 base), and would still only reach 68% coverage of the total text.

That’s heading in to orders of magnitude territory -

  • 900 words for a 30% increase in understanding (0.03% increase per word) vs
  • 3,300 words for a 7% increase in understanding (0.002% increase per word)

Those numbers are pretty clear, but for this example, learning vocabulary up to HSK 6 instead of learning words based on frequency within the text is especially a waste of time once you consider that if you’d learnt the 3,300 most frequent unknown words from the novel itself (starting from an HSK 4 base), you would have 98% comprehension of the text - a large enough amount that you’d be able to imply the meaning of most of the remaining unknown words from context.

The benefits also carry over in to new content.

Say that after 3 months (and 900 new words) you finish reading《活着》and decide to try another novel from the same author《许三观卖血记》.

You’d start with 83% comprehension of the new text, compared to 65% if you’d just learnt HSK 5 words, and 68% if you’d learnt HSK 6 words.

Or perhaps you want a break from rural China during the Cultural Revolution and decide to read a different novel from a completely different author in a completely different setting, for example《圈子圈套》which is set in the Beijing IT industry in the early 2000’s.

After learning 900 of the most frequent unknown words from《活着》(starting from a base of HSK 4) you’d have 72% coverage of the new novel, compared to 64% for HSK 5, and 67% for HSK 6.

As you can see from the numbers, the HSK list, with it’s ‘little bit of this and little bit of that’ approach does a great job at being consistent across novels and genres (high 60’s for HSK 6, low 60’s for HSK 4). However, in terms of being able to understand the content you are trying to read right now, learning high-frequency words is significantly more effective.

Whatever way you slice it, you will almost always come out ahead if you learn the most frequent words from what you are reading rather learning from a pre-compiled list, and using a program such as Chinese Text Analyser you can quickly and easily find that vocabulary and extract it from the text you are reading.

