Zipf's Law on Language 18/80 50-100

tango42 · September 23, 2015, 9:31am

Interesting video about the amount of words used in all languages. This theory, Zipf’s law, applies to all languages, not just English. It basically says that:

to be proficient in a language, one mostly needs to learn 18% of the words, which are used 80% of the time and
Nearly half of any conversation, all conversation, book or article will be the same 50 to 100 words.

Watch the video: youtube.com/watch?v=fCn8zs912OE

Anyone know a book with the top 18-20% or top 50-100 Chinese words?

zender · September 23, 2015, 10:44am

I totally agree that a few words are used over and over in almost any language.

Unfortunately, much of the interesting info is contained is the other words. So, once you know those 50 to 100 really common words, you can hear conversations like the following:

I have a — ---, but the — thing about — is that— --- in a —. If you — with—, it will often— ---, and as — has —, all you need to — is that — ---. :eh:

ironlady · September 23, 2015, 3:03pm

The point, though, is that if you solidly acquire the structure and those 100 top words, so that you can instantly decode them, thus understanding the relationships between all the blanks, you can much more quickly pick up the particular words you need to engage in whatever conversation you want to. Plus, with some guidance, students learn they can do a lot more with a lot less language than they previously thought (especially if they come from a memorization-insistent teaching culture like Taiwan.)

tango42 · September 24, 2015, 5:06am

Like

ehophi · September 30, 2015, 12:22am

I actually use Zipf’s law in some of the work that I do, namely to confirm that some of the ranking algorithms (which are clusters of numbers) are in line with it.

The problem I found with it initially (back in 2010) is that there is hardly any hard data on word frequency for Chinese (not even on Chinese-language websites). What you get instead are lots of stats on character frequency, and unfortunately, characters are not words in their own right in modern Mandarin. Also, it forces a strange pace for learning Chinese characters (i.e., you see ‘的’ before ‘白’ and ‘勺’, though it wouldn’t hurt to get those pieces first if you were working an approach to character recognition more like spelling than brute memorization). In languages like English, it turns out that the most common words also tend to be a lot shorter on average, and the least common ones a lot longer. To compensate for this, I took (imperfect) CJK data from two different sources, then aggregated scores for every character based on character frequency and the frequencies of components’ presence in characters (e.g., CF(白) = F(白)+ F(的) + F(怕) + … + F(泉)). When you rank those, its distribution is close to Zipfian, too, the most common Unicode character (not stroke or component not in Unicode) being ‘口’.

When I was working on making an English text difficulty assessor (back in 2014), I was also incorporating some multi-word terms (phrasal verbs, mostly) into the frequency list. But, as soon as you do that, Zipf’s law becomes less and less reliable.

So, when you talk about the 80/20 rule, you have to be clear on what that 20 will be for you.

A book? I think I’ve seen one, but I couldn’t tell you about it.

Seeing how we’re on the web, though, you could use the Academia Sinica’s word frequency data here or here. Keep in mind, neither tells you when exactly you’ve hit 20% of the net frequency of the corpus studied.

tango42 · October 7, 2015, 1:26pm

Went to Lucky Bookstore today. The don’t have and books with the top used words in conversation, they only have books with the top used characters… which I don’t believe would be the same. Unless the title of the book just isn’t clear and they mean the top used words. I looked through the books, and some of the words listed are rarely used in conversations.

Some of the words listed include baseball, actress, stomach, engineer, salesman, webcam, programmer, homework… some obviously not in the top 100, 200 or 500 words and not scientifically or mathematically determined.

TCAMandarin · October 8, 2015, 7:30am

我，你，謝謝，哈哈哈，好 are probably my top five words. I think the best way is to speak every day with someone who is willing to explain sentences to you and have conversations with you. You will naturally use the top 20% most used words that way.

tango42 · October 8, 2015, 8:36am

I found this mathematically determined word frequency list.

en.m.wiktionary.org/wiki/Append … sts/1-1000