1000 characters, you're 90% there

Interesting site… lingua.mtsu.edu/chinese-computin … p?Which=MO

If you learn 1000 characters you should be able to recognize 90% of what you read. 2000 gets you to 97% and 3000 gets you to 99%.

Now for the fine print. Firstly, this does not account for multi-character words or idioms, which make up a lot of the language (maybe someone has some numbers on this).

Secondly, if you get through ShiDa’s Practial Audio Video Chinese volume 1 you have nearly 1000 characters. Its very important to learn them in context and read and speak and hear them within the structured context of a textbook. However, I bet dollars to yuan that ShiDa’s 1000 hardly correlate with the top 1000 in the above analysis. Books 2 shang and 2 xia also each give you I guess at least 1000 more each. But again, I’m fairly certain these don’t correlate with the top 3000 above that get you to 99%. Anyone know of a textbook series that hews to usage frequency?

OK, written and spoken don’t correlate, but still.

Hmm… if you stop at 1000 on that list, you’ll miss a lot of important characters, like 午, 肉, 牛,店, 昨, etc., which Shi-Da teaches in the first few lessons of the first book (as it should).
Even when you go past 2000, you’ll still find very important characters, like… 啤 (as in 啤酒). I mean, come on, that’s one of the first characters any foreigner will learn when coming to Taiwan or China. :slight_smile:

How else are we to define “important” other than “frequently used”? The list was generated by extensive computer analysis. The words are in order of frequency. First things first. Those missing (lower ranked) words are not as important as you may think. Or more accurately, not as frequently used IN WRITTEN CHINESE as you may think. This may be the issue (as mentioned in my original post.)

My recollection is Taiwan emphasizing important words like “busy”, “hot”, “noisy” early on, and phrases like “mei ban fa” and “cha bu dou”… draw what conclusions you will…

Extensive computer analysis of what? I couldn’t find a list of the sources, but I suspect the texts that were analysed were (classic) literature … where people don’t ask for beer to go with their beef noodle soup. It’d be interesting to see the results if they took the back issues of the more common newspapers and magazines.

Incidentally, is there an easy way to convert the list into Traditional characters?

More info here lingua.mtsu.edu/chinese-computing/statistics/ Above I had posted modern usage but he also has classical and combined.

There’s another thread in Forumosa recently about converting between simplified and traditional, they said “you can do it easily with Chinese Microsoft Word”

Corpora are usually based on texts – articles, written stuff. But frequency overall in Chinese would be based a) on words, not individual characters, and b) on other types of linguistic interactions – like ordering dinner, which is why characters like “cow” are important to learn early on.

Either way, it is true that if you learn 1000 characters – REALLY learn them – you will be well on your way. The only problems are the combinations and the grammar (especially going from spoken Chinese to newspapers). But it’s really neat the first time you correctly guess what the Chinese word would be without knowing it first, just based on your knowledge of “how it goes together usually”. [In my case, not wishing to date myself however, it was “typewriter ribbon”. Hey, it was a great thrill. :smiley: ]

I know I can definitely read over a thousand characters, and while it is true that I recognize most characters in any given newspaper article, I still do not come even close to grasping the real meaning. The problem is that the characters that I don’t understand make up the key words of the text.

That strikes a chord. I’ve lost count of the times I’ve seen a sign which I was able to translate as, for example:

“On weekends please do not use the _____, because it may cause ____. Thank You for your cooperation.”

Great. 80% of the words. 0% comprehension :frowning:

That strikes a chord. I’ve lost count of the times I’ve seen a sign which I was able to translate as, for example:

“On weekends please do not use the _____, because it may cause ____. Thank You for your cooperation.”

Great. 80% of the words. 0% comprehension :frowning:[/quote]

Yes, true david. I’ve been at it two and a half years. Sure I can read a lot of the charachters in the paper, but do I know what it means… :noway:

[quote=“david”]That strikes a chord. I’ve lost count of the times I’ve seen a sign which I was able to translate as, for example:

“On weekends please do not use the _____, because it may cause ____. Thank You for your cooperation.”

Great. 80% of the words. 0% comprehension :frowning:[/quote]

Well, you’re actually quite a ways toward your goal. You know: when NOT to do it. What do you care why not to do it, as long as you know NOT to do it? Based on this, if it’s a weekend, just don’t use anything and you’re good. :smiley: Anyway, another danger in learning to read/understand Chinese is getting hit with something you just don’t expect…I mean if you DID know the words for what it might cause, you might think you were wrong because the cause might not make sense to you (although it would to the Chinese). For example, I had a landlady once who believed that leaving appliances plugged in consumed power (even when they were turned off). So I could imagine her writing a sign to the effect of “Please do not leave appliances plugged in, because we want to save energy” or something like that. If I weren’t sure of my Chinese, I might doubt that I had read the second part correctly because the logic wouldn’t make sense to me.

Seriously, though, this is just a vocab problem. I think you can be very proud of yourself if you can recognize this sort of repeating – and hence very useful – pattern.

After skimming through that list, it seems like “1000 characters = 90% there” is a bit of an overestimation. I’d say at the 1000 character level you could probably handle older elementary school texts quite well, but newspapers and novels would be a bit of a stretch.

I’ve probably learned around 3000 characters (i.e. able to read. In this age of fancy computer technology, damned if I can write by hand a fraction of that) and I’ve hit the wall of diminishing returns. I can’t be bothered to study all of these increasingly obscure characters, because at this point the characters I don’t know are ones that I’d only see once in a great while and certainly would never be used in daily conversation. But on the other hand, it’s that long tail of characters that I don’t know which messes up trying to casually read Chinese novels. There’s nothing more frustrating than reading a sentence and understanding its entire meaning except for one adverb. I can’t be bothered to look up the character since I already have the gist of what they’re trying to say, but usually it’s that one word that gives the sentence its literary polish.

Hmm, I wonder if there’s a Chinese equivalent to Raymond Carver.

I haven’t been able to find any solid statistics on this, but I read somewhere that “on the word-per-word level, any language is about 58-59 percent nouns, 20 percent verbs and 20 percent adjectives. Except for prepositions, conjunctions, interjections, pronouns and other parts of speech that makeup the remaining 1-2 percent, the rest of the language is a combination of these three dominant elements.” So probably when you hit that long tail of diminishing returns its 60% (or more?) increasingly obscure nouns.

True that. At my buxiban, I’ve thumbed through English-Chinese picture dictionaries with pages full of pictures of vegetables and tree species not typically found in Asia and thought “What the hell are all these characters?” Too bad they don’t have pinyin since I’d actually learn something during my coffee break.

Oh, and I’ll have to revise my previous, cocksure estimate. I probably only know somewhere around 2000-2300 characters. I was looking at that list again and by the upper 1000s and lower 2000s, it’s starts to get sketchy. Though that list is in simplified, so who the hell knows.

I agree with you guys as well. There are even times when I can read ALL of the characters in a sentence and STILL not know precisely what they’re trying to say… This is the difference between knowing individual characters (not having tested myself, I’d say 1,000 isn’t out of the range of possibility of what I know now) and knowing character combinations. Dang, this language is frustrating…

Your thrifty landlady may have been correct on some appliances. See I Vant to Drink Your Vatts [nytimes.com] (registration required).

Your thrifty landlady may have been correct on some appliances. See I Vant to Drink Your Vatts [nytimes.com] (registration required).[/quote]Right. I heard that too. And not only TVs and videos but also airconditioners.

(I’ve also heard that TVs left plugged-in are can be a fire risk).

Don’t forget, you have to know 95% of the words in a text to be able to read with comprehension. 90% of characters isn’t 90% of words. What that means is that you’ll be frustrated with just about everything you read.

So that next 1000 is really vital to the reading process.

A thousand characters is a great start. Two thousand and you’re really moving along. Four thousand and you start to get the gist of a lot of things.

But I’ve been in the Chinese to English translation game for six years now, and let me tell you…just when I start to get pleased with myself on how quickly I zip through certain jobs and how large my homemade glossaries are getting, I get hit with a project in an area I don’t usually do that sends me to the dictionary with every sentence. It’s humbling, I tells ya. Sure, I get the gist of the whole thing, but I’m actually getting paid for the nuances.

Nuances sure take time.

Indeed, there is a difference between learning tzyh and learning multi-tzyh words. The former are a lot easier to process with a computer. I have yet to find a list of Mandarin WORDS in frequency order. technology.chtsai.org/ is interesting.

According to mtc.ntnu.edu.tw/text_doce.html
"Practical Audio-Visual Chinese Vol. I is suited for a six-month course at ten hours a week of classroom instruction. After completing this text, students should have learned approximately 615 Chinese characters, 890 vocabulary words and 90 grammatical patterns.

"Practical Audio-Visual Chinese Vol. 2 is suited for a 25-week course, but instructors can choose to skip certain lessons. After completing this book, students should have learned approximately 900 Chinese characters and more than 1,800 vocabulary words. More than 250 grammatical patterns are also included.

I notice the list is in short form characters. Does that mean their corpora is based on mainland-published material? If so then it is not a corpora that is relevant to me.