Academia Sinica Chinese Word List by Frequency

I don’t know if people already know about this, but anyway, on the CHILDES Web site of Carnegie Mellon University, there’s a zipped Excel file (or a zipped folder with an Excel file) of a Chinese word list that the site says was contributed by Academia Sinica. I think it’s about 20,000 words. Each word is in the form of a Traditional Chinese character expression and a pinyin expression with a tone number. Since it’s an Excel file, the words are numbered, but each word also has a number representing the number of instances in which it was found (presumably in Academia Sinica’s corpus, although I don’t know that for a fact).

At work, I had some problems unzipping the folder (my Chinese is so poor that I couldn’t understand the error message, if that’s what it was), but I was able to open the Excel file.

Here’s the page the zipped Excel file is on.

Here’s the folder with the zipped Excel file.

A word of warning if the list is from Academia Sinica’s Corpus: Much of Academia Sinica’s corpus is (probably like most corpora) heavily stocked with newspapers, etc., so I’m guessing it has a different frequency count from what you would get from a body of transcribed spontaneous speech. Here’s a description of the corpus:

[quote]Criterion
Proportions

Genre
Press reportage: 56.25%, Press review: 10.01%, Advert: 0.59%, Letter: 1.29%, Fiction: 10.12%, Essay: 8.48%, Biography and diary: 0.50%, Poetry: 0.29%, Quotes: 0.03%, Manual: 2.03%, Play script: 0.05%, Public speech: 8.19%, Conversation: 1.34%, Meeting minutes: 0.11%

Style
Narrative texts: 70.66%, Argumentative texts: 12.24%, Expository texts: 14.72%, Descriptive texts: 2.83%

Mode
Written: 90.14%, Written-to-be-read: 1.38%, Written-to-be-spoken: 0.82%, Spoken: 7.29%, Spoken-to-be read: 0.35%

Topic
Philosophy: 8.68%, Natural science: 12.97%, Social science: 34.99%, Arts: 9.28%, General/leisure: 17.89%, Literature: 16.20%

Source
Newspaper: 31.28%, General magazine: 29.18%, Academic journal: 0.70%, Textbook: 4.08%, Reference book: 0.13%, Thesis: 1.36%, General book: 8.45%, Audio/video medium: 22.83%, Conversation/interview: 1.63%, Public speech: 0.25%[/quote]

Here’s the page containing the above description.

Anyway, if you’re interested and can get the file open, enjoy!

Thanks Charlie Jack for this post. Having opened the file, I can report to others who’ve not yet done so that it is a 辭 ci2 word list rather than a 字 zi4 character list. The order of the single character items in it will therefore differ significantly from a similar zi4 list, in which, for example, you’ll probably get 的 了個 and similar graphs at or nearer to the top.

Both kinds of list are useful for those wishing to memorize characters or ci2 words in order of frequency, make flashcards, and so on.

Thanks for the link, Charlie Jack.

I found the wordlist in a different format before, and ended up making my own files based on it. Much easier to have them ready-made in Excel format though.

[quote=“joesax”]I found a useful resource:
elearning.ling.sinica.edu.tw/eng_teaching.html
You can get a list of up to 300 words at a time. The instructions are a little misleading. Of course you can enter rank numbers above 300, but the total range can’t exceed 300. So you can get the list for ranks 1 – 300, and also for 301 – 600, but not for 1 – 600 all at the same time.[/quote]http://www.forumosa.com/taiwan/viewtopic.php?p=662626#662626

[quote=“Dragonbones”]Having opened the file, I can report to others who’ve not yet done so that it is a 辭 ci2 word list rather than a 字 zi4 character list. The order of the single character items in it will therefore differ significantly from a similar zi4 list, in which, for example, you’ll probably get 的 了個 and similar graphs at or nearer to the top.

Both kinds of list are useful for those wishing to memorize characters or ci2 words in order of frequency, make flashcards, and so on.[/quote]I think that for a number of purposes word lists such as this one are actually more useful. The meaning of a Chinese word is actually a lot more memorizable of a “chunk” of knowledge than the meanings of most characters are. I’ve found that an effective way to learn about the multiple meanings and pronunciations of characters is firstly to learn various words which use those meanings/pronunciations. I do this one meaning/pronunciation at a time to avoid the dangers of confusion and interference described by writers on vocabulary acquisition such as Joe Barcroft and Paul Nation. By the time I get to studying the character in isolation, this acts as consolidation and review of the knowledge I’ve already built up bit by bit.

Whether or not these kinds of wordlists are used as direct sources for vocabulary memorization (there are various reasons why not all of the most frequent vocab is the most useful at first), they are certainly useful checklists for learners beyond initial stages, to ensure that they’re getting sufficient coverage of the most frequent words.

Chinese Words (James Erwin)
Available in Taiwan, at least in the past it was.

:soapbox:
If you like word lists, fine, but if you are interested in obtaining a serious degree of competency in a foreign language, you just need to learn a whole lot of words. And very soon you’ll find that your interests/specialty have more to say about what words you should learn than any frequency list.

[color=green]Post edited by moderator to reduce url length - Taffy.[/color]

[quote=“um”] :soapbox:
If you like word lists, fine, but if you are interested in obtaining a serious degree of competency in a foreign language, you just need to learn a whole lot of words. And very soon you’ll find that your interests/specialty have more to say about what words you should learn than any frequency list.[/quote]That’s a fair point. But I think a lot of people are interested in getting to a decent intermediate level of general language use. Getting to that level requires a great deal of time and study. So anything that can make the process more efficient has to be worth doing.

The more words someone understands in speech or writing, the more likely it is that he or she will be able to make a good guess at the meaning of the words he/she doesn’t know. And of course more frequent words are more likely to turn up than less frequent ones, so it makes sense to study those first.

More frequent words also tend to express common meanings: the functions that people use on a day-to-day basis. So again it makes sense to study those first.

Something to be a little careful of, however, is that many of the very most frequent words – the first few hundred – are grammatical/function words rather than lexical/content ones. These words have a lot of possible senses and it’s harder to acquire the correct use of that kind of word anyway. So I think that for someone using frequency word lists as a primary learning aid, for example for making flashcards, it might be better to concentrate on lexical/content words first. Many of these latter words are two-character combinations, and in general it makes sense to study two-character words first anyway, to restrict the range of meanings, resulting in more learnable “chunks” of knowledge as I described above.

I do agree with you to some extent about specialised language. Each person wants/needs to learn a slightly different range of language, even before they get to the stage of learning specialised scientific/academic/technical vocabulary. For example I know quite a bit of food vocabulary but very little sports vocabulary! But in terms of getting easily to a decent intermediate level of general language use, learning frequent words first can play an important role.

That’s an excellent point. How does one get to that ‘decent intermediate level’? By memorizing word lists? Probably by using textbooks. Let’s say – just because everyone here knows them – one uses Shida’s book I and books IIA and IIB, not even getting to book III. I think one needs the equivalent of that grammatically to have a decent clue. That is also just enough to painfully take the plunge into whatever authentic materials one wishes to read. Do book III and/or a newspaper book after book IIB and one will be well-prepared to start off on the road of adult reading.

That’s an excellent point. How does one get to that ‘decent intermediate level’? By memorizing word lists? Probably by using textbooks. Let’s say – just because everyone here knows them – one uses Shi-Da’s book I and books IIA and IIB, not even getting to book III. I think one needs the equivalent of that grammatically to have a decent clue. That is also just enough to painfully take the plunge into whatever authentic materials one wishes to read. Do book III and/or a newspaper book after book IIB and one will be well-prepared to start off on the road of adult reading.[/quote]Very briefly, as I don’t want to hijack Charlie Jack’s thread:

1 Using textbooks can certainly form a useful part of beginner and pre-intermediate study. However I feel that there are many things that could be made more efficient both in terms of the textbooks themselves and also in what is done concurrently with textbook study and perhaps before it.
2 For a variety of reasons, its better for beginners to do more formal vocabulary study than grammar study.
3 I think that the vocab in the Shida books (also in the Donghai ones) is pretty useful and much of it is certainly “core language”.
4 For learners, at the very least, the frequency information can fill in any gaps, and act as a useful checklist.
5 For textbook developers and course designers, decent corpus data could and should be a very important consideration.

[color=green][i]I’d like to direct you to a more appropriate thread to begin that discussion. Try this, for example:
http://www.forumosa.com/taiwan/viewtopic.php?t=10157&postdays=0&postorder=asc&&start=0

DB, LC co-mod[/i][/color]

I’m sorry, but I don’t understand your ways here. The link you have given me is a lengthy general thread on learning Chinese. This is a thread about a particular word frequency list. The OP for this thread was just giving us info and so there’s nothing to hijack. Certainly my post was not 100% suited to this thread, but would be out of the blue and pointless on the link you gave me, the more so as I was responding to someone else’s post here on this thread. I was merely expressing my opinion, having also added apparently novel info to the thread in the form of a book on the subject that is available in Taiwan. If my posts should go there, please sir, tell me why the whole thread shouldn’t go there?

Thanks for the link.

Thanks Charlie Jack. The link also has lists for Cantonese as well as other languages. That’s really nice!

And thanks from me, too, Charlie Jack.

I think it’s a very useful checklist for finding gaps in one’s vocabulary. I’m going to run all the way through it as I find time, and note down whatever I’m not familiar with. I expect I’ll find quite a lot toward the end of the list. In the last half hour, I’ve gone through the first 2,000 entries, and have found three that I couldn’t read and understand at once: 颇 (which I’ve come across and looked up countless times, but for some reason just doesn’t stick in my head), ㄟ ( the mild exclamation ei4, which my computer doesn’t seem able to produce), and 胡適 (I’ve probably encountered that when translating speeches, but haven’t remembered it or deemed it worth remembering - but as it’s so high on this list, I must make sure to remember it now).

I’m quite looking forward to working on through the list!

When you’re ready, PM me with your email address and I’ll mail you a Word file listing the top 13,000. See how many of THOSE you know! :laughing:

But Charlie Jack’s Academia Sinica list has 20,000, doesn’t it? So surely that must include everything on your list and more?

But if it’s different or better in any way, I’d certainly love to look at it too. If so, I’ll PM you for it as soon as I get to the end of the AS list (though I expect I’ll find myself drowning in a sea of unfamiliar vocabulary well before I get to the end of the first list, so I’ll already have my work cut out endeavouring to patch up from that).

Thanks, DB. I downloaded the zip myself and this list of 20,770 proves my point – to me at least :slight_smile: – about these frequency lists (there are many and I’ve spent some time evaluating them).

Number 20,763? 活該

Number 20, 752? 拉肚子

Number 20,720? 倒霉

Number 20, 709? 油膩

Number 20, 702? 剪刀

dragon bones? Number 13, 680

I could go on and on. As you can tell, I got these examples by just starting at the bottom of the list and going up, with the exception of dragon bones :slight_smile:
There are other quite common ones in their midst even, but I decided to choose only super common words. Spend some time with the list and you will see that I did not find aberrations. Or, I did, but the list is so full of aberrations that they hardly qualify as aberrations.

There are two problems with these lists (aside from the fact that every student of every language seems to start them and I have yet to meet one who, say, studies the first 13,000 until they know them all. But stick with a language and you do learn them all. At least that’s been my experience.).

  1. They’re inaccurate, as I’ve shown above. Why? I think a lot of it is because all these lists are massaged. Ironic since the massager did this thinking they were being helpful, but of course one cannot combine newspapers, poetry, fiction, and other genres and expect to come up with a useful list, unless perhaps your target audience spends the same proportion of time reading each genre as it is alloted weight in your list.

Sure, all the words may be useful in the long run, but the whole idea to a frequency list is supposed to be to save you time by picking out the words you need to know most and not the words that aren’t so common, right?

  1. Let me try to restate my earlier point more clearly. If you are or wish to be a professional translator, then such a list might have some use, especially if you are not a specialized translator, though by the time you are ready to get paid to translate, this list is not of any more use than a small dictionary, methinks (not guessing).

But if you are learning Chinese for some specialized reason, including “I wanna chat with my Chinese mates” then you’re better off using some textbooks to get your first couple of thousand words and from there making your own purpose built word and character lists, either theoretical or made from words you come across and don’t know.

Don’t believe me? Look at what I found just in the last 700 or so words out of almost 21,000. The whole list is like that. PM me when you’ve memorized the list :slight_smile:

Joe Sax, I find grammar far more important than vocabulary for the first few years of study. Once I have a mold (grammar) I can pour (vocabulary), set, and repeat over and over again. Vocab alone leaves me wandering alone and blind in the middle of the road: I can hear the horns, smell the exhaust, but I don’t know where to go.

Ok, I know, dead horses should be buried, not beaten, and I don’t want the mods to hate me. Enjoy your list!

Again, you’re welcome to everybody, and I hope it helps.

I can’t disagree with um’s point about the problems, though, especially since I recently tried to make a frequency list of children’s spoken English using someone else’s corpus and software, and I encountered similar puzzles to the ones um mentioned in the post above.

I can’t figure out how frequency lists can come out that way, unless it’s just that languages are so vast, and that it’s quite difficult to capture them in a truly representative corpus without overcoming obstacles such as cost-benefit issues, ethical and maybe even legal issues.

Anyway, it’s one more dish in the potluck dinner.