How does one organize data in Chinese?

For example, I have a list of 1000 words, and I want to put them in some type of order, either alphabetical (pinyin) or radical, how do I do this?

Using microsoft word, I’ve found out how to arrange by stroke count, but I can’t find anything else.

My next question would be, how pinyin alphabetical order work for characters with more than one pronunciation / “po yin zi”?

Also, how do they organize data in China or Taiwan? Are there any other methods?

Thanks in advance.

First, it’s better not to call those ‘radicals’, because that term is based on dissection of European words which have a semantic root (radix); however, the dictionary keys in question here are not necessarily semantic in role. The term “bushou” is preferrable.

Anyway, you could put the data in a table, with the pinyin in a column next to it, then sort the table by the pinyin column. For characters with more than one pronunciation, you could enter the character in a separate line for each pronunciation, and then it would show up in both places.

Number of strokes is another way; sorting by bopomofo is another. It all depends on what you like to use, and who your audience is.

Are there any reliable pinyin annotators that can handle a few hundred characters at a time?
How do they handle “po yin zi”?
Is there an automatic method to sort by “bushou”?

Basically. I want to put these characters in some type of order so that I can look up individual characters conveniently.

Um…while I respect your scholarship, DB, I can’t think of any situation where I would drop the single Chinese word “bushou” into an otherwise English discussion about Chinese characters, when everyone pretty much knows what the idea of a ‘radical’ means for Chinese. The term has a separate meaning when applied to Chinese and seemingly works just fine. Let’s not get into the whole “you can’t use x that way in Pinyin because that’s not what it sounds like in my language” thing, okay?

Back to the OP – if you’re organizing your own words for learning, I’d go with something like categories. I have SuperMemo categories for my flash cards called things like “Medical Marvels”, “American Culture”, “Chemistry”, and so on (okay, a little too much Jeopardy when I was young) …they’re probably only logical to me, but if you’re going to be reviewing, first it’s useful to review related semantic items together, and second, you don’t necessarily need a system where every piece of data is ordered with relation to every other one, unless you’re trying to write your own dictionary or something like that. Or have I missed the point of what you’re trying to do?

If you want them all together to find individual charactesr (although I can’t figure out why you wouldn’t just use, um, a dictionary!) Dragonbones’ method sounds like the most feasible. I automatically enter three lines for every vocabulary item I add to my hoard – character, Romanization and English. That way I can sort any way I like. I do the same thing with the terms I put into the software I use in the booth while interpreting, because otherwise it’s impossible for me to search effectively on Chinese characters while doing SI. I can manage to do some searching if it’s on alphabetical stuff.

Sorry to go off-topic with this, but I think using the term “bushou” in the middle of an English discussion is fine. Although I agree with you that the term is well established, I’ve also seen first-hand the confusion that results from this gross mistranslation of the term bushou. It leads to all kinds of misunderstandings about the role that bushou play in character construction. I don’t think there’s anything wrong with me suggesting we correct this error in translation. Changing to a simple transliteration is not difficult to do, and avoids those problems. Translating it as ‘section head(er)’ is also viable. I think it’s a valuable change to suggest, and I do intend to politely raise the suggestion from time to time – folks are perfectly free to just ignore me if they disagree, in which case I won’t push the point further. :wink: I’m done now. :laughing: Cheers! DB

I don’t see where it’s an error in translation. There is an English word “radical” that, among its other uses, refers to the concept of “bushou” in Chinese.
It doesn’t matter what the etymology of the English word is; it matters how it is generally understood today and specifically how people understand it when they use it to refer to the idea known in Chinese as “bushou”. Should we not use the term “gay” to translate “tongxinglian” because the English term “gay” originally meant “happy”?
And anyone who’s talking about “bushou” or “radicals” has been told at some point what they really are, or is being told, or something of the sort.
Not to belabor the point, but could you, just maybe, be a little too far into the whole academic thing these days? :wink:

:slight_smile: The origin of the usage was a mistranslation rather than a loan along the lines of your example. Your point that it is established usage now is reasonable, and I have already acknowledged that. However, the now established usage remains problematic because it leads to confusion, which is the primary reason for my objection. The situation with ‘radical’ might be better compared to a situation in which many people understood that ‘gay’ now also means homosexual, but some people also think that ‘gay’ specificially refers only to happy homosexuals. Believe it or not, this is the situation I’ve encountered with the term radical – that is, perhaps because the radix is the semantic root, but perhaps also due to a general erroneous tendency by many to refer to bushou and the remaining portion as radical and phonetic respectively (implying a semantic role for all ‘radicals’), some people are misled into thinking that 部首 bu4shou3 (due to this mistranslation as ‘radical’) are the portion imparting meaning to the character. Had I not encountered this numerous times, I would not be quite so adamantly opposed to the term. Of course, people are free to use what they wish.

Oh, I wish that were the case. But I think that only a subset of the public has been properly told. Witness for example how unbelievably awful this description on Wikipedia was:

Er, um… :blush:


PS – This is an interesting point of discussion in its own right; might I suggest moving it to a separate thread? (I hate to go off topic for so long; sorry OP!!!) I had started one earlier in which I ranted against ‘radical’, but finding that may have to await the reloading of the Search database in a few days.

I know what a radical is, I can see nothing it can be confused with. but I have no idea what a “bushou” is.
There can be confusion what a “word” is, but not a radical.

As for what order to sort items in, you could also use the CD shop method and put them in randomly.

I don’t want to seem rude, but most of my questions are unanswered.

I have a list of approximately 1000 Chinese surnames I have compiled from various sources. Some characters I know the pronunciation for, others, I don’t. Some of the characters are pretty obscure - not in most dictionaries or character sets.

  1. Is it possible to use a pinyin annotator to add pinyin next to the characters?
  2. Is there a program that can handle 1000 characters at one time?
  3. How do pinyin annotators handle “po yin zi”?
  4. How can I be sure of the accuracy without manually checking each character?
  5. Can a pinyin annotator handle obscure characters?

I’ve seen that the “organize by stroke order” option in microsoft office 2000 works pretty well, but again, for some of the more obsucre characters, they are not sorted properly. Is there any way around this?

Also, if there are programs that can recognize the number of strokes in a character, or somehow check the pronunciation, is it possible to check the radical? Is this something that is included in the basic info of each character, or hasn’t been added yet?

Thanks, in advance.

Really, surnames are usually pretty well accounted for although given names less so for geomantic reasons.

No. There will be too many errors on characters with multiple readings.

2. Is there a program that can handle 1000 characters at one time?
Yes of course.

They don’t. That’s why this won’t work.

You can’t.

They should handle them better than more common characters since they should not have so many variant readings. But not if they are not in a standard character set such as Big5.

No. The character tables and sorting are built in at the OS API level in modern operating systems. They are based on standard character sets.

If I remember correctly, yes. This infomation is built into the characters. You will need to know how to access the underlying API to exploit this effectively. You should also havea look at some of the Perl modules on CPAN. People have done a lot of thsi work already.

I think it would help if you would tell us what you are trying to do.

Wow Feiren, thanks for your help.

Where can I learn more about organization by radical? Basically, I know nothing about programming.

I just want to make lists of information on various topics and be able to sort them in a logical manner. For example, a list of all the Chinese surnames, and then have a way to organize them by stroke count, radical, and pronunciation.

Other lists I’d like to make, or have access to, would be things like geographical names, names of famous people (Chinese and non-Chinese), names of all of the birds, all of the fish, all of the mammals, etc.

So, basically I would be making my own lexicon?

I suppose you would essentially be making your own lexicon, but…if you don’t really know how to look up characters and how to pronounce them, are you planning on offering this information to others, or just using it yourself? I guess I’m still at something of a loss why you would not merely use an existing dictionary.

One program that might be of some assistance is “Bamboo Helper” which should still be floating around the Internet in old versions; it used to change characters into Pinyin with tone numbers. I guess NJStar does something similar as well.

There is a net-based converter to change characters into tone-numbered pinyin - here. It will handle 1000 characters no problem, and poyinzi are dealt with by giving all readings in alphabetical order. In the options part of the form, you will need to select “Show pinyin in text”, then enter your string of characters. Be careful to enter each character on a separate line, otherwise the program will attempt to parse words out of your string of characters and you’ll lose some of the alternate readings (e.g. 重要 written together on the same line will be parsed as zhong4yao4, missing the alternate readings of chong2 and yao1 respectively).

Example of how the software works:



I used this program to convert a huge frequency chart of nearly 14,000 characters into pinyin. It occasionally fails to recognise characters with very low frequency rates (i.e. those numbered over 7,000 in my list, but no more than about 200 in the whole list) and also sometimes does not note some of the very rarely used pronunciations for some poyinzi. So with that caveat, I think it’s still the closest thing to what you’re looking for. Good luck!

Thank you Taffy, that’s a great tool.

Ironlady, I know how to look up characters and pronounce them, but when you’re dealing with 1000 characters, many of them not in dictionaries or character sets, it’s nice to have a more convenient method than searching for each individual character. Some of the characters don’t even have phonetic info in the unihan database.

Just curious what set of data you’re organizing if most of the chars aren’t in dictionaries or in character sets…how did you manage to input it in the first place? Handwritten input? But then it would have to be in the character set to display, right? Or are you using some sort of character editing thingie?

Hmmm, I’m wondering:

  1. where you’re getting such rare characters from, and
  2. what dictionaries you’re using, that are so limited.

You might want to use automated pinyin converters for the majority of those, but then look up the remainder (those not recognized by the converters) in a better dictionary. I’ll wager that the 漢語大字典 will have virtually all of the graphs in it. This tome contains many poyinzi pronunciations for graphs, some being obsolete, but on occasion you might find that one of the pronunciations does allow you to input the character in question via pronunciation.

When no pronunciation info is available, you’re stuck with stroke count or bushou for your sorting.

Dragonbones, when you say graphs, are you referring to Chinese characters?

In better dictionaries, like the han yu da zi dian, are there characters without pronunciation info?

So far the best dictionaries I’ve been able to find in print, or online, have been CEDICT and the unihan database.

Yes, sorry! “Chinese characters” is just too long to type sometimes. :smiley:

I’m not quite sure what you mean by this. Hanyu Da Zidian is a large printed Chinese-only dictionary in which you would typically look a graph (i.e., Chinese character) up by bushou (aka radical). There is also a stroke count index in the back. The pronunciation info in my version (which has been converted to the more complex traditional characters) is in bopomofo; I imagine that in the original simplified PRC edition, the pron. info is in pinyin. This big tome contains many obsolete characters, and also mentions oodles that are surnames (noting that they are 姓, and perhaps giving a textual source). Many of these surnames are so rare or obsolete that I assume you’ll find yours in this book. If you are at a beginning level in Chinese, this book would be way too difficult, btw (I don’t know your level). At beginning to intermediate levels, a Chinese-English dictionary like the ABC, Far Eastern, etc. would be more appropriate, but while these contain uncommon surnames, they may not contain the ubelievably rare and obsolete ones.

I’m not very familiar with the CEDICT and the Unihan database, but I’m sure some of the other posters here can tell you how those rank in comparison to some of the good printed dictionaries we’re familiar with. Chances are that the only reason you’ve got characters that ‘aren’t in the dictionary’ is that you haven’t been turned on to the right dictionaries yet. I seem to remember the two you mention being deficient but don’t recall the details. We have good threads on that topic, and I’d suggest redirecting this part of the conversation to those threads – but the temporary lack of a search function means we’ll have to sift through the Learning Chinese list by hand to find them, and I’ve got to run right now. In the meantime check out this:
and perhaps direct the dictionary question there; this will help keep our threads organized.

Also, another poster once mentioned this Wiki list of surnames, which, while not complete, might be of some interest to you:

There is some other info online in this area too, e.g.,
just FYI

On surnames, there is an interesting statistic cited here: [quote]The Chinese inhabitants of Taiwan include those with single surnames and those with double-barrelled surnames, and there are total of 1,027 surnames in all, over 800/o of which are quite rare. [/quote]

And try this too: