How would you build a proper noun recognition app in Chinese?

mateoias1 · March 2, 2021, 6:40am

So here’s a general comment on life while learning languages. One of the things that always bugs me when I’m trying to read Chinese texts is that I don’t know the proper nouns. Including people’s names, place names, movie titles, organizations, etc. At the moment I’m putting together some software to help with finding proper nouns in English, since I’ve noticed my students also have this problem. For my own interest though I actually want to make one for learning Chinese. My questions for the Chinese learners here are: have you noticed problems with identifying proper nouns when reading Chinese? What specifically do you find difficult? What would you like to see in a proper-noun-finding-App for Chinese learners?
As a side note, I usually find it fairly easy to identify proper nouns in German and Spanish (which I also sort of know), but Chinese is just harder for me. Maybe the lack of capitalization, maybe the fact that “Johann” is more obvious than “魯迅” for a guy that went to school in America. Any thoughts on this subject?

the_bear · March 2, 2021, 6:50am

I’ve never had a problem with this. Or if I have it’s not specifically to do with proper nouns.

tigerninjaman · March 2, 2021, 7:00am

It’s tough, and you’re never going to know all the names. Sometimes places have English/Romanized names, acronyms, abbreviations, etc. so IME the best way is to look at context to try and figure out which characters are part of something’s name. Personally it’s never been much of an issue.
If you’re trying to build an app or something then it looks like you want something that does named entity recognition (NER) which plenty of language processing packages do.

tando · March 2, 2021, 7:20am

FYI

tando · March 2, 2021, 8:21am

I don’t have much problems to identify proper nouns. My problem is I don’t know what the proper nouns are in my language, or I totally don’t know what the proper nouns represent.

hansioux · March 2, 2021, 8:46am

There’s a BERT-Base, Chinese available for download. BERT and perform text tagging, and that’s pretty much what you want. See the github site for tagging example and how you can enable that with the BERT-Base, Chinese model.

MikeN1 · March 2, 2021, 11:21am

I seem to recall that in the mainland they put a line underneath the name (or a vertical line if printed in up and down columns.
I remember being excited when I first read a string of characters, only to be disappointed when it made no sense.
"What the hell is is “horse-hot-forest-door-meat?”
“You know, that famous American actress: ma-ri-lin-men-rou.”

mateoias1 · March 2, 2021, 4:19pm

Yeah, I think overall my problem is that I know a lot of individual characters, but I don’t know the character groupings well enough (all the 2-4 character sets that make up one “word”). I have a really hard time parsing the sentences. I was wondering how other people felt about it.

mateoias1 · March 2, 2021, 4:25pm

Thanks for this. It seems that Chinese in general is more difficult for machine learning models. Primer.ai has a proprietary system that is 95.6% effective for NER (they claim), but only in English. https://primer.ai/blog/a-new-state-of-the-art-for-named-entity-recognition/
For Chinese it looks like something in the high 80’s is the best I can find.

mateoias1 · March 2, 2021, 4:28pm

Do you have any experience with BERT? I’m trying to keep my model small so I can deploy it. At the moment I’m using Spacy zh-core-web-trf which is about 400 MB, but needs Pytorch to run so it comes in at over 1 GB. That still seems to be smaller than using BERT straight up though (especially if I also add in Tensorflow). I’d love to hear about a way to shrink the size!

mateoias1 · March 2, 2021, 5:13pm

This is also what happens to my students in English (some can’t identify the nouns, some don’t know what they are). Do you think it would be useful to put a link to a wikipedia page or a google search for each noun?

Swillbury · March 2, 2021, 6:26pm

I used to have this problem, I would read a sentence and not know what it mean’t because there were always names that I would stumble upon and not know. Just study some common names and when you get used to the sentence structures, it will be pretty easy to spot proper names/nouns when they occur.

yyy · March 3, 2021, 12:26am

Which language is that?

tando · March 3, 2021, 12:29am

tandish

yyy · March 3, 2021, 12:36am

Ooh, I’ve always wanted to learn that! You should open a buxiban…

hansioux · March 3, 2021, 1:50am

One way to do it might be training the large model first, then use the larger model to train a smaller model to give the same results. There are many papers describing this, one is Train Large, Then Compress.

In tensorflow and keras I think there are a bunch of pruning and sparsity APIs such as post-training quantization, quantization aware training, pruning, clustering and tensorflow_model_optimization.sparsity.keras.prune_low_magnitude

Quantization is reducing the size of your input/output from 32bit floating point down to just 16 or 8bit. If it doesn’t hurt your performance too much, it’s actually the fastest way to compress your model.

Pruning removes connections with sparse weights, which could also decrease performance.

Not familiar with pytorch yet, but there are probably similar APIs to use.

RickRoll · March 3, 2021, 11:09pm

I think it can be tricky to separate proper nouns to common nouns.
Take the name of Taiwan’s president, 蔡英文, for example. It’s a proper name as well as an expression.
It’s hard to know which is which unless you have the whole phrase. And even so, you may still need to know the context or the topic been discussed…

mateoias1 · March 4, 2021, 12:58am

Hi hansioux, thanks for all of the readings. They fall into the area of things I have heard about but never actually implemented and don’t yet understand
The problem I am struggling with is that I want to be able to run my model from a free hosted API (like Heroku), but they generally limit you to 500 MB of RAM which is not even enough for tensorflow or pytorch, never mind the model. Do you happen to know if it’s possible to call the API and run predictions on a model without installing the whole tensorflow package?
On a side note, I did get the model running on Amazon EC2 which offers more storage, but I still want to be able to host it without using so much compute, especially as I have a lot of different APIs I want to build eventually.
Finally, a quick apology to anyone who just wants to read about Chinese learning, I know my tech issues are not quite the topic here!

mateoias1 · March 4, 2021, 1:12am

Yes, this is the type of trouble I have. In this case it’s OK, because I know who 蔡英文 is and I also know 蔡 is a last name (it’s the same as my co worker). But if it weren’t for that background knowledge, I would see 英文 and jump immediately to “English” and then be stuck with wondering what 蔡 meant. Especially if the name was in the middle of a phrase, like 我還沒看蔡英文的論文

hansioux · March 4, 2021, 1:39am

In that case, you might need to look at the ONNX or other deep learning network exchange formats and their compilers, such as ONNC. You would train your model like you would normally, and after getting an acceptable model, you would need to convert it into an DL network exchange format, and run them on a more limited machine using programs written in C, C++ or C#, and call the ONNX-runtime API. You could also run it in python, but I haven’t tried that.

My experience is trying to get a facial recognition model to run on an embedded board, and after quantization and converting to the ONNX format, the model was around 1MB. By the sound of it, you might not need to scale down the model so dramatically. You might get away with just compiling the model to ONNX without quantization.

Oh, and a performance hit is to be expected.