OCR that can handle Bopomofo?

Does anybody know some good OCR software that can handle (or ignore) BoPoMoFo?

1 Like

I see some people use Autoencoder models trained to ignore Zhuyin. If the line space is pretty consistent, I wonder if you can also use a program to white out the Zhuyin.

https://d246810g2000.medium.com/使用深度學習辨識含有注音的中文字-b52dabd96d5

I don’t know what kind of original you’re dealing with, but if it’s fairly recent, it might be possible to simply extract the text or save as a Word document, which would give you text with ruby text in Zhuyin. Changing font should then eliminate the zhuyin.

I’ve used ReadIris for years but have never had to deal with bopomofo as ruby text so can’t say about that, sorry.

1 Like

Tesseract

I was actually trying tesseract yesterday to see if I can put together something, but it can barely detect Hanji correctly when it’s in a Kaishu font…

I tried a sample from a children’s storybook,* using a scanner and Google Lens, but it seems to have pretty badly misread the bopomofo characters in the first, fourth, and fifth groups (the ones immediately following 難, 不, and 怕); it also rendered the horizontal line after 你 as a vertical line):

So I guess that option can be ruled out?

*9710, 世一圖書出版中心編, ISBN 978-986-193-370-2, title: 我的第一本寓言故事

博客來-我的第一本寓言故事

May need some tweaking, need to have a recent-ish version (at least v4 or better), and a good character library of course, (sometimes there’s training sets from universities that work better). Last but not least it can be finicky about image size, try doubling the dimensions/resolution of one of your test images and see if there’s a big difference.

Google Lens is an good option because usually the issue with rule based OCR is that Zhuyin causes the text to be gibberish, like what happened to tesseract. If both the hanji and zhuyin are recognized correctly, all you need to do is to use simple regex to remove all the zhuyin symbols.

1 Like

This is kind of convoluted, but I wonder if it would work to just save the PDF as a text file and then use search and replace to eliminate anything that’s in the zhuyin range?

Thanks for that information, @hansioux.

This. What kind of original are you dealing with?

I sometimes get PDF originals that will not export a copy to DOC or TXT because they are too old. The Chinese characters are, I guess, treated as images (or the entire text block is) instead of characters in the digital sense.

Newer PDFs that were created from Word or similar digital originals usually will export a nice clean copy in Word (but with pesky text boxes), or as a .txt file without the text boxes or formatting, or can be “scraped” by just selecting the text and using copy/paste to another document, to give at least a clean TXT file.

If that process gave you Chinese characters with zhuyin ruby text, then changing the font should get rid of the zhuyin. if the zhuyin is separate characters (which I can’t imagine unless the original was made by an OCR process maybe) then searching and replacing for symbols in that ASCII range might do the trick. I’m not an expert on this stuff, it’s just stuff I do when trying to make a decent electronic copy of things to translate from. Fortunately, very few technical articles come with zhuyin notations. lol

EDIT: popped back to the OP, and it seems you just want somethign that will read Zhuyin correctly. In which case, sorry, I’ve got nothing – unless you want to remove the incorrect Zhuyin and then change font to something that would (generally) give you correct ruby text.

2 Likes

Well, he did say “OCR software that can handle (or ignore) BoPoMoFo,” so maybe he’ll end up applying your suggestions.

Besides, what you wrote could also end up being useful to others.