Auto translate Chinese to Pinyin

wonderingsoul · March 20, 2013, 1:16pm

Got a useful tool here for all you pinyin readers. It translates the Chinese characters to pinyin. It’s a chrome extension.

chrome.google.com/webstore/deta … nncgahagh/

tango42 · March 20, 2013, 1:38pm

Wo xi huan. Xie xie.

Now if I can only get google translate to do the same thing.

cranky_laowai · March 20, 2013, 1:47pm

Bro ken syl la bles are not pin yin.

But if it handled word parsing the way Pinyin is meant to be written it would be great.

trubadour · March 20, 2013, 1:48pm

it does have pinyin input since a few weeks ago

ehophi · March 22, 2013, 7:36pm

Click that Ä by the Google Translate bar. You get a half-decent Pinyin output from it.

If you want to preserve the spacing of your original text, set the target language to Chinese, and use it on that side.

I think ironlady may link to a Pinyin-to-Pinyin converter which is more useful for learners, though. I haven’t done it yet, in part because I don’t know what her bandwidth limits are (and I have accidentally spurred DoS attacks on other Chinese-language websites), but I would be tempted to take all of my database’s Google Translate Pinyin results and convert them to the Pinyin system on hers.

ironlady · March 22, 2013, 9:16pm

Which one do you mean?

I have two converters: one for Pinyin-with-numbers to TOP (Tonally Orthographic Pinyin, with colors, caps and small letters and diacritical marks like standard Hanyu Pinyin) and one for coloring character texts in this way, with the input being characters followed (each one) by the number for their tones.

I don’t think I have anything Pinyin to Pinyin except for the Pinyin-with-numbers one mentioned above. (I have no idea what my bandwidth limits are, either, but thanks for being considerate. Go ahead and try a few and see what happens!)

ironlady · March 22, 2013, 9:17pm

Just for the sake of saying it, when I do the Pinyin for a reader, I “translate” it manually from characters into Pinyin (using CAT software). I have never found a characters-to-Pinyin converter that was accurate enough, or which didn’t have syllabification-and-wordification issues. Plus, I add spaces between sense groups since I write for emergent readers of Chinese. It’s just easier, faster and (somewhat, at least) more accurate to do it by hand. Auto-correct seems to insert unexpected things, though.

ehophi · March 23, 2013, 8:46pm

[quote=“ironlady”]Which one do you mean?

I have two converters: one for Pinyin-with-numbers to TOP (Tonally Orthographic Pinyin, with colors, caps and small letters and diacritical marks like standard Hanyu Pinyin) and one for coloring character texts in this way, with the input being characters followed (each one) by the number for their tones.[/quote]

Yeah, that’s it. Since I would insert them as black-and-white text in the file in question, the greater interest was in the arrangement of caps to sound out the words, which was more intuitive.

I did it for a whole ten sentences, and had to work out these steps:
[ol][li]Convert Google’s Pinyin results to Pinyin with numbers.[/li]
[li]Plug them into the TOP converter.[/li][/ol]

I wouldn’t do a full-on project, because, again, I don’t know what your bandwidth allowances are.

Confuzius · March 23, 2013, 9:21pm

Wenlin 文林`does it too.

ironlady · March 23, 2013, 10:10pm

I can give you a copy of the script, and you could install it on your own server, if you like. I certainly don’t mind spreading the gospel of TOP a little farther and wider!

ehophi · March 24, 2013, 4:41pm

Actually, if you have the conversion table (which I’m guessing your program needs to do the substitution functions), I could probably make one that doesn’t run on a server, but that people can download and run directly from a computer.

Maybe I could try my hand at writing a program that goes from Pinyin-with-Tone-Marks to TOP-with-Tone-Marks. I’m guessing that you need the numbers to count for the rightmost limit, so that you don’t end up capitalizing the left ‘g’ when it’s not supposed to be capitalized (e.g. ‘民國’ to ‘míNguÓ,’ not ‘míNGuÓ’).

ironlady · March 24, 2013, 8:16pm

It in only partially based on a conversion table. Most of it is actually string manipulation statements in php.

ehophi · March 25, 2013, 12:49am

I started working on it. There are a few difficulties with simple substitution rules the Pinyin-with-Tones to the TOP. I have to go inside-out, from the shortest strings to the longest strings, to the largest manipulations which assume the smaller ones, and move from Pinyin-with-Tones, to a “proto-TOP”, to TOP.

á --> aÁ --> aÁ, but
áng --> aÁng --> ánG

ironlady · March 25, 2013, 12:37pm

Don’t forget the fun getting the text to be colored, too…

ehophi · March 25, 2013, 2:41pm

I’ll leave that part to someone else. I just need the text.

Update: Pinyin’s ‘n’, ‘g’, and ‘r’ are making this transition are frustrating my progress. The Lexilogos one has this problem, too.

ehophi · March 25, 2013, 10:45pm

Subsequent Update: The post below ought to show that I was wrong and that there is a straightforward procedure.

cranky_laowai · March 26, 2013, 4:28am

[quote=“ehophi”]Final Update: It’s impossible to fully convert Pinyin-with-Tones to any other Romanization system.

The culprits are ambiguous compound strings, such as ‘huanan’. In order to see the atomic strings (that is, one initial and one final put together), the computer must be able to recognize the rightmost character of each atomic string. Since this isn’t possible in many cases (e.g. ‘huan_an_’ v. ‘hua_nan_’), no program can disambiguate them without the original characters.[/quote]
This is incorrect. “Huanan” is not ambiguous. It is always hua + nan.

Huan + an = huan’an.

Hanyu Pinyin has no ambiguities in syllabification.

You may find these useful:
[ul][li]separating Pinyin syllables: PHP code (I didn’t write that to include tone marks; but they could be added without much trouble.) [/li]
[li]Apostrophes in Hanyu Pinyin: when and where to use them[/li][/ul]

ironlady · March 26, 2013, 1:41pm

Yes, CLW is (of course) correct, there is no ambiguity when Pinyin is written correctly.

The other point is that these systems are not intended to be changed one into the other by machine. They’re intended to do a job – to represent the sounds of spoken Mandarin correctly in writing. In the case of TOP, another goal is to reinforce the tones of each syllable through three means: capitals/small letters, color, and diacritical marks.

To go from standard HYPY to TOP, I would use these steps:

Mark word boundaries somehow;
Divide words into syllables where they are multisyllabic to begin with;
Assign a number at the end of each syllable based on the diacritical mark over the vowel;
Replace the vowel with the base vowel minus the diacritical mark;
Remove word boundary marks;
Use the existing Pinyin-with-numbers to TOP script to convert to TOP

Now I’m thinking “why mark word boundaries” but for some reason it seems important to me. I just can’t think of it. No time, I have to teach in a couple of minutes, so I’m a little distracted. :aiyo:

ehophi · March 26, 2013, 2:48pm

[quote=“ironlady”]Yes, CLW is (of course) correct, there is no ambiguity when Pinyin is written correctly.

The other point is that these systems are not intended to be changed one into the other by machine. They’re intended to do a job – to represent the sounds of spoken Mandarin correctly in writing. In the case of TOP, another goal is to reinforce the tones of each syllable through three means: capitals/small letters, color, and diacritical marks.

To go from standard HYPY to TOP, I would use these steps:

Mark word boundaries somehow.[/quote]

If CLW is right, and you can designate right boundaries with apostrophe “’”, you would need to spot all of the ambiguities, and then code to separate them with individual boundaries, then this is good news. I worked on an ordered conversion table, making a nonsense character to mark the leftmost boundary, for each term, and then filling in some spaces.

First, replace the tone-marks with a numeral and the letter:

ā --> 1a
…

Then, every non-ambiguous initial is replaced with a nonsense character plus that initial:

b --> ¤b
…
z --> ¤z

‘h’ is as follows, and in this order:
h --> ¤h
¤c¤h --> ¤ch
¤s¤h --> ¤sh
¤z¤h --> ¤zh

However, the problem will arise with the intersect groups of initial first letters and final last letters, and those are ‘g’, ‘n’, ‘r’, ‘a’, ‘i’, and ‘o’.

Then you need all of the apostrophe disambiguations (e.g. '…an’a vs. ‘…ana’), and then you follow the above procedure with them:

an --> a¤n
¤n’ --> n¤’

Then you add the nonsense character (‘¤’) to every initial (to surround the finals of compounds):

¤ --> ¤¤

Then you use your already solved conversions from Pinyin with numbers, and realign them to the 1[vowel]-Pinyin, plus the two nonsense characters, e.g.:

¤2ang¤ --> ánG
¤2an¤ --> áN

You must go from longest-to-shortest with each substitution, so that you don’t end up with ‘…áNG’ in any of the text.

Then you replace the empty spaces, plus the ¤’s, with empty spaces.

" "¤ --> " "

That should do it.
Leave it to someone who thinks this is fun to figure out this part.

ironlady · March 26, 2013, 3:35pm

Dunno. I had some guy in Singapore write my script. $30 and it works great.