Auto translate Chinese to Pinyin

ehophi · March 26, 2013, 4:14pm

I’m making this thing in OOCalc, so you can peek at it in a more user-friendly manner when I’m finished (in another hour or so).

ehophi · March 27, 2013, 8:26pm

Well, it’s a day later, and after a couple of hours of tinkering, I got it to do 90% of them.

I’m having an issue with the stray vowel pairs. Either I’ll need a lot of substitution rules to cover them, or I’ll have to reapproach the problem with an attempt to just join all acceptable initials and finals.

ironlady · March 27, 2013, 11:08pm

I’m not quite sure why you couldn’t simply:

Mark the end of each syllable with the appropriate tone number (get it from the diacritic mark);
Look up the syllable in a table and replace.

??

ehophi · March 28, 2013, 3:08pm

[quote=“ironlady”]I’m not quite sure why you couldn’t simply:

Mark the end of each syllable with the appropriate tone number (get it from the diacritic mark);
Look up the syllable in a table and replace.

??[/quote]

There’s the issue of separating combined syllables in individual words. After that step, converting it to any other set (Zhuyin, TOP, whatever) is a simple matter of replacing in a lookup table, longest-to-shortest.

I’ll hammer at it more tonight.

ironlady · March 28, 2013, 3:15pm

This is why I only took one logic course in college. And why I farm out a lot of my programming needs.

cranky_laowai · March 28, 2013, 3:20pm

I don’t get it. I already sent a link to a script I wrote that will separate syllables in Pinyin. Why write it over from scratch?

It would need to be adjusted to add tone marks/numbers; but that shouldn’t be a big deal. And I didn’t allow for all of the talk-like-a-pirate -r endings; but, again, that shouldn’t be too hard.

ehophi · March 28, 2013, 6:56pm

I did it for the lulz.

Actually, I did it because I didn’t want to run it on a major programming language or platform, but instead have it run directly from my OOCalc spreadsheet.
I’m working out a stricter Turing-style approach (literally, move these things there, in this order).

I think it’s functional at this point. It allows erhua as long as you write it “er”. I took a different approach from the one that I described earlier.
I think this one is really picky about how we render it.

Pros:
I think it will allow you to type words from other languages into it (including non-Pinyin Chinese) without corrupting that data, as well, but I’ll have to see.

Cons:
It has to take “er4’er4” instead of “er4er4”.

cranky_laowai · March 29, 2013, 2:37am

[quote=“ehophi”]Cons:
It has to take “er4’er4” instead of “er4er4”.[/quote]
That’s not a con. That’s just how Pinyin is supposed to work. So if people are writing erer for er’er, then the old computer maxim applies: GIGO.

ehophi · March 29, 2013, 3:21pm

The pro that I mentioned is false. If you type English into it, you’ll get GO. The masks that I gave for the numbers (so that people could still type years in Chinese, like “2013年”, only works for them).

Let’s call this an alpha. You just type the Pinyin into it, and you get the TOP out of it. In principle, all it does is convert diacritical Pinyin into numerical Pinyin, separates the numerical Pinyin, converts the numerical Pinyin to TOP, and rejoins the words. Ironlady, you can insert HTML coding into the TOP side of the conversion table to get colors. I didn’t want that, personally.

ehophi · April 1, 2013, 6:53pm

And I’ll call this an alpha plus. I’ve added a tool which allows me to take Chinese words with English, and at the end get TOP with English words.

[ol][li]Paste the Chinese with English words in it, and then get a Chinese-plus-gibberish mask.[/li]
[li]Paste that mask into a Chinese-to-tonal-Pinyin translator.[/li]
[li]Paste the masked tonal Pinyin into the tonal-Pinyin-to-TOP converter.[/li]
[li]Unmask the translated TOP result.[/li]
[li]The TOP output will keep the English words intact and space them somewhat appropriately.[/li][/ol]

I also shortened the number of steps and improved an issue with ‘~ng~’ splits and allowed it to convert tonal Pinyin text with line breaks. I also added a few stray Pinyin conversions (like ‘o5’, ‘lve[1-5]’, etc.).