Replacement script -- characters to TOP/Pinyin

Hi all,

I’m hoping someone can help a (broke) teacher with this.

I can write simple php stuff, but even after studying Hypermegaglobal’s Romanization conversion scripts, I’m just not able to get what’s going on or how to do this. (I think maybe he writes too elegantly.)

I need to write a script (preferably php because usually I can make some sense of it) that will take a file written in Chinese characters (simplified, actually :frowning: ) and convert only SELECTED ones to Romanization. The idea is that I can keep a text file or database (SQL is fine) of the characters my students are supposed to have mastered, then convert all the ones that they aren’t supposed to know in a reading I write for them into Romanization to make a mixed reading I can print out or post on the Web for them. (Does that make sense?)

The fancy version would ask for a section name or number (I could see keeping a database with separate data on which characters each class “knows”, and producing customized readings based on each level or even section) but at present I’d just be happy with a script that would do the replacement using input from some source or other. The super-fancy version would somehow mark characters that were indicated in the file to be poyin characters (having more than one possible pronunciation) to make it easier to check them by hand.

I’ve found information on how to get a file into an array, but despite the careful comments in Hypermegaglobal’s scripts I can’t quite figure them out…??

:loco: Thanks for any help anyone has time to offer.

Here’s what I would do.

  1. Use an excel spreadsheet
  2. Set up some columns with filters.
  3. Dump all your characters in, and run a converter on everything.
  4. Use a selector column to mark various character line items.
  5. Set filter on the selector column.

That would be way faster, and easier IMHO.

Hmmm…never thought of that kind of solution! Thanks, Truant. I’ll play around with that. Should be much easier than actually (gulp) learning to program better. :smiley:

The only problem is that I’m trying to convert selected characters inside paragraphs, not lists of characters. I’m not quite clear how I could select only some of the characters to replace in Excel, without writing a huge long search and replace thing. I’ll try to see what can be done, though.

The key to understanding my conversion scripts is the strtr function:

Basically, strtr does all the work, so I’m not sure if this should be called “elegant” or “lazy”. :wink:

I’ll try to put together a simple example script on the weekend, though there might be issues with PHP’s poor support of unicode (strtr is not binary safe; possibly there’s a similar mbstring function, I didn’t have the time to check this yet).

This very basic script takes a string ($text) an replaces characters that are contained within the array $unknown:

<?php /* array with all unknown characters and pinyin */ $unknown=array('好'=>'hǎo', '嗎'=>'ma', '謝謝'=>'xièxie'); /* text to transform */ $text='你好嗎? 我很好,謝謝.'; /* transform and output */ echo strtr($text, $unknown); ?>

With some HTML added:

[code]<?php
/* array with all unknown characters and pinyin /
$unknown=array(‘好’=>‘hǎo’, ‘嗎’=>‘ma’, ‘謝謝’=>‘xièxie’);
/
text to transform */
$text=‘你好嗎? 我很好,謝謝.’;
?>

Character conversion test

<?php /* transform and output */ echo strtr($text, $unknown); ?>

[/code]

Working example here, it returns 你hǎoma? 我很hǎo,xièxie.

The good news is that, once again, strtr can get the job done (when used with an array as the second parameter, it doesn’t have to be “unicode-aware”).

The bad news is that it is difficult to insert spaces in the converted text as there’s no built-in function in PHP to extract a character from an UTF-8 string (http://hsivonen.iki.fi/php-utf8/ shows how it can be done - I’ll have another look when I have some time).

Wow, thanks a million!!

The spacing question isnʻt a problem as I space all my texts before they are converted anyway. The students have enough going on as beginners without having to figure out where the word boundaries are in the first year.

So from this, I could conceivably tweak and get a script that would take the input text and the list of characters to change from separate files (to simplify keeping lists up to date, as they will change over the course of the year)? It doesnʻt make sense for me to hard-code the list of characters to be changed as an array right in the script, although I could do that if itʻs the only way to do it.

Sorry to be looking a very large gift horse in the mouth, and all that…

Uh-oh…
For some reason (still working with it), when I copy your script into the www directory for my localhost setup, it doesn’t work. I added a line to echo “$text
” prior to the output, just to keep track of things, and the characters appear as question marks, not characters. The page encoding on the browser is set to UTF-8. Also, every character in the original string is converted to “xiexie” (the last entry in the array). This also happened when I changed the array using my own data – everything was changed to the last entry in the array. The problems occur when the script is run from my server as well (I was thinking maybe my localhost setup had something wrong with it, but I guess both my localhost and my server do.)

Is this something incredibly obvious?? (I’m hoping it is…)

Hard-coding the list of characters in the script doesn’t seem any more difficult than keeping a text file – and I’m sure I can figure out how to output to that array from my database, so no worries there (at least).

Exactly, you’d load the list from a file or from a database into the array. My idea would be to use a database with each row containing the character, pinyin, possibly TOP and a field for the lesson where the character appeared, then you could just run something like

“SELECT * from characters WHERE lesson>5”

to get all of the characters (+pinyin, of course) which someone who’s at lesson 5 doesn’t know yet. You could conveniently prepare this list in Open Office Calc (or Excel) and import it into the database if you don’t want to edit the database directly.

Bú yào kèqì, this is just some very basic stuff and there’s still a lot of work left to do if you want to turn it into an easy to use and useful program. Don’t hesitate to ask if you encounter any problems!

[quote=“ironlady”]Uh-oh…
For some reason (still working with it), when I copy your script into the www directory for my localhost setup, it doesn’t work. I added a line to echo “$text
” prior to the output, just to keep track of things, and the characters appear as question marks, not characters. The page encoding on the browser is set to UTF-8. Also, every character in the original string is converted to “xiexie” (the last entry in the array). This also happened when I changed the array using my own data – everything was changed to the last entry in the array. The problems occur when the script is run from my server as well (I was thinking maybe my localhost setup had something wrong with it, but I guess both my localhost and my server do.)

Is this something incredibly obvious?? (I’m hoping it is…)[/quote]

Which program do you use to edit your PHP files? You have to make sure they are saved as UTF-8, I had some issues with this, too (I’ll spare you the details, but If you ever feel like spending a lot of money on Zend Studio, just don’t). You could download Komodo Edit for free, create a new file and then choose Edit > Current File settings and set the encoding to UTF-8. Then just paste the code, save the file in you www directory and try again.

Eureka! I was using PSPadEditor, but sure enough there was no way to save UTF-8 files using that. Now with Notepad (right-clicking to run it as an administrator, or else I can’t save files to the www directory!) it works.

Thanks so much!! I’ll have fun modifying this now that the main idea runs for me.

@Hypermegaglobal:

I cannot thank you enough for taking the time to help me with this. Now I’ve got a script to do my characters to TOP replacement, as well as scripts for simplified to traditional, traditional to simplified, characters to Pinyin, and more to come. I can’t even begin to think how much time this has already saved me, just knowing how to get from my databases and spreadsheets (of which there are many) to functionality that does my work for me.

:bravo: :notworthy: