Creating a website - converting entire pages

Web gurus, hear me!

Here’s the situation. I want to be able to write a webpage with a button at the top to convert the text on the page to another format. Let’s say the original is in Hanyu Pinyin and I also want to offer Bopomofo.

One option is to pre-create all these pages, so that when the user clicks the ‘convert’ button it essentially switches between two pages in different subdomains - so between http://hp.example.com/intro and http://bp.example.com/intro. This is fine in the beginning, but it would grow to be a right royal pain in the arse as the site expands.

What I would like to do instead is convert the pages on the fly, which would involve setting up the conversion mechanism and then forgetting all about it. I know it’s possible, but I don’t know how. Help!

Are you wanting any of those formats to be real Chinese characters ?

If not you could potentially setup a table for each format, then depending on which table is selected the correct text string would be formed. Of course then your original text would then need to be formated to match the encoding. The table itself would be quite small as it would only need to include the phonetic alphabet.

Something like this:
id. . . . . pinyin . . . . . bopomofo
1 . . . . . b . . . . . b
2 . . . . . p . . . . . p
3 . . . . . m . . . . . m
4 . . . . . f . . . . . f

1 . . . . . x . . . . . ㄒ
2 . . . . . i . . . . . 一
3 . . . . . an . . . . . ㄢ

Not sure what you would do with the tones.

If you are also wanting to support Chinese text with the same you may be able to store the original ascii character in your table and then simply select a relevant font (if one exists), this may be a lot easier, although if you doing non-chinese i expect the initial table would have to be a lot lot longer to decode this way.

Sorry, not a guru, but I’ll throw in my 2c anyway.

A conversion table will be a little more difficult than a straight switch table, due to the same pinyin (u) for ㄨ and ㄩ.

There are some fonts for chinese characters that include bopomofo embedded in each character, such as HanWangKaiMediumChuIn. I’m able to use this font to convert chinese text to include zhuyin ruby characters. The trouble I see with this approach is finding a font for chinese characters that include pinyin. I wasn’t able to find one after a little searching, but that doesn’t mean there isn’t one.

Can you use Java? sourceforge.net/projects/zh2tone/ will convert Chinese to zhuyin and many kinds of pinyin, respectively.

A quick google threw up a list of unicode codepoints for pinyin here (eg, to display nán for nan2).

MDBG has a CEDICT download here which maps characters to pinyin pronunciation.

The Wikipedia Zhuyin article includes unicode renderings of all the bopomofo characters, and various conversion charts.

So I figure that’s everything you need to render arbitrary Chinese as pinyin or Zhuyin, minus code. Converting a page would probably be relatively expensive, so I’d suggest a cached approach – generate the relevant content the first time someone requests it, and then store it for later use.

This one’s a little fiddly and I don’t feel like hacking it up … as usual I can get it done for money if you’re not in a rush (busy lately), and as usual I’m sure you don’t want to pay for it :wink: It doesn’t seem like there’s anything hard involved though, so you probably don’t need me anyhow.

If I understood this correctly you only want to convert between romanization systems, like this? This example’s based on your conversion tables and has the same limitations as the converter I linked to in my sig. I typed the pinyin for each page, the conversions to Bopomofo etc. are done “on the fly”.

However, the way I did it needs some PHP code added to the top of each page plus everywhere you want the automatically converting romanizations to appear. It’s a rather ugly approach.

A nicer way of doing this would be to tag the pinyin (e.g. by applying a certain style) and have the original html files parsed by a script which would add the session management, conversion buttons and do the actual conversion. I think this should work, but it’s getting kind of late. I agree with Brendon that caching would be better… but unfortunately life’s too short (and I still have to learn a decent programming language :wink:).

Thanks for the replies, everyone.

To clarify, this would not involve Chinese characters at all - simply converting between different methods of romanisation/zhuyin.

As twocs notes, there isn’t a one-to-one conversion between the different systems - i.e. you can’t just say that (u) for ㄨ etc. This isn’t a problem - I have tables made up of all the systems including all the permutations of tone-markings. So huán will map directly to ㄏㄨㄢˊ, huǎn to ㄏㄨㄢˇ etc. This solves the issue Connel mentioned too.

Hypermegaglobal already put together a romanisation converter (for which the aforementioned tables were devised) - and his solution posted here was the kind of thing I was looking for - problem is I know nothing about php. Time to learn, it seems.

Brendon - thanks for the offer, but the reason I’m doing this is specifically to sharpen up my web skills, so getting someone else to do it kind of defeats the point. :wink:

Actually, coming clean about the intention behind this - I actually want to use different systems of Taiwanese romanisation (but avoided saying that in the beginning so as not to complicate things unnecessarily). This means the conversion tables would be much bigger than the equivalents for Mandarin. Mandarin has something like 1,200 possible syllables including tone permutations - with Taiwanese it’s more like three times as much - multiplied by the number of systems I want to offer (probably three to five).

Tell me more about caching - if I go with hmg’s approach will it be a terrible bandwidth hog? I’m not anticipating that the resulting site will generate much traffic.

hmg has the right idea - put something around the content in the html file to make it really easy to figure out what needs to be converted.

As for ambiguity, the trick is to store it in the least ambiguous form possible, and convert from there. For Chinese this would be the original characters … I’m not sure about Taiwanese.

Bandwidth isn’t the issue here. There are two extremes - lots of cpu (converting and rendering each page every time someone asks for it), or lots of disk (converting and saving all the pages in advance). If it’s a low traffic site, the former might be the easiest way – that way you can update content any time you like and not worry about cache invalidation, and the cpu time involved isn’t an issue.

I’m a little drunk, not sure whether this is useful or not :wink:

I tried to post this a couple days ago, but Forumosa decided to have an error just at that moment. Here’s what I thought of earlier.

You’d probably want to use either PHP or ASP to get the job done. It could also be done with Javascript client-side, but then you run into problems where people don’t have Javascript turned on.

What you’d want to do is have an included script that takes an input string (like say, “pin1yin1”) and outputs a) accented pinyin b) numbered pinyin (changes nothing) c) bopomofo. Then, in the text of your pages, you’d make a function call every time you typed something in pinyin. You’d also have a cookie on the page telling you which language to write and a function to refresh the page and change the cookie value.

So, in the include page (a separate file) you’d have a function that looks like:

<script language="javascript"> function outputRoman(inputstring) { //var output=inputstring; //check cookie value //Case 1: cookie=accented //output=convertnumbers("inputstring") (function converts numbered pinyin to accented pinyin) //Case 2: cookie=numbered //do (nothing) //Case 3: cookie=bpmf //output=convertbpmf("inputstring") (function converts numbered pinyin to bopomofo) //document.write(output); } </script>

You can probably find the functions for doing the conversions themselves somewhere on the web already. If not, you’d need to make two more functions to do the actual conversion. (Note, what I wrote there is pseudo-code. It doesn’t actually do anything, it just describes what needs to be done).

Then, whenever you wanted to write any pinyin, you’d call the function.

<p>You can view 拼音 <script="javascript">outputRoman("pin1yin1")</script> using numbers, accents, or BoPoMoFo.</b>

Also, you’d need to have the function to read the cookie a little more defined. That’s how you can do it in Javascript. For ASP or PHP you’d need to find someone else.

3 times 1200 syllables… that’s pretty scary, but not a problem, technically, once you have the conversion table. :wink: However, even if you don’t want to convert from characters, like Brendon wrote you might still want to base your conversion table on them (if that’s possible) if this means that you’d get a smaller table.

The caching thing is something you’d add to an already existing conversion solution (to reduce the CPU load), I wouldn’t worry about this now.

If you want to do this yourself (and acquire some programming skills), there’s nothing really too complicated about it, but it does touch a number of topics, so you’d need some rather broad knowledge.

Even though I wrote the existing converter in PHP, I don’t think it’s such a great language if you’re new to programming because it certainly inspires a sloppy programming style. Python or C# might be a better choice from a didactic perspective.

One more thing to consider: Are you planning to use some sort of CMS for your website? My approach needs a static file to do the conversion, but (given enough time) you could write a plug-in for a CMS, too.

Thanks for the excellent advice, hypermegaglobal, R. Daneel Olivaw and Brendon - much appreciated.

The only place I would be using a CMS would be a potential “news” section of the site (maybe using Wordpress) - following cranky laowai’s example the rest would be done by hand in a text editor - allowing me to keep a closer eye on the html and CSS.

About basing the table on characters - not really possible for Taiwanese, I’m afraid. Too many syllables have no defined character equivalent. The tables I have look pretty much like the ones for your universal romanization converter (just much bigger), so I imagine it would function in a similar way.