Tone marks in Chinese Web pages

In another thread, David_K wrote:

quote:
how do you type out the tone marks from the keyboard and then put it on the web? I have experimented once with Lucinda font and had to use the "Insert (Special) Symbol" function in MSWord before copying the whole thing to HTML. Apart from being highly tedious as you can imagine my other problem was I could not illustrate or type some Chinese characters on the same page. But I wonder if this latter problem has been resolved. My purpose was I wanted to type out passages that I had read (in Chinese and Pinyin) and put it up on Web pages.

This is a problem that makes me want to tear out what little hair I have left.

I only use tone marks in the MRT section of my site, such as on the page at www.romanization.com/mrt/danshui_tones.html

And there I cheat, by using a GIF file rather than real tone marks on text, because overcoming the problem is complicated.

The main trouble is that, generally speaking, Big5 Chinese fonts don’t come with the extended characters that are necessary to show, for example, a third-tone o.

Third tones are the most complicated problem to overcome, because first tones can usually be omitted, and second and fourth are often carried in the main European language section (“Latin extended A”) of a font. (Netscape doesn’t handle even these well.)

Most fonts, however, don’t include Latin extended B section that includes third-tone vowels. Those that do are usually Unicode fonts.

So, of course, the answer is to use a Unicode font. A Web page, however, can have only one character encoding, so Big5 and Unicode are incompatible. (It’s possible to put Unicode on the main page and Big5 on an iframe within it; but there are other problems with this.)

Alas, there are several significant obstacles to using Unicode and Chinese characters and on a Web page.

  • Not all Unicode fonts include Chinese characters, so only certain ones will do, such as Arial Sans Unicode.
  • Many people do not have the necessary fonts on their system. Some can be downloaded; but their size is prohibitively large (more than 10 MB) for many users.
  • Most browsers do not have a default setup that handles Unicode fonts, which means that users have to go make the adjustments themselves. I suspect that the majority of users wouldn't bother to make the changes, or would fear messing with their systems (even if supplied with step-by-step directions).

So, that’s why I haven’t made my site Unicode yet. Well, that and laziness.

If anyone’s interested in working on this problem, here’s the address of some test pages I made: www.romanization.com/unicode/

Cranky,

Thanks for your reply to my question about combining tone marks in pinyin and Chinese characters.

Your are definitely right in your 3 points about Unicode fonts. Thanks for going through all that explaining. It seems that you have confirmed my initial suspicion, which is that, the “evolution” on the PC or the web of using Unicode fonts takes so long! Even though this is the natural progression and simplification.

As I understand it MS has already produced the special set of Hong Kong Supplementary Chinese Characters Set (HKCS) at http://www.microsoft.com/hk/hkscs/ in Unicode already. Previously it was only available in Big5. ( So a total Unicode solution cannot be that far away!)

I have checked the web sites you refer to in the above 2nd posting: ie
http://www.romanization.com/unicode/
In each instance I can see the diacritical marks and the roman letters i.e. the hanyu pinyin for sure. But I am unable to see the equivalent Chinese characters properly on the right for the relevant table cells, which is what you have been trying to explain all along. You then provided the gif picture equivalent below. (yes… I see)

The solution seems to me, to be to use Unicode fonts throughout the same web page as you have suggested. In the above web page examples the only problem seems to be that you are failing to supply the relevant Chinese characters in Unicode fonts/encoding. This means the IME, (Chinese character input editors) that you have been using invariably end up only supplying BIG5 encoded characters without your control. Isn’t there a way of specifying what encoded characters are produced? i.e. Big 5 or Unicode or GB?

I may have a solution for inputting Chinese characters in UNICODE. But this will involve some trials of two IMEs programs I have come across, which might successfully allow you to produce UNICODE characters “on demand” so to speak.

The first IME is the NJStar Communicator 2.23 which supposedly allows you to output any encoding you desire using the “Output Code Switch”. You have to go to their web site to check.

The other IME is the one I am using called the " Q9 ", which supposedly supplies both Big5 and Unicode character at the same time. The character, which actually stays on the “surface”, I am told, depends upon the OS you are using. Well that was what I was told!

Anyway I will try these two IMEs soon and report back the results.

Below are the addresses of the two IMEs I mentioned in case you want to explore yourself:

  1. (NJStar Communicator 2.23 ) NJStar being the maker .

  2. This IME program is call Q9 and is available at this url: (Q9 Technologies Holding Ltd being the maker). This IME uses only the 9 keys of the numbers keyboard, a bit like the telephone key pad to input all the Chinese characters, and each number key would imitate a normal “stroke” as if you are using a normal pen to write but much faster since it “guesses” the character by the 3rd key input. Reputedly faster than the usual pinyin method but pinyin input method is also available as an alternative input method with this program.

For Q9 I am told that the characters produced are both encodings but what sticks on the surface switches between Big5 or Unicode depending upon whether you install the program in a Big5 OS environment like the Taiwanese version of Windows 2000 or a Unicode OS like the English version of Windows 2000 professional.

The final step:
Of course none of this actually means our audience is able to see our web pages as you have so carefully explained in your first post above about their Internet Explorer being not properly equipped with the relevant Unicode fonts etc.

Here I think there may be a ray of hope or we may have to wait as you have hinted. I remembered seeing MS MingLiu & PMingLiu fonts, which I believe are Unicode fonts that comes with all English versions of MS Internet Explorer. It was while installing my Q9 incorrectly in an English environment that I found that I could still type traditional Chinese characters in MSWord which also possess these fonts for Chinese and which all along I thought, were Big 5 encoding! The only reasonable explanation for this was that my Q9 was able to produce Unicode encoded characters.

Okay I am going to do some testing when I have time. If it works there is still the tedium of course of typing pinyin with the Insert Symbol function. Anyway step by step! It ain’t that difficult to set Internet Explorer to use MingLiu font.

Nice discussing the whole thing with you Cranky.

Thanks

David

   [img]images/smiles/icon_cool.gif[/img]

David, thanks for your detailed response.

In my earlier posts I forgot to mention another page about using Unicode and pinyin in Web pages.

It’s by Helmer Aslaksen, the person whose data I adapted for my pages on Chinese New Year.

Helmer’s approach is to use MS Word to produce the Unicode and then cut and paste the relevent code into a text editor. I’m not sure which text editors can handle the Chinese characters within Unicode. (SC UniPad, for example, can’t.) Word will also produce Unicode-encoded Web pages; the trouble is that Word’s HTML is extremely bloated.

I have been doing some testing and here are some confirmations:

1)Input Methods to get Unicode Characters

Of the two IMEs I mentioned above, I think the “best” is NJStar Communicator 2.23 because here you can actually control what encoding is used for the Chinese characters that you output. And this software even provides a code translator, meaning if you have copied a patch of words to the ‘clipboard’ those characters can be easily turned into the right encoding: UTF-8, or GB or Big5 etc. The menus or HELP are also in English/Chinese to facilitate the English user. Also true to their words, the whole program works beautifully without any glitches in either English or Chinese version of Windows 2000.

The only drawback for me was the input method in NJStar Communicator which uses the Pinyin method. I personally prefer the stroke or glyph method of writing Chinese characters as with the pinyin input method sometimes you forget what the equivalent pinyin of a particular character that you want to write is. NJStar is apparently working on a stroke input method but not out yet.

In the meantime the Q9 is the easiest stroke input method invented. It’s limitation is that you cannot choose the type of encoding and this is “fixed” to shield the average user from such technicality of encoding. TwinBridge is also equally silly and “simplifying” in this way. Soon everyone will be the same as the fonts are trained to react to more than one encoding.

  1. Fonts for the web.

The latest fonts types have duo-capabilities. ie under Big 5 it will display as Big5 and under Unicode OS, like Windows 2000, it will behave like Unicode fonts.

Fonts availabilty is no longer a problem because in general Unicode fonts are “automatically” present in version 5 IE browsers. If you are using Office 2000 or later, Unicode/Big5 dual capability fonts like MingLiu, PMingliu and Lucinda and Ariel are apparently “standard”. Same for their respective Simplified character cousins GB/Unicode fonts.

3)Helmer Aslaksen’s Web site

Yes, thanks, I went and confirm that both my English and Chinese OS browsers can see his site without any problems using UTF-8 (unicode encoding standard).

The most important thing to know for viewing is that IE browsers classify the different encoding under font-languages. So as long as the latin-based languages font/Chinese languages font are specified to use their respective Unicode enabled fonts there should be no problems in the viewing.

  1. HTML text editors.

As I understand it, all double-byte accepting text editors can handle Chinese fonts well though formatting is still a problem.

Windows2000 Notepad is double-byte enabled. Most of my testing was done on this. If you are using an English language operating system, I think it is best to run NJStar Communicator as this program also acts as a Chinese charcter viewer. You will need a viewer to see the Chinese characters apart from the input; and obviously to choose the right looking font.

I would avoid all single-byte versions of Windows which are old anyway even if NJStar works on them as viewer and IME. eg Win98, ME etc…Chinese versions of the same OS are obviously double-byted.

The latest fonts types have duo-capabilities. ie under Big 5 it will display as Big5 and under Unicode OS, like Windows 2000, it will behave like Unicode fonts.

Really? I don’t see how that could work, since Big 5 and Unicode have different, overlapping mapping tables, do they not?

Fonts availabilty is no longer a problem because in general Unicode fonts are “automatically” present in version 5 IE browsers. If you are using Office 2000 or later, Unicode/Big5 dual capability fonts like MingLiu, PMingliu and Lucinda and Ariel are apparently “standard”. Same for their respective Simplified character cousins GB/Unicode fonts.

Do you know Bjondi’s Character Agent program? Very useful for checking fonts for specific Unicode characters. And it’s freeware, too.

The most important thing to know for viewing is that IE browsers classify the different encoding under font-languages. So as long as the latin-based languages font/Chinese languages font are specified to use their respective Unicode enabled fonts there should be no problems in the viewing.

What I’m eager to know is if IE does this because it handles the standard correctly (or at least better than others), or because it cheats in some way. (What, Microsoft cheat on standards?) I’m willing to have pages that not everyone can view, as long as those pages are standards-compliant; I’m less willing to put up pages that are hacked to please a certain browser.

It worries me that Mozilla doesn’t handle my new test page well (see below). I messed around with the settings some and got it to work – but in a serif font and not the Arial Sans Unicode that I had specified. Very curious behavior.

As I understand it, all double-byte accepting text editors can handle Chinese fonts well though formatting is still a problem.

I do almost all of my HTML in Notetab (freeware). Although I can use Chinese with it, it doesn’t handle Unicode – at least not on my Chinese Win 98 system.

I downloaded a copy of EmEditor (US$30 shareware) the other day. It seems to work well with both Unicode and Chinese characters.

One nice thing about EmEditor is that I was able to use this program to save a Big 5 page in Unicode in one easy step.

I used it to make a Unicode-enabled test page at www.romanization.com/cities/index_u.html

This page uses the “moving diacritics” (i.e. those that are superimposed on the vowel that precedes them) rather than coded versions of the vowels with the marks. A distinct advantage of these is that with only four code numbers to memorize (one for each tone), typing (or search and replace) can go smoothly and easily. But I came across problems with these and tones over i’s, cuz in some browsers the width of the original letter ends up being the width of the letter with the tone. (In other words, a regular sans serif i looks just the same as a sans serif i with a tone mark – because the tone mark is compressed to the width of the i. Similarly, the tone marks sometimes move down too far when placed over umlauted u’s, creating a mess.)

Hi Cranky,

  1. Your web page using “moving” diacritics and showing Pinyin and Chinese characters

Taiwan Counties

looked fine and sharp in either the English language/the Chinese language IE browsers. The 2nd and 4th tone marks were sufficiently defined over the i’s. I was using a sans serif Lucinda font on one browser.

Maybe you should use/stick to this formula for future pinyin pages.

Just tell all IE users to your site to adjust their “latin based” fonts to a Unicode sensitive font like Mingliu/Lucinda Sans Unicode/PMingliu. English IE browser users hardly needed adjusting I am sure.

" The latest fonts types have duo-capabilities. ie under Big 5 it will display as Big5 and under Unicode OS, like Windows 2000, it will behave like Unicode fonts."

I strongly suspect they do now because the same font can display correct Chinese characters (not garbled like certain other font types) even when I am typing out Unicode or Big 5 encoded characters in separate sentences. I also repeated this experiment many times using different language platforms ie the English or Chinese Win2000 OS and notepad. Always the same.

I tested the NJStar IME separately to make sure it was producing different encoded characters with the Output Switch. I inputted Chinese characters into a Big5 based Chinese-English dictionary software called DrEye. This dictionary rejected/(or is confused with) the NJStar IME characters when the output switch was UTF-8 (Unicode) but worked normally when output was turn to Big 5. So this proves that NJStar IME’s Output Switch was working fine.

  1. In IE and font languages

In IE there is no direct link between a particular Encoding and a font.

What I mean is from the fonts control pop-up box you cannot specify the browser to use a particular font for a particular encoding. There is no encoding type to choose.

Instead IE only allows the user to specify a particular font as the “main” font for a particular “IE language” listed.

For now Simplified and Traditional Chinese are still being treated as two separate IE languages with two separate sets/families of applicable fonts. This is about the only vague clue you have of their different encoding origins.

Why don’t you have an IE to play with ? It would be much easier then to understand what most of your audience will using, to see your site?? Important you know.

Ha Ha! [Another Anti-MS web designer: Most people cannot be bothered to buy another browser! (like Nescape)]

There is another control next to the fonts control that specify the number of languages IE will use when trying to decipher a page. Again this is a focus on languages. No relations to encoding whatsoever.

Sometimes you might find that the font names of a particular IE language is absent from the font control pop-up box. That’s usually because you have not visited a site with that language and your IE does not have any of the requesite fonts installed yet.

I trully believe one day users will be completely oblivious of the “encoding” used for a particular web page, especially as fonts can now display for more than one encoding.

Imagine this a single font can display the same correct Chinese character whether the encoding is Big5 or UTF-8 or GB or Hz! I understand browser would run faster without having so many fonts to carry and load.

But one thing though with our chosen encoding now, UTF-8/ Unicode we can also show simplified characters along side the traditional ones on the same web page!

Getting there!

  [img]images/smiles/icon_cool.gif[/img]

Why don’t you have an IE to play with ? It will be much easier to understand then, what most of your audience will using to see your site?? Important you know. Ha Ha! [Another Anti-MS web designer: Most people cannot be bothered to buy another browser! (like Nescape)]

Hi, David:

Of course I have IE – and Netscape and Opera and Mozilla and Lynx and Amaya (Most of the time I use Opera 5.12 for English pages. For Chinese, NN or IE, depending on my mood.)

My point was that although I see the results in IE, I do not fully understand the reason that IE seems to get it right and others don’t. (Well, Opera doesn’t support Unicode yet.) Lots of pages may look correct in one browser and not another, but unfortunately that doesn’t necessarily mean that one browser is more correct; it could also mean that a browser is using a non-standard method to achieve its end, or that the page was written toward a browser and not toward a standard.

Yes, yes. I know: I worry too much.

My Lucida Sans Unicode (version 0.98, 298 KB) doesn’t have any moving diacritics and very little in Latin Extended-A or Extended-B.

Could I trouble you to check your copies of the fonts you mention in Character Agent?

Bitstream Cyberbit, however, is fine, and also contains CJK.

I understand browser would run faster without having so many fonts to carry and load.
I hope so. But I suspect the larger sizes of the newer fonts will slow systems down just as much.

Getting there!

I should probably get Gus to move this over to Jeremy’s Technology forum.

quote:
" My Lucida Sans Unicode (version 0.98, 298 KB) doesn't have any moving diacritics and very little in Latin Extended-A or Extended-B. (Cranky Lao Xiong)

Well it seems our resident expert in font technologies is not up-to- date with the latest in Microsoft offerings…tat tat!

ha ha! for a man with 5 browsers!! and using IE only on weekends - I’m not surprised!

Almost like a Moslem with 4 wives (bin Laden, Saddam Hussein, Colonel Gaddaffi, etc…hmm…in distinguished company) we understand your problems of having to choose a “fifth” for the weekend and the special occasions!

 [img]images/smiles/icon_biggrin.gif[/img]

To answer your question: My Lucinda Sans Unicode is much more up-to-date than yours.(Version 2.00 and 317KB)

(I only have one wife at the moment and she is real easy… ha ha.)

Yes my Lucinda has everything : Extended A, Extended B. She speaks Greek, Advanced Hebrew, all the currencies and of course the all important “moving” Diacritical Marks for our purposes.

There!

Thanks for your suggestions: I have even programmed my MSWord to using the “moving” diacritical tone marks instead of using whole characters with tone marks. This way we save having to use so many “special” insert characters and only 4 when wanting to include tone marks for hanyu pinyin.

I was thinking perhaps you can now fill up the Chinese History/culture section of your website with real Chinese characters! and the occasional pinyin.

The Chinese need not be that difficult. There are lots of Chinese History text books written for children around primary 6/secondary 1 level (age 12/13). The Chinese used in these text books are already graded by teachers so it would be just a step below what you have in Newspaper Reading classes for foreigners at their university. Of course you could also augment the site with English “summaries” for those who get easily put-off or bored with 100% Chinese websites.

Getting there!

 [img]images/smiles/icon_cool.gif[/img]

Try this. www.romanization.com/streets/u_ch.html

There are still a few mistakes (lines switched around, and a few mis-romanizations). But I’ve worked on this enough tonight.

Another unicode page, this one of train stations between Taipei and Kaohsiung. www.romanization.com/trainstations/west_u.html
Still needing a little work.

Cranky Lao Da,

The pinyin and characters and colours of your site are absolutely marvellous!

Your number of train stations is a lot more than the number of stops appearing on the map. But then to get every station on the map would mean there is no space to write the names of the stations unless the map was much much bigger.

Another question:
How do you draw a map of Taiwan so precisely?
Or do you just scan it in and re-edit?

I like the books on your site. Any plans to add some Chinese books?

To bad that with all that talent, none of you
are considering putting it to work helping the GNU/Linux effort.

By the way, with my Linux system, I can turn the big5 on you webpage into pinyin as below:

lynx -dump http://www.romanization.com/mrt/banqiao.html > banqiao.txt
sh -x ./pinyin-diff.sh banqiao.txt

  • c2t
  • diff banqiao.txt -
  • sed ‘s/^[0-9]+c[0-9]+//;/^---/d’

<

I’m sorry the characters I posted were munged. I’m using a lot of hacks to try to post here [because I don’t use Microsoft]

By the way, your background color made
the foreground characters unreadable on Netscape 4.76, until I did “select all”.

I suspect you have javascript turned off. In Netscape 4.x, this has the effect of also turning off style sheets, making your browser act more like NN3.x in many ways.

I’m not certain why you would want to do this, when many other browsers (Netscape 6.1, Opera 5.12, Mozilla, etc.) would give you the same control against potential nastiness but not sabotage CSS.

Nonetheless, I agree that making content available to as wide a range of browsers is important. So thanks for alerting me to the problem. I’ll put fixing it in my (too long) to-do list.

Another question:
How do you draw a map of Taiwan so precisely? Or do you just scan it in and re-edit?

Thanks for the compliments on the site.

I did that map a long time ago, so my memory of it is a little fuzzy. I probably scanned a map and then traced my version on top of it on a different layer in Photoshop or Illustrator.

I like the books on your site. Any plans to add some Chinese books?

Maybe someday. But not in the near future. I’m actually running short on server space, mainly cuz of another site I did for a musician friend of mine:
Angry Young Grad Student. Have a look.

a message to Dan Jacobson:

Hey I went to your website:
www.geocities.com/jidanni

I must say it has lots of “character”. Some of which I couldn’t read! (hee hee - just kidding)

First I think you are right, the Taiwan govt is just wasting resources going backwards and forwards with this inventing of a new way to spell Chinese tones (TongYung) without really achieving any real added value for learners.

But apart from that I’m really quite fascinated by what else you have to say about learning Mandarin.

(You’ve been at it for say 10 years since 1991 ?)

You mentioned somewhere on these Oriented.forums that all you need to get fluent with Chinese was to use a tape recorder and repeat after the sounds.

You know, I absolutely agree with you on this, I think the reason why most people (foreigners/non native speakers) fail to get fluent with Mandarin is because they are being too distracted (at the same time) by the characters and having to learn “how to write it”. They should just as you have said: listen to “words” and repeat the sounds.

I think unlike English (or romanised languages) the process of speaking and listening should be split from that of reading and writing. Well as much as possible.

My biggest help in getting fluent in Mandarin must definitely be the television. In the last 2 years, my Mandarin listening skills really shot up since I began watching more Mandarin broadcast and programmes on TV. But it is still not where I can watch the news and pick up everything like I can with my main language. But variety shows from Taiwan TV stations with subtitles are much easier.

In Taiwan most of the TV/radio broadcast is of course in Mandarin which helps the learner no end.

In Hong Kong, wholly (24hr) Mandarin broadcast only started with Phoenix tv station about 3 years ago; and now we have www.Asiaplusi.com or DongFeng. There is also cable but that also started about the same time with Mandarin broadcasting and you have to pay!

I think Hong Kong govt is really ridiculous for not allowing all Guangdong channels to be re-boosted or broadcasted straight into Hong Kong (free). I think it would be more choice for everyone and especially since they are so concerned about Hong Kong people being unable to communicate in Putonghua. Already all the news clips (about China) from various provincial stations are borrowed straight and redubbed into Cantonese here, why not allow full parallel access? I think Taiwan has at least 80 local TV channels.

The other question I wanted to ask you was when you write Chinese, (as you have done on your website), how do you know it is “grammatical”?

Or do you write “as you speak” and shut your brain to the written structure?

The “written” form and the “spoken” form in Chinese deviate quite a bit as you know. Less in newspapers than other literary masterpieces but still significant and very off-putting (for the learner).

Most newspapers and magazines in Hong Kong cheat by straight writing in the colloquial expressions and often use spoken structure (ie Cantonese). I have to avoid reading tabloids for this reason and buy the real “standard” newspapers for my Mandarin lessons or buy magazines from Taiwan.

I like Times bi-lingual magazine from Taiwan which is written for English learners. The English is kind of trivial but I think the Chinese is good quality or at least I am told by native speakers it is acceptable.

My problem is I just can’t get the rather complicated “written” sentence structures to pop up in my head even when I am just reading.
And I think I should be writing like that instead of just writing “personal notes”.
Anyway it was quite fascinating to see little paragraphs in Chinese on your website which I presume was written and composed all by yourself - Congratulations! I think it is a major milestone to get your first few paragraphs “published”/accepted.

[img]images/smiles/icon_smile.gif[/img]

Has anyone tried publishing a web page with tone-marked Hanyu pinyin using GB (guobiao) code? I understand that all the pinyin characters are available in GB, it’s just a question of how to type the buggers. They are available in Microsoft’s pinyin IME, but the input is so clumsy that it is only usable for isolated words.

Departing from the matter of pinyin, can I just point out that Big 5 code can be used with a simplified font, and GB code can be used with a traditional font.

Er…ahem…hate to point this out again, but a system like TOP (Tonally Orthographic Pinyin) represents the tones of Mandarin with no diacritical marks whatsoever…the catch is you have to teach people how to read it. Anyway…

This is a long shot (I’m not an expert in web stuff by any means) but what if you coded your page using a specific tone-marked font, like Jim Honeydew’s excellent “4-KeyTimesRoman” or “4KeyCourier”, and then offered those fonts for download via a link on the same page? Would that take care of the problem – if the interested enduser could install the fonts on his (Windows) computer??

The fonts are freeware from Jim’s site, I suppose that if you asked him he probably wouldn’t care if there were another download link…??

Terry