Correct Character Encoding

Sorry if that has been covered before, I tried to do a search and couldn’t find the answer.
Normally my computer displays characters fine, but sometimes certain websites and software comes out garbled. This is especially common with website headers and software dialogue boxes. It is very frustrating. Is there any obvious solution I am missing?
I am using Vista and Firefox 3.0.3
Thanks…

I do not have exact answers to your question, but my recent experiences have given me sufficient time to draw my own conclusions about why these problems continue to happen, and why there is no easy fix.

To better qualify my answers, here is list of PC systems, servers and applications that I use for development:

Red Hat Enterprise Linux WS4 (PC and server)

  • apache2, php 4, mysql 5.1, subversion 1.4, zendStudio 5,5, firefox 2, scim 1.4
    Red Hat Enterprise Linux WS3 (server)
    Windows XP SP2 (PC)

I had many problems when I started learning Chinese to be able to correctly write in Traditional Chinese on Linux. There is a lot of good information here on forumosa.com and linuxquestions.org about how to configure a computer to enter Chinese. Originally, I used Windows XP for writing Chinese, it was just simpler at that time.

Definitions:

Unicode: defines a mapping of all characters/words to a machine readable number (more than one million), each number is called a ‘code point’. Unicode does not define character encodings.

Encoding: a system of storing Unicode code points. Common encodings are: ucs-2 = utf-16 (windows xp & vista), utf-8 (RHEL WS4 Linux), big5 (traditional chinese), … ascii.

utf-8 Encoding: Each unicode code point is stored in 8 bit bytes. Every code point 0-127 is stored in a single byte. Only code points 128+ are stored using 2,3 … upto 6 bytes. This means English looks the same in utf-8 as it did in ascii, and its file size remains nearly the same.

utf-16,ucs-2 Encoding: Each unicode code point is stored in 2 bytes for characters. The same English file is double the size from when it was utf-8. Windows can support all languages with the only penalty that each file takes more space.

Code Points: Unicode can be encoded in any encoding scheme, with one catch. Some of the letters might not show up. If there is no equivalent for the Unicode code point you are trying to represent in the encoding you are trying to use, you get a question mark (?) or a little box. (This is paraphrased from one of the articles I read on the Internet). This is most easily seen when a file encoded for utf-8 (written in Simplified Chinese) is displayed on a webpage that has its encoding set to big5. Each word in Simplifed Chinese that does not have an equivalent in big5 gets a question mark or a little box.

===================================================

On to your questions.

Websites:

Each webmaster or web developer struggles with this issue. There is a huge amount of web server documentation (apache,apache2) on how best to handle this and many server’s offer their own built-in solutions. This is perfectly fine, but what most people want is a simple solution. This simple solution comes down to 1) the software program you write your html/php/asp pages, and 2) the character encoding specified inside the head tag of the webpage.

  1. Your editing software should not alter a file’s character encoding, unless you change it yourself. This seems like a simple request, but many software applications still have this problem. The root issue is that software applications are designed for a particular operating system and a range of encodings, and most people tend to ignore these little details and commonly use older software in situations they were not designed to be used. In my experience, using Dreamweaver MX 2004 on Windows XP SP2 to edit files encoded in big5 is a problem, it displays the file in utf-8 and you see a file full of question marks and little boxes, regardless of the application’s encoding settings. Like I said, older software has more of these problems.

To solve this on Windows, use Notepad or Wordpad to open, edit and save the file. Both of these basic editors included in Windows will not alter the source file encoding. There are many other good applications to use (opensource and proprietory) to safely open,edit and save files on Windows, but these are bundled with Windows and work. If you prefer colored syntax, do a web search for Taditor, it is from Poland and is a good one, it is free to evaluate, but has a fee for full use. Of course, newer versions of Dreamweaver have corrected this, but I don’t agree with having to pay for software upgrades.

On Linux, in my experience, the following works flawlessly: Red Hat Enterprise Linux WS4, SCIM 1.4, IM Engine (I use scim-pinyin & scim-chewing), xterm, gEdit 2.8, openoffice and zendStudio5. My server/workstation install was done in English and I have a modified .bash_profile to start the input method process each time I log in. There are many different versions of Linux (free and proprietary) that can be configured to enter Chinese and many have been writing in Chinese long before me, but I like Red Hat and this solution works best for me.

In an xterm, you should pay attention to the following two lines:

$ locale ... LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 ...

then after scim and scim-pinyin or scim-chewing is successfully installed, add to /home/{user}/.bash_profile

add this line (case is important) to the bottom:

export XMODIFIERS=@im=SCIM; export LC_CTYPE=zh_TW.big5; scim -d;

log out and back in, you can now fine tune the scim server through the grey keyboard in the bottom right of the screen. There is little to configure, but you should review the keyboard short cuts to change between the input methods. Now if you check your locale settings again:

$ locale ... LANG=en_US.UTF-8 LC_CTYPE=zh_TW.big5 ...

===================================================

  1. The most common reason for viewing websites with garbled content, is that the content is not in the same character encoding as the webpage. To briefly review, the single statement that is most important in any html webpage is:

[code]

... [/code]

First, if the http-equiv meta tag, is not the first tag in head, it should be. All modern browsers will restart the page parsing once this tag has been read. If your keywords and description lines are long and are placed before this tag, you can decrease your page loading time by moving the http-equiv tag as the first meta tag after the opening head tag.

Second, if your website/blog/wiki is public on the Internet and allows users to write content, then utf-8 is a better choice than big5. That said, if you are creating an information/products/services website where you control all of the content, then big5 is just fine. As you expand your website and start adding dynamic content like an RSS feed, forum or wiki software, now you need to pay more attention to the encoding used on different pages. I recently finished a project where the website character set was ‘big5’ or ‘en’ (traditional chinese or english) and the MySQL database was utf-8. The core problems were to correctly set the character encoding before each sql statement is executed. The database like any storage system has a set of rules to follow when moving data in and out, and in my early experiments, I got a lot of garbage when trying to store big5 encoded string data.

===================================================

  1. Dialog boxes that are garbled or in the wrong language.

I have not run into this with Windows, but is a constant irritation in Linux, specifically ZendStudio. In my opinion, this is an application level problem. My application, ZendStudio installs in English, but because my .bash_profile includes LC_CTYPE=zh_TW.big5, when I first open the application, all menus are in Chinese. Ok, Tools->Preferences->Desktop->Appearance->Language, change to English. Close and open the application and now all of the menus are in English. Now, menu->file open, the file open dialog box opens. The window-title is english, all input box titles are Chinese and button names are Chinese. My guess is that the application in general is ok, but as soon as a dialog box needs to interact with the operating system, or more likely interact with your current shell environment, there is some confusion over which language to render. It is an irrigation worth living with because overall, this software makes development simple and productive.

===================================================

A little long for some, but so many times I have come across posts likes these that never get answered because of the very complex nature of the problem. These are my experiences and what works for me. Those interested in more information on this topic should look at:

unicode.org
utf-8.com
en.wikipedia.org/wiki/Big5
herongyang.com/php/non_ascii_mysql.html