Reading utf-8 text file in cygwin

smithsgj · May 8, 2008, 8:55am

Anyone got any idea how to do this?

I’ve got a normal XP PC, English version but with CJK language pack and whatnot, and a text file in Chinese which opens in MS Word and wordpad just fine (Word prompts me to confirm the encoding).

The file (compressed) is at mcu.edu.tw/~ssmith/cna101.rar It’s a bit big I’m afraid, but I don’t know how to cut out a sample without potentially changing the formatting

But when I use cat or anything else in cygwin, I just get garbage. Stuff that looks like Chinese, but isn’t real characters.

I read somewhere that you should try cat -U. But that doesn’t work in cygwin, no such command.

I tried downloading a utf-8 test file cl.cam.ac.uk/~mgk25/ucs/exam … 8-demo.txt . In Word and Notepad, this comes out a bit funny, maybe because I haven’t got all the language fonts installed, but no big deal.

With cat/cygwin, STILL the funny garbage characters that look a lot like “fake” Chinese characters (even though the demo file doesn’t contain any Chinese).

If I open the demo file in Word, as BIG5 format, I can reproduce the behaviour of cat (except that cat has a capital C cedilla that doesn’t show up in Word). Ah-ha, I thought, my version of cygwin thinks it’s BIG5 for some reason!

But no, when I look at the demo file (no Chinese in it) with Word, and choose either Windows or MS-DOS when the encoding prompt comes up, I still get the fake Chinese characters!

So I don’t really know what’s going on! Anyone know how I can read this file and process it with cat, grep etc?

irishstu · May 8, 2008, 9:35am

Try copying it out of Word, then pasting it into Notepad, without saving, then copying it out of there and pasting it into your other software.

If that doesn’t work, try from Word to Wordpad to your software.

Big_Fluffy_Matthew · May 8, 2008, 9:45am

Have you set the non-unicode programs thingy in the control panel to “Chinese” ?

smithsgj · May 10, 2008, 9:05am

BFM I changed that control panel thingy to English and now I get normal ASCII looking luan ma instead of the fake Chinese-looking characters. So cygwin is indeed one of those “non-unicode programs”

Stu thanks for your suggestions. But I’m looking at 300 massive text files, so manual editing in Word etc won’t work:-( This is the reason why I need to use cygwin (the only way to get “Unix” running on a PC). And you can’t paste into cygwin, I don’t think: only type on the comd line or use an input file.

Ectoplasma · May 10, 2008, 11:53am

Aside from your problem, you don’t need to install a “CJK pack” for Windows XP. You can get CJK by going to the Language and Regional Settings and installing those languages. You’ll need the Windows XP installation CD though.

As to the actual problem you have:

I don’t know why you use cat, (did you try dog already? haha) but depending on where your data goes, you might just ignore how it looks and you might get it right when you have it saved in the destination file / format.

If not, why don’t you convert the file from its original encoding to something more modern and widely supported. Save it in UTF-8 or anything Unicode. Try to do this from Notepad, or MS Word, or something freeware such as Notepad++ or PSPad or a trial version of UltraEdit. I think the Linux cat command should work fine with UTF-8 / Unicode.

Not sure what you try to do though…

Ectoplasma · May 10, 2008, 12:21pm

I just took a look at your file. It seems already encoded as UTF-8. I used MS Office Word 2003 and save it as a UTF-8 text file. That makes a difference. What difference:

I notice there is no byte order mark at the start of the original file. Linux cat needs that to see it is UTF-8. The byte order mark is invisible, but Word can add it (it is a start sequence of 3 bytes: EF BB BF, the first 3 bytes of the file).

So try to save it as a UTF-8 text file in Word. Linux cat should then work.

smithsgj · May 10, 2008, 3:43pm

THanks ecto. I just tried that and cat doesn’t work unfortunately. When you saved in Word, did you use any of the options or just take defaults?

cat is fine with non-unicode Chinese. It’s the fact that this file is unicode that’s causing the problem, I think.

what I actually want to do is a (fairly) simple string edit. The file is a linguistic corpus, and i want to redo some of the markup without actually re-running the tagging program. I wouldn’t be using cat for that, but sed or awk.

I’ll try that (not dog, the other thing ;-)). I will write a script saying what I want to substitute from and to. I suppose I have to make sure the script is in UTF-8 also, and I do that by saving in Word in the way you told me in your second message?

I can’t just do this IN Word, because there are 300 files like the one you looked at, and about 1000 substitutions for each file.

Brendon · May 10, 2008, 5:20pm

What’s your locale set to in cygwin?

Ectoplasma · May 11, 2008, 5:51am

When I open the file in Word it asks me the format (UTF-8 already selected) and I confirm it is UTF-8. Then, I save the file as text format. Word again asks me if I would like to use UTF-8, and I confirm. I checked the byte order mark of the text file with a hex editor, it’s there now so I got the file saved in proper UTF-8 text format. You can try 'od -x ’ on Linux to verify if the byte order mark EF BB BF is there. I hope this solves it… Good luck.