Learning Chinese with the Word Sketch Engine

smithsgj · February 1, 2007, 5:29am

華語學生您好! 想不想以軟體程式加強您的中文詞彙能力?

Introducing the Word Sketch Engine, a software tool developed by UK academics, guaranteed to improve your Chinese word power tenfold in a matter of weeks!!!

No, seriously, this isn’t spam. I’m involved with the evaluation of the Sketch Engine’s Chinese version at Ming Chuan University. We’d like to invite keen Chinese learning Forumosans to use the Sketch Engine (SkE for short) in their studies, to help with reading and vocabulary learning.

For further details, please read on. Or just go straight to mcu.edu.tw/~ssmith/walkthrough if you prefer! When asked to log in to Sketch Engine, use the name mcu02, and the password forumosa.

What’s in it for me?

You would potentially get a great boost with vocabulary acquisition, and definitely a lot of exposure to authentic newspaper Chinese (from either China or Taiwan or both, as you please). You would get to read real sentences containing the vocabulary you’re interested in – a far cry from the arbitrary vocab offerings in certain textbooks!

What is the Sketch Engine?

It’s called a “corpus query tool”. It’s a computer program, with a decent web interface, which reads in a vast corpus of Taiwan and mainland newswires to generate a short Word Sketch, a one page summary of the most common contexts in which a given word may be found. SkE was designed to help in compiling dictionaries, actually, and has already been used in Longman and OUP publications, but now we want to see how it performs as a tool for helping non-native speakers to learn Chinese.

There are versions of SkE for English and other languages too, but this research is interested in the Chinese version.

What features does it have?

There’s something called a Word Sketch, which shows you how a word patterns. Suppose you know contribution is 貢獻,but you don’t know whether to say 進行貢獻 or 做出貢獻. Well, you would just fire off a word sketch for 貢獻 (or one of the verbs), and that would tell you what the most significant collocate is.

Another thing it does is Sketch Differences. Imagine you wanted to clear up once and for all the difference between 高興 and 快樂. You could of course also run a word sketch to help with this problem, but with Differences you get a summary of the two. You are shown contexts where only 高興 is possible, contexts where only 快樂 is possible, and contexts where both are OK. There’s a colour coding system, so you can see quickly which is which.

What do you want me to do?

Use it in your studies! Every time you would normally reach for the dictionary, stop. You know you’re supposed to figure out the meaning from context by yourself… the Sketch Engine can help you (we hope) to do just that. We think it’s potentially a great learning tool, but we want to see scientific evidence of that.

What’s the next step?

You need to go to http://mcu.edu.tw/~ssmith/walkthrough. Here, you will be invited to take a short questionnaire (we’re calling it a pre-test, because we plan to ask you to do a post-test after a few weeks of using SkE in your studies). There are some questions on your background, including contact details, and on your current knowledge of collocations – we hope you will find the latter interesting.

Next, you’ll be taken on a walk through the Sketch Engine software (this will include logging in: use the name mcu02, and the password forumosa). You might want to set aside an hour to work through everything (you can always do the pre-test and then come back to the walkthrough later, of course).

Of course, you are under no obligation to participate, and having agreed to participate you are still free to withdraw at any time. And of course we wouldn’t use your personal data for any purpose other than our research: if you wanted to be anonymous that would be fine by us. Please offer feedback, too, via PM or this thread, as you please.

敬頌學安!

sjcma · February 1, 2007, 7:13am

Let me start off by saying, “Kewl!”

I didn’t fill out the questionnaire but I did have fun playing with Sketch. One thing I tried is inputting partial idioms to see what the tool would do. For example, take the idiom 一言不發. If I saw that used in a sentence but didn’t know what it meant, there’s a chance that I could parse it the wrong way and think that 一言不 was a word. In this case, inputting 一言不 resulted in Sketch returning nothing. A smarter Sketch may return with nothing but offer alternatives that it found which may also work.

I also tried the very popular phrase 塞翁失馬焉知非福. First, I inputted 焉知非福 into the Concordance page and got 3 hits with 塞翁失馬 preceding 焉知非福 in all three cases. Then I entered 塞翁失馬 and got 5 pages worth of hits, the vast majority being followed by 焉知非福. So how come these two queries offered such a discrepancy in hits? Word Sketch also fails when looking up either 焉 or 焉知, which is a bit disappointing since I was really curious about the results.

That’s all I have after playing with it for about 10 min. Good work!

ironlady · February 1, 2007, 1:28pm

I’d love a corpus tool, but I’m getting “page not found” when clicking on your link.

Morbius · February 1, 2007, 1:53pm

The first http link works but the 2nd one doesn’t for me.

smithsgj · February 1, 2007, 3:33pm

Thansk for reporting the link problem people. Corrected it now.

Well the thing is that the corpus is segmented. A segmentation (段詞) algorithm was run once over the whole corpus to decided where the word boundaries were before implementation. So what you suggest, while possible, would correspond to English silen*, or something like that. I can see how that kind of search could be pedagogically useful, but you wouldn’t be able to pull out stats (in the form of a Word Sketch) on how it behaved lexically, because it isn’t a word.

This is because in three cases only was 焉知非福 treated as a word. In other cases the segmentation is 焉知非福 – three words. So on the 5 pages, you see mostly the latter (you can make out the gap between the words if you look carefully.

[quote=“sjcma”]Word Sketch also fails when looking up either 焉 or 焉知, which is a bit disappointing since I was really curious about the results.
[/quote]

What error are you getting? “Not in pre-loaded corpus blah blah” or “No word sketch available”?

sjcma · February 1, 2007, 4:02pm

Well the thing is that the corpus is segmented. A segmentation (段詞) algorithm was run once over the whole corpus to decided where the word boundaries were before implementation. So what you suggest, while possible, would correspond to English silen*, or something like that. I can see how that kind of search could be pedagogically useful, but you wouldn’t be able to pull out stats (in the form of a Word Sketch) on how it behaved lexically, because it isn’t a word.[/quote]

I guess I wasn’t clear in my suggestion, so let me try again with an example. Right now, inputting 一言 into the Concordance page results in “Empty result”. That’s fine. However, it would be more useful if after “Empty result”, it also gave a list that is essentially a “一言” search as potential alternatives. So the output would read:

Empty result, but perhaps you meant:

默無一言
一言難盡
一言不發
一言千金
一言為定
etc.

Clicking on any one of the suggestions will invoke a new Concordance search using the clicked word as input.

This is because in three cases only was 焉知非福 treated as a word. In other cases the segmentation is 焉知非福 – three words. So on the 5 pages, you see mostly the latter (you can make out the gap between the words if you look carefully.[/quote]

So is this simply a limitation with the segmentation algorithm? Chinese is inherently different than English in the sense that word segmentation is not straight forward. A similar issue arises when one tries to write Chinese purely in romanization. I don’t know how difficult it is to write the segmentation algorithm, but off the top of my head, it would seem beneficial to have the algorithm output a list of segmentation possibilities. Each item on the list can be ranked in terms of confidence. The search results will thus output sentences that match with the highest confidence, followed by sentences that match the next level of confidence, etc. The number of confidence levels to search can be chosen by the advanced user.

For the researcher compiling statistics on word usage and frequency, this doesn’t really affect such statistics but can open up further avenues to explore.

Of course, I may be just talking out of my @ss…

[quote=“smithsgj”][quote=“sjcma”]Word Sketch also fails when looking up either 焉 or 焉知, which is a bit disappointing since I was really curious about the results.
[/quote]
What error are you getting? “Not in pre-loaded corpus blah blah” or “No word sketch available”?[/quote]
The Word Sketch search on 焉 resulted in “Insufficient data for 焉” while the search for 焉知 resulted in “No Word Sketch available”.

ironlady · February 1, 2007, 6:34pm

I’m not sure that everyone is going to want to have the data on their responses downloadable by everyone, though. Not sure I would have put my name on something that was going to make public my shocking lack of knowledge about these collocation things…(seriously, there are privacy issues here.)

Also, it would be worth mentioning whether certain examples come from speech or writing, or could be equally applicable to both. This is different from other languages where the differences aren’t as marked.

There was also a completely free corpus of Chinese on the Web two or three years ago, although I’ve lost the link to it. I used to use it when I was in grad school in Taiwan. I’d be interested to have that link, because while te first 30 days are free, evidently they are planning to charge after that, and frankly I’m not sure how much that would be or whether it would be worth it for me. I enjoy browsing it as a sort of twisted sick kind of recreation (I admit it) but I can always Google to get similar results on a keyword.

ironlady · February 7, 2007, 3:01pm

I’m also not sure how this is getting me anything more than a data set a bit neater than a Google search? It doesn’t really seem ready to handle Chinese.

Example:
I input 實施 (I’m in the Simplified Chinese corpus, but anyway…)
When I go to Word Sketch, it tells me that “醫藥分業” is a subject of 實施. I don’t think so.
Since I already read Chinese, I can tell that this is not the case, but what would a student at an earlier stage of the whole thing think?

I like corpora, don’t get me wrong, but I wonder if this isn’t being brought out a bit hastily in hopes of being able to simply plug in a Chinese corpus and have it work, when the whole basis for grammatical relationships is quite different in Chinese than in, say, the other languages in the system thus far. The whole topic-comment thing is a bear, just for starters. I wonder if there isn’t too much emphasis on word order in the process that determines which words go with which words?

I’d love to be able to use this kind of thing for something and recommend it to students, but I really need to understand how it’s going to be of benefit and what kind of information it’s going to provide that isn’t already available for free on google.com Maybe someone can point me in the right direction??

smithsgj · February 13, 2007, 4:05am

Sorry about the way the answers were made public. That issue is now resolved: please now go ahead and take the pre-test, by going to http://mcu.edu.tw/~ssmith/walkthrough. Your answers will NOT now be available anywhere on the web!

Ironlady, check http://www.sinica.edu.tw/ftms-bin/kiwi1/mkiwi.sh

http://bowland-files.lancs.ac.uk/corplang/lcmc/

I’ll get back on your other points: but please bear in mind that Sketch Engine is a corpus query tool, not a corpus! It can be applied to any suitable marked up corpus, but works best with a truly huge corpus like Chinese Gigaword.

The grammar rules for SkE were indeed written for English, this is true. However

a) they are being developed to take account of (EG) 把 / 將 construction (the error you spotted was to do with verb subject/object wasn’t it?)

b) SkE uses grammatical patterns to find collocates. It is not intended to be a grammar teaching tool at all! I can see how people could be misled into thnking it is, and I’ll think about that issue more. Thank you for raising it.

I’ll get back on other points fairly soon.

smithsgj · April 9, 2007, 12:51am

For those who were kind enough to complete the online pre-test a few weeks back:

If you didn’t get our email, would you mind going to

http://myweb.scu.edu.tw/~mralice/SimplifiedlPostTest.htm

or

http://myweb.scu.edu.tw/~mralice/TraditionalPostTest.htm

to do the second part of the questionnaire?

Thanks very much!

ironlady · April 9, 2007, 12:42pm

I hate to always sound like the devil’s advocate, but what is a post-test without any control of the treatment in the intervening period going to show? Some have used the product extensively; others haven’t touched it. I don’t see detailed questions at the beginning to really separate degrees of use (and how do you do that anyway, given different Mandarin levels, and so on) so unless each individual user is being taken as a group for statistical purposes – which makes statistics meaningless --…???