How are the "wiki links" on the Taipei Times site done?

dearpeter · February 22, 2007, 10:10am

If you read any news story on the Taipei Times Web site, there is a little “wiki links” icon, which, if clicked, produces a very similar page with many keywords linked to the corresponding Wikipedia entry.

Does anyone know how this was accomplished? The link leads to a wikified version of the original pages, but I doubt they code each one by hand. I suspect there must be some script that automates the generation of the wikified version.

Brendon · February 22, 2007, 10:41am

I don’t know for sure how they did it, but here’s how I’d do it:

Download an index of the sort of Wikipedia pages you want to link to. You could do this through the Wikipedia Contents or Index pages.
Stick the page names into a database or file. I’d use a file, and then a hash in server memory at runtime for fast lookups.
When a new content page is added to your site (or old content is edited), run over it with a script. Look for capitalized words, or acronyms – these may be proper nouns and thereby good candidates for linking. If you find several capitalized words in a row, join them together into WikiLink format.
See if the phrase is in your index index. If it is, wrap it in a standard Wikipedia link. Continue.
Save the modified content page somewhere and link to it as the wikilinks version. Doing it only at content update time will be much more efficient than parsing it every time someone looks at the page.

And you’re done. I’m pretty sure the Taipei Times does something like this. I noticed they link only to capitalized words and acronyms. Trying to do an all-word-sets parse on the article (to get things other than proper nouns) would be prohibitively expensive.

I also noticed that they seem to assume the first word of the sentence is capitalized only because that’s what we do in English, and not because it’s a linkable name. For instance in this article, there’s a paragraph starting with the word “Denmark”. Denmark doesn’t get linked to, though it presumably should (they link to Britain, etc). So they have a bug!

Cheers,
Brendon

Edit: Actually I noticed a couple other crappy bugs too. Their links extend over trailing punctuation, so you get blue underlined commas and stuff. And they link to the same phrase more than once in the same article, which is silly - it should link only the first occurrence, to make the text clearer.

dearpeter · February 22, 2007, 10:43am

And could you write such a script? Or is there one out there ready to wear?

Brendon · February 22, 2007, 10:46am

It seems pretty straightforward, but I have a bunch of other stuff to get done in the next few weeks. I did a cursory google for existing scripts but didn’t see any

Taffy · February 22, 2007, 10:49am

I see the way Brendon has done it - but here’s a (simpler?) idea:

They could do it by running an auto-censor similar to Forumosa. All that would be needed then is to define a list of keywords (for example Taipei, Lee Teng-hui, Executive Yuan, bubble tea) which will trigger the wiki syntax. So when your search and replace function finds a word that is on the wiki list, it will create an automatic link. This is helped by the fact that wiki syntax is very consistent, so all articles can be found at en.wikipedia.org/wiki/. You just set up your s&r function to replace any keyword in your predefined list with a set string like <a href="http://en.wikipedia.org/wiki/keyword">keyword</a>
and Bob’s your uncle.

That’s the way I’d do it.

Brendon · February 22, 2007, 11:04am

So far as I can tell, that’s exactly what I said but involves typing in all the keywords yourself

Edit: Actually, there are a few other differences. You’d need a search-and-replace tool that could be configured to do things like replace spaces - the Wikipedia link for Tony Blair is “Tony_Blair”. It would also need to be able to deal with line breaks in the middle of phrases, and the like. And since it wouldn’t have heuristic rules for which phrases to check against the keyword index, it would have to check every keyword for every word of the content - making it m times slower where m is the number of keywords (It would be O(mn) rather than O(n)).

On the other hand, it might be easier to set up – in the sense that it would only take three hours of admin time rather than one hour of coding time

Taffy · February 22, 2007, 11:34am

Damn, you edited your response so I had to throw away half of mine.

Why would including other categories of words be prohibitively expensive? The Forumosa autocensor script scans all the text looking for certain words (capitalised or otherwise) - this doesn’t take up a huge amount of processing time as far as I can see. As long as you’re doing what you suggested and saving a wikified copy to the database rather than doing it each time someone requests the page, I can’t see the problem (at least for sites with F.com’s volume of traffic).

By checking against only certain instances (like capitalised words which are not at the beginning of sentences) you are limiting the usefulness and scope of a wikified version (which is how I feel about the TT site). Sure, most of the words you’ll want to wikify will be capitalised, but not all.

Brendon · February 22, 2007, 11:47am

Oops, sorry. Feel free to edit yours while I’m writing this one

Right. There are two differences, though. The first is that while the censor (so far as I know) only has to worry about single words, wiki links are often to phrases of several words.

So if you have “one two three four five”, the censor only has to do five lookups in its banned-word list – one for each word. But for phrases, the lookups you’d have to do are:

“one”, “one two”, “one two three”, “one two three four”, “one two three four five”, “two”, “two three”, “two three four”, “two three four five”, “three”, “three four”, “three four five”, “four”, “four five”, and “five”.

And that’s just for one short sentence. Over a whole article, it would be a gigantic number of possible phrases.

Or you could do it the other way around, looking through all the key phrases and seeing if their first word matches the word of the article you’re on. That’s better, in that it doesn’t need so many lookups, but it still needs quite a few.

Which brings us on to the second problem - the censored words list is, I’m guessing, only a few dozen words long. But a general wiki-linker, even for a tight field like “political news”, could well run into the hundreds or even thousands of words. Which multiplies the problem.

There are ways you can cut it down with little indexing tricks and stuff. They help, but they’re not easy to get right. And I’m guessing they won’t be done by stuff like the auto-censor, because it doesn’t need them.

That’s true – even the most expensive approach here would probably not be a big deal. Still, why be wasteful?

This is a tricky one. It’s not just about performance compromises. Wikipedia has articles on every damn thing you can possibly imagine, and most of them probably aren’t going to be interesting to your readers. Looking at the same TT article I linked to before, there are wikipedia articles on “troop”, “soldier”, “conflict”, “pressure”, “progress”, “withdrawal”, “President”, “triangle”, “Perth”, “police”, “checkpoint”, “tanker”, “truck”, “chlorine”, and so on and so on ad nauseum.

You certainly could link to all of them, but I don’t think it would add any value, and would make it harder to read and use. Or you could ensure yourself, by hand, that your keyphrase index doesn’t include any uninteresting words. As I said before, the list could easily be very big, and that would be a very boring task.

The TT approach of only caring about Proper Nouns seems sensible to me.

You know, in the time we’ve spent discussing this I could probably have written the code for my original suggestion. Now I feel obscurely guilty

Taffy · February 22, 2007, 12:07pm

Good points made above.

Regarding the database list options I suppose it depends what dearpeter wants to use it for. If it’s a fairly short list of links that he’d like, best to write that list himself and be sure that every link will be relevant (he can also add new links later as he thinks of them). If he wants a broad range of auto-links then measures will have to be built in to combat overuse or irrelevant links (like just choosing capitalised words).

Big_Fluffy_Matthew · February 22, 2007, 12:40pm

you could store the words in a tree. Then it’s log2(n), or summit. Or just sort them (It’s not that dynamic) then you can do a binary search.

Brendon · February 22, 2007, 12:51pm

Right – indexing tricks. But a tree or binary search doesn’t help all that much - you still have to iterate either the article or the keywords and do an O(log2x) lookup each time, which gives you O(nlog2m) or O(mlog2n). You could also use a hash map, which has O(1) lookups, giving you O(n) overall - which was my original suggestion

Big_Fluffy_Matthew · February 22, 2007, 1:07pm

Yeah, hash tables would help too. How big would you make it ? If you make it too big, there would be unused entries and would be er, too big. Too small and there would be mutiple results for one entry. Wouldn’t there ?

Brendon · February 22, 2007, 1:23pm

Too big is fine - even tens of thousands of phrases will still work out at a few kilobytes of memory, which is nothing. Too small is also fine, usually - collisions (items with matching hashes) get chained as linked lists, and the lookup then goes through them and does regular comparisons with the key you gave, to identify the right one. Obviously that takes a little longer, but it’s usually not a big deal.

Big_Fluffy_Matthew · February 22, 2007, 1:38pm

I guess that’s true. I must admit I haven’t really used hashing. I tend to use more dynamic structures like trees for depth sorting. I would use hashing to speed up string testing though.

How would the mods know if we’re off topic ?

Brendon · February 22, 2007, 1:48pm

I think it’s a language-background thing. You’re from C++ if I remember rightly, which is all about trees and linked lists and crap. I’m from Perl and Python, which are all about hashes.

Perhaps they could write a program that goes through our posts comparing them to a big list of off-topic words …

Brendon · February 22, 2007, 4:39pm

I took a shot at this tonight. This script is not terribly efficient, but gets a number of things right that the Taipei Times get wrong:

Trailing punctuation is excluded from links
The first word in a sentence can be linked to
Links are not repeated - only the first occurrence links

It can be a little over-zealous at times. If you have “Prime Minister” as a keyword, the following text will match:

[code]We are under attack by Optimus Prime.

Minister Blair has gone into hiding.[/code]

You’ll get a link starting with “Prime” and ending with “Minister”. This is hard to avoid without breaking other cases which should work. But never mind.

Try it out here: microcosm.dynalias.org/wiki/test.py

And here’s the code:

[code]"""
Wiki link inserter.

Give this script an article on stdin. It is assumed to be plain-text,
and may bugger up HTML code and other things if they are present.
So run it over articles before inserting them into the page template.

Output is to stdout. Run it all like this:

python transform.py < inputArticle.txt > outputArticle.txt
“”"

A file full of phrases to change into Wikipedia links

One per line. Empty lines and lines starting with # ignored.

Words in phrases separated by space, eg “Tony Blair”

phrasesFile = “phrases.txt”

Template for wiki links.

wikiLinkTemplate = ‘%(human)s’

---------------------------------

import string, re

try:
from cStringIO import StringIO
except:
from StringIO import StringIO

CHUNKSIZE = 512

keywords = set(map(string.strip, open(phrasesFile, “rb”).readlines()))

donePhrases = set()

def process(infile, outfile):

chunk = StringIO()
midWord = False
midPhrase = False

phraseBuffer = StringIO()

chunk = infile.read(CHUNKSIZE)

while chunk:

    for char in chunk:

        # Hooray for hacked-up state-machines
        if char.isalpha():
            if not midWord:
                if char.isupper():
                    midPhrase = True
                elif midPhrase:
                    doPhrase(phraseBuffer.getvalue(), outfile)
                    phraseBuffer.truncate(0)
                    midPhrase = False
            midWord = True
        else:
            midWord = False

        if midPhrase:
            phraseBuffer.write(char)
        else:
            outfile.write(char)

    chunk = infile.read(CHUNKSIZE)

def doPhrase(phrase, outfile):

words = splitWithPunctuation(phrase)

for i in range(len(words)):
    for j in range(i+1,len(words)+1):

        key = ' '.join(map(clean, words[i:j]))

        if key in keywords and key not in donePhrases:

            outfile.write(applyTemplate(''.join(words[i:j]), key))
            donePhrases.add(key)

            # Recurse here to avoid overlapping matches
            doPhrase(''.join(words[j:]), outfile)
            return

    outfile.write(words[i])

def clean(word):
return filter(string.letters.contains, word)

def splitWithPunctuation(phrase):
return re.findall(r"(\w.*?(?:\W+|$))", phrase, re.DOTALL)

def applyTemplate(phrase, linkPhrase):

# Identify final punctuation
for i in range(len(phrase)-1, 0, -1):
    if phrase[i].isalpha():
        break

linkable = phrase[:i+1]
extra = phrase[i+1:]
return (wikiLinkTemplate % dict(link='_'.join(linkPhrase.split()),
                                human=linkable)) + extra

if name == “main”:
import sys
process(sys.stdin, sys.stdout)[/code]

Cheers,
Brendon

anon24369109 · February 23, 2007, 12:07am

… oopsy