If you read any news story on the Taipei Times Web site, there is a little “wiki links” icon, which, if clicked, produces a very similar page with many keywords linked to the corresponding Wikipedia entry.
Does anyone know how this was accomplished? The link leads to a wikified version of the original pages, but I doubt they code each one by hand. I suspect there must be some script that automates the generation of the wikified version.
I don’t know for sure how they did it, but here’s how I’d do it:
Download an index of the sort of Wikipedia pages you want to link to. You could do this through the Wikipedia Contents or Index pages.
Stick the page names into a database or file. I’d use a file, and then a hash in server memory at runtime for fast lookups.
When a new content page is added to your site (or old content is edited), run over it with a script. Look for capitalized words, or acronyms – these may be proper nouns and thereby good candidates for linking. If you find several capitalized words in a row, join them together into WikiLink format.
See if the phrase is in your index index. If it is, wrap it in a standard Wikipedia link. Continue.
Save the modified content page somewhere and link to it as the wikilinks version. Doing it only at content update time will be much more efficient than parsing it every time someone looks at the page.
And you’re done. I’m pretty sure the Taipei Times does something like this. I noticed they link only to capitalized words and acronyms. Trying to do an all-word-sets parse on the article (to get things other than proper nouns) would be prohibitively expensive.
I also noticed that they seem to assume the first word of the sentence is capitalized only because that’s what we do in English, and not because it’s a linkable name. For instance in this article, there’s a paragraph starting with the word “Denmark”. Denmark doesn’t get linked to, though it presumably should (they link to Britain, etc). So they have a bug!
Edit: Actually I noticed a couple other crappy bugs too. Their links extend over trailing punctuation, so you get blue underlined commas and stuff. And they link to the same phrase more than once in the same article, which is silly - it should link only the first occurrence, to make the text clearer.
I see the way Brendon has done it - but here’s a (simpler?) idea:
They could do it by running an auto-censor similar to Forumosa. All that would be needed then is to define a list of keywords (for example Taipei, Lee Teng-hui, Executive Yuan, bubble tea) which will trigger the wiki syntax. So when your search and replace function finds a word that is on the wiki list, it will create an automatic link. This is helped by the fact that wiki syntax is very consistent, so all articles can be found at en.wikipedia.org/wiki/. You just set up your s&r function to replace any keyword in your predefined list with a set string like <a href="http://en.wikipedia.org/wiki/keyword">keyword</a>
and Bob’s your uncle.
So far as I can tell, that’s exactly what I said but involves typing in all the keywords yourself
Edit: Actually, there are a few other differences. You’d need a search-and-replace tool that could be configured to do things like replace spaces - the Wikipedia link for Tony Blair is “Tony_Blair”. It would also need to be able to deal with line breaks in the middle of phrases, and the like. And since it wouldn’t have heuristic rules for which phrases to check against the keyword index, it would have to check every keyword for every word of the content - making it m times slower where m is the number of keywords (It would be O(mn) rather than O(n)).
On the other hand, it might be easier to set up – in the sense that it would only take three hours of admin time rather than one hour of coding time
Damn, you edited your response so I had to throw away half of mine.
Why would including other categories of words be prohibitively expensive? The Forumosa autocensor script scans all the text looking for certain words (capitalised or otherwise) - this doesn’t take up a huge amount of processing time as far as I can see. As long as you’re doing what you suggested and saving a wikified copy to the database rather than doing it each time someone requests the page, I can’t see the problem (at least for sites with F.com’s volume of traffic).
By checking against only certain instances (like capitalised words which are not at the beginning of sentences) you are limiting the usefulness and scope of a wikified version (which is how I feel about the TT site). Sure, most of the words you’ll want to wikify will be capitalised, but not all.
Oops, sorry. Feel free to edit yours while I’m writing this one
Right. There are two differences, though. The first is that while the censor (so far as I know) only has to worry about single words, wiki links are often to phrases of several words.
So if you have “one two three four five”, the censor only has to do five lookups in its banned-word list – one for each word. But for phrases, the lookups you’d have to do are:
“one”, “one two”, “one two three”, “one two three four”, “one two three four five”, “two”, “two three”, “two three four”, “two three four five”, “three”, “three four”, “three four five”, “four”, “four five”, and “five”.
And that’s just for one short sentence. Over a whole article, it would be a gigantic number of possible phrases.
Or you could do it the other way around, looking through all the key phrases and seeing if their first word matches the word of the article you’re on. That’s better, in that it doesn’t need so many lookups, but it still needs quite a few.
Which brings us on to the second problem - the censored words list is, I’m guessing, only a few dozen words long. But a general wiki-linker, even for a tight field like “political news”, could well run into the hundreds or even thousands of words. Which multiplies the problem.
There are ways you can cut it down with little indexing tricks and stuff. They help, but they’re not easy to get right. And I’m guessing they won’t be done by stuff like the auto-censor, because it doesn’t need them.
That’s true – even the most expensive approach here would probably not be a big deal. Still, why be wasteful?
This is a tricky one. It’s not just about performance compromises. Wikipedia has articles on every damn thing you can possibly imagine, and most of them probably aren’t going to be interesting to your readers. Looking at the same TT article I linked to before, there are wikipedia articles on “troop”, “soldier”, “conflict”, “pressure”, “progress”, “withdrawal”, “President”, “triangle”, “Perth”, “police”, “checkpoint”, “tanker”, “truck”, “chlorine”, and so on and so on ad nauseum.
You certainly could link to all of them, but I don’t think it would add any value, and would make it harder to read and use. Or you could ensure yourself, by hand, that your keyphrase index doesn’t include any uninteresting words. As I said before, the list could easily be very big, and that would be a very boring task.
The TT approach of only caring about Proper Nouns seems sensible to me.
You know, in the time we’ve spent discussing this I could probably have written the code for my original suggestion. Now I feel obscurely guilty
Regarding the database list options I suppose it depends what dearpeter wants to use it for. If it’s a fairly short list of links that he’d like, best to write that list himself and be sure that every link will be relevant (he can also add new links later as he thinks of them). If he wants a broad range of auto-links then measures will have to be built in to combat overuse or irrelevant links (like just choosing capitalised words).
Right – indexing tricks. But a tree or binary search doesn’t help all that much - you still have to iterate either the article or the keywords and do an O(log2x) lookup each time, which gives you O(nlog2m) or O(mlog2n). You could also use a hash map, which has O(1) lookups, giving you O(n) overall - which was my original suggestion
Yeah, hash tables would help too. How big would you make it ? If you make it too big, there would be unused entries and would be er, too big. Too small and there would be mutiple results for one entry. Wouldn’t there ?
Too big is fine - even tens of thousands of phrases will still work out at a few kilobytes of memory, which is nothing. Too small is also fine, usually - collisions (items with matching hashes) get chained as linked lists, and the lookup then goes through them and does regular comparisons with the key you gave, to identify the right one. Obviously that takes a little longer, but it’s usually not a big deal.
Give this script an article on stdin. It is assumed to be plain-text,
and may bugger up HTML code and other things if they are present.
So run it over articles before inserting them into the page template.
chunk = StringIO()
midWord = False
midPhrase = False
phraseBuffer = StringIO()
chunk = infile.read(CHUNKSIZE)
for char in chunk:
# Hooray for hacked-up state-machines
if not midWord:
midPhrase = True
midPhrase = False
midWord = True
midWord = False
chunk = infile.read(CHUNKSIZE)
def doPhrase(phrase, outfile):
words = splitWithPunctuation(phrase)
for i in range(len(words)):
for j in range(i+1,len(words)+1):
key = ' '.join(map(clean, words[i:j]))
if key in keywords and key not in donePhrases:
# Recurse here to avoid overlapping matches
# Identify final punctuation
for i in range(len(phrase)-1, 0, -1):
linkable = phrase[:i+1]
extra = phrase[i+1:]
return (wikiLinkTemplate % dict(link='_'.join(linkPhrase.split()),
human=linkable)) + extra
if name == “main”: