Scraping information off of websites

Does anyone here have any experience programming a spider type of program to scrape information regularly off of a website? A common use of such a program would be to go to an online directory and scrape contact information and put it into a database, which would then be merged with existing information (updating outdated information where conflicting). If anyone has done this before, or knows of companies that will do this for a fee, or if you know what keywords to search under to find this type of service on google, please let me know.

Edit: I found something called Visual Web Task by Lencom Software. Anyone have experience with this?

Yes. I use LWP (Library for Web Access in Perl) to do this. There was an article about how to do it on some time ago; there are also chapters in Perl Cookbook and Network Programming with Perl that show you the basic techniques. It’s especially powerful in combination with the Perl database interface libraries (DBI/DBD) and an open source database like MySQL or Postgresql.

If you don’t have time to learn how to do it yourself, post a job on for a local developer.

I don’t see why you would consider paying for a commercial product when there are so many powerful free ones around.


Can you come up with a Screen Scraper for PHP?

Say, for example, there is a database of jobs at (ahem) another Taiwan website. Now, imagine that I have a Taiwan website with a discussion forum, that uses PHP. Dennis Pallet writes about Screen Scraping into RSS. Do you know anyone who could write up something that did this, but instead of RSS, it just squirted the job ads out in a clean, printer-ready format?

Yes, probably. Send me the URLs.


I’m still hoping you can help me put together a screen scraper for this page: click here

I think I only need to keep 3 kinds of data:

  1. The number of ads in the list
  2. The names of each ad
  3. The description of each ad

In the code that the Perl generates, these are unique:
What precedes the ad name

<td bgcolor="#d0d098" align=left><font face=Arial size=2>

What precedes each ad description

<td></td><td colspan=4><font face="Arial" size=1>

I hope to load the resulting code into a variable which I can use in our phpBB templates

Can you help me?