Web scraping with Perl

2010-04-04

Since I need to get all ICD9 codes for my work, I decided to implement a lightweight web crawler in Perl, with the aims of parsing all codes found at ICD9Data.com.

It seems that the WWW-Mechanize module provides all that is needed. In fact, I also realized that this technique may be extended to catch up anything on a website, which is called web scraping.

Web scraping is the process of automatically collecting Web information. Web scraping is a field with active developments sharing a common goal with the semantic Web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies even though some solutions are entirely ad hoc. Wikipedia

The final result will be available on my website soon.

While reading the on-line Perl receipes, I took the occasion to adapt one example script to download all ConTeXt manuals from the PRAGMA website. However, I just realized that almost all manuals are available in tex format on the SVN server.


Update, May 14

I found in Berman’s Perl Programming for Medicine and Biology (2007, Jones and Bartlett Publishers) another Perl script that the author suggested for collecting the ICD codes from the UMLS metathesaurus (Chapter 5). The UMLS metathesaurus is actually the largest medical nomenclature, and it includes more than 100 different biomedical vocabularies with about 6 million term records. These term records are in a file named MRCONSO which is available at no cost, provided you first register on UML website. Here is an example of what it looks like:

C0000005|ENG|X|L0000005|PF|S0007492|Y|A7755565||M0019694|D012711|MSH|PEN|
D012711|(131)I-Macroaggregated Albumin|0|N||

The example proposed in the book consists in extracting all vocabulary terms from MRCONSO, with their unique ID (i.e., Concept Unique Identifier Code). This reads as follows:

$line = ” “;
$start = time();
open(TEXT, “MRCONSO”);
open(OUT, “>icd.txt”);
while ($line ne “”) {
  $line = ;
  @linearray = split(/\|/,$line);
  $icdnumber = $linearray13;
  $language = $linearray1;
  $term = $linearray14;
  $vocabulary = $linearray11;
  next if (“ENG” ne $language);
  next unless ($vocabulary =~ /ICD10AM/);
  print OUT “$icdnumber $term\n”;
}
$end = time();
$total = $end-$start;
print “\nTotal time was $total seconds\n”;
---

Articles with the same tag(s):

Collecting email usage statistics from mu
Data science at the command-line
Interacting with Weka from Jython
CoffeeScript or how to avoid typing ugly Javascript code
Workflow for statistical data analysis
Playing with Julia
GSL Shell
Apple weekend miscellanies
Color schemes for Emacs and TeX
Compiling Gnuplot on OS X

---