Importing DBpedia 3.8

Importing progress:

  • June 12, 01:06
    • Virtuoso server up and running
    • test import worked, SPARQL on localhost Virtuoso server works
    • starting proper import, small datasets first (<20MB, gzipped)
  • June 12, 18:11
    • It’s taking forever to import the big datasets
    • probably, it’ll take a few more days to load everything (except pagelinks)
  • June 12, 21:01
    • Having an SSD drive helps BIG time
      • database seems to grow 1GB/5min at average
      • putting the Virtuoso database file (temporarily) on a SSD to populate it seems to speed it up much (labels_en.nt.gz loaded in 10 minutes)
Advertisements

Setting Up a DBpedia mirror

How to setup a DBpedia mirror on Virtuoso?

Deployment system specifications (used here):

  • Ubuntu 12.04
  • Quad-Core @ 3.4 GHz
  • 8 GB RAM
  • X disk space

Step 1: Installing Virtuoso

You can go to the Openlink Virtuoso website and download or buy the Virtuoso server to install it on your machine. On Ubuntu 12.04, however, we can install the (opensource) Virtuoso server through the package manager. Command-line:

~$ sudo apt-get install virtuoso-server

Step 2: Download DBpedia data

On downloads.dbpedia.org, data dumps can be downloaded of different versions of DBpedia for different languages. Here, version 3.8 (most recent at the time) and English are chosen.

  1. Go to the download page for your version and language (in our case, downloads.dbpedia.org/3.8/en/)
  2. Download all archives ending with “nt.bz2” in one folder on your machine. Let’s call this folder dumpfolder.
    1. this doesn’t have to happen manually, you can also use the following command on Linux from dumpfolder:
    • wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.6/en/
  3. Download the DBpedia Ontology

Step 3: Prepare for importing DBpedia dumps

  1. transform b-zipped dumps to gzip (saves space):
    • ~$ for i in *.bz2 ; do bzcat $i | gzip --fast > ${i%.bz2}.gz && rm $i ; done &
  2. clean DBpedia dumps:
  • ~$ for i in external_links_en.nt.gz page_links_en.nt.gz infobox_properties_en.nt.gz ; do   echo -n “cleaning $i…”   zcat $i | grep -v -E ‘^<.+> <.+> <.{1025,}> \.$’ | gzip –fast > ${i%.nt.gz}_cleaned.nt.gz &&   mv ${i%.nt.gz}_cleaned.nt.gz $i   echo “done.” done
  1. import loading scripts

Step 4: import data

This is the longest step. It may take hours (depending on how much you import)

isql-vt
ld_dir_all(<folder with dumps>, '*.*', 'http://dbpedia.org/');
SELECT * FROM DB.DBA.LOAD_LIST;
EXIT;

Run the loader:

rdf_loader_run();
checkpoint;
commit WORK;
checkpoint;
exit;

References: