Working with Hebrew Wikipedia as a Graph - NLP 2012

Viewing Wikipedia as a Graph

Use the JWPL Java library. This library performs the following:

It takes a dump of all the contents of Wikipedia in a specific language (from the list of wikipedia dumps).
It parses the dump into a set of articles.
It parses each article and recognizes links from page to page and additional metadata attached to each page (in particular, its categories).
It stores the resulting information into a relational database (MySQL) with appropriate indexes.
It exposes the resulting graph data model through a convenient Java API.

See Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary by Zesch et al, LREC 2008, for a presentation of the library. Each language has its own Wikipedia dump. For example, the Hebrew Wikipedia dump is in hewiki. In this page, we document how to parse the Hebrew Wikipedia dump into a graph representation that can be efficiently queried through the JWPL API. The overall process is:

Install Java and MySQL
Either download a pre-processed version of the Hebrew Wikipedia (about 300MB) or pre-process an XML dump of Hebrew Wikipedia (takes about 1 hour of CPU).
Import the pre-processed dump into MySQL
Run Java code to query and access the graph representation of Wikipedia

Installing JWPL

Prerequisites

Java 6 or up
MySQL 5.x (MySQL 5.5 installer for Windows is about 200MB). In the rest of this document, we assume the MySQL server runs on localhost with user root and password "password". Adjust the commands according to your configuration.

General instructions for installing JWPL are available in the JWPL DataMachine page. These instructions are specific to the installation of JWPL under Windows for the Hebrew Wikipedia dump.

Download JWPL

Download DataMachine
Download WikipediaAPI

Pre-processed Wikipedia Dump

The following archive hewiki-output.zip is about 340MB - it contains the result of the process described below applied to the dump of hewiki of July 15th 2012. You can download this archive and continue directly to the sql section or alternatively pre-process your own version of the Wikipedia dump as documented next.

Download Wikipedia Dump

For the Hebrew Wikipedia dump, we need these 3 files (dump of 15 Jul 2012):

These 3 files take about 420MB compressed and 2GB uncompressed. The dump of July 15 2012 contains about 300K articles.

Preprocess Wikipedia Dump

The following command parses the Wikipedia dump and creates a set of text files that capture the graph structure of Wikipedia:

java -jar JWPLDataMachine.jar [LANGUAGE] [MAIN_CATEGORY_NAME] [DISAMBIGUATION_CATEGORY_NAME] [SOURCE_DIRECTORY]

where:

LANGUAGE - a language string matching one the JWPL_Languages.
MAIN_CATEGORY_NAME - the name of the main (top) category of the Wikipedia category hierarchy
DISAMBIGUATION_CATEGORY_NAME - the name of the category that contains the disambiguation categories
SOURCE_DIRECTORY - the path to the directory containing the source files

For large dumps you need to increase the amount of memory assigned to the JVM using e.g. the "-Xmx2g" flag to assign 2GB of memory. For the recent English dump you will need at least 4GB. For Hebrew, we do ok with 1GB of RAM. For Hebrew, use the following values:

LANGUAGE = hebrew
MAIN_CATEGORY_NAME = עמוד_ראשי
DISAMBIGUATION_CATEGORY_NAME = פירושונים

Because it is difficult to run Java programs in Windows or Linux that accept characters in Hebrew, you cannot easily run this:

java -Xmx1g -jar JWPLDataMachine.jar hebrew עמוד_ראשי פירושונים c:\jwpl

We instead need to compile the following launcher application in Java - make sure this file is saved in UTF-8 encoding to preserve the Hebrew strings:

public class T1 {
    public static void main(String[] args) {
		org.apache.log4j.BasicConfigurator.configure();
		String args1[] = {"hebrew", "עמוד_ראשי", "פירושונים", "c:\\jwpl"};
        de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(args1);
    }
}

The compilation line is:

javac -encoding UTF-8 -cp JWPLDataMachine.jar T1.java

and the execution is:

java -Dfile.encoding UTF-8 -cp JWPLDataMachine.jar;. T1 > datamachine.log

This will output in the log file progress information in the following form:

0 [main] INFO org.springframework.beans.factory.xml.XmlBeanDefinitionReader  - 
  Loading XML bean definitions from class path resource [context/applicationContext.xml]
0 [main] DEBUG org.springframework.beans.factory.xml.DefaultDocumentLoader  - 
  Using JAXP provider [org.apache.xerces.jaxp.DocumentBuilderFactoryImpl]
219 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - parse input dumps...
219 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Discussions are unavailable
508753 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - processing table page...
509144 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pages 10000
509503 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pages 20000
509847 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pages 30000
...
331 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pages 260000
518691 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pages 270000
519206 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pages 280000
nrOfCategories: 29166
nrOfPage: 257829
nrOfRedirects before testing the validity of the destination:27059
519472 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - processing table categorylinks...
519738 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Categorylinks 10000
...
528613 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Categorylinks 610000
528863 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - processing table pagelinks...
529003 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pagelinks 10000
...
656629 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Pagelinks 16680000
656801 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - processing table revision...
656910 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Revision 10000
...
659176 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Revision 280000
659223 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - processing table text...
768099 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Text 10000
...
1633292 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - Text 280000
1651870 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - writing metadata...
1651870 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - finished
1651870 [main] INFO de.tudarmstadt.ukp.wikipedia.wikimachine.debug.Log4jLogger  - End of the application. 
        Working time = 1651917 ms

This execution should create 11 txt files in an "output" subfolder. While the DataMachine is running, it creates three temporary .bin files. These are only needed during processing time and can be deleted once all files in the output directory have been produced.

Prepare SQL Database

Create a database: in the MySQL workbench, execute the following command:

CREATE DATABASE hewiki DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

Create the schema of the database: execute the script jwpl_tables.sql in the hewiki database.

Fill the database: execute the batch file dbimport.bat:

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki Category.txt
hewiki.Category: Records: 29166  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki category_inlinks.txt
hewiki.category_inlinks: Records: 56494  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki category_outlinks.txt
hewiki.category_outlinks: Records: 56494  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki category_pages.txt
hewiki.category_pages: Records: 451669  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki MetaData.txt
hewiki.MetaData: Records: 1  Deleted: 0  Skipped: 0  Warnings: 1

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki Page.txt
hewiki.Page: Records: 230770  Deleted: 0  Skipped: 0  Warnings: 224

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki PageMapLine.txt
hewiki.PageMapLine: Records: 257738  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki page_categories.txt
hewiki.page_categories: Records: 451669  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki page_inlinks.txt
hewiki.page_inlinks: Records: 7563045  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki page_outlinks.txt
hewiki.page_outlinks: Records: 7563045  Deleted: 0  Skipped: 0  Warnings: 0

C:\jwpl\output>mysqlimport -uroot -ppassword --local --default-character-set=utf8 hewiki page_redirects.txt
hewiki.page_redirects: Records: 26968  Deleted: 0  Skipped: 0  Warnings: 0

Using the JWPL Wikipedia API

Now you are ready to use the database with the JWPL Core API. When first connecting to a newly imported database, indexes are created. This takes some time (up to 30 minutes), depending on the server and the size of your Wikipedia. Subsequent connects won't have this delay. You need the following JAR file to execute calls against the MySQL database that contains the indexed version of Wikipedia: WikipediaAPI. In the following, we refer to this as JWPLWikipediaAPI.jar.

All the tutorials below have been adjusted from those provided in JWPL to work with the Hebrew dump. In particular, since it is a problem to pass UTF-8 parameters through the command line or through text property files, we use an XML property file. The samples look at the file jwpl.xml which contains this default values:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>JWPL</comment>
    <entry key="user">root</entry>
    <entry key="db">hewiki</entry>
    <entry key="host">localhost</entry>
    <entry key="pwd">password</entry>
    <entry key="title">צורה</entry>
    <entry key="category">מחשוב</entry>
</properties>

Install the files in the folder according to the package de/tudarmstadt/ukp/wikipedia/api/tutorial. You can use the following archive and open it in your root jwpl folder.

The command line to compile these files is:

javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T1a_HelloWorld.java
javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T1b_HelloWorld.java
javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T1c_HelloWorld.java
javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T2_PageInfo.java
javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T3_PageDetails.java
javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T4_Categories.java
javac -encoding UTF-8 -cp JWPLWikipediaAPI.jar de\tudarmstadt\ukp\wikipedia\api\tutorial\T5_PageList.java

The execution command is:

java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T1a_HelloWorld > T1a.log
java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T1b_HelloWorld > T1b.log
java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T1c_HelloWorld > T1c.log
java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T2_PageInfo    > T2.log
java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T3_PageDetails > T3.log
java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T4_Categories  > T4.log
java -Dfile.encoding=UTF-8 -cp JWPLWikipediaAPI.jar;. de.tudarmstadt.ukp.wikipedia.api.tutorial.T5_PageList    > T5.log

Tutorial 1: Access a Wikipedia Page by Title

The Java code adapted to Hebrew is available here: T1a_HelloWorld.java. The key API calls in this example are:

Wikipedia wiki = new Wikipedia(dbConfig); // Creates a connection to the SQL dump
Page page = wiki.getPage("הנדסת_תוכנה"); // Loads in RAM a page description given its title.
page.getText(); // Access the text (or parsed text) of a loaded page

Tutorial 2: Get Page Information

The Java code adapted to Hebrew is available here: T2_PageInfo.java. The key API calls in this example are:

page.getTitle();
page.isDisambiguation();
page.isRedirect(); // If a page is a redirect, we can use it like a normal page. The other fields of the page are transparently served by the page that the redirect points to.
page.getNumberOfInlinks(); // Number of links pointing to this page
page.getNumberOfOutlinks();
page.getNumberOfCategories(); // Number of categories to which this page belongs

Tutorial 3: Get Page Details

The Java code adapted to Hebrew is available here: T3_PageDetails.java. The key API calls in this example are:

page.getRedirects(); // output the page's redirects
page.getCategories(); // list of categories
page.getInlinks();
page.getOutlinks();

Tutorial 4: Get Categories Information

The Java code adapted to Hebrew is available here: T4_Categories.java. The key API calls in this example are:

getCategory(title); // Load a category given its name (without the "Category:" prefix)
cat.getTitle();
cat.getParents(); // super categories
cat.getChildren();
cat.getArticles(); // articles that belong to this category

Tutorial 5: Get Categories Descendants

The Java code adapted to Hebrew is available here: T5_PageList.java. The key API calls in this example are:

cat.getDescendants(); // recursive list of children categories.

Wikipedia API

The Javadoc of the API is available here.

Last modified July 24th, 2012