How to Extract English Fiction Prose from Project Gutenberg

Step 2: Build rdf2sql

Now we have a bunch of RDF files. Every RDF file describes a particular book. Not all of these books are in English and not all of them are fiction. To filter out only English fiction prose we first need to turn these RDF files into a more convenient format – a SQLite database.

To do this we first need to create a SQL script with the data about all books represented by RDF files.

To do this we will use the tool rdf2sql. You can check out its source code or download it from here (SHA checksum).

rdf2sql is based on the Pravlesian Process Engine and its logic is described by the process diagram src/main/resources/main.ppmn.fodg:

/ref/en/how-to-extract-english-fiction-prose-from-project-gutenberg-2/main.ppmn.png

It

  1. figures out all subdirectories of the provided directory (Determine all subdir names),
  2. initializes the loop (Init loop),
  3. appends the CREATE TABLE ... statement to the SQL script (Add table creation DDL).

Then, for every subdirectory detected in Determine all subdir names, it calls the Process file subprocess:

/ref/en/how-to-extract-english-fiction-prose-from-project-gutenberg-2/process-file.ppmn.png

First, it reads data from the respective RDF file (Extract data from RDF). Some of the directories downloaded in step 1 are empty. In these cases, the condition include-book? will be false and the subprocess instance will end.

If include-book? is true, the following happens:

  1. A unique identifier is generated (Generate UUID).
  2. An INSERT INTO ... SQL statement is created and appended to the SQL script (Append a DDL stmt. to SQL file).

The result of this tool will be a SQL file which inserts all the data from the RDF files downloaded in step 1 into a SQLite database.


Now we need to build rdf2sql using

mvn clean package

The executable JAR file will be located in target/rdf2sql-1.2-jar-with-dependencies.jar.