How to Extract English Fiction Prose from Project Gutenberg

Purpose of the document

In this tutorial, I will explain how to extract a list of all books in Project Gutenberg which satisfy the following criteria:

  1. The book is in English.
  2. The book is prose (not poetry, not a play).
  3. The book is fiction.

Preconditions

Before you execute the steps in this tutorial, please make sure the following conditions are satisfied:

  1. There is at least 2.5 GB of free space on your hard drive.
  2. The correct version of rsync is installed on your machine. The script presented here was executed on MacOS. There may be differences between rsync implementations across different versions and different operating systems.
  3. SQLite version 3.32.3 or newer is installed on your machine.
  4. SQLite Studio version 3.3.3 or newer is installed on your machine.

Step 1: Download the RDF files

First, we need to download the RDF files of all books in Project Gutenberg. To do this, we need a shell script `download-rdfs.sh` with the following contents:

# Destination for metadata
DEST_DIR="gutenberg_english_txt"
mkdir -p "$DEST_DIR"

# Gutenberg rsync mirror
MIRROR="rsync://gutenberg.pglaf.org/gutenberg/"

echo "$(date "+%Y-%m-%d %H:%M:%S"): Start to download RDF files"

rsync -av --delete \
      --partial \
      --progress \
      --bwlimit=500 \
      --include="*/" \
      --include="*.rdf" \
      --exclude="*" \
      "$MIRROR"cache/epub/ "$DEST_DIR/cache/"

echo "$(date "+%Y-%m-%d %H:%M:%S"): Done"

Set DEST_DIR to the directory to which these files should be downloaded. --bwlimit=500 makes sure that during download, the bandwidth is limited and we don't put unnecessary load on Project Gutenberg's servers.

Then run this script using the following command:

./download-rdfs.sh 2>&1 | tee download-rdfs.sh.log

This will start the download of all RDF files from Project Gutenberg into the directory DEST_DIR. When I ran this script on 2025-08-26, the total size of downloaded files was 2.23 GB (ca. 133 MB compressed) and it took three and a half hours (3 hours and 34 minutes according to the download-rdfs.sh.log file).

Alternative approach

In addition to the method presented above, you may be able to get all the RDF files by downloading the file rdf-files.tar.bz2 from here. You can find additional information about this here.


Click on the Next link below to proceed to the next step.