How to Extract English Fiction Prose from Project Gutenberg
Purpose of the document
In this tutorial, I will explain how to extract a list of all books in Project Gutenberg which satisfy the following criteria:
- The book is in English.
- The book is prose (not poetry, not a play).
- The book is fiction.
Preconditions
Before you execute the steps in this tutorial, please make sure the following conditions are satisfied:
- There is at least 2.5 GB of free space on your hard drive.
- The correct version of rsync is installed on your machine. The script presented here was executed on MacOS. There may be differences between
rsyncimplementations across different versions and different operating systems. - SQLite version 3.32.3 or newer is installed on your machine.
- SQLite Studio version 3.3.3 or newer is installed on your machine.
Step 1: Download the RDF files
First, we need to download the RDF files of all books in Project Gutenberg. To do this, we need a shell script `download-rdfs.sh` with the following contents:
# Destination for metadata
DEST_DIR="gutenberg_english_txt"
mkdir -p "$DEST_DIR"
# Gutenberg rsync mirror
MIRROR="rsync://gutenberg.pglaf.org/gutenberg/"
echo "$(date "+%Y-%m-%d %H:%M:%S"): Start to download RDF files"
rsync -av --delete \
--partial \
--progress \
--bwlimit=500 \
--include="*/" \
--include="*.rdf" \
--exclude="*" \
"$MIRROR"cache/epub/ "$DEST_DIR/cache/"
echo "$(date "+%Y-%m-%d %H:%M:%S"): Done"
Set DEST_DIR to the directory to which these files should be downloaded. --bwlimit=500 makes sure that during download, the bandwidth is limited and we don't put unnecessary load on Project Gutenberg's servers.
Then run this script using the following command:
./download-rdfs.sh 2>&1 | tee download-rdfs.sh.log
This will start the download of all RDF files from Project Gutenberg into the directory DEST_DIR. When I ran this script on 2025-08-26, the total size of downloaded files was 2.23 GB (ca. 133 MB compressed) and it took three and a half hours (3 hours and 34 minutes according to the download-rdfs.sh.log file).