[IIAB] Gutenberg epubs and html

Fri Jun 21 13:04:09 PDT 2013

Hi Emmanuel,

I finally got around to distilling the Gutenberg collection.  I've 
extracted all 40,000 books in epub format, and also converted them to 
zipped html ("htmlz").  I've made collections both with and without images.

The result is four collections (sizes in gigabytes):
6.9G    gutenberg-htmlz
23G    gutenberg-htmlz-images
7.0G    gutenberg-epub
23G    gutenberg-epub-images

There is no metadata in these collections, just the books.  We could 
generate some meta using our database.  I'm not sure what you would need 
to make a usable zim.

We are probably just going to keep them in this format (instead of 
zimmifying them all) for the Internet-in-a-Box.

I can make these available via torrents.

Other news, I wrote a from-scratch pure python ZIM file reader I'm 
calling "zimpy".
https://github.com/braddockcg/internet-in-a-box/blob/master/iiab/zimpy.py

I'm now using the zimpy code for reading zims for Internet-in-a-Box, and 
if it gets a bit more mature I'll release it as a separate project.  It 
doesn't currently do anything more than I need.  The existing openzim 
bindings did not support any read capability.

-braddock