[IIAB] Gutenberg epubs and html
Braddock
braddock at braddock.com
Fri Jun 21 13:04:09 PDT 2013
Hi Emmanuel,
I finally got around to distilling the Gutenberg collection. I've
extracted all 40,000 books in epub format, and also converted them to
zipped html ("htmlz"). I've made collections both with and without images.
The result is four collections (sizes in gigabytes):
6.9G gutenberg-htmlz
23G gutenberg-htmlz-images
7.0G gutenberg-epub
23G gutenberg-epub-images
There is no metadata in these collections, just the books. We could
generate some meta using our database. I'm not sure what you would need
to make a usable zim.
We are probably just going to keep them in this format (instead of
zimmifying them all) for the Internet-in-a-Box.
I can make these available via torrents.
Other news, I wrote a from-scratch pure python ZIM file reader I'm
calling "zimpy".
https://github.com/braddockcg/internet-in-a-box/blob/master/iiab/zimpy.py
I'm now using the zimpy code for reading zims for Internet-in-a-Box, and
if it gets a bit more mature I'll release it as a separate project. It
doesn't currently do anything more than I need. The existing openzim
bindings did not support any read capability.
-braddock
More information about the IIAB
mailing list