[IIAB] gutenberg files

Braddock braddock at braddock.com
Fri Mar 15 15:04:04 PDT 2013


Hi Joel,
Okay, I started a full mirror of the gutenberg cache.

I am excluding .mobi, .log, .pdb, .rdf, .qioo.jar

I am including .epub because it is easy with Calibre to convert it to 
HTML + Images which we may want.  For example:

ebook-convert pg10894.epub pg10894.zip

I don't know how long it will take, maybe a couple days - it depends how 
aggressively ibiblio kicks me off.  rsync takes over an hour to 
re-establish a connection and start actual downloading because it very 
slowly transfers a full index of all the files on the site each time.

I wish they had a recent torrent.  I guestimate it will only be about 
30GB of data.

-braddock

On 03/14/2013 07:09 PM, Joel Steres wrote:
>> I don't want to assume any e-book reader software.  For a first pass we
>> should provide the ebooks in straight html or text.
> I'm fine with that.  I will be a little surprised if text format is
> not provided for books that also have ereader formats.  Anyway, the
> index directly references 9 file extensions from the cache. If we
> exclude the ebooks we exclude:
>      .epub
>      .mobi
>      .plucker.pdb
>      .qioo.jar
>
> This will leave the following extensions (assuming we keep pdf):
>      .cover.medium.jpg
>      .cover.small.jpg
>      .html.utf8
>      .pdf
>      .txt.utf8
>
> -Joel
>
>
> On Thu, Mar 14, 2013 at 4:07 PM, Braddock <braddock at braddock.com> wrote:
>> Hi Joel,
>>
>> The target I have in mind is a user on a cheap tablet or OLPC laptop with
>> only a basic web browser, maybe even a wifi enabled phone.  Perhaps not even
>> Javascript, so we should degrade gracefully (as jQuery Mobile does).
>>
>> I don't want to assume any e-book reader software.  For a first pass we
>> should provide the ebooks in straight html or text.  The ebook formats may
>> be of interest if we can convert them to HTML with images (which does not
>> seem to be available in the cache). Obviously we could add e-book formats
>> later fairly easily.
>>
>> -braddock
>>
>>
>>
>> On 03/14/2013 02:04 PM, Joel Steres wrote:
>>> Hi Braddock,
>>>
>>> What formats need is a question/discussion about the vision for iiab.
>>> The indexed cache contents consist of 243k files in the following
>>> formats.
>>>
>>> application/epub+zip
>>> application/pdf
>>> application/prs.plucker
>>> application/x-mobipocket-ebook
>>> application/x-qioo-ebook
>>> image/jpeg
>>> text/html
>>> text/plain
>>> text/plain; charset="utf-8"
>>>
>>> (I have not looked but it is possible that some cached content is
>>> referenced by non-cached files perhaps not represented above.)
>>>
>>> Who is the user?  I don't have good insight into the intended user
>>> base.  Might ebooks be valuable to them?  I think the generated text,
>>> html and probably images are an obvious choice for inclusions.  Ebook
>>> formats seem kind of cool, but I can't assess their value without
>>> knowing more about the audience.
>>>
>>> -Joel
>>>
>>>
>>> On Thu, Mar 14, 2013 at 1:37 PM, Braddock <braddock at braddock.com> wrote:
>>>> Hi Joel,
>>>>
>>>> So is cache/generated something you want or need?  If so I'll complete
>>>> the
>>>> mirror (ibiblio kicks my script off for a while every few gigabytes so it
>>>> might easily take a week).
>>>>
>>>> It will be much much faster if we don't need to download the .epub and
>>>> .mobi
>>>> files (which contain images).
>>>>
>>>> -braddock
>>>>
>>>>
>>>> On 03/13/2013 10:46 PM, Joel Steres wrote:
>>>>> The gutenberg db and index have been removed from the repository.
>>>>>
>>>>> On closer inspection it does look like the cache/generated content is
>>>>> the content the index shows under cache/epub.  The problems I
>>>>> encountered are probably just due to the incomplete mirroring.
>>>>>
>>>>> Hope you're doing well.
>>>>>
>>>>> -Joel
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2013 at 8:04 PM, Braddock <braddock at braddock.com> wrote:
>>>>>> Hi Joel,
>>>>>>
>>>>>> Thanks for your activity.  I haven't been able to keep completely up
>>>>>> the
>>>>>> last few days.
>>>>>>
>>>>>> I mirrored some of cache/generated to another server using:
>>>>>> rsync -avHS --delete --delete-after ftp.ibiblio.org::gutenberg-epub
>>>>>> generated
>>>>>>
>>>>>> I've copied that incomplete download (only 5.7 GB) to zhen in
>>>>>> /knowledge/data/gutenberg/cached now.
>>>>>>
>>>>>> If you want a symlink from within static/ that is fine with me.
>>>>>>
>>>>>> I've seen no sign of a cache/epub/ directory.
>>>>>>
>>>>>> I've been trying to keep the path /knowledge universal across devices
>>>>>> (zhen,
>>>>>> the Satellite, the GoFlex Home, and my personal server) so links into
>>>>>> it
>>>>>> should work anywhere.
>>>>>>
>>>>>> On a side note, the 100MB gutenberg.db should probably not be in the
>>>>>> git
>>>>>> repo.  I'd prefer if it lived under /knowledge/processed/, which is
>>>>>> where
>>>>>> I'm keeping all processed data.
>>>>>>
>>>>>> I hope to have some time to get back into IIAB in the next couple days.
>>>>>> We
>>>>>> had the funeral today, so things should begin to return to normal.
>>>>>>
>>>>>> -braddock
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/12/2013 09:58 AM, Joel Steres wrote:
>>>>>>> Hi Braddock,
>>>>>>>
>>>>>>>> I am also mirroring cache/generated - the gutenberg mirrors seem to
>>>>>>>> block
>>>>>>>> access to it via ftp etc, but I can get it via rsync. Maybe those
>>>>>>>> files
>>>>>>>> will
>>>>>>>> be more consistent.
>>>>>>> Thanks for mirroring cache/generated. In the current catalog all files
>>>>>>> referencing 'cache' point to cache/epub/... rather than
>>>>>>> cache/generated/ and the contents of the two paths differ. I looked at
>>>>>>> the rsync script from git but it does not seem to include the addition
>>>>>>> for gutenberg.org/cache mirroring.  Could you either make the
>>>>>>> adjustment or show me where to do so?
>>>>>>>
>>>>>>> Also, I found that html files include images.  It might be easier to
>>>>>>> put the gutenberg files into the flask static directory and permit the
>>>>>>> existing paths to work.  No objections if I symlink to
>>>>>>> /knowledge/data/gutenberg/gutenberg/ from iiab/static/gutenberg/data/?
>>>>>>>
>>>>>>> -Joel
>>>>>>
>>>>>>
>>




More information about the IIAB mailing list