[IIAB] gutenberg files

Braddock braddock at braddock.com
Sun Mar 17 10:50:39 PDT 2013


Joel,

The mirror of the gutenberg cache is complete.  54 GB total. 
/knowledge/data/gutenberg/cache

-braddock


On 03/15/2013 03:04 PM, Braddock wrote:
> Hi Joel,
> Okay, I started a full mirror of the gutenberg cache.
>
> I am excluding .mobi, .log, .pdb, .rdf, .qioo.jar
>
> I am including .epub because it is easy with Calibre to convert it to 
> HTML + Images which we may want.  For example:
>
> ebook-convert pg10894.epub pg10894.zip
>
> I don't know how long it will take, maybe a couple days - it depends 
> how aggressively ibiblio kicks me off.  rsync takes over an hour to 
> re-establish a connection and start actual downloading because it very 
> slowly transfers a full index of all the files on the site each time.
>
> I wish they had a recent torrent.  I guestimate it will only be about 
> 30GB of data.
>
> -braddock
>
> On 03/14/2013 07:09 PM, Joel Steres wrote:
>>> I don't want to assume any e-book reader software.  For a first pass we
>>> should provide the ebooks in straight html or text.
>> I'm fine with that.  I will be a little surprised if text format is
>> not provided for books that also have ereader formats.  Anyway, the
>> index directly references 9 file extensions from the cache. If we
>> exclude the ebooks we exclude:
>>      .epub
>>      .mobi
>>      .plucker.pdb
>>      .qioo.jar
>>
>> This will leave the following extensions (assuming we keep pdf):
>>      .cover.medium.jpg
>>      .cover.small.jpg
>>      .html.utf8
>>      .pdf
>>      .txt.utf8
>>
>> -Joel
>>
>>
>> On Thu, Mar 14, 2013 at 4:07 PM, Braddock <braddock at braddock.com> wrote:
>>> Hi Joel,
>>>
>>> The target I have in mind is a user on a cheap tablet or OLPC laptop 
>>> with
>>> only a basic web browser, maybe even a wifi enabled phone. Perhaps 
>>> not even
>>> Javascript, so we should degrade gracefully (as jQuery Mobile does).
>>>
>>> I don't want to assume any e-book reader software.  For a first pass we
>>> should provide the ebooks in straight html or text.  The ebook 
>>> formats may
>>> be of interest if we can convert them to HTML with images (which 
>>> does not
>>> seem to be available in the cache). Obviously we could add e-book 
>>> formats
>>> later fairly easily.
>>>
>>> -braddock
>>>
>>>
>>>
>>> On 03/14/2013 02:04 PM, Joel Steres wrote:
>>>> Hi Braddock,
>>>>
>>>> What formats need is a question/discussion about the vision for iiab.
>>>> The indexed cache contents consist of 243k files in the following
>>>> formats.
>>>>
>>>> application/epub+zip
>>>> application/pdf
>>>> application/prs.plucker
>>>> application/x-mobipocket-ebook
>>>> application/x-qioo-ebook
>>>> image/jpeg
>>>> text/html
>>>> text/plain
>>>> text/plain; charset="utf-8"
>>>>
>>>> (I have not looked but it is possible that some cached content is
>>>> referenced by non-cached files perhaps not represented above.)
>>>>
>>>> Who is the user?  I don't have good insight into the intended user
>>>> base.  Might ebooks be valuable to them?  I think the generated text,
>>>> html and probably images are an obvious choice for inclusions.  Ebook
>>>> formats seem kind of cool, but I can't assess their value without
>>>> knowing more about the audience.
>>>>
>>>> -Joel
>>>>
>>>>
>>>> On Thu, Mar 14, 2013 at 1:37 PM, Braddock <braddock at braddock.com> 
>>>> wrote:
>>>>> Hi Joel,
>>>>>
>>>>> So is cache/generated something you want or need?  If so I'll 
>>>>> complete
>>>>> the
>>>>> mirror (ibiblio kicks my script off for a while every few 
>>>>> gigabytes so it
>>>>> might easily take a week).
>>>>>
>>>>> It will be much much faster if we don't need to download the .epub 
>>>>> and
>>>>> .mobi
>>>>> files (which contain images).
>>>>>
>>>>> -braddock
>>>>>
>>>>>
>>>>> On 03/13/2013 10:46 PM, Joel Steres wrote:
>>>>>> The gutenberg db and index have been removed from the repository.
>>>>>>
>>>>>> On closer inspection it does look like the cache/generated 
>>>>>> content is
>>>>>> the content the index shows under cache/epub.  The problems I
>>>>>> encountered are probably just due to the incomplete mirroring.
>>>>>>
>>>>>> Hope you're doing well.
>>>>>>
>>>>>> -Joel
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2013 at 8:04 PM, Braddock <braddock at braddock.com> 
>>>>>> wrote:
>>>>>>> Hi Joel,
>>>>>>>
>>>>>>> Thanks for your activity.  I haven't been able to keep 
>>>>>>> completely up
>>>>>>> the
>>>>>>> last few days.
>>>>>>>
>>>>>>> I mirrored some of cache/generated to another server using:
>>>>>>> rsync -avHS --delete --delete-after ftp.ibiblio.org::gutenberg-epub
>>>>>>> generated
>>>>>>>
>>>>>>> I've copied that incomplete download (only 5.7 GB) to zhen in
>>>>>>> /knowledge/data/gutenberg/cached now.
>>>>>>>
>>>>>>> If you want a symlink from within static/ that is fine with me.
>>>>>>>
>>>>>>> I've seen no sign of a cache/epub/ directory.
>>>>>>>
>>>>>>> I've been trying to keep the path /knowledge universal across 
>>>>>>> devices
>>>>>>> (zhen,
>>>>>>> the Satellite, the GoFlex Home, and my personal server) so links 
>>>>>>> into
>>>>>>> it
>>>>>>> should work anywhere.
>>>>>>>
>>>>>>> On a side note, the 100MB gutenberg.db should probably not be in 
>>>>>>> the
>>>>>>> git
>>>>>>> repo.  I'd prefer if it lived under /knowledge/processed/, which is
>>>>>>> where
>>>>>>> I'm keeping all processed data.
>>>>>>>
>>>>>>> I hope to have some time to get back into IIAB in the next 
>>>>>>> couple days.
>>>>>>> We
>>>>>>> had the funeral today, so things should begin to return to normal.
>>>>>>>
>>>>>>> -braddock
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 03/12/2013 09:58 AM, Joel Steres wrote:
>>>>>>>> Hi Braddock,
>>>>>>>>
>>>>>>>>> I am also mirroring cache/generated - the gutenberg mirrors 
>>>>>>>>> seem to
>>>>>>>>> block
>>>>>>>>> access to it via ftp etc, but I can get it via rsync. Maybe those
>>>>>>>>> files
>>>>>>>>> will
>>>>>>>>> be more consistent.
>>>>>>>> Thanks for mirroring cache/generated. In the current catalog 
>>>>>>>> all files
>>>>>>>> referencing 'cache' point to cache/epub/... rather than
>>>>>>>> cache/generated/ and the contents of the two paths differ. I 
>>>>>>>> looked at
>>>>>>>> the rsync script from git but it does not seem to include the 
>>>>>>>> addition
>>>>>>>> for gutenberg.org/cache mirroring.  Could you either make the
>>>>>>>> adjustment or show me where to do so?
>>>>>>>>
>>>>>>>> Also, I found that html files include images.  It might be 
>>>>>>>> easier to
>>>>>>>> put the gutenberg files into the flask static directory and 
>>>>>>>> permit the
>>>>>>>> existing paths to work.  No objections if I symlink to
>>>>>>>> /knowledge/data/gutenberg/gutenberg/ from 
>>>>>>>> iiab/static/gutenberg/data/?
>>>>>>>>
>>>>>>>> -Joel
>>>>>>>
>>>>>>>
>>>
>




More information about the IIAB mailing list