[IIAB] gutenberg files

Joel Steres joel.steres at ymobility.com
Thu Mar 14 19:09:16 PDT 2013


> I don't want to assume any e-book reader software.  For a first pass we
> should provide the ebooks in straight html or text.

I'm fine with that.  I will be a little surprised if text format is
not provided for books that also have ereader formats.  Anyway, the
index directly references 9 file extensions from the cache. If we
exclude the ebooks we exclude:
    .epub
    .mobi
    .plucker.pdb
    .qioo.jar

This will leave the following extensions (assuming we keep pdf):
    .cover.medium.jpg
    .cover.small.jpg
    .html.utf8
    .pdf
    .txt.utf8

-Joel


On Thu, Mar 14, 2013 at 4:07 PM, Braddock <braddock at braddock.com> wrote:
> Hi Joel,
>
> The target I have in mind is a user on a cheap tablet or OLPC laptop with
> only a basic web browser, maybe even a wifi enabled phone.  Perhaps not even
> Javascript, so we should degrade gracefully (as jQuery Mobile does).
>
> I don't want to assume any e-book reader software.  For a first pass we
> should provide the ebooks in straight html or text.  The ebook formats may
> be of interest if we can convert them to HTML with images (which does not
> seem to be available in the cache). Obviously we could add e-book formats
> later fairly easily.
>
> -braddock
>
>
>
> On 03/14/2013 02:04 PM, Joel Steres wrote:
>>
>> Hi Braddock,
>>
>> What formats need is a question/discussion about the vision for iiab.
>> The indexed cache contents consist of 243k files in the following
>> formats.
>>
>> application/epub+zip
>> application/pdf
>> application/prs.plucker
>> application/x-mobipocket-ebook
>> application/x-qioo-ebook
>> image/jpeg
>> text/html
>> text/plain
>> text/plain; charset="utf-8"
>>
>> (I have not looked but it is possible that some cached content is
>> referenced by non-cached files perhaps not represented above.)
>>
>> Who is the user?  I don't have good insight into the intended user
>> base.  Might ebooks be valuable to them?  I think the generated text,
>> html and probably images are an obvious choice for inclusions.  Ebook
>> formats seem kind of cool, but I can't assess their value without
>> knowing more about the audience.
>>
>> -Joel
>>
>>
>> On Thu, Mar 14, 2013 at 1:37 PM, Braddock <braddock at braddock.com> wrote:
>>>
>>> Hi Joel,
>>>
>>> So is cache/generated something you want or need?  If so I'll complete
>>> the
>>> mirror (ibiblio kicks my script off for a while every few gigabytes so it
>>> might easily take a week).
>>>
>>> It will be much much faster if we don't need to download the .epub and
>>> .mobi
>>> files (which contain images).
>>>
>>> -braddock
>>>
>>>
>>> On 03/13/2013 10:46 PM, Joel Steres wrote:
>>>>
>>>> The gutenberg db and index have been removed from the repository.
>>>>
>>>> On closer inspection it does look like the cache/generated content is
>>>> the content the index shows under cache/epub.  The problems I
>>>> encountered are probably just due to the incomplete mirroring.
>>>>
>>>> Hope you're doing well.
>>>>
>>>> -Joel
>>>>
>>>>
>>>> On Tue, Mar 12, 2013 at 8:04 PM, Braddock <braddock at braddock.com> wrote:
>>>>>
>>>>> Hi Joel,
>>>>>
>>>>> Thanks for your activity.  I haven't been able to keep completely up
>>>>> the
>>>>> last few days.
>>>>>
>>>>> I mirrored some of cache/generated to another server using:
>>>>> rsync -avHS --delete --delete-after ftp.ibiblio.org::gutenberg-epub
>>>>> generated
>>>>>
>>>>> I've copied that incomplete download (only 5.7 GB) to zhen in
>>>>> /knowledge/data/gutenberg/cached now.
>>>>>
>>>>> If you want a symlink from within static/ that is fine with me.
>>>>>
>>>>> I've seen no sign of a cache/epub/ directory.
>>>>>
>>>>> I've been trying to keep the path /knowledge universal across devices
>>>>> (zhen,
>>>>> the Satellite, the GoFlex Home, and my personal server) so links into
>>>>> it
>>>>> should work anywhere.
>>>>>
>>>>> On a side note, the 100MB gutenberg.db should probably not be in the
>>>>> git
>>>>> repo.  I'd prefer if it lived under /knowledge/processed/, which is
>>>>> where
>>>>> I'm keeping all processed data.
>>>>>
>>>>> I hope to have some time to get back into IIAB in the next couple days.
>>>>> We
>>>>> had the funeral today, so things should begin to return to normal.
>>>>>
>>>>> -braddock
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 03/12/2013 09:58 AM, Joel Steres wrote:
>>>>>>
>>>>>> Hi Braddock,
>>>>>>
>>>>>>> I am also mirroring cache/generated - the gutenberg mirrors seem to
>>>>>>> block
>>>>>>> access to it via ftp etc, but I can get it via rsync. Maybe those
>>>>>>> files
>>>>>>> will
>>>>>>> be more consistent.
>>>>>>
>>>>>> Thanks for mirroring cache/generated. In the current catalog all files
>>>>>> referencing 'cache' point to cache/epub/... rather than
>>>>>> cache/generated/ and the contents of the two paths differ. I looked at
>>>>>> the rsync script from git but it does not seem to include the addition
>>>>>> for gutenberg.org/cache mirroring.  Could you either make the
>>>>>> adjustment or show me where to do so?
>>>>>>
>>>>>> Also, I found that html files include images.  It might be easier to
>>>>>> put the gutenberg files into the flask static directory and permit the
>>>>>> existing paths to work.  No objections if I symlink to
>>>>>> /knowledge/data/gutenberg/gutenberg/ from iiab/static/gutenberg/data/?
>>>>>>
>>>>>> -Joel
>>>>>
>>>>>
>>>>>
>>>
>
>



More information about the IIAB mailing list