[IIAB] gutenberg files

Joel Steres joel.steres at ymobility.com
Thu Mar 14 14:04:10 PDT 2013


Hi Braddock,

What formats need is a question/discussion about the vision for iiab.
The indexed cache contents consist of 243k files in the following
formats.

application/epub+zip
application/pdf
application/prs.plucker
application/x-mobipocket-ebook
application/x-qioo-ebook
image/jpeg
text/html
text/plain
text/plain; charset="utf-8"

(I have not looked but it is possible that some cached content is
referenced by non-cached files perhaps not represented above.)

Who is the user?  I don't have good insight into the intended user
base.  Might ebooks be valuable to them?  I think the generated text,
html and probably images are an obvious choice for inclusions.  Ebook
formats seem kind of cool, but I can't assess their value without
knowing more about the audience.

-Joel


On Thu, Mar 14, 2013 at 1:37 PM, Braddock <braddock at braddock.com> wrote:
> Hi Joel,
>
> So is cache/generated something you want or need?  If so I'll complete the
> mirror (ibiblio kicks my script off for a while every few gigabytes so it
> might easily take a week).
>
> It will be much much faster if we don't need to download the .epub and .mobi
> files (which contain images).
>
> -braddock
>
>
> On 03/13/2013 10:46 PM, Joel Steres wrote:
>>
>> The gutenberg db and index have been removed from the repository.
>>
>> On closer inspection it does look like the cache/generated content is
>> the content the index shows under cache/epub.  The problems I
>> encountered are probably just due to the incomplete mirroring.
>>
>> Hope you're doing well.
>>
>> -Joel
>>
>>
>> On Tue, Mar 12, 2013 at 8:04 PM, Braddock <braddock at braddock.com> wrote:
>>>
>>> Hi Joel,
>>>
>>> Thanks for your activity.  I haven't been able to keep completely up the
>>> last few days.
>>>
>>> I mirrored some of cache/generated to another server using:
>>> rsync -avHS --delete --delete-after ftp.ibiblio.org::gutenberg-epub
>>> generated
>>>
>>> I've copied that incomplete download (only 5.7 GB) to zhen in
>>> /knowledge/data/gutenberg/cached now.
>>>
>>> If you want a symlink from within static/ that is fine with me.
>>>
>>> I've seen no sign of a cache/epub/ directory.
>>>
>>> I've been trying to keep the path /knowledge universal across devices
>>> (zhen,
>>> the Satellite, the GoFlex Home, and my personal server) so links into it
>>> should work anywhere.
>>>
>>> On a side note, the 100MB gutenberg.db should probably not be in the git
>>> repo.  I'd prefer if it lived under /knowledge/processed/, which is where
>>> I'm keeping all processed data.
>>>
>>> I hope to have some time to get back into IIAB in the next couple days.
>>> We
>>> had the funeral today, so things should begin to return to normal.
>>>
>>> -braddock
>>>
>>>
>>>
>>>
>>> On 03/12/2013 09:58 AM, Joel Steres wrote:
>>>>
>>>> Hi Braddock,
>>>>
>>>>> I am also mirroring cache/generated - the gutenberg mirrors seem to
>>>>> block
>>>>> access to it via ftp etc, but I can get it via rsync. Maybe those files
>>>>> will
>>>>> be more consistent.
>>>>
>>>> Thanks for mirroring cache/generated. In the current catalog all files
>>>> referencing 'cache' point to cache/epub/... rather than
>>>> cache/generated/ and the contents of the two paths differ. I looked at
>>>> the rsync script from git but it does not seem to include the addition
>>>> for gutenberg.org/cache mirroring.  Could you either make the
>>>> adjustment or show me where to do so?
>>>>
>>>> Also, I found that html files include images.  It might be easier to
>>>> put the gutenberg files into the flask static directory and permit the
>>>> existing paths to work.  No objections if I symlink to
>>>> /knowledge/data/gutenberg/gutenberg/ from iiab/static/gutenberg/data/?
>>>>
>>>> -Joel
>>>
>>>
>>>
>
>



More information about the IIAB mailing list