[IIAB] gutenberg cached files

Joel Steres joel.steres at ymobility.com
Fri Mar 1 09:26:28 PST 2013


Hi Braddock,

Just a minor point of clarification, the symlink I mentioned is broken
on the source mirror itself. I checked your script and the ibiblio
site before I wrote.  The script excludes a number of extensions but
not the ebook formats.

I will go ahead and remove the audio and video format entries from the
database.  Also I will adjust the rsync to omit some a/v extensions
that slipped through.

-j


On Fri, Mar 1, 2013 at 9:07 AM, Braddock <braddock at braddock.com> wrote:
> Hi Joel,
>
> Those files do not exist because I had rsync exclude them.
>
> Gutenberg has an ENORMOUS quantity of duplicated content in various forms
> (including mp3 of speech synthesizers reading books!).  It is like they were
> intentionally trying to bloat the archive.  It is a mess.  My assumption was
> we would stick to text or html formats.
>
> -braddock
>
>
> On 03/01/2013 09:02 AM, Joel Steres wrote:
>>
>> Greetings,
>>
>> The Gutenberg project references a lot of "generated" files which are
>> not in the ibiblio mirror we use.  The mirror just contains a broken
>> symlink at cache/generated.  Much of it appears to be various ebook
>> formats.  I will continue based on the assumption that cached content
>> is unavailable and so I will filter it out.  If we should try to
>> obtain/include the cached content let me know.
>>
>> Joel
>
>
>



More information about the IIAB mailing list