[IIAB] gutenberg cached files

Braddock braddock at braddock.com
Fri Mar 1 09:29:59 PST 2013


Ah, sorry for the confusion.

Project Gutenberg is just a mess.

-braddock

On 03/01/2013 09:26 AM, Joel Steres wrote:
> Hi Braddock,
>
> Just a minor point of clarification, the symlink I mentioned is broken
> on the source mirror itself. I checked your script and the ibiblio
> site before I wrote.  The script excludes a number of extensions but
> not the ebook formats.
>
> I will go ahead and remove the audio and video format entries from the
> database.  Also I will adjust the rsync to omit some a/v extensions
> that slipped through.
>
> -j
>
>
> On Fri, Mar 1, 2013 at 9:07 AM, Braddock <braddock at braddock.com> wrote:
>> Hi Joel,
>>
>> Those files do not exist because I had rsync exclude them.
>>
>> Gutenberg has an ENORMOUS quantity of duplicated content in various forms
>> (including mp3 of speech synthesizers reading books!).  It is like they were
>> intentionally trying to bloat the archive.  It is a mess.  My assumption was
>> we would stick to text or html formats.
>>
>> -braddock
>>
>>
>> On 03/01/2013 09:02 AM, Joel Steres wrote:
>>> Greetings,
>>>
>>> The Gutenberg project references a lot of "generated" files which are
>>> not in the ibiblio mirror we use.  The mirror just contains a broken
>>> symlink at cache/generated.  Much of it appears to be various ebook
>>> formats.  I will continue based on the assumption that cached content
>>> is unavailable and so I will filter it out.  If we should try to
>>> obtain/include the cached content let me know.
>>>
>>> Joel
>>
>>




More information about the IIAB mailing list