[IIAB] gutenberg cached files
Braddock
braddock at braddock.com
Fri Mar 1 09:29:59 PST 2013
Ah, sorry for the confusion.
Project Gutenberg is just a mess.
-braddock
On 03/01/2013 09:26 AM, Joel Steres wrote:
> Hi Braddock,
>
> Just a minor point of clarification, the symlink I mentioned is broken
> on the source mirror itself. I checked your script and the ibiblio
> site before I wrote. The script excludes a number of extensions but
> not the ebook formats.
>
> I will go ahead and remove the audio and video format entries from the
> database. Also I will adjust the rsync to omit some a/v extensions
> that slipped through.
>
> -j
>
>
> On Fri, Mar 1, 2013 at 9:07 AM, Braddock <braddock at braddock.com> wrote:
>> Hi Joel,
>>
>> Those files do not exist because I had rsync exclude them.
>>
>> Gutenberg has an ENORMOUS quantity of duplicated content in various forms
>> (including mp3 of speech synthesizers reading books!). It is like they were
>> intentionally trying to bloat the archive. It is a mess. My assumption was
>> we would stick to text or html formats.
>>
>> -braddock
>>
>>
>> On 03/01/2013 09:02 AM, Joel Steres wrote:
>>> Greetings,
>>>
>>> The Gutenberg project references a lot of "generated" files which are
>>> not in the ibiblio mirror we use. The mirror just contains a broken
>>> symlink at cache/generated. Much of it appears to be various ebook
>>> formats. I will continue based on the assumption that cached content
>>> is unavailable and so I will filter it out. If we should try to
>>> obtain/include the cached content let me know.
>>>
>>> Joel
>>
>>
More information about the IIAB
mailing list