[IIAB] [KIWIX][INTERNET-IN-A-BOX] Nice project!

Tue Mar 26 15:11:32 PDT 2013

Hi Emmanuel,

I guess I should clarify my metadata question.  I've looked again at the 
openzim docs, so I guess ZIM itself is really just a sophisticated 
container which knows nothing of the file internals except maybe a page 
title (please correct me if I'm wrong). Confusingly zimlib seems to have 
some form of search functionality but I get the impression that kiwix 
does not use it. (?)

What I don't understand is how kiwix builds the search index for a ZIM.  
Is it looking for mimetype HTML entities within the ZIM and using 
cLucene or something to index it?  Feel free to point me to relevant docs.

Gutenberg in epub format only is looking to be about 20 GB.  I have a 
feeling that can be greatly reduced by reasonable image scaling.

thanks,
braddock

On 03/26/2013 02:50 PM, Emmanuel Engelhart wrote:
> Hi Braddock
>
> On 03/26/2013 09:56 PM, Braddock wrote:
>> I'm preparing to do a dump of gutenberg to epub and converted html files
>> that you could use to build a ZIM. (may or may not get it done this week
>> however)
>
> This is on my side really not urgent! I don't want to put you under 
> pressure :)
>
>> How should I handle metadata for you to create a zim? Title, author,
>> illustrator, year, language, etc? Is it possible to search on those 
>> fields?
>
> Not sure exactly how to understand this question. This is maybe 
> because we have a misunderstood. My wish is to have one ZIM file with 
> all the books of Gutemberg. So, I don't see a big problem with the ZIM 
> file metadata. My proposition would be: publisher=yourprojectname, 
> creator=Project Gutenberg, language is a big challenge because you 
> have books in many languages. Maybe it's better to make a ZIM file per 
> language? With the Kiwix fulltext search engine, it should be pretty 
> easy to find stuff. With good index pages, this should also help... 
> cf. my next comment.
>
>> By default the gutenberg files do not have interesting file names
>> (pg30532-image.epub for example). The meta data is stored in the obtuse
>> catalog.rdf XML file, indexed by the id number of the text. We have all
>> this parsed and broken out into a SQLite database for our own use.
>
> This is great. The most important is to have a good HTML title tag and 
> if possible a meta tag with good keywords. Do you plan to create 
> custom HTML indexes with these data (like lists of book per author, 
> language, title)?
>
> Kind regards
> Emmanuel