[IIAB] [KIWIX][INTERNET-IN-A-BOX] Nice project!

Thu Mar 28 02:48:17 PDT 2013

Hi Emmanuel,

Thanks for your interest in the project.

> In any case (and sorry if I repeat myself), if you have a workable set of
> HTML pages perfectly usable offline, then it won't be a problem to create
> the corresponding ZIM file.

As it stands now the Gutenberg portion is functional, though still
being refined, and can be served locally.   However, the webpages are
composed dynamically using templates with flask.  All of the metadata
about the books is stored in a sqlite database which flask uses to
populate the templates.  I'm not sure but it sounds like you are
looking for a collection of rendered HTML pages. That could be
obtained by crawling the site.  Before doing that we might want to
finalize the presentation.

If there is something I can help with let me know.

Joel

On Tue, Mar 26, 2013 at 3:58 PM, Emmanuel Engelhart <kelson at kiwix.org> wrote:
> On 03/26/2013 11:11 PM, Braddock wrote:
>>
>> I guess I should clarify my metadata question. I've looked again at the
>> openzim docs, so I guess ZIM itself is really just a sophisticated
>> container which knows nothing of the file internals except maybe a page
>> title (please correct me if I'm wrong). Confusingly zimlib seems to have
>> some form of search functionality but I get the impression that kiwix
>> does not use it. (?)
>
>
> Yes, ZIM is a container and is itself pretty content agnostic. In a ZIM file
> each file/article is identified by a unique URL and title. URLs have to be
> unique... but if I remember titles don't. The zimlib uses them to find a
> content (each ZIM file has two sorted lists with pointers to the blobs).
> Kiwix also uses the titles list to propose suggestions in the search box. My
> script buildZIMFileFromDirectory.pl (in the VM) computes both during the ZIM
> file creation, you don't have to care much about that.
>
>
>> What I don't understand is how kiwix builds the search index for a ZIM.
>> Is it looking for mimetype HTML entities within the ZIM and using
>> cLucene or something to index it? Feel free to point me to relevant docs.
>
>
> Now I speak about the fulltext search engine. A ZIM file itself has nothing
> allowing to do that. The zimlib developer has started to code some search
> features storing an index in a ZIM file... but we don't use this feature and
> you can simply ignore that part.
>
> Kiwix uses Xapian with "external" fulltext search indexes. So, Kiwix has its
> own ZIM indexer going through all HTML pages and parsing the HTML. The
> CLucene version of this parser is not finished.
>
>
>> Gutenberg in epub format only is looking to be about 20 GB. I have a
>> feeling that can be greatly reduced by reasonable image scaling.
>
>
> OK, this is huge, but I see not problem to make a ZIM file of 20GB. This is
> something we can work with. ZIM is also able to compress a lot more than
> EPUB. Maybe this would be good to make first a ZIM file only with books in
> English.
>
> In any case (and sorry if I repeat myself), if you have a workable set of
> HTML pages perfectly usable offline, then it won't be a problem to create
> the corresponding ZIM file.
>
> Emmanuel
>