[IIAB] [KIWIX][INTERNET-IN-A-BOX] Nice project!

Emmanuel Engelhart kelson at kiwix.org
Tue Mar 26 15:58:59 PDT 2013


On 03/26/2013 11:11 PM, Braddock wrote:
> I guess I should clarify my metadata question. I've looked again at the
> openzim docs, so I guess ZIM itself is really just a sophisticated
> container which knows nothing of the file internals except maybe a page
> title (please correct me if I'm wrong). Confusingly zimlib seems to have
> some form of search functionality but I get the impression that kiwix
> does not use it. (?)

Yes, ZIM is a container and is itself pretty content agnostic. In a ZIM 
file each file/article is identified by a unique URL and title. URLs 
have to be unique... but if I remember titles don't. The zimlib uses 
them to find a content (each ZIM file has two sorted lists with pointers 
to the blobs). Kiwix also uses the titles list to propose suggestions in 
the search box. My script buildZIMFileFromDirectory.pl (in the VM) 
computes both during the ZIM file creation, you don't have to care much 
about that.

> What I don't understand is how kiwix builds the search index for a ZIM.
> Is it looking for mimetype HTML entities within the ZIM and using
> cLucene or something to index it? Feel free to point me to relevant docs.

Now I speak about the fulltext search engine. A ZIM file itself has 
nothing allowing to do that. The zimlib developer has started to code 
some search features storing an index in a ZIM file... but we don't use 
this feature and you can simply ignore that part.

Kiwix uses Xapian with "external" fulltext search indexes. So, Kiwix has 
its own ZIM indexer going through all HTML pages and parsing the HTML. 
The CLucene version of this parser is not finished.

> Gutenberg in epub format only is looking to be about 20 GB. I have a
> feeling that can be greatly reduced by reasonable image scaling.

OK, this is huge, but I see not problem to make a ZIM file of 20GB. This 
is something we can work with. ZIM is also able to compress a lot more 
than EPUB. Maybe this would be good to make first a ZIM file only with 
books in English.

In any case (and sorry if I repeat myself), if you have a workable set 
of HTML pages perfectly usable offline, then it won't be a problem to 
create the corresponding ZIM file.

Emmanuel



More information about the IIAB mailing list