[IIAB] [KIWIX][INTERNET-IN-A-BOX] Nice project!
Emmanuel Engelhart
kelson at kiwix.org
Tue Mar 26 15:58:59 PDT 2013
On 03/26/2013 11:11 PM, Braddock wrote:
> I guess I should clarify my metadata question. I've looked again at the
> openzim docs, so I guess ZIM itself is really just a sophisticated
> container which knows nothing of the file internals except maybe a page
> title (please correct me if I'm wrong). Confusingly zimlib seems to have
> some form of search functionality but I get the impression that kiwix
> does not use it. (?)
Yes, ZIM is a container and is itself pretty content agnostic. In a ZIM
file each file/article is identified by a unique URL and title. URLs
have to be unique... but if I remember titles don't. The zimlib uses
them to find a content (each ZIM file has two sorted lists with pointers
to the blobs). Kiwix also uses the titles list to propose suggestions in
the search box. My script buildZIMFileFromDirectory.pl (in the VM)
computes both during the ZIM file creation, you don't have to care much
about that.
> What I don't understand is how kiwix builds the search index for a ZIM.
> Is it looking for mimetype HTML entities within the ZIM and using
> cLucene or something to index it? Feel free to point me to relevant docs.
Now I speak about the fulltext search engine. A ZIM file itself has
nothing allowing to do that. The zimlib developer has started to code
some search features storing an index in a ZIM file... but we don't use
this feature and you can simply ignore that part.
Kiwix uses Xapian with "external" fulltext search indexes. So, Kiwix has
its own ZIM indexer going through all HTML pages and parsing the HTML.
The CLucene version of this parser is not finished.
> Gutenberg in epub format only is looking to be about 20 GB. I have a
> feeling that can be greatly reduced by reasonable image scaling.
OK, this is huge, but I see not problem to make a ZIM file of 20GB. This
is something we can work with. ZIM is also able to compress a lot more
than EPUB. Maybe this would be good to make first a ZIM file only with
books in English.
In any case (and sorry if I repeat myself), if you have a workable set
of HTML pages perfectly usable offline, then it won't be a problem to
create the corresponding ZIM file.
Emmanuel
More information about the IIAB
mailing list