[IIAB] Wikipedia Full Text Search Rankings

Braddock braddock at braddock.com
Thu Sep 12 17:42:51 PDT 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

James,

On orlop in /knowledge/modules/wikipedia-links/ I've created tab
delimited files for each ZIM file which contain an entry for each
article index, the number of links TO that article, the number of
links FROM that article, and the article URL.

These data were extracted by the scripts/zim_link_analysis.py script
which extracted href and img links from the HTML of every article.

In the past I have found that in-degree (number of links to) is a very
good score metric for wikipedia searches.  I hope you can incorporate
this dataset into the Whoosh search score so we can get
full-text-search on Wikipedia working without being completely washed
out with noise.

For example, to see the most linked-to articles in full english wikipedia:
sort -k2 -r -n wikipedia_en_all_nopic_01_2012.links

- -braddock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSMl+LAAoJEHWLR/DQzlZuAFsIAKYt6CAQMEABuKPMz+t9wJSm
0jHmtjKrbJIlqp9R2B+imAdaw6+X/YKr331uknZObTTQChOA9UWtq3JNxFnpAoin
Yw3/Jzlq3U3X0sdFYxlVs6627XHOlVNcsNJ7KP8Ek5ckURm82SKK/rG24nJnaV4h
2ckoXdgjJ7KD3eyGszrSmaLG9Cxmekie8e2bve3og5ilguBhwAGNE2vgF9eNO+nM
Ziau4DXBq9h66ywu2jawwxj+IBvTJR2RvzWvWsYt5E6jwcnZdRRpHKwofT3goCdY
WWU6pdl7mRnX/v1Njz1SSW11T4V68hzDsRX6amR7CjFUAywAjujoJUQ2cCfn1Lw=
=3hkI
-----END PGP SIGNATURE-----



More information about the IIAB mailing list