Mirroring Wikipedia

So I had an internet outage, and was thinking if I was trapped on my proverbial desert island what would I want with me?

Well wikipedia would be nice!

So I started with this extreme tech article by Sebastian Anthony, although it has since drifted out of date on a few things.

But it is enough to get you started.

I downloaded my XML dump from Brazil like he mentions. Â The files I got were:

enwiki-20140304-pages-articles.xml.bz2 10G
enwiki-20140304-all-titles-in-ns0.gz 58MB
enwiki-20140304-interwiki.sql.gz 728Kb
enwiki-20140304-redirect.sql.gz 91MB
enwiki-20140304-protected_titles.sql.gz 887Kb

The pages-articles.xml is required. Â I added in the others in the hopes of fixing some formatting issues. Â I re-compressed it from 10GB using Bzip2 to 8.4GB with 7zip. Â Itâ€™s still massive, but when you are on a â€˜slowâ€™ connection every saved GB matters.

Since I already have apache/php/mysql running on my Debian box, I canâ€™t help you with a virgin install. Â I would say itâ€™s pretty much like every other LAMP install.

Although I did *NOT* install phpmyadmin. Â Iâ€™ve seen too many holes in it, and I prefer the command line anyways.

First I connect to my database instance:

mysql -uroot -pMYBADPASSWORD

And then execute the following:

create database wikimirror;
create user â€˜wikimirrorâ€™@’localhostâ€™ IDENTIFIED BY â€˜MYOTHERPASSWORDâ€™;
GRANT ALL PRIVILEGES ON wikimirror.* TO â€˜wikimirrorâ€™@’localhostâ€™ WITH GRANT OPTION;
show grants for â€˜wikimirrorâ€™@’localhostâ€™;

This creates the database, adds the user and grants them permission.

Downloading and setting up mediawiki 1.22.5Â is pretty straight forward. Â There is one big caveat I found though. Â InnoDB is incredibly slow for loading the database. I spent a good 30 minutes trying to find a good solution before going back toÂ MyISAM with utf8 support.

With the empty site created, I do a quick backup incase I want to purge what I have.

/usr/bin/mysqldump -uwikimirror -pw1k1p3d1a wikimirror > /usr/local/wikipedia/wikimedia-1.22.5-empty.sql

This way I can quickly revert as constantly re-installing mediawiki isâ€¦ a pain. Â And it gets repetitive which is good for introducing errors, so itâ€™s far easier to dump the database/user and re-create them, and reload the empty database.

When I was using InnoDB, I was getting a mere 163 inserts a second. That means it would take about 24 hours to import the entire database!! Â Which simply is not good enough for someone as impatient as me. Â As of this latest dump there areÂ 14,313,024 records that need to be inserted, which would take the better part of forever to do.

So letâ€™s make some changes to the MySQL server config. Â Naturally backup your existing /etc/mysql/my.cnf to something else, then I added the following bits:

Â key_buffer = 1024M
max_allowed_packet = 384M
query_cache_limit = 18M
query_cache_size = 128M

I should add that I have a lot of system RAM available. Â And that my box is running Debian 7.1 x64_86.

Next youâ€™ll want a slightly modified import program, Â I used the one from Michael Tsikerdekisâ€™s site, but I did modify it to run the â€˜precommitâ€™ portion on itâ€™s own. Â I did this because I didnâ€™t want to decompress the massive XML file on the filesystem. Â I may have the space but it just seems silly.

With the script ready we can import! Â Remember to restart the mysql server, and make sure itâ€™s running correctly. Â Then you can run:

bzcat enwiki-20140304-pages-articles.xml.bz2 | perl ./mwimport2 | mysql -f -u wikimirror -pMYOTHERBADPASSWORD Â â€“default-character-set=utf8 wikimirror

And then youâ€™ll see the progress flying by. Â While it is loading you should be able to hit a random page, and get back some wikipedia looking data. Â If you get an error well obviously something is wrongâ€¦

With my slight moddifications I was getting about 1000 inserts a second, which gave meâ€¦

Â 14313024 pages (1041.174/s), Â 14313024 revisions (1041.174/s) in 13747 seconds

Which ran in just under four hours. Â Not too bad!

With the load all done, I shut down mysql, and then copy back the first config. Â For the fun of it I did add in the following for day to day usage:

Â key_buffer = 512M
max_allowed_packet = 128M
query_cache_limit = 18M
query_cache_size = 128M

I should add that the â€˜defaultâ€™ small config was enough for me to withstand over 16,000 hits a day when I got listed on reddit. Â So itâ€™s not bad for small-ish databases (my wordpress is about 250MB) that see a lot of action, but wikipedia is about 41GB.

Now for the weird stuff. Â There is numerous weird errors thatâ€™ll appear on the pages. Â Iâ€™ve tracked the majority down to lua scripting now being enabled on the template pages of wikipedia. Â So you need to enable lua on your server, and setup the lua extensions.

The two that just had to be enabled to get things looking half right are:

With this done right, youâ€™ll see Lua as part of installed software on the version page:

And under installed extensions:

I did need to put the following in the LocalSettings.php file, but itâ€™s in the installation bits for the extensions:

$wgLuaExternalInterpreter = â€œ/usr/bin/lua5.1â€³;
require_once(â€œ$IP/extensions/Lua/Lua.phpâ€);
$wgScribuntoEngineConf[‘luastandalone’][‘luaPath’] = â€˜/usr/bin/lua5.1â€²;
require_once( â€œ$IP/extensions/Scribunto/Scribunto.phpâ€ );

Now when I load a page it still has some missing bits, but itâ€™s looking much better.

The Amiga pageâ€¦

Now I know the XOWA people have a torrent setup for about 75GB worth of images. Â I just have to figure out how to get those and parse them into my wikipedia mirror.

I hope this will prove useful for someone in the future. Â But if it looks too daunting, just use the XOWA. Â Another solution is WP-MIRROR, although it can apparently take several days to load.

Virtually Fun

Fun with Virtualization

Leave a Reply Cancel reply