Mirroring Wikipedia

So I had an internet outage, and was thinking if I was trapped on my proverbial desert island what would I want with me?

Well wikipedia would be nice!

So I started with this extreme tech article by Sebastian Anthony, although it has since drifted out of date on a few things.

But it is enough to get you started.

I downloaded my XML dump from Brazil like he mentions.  The files I got were:

  • enwiki-20140304-pages-articles.xml.bz2 10G
  • enwiki-20140304-all-titles-in-ns0.gz 58MB
  • enwiki-20140304-interwiki.sql.gz 728Kb
  • enwiki-20140304-redirect.sql.gz 91MB
  • enwiki-20140304-protected_titles.sql.gz 887Kb

The pages-articles.xml is required.  I added in the others in the hopes of fixing some formatting issues.  I re-compressed it from 10GB using Bzip2 to 8.4GB with 7zip.  It’s still massive, but when you are on a ‘slow’ connection every saved GB matters.

Since I already have apache/php/mysql running on my Debian box, I can’t help you with a virgin install.  I would say it’s pretty much like every other LAMP install.

Although I did *NOT* install phpmyadmin.  I’ve seen too many holes in it, and I prefer the command line anyways.

First I connect to my database instance:

mysql -uroot -pMYBADPASSWORD

And then execute the following:

create database wikimirror;
create user ‘wikimirror’@’localhost’ IDENTIFIED BY ‘MYOTHERPASSWORD’;
GRANT ALL PRIVILEGES ON wikimirror.* TO ‘wikimirror’@’localhost’ WITH GRANT OPTION;
show grants for ‘wikimirror’@’localhost’;

This creates the database, adds the user and grants them permission.

Downloading and setting up mediawiki 1.22.5 is pretty straight forward.  There is one big caveat I found though.  InnoDB is incredibly slow for loading the database. I spent a good 30 minutes trying to find a good solution before going back to MyISAM with utf8 support.

With the empty site created, I do a quick backup incase I want to purge what I have.

/usr/bin/mysqldump -uwikimirror -pw1k1p3d1a wikimirror > /usr/local/wikipedia/wikimedia-1.22.5-empty.sql

This way I can quickly revert as constantly re-installing mediawiki is… a pain.  And it gets repetitive which is good for introducing errors, so it’s far easier to dump the database/user and re-create them, and reload the empty database.

When I was using InnoDB, I was getting a mere 163 inserts a second. That means it would take about 24 hours to import the entire database!!  Which simply is not good enough for someone as impatient as me.  As of this latest dump there are 14,313,024 records that need to be inserted, which would take the better part of forever to do.

So let’s make some changes to the MySQL server config.  Naturally backup your existing /etc/mysql/my.cnf to something else, then I added the following bits:

 key_buffer = 1024M
max_allowed_packet = 384M
query_cache_limit = 18M
query_cache_size = 128M

I should add that I have a lot of system RAM available.  And that my box is running Debian 7.1 x64_86.

Next you’ll want a slightly modified import program,  I used the one from Michael Tsikerdekis’s site, but I did modify it to run the ‘precommit’ portion on it’s own.  I did this because I didn’t want to decompress the massive XML file on the filesystem.  I may have the space but it just seems silly.

With the script ready we can import!  Remember to restart the mysql server, and make sure it’s running correctly.  Then you can run:

bzcat enwiki-20140304-pages-articles.xml.bz2 | perl ./mwimport2 | mysql -f -u wikimirror -pMYOTHERBADPASSWORD  –default-character-set=utf8 wikimirror

And then you’ll see the progress flying by.  While it is loading you should be able to hit a random page, and get back some wikipedia looking data.  If you get an error well obviously something is wrong…

With my slight moddifications I was getting about 1000 inserts a second, which gave me…

 14313024 pages (1041.174/s),  14313024 revisions (1041.174/s) in 13747 seconds

Which ran in just under four hours.  Not too bad!

With the load all done, I shut down mysql, and then copy back the first config.  For the fun of it I did add in the following for day to day usage:

 key_buffer = 512M
max_allowed_packet = 128M
query_cache_limit = 18M
query_cache_size = 128M

I should add that the ‘default’ small config was enough for me to withstand over 16,000 hits a day when I got listed on reddit.  So it’s not bad for small-ish databases (my wordpress is about 250MB) that see a lot of action, but wikipedia is about 41GB.

Now for the weird stuff.  There is numerous weird errors that’ll appear on the pages.  I’ve tracked the majority down to lua scripting now being enabled on the template pages of wikipedia.  So you need to enable lua on your server, and setup the lua extensions.

The two that just had to be enabled to get things looking half right are:

With this done right, you’ll see Lua as part of installed software on the version page:

mediawiki installed softwareAnd under installed extensions:

wikimedia installed extensions

I did need to put the following in the LocalSettings.php file, but it’s in the installation bits for the extensions:

$wgLuaExternalInterpreter = “/usr/bin/lua5.1″;
require_once(“$IP/extensions/Lua/Lua.php”);
$wgScribuntoEngineConf[‘luastandalone’][‘luaPath’] = ‘/usr/bin/lua5.1′;
require_once( “$IP/extensions/Scribunto/Scribunto.php” );

Now when I load a page it still has some missing bits, but it’s looking much better.

The Amiga page...

The Amiga page…

Now I know the XOWA people have a torrent setup for about 75GB worth of images.  I just have to figure out how to get those and parse them into my wikipedia mirror.

I hope this will prove useful for someone in the future.  But if it looks too daunting, just use the XOWA.  Another solution is WP-MIRROR, although it can apparently take several days to load.

Interesting wiki site…

While looking something up, I came across this site:

http://gunkies.org/wiki/Main_Page

It’s a wiki dedicated to retro-computing, but more so in first person.. Not so much in the factual history of wikipedia. Anyways I thought it was worth mentioning, and I’ve started to redo some of my work here over there. Hopefully this will get more people interested in the whole thing. I’ve also been thinking about doing a windows installer for the z80 CP/M module from SIMH.. Hopefully the games & stuff work fine, so it’ll be FUN!

Proxmox VE

Well frankly I’ve been majorly disappointed with Microsoft’s latest offerings in the world of virtualization. Frankly it’s been one BIG step backwards in terms of management.

I mean check this well meaning blog on how “easy” it is to setup remote management. And of course for the most part it NEVER works.

I know this must be a major news flash to Microsoft but you see virtual servers are like mainframes. The zone 0 OS must be able to stand on it’s own, and have just enough to bootstrap the hypervisor and allow itself to be managed in a stand alone fashion. After all if it were in a domain, where do you think those domain controllers are? Yep they are Virtual machines! And how do you ‘manage’ a domain resource with no DC’s? The whole 2008 Hyper-V is a BIG miscalculation on Microsoft’s part. I hope they wake up and notice how they had a good thing and have destroyed it.

All this nonsense sent me searching for an alternative which I’m pretty sure I found a great blend of system emulation, and something like SUN containers for Linux. There is even a Debian etch based quick install version called Proxmox which incorporates KVM (The new Linux hypervisor) and OpenVZ. And of course it’s FREE!

The cool thing is that the main management works on a web page, the consoles can be controlled via a VNC viewer that uses JAVA, and it’s VERY quick to setup.

The system emulation KVM uses the core devices from Qemu so a lot of Qemu virtual machines will “just work” if you copy them over. If you are installing an OS onto the virtual machine the ‘easy’ way is with the physical CD, you can use ISO images, however they are awkard to use. You have to flag the VM to pause on startup switch over to the monitor page and issue the following command:

change ide1-cd0 /directory/isoimage.iso

then tell the emulator to start up with the ‘c’ command which will continue from the pause…. Yeah I know it’s not terribly eligant.

On the OpenVZ front, it’s FAST as there is no real emulated IO it’s native. So I decided to use the wiki template and setup a wikipedia mirror at home. If anyone feels as brave you too can find instructions here:

These are some of the table times to load:

601M pages.sql Query OK, 7,473,186 rows affected, 8 warnings (5 min 10.52 sec)
837M revision.sql Query OK, 7,473,200 rows affected, 65535 warnings (2 min 11.84 sec)
18G text.sql Query OK, 7,473,202 rows affected, 1 warning (12 min 12.07 sec)
20M category.txt Query OK, 471,207 rows affected (13.14 sec)
1.8G categorylinks Query OK, 24,501,837 rows affected, 30177 warnings (28 min 28.31 sec)
5.6G externallinks Query OK, 36,492,925 rows affected (3 min 50.34 sec)
362M latestimage Query OK, 807,906 rows affected, 2 warnings (34.35 sec)
555M imagelinks Query OK, 18,615,721 rows affected (10 min 49.60 sec)
32k interwiki Query OK, 651 rows affected (0.08 sec)
186M langlinks Query OK, 5,780,509 rows affected (2 min 17.75 sec)
2G logging Query OK, 16,398,421 rows affected (2 min 51.75 sec)
45M oldimage Query OK, 118,449 rows affected (1.97 sec)
7.6G pagelinks Query OK, 270,641,297 rows affected (6 hours 12 min 4.83 sec)
104M redirect Query OK, 3,234,481 rows affected (23.71 sec)
1.2G template-link Query OK, 48,885,222 rows affected (50 min 7.08 sec)
68k user_groups Query OK, 3,947 rows affected (0.11 sec)

Even the ‘longest’ part here with the 270 million records took six hours… Not too bad! That’s still 12,122.88 TPS!

Also as a tip for anyone else crazy enough to do a sizable mediawiki (like wikipedia) or any single server wiki look to this page.

The upshot is that by loading this APC
extension into PHP and mediawiki load times for my cached site went from 2-5 minutes to 1-10 seconds.

The OpenVZ portion has various application templates that can be loaded into the zones from CentOS, Debian, Ubuntu, to pre configured applications like the media wiki and a few others.

If anything I’d say that proxmox is what I was hoping Microsoft’s Hyper-V could have been. A container version of windows with easy remote admin along with some system emulation could have made things MASSIVLY easier to deal with. It’s a shame they decided to go with this bizarre WMI based thing.