Personal AltaVista + UTZOO reloaded

Introduction

Long before websites, during the dark ages of the BBS, on the internet there was (well it’s still there!) a distributed messaging system called usenet.  There are countless topics on just about everything that was full of all kinds of incredible conversations.  Before the walled gardens, and the ease of running individual bulletin boards, the internet had prided itself on having one big global distributed messaging system.  It was a big system, and one thing that was always taken for granted was that it was too big to save, and that whatever you put out there would probably be erased as all sites had a finite amount of very expensive disk space, and they would only keep recent articles.

But it turns out that in the University of Toronto, in the zoology department they had a tape budget, and were in fact archiving everything they could.  In all they had amassed 141 tapes spanning from  February 1981 (though these are not Usenet posts, just internal netnews University stuff) all the way up to about midnight of July 01, 1991!

While the archive was made available to a few people in 2001, it was made generally available in 2009, and then in 2011 on archive.org where I downloaded a copy of it.  There is some interesting backstory over on Dogcow land, as it took quite a bit of effort to get the data from the tapes, and then slowly released out into the wild.

As mentioned on the archive.org site:

This is a collection of .TGZ files of very early USENET posted data provided by a number of driven and brave individuals, including David Wiseman, Henry Spencer, Lance Bailey, Bruce Jones, Bob Webber, Brewster Kahle, and Sue Thielen.

OK, so back a few months ago, I had setup AltaVista personal desktop search along with the UTZOO usenet archive for the purpose of using something more sophisticated than grep, but maintaining that legacy/retro feel us using outdated technology.  To recap the first challenge is that the desktop search product, is only meant to be used from the desktop of a Windows 98/NT 4.0 workstation.  It uses a super ancient version of JAVA as the webserver, and they chose to bind it to 127.0.0.1:6688 .  So the first thing to get around that was to build a stunnel tunnel allowing me to effectively connect to the webserver remotely.  And since the server assumes it’s locally I had to use Apache with mod_rewrite to setup some simple regex expressions to massage the pages into something that would be usable from a non local machine.

So with that word salad up, let’s have a brief picture!

Flow diagram

Stepping it up

On my ‘general’ hosting machine, I use haproxy to reverse proxy out multiple sites out the single address.  This is a super simple solution that allows me to have all kinds of different backends using various hosting platforms, such as Apache 1.3 on Windows NT 3.1.  So for this to work I just needed to create an altavista.superglobalmegacorp.com DNS record, and then the following in the haproxy config:

frontend named-hosts
bind 172.86.179.14:80
acl is_altavista hdr_end(host) -i altavista.superglobalmegacorp.com
use_backend altavista if is_altavista

backend altavista
balance roundrobin
option httpclose
option forwardfor
server debian8 10.0.0.18:80 check maxconn 10

So as you can see it’s really simple it looks for the string ‘altavista.superglobalmegacorp.com’ in the host header, and then sends it to the backend that has a single web server, in this case a lone Debian server, aptly named debian8 that throttles after 10 concurrent connections.

The next thing to do was generate a SSL self signed cert, which wasn’t too hard.  The stunnel installer has a profile ready to go, so it was only a matter of finding a version of OpenSSL that’ll run on NT 4.  As this isn’t public encryption I really don’t care about it using crap certs.

On the Debian server is where all the regex magic, is along with the stunnel client to connect to the NT 4.0 Workstation.

client = yes
debug = 0
cert = /etc/stunnel/stunnel.pem

[altavista]
accept = 127.0.0.1:8080
connect = 10.0.0.19:8443

Likewise on NT stunnel will need a config like this:

cert = c:\stunnel\stunnel.pem

; Some performance tunings
socket = l:TCP_NODELAY=1
socket = r:TCP_NODELAY=1

; Some debugging stuff useful for troubleshooting
debug = 0
output = c:\stunnel\stunnel.log.txt

[altavista]
accept = 8443
connect = 127.0.0.1:6688

With the ability for the Debian box to talk to the AltaVista web server, it was now time to configure Apache.  This is the most involved part, as the html formatting by AltaVista personal search is hard coded into the java binary.  However thanks to mod_rewrite we can modify the page on the fly!  So the first thing is that I setup to virtual directories, the first one /altavista maps to the search engine, and then I added /usenet which then talks to IIS 4.0 on the Windows NT 4.0 workstation, which is just allowing read & browse to the usenet files that will need to be indexed.

#This part connect to a stunnel connection to the Altavista server
ProxyPass “/altavista” “http://localhost:8080”
ProxyPassReverse “/altavista” “http://localhost:8080”
#This connects to IIS 4.0 on the NT 4.0 machine
ProxyPass “/usenet” “http://10.0.0.19/usenet”
ProxyPassReverse “/usenet” “http://10.0.0.19/usenet”
ProxyRequests Off
RewriteEngine On

Because we mounted it on a sub directory we need to redirect the root to /altavista so I simply add:

#Redirect the root to the /altavista path.
#
RedirectMatch 301 ^/$ /altavista

To get the images to work, along with fixing the 127.0.0.1 hardcoding,  I copied them from the NT workstation onto the Apache server, then added this regex statement:

#clean up urls
Substitute “s|Copyright 1997|Copyright 2017|n”
Substitute “s|127.0.0.1:6688|altavista.superglobalmegacorp.com/altavista|n”
Substitute “s|file:///c:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|/images/|n”

And now the site is starting to work.  The most involved regex is to change the links from local text files, into a path to point to the usenet shares.  This changes the text for u:\usenet\a333\comp\33.txt into a workable URL.

Substitute “s|>u:\\\\usenet.([a-z]{1,}[0-9]{3,})\\\([0-9a-z\+\-]{1,})\\\([0-9]{1,})|—><a href=\”http://utzoo.superglobalmegacorp.com/usenet/$1/$2/$3.txt\”>[$2\] Click for article|

Naturally there is a LOT of these type of statements to match various depths, and pattern types as there is A news, B news and C news archives, plus scavenged bits.

Additionally I disabled a bunch of URL’s that would either try to alter the way the engine works, or allow the search location to change, just giving you empty results, along with altering some of the branding, as digital.com doesn’t exist anymore, and various tweeks.  The finished config file for Apache is here.

Now with that in place, I can hit my personal AltaVista search.  The next insane thing was to rename all the files from the UTZOO dump adding a .txt extension, and then re-encoding them in MS-DOS CR/LF format.  I found using ‘find -type f’ to find files, and then a simple exec to rename them into a .txt extension.  Then it was only a matter of using ZIP to compress the archives, and then transferring them to Windows NT, and running UNZIP on them with the -a flag to convert them into CR/LF ASCII files on Windows.  This took a tremendous amount of time as there are about 2.1 million files in the archive.

Now with the files on Windows, now I had to run the indexer.

Indexed in under 7 hours!

While I had originally had an IIS 4.0 instance on the same NT 4.0 Workstation serving up the result files, I thought it may make more sense to just serve them from the UTZOO mirror server I have in the same collocation so it’d be much faster, so that way only the queries are relying on servers in Hong Kong, instead of being 100% located in the United States.

So here we go, my search portal for all that ancient usenet goodness:

altavista.superglobalmegacorp.com

If you are hoping for the wealth of knowledge to be gained from people posting on usenet from 1981 to 1991 then this is your ticket.  Keep in mind that usenet being usenet, there is discussions on everyone and everything, and like all other forums before you know it it’ll end with calling people Hitler, and how the Amiga is the greatest computer ever (well it was!).  A tip when searching by year, is that people commonly wrote the year as 2 digits.  However when looking for numbers like, say Battletech 3025, it will pull up files named 3025.txt.  To prevent this just add -3025.txt to stop names like 3025.txt, or if you want to find out about the movie Bladerunner from 1982, try searching for bladrunner 82 -82.txt +review +movie.  If you have any questions, there is of course the manual with a guid on how to search.

While the story of AltaVista is somewhat interesting, but much like how Digitial screwed up the Alpha market by trying to hoard high end designs, they also didn’t set the search people free to focus on search.  And the intranet stuff was crazy expensive, look at this ad from 1996 which translate to a minimum of $10,000 USD a year to run a single search engine!  But as we all know, the distributed model of google won search and AltaVista never had a chance as it was caught up in the Compaq/HP mess then spun out to be quickly absorbed by Yahoo.

Meanwhile it appears the original owners of altavista.com, AltaVista Technology, Inc. of California, are actually still in business.  If anyone cares I’ll put the installation files, and some of the config’s in this directory.

Apache 1.3.4 on Windows NT 3.1

Yes, I know it’s crazy old, totally useless, but it did mostly compile.

Apache 1.3.4 running on Windows NT 3.1 Advanced Server

Apache 1.3.4 running on Windows NT 3.1 Advanced Server

Assuming it’s hasn’t been crashed or hacked it should be online here:

http://winnt31.superglobalmegacorp.com/

Unlike Serweb 0.3, Apache is HTTP 1.1 compliant, which means that I can put it behind haproxy, and enjoy the fact it doesn’t need a dedicated IP address.

Although I can’t imagine anyone wanting it, here is the binary/source and my htdoc dump.  Download it here: apache-NT-31.zip & unzip.exe

Apache console mode

Apache console mode

I had to pull out some stuff, like some of the service features, so it really only runs as a console app.

I’ve compiled it with /Zi meaning full debug and no optimization.  If you want to re-compile you’ll probably want either the Win32 SDK, or Visual C++ 1.0 32bit, and replace the headers and libraries from the Windows NT 3.5 SDK.  Much like trying to build GCC 2.6.3 on Windows NT.

Also in a silly way, thanks to Qemu, I’m now running both OS/2 & Windows NT on the same server, running Linux.

Fun with regex substitutions in Apache

Continuing from my previous post, I was now able to access my AltaVista server, however from a web browser I was unable to actually view any of the documents remotely.

In the pages though I did get the MS-DOS path to the usenet article in question:

Now how do I turn that into a URL?

Well as it turns out mod_rewrite does support regex, which in turn can do variable re-ordering!

After a bit of googling I found this page on stackoverflow, on how to convert a date between UK/US formats:

s/(\d{4})-(\d{2})-(\d{2})/$1-$3-$2/

Simple, right?  So what is going on here?  The parenthesis define a variable set, and on the substitution part you can recall them with $1, $2 , $3 etc.  So using this recipe I could take something like this:

u:\b227\comp\sys\laptops\3080

and convert it into the following:

http://debian7/usenet/b227/comp/sys/laptops/3080

The code for this would look something like this:

Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href="\"http://debian7/usenet/$1/$2/$3\""]Click for article|"

Although for some reason it’s embedding the URL’s even though I specified code formatting.

Now all I had to do was install IIS 4.0 off the Option Pack CD-ROM, onto my Windows NT 4.0 workstation, and create a virtual directory of /usenet which then pointed to the U: drive where AltaVista did it’s indexing.

So to this point that gives me a config file much like this:

ServerAdmin webmaster@localhost
DocumentRoot /var/www
SSLProxyEngine On
ProxyPass "/altavista/" "https://10.12.0.16"
ProxyPassReverse "/altavista/" "https://10.12.0.16/"
ProxyRequests Off
RewriteEngine On

SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
#clean up urls
Substitute "s|127.0.0.1:6688|debian7/altavista|n"
Substitute "s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n"
#protect the page
Substitute "s|launch=app||n"
Substitute "s|?pg=config&amp;what=init|?pg=h|n"
#fix title
Substitute "s|&lt;IMG src=\"http://debian7/images/av_personal.gif\" alt=\"[AltaVista] \"  BORDER=0 ALIGN=middle HEIGHT=72 VSPACE=0 HSPACE=0&gt;|&lt;a href=\"http://debian7/altavista\"&gt;&lt;IMG src=\"http://debian7/images/av_personal.gif\" alt=\"[AltaVista] \"  BORDER=0 ALIGN=middle HEIGHT=72 VSPACE=0 HSPACE=0&gt;<strong>|---&gt;|n"
Substitute "s|</strong>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3\"&gt;Click for article|"
# Need links for the u:\news097f1\b120\comp\society\futures\1122
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7/$8\"&gt;Click for article|"
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7\"&gt;Click for article|"
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6\"&gt;Click for article|"
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5\"&gt;Click for article|"
# Need links for  u:\news002f1\b1\fa.poli-sci\8
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z\.\-]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4\"&gt;Click for article|"

&lt;Location /usenet/&gt;
    ProxyPass  http://10.12.0.16/usenet/
    RewriteEngine On
    SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
    AddOutputFilterByType SUBSTITUTE text/html
&lt;/Location&gt;

bla bla rest of the 000-default crap....

Simple right?

Searching for AltaVista

Searching for AltaVista

So now I get a nicely formatted page, I can click the mountain icon, and I jump back to home, and I can click on the articles and, because I have no extensions or MIME types to intercept it’ll just download them to my PC.  I guess I need to go through them all, convert them from UNIX format to MS-DOS, and stick a .txt extension on every single one of them.

I’m still thinking this thing is far too rickety to put on the internet, but we’ll see.

Fun with Apache, (mod_proxy, mod_rewrite), stunnel, And AltaVista Personal search

As you may remember from my prior attempt at using Altavista Search I ran out of space, and found out it only serves pages on 127.0.0.1:6688 and is pretty much hardcoded to do so.  It’s a “fine” hyrid java 1.01 application, with the bulk of it being java.  I finally got around to setting up a VM, and unpacking all of the utzoo archives, and indexing them.  I should have done something about the IO because this took too long (KVM).

SIXTEEN HOURS!!!

SIXTEEN HOURS!!!

So to cheat the system, I installed stunnel as a simple https to http proxy, which let me access my search VM anywhere.  However it still embedded 127.0.0.1 in all the pages.

via stunnel

via stunnel

Enter an Apache reverse proxy to talk to stunnel to talk to AltaVista search!

First to enable a few modules:

a2enmod substitute
a2enmod proxy
a2enmod ssl
a2enmod proxy_http
a2enmod rewrite

And adding this into the config:

SSLProxyEngine On
ProxyPass “/altavista/” “https://10.12.0.16”
ProxyPassReverse “/altavista/” “https://10.12.0.16/”
ProxyRequests Off
RewriteEngine On
SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
Substitute “s/1997/2016/ni”
Substitute “s/97/16/ni”
Substitute “s|127.0.0.1:6688|debian7/altavista|n”
Substitute “s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n”
Substitute “s|launch=app||n”
Substitute “s|<a href=http://debian7/altavista/?pg=q&what=0&fmt=d|<!—|n”
Substitute “s|><strong>|—>|n”
Substitute “s|</strong></a>||n”
Substitute “s|>u:\|->u:\|n”

This let me redirect all of those requests into a VM called debian7 on the /altavista path.  I also copied the images to the apache server, and now I get something that looks correct!

Apache in the mix!

Apache in the mix!

I cut the results short… But here is a search of something simple:

About 16598 documents match your query.

About 16598 documents match your query.

I also killed all the ‘working URL’s that simply open a desktop application on the index ‘server’.  Naturally it was a personal service, but as a server this isn’t any good.  As such you can’t click on any search results now.  I need something else to figure out how to take the result blocks like “u:\b128\comp\databases\2852” and turn them into URL’s.

Also, as much as I want to re-index I would be best to cut off the headers, or most of them so the preview lines make sense.  Xref, Path, even From & Newsgroups don’t interest me.

I hate to leave it as ‘good enough’ but if anyone has a solution…. I’ll be glad to make this wonderful resource available!

I accidentally upgraded vpsland to Debian 8

So yeah, dealing with Apache 2.4 vs 2.2 was… fun.  The security Order stuff is obsolete so that was fun editing all the virtual hosts.

The key parts being:

In this example, all requests are denied.

2.2 configuration:

<span class="kwd">Order</span><span class="pln"> deny</span><span class="pun">,</span><span class="pln">allow
</span><span class="kwd">Deny</span><span class="pln"> from all</span>

2.4 configuration:

<span class="kwd">Require</span><span class="pln"> all denied</span>

In this example, all requests are allowed.

2.2 configuration:

<span class="kwd">Order</span><span class="pln"> allow</span><span class="pun">,</span><span class="pln">deny
</span><span class="kwd">Allow</span><span class="pln"> from all</span>

2.4 configuration:

<span class="kwd">Require</span><span class="pln"> all granted</span>

In the following example, all hosts in the example.org domain are allowed access; all other hosts are denied access.

Boy was that fun!

Another bit of fallout was the hosts file.  I have spamd running and suddenly I was being bombarded with this message:

Jul 25 10:15:39 cheapvps spamc[683]: connect to spamd on ::1 failed, retrying (#1 of 3): Connection refused

Well it turns out after much digging around that Debian 8 is more IPv6 ready.  The hosts file from Debian 7 was something like this:

127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback

And in 8, it changed to this:

fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
127.0.0.1 localhost.localdomain localhost
# Auto-generated hostname. Please do not remove this comment.
::1 localhost ip6-localhost ip6-loopback

Needless to say, having localhost point to ::1 made it dependant on all local daemons supporting IPv6, and spamd sadly is IPv4 only.  Luckily it’s a quick fix to remove localhost from ::1, which then let’s it work again with 127.0.0.1, and now it can connect over IPv4.

Well today (August 4th, 2015) there was a critical update to Apache.  And after updating I got this fine error:

# /etc/init.d/apache2 restart

[….] Restarting apache2 (via systemctl): apache2.serviceJob for apache2.service failed. See ‘systemctl status apache2.service’ and ‘journalctl -xn’ for details.

failed!

Great.  So what does the error actually say?

# systemctl status apache2.service
* apache2.service – LSB: Apache2 web server
Loaded: loaded (/etc/init.d/apache2)
Active: failed (Result: exit-code) since Tue 2015-08-04 13:52:13 HKT; 7s ago
Process: 6063 ExecStop=/etc/init.d/apache2 stop (code=exited, status=0/SUCCESS)
Process: 6427 ExecStart=/etc/init.d/apache2 start (code=exited, status=1/FAILURE)

systemd[1]: Starting LSB: Apache2 web server…
apache2[6427]: Starting web server: apache2 failed!
apache2[6427]: The apache2 configtest failed. …….
apache2[6427]: Output of config test was:
apache2[6427]: apache2: Syntax error on line 250 …y
apache2[6427]: Action ‘configtest’ failed.
apache2[6427]: The Apache error log may have more….
systemd[1]: apache2.service: control process exi…=1
systemd[1]: Failed to start LSB: Apache2 web server.
systemd[1]: Unit apache2.service entered failed …e.
Hint: Some lines were ellipsized, use -l to show in full.

Fantastic.

# apachectl configtest
apache2: Syntax error on line 250 of /etc/apache2/apache2.conf: Could not open configuration file /etc/apache2/mods-enabled/alias.load: No such file or directory
Action ‘configtest’ failed.
The Apache error log may have more information.

So, normally you’d check under modules-enabled, and link in the missing bits, right? Yeah except there is no MPM modules. Not anymore.  And yes I removed and re-installed the apache2-mpm-prefork module, to no avail.  So after much digging around it looks like the transition to 2.4 finally broke everything irrecoverably.  So I backed up the /etc/apache2 directory than ran the follwing:

apt-get purge apache2

Which then removes all the apache2 stuff from the system.  Then to finish it off, run a quick

rm -rf /etc/apache2

You did back it up, right?

now put it back in..

apt-get install apache2 libapache2-mod-php5

Now to re-enable the virtual sites.  For some reason they need to be enabled with a2ensite.  Except they don’t tell you that your sites now need to end in .conf in the /etc/apache2/sites-available (you did back it up right?)

Also if you run perl (src2html) be sure to run:

a2enmod cgi
service apache2 restart

Not to mention the joys of updating perl, and the cvsweb breaking, and I’m sure far more to break.  Oh well, at least it’ll be up to date.  That’s what I get for mixing ‘stable’ with ‘old stable’, when the local mirror out in the UK I was using moved up to 8.

shellinabox

So while browsing reddit, I came across this neat package, shellinabox.  Simply put, it runs as a process on your ‘box’ and fronts it with a javascript terminal interface.  So as long as you have a halfway modern machine with javascript support you too can just connect to a machine and run CLI based stuff.

BBS via telnet

BBS via telnet

So as a test I setup a game of tetris, and a telnet session to my BBS.

There isn’t much to ‘setup’ in the way of shellinabox, because it’s all command line driven.

/shellinabox-2.14/shellinaboxd -t -s /:LOGIN -s “/bbs:nobody:nogroup:/:/usr/bin/telnet localhost” -s /tetris:nobody:nogroup:/:/usr/games/tetris-bsd –css /shellinabox-2.14/shellinabox/white-on-black.css -b

So this will create a new web server that by default listens on TCP port 4200 which in turn uses the virtual directories / for a login, /bbs which launches telent, and /tetris which starts the BSD tetris for terminals game.  Now as many of you are aware, not all people with internet connections have the luxury of having all outbound TCP/IP ports. Even the most excellent flashterm still establishes a TCP session.  That is what makes this different is that all the traffic is done via HTTP, which means it can be proxied.  Now the real trick is having a web server do the proxing for you, so that all the user has to do is hit a special URL, and the server will proxy the request to shellinabox’s web server.

Enter Apache2’s reverse proxy!

So on my BBS’es apache config, I add in the following lines:

ProxyPass /tetris http://localhost:4200/tetris
ProxyPassReverse /tetris http://localhost:4200/tetris
ProxyPass /bbs http://localhost:4200/bbs
ProxyPassReverse /bbs http://localhost:4200/bbs

I’m not sure exactly of the specific modules to enable, but hammering away this got it to work:

a2enmod proxy
a2enmod mod_proxy
a2enmod rewrite
a2enmod mod_proxy
a2enmod proxy_http
a2enmod proxy_module
a2enmod headers
a2enmod deflate

Under my virtual server’s ‘root’ directory.  So now when you access https://virtuallyfun.com/tetris/ Apache will proxy your request into the shellinabox http server, and you’ll get…

Tetris

Tetris

So now only using HTTP you can play tetris!

So where to go from here?  I was thinking some kind of SIMH CP/M on demand thing.  There is a command line Wyse 60 emulator, so maybe that’d be fun.  I may even bring back something I had ages ago, access into a bunch of legacy systems.  This is a great ‘solution’ to enable multiplexing without having to use another software MUX.

 

WordPress spam…

So I was looking, at the start of the year about 8% of my stats was SPAM.yuck.  Then something insane happened this week, it jumped to 28%.

So I crossed that point when something would have to be done!

I’ve already installed stuff to detect the spam, and it does a good overall job.  But I wanted to take it to the next level, and block all traffic from the spammers!  Anyone who SPAM’s probably is engaged in other nonsense that makes me not want their traffic.

Thankfully for me and this brave new era of google, I could quickly find someone has done 99% of the leg work for me right here!  Thanks to Sakis’s hard work I was able to add some minor tweaks, and generate a full iptables config, flush & add the new rules, then have cron run it every few minutes.

Pretty cool stuff if I do say so myself!

 

Since the primary site is now offline, I’ve updated with an archive.org link.  For what it’s worth, here is the meat of the article in question:

 

Dodging WordPress comment spammers

I admit: Allowing anyone to post comments is bad practice. Though, I’ve got my reasons to stand my ground. I’ve many times read something on a blog and to some of them I even had something to add. Could potentially help blog’s author or future visitors by sharing my own experience, or request a solution to one of my problems by posting a question. Guess what? I am so lazy that I rarely go through registration procedure, just to enable me posting a comment.

I am one of those that insist dialog and discussion is always constructive as long as both ends feel like establishing it. I do not want to loose the opinion and comments of stopping-by visitors, just because I want a “safe” thing that runs on its own. But, “buts” exist. My blog is currently one month old, still it manages to receive 300+, in average, spam-oriented comments per day, while I’ve even witnessed a 1k/day.

Thank god, WordPress provides blacklist features based both on IP addresses and comment content. And it really does a good job: After messing around with your recent “spam” you can easily end up with a list that accurately detect a non constructive comment. However, you’ve not solved all your problems this way:

  • New comments still come. They are just automatically rated as spam.
  • Your database fills with garbage.
  • Your web traffic statistics are spoiled.
  • You waste bandwidth.
  • You waste CPU time.
  • If your spammer ever stop selling drugs and starts advertising flesh, all your content matching rules go away.
  • If your spammer loose interest into being a blog spammer and switch to a port-scanner, you will receive that too.

How about you refuse them a spare TCP socket? Besides, you don’t even wanna know them. All their connection attempts will end-up to void. Time for some iptables magic.

WordPress has already stored their IP addresses within its database. Consult that wp-config.php file you lately edit when you firstly installed WordPress, and refresh your memory on what your database name, username and password is. Mine are:


$ grep "DB_" wp-config.php
define('DB_NAME', 'mywordpress');
define('DB_USER', 'sakis');
define('DB_PASSWORD', 'myextrastrongpassword');
define('DB_HOST', 'localhost');
define('DB_CHARSET', 'utf8');

You now have to use that information into constructing this single-row command:


mysql -f -p --user=<strong>DB_USER</strong> <strong>DB_NAME</strong> &lt;&lt;&lt;"select distinct CONCAT('iptables -A INPUT -s ',comment_author_IP,'/32 -j DROP') from wp_comments where comment_approved='spam' order by 1 asc" | grep -v "^CONCAT" &gt;&gt; THEY_BOTHER_ME

Check my example:


$ mysql -f -p --user=sakis mywordpress &lt;&lt;&lt;"select distinct CONCAT('iptables -A INPUT -s ',comment_author_IP,'/32 -j DROP') from wp_comments where comment_approved='spam' order by 1 asc" | grep -v "^CONCAT" &gt;&gt; THEY_BOTHER_ME
Enter password:
$ head THEY_BOTHER_ME
iptables -A INPUT -s 113.161.128.232/32 -j DROP
iptables -A INPUT -s 117.121.208.254/32 -j DROP
iptables -A INPUT -s 118.141.141.7/32 -j DROP
iptables -A INPUT -s 118.194.1.157/32 -j DROP
iptables -A INPUT -s 119.235.27.100/32 -j DROP
...

You now have a simple recipe, named “THEY_BOTHER_ME”, ready to be executed (as root):


$ su
# . ./THEY_BOTHER_ME

Make sure you hook “THEY_BOTHER_ME” at your system’s start-up procedure and construct a cron/at job to periodically refresh it.

I’ve created a file named /etc/cron.daily/update_spammers.sh, with the following contents:


#!/bin/sh

fileloc="/etc/THEY_BOTHER_ME"

before=`cat "${fileloc}" | wc -l`
before=`echo ${before}`

cp "${fileloc}" /tmp/BOTHERS.$$

mysql -f --user=<strong>sakis</strong> --password=<strong>myextrastrongpassword</strong> <strong>mywordpress</strong> &lt;&lt;&lt;"select distinct CONCAT('iptables -A INPUT -s ',comment_author_IP,'/32 -j DROP') from wp_comments where comment_approved='spam' order by 1 asc" | grep -v "^CONCAT" &gt;&gt; /tmp/BOTHERS.$$

sort /tmp/BOTHERS.$$ | uniq &gt; "${fileloc}"
rm -f "/tmp/BOTHERS.$$"

. "${fileloc}"

after=`cat "${fileloc}" | wc -l`
after=`echo ${after}`

di=`expr ${after} - ${before}`
di=`echo ${di}`

printf "[%s] Spammers updated. Added %d new spammer(s) (Before: %d, After: %d)\n" "`date`" ${di} ${before} ${after}

And sadly his original script is now offline.  This should be enough for anyone to get going on this exciting spam adventure…

ownCloud

So I was reading through a friends blog (wintellect!) and I came across this page about ownCloud…  Well I thought this was very interesting as I’ve pulled a lot of my external email mess inside (on my own Exchange 5.5 server on MS Virtual Server 2005!) .. So I like this whole idea.

I’ve got this VPS that has a few extra gigs of space, and it’d be SUPER convenient to map some drives for backups, or even back it up by copying some files..  It’s a simple AMP program setup, so I had it up and running in a few seconds.  The ‘hard’ part was mapping the drive from Vista.  Naturally it came down to reading the instructions, namely:

  1. in Services, enable the Webclient service (might be enabled already)
  2. in the Registry, change HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\WebClient\Parameters\BasicAuthLevel from 1 to 2
  3. go to My Computer → Mount Network Drive
    • in the Folder field type <a href="http://address/files/webdav.php">http://ADDRESS/files/webdav.php</a>
    • check Connect using different credentials
And that is about the size of it.

Installing mediawiki on WAMP

Building on our WAMP installation, we are now going to install mediawiki.

The first thing I’d recommend to do is to move the contents of c:\wamp\www into another directory… I just shoved the terminal thing into c:\wamp\terminal .

Now mediawiki is the software that powers wikipedia. It’s a great collaboration platform, it has built in revision control, and best of all it’s free.

It’s also VERY simple to setup, well compared to other web content platforms.

The current version is 1.16, which can be downloaded here. As things change, you may be best served by just visiting the main download site.

Since most ‘AMP’ servers are Linux based, we’ll have to get gzip & tar to extract mediawiki. It’s very easy though.

Simply type this in to extract mediawiki

C:\temp>dir
Volume in drive C has no label.
Volume Serial Number is FC55-C2F4

Directory of C:\temp

12/28/2010 08:15 PM DIR .
12/28/2010 08:15 PM DIR ..
12/28/2010 08:13 PM 49,152 gzip.exe
112/28/2010 08:15 PM 12,647,934 mediawiki-1.16.0.tar.gz
12/28/2010 08:13 PM 114,688 tar.exe
3 File(s) 12,811,774 bytes
2 Dir(s) 7,073,234,944 bytes free

C:\temp>gzip -dc mediawiki-1.16.0.tar.gz| tar -xf –

C:\temp>

Ok, now with mediawiki extracted we just move the contents of c:\temp\mediawiki-1.16.0 into c:\wamp\www

Now before we go on, we are going to set a password for the MySQL process. In the off chance someone is following this on a server to deploy on the internet, it’d be crazy to leave it with no password.

So left click on the WAMP system tray icon, go to MySQL, and bring up the MySQL Console.

media1

Just hit enter for the password as there isn’t one.

Next follow this SQL statement to set the password for the root user to password. Or select your own better password.

mysql> use mysql;
Database changed
mysql> update user set password=PASSWORD(“password”) where User=’root’;
Query OK, 3 rows affected (0.05 sec)
Rows matched: 3 Changed: 3 Warnings: 0

Now restart the mysql service, by clicking on the system tray icon, then mysql, service then ‘restart service’. If you don’t do this the password change will not take effect!

With that out of the way, it’s time to configure mediawiki. Simply open up a web browser to the following location:

http://localhost

And you should see something like this:

media2

Click the setup link, and let’s walk through the options…

First is the wikiname. I’m just going to call mine ‘test wiki’. Put in your own contact email, so that mediawiki will email YOU if anything is going on… I left the language in English, and left the license alone. The next important thing to do is to select a Admin username, and password. This is all up to you. Just remember that the Username is CaSe SeNsItIvE!!!

Leave the caching off.

The next section is for the email notifications, I just left those as default.

The final thing to configure is the database.

Since we are going to keep this simple, just set the DB username to root, and put in the password you configured earlier in the MySQL Console. Next check the ‘superuser account’ box, and specify root and the password again.

You can now click the Install MediaWiki button!

You’ll see some information printed on the page, and if everything goes according to plan, you’ll get the message:

Installation successful! Move the config/LocalSettings.php file to the parent directory, then follow this link to your wiki.

You should change file permissions for LocalSettings.php as required to prevent other users on the server reading passwords and altering configuration data

So simply copy the file c:\wamp\www\config\LocalSettings.php to c:\wamp\www\

then simply click the following link to be taken to your personal wiki:

http://localhost/index.php

media4

And that should take care of it!