Fun with regex substitutions in Apache

Continuing from my previous post, I was now able to access my AltaVista server, however from a web browser I was unable to actually view any of the documents remotely.

In the pages though I did get the MS-DOS path to the usenet article in question:

Now how do I turn that into a URL?

Well as it turns out mod_rewrite does support regex, which in turn can do variable re-ordering!

After a bit of googling I found this page on stackoverflow, on how to convert a date between UK/US formats:

s/(\d{4})-(\d{2})-(\d{2})/$1-$3-$2/

Simple, right?  So what is going on here?  The parenthesis define a variable set, and on the substitution part you can recall them with $1, $2 , $3 etc.  So using this recipe I could take something like this:

u:\b227\comp\sys\laptops\3080

and convert it into the following:

http://debian7/usenet/b227/comp/sys/laptops/3080

The code for this would look something like this:

Substitute "s|>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href="\"http://debian7/usenet/$1/$2/$3\""]Click for article|"

Although for some reason it’s embedding the URL’s even though I specified code formatting.

Now all I had to do was install IIS 4.0 off the Option Pack CD-ROM, onto my Windows NT 4.0 workstation, and create a virtual directory of /usenet which then pointed to the U: drive where AltaVista did it’s indexing.

So to this point that gives me a config file much like this:

ServerAdmin webmaster@localhost
DocumentRoot /var/www
SSLProxyEngine On
ProxyPass "/altavista/" "https://10.12.0.16"
ProxyPassReverse "/altavista/" "https://10.12.0.16/"
ProxyRequests Off
RewriteEngine On

SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
#clean up urls
Substitute "s|127.0.0.1:6688|debian7/altavista|n"
Substitute "s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n"
#protect the page
Substitute "s|launch=app||n"
Substitute "s|?pg=config&what=init|?pg=h|n"
#fix title
Substitute "s|&lt;IMG src=\"http://debian7/images/av_personal.gif\" alt=\"[AltaVista] \"  BORDER=0 ALIGN=middle HEIGHT=72 VSPACE=0 HSPACE=0&gt;|&lt;a href=\"http://debian7/altavista\"&gt;&lt;IMG src=\"http://debian7/images/av_personal.gif\" alt=\"[AltaVista] \"  BORDER=0 ALIGN=middle HEIGHT=72 VSPACE=0 HSPACE=0&gt;<strong>|---&gt;|n"
Substitute "s|</strong>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4\"&gt;Click for article|"
Substitute "s|&gt;u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3\"&gt;Click for article|"
# Need links for the u:\news097f1\b120\comp\society\futures\1122
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7/$8\"&gt;Click for article|"
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7\"&gt;Click for article|"
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6\"&gt;Click for article|"
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4/$5\"&gt;Click for article|"
# Need links for  u:\news002f1\b1\fa.poli-sci\8
Substitute "s|&gt;u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z\.\-]{1,})\\\([0-9]{1,})|---&gt;&lt;a href=\"http://debian7/usenet/$1/$2/$3/$4\"&gt;Click for article|"

&lt;Location /usenet/&gt;
    ProxyPass  http://10.12.0.16/usenet/
    RewriteEngine On
    SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
    AddOutputFilterByType SUBSTITUTE text/html
&lt;/Location&gt;

bla bla rest of the 000-default crap....

Simple right?

Searching for AltaVista

Searching for AltaVista

So now I get a nicely formatted page, I can click the mountain icon, and I jump back to home, and I can click on the articles and, because I have no extensions or MIME types to intercept it’ll just download them to my PC.  I guess I need to go through them all, convert them from UNIX format to MS-DOS, and stick a .txt extension on every single one of them.

I’m still thinking this thing is far too rickety to put on the internet, but we’ll see.

Fun with Apache, (mod_proxy, mod_rewrite), stunnel, And AltaVista Personal search

As you may remember from my prior attempt at using Altavista Search I ran out of space, and found out it only serves pages on 127.0.0.1:6688 and is pretty much hardcoded to do so.  It’s a “fine” hyrid java 1.01 application, with the bulk of it being java.  I finally got around to setting up a VM, and unpacking all of the utzoo archives, and indexing them.  I should have done something about the IO because this took too long (KVM).

SIXTEEN HOURS!!!

SIXTEEN HOURS!!!

So to cheat the system, I installed stunnel as a simple https to http proxy, which let me access my search VM anywhere.  However it still embedded 127.0.0.1 in all the pages.

via stunnel

via stunnel

Enter an Apache reverse proxy to talk to stunnel to talk to AltaVista search!

First to enable a few modules:

a2enmod substitute
a2enmod proxy
a2enmod ssl
a2enmod proxy_http
a2enmod rewrite

And adding this into the config:

SSLProxyEngine On
ProxyPass “/altavista/” “https://10.12.0.16”
ProxyPassReverse “/altavista/” “https://10.12.0.16/”
ProxyRequests Off
RewriteEngine On
SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
Substitute “s/1997/2016/ni”
Substitute “s/97/16/ni”
Substitute “s|127.0.0.1:6688|debian7/altavista|n”
Substitute “s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n”
Substitute “s|launch=app||n”
Substitute “s|<a href=http://debian7/altavista/?pg=q&what=0&fmt=d|<!—|n”
Substitute “s|><strong>|—>|n”
Substitute “s|</strong></a>||n”
Substitute “s|>u:\|->u:\|n”

This let me redirect all of those requests into a VM called debian7 on the /altavista path.  I also copied the images to the apache server, and now I get something that looks correct!

Apache in the mix!

Apache in the mix!

I cut the results short… But here is a search of something simple:

About 16598 documents match your query.

About 16598 documents match your query.

I also killed all the ‘working URL’s that simply open a desktop application on the index ‘server’.  Naturally it was a personal service, but as a server this isn’t any good.  As such you can’t click on any search results now.  I need something else to figure out how to take the result blocks like “u:\b128\comp\databases\2852” and turn them into URL’s.

Also, as much as I want to re-index I would be best to cut off the headers, or most of them so the preview lines make sense.  Xref, Path, even From & Newsgroups don’t interest me.

I hate to leave it as ‘good enough’ but if anyone has a solution…. I’ll be glad to make this wonderful resource available!

AltaVista Personal Indexer

caption

Probably not a good idea..

I never got into the whole ‘desktop search’ thing as I used to know where my stuff was.  But now we live in the future where not only can you just go out and buy terrabytes worth of storage but downloading 10 years worth of usenet is something you can accomplish in a few minutes (on a good connection) but storing it as flat files only takes 20 minutes to decompress some 2,070,332 worth of files is a trivial manner.  It’s really cool to live in the future.

Total Files Listed: 2070332 File(s) 5,429,376,673 bytes 
                    168164 Dir(s) 1,119,884,468,224 bytes free

Now what about finding something in those files?

I should be embarrassed as I was using grep.

Yes in my hunt for obscure information grep was my tool of choice.

So after Frank had mentioned it in passing, if I’d ever used AltaVista Personal Search 97 before I thought I’d give it a bit of a test.  First I unpacked some BSD source code, and had it index that.  The results were incredibly FAST.  So the next thing to do was to try the UTZOO archives.  I should have expanded my NT 4.0 VM’s disk first, but I got this far until I was down to 200MB of free disk space

Screen Shot 2014-10-29 at 9.04.27 PM

 

I should add that I’m sharing the UTZOO archvie over the network.  Not the fastest way at all.  And I only made it about 40% the way through the archive.  Even at this point the search database is only 1.2GB

So how does it run?  Well it’s a localized web service that resides on your desktop.  Of course it only works when you request from 127.0.0.1 as they sold a network searchable version of AltaVista, the Workgroup Edition.  Even this was a retail product at one point retailing for $29 to $35

Screen Shot 2014-10-29 at 9.46.46 PM

Show me the Xenix!

So you hit the web page, type in your search, and you answers like immediately.  It really is scary how fast this thing is.  Although the results can need a lot of tweaking but we are talking 800,000 files.

But needless to say there was the disastrous Compaq buyout of DEC, and the entrance of Google, and it was over.  From what I understand people are still selling the workgroup/enterprise search.  I can see why even though the 97 is rough it still has promise.

What a bargain!

What a bargain!

For anyone who cares, it’s geared to Windows 95, or Windows NT 4.0.. 2000 and beyond is at your own risk.  It uses a Win16 setup program, so Windows 7 x64 was out of the question, but you can download it here.

As part of the retrochallenge 2012, there is a PDP-11 running 2.11 BSD out there!

No, really!

You can get an account, just sign up here!

Sander Reiche has setup a MicroPDP-11/83 with the following specs:

So far there are FOUR users.. which means you can get in on the action for sure!

For those of you who want a sandboxed version at home, you can download my install here, which of course I touched on a while back.

For those unfamiliar, here is what retrochallenge is all about!

  1. RetroChallenge commences July 1st, 2012 and runs until July 31st, 2012.
  2. In order to qualify, computer systems must by approximately 10 years old (or older!)… in general, this means 486 or below, 680×0 and pretty much everything with an 8-bit processor, but we’ll also let you in if you have an old Cray kicking about, and exceptions can always be made for exotica!
  3. Gaming consoles and PDAs qualify if they were made in the previous century.
  4. Where appropriate, replica hardware and emulators may be used.
  5. Entrants are responsible for adequately documenting their projects and submitting occasional updates during the contest.
  6. Projects may encompass any aspect of retro-computing that tickles the fancy of the individual entrant.
  7. Winners will be carefully selected and thoughtfully chosen prizes presented (hopefully before the next challenge commences).
  8. Have fun!

Sadly I don’t have anything physical around here that really qualifies.  A G5 mac is too new, and I recently picked up a Pentium 150 based IBM Aptiva, but its too new apparently….

DEC Legacy Event

Well I just found out about a “DEC Legacy Event” being held in the UK. Sadly I already booked tickets to the UK *this* month not the correct one… But then who knows… 😉

From the site:

The DEC Legacy Event that will take place on the 17th & 18th April 2010 in Windermere, UK.

The purpose of the event is to bring together people with an interest in the company Digital Equipment Corporation and their legacy of hardware, software and ethos. There will be both vintage DEC computer hardware and software and more recent HP hardware and software being demonstrated at the event.

I suppose this would have been the place to get some win terminals going, and have multi-user access into a VMS system running on SIMH on an Alpha…

Oh well…

At any rate they promise to upload video from the aftermath, and they’ve got up some interesting promo pics