Fun with regex substitutions in Apache

Continuing from my previous post, I was now able to access my AltaVista server, however from a web browser I was unable to actually view any of the documents remotely.

In the pages though I did get the MS-DOS path to the usenet article in question:

Now how do I turn that into a URL?

Well as it turns out mod_rewrite does support regex, which in turn can do variable re-ordering!

After a bit of googling I found this page on stackoverflow, on how to convert a date between UK/US formats:

s/(\d{4})-(\d{2})-(\d{2})/$1-$3-$2/

Simple, right?  So what is going on here?  The parenthesis define a variable set, and on the substitution part you can recall them with $1, $2 , $3 etc.  So using this recipe I could take something like this:

u:\b227\comp\sys\laptops\3080

and convert it into the following:

http://debian7/usenet/b227/comp/sys/laptops/3080

The code for this would look something like this:

Substitute "s|>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href="\"http://debian7/usenet/$1/$2/$3\""]Click for article|"

Although for some reason it’s embedding the URL’s even though I specified code formatting.

Now all I had to do was install IIS 4.0 off the Option Pack CD-ROM, onto my Windows NT 4.0 workstation, and create a virtual directory of /usenet which then pointed to the U: drive where AltaVista did it’s indexing.

So to this point that gives me a config file much like this:

ServerAdmin webmaster@localhost
DocumentRoot /var/www
SSLProxyEngine On
ProxyPass "/altavista/" "https://10.12.0.16"
ProxyPassReverse "/altavista/" "https://10.12.0.16/"
ProxyRequests Off
RewriteEngine On

SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
AddOutputFilterByType SUBSTITUTE text/html
#clean up urls
Substitute "s|127.0.0.1:6688|debian7/altavista|n"
Substitute "s|file:///C:\Program Files\DIGITAL\AltaVista Search\My Computer\images\|http://debian7/images/|n"
#protect the page
Substitute "s|launch=app||n"
Substitute "s|?pg=config&what=init|?pg=h|n"
#fix title
Substitute "s|<IMG src=\"http://debian7/images/av_personal.gif\" alt=\"[AltaVista] \"  BORDER=0 ALIGN=middle HEIGHT=72 VSPACE=0 HSPACE=0>|<a href=\"http://debian7/altavista\"><IMG src=\"http://debian7/images/av_personal.gif\" alt=\"[AltaVista] \"  BORDER=0 ALIGN=middle HEIGHT=72 VSPACE=0 HSPACE=0>|--->|n"
Substitute "s|u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7\">Click for article|"
Substitute "s|>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6\">Click for article|"
Substitute "s|>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5\">Click for article|"
Substitute "s|>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4\">Click for article|"
Substitute "s|>u:.([a-z]{1,}[0-9]{3,})\\\([0-9a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3\">Click for article|"
# Need links for the u:\news097f1\b120\comp\society\futures\1122
Substitute "s|>u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7/$8\">Click for article|"
Substitute "s|>u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6/$7\">Click for article|"
Substitute "s|>u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5/$6\">Click for article|"
Substitute "s|>u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z]{1,})\\\([a-z]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4/$5\">Click for article|"
# Need links for  u:\news002f1\b1\fa.poli-sci\8
Substitute "s|>u:.(news[0-9]{3,}f[0-9])\\\([b0-9]{1,})\\\([a-z\.\-]{1,})\\\([0-9]{1,})|---><a href=\"http://debian7/usenet/$1/$2/$3/$4\">Click for article|"

<Location /usenet/>
    ProxyPass  http://10.12.0.16/usenet/
    RewriteEngine On
    SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
    AddOutputFilterByType SUBSTITUTE text/html
</Location>

bla bla rest of the 000-default crap....

Simple right?

Searching for AltaVista

Searching for AltaVista

So now I get a nicely formatted page, I can click the mountain icon, and I jump back to home, and I can click on the articles and, because I have no extensions or MIME types to intercept it’ll just download them to my PC.  I guess I need to go through them all, convert them from UNIX format to MS-DOS, and stick a .txt extension on every single one of them.

I’m still thinking this thing is far too rickety to put on the internet, but we’ll see.

2 thoughts on “Fun with regex substitutions in Apache

    • I tried adding a mime type for ‘*’ to map to text/plain but that didn’t help. I was looking and the easiest way out is to simply rename every file. I think I can do it via perl, maybe even convert them all to DOS text format too. And with the regex just append .txt to everything that is a link , and it should magically work.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.