The more they overthink the plumbing, the easier it is to stop up the drain.
-Star Trek III
Well as of late we’ve been fighting this aging SQL server at work. It was originally a NT 4.0 server with SQL 6.5 upgraded to 7.0 then 2000 with Windows 2000. It also had to do some work with Oracle, and the dba’s were using MSDTC to make sure their transactions were getting committed into Oracle. Oh and to keep it ‘fresh’ the Oracle client was version 7 as it originally was talking to an Oracle 7 DB.
Well as the years go by, that Oracle 7 DB became Oracle 9i (already obsolete!), and we suddenly hit a transaction wall.
And along the way we virtualized the server to go into our VMWare ESX 2.x cluster, and it’s been since migrated to VMWare ESX 3.5
The server was dropping tens of thousands of these XA???????????????????.trc files, into the \winnt\system32 directory.
Well naturally you’ll eventually hit this wall of how many 8.3 translations you can do before the system CRAWLS. And boy oh boy did we hit that wall. So at first my idea was to delete all of these trc files, and let it live, but that’s not such a ‘great’ idea… As this reeks of a fundamental problem.
So the ‘first’ step in all of this madness was to up the OS to Windows 2003 enterprise (it was 2000 Advanced server before) And see how things were doing. The OS upgrade went smoothly I had slipstreamed sp2 into the update, so I only had some 90+ updates needing to be done once the OS had been upgraded. And for ‘good’ measure I thought I’d take the server from 768MB of ram up to 2GB, and set VMWare to allow 4 cpu’s for the database server. The node it was running on wasn’t doing terribly much so what the hell right?
The server was now performing markibly SLOWER… And yes, still dropping TRC files like there was no tomorrow.
After a bunch of digging around, I found out that in 2003 you have to click a box in the component manager to allow XA (cross architecture) transactions! Well now it wasn’t dropping as many XA trc files, but after watching it for a while, when two went to run at the same time, the SQL server would crash with a hex code saying it was out of memory.
Out of memory? I’d just given it more!?
So I did the ‘logical’ thing and gave the system 5 GB of ram, and enabled the /3GB flag in boot.ini
I moved SQL server up to using 2GB of ram (out of 5, sure why not?). Same error.
Well this sucked, so we tried to update the Oracle client from 7 to 9i. In the process I found I couldn’t un-install the 7 client, nor could the 9i thing just ‘upgrade’ it, 9i kind of installed in parallel. Which led us to our next major fault, after swapping out the new client, removing the Oracle product key from the registry, and re-linking the Oracle servers using a new registration string, now ALL of our transactions against the oracle servers were failing.
Thank goodness for google, as we were able to deduce that the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsfot\MSDTC\MTxOCI was populated with all the old Oracle 7 values….
OracleOciLib -> oci.dll
OracleSqlLib -> orasql9.dll
OracleTraceFilePath -> c:\oratrace
OracleXaLib -> oraclient9.dll
So changing it to reflect Oracle 9i and suddenly our transactions were running! Even two at a time!!
But there was no doubt about it the transactions were slower then hell. We had gone from 1 minute to 11 minutes on one, and 5 minutes to just under an HOUR.
I added MORE memory to only find the SQL server couldn’t see the network card. So I added another one, and it got even SLOWER.
So in a minute of panic, I reduced the ram back to 768MB, took the VM from 4 cpu’s to 2 cpu’s and forced SQL server to use a single processor.
And our timings are now fantastic! That 1 minute process can complete now in 12 seconds!!! The other process finishes in about a minute give or take, but it’s tremendously faster.
From what I can gather, since SQL is so IO bound the more ‘top’ hardware you give it, the harder it pushes the IO stalling itself… Naturally it’s different on a physical machine, but sometimes it’s interesting to see what happens.
And may this be a lesson, just because it can emulate multiple CPU’s doesn’t mean it’ll run parallel things ‘better’…