This friday I finished up some work on a new version of one of Stardock's products, which'll probably see the light shortly after the company finishes moving to Plymouth. So what do geeks do with their down-time? Well, in my case, it's often pretty much the same to what I do for money, only for the communities I'm interested in. Recently, a lot of my time has been spent on the Creatures Community, the group of people who've played the Creatures series of artificial life games. When I'm not contributing to the Creatures Wiki, I'll be writing some sort of tool, like this sprite thumbnail viewer, or polishing the next version of JRNet. But enough about my projects, as this one is actually about someone else's . . .
GEL is a genetic editor for Creatures 2. It is used to edit the genes for the various creatures (Norns, Ettins and Grendels). There are other editors, but people get attached to their favourite programs, and GEL is no different.
The trouble is, although GEL worked great on Windows 98, it didn't seem to want to work on XP. OK, so that wasn't great, but the real problem was that the source code - the words that the programmer types and gives to the compiler to turn into a program - had been lost in a hard disk accident, and so he couldn't fix the problem. Without the source code, you just have the "compiled" version, and it's very hard to make any changes to that.
There were several people upset about this, though and it's always a shame to lose a useful program, so I decided to see if I could do something about it. Overall it took about two days of work to get it back up and running, which I thought was pretty good. I figured it would be kinda neat to tell you how I did it, and show you some of the different tools I used, so I wrote this article. Just skip over the bits that get too technical!
Let's start with what I got when I installed the program and tried starting it up myself:
OK. I got this error, and then when I pressed OK it closed on me. That didn't work out that well. I'm sure you've seen similarly confusing errors on your own computers! Turns out, it's not always easy for programmers to figure out what it means, either . . .
So, we have a problem. But where? "Path not found" isn't a very helpful message - it doesn't tell you what path, for a start! I decided this would be the first thing to try and find out, so I started up FileMon, a utility that monitors what files are accessed by running programs. I was looking for any "not found" messages, and there were a few, but they all turned out to be dead ends.
By now it was clear that it wasn't going to be as easy as a missing file or a permission problem. The next thing I tried was another of the Sysinternals tools, RegMon. This does much the same thing as FileMon - monitor what's happening - but for the Windows Registry, so you can see what settings are being written and read. I consider both of these tools essential if you want to know what's really going on.
This was the last registry read before the error
As it happens, RegMon did turn up something - the last thing that GEL read before it all started to go wrong was the main path of Creatures 2. The thing was, this registry read didn't fail. This just happened to be the last thing that it did with the registry that I couldn't narrow down to other causes. I did try modifying this value in the registry, but this just resulted in slightly different errors.
After that, I briefly tried using another utility called API Monitor in order to see what calls the program was making to the operating system. This program is rather like a general version of Regmon and Filemon - while they monitor specific things, API Monitor "hooks" pretty much every system function that there is and records their use. Unfortunately, I couldn't find what I was looking for; I later found out that it didn't even start sending messages until a window had been created.
A small aside . . .
One thing I did notice using API Monitor was the amazing number of calls that
were made - almost
ten thousand of them - just to display an error box on the screen.
Of course, it wasn't just displaying the box, it was:
- Loading the code to process the error
- Looking up the error message
- Playing the error sound, which meant:
- Loading sound libraries
- Checking what sound devices were available
- Figuring out what sound was the error sound
- Loading and playing the sound
- Loading up the screen reader library in case it needed to read the message to me
- Lots of other things that might have been useful if it had actually done anything after showing the error
There's a simple reason CPUs keep getting faster - they have to, because if they didn't, there's no way we'd be able to use all the things our computers do for us! The really crazy thing is that displaying such a box (with all the above features) just takes one line of code; something like this:
MsgBox "Hi there! This is the message" & vbNewLine &, vbInformation, "This is the title!"
Truly, we live in an age of wonders.
To recap: I'd found it wasn't a case of failed registry entries or a file not being there. It was time to bring out the big guns.
My first tool is recognizable to pretty much all Windows programmers, even if they don't use it themselves - Microsoft's Visual Studio. This is the number one tool for Windows development, and although it has its detractors, it's pretty good as development environments go. I would use this to run the program and stop it halfway, examining and changing the memory that it used.
The second might be a little less familiar to most programmers - IDA, the Interactive Disassembler. A disassembler is a program that turns compiled programs back one step into "assembly code", the last point at which it can be considered remotely readable. Few programmers actually write code at this level - most use a higher-level language like C++, Pascal, Java or Visual Basic - but it is usually possible to get a good idea of how parts of a program works through reading it in assembly.
Disassembling programs (also known as reverse-engineering) is something of a shady activity - one of IDA's most popular uses (though not one they advertise) is to figure out how to get around serial code checks, and this is one reason why disassembly is forbidden in most software licenses. However, all tools have their uses, and when you need to know exactly what a program is doing in order to fix it, but don't have the source, a good disassembler is a requirement.
Anyway, I started the program running in the Visual Studio debugger - a mode in which you can control exactly how a program executes, and modify the variables it is using - and ran through the code to see where the problem occurred. It was pretty easy to see what part of the error was in - a file called glsupcts.dll that came with the program. To see exactly what the code did, I set IDA running on it; after a few minutes it had an assembly listing of the code ready for me to read.
A note about DLLs
DLLs are not that much different from EXEs - they're both files that contain "code" (and sometimes other things like icons or embedded sounds). The main difference is that the EXE files contain the bit of code that starts the whole thing going, whereas DLLs tend to get called up by those EXEs to do their share of the work.
Of course, the assembly code wasn't actually all that easy to read. Something that made it even difficult was that the program had been written in Visual Basic, a language that I like which has a very easy to use system of programming, but which is often more general than required. As a result, it often did things in an odd way, and the code made a lot of calls to functions in the Visual Basic library. Of course, since these library calls were not documented, I ended up having to decompile this library as well, just to figure out what the program was doing! Hopefully nobody from Microsoft who cares is reading this.
Reading through the IDA output, I found the check for the registry value just before the error occured. It certainly seemed these were linked in some way. Then I found a reference to "AllChemicals.str", a file that contains the names of chemicals in Creatures. It made sense that GEL would try to load this file, so that it knew what each of the chemicals was called!
Now I had a clue - since I knew from reading the FileMon output that it never actually managed to load that file, it was probably failing while trying to. Using Visual Studio to look at the memory when the program crashed, I saw there was something odd about the path it had given to the "open file" function. It started off fine, but the end didn't look at all right. Here was my problem!
The system had used part of the memory given to it to work out the path (see the end for details), and GEL had thought this was part of the path itself. It was all clear now - the buffer was not being trimmed of the working copy, and this was getting left after the path name, so when the program put "AllChemicals.str" on the end, the middle of the path was invalid. This was the reason it wasn't showing up on FileMon - it didn't even get to the point where it looked for the file on the disk.
So what could I do? Well, I knew it was trimming off the last part of the string - the trouble was, it thought it was twice as big as it actually was, so it was keeping twice as much as it should. The length had to be stored as a number somewhere. Eventually I found the number being returned from a call to a function called vbaLenBstr
- which naturally calculated the length of the incorrect path. Now I just needed to divide it by two and it would only use the correct portion of the string.
Remembering my computer operations, I knew that the best way to divide by two was to shift the number to the right. What does this mean? Well, you can think of numbers inside a computer as being like a group of people all standing in a row, with flags with numbers on - starting from the right, they'd go 1, 2, 4, 8, 16 . . . all the powers of 2. When you shift right, the people all look at whoever's holding the next-highest flag, and do what they're doing. It looks like this:
Flag num 128 64 32 16 8 4 2 1
Before: 1 0 1 0 1 0 1 1 = 128 + 32 + 7 + 2 + 1 = 171
After: 0 1 0 1 0 1 0 1 = 64 + 16 + 4 + 1 = 85
Voila - division by two! Of course, you lose any remainder, since there's no 0.5 flag. Fortunately there's no such thing as half a character.
Of course, I'm not a whiz at assembly, so I had to look up exactly how to do the shift - I actually found the one I needed elsewhere in the code, so I could just use that. Now I had my instruction, and I knew where it had to go. It should be simple from here, right?
Well, no. The trouble is, you can't just add another instruction to the middle of a compiled program, moving all the others along. It would be like rearranging pages in a book and not updating the index (which is regenerated each time you "compile" a book). Worse, since machine instructions usually take more than one byte, moving them means instructions would start in the wrong place, changing their whole meaning - imagine what would happen if you kept all the spaces in a book in the same place but moved each letter along one position! When things get out of order in a computer, programs crash.
One thing I could have done would be to overwrite what was there already (perhaps something that didn't matter much). It had been enough trouble figuring out what one piece of assembly did, though - I didn't want to have to go through that all over again!
Fortunately, I didn't have to try that, because there was a convenient area of NOP
instructions nearby. NOP
stands for "no op" - it's an instruction that does nothing but move onto the next instruction. This seems useless, but it can in fact be useful for various things.
In this case, it was useful because it meant I had some space to work with. Because this space was free, I could fill it with more code. I needed to add just one instruction, but to get to that instruction I needed to put a jump instruction in. I looked these up, and it turned out the one I needed was a whole five bytes long, including the place to jump to.
That meant I had to move the code it replaced down into the section of NOPs as well, after my right shift. I then needed to jump back up to the point after the first jump instruction, so that the code could continue as if nothing had happened.
So, after that, the big question is did it work?
. . .
Yes! It finally loads!
For those who've read this far, congratulations! I hope you found this little view into my world educational.
This is a bit more than you'd usually have to do when debugging a problem, but it's pretty representative of what most programmers do in real life - it's not all fast cars, mansions and stock options! Sure, you don't usually have to go as hardcore as writing assembly-code patches for broken DLLs (I'm sure I'm going to get nasty comments from the real hardcore folks out there who do stuff like this every day , but a lot of the time you're figuring out problems with existing code, not just writing new code.
Often it's not our code, either - it'll be written by someone else (who left six months ago) in a way that seems totally nonsensical. Sometimes you're right to think that, other times you just don't understand it yet; either way, you have to fix it, and probably add a few new things, too! Ahh, well, all in a day's work . . .
Bonus! (warning: very techy)
If the program was buggy, why did it work on Windows 98? Well, the difference is the way in which the operating system works with text. On Windows 98, the standard is to have one byte per text (UTF-8), which is nice and fast, but which means you only get 256 different characters to choose from at any one moment; not enough for many languages. In Windows XP, the standard (called UTF-16) is to have two bytes, which gives you 65536 characters, which is good enough for most purposes.
Asking for a value from the registry is done by preparing a memory buffer - an area of memory to hold the result - and telling the operating system how you want the data. Because GEL had to work on both operating systems, it used the Windows 98 method, asking for a UTF-8 string (step 1 on the text diagram), and expanded the text later (step 2). But on Windows XP, this meant that the operating system had to reduce the size of the text, since it stored everything in UTF-16.
What was the easiest way to do that? Why, to read the output into a buffer and then collapse it down to UTF-8 by copying each character back just the right amount (computers rarely "move" data, since it costs twice the time it takes just copying it - deletion is an extra step) - first 0 steps, then 1, then 2 . . .
Where should this conversion take place? Since the memory space to store the text in was already available, it used that, safe in the knowledge that any extra text in the buffer should be ignored by the program, which was told how much had been returned. At least, that's how it should have worked . . . but instead, the program used the whole string and just counted the length of it rather than relying on the value it was given. On Windows 98, that space was never filled with text, as it was UTF-8 to start with. Although the bug was still there, it had no actual effect, since counting the length of the text came up with the right answer.
By the way, this explains why the "extra" text is quadruple-spaced - it was UTF-16 to start with, and it was left there because it was the second half of the text, which was not overwritten by the UTF-8 version of itself. It was then incorrectly read as part of the UTF-8 string, and expanded again by GEL into UTF-32.
The irony is that if it had been read straight from the registry as UTF-16, no conversion would have been necessary, and the application would probably have worked. Such are the ways of code!