The Joy and Gibbering Terror of Custom-Loading Executables

(ODT) (PDF) (TXT)


Further Integration

Loading Entire Executables

Loading Windows Executables in Linux

Distributing

On a Silver Platter

Conclusion



Let's say you have an image processing library. You want to do things fast, and you want to do them flexibly, and you want your flexibility to be uniform. So you allow from 1 to 64 bits per component, floating-point in 32, 64, and 96 bits per component, RGB, YUV, L*a*b*, L*u*v*, L, indexed, YcbCr, CMYK, and RG since you're a fan of early colour film, and an option of an alpha channel. Did I forget anything? Ah, that's right, the speed!

Take a very common image operation – converting from one colour specification to another. With the options I've given, there are 1,453,230 possible combinations of features that require conversion. You could template that, of course. But it would take several hours to compile, could break the compiler outright, result in an executable that's hundreds of megabytes in size, and only a tiny fraction of that would usually be used by any given instance of your program. No, I'm afraid that what you're going to have to do is cut out some of the flexibility (perhaps one of the marginal colour-spaces, like RGB) and decode colours in a multi-stage switching process that treats all your colours like they were real L*a*b*. After all, it's not like you could generate the specific combination of code you'll need at runtime, right? That would be something like dynamic compilation, that ever-present menace of communist Russia!

Hail, comrade! Dynamic compilation isn't so bad; it's merely an extension of regular compilation, and it's a far cry from the opaque madness of mashing together a bunch of bytes which you're pretty sure represent assembly instructions. Moreover, DigitalMars' C and D compilers are fast enough to be able to generate efficient code for you at speeds that won't slow down your program's initialisation significantly while increasing the speed of its runtime. I'll even tell you how to turn compilers essentially into libraries!

Let's start with the example here. The core of it is very simple – you just need to generate code that implements a given combination of elements, get a compiler to generate an object file, load the object file, and cache the results. The first two and the final steps are banal; I'll concentrate on the object file here.

Well, that's pretty banal too. On Windows, DMD produces OMF files as its object file format. Loading these files is fairly straightforward and you can ignore quite a number of the records (it's best to use object file examples to find out what you need to implement), particularly if you're only interested in running them from your code. When you get to the FIXUPP record, what should be kept in mind is that the useless “compression” it uses (compression that doesn't save any space, it's miraculous) is not used by DigitalMars, so don't waste your time implementing it. Otherwise it can be materially implemented in just two hundred lines; most of your time will be spent on making the abstract object file representation.

Once you have the data loaded, you need to link it. I incrementally offer object files new symbols from the runtime as they come up (if there's any key lesson here it's to not do any more than you absolutely need to because you could spend months on what's being discussed without actually making what you need to do any better), and when it asks for something like “_D6object6Object8opEqualsMFC6ObjectZi”, I put in an “extern (C) void D6object6Object8opEqualsMFC6ObjectZi ();” in my code and send it a pointer to that; that way the export is fragile, so if Object.opEquals changes I won't mistakenly send my object files an incorrect pointer.

At that point you just need to grab a pointer to the function you want, and call it. Usually I'll use “extern (C)” linkage; it makes it easier to search for the symbol. The only other thing to keep in mind is that you don't want to keep a reference to an object file's function unless you also have a live reference to the object file itself; fiddling around with pointers like this is a good way to mess up the garbage collector and make it think the object file can be collected.

Further Integration

Let's say you're doing some dynamic loading, liking it, and want to be able to use plugins to interface with your program directly. Should you continue to use custom dynamic loading, or should you switch to DLLs?

In my experience you can continue to use dynamic loading, but you need to export more and more symbols from your main runtime that can become increasingly difficult to get a handle on (some symbols that don't have legal identifiers, for example). Certainly you wouldn't want to manually export symbols from your own interface.

In those cases you have two options. You can go to DLLs where you need to somehow deal with exporting symbols yourself (although automated tools can find every symbol and export them), or you can turn your main executable into a stub that only knows how to call the compiler and load the entry point. You can also hybridise; a DLL runtime can LoadLibrary itself to serve symbols out to custom-loaded object files. In that case, you're avoiding the linker, which is harder to deal with, another external dependency you'll need to transfer with your program, and it slows down any dynamic compilation.

To be honest, in my estimation it's easier to load your program yourself and get the benefits of that than to deal with DLLs. All you'll really lose is the ability to share read-only data among multiple instances of your program which is, let's all be honest, an exceptional condition for most programs, and you'll gain the flexibility of doing things yourself.

Loading Entire Executables

Now this is just crazy talk, I know, but hear me out: there are very good reasons to custom-load an executable. For example, say you want to freak it out, such as making it work with your temporary files without creating them. Or you need to run it numerous times during the course of the program. Maybe you want to run a program in a non-native execution environment (such as using the speedy DigitalMars SC in Linux) without putting a dependency on a massive emulation layer.

Take my case, where I'm implementing the common scripting-language function “eval”, which evaluates a string as code and needs to run at interactive rates. Going through the OS's executable loader and the file system is twenty times as slow as loading the executable manually, after the initial loading. It is almost six times as slow in the first execution.

Not everything is suitable for being custom-loaded. If it's highly-dependent upon DLLs or calls processes itself you might have problems wrapping it; it might be necessary to wrap some of the DLLs as well. This is not a tool that can be applied generically to any program you'd like; it needs customisation. For example, you don't want to be stuck having to implement all the semantics of even a single Windows API function – you should only implement whatever is needed for the program you're trying to run.

The meat of the “portable executable” format that Windows uses is actually quite lean. The problem is that it's marbled with a ton of fat that you need to trim off, a legacy of it being a twenty-year-old format that was uncomfortably rigged with an overlay that has nothing to do with the original COFF. Here is the heart of the process intended as an accompaniment to Microsoft's specification.

First we seek to offset 0x3C and read the PE magic ('P', 'E', 0, 0). In the COFF file header we need the NumberOfSections (offset 2, ushort), SizeOfOptionalHeader (offset 16, ushort), and Characteristics (offset 18, ushort) fields. The characteristic IMAGE_FILE_RELOCS_STRIPPED (mask 1) must not be false. For the “optional” header you need to check whether this is PE32 or PE32+; in the former case the optional header must be at least 96 bytes, in the latter case it must be at least 112. Of the optional header standard fields, you need only AddressOfEntryPoint (offset 16, uint), ImageBase (offset 28/24, uint/ulong), SizeOfImage (offset 56, uint), and NumberOfRvaAndSizes (offset 92/108, uint).

Then you need to load the optional header data directory, then the section table. Generate an array that is SizeOfImage (from the PE header) bytes long, and load the sections, using the PointerToRawData to seek to that point in the file, then loading SizeOfRawData bytes into VirtualAddress in the array you created. If the section's Characteristics field has IMAGE_SCN_MEM_WRITE (mask 0x8000_0000) set, then be sure to make a copy of what's there; you'll need to overwrite the section before every run with the original data.

Now you need to jump to two entries in the optional header data directory. The second entry (index 1) is the import table, which is documented in the “.idata” section. You need all the information in that section so I won't comment on it, except to say that you read from the import lookup table, and write to the import address table; it's pretty basic. You may want to wrap imports to insert your own functionality, and you can do so by linking in pointers to those functions rather than the DLL import.

Finally you need to relocate the image. The image was originally linked to a certain address given as ImageBase in the PE header. You need to iterate over the entries in the relocation table (under index 5 in the import table, documented in the “.reloc” section) and add the start of your array subtracted by ImageBase to the values it indexes.

We're now done dealing with the image and can get around to executing it. For this we have to keep a few considerations about the Windows API in mind. So far I only have experience in working with DigitalMars' CRT; other CRTs will almost certainly have additional rules, but they should not be contradictory.

First we need to deal with exception handling. The process uses the same exception handling D does, but if you throw while in the program Windows will say that it crashed because its default handler will catch it. We need to do something like this when we call into the process:

asm
{
// Grab the stack-relevant registers and FS:0, which is the last item in
// the exception-handling chain.

mov WrapEBP, EBP;
mov WrapESP, ESP;
mov EAX, FS:0;
mov WrapFS0, EAX;

// This does not usually directly return.
call entry_point;

// Copy off the exit code.
mov exit_code, EAX;
}

Now in order to throw we need to simply restore FS:0 before throwing the exception:

asm
{
mov EAX, WrapFS0;
mov FS:0, EAX;
push exception;
call _d_throw;
}

CloseHandle: The CRT may call CloseHandle on the standard handles (standard input, output, error) before exiting. This will prevent your program from being able to print anything after running the program. I overload GetStdHandle and return the input instead of the actual handle, then overload WriteFile and WriteConsoleA (which I need to do anyway for error reporting), and ignore the CloseHandle request when it's trying to close any of these.

ExitProcess and Exception Handling: ExitProcess is how any DigitalMars CRT program returns to your control flow. You need to get from that function call all the way back in the stack frame to where you enter the program. You can do this through an exception, but it's just as easy to do it by unwind the call frame stack:

asm
{
/// Store the exit code in EAX.
mov EAX, exit_code;

// Get back the registers we originally stored.
mov EBP, WrapEBP;
mov ESP, WrapESP;
mov ECX, WrapFS0;
mov FS:0, ECX;

// Jump to the return EIP.
mov EDX, [ESP-4];
jmp EDX;
}

GetCommandLineA: You need to overload this if you're running a command-line program. Nothing crazy about this.

CloseHandle, CreateFile, GlobalAlloc, GlobalFree, HeapCreate, HeapDestroy, VirtualAlloc, VirtualFree: You need to track these resources and free them if the program doesn't when it ends. There will be many more resource types to track if the program is windowed.

Now you should have a nice little program that runs inside your own program. The only thing to keep in mind is to reset the writable segments before running it again; other than that it'll have no idea it's being run multiple times.

Loading Windows Executables in Linux

As said, it may be beneficial to you to be able to run simple Windows executables in Linux without requiring Wine; it simplifies your distribution and it's a fairly easy process once you've already got executable wrapping working in Windows. So what do you need to do?

Your first two concerns will relate to exception-handling, which is handled completely differently in Linux. Rather than creating a new chain entry and storing it in FS:0, Linux unrolls the call frames and finds out whether anyone wants to catch an exception by searching for what function any given piece of code is within. This lowers the flexibility somewhat, but functions which use exception-handling code don't have any overhead. The relevant point here is that if you throw an exception in our synthetic process it won't be handled properly because Linux will have no idea where you are and so won't be able to unroll the call frames.

This means we don't need or can copy FS:0, and that when we throw an exception we need to return EBP and ESP to their stored values. Other than that, both ExitProcess and wrapped throw can be done in the same way between Windows and Linux.

What'll really get you is FS:0. All segment registers in Linux point to the same thing, and segment registers themselves index a descriptor table that the OS is in control of and there isn't any standardisation. You need to make a syscall; in Linux (but not in FreeBSD or BeOS) you store the desired syscall in EAX, then any additional parameters in EBX, ECX, and EDX in that order, switching to a struct if you need more (which is rare as a result). It takes more to describe using it than it does to implement, so take a look at my source (search for allocate_ldt in exe.wrap).

Now you'll just need to fill in Windows API functionality. Keep in mind what I've said before: don't do more than you strictly need to to get your code working. Sometimes you don't even need to implement the function; scppn.exe calls GetTimeZoneInformation but doesn't care if you don't write anything to it, for example.

Distributing

This is all fun for your own use, but you have a compiler on your hard drive. Once you distribute your program surely you can't do this, right? After all, you can't expect a compiler to be on every person's computer.

The good part with C and D is that DigitalMars' compilers are very self-contained, rather than depending upon the massive hierarchy of files that something like GCC uses. For D's case, all you need is dmd.exe, /dmd/src/phobos/object.di, and /dmd/bin/sc.ini; it's easy enough to virtualise these latter two files if you've custom-loaded dmd.exe. In C's case you need to ignore /dm/bin/sc.exe and use /dm/bin/scppn.exe, which is what /dm/bin/sc.exe depends upon. If you need to include files you'll need to provide Phobos or CRT includes as well, in which case you may want to use Phobos' std.zip to reduce installation size and complexity. To the best of my knowledge, there are no licensing issues with distributing these compilers with your program, although it would be best to ask Walter Bright first.

On a Silver Platter

So you don't want to do any hard work yourself, you just want to be able to use someone else's code? Fine, be that way! You can use this code, which is implemented for Phobos and will likely never be updated because Phobos is not the environment I'm working in. It works in both Windows and Linux, and compilation of the example file requires bud and is implemented for D 1.0. You can also browse the documentation online.

The root of the library is exe.executable, which has the Executable interface. The two methods it contains are used as a building block for all additional functionality. For example, exe.group provides GroupExecutable, which links a number of executables together, such as exe.base's BaseExecutable that provides common symbols from the D runtime and an OMFExecutable from exe.omf that needs those symbols to link properly. A common helpful function is link_executable from exe.executable.

You've also got the aforementioned exe.omf for OMF files and as well you have exe.pe for Portable Executables under the PEExecutable class.

The core of linking together an executable is found in exe.wrap; there are a few functions there which make things simpler to use, but it is still by far the most task-dependent aspect of the library; if you want to use something other than DigitalMars C or D, expect to have to understand what it's doing.

But you don't often need to use that. What you really want is exe.dynamism, the public interface that abstracts all this work into a few easy-to-use functions: eval for evaluating simple statements and RuntimeTemplate for caching generated code.

This code is fairly robust for an article example, but it's not complete. You would want to interpret standard output to produce a helpful error when a syntax problem occurs in the code. You could also flesh out the exports from BaseExecutable (or better yet, parse the executable's .map file to automatically export everything), or have RuntimeTemplate cache only a certain number of loaded files. You could even implement ELF (which is more uniform and simpler than PE or OMF) and try running Linux programs in Windows. Feel free to do whatever you'd like with it – it's in the public domain.

Conclusion

What I hope I've impressed you with the most in this article aside from my debonair good looks and compellingly girlish laugh is that while custom-loading executables is strange and possibly even perverted, it shouldn't be considered intimidating; you shouldn't ever feel like you've stumbled on a satanic ritual and you're not sure whether the proper introduction is to kill a baby or just throw some devil horns. That is to say, you should always feel that you have some destination in mind and that there's a clear way to get there from here, and as exe.dynamism shows, it is possible to reduce what complexity there is into a few simple functions.

The truth is, there's no real good reason why compilation should stop at the executable. DigitalMars C and D are not only super-fast compilers, they're small and they're independent so you only need to package a small number of files with your program, they transform D from a traditional statically-compiled language into one which offers many of the dynamic-loading advantages of scripting languages without the horrible execution speeds or dealing with migrating script to C, and they offer high-level capabilities that are not normally available outside of Universities.

So, sad to say, there is no gibbering terror at all about custom-loading executables. It's fun, easy, reliable, and useful.

Questions? Comments? Death threats? burton-radons@shaw.ca.