W^X | Preision Egnineering

Here’s a short one. Let’s pretend this is a standard kind of entry-level C++ interview, and I’ll impress you with how subtly wrong I can be about so many different things, and yet still produce a weirdly coherent sort of result at the end.

Polymorphism

“Explain polymorphism in C++.”

I learned computers the wrong way. I got involved in the warez scene–but not the smart kind, not shipping brand-name releases. No, there was a group of kids (probably?) on IRC who raced each other to publish cracks before the professionals.

“Sorry, let me repeat the question. Explain polymorphism in C++ without any kind of world-building or extremely tired back-story.”

OK, sure. I know this one. It’s somewhat esoteric and whimsical, but an interesting question. Polymorphic code is an old technique. Today, we hold the representation of execution as sacred–not just unchanging, but fundamentally unchangeable. The scriptures mention W^X–the covenant we must make in accepting to execute the contents of a mapping is that we cannot also write to it–Write XOR Execute.

“No, by poly-“

The ancients lived closer to sin, fabricating wondrous and lethal contraptions unfathomable to the modern eye. It was Von Neumann’s land, the place where instruction and data were one (and it still is, except we have all of these rules). You can’t spell demonstration without “demon,” so here’s an example. Say you have a function (whatever that even is), and it does something special the first time you call it. Today, you might have a static variable and a branch. In other words, you’d have a little bit of context which would change data-as-data, and that data would get checked in the implementation to execute a special block of code.

“Let’s try ag-“

This trick was known even in the old days, but there are reasons to avoid it. For a start, performing such a check is a bit wasteful. Sure, contemporary CPUs might be great at tuning their prediction of the branch, but this wasn’t always the case. Indeed, branches can still be problematic.

There’s another way. At the preamble to the function, insert a sequence of NOP (do-nothing) opcodes, then do the one-time initialization. After initialization completes, rewrite the NOPs to a (near?) JMP to the next instruction. Initialization is now unreachable, no matter what the data might say. Polymorphic code is when the code can change itself.

“STOP. Stop. Just stop. We’re done here.”

Polymorphism, with gusto

Last year, I saw an interesting paper where the authors used polymorphic code to improve the branching characteristics of some HFT application. One problem is actually deploying such code isn’t so easy, since Linux makes the same W^X covenant as the rest of us. Sure, one could build the kernel without that annoying flag, but this adds risk to everything else on the machine. Another option is to use mprotect() to modify the page protections live. Unfortunately, this somewhat defeats the purpose of the exercise, since a syscall is much more expensive than a mere branch, so now the toggle really has to be worth it, but it’s nice for it to be easier to make it even more worth it.

(Un)fortunately, I’ve been thinking a bit about problems in an adjacent space–storage–and I’ve cooked up something of a procedure to make these modifications a bit more fun. The story goes something like this.

Identify the precise location of the code to modify. This is a little bit tricky, but if you want to do this in the first place, you’re signing up for a lot more than just this, so consider it a solved problem.
Create an identically-sized memfd_create()
mmap() that memfd with RW permissions. In particular, just do a mmap(0, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0).
mmap() that memfd again, but this time with feeling (RX permissions). It is vital to map over the part of the address range where this exact mapping already lives.
Writes to region RW will be immediately evident in region RX. Of course, these operations are non-atomic, so you need to a way to ensure the mutations occur outside the protection of a comparably-exercised branch (lest you defeat the point!!!).

We’re following along from the example here. Let’s step through the interesting bits.

It doesn’t really matter, but for the sake of keeping things a little bit clean, here’s a representation for a “region” or “mapping” or “DSO” or whatever you’d call it on a contemporary Linux system with a CPU utilizing a paged memory model. When execution occurs, the dynamic loader will organize various bits of a file into different parts of virtual memory. Subsequently, when dynamic libraries are loaded, a similar process occurs. This always gets a few head-scratches since one practical conceptual model is that “executables” and “libraries” are fundamentally different, but in reality the acronym ELF gives the game away.

typedef struct {
  uintptr_t start;
  uintptr_t end;
  off_t offset;
  char filename[4096];
} Mapping;

bool get_mapping(Mapping *mapping); // let's assume this just magically works,
                                    // getting--say--the first mapping in the
                                    // process (usually the main executable).

So let’s say we performed Mapping mapping = {0}; get_mapping(&mapping); and it worked. Let’s just precompute the size of the executable region we obtained, since we’ll need it. Slightly cumbersome, but let’s just remember these regions are all page-aligned. At the same time, let’s not think too hard about the default size of a page.

    size_t pages = (mapping.end - mapping.start + 4095) / 4096;
    size_t size = 4096*pages;

memfd_create() is one of those interfaces that gained a glibc entrypoint very recently, despite having lived in the Linux kernel for quite a while. I have access to such a glibc, but for the sake of inclusivity, let’s do it this way instead. Note the ftruncate() is necessary–the resulting object needs to have its size defined, and for file-like objects ftruncate() is the way we do it.

    int fd = syscall(SYS_memfd_create, "jitsegment", 0);
    ftruncate(fd, size);

OK, so there is an executable region in virtual memory which is backed by the main binary itself. The region consists of instructions, and we want to modify them. The mmap() we have is, of course, RX, and we can’t just make it RWX (otherwise, this article is not just silly and a little wrong, but profoundly pointless). Rather than going through any sort of heroics, observe we can actually just replace this region with something else–in particular, something else which happens to be different, but also bit-for-bit identical. The first thing we do is create the RW (writable) mapping, and then copy over the contents of our original mapping. After this operation, we’ll have a view in memory into our memfd, and it will contain the exact instructions we want it to, but it’ll kind of be out there in the aether.

    void *rw = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);  // Skip error handling
    memcpy(rw, (void *)mapping.start, size);

Now that we have that done, we can create one more (the RX one) region from the memfd, except this time when we do it, we map it right over the region we just copied from. This replaces what we had been executing from. There is no need to memcpy(), because this is actually just another view into the memfd and we already populated it.

    void *rx = mmap((void *)mapping.start, size, PROT_READ | PROT_EXEC, MAP_SHARED | MAP_FIXED, fd, 0);

And that’s it. rx lives in the same place the instructions used to live, but now when we modify rw + i, that modification is immediately visible and executable in rx. If exit() or return() are too rich for you, now you can

    memset(rw, 0, size);