CyberSecPolitics: The common thread: Fuzzing, Bug Triage, and Attacker Automation

INFILTRATE

People think of conferences as singular events, but INFILTRATE, which is Immunity's open access offensive information security conference, is a chain of research. The goal of a talk is to help illuminate the room we're in, not just scatter dots of light around the night sky.

I'll start the discussion of this chain by looking at some of the technical work presented at INFILTRATE that I think highlights the continuity of our efforts as an offensive community, and then end with the policy implications, as befits this blog.

Finding Bugs in OS X using AFL

The first talk I want to talk about is a very entertaining introduction to fuzzing that Ben Nagy did last year. You can watch the video here. You SHOULD watch the video. It's great. And one thing you take away from multiple viewings of it (I've seen this talk three times now) is that fuzzing is now all about file formats, not programs or protocols. People spend all their time treating the PDF format to the pain it rightfully deserves. If we want to improve security as a whole, we might be able to do so by simplifying what we ask of our file formats, instead of, say, banning hacking tools.

Finding Bugs in PHP (and other interpreters, such as say, Javascript)

In 2014 Sean gave a talk at INFILTRATE about his work looking into fuzzing language interpreters using their own regression tests to feed them as input. This is necessary because fuzzing languages has gotten medium-hard now. You're not going to find simple bugs by sending long strings of A's into the third parameter of some random API. First of all: FINDING the available API's reachable by an interpreted program is hard. Then finding what complex structures and sets of structures and values need to be sent into API's to make them parse it, is itself hard.

Then you have to find ways to mutate the code you generate without completely breaking the syntax of the files you are creating, run that code through the interpreter, and trap resulting memory access exceptions.

Using Fuzzed Bugs for Actual Exploitation

The first step: Automatically generating hacker-friendly reports from crash data. Because as Sean Heelan says "The result of succeeding at fuzzing is more pain, not more happiness." This is something Ben Nagy has pointed out over the years as well. "Anyone want 500000 WinWord.exe crashes with no symbols or seeming utility of any kind?"

That brings us to 2016 and Sean's INFILTRATE talk focused on "What do you do with all those crashes?" Everyone is going to say "You triage them!" But you don't do that by hand. Bugs become "bug primitives" which get you some entry into the "weird machine programming language" and some are more powerful and some are less powerful but you don't want to look at the source code and try to derive what the bug really is by hand every time.

The point here, is that for some reason people (especially in policy, which this blog is usually aimed at) believe bugs occur on "one line of code" but more realistically bugs derive from a chain of coding mistakes that culminate in some sort of write4 climax.

What would a sample output from this report look like? It would look like any professional hacking team's Phrack article! Usually these look like a section of code where certain lines are annoted as [1], [2], and so forth. For example, let's look at this classic paper from MaXX which introduced heap overflows to the public discussion.

This is the hacker-traditional way to do footnotes on code which are then used to describe a bug.

Ok, so Sean wants to take all the crashes he is finding in PHP and have the computer annotate what the bugs really are. But in order to even think about how to automate that, you need a corpus of bugs which you have hand annotated so you can judge how well your automation is working, along with sample inputs you would get when a fuzzer runs against them. I'm not sure how academics (Sean is soon to be a PhD student but will never be academic) try to do research without something like this. In fact, in the dry run Sean went on and on about it.

Ok, so once he has that a bunch more work happens and the algorithm below actually works and produces, from arbitrary PHP fuzz data + a decent program instrumentation library, sample triage reports that match what a hacker wants to see from crashes.

Right now Sean is manually inserting variable tracing into targeted functions because to do that automatically he has to write an CLANG plugin, which he says is "easy, no problem!" but I think is reasonably complex. Nobody asked him "WHY NOT JUST SOLVE FOR ALL OF THE PROGRAM STATE USING SMT SOLVERS?" (Example for Windows ANI). Probably because that question is mildly stupid, but also because input crafting is one of those things academics love to talk about how they do very well, but when you make the problem "A real C program" lots of excuses start coming out, such as loop handling.

Loop handling is a problem with Sean's method too, btw. Just not as BIG a problem.

Which brings us to DARPA and the DARPA Grand Challenge which I've been following with some interest, as have many of you. One of the competitors gave a talk at INFILTRATE which I was really excited to hear!

Automated BugFinding is Important but Still Mostly Fuzzing

To put this talk in context feel free to go read the blog from the winning team, which was led by David Brumley.

"We combined massive hubris with an SMT Solver, and a fuzzer, to win the cyber grand challenge!"

Artem's team came in second when it comes to what we actually care about - finding new bugs.

And they did it in, and I think this is something interesting by itself, a similar way to the other teams, including "For All Secure", as detailed in his next slide:

Because DARPA has released their test corpus (!!!!) you can of course read the Driller paper, which does the same methodology (although not as well as Boosted), using the open source American Fuzzy Lop and a solver to find bugs.

A question for all of us in the field would be "based on this research, would you invest in improving your fuzzer, or your symbolic execution engine, or in how you transfer information between them." I'm betting the slight difference between ForAllSecure and Driller/Boosted is that they probably transfer information in a more granular form than just "inputs". Driller may even be overusing their Solver - it's still unknown how much benefit you get from it over just finding constants within a binary to use as part of the fuzz corpus. In other words, all that research into symbolic execution might be replaceable by using the common Unix utility "strings". See the slide below for an example of a common issue and imagine in your head how it could be solved.

But the slight difference in results may not be important in the long run. Finding 69 bugs (Boosted) or 77 bugs (top scorer) or with just the fuzzer: 61 bugs. The interesting thing here is that the fuzzer is still doing MOST OF THE HEAVY LIFTING.

In Boosted they use their own fuzzer, which seems pretty awesome. But sometimes, as in the next talk I want to look at, the fuzzer is just used to generate information for the analysis engine to look at.

What DARPA Scoring Means

Having worked with DARPA myself as part of the cyber fast track effort, I know that any hint of "offensive" research is treated like you were building a demon summoning amulet complete with blood sacrifice. And yet, the actual research everyone is most interested in is always offensive research. DARPA is the ultimate font of "parallel construction" in cyber.

Cyber Grand Challenge is the same. Making the teams "patch" the bugs is not part of the scoring because for some reason we envision a future where bugs get automatically patched by automated systems. It's there because being able to patch a bug in a way that is not performance limiting demonstrates an automated "understanding" of the vulnerability, which is important for more advanced offensive use of the bug, such as will be in theory demonstrated at DefCon when the teams compete there for the final - one I predict humans will destroy them at.

Finding bugs in HyperVisors

Let's move on to another use of automation to find bugs. These slides cover the whole talk but I'll summarize quickly.

I wish my company had done this work because it's great. :(

The basic theory is this: If you spend six months implementing a hypervisor you can detect time of check to time of use bugs by fuzzing pseudo-drivers in Xen/Hyper-V/VMware to force memory access patterns which you can log using some highly performant system, and you'll get a bunch of really cool hypervisor escapes. Future work includes: better fuzzing of system calls, more bugs, better exploits. This is one of those few areas where KASLR makes a big difference sometimes, but that's very bug dependent.

People care about hypervisor escapes a LOT. People wish they would stop being a thing. But they're never going away and it's just weird to think you can have a secure hypervisor and then build systems on top of it that are truly isolated from each other. It's like people want to live in a fantasy world where at least ONE thing is solidly built, and they've chosen the hypervisor as that one thing.

In Conclusion: Fuzzers are Mightier than the Sword

I'm mighty.

Program analysis techniques to find bugs have washed over the community in waves. From Fortify-like signatures on bad API's, to program slicing, to SMT solving, etc. But the one standout has always been: Send bad data to the program and see where it crashes. And from the human side it has always been, "Read the assembly and see what this thing does."

What if that never changes? What if forever, sending the equivalent of long strings of A's and humans reading code are the standard, and everything else is so pale an imitation as to be basically transparent? What does that say for our policy going forwards? What would it say if we refuse to acknowledge that as a possible reality?

Policy needs to take these sorts of strategic questions into account, which is why I posted this on the blog, but we also need to note that all research into these areas will stop if Wassenaar's "Intrusion Software" regulation is widely implemented even in a very light-handed way. I wanted to demonstrate the pace of innovation and our hopes for finding ground truth answers to big questions - all of it offensive work but crucial to our understanding of the software security "universe" that we live in.

CyberSecPolitics

Tuesday, May 10, 2016

The common thread: Fuzzing, Bug Triage, and Attacker Automation