CGC Monitor: A Vetting System For The DARPA Cyber Grand Challenge

Michael Thompson and Timothy Vidas discuss their work at DFRWS US 2018.

Joe: So, here we have Michael and Tim, who will be talking about their paper, ‘CGC Monitor: A Vetting System for the DARPA Cyber Grand Challenge’.
Timothy: Hi. Thanks for sticking around for the last paper, presentation, [anyway], looks like a few more people have trickled in since breakfast. I’m Tim and this is Mike. And we both worked on the DARPA Cyber Grand Challenge program for about three years. And that concluded just under two years ago now, and the DEF CON conference in Las Vegas. So, we wanted to … it took a little bit to get through all of the disclosure and vetting and [01:01] processes for getting information out from the project, as is sometimes the case with government work. But we wanted to talk about a little piece of the integrity system that was forensically … doing some forensics on the submissions from the competitors.

The overview slide that’s sort of required [01:24] overview slide. I’m going to give some background on the CGC program itself, and [motivate] the need for integrity, and then switch it over to Mike, who’s going to go over some of the details of the vetting system, and broadly talk about reverse execution.

So, who’s heard of Capture-the-Flag contests? Yeah? In the context of cyber [security, right?] I guess – how many people did the rodeo last night? About the same people, it looks like. So, [we won’t] spend too much time talking about CTF, but I do want to make the distinction of … there’s two broad classes of CTF, and the rodeo took one class in one direction last night, with a jeopardy board style, where you sort of choose your own adventure. You can pick the challenge [that you’re doing] by picking the square or whatever the game board looks like. And there’s a whole other class which is dubbed Attack/Defend, where the contestants are simultaneously trying to defend some information system while exploiting vulnerabilities on a similar system that the other contestants are operating. And often, those are directly attached and might happen over traditional networks [02:37] TCP and things like that.

In this context, it’s security-based Capture-the-Flag, cyber security-based, right? That’s not the game that kids are playing physically, it’s not the first-person shooter that’s available on the Xbox or the PlayStation or whatever. And typically, these puzzles are involving demonstrating some level of proficiency in a particular skill set, so it might be like a forensics problem or it might a reverse engineering problem, [another] security problem, [crypto] and so forth.

So, there’s different models for organizing how these contests actually work. So, you might structure the contest to try and educate [office] people to make them more experts, you might structure it to achieve particular learning objectives, and there’s different ways to structure the contest itself. So, they’re sort of … the game for the organizers is structuring a contest such that the participants are navigating [the game] in a way that you want, to demonstrate that level of proficiency.

These things are increasingly common. Ten years ago, there were only a handful of them. Now you can basically participate in them about every other week, and some of them have serious prizes, where if you are a skilled enough individual or team, you could basically make a living by running these contests [year round].

And also, I wanted to note that today, the character of CTF is sort of becoming more and more ambiguous, that it almost just means contest these days. I think DEF CON has like 27 CTFs in the coming weeks, and a lot of them don’t fit into that classic mold of sort of the jeopardy style, with the Attack/Defend.

The game flow … this is specific on type of CTF, and this is the one that applies to Attack/Defend CTF, and this is the one that applies to how the CGC game works. So, that has some added context for the next slides. One thing that was particularly curious about … or somewhat [awful] about the CGC design is that it was Attack/Defend, but the attackers and the defenders were not directly connected to each other. In the old days, so to speak, the different participants might literally be plugged into the same network switch, and you could more or less directly attack by opening just [05:01], you just connect to them, and try and exploit a vulnerability.

In CGC and a few other contests, it’s a brokered game, so all of the game moves, so to speak, are mediated by some central party. In our case, it was the competition infrastructure, which was quite a large set of hardware and software, which we’ll talk a little bit about. I’ll tell you about the pieces that are interesting for the forensics [05:29].

Again, this is a huge endeavor by many people over [three] years, so if we’re glossing over a part that was interesting to you, feel free to grab us and ask us about it. It was pretty exciting work, and we’re pretty enthusiastic to talk about it.

So, it was a brokered game. The infrastructure mediates everything. There’s an API. Part of the CGC [05:53] part of the CGC mission was autonomy. So, the APIs were designed to be run without humans in the interface, all the interfaces were machine-readable and designed to be used by machines. And roughly, they’ll be downloading software that was meant to be analyzed, and then uploading software that was affecting the different offensive and defensive moves they could make in the game. So, these are modified [06:22].

To give you a little bit more insight into what the competition infrastructure looked like on the inside – now, what the competitor was controlling or what the competitor was affecting in their moves and their uploads, we had different physical hosts on the infrastructure side for each team. So, each team had a set of hosts, depicted here, allocated for them on the infrastructure [06:50], and the easiest one to start with is the defended host, on the right-hand side. And this is where the vulnerable software runs. So, this is all binary software, we call them challenge binaries. These are the services that have the known vulnerabilities in them that have to be detected and then taken advantage of, and also modified and be defended, so that the other participants [can’t] take advantage of [it on your defended host]. So, we, the infrastructure team, managed this, and the moves are uploaded by the participants, and then we place the modified binaries in a running state on the defended host.

Since nothing can interact directly, we need a way to exercise the challenge binaries as they go through their different types of behaviors. So, if you think about a one-server, you need to have the different [HTTP] requests coming into the web server and exercising the different features of the web server, so that sort of benign or expected behavior is done by the poller, on your upper left, the guy with the little checkbox. So, he’s sort of making sure that all of the functionality that’s expected of that service is still true, so exercising all the different aspects of the web server, so to speak.

And then, because this is a brokered game, the proof of vulnerability, which you might call an attack against the vulnerability, can’t be launched directly by a participant. So, these, again, are uploaded to this brokered interface, and then they’re launched on behalf of the participant by the competition infrastructure. So, we have a dedicated machine down here, called the POV machine, and those different color boxes are indicating different participants that have placed a move to launch an attack against the set of hosts on this slide, for this participant.

Here, three different opponents have registered their intents to launch a proof of vulnerability against this defended host. So, all of the polls, the benign interactions, and the POVs, [the attempts] at proving vulnerability or the attacks, are sent towards the defended host, and that’s all through a network IDS. So, everything is mediated through the IDS, and the participant that is associated with this defended host has an opportunity to not only replace the services but also introduce [09:20] detection rules, network style, [09:23] style rules, to [mitigate attacks].

In addition, one other piece on the slide is that in addition to the interface that allows you to change the software in the game, there’s also a TAP, like a [mirror port feed], that comes from IDS, that goes into the CRS. So, all of the participants in two cables, a bi-directional [API] cable, and then a unidirectional attack from IDS.

So, what does it look like? Well, CTF-wise, if anybody watches Mr. Robot and stuff, there was an episode where they went through and they kind of depicted this party-like environment with a rodeo-style, jeopardy-style gameboard at the end. It looks like this. It’s not the most … CTF is not the most spectator-friendly sport. You’ll see a lot of people looking at computers, just working at their keyboard. So, you think, “Well, what is the Hollywood representation?” Well, it looks a lot like what it actually looks like in real life, so there is a little bit of … quite a bit of true-to-reality there.

So, if you look at DEF CON just a few years ago … I’ve got a picture from DEF CON 2016, and you can go [all the way back] to like 2012. Pretty much though, it’s the same. It’s all [in the cyber realm], it’s all people hunkered over keyboards, much as you are right now.

The DEF CON one is one of the longest-running contests in the United States, and it’s often called the World Series or the Superbowl of CTF.

To anchor that back into the CGC – this is like the [DARPA and the DND] funded program, the cybercrime challenge. If you [11:13] down to this little, small phrase, the research question was: Could you automate all of that? You’re typically saying that this is human-driven, you’re trying to exercise specific subject matter expertise in crypto, in forensics, in reverse engineering. So, DARPA likes to automate things, they like to build robots, they like to make large technology small. So, could a purpose-built computer actually play against this sort of Superbowl of CTFs.

If we distill that down, we need autonomous binary analysis, patching, vulnerability discovery, service resiliency, and, because there’s an IDS component [to the network, IDS].

If we’re going through the pictures and seeing what it looked like – this is what CGC ended up looking like. Instead of a bunch of hackers wearing black t-shirts and sitting at keyboards, because it’s autonomous, we have racks of machines that are sitting there, doing the things that the [humans used to], so it’s sort of even less spectator-friendly. [12:17] don’t even have a person sitting at a keyboard, you just have a rack that you are taking on faith that it is making similar actions to what used to happen.

Even so, this isn’t a rendering. This kind of looks like an artistic rendering. This is a real picture, it’s a very colorful and saturated kind of environment.
But much like a golf game or whatever, it was very long-running, but the last portion of the event was live-streamed and done as a live event in Las Vegas, so there were crowds and … you take this spectator-unfriendly environment of having the racks and trying to find ways to engage the audience with visualizations and explanations and commentators and so forth. It was [13:08].

Again, it was three years. There were [13:14] the final event is the one that we’re talking about the most, which is why you’ll see the forensics [13:19] came into play, and you can see that there were a hundred qualified applicants down to the seven finalists. So, right from the beginning, from more than three years before the final event, there were critical concerns about the integrity of the game. [13:41] wanted it to be very repeatable, so [they were at great lengths into] making … increasing the [determinism] and different aspects of the architecture and the implementation of the game. And then, skipping down to the bottom bullet, we wanted to also make lots of related datasets available to continue research in the field. So, not only the different corpuses or corpora of binaries [and] the results from the event itself, but the [places] where we couldn’t have high determinism, where we wanted to record, so that the event could be replicated, the results could be replicated.

And then, crucial to today’s presentation was the competition integrity. We have lots of inputs from competitors, because competitors are some of the best hackers, and what they do is subvert CTF systems and trying to win it at all costs, right? So, there’s a lot of concern both from [who the participants are] and then also the purse, the prize money at stake. There was a little under seven million dollars at stake. And seven million dollars is enough, but it’s also seven million dollars that’s coming from the United States government. So, there’s this additional layer of the need for integrity and desire to not be subverted.

This is a pretty dense slide, but I wanted to highlight the lengths that we were going to in different aspects of the game, which [are some kind of motivating …] some of the intensity to the forensics [on this]. But the randomness, we even limited randomness available to [user space], so that people couldn’t … we needed the reproducibility, so we couldn’t have the randomness that you might get from a typical machine, but we also couldn’t let people predict what was going to happen at any particular space so that they could try and game the system. So, we went to lengths, such as using a deterministic pseudorandom number generator to seed the different aspects of randomness available in [user space]. But the seeding for that couldn’t be controlled by DARPA, who could then predict the outcome of the game. It couldn’t be controlled by [a unique] participant, so we mixed inputs from all the different participants and from DARPA, and we committed to the [seeds at a time,] we committed to the schedule of CPU release, at a time, and tried to really take all the different ways that you could game the system and give yourself advantage, and take it away while still maintaining the repeatability aspects, so that people could redo experiments.

We even committed to kernel versions years in advance, which … months and months in advance, which is really annoying when you’re trying to keep up with security updates, [16:48] patches … the reason we committed to [16:50] primary reason was for supply chain concerns. Could people actually manipulate the game by subverting the entire infrastructure by committing something or understanding some vulnerability in the kernel after the announcement of the competition but before the competition was [ever done].

We designed an architecture that only had seven system calls, so the environment is very reduced compared to a modern operating system that would have hundreds of system calls. And even in that case, all of our [public] systems were at least on a [BST] … or, sorry, were at least on a Linux-based 32-bit kernel, and our production systems were on a 64-bit [17:29] kernel. So, think about all the complexities of making those two systems appear the same from [user space] – like, anybody that’s gone to the OS [internals], there’s a huge [left] in making those appear identical to [user space] software.

And then, again, just dragging on the lengths towards integrity, we have a physical air gap – so, a lifted stage, off the floor, so you didn’t have additional network cables going into the different competitor machines. This was fully autonomous, so when the games started, all the designers of the software had to just step back and watch, and were not allowed to manipulate anything. You couldn’t restart services, you couldn’t change software, it was completely hands-off. So, there was a physical [internet] powered [totally through] this transparent bridge thing, to try and demonstrate that there was really nothing crossing the air gap.

For production needs, [the one thing] that did cross the air gap was one a unidirectional robot that took optical media from the inside and dropped it on the outside. Expensive [contractor] project.

So, competitors had to be autonomous, but we’re [18:47]. So, we … the organizers had to observe [air gap rules], so as we went in and out of [the refereeing] system during the event, cellphones couldn’t go in and out, [19:00], dedicated analysis machines on the inside, and so forth.
Now we’re going to dive a little deeper into the specific integrity control for the CGC [19:15], the forensic [harness].

Michael: Hi. I’m Mike Thompson, I’m from the [19:20] post-graduate school, and this is what I worked at primarily over those three years. Our goal is to vet all of the competitor software that we got before [19:27] as it was running on the competition infrastructure. In order to do that, we simulated or duplicated the entire CGC infrastructure on Simics full system simulator. So, we simulated the game and used that [to vet] the software.

The simulation included multiple components – all of the game services, the operating system that Tim described, that was all simulated. And so, we monitored the operating system for execution and data integrity while this software, [19:56] software was running. And we used a high-fidelity x86 model from Intel in order to sort of anchor the simulation.

We built the monitor on Simics, primarily using breakpoints and callbacks. Those of you who work with dynamic virtual machines introspection, you might understand some of those challenges. None of the monitoring functions were running on the system itself. In order to achieve this, we built custom operating system awareness, based on the internals of the operating systems, and we had to do that for both 32-bit and 64-bit versions of Linux and FreeBSD. So, we even had simulations running with heterogenous mixes of those. During the competition, all of that was running on 32 blade servers, on which we ran multiple instances of the simulated CGC infrastructure, so that we could vet all of those competitor-supplied software solutions.

Those submissions included the proofs of vulnerabilities that Tim described, their replacement challenge binaries, and the filters that the competitors issued for their IDSes. So, [while] any of those things were scheduled for execution on the kernel, we monitored the kernel for things like ROP – that would be execution of return instructions that don’t appear to follow calls. We monitored for a modification to kernel page tables. Modification of user credentials such as trying to change a user ID [to maybe zero]. And execution of any code that didn’t have any business executing while a particular competitor submission was running.

All those things were done during the game to look for attempts to subvert the infrastructure. We also, as we were doing this, had the ability to generate artefacts, while we were monitoring. So, of course any events that were anomalous, we would record. We could also create full execution traces that included data references, system fault logs of everything that happened during the game. We could generate records for successful proofs of vulnerabilities, that is successful exploits within the game. Instances of ROP or stack execution by the actual applications, which is what they were trying to do and what we were expecting to see, we could record that. And of course, [22:18] services.

That’s sort of an overview of the monitoring side. Needless to say, we didn’t discover anybody trying to subvert the infrastructure. But another component that we built into this simulator was the ability to actually look at the successful exploits and try to understand what bugs were being exploited. So, this was my favorite part of the … this part of the project, and that allowed me to run a computer backwards in order to figure out what the exploits were.

To put it into context, imagine a real-world analogy where you have a fuzzer that has managed to generate a [crash of input] for some service – so you’ve got this output from the fuzzer, and you know it crashed the service, and maybe you know it gives you control of the IP. That tells you precisely nothing – almost nothing – about the actual bug that [23:10]. CGC demonstrated quite well that you can exploit a bug in a service without really understanding what that bug is or where the bug is.

In CGC, the competitors found 20 vulnerabilities and 82 of the challenge sets that we gave them to work on. But the question was: What were the flaws? Were they actually exploiting the intended flaws that the authors put into the software?

If you looked at the patches that the competitors issued against [those … that flawed software], that really wouldn’t tell you anything about the actual bug that was being exploited, because the patches themselves were generic in nature. They didn’t really fix the bugs.

In order to answer this question, we re-instrumented the simulator to analyze the applications as opposed to watching the operating system. And what this gave us was the ability to have an analyst start up a session, and that session would then pause when something interesting happened, like a proof of vulnerability. The analyst could then reverse-execution to help him or her understand what the nature of the bug was. And to do that, we combined a IDA Pro debugger, and used that as a front end to our CGC monitor, which was built on a Simics full system simulator.

Imagine you’ve got an IDA Pro session that you’re running. It looks just like any IDA debugger session. And in this case, you can’t see it, but I’m looking at a functional call pointer, which I know that, say, the content of the ECX is corrupt, and I am interested at this point in knowing where that value came from, what’s the providence of that corrupt registry. So, reverse execution becomes really helpful to answer that question. So, I can actually run backwards to start answering that question, using functions that we built into an IDA Pro plugin.

So, our IDA Pro client includes the usual sort of debugger functions, but in reverse. And so, I can say “Run backwards till you hit [a breakpoint].” I can say things like “Step backwards over the current function,” or into the current function. I can put the cursor somewhere in the [25:21] and say “Run backwards to that”. I can say “Run backwards until this particular register has been modified,” back in time. I can set and jump around to bookmarks, interesting places in the execution. And I can [semi-automate] back tracing of data – so, if I were interested in the provenance of content of memory, I could say “Run backwards and generate bookmarks until you find the source of that particular data”.

I’ll quickly look at the sub-system that makes this happen. You’re interacting with the IDA Pro, you say you want to jump backwards to where that arrow is, up at the top. The client then is going to send over the CGC monitor using [a sort of an outer band] send monitor command. It’s going to send the address to where I want to go backwards to. The CGC monitor then is going to call down to the Simics engine and say, “Hey, the next time you stop for whatever reason, come to this callback.” I tell it where to break and then I tell it to run, and it then starts running backwards, executing backwards, or at least looking like it’s executing backwards, till it gets to some breakpoint. It might be the breakpoint I set, it might be some other breakpoint, because [whereas] I’m debugging, I’m want it to run until it hits a breakpoint.

When it hits a breakpoint, it stops and it tells my IDA client at this point that it’s stopping and where it has stopped. The interesting thing here is that IDA Pro, up till this point, hasn’t been involved in that particular sequence. It doesn’t know where it is right now. So, I then call down into IDA Pro, and I say, “Hey, set a breakpoint at this address,” and I tell it to set the breakpoint [at what happened when I know it’s the current address.] So, it then talks, using its normal debugger interface, to the debug server, and says, “Run to this breakpoint,” and the server says, “Hey, I’m already there. Okay, I’m there, I’m done. I ran to your breakpoint.” IDA is now happy, and it queries the server in order to get its registers and its memory updated. So, now from the user’s perspective, IDA has run backwards and it now knows the current system state, so I can look at register content, memory content, and such.

So, how do we get this reverse execution? How does Simics provide that illusion? Well, it’s really resource-intensive. What it does is as you’re running forward, it records a series of micro checkpoints, and then it references those checkpoints during reverse execution, essentially iterating over each one, then running forward till it hits a breakpoint, and it keeps doing that until it decides that it has hit the most recent breakpoint of interest.

The challenge though is that that progression running backwards isn’t strictly serial, so you can hit a lot of breakpoints that you wouldn’t normally hit, so if you have associated callbacks with those breakpoints, you’re going to get a lot of noise and a lot of garbage, so you can’t really use those callbacks when you’re running backwards.

To review what we’ve found, we’ve used that harness in order to harness all of the successful POVs in the event. Of 82 challenge sets, there were 109 intended vulnerabilities built into the services. 20 of those had working POVs in the final event, so [28:47] 20 exploits. Turns out that half of those working POVs were not even what the author intended. Six were actually bugs that were unknown to the author, that they put in. Two of the services were exploited by the same bug – happened to be a shared library that had an unintended exploit in it. Four of the bugs were the intended bug, but they were exploited in a way that the author did not intend. And then, all of the exploits within the game, that is everything that everybody found, was basically the same – they all found the same bug and exploited it the same way.

The tool was also able to generate automatic back traces of data. So, this is [a part of] the CGC corpus, it’s online, you can see the back traces that this reverse execution generated. So, you’ll see the sources [corrupt] memory locations and call registers, that sort of thing.

In the future, what we’d like to do is extend this tool to more general execution environments. As Tim described, the DECREE environment that the game ran on had only seven system calls [that got expanded] for a full set, and we’d like to make the tool sort of a service, so that somebody sitting out with IDA Pro at the workstation can connect to the service and run software backwards, help find the bugs.

The monitoring and analysis is up on GitHub at these links. The source [for the monitor is there], but you’ll have to bring your own Simics. So, I’d like to close by noting that the US government has a couple of hundred copies, unused copies of licenses to use this Simics-based monitor. Those are available on a number of repurposed supercomputers at CGC [that DARPA] distributed after the competition. So, I’d like to find government people who might be interested in locating those and setting up a Simics-based monitor.

Questions.

[applause]

Host: Are there any questions?

[silence]

Audience member: I thought that this was one of the more interesting papers that I’ve read, because it’s so unique. Do you have any thoughts on how this might be useful outside of the challenge? Is it generally useful or is this just something, in your opinion, that was a really cool way of solving this one problem?

Joe: Well, I think the [analyst tool] is generally useful from the perspective of … Well, two perspectives. The most useful one is [31:35] training people who are getting into reverse engineering, that need to understand and comprehend what’s happening to an application while it’s being exploited. So, if you think about bringing up IDA Pro and trying to analyze what the vulnerable software is doing, if you have the ability to run backwards on your … to answer questions that you have in your own mind, I think it gives you the ability to much better comprehend what’s happening. So, from a training and an education perspective, I think it’s a useful tool.

I think the automated back tracing is also very helpful in actually identifying that whole problem of “Yeah, I can exploit that thing, but can I catch it? Can I figure out what the bug is? I think it’s a good contribution to the field of being able to automate the process of actually identifying [the true bugs, so I imagine.]

Audience member: The [32:34] videos, something within the last year, it’s quite [32:40] but it’s worth watching. And they also share the sort of … it’s a visualization to kind of show the exploits. Is that related to this at all or is that a [different project]?

Joe: It’s a different project. [32:54] didn’t really show what was happening. It showed what it looked like what was happening, but you couldn’t say [32:59] something happening [33:01], so yeah, this was [33:07].

Audience member: Different project than actually came to their visualization data, through a completely different path, so they were [33:16] so they were operating on data that came off those optical disks. So, they were … the knew the outcomes, [so like how scores] were generated, whether proof of vulnerabilities were successful or not, and the [33:29] and things like that. But when it came to execution [33:33] computing systems, [33:37] they weren’t [interlocked] for reasons …

End of transcript

CGC Monitor: A Vetting System For The DARPA Cyber Grand Challenge

Get The Latest DFIR News

Join the Forensic Focus newsletter for the best DFIR articles in your inbox every month.

Leave a Comment Cancel reply

Forensic Focus Briefs National Policing and Forensic Bodies on Investigator Well-Being

The End Of Manual Transcription Starts Here

Katelyn Rogers, Digital Forensic Analyst, Mississippi Cyber Initiative

Digital Forensics Jobs Round-Up, July 13 2026

Still Reviewing CCTV The Slow Way? See S21 CCTV v2.0 In Action

From Inaccessible To Actionable: How Punjab Police Recovered Critical Evidence From Feature Phones