Preventing Data Leaks With Git Guardian

Desi: Welcome one and all to the Forensic Focus podcast. Today we have Mackenzie Jackson, who is from GitGuardian, and I actually met his counterpart Dwayne at the recent ICCWS conference over in the States and had a talk to them on their podcast, and we decided to bring the digital Forensics Focus together with what their company does to our podcast. So welcome Mackenzie. Thanks for joining us today. Why don’t you give us a quick overview of what GitGuardian is and where you fit within the company?

Mackenzie: For sure. So, GitGuardian itself is a code security company, and what we specifically focus on is detecting sensitive information inside source code. And when I’m referring to sensitive information, I’m typically talking about what we generally call: “secrets”. And these are things like your API credentials, your API tokens, they could be usernames and passwords, it could be your database passwords, and we can get into why they exist a little bit later on.

But that’s essentially the core of what GitGuardian does. And we’re expanding that into broader areas now. We’re expanding into infrastructure as code, and we also create honeypots, which is kind of like trip wires for attackers, so you know when they’re in the infrastructure. But essentially at the core, what we started at and what we still mostly do is detect sensitive information.

And as far as my role in GitGuardian, I actually came on to GitGuardian really early in their life. When I joined the company, it was still just, I think it was about 15 nerds in an attic in Paris <laugh>. When I first joined up, I was the first ever kind of marketing hire that they did. Although I’m an engineer by trade, I was kind of the first person that wasn’t dedicated in that marketing space.

And so, I joined as a developer advocate or security advocate, which is a strange role. It’s a fun role. It’s a basically just trying to create communities, answer questions, go on podcasts and talk about the problem in general. And that was four years ago now, when I joined the company, which is now about 125 people, so quite big and pretty global in our reach.

Desi: That really provides an interesting backdrop to how the company started. When you think of tech startups, you think of a smelly garage in Silicon Valley. Whereas now I’ve got this image of this beautiful attic in Paris, with the Eiffel Tower in the background, and you’re just going out and getting croissants before you’re starting…

Si: <laugh> This is the romantic European take on tech startup. Yeah, the American one. That’s what it is.

Desi: It’s so much better. Like why doesn’t everyone just move to France through tech startups?

Mackenzie: <laugh> Well, you’ve never been to a French leaky attic. I’m telling you. When you have to climb up eight flights of stairs every single day to get to this small room? And Paris is interesting because we don’t have garages that we can start in, but if you think of French buildings, they’re still very old and a lot of them are still designed in an old way, where servants were meant to live in the house without ever being seen.

So, when you have places like the attics and small rooms that are tucked away, now they’re kind of converted into these workspaces or small studio apartments. But that was once where the help used to live. And they’re weird because the buildings are designed in a way so you can kind of get to your little area without having to interact with any of the people in the main residence. <laugh>

Si: The backstairs. Yeah.

Mackenzie: Yeah! It’s quite funny.

Desi: So, you mentioned your background, you’re an engineer, so maybe you can explain to our listeners how you got into this role, from maybe university through to where you are now, just about the path that led you into where you are?

Mackenzie: Yeah, for sure. It’s a bit of a weird story. But when I was in high school, I always absolutely loved programming and my side gig that I’d always kind of done throughout university and started in high school was to create websites for people. I had a small niche where I would make websites for musicians because I kind of built up a template of what everyone wanted back in early 2000s. You know, have a guest board and all of that. And I basically just changed a few of the images around and off we go.

Desi: Was it MySpace? It was just MySpace, wasn’t it?

Mackenzie: Basically, just creating MySpace templates…I remember, the early days of cross-site scripting when you could just inject whatever you wanted into your social media <laugh>. But I actually became an architect, like a building architect during university. But I always maintained that how I made money throughout the whole period was through software design. And I kind of became an architect because I guess my mom wanted me to, and like, How I met your Mother was on and Ted was an architect. And I don’t know why I made that decision.

Desi: Nice.

Mackenzie: <laugh>. Yeah. Anyway, I got into work, I got my engineering degree and I then just kind of went about trying to automate my job as best as I could. And so in these kind of big software systems, you can actually write code in there. So part of my job was what they called finding the highest and best use of land. And it was kind of like: Hey, I get a block, how many apartments could I fit in this block? How many car parks do I need? What are the regulations? So I kind of built some scripts that could do that. I never told anyone about it. I always just…

Si: So, basically you wrote Tetris for buildings. Yeah?

Mackenzie: Exactly. Yeah. Yeah. Have you ever played the Sims? It was…<laugh>…but automatic. And then I kind of decided, why the hell am I doing this? What I enjoy is coding. And I’m like hiding that fact in my job. Which I now reflect on as being stupid. I’m sure my work would’ve loved it. Had I told them that I’d built this thing that they could use.

But then I ended up in a startup. We founded my own company. It was called Conpago. It still exists today. It’s based in Brisbane and headquartered in Brisbane. I’m not really a part of it anymore, but I was there for five years. And then eventually, I exited from that organization. I had met my girlfriend, then I was like: Hey, wherever you want to go in the world, let’s do it. Let’s have an adventure.

She chose Paris. That’s how I ended up in Paris. And then I got to Paris, and I realized I really needed something to fill my time. I had no idea what to do because I was the CTO of this startup. I don’t have… I have an engineering degree… but I don’t really have like the good bones of software engineering. Most of the early days of my startup was copy and pasting from Stack Overflow, but I had lots of weird skills, so I had no idea what to do.

But I got into Paris and there’s like a job search called Welcome to the Jungle in France, and I just typed in ‘native English speaker’, to see what came up. And then this job came up, which was like: I need to be a good public speaker. I need to understand code, but I’m not really going to be coding a lot. And kind of this area. And I was like: Hey, this is me and then I got super deep into the world of security, and I’ve been here ever since. That’s a very long explanation…

Si: That’s a fantastic explanation. <laugh>The first web designer I ever worked with was an architect as well. He dropped out to pursue it because it was a way more lucrative and satisfying way to spend his life. So…

Mackenzie: Architecture is one of those things where it kind of sounds nice, but once you’re in there and you realize that most of your job is designing bathroom and stair details…<laugh>… and you know, like working with building regulations, it quickly loses its appeal. And it’s not like being a lawyer where you can hate your job and just kind of generally do shitty things, but still get paid enough to kind of sleep at night comfortably in a big-ass bed. And an architect, you’re in one of those Parisian basements made for the maid…

Si: Yeah. Down with the damp instead of up with the damp. Cool. So, I’ve been in security for a long time and you know, hardcoded credentials in code, is and has been a problem forever. I mean, it’s a thing, but I mean, I’m getting old now. So, when I started out it was all usernames and passwords, hard coded in scripts. I was Unix admin, so you go through and find them embedded in scripts everywhere.

Yeah. I may have left a few lying around myself because there’s no other way of doing this or not a convenient way of doing it. But now we are in a vastly different world where there’s other stuff out there. What sort of things are you looking for now? As opposed to just sort of usernames and passwords?

Mackenzie: For sure. Well, I mean so much wacky stuff. And, as you were talking, in what you kind of alluded to there, is how much the attack surface has expanded in the form of code. Because, yes, we still have our source code, but now we’ve also got things like Git ops or infrastructure as code. And we’re codifying so many elements we would have done manually or had different roles in.

Now this is all ending up inside our Git repositories. So, the scope is much worse. So, we still have username and passwords that are hard coded in there. SolarWinds, which is a company that had a massive supply chain attack, which affected massive players like the US military. And you know, about as deep as you can go. Just before that attack, someone from SolarWinds hard-coded a username and password for an FTP update service.

So that’s pretty old school. And the password was SolarWinds123 and…<laugh>… the email was something like, you know, admin at SolarWinds. And this was in a public Git repository. So that still happens. But mostly what we’re talking about is, like your API keys, that’s the biggest one, your cloud infrastructure keys.

So around about 20% of the keys that we find are for cloud infrastructure. Around about 27% are for databases. So, lots of these kinds of things. And to understand how they get in there, because I’m sure some people, because whenever you have this conversation, it’s literally like, yes, but oh my God, you have to be an idiot to hard code your credentials and put it inside a Git repository, let alone a public Git repository.

But the problem is actually a lot deeper than that, because when you think about Git, it’s version-controlled, which means that you have a version of everything. So, let’s say that I give a developer a task and say: Hey, I want you to connect up with Algolia and build a search functionality that’s going sift through this data.

And so, the first thing that I do is I create some obscure branch that I’m going work off for this feature. I hardcode the credentials in just to start with because I want to get something quickly happening, right? This proof of concept, I just want to get something on the page that’s like, yes, I could interact with this data set, via this API. I get that happening, I make it all look pretty. A hundred commits later, I remove those hard-coded credentials because I know that that’s not what I want in the end.

I use environment variables. I send it off to a code review: code reviewer checks it. Yep. All good. Can’t see any credentials. Everything looks right. It gets merged into master. Unknowingly, that history persists. So now you have these credentials buried in hundreds of commits on some weird branch that no one’s going to be able to discover unless you’re really looking for credentials and your Git repository.

And then, let’s talk about six months later, when that repository is now made public, you have just made all that history public along with it. And even squashing history doesn’t necessarily get rid of everything, you know, because that’s the other argument that people hear. So, we find lots of weird stuff and a lot of it isn’t on that top level, right?

Because if we look at a different type of vulnerability, cross-site scripting, for example, you have a cross-site scripting vulnerability in your application. You change it, you manage, you handle your data differently, so you no longer have that vulnerability and now your app is secure. But with credentials it’s not the case. Because even when you get rid of them, they still exist in your code and that’s the risk in and of itself.

So, we find lots of weird stuff in repositories, and a lot of it, is buried. And just to give you an example of how much we find. Most of what we do, what we get paid for, is we protect private infrastructure. But what we also do is we detect secrets in public places. So, one of the most obvious places we look for credentials is a Github.com of all public repositories. So, we actually scanned every single commit that was made to Github.com last year. That’s over a billion commits. So, a commit, a contribution, is pushing code, uploading code to GitHub.

So that happened a billion times last year, and GitGuardian scanned every single one of those commits for secrets. And we found 10 million secrets, just last year. And as I said, 20% of them were a cloud provider keys and we actually check the validity of these keys. So, it’s not like we find a key and we think it’s real.

We actually check it with the provider: Hey, is this real? And they come back saying: Yes. So, over a million cloud provider keys that were valid were leaked last year. And we could go on about all the different types. But yeah. So, we find lots and lots of different stuff in lots of different areas, in lots of weird places, basically.

Si: So going just on the basis of that, you… I’ve got so many questions from that alone. I mean, how huge is your infrastructure to do this? Cause that’s like a hell of a task.

Mackenzie: Yeah, I mean, it is. It’s helped a little bit by how GitHub is set up and how Git is set up. So yeah, it is a hell of a task. We have massive infrastructure to be able to do that. But it is surprising how easy it is. So, for example, if you’re listening at home and you’re in front of your computer, you can go to the url api.Github.com/events. This is a public API. You don’t need authentication to look at it. You can call it I think 60 times in an hour without any authentication.

That’s the public infrastructure of GitHub. Everything that happens publicly on GitHub is published on that page, that events ledger, if you will. So, there we are, we can connect up to that really easily, to just start sifting through information. And we are not the only ones that do it. We’ve done lots of experiments where we leak a honeypot credential. This is kind of like a fake credential that’s real but doesn’t pose any risk and allows us to monitor an attacker.

It will take less than a minute before someone starts to try and exploit that credential if you leak it on GitHub publicly. And that’s because it’s so easy to monitor. Now when you’re talking about monitoring all of it. Okay. Then it’s a big job. And we’ve also worked a lot at making sure that our scanning is very lean, but also very effective in lots of different areas. But if you want to get started, so I said we found 10 million credentials.

Okay, let’s say you’re an attacker, you don’t want to find 10 million, you want to find 20. Like, you just need to monitor that, not even consistently, just monitor it as much as you can for like a week and you’ll have more than you would ever know what to do with. So yes, it’s big, but you know, you can also scale it down to a much more manageable size if you want to.

Si: No, that’s incredible…

Desi: I just wanted to ask. So, I’ve done some GitHub forensics with like version history stuff in CTFs only. You said something there about squishing history? What is that and is that meant to be like a mitigation towards removing some of the history? And what do you mean when you said that it’s not always that effective?

Mackenzie: So, Git merge squash, or something like that, is basically saying: Hey, we’re merging some branches and I kind of just want to make it more manageable. So, I want to squish down all the versions. So, like instead of having a hundred different versions, we we’re reducing this so that on our Git tree it looks more manageable. It’s basically just a cosmetic thing.

So, what a lot of people kind of do is they squash their history, thinking that that then removes everything. And therefore, if there were secrets in that deep, in that tree, in that history, now that I’ve squashed them and that gets rid of them, but you just got to understand that Git is very complicated and a lot of the things that it aims to do is to be reversible in a lot of things. Because why we have history is because if we break something, we can move back.

So basically, squashing a history is a cosmetic thing is what I’m trying to get across. And you do lose some data, and it certainly can make it harder to find these things, but squashing a history isn’t going to solve the problem. But it makes it, you know, it just kind of adds another level of obfuscation to it. Makes it a bit more confusing.

But I mean when you’re using automated tools, none of this stuff matters, right? The tools are going to be able to recreate and put back together whatever you had and all that metadata of everything that you’ve done, you know, will persist, and it persists for a reason. And that’s basically so that these things are reversible when you go through.

Si: I mean, Git is, I’m gonna say, it’s relatively new in the terms of VCS systems, although relatively new now is 20 years old. I mean, I remember it coming onto the scene.

Mackenzie: Doss Forge . Yes. No…

Si: Do you guys work with some of the other, well, not that there are that many left, but do you work with some of the other VCS systems online or are you, I mean, it’s in the company name, but….

Mackenzie: Yeah, I mean, we work with any GIT system, so like Bitbucket, GitLab, you know, obviously GitHub is zero repos. We work with all the Git platforms, we natively integrate with that. What we also have is… I mean, it’s not just in Git that you find secrets. This is kind of like your version control, where the sources come from.

So, in terms of other people, there are some others, we have had some inquiries where people are using different areas, and yes, we do, we can scan them, but we have to do it through an API and through some other manual kind of inceptions. So, if you put up hooks, to where your code’s being pushed onto, whatever repository or server that it’s hosted on, we can scan it through an API.

But in terms of like native integration where we fit really seamlessly into it, and we have, you know, all the bells and whistles that’s Git, and that’s like 95% of organizations. And outside of that, I mean the biggest laggers in techs are like banks because they’re still using COBBLE and old school languages and they’re the most resistant to changing infrastructure. And we are dealing with large banks that are fully integrated into Git now. So, if you are not using Git, and you are stuck to something on the later end of version control, I mean, you have to be pretty far down the laggard curve to be doing that.

But a different way of answering that is to kind of say: there are secrets in lots of other places, so they can end up inside backups, inside your wikis, inside your Slack messaging systems. You know, they can end up, in email, they can end up on your networks, lots of different places outside of version control. And we do scan all those, except we do that through, we have like CLI tools, one’s called Gigi Shield, where you can scan directories or do anything else.

It does it all through, you know, APIs. It’s not quite as elegant because it needs to be a bit more manipulate-able because your system’s going to be different. But we can do it, and we do do it, and we find lots of weird stuff everywhere. But the reason why we focus on Git is because, generally speaking, that’s the central source of what we call secret sprawl, because when something hits Git, you know, it’s cloned onto everyone’s machine, it’s probably backed up, it’s probably pooled across into some wiki.

It ends up in your running application, it’s connected to your CICDs, it’s how you manage your infrastructure. So, we want to try and isolate secrets from leaking, like Git is the place to be, and then build out from that or whatever version control, if you’re not using Git, then I don’t know <laugh>. Yeah, yeah. You’re pretty old school, but I mean, we can still help. Yeah, for sure. <laugh>

Si: Cool. One of the things you said was that you verify the keys that you find. I mean we can delete this in a minute if this turns out to be something you don’t want to incriminate yourself with. But I’m assuming you’re not just trying these on the web and seeing whether they let you in or not. You said you contact the cloud service providers. How have you built that relationship?

Because I mean, a lot of the cloud service providers are huge, and frankly don’t give a toss, as far as I can tell. I mean trying to get anything out of Microsoft from a forensic perspective, when you go and say: Hi, can I please have a copy of this azure? They come back and say: No give me a warrant. And you know that sort of falls flat. How have you guys found integrating into these large cloud providers?

Mackenzie: Well, so we don’t actually have relationships with the cloud providers. For the most part at least. So, what we do… there’s two kinds of areas. The first one is when we’re integrated with a customer, so let’s say that you are, I don’t know, a company, a startup. You are using our tool. And so basically when you sign up, and this is when we’re kind of scanning your private infrastructure, your private Git repositories, then you are the owner of these keys and you are allowing us to check them, basically.

So, you’re allowing us to kind of make a non-intrusive API call. So, kind of saying on the web, just basically test this works. That’s essentially what we’re doing. We’re just making a call and seeing what we get back. And we do it by basically just testing that the key’s valid, the less intrusive API call that you can make. Now, if we shift this into the public sphere, I said that we scan public infrastructure of something that’s leaked publicly on Github.com, for example, and we test those as well.

And so here things get a little bit tricky, but essentially something is in the public domain. And when you have public interest, when you’re doing something non-intrusive with the public interest, then that is deemed as being kind of a responsible disclosure of vulnerabilities. And we are investigating this with the purpose of notifying the provider or the user that this key has been leaked. So, it’s kind of like calling a number that you find in a lost wallet.

You know, it’s not a spam call because we are not doing it for any marketing purposes, we’re just trying to find out if the keys valid. So, I mean, it is a little bit alarming for someone, when we tell them that their keys are valid, but at the end of the day, this is the best way for us to be able to reduce false positives and keep the community safe, as safe as we possibly can.

And we detect, we have over 350 specific detectives that we look for. So, trying to manage a relationship with 350 different providers would just be pretty unattainable for a company of our size. So it’s really the only way that we can, that we can do it is just to basically test to see if they work. And if it’s private, then we have permission to do that. And if it’s public and if it’s public, so we have to try and use what we can because the attackers are going to be doing these same things.

Si: Yeah, yeah. Okay.

Desi: So, I guess pivoting into the attacker thing, you mentioned before SolarWinds, and the FTP server credentials being published. What kind of other published breaches come to mind that fits this kind of leak that been part of the initial access or source codes being leaked, just so that we can link these in the show notes and people can go read about them as well.

Mackenzie: For sure. I mean, there’s so many cases to talk about and different categories. So maybe what I’ll do is we’ll talk about stuff that’s leaked publicly, and then we can talk about stuff that’s leaked privately that’s also been used by attackers. But to start off with just the public ones. If we look at just what’s happened like recently. Last year, Toyota there have a product called T Connect. It’s a mobile application that essentially acts as a key for your car and in a public repository that wasn’t owned by Toyota but was actually owned by a consultant that was working with Toyota.

The admin keys to databases of this application were published in a public repository, which essentially would enable attackers to access all the information of anyone that’s using that application and give them a foothold into the internal infrastructure of essentially an app that can start your car and do lots of other things as well.

So that’s kind of one thing that’s happened publicly. One very recently, which happened, not last week, but the week before. If we are looking at what type of companies actually leak public credentials, so someone from GitHub leaked GitHub’s route SSH key publicly on their repository. So, what this would’ve enabled an attacker to do is set up a man in the middle attack and basically listen and obtain information that’s going through your private repositories on GitHub.

So, this is something that GitHub did themselves. Now it was caught very quickly, but you still have to change keys if you’re using SSH and there’s still lots of things to go through. So, this happens to lots of different companies. Uber’s had a number of times where they’ve leaked their S3 buckets and that have accessed them. So, there’s lots of different things that happened on the public sphere.

Now, if we talk about the private sphere, because what people kind of say is: Look, we have no public code repositories. We don’t deal with any open source. I mean, none of our employees push anything publicly. I mean, you could never say that, but let’s just say that everything you do is behind authentication. You have multifactor authentication, and you have your code, it’s like your vault, but your code is incredibly leaky.

And so what we often find is, for example, Twitch had all of their source code leaked because of a misconfiguration. Basically, someone set remote access to true instead of false and some infrastructure is code. And Twitch’s source code was all publicly available. We found this code, we scanned it, and we found 6,000 credentials, including 194 AWS credentials inside Twitch’s source code, that now is everywhere on the internet that you want to look.

And so, what attackers often do is they really target source code. So last year we saw lots of source code leaks with mostly by a group called Lapsus$ which was I think a UK-based hacking organization. They ended up being a load of teenagers and they leaked source code for NVIDIA, Microsoft, Samsung. They got into Ubisoft gaming, lots of different stuff that these attackers did.

And everyone’s kind of wondering how did these group of teenagers actually access the private source code of all these great companies, which have good security posture. And the answer came through their telegram channel where they were literally just paying people to give them access to source code repositories. So just paying employees like: Hey, you work for Microsoft, you had a bad week. We’ll give you five grand if you give us access to the internal code repository.

Once in there, then you’ve got access to all the secrets and you can move laterally and you can elevate privileges and whatever it may be that you’re targeting, you can do a lot from that source code. And there’s very low barrier to entry and source code is meant to be shared and scrolled by everyone. That’s the whole idea of Git. So, it makes it very hard to be able to stop and prevent that. So that’s kind of looking at the private source code.

And one good example that does involve someone from the Lapsus$ group is that Uber had a massive kind of worst-case scenario breach where a credential to their VPN was published on the dark web. It was bought, so it was purchased by someone from Lapsus$, they had multi-factor authentication and basically through social engineering, they called up this guy and said: Hey, we’re from the security team, you need to give us access to the network through your account. The guy clicked ‘yes’ on the multi-factor authentication app and then Lapsus$ had access to the network.

Once they were on the network, they found admin credentials to the PAM system, the Privileged Access Manager system. So basically, what this means is they had admin credentials to both the secrets manager and the password manager, and they were able to access every single system that Uber used, like from their cyber defense to their Git repositories, to their cloud infrastructure, to their Slack channels.

They had the keys to the kingdom, the complete master keys to the kingdom because someone had put this in a PowerShell script on the network. So that’s kind of like secrets being internal. And so, attackers often use this. And when we break apart attacks, what you’ll find is that secrets are always, generally always, used at some point.

Now it may be the initial access. So with SolarWinds we may be able to elaborate that the FTP server is how they got initial access and then they did more stuff. But they also might not be, it might be that they had access to a server and now they’ve found secrets to move into different systems. So, credentials are used in lots of different areas. So that’s just a kind of a few of the kind of attack scenarios and there’s lots more. And I’ll send you links for what we’ve just talked about for the show notes.

Si: Where is the problem here? Is it the design of Git that just replicates stuff ad hoc or not ad hoc? Well, it is ad hoc to be fair. Or is it that we just have such poor coding standards that do this? Or is there a sort of a middle ground whereby Git is replicating things it shouldn’t be necessarily and there should be a deny-all in a specific-allow versus let’s just copy everything to a server. Is it an engineering problem here that we have or is it a human problem or is it just everything?

Mackenzie: Well, I think mostly it’s kind of a human problem, but, you know, I don’t think there’s anything wrong with Git and there’s also nothing wrong with like open source or any of the technologies that we use. It’s definitely how we use them. So, there’s a lack of understanding of how these systems work though, because a lot of people kind of don’t factor in the history or think that because it’s… they have to authenticate into GitHub that that’s acceptable to have secrets in there.

So, there’s a bit of a human element here as well. But I think the real problem actually comes from I guess kind of the reliance that we’ve built up on these APIs. So, for example, there’s things that we can do to help the problem, which is kind of ignoring the fact that secrets are going to keep getting leaked on Git, is we can really lock down our APIs.

So, if this API is designed to be accessed by this other service, well then the only person that can use that API is this service. We have the IP address range, or we have different validations that we can use that really locks down this API. Other things that we can do, is we can automatically rotate our credentials every two months or whatever it may be, so that when stuff gets leaked they automatically become invalid.

And there are other areas around zero trust, which is kind of like saying just because an API key gets leaked, doesn’t mean that someone that shouldn’t use it can use it. I feel like that’s part of the problem as well. The fact that we kind of create these API keys that just anyone can use that have no rotation policies to them.

There are some companies that are helping to try to solve this. So Hashi Corp they created a product called Hashi Vault. It’s a Secrets Manager. There are other managers out there that can do this as well. I just use Hashi Corp because they created this concept. They created a concept called Dynamic Secrets, which is basically a one-time use secret, whereas a secret is generated to access something just in time, you know just for that one kind of use. And then it’s destroyed.

So that means that these secrets aren’t kind of floating around. And if they are, then they’ve been used so they’re not used anymore and it forces good practices. So, I’m really hesitant to blame Git. I’m also really hesitant to blame developers, because I think that like mistakes happen, it’s often all never malicious and it comes from a misunderstanding of how everything works.

And I don’t know in how much detail we expect everyone to understand all their systems. But if you really want to solve the problem, then we need to implement further restrictions on our APIs. We need to scan our infrastructure so that we know when they’re leaked. And then we also need to give developers tools so that they can prevent them from leaking.

So, something really simple that helps this problem, that’s super simple to do, is to set up a Git hook that detects secrets when you commit code and it will just block the commit if it contains secrets, it will let you know and say: Hey, there’s a secret in here, you’ve got to remove this before we let you commit it into the repository. And then that kind of stops the bleeding a lot.

So there’s lots of things that you can do and I really don’t know like, there’s no one thing I can point to and say: That’s it. That’s where the problem is and if we can solve that. It’s just kind of how we build stuff that’s made it difficult.

Desi: Yeah, that’s really interesting. And sounds kind of like where the cybersecurity industry is at the moment with users, is it’s just that basic security hygiene and monitoring and logging, on top that to say, like you said with the Git hooks: Hey, you’ve got credentials here, we’re not going to commit it because we’re auditing what you’re pushing into the commit.

Mackenzie: Yeah, yeah. It’s kind of one of those things and that often comes with security, right? It’s boring. It’s boring to talk about password hygiene, right? You know, like, oh God, we’re still talking about that. But that’s kind of really what it comes down to. If everyone had good hygiene and there’s lots of tools that we can use to help that and enforce that.

And I think that’s kind of part of solving the problem. But at the moment, the problem’s getting worse, we find more and more secrets leaked every year. And so, it’s challenging to kind of wonder where we’re going to be with this issue.

Si: In that regard, you’re seeing more and more secrets leaked every year. There are obviously more and more users using GitHub. Is it the same percentage of secrets just over a larger number of users? Or is it a larger number, a larger percentage over a larger number of users?

Mackenzie: Yeah, yeah, yeah. So, there’s two factors each year that do help contribute to it. So last year the report that we published had 6 million credentials, the year before that had 3 million. This year we have 10 million. So, between this year and last year’s, it’s about a 67% increase. GitHub increased their users by about 20%, so that accounts for some of it. And also, we get better every year at detecting secrets.

So, we can also say that part of the reason why that increases is because we’ve spent a whole year expanding our detection service. We are getting better. But that doesn’t account for the complete amount. So, it’s really hard to pinpoint exactly how much. One area is that if you look at how many secrets we detect per commit, so per 1000 commits, we detect I forget the number. I think it’s four secrets, you know, four secrets, every thousand commits that goes together, how we detect a secret.

What I do know is that was a 50% increase from the year before. So from the same amount of code, we’ve increased 50%, let’s say that we’ve improved 20%, it still leaves a lot of increase. And this can be explained because now we are using source code for lots of different areas that’s more sensitive, like infrastructure as a code, where we’re integrating all of our pipelines and everything into our repositories with GitHub actions and other areas.

And we’re also kind of using a lot more artificial tools to help us code quickly. But those help us almost skip past the lessons where we would’ve learned that hard coding credentials was bad because we are kind of moving faster. So, I mean all of these things can be considered. So yes, increase of code is part of it, yes. Us getting better is part of it, but I mean lots of it, is just getting worse.

Si: It’s really interesting that you brought up the automated tools that help us do this because we were having a very brief chat at the beginning of this before you came on, about co-pilot and its role in GitHub now. Do you think that’s having an impact on it? I’m going to say I, I’m not a developer. I don’t do much coding and what coding I do is poor and probably has embedded credentials, but I don’t use Git, so minor detail.

But I understand that you start, and it auto completes for you. So, is this an issue of people trying: I need a password, I need to authenticate and it’s bringing this in or…? Also, my understanding of co-pilot is it, itself, is scraping Git for the source, the training sets. So, one would assume that actually it is, in itself, ingesting all of these credentials that are lurking in the background?

Mackenzie: So it is, so there’s a couple of things about co-pilot and I wrote a blog about this called shitty code, shitty co-pilot. But there’s a few things. So, when co-pilot first came out, one of the ways that you could find a credential was by prompting it to go, you know: AWS underscore key equals, and then it would kind of give you a key. And a lot of those were valid. Some of them weren’t, but like, were old keys.

And basically, it was finding it from its code base, like yes, finding these keys. That kind of quickly stopped happening. So now that wouldn’t really… I mean apart from like a really edge-case, that doesn’t really happen now. The problem with things like co-pilot is that it’s being trained on massive (and all AI tools, right?), are being trained on massive data sets.

Like take 10 open-source repositories, just random open-source repositories. None of them are going to be absolute trash. None of them are: Someone started a project that they never finished, some portfolio thing that they did for a work interview or something that isn’t fully formed. Like a lot of that, and that’s what GitHub co-pilot is being trained on, like 90% trash code.

So, it has its place, and you know, like AI writers, most of what’s written on the internet isn’t great and that’s what it’s being trained on, but you can use it as a starting point. I’ve got no problem with using them as a starting point. And AI is interesting because, I mean, it depends on how it evolves and where it goes, but it could be a great tool for being like, okay, you are hard coding a credential, then immediately your AI tool stops you and says: Hey, this is a terrible idea. Why don’t we do this instead?

And then tells you how to use an environment variable and create a EMV file, all of which is easy. So it could be, but it doesn’t do this. I want to be clear; it doesn’t do this at the moment, but when you look at this, we’re in the infancy of it. It could be a good strategy that basically prevents people from doing poor coding practices, but that’s going to rely on it being trained on good quality code.

And if we look at just kind of the code that’s out there, well most of it isn’t good quality. And so, like when I go back to my blog post shitty code, shitty co-pilot, it was basically that if you wanted to have good quality code, then you needed to write your code in the best hygiene possible, start off with your author, comment everything really well, and then it would actually give you good outputs because it’s referencing other code that starts like that, which is good quality code.

And if you’re a junior developer that’s just trying to code something quickly, then it’s going to take examples from other people. So basically, if you’re a shitty coder, you are going to have a shitty co-pilot. If you’re a good coder, you’re going to have a good co-pilot. So, we need to try and be able to distinguish what is good code, what is bad code, and only train our models on good code. I mean that’s a challenge, but it could be a good thing, right? It could be…

Desi: So, I get two things out of that. And first one is I want to apologize for the junior directors because 90% of that code, I’m in there with my GitHub for sure. <laugh>

Mackenzie: Me too buddy. Me too. We’ve already established that I’m a shitty coder <laugh>.

Desi: Number two is that it sounds like from the media that some of the stuff is looking like it wants to go that way where it’s providing suggestions with: Hey, that’s a shitty idea and it just reminds me of we’re finally going to get clippy back, but in like a more modern version where he is tapping on your screen to go: “Don’t be an idiot”.

Mackenzie: Yep. Basically.

Desi: Yeah, that’s really cool. So, I want to flip here to the kind of security side with investigators. So, anyone that’s had to investigate something always comes across a new platform or something that they need to dig into. So, looking at Git repositories seems like such a niche skill to me because it’s not something that I come across every day. Like obviously there’s companies like yours that are built to do forensics around this.

What would you suggest to someone who is an investigator that comes across a case that they have to like go to a public GitHub and maybe their client has said, they’ve had a breach, and their client has said: Hey, we’ve got this GitHub, we need you to go look if our credentials were leaked there and do initial access, like for that one case, what’s your suggestion for them to get a crash course on it? Or like where to start? I guess noting that there’s that squashing history that you were talking about and there’s a whole bunch of version control.

Mackenzie: Yeah. I mean look, the only way to kind of understand Git repositories is to use the tools that ingest that information because it’s like 10% of a Git repository, unless it’s on its first commits is going to be behind the scenes. And it’s going to be a real pain to kind of go through any of that manually. So, there’s lots of tools out there.

So, I’m from a vendor: Git Guardian. We have some tools and some free tools, but there’s also lots of open-source tools in there. And I think one of the differences, is that if you’re investigating something, something that open-source tools generates a lot of false positives, but if you’re investigating something and you’re taking your time to understand it, then I feel like in an investigation, false positives are okay.

But in real life when you’ve kind of got time on your hands it’s not. So, there’s free tools out there Gitleaks, Trufflehogs…a bunch of them, where you can scan your repositories and pull-out information. You know, if you’re looking for security vulnerabilities, then it’s going to be the same thing: infrastructure as code scanners to try and look at what has been leaked out.

Are there any business logic flaws that have been exposed in this code? I mean, that’s what I’d be looking at. So, if you’re looking at a Git repository and you know what a Git repository is, right? But you are doing forensics and you are not intimately familiar with the technology, now it’s time to get some tools to help you sift through this information because it’s much deeper than you would expect.

It’s much more complicated than you would expect. I mean, this was built originally by Lenard…I forget his name, Lenard, the guy that created Linux, everything he does is very complicated…

Si: Linus Torvalds

Mackenzie: Linus Torvalds, that’s it. Yeah. I don’t know why I was thinking Lenard. Linus Torvalds. So, everything that he created is just very complex with great roots, but Git’s the same, you know, it’s a complicated system. So, you need to kind of go through and use it. So, if you’re doing forensics, hey, there’s lots of automated tools out there that are going to help you. It’s time to take a look at them and don’t try and sift with this information yourself because you’re going to miss it.

And then the other things are to make sure you look at things like your Git commit messages. There’s going to be lots of sensitive information and messages themselves. Look at different areas of Git, like issues that have been raised because they will give you clues as to kind of what’s happened. So yeah, you’re going to have to dive a little bit deeper into the hood, but it’s okay because you don’t have to do it manually.

Si: On the inverse of that, in protecting my own terrible code from either being scraped or from containing secrets, is a private repository actually sufficient protection for me to include hard-coded credentials? Or, do we just say: No, this is bad?

Mackenzie: No, this is bad <laugh>.

Si: Good answer. I’m happy with that <laugh>. So, a private repository is not the solution.

Mackenzie: No, absolutely not. And I get kind of where you’re coming from. You’re saying: Hey, I don’t work with a team, right? No one from my team can be turned. I’m not a big company, so attackers aren’t after me, so why can’t I put my personal stuff in a private repository? And basically, there’s lots of ways for your credentials and everything to get leaked.

And what we’re finding is, in our latest report, the Status Secrets pool, we tracked credentials from being leaked on GitHub and then being exposed in dark web forums and being sold. So, your credentials could be leaked in a number of different ways and be sold on a forum which may give someone access to a private repository. And all of this is just kind of automated.

So it’s just very, very weak authentication to have credentials inside your private Git repository and there’s no logging, so if someone gets access to it, you have no idea if they’ve been in there, you have no idea on what machines it’s been into you, do you have backups automatically done? Is your code now in your Dropbox because you’re automatically backing up your Apple Drive or your computer? Is it in Apple Drive?

I can guarantee you, that even if it’s just you, using a private Git repository, your code is in way more places than you would ever expect. And if your Dropbox credentials get sold on a dark web forum, then you know someone has access to your code. Or whatever it may be, right? There’s just lots of different ways that it could happen. me for saying this. But what you can do, is there’s this product out there called Git-secret. And essentially what it does is it encrypts your secrets.

So, let’s say you’re storing your secrets in an environment variable file a dot env file, and you want a way to be able to share that or store it in Git, because it’s convenient for you. You can encrypt that file and put it into your Git repository encrypted. This gives you a single point of failure. It’s not particularly great from a security point of view, but if the difference is between… but it’s easy. So, the difference is: Hey, we’re going to encrypt them or not encrypt them, then let’s just start by encrypting them.

It’s easy, it’s quick. You can de-encrypt it with your command line. Like it’s just a very low barrier to entry. Don’t hardcode secrets ever, because if you’re hard coding them just for personal projects, then that’s going to translate into your work. So just practice good hygiene. But you can sort them in Git, just make sure they’re encrypted and just don’t, <laugh> don’t hardcode secrets into Git…<laugh>

Desi: <laugh> Alright. Awesome mate. I think, so we’re coming to the top of the hour. It’s been super informative talking about GitHub and the forensics and kind of what you guys do as well. And I’m sure the listeners will get heaps out of it.

Si: Yeah. Yeah.

Desi: I’ll throw one last question to you to wrap it up, but what do you do to unwind? Like what’s your downtime?

Mackenzie: Oh, what’s my downtime? Okay. Well, I get to travel a lot for work and I’m a total history geek. So, I’ve traveled to your historic places and castles. I’m from New Zealand, which has a history that expands 500 years. So, the fact that in Europe, we have so much history here that it’s crazy for me to go to these historic places. I collect coins. I’m embarrassed to say it, I collect coins from everywhere I go, in historic areas and I’ve spent way too much time researching and even cleaning old coins that have build-up on them to try… it’s like opening a present when you buy an ancient coin, but you don’t know what it is, and I clean them. So, I don’t know, I’m just a history geek. So, anything to do with history, coins and… <laugh>

Si: That’s seriously cool.

Mackenzie: Well, I’m glad I’m in good company because my girlfriend does not think so. <laugh>

Desi: I think it’s just like when we get people on, everyone’s got a unique answer. Like, I don’t think anyone ever has like the same kind of hobby and unwinding and all that. So, it’s always really cool to hear.

Si: Yeah.

Desi: But thanks again so much for coming on and talking to us about Git repositories, forensics and everything. like we said before to all our listeners we’ll chuck everything in the show notes. We’ll grab some more show notes off Mackenzie to chuck in as well, including their website, so you can go check them out if it’s something that you’re interested in or something that you need. But thanks for listening. This will be on YouTube Forensic Focus. If you like our stuff, please like and subscribe. And chuck down comments if you’ve liked the content so we can chuck up some more stuff. But thanks again Mackenzie, and hopefully we’ll have you on again in the future.

Mackenzie: Yep. That was awesome guys. Thanks. I had a great time. I haven’t laughed so much on a podcast for a while. So that’s…<laugh>a good sign.

Desi: Awesome.

Si: Excellent

Desi: All right. See you everyone.

Si: All Right. Cheers guys. Thank you.

Preventing Data Leaks With Git Guardian

Get The Latest DFIR News

Leave a Comment Cancel reply

Forensic Focus

Forensic Focus Digest, May 10 2024

Oxygen Forensic® KeyDiver

Detego Global Announces Webinar To Demonstrate The Powerful New Features Of Detego v4.16

Digital Forensics Round-Up, May 08 2024

UK Parliamentary Legislation Introduced Against Deepfakes

Digital Forensics Round-Up, May 01 2024