Saturday, April 20, 2024

What Open Source projects are unmaintained and should you target for takeover ?

I spent some time looking at which open source packages have not been maintained or updated, and how depends on those packages. The answer is YOU :) 

I really like this quick Reagent query as an example. There's three hundred and fifty Pip packages in the top 5000 Pip packages with no updates since 2020? Perfect for JiaTaning!

I'm not printing all of them because that's not great as a format for a blogpost, but if you want to know more, feel free to email me. 

Of course, there's also dependencies to worry about. One Pip package can "Require" another Pip package, and we look at that with the below query:

72 packages at risk by packages not maintained since 2017. 525 if you look at packages not updated since 2020 - a full 10% of the total of the top 5000.

This is just looking at a small piece of the puzzle - but Pip is probably the most important repository and software source on the planet and we know it's often targeted by adversaries. Being able to predict where the next Jia Tan is targeting is important, but also quite easy with some simple Neo4j Queries on Reagent!

Thursday, April 18, 2024

The Open Source Problem

People are having a big freakout about the Jia Tan user and I want to throw a little napalm on that kitchen fire by showing ya'll what the open source community looks like when you filter it for people with the same basic signature as Jia Tan. The summary here is: You have software on your machine right now that is running code from one of many similar "suspicious" accounts. 

We can run a simple scan for "Jia-Tans" with a test Reagent database and a few Cypher queries, the first on just looking at the top 5000 Pip packages for:

  • anyone who has commit access
  • is in Timezone 8 (mostly China)
  • has an email that matches the simple regular expression the Jia Tan team used for their email (a Gmail with name+number):

MATCH path=(p:Pip)<-[:PARENT]-(r:Repo)<-[:COMMITTER_IN]-(u:User) WHERE u.email_address =~ '^[a-zA-Z]+[0-9]+@gmail\\.com$' AND u.tz_guess = 8 RETURN path LIMIT 5000

This gets us a little graph with 310 Pip packages selected:

So many potential targets, so little time

One of my favorites is that Pip itself has a matching contributor:

I'm sure whoever meowmeowcat is did a great job editing

Almost every package of importance has a user that matches our suspicious criteria. And of course, your problems just start there when you look at the magnitude of these packages. 

I didn't scroll all the way down, but you can imagine how long this list is.

You can also look for matching Jia Tan-like Users who own (as opposed to just commit into) pip packages in the top 5000:

MATCH path=(u:User)-[:PARENT]->(p:Pip)<-[:PARENT]-(r:Repo)
WHERE u.tz_guess = 8
AND u.email_address =~ '^[a-zA-Z]+[0-9]+@gmail\\.com$'
ORDER BY r.pagerank DESC

Ok, there's not as many (less than 100), but some of these might be interesting given they are in the top 5000.

Pip packages can require other pip packages to be installed, and you also want to look at that entire chain of dependencies when looking at your risk profile. Reagent allows you to do this with a simple query. Below you can see the popular diffusers tool and scipy packages require pip packages that match "dangerous" users. In the scipy case, this is only if you install it as a dev. But nonetheless, this is interesting.

MATCH path=(u:User)-[:PARENT]->(p:Pip)<-[:REQUIRES]-(p2:Pip)<-[:PARENT]-(r:Repo) 
WHERE u.tz_guess = 8
AND u.email_address =~ '^[a-zA-Z]+[0-9]+@gmail\\.com$'

On the other hand, many people don't care about the particular regular expression that matches emails. What if we broadened it out to all Chinese owners of a top5000 Pip packages with either Gmail or addresses and all the packages that rely on them. We sort by pagerank for shock value.
For customers that want to cut and paste into their DB:
MATCH path=(u:User)-[:PARENT]->(p:Pip)<-[:REQUIRES*..5]-(p2:Pip)<-[:PARENT]-(r:Repo)
WHERE u.tz_guess = 8
  AND ALL(rel IN relationships(path) WHERE rel.marker IS NULL)
  AND (u.email_address CONTAINS "" OR u.email_address CONTAINS "")
RETURN, r.pagerank
ORDER BY r.pagerank DESC

Don't run Ansible, I guess?

One of the unique things about Reagent is we can say if a contributor is actually a maintainer, using some graph theory that we've gone into in depth in other posts. This is the query you could use:
MATCH path=(u:User)-[:MAINTAINS]->(c:Community)<-[:HAS_COMMUNITY]-(r:Repo)-[:PARENT]->(p:Pip)   
 WHERE u.tz_guess = 8
 AND u.email_address =~ '^[a-zA-Z]+[0-9]+@gmail\\.com$'

As you can see, there are quite a few Pip packages where at least one maintainer (by our own definition) has been or currently is in the "Jin Tan"-style format.

Ok, so that's the tip of the iceberg! We didn't go over using HIBP as a verification on emails, or looking at any time data at all or commit frequencies or commit message content or anything like that. And of course, we also support NPM and Deb packages, and just Git repos in general. Perhaps in the next blog post we will pull the thread further. 

Also: I want to thank the DARPA SocialCyber program for sponsoring this work! Definitely thinking ahead! 

Wednesday, April 3, 2024

Jia Tan and SocialCyber

I want to start by saying that Sergey Bratus and DARPA were geniuses at foreseeing the problems that have led us to Jia Tan and XZ. One of Sergey's projects at DARPA, SocialCyber, which I spent a couple years as a performer on, as part of the Margin Research team, was aimed directly at the issue of trust inside software development.

Sergey's theory of the case, that in order to secure software, you must understand software, and that software includes both the technical artifacts (aka, the commits) and the social artifacts (messages around software, and the network of people that build the software), holds true to this day, and has not, in my opinion, received the attention it deserves.

Like all great ideas, it seems obvious in retrospect. 

During my time on the project, we focused heavily on looking at that most important of open source projects, the Linux Kernel. Part of that work was in the difficult technical areas of ingesting years of data into a format that could be queried and analyzed (which are two very different things). In the end, we had a clean Neo4j graph database that allowed for advanced analytics.

I've since extended this to multi-repo analysis. And if you're wondering if this is useful, then here is a screenshot from this morning that shows two other users with a, TZ=8, and a low pagerank in the overall imported software community who have commit access to LibLZ (one of the repos "Jia Tan" was working with):

There's a lot of signals you can use to detect suspicious commits to a repository: 

  • Time zones (easily forged, but often forgotten), sometimes you see "impossible travel" or time-of-life anomalies as well
  • Pagerank (measures "importance" of any user in the global or local scope of the repo). "Low" is relative but I just pick a random cutoff and then analyze from there.
  • Users committing to areas of the code they don't normally touch (requires using Community detection algorithms to determine areas of code)
  • Users in the community of other "bad" users
  • Have I Been Pwned or other methods of determining the "real"-ness of an email address - especially email addresses that come from non-corpo realms, although often you'll see people author a patch from their corpo address, and commit it from their personal address (esp. in China)
  • Semantic similarity to other "bad" code (using an embeddings from CodeBERT and Neo4j's vector database for fast lookup)

You can learn a lot of weird/disturbing things about the open source community by looking at it this way, with the proper tools. And I'll dump a couple slides from our work here below (from 2022), without further comment: