I spent some time looking at which open source packages have not been maintained or updated, and how depends on those packages. The answer is YOU :)
I really like this quick Reagent query as an example. There's three hundred and fifty Pip packages in the top 5000 Pip packages with no updates since 2020? Perfect for JiaTaning!Saturday, April 20, 2024
What Open Source projects are unmaintained and should you target for takeover ?
Thursday, April 18, 2024
The Open Source Problem
People are having a big freakout about the Jia Tan user and I want to throw a little napalm on that kitchen fire by showing ya'll what the open source community looks like when you filter it for people with the same basic signature as Jia Tan. The summary here is: You have software on your machine right now that is running code from one of many similar "suspicious" accounts.
We can run a simple scan for "Jia-Tans" with a test Reagent database and a few Cypher queries, the first on just looking at the top 5000 Pip packages for:
- anyone who has commit access
- is in Timezone 8 (mostly China)
- has an email that matches the simple regular expression the Jia Tan team used for their email (a Gmail with name+number):
MATCH path=(p:Pip)<-[:PARENT]-(r:Repo)<-[:COMMITTER_IN]-(u:User) WHERE u.email_address =~ '^[a-zA-Z]+[0-9]+@gmail\\.com$' AND u.tz_guess = 8 RETURN path LIMIT 5000
This gets us a little graph with 310 Pip packages selected:
One of my favorites is that Pip itself has a matching contributor: meowmeowcat1211@gmail.com
Almost every package of importance has a user that matches our suspicious criteria. And of course, your problems just start there when you look at the magnitude of these packages.
You can also look for matching Jia Tan-like Users who own (as opposed to just commit into) pip packages in the top 5000:
MATCH path=(u:User)-[:PARENT]->(p:Pip)<-[:PARENT]-(r:Repo)
WHERE u.tz_guess = 8
AND u.email_address =~ '^[a-zA-Z]+[0-9]+@gmail\\.com$'
RETURN path
ORDER BY r.pagerank DESC
On the other hand, many people don't care about the particular regular expression that matches emails. What if we broadened it out to all Chinese owners of a top5000 Pip packages with either Gmail or QQ.com addresses and all the packages that rely on them. We sort by pagerank for shock value.
MATCH path=(u:User)-[:PARENT]->(p:Pip)<-[:REQUIRES*..5]-(p2:Pip)<-[:PARENT]-(r:Repo)
One of the unique things about Reagent is we can say if a contributor is actually a maintainer, using some graph theory that we've gone into in depth in other posts. This is the query you could use:
Ok, so that's the tip of the iceberg! We didn't go over using HIBP as a verification on emails, or looking at any time data at all or commit frequencies or commit message content or anything like that. And of course, we also support NPM and Deb packages, and just Git repos in general. Perhaps in the next blog post we will pull the thread further.
Also: I want to thank the DARPA SocialCyber program for sponsoring this work! Definitely thinking ahead!
Wednesday, April 3, 2024
Jia Tan and SocialCyber
I want to start by saying that Sergey Bratus and DARPA were geniuses at foreseeing the problems that have led us to Jia Tan and XZ. One of Sergey's projects at DARPA, SocialCyber, which I spent a couple years as a performer on, as part of the Margin Research team, was aimed directly at the issue of trust inside software development.
Sergey's theory of the case, that in order to secure software, you must understand software, and that software includes both the technical artifacts (aka, the commits) and the social artifacts (messages around software, and the network of people that build the software), holds true to this day, and has not, in my opinion, received the attention it deserves.
Like all great ideas, it seems obvious in retrospect.
During my time on the project, we focused heavily on looking at that most important of open source projects, the Linux Kernel. Part of that work was in the difficult technical areas of ingesting years of data into a format that could be queried and analyzed (which are two very different things). In the end, we had a clean Neo4j graph database that allowed for advanced analytics.
I've since extended this to multi-repo analysis. And if you're wondering if this is useful, then here is a screenshot from this morning that shows two other users with a name+number@gmail.com, TZ=8, and a low pagerank in the overall imported software community who have commit access to LibLZ (one of the repos "Jia Tan" was working with):
There's a lot of signals you can use to detect suspicious commits to a repository:
- Time zones (easily forged, but often forgotten), sometimes you see "impossible travel" or time-of-life anomalies as well
- Pagerank (measures "importance" of any user in the global or local scope of the repo). "Low" is relative but I just pick a random cutoff and then analyze from there.
- Users committing to areas of the code they don't normally touch (requires using Community detection algorithms to determine areas of code)
- Users in the community of other "bad" users
- Have I Been Pwned or other methods of determining the "real"-ness of an email address - especially email addresses that come from non-corpo realms, although often you'll see people author a patch from their corpo address, and commit it from their personal address (esp. in China)
- Semantic similarity to other "bad" code (using an embeddings from CodeBERT and Neo4j's vector database for fast lookup)
You can learn a lot of weird/disturbing things about the open source community by looking at it this way, with the proper tools. And I'll dump a couple slides from our work here below (from 2022), without further comment: