I want to start by saying that Sergey Bratus and DARPA were geniuses at foreseeing the problems that have led us to Jia Tan and XZ. One of Sergey's projects at DARPA, SocialCyber, which I spent a couple years as a performer on, as part of the Margin Research team, was aimed directly at the issue of trust inside software development.
Sergey's theory of the case, that in order to secure software, you must understand software, and that software includes both the technical artifacts (aka, the commits) and the social artifacts (messages around software, and the network of people that build the software), holds true to this day, and has not, in my opinion, received the attention it deserves.
Like all great ideas, it seems obvious in retrospect.
During my time on the project, we focused heavily on looking at that most important of open source projects, the Linux Kernel. Part of that work was in the difficult technical areas of ingesting years of data into a format that could be queried and analyzed (which are two very different things). In the end, we had a clean Neo4j graph database that allowed for advanced analytics.
I've since extended this to multi-repo analysis. And if you're wondering if this is useful, then here is a screenshot from this morning that shows two other users with a name+number@gmail.com, TZ=8, and a low pagerank in the overall imported software community who have commit access to LibLZ (one of the repos "Jia Tan" was working with):
There's a lot of signals you can use to detect suspicious commits to a repository:
- Time zones (easily forged, but often forgotten), sometimes you see "impossible travel" or time-of-life anomalies as well
- Pagerank (measures "importance" of any user in the global or local scope of the repo). "Low" is relative but I just pick a random cutoff and then analyze from there.
- Users committing to areas of the code they don't normally touch (requires using Community detection algorithms to determine areas of code)
- Users in the community of other "bad" users
- Have I Been Pwned or other methods of determining the "real"-ness of an email address - especially email addresses that come from non-corpo realms, although often you'll see people author a patch from their corpo address, and commit it from their personal address (esp. in China)
- Semantic similarity to other "bad" code (using an embeddings from CodeBERT and Neo4j's vector database for fast lookup)
You can learn a lot of weird/disturbing things about the open source community by looking at it this way, with the proper tools. And I'll dump a couple slides from our work here below (from 2022), without further comment:
No comments:
Post a Comment