There is no need to worry that your secret OpenAI ChatGPT conversations were obtained in a recently reported breach of OpenAI’s systems. While troubling, the hack appears to have been superficial — but it’s a reminder that AI companies have, in short order, made themselves into one of the juiciest targets out there for hackers.
Details of the Breach
The New York Times reported the hack in more detail after former OpenAI employee Leopold Aschenbrenner hinted at it recently in a podcast. He called it a “major security incident,” but unnamed company sources told the Times the hacker only got access to an employee discussion forum. (I reached out to OpenAI for confirmation and comment.)
Security and Value of Data
No security breach should be treated as trivial, and eavesdropping on internal OpenAI development talk certainly has value. But it’s far from a hacker accessing internal systems, models in progress, secret roadmaps, etc. However, it should scare us anyway, and not necessarily because of the threat of China or other adversaries overtaking us in the AI arms race. The simple fact is that these AI companies have become gatekeepers to a tremendous amount of precious data.
High-Quality Training Data
Let’s talk about three kinds of data: OpenAI and, to a lesser extent, other AI companies created or have access to high-quality training data, bulk user interactions, and customer data. It’s uncertain what training data they have because the companies are incredibly secretive about their hoards. But it’s a mistake to think they are just big piles of scraped web data. Yes, they use web scrapers or datasets like the Pile.
Importance of Dataset Quality
Some machine learning engineers have speculated that of all the factors going into creating a large language model (or, perhaps, any transformer-based system), dataset quality is the single most important one. That’s why a model trained on Twitter and Reddit will never be as eloquent as one trained on every published work of the last century. (And probably why OpenAI reportedly used questionably legal sources like copyrighted books in their training data, a practice they claim to have given up.)
Bulk User Interactions
So the training datasets OpenAI has built are of tremendous value to competitors, from other companies to adversary states to regulators here in the U.S. But perhaps even more valuable is OpenAI’s enormous trove of user data — probably billions of conversations with ChatGPT on hundreds of thousands of topics. Just as search data was once the key to understanding the collective psyche of the web, ChatGPT has its finger on the pulse of a population that may not be as broad as the universe of Google users but provides far more depth.
Value of Conversations
In the case of Google, an uptick in searches for “air conditioners” tells you the market is heating up a bit. But those users don’t have a whole conversation about what they want, how much money they’re willing to spend, what their home is like, manufacturers they want to avoid, and so on. You know this is valuable because Google is trying to convert its users to provide this information by substituting AI interactions for searches! Think of how many conversations people have had with ChatGPT and how valuable that information is, not just to developers of AIs but to marketing teams, consultants, analysts… it’s a gold mine.
Customer Data
The last data category is perhaps of the highest value on the open market: how customers use AI and the data they have fed to the models. Hundreds of primary and countless smaller companies use tools like OpenAI and Anthropic’s APIs for various tasks. For a language model to benefit them, it usually must be fine-tuned or otherwise given access to their internal databases. This might be as prosaic as old budget sheets or personnel records (to make them more easily searchable, for instance) or as valuable as code for an unreleased piece of software. Their business is what they do with the AI’s capabilities (and whether they’re helpful). Still, the simple fact is that the AI provider has privileged access, just as any other SaaS product does.
Risks and Security Practices
These are industrial secrets, and AI companies are suddenly at the heart of many. Like any SaaS provider, AI companies can provide industry-standard levels of security, privacy, and on-premises options; generally, they provide their service responsibly. I do not doubt that the private databases and API calls of OpenAI’s Fortune 500 customers are locked down very tightly! They must certainly be as aware of the risks inherent in handling confidential data in the context of AI. (The fact that OpenAI did not report this attack is their choice, but it doesn’t inspire trust for a company that desperately needs it.)
Value and Threat
But good security practices don’t change the value of what they are meant to protect or the fact that malicious actors and sundry adversaries are clawing at the door to get in. Security isn’t just picking the correct settings or updating your software—though the basics are essential, too.
SUMMARY (OpenAI)
There’s no reason to panic — companies with access to lots of personal or commercially valuable data have faced and managed similar risks for years. But AI companies represent a newer, younger, and potentially juicier target than your garden-variety poorly configured enterprise server or irresponsible data broker. Even a hack like the one reported above, with no severe exfiltrations that we know of, should worry anybody who does business with AI companies. They’ve painted the targets on their backs.