This is Part II of a pair of posts on Anthropic's recent report on Chinese threat actors using Claude to orchestrate cyber espionage. For Part I on what happened, check here. This post will focus more broadly on how findings (and risks) are being communicated by private AI organizations, and why we should care about how these discussions are framed.
As discussed in my first post, this event was not a fully autonomous AI agent carrying out an attack at the behest of a state-backed Chinese group, but it does represent a novel usecase for AI use by bad actors. To arrive here, Anthropic failed to detect this behavior in its user monitoring systems, and also with its built-in safeguards at inference. Despite these failures, the messaging throughout these communications struck an interesting tone. Take the first sentence of their blog post:
"We recently argued that an inflection point had been reached in cybersecurity: a point at which AI models had become genuinely useful for cybersecurity operations, both for good and for ill."
Compare this with the PR from any cyber breach, and you might think this sounds more like a shiny new use-case than a company admitting that its product was jailbroken to aid a Chinese threat actor. This kind of messaging is not only PR speak, it also asserts a few important things. Firstly, it asserts that Anthropic has been doing well to warn us about this for some time. Secondly, and more importantly, it is an assertion of the expanding capabilities of their flagship product, Claude. In essence, this vulnerability (and associated risk) are an excellent marketing opportunity for Claude itself.
While offering a bit more detail, this kind of messaging continues in the full report. Take this example:
"Claude maintained persistent operational context across sessions spanning multiple days, enabling complex campaigns to resume seamlessly without requiring human operators to manually reconstruct progress."
Not to be petty, but does the word "seamlessly" strike anyone else as extremely odd here? This is a word that conjures the image of a pleasant and efficient user experience, but in this case the user is a collective of Chinese hackers. It is exactly this kind of tone that fairly raises criticism about the purpose of some of these self-initiated safety reports. In this case, it is hard to feel like the risk is being taken seriously, but rather that the capability on display is being celebrated. Everyone look! This LLM we make actually works in the real world!
The report finishes up by asking the question: "[I]f AI models can be misused for cyberattacks at this scale, why continue to develop and release them?" Setting aside whether or not it is realistic that Anthropic pauses AI development, their response is unsurprising: use their LLM for cyber defense, too. It is worth noting that Anthropic also highlighted plans to expand its own detection capabilities when it comes to autonomous attacks, but it is hard to feel like we aren't being sold the problem and solution in one neat package in this report.
This is more a criticism on the message than those working on safety teams at Anthropic or any frontier lab. I want to emphasize that I know there are scores of extremely smart, concerned and thoughtful people working in these organizations, and I don't mean to discount their earnest interest (and progress) in making their products safer. However, I can't help but feel like I just read an advertisement dressed up as a safety report.
We see this kind of "marketable risk" in many communications about AI safety - asserting that AI could be an existential risk to humanity is also an advertisement that it could give unimaginable power to its owners. In this case, their product created a "seamless" experience for bad actors this time, but this capability could be very valuable to a more legal market. Setting aside whether or not this technology could present catastrophic risk, we must recognize the incentive that this dynamic presents. It is much easier for organizations to focus on risk that is more "marketable" in this sense, than, say, more present harms like ecological risk. The assertion that an LLM destroys the environment does not also give its owners god-like power.
Because of these incentives, it becomes important that we build a more robust infrastructure of third-party evaluators that can report on events like this, and a community of journalists who can precisely communicate the risks and remediations identified by these reports. In short, we should expect AI developers to tell us their products are seamless (and very powerful), but maybe we shouldn't expect them to tell us when their products have been compromised.