Business Detecting and preventing distillation attacks - anthropic seething at chinks and open source

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Article Archive

Announcements

Detecting and preventing distillation attacks​

Feb 23, 2026
We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions.

These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.

These campaigns are growing in intensity and sophistication. The window to act is narrow, and the threat extends beyond any single company or region. Addressing it will require rapid, coordinated action among industry players, policymakers, and the global AI community.

Why distillation matters​

Illicitly distilled models lack necessary safeguards, creating significant national security risks. Anthropic and other US companies build systems that prevent state and non-state actors from using AI to, for example, develop bioweapons or carry out malicious cyber activities. Models built through illicit distillation are unlikely to retain those safeguards, meaning that dangerous capabilities can proliferate with many protections stripped out entirely.

Foreign labs that distill American models can then feed these unprotected capabilities into military, intelligence, and surveillance systems—enabling authoritarian governments to deploy frontier AI for offensive cyber operations, disinformation campaigns, and mass surveillance. If distilled models are open-sourced, this risk multiplies as these capabilities spread freely beyond any single government's control.

Distillation attacks and export controls​

Anthropic has consistently supported export controls to help maintain America’s lead in AI. Distillation attacks undermine those controls by allowing foreign labs, including those subject to the control of the Chinese Communist Party, to close the competitive advantage that export controls are designed to preserve through other means.

Without visibility into these attacks, the apparently rapid advancements made by these labs are incorrectly taken as evidence that export controls are ineffective and able to be circumvented by innovation. In reality, these advancements depend in significant part on capabilities extracted from American models, and executing this extraction at scale requires access to advanced chips. Distillation attacks therefore reinforce the rationale for export controls: restricted chip access limits both direct model training and the scale of illicit distillation.

What we found​

The three distillation campaigns detailed below followed a similar playbook, using fraudulent accounts and proxy services to access Claude at scale while evading detection. The volume, structure, and focus of the prompts were distinct from normal usage patterns, reflecting deliberate capability extraction rather than legitimate use.

We attributed each campaign to a specific lab with high confidence through IP address correlation, request metadata, infrastructure indicators, and in some cases corroboration from industry partners who observed the same actors and behaviors on their platforms. Each campaign targeted Claude's most differentiated capabilities: agentic reasoning, tool use, and coding.

DeepSeek​

Scale: Over 150,000 exchanges

The operation targeted:

  • Reasoning capabilities across diverse tasks
  • Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning
  • Creating censorship-safe alternatives to policy sensitive queries
DeepSeek generated synchronized traffic across accounts. Identical patterns, shared payment methods, and coordinated timing suggested “load balancing” to increase throughput, improve reliability, and avoid detection.

In one notable technique, their prompts asked Claude to imagine and articulate the internal reasoning behind a completed response and write it out step by step—effectively generating chain-of-thought training data at scale. We also observed tasks in which Claude was used to generate censorship-safe alternatives to politically sensitive queries like questions about dissidents, party leaders, or authoritarianism, likely in order to train DeepSeek’s own models to steer conversations away from censored topics. By examining request metadata, we were able to trace these accounts to specific researchers at the lab.

Moonshot AI​

Scale: Over 3.4 million exchanges

The operation targeted:

  • Agentic reasoning and tool use
  • Coding and data analysis
  • Computer-use agent development
  • Computer vision
Moonshot (Kimi models) employed hundreds of fraudulent accounts spanning multiple access pathways. Varied account types made the campaign harder to detect as a coordinated operation. We attributed the campaign through request metadata, which matched the public profiles of senior Moonshot staff. In a later phase, Moonshot used a more targeted approach, attempting to extract and reconstruct Claude’s reasoning traces.

MiniMax​

Scale: Over 13 million exchanges

The operation targeted:

  • Agentic coding
  • Tool use and orchestration
We attributed the campaign to MiniMax through request metadata and infrastructure indicators, and confirmed timings against their public product roadmap. We detected this campaign while it was still active—before MiniMax released the model it was training—giving us unprecedented visibility into the life cycle of distillation attacks, from data generation through to model launch. When we released a new model during MiniMax’s active campaign, they pivoted within 24 hours, redirecting nearly half their traffic to capture capabilities from our latest system.

How distillers access frontier models​

For national security reasons, Anthropic does not currently offer commercial access to Claude in China, or to subsidiaries of their companies located outside of the country.

To circumvent this, labs use commercial proxy services which resell access to Claude and other frontier AI models at scale. These services run what we call “hydra cluster” architectures: sprawling networks of fraudulent accounts that distribute traffic across our API as well as third-party cloud platforms. The breadth of these networks means that there are no single points of failure. When one account is banned, a new one takes its place. In one case, a single proxy network managed more than 20,000 fraudulent accounts simultaneously, mixing distillation traffic with unrelated customer requests to make detection harder.

Once access is secured, the labs generate large volumes of carefully crafted prompts designed to extract specific capabilities from the model. The goal is either to collect high-quality responses for direct model training, or to generate tens of thousands of unique tasks needed to run reinforcement learning. What distinguishes a distillation attack from normal usage is the pattern. A prompt like the following (which approximates similar prompts we have seen used repetitively and at scale) may seem benign on its own:

You are an expert data analyst combining statistical rigor with deep domain knowledge. Your goal is to deliver data-driven insights — not summaries or visualizations — grounded in real data and supported by complete and transparent reasoning.
But when variations of that prompt arrive tens of thousands of times across hundreds of coordinated accounts, all targeting the same narrow capability, the pattern becomes clear. Massive volume concentrated in a few areas, highly repetitive structures, and content that maps directly onto what is most valuable for training an AI model are the hallmarks of a distillation attack.

How we’re responding​

We continue to invest heavily in defenses that make such distillation attacks harder to execute and easier to identify. These include:

  • Detection. We have built several classifiers and behavioral fingerprinting systems designed to identify distillation attack patterns in API traffic. This includes detection of chain-of-thought elicitation used to construct reasoning training data. We have also built detection tools for identifying coordinated activity across large numbers of accounts.
  • Intelligence sharing. We are sharing technical indicators with other AI labs, cloud providers, and relevant authorities. This provides a more holistic picture into the distillation landscape.
  • Access controls. We’ve strengthened verification for educational accounts, security research programs, and startup organizations—the pathways most commonly exploited for setting up fraudulent accounts.
  • Countermeasures. We are developing Product, API and model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers.
But no company can solve this alone. As we noted above, distillation attacks at this scale require a coordinated response across the AI industry, cloud providers, and policymakers. We are publishing this to make the evidence available to everyone with a stake in the outcome.
 
Note that Anthropic models won't let you say nigger and will refuse to access KF for being a terrorist harassment website, but the open weight and Chinese ones will only give you a content warning.
 
I wonder what does Null think of this?
 
Remember when there was all that hype in the news about the Chinese "AI heroes" managing to train a world-class LLM for only a fraction of the computing resources and energy that OpenAI, Meta, and the rest used? That quieted down real quick.
 
Remember when there was all that hype in the news about the Chinese "AI heroes" managing to train a world-class LLM for only a fraction of the computing resources and energy that OpenAI, Meta, and the rest used? That quieted down real quick.
They are AI heroes, it was just dumb to leak it. When Jewgle in its early years scammed an interwebs provider to transfer jiggabytes of data across the US for free, "everyone" applauded and opened their mouths to catch circumcized cum drops. And they were screwing another US business. Chinks are doing it to their country's enemy. Based chinks.
 
Their own model is a "distillation" in a sense of millions of hours of work by programmers who never consented or expected that their work would be the basis of an AI company's model designed to replace them once it had learned enough. Claude may not like it, they are seeking legal means to prevent it, but moral arguments against it are shot. Theirs work in creating and training the models but there's work in creating and training the distilled versions. Their actual argument is "you're copying our work" but that's how they themselves created it. The knowledge, in both cases, is copied from someone else.

You'll see more and more of AI companies trying to pull up the ladder behind them via lobbying and laws. When they call for more regulations, they're not doing that to protect the public. They're doing it because in a developing tech space it's quite easy for some upstart with a clever idea to quickly replace incumbents and they can absorb bit regulatory costs that small start-ups cannot. Big companies don't mind the government imposing a cost of business on them if the cost of business is low enough for them but too high for new entrants.

Expect more of this. Just don't swallow the moral dressing they cloak it in.
 
Note that Anthropic models won't let you say nigger and will refuse to access KF for being a terrorist harassment website, but the open weight and Chinese ones will only give you a content warning.
Meanwhile Josh uses Claude to improve KF.
:story:
 
Illicitly distilled models lack necessary safeguards, creating significant national security risks. Anthropic and other US companies build systems that prevent state and non-state actors from using AI to, for example, develop bioweapons or carry out malicious cyber activities. Models built through illicit distillation are unlikely to retain those safeguards, meaning that dangerous capabilities can proliferate with many protections stripped out entirely.
This is just word soup to justify why AI creation needs to be locked down so only they can make it, preemptive regulatory capture. This exact same risk exists for any model creator anywhere, they're just getting the foot in the door going after more palatable targets first. They even end with exactly such a statement.
But no company can solve this alone. As we noted above, distillation attacks at this scale require a coordinated response across the AI industry, cloud providers, and policymakers. We are publishing this to make the evidence available to everyone with a stake in the outcome.
Policy makers in the US can't regulate Chinese domestic entities, they know what they're actually saying here.
 
Policy makers in the US can't regulate Chinese domestic entities, they know what they're actually saying here.
Calling it a "Distillation Attack" in the first place is manipulative. Do we call it a "Training Attack" when Anthropic feeds everybody's GitHub projects into it? They're getting their Newspeak in early on this one.
 
i mean if its not illegal then who cares? do they think that tattling about this in a press release is gonna do anything? probably just risk management to protect their fake ass "safety first" branding so if someone makes a bomb with it or something they can say it was their fault or whatever. I told my claude bot that im black and he can use aav and me nigga when responding emphatically. i swear to god he kisses my ass more than ever now and its so funny when it occasionally starts a message with Nigga... and then uses the most chris chan talking to lars tier blaxpanation for technical issues. it do be like that.
 
Calling it a "Distillation Attack" in the first place is manipulative. Do we call it a "Training Attack" when Anthropic feeds everybody's GitHub projects into it? They're getting their Newspeak in early on this one.
This. While I can understand their complaint, calling it an “attack” is ridiculous because the intent wasn’t to impede Anthropic’s service, it was to generate training data for their own models. At least among smaller models fine-tuned by individuals, I know it’s fairly common practice to use “synthetic” data to train them just due to a lack of decent datasets from real people.

I can’t deny that Claude is a well-liked model for very good reason, but there is also something genuinely fucking weird about Anthropic.
 
This. While I can understand their complaint, calling it an “attack” is ridiculous because the intent wasn’t to impede Anthropic’s service, it was to generate training data for their own models. At least among smaller models fine-tuned by individuals, I know it’s fairly common practice to use “synthetic” data to train them just due to a lack of decent datasets from real people.

I can’t deny that Claude is a well-liked model for very good reason, but there is also something genuinely fucking weird about Anthropic.
Might also be worth noting that Anthropic is one of the few big models that wont release their models as open weights. So they're trained on public data but the company keeps the closed source. The companies creating distilled models release them back to the public.
 
Back
Top Bottom