Wrong labels on clean samples
Flipped, mis-mapped, or backdoor-triggered labels shift the decision boundary. The sample looks valid; the class assignment is poisoned.
Threat surface
The attack surface on ingested ML data is wide and growing. Below is the families Agentiks blocks at the gate — not a closed list, but an honest taxonomy. After the families, real incidents where the data reached production because no one was blocking in real time.
Attack families
The full space is larger and evolving. These are the recurring shapes. We add new families to the stack as they're observed in the wild.
Flipped, mis-mapped, or backdoor-triggered labels shift the decision boundary. The sample looks valid; the class assignment is poisoned.
Adversarially crafted inputs with correct-looking labels that still steer the model. The most insidious family — standard label audits miss them entirely.
A hidden pattern that activates only in production, causing controlled misclassification. Models trained on triggered data test fine in dev.
One actor, many sources. Per-source rate limits do nothing when every source is the same attacker wearing different hats.
A source submits clean data for weeks or months to build trust, then ships poisoned batches that skate past trust-gated thresholds.
Clean samples served to validation and staging, poisoned samples served to training. Everything tests green. Production ships the poison.
Attackers who see gradients (or approximate them) craft inputs designed to move specific parameters. Small volume, high impact.
Reused credentials, signed manifests that don't match content, hash mismatches between pipeline stages. Attacks the trail itself.
Real-world incidents
A selection of public incidents that a pipeline with verifiable provenance, orchestrated defense, and continuous verdict learning would have materially changed the outcome of.
Security researchers found over 100 malicious models uploaded to the public hub. Model weights carried executable payloads that ran on anyone who loaded them.
Stanford researchers found CSAM and unlicensed content inside the dataset that trained Stable Diffusion and Imagen. The dataset was pulled; downstream models were already shipped.
Backdoor triggers embedded in training data survived safety fine-tuning and RLHF. The model behaved correctly in evaluation and maliciously in production.
Mithril Security uploaded a targeted-falsehood model to Hugging Face under a look-alike org. Tens of thousands of downloads before takedown.
The divergence attack extracted gigabytes of raw training data from a production model via crafted prompts. The training corpus was unknowable by audit.
Coordinated adversarial pixel-level poisoning released as a public tool. Designed to look clean to humans and corrupt generative image models at scale.
Unvetted Reddit content and other low-trust sources surfaced as authoritative answers. Glue-on-pizza went global; trust in ML-surfaced content cratered overnight.
Not every incident is a poisoning attack in the strict sense — some are provenance failures, supply-chain lapses, or coordinated manipulation. All are cases where Agentiks-style controls at the ingestion boundary would have narrowed or closed the exposure.