The integrity & trust layer for AI training data

Prove what your model learned from.

Agentiks sits at the point of intake and judges every source, checks every sample in the embedding space, and signs a tamper-evident record of what entered your model, before it ever trains.

Source trust at intake
Per sample, in the embedding space
Software-only, any cluster

Intake · livewatching

Source trust

arxiv-mirror-07tier A0.91

commoncrawl-eutier B0.74

vendor-rlhf-03tier C0.58 ↓

Every sample · embedding space

sample 0xA4F9…7C2PASS

sample 0x71C0…3B8REJECT

How it works

Everything that has to happen before your data trains, in one gate.

Lineage tools just map it. Observability tools just watch it, after the fact. Agentiks does both. It also judges the source, checks every sample, and signs the record, the moment data arrives, before it can reach training.

Many sources

Web crawlsData vendorsHuman labelersRLHF feeds

Agentiks · the gate at intake

Map every source and transform

lineage

Score how much to trust each source

source trust

Check every sample, inline

binding verdict

Sign a tamper-evident record

the certificate

The output

Trusted, signed data

→ ready to train

Map · judge · check · sign. One pass, before the data lands.

Proof 01 · Source trust

A credit score for every data source.

A live trust score for every place your data comes from, earned over time and able to fade, scored across the signals that matter. So you train on sources you judged, not sources you assumed were fine.

Earned, fading reputation is proven: Spamhaus (email, since 1998) · BitSight · Sift.

Source · arxiv-mirror-07A

0/ 1.00↓ 0.04 · 30d

Trust over time · earned, decaying

commoncrawl-eutier B0.74

vendor-rlhf-03tier C0.58 ↓

Proof 02 · Every sample, at intake

Every sample gets a verdict before it ever trains.

Each sample is examined the moment it arrives and given a clear verdict, let in, hold, or reject, before it can reach training. The decision is binding: if a check can’t run, the sample stays out.

SourcesGate · provenance · embedding · qualityTo training

PASS· enters trainingQUARANTINE· held for reviewREJECT· never trains

Where bias and drift show up first

Every sample also lands somewhere on a map of meaning, its embedding. We watch that map closely: it’s where bias creeps in and a drifting source shows first, as a point sitting off on its own before any label ever looks wrong.

How we embed

Each sample, text or image, is run through a production embedding model, the encoder, on NVIDIA GPU nodes. That places it in the same representation space the model learns in, inline at intake, so the check is on meaning, not surface statistics.

Encodermultimodal embedding model

ComputeNVIDIA GPU nodes

Runsinline, at intake

Proof 03 · Tamper-evident ledger

A record nobody can quietly rewrite.

Every sample gets a unique fingerprint, and each record is locked to the one before it in an add-only log. Periodic Merkle checkpoints roll each batch into a single root, so anyone can prove a given sample is in the ledger, and unchanged, with a short proof instead of replaying the whole chain. Software-only, a PostgreSQL hash chain with optional Merkle checkpoints and S3 Object Lock, verifiable with psql and aws s3.

Merkle-tree audit logs are battle-tested: Certificate Transparency (10.9B certs, every browser padlock) · AWS CloudTrail · Guardtime (Estonia/NATO) · SEC 17a-4 · FINRA 4511.

Batch Merkle root

root 7b3e…f1

↓one root commits to every record below

record #4410

sha256 a17f…c0

⛓links to a17f…

record #4411

sha256 4e1b…9f

⛓links to 4e1b…

record #4412sealed head

sha256 9f2c…1d

✗

Edit record #4411

Every later hash and the batch Merkle root stop matching. The tamper is provable in one check.

The Integrity Certificate

The thing you hand the auditor.

Any one proof alone proves little. All three together is the certificate: a signed bundle for every sample and batch, holding where it came from, how trusted the source was, the checks it passed, and a seal proving none of it changed afterward.

Integrity Certificate

batch · 12,408 samples

01 · SOURCE TRUST

arxiv-mirror-07tier A · score 0.91

02 · SAMPLE VERDICT

0xA4F9…7C2PASS · checked inline

03 · SIGNATURE CHAIN

sha256 9f2c…1d⛓ prev 4e1b…

Signed · tamper-evident

in-toto · SLSA · CycloneDX ML-BOM

source trust + sample verdicts + signature chain = integrity.
We deliver all three.

Built for your seat

Judge. Check. Sign. All three, at the gate.

Frontier & ML platform teams

Source trust at intake, a sub-second binding verdict, and embedding-space drift caught early. A software SDK that runs on any cluster, even air-gapped.

Fraud, credit & trading ML

Adversarial sources scored from behavior alone, with dedup and drift checked on every retrain, before bad data can move the model.

Governance & model-risk

A signed Integrity Certificate per sample and batch, tamper-evident lineage, and data-plane evidence a regulator will accept.

Become a design partner and put the gate in front of your training data.

Building with design partners in frontier and regulated AI