A regulator, a plaintiff's lawyer, and your head of risk walk in six months after a bad inference. None of them want your prompt-engineering notes. They want a reconstructable answer to four questions: what model decided, on which input, on whose behalf, with what downstream effect.
Most teams log AI calls the way they log a web request. Status code, latency, a redacted body, a trace ID that expires in 30 days. Then at incident time the trace is unreplayable, the input is gone under GDPR minimisation, and the output cannot be tied to the customer record it altered. The model version is "the one we had in March." The reviewer who accepted the suggestion has left the company.
This post is the audit-log contract we ship on every regulated build at DAD. It is grounded in EU AI Act Article 12 and Article 19, the NIST AI Risk Management Framework, and what platform defaults like AWS Bedrock model invocation logging actually capture. Scope: high-risk AI under EU Annex III (credit scoring, hiring, certain medical and biometric uses), regulated finance, and any deployment where a wrong inference has a named victim. Productivity copilots are out of scope; the cost of getting them wrong is a bad slide.
Logging like a web app is the failure mode the regulation already named
— EU AI Act, Article 12(1)High-risk AI systems shall technically allow for the automatic recording of events (logs) over the lifetime of the system.
That phrase, lifetime of the system, is what rules out the default web-app retention window. A model in production for three years cannot be audited from 30 days of logs.
Article 12(2) fixes the purpose of those logs. They must enable three things:
- Risk identification under Article 79(1) and detection of substantial modification.
- Post-market monitoring under Article 72.
- Operation monitoring under Article 26(5).
Debugging is not on the list. Cost attribution is not on the list. The legal text treats logs as the evidence layer for risk, not as developer convenience.
The retention floor is six months under Article 19, "unless provided otherwise in the applicable Union or national law." Financial institutions already exceed it; their existing financial-services recordkeeping rules subsume AI Act logs.
The US-side anchor is the NIST AI RMF, structured around four functions: GOVERN, MAP, MEASURE, MANAGE. Audit logs sit inside MEASURE (instrumentation, evidence) and MANAGE (incident handling, change control). The GenAI Profile (NIST.AI.600-1) extends the framework with four primary considerations, and three of the four are log-bearing: Governance, Content Provenance, Incident Disclosure. NIST is voluntary. It is also the de facto checklist US regulators reach for. Counsel will ask.
Platform defaults are not compliance. Bedrock invocation logging is disabled by default. When you enable it, you get requestId, modelId, operation, timestamp, input and output payloads up to 100 KB inline, and token counts. That is enough for usage auditing and prompt-engineering replay. It contains no user identity, no tenant, no decision rationale, no downstream effect, and no hash of input. The Bedrock Responses API is not covered at all. Take Bedrock's schema as the floor. The contract below is the delta.
What to capture per inference: seven fields most teams miss
This is the minimum schema we ship. It is framed as deltas on what a platform like Bedrock already gives you.
// InferenceLog: minimum schema for a high-risk AI surface.
// Every field below is load-bearing under EU AI Act Art. 12-19 or NIST AI RMF.
type InferenceLog = {
// 1. Inference identity
inferenceId: string; // Your own UUID. Survives vendor swaps.
providerRequestId: string | null; // e.g. Bedrock requestId. May be null.
retryOf: string | null; // Parent inferenceId for retries.
timestamp: string; // ISO 8601, UTC.
// 2. Actor and on-behalf-of
actor: {
userId: string; // The authenticated caller.
tenantId: string; // The customer org.
sessionId: string | null;
};
subject: {
type: "customer" | "applicant" | "patient" | "transaction" | "none";
id: string | null; // The person or record the decision is ABOUT.
};
// 3. Model + version + parameters (everything needed to replay)
model: {
provider: string; // "anthropic" | "openai" | "bedrock" | ...
modelId: string; // Exact version string. Never "latest".
promptTemplateHash: string; // SHA-256 of canonical template.
systemPromptHash: string; // SHA-256 of the resolved system prompt.
toolSchemaHash: string | null; // SHA-256 of tool/function schemas.
parameters: {
temperature: number;
topP: number | null;
maxTokens: number;
seed: number | null;
};
};
// 4. Input as a hash. Raw input lives in a shorter-retention store.
input: {
sha256: string; // Canonicalised: lower-cased, trimmed, JSON-sorted.
inputTokenCount: number;
rawRef: string | null; // S3 URI or null if expired/never stored.
};
// 5. Output AND the structured decision the system actually took
output: {
outputTokenCount: number;
rawRef: string | null; // Full text, warm tier.
decision: {
action: string; // "deny" | "approve" | "escalate" | "suggest" ...
reasonCode: string; // "CRD-7", not free text.
confidence: number | null;
};
};
// 6. Downstream effect: what changed in the world?
effects: Array<{
kind: "record.update" | "ticket.create" | "payment.issue" | "notification.send";
targetId: string; // The mutated record or created artefact.
targetSystem: string; // "salesforce" | "core-banking" | ...
}>;
// 7. Human-in-the-loop state
humanReview: {
presented: boolean; // Was the suggestion shown to a human?
outcome: "accepted" | "overridden" | "escalated" | "ignored" | "n/a";
reviewerId: string | null;
overrideReason: string | null;
};
};The seven sections each answer a specific question a regulator or plaintiff will ask.
- Inference identity. Your own
inferenceId, re-emitted on retries, with the provider'srequestIdcarried alongside. Provider IDs disappear the day you switch vendors, and you will switch vendors. AretryOfchain matters when an inference is replayed after a timeout; the actual decision was the third attempt, not the first. - Actor and on-behalf-of. The authenticated caller, the tenant, and the subject the decision is about. Bedrock has none of these. Article 12(3) makes natural-person identification explicit for biometric systems, and the spirit generalises: if the inference can hurt a named person, you must be able to name them in the log. "Who was affected" cannot be answered from a
userIdalone when one credit officer reviews two hundred files a day. - Model, version, parameters, hashes. Without
modelIdpinned to an exact version, the trace is unreplayable. WithoutpromptTemplateHashandsystemPromptHash, you cannot prove which version of the prompt produced which decision after a template change. Hash the canonicalised template at build time, store the hash with the inference, and keep the template itself in version control. Six months later you do agit checkout, not a forensic search. - Input as a hash, not as raw text. The raw input may contain PII. GDPR minimisation says you cannot keep it forever; the AI Act says you must be able to evidence the decision. A
SHA-256of the canonicalised input resolves the tension. Raw input lives in a shorter-retention, access-controlled store (next section). The hash lives forever in the evidentiary log. Subject-access deletion erases the raw input. The hash survives because, once the input is destroyed, the hash is no longer personal data. - Output and structured decision. Log the full output, short-retained, alongside a structured representation of the action taken. Free text does not survive a deposition; codes do.
decision.action = "deny"anddecision.reasonCode = "CRD-7"are auditable. "The model said something like, the applicant did not meet our criteria" is not. - Downstream effect. The IDs of records the inference mutated, tickets it created, refunds it issued, messages it sent. This is the field most teams skip. Article 12(2) names monitoring of operation. Operation includes side-effects. If your AI surface only writes decisions and a downstream worker applies them, you still log the effect at the inference boundary or at the worker, but you log it somewhere queryable by
inferenceId. - Human-in-the-loop state. Was the output shown to a human? Did they accept, override, escalate, ignore? Article 14(5) gives the human reviewer a load-bearing role in high-risk systems. Most teams log the model's output and miss what the human did with it. Then a year later they cannot tell whether the model was wrong, the human was wrong, or both.
Retention versus queryability is the architecture question, not retention alone
The honest tradeoff: longer retention conflicts with GDPR data minimisation, deeper queryability conflicts with cost and with PII exposure. There is no single log that satisfies all three. The answer is a split.
We ship three tiers on every regulated build.
| Tier | Retention | Contents | Storage | Primary use |
|---|---|---|---|---|
| Tier 1: hot | 90 days | Structured fields only: IDs, hashes, decision codes, modelId, actor, effect kind | ClickHouse / BigQuery / Snowflake | Dashboards, triage, drift alarms |
| Tier 2: warm | 6 to 24 months | Full payloads: raw input, raw output, tool-call traces | S3 Standard-IA, encrypted with separate key | Article 19 evidence, replay |
| Tier 3: cold | 7 years (finance) / 6 months floor | Hashes, decision codes, model versions, effect IDs | S3 Object Lock (WORM) / bank equivalent | Evidentiary, append-only |
Notes on each tier:
- Tier 1 is what engineers actually use day-to-day. No raw PII; the hash columns are safe to expose to analytics.
- Tier 2 is the bridge. Access ticketed and logged (the meta-audit, which most teams also forget). Bedrock's S3 sidecar pattern (100 KB inline cap, large bodies as separate objects) is a reasonable starting shape; it is what AWS gives you for free when you enable invocation logging. Tier 2 is what you let expire first under data-minimisation pressure.
- Tier 3 survives subject-access deletion because hashes are not personal data once the input is destroyed. Survives a vendor change because
inferenceIdis yours.
The schema travels the tiers via inferenceId. Tier 1 owns the index. Tier 3 owns the proof. Tier 2 is the bridge. When legal asks "show me everything about subject 412 from March," you query Tier 1 for inferenceIds, fetch Tier 2 for the payloads if still in window, fall back to Tier 3 for the structured record otherwise. That is what reconstructable means.
A practical sizing note. On a hiring surface in the shape of /case-studies/bewerbeagentur at 50k inferences/day:
- Tier 1: about 8 GB/month in ClickHouse.
- Tier 2: about 400 GB/month in S3 Standard-IA.
- Tier 3: about 6 GB/month in S3 Glacier Instant Retrieval.
An order of magnitude under 200 USD/month at AWS list prices, by our sizing. The blocker is rarely cost; it is that nobody owned the schema before the first incident.
The schema beside Bedrock's is the strongest argument
Put the InferenceLog type next to Bedrock's ModelInvocationLog. The diff is the post.
Bedrock gives you schemaType, schemaVersion, timestamp, accountId, region, requestId, operation, modelId, input content type and body and token count, output content type and body and token count. Useful. Billable. Replayable for prompt iteration.
It does not give you:
- actor (no
userId, notenantId) - subject (no person-the-decision-is-about)
- prompt-template hash or system-prompt hash
- structured decision (
action,reasonCode) - downstream effect (mutated record IDs)
- human-review state (
presented,outcome,reviewerId)
Seven of the seven sections above are absent. That is not a criticism of AWS; invocation logging is a platform primitive, not a compliance product. It is a criticism of any team that enables Bedrock logging and ticks the "we have an audit trail" box on an architecture review.
The other failure mode we see, less often but more painfully: teams that log everything to a single Elasticsearch cluster with 30-day retention because that is what the platform team gives them. At month four they cannot tell which prompt template was live in March, and the modelId field is a string like claude-3-sonnet with no minor version. The fix is not bigger Elasticsearch. It is the split-tier shape above, and a modelId that pins exact versions even when the SDK accepts aliases.
For the eval side of this contract, see /blog/production-ai-checklist. Logs prove what happened. Evals prove what should have happened. You need both.
Honest gaps
Two limits worth naming:
- NIST GenAI Profile action IDs. We could not extract specific action IDs at writing time; the Profile PDF (
NIST.AI.600-1) is available but our fetch returned binary-only content. The reference above is to the Profile's structure (four primary considerations, organisation by AI RMF subcategory), not to action IDs we did not verify. If your counsel needs ID-level mapping, pull the PDF directly. - Vendor retention claims. Anthropic's public transparency hub does not commit to enterprise-relevant inference-log retention. Figures circulated about 30-day retention are about Anthropic's own internal monitoring under their Responsible Scaling Policy, not about customer API log retention. Treat vendor retention claims as a contract clause to negotiate, not as a default.
What to do on Monday
Pull one production AI surface. Pick an inferenceId from 90 days ago. Answer in writing, without thawing a backup:
- Who was the actor, and who was the subject the decision was about?
- What was the input hash, and (if still in your warm tier) the input itself?
- What exact
modelIdproduced the output, with whichpromptTemplateHash? - What was the structured decision and
reasonCode? - Which downstream record changed, and is the change still in place?
If any of those five is "no," your next sprint is the split-tier log, not the next model upgrade. The model will get better on its own. The log will not.