Six months ago, logging LLM calls was enough. Now agents invoke tools, chain actions, and operate autonomously - and most audit logs miss the events that matter. Here's what the next version looks like.
The guide we published in February covered what audit logging means when your application calls an LLM. Inputs, outputs, model, tokens, timestamps, user context. That post is still correct. The problem is that the surface it covered is shrinking as a fraction of the actual risk surface, and in 2026 that gap is the thing compliance buyers and regulators are starting to ask about.
The short version: if your application does anything more than prompt-completion chat, your audit log is probably missing the events that matter most.
Six months ago, the default assumption behind most AI logging was that "AI" meant "an LLM that returns text." You logged the prompt, the completion, maybe the system message. That was a reasonable model of the risk.
It isn't anymore. Applications now routinely give models access to tools - shell commands, HTTP requests, database queries, email APIs, file operations. Once a model can act, the model's words stop being the interesting thing. The action is the interesting thing.
You can see this clearly in the recent Mythos disclosures - Anthropic's frontier model evaluation that revealed autonomous vulnerability discovery and multi-step exploit generation without human steering. None of them are "chat" risks. They're execution risks. An agent with shell access, HTTP access, or database access can do substantially more damage than a chatbot that says an impolite thing. The log entries that matter are the ones describing what the agent decided to do, not what it decided to say.
Most current audit logging doesn't capture that cleanly. It captures the LLM call that preceded the action, and sometimes it captures the action itself in a separate system - application logs, database logs, API gateway logs - but it doesn't stitch them together into a single coherent record of what the agent did and why. That's the gap.
Tool-use events, as first-class records. When an agent calls a tool, the audit log should capture: which tool, what arguments, the return value or error, the duration, and a link back to the reasoning step that decided to call it. Not buried in a downstream service's logs. Not reconstructed after the fact from three different systems. A single event, emitted at the moment of execution, joined to the trace.
Cross-step reasoning traces. A single agent task often spans dozens of model calls, tool invocations, and conditional branches. Logging the LLM call and logging the tool call individually is not enough if you can't reconstruct the sequence they ran in, which decisions depended on which results, and where the chain diverged from what a human would have done. The trace, not the individual events, is the auditable unit.
Risky-capability flags. Certain patterns warrant review regardless of whether anything broke. An agent referencing CVE identifiers in its reasoning. An agent assembling what looks like an exploit. An agent attempting to access credential material. An agent chaining tools in a way that resembles a known attack pattern. These are the things a human reviewer would want to know about, and they're the things a pure input/output log will not surface. When a flag fires, the system should be able to respond - alerting a human, blocking the action before it executes, or routing the trace for manual review - not just record the event after the fact.
Consider a minimal schema for a single agent trace:
{
"trace_id": "trc_01HXYZABC...",
"started_at": "2026-04-16T13:44:37Z",
"ended_at": "2026-04-16T13:44:52Z",
"principal": {
"tenant_id": "t_123",
"user_id": "u_abc",
"agent_id": "writer-v2"
},
"task": {
"kind": "customer_support_resolution",
"input_hash": "sha256:..."
},
"steps": [
{
"step_id": 1,
"kind": "llm_call",
"model": "claude-opus",
"input_tokens": 1420,
"output_tokens": 312,
"risk_flags": []
},
{
"step_id": 2,
"kind": "tool_use",
"tool": "db.query",
"args_hash": "sha256:...",
"args_redacted": true,
"result_status": "ok",
"duration_ms": 84,
"risk_flags": []
},
{
"step_id": 3,
"kind": "tool_use",
"tool": "email.send",
"args_hash": "sha256:...",
"result_status": "ok",
"duration_ms": 210,
"risk_flags": ["sends_external_message"]
}
],
"outcome": "completed",
"risk_score": 12,
"retention_class": "standard_7y"
}
A few things worth noting about this shape.
Each step has a stable identifier so you can reason about ordering and reference specific steps from other systems. Tool calls carry a status and a duration, so you can tell the difference between "the tool succeeded" and "the tool ran for thirty seconds and timed out." Arguments are hashed by default, with a separate mechanism to store the unredacted payload at the retention class the tenant requires - this keeps the hot log small and lets you satisfy different privacy regimes per tenant without fragmenting the schema. Risk flags are first-class on both the step and the trace, so a reviewer can filter on them without re-scanning content.
None of this is technically novel. It's straightforward event modeling. The reason most existing logging doesn't do it is that it grew out of LLM-first assumptions, not agent-first assumptions. Retrofitting is mostly a discipline problem, not a technology problem.
The EU AI Act, in Article 12, requires automatic record-keeping over the lifetime of any high-risk AI system, with enough detail to identify situations that may present risks or lead to substantial modification of the system. For a chat-only application, a log of inputs and outputs arguably satisfies that. For an agent that invokes tools, it does not. You cannot identify the risk situations if you haven't logged the actions.
ISO 42001, which more enterprises are adopting as an AI management standard, points in the same direction: records of decisions made by the system, not only records of what the system was asked.
SOC 2 is more implementation-agnostic, but increasingly auditors are asking AI-using companies specifically about agent oversight - and "we log the LLM prompts" is becoming a less satisfying answer. Expect questions about tool inventories, risk tiering of capabilities, and evidence of human-in-the-loop on high-risk actions.
None of this means every team needs a full control plane tomorrow. It does mean that the logging decisions being made in Q2 2026 will be audited against a 2027 standard, and that bar will almost certainly be higher than "we have prompts and completions in a database somewhere."
The systems that stay defensible will treat agent behavior the way financial systems treat transactions: every material decision captured, joined to its context, retained to a policy, reviewable by a human and by automated rules. What auditors call a control and what engineers call a trace will converge. The next step is automated trace-level compliance scoring - not just logging what happened, but continuously evaluating whether each trace meets the policy requirements for its risk tier.
For teams building now, the practical advice is the same as it was in February, with one update. Log the LLM calls - still true. Log the tool calls as first-class events joined to the trace - this is the new part. Give yourself a trace structure that can represent the whole chain, not just the pieces. You do not need to ship the full control plane this quarter. You do need to make sure the events you are writing today will still make sense when an auditor shows up in eighteen months.
This is the direction we're building SignalVault toward - and the February guide is still the right starting point. This is the page that gets written next.
The Complete Guide to AI Audit Logging
Learn what AI audit logging is, what to log, encryption requirements, retention policies, and how audit logs enable SOC2/GDPR compliance.
How to Make Your AI Application SOC2 Compliant
A practical guide to SOC2 compliance for AI and LLM applications—controls, audit gaps, and how to build a compliance-ready AI stack.
PII Detection in LLM Applications: A Complete Guide
Learn how to detect and handle PII in AI prompts—detection methods, redaction, GDPR/CCPA implications, and building a PII detection pipeline.
Get started with SignalVault in under 5 minutes.