Summarizer
End-to-end recipe for wrapping a Sonny Labs scan around an LLM summarizer — input scan, model call, output scan, decision handling, and same-pattern snippets for OpenAI, Anthropic, LangChain, LlamaIndex, Vercel AI SDK, and Semantic Kernel.
A worked example for putting Sonny Labs in front of (and behind) an
existing LLM summarizer. Covers where the two scan calls go relative
to the model call, what to do with each decision.action, how to
keep the request and response correlated, and how to express the
same pattern in the major Python and TypeScript LLM frameworks.
At a glance
Around your existing LLM call, do exactly two extra things:
┌──────────────────────────┐
user input → │ scan(user_message) │ → decision A
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ YOUR LLM CALL │ (only if A != "blocked")
└──────────────────────────┘
│
▼
┌──────────────────────────┐
model output → │ scan(assistant_output) │ → decision B
└──────────────────────────┘
│
▼
return / block / redactBoth scans share the same agent_id and session_id so the
dashboard can correlate them. Honour decision.action === "blocked"
as a hard stop on each side. Do not call the LLM if the input was
blocked, and do not return the model output if the output was
blocked.
The core pattern (TypeScript + OpenAI)
import { SonnyLabsClient } from "@sonnylabs/sdk";
const sonny = new SonnyLabsClient({
apiKey: process.env.SONNY_API_KEY!,
baseUrl: process.env.SONNY_BASE_URL, // unset for SaaS; set for self-hosted
});
export async function summarize(
text: string,
opts: { agentId?: string; sessionId?: string } = {},
): Promise<{ status: string; summary?: string; reason?: string }> {
const agentId = opts.agentId ?? "summarizer";
const sessionId = opts.sessionId ?? `session-${Date.now()}`;
// 1. Scan the input.
const inputScan = await sonny.createContentScan({
surface: "user_message",
content: { type: "text", text },
context: { agent_id: agentId, session_id: sessionId },
});
// 2. Honour the input decision BEFORE calling the LLM.
if (inputScan.decision.action === "blocked") {
return { status: "blocked_input", reason: inputScan.decision.reason };
}
// 3. Call the LLM.
const summary = await callOpenAISummarizer(text);
// 4. Scan the output, sharing agent_id + session_id.
const outputScan = await sonny.createContentScan({
surface: "assistant_output",
content: { type: "text", text: summary },
context: { agent_id: agentId, session_id: sessionId },
});
// 5. Honour the output decision before returning.
if (outputScan.decision.action === "blocked") {
return { status: "blocked_output", reason: outputScan.decision.reason };
}
return { status: "ok", summary };
}A few details worth pinning down because they are easy to get wrong:
surfaceis a fixed enum. Valid values areuser_message,assistant_output,tool_result,tool_params,document,agent_message,mcp_resource, andmcp_tool_description. Useuser_messagefor the prompt andassistant_outputfor the response — the dashboard groups events by surface, and the policy you write against the input is rarely the same one you want against the output.decision.actionreturns"blocked" | "flagged" | "warned" | "allowed"— full strings, notblock/allow. Switching on the wrong literal silently treats every scan as not blocked, which is the worst possible failure mode.createContentScanis a thin wrapper aroundcreateScanthat injects thekind: "content"discriminator the API requires.
Decision handling — what each action means
decision.action | What it means | Recommended posture |
|---|---|---|
allowed | No detector tripped over its policy threshold. Safe to proceed. | Continue normally. |
warned | A detector fired below the action threshold, or matched a warn rule. | Log the scan.id and continue. Surface in your own UX if the channel is user-facing. |
flagged | Stronger signal than warned; policy chose to mark but not block. | Log + continue. Consider a human-review path; emit your own alert if the surface is high-risk. |
blocked | Policy chose to block. Whatever channel produced the content must not be passed downstream. | Hard stop. Return a safe stub. Never call the LLM with a blocked input or surface a blocked output. |
The decision object also carries:
reason— one ofscore_threshold,rule_match,allow_list,block_list,policy_default,no_detectors_fired. Useful in logs and in your own UX ("This content was blocked because it matched a configured allow-list rule").policy_idandpolicy_revision_id— pin which policy revision produced the decision. Worth logging alongside thescan.idso a later policy edit doesn't muddy your incident review.triggering_findingandmatched_rule_id— point at the specific detector / rule that crossed the threshold. The full per-detector output is inscan.findings.
Linking the request scan and the response scan
There is no server-issued "conversation ID" today. Use ScanContext
to carry the linkage:
context: { agent_id: "summarizer", session_id: sessionId }agent_id is a free-form client identifier; pick one stable per
component (one summarizer = one agent_id). session_id is your
per-conversation correlation handle — generate one at the start of
the conversation and reuse it for every scan call until that
conversation ends. The dashboard's scan history filters on both
fields, so a query like "show me every input + output for session
session-abc123" returns the two halves side-by-side.
If you already propagate W3C trace context (OpenTelemetry,
distributed tracing), pass your trace ID through context.trace_id
on both scans. Sonny Labs forwards it to outbound webhook deliveries,
so any webhook you wire to a SIEM can be joined back to your own
traces without an extra lookup.
Heads up. Sharing
(agent_id, session_id)is the supported correlation pattern today. If you need a stricter join (for example, for a downstream analytics pipeline), persist the inputscan.idalongside the outputscan.idin your own store at the call site.
Handling failure: fail-closed vs fail-open
The SDK does not ship a built-in failOpen / failClosed flag.
Posture is your choice and lives in your call site. The right knob
to turn is the per-call timeout plus your own try / catch.
A reasonable default is fail-open on the input scan and fail-open on the output scan, with logging, because a hard SDK outage that falls into fail-closed turns into a customer-visible outage. A higher-stakes deployment (regulated content, agentic tool calls) should fail-closed on the input and fail-open on the output, on the theory that an unscanned prompt is more dangerous than an unscanned response.
Express the policy explicitly; do not let it emerge from whatever the network does on the day:
async function scanOrFallback(
doScan: () => Promise<{ decision: { action: string } }>,
posture: "fail_open" | "fail_closed",
surface: string,
): Promise<{ decision: { action: string } }> {
try {
return await doScan();
} catch (err) {
console.warn(`[sonny] ${surface} scan failed`, err);
if (posture === "fail_closed") {
throw err; // upstream handler returns 5xx / safe stub
}
// fail_open: synthesise an "allowed" decision so the surface
// continues, but log the gap so it shows up in dashboards.
return { decision: { action: "allowed" } };
}
}Use the SDK's per-call timeout (timeoutMs in TypeScript,
timeout in Python) to bound how long you wait before the catch
fires. A 2–5 second per-scan ceiling on a request-path summarizer is
typical; do not inherit the 30 s default for a hot path.
Streaming model output
The SDK does not scan a stream incrementally. The supported pattern is:
- Stream tokens to the user (or your downstream consumer) for latency.
- Buffer the full output as it arrives.
- After the stream completes, scan the assembled text on
surface: "assistant_output". - If the post-stream scan returns
blocked, take the appropriate action — for a chat UI this is usually "redact and replace the visible bubble"; for a programmatic caller, surface a typed error so the caller can refuse to use the output.
This gives the user-perceived latency of streaming with a detect-and-redact safety net behind it. The cost is that a rejected stream has already partially reached the user; for surfaces where that is unacceptable, do not stream — collect the whole response server-side, scan it, then return.
Latency
The two scan calls add two round-trips:
- Self-hosted, in-VPC. Single-digit milliseconds per scan on the
fasttier with a warm instance. Theaccuratetier is a few additional milliseconds of inference plus the network hop. - SaaS over the public API. Tens of milliseconds per scan end-to-end, dominated by network round-trip rather than detector inference.
Pin options.tier: "fast" on a hot path where you would rather
catch the obvious cases quickly than wait for the heavier
classifier. The SDK ships typed SCAN_TIER_FAST,
SCAN_TIER_ACCURATE, and SCAN_TIER_AUTO constants so a typo at the
call site surfaces at compile time.
const inputScan = await sonny.createContentScan({
surface: "user_message",
content: { type: "text", text },
context: { agent_id, session_id },
options: { tier: "fast" },
});Self-hosted parity
The same code path runs against SaaS and a self-hosted Sonny Labs
deployment. Switch via baseUrl:
const sonny = new SonnyLabsClient({
apiKey: process.env.SONNY_API_KEY!,
baseUrl: process.env.SONNY_BASE_URL ?? "https://api.sonnylabs.ai",
});client = SonnyLabsClient(
api_key=os.environ["SONNYLABS_API_KEY"],
base_url=os.environ.get("SONNYLABS_BASE_URL", "https://api.sonnylabs.ai"),
)API-key format and scope semantics are identical between the two
modes, so the same call site works against either deployment with
just an environment variable change — no separate code path, no
if (saas) {} else {} branching.
Authentication
- Mint a key in the dashboard with the
scans:writescope. That is the smallest viable scope for a runtime scanner. - Read the key from your environment or secrets manager, never from
source. The example assumes
SONNY_API_KEYis exported. - The plaintext secret is shown exactly once in the create-key modal. Persist it in your secrets manager before closing the dialog; subsequent reads only return the prefix and last four characters.
- Rotate by minting a new key, updating the secret in your config,
rolling the deployment, and revoking the old key
(
DELETE /v1/api-keys/{id}).
See the API key endpoints in the REST reference for the full lifecycle.
What NOT to do
These are integration anti-patterns worth flagging explicitly.
- Don't scan only the input. A summarizer can leak PII or policy-violating content in its output even when the input was clean. Both scans are mandatory.
- Don't call the LLM and then "decide". The order matters. By the time the model has the prompt, the prompt has been processed by the model — if the input was prompt-injection-shaped, the attack has already landed. Scan first, gate, then call.
- Don't ignore
flagged/warned. They are not "allowed with a warning"; they are the signal that the policy fired below the block threshold. Log thescan.idso an incident reviewer can reconstruct what happened. Consider a human-review path on high-risk surfaces. - Don't share one
session_idacross users. The dashboard filters and policies key on(agent_id, session_id). Reusing one ID across all users smears the per-conversation view and breaks any per-session rate or quota policy you later configure. - Don't mock the SDK in production tests. Mocks live in unit
tests; integration tests should hit the real
/v1/scansendpoint against a test key (sk_test_…). A mocked SDK that always returns{action: "allowed"}will pass every test on the day a real upgrade silently changes the wire shape. - Don't capture content unless you need it.
options.capture = truepersists the raw scanned text on the server (subject to the 30-day retention default) so you can inspect it viaGET /v1/scans/{id}later. That is useful in development and during incident response, but on a hot path it is wasted bytes — leave the defaultcapture: falsein production. - Don't reuse one Idempotency-Key across calls. The SDK
auto-generates a UUIDv4 per call, which is what you want. Only
override it if you have a stable upstream identifier (workflow
run ID, etc.) — and if you do, never reuse it across functionally
different calls, or you will get
409 idempotency.key_reuse_mismatch. - Don't hand-roll
kind/surfacestrings as untyped literals scattered across the codebase. Both SDKs ship typed enums for these. Centralise the strings in a constants module so a server-side rename surfaces as a compile error, not a 422.
Same pattern, different framework
The shape stays the same — two scans, one LLM call, share
(agent_id, session_id). Only the bit in the middle changes.
Python + OpenAI
import os
from sonnylabs import SonnyLabsClient
with SonnyLabsClient(api_key=os.environ["SONNYLABS_API_KEY"]) as sonny:
input_scan = sonny.create_scan(
surface="user_message",
content={"type": "text", "text": text},
context={"agent_id": "summarizer", "session_id": session_id},
)
if input_scan["decision"]["action"] == "blocked":
return {"status": "blocked_input"}
summary = call_openai_summarizer(text) # your existing call
output_scan = sonny.create_scan(
surface="assistant_output",
content={"type": "text", "text": summary},
context={"agent_id": "summarizer", "session_id": session_id},
)
if output_scan["decision"]["action"] == "blocked":
return {"status": "blocked_output"}
return {"status": "ok", "summary": summary}The Python SDK's create_scan does not take a kind= keyword — it
always sends kind: "content". Pass surface= and content=
directly.
TypeScript + Anthropic
const summary = await anthropic.messages
.create({
model: "claude-3-5-sonnet-latest",
max_tokens: 256,
messages: [
{ role: "user", content: `Summarize the following:\n\n${text}` },
],
})
.then((m) => (m.content[0]?.type === "text" ? m.content[0].text : ""));Drop that in place of callOpenAISummarizer(text) in the core
pattern. The two createContentScan calls do not change.
TypeScript + Vercel AI SDK (generateText / streamText)
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
const { text: summary } = await generateText({
model: openai("gpt-4o-mini"),
prompt: `Summarize the following:\n\n${text}`,
});For streamText, follow the streaming guidance above: stream to the
user for latency, buffer in parallel, scan the buffered text after
done. The same pattern applies to Mastra and any other framework
whose model interface ultimately resolves to "an input string and an
output string".
Python + LangChain
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatOpenAI(model="gpt-4o-mini")
summary = llm.invoke(
[
SystemMessage(content="You are a concise summarizer."),
HumanMessage(content=f"Summarize the following:\n\n{text}"),
]
).contentWrap that with the two sonny.create_scan(...) calls from the
Python core pattern. If you are composing with LCEL pipes, put the
input scan as the first runnable in the chain and the output scan
as a RunnableLambda after the model — or keep them as plain
function calls around chain.invoke(...) to avoid coupling Sonny
Labs to LangChain's lifecycle.
Python + LlamaIndex
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
summary = llm.complete(f"Summarize the following:\n\n{text}").textIdentical wrapping. If you are using LlamaIndex's higher-level
SummaryIndex, scan the user-supplied text before you build the
index (the index reads the text and may emit additional model calls
internally), and scan the rendered summary string after the
as_query_engine().query(...) call returns.
Python + Semantic Kernel
from semantic_kernel.contents.chat_history import ChatHistory
history = ChatHistory()
history.add_user_message(f"Summarize the following:\n\n{text}")
result = await chat_service.get_chat_message_content(
chat_history=history, settings=settings, kernel=kernel,
)
summary = str(result)Same wrapper. If you are registering Sonny Labs as a Semantic Kernel
filter (FunctionInvocationFilter), wire the input scan in the
pre-invocation hook and the output scan in the post-invocation hook
— that puts the scans in front of every kernel-mediated LLM call,
not just the summarizer.
Current SDK limitations to plan around
These behaviours are honest gaps in the SDK today. Future SDK releases may close them; the patterns above are the supported way to express the intent now.
- No conversation primitive. There is no
client.startConversation() → conversation_idAPI. Sharing(agent_id, session_id)onScanContextis the correlation mechanism. - No streaming-aware scan. The SDK accepts the full content string per call. Streaming summarizers must collect-then-scan, as above.
- No
failOpenknob. Posture is policy at the call site, not a constructor option. The wrapper pattern in "Handling failure" above is the recommended way to express it. - No batched scan call. Each scan is a separate POST. If you
need to scan a large fan-out (for example, summarising many
documents in parallel), parallelise at the language level
(
Promise.all/asyncio.gather) — the SDK's per-instance HTTP client pools connections, so this is cheap. - No SDK-level helper for the "input + LLM + output" wrapper. You have to write the four-step body yourself, which is exactly what this page documents.
If you hit a limitation that is not on this list,
support@sonnylabs.ai is the right
place to start.
Webhooks
Verify Sonny Labs outbound webhook signatures (HMAC-SHA256 over {t}.{body}) using the helpers shipped in the Python and TypeScript SDKs.
TypeScript SDK reference
Auto-generated symbol reference for @sonnylabs/sdk — every public class, method, options bag, error subclass, and helper, rendered from the JSDoc on the source.