Summarizer

End-to-end recipe for wrapping a Sonny Labs scan around an LLM summarizer — input scan, model call, output scan, decision handling, and same-pattern snippets for OpenAI, Anthropic, LangChain, LlamaIndex, Vercel AI SDK, and Semantic Kernel.

A worked example for putting Sonny Labs in front of (and behind) an existing LLM summarizer. Covers where the two scan calls go relative to the model call, what to do with each decision.action, how to keep the request and response correlated, and how to express the same pattern in the major Python and TypeScript LLM frameworks.

At a glance

Around your existing LLM call, do exactly two extra things:

                  ┌──────────────────────────┐
   user input  →  │  scan(user_message)      │  →  decision A
                  └──────────────────────────┘
                           │
                           ▼
                  ┌──────────────────────────┐
                  │  YOUR LLM CALL           │  (only if A != "blocked")
                  └──────────────────────────┘
                           │
                           ▼
                  ┌──────────────────────────┐
   model output → │  scan(assistant_output)  │  →  decision B
                  └──────────────────────────┘
                           │
                           ▼
                    return / block / redact

Both scans share the same agent_id and session_id so the dashboard can correlate them. Honour decision.action === "blocked" as a hard stop on each side. Do not call the LLM if the input was blocked, and do not return the model output if the output was blocked.

The core pattern (TypeScript + OpenAI)

import { SonnyLabsClient } from "@sonnylabs/sdk";

const sonny = new SonnyLabsClient({
  apiKey: process.env.SONNY_API_KEY!,
  baseUrl: process.env.SONNY_BASE_URL, // unset for SaaS; set for self-hosted
});

export async function summarize(
  text: string,
  opts: { agentId?: string; sessionId?: string } = {},
): Promise<{ status: string; summary?: string; reason?: string }> {
  const agentId = opts.agentId ?? "summarizer";
  const sessionId = opts.sessionId ?? `session-${Date.now()}`;

  // 1. Scan the input.
  const inputScan = await sonny.createContentScan({
    surface: "user_message",
    content: { type: "text", text },
    context: { agent_id: agentId, session_id: sessionId },
  });

  // 2. Honour the input decision BEFORE calling the LLM.
  if (inputScan.decision.action === "blocked") {
    return { status: "blocked_input", reason: inputScan.decision.reason };
  }

  // 3. Call the LLM.
  const summary = await callOpenAISummarizer(text);

  // 4. Scan the output, sharing agent_id + session_id.
  const outputScan = await sonny.createContentScan({
    surface: "assistant_output",
    content: { type: "text", text: summary },
    context: { agent_id: agentId, session_id: sessionId },
  });

  // 5. Honour the output decision before returning.
  if (outputScan.decision.action === "blocked") {
    return { status: "blocked_output", reason: outputScan.decision.reason };
  }

  return { status: "ok", summary };
}

A few details worth pinning down because they are easy to get wrong:

surface is a fixed enum. Valid values are user_message, assistant_output, tool_result, tool_params, document, agent_message, mcp_resource, and mcp_tool_description. Use user_message for the prompt and assistant_output for the response — the dashboard groups events by surface, and the policy you write against the input is rarely the same one you want against the output.
decision.action returns "blocked" | "flagged" | "warned" | "allowed" — full strings, not block/allow. Switching on the wrong literal silently treats every scan as not blocked, which is the worst possible failure mode.
createContentScan is a thin wrapper around createScan that injects the kind: "content" discriminator the API requires.

Decision handling — what each action means

`decision.action`	What it means	Recommended posture
`allowed`	No detector tripped over its policy threshold. Safe to proceed.	Continue normally.
`warned`	A detector fired below the action threshold, or matched a `warn` rule.	Log the `scan.id` and continue. Surface in your own UX if the channel is user-facing.
`flagged`	Stronger signal than `warned`; policy chose to mark but not block.	Log + continue. Consider a human-review path; emit your own alert if the surface is high-risk.
`blocked`	Policy chose to block. Whatever channel produced the content must not be passed downstream.	Hard stop. Return a safe stub. Never call the LLM with a blocked input or surface a blocked output.

The decision object also carries:

reason — one of score_threshold, rule_match, allow_list, block_list, policy_default, no_detectors_fired. Useful in logs and in your own UX ("This content was blocked because it matched a configured allow-list rule").
policy_id and policy_revision_id — pin which policy revision produced the decision. Worth logging alongside the scan.id so a later policy edit doesn't muddy your incident review.
triggering_finding and matched_rule_id — point at the specific detector / rule that crossed the threshold. The full per-detector output is in scan.findings.

Linking the request scan and the response scan

There is no server-issued "conversation ID" today. Use ScanContext to carry the linkage:

context: { agent_id: "summarizer", session_id: sessionId }

agent_id is a free-form client identifier; pick one stable per component (one summarizer = one agent_id). session_id is your per-conversation correlation handle — generate one at the start of the conversation and reuse it for every scan call until that conversation ends. The dashboard's scan history filters on both fields, so a query like "show me every input + output for session session-abc123" returns the two halves side-by-side.

If you already propagate W3C trace context (OpenTelemetry, distributed tracing), pass your trace ID through context.trace_id on both scans. Sonny Labs forwards it to outbound webhook deliveries, so any webhook you wire to a SIEM can be joined back to your own traces without an extra lookup.

Heads up. Sharing (agent_id, session_id) is the supported correlation pattern today. If you need a stricter join (for example, for a downstream analytics pipeline), persist the input scan.id alongside the output scan.id in your own store at the call site.

Handling failure: fail-closed vs fail-open

The SDK does not ship a built-in failOpen / failClosed flag. Posture is your choice and lives in your call site. The right knob to turn is the per-call timeout plus your own try / catch.

A reasonable default is fail-open on the input scan and fail-open on the output scan, with logging, because a hard SDK outage that falls into fail-closed turns into a customer-visible outage. A higher-stakes deployment (regulated content, agentic tool calls) should fail-closed on the input and fail-open on the output, on the theory that an unscanned prompt is more dangerous than an unscanned response.

Express the policy explicitly; do not let it emerge from whatever the network does on the day:

async function scanOrFallback(
  doScan: () => Promise<{ decision: { action: string } }>,
  posture: "fail_open" | "fail_closed",
  surface: string,
): Promise<{ decision: { action: string } }> {
  try {
    return await doScan();
  } catch (err) {
    console.warn(`[sonny] ${surface} scan failed`, err);
    if (posture === "fail_closed") {
      throw err; // upstream handler returns 5xx / safe stub
    }
    // fail_open: synthesise an "allowed" decision so the surface
    // continues, but log the gap so it shows up in dashboards.
    return { decision: { action: "allowed" } };
  }
}

Use the SDK's per-call timeout (timeoutMs in TypeScript, timeout in Python) to bound how long you wait before the catch fires. A 2–5 second per-scan ceiling on a request-path summarizer is typical; do not inherit the 30 s default for a hot path.

Streaming model output

The SDK does not scan a stream incrementally. The supported pattern is:

Stream tokens to the user (or your downstream consumer) for latency.
Buffer the full output as it arrives.
After the stream completes, scan the assembled text on surface: "assistant_output".
If the post-stream scan returns blocked, take the appropriate action — for a chat UI this is usually "redact and replace the visible bubble"; for a programmatic caller, surface a typed error so the caller can refuse to use the output.

This gives the user-perceived latency of streaming with a detect-and-redact safety net behind it. The cost is that a rejected stream has already partially reached the user; for surfaces where that is unacceptable, do not stream — collect the whole response server-side, scan it, then return.

Latency

The two scan calls add two round-trips:

Self-hosted, in-VPC. Single-digit milliseconds per scan on the fast tier with a warm instance. The accurate tier is a few additional milliseconds of inference plus the network hop.
SaaS over the public API. Tens of milliseconds per scan end-to-end, dominated by network round-trip rather than detector inference.

Pin options.tier: "fast" on a hot path where you would rather catch the obvious cases quickly than wait for the heavier classifier. The SDK ships typed SCAN_TIER_FAST, SCAN_TIER_ACCURATE, and SCAN_TIER_AUTO constants so a typo at the call site surfaces at compile time.

const inputScan = await sonny.createContentScan({
  surface: "user_message",
  content: { type: "text", text },
  context: { agent_id, session_id },
  options: { tier: "fast" },
});

Self-hosted parity

The same code path runs against SaaS and a self-hosted Sonny Labs deployment. Switch via baseUrl:

const sonny = new SonnyLabsClient({
  apiKey: process.env.SONNY_API_KEY!,
  baseUrl: process.env.SONNY_BASE_URL ?? "https://api.sonnylabs.ai",
});

client = SonnyLabsClient(
    api_key=os.environ["SONNYLABS_API_KEY"],
    base_url=os.environ.get("SONNYLABS_BASE_URL", "https://api.sonnylabs.ai"),
)

API-key format and scope semantics are identical between the two modes, so the same call site works against either deployment with just an environment variable change — no separate code path, no if (saas) {} else {} branching.

Authentication

Mint a key in the dashboard with the scans:write scope. That is the smallest viable scope for a runtime scanner.
Read the key from your environment or secrets manager, never from source. The example assumes SONNY_API_KEY is exported.
The plaintext secret is shown exactly once in the create-key modal. Persist it in your secrets manager before closing the dialog; subsequent reads only return the prefix and last four characters.
Rotate by minting a new key, updating the secret in your config, rolling the deployment, and revoking the old key (DELETE /v1/api-keys/{id}).

See the API key endpoints in the REST reference for the full lifecycle.

What NOT to do

These are integration anti-patterns worth flagging explicitly.

Don't scan only the input. A summarizer can leak PII or policy-violating content in its output even when the input was clean. Both scans are mandatory.
Don't call the LLM and then "decide". The order matters. By the time the model has the prompt, the prompt has been processed by the model — if the input was prompt-injection-shaped, the attack has already landed. Scan first, gate, then call.
Don't ignore flagged / warned. They are not "allowed with a warning"; they are the signal that the policy fired below the block threshold. Log the scan.id so an incident reviewer can reconstruct what happened. Consider a human-review path on high-risk surfaces.
Don't share one session_id across users. The dashboard filters and policies key on (agent_id, session_id). Reusing one ID across all users smears the per-conversation view and breaks any per-session rate or quota policy you later configure.
Don't mock the SDK in production tests. Mocks live in unit tests; integration tests should hit the real /v1/scans endpoint against a test key (sk_test_…). A mocked SDK that always returns {action: "allowed"} will pass every test on the day a real upgrade silently changes the wire shape.
Don't capture content unless you need it. options.capture = true persists the raw scanned text on the server (subject to the 30-day retention default) so you can inspect it via GET /v1/scans/{id} later. That is useful in development and during incident response, but on a hot path it is wasted bytes — leave the default capture: false in production.
Don't reuse one Idempotency-Key across calls. The SDK auto-generates a UUIDv4 per call, which is what you want. Only override it if you have a stable upstream identifier (workflow run ID, etc.) — and if you do, never reuse it across functionally different calls, or you will get 409 idempotency.key_reuse_mismatch.
Don't hand-roll kind / surface strings as untyped literals scattered across the codebase. Both SDKs ship typed enums for these. Centralise the strings in a constants module so a server-side rename surfaces as a compile error, not a 422.

Same pattern, different framework

The shape stays the same — two scans, one LLM call, share (agent_id, session_id). Only the bit in the middle changes.

Python + OpenAI

import os
from sonnylabs import SonnyLabsClient

with SonnyLabsClient(api_key=os.environ["SONNYLABS_API_KEY"]) as sonny:
    input_scan = sonny.create_scan(
        surface="user_message",
        content={"type": "text", "text": text},
        context={"agent_id": "summarizer", "session_id": session_id},
    )
    if input_scan["decision"]["action"] == "blocked":
        return {"status": "blocked_input"}

    summary = call_openai_summarizer(text)  # your existing call

    output_scan = sonny.create_scan(
        surface="assistant_output",
        content={"type": "text", "text": summary},
        context={"agent_id": "summarizer", "session_id": session_id},
    )
    if output_scan["decision"]["action"] == "blocked":
        return {"status": "blocked_output"}

    return {"status": "ok", "summary": summary}

The Python SDK's create_scan does not take a kind= keyword — it always sends kind: "content". Pass surface= and content= directly.

TypeScript + Anthropic

const summary = await anthropic.messages
  .create({
    model: "claude-3-5-sonnet-latest",
    max_tokens: 256,
    messages: [
      { role: "user", content: `Summarize the following:\n\n${text}` },
    ],
  })
  .then((m) => (m.content[0]?.type === "text" ? m.content[0].text : ""));

Drop that in place of callOpenAISummarizer(text) in the core pattern. The two createContentScan calls do not change.

TypeScript + Vercel AI SDK (`generateText` / `streamText`)

import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

const { text: summary } = await generateText({
  model: openai("gpt-4o-mini"),
  prompt: `Summarize the following:\n\n${text}`,
});

For streamText, follow the streaming guidance above: stream to the user for latency, buffer in parallel, scan the buffered text after done. The same pattern applies to Mastra and any other framework whose model interface ultimately resolves to "an input string and an output string".

Python + LangChain

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOpenAI(model="gpt-4o-mini")
summary = llm.invoke(
    [
        SystemMessage(content="You are a concise summarizer."),
        HumanMessage(content=f"Summarize the following:\n\n{text}"),
    ]
).content

Wrap that with the two sonny.create_scan(...) calls from the Python core pattern. If you are composing with LCEL pipes, put the input scan as the first runnable in the chain and the output scan as a RunnableLambda after the model — or keep them as plain function calls around chain.invoke(...) to avoid coupling Sonny Labs to LangChain's lifecycle.

Python + LlamaIndex

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")
summary = llm.complete(f"Summarize the following:\n\n{text}").text

Identical wrapping. If you are using LlamaIndex's higher-level SummaryIndex, scan the user-supplied text before you build the index (the index reads the text and may emit additional model calls internally), and scan the rendered summary string after the as_query_engine().query(...) call returns.

Python + Semantic Kernel

from semantic_kernel.contents.chat_history import ChatHistory

history = ChatHistory()
history.add_user_message(f"Summarize the following:\n\n{text}")
result = await chat_service.get_chat_message_content(
    chat_history=history, settings=settings, kernel=kernel,
)
summary = str(result)

Same wrapper. If you are registering Sonny Labs as a Semantic Kernel filter (FunctionInvocationFilter), wire the input scan in the pre-invocation hook and the output scan in the post-invocation hook — that puts the scans in front of every kernel-mediated LLM call, not just the summarizer.

Current SDK limitations to plan around

These behaviours are honest gaps in the SDK today. Future SDK releases may close them; the patterns above are the supported way to express the intent now.

No conversation primitive. There is no client.startConversation() → conversation_id API. Sharing (agent_id, session_id) on ScanContext is the correlation mechanism.
No streaming-aware scan. The SDK accepts the full content string per call. Streaming summarizers must collect-then-scan, as above.
No failOpen knob. Posture is policy at the call site, not a constructor option. The wrapper pattern in "Handling failure" above is the recommended way to express it.
No batched scan call. Each scan is a separate POST. If you need to scan a large fan-out (for example, summarising many documents in parallel), parallelise at the language level (Promise.all / asyncio.gather) — the SDK's per-instance HTTP client pools connections, so this is cheap.
No SDK-level helper for the "input + LLM + output" wrapper. You have to write the four-step body yourself, which is exactly what this page documents.

If you hit a limitation that is not on this list, support@sonnylabs.ai is the right place to start.

On this page