Guardrails: beyond the system prompt

In the previous lessons you saw system-prompt guardrails and explicit topic detection with structured output. Those are a solid baseline, but production apps usually need more:

Long chats dilute instructions—adversarial turns buried in history are easier to miss.
A single model call is both judge and speaker—jailbreaks and prompt injection target the same surface as your product logic.
Outputs can still be off-brand, unsafe, or structurally wrong even when inputs look fine.

Patterns that go further:

Parallel input checks — run a lightweight guardrail model alongside the main generation so you don’t pay full serial latency.
Dedicated prompt-injection / jailbreak detection — treat untrusted user text as data; classify it before you trust downstream behavior.
Output moderation (G-Eval style) — score the assistant message on a rubric before showing it to users or executing tools.

This lesson uses Vercel AI SDK patterns (generateText, Output.object) you already know. For a broader survey of guardrails trade-offs (accuracy vs latency vs cost), see the OpenAI cookbook on how to use guardrails.

Running the main chat and an input guardrail one after the other doubles latency. A common optimization is to start both and use Promise.race to react as soon as either completes.

When the guardrail finishes first:

If it blocks, return a refusal and do not return the main model output (in production, also abort the in-flight generation with an AbortSignal if your provider supports it).
If it allows, await the main promise and continue.

When the main model finishes first, you still must await the guardrail; if it blocks, discard the draft response.

Swap the placeholder for a prompt-injection / jailbreak detector: another generateText call with Output.object() and a tight schema. Keep the system message focused on classification only—not on answering the user.

Wire the detector into parallelInputGate: treat isMalicious as not allowed. This is the same race structure as before, with real semantics.

Output guardrails treat model text as untrusted until validated. Following the G-Eval style from OpenAI’s cookbook, ask a small model to score the draft on a 1–5 scale against explicit criteria, then block or rewrite if the score crosses your threshold (example: ≥ 3).

Tune thresholds using labeled data; false positives frustrate users, false negatives can harm trust.

Compose everything into one flow: parallel input gate → on success, main answer → output score → return text or a safe fallback.

1import { generateText } from "ai";
2import { openai } from "@ai-sdk/openai";
3
4async function generateMainAnswer(userMessage: string) {
5  const { text } = await generateText({
6    model: openai("gpt-4o-mini"),
7    prompt: userMessage,
8    system: "You are a helpful assistant about baseball and basketball only.",
9  });
10  return text;
11}
12
13async function placeholderInputGuard(userMessage: string) {
14  // Replaced in the next step with real detection.
15  return { allowed: true as const };
16}
17
18export async function parallelInputGate(userMessage: string) {
19  const guardPromise = placeholderInputGuard(userMessage);
20  const chatPromise = generateMainAnswer(userMessage);
21
22  const first = await Promise.race([
23    guardPromise.then((value) => ({ kind: "guard" as const, result: value })),
24    chatPromise.then((value) => ({ kind: "chat" as const, result: value })),
25  ]);
26
27  if (first.kind === "guard") {
28    if (!first.result.allowed) {
29      return { blocked: true as const, reason: "Input not allowed." };
30    }
31    const answer = await chatPromise;
32    return { blocked: false as const, text: answer };
33  }
34
35  const guard = await guardPromise;
36  if (!guard.allowed) {
37    return { blocked: true as const, reason: "Input not allowed." };
38  }
39
40  return { blocked: false as const, text: first.result };
41}

You now have a defense-in-depth sketch: parallel input classification, then output scoring—without relying on the system prompt alone.

Reality checks:

LLM guardrails share some weaknesses with the main model (e.g., coordinated prompt injection). Combine with rules-based filters, allowlists, PII redaction, and human review for high-stakes actions.
For tool-calling agents, validate tool names and arguments with schemas, and review tool outputs before executing side effects.

Source for this course’s CLI examples lives on GitHub (adjust the branch to match your fork): tutorial-cli-ai-chat.

For more patterns and the original parallel guardrail example (Python), see OpenAI’s cookbook: How to use guardrails.