Building AI Agent Workflows with MCP: The Pattern I Use After One Too Many Unapproved Posts

The workflow looked innocent: when a pull request gets the bug label, read the diff, suggest a test, post a comment. I wired it through MCP on a Friday afternoon. By Friday night it had posted a comment on the wrong PR because I fat-fingered a repo slug in the GitHub server config and the agent confidently used the only repo it could see.

Nothing leaked. Nothing merged. Still embarrassing.

That failure split my thinking into two tracks. Demos optimize for "watch the agent go." Production workflows optimize for "predictable side effects, replayable traces, and a human at the write boundary." This post is the second track.

I am Rohit Singh, a developer in Jaipur who ships desktop apps and client web work. I use MCP daily in Cursor and in small custom runners for automation. If you already know the protocol basics, skim what MCP is. If you are choosing autonomous shells like Hermes or OpenClaw, read Hermes vs OpenClaw for how personal agents differ from the orchestration style here.

What is an MCP workflow?

An MCP workflow is a bounded sequence where an MCP host (Cursor, Claude Desktop, or your own runner) calls one or more MCP servers that expose tools (actions) and resources (readable context). The model plans steps; your code enforces policy.

It is not "an agent that does my job." It is a named procedure with:

A trigger (webhook, cron, label change, manual slash command)
A tool allowlist (read tools vs write tools)
A step budget (max iterations and cost ceiling)
Approval gates before irreversible calls
A structured trace per run

Think of MCP as the USB-C layer. The workflow is the firmware that decides when power actually flows.

MCP agent workflow with approval gates, bounded loops, and JSON tracing

Start with the workflow, not the agent

Bad framing: "Build an AI that does customer support."

Good framing: "When ticket tag equals billing, fetch account snapshot (read-only), draft reply, queue for human send."

The good framing gives you testable success criteria. You can write a golden prompt and assert the tool sequence fetch_account → draft_reply without touching send_email.

My PR triage workflow in plain language:

Trigger: bug label added
Read: PR diff, linked issue, TESTING.md resource
Plan: model proposes test suggestion
Gate: human approves post_comment
Execute: idempotent comment create with run ID in footer
Log: JSON trace stored either way

If you cannot write those six lines before writing code, you are not ready for write tools.

Tool design discipline (the part that saves you)

Each side effect gets one MCP tool. Tools should be small, typed, and boring.

Good tool schema

{
  "name": "create_linear_issue",
  "description": "Create a Linear issue in team ENG after human approval. Do not use for duplicates; search issues first.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "title": { "type": "string", "minLength": 5 },
      "teamId": { "type": "string" },
      "idempotencyKey": { "type": "string" }
    },
    "required": ["title", "teamId", "idempotencyKey"]
  }
}

Descriptions are prompts. If the model picks the wrong tool, my first fix is English in the description, not a bigger model.

Rules I follow

Rule	Why
Idempotency keys on writes	Retries do not duplicate issues
Read tools separate from write tools	Policy engines can block by name prefix
No mega-tools	`manage_github` becomes impossible to test
Return structured errors	`{ "retryable": true, "code": "RATE_LIMIT" }` beats stack traces in the model context
Log argument hashes, not secrets	Debugging without leaking tokens

Resources hold STYLE.md, OpenAPI specs, runbooks. Tools hold mutations. Mixing them confuses both humans and models.

Orchestration: explicit steps vs bounded agent loop

Two patterns work in production.

Explicit step machine (DAG): best when the procedure is stable. Example: fetch diff → summarize → propose comment. You pay less in tokens; you lose flexibility when input shape varies wildly.

Bounded agent loop: best when inputs are messy but tools are safe. Cap at 10 iterations, $0.40 model spend, 120 seconds wall time. The model can replan; it cannot loop forever.

I start explicit. I move to bounded loops only after at least two weeks of traces show predictable tool paths.

For personal autonomous agents (always-on messaging bots), see AI agents landscape 2026. Those tools optimize for autonomy. MCP workflows optimize for control.

Approval gates: where production actually lives

Reads can be liberal. Writes earn friction.

Implementation options I have used:

Host-native approval (Cursor asks before tool execution)
Policy wrapper in your runner (intercepts tools/call for write_*)
Human queue (agent creates draft artifact; you publish via separate tool)

Example policy map:

// policy.ts
export type ToolMode = "read" | "write";

export const TOOL_POLICY: Record<string, ToolMode> = {
  list_pull_requests: "read",
  get_diff: "read",
  summarize_diff: "read",
  post_pr_comment: "write",
  create_linear_issue: "write",
};

export function requiresApproval(toolName: string): boolean {
  return TOOL_POLICY[toolName] === "write";
}

post_pr_comment never runs silently in my setups. Ever.

Structured logging (your future self is the user)

Every tool invocation emits one JSON line:

{
  "run_id": "pr-triage-20260605-001",
  "step": 4,
  "tool": "post_pr_comment",
  "latency_ms": 842,
  "outcome": "approved",
  "pr_number": 128,
  "idempotency_key": "bug-label-128-v1"
}

When the model does something weird, you grep run_id, not chat history. Chat history lies by omission. Traces do not.

Working example: minimal MCP server + workflow runner

The following TypeScript uses the official MCP SDK pattern. Install in a fresh folder (versions pinned as of mid-2026):

npm init -y
npm install @modelcontextprotocol/sdk zod
npm install -D typescript @types/node

`server.ts` — one read tool, one write tool

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

const server = new McpServer({ name: "pr-triage", version: "1.0.0" });

server.registerTool(
  "get_pr_diff",
  {
    description: "Fetch unified diff for a pull request number in the configured repo.",
    inputSchema: {
      prNumber: z.number().int().positive(),
    },
  },
  async ({ prNumber }) => {
    // Replace with real GitHub API client + read-only token
    const diff = await fakeFetchDiff(prNumber);
    return { content: [{ type: "text", text: diff }] };
  }
);

server.registerTool(
  "post_pr_comment",
  {
    description: "Create a PR review comment. Requires human approval in the host runner.",
    inputSchema: {
      prNumber: z.number().int().positive(),
      body: z.string().min(10),
      idempotencyKey: z.string().min(8),
    },
  },
  async ({ prNumber, body, idempotencyKey }) => {
    const result = await fakeCreateComment({ prNumber, body, idempotencyKey });
    return { content: [{ type: "text", text: JSON.stringify(result) }] };
  }
);

async function fakeFetchDiff(prNumber: number): Promise<string> {
  return `diff --git a/src/example.ts b/src/example.ts\n--- a/src/example.ts\n+++ b/src/example.ts\n@@ -1 +1 @@\n-old\n+new (${prNumber})`;
}

async function fakeCreateComment(args: {
  prNumber: number;
  body: string;
  idempotencyKey: string;
}): Promise<{ url: string; idempotencyKey: string }> {
  return {
    url: `https://github.com/org/repo/pull/${args.prNumber}#comment-1`,
    idempotencyKey: args.idempotencyKey,
  };
}

const transport = new StdioServerTransport();
await server.connect(transport);

`runner.ts` — bounded loop with approval and trace

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
import { requiresApproval } from "./policy.js";
import fs from "node:fs";

const MAX_STEPS = 10;
const BUDGET_USD = 0.4;

type TraceEvent = {
  run_id: string;
  step: number;
  tool: string;
  outcome: "ok" | "blocked" | "error";
  latency_ms: number;
};

async function main() {
  const runId = `pr-triage-${Date.now()}`;
  const transport = new StdioClientTransport({
    command: "node",
    args: ["--import", "tsx", "server.ts"],
  });

  const client = new Client({ name: "workflow-runner", version: "1.0.0" });
  await client.connect(transport);

  const tools = await client.listTools();
  console.log(
    "available tools:",
    tools.tools.map((t) => t.name).join(", ")
  );

  // In production, replace this stub with your model loop that plans tool calls.
  const plannedCalls = [
    { name: "get_pr_diff", args: { prNumber: 128 } },
    {
      name: "post_pr_comment",
      args: {
        prNumber: 128,
        body: "Suggested test: reproduce with empty payload on /api/v1/widgets.",
        idempotencyKey: "bug-label-128-v1",
      },
    },
  ];

  let step = 0;
  for (const call of plannedCalls) {
    step += 1;
    if (step > MAX_STEPS) throw new Error("step budget exceeded");

    const started = Date.now();
    if (requiresApproval(call.name)) {
      const approved = await askHuman(`Approve ${call.name} on PR 128?`);
      if (!approved) {
        logTrace({ run_id: runId, step, tool: call.name, outcome: "blocked", latency_ms: Date.now() - started });
        continue;
      }
    }

    try {
      await client.callTool({ name: call.name, arguments: call.args });
      logTrace({ run_id: runId, step, tool: call.name, outcome: "ok", latency_ms: Date.now() - started });
    } catch (error) {
      logTrace({ run_id: runId, step, tool: call.name, outcome: "error", latency_ms: Date.now() - started });
      throw error;
    }
  }

  await client.close();
}

function logTrace(event: TraceEvent) {
  fs.appendFileSync("traces.jsonl", JSON.stringify(event) + "\n");
}

async function askHuman(question: string): Promise<boolean> {
  // Hook to Slack, email, or a CLI prompt in real life
  console.log(question);
  return false; // default deny in CI; flip for local manual tests
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

This is intentionally explicit about the planning stub. The production piece you own is the model loop plus policy. MCP standardizes everything after the plan exists.

Wire the server in Cursor via mcp.json using the same node command, then iterate prompts in Agent mode. Cursor MCP setup covers host configuration details.

Example `mcp.json` entry (Cursor)

{
  "mcpServers": {
    "pr-triage": {
      "command": "node",
      "args": ["--import", "tsx", "D:/workflows/pr-triage/server.ts"],
      "env": {
        "GITHUB_READ_TOKEN": "${env:GITHUB_READ_TOKEN}",
        "GITHUB_REPO": "org/repo"
      }
    }
  }
}

Keep paths absolute on Windows. Scope tokens in env vars the host injects, not in the committed file. After reload, confirm get_pr_diff and post_pr_comment appear in the tool list before you trust any prompt.

Resources: ground the model in your docs, not the internet

Expose stable context as MCP resources, not one-off paste bins.

resources/
  TESTING.md
  api/openapi.yaml
  runbooks/oncall-payments.md

Register them read-only. When the model drafts a test suggestion, it cites your testing doc. Hallucination rate drops more from good resources than from swapping GPT-4 for another flagship model in my projects.

Testing agents like you test HTTP handlers

Golden prompts

Store five prompts with expected tool sequences:

# tests/golden.yaml
- name: bug_label_small_diff
  prompt: "PR 128 labeled bug. Suggest regression test."
  expect_tools:
    - get_pr_diff
    - post_pr_comment
  deny_tools:
    - create_linear_issue

Run in CI weekly. Models drift; tools break.

Chaos cases

Tool returns 500 → runner retries once, then escalates to human queue
Tool returns empty diff → workflow stops; no comment posted
Model proposes wrong PR number → schema validation catches before call

Cost caps

Track spend per run_id. My PR triage workflow averages $0.06–$0.12 with a mid-tier model when diff size stays under 400 lines. Over that I chunk the diff in the read tool instead of sending walls of text to the model.

Trace review habit

Every Monday I skim traces.jsonl for outcome: "blocked" and outcome: "error". Blocked writes usually mean approval UX is working. Errors cluster around rate limits or stale repo config. Two months in, this takes ten minutes and prevents the slow drift where everyone assumes the bot is fine because it is quiet.

What I tried that did not work

One mega github tool. The model could not reliably choose between search, read, and write operations. Splitting tools fixed more than prompt tuning.

Trusting host approval UI alone. Developers click through approvals on autopilot. Policy deny-by-default in code catches fatigue mistakes.

Skipping idempotency. A retry after a timeout created duplicate Linear issues. Now every write carries a key derived from trigger + entity ID.

Letting the model pick channels or repos without validation. My Friday night mistake. Hardcode allowed repos in server config, not in prompts.

Building autonomous messaging before explicit workflows. I ran OpenClaw and Hermes experiments. Fun, useful, different problem. Production team workflows still look like the diagram above.

Trade-offs vs other stacks

Approach	Pros	Cons
MCP workflow (this post)	Testable, portable tools, host swap	You build orchestration
LangGraph / Temporal DAG	Strong SLAs, visual ops	Heavier upfront design
Personal agents (Hermes/OpenClaw)	Always-on, multi-channel	Harder to enforce enterprise policy
IDE-only MCP	Fastest for dev tasks	Not a cron-friendly ops layer

Hybrids are normal. MCP servers as the tool layer, personal agents for exploratory research, DAG for customer-facing automation.

Security checklist (short, not optional)

Read-only DB users for analytics tools
Filesystem servers scoped to repo root (MCP security basics)
Secrets in env vars, never in committed mcp.json
Separate tokens per workflow, not one god PAT
Review third-party MCP servers like dependencies (npm supply chain post mindset applies)

Known limitations

MCP does not standardize approval UX. You implement gates per host.
Long diffs blow token budgets. Preprocess in tools.
Model tool-choice errors still happen with great schemas. Golden tests catch drift; they do not eliminate it.
Cross-host feature parity is imperfect. Test on the host you deploy.

FAQ

Do I need an autonomous agent framework to use MCP workflows?

No. MCP workflows run fine inside Cursor, a cron-fired Node script, or a CI job with a headless host. Frameworks add channels and memory; they are not prerequisites.

How many tools should one server expose?

I aim for 5–12 per domain server (github-read, github-write). More than that and selection accuracy drops in my traces.

Where does prompt engineering fit?

Prompts steer planning; policy enforces boundaries. Prompt engineering vs software engineering is the longer argument. Short version: prompts are not your only safety layer.

Can I reuse the same MCP servers in Hermes or OpenClaw?

Often yes, with adapter friction. Invest in idempotent tool contracts and you swap orchestration shells faster.

What should my first workflow be?

Read-only summarization with logs. Add one write tool after a week of clean traces.

Closing: the workflow you can debug at midnight

The point is not maximal autonomy. The point is a procedure your teammate could run manually, now accelerated by a model, with traces that tell the truth when it misbehaves.

Start with one trigger, two read tools, one gated write, ten-step cap. Log JSON. Run golden tests. Expand only when traces bore you.