Context Engineering
Modeling Focus
A model’s context window functions like RAM—working memory. Every piece of information occupies space, and the computational cost of processing that space scales quadratically. With n tokens, the model computes n² pairwise relationships. Double the context, quadruple the math.
Researchers call the downstream effect context rot: as token count increases, the model’s ability to accurately retrieve and reason over information in that window degrades. It’s a gradient, not a cliff, but it’s measurable. Anthropic’s engineering team describes it as an “attention budget” that depletes with every token you add.
Pre-loading your agent with the complete customer history, every product doc, and the full competitive landscape doesn’t make it more prepared. It makes it worse at everything. You’ve essentially started a customer call by reading the entire CRM database out loud.
This is the core engineering constraint behind every AI agent you deploy for sales, CS, or revenue operations. And the term for solving it, context engineering, has been circulating since Shopify’s CEO Tobi Lütke posted about it in mid-2025 and Andrej Karpathy gave it a framework. Anthropic recently published a deep technical treatment. Manus (now part of Meta) shared production lessons from rebuilding their agent framework four separate times. Google’s Agent Development Kit team wrote about tiered storage and compiled context views.
All of it converges on the same constraint: the context window. The teams treating it like an engineering problem are shipping agents that hold up in production. Everyone else is cycling through prompt variations and blaming the model.
The engineering challenge is finding the smallest possible set of high-signal tokens that maximize the probability of getting the output you want. That’s context engineering. And for GTM teams, it reframes every decision about agent architecture.
Failure Mode
I’ve watched the same sequence play out across a dozen B2B SaaS companies trying to stand up agents for sales, customer success, and revops.
A team builds an agent. They load it with every piece of customer data they can find. They write a system prompt that tries to anticipate every scenario. The agent works well in testing, where conversations are short and context is fresh. Then it ships. By turn three of a real conversation, it’s hallucinating or contradicting itself.
The postmortem always blames the model. It’s never the model.
The failure mode is always one of two things: overloading the agent with rigid logic that breaks the moment reality deviates from the script, or starving it of usable structure so it improvises badly. Both are context problems.
Manus learned this the hard way. Their team rebuilt their entire agent framework four times, what they call “Stochastic Graduate Descent,” before landing on a set of principles that held up in production. Their core insight: design everything around the KV-cache. In their system, the average input-to-output token ratio is 100:1.
At that ratio, success depends entirely on context management.
The Architecture
The teams getting this right across agentic coding, sales automation, and customer operations converge on a few principles worth internalizing.
Calibrate instruction altitude. Anthropic frames this as finding the Goldilocks zone for system prompts. Too rigid—hardcoded if/then decision trees—and the agent shatters the moment conditions change. Too vague, and it improvises with no guardrails.
The rigid version looks like this:
If account value > $50K AND last touch > 7 days AND product usage > 80th percentile, then escalate to enterprise team.
This works in deterministic automation. It fails in agent instructions because the model has no room to interpret context. The calibrated version:
Prioritize accounts showing high engagement relative to their cohort, with recent activity gaps that suggest decision-making windows.
Same intent. The model can now apply judgment to pattern-match across situations the prompt author didn’t anticipate. Structure the prompt with clear sections (XML tags, Markdown headers, whatever your framework supports) but let the language do the steering, not the logic gates.
Make agents pull context, not receive it. This is the architectural insight that separates production systems from demos.
Claude Code doesn’t load your entire codebase into memory. It maintains lightweight references: file paths, stored queries. It retrieves what it needs at the moment it becomes relevant. Google’s ADK team calls this “scope by default”: every model call sees the minimum context required, and agents reach for more information explicitly via tools.
For GTM agents, this means deploying retrieval hooks:
query_crm(account_id, fields=["last_touch", "deal_stage", "arr"])
fetch_product_usage(account_id, days=30)
get_deal_notes(account_id, limit=5, sort="recent")
Let agents pull 2,000-token account summaries instead of sitting on 50,000-token interaction logs. The difference between pushing context (guessing what the agent needs) and pulling context (responding to the actual situation) is the difference between a prepared speech and a real conversation.
Metadata helps here too. A file named Q4_pricing_playbook_current.md tells the agent what’s fresh and what’s stale without consuming tokens to explain it.
Build memory that outlasts the conversation. Sales cycles run longer than context windows. A deal that spans four months and fifty touchpoints can’t live inside a single conversation thread.
Manus solved this by separating storage from presentation—durable state persists across sessions while each individual model call gets a compiled, minimal view of what’s relevant right now. Google’s ADK uses the same pattern with tiered storage and what they call “compiled context views.”
In GTM terms, this looks like agent-maintained structured notes that live outside the context window:
## Key Stakeholders
- Kurt C. (VP Eng): Budget owner, wants uptime guarantees
- Courtney L. (CTO): Final approver, allergic to vendor lock-in
## Open Risks
- Security review started 2/15, usually takes 2-3 weeks
- Price war with Vendor X ($180K vs our $220K)
After fifty touchpoints, compress early interactions into structured summaries: “Technical buyer confirmed API-first requirement, economic buyer flagged 18-month payback threshold, competing against Vendor X quoting $180K.” The agent doesn’t need the raw transcript from meeting twelve. It needs the decisions and commitments that came out of it.
Decompose agents by function, not by ambition. The instinct is to build one agent that handles everything: product knowledge, pricing, competitors, objection handling, negotiation. Theoretically elegant. Practically, every additional capability increases the surface area for tool selection errors. A 100-tool agent isn’t 10x more capable than a 10-tool agent. It’s exponentially more likely to grab the wrong tool at the wrong time.
Manus designed their tool namespace like a filesystem, organizing actions hierarchically to reduce selection confusion. The LangChain team documented the same pattern: overlapping tool descriptions cause model confusion about which tool to invoke.
For GTM, you want discrete agents for prospecting, qualification, demo prep, and proposal generation. Each maintains its own focused context and hands off cleanly to the next stage. A prospecting agent that deeply understands enrichment signals and outreach timing. A qualification agent that knows your ICP scoring model cold. Neither one trying to also handle contract redlining.
Garbage In, Confidence Out
You can architect the most elegant context system in the world and it will still fail if the underlying data is trash.
That revenue leader’s competitor pricing story? The root cause wasn’t a context engineering failure. It was a data maintenance failure. Nobody owned the process of updating competitive intel. The agent performed exactly as designed. It retrieved the most relevant pricing document it had access to. That document was three months stale.
The teams I see succeeding treat their knowledge bases like production code. They have processes for deprecating outdated content, validating new additions, and flagging information for human review. Product releases get indexed within 24 hours. Competitive intel gets refreshed weekly from win/loss call transcripts. Customer health scores update daily.
The teams that struggle treat their docs like a junk drawer. Everything goes in, nothing comes out, and the agent picks through the mess every time someone asks a question. Then everyone blames the model for “hallucinating” when the real problem is that it was handed a mix of current facts, stale facts, and half-finished drafts and told to figure out which was which.
Assign ownership. Run audits. Care about data quality the way your engineering team cares about code quality.
This sounds obvious.
Almost nobody does it.
Build Order
Audit your signals first. Map where customer context lives: CRM, product analytics, sales notes, marketing automation, support logs. You’ll find massive duplication, critical gaps, and metrics everyone watches that don’t actually predict outcomes anyone cares about. If a human can’t definitively identify what data matters in a given situation, your agent has no chance.
Kill noise aggressively. Find the smallest set of signals that demonstrably influence customer behavior. I’ve watched teams cut context by 80% and improve agent performance—not because less is always better, but because the 20% they kept was actually predictive. The other 80% was just diluting the attention budget.
Wire retrieval hooks before building agents. Create lightweight access points: queries, APIs, pre-filtered datasets. Agents invoke these on demand. Load critical references upfront for speed, enable dynamic navigation for everything else.
Layer in memory systems. Implement agent-maintained notes with file-based storage outside the context window. Let agents build knowledge over time, maintain deal state across sessions, reference previous interactions without keeping everything in active working memory. Think of it like a senior rep’s notebook. They don’t recall every conversation verbatim, but they know the stakeholder dynamics, the pricing sensitivities, the history that shapes what happens next.
Measure business outcomes, not system metrics. Trial-to-paid conversion rates. Average deal velocity. Rep preparation time per account. Expansion revenue identified versus captured. If you can’t draw a line from an architecture decision to revenue impact, you’re optimizing the wrong layer.
Timing
Every week brings a new model that’s supposedly 20% better at reasoning or 30% faster at code generation. And models are genuinely improving. But model capability is rarely the actual bottleneck in production.
A well-architected agent on last year’s model will outperform a poorly architected agent on next year’s model. Every improvement in model capability amplifies the advantage of good architecture and amplifies the cost of bad architecture. Better models don’t fix bad context. They just produce wrong answers faster and with more confidence.
Manus rebuilt their framework four times before finding what worked. Anthropic is publishing engineering guides specifically about context management. Google built an entire context stack into their agent framework. The infrastructure layer is where the leverage is.
The window to build this competency is right now, while most teams are still cycling through prompt variations and hoping the next model release solves their problems.
It won’t.
✌🏽 SR



