Until recently, a 6-month support conversation between a customer and a brand couldn't be passed to the AI model in a single call. You had to summarize, trim, vectorize and retrieve chunks via RAG. It worked, but it lost nuance: historical tone, prior decisions, the customer's exact phrasing, contradictions across sessions.
With the release of models with 1 million-token windows (Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro), the math changes. The question is no longer what to summarize, but when summarization is still worth it. This guide covers what gets unlocked and what problems it surprisingly doesn't solve.
The new arithmetic
A million tokens is roughly 750,000 words. To size it:
- A typical WhatsApp support conversation: 500-2,000 words / month per active customer
- A full 12-month history with one customer: ~25,000 words
- All product documentation + policies + internal FAQs: 50,000-150,000 words
- Operations manual of a mid-sized company: 200,000-400,000 words
With 1M tokens, you can hand the model the full customer history + all documentation + the last 50 similar-case conversations and still have room to spare. No RAG, no loss.
What becomes viable
1. Longitudinal customer memory, no embeddings needed
Before: you vectorized every prior conversation, did top-k similarity, retrieved snippets. Useful but noisy.
Now: load the customer's full history as context. The model remembers that in March they said they prefer WhatsApp in the morning, that their kid is 4, that they had a claim in 2024. That context persists turn by turn with no retrieval work.
2. Instant onboarding for new agents
Load the entire operations manual, all policies, the last 6 months of human-resolved cases as examples. A freshly-instantiated "junior" AI agent has access to the equivalent of a human's 6-month onboarding, without additional training.
3. Cohort analysis of conversations
An ops team can ask the model "read these 200 complaint conversations from last month and give me the top 5 patterns with verbatim quotes". Before required a custom NLP pipeline; now it's a long-context call.
4. Long-form research with sources
Processing 30 regulatory PDFs from a country and extracting clauses relevant to a specific case. Useful in insurance, banking, legal — where you previously had to do RAG with re-ranking.
5. Living contracts with history
A conversation that started in January and continues in August can be a single continuous session for the model. The customer who returns after months finds an AI that remembers exactly what was discussed.
What it does NOT solve
This is where most announcements skip the fine print:
1. Linear cost with context
Processing 1M tokens costs significantly more than 100K. If you load the full history on every turn for every customer, cost scales ugly. Two mitigations: prompt caching (cache the invariant context to cheapen reuse), and tiering (heavy context only on critical turns, not every greeting).
2. Latency
More context = more latency. On WhatsApp the customer waits seconds, not minutes. The strategy that works best: medium context on hot path (~50-100K tokens), large context in background for analysis or hard decisions.
3. Lost in the middle
Even though the model "sees" 1M tokens, its attention isn't uniform. Information in the middle of the prompt tends to have lower recall than info at the start or end. For production, it still pays to structure context well and place critical info in favorable positions.
4. Privacy and retention
Loading a customer's full history on each call means that data travels to the API every turn. For regulated sectors (health, finance), this can clash with data minimization policies. Solution: pseudonymization + flags for what info can leave.
5. Source quality
If you load 200K tokens of "everything we know about the customer" but that data is mixed with inconsistent internal notes, the model may use the wrong piece. More context does not substitute for source curation.
The pattern that's working: stratified context
In well-designed LATAM implementations, the pattern that delivers best results is stratifying context in layers:
- Layer 1 — Identity (always, ~2K tokens): customer profile, declared preferences, language, tone.
- Layer 2 — Current session (always, ~5K tokens): the last 50 turns of the active thread.
- Layer 3 — Recent history (on demand, ~30K tokens): last 5 conversations / interactions.
- Layer 4 — Deep history (only when complex case is detected, ~200K tokens): full history + documentation + policies.
The model only accesses Layer 4 when an upstream classifier flags the case as "needs deep context" — major claims handling, tier-2 escalation, post-mortem analysis. 80% of interactions don't need it.
Implication for Keebai and similar platforms
For us, the practical change is that we no longer need huge RAG pipelines for cases that previously required them. That simplifies the stack, reduces failure points and opens new use cases (longitudinal support, agents with persistent memory, cohort analysis).
But discipline remains the same: curate the source well, stratify access, measure what the model is using, control costs. The 1M window is a lever, not a magic wand.
Next steps
If your operation already uses AI with RAG, it's worth auditing: there are likely 1-2 pipelines where RAG no longer adds value and can be simplified to direct long-context. That audit alone can lower latency, reduce infra costs (Pinecone, Weaviate) and improve response quality.
