Why does an AI demo that wowed your team buckle the moment real users show up?

AI Architecture That Reaches Production

By Doriel Alie9 min
Header illustration for AI Architecture That Reaches Production

A working AI demo and a working AI system are two very different things. The demo runs on a developer's laptop with one user, generous timeouts, and a friendly model API on a quiet day. The system has to run for thousands of users on the day a model provider has an outage and traffic doubles unexpectedly.

The gap between the two is mostly architecture. Specifically, a handful of architectural decisions that don't matter at all in the demo and matter enormously the moment the system has to behave in the real world.

This piece is about the AI architecture choices that decide whether what you've built reaches production at all, scales when it does, and stays up when something inevitably goes wrong.

Key Takeaways

  • AI is a different kind of backend: slow, expensive, non-deterministic, and dependent on a provider you don't control. Most failed projects treated it like a normal API.
  • Async by default. A queue between the request and the worker is the only pattern that survives at scale.
  • The patterns that break most often are basic: rate limiting, retry with exponential backoff, idempotency, monitored dead letter queues, external state.
  • Cost and observability are architectural concerns from day one. Without instrumentation, the first sign of a problem is the bill or a customer complaint.
  • Abstract model calls behind an interface and load test at two to three times expected volume. Provider outages happen. Find the cracks in testing, not in production.

Why AI changes the architectural questions

Most software architecture patterns were designed around fast, predictable, mostly-reliable backend services. A typical web request hits your code, queries a database, returns in under a second, and almost always works.

AI requests are different. They take seconds, sometimes tens of seconds. They cost real money per call. They occasionally fail for opaque reasons. They have rate limits that kick in at the worst moments. They produce non-deterministic outputs that vary even with the same input. The model provider you depend on is almost never under your control, and they can throttle, deprecate, or briefly fall over without warning.

A system that handles all of that gracefully looks different from a system that handles a normal API. Most failed AI projects tried to use one as the other.

Sync vs async: the first decision

The first architectural choice is whether the user waits for the AI response in real time, or submits the request and gets a response later.

A synchronous call works for fast interactions, particularly chat where streaming responses keep the user feeling that something is happening. It breaks down for anything where the AI work takes more than a couple of seconds and the user might do something else in the meantime. It also creates a brittle dependency. If the model API slows down or fails, the user sees a hung request.

An asynchronous pattern submits the job to a queue, returns a job ID immediately, and notifies the user via webhook, polling, or a UI update when the work is done. It's more work to build. It's also the only pattern that survives at any kind of scale.

For anything that runs in the background, generates documents, processes batches, or takes more than a few seconds, async is the only sensible choice. Even in real-time chat, the underlying work should usually be queued, with streaming used to keep the user informed.

If you're building anything more serious than a prototype, the question isn't "should this be async" but "where exactly is the queue."

Queues, workers, and the things that actually break

Once you've gone async, you need a queue and workers that consume from it. Almost every production AI system looks like this somewhere.

The queue holds incoming jobs. Workers pick them up, call the AI, do something with the result. You scale by adding more workers when the queue fills up. You handle failures by sending dead jobs to a dead letter queue for inspection.

The pieces of this that actually break in production tend to be the same few:

The queue with no rate limiting. Your model provider has a limit of, say, ten thousand tokens per minute. The queue happily lets through all fifty thousand requests that arrived at 9am on Monday. The workers all hit the model API, get rate-limited, fail, retry, get rate-limited again. Throughput collapses entirely.

The worker with no retry logic. Transient errors are common with AI APIs. Network blips. Brief provider outages. A worker that fails on the first error, drops the job, and moves on, will quietly lose work that should have succeeded on a retry.

The retry that isn't idempotent. A request that creates a record, charges a card, or sends an email needs to know whether it has already done that thing before retrying. Without idempotency, the retry duplicates the side effect. Customers get charged twice. Emails get sent twice.

Linear retries with no backoff. They make rate limit problems worse, not better. Exponential backoff with jitter is the established pattern, and it exists because every other approach fails under load.

The dead letter queue with no monitoring. Jobs that failed permanently get sent there and forgotten. By the time someone notices, hundreds of jobs are sitting in DLQ and customer impact is already weeks deep.

None of these are advanced patterns. They are basic distributed systems hygiene. The reason they show up so often as production AI failures is that the projects skipped them, because the demo didn't need them.

State handling

AI systems often need to remember things across calls. A conversation has history. A multi-step workflow needs to track progress. A user has preferences and context.

The choice is between stateless workers with state stored externally, or stateful workers with state held in memory. Stateless is almost always the right answer.

External state stores like Redis or a managed database give you what you actually need. State that survives a worker dying. State that scales with the system. State that can be inspected when something goes wrong. Stateful workers look simpler in the demo and break the moment you scale beyond a single instance.

For conversations, this typically means storing message history in a database, then pulling the relevant context into the prompt at each turn. For workflows, it means each step writes its progress to a state store before the next step runs. For users, it means a profile or context document that gets retrieved per request.

The corollary is that prompt construction becomes a real piece of work in production AI. You're not just sending the user's message. You're building a prompt from stored context, retrieved knowledge, system instructions, and the current input. That assembly happens on every request and needs to be fast.

Cost and observability

Cost is an architectural concern in AI systems in a way it usually isn't in normal software.

Every call has a price. The price varies by model, by token count, by whether you're using a cached prompt or a fresh one. A bug that causes ten times the expected calls will produce a ten times bill, often before anyone notices.

Production AI systems need to track cost per request, per user, and per feature. They need budgets and alerts. They need the ability to route different parts of the workflow to different models, using a cheaper one for first-pass classification and an expensive one only where it earns its keep. They need caching where the same query is being asked repeatedly.

Observability sits next to this. Latency, error rates, token usage, and which parts of the workflow are expensive. Without observability, the first sign of a problem is the bill at month-end or a customer complaint.

This isn't a feature you bolt on later. The instrumentation needs to be in from the start. Otherwise the data you'd want to look at when something goes wrong does not exist.

Load test at the level you'll actually run

The pilot ran fine on twenty calls a day. The production version needs to run on two thousand. That gap is where most architectural problems become visible.

Load test at two to three times the volume you actually expect, not the volume you expect. Then look at where latency spikes, where errors cluster, whether retry logic copes or amplifies the problem, whether cost per call stays where you projected it, and what happens if the model provider is slow that day.

You will find things. They are cheaper to fix during testing than in front of customers.

Provider risk

A production AI system that depends entirely on one model provider is taking a real risk. Providers have outages. Providers deprecate models. Providers change pricing without much notice.

The most robust setups can fail over between providers. The same prompt runs against one model by default, falls back to a second when the first is down, falls back to a third when both are unavailable. This is more work to build and isn't right for every project. For anything customer-facing where downtime hurts, it's worth the investment.

At minimum, abstract your model calls behind an interface so swapping providers later is a small change rather than a rewrite. We've seen too many projects locked into one provider by accident, then caught out when that provider had a bad week.

What to insist on before launch

A short list of architectural commitments worth making before any AI system ships to production:

  • Async by default, with a queue between the request and the worker
  • Workers with retry logic and exponential backoff, calling idempotent operations
  • A dead letter queue with monitoring, not silence
  • External state storage, not in-memory state
  • Cost tracking per request and per feature, with alerts
  • Observability covering latency, errors, and token usage
  • Load testing at two to three times expected volume
  • Model calls abstracted behind an interface, so swapping providers stays cheap

None of these are exotic. They are standard production patterns applied to a slightly different kind of backend. The reason AI projects hit problems in production is rarely that they did something genuinely novel and got it wrong. It's that they shipped a demo without the underlying engineering most other production systems would have insisted on.

Build the architecture once, well, and the actual AI work on top of it gets to be the interesting part. Skip it, and the AI work is going to be the part you keep firefighting.

AI architectureproduction AIAI scaling
Doriel Alie, CEO, Operational AI Systems at Operational AI Systems

Doriel Alie

Doriel is the founder of Operational AI Systems, an AI consultancy and software development agency in Milton Keynes. More about Doriel.

System Status

Systems
Operational
Response time
< 2 hours
Availability
Accepting projects
Infrastructure
99.9% uptime
Doriel Alie

Ready to bring clarity to your systems?

SKETCHASKETCHA