Minimal architecture for integrating an LLM into a backend application
How to integrate an LLM into a real backend: layered architecture, provider abstraction, limits, errors, costs, and tests.

The first time I integrated an LLM into a production backend, I made every possible mistake. Direct call to the OpenAI API from the controller, no timeout, no fallback, no cost control. It worked locally, passed manual tests, and of course, two weeks later we had a traffic spike that left us with a rather uncomfortable API bill and a degraded service because a 30-second timeout was blocking server threads.
Since then, I’ve refined my approach quite a bit. You don’t need a NASA-grade architecture, but you do need a minimal structure that prevents the most common problems. This is what I use today, and what I recommend to any team that’s starting to bring LLMs into their backend.
The real problem: it’s not calling the LLM, it’s everything else
Making a call to an LLM is trivial. A POST to an endpoint, a JSON response back. Any tutorial teaches you that in 10 minutes.
The real problem shows up when that call has to coexist with:
- Concurrent users firing simultaneous requests.
- An LLM provider that can take 3 seconds or 45 depending on the model and load.
- Costs that scale per token, not per request.
- Models that change versions without notice and break your response parsing.
- The need to test without spending money on every
mvn test.
If you don’t separate responsibilities from the start, you end up with a spaghetti monolith where the controller knows the provider details, the prompt format, and the retry logic. And in production, that costs you.
Layered architecture: the minimum that works
The structure I use has three well-defined layers. It’s nothing revolutionary, just separation of responsibilities applied to LLMs:
┌─────────────────────────────┐
│ Controller │ ← Receives HTTP request, validates input
├─────────────────────────────┤
│ Service │ ← Business logic, prompt templates, parsing
├─────────────────────────────┤
│ LLM Provider │ ← Provider abstraction (OpenAI, Anthropic, local)
├─────────────────────────────┤
│ Rate Limiter / Cache │ ← Cost control, rate limits, caching
└─────────────────────────────┘Each layer has a clear responsibility:
Controller: Receives the request, validates parameters, returns the formatted response. It knows nothing about prompts or LLMs.
Service: Builds the prompt, calls the provider, parses the response, applies business logic. This is where the feature’s brain lives.
LLM Provider: Abstraction over the concrete provider. It knows how to make a call to an LLM and return text. Nothing more.
Rate Limiter / Cache: Cross-cutting layer that controls how many calls you make and caches responses when it makes sense.
Practical example: Kotlin with Spring Boot
Let’s implement this with a concrete case: an endpoint that receives a technical text and returns a structured summary. Something you might find in a documentation processing service.
The provider interface
First comes the abstraction. I don’t want my service to know whether I’m using OpenAI, Anthropic, or a local model:
interface LlmProvider {
suspend fun complete(request: LlmRequest): LlmResponse
fun getProviderName(): String
}
data class LlmRequest(
val systemPrompt: String,
val userMessage: String,
val model: String,
val maxTokens: Int = 1024,
val temperature: Double = 0.3
)
data class LlmResponse(
val content: String,
val tokensUsed: TokenUsage,
val model: String,
val latencyMs: Long
)
data class TokenUsage(
val promptTokens: Int,
val completionTokens: Int
) {
val totalTokens: Int get() = promptTokens + completionTokens
}The key to this interface is that it’s generic. Any provider that can receive a prompt and return text fits here. That allows you to switch providers without touching the service.
Implementation for a specific provider
@Component
@ConditionalOnProperty("llm.provider", havingValue = "anthropic")
class AnthropicProvider(
private val config: LlmConfig,
private val httpClient: WebClient
) : LlmProvider {
override suspend fun complete(request: LlmRequest): LlmResponse {
val startTime = System.currentTimeMillis()
val response = httpClient.post()
.uri("/v1/messages")
.header("x-api-key", config.apiKey)
.header("anthropic-version", "2023-06-01")
.bodyValue(buildRequestBody(request))
.retrieve()
.awaitBody<AnthropicApiResponse>()
val latency = System.currentTimeMillis() - startTime
return LlmResponse(
content = response.content.first().text,
tokensUsed = TokenUsage(
promptTokens = response.usage.inputTokens,
completionTokens = response.usage.outputTokens
),
model = response.model,
latencyMs = latency
)
}
override fun getProviderName() = "anthropic"
}Notice the @ConditionalOnProperty. With a single property in application.yml, I can switch providers without recompiling:
llm:
provider: anthropic
api-key: ${LLM_API_KEY}
default-model: claude-sonnet-4-20250514
timeout-seconds: 30
max-retries: 2The service: where the logic lives
@Service
class DocumentSummaryService(
private val llmProvider: LlmProvider,
private val rateLimiter: LlmRateLimiter,
private val costTracker: CostTracker
) {
private val logger = LoggerFactory.getLogger(javaClass)
suspend fun summarize(document: String, language: String = "es"): SummaryResult {
rateLimiter.checkLimit()
val request = LlmRequest(
systemPrompt = buildSystemPrompt(language),
userMessage = buildUserPrompt(document),
model = "claude-sonnet-4-20250514",
maxTokens = 512,
temperature = 0.2
)
return try {
val response = llmProvider.complete(request)
costTracker.record(response.tokensUsed, llmProvider.getProviderName())
logger.info(
"Summary generated: provider={}, tokens={}, latency={}ms",
llmProvider.getProviderName(),
response.tokensUsed.totalTokens,
response.latencyMs
)
parseSummaryResponse(response.content)
} catch (e: LlmTimeoutException) {
logger.warn("LLM timeout after {}ms, returning fallback", e.timeoutMs)
SummaryResult.fallback(document)
} catch (e: LlmRateLimitException) {
logger.error("Rate limit exceeded: {}", e.message)
throw ServiceUnavailableException("Service temporarily unavailable")
}
}
private fun buildSystemPrompt(language: String): String = """
Eres un asistente técnico especializado en documentación de software.
Responde siempre en $language.
Devuelve un JSON con la estructura: {"title": "...", "summary": "...", "keyPoints": ["..."]}
No incluyas explicaciones fuera del JSON.
""".trimIndent()
private fun buildUserPrompt(document: String): String = """
Resume el siguiente documento técnico de forma concisa:
---
$document
---
""".trimIndent()
}The controller: clean and simple
@RestController
@RequestMapping("/api/v1/documents")
class DocumentController(
private val summaryService: DocumentSummaryService
) {
@PostMapping("/summarize")
suspend fun summarize(
@Valid @RequestBody request: SummarizeRequest
): ResponseEntity<SummaryResult> {
val result = summaryService.summarize(
document = request.content,
language = request.language ?: "es"
)
return ResponseEntity.ok(result)
}
}The controller doesn’t know an LLM exists. It only knows there’s a service that summarizes documents. That’s what matters.
The same thing in Python with FastAPI
For teams working with Python, the structure is identical. Only the syntax changes:
from abc import ABC, abstractmethod
from pydantic import BaseModel
class LlmRequest(BaseModel):
system_prompt: str
user_message: str
model: str
max_tokens: int = 1024
temperature: float = 0.3
class LlmProvider(ABC):
@abstractmethod
async def complete(self, request: LlmRequest) -> LlmResponse:
pass
class AnthropicProvider(LlmProvider):
def __init__(self, api_key: str):
self.client = AsyncAnthropic(api_key=api_key)
async def complete(self, request: LlmRequest) -> LlmResponse:
start = time.monotonic()
response = await self.client.messages.create(
model=request.model,
max_tokens=request.max_tokens,
system=request.system_prompt,
messages=[{"role": "user", "content": request.user_message}]
)
latency = (time.monotonic() - start) * 1000
return LlmResponse(
content=response.content[0].text,
tokens_used=TokenUsage(
prompt_tokens=response.usage.input_tokens,
completion_tokens=response.usage.output_tokens
),
latency_ms=latency
)The core idea is the same: an interface, concrete implementations, dependency injection.
Rate limiting and cost control
This is the point everyone ignores until the bill arrives. LLMs charge per token, and a single user can generate hundreds of thousands of tokens in a session.
@Component
class LlmRateLimiter(
private val config: RateLimitConfig
) {
private val requestCounts = ConcurrentHashMap<String, AtomicInteger>()
private val tokenCounts = ConcurrentHashMap<String, AtomicLong>()
fun checkLimit(userId: String = "global") {
val requests = requestCounts.getOrPut(userId) { AtomicInteger(0) }
if (requests.get() >= config.maxRequestsPerMinute) {
throw LlmRateLimitException("Request limit exceeded")
}
val tokens = tokenCounts.getOrPut(userId) { AtomicLong(0) }
if (tokens.get() >= config.maxTokensPerHour) {
throw LlmRateLimitException("Token limit exceeded")
}
requests.incrementAndGet()
}
fun recordUsage(userId: String, tokensUsed: Int) {
tokenCounts.getOrPut(userId) { AtomicLong(0) }.addAndGet(tokensUsed.toLong())
}
}The limits I typically configure:
| Level | Requests/min | Tokens/hour | Tokens/day |
|---|---|---|---|
| Per user | 10 | 50,000 | 200,000 |
| Global | 100 | 500,000 | 2,000,000 |
| Alert | - | 300,000 | 1,500,000 |
The alert threshold is just as important as the hard limit. If you hit 60% of your daily budget by 10 AM, something weird is going on and you want to know before it’s too late.
It’s also a good idea to track estimated costs in real time:
@Component
class CostTracker(private val meterRegistry: MeterRegistry) {
private val costPerToken = mapOf(
"claude-sonnet" to CostPer1kTokens(input = 0.003, output = 0.015),
"gpt-4o" to CostPer1kTokens(input = 0.005, output = 0.015)
)
fun record(usage: TokenUsage, provider: String) {
val cost = costPerToken[provider]?.let {
(usage.promptTokens / 1000.0 * it.input) +
(usage.completionTokens / 1000.0 * it.output)
} ?: 0.0
meterRegistry.counter("llm.cost.usd", "provider", provider)
.increment(cost)
meterRegistry.counter("llm.tokens.total", "provider", provider)
.increment(usage.totalTokens.toDouble())
}
}With those metrics in Prometheus/Grafana, you have full visibility into spending. Without this, you’re flying blind.
Error handling: what fails, will fail
LLMs fail in creative ways. Long timeouts, truncated responses, provider rate limits, malformed JSON in the response, output format changes when they update the model. Your code needs to be prepared for all of that.
My approach: retry with backoff for transient errors, fallback for degradation, and circuit breaker to prevent cascades.
@Component
class ResilientLlmProvider(
private val primary: LlmProvider,
@Qualifier("fallback") private val fallback: LlmProvider?
) : LlmProvider {
private val circuitBreaker = CircuitBreaker.ofDefaults("llm-provider")
override suspend fun complete(request: LlmRequest): LlmResponse {
return try {
circuitBreaker.executeSupplier {
runBlocking { retryWithBackoff { primary.complete(request) } }
}
} catch (e: Exception) {
if (fallback != null) {
logger.warn("Primary LLM failed, switching to fallback: ${e.message}")
fallback.complete(request)
} else {
throw LlmUnavailableException("LLM unavailable", e)
}
}
}
private suspend fun <T> retryWithBackoff(
maxRetries: Int = 2,
initialDelay: Long = 1000,
block: suspend () -> T
): T {
var lastException: Exception? = null
repeat(maxRetries) { attempt ->
try {
return block()
} catch (e: LlmTimeoutException) {
lastException = e
delay(initialDelay * (attempt + 1))
} catch (e: LlmRateLimitException) {
lastException = e
delay(initialDelay * (attempt + 1) * 2)
}
}
throw lastException ?: LlmUnavailableException("Max retries exceeded")
}
}Errors you need to handle explicitly:
| Error | Common cause | Strategy |
|---|---|---|
| Timeout | Slow model, long prompt | Retry with backoff, reduce max_tokens |
| Rate limit (429) | Too many calls to the provider | Exponential backoff, queue |
| Malformed JSON | Model doesn’t follow instructions | Re-parse, stricter prompt |
| Truncated response | max_tokens too low | Increase limit, split the request |
| API down (5xx) | Provider issues | Circuit breaker, fallback to another provider |
Testing without calling the real LLM
This is the point where the provider abstraction proves its worth. If your service depends on an interface rather than a concrete implementation, testing is trivial:
class DocumentSummaryServiceTest {
private val mockProvider = MockLlmProvider()
private val rateLimiter = LlmRateLimiter(RateLimitConfig(100, 100000, 1000000))
private val costTracker = CostTracker(SimpleMeterRegistry())
private val service = DocumentSummaryService(mockProvider, rateLimiter, costTracker)
@Test
fun `should return structured summary for valid document`() = runTest {
mockProvider.setResponse("""
{"title": "Resumen", "summary": "Texto resumido", "keyPoints": ["punto 1"]}
""".trimIndent())
val result = service.summarize("Un documento técnico largo...")
assertThat(result.title).isEqualTo("Resumen")
assertThat(result.keyPoints).hasSize(1)
}
@Test
fun `should return fallback on timeout`() = runTest {
mockProvider.shouldTimeout = true
val result = service.summarize("Un documento...")
assertThat(result.isFallback).isTrue()
}
@Test
fun `should track token usage`() = runTest {
mockProvider.setResponse("""{"title": "T", "summary": "S", "keyPoints": []}""")
mockProvider.tokensToReturn = TokenUsage(100, 50)
service.summarize("Documento de prueba")
// Verify usage was recorded
assertThat(costTracker.getTotalTokens()).isEqualTo(150)
}
}
class MockLlmProvider : LlmProvider {
private var response: String = ""
var shouldTimeout = false
var tokensToReturn = TokenUsage(10, 10)
fun setResponse(content: String) { response = content }
override suspend fun complete(request: LlmRequest): LlmResponse {
if (shouldTimeout) throw LlmTimeoutException(30000)
return LlmResponse(response, tokensToReturn, "mock-model", 50)
}
override fun getProviderName() = "mock"
}Tests shouldn’t depend on an external service that charges per use. If your test suite needs a real API key to pass, you have a design problem.
For integration tests that do need to validate the real format of responses, I use a separate profile with a limited budget and tests tagged as @Tag("integration") that only run in CI on a specific schedule, never on every push.
Common mistakes I’ve seen (and made)
1. Hardcoded prompt in the service. If the prompt is embedded as a string literal in the code, every change requires recompiling and redeploying. Better to externalize it in template files or configuration.
2. Not logging latency or tokens. Without that data, you can’t optimize anything. You don’t know if a new prompt is more efficient or more expensive. Always log: provider, model, input tokens, output tokens, latency, result (ok/error).
3. Coupling to a provider. I see this constantly. Code that directly imports com.openai.client in the business service. When you want to change models or try another provider, you have to touch the entire application.
4. No timeout set. The default timeout for many HTTP clients is 30 seconds or infinite. An LLM that hangs for a minute blocks a thread on your server. Set aggressive timeouts: 15-20 seconds for most cases.
5. Ignoring cost in development. I’ve seen development environments where every application reload triggered 10 LLM calls. Multiply that by 5 developers doing hot-reload all day, and spending skyrockets without generating any value.
6. Parsing the response without validation. The model doesn’t always return valid JSON, no matter how much you ask for it in the prompt. Always validate and handle the malformed response case.
When this architecture is enough (and when it’s not)
This structure covers the 80% case well: a backend that needs to call an LLM for a specific feature, with cost control and basic resilience.
It’s not enough when:
- You need streaming responses token by token (requires SSE or WebSockets).
- You have chains of LLM calls (RAG, agents) that need orchestration.
- The call volume justifies a dedicated LLM gateway.
- You need to route between models based on request complexity.
For those cases, the architecture grows, but always on this foundation. Provider abstraction, cost control, and error handling remain necessary. You just add layers on top.
What matters in the end
Integrating LLMs into a backend isn’t an AI problem. It’s a software engineering problem. The same practices you apply when integrating any external service (abstraction, resilience, observability, testing) work here.
The difference is that LLMs are expensive, slow, and unpredictable compared to most APIs. That’s why discipline needs to be greater, not lesser.
If you take one thing from this article: don’t start with the model or the prompt. Start with the architecture. A good prompt on a bad architecture is a future problem. A good architecture with a mediocre prompt gets fixed in an afternoon.


