Backend Engineering Artificial intelligence

Minimal architecture for integrating an LLM into a backend application

How to integrate an LLM into a real backend: layered architecture, provider abstraction, limits, errors, costs, and tests.

Roger Bosch May 18, 2026

The first time I integrated an LLM into a production backend, I made every possible mistake. Direct call to the OpenAI API from the controller, no timeout, no fallback, no cost control. It worked locally, passed manual tests, and of course, two weeks later we had a traffic spike that left us with a rather uncomfortable API bill and a degraded service because a 30-second timeout was blocking server threads.

Since then, I’ve refined my approach quite a bit. You don’t need a NASA-grade architecture, but you do need a minimal structure that prevents the most common problems. This is what I use today, and what I recommend to any team that’s starting to bring LLMs into their backend.

The real problem: it’s not calling the LLM, it’s everything else

Making a call to an LLM is trivial. A POST to an endpoint, a JSON response back. Any tutorial teaches you that in 10 minutes.

The real problem shows up when that call has to coexist with:

Concurrent users firing simultaneous requests.
An LLM provider that can take 3 seconds or 45 depending on the model and load.
Costs that scale per token, not per request.
Models that change versions without notice and break your response parsing.
The need to test without spending money on every mvn test.

If you don’t separate responsibilities from the start, you end up with a spaghetti monolith where the controller knows the provider details, the prompt format, and the retry logic. And in production, that costs you.

Layered architecture: the minimum that works

The structure I use has three well-defined layers. It’s nothing revolutionary, just separation of responsibilities applied to LLMs:

┌─────────────────────────────┐
│        Controller            │  ← Receives HTTP request, validates input
├─────────────────────────────┤
│        Service               │  ← Business logic, prompt templates, parsing
├─────────────────────────────┤
│      LLM Provider            │  ← Provider abstraction (OpenAI, Anthropic, local)
├─────────────────────────────┤
│    Rate Limiter / Cache      │  ← Cost control, rate limits, caching
└─────────────────────────────┘

Each layer has a clear responsibility:

Controller: Receives the request, validates parameters, returns the formatted response. It knows nothing about prompts or LLMs.

Service: Builds the prompt, calls the provider, parses the response, applies business logic. This is where the feature’s brain lives.

LLM Provider: Abstraction over the concrete provider. It knows how to make a call to an LLM and return text. Nothing more.

Rate Limiter / Cache: Cross-cutting layer that controls how many calls you make and caches responses when it makes sense.

Practical example: Kotlin with Spring Boot

Let’s implement this with a concrete case: an endpoint that receives a technical text and returns a structured summary. Something you might find in a documentation processing service.

The provider interface

First comes the abstraction. I don’t want my service to know whether I’m using OpenAI, Anthropic, or a local model:

interface LlmProvider {
    suspend fun complete(request: LlmRequest): LlmResponse
    fun getProviderName(): String
}

data class LlmRequest(
    val systemPrompt: String,
    val userMessage: String,
    val model: String,
    val maxTokens: Int = 1024,
    val temperature: Double = 0.3
)

data class LlmResponse(
    val content: String,
    val tokensUsed: TokenUsage,
    val model: String,
    val latencyMs: Long
)

data class TokenUsage(
    val promptTokens: Int,
    val completionTokens: Int
) {
    val totalTokens: Int get() = promptTokens + completionTokens
}

The key to this interface is that it’s generic. Any provider that can receive a prompt and return text fits here. That allows you to switch providers without touching the service.

Implementation for a specific provider

@Component
@ConditionalOnProperty("llm.provider", havingValue = "anthropic")
class AnthropicProvider(
    private val config: LlmConfig,
    private val httpClient: WebClient
) : LlmProvider {

    override suspend fun complete(request: LlmRequest): LlmResponse {
        val startTime = System.currentTimeMillis()

        val response = httpClient.post()
            .uri("/v1/messages")
            .header("x-api-key", config.apiKey)
            .header("anthropic-version", "2023-06-01")
            .bodyValue(buildRequestBody(request))
            .retrieve()
            .awaitBody<AnthropicApiResponse>()

        val latency = System.currentTimeMillis() - startTime

        return LlmResponse(
            content = response.content.first().text,
            tokensUsed = TokenUsage(
                promptTokens = response.usage.inputTokens,
                completionTokens = response.usage.outputTokens
            ),
            model = response.model,
            latencyMs = latency
        )
    }

    override fun getProviderName() = "anthropic"
}

Notice the @ConditionalOnProperty. With a single property in application.yml, I can switch providers without recompiling:

llm:
  provider: anthropic
  api-key: ${LLM_API_KEY}
  default-model: claude-sonnet-4-20250514
  timeout-seconds: 30
  max-retries: 2

The service: where the logic lives

@Service
class DocumentSummaryService(
    private val llmProvider: LlmProvider,
    private val rateLimiter: LlmRateLimiter,
    private val costTracker: CostTracker
) {
    private val logger = LoggerFactory.getLogger(javaClass)

    suspend fun summarize(document: String, language: String = "es"): SummaryResult {
        rateLimiter.checkLimit()

        val request = LlmRequest(
            systemPrompt = buildSystemPrompt(language),
            userMessage = buildUserPrompt(document),
            model = "claude-sonnet-4-20250514",
            maxTokens = 512,
            temperature = 0.2
        )

        return try {
            val response = llmProvider.complete(request)
            costTracker.record(response.tokensUsed, llmProvider.getProviderName())

            logger.info(
                "Summary generated: provider={}, tokens={}, latency={}ms",
                llmProvider.getProviderName(),
                response.tokensUsed.totalTokens,
                response.latencyMs
            )

            parseSummaryResponse(response.content)
        } catch (e: LlmTimeoutException) {
            logger.warn("LLM timeout after {}ms, returning fallback", e.timeoutMs)
            SummaryResult.fallback(document)
        } catch (e: LlmRateLimitException) {
            logger.error("Rate limit exceeded: {}", e.message)
            throw ServiceUnavailableException("Service temporarily unavailable")
        }
    }

    private fun buildSystemPrompt(language: String): String = """
        Eres un asistente técnico especializado en documentación de software.
        Responde siempre en $language.
        Devuelve un JSON con la estructura: {"title": "...", "summary": "...", "keyPoints": ["..."]}
        No incluyas explicaciones fuera del JSON.
    """.trimIndent()

    private fun buildUserPrompt(document: String): String = """
        Resume el siguiente documento técnico de forma concisa:

        ---
        $document
        ---
    """.trimIndent()
}

The controller: clean and simple

@RestController
@RequestMapping("/api/v1/documents")
class DocumentController(
    private val summaryService: DocumentSummaryService
) {
    @PostMapping("/summarize")
    suspend fun summarize(
        @Valid @RequestBody request: SummarizeRequest
    ): ResponseEntity<SummaryResult> {
        val result = summaryService.summarize(
            document = request.content,
            language = request.language ?: "es"
        )
        return ResponseEntity.ok(result)
    }
}

The controller doesn’t know an LLM exists. It only knows there’s a service that summarizes documents. That’s what matters.

The same thing in Python with FastAPI

For teams working with Python, the structure is identical. Only the syntax changes:

from abc import ABC, abstractmethod
from pydantic import BaseModel

class LlmRequest(BaseModel):
    system_prompt: str
    user_message: str
    model: str
    max_tokens: int = 1024
    temperature: float = 0.3

class LlmProvider(ABC):
    @abstractmethod
    async def complete(self, request: LlmRequest) -> LlmResponse:
        pass

class AnthropicProvider(LlmProvider):
    def __init__(self, api_key: str):
        self.client = AsyncAnthropic(api_key=api_key)

    async def complete(self, request: LlmRequest) -> LlmResponse:
        start = time.monotonic()
        response = await self.client.messages.create(
            model=request.model,
            max_tokens=request.max_tokens,
            system=request.system_prompt,
            messages=[{"role": "user", "content": request.user_message}]
        )
        latency = (time.monotonic() - start) * 1000
        return LlmResponse(
            content=response.content[0].text,
            tokens_used=TokenUsage(
                prompt_tokens=response.usage.input_tokens,
                completion_tokens=response.usage.output_tokens
            ),
            latency_ms=latency
        )

The core idea is the same: an interface, concrete implementations, dependency injection.

Rate limiting and cost control

This is the point everyone ignores until the bill arrives. LLMs charge per token, and a single user can generate hundreds of thousands of tokens in a session.

@Component
class LlmRateLimiter(
    private val config: RateLimitConfig
) {
    private val requestCounts = ConcurrentHashMap<String, AtomicInteger>()
    private val tokenCounts = ConcurrentHashMap<String, AtomicLong>()

    fun checkLimit(userId: String = "global") {
        val requests = requestCounts.getOrPut(userId) { AtomicInteger(0) }
        if (requests.get() >= config.maxRequestsPerMinute) {
            throw LlmRateLimitException("Request limit exceeded")
        }

        val tokens = tokenCounts.getOrPut(userId) { AtomicLong(0) }
        if (tokens.get() >= config.maxTokensPerHour) {
            throw LlmRateLimitException("Token limit exceeded")
        }

        requests.incrementAndGet()
    }

    fun recordUsage(userId: String, tokensUsed: Int) {
        tokenCounts.getOrPut(userId) { AtomicLong(0) }.addAndGet(tokensUsed.toLong())
    }
}

The limits I typically configure:

Level	Requests/min	Tokens/hour	Tokens/day
Per user	10	50,000	200,000
Global	100	500,000	2,000,000
Alert	-	300,000	1,500,000

The alert threshold is just as important as the hard limit. If you hit 60% of your daily budget by 10 AM, something weird is going on and you want to know before it’s too late.

It’s also a good idea to track estimated costs in real time:

@Component
class CostTracker(private val meterRegistry: MeterRegistry) {

    private val costPerToken = mapOf(
        "claude-sonnet" to CostPer1kTokens(input = 0.003, output = 0.015),
        "gpt-4o" to CostPer1kTokens(input = 0.005, output = 0.015)
    )

    fun record(usage: TokenUsage, provider: String) {
        val cost = costPerToken[provider]?.let {
            (usage.promptTokens / 1000.0 * it.input) +
            (usage.completionTokens / 1000.0 * it.output)
        } ?: 0.0

        meterRegistry.counter("llm.cost.usd", "provider", provider)
            .increment(cost)
        meterRegistry.counter("llm.tokens.total", "provider", provider)
            .increment(usage.totalTokens.toDouble())
    }
}

With those metrics in Prometheus/Grafana, you have full visibility into spending. Without this, you’re flying blind.

Error handling: what fails, will fail

LLMs fail in creative ways. Long timeouts, truncated responses, provider rate limits, malformed JSON in the response, output format changes when they update the model. Your code needs to be prepared for all of that.

My approach: retry with backoff for transient errors, fallback for degradation, and circuit breaker to prevent cascades.

@Component
class ResilientLlmProvider(
    private val primary: LlmProvider,
    @Qualifier("fallback") private val fallback: LlmProvider?
) : LlmProvider {

    private val circuitBreaker = CircuitBreaker.ofDefaults("llm-provider")

    override suspend fun complete(request: LlmRequest): LlmResponse {
        return try {
            circuitBreaker.executeSupplier {
                runBlocking { retryWithBackoff { primary.complete(request) } }
            }
        } catch (e: Exception) {
            if (fallback != null) {
                logger.warn("Primary LLM failed, switching to fallback: ${e.message}")
                fallback.complete(request)
            } else {
                throw LlmUnavailableException("LLM unavailable", e)
            }
        }
    }

    private suspend fun <T> retryWithBackoff(
        maxRetries: Int = 2,
        initialDelay: Long = 1000,
        block: suspend () -> T
    ): T {
        var lastException: Exception? = null
        repeat(maxRetries) { attempt ->
            try {
                return block()
            } catch (e: LlmTimeoutException) {
                lastException = e
                delay(initialDelay * (attempt + 1))
            } catch (e: LlmRateLimitException) {
                lastException = e
                delay(initialDelay * (attempt + 1) * 2)
            }
        }
        throw lastException ?: LlmUnavailableException("Max retries exceeded")
    }
}

Errors you need to handle explicitly:

Error	Common cause	Strategy
Timeout	Slow model, long prompt	Retry with backoff, reduce max_tokens
Rate limit (429)	Too many calls to the provider	Exponential backoff, queue
Malformed JSON	Model doesn’t follow instructions	Re-parse, stricter prompt
Truncated response	max_tokens too low	Increase limit, split the request
API down (5xx)	Provider issues	Circuit breaker, fallback to another provider

Testing without calling the real LLM

This is the point where the provider abstraction proves its worth. If your service depends on an interface rather than a concrete implementation, testing is trivial:

class DocumentSummaryServiceTest {

    private val mockProvider = MockLlmProvider()
    private val rateLimiter = LlmRateLimiter(RateLimitConfig(100, 100000, 1000000))
    private val costTracker = CostTracker(SimpleMeterRegistry())

    private val service = DocumentSummaryService(mockProvider, rateLimiter, costTracker)

    @Test
    fun `should return structured summary for valid document`() = runTest {
        mockProvider.setResponse("""
            {"title": "Resumen", "summary": "Texto resumido", "keyPoints": ["punto 1"]}
        """.trimIndent())

        val result = service.summarize("Un documento técnico largo...")

        assertThat(result.title).isEqualTo("Resumen")
        assertThat(result.keyPoints).hasSize(1)
    }

    @Test
    fun `should return fallback on timeout`() = runTest {
        mockProvider.shouldTimeout = true

        val result = service.summarize("Un documento...")

        assertThat(result.isFallback).isTrue()
    }

    @Test
    fun `should track token usage`() = runTest {
        mockProvider.setResponse("""{"title": "T", "summary": "S", "keyPoints": []}""")
        mockProvider.tokensToReturn = TokenUsage(100, 50)

        service.summarize("Documento de prueba")

        // Verify usage was recorded
        assertThat(costTracker.getTotalTokens()).isEqualTo(150)
    }
}

class MockLlmProvider : LlmProvider {
    private var response: String = ""
    var shouldTimeout = false
    var tokensToReturn = TokenUsage(10, 10)

    fun setResponse(content: String) { response = content }

    override suspend fun complete(request: LlmRequest): LlmResponse {
        if (shouldTimeout) throw LlmTimeoutException(30000)
        return LlmResponse(response, tokensToReturn, "mock-model", 50)
    }

    override fun getProviderName() = "mock"
}

Tests shouldn’t depend on an external service that charges per use. If your test suite needs a real API key to pass, you have a design problem.

For integration tests that do need to validate the real format of responses, I use a separate profile with a limited budget and tests tagged as @Tag("integration") that only run in CI on a specific schedule, never on every push.

Common mistakes I’ve seen (and made)

1. Hardcoded prompt in the service. If the prompt is embedded as a string literal in the code, every change requires recompiling and redeploying. Better to externalize it in template files or configuration.

2. Not logging latency or tokens. Without that data, you can’t optimize anything. You don’t know if a new prompt is more efficient or more expensive. Always log: provider, model, input tokens, output tokens, latency, result (ok/error).

3. Coupling to a provider. I see this constantly. Code that directly imports com.openai.client in the business service. When you want to change models or try another provider, you have to touch the entire application.

4. No timeout set. The default timeout for many HTTP clients is 30 seconds or infinite. An LLM that hangs for a minute blocks a thread on your server. Set aggressive timeouts: 15-20 seconds for most cases.

5. Ignoring cost in development. I’ve seen development environments where every application reload triggered 10 LLM calls. Multiply that by 5 developers doing hot-reload all day, and spending skyrockets without generating any value.

6. Parsing the response without validation. The model doesn’t always return valid JSON, no matter how much you ask for it in the prompt. Always validate and handle the malformed response case.

When this architecture is enough (and when it’s not)

This structure covers the 80% case well: a backend that needs to call an LLM for a specific feature, with cost control and basic resilience.

It’s not enough when:

You need streaming responses token by token (requires SSE or WebSockets).
You have chains of LLM calls (RAG, agents) that need orchestration.
The call volume justifies a dedicated LLM gateway.
You need to route between models based on request complexity.

For those cases, the architecture grows, but always on this foundation. Provider abstraction, cost control, and error handling remain necessary. You just add layers on top.

What matters in the end

Integrating LLMs into a backend isn’t an AI problem. It’s a software engineering problem. The same practices you apply when integrating any external service (abstraction, resilience, observability, testing) work here.

The difference is that LLMs are expensive, slow, and unpredictable compared to most APIs. That’s why discipline needs to be greater, not lesser.

If you take one thing from this article: don’t start with the model or the prompt. Start with the architecture. A good prompt on a bad architecture is a future problem. A good architecture with a mediocre prompt gets fixed in an afternoon.

Tags: #llm #backend #architecture #spring boot #fastapi #prompt templates #rate limits #testing

Back to all posts

Cover for ADRs for small projects: how to document technical decisions without bureaucracy

Software Architecture

Roger Bosch

•

May 18, 2026

ADRs for small projects: how to document technical decisions without bureaucracy

Cover for How to separate an AI PoC from a system you can actually maintain

Artificial intelligence

Roger Bosch

•

May 18, 2026

How to separate an AI PoC from a system you can actually maintain

Cover for AI Skills as living documentation for a development team

Artificial intelligence

Roger Bosch

•

May 18, 2026

Minimal architecture for integrating an LLM into a backend application

The real problem: it’s not calling the LLM, it’s everything else

Layered architecture: the minimum that works

Practical example: Kotlin with Spring Boot

The provider interface

Implementation for a specific provider

The service: where the logic lives

The controller: clean and simple

The same thing in Python with FastAPI

Rate limiting and cost control

Error handling: what fails, will fail

Testing without calling the real LLM

Common mistakes I’ve seen (and made)

When this architecture is enough (and when it’s not)

What matters in the end

Related Posts

ADRs for small projects: how to document technical decisions without bureaucracy

How to separate an AI PoC from a system you can actually maintain

AI Skills as living documentation for a development team

Legal

Navigation

RRSS

Cookie Settings

Minimal architecture for integrating an LLM into a backend application

The real problem: it’s not calling the LLM, it’s everything else

Layered architecture: the minimum that works

Practical example: Kotlin with Spring Boot

The provider interface

Implementation for a specific provider

The service: where the logic lives

The controller: clean and simple

The same thing in Python with FastAPI

Rate limiting and cost control

Error handling: what fails, will fail

Testing without calling the real LLM

Common mistakes I’ve seen (and made)

When this architecture is enough (and when it’s not)

What matters in the end

Related Posts

ADRs for small projects: how to document technical decisions without bureaucracy

How to separate an AI PoC from a system you can actually maintain

AI Skills as living documentation for a development team

Legal

Navigation

RRSS