Building Scalable Microservices: Real Patterns from the Trenches
Engineering Architecture Backend Featured

Building Scalable Microservices: Real Patterns from the Trenches

A hands-on look at microservices architecture - the patterns that actually work, the pitfalls that bit us, and why sometimes a monolith isn't that bad after all.

E
Engineering Team
Senior Solutions Architects
December 15, 2025
14 min read

The Real Problem

When we first looked at this financial services platform, it had that familiar smell. A monolith serving 50,000 daily users, deployments that required late-night maintenance windows, and a codebase where touching module A somehow broke module Z. The business wanted lower infrastructure costs, more development flexibility, faster time-to-market, and the ability to scale individual features independently. Classic stuff - but the devil's in the details.

Microservices: Hype vs Reality

Let's be honest - microservices solve specific problems, not all problems. Here's what actually drove our decision:
  • Deployment independence - ship the payment module without touching user auth
  • Blast radius containment - when (not if) things break, they break small
  • Polyglot persistence - use Postgres for transactions, Redis for sessions, MongoDB for documents
  • Team ownership - clear boundaries mean clear responsibility
  • Targeted scaling - scale the search service during peak, not the entire app

Architecture Deep Dive

We built on Domain-Driven Design, but not the academic version. Bounded contexts emerged from actual team conversations, not whiteboard exercises. Our principles:
  • Aggregate boundaries define service boundaries - if it's one transaction, it's one service
  • Events over sync calls - choreography beats orchestration for most cases
  • API contracts as first-class citizens - break the contract, break the build
  • Shared nothing architecture - each service owns its data, period
  • Observability isn't optional - if you can't trace it, don't ship it

The Stack (And Why)

Every tool choice was a tradeoff. Here's what we landed on:

                      Infrastructure:
โ”œโ”€โ”€ Kubernetes (EKS) โ†’ Declarative deployments, self-healing
โ”œโ”€โ”€ Istio service mesh โ†’ mTLS, traffic shaping, circuit breaking
โ”œโ”€โ”€ Kong API Gateway โ†’ Rate limiting, auth, request transformation
โ”‚
Messaging:
โ”œโ”€โ”€ Kafka โ†’ Event backbone, 7-day retention
โ”œโ”€โ”€ Redis Streams โ†’ Lightweight pub/sub, ephemeral data
โ”‚
Data Layer:
โ”œโ”€โ”€ PostgreSQL โ†’ ACID transactions, JSONB for flexibility
โ”œโ”€โ”€ MongoDB โ†’ Document store for audit logs, activity feeds
โ”œโ”€โ”€ Redis Cluster โ†’ Session store, distributed caching
โ”œโ”€โ”€ Elasticsearch โ†’ Full-text search, log aggregation
โ”‚
Observability:
โ”œโ”€โ”€ OpenTelemetry โ†’ Vendor-agnostic instrumentation
โ”œโ”€โ”€ Prometheus + Thanos โ†’ Metrics with long-term storage
โ”œโ”€โ”€ Grafana โ†’ Dashboards, alerting
โ”œโ”€โ”€ Jaeger โ†’ Distributed tracing
โ”‚
CI/CD:
โ”œโ”€โ”€ GitLab CI โ†’ Build, test, security scanning
โ”œโ”€โ”€ ArgoCD โ†’ GitOps deployments
โ”œโ”€โ”€ Sealed Secrets โ†’ K8s-native secret management
                    

Patterns That Saved Us

Theory is nice. Here's what actually kept us out of trouble: **Transactional Outbox** - Instead of dual-writes (database + message broker), we write events to an outbox table in the same transaction. A separate process publishes them. Atomic. Reliable. No distributed transaction nightmares. **Event Sourcing (where it matters)** - For payment flows and audit-critical paths, we store events, not state. Every mutation is an immutable event. Debug production issues by replaying exact sequences. Compliance teams love it. **CQRS with Projections** - Write models optimized for validation, read models optimized for queries. Eventual consistency is fine for read views. The reporting team gets their denormalized tables without polluting the write path. **Saga Orchestration** - Long-running business processes (onboarding, payment settlement) as explicit state machines. Compensating transactions on failure. No orphaned partial states. **Circuit Breaker + Bulkhead** - Hystrix is dead, but the patterns aren't. Resilience4j handles circuit breaking, rate limiting, and retry with backoff. Separate thread pools for external integrations.

The Payment Flow: Real Architecture

This is how actual money moves through the system:

                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ API Gateway โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”‚  Payment    โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”‚   Fraud     โ”‚
โ”‚   (Kong)    โ”‚ gRPC โ”‚  Service    โ”‚ Eventโ”‚  Detection  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚                    โ”‚
                     PaymentInitiated      FraudCheckCompleted
                            โ”‚                    โ”‚
                            โ–ผ                    โ–ผ
                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚   Outbox    โ”‚      โ”‚    Risk     โ”‚
                     โ”‚   Table     โ”‚      โ”‚   Scoring   โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚                    โ”‚
                     Debezium CDC           RiskAssessed
                            โ”‚                    โ”‚
                            โ–ผ                    โ–ผ
                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚         Kafka Topics            โ”‚
                     โ”‚  payments.initiated             โ”‚
                     โ”‚  fraud.checked                  โ”‚
                     โ”‚  risk.assessed                  โ”‚
                     โ”‚  payments.completed             โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ–ผ                       โ–ผ                       โ–ผ
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚   Ledger    โ”‚         โ”‚   Order     โ”‚         โ”‚Notification โ”‚
     โ”‚   Service   โ”‚         โ”‚   Service   โ”‚         โ”‚   Service   โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    

Testing Strategy That Actually Works

Forget the testing pyramid for a second. In distributed systems, you need:
  • Contract tests (Pact) - Services talk to stubs, not real dependencies. Contracts break in CI, not production
  • Consumer-driven contracts - Consumers define what they need, providers prove they deliver
  • Chaos testing (Chaos Monkey, Litmus) - Kill pods randomly. Inject latency. Prove resilience
  • Synthetic monitoring - Continuous production probes for critical user journeys
  • Load testing as validation - We validated the architecture could handle 80x the original traffic through rigorous load tests before launch
  • Canary deployments - 1% traffic to new versions, automatic rollback on error spike

What Actually Happened

At the start, we focused on decomposing the monolith - identifying seams, strangling the old system service by service. For the beginning, it was rough. Distributed debugging skills were lacking, traces were incomplete, and network partitions exposed consistency bugs. After some months, things clicked. Teams owned their services end-to-end. Deployments became non-events. The platform engineering team had built enough golden paths that spinning up a new service took hours, not weeks. When time had gone and the architecture matured, the results spoke for themselves:
  • Response times dropped from 850ms p99 to under 120ms p99
  • Zero-downtime deployments - maintenance windows became a memory
  • Infrastructure costs down 35% despite higher traffic
  • Deployment frequency: from monthly to 50+ daily deploys
  • Mean time to recovery: under 5 minutes for most incidents

The Hard Lessons

Not everything was smooth. Here's what hurt: **Eventual consistency is a feature, not a bug** - But explain that to the PM wondering why the dashboard shows stale data. Design for it. Communicate it. **Distributed tracing or death** - Without correlation IDs and proper trace context propagation, debugging is archaeology. OpenTelemetry auto-instrumentation is your friend. **Schema evolution is hard** - Avro with schema registry. Backwards-compatible changes only. Breaking changes require a new topic. **Kubernetes is an operating system** - Don't fight it. Learn it. Resource limits, liveness probes, pod disruption budgets - they exist for reasons. **Platform team is non-negotiable** - Someone has to own the infrastructure abstractions. Otherwise, every team reinvents the wheel.

When NOT to Microservices

Real talk - microservices are expensive. Consider alternatives if:
  • Your team is small - coordination overhead will kill velocity
  • Domain boundaries are unclear - you'll draw them wrong and suffer migration pain
  • You don't have platform engineering capacity - infrastructure complexity explodes
  • Latency requirements are extreme - network hops add up
  • Your monolith just needs better modularization - try a modular monolith first

Takeaways

At the end of this journey, we had a platform that deployed continuously, scaled on demand, and gave teams real ownership. The business got what they asked for: lower costs, faster delivery, and the flexibility to evolve. Was it worth it? For this scale and these requirements, absolutely. But we started with a modular monolith and only extracted services when the pain was real. Thinking about this kind of transformation? Start with the problem, not the solution.

tags

#microservices #kubernetes #docker #scalability #devops #cloud-native #event-sourcing #cqrs #ddd

conclusion

Microservices architecture isn't about following trends - it's about solving specific scaling and organizational challenges. The patterns we've covered (transactional outbox, event sourcing, CQRS, saga orchestration) aren't theoretical exercises; they're battle-tested solutions to real distributed systems problems. We validated the architecture could sustain 80x the original traffic through comprehensive load testing. Deployments went from monthly events to non-events happening dozens of times daily. That's the payoff when you get the fundamentals right. Have architecture challenges you're wrestling with? Let's talk patterns.
E

Engineering Team

Senior Solutions Architects

We've been building distributed systems since before 'microservices' was a thing. Our scars tell stories.