How Our Spring Boot Microservices Failed at 2AM - And What We Fixed to Achieve True High Availability

All blog posts

Kunjan Thadani

Oct 9, 2025

How we discovered that proper architecture doesn't guarantee high availability - and the configuration changes that saved us

The Problem: When Well-Designed Systems Fail

It was 2AM on a peak traffic weekend in early 2025. Our platform was humming along with over 2 lakh active users when the Slack alert lit up my phone: "Zero-day vulnerability detected - critical patch required on gateway-prod-01 immediately."

We couldn't wait for a maintenance window. Not with 3+ lakh API calls flowing through the system daily.

The truth we learned: High availability isn’t just about having redundant services, it’s about ensuring they fail over correctly and quickly under real load.

Here's the thing - our setup looked good on paper. We had about 15-20 microservices spread across four tiers, all running Spring Boot 3.x with Spring Cloud's latest release train. Our architecture ticked all the right boxes: layered design, redundant discovery services, and proper separation of concerns.

• gateway-prod-01 & gateway-prod-02: API Gateway tier handling external traffic

• infra-prod-01 & infra-prod-02: Eureka discovery services and config servers

• core-prod-01 & core-prod-02: Critical business services

• business-prod-01 & business-prod-02: Domain services, including our ML-powered recommendation engine written in Python

We had even mixed in some Python services alongside our Java microservices because, well, sometimes the right tool for the job isn't in the JVM ecosystem.

Figure 1: Our multi-tier microservices architecture with Eureka discovery services, API gateways, and business services. Red dashed lines show service discovery connections.

The confidence trap: It worked flawlessly in staging. We had proper high availability design. We believed that bringing down one discovery service instance would be fine since the other one was still running.

So we patched gateway-prod-01. And then everything fell apart.

Services couldn't discover each other. Our recommendation engine took over 3 minutes to failover. Gateway-prod-02 started buckling under the load it was supposed to handle alone. User requests timed out across multiple services.

This was our wake-up call: redundancy without coordination is just a false sense of security.

The harsh reality hit us hard: even well-architected, scalable microservices can fail catastrophically if Spring Boot's configuration defaults aren't tuned for high availability. Whether you're running on cloud infrastructure or your own servers, proper configuration is critical. We learned this the hard way at 2AM with thousands of users online.

TL;DR: How We Achieved True High Availability

The Quick Wins:

• Clustered Discovery Servers – Eureka doesn't magically cluster. Configure bidirectional replication explicitly.

• Dual Discovery Connections – Every service connects to both discovery servers. No single points of failure.

• Tuned Heartbeats – Changed defaults from 30s/90s to 8s/25s. Recovery time: 90+ seconds → 25 seconds.

• Fixed Zombie Services – Python services now deregister properly via signal handling (Spring Boot does this automatically).

• Tested Under Chaos – Staging success ≠ production resilience. Test failure scenarios under real load.

• Monitored Failover Time – Track recovery speed, not just uptime percentages.

Result: 94% faster recovery, zero unplanned outages, safe deployments during business hours.

The Five Spring Boot Configuration Pitfalls That Kill High Availability

After a few stressful hours digging through logs and testing theories, we found the configuration gaps that undermined our carefully designed architecture:

Pitfall #1: The "Clustering" Illusion

Here's what our discovery server config looked like:

# Our original configuration (core-prod-01 & core-prod-02):
eureka.client.register-with-eureka=false
eureka.client.fetch-registry=false
eureka.server.enable-self-preservation=true

We had Eureka running on both servers, so we figured they were clustering. They weren't. Turns out this is a surprisingly common mistake - teams assume that deploying discovery services on multiple servers automatically creates a cluster.

Pitfall #2: The Single Point of Discovery

Here's another mistake we made: most of our services connected to only their local discovery server.

Services on gateway-prod-01? They only talked to infra-prod-01's Eureka. Services on business-prod-01? Same thing - only infra-prod-01.

This passed every staging test we threw at it. But in production, when infra-prod-01 went down, half our services suddenly couldn't discover anything. Single point of failure, hidden in plain sight.

Pitfall #3: Spring Boot's Conservative Configuration Defaults

We were running with Spring Boot's out-of-the-box configuration: 30-second heartbeats, 90-second lease expiration. These defaults haven't changed much since the early Spring Cloud days, and there's a good reason - they prioritize stability over speed.

For most systems, especially smaller deployments where a 90-second recovery window is acceptable, these defaults work fine. But when you're serving lakhs of users and every second of downtime impacts revenue and user experience? Those defaults become a liability.

Figure 2: Service failure detection timeline - Default configuration (90-150 seconds) vs Optimized configuration (25-35 seconds).

Pitfall #4: The Zombie Service Problem

Here's where things got really interesting. Our Python-based recommendation engine had a nasty habit: it would crash or restart, but Eureka still showed it as healthy for 90+ seconds. Traffic kept routing to a dead service. Users got errors. We had no idea why.

The culprit? Graceful shutdown behaviour.

When we dug into it, we discovered a fundamental difference:

• Spring Boot services: Automatically deregister from Eureka when they shut down (built-in SIGTERM handling)

• Our Python service: Stayed registered in Eureka even after the process died

Every restart or crash left a "zombie" registration. For 90+ seconds, Eureka told other services "hey, the recommendation engine is healthy!" while routing traffic to nothing.

This meant our "highly available" system was actually causing user-facing failures during every Python service deployment or crash. The service appeared healthy in the registry when it was actually dead.

Pitfall #5: "It Works in Staging" Syndrome

Our staging environment had lower traffic and different timing characteristics. Every test passed beautifully.

The problem? Staging had:

• 100 concurrent users vs 200K in production

• Single-node Eureka (we thought we were being efficient) vs clustered in production

• Synthetic, predictable traffic patterns vs real user chaos

That 3-minute failover during the production incident? Never happened in staging where services restarted cleanly, one at a time, with plenty of breathing room between requests.

Production exposes timing issues and race conditions that staging's predictable patterns hide. Load testing isn't enough - you need chaos.

The Transformation: Building Bulletproof HA

Redundancy became resilience only after we fixed what connected it all: configuration.

First, we made our discovery servers actually cluster together. The breakthrough was simple but critical - making both Eureka servers replicate to each other:

# NEW configuration for gateway-prod-01: 
eureka.client.register-with-eureka=true 
eureka.client.fetch-registry=true
 eureka.client.serviceUrl.defaultZone=http://gateway-prod-02:8761/eureka/

# NEW configuration for gateway-prod-02: 
eureka.client.register-with-eureka=true  
eureka.client.fetch-registry=true 
eureka.client.serviceUrl.defaultZone=http://gateway-prod-01:8761/eureka/

Result: Both servers now replicate their service registries to each other.

Next, we connected every service to both discovery servers. No more single points of failure:

# Universal HA configuration for ALL services: 
eureka.client.serviceUrl.defaultZone=http://infra-prod-01:8761/eureka/,http://infra-prod 02:8761/eureka/ 
eureka.instance.prefer-ip-address=true 
eureka.instance.instance-id=${spring.application.name}:${eureka.instance.ip-address}:${server.port}

The beauty: Same configuration template works across all service tiers. No more environment-specific configs.

Then we tuned the timing for production reality. We optimized for HA over network efficiency - critical for systems at scale:

Property	Default	Optimized	Impact
registry-fetch-interval seconds	30	10	Services discover changes in 10s vs 30s
lease-renewal-interval seconds	30	8	Faster heartbeat detection
lease-expiration-duration seconds	90	25	Failed services removed in 25s vs 90s
enable-self-preservation	true	false	No stale service entries

Critical insight: For high-availability systems serving thousands of concurrent users, slightly higher network overhead is worth sub-30-second failure detection.

About these specific values: These settings were optimized for our ~15-20 service deployment serving 2+ lakh concurrent users. If you're running fewer services (5-10) or have lower traffic, you might use slightly more conservative values (15-20 second intervals). For larger deployments (50+ services), you may need additional discovery server instances.

Finally, we fixed the Python zombie registration problem. Our recommendation engine needed explicit graceful shutdown handling:

import py_eureka_client.eureka_client as eureka_client
import signal
import sys

# Connect to BOTH discovery servers with proper failover
eureka_servers = "http://core-prod-01:8761/eureka,http://core-prod-02:8761/eureka"

eureka_client.init(
    eureka_server=eureka_servers,
    app_name="recommendation-engine",
    renewal_interval_in_secs=8,
    duration_in_secs=25
)

# Critical: Proper graceful shutdown to prevent zombie registrations
def graceful_shutdown(signum, frame):
    eureka_client.stop()  # Clean deregistration - prevents traffic to dead service
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

Impact: This single change eliminated ghost service registrations. Python services now cleanly deregister on shutdown in under 1 second, preventing traffic from being routed to dead instances.

Understanding the trade-off:

• Server-side lease expiration (25s): Safety net for crashes where graceful shutdown isn't possible

• Client-side graceful shutdown (<1s): Eliminates downtime during planned restarts/deployments

Even with our optimized 25-second lease expiration, relying only on that would mean 25 seconds of failed requests during every Python deployment. Proper graceful shutdown reduces this to near-zero.

Key lesson: When mixing technologies with Spring Cloud, test not just startup and discovery, but also shutdown behaviour. Missing graceful deregistration is a hidden HA killer that only appears during restarts and crashes. Spring Boot has this built-in; other technologies need explicit implementation.

Before You Apply These Settings

Critical considerations:

1. Scale matters: These settings work for 15-20 services with 200K+ users. Smaller systems (<10 services) should use more conservative values (15-20s intervals). Larger deployments (50+ services) need additional discovery instances.

2. Test thoroughly: Validate in staging with production-like chaos, not just load. Kill services during peak traffic and measure actual recovery time. If you haven't tested failure scenarios, you don't know if your HA works.

3. Understand the trade-offs: Spring Boot's 30s defaults work well for most systems. We traded network efficiency for recovery speed because every second mattered at our scale. Make sure the trade-off makes sense for your situation.

The Results

Over the following months, we gradually refined our HA approach and faced our ultimate test: a critical infrastructure migration during peak business hours.

The scenario: Major deployment needed while serving 2+ lakh active users.

The execution:

• T+0: Deploy changes to gateway-prod-01 during peak traffic

• T+15s: All services detect the change and seamlessly failover to infra-prod-02

• T+25s: Zero user impact, deployment continues across service tiers

• T+5min: Deployment complete, services automatically rebalance

User impact: Minimal failed requests - our first truly resilient deployment under production scale.

Measurable High Availability Improvements

Metric	Before	After	Business Impact
Failure Detection Time	90-180s	20-30s	Users experience issues for seconds instead of minutes
Service Discovery Issues	5-15 min downtime per incident	< 30s impact	Maintains 99.9% SLA commitments
Deployment Risk Level	High - Manual intervention required	Low - Automatic failover	Safe deployments during business hours
Production Incidents	3-5 per deployment	0-1 per month	85% reduction in on-call alerts
Mean Time to Recovery (MTTR)	8-12 minutes	30-45 seconds	94% improvement in recovery speed
User-Facing Errors During Deployment	50-100 failed requests	< 5 failed requests	95% reduction in customer impact

Beyond the numbers, the transformation was cultural:

• Zero unplanned outages during deployments (previously multiple per quarter)

• Deployment confidence: Shifted from weekend-only to anytime deployments

• Customer experience: Significantly improved satisfaction during deployment windows

• Engineering velocity: Daily deployments instead of weekly batches, with faster feature delivery

• Team morale: Reduced on-call stress and elimination of the "fear of deployments"

• Revenue protection: Eliminated potential revenue loss from extended outages

Putting It Into Practice: Implementation Patterns

Now that you understand the what and why, let's talk about the how. Here's how to organize these configurations professionally across environments without creating a maintenance nightmare.

Environment-Specific Tuning

• Development: 5-second intervals for fast feedback

• Staging: 8-second intervals for realistic testing

• Production: 10-second intervals for stability with speed

Configuration Management

Single configuration with profiles eliminates environment drift:

# Universal base configuration 
eureka: 
 client: 
  serviceUrl: 
   defaultZone: http://web-prod-01:8761/eureka/,http://web-prod-02:8761/eureka/

---

# Development profile 
spring.profiles: development 
eureka.client.registry-fetch-interval-seconds: 5

---

# Production profile 
spring.profiles: production 
eureka.client.registry-fetch-interval-seconds: 10

Final application-production.yml for True High Availability

# application-production.yml 
eureka: 
 client: 
  service-url: 
   defaultZone: http://infra-prod-01:8761/eureka/,http://infra-prod-02:8761/eureka/ 
  registry-fetch-interval-seconds: 10 
  lease-renewal-interval-in-seconds: 8 
  lease-expiration-duration-in-seconds: 25 
  fetch-registry: true 
  register-with-eureka: true 
instance: 
  prefer-ip-address: true
  instance-id: ${spring.application.name}:${eureka.instance.ip-address}:${server.port}

Monitoring & Alerting

• Service Discovery Health: Track registration/deregistration events

• Failover Time: Alert if discovery takes >30 seconds

• Registry Sync: Monitor peer replication between discovery servers

Key Takeaways: Building True High Availability with Spring Boot

High availability isn’t just about redundant services. We had those. What we lacked was the ability for them to fail over quickly and correctly under pressure.

After months of refining our approach and running it in production, here's what we'd tell our past selves - and what might help you avoid the same pain:

1. Architecture Alone Isn't Enough

Perfect microservices design means nothing without proper Spring Boot configuration. We had textbook architecture, proper tier separation, redundant services - and it still failed. Both layers require expertise for true HA.

2. Redundancy ≠ High Availability

Having duplicate components means nothing if they can't seamlessly take over for each other. Our two Eureka servers weren't actually clustering. Our services connected to only one discovery server each. We had the illusion of redundancy without the reality of failover.

3. Test Your Failure Scenarios Under Load

Staging tests passed beautifully. Production at 2AM with 200K users? Complete failure. Until you've killed services during peak traffic and measured actual recovery time, you don't know if your HA works. We now regularly run chaos engineering tests during business hours.

4. Default Configurations Don't Scale

Spring Boot's 30-second heartbeats and 90-second lease expiration optimize for stability and broad compatibility. For systems serving thousands of users where every second impacts revenue, these defaults become liabilities. Tune based on your actual scale and tolerance for detection delays.

5. Every Service, Every Discovery Server

In distributed systems, there's no such thing as "partial" high availability. Every service must connect to every discovery server. Half-measures create hidden single points of failure that only reveal themselves in production.

6. Non-JVM Services Need Explicit HA Work

Spring Boot's Eureka client automatically handles graceful shutdown. Our Python service didn't - it required explicit signal handling and deregistration code. When building polyglot architectures, test shutdown behaviour, not just startup and discovery. The zombie registration problem cost us weeks of debugging.

7. Configuration is Contextual

Our settings work for 15-20 services serving 200K+ users. Your mileage will vary. Smaller systems need more conservative values. Larger deployments need additional discovery instances. Understand your scale, test your specific failure modes, and tune accordingly.

8. Monitor What Actually Matters

We track service discovery latency and failover time, not just uptime percentages. Speed of recovery matters as much as recovery itself. Alert on what impacts users, not just what's technically "down."

9. Use Smart Defaults That Reduce Configuration Drift

Setting prefer-ip-address=true eliminated dozens of environment-specific configuration files. Spring profiles let us manage dev/staging/prod settings in one place. Invest in configuration patterns that scale with your team.

Want to test your own HA setup? Start by:

• Killing one discovery server during peak load

• Measuring time to failover

• Verifying no zombie services remain registered

And remember: if you haven’t tested failover under stress, your “HA” is just theory.

The Bottom Line

Even with textbook microservices architecture, Spring Boot's configuration defaults can undermine your high availability promises when operating at scale.

Our transformation taught us that Spring Boot HA success requires understanding both the architectural patterns AND the specific configuration tuning needed for your production scale. Most documentation covers the "how" but not the "why" behind production configuration choices.

The universal truth: Whether you're on cloud platforms with managed services or running your own infrastructure, the fundamental principle remains the same - architecture + correct configuration = true high availability. Cloud providers offer additional tools and managed services that can help, but they don't eliminate the need to understand and properly configure your service discovery layer.

The next time someone asks if your microservices are "highly available," don't just check your architecture diagram. Ask: "Have we tested our service discovery failover under production load, and are our Spring Boot timeouts tuned for our actual scale and requirements?"

If you're not sure, you might be one security patch away from finding out the hard way.