Cloud & DevOps

Building Resilient Systems: Error Handling and Recovery Patterns

Digiboffins Team
March 25, 202411 min read1320 views
Building Resilient Systems: Error Handling and Recovery Patterns

Learn how to build systems that gracefully handle failures. Explore retry patterns, circuit breakers, bulkheads, and other resilience patterns.

Building Resilient Systems: Error Handling and Recovery Patterns

Introduction

Failures are inevitable in distributed systems. The difference between a good system and a great one is how gracefully they handle failures. Here's how to build resilient systems that recover automatically.

Failure Modes

Common Failures

1. Network Failures - Timeouts - Connection refused - Packet loss

2. Service Failures - Service crashes - High latency - Resource exhaustion

3. Database Failures - Connection pool exhaustion - Query timeouts - Deadlocks

4. Third-Party Failures - API outages - Rate limiting - Invalid responses

Resilience Patterns

1. Retry Pattern

When to Retry:

  • Transient failures
  • Network issues
  • Temporary service unavailability

Implementation:

async function retryOperation(operation, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
    }
  }
}

Best Practices:

  • Exponential backoff
  • Jitter to prevent thundering herd
  • Maximum retry limit
  • Don't retry non-retryable errors

2. Circuit Breaker Pattern

Purpose: Prevent cascading failures by stopping requests to failing services.

States:

  • Closed: Normal operation
  • Open: Failing, reject requests immediately
  • Half-Open: Testing if service recovered

Implementation:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED';
    this.nextAttempt = Date.now();
  }

async execute(operation) { if (this.state === 'OPEN') { if (Date.now() < this.nextAttempt) { throw new Error('Circuit breaker is OPEN'); } this.state = 'HALF_OPEN'; }

try { const result = await operation(); this.onSuccess(); return result; } catch (error) { this.onFailure(); throw error; } }

onSuccess() { this.failureCount = 0; this.state = 'CLOSED'; }

onFailure() { this.failureCount++; if (this.failureCount >= this.threshold) { this.state = 'OPEN'; this.nextAttempt = Date.now() + this.timeout; } } }

3. Bulkhead Pattern

Purpose: Isolate resources to prevent total system failure.

Example:

  • Separate thread pools
  • Isolated database connections
  • Separate service instances

Benefits:

  • Failure isolation
  • Resource protection
  • Better resource utilization

4. Timeout Pattern

Always Set Timeouts:

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 5000);

try { const response = await fetch(url, { signal: controller.signal }); } finally { clearTimeout(timeoutId); }

5. Fallback Pattern

Provide Alternatives:

async function getData() {
  try {
    return await primaryService.getData();
  } catch (error) {
    console.warn('Primary service failed, using fallback');
    return await fallbackService.getData();
  }
}

Fallback Options:

  • Cached data
  • Default values
  • Alternative service
  • Degraded functionality

6. Health Checks

Monitor Service Health:

app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalApi: await checkExternalApi()
  };

const healthy = Object.values(checks).every(c => c === true); res.status(healthy ? 200 : 503).json({ checks }); });

Error Handling Best Practices

1. Structured Error Handling

Error Types:

class AppError extends Error {
  constructor(message, code, statusCode) {
    super(message);
    this.code = code;
    this.statusCode = statusCode;
  }
}

class ValidationError extends AppError { constructor(message) { super(message, 'VALIDATION_ERROR', 400); } }

2. Error Logging

Log Everything:

  • Error message
  • Stack trace
  • Context (user, request, etc.)
  • Timestamp

Tools:

  • Sentry
  • LogRocket
  • Datadog
  • Custom logging

3. User-Friendly Error Messages

Don't Expose Internals:

// Bad
res.status(500).json({ error: error.stack });

// Good res.status(500).json({ error: 'Something went wrong. Please try again later.' });

4. Graceful Degradation

Degrade Features, Not Entire App:

  • Show cached data
  • Disable non-critical features
  • Show maintenance message
  • Allow core functionality

Monitoring and Alerting

Key Metrics

  • Error rates
  • Response times
  • Circuit breaker states
  • Retry counts
  • Timeout rates

Alerting

Alert On:

  • High error rates
  • Circuit breaker opens
  • Service degradation
  • Unusual patterns

Testing Resilience

Chaos Engineering

Purpose: Test system behavior under failure conditions.

Techniques:

  • Kill services randomly
  • Inject latency
  • Simulate network failures
  • Resource exhaustion

Tools:

  • Chaos Monkey
  • Litmus
  • Gremlin

Conclusion

Building resilient systems requires anticipating failures and implementing patterns to handle them gracefully. Start with retries and timeouts, add circuit breakers for critical services, implement health checks, and continuously monitor. Resilience is not optional—it's essential.

*Need help building resilient systems? [Contact us](/schedule-appointment) for expert guidance.*

Stay Ahead in the Digital Gold Rush

Get exclusive insights on building, launching, and scaling digital products. Join our newsletter to get ahead of the curve.

Chat with DigiBoffins

Hi! Click on the WhatsApp icon below to reach our team instantly.

Our team typically replies within a few minutes.

DigiBoffins

Support Team