Learn how to build systems that gracefully handle failures. Explore retry patterns, circuit breakers, bulkheads, and other resilience patterns.

Building Resilient Systems: Error Handling and Recovery Patterns

Introduction

Failures are inevitable in distributed systems. The difference between a good system and a great one is how gracefully they handle failures. Here's how to build resilient systems that recover automatically.

Failure Modes

Common Failures

1. Network Failures - Timeouts - Connection refused - Packet loss

2. Service Failures - Service crashes - High latency - Resource exhaustion

3. Database Failures - Connection pool exhaustion - Query timeouts - Deadlocks

4. Third-Party Failures - API outages - Rate limiting - Invalid responses

Resilience Patterns

1. Retry Pattern

When to Retry:

Transient failures
Network issues
Temporary service unavailability

Implementation:

async function retryOperation(operation, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
    }
  }
}

Best Practices:

Exponential backoff
Jitter to prevent thundering herd
Maximum retry limit
Don't retry non-retryable errors

2. Circuit Breaker Pattern

Purpose: Prevent cascading failures by stopping requests to failing services.

States:

Closed: Normal operation
Open: Failing, reject requests immediately
Half-Open: Testing if service recovered

Implementation:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED';
    this.nextAttempt = Date.now();
  }

async execute(operation) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }

try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

3. Bulkhead Pattern

Purpose: Isolate resources to prevent total system failure.

Example:

Separate thread pools
Isolated database connections
Separate service instances

Benefits:

Failure isolation
Resource protection
Better resource utilization

4. Timeout Pattern

Always Set Timeouts:

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 5000);

try {
  const response = await fetch(url, {
    signal: controller.signal
  });
} finally {
  clearTimeout(timeoutId);
}

5. Fallback Pattern

Provide Alternatives:

async function getData() {
  try {
    return await primaryService.getData();
  } catch (error) {
    console.warn('Primary service failed, using fallback');
    return await fallbackService.getData();
  }
}

Fallback Options:

Cached data
Default values
Alternative service
Degraded functionality

6. Health Checks

Monitor Service Health:

app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalApi: await checkExternalApi()
  };

const healthy = Object.values(checks).every(c => c === true);
  res.status(healthy ? 200 : 503).json({ checks });
});

Error Handling Best Practices

1. Structured Error Handling

Error Types:

class AppError extends Error {
  constructor(message, code, statusCode) {
    super(message);
    this.code = code;
    this.statusCode = statusCode;
  }
}

class ValidationError extends AppError {
  constructor(message) {
    super(message, 'VALIDATION_ERROR', 400);
  }
}

2. Error Logging

Log Everything:

Error message
Stack trace
Context (user, request, etc.)
Timestamp

Tools:

Sentry
LogRocket
Datadog
Custom logging

3. User-Friendly Error Messages

Don't Expose Internals:

// Bad
res.status(500).json({ error: error.stack });

// Good
res.status(500).json({ 
  error: 'Something went wrong. Please try again later.' 
});

4. Graceful Degradation

Degrade Features, Not Entire App:

Show cached data
Disable non-critical features
Show maintenance message
Allow core functionality

Monitoring and Alerting

Key Metrics

Error rates
Response times
Circuit breaker states
Retry counts
Timeout rates

Alerting

Alert On:

High error rates
Circuit breaker opens
Service degradation
Unusual patterns

Testing Resilience

Chaos Engineering

Purpose: Test system behavior under failure conditions.

Techniques:

Kill services randomly
Inject latency
Simulate network failures
Resource exhaustion

Tools:

Chaos Monkey
Litmus
Gremlin

Conclusion

Building resilient systems requires anticipating failures and implementing patterns to handle them gracefully. Start with retries and timeouts, add circuit breakers for critical services, implement health checks, and continuously monitor. Resilience is not optional—it's essential.

*Need help building resilient systems? [Contact us](/schedule-appointment) for expert guidance.*

DigiBoffins

Building Resilient Systems: Error Handling and Recovery Patterns

Building Resilient Systems: Error Handling and Recovery Patterns

Introduction

Failure Modes

Common Failures

Resilience Patterns

1. Retry Pattern

2. Circuit Breaker Pattern

3. Bulkhead Pattern

4. Timeout Pattern

5. Fallback Pattern

6. Health Checks

Error Handling Best Practices

1. Structured Error Handling

2. Error Logging

3. User-Friendly Error Messages

4. Graceful Degradation

Monitoring and Alerting

Key Metrics

Alerting

Testing Resilience

Chaos Engineering

Conclusion

Stay Ahead in the Digital Gold Rush

Chat with DigiBoffins

DigiBoffins