Building Resilient Systems: Error Handling and Recovery Patterns
Introduction
Failures are inevitable in distributed systems. The difference between a good system and a great one is how gracefully they handle failures. Here's how to build resilient systems that recover automatically.
Failure Modes
Common Failures
1. Network Failures - Timeouts - Connection refused - Packet loss
2. Service Failures - Service crashes - High latency - Resource exhaustion
3. Database Failures - Connection pool exhaustion - Query timeouts - Deadlocks
4. Third-Party Failures - API outages - Rate limiting - Invalid responses
Resilience Patterns
1. Retry Pattern
When to Retry:
- Transient failures
- Network issues
- Temporary service unavailability
Implementation:
async function retryOperation(operation, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await operation();
} catch (error) {
if (i === maxRetries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
}
}
}
Best Practices:
- Exponential backoff
- Jitter to prevent thundering herd
- Maximum retry limit
- Don't retry non-retryable errors
2. Circuit Breaker Pattern
Purpose: Prevent cascading failures by stopping requests to failing services.
States:
- Closed: Normal operation
- Open: Failing, reject requests immediately
- Half-Open: Testing if service recovered
Implementation:
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failureCount = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED';
this.nextAttempt = Date.now();
}
async execute(operation) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
3. Bulkhead Pattern
Purpose: Isolate resources to prevent total system failure.
Example:
- Separate thread pools
- Isolated database connections
- Separate service instances
Benefits:
- Failure isolation
- Resource protection
- Better resource utilization
4. Timeout Pattern
Always Set Timeouts:
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 5000);
try {
const response = await fetch(url, {
signal: controller.signal
});
} finally {
clearTimeout(timeoutId);
}
5. Fallback Pattern
Provide Alternatives:
async function getData() {
try {
return await primaryService.getData();
} catch (error) {
console.warn('Primary service failed, using fallback');
return await fallbackService.getData();
}
}
Fallback Options:
- Cached data
- Default values
- Alternative service
- Degraded functionality
6. Health Checks
Monitor Service Health:
app.get('/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalApi: await checkExternalApi()
};
const healthy = Object.values(checks).every(c => c === true);
res.status(healthy ? 200 : 503).json({ checks });
});
Error Handling Best Practices
1. Structured Error Handling
Error Types:
class AppError extends Error {
constructor(message, code, statusCode) {
super(message);
this.code = code;
this.statusCode = statusCode;
}
}
class ValidationError extends AppError {
constructor(message) {
super(message, 'VALIDATION_ERROR', 400);
}
}
2. Error Logging
Log Everything:
- Error message
- Stack trace
- Context (user, request, etc.)
- Timestamp
Tools:
- Sentry
- LogRocket
- Datadog
- Custom logging
3. User-Friendly Error Messages
Don't Expose Internals:
// Bad
res.status(500).json({ error: error.stack });
// Good
res.status(500).json({
error: 'Something went wrong. Please try again later.'
});
4. Graceful Degradation
Degrade Features, Not Entire App:
- Show cached data
- Disable non-critical features
- Show maintenance message
- Allow core functionality
Monitoring and Alerting
Key Metrics
- Error rates
- Response times
- Circuit breaker states
- Retry counts
- Timeout rates
Alerting
Alert On:
- High error rates
- Circuit breaker opens
- Service degradation
- Unusual patterns
Testing Resilience
Chaos Engineering
Purpose: Test system behavior under failure conditions.
Techniques:
- Kill services randomly
- Inject latency
- Simulate network failures
- Resource exhaustion
Tools:
- Chaos Monkey
- Litmus
- Gremlin
Conclusion
Building resilient systems requires anticipating failures and implementing patterns to handle them gracefully. Start with retries and timeouts, add circuit breakers for critical services, implement health checks, and continuously monitor. Resilience is not optional—it's essential.
*Need help building resilient systems? [Contact us](/schedule-appointment) for expert guidance.*