Amazon CloudWatch
Amazon CloudWatch Overview
- Definition: Amazon CloudWatch is a managed monitoring and observability service that collects, analyzes, and visualizes metrics, logs, and events from AWS resources, applications, and on-premises systems.
- Key Features:
- Collects metrics (e.g., EC2 CPU usage), logs (e.g., Lambda logs), and events (e.g., state changes).
- Provides alarms for automated actions, dashboards for visualization, and Logs Insights for log analysis.
- Supports CloudWatch Agent for custom metrics/logs and Synthetics for application health checks.
- Integrates with SNS, Lambda, EventBridge, Auto Scaling, and Security Hub for automation and monitoring.
- Use Cases: Monitor application performance, trigger auto-scaling, analyze logs for errors, ensure compliance, detect anomalies.
- Key Updates (2024–2025):
- Cross-Account Observability: Unified metrics/logs across accounts (October 2024).
- Improved Anomaly Detection: Enhanced ML-based alerts (March 2024).
- FIPS 140-2 Compliance: Enhanced for GovCloud (October 2024).
- Security Hub Integration: Centralized monitoring findings (January 2025).
1. CloudWatch Core Concepts
Components
- Metrics:
- Time-series data points representing resource or application performance (e.g., EC2 CPUUtilization).
- Stored for 15 months; customizable via namespaces.
- Explanation: E.g., metric for S3 NumberOfObjects.
- Logs:
- Text-based records from applications or AWS services (e.g., Lambda execution logs).
- Stored in Log Groups and Log Streams.
- Explanation: E.g., log group /aws/lambda/myFunction.
- Alarms:
- Monitor metrics and trigger actions (e.g., SNS notification, Auto Scaling) based on thresholds.
- Explanation: E.g., alarm if EC2 CPUUtilization > 80%.
- Dashboards:
- Customizable visualizations of metrics, logs, and alarms.
- Explanation: E.g., dashboard for EC2, RDS, Lambda metrics.
- Events:
- Records of state changes or schedules (e.g., EC2 instance termination, cron jobs).
- Processed via EventBridge (formerly CloudWatch Events).
- Explanation: E.g., event triggers Lambda on S3 upload.
- Logs Insights:
- Interactive query tool for analyzing log data using SQL-like syntax.
- Explanation: E.g., query Lambda logs for “ERROR”.
- CloudWatch Agent:
- Software to collect custom metrics/logs from EC2, on-premises servers.
- Explanation: E.g., collect memory usage from EC2.
- Synthetics:
- Canary scripts to monitor application endpoints and APIs.
- Explanation: E.g., synthetic test for website uptime.
- Contributor Insights:
- Analyzes log data to identify top contributors (e.g., IP addresses, users).
- Explanation: E.g., identify top API callers.
- CloudWatch Evidently:
- Feature flagging and experimentation for application changes.
- Explanation: E.g., test new UI feature on 10% of users.
Key Concepts
- Namespaces:
- Containers for metrics (e.g., AWS/EC2, Custom/MyApp).
- Explanation: E.g., MyApp/MemoryUsage for custom metric.
- Dimensions:
- Key-value pairs to filter metrics (e.g., InstanceId=i-123).
- Explanation: E.g., metric for specific EC2 instance.
- Retention:
- Metrics: 15 months (1-minute granularity for 3 hours, 5-minute for 14 days, etc.).
- Logs: Configurable (1 day to indefinite).
- Explanation: E.g., retain Lambda logs for 30 days.
- Anomaly Detection:
- ML-based detection of metric anomalies.
- Explanation: E.g., detect unusual spike in API latency.
- Cross-Account Observability:
- Unified view of metrics/logs across accounts (new 2024).
- Explanation: E.g., monitor 10 accounts in one dashboard.
Key Notes:
- Exam Relevance: Understand metrics, logs, alarms, Logs Insights, and cross-account observability.
- Mastery Tip: Compare CloudWatch vs. CloudTrail vs. X-Ray for monitoring.
2. CloudWatch Performance Features
CloudWatch optimizes monitoring and analysis.
Low Latency
- Purpose: Real-time monitoring.
- Features:
- Metrics updated in seconds (e.g., EC2 CPUUtilization every 5 minutes, optional 1 minute).
- Logs delivered to Log Groups in near-real-time.
- Logs Insights queries execute in seconds (improved 2024).
- Explanation: E.g., Lambda logs appear in <1 second.
- Exam Tip: Highlight real-time metrics/logs for performance.
High Throughput
- Purpose: Handle large data volumes.
- Features:
- Scales to billions of metrics/logs daily.
- Supports high-frequency custom metrics via CloudWatch Agent.
- Explanation: E.g., collect 1 million Lambda logs/hour.
- Exam Tip: Use for high-traffic apps.
Scalability
- Purpose: Support growing workloads.
- Features:
- Auto-scales for metrics, logs, and alarms.
- Cross-account observability for multi-account environments (new 2024).
- Explanation: E.g., monitor 1,000 EC2 instances across 10 accounts.
- Exam Tip: Use cross-account for enterprise scalability.
Key Notes:
- Performance: Low latency + high throughput + scalability = efficient monitoring.
- Exam Tip: Emphasize CloudWatch for real-time, scalable observability.
3. CloudWatch Resilience Features
Resilience ensures reliable monitoring.
Multi-AZ/Region Redundancy
- Purpose: Survive failures.
- Features:
- Metrics/logs stored in Regional, multi-AZ infrastructure.
- Cross-Region data replication for logs (manual setup).
- Explanation: E.g., metrics persist if us-east-1a fails.
- Exam Tip: Highlight multi-AZ for HA.
Continuous Monitoring:
- Purpose: Uninterrupted observability.
- Features:
- Runs 24/7, unaffected by resource failures.
- Alarms and Synthetics monitor continuously.
- Explanation: E.g., monitor EC2 during S3 outage.
- Exam Tip: Use for continuous monitoring.
Monitoring and Recovery:
- Purpose: Detect and respond to issues.
- Features:
- Alarms trigger SNS, Lambda, or Auto Scaling for recovery.
- CloudWatch Events (EventBridge) for automated workflows.
- Security Hub detects misconfigured monitoring (new 2025).
- Explanation: E.g., alarm scales out EC2 on high CPU.
- Exam Tip: Use alarms and EventBridge for resilience.
Data Retention:
- Purpose: Ensure historical data availability.
- Features:
- Metrics retained for 15 months.
- Logs configurable (1 day to indefinite).
- Explanation: E.g., analyze 6-month-old EC2 metrics.
- Exam Tip: Highlight retention for compliance.
Key Notes:
- Resilience: Multi-AZ + continuous monitoring + alarms + retention = reliable observability.
- Exam Tip: Design resilient monitoring with alarms and cross-Region replication.
4. CloudWatch Security Features
Security is a core focus for CloudWatch in SAA-C03.
Access Control
- IAM Policies:
- Control access to metrics, logs, alarms (cloudwatch:PutMetricData, logs:CreateLogGroup).
- Restrict access to specific namespaces or Log Groups.
- Example: {"Effect": "Allow", "Action": "cloudwatch:PutMetricData", "Resource": "*"}.
- Resource Policies:
- Control cross-account access to Log Groups.
- Explanation: E.g., allow dev account to read prod logs.
- Exam Tip: Practice IAM and resource policies for access control.
Encryption
- In Transit:
- HTTPS for API calls and log/metric delivery.
- Explanation: E.g., secure PutMetricData call.
- At Rest:
- Log Groups encrypted with KMS (optional).
- Metrics stored securely (no KMS option).
- Explanation: E.g., KMS key encrypts Lambda logs.
- Exam Tip: Highlight KMS for compliance.
Compliance:
- Purpose: Meet regulatory standards.
- Features:
- Supports HIPAA, PCI, SOC, ISO, GDPR, FIPS 140-2 (GovCloud).
- Security Hub detects misconfigured monitoring (new 2025).
- Explanation: E.g., use CloudWatch for PCI-compliant monitoring.
- Exam Tip: Highlight compliance certifications.
Auditing and Analysis:
- Purpose: Track and investigate activity.
- Features:
- CloudTrail logs CloudWatch API calls (e.g., CreateAlarm).
- Logs Insights for log analysis.
- Contributor Insights for top contributors.
- Explanation: E.g., query Lambda logs for “timeout”.
- Exam Tip: Use Logs Insights for security investigations.
Key Notes:
- Security: IAM + encryption + compliance + auditing = secure monitoring.
- Exam Tip: Configure KMS, IAM, and Logs Insights for secure observability.
5. CloudWatch Cost Optimization
Cost efficiency is a key exam domain.
Pricing
- Metrics:
- $0.30/metric/month (first 10,000 free).
- $0.01/1,000 metric requests.
- Logs:
- $0.50/GB ingested.
- $0.03/GB stored/month.
- $1.00/GB for Logs Insights queries.
- Alarms:
- $0.10/alarm/month (standard).
- $0.30/alarm/month (high-resolution).
- Dashboards:
- $3/dashboard/month.
- Synthetics:
- $0.0012/canary run.
- Example:
- 100 metrics, 1M metric requests, 10 GB logs ingested, 10 GB stored, 5 alarms, 1 dashboard, 1,000 canary runs:
- Metrics: 100 × $0.30 + 1M × $0.01/1,000 = $30 + $10 = $40.
- Logs: 10 × $0.50 + 10 × $0.03 = $5 + $0.30 = $5.30.
- Alarms: 5 × $0.10 = $0.50.
- Dashboard: 1 × $3 = $3.
- Synthetics: 1,000 × $0.0012 = $1.20.
- Total: $40 + $5.30 + $0.50 + $3 + $1.20 = ~$50/month.
- 100 metrics, 1M metric requests, 10 GB logs ingested, 10 GB stored, 5 alarms, 1 dashboard, 1,000 canary runs:
- Free Tier: 10 metrics, 3 dashboards, 10 alarms, 5 GB logs/month.
Cost Strategies
- Limit Metrics:
- Use default AWS metrics; avoid excessive custom metrics.
- Explanation: E.g., monitor CPUUtilization only to save $0.30/metric.
- Optimize Logs:
- Enable logs only for critical resources (e.g., Lambda errors).
- Use lifecycle policies to archive/delete logs.
- Explanation: E.g., archive logs to S3 ($0.023/GB) to save $0.007/GB.
- Targeted Alarms:
- Create alarms for critical thresholds only.
- Explanation: E.g., alarm on CPU > 80% to save $0.10/alarm.
- Efficient Queries:
- Use Logs Insights for targeted queries vs. scanning all logs.
- Explanation: E.g., save $1/GB with precise queries.
- Tagging:
- Tag Log Groups, alarms, dashboards for cost tracking.
- Explanation: E.g., tag Log Group with “Project:App”.
- Monitor Usage:
- Use CloudWatch metrics to track ingestion and optimize.
- Explanation: E.g., reduce log ingestion to save $5/month.
Key Notes:
- Cost Savings: Limit metrics/logs + lifecycle policies + tagging = lower costs.
- Exam Tip: Calculate costs for metrics, logs, and alarms.
6. CloudWatch Advanced Features
Cross-Account Observability:
- Purpose: Unified monitoring.
- Features:
- Aggregates metrics/logs across accounts via AWS Organizations (new 2024).
- Explanation: E.g., dashboard for 10 accounts’ EC2 metrics.
- Exam Tip: Use for enterprise monitoring.
Improved Anomaly Detection:
- Purpose: Detect unusual patterns.
- Features:
- Enhanced ML-based detection for metrics (new 2024).
- Explanation: E.g., detect spike in Lambda errors.
- Exam Tip: Enable for proactive monitoring.
Security Hub Integration:
- Purpose: Centralized security monitoring.
- Features:
- Detects misconfigured CloudWatch setups (e.g., no alarms) (new 2025).
- Aggregates findings with GuardDuty, Inspector.
- Explanation: E.g., flag missing EC2 monitoring.
- Exam Tip: Use Security Hub for compliance.
CloudWatch Synthetics:
- Purpose: Monitor application health.
- Features:
- Canary scripts test endpoints, APIs, and workflows.
- Explanation: E.g., test API endpoint every 5 minutes.
- Exam Tip: Use for proactive health checks.
CloudWatch Evidently:
- Purpose: Feature experimentation.
- Features:
- A/B testing and feature flagging for apps.
- Explanation: E.g., roll out new feature to 20% of users.
- Exam Tip: Know for application testing.
Contributor Insights:
- Purpose: Identify log contributors.
- Features:
- Analyzes top users, IPs, or resources in logs.
- Explanation: E.g., find top S3 bucket accessors.
- Exam Tip: Use for log analysis.
Key Notes:
- Flexibility: Cross-account + anomaly detection + Synthetics = advanced observability.
- Exam Tip: Master cross-account observability, Synthetics, and Evidently.
7. CloudWatch Use Cases
Understand practical applications.
Application Monitoring
- Setup: Metrics for EC2, Lambda; logs for errors; alarms for thresholds.
- Features: Real-time performance tracking.
- Explanation: E.g., monitor Lambda latency, alert on errors.
Auto-Scaling
- Setup: Alarms trigger Auto Scaling policies.
- Features: Scale EC2 based on CPU or custom metrics.
- Explanation: E.g., scale out if CPUUtilization > 70%.
Log Analysis
- Setup: Log Group for app logs, Logs Insights for queries.
- Features: Identify errors, performance issues.
- Explanation: E.g., query API Gateway logs for 500 errors.
Compliance Monitoring
- Setup: Metrics/logs with KMS, Security Hub integration.
- Features: HIPAA/PCI-compliant monitoring.
- Explanation: E.g., monitor RDS for PCI compliance.
8. CloudWatch vs. Other Monitoring Services
Feature | CloudWatch | CloudTrail | X-Ray |
---|---|---|---|
Type | Monitoring/Observability | API Audit | Application Tracing |
Focus | Metrics, logs, alarms | API calls, events | Request tracing |
Data | Performance, logs | Management/data events | Traces, latency |
Cost | $0.30/metric, $0.50/GB | Free–$2/100K events | $5/1M traces |
Use Case | Monitor EC2 CPU | Audit IAM changes | Trace Lambda requests |
Explanation:
- CloudWatch: General monitoring and logging.
- CloudTrail: API auditing.
- X-Ray: Application performance tracing.
9. Detailed Explanations for Mastery
- Cross-Account Observability:
- Example: Dashboard for EC2 metrics across 10 accounts.
- Why It Matters: Unified enterprise monitoring—new for 2024.
- Improved Anomaly Detection:
- Example: Detect unusual spike in S3 requests.
- Why It Matters: Proactive alerts—new for 2024.
- Security Hub Integration:
- Example: Flag missing Lambda monitoring.
- Why It Matters: Centralized compliance—new for 2025.
10. Quick Reference Table
Feature | Purpose | Key Detail | Exam Relevance |
---|---|---|---|
Metrics | Track performance | Time-series, 15-month retention | Core Concept |
Logs | Store app logs | Log Groups, Insights queries | Core Concept |
Alarms | Trigger actions | Threshold-based, SNS/Lambda | Core Concept |
Cross-Account | Unified monitoring | Multi-account metrics/logs (2024) | Scalability |
Anomaly Detection | Detect unusual patterns | ML-based, improved (2024) | Security, Performance |
Synthetics | Monitor endpoints | Canary scripts for health checks | Flexibility |
Security Hub | Compliance monitoring | Misconfigured setups (2025) | Security, Resilience |