# Engineering Real-Time Analytics: Processing 2B Data Points Daily
Real-time analytics sounds straightforward until you're processing 23,000 events per second and users expect query results in under 500ms. Here's how we built it.
## The Problem Space
Our client, a B2B analytics platform, needed to:
- Ingest 2 billion data points daily from 200+ customers
- Support complex queries (aggregations, filters, custom metrics)
- Return results in <500ms (P99 latency)
- Handle concurrent queries from thousands of users
- Support white-label deployments with data isolation
Traditional approaches failed:
- **PostgreSQL**: Great for transactions, terrible for analytical queries at this scale
- **ElasticSearch**: Fast searches, poor aggregation performance
- **Pre-aggregated tables**: Too rigid, couldn't support custom metrics
## Architecture Overview
### Data Ingestion Pipeline
**Step 1: Event Collection**
Customers send events via REST API or SDK. We chose **Apache Kafka** for ingestion because:
- Handles 23K events/sec easily
- Provides replay capability (critical for debugging)
- Decouples ingestion from processing
**Step 2: Stream Processing**
We use Kafka Streams for real-time processing:
```javascript
// Enrich events with user metadata
events
.filter(event => event.timestamp > Date.now() - 86400000) // Last 24h only
.join(userTable, (event, user) => ({
...event,
userCountry: user.country,
userPlan: user.subscriptionPlan
}))
.to('enriched-events')
```
**Step 3: Storage**
We evaluated 5 databases and chose **ClickHouse** because:
- 100x faster than PostgreSQL for analytical queries
- Built-in compression (10:1 ratio, saving storage costs)
- Horizontal scaling through sharding
- Excellent aggregation performance
### Query Processing
Users build custom dashboards with complex metrics. Example:
```sql
SELECT
date_trunc('day', timestamp) as date,
country,
COUNT(DISTINCT user_id) as active_users,
AVG(session_duration) as avg_session,
PERCENTILE(0.95, load_time) as p95_load_time
FROM events
WHERE
app_id = 'xyz123'
AND timestamp > now() - INTERVAL 30 DAY
AND user_plan IN ('pro', 'enterprise')
GROUP BY date, country
ORDER BY date DESC
```
**Optimization Strategies**:
1. **Materialized Views**: Pre-compute common aggregations
```sql
CREATE MATERIALIZED VIEW daily_user_stats AS
SELECT
toDate(timestamp) as date,
app_id,
COUNT(DISTINCT user_id) as dau,
COUNT(*) as event_count
FROM events
GROUP BY date, app_id
```
2. **Partition Pruning**: Data partitioned by date. Queries for last 7 days only scan 7 partitions, not full dataset.
3. **Intelligent Caching**:
- Hot data (last 24h): In-memory Redis cache
- Warm data (last 30 days): ClickHouse MergeTree engine
- Cold data (>30 days): S3 with slower access
### Custom Metrics Engine
Customers define unlimited metrics like:
```javascript
{
name: "Revenue per User",
formula: "SUM(purchase_amount) / COUNT(DISTINCT user_id)",
filters: { event_type: "purchase", status: "completed" }
}
```
We built a formula parser that:
- Validates syntax
- Translates to ClickHouse SQL
- Applies security constraints (prevent injection)
- Caches results
## Performance Results
**Before optimization**:
- P50 latency: 3.2s
- P95 latency: 12s
- Query timeout rate: 8%
**After optimization**:
- P50 latency: 180ms ✅
- P95 latency: 450ms ✅
- Query timeout rate: 0.02% ✅
**Cost savings**: 45% reduction in infrastructure costs through:
- Data compression (10:1)
- Smarter caching strategy
- Spot instances for processing cluster
## Technical Challenges & Solutions
### Challenge 1: Late-Arriving Data
Events sometimes arrive hours late due to mobile offline sync.
**Solution**: Implemented a late-data processor:
```javascript
// Re-compute affected aggregations when late data arrives
if (event.timestamp < now() - 3600000) { // >1 hour late
invalidateCaches(event.app_id, event.date);
recomputeMaterializedViews(event.date);
}
```
### Challenge 2: Data Skew
One customer sends 40% of all events, causing hot partitions.
**Solution**: Hybrid partitioning strategy:
- Partition by date (standard)
- Sub-partition large customers
- Use consistent hashing for even distribution
### Challenge 3: Concurrent Query Overload
During business hours (9-11 AM), dashboard queries spike 10x.
**Solution**: Implemented query queue with priority:
- Real-time user queries: High priority
- Scheduled reports: Low priority
- Background jobs: Run off-peak only
Auto-scaling based on queue depth:
```yaml
# Kubernetes HPA config
metrics:
- type: External
external:
metric:
name: kafka_consumer_lag
target:
value: 1000
```
## Scaling Strategies
### Horizontal Scaling
ClickHouse cluster with 12 nodes:
- 3 shards (data distribution)
- 4 replicas per shard (high availability)
### Vertical Scaling
Each node: 32 CPU cores, 128GB RAM. Why?
- ClickHouse is CPU-bound for queries
- More RAM = larger caches = faster queries
### Read Replicas
Dashboard queries hit read replicas. Write operations go to primary cluster. This separation prevents query load from impacting ingestion.
## Monitoring & Alerting
Critical metrics we track:
```javascript
{
ingestion: {
events_per_second: 23000,
kafka_lag: '<1000', // Alert if >5000
processing_delay: '<5s'
},
query: {
p50_latency: '<200ms',
p95_latency: '<500ms',
p99_latency: '<2s',
error_rate: '<0.1%'
},
infrastructure: {
clickhouse_cpu: '<70%',
disk_usage: '<80%',
memory_usage: '<85%'
}
}
```
We use Prometheus + Grafana for dashboards and PagerDuty for alerts.
## Cost Optimization
**Total infrastructure cost**: $28K/month for 2B events/day
Breakdown:
- ClickHouse cluster (AWS r6i.4xlarge): $18K
- Kafka cluster (3 nodes): $4K
- Redis cache: $2K
- Data transfer & storage: $4K
Cost per million events: $0.42
**Optimization wins**:
- Compression saved 60% storage costs
- Spot instances for processing: 70% cheaper
- Data lifecycle (move old data to S3): 40% savings
## Tech Stack
- **Ingestion**: Apache Kafka, Kafka Streams
- **Storage**: ClickHouse (analytical), PostgreSQL (metadata), Redis (cache)
- **Processing**: Node.js, Python (data science jobs)
- **Infrastructure**: Kubernetes (AWS EKS), Terraform
- **Monitoring**: Prometheus, Grafana, Sentry
## Key Takeaways
1. **Choose the right database**: ClickHouse for analytics, not PostgreSQL
2. **Cache aggressively**: 80% of queries hit the same data
3. **Monitor everything**: You can't optimize what you don't measure
4. **Design for failure**: Replicas, circuit breakers, graceful degradation
5. **Optimize for cost early**: Cloud bills grow fast at scale
Real-time analytics at scale is solvable with the right architecture and tools. The key is understanding your query patterns and optimizing for them.
Real-time analytics sounds straightforward until you're processing 23,000 events per second and users expect query results in under 500ms. Here's how we built it.
## The Problem Space
Our client, a B2B analytics platform, needed to:
- Ingest 2 billion data points daily from 200+ customers
- Support complex queries (aggregations, filters, custom metrics)
- Return results in <500ms (P99 latency)
- Handle concurrent queries from thousands of users
- Support white-label deployments with data isolation
Traditional approaches failed:
- **PostgreSQL**: Great for transactions, terrible for analytical queries at this scale
- **ElasticSearch**: Fast searches, poor aggregation performance
- **Pre-aggregated tables**: Too rigid, couldn't support custom metrics
## Architecture Overview
### Data Ingestion Pipeline
**Step 1: Event Collection**
Customers send events via REST API or SDK. We chose **Apache Kafka** for ingestion because:
- Handles 23K events/sec easily
- Provides replay capability (critical for debugging)
- Decouples ingestion from processing
**Step 2: Stream Processing**
We use Kafka Streams for real-time processing:
```javascript
// Enrich events with user metadata
events
.filter(event => event.timestamp > Date.now() - 86400000) // Last 24h only
.join(userTable, (event, user) => ({
...event,
userCountry: user.country,
userPlan: user.subscriptionPlan
}))
.to('enriched-events')
```
**Step 3: Storage**
We evaluated 5 databases and chose **ClickHouse** because:
- 100x faster than PostgreSQL for analytical queries
- Built-in compression (10:1 ratio, saving storage costs)
- Horizontal scaling through sharding
- Excellent aggregation performance
### Query Processing
Users build custom dashboards with complex metrics. Example:
```sql
SELECT
date_trunc('day', timestamp) as date,
country,
COUNT(DISTINCT user_id) as active_users,
AVG(session_duration) as avg_session,
PERCENTILE(0.95, load_time) as p95_load_time
FROM events
WHERE
app_id = 'xyz123'
AND timestamp > now() - INTERVAL 30 DAY
AND user_plan IN ('pro', 'enterprise')
GROUP BY date, country
ORDER BY date DESC
```
**Optimization Strategies**:
1. **Materialized Views**: Pre-compute common aggregations
```sql
CREATE MATERIALIZED VIEW daily_user_stats AS
SELECT
toDate(timestamp) as date,
app_id,
COUNT(DISTINCT user_id) as dau,
COUNT(*) as event_count
FROM events
GROUP BY date, app_id
```
2. **Partition Pruning**: Data partitioned by date. Queries for last 7 days only scan 7 partitions, not full dataset.
3. **Intelligent Caching**:
- Hot data (last 24h): In-memory Redis cache
- Warm data (last 30 days): ClickHouse MergeTree engine
- Cold data (>30 days): S3 with slower access
### Custom Metrics Engine
Customers define unlimited metrics like:
```javascript
{
name: "Revenue per User",
formula: "SUM(purchase_amount) / COUNT(DISTINCT user_id)",
filters: { event_type: "purchase", status: "completed" }
}
```
We built a formula parser that:
- Validates syntax
- Translates to ClickHouse SQL
- Applies security constraints (prevent injection)
- Caches results
## Performance Results
**Before optimization**:
- P50 latency: 3.2s
- P95 latency: 12s
- Query timeout rate: 8%
**After optimization**:
- P50 latency: 180ms ✅
- P95 latency: 450ms ✅
- Query timeout rate: 0.02% ✅
**Cost savings**: 45% reduction in infrastructure costs through:
- Data compression (10:1)
- Smarter caching strategy
- Spot instances for processing cluster
## Technical Challenges & Solutions
### Challenge 1: Late-Arriving Data
Events sometimes arrive hours late due to mobile offline sync.
**Solution**: Implemented a late-data processor:
```javascript
// Re-compute affected aggregations when late data arrives
if (event.timestamp < now() - 3600000) { // >1 hour late
invalidateCaches(event.app_id, event.date);
recomputeMaterializedViews(event.date);
}
```
### Challenge 2: Data Skew
One customer sends 40% of all events, causing hot partitions.
**Solution**: Hybrid partitioning strategy:
- Partition by date (standard)
- Sub-partition large customers
- Use consistent hashing for even distribution
### Challenge 3: Concurrent Query Overload
During business hours (9-11 AM), dashboard queries spike 10x.
**Solution**: Implemented query queue with priority:
- Real-time user queries: High priority
- Scheduled reports: Low priority
- Background jobs: Run off-peak only
Auto-scaling based on queue depth:
```yaml
# Kubernetes HPA config
metrics:
- type: External
external:
metric:
name: kafka_consumer_lag
target:
value: 1000
```
## Scaling Strategies
### Horizontal Scaling
ClickHouse cluster with 12 nodes:
- 3 shards (data distribution)
- 4 replicas per shard (high availability)
### Vertical Scaling
Each node: 32 CPU cores, 128GB RAM. Why?
- ClickHouse is CPU-bound for queries
- More RAM = larger caches = faster queries
### Read Replicas
Dashboard queries hit read replicas. Write operations go to primary cluster. This separation prevents query load from impacting ingestion.
## Monitoring & Alerting
Critical metrics we track:
```javascript
{
ingestion: {
events_per_second: 23000,
kafka_lag: '<1000', // Alert if >5000
processing_delay: '<5s'
},
query: {
p50_latency: '<200ms',
p95_latency: '<500ms',
p99_latency: '<2s',
error_rate: '<0.1%'
},
infrastructure: {
clickhouse_cpu: '<70%',
disk_usage: '<80%',
memory_usage: '<85%'
}
}
```
We use Prometheus + Grafana for dashboards and PagerDuty for alerts.
## Cost Optimization
**Total infrastructure cost**: $28K/month for 2B events/day
Breakdown:
- ClickHouse cluster (AWS r6i.4xlarge): $18K
- Kafka cluster (3 nodes): $4K
- Redis cache: $2K
- Data transfer & storage: $4K
Cost per million events: $0.42
**Optimization wins**:
- Compression saved 60% storage costs
- Spot instances for processing: 70% cheaper
- Data lifecycle (move old data to S3): 40% savings
## Tech Stack
- **Ingestion**: Apache Kafka, Kafka Streams
- **Storage**: ClickHouse (analytical), PostgreSQL (metadata), Redis (cache)
- **Processing**: Node.js, Python (data science jobs)
- **Infrastructure**: Kubernetes (AWS EKS), Terraform
- **Monitoring**: Prometheus, Grafana, Sentry
## Key Takeaways
1. **Choose the right database**: ClickHouse for analytics, not PostgreSQL
2. **Cache aggressively**: 80% of queries hit the same data
3. **Monitor everything**: You can't optimize what you don't measure
4. **Design for failure**: Replicas, circuit breakers, graceful degradation
5. **Optimize for cost early**: Cloud bills grow fast at scale
Real-time analytics at scale is solvable with the right architecture and tools. The key is understanding your query patterns and optimizing for them.
#analytics#real-time#big data#ClickHouse#Kafka#performance
Related articles
Keep exploring Insights
Handpicked reads based on this topic.
Related Articles

Media Strategy
TikTok Marketing for Businesses in 2026: Beyond Dance Videos
TikTok isn't just for Gen Z anymore. B2B companies are getting 10x cheaper leads than LinkedIn. Here's the exact content framework that works.
•6 min read

Media Strategy
Facebook Ads vs Google Ads: Which Gets Better ROI for Your Business?
We spent $500K across both platforms for 30+ clients. Here's the data on which platform works better for different business types.
•7 min read

Business Insights
LinkedIn Ads for B2B: How to Generate Quality Leads (Not Just Clicks)
LinkedIn CPCs are 3x higher than Facebook. But for B2B, the lead quality is worth it. Here's our proven campaign structure.
•8 min read