Tech Deep Dive15 Desember 202515 min read

Engineering Real-Time Analytics: Processing 2B Data Points Daily

How we built a sub-second analytics platform using ClickHouse, Kafka, and custom aggregation engine for an enterprise SaaS serving 200+ customers.

DL

Dan Ramdani

Dan Labs

# Engineering Real-Time Analytics: Processing 2B Data Points Daily

Real-time analytics sounds straightforward until you're processing 23,000 events per second and users expect query results in under 500ms. Here's how we built it.

## The Problem Space

Our client, a B2B analytics platform, needed to:
- Ingest 2 billion data points daily from 200+ customers
- Support complex queries (aggregations, filters, custom metrics)
- Return results in <500ms (P99 latency)
- Handle concurrent queries from thousands of users
- Support white-label deployments with data isolation

Traditional approaches failed:
- **PostgreSQL**: Great for transactions, terrible for analytical queries at this scale
- **ElasticSearch**: Fast searches, poor aggregation performance
- **Pre-aggregated tables**: Too rigid, couldn't support custom metrics

## Architecture Overview

### Data Ingestion Pipeline

**Step 1: Event Collection**
Customers send events via REST API or SDK. We chose **Apache Kafka** for ingestion because:
- Handles 23K events/sec easily
- Provides replay capability (critical for debugging)
- Decouples ingestion from processing

**Step 2: Stream Processing**
We use Kafka Streams for real-time processing:
```javascript
// Enrich events with user metadata
events
.filter(event => event.timestamp > Date.now() - 86400000) // Last 24h only
.join(userTable, (event, user) => ({
...event,
userCountry: user.country,
userPlan: user.subscriptionPlan
}))
.to('enriched-events')
```

**Step 3: Storage**
We evaluated 5 databases and chose **ClickHouse** because:
- 100x faster than PostgreSQL for analytical queries
- Built-in compression (10:1 ratio, saving storage costs)
- Horizontal scaling through sharding
- Excellent aggregation performance

### Query Processing

Users build custom dashboards with complex metrics. Example:
```sql
SELECT
date_trunc('day', timestamp) as date,
country,
COUNT(DISTINCT user_id) as active_users,
AVG(session_duration) as avg_session,
PERCENTILE(0.95, load_time) as p95_load_time
FROM events
WHERE
app_id = 'xyz123'
AND timestamp > now() - INTERVAL 30 DAY
AND user_plan IN ('pro', 'enterprise')
GROUP BY date, country
ORDER BY date DESC
```

**Optimization Strategies**:

1. **Materialized Views**: Pre-compute common aggregations
```sql
CREATE MATERIALIZED VIEW daily_user_stats AS
SELECT
toDate(timestamp) as date,
app_id,
COUNT(DISTINCT user_id) as dau,
COUNT(*) as event_count
FROM events
GROUP BY date, app_id
```

2. **Partition Pruning**: Data partitioned by date. Queries for last 7 days only scan 7 partitions, not full dataset.

3. **Intelligent Caching**:
- Hot data (last 24h): In-memory Redis cache
- Warm data (last 30 days): ClickHouse MergeTree engine
- Cold data (>30 days): S3 with slower access

### Custom Metrics Engine

Customers define unlimited metrics like:
```javascript
{
name: "Revenue per User",
formula: "SUM(purchase_amount) / COUNT(DISTINCT user_id)",
filters: { event_type: "purchase", status: "completed" }
}
```

We built a formula parser that:
- Validates syntax
- Translates to ClickHouse SQL
- Applies security constraints (prevent injection)
- Caches results

## Performance Results

**Before optimization**:
- P50 latency: 3.2s
- P95 latency: 12s
- Query timeout rate: 8%

**After optimization**:
- P50 latency: 180ms ✅
- P95 latency: 450ms ✅
- Query timeout rate: 0.02% ✅

**Cost savings**: 45% reduction in infrastructure costs through:
- Data compression (10:1)
- Smarter caching strategy
- Spot instances for processing cluster

## Technical Challenges & Solutions

### Challenge 1: Late-Arriving Data
Events sometimes arrive hours late due to mobile offline sync.

**Solution**: Implemented a late-data processor:
```javascript
// Re-compute affected aggregations when late data arrives
if (event.timestamp < now() - 3600000) { // >1 hour late
invalidateCaches(event.app_id, event.date);
recomputeMaterializedViews(event.date);
}
```

### Challenge 2: Data Skew
One customer sends 40% of all events, causing hot partitions.

**Solution**: Hybrid partitioning strategy:
- Partition by date (standard)
- Sub-partition large customers
- Use consistent hashing for even distribution

### Challenge 3: Concurrent Query Overload
During business hours (9-11 AM), dashboard queries spike 10x.

**Solution**: Implemented query queue with priority:
- Real-time user queries: High priority
- Scheduled reports: Low priority
- Background jobs: Run off-peak only

Auto-scaling based on queue depth:
```yaml
# Kubernetes HPA config
metrics:
- type: External
external:
metric:
name: kafka_consumer_lag
target:
value: 1000
```

## Scaling Strategies

### Horizontal Scaling
ClickHouse cluster with 12 nodes:
- 3 shards (data distribution)
- 4 replicas per shard (high availability)

### Vertical Scaling
Each node: 32 CPU cores, 128GB RAM. Why?
- ClickHouse is CPU-bound for queries
- More RAM = larger caches = faster queries

### Read Replicas
Dashboard queries hit read replicas. Write operations go to primary cluster. This separation prevents query load from impacting ingestion.

## Monitoring & Alerting

Critical metrics we track:
```javascript
{
ingestion: {
events_per_second: 23000,
kafka_lag: '<1000', // Alert if >5000
processing_delay: '<5s'
},
query: {
p50_latency: '<200ms',
p95_latency: '<500ms',
p99_latency: '<2s',
error_rate: '<0.1%'
},
infrastructure: {
clickhouse_cpu: '<70%',
disk_usage: '<80%',
memory_usage: '<85%'
}
}
```

We use Prometheus + Grafana for dashboards and PagerDuty for alerts.

## Cost Optimization

**Total infrastructure cost**: $28K/month for 2B events/day

Breakdown:
- ClickHouse cluster (AWS r6i.4xlarge): $18K
- Kafka cluster (3 nodes): $4K
- Redis cache: $2K
- Data transfer & storage: $4K

Cost per million events: $0.42

**Optimization wins**:
- Compression saved 60% storage costs
- Spot instances for processing: 70% cheaper
- Data lifecycle (move old data to S3): 40% savings

## Tech Stack

- **Ingestion**: Apache Kafka, Kafka Streams
- **Storage**: ClickHouse (analytical), PostgreSQL (metadata), Redis (cache)
- **Processing**: Node.js, Python (data science jobs)
- **Infrastructure**: Kubernetes (AWS EKS), Terraform
- **Monitoring**: Prometheus, Grafana, Sentry

## Key Takeaways

1. **Choose the right database**: ClickHouse for analytics, not PostgreSQL
2. **Cache aggressively**: 80% of queries hit the same data
3. **Monitor everything**: You can't optimize what you don't measure
4. **Design for failure**: Replicas, circuit breakers, graceful degradation
5. **Optimize for cost early**: Cloud bills grow fast at scale

Real-time analytics at scale is solvable with the right architecture and tools. The key is understanding your query patterns and optimizing for them.
#analytics#real-time#big data#ClickHouse#Kafka#performance

Share this article