The Difference Between Demo and Scale
My first real-time pipeline was built in a week. It worked perfectly on my laptop: Kafka topics, some stream processing with Spark, results dropped into a database. The demo impressed stakeholders.
Then we scaled to production data—100x the volume we'd tested with. The system fell apart in spectacular ways.
That's when I learned that data pipelines don't scale linearly. They break in unexpected phases: first at 2x scale, then again at 10x, then again at 100x. Each phase requires architectural rethinking.
Kafka is Not a Truck
When you first encounter Kafka, it seems magical: write data to a topic, consume it somewhere else, and messages are guaranteed to be delivered. In reality, Kafka is a distributed log with strong guarantees and sharp edges.
The critical misunderstandings:
1. Ordering is tricky. Kafka guarantees per-partition ordering, not global ordering. If you have 10 partitions, events can complete out-of-order. If your application depends on strict ordering, you probably should use a single partition (which kills scalability) or handle out-of-order events in your processing logic.
2. Delivery semantics matter. Kafka offers "at least once" delivery by default. This means your processing logic must be idempotent. If the same message is delivered twice, it must produce the same result both times. Most teams don't build idempotent consumers initially and suffer for it.
3. Consumer lag is invisible until it's not. In production, you'll have days where everything runs fine, then one day Kafka's disk fills up, or a consumer crashes and suddenly you're 8 hours behind. You need monitoring, not just for Kafka, but for lag itself.
# At least once delivery requires idempotent processing
def process_message(message):
msg_id = message['id']
# Check if we've processed this before
if db.seen_messages.exists(msg_id):
return
# Process
result = expensive_calculation(message)
# Store result and mark as seen atomically
with db.transaction():
db.results.insert(result)
db.seen_messages.insert(msg_id)
The Stateful Processing Trap
When you're processing streaming data, you inevitably need state: "how many events have we seen from this user in the last hour?" or "what's the running average price of this stock?"
Kafka Streams and Flink both handle this elegantly with state stores—local RocksDB instances that maintain your state. This works great until you need to reprocess data (a new bug fix requires re-running the pipeline). Now your state stores are out of sync with your code.
The lesson I learned the hard way: state should be derivable, not stored. Instead of "count of user events in the last hour," keep raw events and derive the count on demand. Instead of "running average price," store raw prices and average them at query time.
This costs more compute initially, but it's worth it:
- Reprocessing is trivial (delete old results, re-run)
- Debugging is easy (you have all the raw data)
- Code changes don't require state migration
- You can rebuild any derived metric without losing data
The Latency vs. Throughput Frontier
Real-time pipelines force a choice: do you optimize for latency (fast individual events) or throughput (many events per second)?
Latency-optimized pipelines process events as soon as they arrive. You buffer minimally. Each event spends milliseconds in the system. This requires bespoke tuning and breaks at scale.
Throughput-optimized pipelines batch events. You collect 1,000 events, process them together, then move to the next batch. This wastes latency (you might wait 100ms for a batch to fill) but gets exponentially better throughput.
The Pareto frontier for most real-time applications lives here:
- Ingest events with minimal buffering (you want low latency perception)
- Process them in micro-batches of 100-1,000 events (you want good throughput)
- Results are available within 1-5 seconds (good latency, great throughput)
For actual low-latency work (stock trading, auto-bidding), you go further and optimize at the microsecond level. Most teams should not. The infrastructure complexity multiplies exponentially.
Monitoring is Not Optional
The scariest moment in production data engineering is when your pipeline runs fine, but your results are silently wrong.
You had a Kafka rebalance that caused some messages to be skipped. Or a bug in your processing that quietly dropped 0.1% of data. Or clock skew on a server that made timestamps invalid. These things happen.
You need:
- End-to-end tests: generate known data, push it through the pipeline, verify results match
- Sampling and comparisons: take 1% of data, process it with the old version and the new version, compare
- Data quality metrics: track not just that the pipeline is running, but that the output makes sense (distribution of values, min/max ranges, null counts)
- Alerting on lag: if you're more than 5 minutes behind, you want to know immediately
- Audit logs: every pipeline run should log what was processed, when, and what the output was
# Data quality check (not latency check)
def validate_batch(results):
# Results should have X rows (within 10% of expected)
assert len(results) > 9000 and len(results) < 11000
# All timestamps should be recent
for row in results:
age = now() - row.timestamp
assert age < timedelta(hours=1)
# No NULLs in required fields
for row in results:
assert row.user_id is not None
assert row.value > 0
Reprocessing is an Escape Hatch
The best investment you can make in data pipeline reliability is the ability to reprocess historical data quickly and safely.
When a bug is discovered, you don't want to manually fix data. You want to fix the code, reprocess, and overwrite old results. This requires:
- Immutable raw event storage (typically S3 or similar)
- A reprocessing job that can read from a specific date range
- Idempotent processing (so running it twice produces the same results)
- Version control for your processing logic (so you can match results to code)
With this in place, a discovery of data corruption becomes: fix the bug, run the reprocessing job, and move on. Without it, you're manually patching records in production, which is a disaster.
The Real-Time Promise and Reality
Real-time data pipelines promise insights with minimal latency. The reality is that they're complex distributed systems with many failure modes.
Build what you actually need: if your business decisions can tolerate 1-hour latency, batch pipelines are simpler. If you actually need sub-second latency, know that you're signing up for significant infrastructure complexity and ongoing operational work.
The sweet spot for most organizations is "near real-time": data available within a few seconds to a few minutes. Build for that, monitor relentlessly, and have a reprocessing strategy you trust.