Stream and Batch Processing Frameworks31745 2019-02-16 22:13
Why such frameworks? #
- process high-throughput in low latency.
- fault-tolerance in distributed systems.
- generic abstraction to serve volatile business requirements.
- for bounded data set (batch processing) and for unbounded data set (stream processing).
Brief history of batch/stream processing #
- Hadoop and MapReduce. Google made batch processing as simple as MR
result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs)in a distributed system.
- Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
- Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
- Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.
Wait… But why is Flink gaining popularity? #
- its fast adoption of Google Dataflow/Beam programming model.
- its highly efficient implementation of Chandy-Lamport checkpointing.
Architectural Choices #
To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures…
- master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
- P2P (decentralized): apache s4.
- DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
- Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
- Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
- Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.
|Overhead of fault-tolerance||high||medium||medium||low|