The world beyond batch: Streaming 101

这篇文章,首先要说清的一个问题是,给‘Streaming’正名

What is streaming?

The crux of the problem is that many things that ought to be described bywhat they are (e.g., unbounded data processing, approximate results, etc.), have come to be described colloquially by how they historically have been accomplished (i.e., via streaming execution engines).

当前我们对Streaming的定义是不准确的,导致我们对Streaming会有些误解 
比如,认为Streaming就意味着Low-latency, approximate,lack of precision

这个问题的症结在于,我们把一样东西的本质是什么和这样东西被完成到什么程度给混淆了

所以这里作者给出streaming的定义,

I prefer to isolate the term streaming to a very specific meaning: a type of data processing engine that is designed with infinite data sets in mind. Nothing more.

而对于常常出现的和streaming相关的词,也加以区别定义

Unbounded data: A type of ever-growing, essentially infinite data set. 
这个词用于描述数据集本身的特性,而Streaming用于描述processing engine

Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data.
which is at best misleading:repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived 
batch engine也可以用于repeated的处理Unbounded data 
同样Streaming engine也可以用于处理Bounded data 
所以这个词并不等同于Streaming

Low-latency, approximate, and/or speculative results:

作者认为只是,batch engine在设计时没有考虑要针对low-latency的场景,batch也可以做到low-latency,也可以得出approximate或speculative结果 
反之,streaming也可以balance low-latency来达到准确的结果

So,

From here on out, any time I use the term “streaming,” you can safely assume I mean an execution engine designed for unbounded data sets, and nothing more.

 

What streaming can do?

近期流计算的兴起于Twitter’s Nathan Marz (creator of Storm)的Storm,当然也带给Streaming以low-latency, inaccurate/speculative results这样的标签

为了提供eventually correct results,Marz提出Lambda Architecture. 这种架构虽然看上去很简单,但是给出一种balance一致性和可用性的思路;

当然问题也很明显,你需要维护streaming和batch两个pipeline,这个代价是很大的。

作者表示对于这种架构 a bit unsavory。

Unsurprisingly, I was a huge fan of Jay KrepsQuestioning the Lambda Architecture post when it came out.

所以下位出场的是linkedin的Jay Krep,他提出的是基于Kafka的Kappa Architecture,

该架构也很简单,但给出将两个pipeline合并成一个pipeline的思路;更关键的这个方案用well-designed streaming system替代了batch pipeline,这个对于作者是有很大启发的

作者对这个架构的评价,I’m not convinced that notion itself requires a name, but I fully support the idea in principle.

 

Quite honestly, I’d take things a step further. 
I would argue that well-designed streaming systems actually provide a strict superset of batch functionality.

作者进步提出,Streaming是Batch的超集,即这个时代已经不需要batch了,该退休了

Steaming要击败Batch,只需要做到两件事,

Correctness — This gets you parity with batch.

只要做到这点,就至少可以等同于batch

At the core, correctness boils down to consistent storage. 
Streaming systems need a method for checkpointing persistent state over time (something Kreps has talked about in hisWhy local state is a fundamental primitive in stream processing post), and it must be well-designed enough to remain consistent in light of machine failures.

 

If you’re curious to learn more about what it takes to get strong consistency in a streaming system, I recommend you check out theMillWheel and Spark Streaming papers.

 

Tools for reasoning about time — This gets you beyond batch.

做到这点就可以超越batch

Good tools for reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew.

这是作者的重点,讨论如何处理unbounded, unordered data

因为在现实中,我们往往需要安装event-time来处理数据,而不是按照process-time

 

In the context of unbounded data, disorder and variable skew induce a completeness problem for event time windows: 
lacking a predictable mapping between processing time and event time, how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources, you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness, which puts them at a severe disadvantage when applied to unbounded data sets.

这个问题会在102中详细的描述,其实就是dataflow论文里面的内容

 

Data processing patterns

最终,作者描述下当前的数据处理的patterns

Bounded data

 

Unbounded data — batch

Fixed windows

 

Sessions

这个和上面fixed windows的区别,人为的划分fixed windows会切断sessions,如图中红色

 

Unbounded data — streaming

现实中,unbounded data往往有两个特点,

  • Highly unordered with respect to event times, meaning you need some sort of time-based shuffle in your pipeline if you want to analyze the data in the context in which they occurred.
  • Of varying event time skew, meaning you can’t just assume you’ll always see most of the data for a given event time X within some constant epsilon of time Y.

对于这样的数据,处理的方式有如下几类,

Time-agnostic

Time-agnostic processing is used in cases where time is essentially irrelevant — i.e., all relevant logic is data driven.

这个最简单,时间无关的应用,所以stateless的情况,比如map或filter都属于这个case

这种场景没啥好说的,任何Streaming平台都可以很好的处理

 

Approximation algorithms

The second major category of approaches is approximation algorithms, such as approximate Top-Nstreaming K-means, etc.

 

Windowing by processing time

There are a few nice properties of processing time windowing:

  • It’s simple
  • Judging window completeness is straightforward.
  • If you’re wanting to infer information about the source as it is observed, processing time windowing is exactly what you want.

 

Windowing by event time

Event time windowing is what you use when you need to observe a data source in finite chunks that reflect the times at which those events actually happened.

It’s the gold standard of windowing. Sadly, most data processing systems in use today lack native support for it.

这种方式是作者所采用的,他认为是gold standard of windowing,而当前的system往往都是native不支持的,原因是比较困难,这也是作者的主要贡献,102中会详细描述

 

Of course, powerful semantics rarely come for free, and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:

Buffering: Due to extended window lifetimes, more buffering of data is required.

Completeness: Given that we often have no good way of knowing when we’ve seen all the data for a given window, how do we know when the results for the window are ready to materialize? In truth, we simply don’t.

时间: 2024-09-20 08:36:36

The world beyond batch: Streaming 101的相关文章

如何基于Spark Streaming构建实时计算平台

1.前言 随着互联网技术的迅速发展,用户对于数据处理的时效性.准确性与稳定性要求越来越高,如何构建一个稳定易用并提供齐备的监控与预警功能的实时计算平台也成了很多公司一个很大的挑战. 自2015年携程实时计算平台搭建以来,经过两年多不断的技术演进,目前实时集群规模已达上百台,平台涵盖各个SBU与公共部门数百个实时应用,全年JStorm集群稳定性达到100%.目前实时平台主要基于JStorm与Spark Streaming构建而成,相信关注携程实时平台的朋友在去年已经看到一篇关于携程实时平台的分享:

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processing platforms While the SMACK sta

Hadoop summit 2015 实时计算

有幸参加了6月9号到6月11号在圣何塞举办Hadoop summit 2015,主要关注了实时计算相关的topic. 本次参会的主要感受是:实时处理成为各个公司的标配,OLAP是基本需求. 下面我主要分享如下三个议题: 实时计算框架(主要是storm,spark主题太少,涉及实时计算的基本没有) RealTime Process和 Batch Process的统一 RealTime 处理架构以及Design Pattern 实时计算框架 这次Hadoop 峰会有一个storm的committer

Spark Streaming 不同Batch任务可以并行计算么?

关于Spark Streaming中的任务有如下几个概念: Batch Job Stage Task 其实Job,Stage,Task都是Spark Core里就有的概念,Batch则是Streaming特有的概念.同一Stage里的Task一般都是并行的.同一Job里的Stage可以并行,但是一般如果有依赖则是串行,可以参考我这篇文章Spark 多个Stage执行是串行执行的么?. Job的并行度复杂些,由两个配置决定: spark.scheduler.mode(FIFO/FAIR) spar

专访阿里王峰:Hadoop生态下一代计算引擎-streaming和batch的统一

编者按:Hadoop于2006年1月28日诞生,至今已有10年,它改变了企业对数据的存储.处理和分析的过程,加速了大数据的发展,形成了自己的极其火爆的技术生态圈,并受到非常广泛的应用.在2016年Hadoop十岁生日之际,InfoQ策划了一个Hadoop热点系列文章,为大家梳理Hadoop这十年的变化,技术圈的生态状况.本次InfoQ便采访了阿里搜索离线基础平台团队负责人王峰,和大家一起聊一聊Hadoop. 问:您是2009年开始关注Hadoop生态技术发展,并逐步将其引入阿里电商搜索技术体系.

Kafka+Spark Streaming+Redis实时计算整合实践

基于Spark通用计算平台,可以很好地扩展各种计算类型的应用,尤其是Spark提供了内建的计算库支持,像Spark Streaming.Spark SQL.MLlib.GraphX,这些内建库都提供了高级抽象,可以用非常简洁的代码实现复杂的计算逻辑.这也得益于Scala编程语言的简洁性.这里,我们基于1.3.0版本的Spark搭建了计算平台,实现基于Spark Streaming的实时计算. 我们的应用场景是分析用户使用手机App的行为,描述如下所示: 手机客户端会收集用户的行为事件(我们以点击

Spark-Spark Streaming例子整理(二)

Spark Streaming从Flume Poll数据 一.Spark Streaming on Polling from Flume实战 二.Spark Streaming on Polling from Flume源码 第一部分: 推模式(Flume push SparkStreaming) VS 拉模式(SparkStreaming poll Flume) 采用推模式:推模式的理解就是Flume作为缓存,存有数据.监听对应端口,如果服务可以链接,就将数据push过去.(简单,耦合要低),

Spark Streaming原理简析

执行流程 数据的接收 StreamingContext实例化的时候,需要传入一个SparkContext,然后指定要连接的spark matser url,即连接一个spark engine,用于获得executor. 实例化之后,首先,要指定一个接收数据的方式,如 val lines = ssc.socketTextStream("localhost", 9999) 这样从socket接收文本数据.这个步骤返回的是一个ReceiverInputDStream的实现,内含Receive

Spark Streaming Crash 如何保证Exactly Once Semantics

前言 其实这次写Spark Streaming相关的内容,主要是解决在其使用过程中大家真正关心的一些问题.我觉得应该有两块: 数据接收.我在用的过程中确实产生了问题. 应用的可靠性.因为SS是7*24小时运行的问题,我想知道如果它Crash了,会不会丢数据. 第一个问题在之前的三篇文章已经有所阐述: Spark Streaming 数据产生与导入相关的内存分析 Spark Streaming 数据接收优化 Spark Streaming Direct Approach (No Receivers