Structured Streaming Programming Guide

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming

 

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. 
You can express your streaming computation the same way you would express a batch computation on static data.

The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine.

Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

你可以像在静态数据源上一样,使用DataFrame接口去执行SQL,这些SQL会跑在和batch相同的optimized Spark SQL engine上 
并且可以保证exactly-once fault-tolerance,通过checkpointing and Write Ahead Logs

 

只是将DStream抽象,换成DataFrame,即table

这样就可以进行结构化的操作,

并且基本和处理batch数据一样,

可以看到差别不大

 

整个过程是这样的,

可以看到,这里的output模式是complete,因为有聚合,所以每次输出需要,输出until now的统计数据

输出的mode,分为,

The “Output” is defined as what gets written out to the external storage. The output can be defined in different modes

  • Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
  • Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
  • Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (not available yet in Spark 2.0). Note that this is different from the Complete Mode in that this mode does not output the rows that are not changed.

complete mode上面的例子已经给出

append mode,就是每次只输出增量,这个对于没有聚合的场景就是合适的

 

Window Operations on Event Time

spark认为自己对于Event time是天然支持的,只需要把它作为dataframe里面的一个列,然后做groupby即可以

然后对于late data,因为是增量输出的,所以也是可以handle的

 

Fault Tolerance Semantics

Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. 
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotant sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.

首先依赖source是可以依据offset replay,而sink是幂等的,这样只需要通过Write Ahead Logs记录offset,checkpoint记录state,就可以做到exactly once,因为本质是batch

时间: 2024-08-01 16:12:01

Structured Streaming Programming Guide的相关文章

Spark Streaming Programming Guide

参考,http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html  Overview SparkStreaming支持多种流输入,like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets,并且可以在上面进行transform操作,最终数据存入HDFS,数据库或dashboard 另外可以把Spark's in-built machine

[iOS]日历和提醒编程指南(Calendar and Reminders Programming Guide)

[iOS]日历和提醒编程指南(Calendar and Reminders Programming Guide) 太阳火神的美丽人生 (http://blog.csdn.net/opengl_es) 本文遵循"署名-非商业用途-保持一致"创作公用协议 转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS.Android.Html5.Arduino.pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作. 分析:事件提醒开发包(Event

iOS 地址簿编程指南(Address Book Programming Guide for iOS)

iOS 地址簿编程指南(Address Book Programming Guide for iOS) 太阳火神的美丽人生 (http://blog.csdn.net/opengl_es) 本文遵循"署名-非商业用途-保持一致"创作公用协议 转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS.Android.Html5.Arduino.pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作. 概述(Introduction) iOS

The Linux Kernel Module Programming Guide

The Linux Kernel Module Programming Guide Peter Jay SalzmanMichael BurianOri Pomerantz Copyright 2001 Peter Jay Salzman The Linux Kernel Module Programming Guide is a free book; you may reproduce and/or modify it under the terms of the Open Software

StreamingPro 再次支持 Structured Streaming

前言 之前已经写过一篇文章,StreamingPro 支持Spark Structured Streaming,不过当时只是玩票性质的,因为对Spark 2.0+ 版本其实也只是尝试性质的,重点还是放在了spark 1.6 系列的.不过时间在推移,Spark 2.0+ 版本还是大势所趋.所以这一版对底层做了很大的重构,StreamingPro目前支持Flink,Spark 1.6+, Spark 2.0+ 三个引擎了. 准备工作 下载streamingpro for spark 2.0的包,然后

StreamingPro 支持Spark Structured Streaming

前言 Structured Streaming 的文章参考这里:Spark 2.0 Structured Streaming 分析.2.0的时候只是把架子搭建起来了,当时也只支持FileSource(监控目录增量文件),到2.0.2后支持Kafka了,也就进入实用阶段了,目前只支持0.10的Kafka.Structured Streaming 采用dataframe API,并且对流式计算重新进行了抽象,个人认为Spark streaming 更灵活,Structured Streaming 在

【Spark Summit EU 2016】在在线学习中使用Structured Streaming流数据处理引擎

本讲义出自Ram Sriharsha与Vlad Feinberg在Spark Summit EU上的演讲,首先介绍了什么是在线学习,其实在线学习的主要特点就是在每个数据点都会更新数据参数,但是却无法再次访问之前的数据点.在演讲中还分享了在线学习的优点以及目前分布式在线学习所面临的挑战,之后还介绍了Structured Streaming流数据处理引擎,以及基于Structured Streaming的机器学习模型.

[引用] Eclipse Forms Programming Guide

guid Eclipse Forms Programming GuideInitial creation: 21 February 2004Added table wrap sample: 22 February 2004IntroductionThis document has been written to help you use the new Eclipse 3.0 feature called 'Eclipse Forms'. The content will eventually

键值编码 Key-Value Coding Programming Guide

1,什么是Key-Value Coding? Key-Value Coding是一种间接访问对象属性的机制,使用字符串标识属性,而不是通过调用实例变量的访问方法.其使用的方法基本都声明自NSKeyValueCoding协议,并被NSObject实现. Key-Value Coding支持对象属性,也支持标量类型和结构类型.非对象参数和返回类型被自动包装和解包装. NSKeyValueCoding定义的方法有: 获得属性值的方法:     – valueForKey:     – valueFor