Live long and process (#LLAP)
sershe, sseth, hagleitn, 2014-08-27.
Overview综述
Hive has become significantly faster thanks to various features and improvements
that were built by the community over the past two years. Keeping the
momentum, here are some examples of what we think will take us to the next
level: asynchronous spindle-aware IO, pre-fetching and caching of column chunks,
and multi-threaded JIT-friendly operator pipelines.
得益于在过去两年里社区提交的各种特性和改进,Hive能够显著的变快。
为保持这一势头,这里有一些示例,被认为能够带领我们(将Hive)提升到下一个水平:异步主轴感知IO、预取、缓存列块、多线程运行时友好的操作管道。
In order to achieve this we are proposing a hybrid execution model which consists
of a long-lived daemon replacing direct interactions with the HDFS DataNode and a
tightly integrated DAG-based framework. Functionality such as caching,
pre-fetching, some query processing and access control will move into the
daemon. Small/short queries can be largely processed by this daemon directly,
while any heavy lifting will be performed in standard YARN containers.
为了实现这些,我们提出一种混合执行模型,它由一个替代直接与HDFS DataNode交互的长期存活的守护进程、一个基于DAG紧密集成的框架组成。
诸如缓存caching、预取pre-fetching、一些查询处理和访问控制将会被移动到守护进程。
小的或短的查询主要被该守护进程直接处理,同时任何繁重的工作将会在标准的Yarn容器中执行。
Similar to the DataNode, LLAP daemons can be used by other applications as well,
especially if a relational view on the data is preferred over file-centric processing.
We’re thus planning to open the daemon up through optional APIs (e.g.:
InputFormat) that can be leveraged by other data processing frameworks as a
building block.
与DataNode一样,LLAP守护进程也能够被其它应用使用,特别是优先于以文件为中心处理的数据之上关系型视图。
因此我们打算通过可选的api(比如InputFormat)打开这个守护进程,可以利用其他数据处理框架作为构建块。
Last, but not least, fine-grained column-level access control -- a key requirement
for mainstream adoption of Hive -- fits nicely into this model.
最后但也是最重要的,细粒度的列级访问控制,Hive主流应用的一个关键需求,很好地符合这个模型。
Example of execution with #LLAP. Tez AM orchestrates overall
execution. Initial stage of query is pushed into #LLAP, large
shuffle is performed in their own containers. Multiple queries
and applications can access #LLAP concurrently.
LLAP执行示例:Tez AM统筹总体执行。查询的初级阶段被推进到LLAP,较大的shuffle在它们自己的容器中执行。
多种查询或应用能够在LLAP中并行执行。
Persistent daemon 持续的守护进程
To facilitate caching, JIT optimization and to eliminate most of the startup costs,
we will run a daemon on the worker nodes on the cluster. The daemon will handle
I/O, caching, and query fragment execution.
为促进缓存、运行时优化,同时为消除大部分启动开销,我们将会在集群中的一些工作节点上运行一个守护进程。
这个守护进程将会处理I/O、缓存和查询片段的执行。
● These nodes will be stateless. Any request to an #LLAP node will contain
the data location and metadata. It will process local and remote locations;
locality will be the caller’s responsibility (YARN).
这些节点将会是无状态的。任何发往LLAP节点的请求将会包含数据位置和元数据信息。它将会处理本地和远程位置,位置将是调用者的责任。
● Recovery/resiliency. Failure and recovery is simplified because any data
node can still be used to process any fragment of the input data. The Tez
AM can thus simply rerun failed fragments on the cluster.
恢复/弹性。故障和恢复被简化,因为任何数据节点仍然可以用于处理输入数据的任何片段。Tez AM可以简单地在集群上重新运行失败的碎片。
● Communication between nodes. #LLAP nodes will be able to share data
(e.g., fetching partitions, broadcasting fragments). This will be realized
with the same mechanisms used today in Tez.
节点间通信。LLAP节点能够共享数据(比如加载分区、广播片段等)。现在这将在使用Tez通过相同的途径实现。
Working within existing execution model 在现有的执行模型中工作
#LLAP will work within existing, process-based Hive execution to preserve the
scalability and versatility of Hive. It will not replace the existing execution model
but enhance it.
LLAP能够工作在现有的基于进程的hive执行模型,以此来保护Hive的可扩展性和多功能性。它不会替代现有的执行模型,反而会提升它。
● The daemons are optional. Hive will continue to work without them and
will also be able to bypass them even if they are deployed and operational.
Feature parity with regard to language features will be maintained.
守护进程是可选的。没有守护进程Hive仍将继续工作,并且即使守护进程被部署和运作,hive仍可以绕过它们。
关于语言的特性仍然被保留。
● External orchestration and execution engines. #LLAP is not an
execution engine (like MR or Tez). Overall execution will be scheduled and
monitored by existing Hive execution engine such as Tez; transparently
over both #LLAP nodes, as well as regular containers. Obviously, #LLAP
level of support will depend on each individual execution engine (starting
with Tez). MapReduce support is not planned, but other engines may be
added later. Other frameworks like Pig will also have the choice of using
#LLAP daemons.
外部协调和执行引擎。
LLAP不是一个执行引擎(像MR或Tez)。总的执行将会被现有的Hive执行引擎比如Tez调度和监控。
在LLAP节点和常规容器间透明。明显的,LLAP支持水平将取决于每个执行引擎(随Tez启动)。MapReduce的支持目前没有计划,但是其它引擎可能在随后添加。
其它框架比如Pig等也会添加使用LLAP守护进程的选项。
● Partial execution. The result of the work performed by an #LLAP daemon
can either form part of the result of Hive query, or be passed on to external
Hive tasks, depending on the query.
部分执行。LLAP守护进程执行的工作结果,即可能是Hive查询的部分结果,也可能是被传递给外部依赖查询的Hive任务。
● Resource management. YARN will remain responsible for the
management and allocation of resources. The YARN container delegation
model will be used for users to transfer allocated resources to #LLAP. To
avoid the limitations of JVM memory settings, we will keep cached data, as
well as large buffers for processing (e.g., group by, joins), off-heap. This
way, the daemon can use a small amount of memory, and additional
resources (i.e., CPU and memory) will be assigned based on workload.
资源管理。Yarn仍将负责资源的管理和分配。Yarn容器代理模型将被用户用来传递被分配的资源到LLAP。
为避免JVM内存设置的局限性,我们将把缓存的数据、为处理数据而设置的缓冲区(比如group by、joins等)保持在堆外内存。
这种方式,守护进程能够使用少量的内存,其它附加资源(比如CPU、内存)将根据工作负载分配。
Query fragment execution 查询分片执行
For partial execution as described above, #LLAP nodes will execute “query
fragments” such as filters, projections, data transformations, partial aggregates,
sorting, bucketing, hash joins/semi-joins, etc. Only Hive code and blessed UDFs
will be accepted in #LLAP. No code will be localized and executed on the fly. This
is done for stability and security reasons.
对于上面提到的分片查询,LLAP节点将会执行查询的片段,比如过滤器、预测、数据转换、部分聚集、排序、分桶、哈希join、semi-joins等。
仅仅是Hive代码、UDFs将会被LLAP接受。
● Parallel execution. The node will allow parallel execution for multiple
query fragments from different queries and sessions.
并行执行。该节点将允许不同查询和会话的多种查询片段的并行执行。
● Interface. Users can access #LLAP nodes directly via client API. They will
be able to specify relational transformations and read data via
record-oriented streams.
接口。用户可以直接通过客户端API访问LLAP节点。它们能够列举关系转换,并且能够通过面向记录的流读取数据。
I/O 输入/输出
The daemon will off-load I/O and transformation from compressed format to
separate threads. The data will be passed on to execution as it becomes ready, so
the previous batches can be processed while the next ones are being prepared.
The data will be passed to execution in a simple RLE-encoded columnar format
that is ready for vectorized processing; this will also be the caching format, and
intends to minimize copying between I/O, cache, and execution.
守护进程将摆脱I/O,和从压缩格式转换到单独的线程。数据在准备好后将被传递到执行,前面的batchs被处理,同时下一组batchs也已准备好。
数据将会以简单的RLE列编码格式被传递以执行,以备向量化处理。这也将会是缓存格式、打算减少I / O之间的复制、缓存和执行。
● Multiple file formats. I/O and caching depend on some knowledge of the
underlying file format (especially if it is to be done efficiently). Therefore,
similar to Vectorization work, different file formats will be supported
through plugins specific to each format (starting with ORC). Additionally, a
generic, less-efficient plugin may be added that supports any Hive input
format. The plugins have to maintain metadata and transform the raw data
to column chunks.
多种文件格式。I/O和缓存依赖于底层的文件格式的一些知识(特别是如果要有效地完成)。因此,类似向量化工作,不同的文件格式将通过针对每种文件格式的插件来支持(从ORC开始)。
此外,一种一般的,低效的,能够支持任何Hive输入格式的插件会被添加。插件必须保持元数据,并将行数据转换成列块。
● Predicates and bloom filters. SARGs and bloom filters will be pushed
down to storage layer, if they are supported.
谓词和布隆过滤器。如果他们支持的话,查询参数和布隆过滤器将会被下推到存储层。
Caching 缓存
The daemon will cache metadata for input files, as well as the data. The metadata
and index information can be cached even for data that is not currently cached.
Metadata will be stored in process in Java objects; cached data will be stored in the
format described in the I/O section, and kept off-heap (see Resource
management).
守护进程将会缓存输入文件的元数据、数据。元数据和索引信息甚至可以缓存当前没有缓存的数据。
元数据将会存储在进程的java对象中,缓存的数据将被以I/O section格式描述形式存储,并保持在堆外,即非堆存储。
● Eviction policy. The eviction policy will be tuned for analytical workloads
with frequent (partial) table-scans. Initially, a simple policy like LRFU will
be used. The policy will be pluggable.
回收策略。回收策略将会根据为分析工作(而进行的)频繁的或者部分的表扫描而调整。最初,一个简单的LRU回收策略将会被使用。回收策略是可以插件化的。
● Caching granularity. Column-chunks will be the unit of data in the cache.
This achieves a compromise between low-overhead processing and storage
efficiency. The granularity of the chunks depends on particular file format
and execution engine (Vectorized Row Batch size, ORC stripe, etc.).
缓存粒度。缓存中数据的单位是列块。这将在低开销处理和存储效率方面实现一种平衡(妥协)。
块的粒度取决于特定的文件格式和执行引擎(向量化行批处理大小、ORC stripe等)。
Workload Management 工作负载管理
YARN will be used to obtain resources for different workloads. Once resources
(CPU, memory, etc) have been obtained from YARN for a specific workload, the
execution engine can choose to delegate these resources to #LLAP, or to launch
Hive executors in separate processes. Resource enforcement via YARN has the
advantage of ensuring that nodes do not get overloaded, either by #LLAP or by
other containers. The daemons themselves will be under YARN’s control.
Yarn将被用于为不同的负载获取资源。一种资源(CPU、内存等)为专门的工作负载而从Yarn获取,
执行引擎可以选择将这些资源委托给LLAP,或者在分离的进程中执行Hive操作。
通过Yarn获取的资源永远能够保证节点不会超负荷的优势,不管是对LLAP还是其它容器。守护进程自己也会在Yarn的控制下。
Acid 事务
#LLAP will be aware of transactions. The merging of delta files to produce a
certain state of the tables will be performed before the data is placed in cache.
Multiple versions are possible and the request will specify which version is to be
used. This has the benefit of doing the merge async and only once for cached data,
thus avoiding the hit on the operator pipeline.
LLAP将会了解事务。为表产生一定状态的差异文件的合并,将会在数据被放置到缓存之前执行。
可能存在多个版本,请求将指定使用哪个版本。这样做的好处是异步合并,且只有一次数据缓存,这样就避免了操作管道的热点问题。
Security 安全性
#LLAP servers are a natural place to enforce access control at a more fine-grained
level than “per file”. Since the daemons know which columns and records are
processed, policies on these objects can be enforced. This is not intended to
replace the current mechanisms, but rather to enhance and open them up to other
applications as well.
LLAP服务器是一个(实现)比每个文件更细粒度水平的强制访问控制的合适的地方。
由于守护进程知道哪些列和记录能够被处理,这些对象之上的策略可以被强制执行。这不是为了替代当前的机制,
而是使其对其他应用程序来说更加强大和开放。