连通分支
连通分支算法识别会一个更大的图,这部分图通过被相同的组件ID链接的所有顶点连接。类似PageRank,连通组件是一个迭代算法。在每个步骤中,每个顶点都将其当前组件ID传给所有邻居。如果小于自己的组件ID,一个顶点从邻居接受组件ID。
此实现使用增量迭代:组件ID未变化的顶点不参与下一步骤。因为后来的迭代通常只处理一些离群顶点,这将产生更好的性能。
// read vertex and edge data
DataSet<Long> vertices = getVertexDataSet(env);
DataSet<Tuple2<Long, Long>> edges = getEdgeDataSet(env).flatMap(new UndirectEdge());
// assign the initial component IDs (equal to the vertex ID)
DataSet<Tuple2<Long, Long>> verticesWithInitialId = vertices.map(new DuplicateValue<Long>());
// open a delta iteration
DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration =
verticesWithInitialId.iterateDelta(verticesWithInitialId, maxIterations, 0);
// apply the step logic:
DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset()
// join with the edges
.join(edges).where(0).equalTo(0).with(new NeighborWithComponentIDJoin())
// select the minimum neighbor component ID
.groupBy(0).aggregate(Aggregations.MIN, 1)
// update if the component ID of the candidate is smaller
.join(iteration.getSolutionSet()).where(0).equalTo(0)
.flatMap(new ComponentIdFilter());
// close the delta iteration (delta and new workset are identical)
DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes);
// emit result
result.writeAsCsv(outputPath, "\n", " ");
// User-defined functions
public static final class DuplicateValue<T> implements MapFunction<T, Tuple2<T, T>> {
@Override
public Tuple2<T, T> map(T vertex) {
return new Tuple2<T, T>(vertex, vertex);
}
}
public static final class UndirectEdge
implements FlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {
Tuple2<Long, Long> invertedEdge = new Tuple2<Long, Long>();
@Override
public void flatMap(Tuple2<Long, Long> edge, Collector<Tuple2<Long, Long>> out) {
invertedEdge.f0 = edge.f1;
invertedEdge.f1 = edge.f0;
out.collect(edge);
out.collect(invertedEdge);
}
}
public static final class NeighborWithComponentIDJoin
implements JoinFunction<Tuple2<Long, Long>, Tuple2<Long, Long>, Tuple2<Long, Long>> {
@Override
public Tuple2<Long, Long> join(Tuple2<Long, Long> vertexWithComponent, Tuple2<Long, Long> edge) {
return new Tuple2<Long, Long>(edge.f1, vertexWithComponent.f1);
}
}
public static final class ComponentIdFilter
implements FlatMapFunction<Tuple2<Tuple2<Long, Long>, Tuple2<Long, Long>>,
Tuple2<Long, Long>> {
@Override
public void flatMap(Tuple2<Tuple2<Long, Long>, Tuple2<Long, Long>> value,
Collector<Tuple2<Long, Long>> out) {
if (value.f0.f1 < value.f1.f1) {
out.collect(value.f0);
}
}
}
scala
// set up execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// read vertex and edge data
// assign the initial components (equal to the vertex id)
val vertices = getVerticesDataSet(env).map { id => (id, id) }
// undirected edges by emitting for each input edge the input edges itself and an inverted
// version
val edges = getEdgesDataSet(env).flatMap { edge => Seq(edge, (edge._2, edge._1)) }
// open a delta iteration
val verticesWithComponents = vertices.iterateDelta(vertices, maxIterations, Array(0)) {
(s, ws) =>
// apply the step logic: join with the edges
val allNeighbors = ws.join(edges).where(0).equalTo(0) { (vertex, edge) =>
(edge._2, vertex._2)
}
// select the minimum neighbor
val minNeighbors = allNeighbors.groupBy(0).min(1)
// update if the component of the candidate is smaller
val updatedComponents = minNeighbors.join(s).where(0).equalTo(0) {
(newVertex, oldVertex, out: Collector[(Long, Long)]) =>
if (newVertex._2 < oldVertex._2) out.collect(newVertex)
}
// delta and new workset are identical
(updatedComponents, updatedComponents)
}
verticesWithComponents.writeAsCsv(outputPath, "\n", " ")
该连通分支程序实现了上述例子。它需要运行下列参数:–vertices –edges –output –iterations 。
输入文件是纯文本文件,必须格式化如下:
–Vertices 以IDS表示的顶点,由换行字符分隔。例如“1\n2\n12\n42\n63\n”给出了五个订单(1)、(2)、(12)、(42)和(63)。
–Edges 边通过以空格分隔的两个顶点ID表示。不同边是由换行符分隔。例如“1 2\n2 12\n1 12\n42 63\n”表示了四个无方向链接(1)-(2)、(2)-(12)、(1)-(12)和(42)-(63)。
关系型查询
关系型查询示例假定会使用两张表,一张订单表,另一张是TPC-H决策支持基准测试表。TPC-H是数据库行业标准基准测试。如何生成输入数据请参见下面的说明。
该示例实现以下sql查询。SELECT l_orderkey, o_shippriority, sum(l_extendedprice) as revenue
FROM orders, lineitem
WHERE l_orderkey = o_orderkey
AND o_orderstatus = "F"
AND YEAR(o_orderdate) > 1993
AND o_orderpriority LIKE "5%"
GROUP BY l_orderkey, o_shippriority;
Flink程序中按照如下的方式进行sql查询
// get orders data set: (orderkey, orderstatus, orderdate, orderpriority, shippriority)
DataSet<Tuple5<Integer, String, String, String, Integer>> orders = getOrdersDataSet(env);
// get lineitem data set: (orderkey, extendedprice)
DataSet<Tuple2<Integer, Double>> lineitems = getLineitemDataSet(env);
// orders filtered by year: (orderkey, custkey)
DataSet<Tuple2<Integer, Integer>> ordersFilteredByYear =
// filter orders
orders.filter(
new FilterFunction<Tuple5<Integer, String, String, String, Integer>>() {
@Override
public boolean filter(Tuple5<Integer, String, String, String, Integer> t) {
// status filter
if(!t.f1.equals(STATUS_FILTER)) {
return false;
// year filter
} else if(Integer.parseInt(t.f2.substring(0, 4)) <= YEAR_FILTER) {
return false;
// order priority filter
} else if(!t.f3.startsWith(OPRIO_FILTER)) {
return false;
}
return true;
}
})
// project fields out that are no longer required
.project(0,4).types(Integer.class, Integer.class);
// join orders with lineitems: (orderkey, shippriority, extendedprice)
DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders =
ordersFilteredByYear.joinWithHuge(lineitems)
.where(0).equalTo(0)
.projectFirst(0,1).projectSecond(1)
.types(Integer.class, Integer.class, Double.class);
// extendedprice sums: (orderkey, shippriority, sum(extendedprice))
DataSet<Tuple3<Integer, Integer, Double>> priceSums =
// group by order and sum extendedprice
lineitemsOfOrders.groupBy(0,1).aggregate(Aggregations.SUM, 2);
// emit result
priceSums.writeAsCsv(outputPath);
缺少scala例子(译者注)
关系查询程序实现了上述查询。它需要以下参数运行–orders –lineitem –output 。
order和lineitem文件可以使用TPC-H基准测试套件的数据生成工具(DBGEN)生成。采取以下步骤生成需提供给flink程序输入的任意大小的数据文件。
1、下载并解压DBGEN
2、复制makefile.suite并更名为Makefile,编辑修改如下:
DATABASE = DB2
MACHINE = LINUX
WORKLOAD = TPCH
CC = gcc
1、使用make命令构建DBGEN
2、使用DBGEN生成lineitem和orders表。-s命令传入1,将会一个生成约1 GB的大小的数据集。
./dbgen -T o -s 1