Let’s frame our problem - Buku The Big Book of Data Science Use Cases

What if we represented all of our geometries, no matter their shape, with a corresponding BNG-aligned bounding box? A bounding box is a rectangular polygon that can fit the entirety of the original geometry within. And what if we represented said bounding box as a set of BNG indices at a given resolution that together covers the same area?

Diagram F

We used MLflow to conduct a series of naive joins to evaluate the baseline

performance we are trying to outperform. For the naive approach, the largest join we were able to successfully execute was 10 thousand points to 100 thousand polygons. Any further increase in data volume resulted in our Spark jobs failing without producing the desired outputs. These failures were caused by the unoptimized nature of the workloads we were trying to run.

Diagram G

Now we can execute our joins via a more optimized theta join. We will only check whether a point is inside the polygon via PIP relation if a point falls into one of the BNG indices that are used to represent the polygon. This reduces our join effort by multiple orders of magnitude.

In order to produce the said set of BNG indices, we have used the following code;

note that the bng_to_geom, coords_to_bng and bng_get_resolution functions are not provided with this blog.

from shapely.geometry import box

#auxiliary function to retrieve the first neighbours

#of a BNG index cell to the right

def next_horizontal(bng_index, resolution):

x, y = bng_to_geom(bng_index)

return coords_to_bng(x+resolution, y, resolution)

#auxiliary function to retrieve the first neighbours

#of a BNG index cell to the bottom

def next_vertical(bng_index, resolution):

x, y = bng_to_geom(bng_index)

return coords_to_bng(x, y-resolution, resolution)

#filling function that represents the input geometry as set of indices

#corresponding to the area of the bounding box of said geometry def bng_polyfil(polygon, resolution):

(x1,y1,x2,y2) = polygon.bounds bounding_box = box(*polygon.bounds)

lower_left = coords_to_bng(x1, y2, resolution) queue = [lower_left]

result = set() visited = set() while queue:

index = queue.pop()

index_geom = shapely.wkt.loads(bng_to_geom_grid(index, "WKT")) intersection = bounding_box.intersects(index_geom)

if intersection:

result.add(index)

n_h = next_horizontal(index, resolution) if n_h not in visited:

queue.append(n_h)

n_v = next_vertical(index, resolution) if n_v not in visited:

queue.append(n_v) visited.add(index) visited = []

return result

This code ensures that we can represent any shape in a lossless manner. We are using intersects relation between a BNG index candidate and the original geometry to avoid blind spots in representation. Note that a more efficient implementation is possible by using contains relation and a centroid point; that approach is only viable if false positives and false negatives are acceptable. We assume the existence of the bng_to_geom function that given a BNG index ID can produce a geometry representation, the bng_get_resolution function that given a BNG index ID determines the selected resolution and coords_to_bng function that given the coordinates returns a BNG index ID.

Diagram H

We have run our polygon bounding box representation for different resolutions of the BNG index system and for different data set sizes. Note that running this process was failing consistently for resolutions below 100. Resolutions are represented in meters in these outputs. The reason for consistent failures at resolutions below 100m can be found in overrepresentation; some polygons (due to random nature) are much larger than others, and while some polygons would be represented by a set of a dozen indices, other polygons can be represented by thousands of indices, and this can result in a big disparity in compute and memory requirements between partitions in a Spark job that is generating this data.

We have omitted the benchmarks for points data set transformations since this is a relatively simple operation that does not yield any new rows; only a single column is added, and the different resolutions do not affect execution times.

With both sides of the join being represented with their corresponding BNG representations, all we have to do is to execute the adjusted join logic:

@udf("boolean")

def pip_filter(poly_wkt, point_x, point_y):

from shapely import wkt from shapely import geometry polygon = wkt.loads(poly_wkt)

point = geometry.Point(point_x, point_y) return polygon.contains(point)

def run_bounding_box_join(polygons_path, points_path):

polygons = spark.read.format("delta").load(polygons_path) polygons = polygons.select(

F.col("id"),

F.col("wkt_polygon"),

F.explode(F.col("bng_set")).alias("bng")

) points = spark.read.format("delta").load(points_path) return polygons.join(

points, on=["bng"], how="inner"

).where(pip_filter("wkt_polygon", "eastings", "northings"))

#run an action on the join dataset to evaluate join runtime run_bounding_box_join(polygons_path, points_path).count()

These modifications in our code have resulted in a different Spark execution plan.

Spark is now able to first run a sort merge join based on the BNG index ID and vastly reduce the total number of comparisons. In addition, each pair comparison is a string-to-string comparison, which is much shorter than a PIP relationship.

This first stage will generate all the join set candidates. We will then perform a PIP relationship test on this set of candidates to resolve the final output. This approach ensures that we limit the number of times we have to run the PIP operation.

Diagram I

Diagram J

From the execution plan, we can see that Spark is performing a very different set of operations in comparison to the naive approach. Most notably, Spark is now executing Sort Merge Join instead of Broadcast Nested Loop Join, which is bringing a lot of efficiencies. We are now performing about 186 million PIP operations instead of a billion. This alone is allowing us to run much larger joins with better response time whilst avoiding any breaking failures that we have experienced in the naive approach.

This simple yet effective optimization has enabled us to run a PIP join

between 10 million points and 1 million polygons in about 2,500 seconds. If we compare that to the baseline execution times, the largest join we were able to successfully execute was 10 thousand points to 100 thousand polygons, and even that join required about 1,500 seconds on the same hardware.

Dalam dokumen Buku The Big Book of Data Science Use Cases (Halaman 97-101)