Road network matching shenanigans
The purpose of this article is to complement and correct the previous one on the same subject. In that article, I presented an approach to reconstructing missing map-matched data from the Extended Vehicle Energy Dataset¹ (EVED) [1]. My technique explored the mathematical properties of triangles to find the lost data on a road network database quickly. Unfortunately, a subtle error in the code made it fail under some circumstances.
Here, I present a correction to the code and discuss the results of applying these techniques to reconstruct more than thirty-two thousand trajectories on the EVED.
In the previous article, I illustrated two methods to fit map-matched locations to road network segments using triangular properties. The geometrical concepts that support the idea are adequate, but the code needs to be corrected. The issue occurs on the node filtering part and affects both matching methods, the ratio-based and the distance-based. Let’s look at the original code again using the ratio-based function:
def get_matching_edge(self, latitude, longitude, bearing=None):
loc = np.array([latitude, longitude])
_, r = self.geo_spoke.query_knn(loc, 1)
radius = self.max_edge_length + r[0]
node_idx, dists = self.geo_spoke.query_radius(loc, radius)
nodes = self.ids[node_idx]
distances = dict(zip(nodes, dists))
adjacent_set = set()
graph = self.graphbest_edge = None
for node in nodes:
if node not in adjacent_set:
adjacent_nodes = np.intersect1d(np.array(graph.adj[node]),
nodes, assume_unique=True)
adjacent_set.update(adjacent_nodes)
for adjacent in adjacent_nodes:
edge_length = graph[node][adjacent][0]['length']
ratio = (distances[node] + distances[adjacent]) /
edge_length
if best_edge is None or ratio < best_edge[2]:
best_edge = (node, adjacent, ratio)
if bearing is not None:
best_edge = fix_edge_bearing(best_edge, bearing, graph)
return best_edge
The function uses a set to store all the processed adjacent nodes to avoid repetitions. And it does avoid repetitions, unfortunately. This is where the error lies because by excluding a node, the function is also preventing all other adjacent edges from being tested. This situation is especially prevalent for nodes with multiple links. Correcting the code requires a conceptual change, removing processed road network edges instead of processed nodes.
But another issue surfaced after using the code with a few different trajectories, a black swan. What happens when the input GPS location is close to one of the nodes? In some rare situations, the sampled GPS locations match one of the road network nodes very closely, which induces very subtle floating point instabilities. The result is the selection of a nonsensical edge, ruining the path reconstruction. To solve this issue, I added a default parameter with the minimum distance from the input GPS location to the closest node. When this distance is below one meter (three feet), the function refuses to select an edge and returns an empty value. This solution is moderate as this will happen relatively infrequently, and the ensuing reconstruction mechanism compensates for that.
You can see the corrected code below, featuring the set of tested road network edges and the minimum distance check.
def get_matching_edge(self, latitude, longitude,
bearing=None, min_r=1.0):
best_edge = None
loc = np.array([latitude, longitude])
_, r = self.geo_spoke.query_knn(loc, 1)
if r > min_r:
radius = self.max_edge_length + r[0]
node_idx, dists = self.geo_spoke.query_radius(loc, radius)
nodes = self.ids[node_idx]
distances = dict(zip(nodes, dists))
tested_edges = set()
graph = self.graph
node_set = set(nodes)for node in nodes:
adjacent_nodes = node_set & set(graph.adj[node])
for adjacent in adjacent_nodes:
if (node, adjacent) not in tested_edges:
edge_length = graph[node][adjacent][0]['length']
ratio = edge_length / (distances[node] + distances[adjacent])
if best_edge is None or ratio > best_edge[2]:
best_edge = (node, adjacent, ratio)
tested_edges.add((node, adjacent))
tested_edges.add((adjacent, node))
if bearing is not None:
best_edge = fix_edge_bearing(best_edge, bearing, graph)
return best_edge
Note that I also changed the rate formula so that it ranges between zero and one, with larger values corresponding to a better fit. Interestingly, the rate-based function works better and faster than the distance-based one.
The first part of the path reconstruction implies applying the function above to every point of the input trajectory and keeping only the unique road network edges. The code below shows a Python function that does that.
def match_edges(road_network, trajectory):
edges = []
unique_locations = set()
edge_set = set()
for p in trajectory:
if p not in unique_locations:
e = road_network.get_matching_edge(*p, min_r=1.0)
if e is not None:
n0, n1, _ = e
edge = (n0, n1)
if edge not in edge_set:
edge_set.add(edge)
edges.append(edge)
unique_locations.add(p)
return edges
The function assigns a road network edge to each unique trajectory location. Figure 1 below shows a sample of such matching.
As you can see, there are some gaps between the matched links due to the GPS sampling frequency. Reconstructing the path requires us to fill in these gaps, and we do so by finding the shortest way between the road network edges. The code below does that by keeping together the connected links and filling in the gaps for all the others using OSMnx’s services [2].
def build_path(rn, edges):
path = []
for e0, e1 in pairwise(edges):
if not len(path):
path.append(e0[0])
if e0[0] != e1[0] and e0[1] != e1[1]:
if e0[1] == e1[0]:
path.extend([e0[1], e1[1]])
else:
n0, n1 = int(e0[1]), int(e1[0])
sp = ox.distance.shortest_path(rn, n0, n1)
if sp is not None:
path.extend(sp[1:])
return path
After calling this function on the previous matches, we should have the trajectory fully reconstructed as a sequence of road network edges. We see a continuous result by superimposing the trajectory locations to the patched road network edges, as depicted in Figure 2 below.
The Case for a Minimum Distance
While reviewing this solution, I found an issue when iterating through the individual trajectories. Figure 3 below details the problem.
A spurious cycle in the route arises from an unexpected value in the input data for the patching process. We can dig deeper into what is happening by reverting to the view showing which road network edges the algorithm matches to these GPS locations.
Figure 4 below illustrates the situation at the fork, where the matched GPS location coincides (or is extremely close) to one of the road network nodes. This will likely cause some numeric instability yielding a wrong result. A robust algorithm should avoid these situations, but how?
After considering it, I decided to change both matching algorithms and use the distance to the closest road network node as a criterion to include or exclude the input location. If this distance is smaller than a given threshold (one meter or three feet as default), the match function returns an empty value signaling the calling code that it cannot safely assign a link. This is fine for the following step as it should infer the missing links because that algorithm depends only on road network edges and not the input locations.
Now that we have sanitized the whole algorithm, let us see how it performs on the entire dataset.
To assess how the algorithm behaves on the whole EVED, I created a new standalone script that iterates through all trajectories, matches the edges, generates a path, and compares the original trajectory’s length to the reconstructed path. The Python script records all the differences so we can later analyze them using a standard statistical toolset.
The script core lies in its main loop, detailed in the code below.
def process_trajectories():
rn = download_network()
road_network = RoadNetwork(rn)state = load_state()
if state is None:
state = {
"trajectories": get_trajectories(),
"errors": []
}
save_counter = 0
trajectories = state["trajectories"]
while len(trajectories) > 0:
trajectory_id = trajectories[0]
trajectory = load_trajectory_points(trajectory_id,
unique=True)
if len(trajectory) > 3:
edges = match_edges(road_network, trajectory)
path = build_path(rn, edges)
if len(path) > 0:
diff = calculate_difference(rn, path, trajectory)
print(f"Trajectory: {trajectory_id}, Difference: {diff}")
state["errors"].append((trajectory_id, diff))
trajectories = trajectories[1:]
state["trajectories"] = trajectories
save_counter += 1
if save_counter % 100 == 0:
save_state(state)
save_state(state)
The function starts by downloading and preprocessing the road network data from OpenStreetMap. Because this is a long-running script, I added a persistent state serialized as a JSON-formatted text file and saved every 100 trajectories. This feature allows the user to interrupt the Python script without losing all the data from already-processed trajectories. The function that calculates the length difference between the original input trajectory and the reconstructed path is displayed below.
def calculate_difference(rn, path, trajectory):
p_loc = np.array([(rn.nodes[n]['y'], rn.nodes[n]['x']) for n in path])
t_loc = np.array([(t[0], t[1]) for t in trajectory])p_length = vec_haversine(p_loc[1:, 0], p_loc[1:, 1],
p_loc[:-1, 0], p_loc[:-1, 1]).sum()
t_length = vec_haversine(t_loc[1:, 0], t_loc[1:, 1],
t_loc[:-1, 0], t_loc[:-1, 1]).sum()
return p_length - t_length
The code converts both inputs into sequences of geolocations, computes their cumulative lengths, and finally computes their difference. When the script finishes, we can load the state data and analyze it using Pandas.
We see how the difference between both distances distributes in Figure 5 below (named “error“).
To see more detail around zero, I clipped the histogram and created the one in Figure 6 below.
As you can see, there are some differences in the negative part of the spectrum. Still, it is primarily positive, meaning that the reconstructed path lengths are usually longer than trajectories, which is expectable. Remember that trajectory points seldom match road network nodes, meaning that the most common scenario should be a match midway through a segment.
We can also inspect where the mid-fifty percent of differences is by drawing a box-and-whiskers plot, as depicted below in Figure 7.
The image above contains essential information on the distribution of differences and the number of outliers. Using Tukey’s fences method, we can classify as outliers 5,332² trajectories out of 32,528. This proportion represents a relatively high percentage value of 16.4%, meaning there are probably better reconstruction methods than this or that there is a sizable amount of poorly processed trajectories. We can look into some of the most dramatic distance difference outliers to try and understand these results.
Outlier Analysis
As previously noted, there are over five-thousand trajectories with poor matching. We can quickly inspect a few to determine why the difference between the sampled trajectory and the reconstruction is so significant. We start with a revealing case, trajectory number 59, depicted below in Figure 8.
The case above shows that the map-matching process incorrectly placed three GPS samples on a parallel road, forcing the reconstruction process to find a spurious detour. This type of error might result from the inadequate parameterization of the map-matching process, something to research in a later article.
Finally, we can see a very radical case in Figure 9 below. The sampled GPS locations are so far away from the road that the map-matching process gave up. As you can see, the reconstruction process assigns road network links that bear no relation to the intended trajectory.
In this article, I corrected previously published code on road network edge matching and assessed its performance using the EVED dataset. I collected the performance assessment data using a specialized Python script that revealed a relatively high number of matched trajectories with seemingly abnormal deviations in the measured distances. To understand why this result could be better, I will turn next to the original map-matching software, Valhalla, to try and fix the issues above.
- The original paper authors licensed the dataset under the Apache 2.0 License (see the VED and EVED GitHub repositories). Note that this also applies to derivative work.
- The source of all images and calculations is in a Jupyter notebook, available from the public GitHub repository.
[1] Zhang, S., Fatih, D., Abdulqadir, F., Schwarz, T., & Ma, X. (2022). Extended vehicle energy dataset (eVED): An enhanced large-scale dataset for deep learning on vehicle trip energy consumption. ArXiv. https://doi.org/10.48550/arXiv.2203.08630
[2] Boeing, G. 2017. OSMnx: New Methods for Acquiring, Constructing, Analyzing, and Visualizing Complex Street Networks. Computers, Environment and Urban Systems 65, 126–139. doi:10.1016/j.compenvurbsys.2017.05.004