This project was completed by Joseph Schmit and Eric Moreno
Car accidents are a huge problem in the world and the World Health Organization (WHO) reports that 1.35 million people die in car accidents globally each year. The CDC estimates that 2.5 million Americans every year are admitted to the emergency room following vehicular collisions.
This article discusses our approach to exploring if machine learning can be used to predict the risk probability of traffic collisions within Los Angeles County.
link to Power BI dashboard
- Problem Statement
- Data Sources
- Our Approach
- Building a Hexagon Grid Over Los Angeles County
- Building The Training Model
- Data Exploration and Feature Engineering
- Modeling Approach
- Exploring Various Models
A state’s population has an obvious effect on the traffic collisions and California is the most populous state with a population of 39 million. According to the Insurance Institute for Highway Safety (IIHS), in 2020 there were 3,558 fatal crashes in California. A study found that drivers in Los Angeles sat in traffic for an average of 90 hours in 2013 and for every commute driven in peak periods they experienced 39 minutes of delay.
Predicting risk probabilities for traffic collisions is a tough task because there are so many variables that come into play: time, weather, day of the week, number of vehicles on the road, hour of the day, the number of lanes, stop signs, speed, and there’s also uncertainty with any prediction. With that being said, the very first step in our approach is to frame this problem as a binary classification problem where the target variable has two classes (accident and no accident).
The objective is to combine location and time characteristics to compare the relative likelihood of a collision occurring at a given location and time for Los Angeles County. In order to accomplish this, we will create a methodology for capturing geolocation information and transforming it into features that can be used to predict collisions. Various modeling techniques will be implemented and each technique will be scored against each other to determine which model performs best.
The outputs of each model will then be used to describe characteristics of locations that predict collisions to help the end user understand what makes a location more prone to risk. We also want to provide a dashboard where the user can view a map of LA County where risky areas are highlighted based on the day and time.
Below is a description of the data sources used in our project.
- Uber hexagon data: a geospatial indexing system developed and sourced by Uber.
- OpenStreetMap: This an open source data project that includes tags that describe roads. This information includes characteristics about the streets and intersections such as speed, road type, and location of street lights. Open streets also provides building and business location information. We were able to collect the location of restaurants, bars, schools and colleges.
- Los Angeles County shapefiles: files for the county and and all cities used to label the hexagons.
- Weather history from NOAA: Historical temperature, wind speed, etc., by days in our data.
- Collision Data: LA County SWITRS collisions data processed by the UC Berkeley. This includes date, time and latitude and longitude information.
Credit: Transportation Injury Mapping System (TIMS), Safe Transportation Research and Education Center, University of California, Berkeley. 2022
One of the limitations of modeling on collision data is that it only contains examples of when a collision occurred. Our data has latitude, longitude, date and time, and location specific data, however, it does not include examples of places and times when collisions do not occur. Fortunately, we found several projects using different approaches to find a solution for this.
A few months ago we completed a similar traffic collision risk prediction project and that project used an approach similar to a project done by Meraldo Antonio where we performed negative sampling to generate synthetic data for our negative class (no accident). Unfortunately, our dataset only contained records where a collision occurred (positive class) and we needed a way to produce records where an accident did not occur so that we could take a binary classification approach to our problem. We produced a synthetic dataset of negative class records by using random sampling to randomly sample over the hours and dates in our data. We then cross referenced our negative samples with our positive samples and threw out any negative examples that overlapped with the positive samples.
In our previous project we also incorporated unsupervised learning to cluster over the latitude and longitude coordinates and the cluster center was attached as a label for each record. HDBSCAN was used for clustering instead of K-Means because K-Means minimizes variance, not geodetic distance. DBSCAN and HDBSCAN are effective choices with spatial data because data is clustered based on physical distance from each point and a minimum cluster size.
Upon completion of the project we had a wish list of items that we thought could improve the modeling and project overall including:
- A method for attaching latitude and longitude to all collisions to increase our sample size because approximately 50% of the data had to be tossed out because of missing geocoordinates.
- Finding examples of non-collision locations, which would allow us to different from collision-prone and non-collision prone areas.
- A dashboard that had an interactive map of Los Angeles County and where the user could discover where and when collisions are more likely to occur, along with being able to see which features are driving the model’s predictions.
After reflecting on lessons learned from our first project, we are going to take a new approach at predicting the risk probability for traffic collisions. Our first objective is to solve our location bias issue because we want a way to describe locations where collisions did not occur. We looked at two methodologies for solving this problem.
The first way would be to find a shapefile of every road segment in the county and there have been other projects that have achieved good results with this method. Los Angeles County has approximately 370,000 road segments and although there are ways to pair this down for modeling, the application of this to our dashboard and scoring could be limiting. Scoring every segment for every hour of every day of a year for an end user type interface would be computationally difficult. Just scoring the month of January would require 80 billion records. Also, translating other geographical features and relating them to line segments is difficult. Using big data solutions such as Databricks might improve how achievable this is and projects using this have had good results.
A second solution is to translate the area of Los Angeles County into a grid network. We found a great solution in Uber’s H3: Uber’s Hexagonal Hierarchical Spatial Index. Uber has created a spatial index for the entire world based on hexagons. This solution has many characteristics that make it an ideal solution for creating a geo grid over any location. The image below is an example of where we applied an H3 grid over the city of Long Beach and plotted the collisions from our dataset. The python library h3 was used to download the shape files LA County hexagons and each hexagon is indexed with a specific hexagon ID.
By using hexagons to represent a location, we are able to map features from our data to each hexagon and summarize the data and create features that describe each hexagon such as:
- What is the maximum speed of streets in the hexagon?
- How many street lights or stop signs are in the hexagon?
- How many restaurants are in the hexagon?
- How many schools are in the hexagon?
- What type of roads are in the hexagon? Are they one-way, motorway, etc.?
A unique advantage of the hexagons versus other shapes is the nearest neighbor problem. Creating a ring of neighbors and neighbors of neighbors becomes complicated with triangles and squares. Uber gives this graphical example of how this is solved through hexagons, as shown in the figure below.
Hexagon neighbors have a more consistent distance to the original center shape. The H3 library has functions that return in a list of hexagon ids for all the neighbors and has a variable, K, for selecting how far out to collect the ids. K = 1 is a list of 7 hexagon ids. Represented in the middle image below. If K is set to 2, the list returns all the hexagons from K = 1, plus the next outer ring as represented in the image on the right.
Why is this useful? One application is finding how many collisions were in the area around a hexagon. Instead of having to calculate distances from collision points and do matching that way, we can use dataframe joining that is easily scalable.
Our approach to building the grid
We start by indexing because it allows us to attach a list of all the neighboring hexagons. We can then explode by the hexagon neighbor list to create a tall dataframe that now has a row for every neighbor for every hexagon.
We can then join on hex_kring_1_neighbors and hex_id.
Finally, group by hex_id and sum.
Because all the hexagons are equal distance, we can say that they represent all the collisions that occurred within approximately 1 kilometer of the original hexagon, which allows us to add features related to each hexagon. For example, a feature that can be attached to a hexagon is: how many accidents did the hexagon and its nearest neighbors have last year?
Dealing with missing latitude and longitude values
As mentioned earlier, in the previous project we encountered collision records with missing latitude and longitude values. UC Berkeley created a geocoding methodology to attach latitude and longitude to all the collisions by imputing the latitude and longitude using hierarchical decisions that use collision descriptions including street and intersection information. This solution allowed us to use the entire dataset. Below is a diagram of their geocoding methodology.
There is a lot that goes into building a complex model and the diagram below illustrates our approach.
The main driver for connecting our data is the geolocation. For our processing, the main library we are using to do spatial joins is Geopandas. The
within function can do intersection and within joins, which is useful for joining different kinds of data using an option for
within. When joining data to summarize which latitude and longitude points are in a particular hexagon the
within option is used. When joining line data or other shapes to the hexagon, the
intersects option is used.
One of the joins performed labels the hexagons with their associated city. To implement this procedure we calculate the centroid of each hexagon and then
sjoin to a Los Angeles County city shapefile and then find which city each of the centroids is located within. This results in a label that is attached to the hexagon.
Another example is a similar process for finding how many street lights are within a particular hexagon. Once a hexagon ID is linked to each street light, we can then use group by and a summary statistic such as count.
Street information is in the form of a line shapefile and for this we use the intersection option, which produces a row for every hexagon where the line intersects, and group by can be used to produce summary statistics. By joining street information we can produce various features such as:
- Maximum and minimum speed of a street in a hexagon
- A flag that indicates the presence of objects such as a bridge or a tunnel
- A flag that indicates whether a street is one-way or a motorway
The target event for our model is the date and time of a collision within a hexagon and below is an illustration of how our data.
The resulting data is called our positive samples. To build a model, we also need examples of non-collisions, which will be the negative samples for our model. We then sample from all the other permutations of hexagon ID, collision date and collision hour that are not present in our data. We also exclude the days before and after a collision for a given hexagon. We then attach the samples with randomized hexagon ID, collision date and collision hour to the data and set the target to 0.
Important note about negative sampling
Our model data collision probability mean is 20%, which is a product of 4 negative samples created for every 1 positive sample. In all our visuals we are reporting the model data probability of a collision and the actual probability of a collision for any hexagon during a particular hour of the day is approximately 0.07%. Therefore, it is important to note that ‘actual’ and modeled probabilities are over inflated. For our analysis, we are studying the magnitude of collision probabilities. For our dashboard, we will address this issue again and will discuss how the end user should view and understand the outputs.
So what do 456K traffic collisions look like across Los Angeles County from 2015 to 2019?
An important first step in performing feature engineering is to plot a correlation matrix because it displays the correlation coefficients across all combinations of variables. The correlation analysis can show which variables are correlated with the target variable and also help uncover any multicollinearity. Depending on the modeling method, special care might be needed to make sure the correlation is accounted for in the modeling process.
The next step in the feature engineering was to analyze variables and their individual relationship with our target variables. These charts allow us to analyze how the probability is related to a potential feature variable as well as providing a histogram of the distribution. Below we will discuss some of the charts we analyzed. We created a PDF file that contains all the variables. A complete output of all the variables is located here.
Number of traffic signals
Here we can see that as the number of traffic signals in a hexagon increases, the probability of a collision increases. We are careful to note that this describes the hexagon and is more related to correlation. Our features could be a proxy for another variable. Traffic signals do not cause collisions, but they are located in higher congested areas and would make a good predictor to describe the location.
Existence of a motorway link (expressway ramp)
This feature has a value of 1 if the hexagon has an exit ramp (motorway link) located in the hexagon. We have several flags in our dataset for marking the existence of the feature in the hexagon. Here we can see that hexagons with exits are over twice as likely to have collisions.
Number of restaurants
This variable describes how many restaurants are in the hexagon. The below chart tells us there is a strong relationship. This is a feature that describes the type of location of the hexagon. What is interesting is that there is very little lift above a value of 1. As the data gets thin, it starts to jump around. This feature is a good candidate for capping. We might want to cap these values at 5.
Previous calendar year collision count
The final example here is collision history. This is a count of the number of accidents for each hexagon for the previous calendar year. Not surprising. Although this is intuitive, it is surprising how far the relationship goes. In our data, there are hexagons that had nearly 150 collisions in one calendar year. We have a similar feature where we sum the number of accidents the nearest neighboring hexagons had in the previous year. This helps smooth the collision history a little over the geographical area.
Time Series Analysis
Another chart (left) created analyzed the relationship between hours in the day and days of the week. This is interesting because it uncovers some hidden relationships between certain days and times when collisions are more likely. The chart below shows that there is an uptick in collision probability during weekends from 12AM(0) to 3AM compared to weekdays. You can clearly see weekday commute patterns while Friday seems to have its own spikes in probability.
To model time, we were inspired by another collision analysis where time was a strong component. The hours in a day are an interesting feature. It does not work to use them as a numeric value. If we used numbers from 0 to 23 to represent the hours, the relationship distance between 11PM(23) and 12AM(0) is not preserved. In this paper, the authors transformed the hours into a cosine and sine component. This essentially maps the hours onto a circle and preserves the cyclical distances.
The desired output of our model is to measure the probability of a collision in a hexagon when compared to all other hexagons. To accomplish this we create a target variable that is binary and model the probability between 0 and 1 for a collision. Below are the descriptions for each class.
- Class 1: indicates that there was at least one collision in the hexagon during a particular hour.
- Class 0: indicates that there was no collision in the hexagon during a particular hour.
Our tree models will use a binary classifier and our GLM will use logistic regression. Since collisions are rare and our data is unbalanced, we will be using Area Under the Curve(AUC) as our evaluation metric. Collisions are so rare that the hexagons in our data have a 0.07% probability in any given hour to have a collision and it would be impossible to say with precision that an actual collision will occur during any one given hour. So with that being said, we are interested in magnitudes of prediction.
Here is how we broke up our data:
Our model takes the following form: target ~ location descriptions + date and time + collision history + weather
- Location Descriptions: Hexagon location specific information. For examples: Maximum speed for a location, number of streets lights, number of restaurants
- Date and Time: Month, day of week, and hour of day
- Collisions History: Number of collisions in the previous year at the levels of the hexagon and the hexagons neighbors
- Weather: NOAA weather for the day including temperature and wind speed.
Important note: our model data collision probability mean is 20%. This is a product of our model creation. 4 negative samples created for every 1 positive sample.
This is a specialized Generalized Linear Model combining LASSO and RIDGE regression to reduce dimensionality. This is an ideal solution for our logistic regression. We used this to create a base model and observe dimension reductions. After one hot encoding, we had over 150 features. The GLMnet reduced all by 70 coefficients to zero. We did not perform many complex feature engineering. It performed very well in our AUC validation and set the bar for our other models to improve upon.
Gradient Boosting Machine(GBM)
This is an ensemble tree based model that builds subsequent models based on minimizing the overall prediction error. We used the Catboost library for this task. Catboost is an ideal solution because it can handle both numerical and categorical features. It efficiently can handle categorical features without one hot encoding. The algorithm converges very efficiently. Even with complex features such as finding interactions between hours of the week and time of the day, the algorithm performs very well. This model performed very well both in the test data and in our final validation set.
Discuss Auto ML using Sagemaker and how its performance compared to our traditional machine learning model.