Co-Authors: Zealand Cooley, Zhou Wang
Even the happiest place on Earth has parts of it that are not so enjoyable. Although most Disney World rides have immersive decorations to make each ride wait an event in itself, everyone has their limits for how long they are willing to be in line for each ride. We wanted to know what factors contribute to Disney World ride wait times so that the next time you go, you are better prepared to make your magical days more magical.
- Data Sources
- Data Cleaning and Feature Engineering
- Exploratory Data Analysis
- Modeling and Feature Importance
- Ethical Considerations
- More Information
We found Disney World ride and park data from TouringPlans.com (here) and Disney World ride data from DataWorld (here). We decided to limit our project scope to Magic Kingdom rides and found that 20 of them were in both datasets.
- Astro Orbiter
- Big Thunder Mountain Railroad
- Buzz Lightyear’s Space Ranger Spin
- Dumbo the Flying Elephant
- Haunted Mansion
- It’s a Small World
- Jungle Cruise
- Mad Tea Party
- Peter Pan’s Flight
- Pirates of the Caribbean
- Prince Charming Regal Carrousel
- Seven Dwarfs Mine Train
- Space Mountain
- Splash Mountain
- The Barnstormer
- The Magic Carpets of Aladdin
- The Many Adventures of Winnie the Pooh
- Tomorrowland Speedway
- Tomorrowland Transit Authority PeopleMover
- Walt Disney’s Carousel of Progress
The TouringPlans data for each ride include datetimes, posted wait times, and actual wait times. There is also metadata for the park for each date, including information like if there was a holiday that day, opening and closing hours, the percentage of schools in session, when a parade or event took place, etc. The DataWorld data includes characteristics of each ride like what land it is in within Magic Kingdom, the number of days it has been open, what type of ride it is (spinning, dark, big drops, thrilling, slow, etc.), and how long the ride is. The data spans the years 2015 through 2021. The exact details of all variables are available to view on the linked sites.
One factor we thought would play a major role in being able to predict ride wait times is weather. Only the average temperature for each day was included in the Disney datasets, so we had to explore outside data in order to get weather data more granular in regards to temperature, precipitation, fog, etc. We were able to get this from the National Centers for Environmental Information (here) where we indicated the city (Orlando, Florida) and the date range (2015 through 2021). The data originates from the nearest weather station located at the Orlando International Airport (which is only 16 miles away from Disney World). This data was aggregated by the hour and contains information about wind quality, wind speed, cloud quality, visibility, and temperature.
Lastly, since the data spans time intervals which include days affected by Covid, we decided to add more specific information about the state of Covid for each day. We got this data from the CDC website (here). It contains data on the number of cases and deaths recorded by state for each day, which we aggregated to gather the number of new cases in the US each day.
We combined the data, vertically concatenating all of the ride data sets and horizontally concatenating the park metadata, Covid data, and weather data. After combining all sets, our final data set contained 383 columns and 5,654,849 records. We’ll briefly discuss how this was divided into training and testing sets in the modeling section of this post.
Our data was relatively clean when we encountered it, but there were several things that had to be done to certain variables in order for it to work in the model and to make it easier to digest as we worked with it.
A lot of effort was put into making the column names of the weather data more readable, each one was renamed using the website’s codebook. This made it easier to digest what weather variables were important once we started our feature selection. As mentioned in the last section, we also had to aggregate this data by the hour, averaging and summing where appropriate. Since the weather data was recorded at hourly intervals (or nearly), and the ride data taken every few minutes, we joined the two data frames by matching to the nearest hour for both.
The variables “MONTHOFYEAR”, “DAYOFYEAR”, “YEAR”, and “HOUROFDAY” are integer variables created from the date for each record. Some variable values had to be changed in order for Python to work with them correctly. For example, the park’s closing hours were recorded in terms of the day that the park opened, so if the Magic Kingdom closed at 1AM, the time would be 25:00. Times were converted to correct military time for consistency. All other variable types were set where appropriate for usability in the model. Meaning, Boolean variables were specifically set to Boolean, integer to integer, etc.
Exploratory Data Analysis is important in every data science project to have a preliminary understanding of the data and form hypotheses for what elements will be important in the model. We looked at the 2018 data for this exploratory step as it was a year with complete data, before Covid to gauge the normal behavior of ride wait times at the park.
First, we got a look at the distributions of actual wait times and posted wait times. The .describe() function in Python is really useful for quickly checking basic statistics of quantitative variables. Note that ‘SACTMIN’ is the column name for actual wait times and ‘SPOSTMIN’ is the column name for posted wait times in the data.
Right off the bat in this EDA, we can see that actual wait times are, on average, smaller than the posted wait times. Something useful to keep in mind if you usually get discouraged when you see a long posted wait time!
We also wanted to look at some of the ride characteristics to see if there were significant differences in the distribution of actual wait times. Some of these characteristics include:
- If the ride is a considered thrilling
- If the ride is considered slow
- If the ride has small drops
- If the ride has large drops
- If the ride is dark
- If the ride has spinning
- If it’s a water ride
Here is a code snippet of one of those significance tests and the corresponding boxplots as a visual.
We can infer from these results that water rides have a significantly larger mean wait time than non-water rides and they have a larger range of wait times.
We decided to visualize the actual wait time distributions for each ride as well to have a better idea of which rides tend to have shorter or longer waits.
We can see very clearly from these boxplots that Walt Disney’s Carousel of Progress has the shortest mean and range wait times and the Seven Dwarf’s ride has the largest mean and largest range wait times. We can also see that many of the average posted wait times hover around 25 minutes.
Other EDA can be viewed in our GitHub repository linked at the end of this blog post.
Next, we started modeling the data.
In order to predict Magic Kingdom ride wait times, we explored several regression models to determine the best one. The regression models we tried were Ridge Regression, Extreme Gradient Boost Regression, Bayesian Ridge Regression, as well as several tree-based models: Random Forest Regression, Extra Trees Regression, and Decision Tree Regression.
Before training these models, a few additional steps needed to be taken to prepare the data for modeling. First, we needed to split the data into training and testing datasets. We decided to take a randomized approach to our train-test split because we wanted to ensure pre and post-Covid data was included in both training and testing sets to limit any confounding variables related to Covid-19 protocols.
We also had some missing data. Several features, like the time of the second parade in a day at Magic Kingdom were null by design. Null values in this case meant there was not a second parade in Magic Kingdom on that day. To handle these situations, we imputed the null values with 99 (typical range was 0–23 military time hour) to tell the model that this value is drastically different from the others. We imputed the null values in the remaining columns with a backfill strategy, meaning that we took the previous value. To ensure the success of this strategy, we sorted the datasets on datetime to ensure values were getting imputed with close dates, which are likely to have similar characteristics. There were a few columns that had larger groups of nulls towards the more recent years (2021 primarily). Any remaining missing values were handled using the median imputation approach.
Now that the data was ready, we used the following metrics to determine how well the models performed:
- Mean Absolute Error: the average of all absolute errors of the data collected
- Mean Squared Error: the average of the square of the difference between actual and estimated values
- Root Mean Squared Error: the square root of the mean squared error
- R-squared: measure of how much the variation in the dependent variable (posted wait time) can be explained by the independent variables
After looking at the models, we performed a deeper dive into the best performing models to find the most ideal for our purposes. We found that the best performing models were tree-based models. We’ll go into more detail on the tree-based models here, but the exploration of the other models is in more detail in the notebooks section of our GitHub, linked below.
For all three tree-based models, we looked at the above defined metrics for :
n_estimators = [10, 50, 100]
max_depth = [10, 50, 100]
After this analysis, the simplest model that best fit our data for predicting posted ride wait times was a Random Forest Regressor with 10 trees and max depth 50. Our metrics for this model are as follows:
So, 73% of the variation in the posted ride wait times can be explained by the variation in the available ride, park, weather, and Covid data. To get a better understanding of what variables contribute the most to our model, we conducted feature importance.
The top features are as follows:
- Magic Kingdom Extra Magic Hour Evening (Boolean)
- Ride Type — Slow (Boolean)
- Ride Type — Spinning (Boolean)
- Ride Type — Small Drops (Boolean)
- Total Opening Hours including Extra Magic Hours for Magic Kingdom (numeric)
- Percentage of Schools in Session Within Driving Distance to California Only (numeric)
- Park Area Adventureland (Boolean)
- Walt Disney World Max Temperature (numeric)
- Magic Kingdom Extra Magic Hour Morning Yesterday (Boolean)
- Total Opening Hours including Extra Magic Hour for Animal Kingdom (numeric)
- Historical High Temperature (numeric)
- Historical Low Temperature (numeric)
- Ticket Season Peak (Boolean)
- Ride Name is Seven Dwarfs Mine Train (Boolean)
- Total hourly capacity lost on that park day (due to attraction closures) (numeric)
- Magic Kingdom Extra Magic Hour Evening Tomorrow (Boolean)
- Magic Kingdom Extra Magic Hour Evening Yesterday (Boolean)
- Percentage of Schools in Session Within Central Florida Only (numeric)
- Epcot Extra Magic Hour Morning (Boolean)
- Park Area Frontierland (Boolean)
- Ride name is Walt Disney’s Carousel of Progress (Boolean)
- Yesterday’s Total Opening Hours including Extra Magic Hours for Magic Kingdom (numeric)
- Percentage of Schools in Session Within Driving Distance to Florida Only (numeric)
- Fast pass (Boolean)
- 1st Epcot Fireworks time (time)
- Ride Type Dark (Boolean)
- Magic Kingdom Event Disney Villain After Hours (Boolean)
- Yesterday’s Total Opening Hours including Extra Magic Hours for Animal Kingdom (numeric)
- Happy Hallowishes Fireworks at Magic Kingdom (Boolean)
- Holiday Wishes Fireworks at Magic Kingdom (Boolean)
It’s important to note the large drop off in importance from the first feature to the next. As well as from the fourth to fifth and the gradual decline in the rest where the last 15 of the top 30 don’t appear to have a huge impact. As you can see, we were correct in our predictions that ride type (slow, spinning, drops, etc.) did contribute to the model. We can also see that some weather variables pop up, as well as features relating to Magic Hours.
While it seems that extra hours at Magic Kingdom and the general ride type account for much of the model’s prediction, it’s difficult to make substantial conclusions about the “perfect day at Disney” given these results. To further explore our model’s decisions we visually evaluated the predicted wait times for each ride, which can be seen in our GitHub repository.
It’s always important to think about any potential ethical concerns when starting a data science project. We thought it might be possible that we could have found information that could negatively implicate Disney and their business. This didn’t end up being the case.
Since there is no PII in this data set, there wasn’t a huge risk for ethical issues to come up. However, it is important to consider the audience for this project as not only being people that are interested in Data Science, but people that are familiar with Disney World rides and events and likely can afford to go to Disney World. For that reason, this project is somewhat exclusionary, but we hope that anyone that comes across it finds it enjoyable anyway.
We’ve learned that there are a lot of things that go into the “perfect” day at Disney World, and planning will only get you so far! However, we are excited that our model picked up on some signal with Extra Magic Hours and general ride type being fairly significant indicators of wait times at Magic Kingdom on a given day.
While we are excited about our progress so far, there is still lots of room for growth with this project. For anyone looking to dive deeper, we would recommend training a model for each ride separately to see if that brings to light any additional insights or improves the models performance.
Related Data Science Work
If you liked this post and are interested in how Data Science can be used to understand Disney Parks, check out these blog posts from other researchers:
Disney Parks collect a massive amount of customer data and keep track of their own park operations with data. The rides people ride, the merchandise they purchase, the entertainment they consume, etc. is all used for Disney to develop solutions to optimize their guests’ experience and help with overall customer satisfaction so that their patrons continue to come back. We wanted to create this project for the current customers to know what to expect when they are at the Magic Kingdom. It can be a stepping stone for further research that interests the customer over the company’s profits.
Check out our GitHub Repository for a more detailed understanding of the project:
Statement of Work
Zealand Cooley contributed to the merging of the data sets, exploratory data analysis, and exploration of tree based machine learning models. More specifically working on the horizontal concatenation of data sets, visualizing wait times distributions across rides and ride types, visualizing the correlations of wait times with other quantitative variables and performing the initial regression exploration for the model types Decision Tree and Extra Trees. She also wrote the complete first draft of this blog post.
Kendall Dyke contributed to data cleaning and feature engineering efforts to prepare for modeling. She also explored several regression models and built the final pipeline, Makefile, & resulting Python scripts based on the chosen Random Forest model. She developed the results visualizations including wait time analysis and feature engineering charts. Kendall finally outlined and wrote the draft of the README with the exception of the details about the weather data.
Zhou contributed to data cleaning and integrating weather data into the dataset, some exploratory data analysis, and some of the modeling. He contributed to the vertical concatenation of the data sets, manipulating the weather data to a more readable format and concatenating that dataset to the ride dataset. He worked on the weather data of the README file.