This project is aimed at analyzing the factors leading to commercial flight delays, specifically targeting small operating airlines. The goal is to analyze historical flight data from Carrier On-Time Performance Report (2018 — present) on Bureau of Transportation Statistics (BTS)  to draw insights and eventually use this knowledge to develop a machine learning model that predicts the probability of flight delays.
Data for this analysis was sourced from the Bureau of Transportation Statistics. This involved downloading datasets from January 2018 to April 2023. The datasets were consolidated into a singular comprehensive file to facilitate more streamlined analysis called combined_data.
Combined_data has 32 million records and 120 columns. Among these columns, two have been specifically marked as possible predicted labels:
- ArrDel15: This column categorizes flights into two groups — delayed (1) and not delayed (0). All flights with delayed time more than fifteen minutes were considered as delayed and vice versa .
- ArrDelayMinutes: This column details the exact duration of a flight’s delay in minute.
The remainder of the columns are factors that may influence or correlate with delays. They contain both internal and external factors, including: reason of delays, seasonal-related factors (months, days of week, years, etc.), operational-related aspects (wheel-on, wheel-off, flight schedules, tail numbers, etc.), and so on.
However, combined_data is a very big file for analysis. Therefore, the research was conducted to define which factors have the strong impact on delays. Seventy-four columns were dropped from the dataset and forty-six columns are kept for data analysis and modeling. The new combined_data file has 32 million rows and 46 columns after manual feature selection step.
In addition, an initial analysis from the new combined_data file shows that 2020 has the lowest number of delays. Covid-19 may be the factors that caused the number of flights in 2020 drop dramatically, leading to the small number of delays. According to TranStats , the number of arrival delays in 2020 was half that of 2018, 2019, 2022, and the predicted figures for 2023. Meanwhile, the number of canceled flights in 2020 was four times that of the other years in the period. This leads us to conclude that the anomalies observed in 2020 were likely due to the impact of COVID-19 on flight schedules and operations. Include 2020 in the trained model may impact on the accuracy of the model, thus, the final data for data analysis and modeling does not include data of 2020. The new combined_data file include 26 million rows and 46 columns after year selection step.
Tools and Techniques
Insufficient computation resources presented challenges in merging csv files after downloading from BTS. To address this problem, cloud computing and MySQL server were deployed. The 64 csv files were uploaded to MySQL with the aid of dBeaver. The connection between Tableau and MySQL was established for data visualization. The data selection step was processed on Google Colab. Other tools which were utilized throughout the project are:
- Development Environment: Jupiter Notebook
- Python library: Scikit-learn, Pandas, NumPy, Matplotlib, Plotly
- BI tool: Tableau
- Version Control: Git, GitHub
- Databased: MySQL, dBeaver
Exploratory Data Analysis Approach
Data Modeling Approach
In Section 1.3.1, two potential predictive labels are outlined: ArrDel15 and ArrDelMinutes. For ArrDel15, classification models aim to predict the probability of delays, while for ArrDelayMinutes, regression models are employed to forecast the duration in minutes. These two labels suggest differing modeling approaches; however, both utilize supervised learning technique.
In this project, the predictive objective was ArrDel15, therefore, classification models were used, including: Gradient Boosting, Random Forest, Decision Tree, AdaBoost. To train the model, a set of features were selected based on their correlation to the label. Important features which have the highest correlation with ArrDel15 in the correlation heatmap (Figure 1) were chosen. In addition, two evaluation matrices are used, including Accuracy and Confusion Matrix (precision and recall are used).
An analysis was conducted to calculate average flight per month of each airline in the dataset in years 2018, 2019, 2021, 2022, and 2023. According to the information given by Chris MacNeel, airlines with more than 20,000 flights per month are big airlines and vice versa. Therefore, airlines in the dataset were categorized into three primary group:
- Big Airlines (>20,000 monthly flights): Southwest Airlines (WN), Delta Air Lines (DL), American Airlines (AA), SkyWest Airlines (OO), United Airlines (UA), Republic Airways (YX), Envoy Air (MQ), JetBlue Airways (B6).
- Small Airlines (<20,000 monthly flights):
Regional/cargo airlines: Endeavor Air (9E), PSA Airlines (OH), Mesa Airlines (YV), Horizon Air (QX), West Atlantic (PT), Air Wisconsin (ZW), GoJet Airlines (G7), Commute Air (C5), Empire Airlines (EM), PenAir (KS), Cape Air (9K).
Small operating airlines: Alaska Airlines (AS), Spirit Airlines (NK), Frontier Airlines (F9), Allegiant Air (G4), Hawaiian Airlines (HA).
- Terminated Airlines: Trans States Airlines (AX — 2020), Virgin America (VX — 2018), Compass Airlines (CP — 2020), ExpressJet Airlines (EV — 2020).
The project only focuses on small airlines because they are the targeted customers of our client.
 “Marketing Carrier On-Time Performance (Beginning January 2018).” Bureau of Transportation Statistics, www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGK&QO_fu146_anzr=. Accessed 27 Aug. 2023.
 Schonland, Addison. The Value of Time for an Airline | AirInsight. 13 Feb. 2023, airinsight.com/the-value-of-time-for-an-airline/. Accessed 28 Aug. 2023.