This study explores how machine learning methods can be applied to triathlon race data in order to forecast the performance of the running and cycling segments. These predictions aim to assist triathletes with pacing guidance mid-race so that, after they finish the swimming heat, they can be advised as to what the optimum cycling and running pace would be if they are to achieve their overall target time.

Twenty years’ worth of elite-level triathlon performances over both sprint and Olympic-distances are used to train neural network, linear regression and regression tree algorithms. The weak correlation between running times and the independent variables results in a clear gulf in accuracy between when running is treated as the target variable, to when cycling is the target variable. All three models appear equally applicable to each of the four performance groups — Olympic Male, Olympic Female, Sprint Male and Sprint Female as there is very little difference in accuracy scores. The artificial neural network model generates slightly more accurate predictions and, whilst it lacks transparency, is sensitive to all three independent variables and their relative explanatory power over the dependent variables. Regression trees, on the other hand, are easier to understand but, particularly when pruned, lose a sizeable portion of the accuracy as their predictions become too generalised.

Therefore, the real-world application of these models would use neural networks to predict a cycling time and then deduct this value, the transition and swimming times by the overall target time in order to produce the optimal pacing guides for the final two segments. When there is minimal variation in training data, particularly around racing conditions and athlete physiological profiles, the model works most effectively and offers relevant tactical guidance to enhance performance mid-race.

Swimming, cycling and running are three of the most universally popular forms of exercise and have been combined as a single sporting event since the early 20th century. It wasn´t until 1989 that the International Triathlon Union (ITU) was founded and set the Olympic distance as a 1.5km swim, a 40km bike ride and a 10km run ahead of the 2000 Sydney Olympic Games (Millet, GP. 2000). The event has grown massively in popularity, now boasting six annual World Triathlon Championship Series events in which over 100 professional triathletes compete. The ITU, now known as World Triathlon, is the main source of the data for this study and consists of more than twenty years of Olympic Games and World Triathlon Championship Series results. The multidisciplinary nature of the sport demands some distinct and some shared physiological and technical athletic characteristics such that a performance level in one sport may not necessarily equate to an identical performance in another.

Initially, this study aims to identify which are the determinant factors behind high performance in triathlons, considering segment times of over 13,000 triathletes from 157 different races. By forming the training data from a large number of different races and diversifying the data source, the model will be able to discard any outlier races where race times are adversely affected by external conditions such as weather. Training datasets will be separated firstly by gender and then by distance in order for models to be as specific and relevant as possible to the athletic profile which is being forecasted. Triathlon-specific sporting features such as physiological demands, transitions and pacing strategies will be explored in order to train a model consistent with the most typical racing environment.

Once the key predictive variables are established, a machine learning model will be created which predicts required performance in cycling and running based on swimming race time, overall target time and the average segment times on that particular race course. Given that all the variables are continuous and the training data is labelled, the supervised machine learning methods of linear regression, artificial neural networks and regression tree will be explored to understand which offers a more useful performance prediction. Linear regression creates a line of best fit to predict values, ANN simulates a biological system of neurons which outputs values based on their predefined activation functions (Eshragh, F. et al 2015) whilst a regression tree splits data into branches using recursive portioning. These forecasts will serve as a guideline for athletes’ mid-race as, having just finished the swimming section, it will inform them what times they should aim to achieve in cycling and running if they are to reach their overall target time. Such information is of vital importance as it enables their pacing strategy and tactics to be both reflective of their in-race performance and also statistically supported by training data of previous races. As a result of this information, athletes will be able to optimise their performance by exerting and conserving energy where necessary based on the models’ predictions.

This study will start by exploring relevant literature including sporting, scientific and statistical journals in order to best establish the nature of triathlon races and any factors which need to be considered. An explanation of the methodology will then justify the steps taken to create and evaluate the models. Results will then be presented ahead of a critical discussion as to how the forecasts meet the aims of models.

In addition to the aforementioned Olympic distance, this project will also examine Sprint triathlons (750m swim, 20km bike and 5km run) to understand how triathlete profiles and performance change relative to distance. Whilst the Sprint race is exactly half the distance in each of the three categories, a shorter distance demands different physiological and tactical exertions. As the name suggests, the shorter distance is considered more of a sprint, with athletes operating closer to their maximum aerobic capacity. Sharma, P., et al. (2020) explains that speed and power are more dominant characteristics over these distances compared to the importance of tactical strategy and endurance in longer distances. Changes in distance not only call for different physiological demands, but according to Bentley DJ., et al. (2008), also require different carbohydrate loading strategies, fluid plans and a cycling pacing strategy which induces a metabolic reverse prior to running. The implications of these significant differences for the model are that Sprint data and Olympic data will have to be treated separately to train two distinct models, otherwise the predictions will be inaccurate as they will not consider the influence distance has on performance.

Sousa, CV., et al.´s 2021 paper examined the importance of each sport when predicting overall performance across different distances. Cycling is identified as the best predictor of Sprint triathlons, whilst swimming is the best predictor for Olympic distances. At first glance, this is peculiar as swimming represents the smallest segment of the three, yet it has great influence owing to the disadvantageous position which a slow swim time presents a triathlete. The study argues a slow swim will result in joining a slower cycling peloton or even cycling completely alone. The manner in which cycling is raced in groups (pelotons) means that racers move together in order to use each other to shelter from wind resistance and, doing so, conserve energy. This is typical of Sprint and Olympic distances and therefore will need to be taken into account in this study as it may account for similar cycling performance for athletes who finished the first transition at the same time.

As is the case for the majority of sports, triathlon is separated by gender owing to the significant impact which physiological gender differences have on triathlete performance. Knechtle, B., et al. (2019) discovered that women incur proportionally longer cycling times than men. Such differences result in the unequal relative importance of the three disciplines to the overall race result. Cuba-Dorado, A., et al. (2021) recognises that at elite level, running makes the significant contribution to finishing position, however it is more decisive in male triathletes. It argues that swimming performance, on the other hand, bears a greater contribution to finishing positions in females rather than males. These important gender and performance differences mean that the model will treat male and female data separately so as to create gender-specific performance predictions and guidelines which more accurately reflect the relative importance of each of the three disciplines.

A third significant differential in triathlete performance is the difference between elite and recreational performance. Elite performers have access to vastly different preparation resources and equipment such that they cannot be treated the same as amateur runners, who are of all ages and take part for different purposes other than performance optimisation and competition. For this reason, as well as the fact that data for amateur triathletes is less consistent and accessible, only elite triathletes will be considered in this study, which will make the forecast model as specific and relevant as possible.

Whilst not identifying a proven pacing strategy for success, Wu, S. (2014) recognises weather, topography, age, drafting, gender and race distance as factors that inform how athletes distribute their physical exertions. Le Meur, Y., et al. (2011) finds that athletes should not be swayed by direct opponents into deploying an aggressive strategy in the initial running segment in order to achieve an optimal total running time. The significance of such race tactics will be explored further as they are likely to have an impact on segment distribution. Pacing strategy and physiological capacity are interlinked in that a triathlete requires a tactic which will optimise their physical capabilities and not over-exert or under-exert themselves. Whilst the Sprint distance is exactly half the Olympic distance in each of the three sports, it is not assumed that the change in pace will be the same in all three. This is to say that some disciplines can be maintained for longer at a maximum speed than others so different pacing strategies are required. The study will examine the proportion in which distance affects velocity and in which disciplines a slower pace is more suited to achieving a better overall time. One key goal of this project is to offer athletes a mid-race forecast of their cycling and running sections (dependent variables), based on their swimming time (independent variables), if they can achieve their overall target times (independent variables). This serves as a pacing strategy in itself as it is valuable information as to how athletes should optimise their energy and at what speed they should race in the final two disciplines.

Unlike when competing in a single discipline sport, triathlons pose the unique feature of transitions and, as they contribute to the overall race time, they need to be trained in order to be as smooth and quick as possible. Cejuela, R., et al. 2013 explores the statistical significance that these periods hold on final race outcome, calculating a correlation coefficient of 0.43 between lost time in transition two (cycling-running) and overall performance. Such a correlation with overall performance was even higher than the correlation with transition one, transition two and the swimming segment. This statistically demonstrates that losing less time in transition bares a significant influence on race performance. If such a statistically significant relationship is proven in this dataset, then this study should also consider including transitions in the model in order to better inform the forecast.

Owing to the multidisciplinary nature of triathlon, predictive power from one sport to another assumes some degree of skill transferability between disciplines. This is to say that there is sufficient correlation between swimming, cycling and running performance to be able to make meaningful inter-sport forecasts. Whilst technique and muscle groups differ between sports, one crucial fundamental characteristic of triathletes is aerobic capacity, often measured as VO2 max and, as discussed by González-Parra, G., et al. (2013), can even be used to predict triathlon performance. It explains that VO2 max is not an innate skill and can be improved with concentrated training. Furthermore, O’Toole, M.L., et al. (1995) identify two other physiological adaptations which depend on aerobic capacity as important factors behind triathlon performance. Firstly, the economy of motion, which is submaximal VO2, is influential as energy output needs to be sustained for optimal performance and studies have proven the correlation between the economy of motion in the three disciplines. Secondly, it may be useful to consider training and racing intensities as a percentage of VO2max, known as fractional utilisation of maximal capacity. Each triathlete operates at a different energy output in each sport, so understanding what proportion of intensity they are exerting can be useful when identifying which discipline they are underperforming in.

As discussed in Sleivert, G., et al. (1996), the profile of elite triathletes consists of a combination of physiological and physical characteristics which are also present in endurance swimmers, cyclists and runners. Being generally tall, with low levels of body fat and light to average in weight offers large leverage as well as an ideal surface area to weight ratio. The training data is based on elite athletes of similar athletic profiles so the model may predict segment distributions which are significantly different to a triathlete with non-elite physiological body shape. The further away an athlete´s physiological profile is from that of an elite competitor, the less relevant the model will be. This is one limitation of the model as its training data is based on a narrow subset of similar performers, albeit it over a wide range of race courses. The recognition of such physiological characteristics supplements the model´s prediction information for coaches as they can better understand the physiological changes an athlete will need to do in order to optimise their performance towards an elite level. The longer distances which require a change from power to endurance are mirrored in a changing physical profile and training programme of athletes.

Schabort, EJ., et al. (2000) conducts a similar study which aimed to identify the most powerful variables in the prediction of triathlete race performance. This study was fortunate enough to have greater access to athlete information such as heart rate, peak oxygen uptake and blood lactate. However, it’s limited as firstly, it was conducted in a laboratory rather than a competitive racing environment and secondly, it considered segment performance as the dependent variable rather than the independent variable. Instead, the physiological variables were the independent variables. These limitations mean that not only does it fail to accurately mirror a racing condition, but it also does not appreciate the impact which one segment may have on the next. For instance, immediately after completing swimming and cycling segments in a World Championship, an athlete´s running condition will be different to when they are in a laboratory five days after a race. Therefore, it is important that each sport is treated in the context of the conditions of the training data, which is consecutively and competitively.

The following steps outline the approach taken to acquire, prepare and explore the data, before modelling and evaluating the results.

The first step in this project was to acquire the relevant race data from triathlon.org/results, which is an online database of all World Triathlon-affiliated races. As the official governing body for the sport, this can be considered a reliable data source and offers a useful breakdown of variables even if they are all separated by race. From the online database, races were filtered based on their category and specification before being individually downloaded in CSV form. The selected races were Olympic Games, World Cup and World Triathlon Championship Series races since 2000. Each race consists of the following fourteen variables:

All 312 races are individually imported into R for further analysis and manipulation.

The first step in manipulating the data so it is ready for analysis is to amalgamate all the different races into one large data frame for preparation. The four irrelevant variables (First Name, Last Name, Nationality and Start Number) are immediately discarded as they will not be considered further. Race data consists of all registered participants in each triathlon regardless of performance. Some athletes are filtered out for returning null values due to not finishing (DNF), not starting (DNS) or disqualification (DSQ) as they will only skew the data (Box 1).

Box 1 Filter race data

#Remove athlete names, nationality, start number columnsResults2 = subset(Results2, select = -c(NATIONALITY, `ATHLETE FIRST`, `ATHLETE LAST`, `START NUMBER`) )#Remove DNF, LAP, DNS and DSQ (nulls) rowsResults3 <- Results2 %>% filter(POSITION !="DNF", POSITION !="LAP",POSITION !="DNS", POSITION !="DSQ",RUN !="00:00:00", SWIM !="00:00:00",BIKE !="00:00:00", T2 !="00:00:00",T1 !="00:00:00", PROGRAM == "Elite Men" | PROGRAM == "Elite Women")

With the remaining data, any variables which represent time must be converted into a numerical data form. Initially, R recognised the data as characters because it was presented in the *hours:minutes:seconds* format, so in order to be processed numerically, it is converted into seconds. (Box 2)

Box 2 Convert time variables into seconds

#Convert to secondsResults3$Swim <- period_to_seconds(hms(Results3$SWIM))Results3$T1 <- period_to_seconds(hms(Results3$T1))Results3$Bike <- period_to_seconds(hms(Results3$BIKE))Results3$T2 <- period_to_seconds(hms(Results3$T2))Results3$Run <- period_to_seconds(hms(Results3$RUN))Results3$TotalTime <- period_to_seconds(hms(Results3$`TOTAL TIME`))

As previously discussed, it is important to understand the proportion of the total race time which is spent on each segment, so the segment percentage is calculated to form a useful insight into segment distribution. Three extra columns are added which calculate* Swim time/Total time, Bike time/Total time *and *Run time/Total time* (Box 3). The total of the three values should be just less than 1 as the transition times, which are the few seconds in which athletes switch between sports, are not included in the calculation.

Box 3 Calculation of segment percentage

#Add segment percentage Box 3.2dResults3$SwimPercentage <- with(Results3, SWIM/`TOTAL TIME`)Results3$BikePercentage <- with(Results3, BIKE/`TOTAL TIME`)Results3$RunPercentage <- with(Results3, RUN/`TOTAL TIME`)

It is important to differentiate between performance time and performance ranking. To enable this, a performance ranking coefficient is calculated for each variable which divides the athlete´s position by the total number of athletes competing in that race. This is done for each segment of the race and an example of the calculation for the swimming position is shown in Box 4. Values closer to zero are the best-performing athletes whereas values closest to one are the slowest.

Box 4 Calculation of swimming position coefficient

Results3 <- Results3 %>%group_by(`PROG ID`) %>%mutate(SwimRank = order(order(`PROG ID`, SWIM, decreasing=TRUE)))Results3$SwimRank <- with(Results3, SwimRank/TotalProg)

The data frames are then split into four distinct sets: Male Sprint, Female Sprint, Male Olympic and Female Olympic. The four datasets will form the basis of the data exploration and each consist of more than 2,500 rows and thirteen columns, detailing athlete information, race times, segment percentage, ranking, and course information.

Now that the datasets are separated into relevant performance groups, the results can be explored in order to identify any outliers and erroneous times. It is important that these spot checks occur after data is split because performance can only be judged against athletes of the same gender and competing in the same distance. Scatter plots and boxplots are both clear and visual methods to identify data points which are dissimilar to the rest of the performance group. Once identifying the point, further details can be investigated by filtering on the Athlete ID or Program ID to understand whether the anomaly is a one-off or whether it’s consistent with the athlete or race. After removing void data, an initial comparison into segment distribution can be taken in order to understand the timeshare of each segment. This is carried out by producing boxplots and statistical summaries of each distribution. It is important to note the differences in distribution by gender and distance in order to validate the splitting of the data by these criteria.

Secondly, testing for correlation between segment times and total time is an important method of understanding the statistical relationship between segment performance and overall performance. Crucially, raw times and segment percentage alone cannot be used in this test as it will be skewed by the cycling segment which in all cases takes up the most time. The ranked position coefficient does give a standardised interpretation of performance and will test the correlation between the position in each segment compared to the overall race position.

By exploring segment percentage or position coefficient rather than raw race times, the differences between race conditions are absorbed. For instance, disparities in temperature, altitude, wind and elevation gains must all be considered when aggregating data from different races. Therefore, a metric which standardises the race results is useful in order to not be negatively skewed by inconsistencies between races. Adding the variables of *CourseAverageBike* and *CourseAverageRun* (Box 5), which are the mean average times of all competitors who have competed in that discipline on that course, further minimises the impact of varying race conditions between courses.

Box 5 Calculation of course average segments times

Results3$CourseAverageBike <- ave(Results3$Bike, Results3$LOCATION, FUN = mean)Results3$CourseAverageRun <- ave(Results3$Run, Results3$LOCATION, FUN = mean)

Once the role of each segment and its impact on the overall performance has been established, the dataset can be taken forward to the modelling stage of the study. The four distinct datasets (*Sprint Male, Sprint Female, Olympic Male *and* Olympic Female*) which each consist of six variables (*Swim, Bike, CourseAverageBike, Run, CourseAverageRun* and* Total Time*) will each be inputted to the model separately as training data.

As all of the variables featured in vastly different ranges, and some machine learning methods require standardised variables in order to make predictions, all of the variables are subjected to normalisation prior to modelling. Boxplots such as Fig 1 show how in twenty randomly selected races from the male Olympic distance there is significant variance in times despite officially being the same distance and consisting of the same elite-level athletes. Therefore, a normalisation process is deployed in order to standardise the races such that they all are to scale and that a strong performance in a longer race is not mistaken for a poor performance in a shorter race, despite the raw times being the same. By normalising each race with the softmax technique, all variables range from zero to one, with zero representing the minimum value and one representing the maximum. This nullifies the variance between races to give uniform values which are reflective of the athletes´ performances in each race.

Having checked for missingness and normality of the distribution, the data is then randomised and split into training (75%) and test (25%) data. The same training and testing data will be used for each of the three types of models, although the dependent variable will change depending on which sport is being predicted.

The first of the three models is the linear regression forecast which, despite being commonly used as a quantitative analysis method, can be applied as a supervised machine learning algorithm to make predictions based on independent variables. *Bike*, the normalised time which it takes to complete the bike segment is considered as the first dependent variable whilst *Swim, CourseAverageBike* and *Total Time* are the independent variables. A model summary is produced with the *lm* function and particular attention is paid to the p-value and adjusted r-squared value. These values detail the statistical significance of the relationship between the dependent variable and the independent variables as well as how much of the dependent variable is being explained by the model. Including too many independent variables runs the risk of being distorted by multicollinearity, so a variance inflation factor test after the model is calculated. Any variables which exceed the threshold of two will be excluded from the model (Box 6). Similarly, factor analysis can also be conducted in order to decide which variables to include and which to discard.

Box 6 Variance Inflation factor

# calculate variance inflation factorvif(model2)sqrt(vif(model2)) > 2 # if > 2 vif too high

The second machine learning method is a neural network and again, this method will be used to predict cycling and running times in each of the four performance groups. An algorithm is created using the *neuralnet* function which transforms the left-most nodes (inputs) to the right-most nodes (outputs) based on middle nodes which are weighted by how much a variable contributes. Exploring the correlation between predicted and actual values, as well as the error values is a way of assessing its performance whilst it can be improved by changing the selection of independent variables.

Finally, the regression tree method is applied, which uses a decision tree to predict continuous values by splitting the data at each node. After selecting the same datasets as in the previous models, the *rpart* function is used to create a tree-based which seeks to predict each segment time. Model performance is evaluated by comparing the correlation between actual and predicted as well as using error functions. A regression tree can be pruned using the prune function in order to simplify the model with fewer nodes. Again, accuracy is monitored in order to decide whether pruning is advantageous to the original model.

Root Mean Squared Error (RMSE) is a recognised function for calculating the margin of error of forecasted values. It can therefore be applied as a consistent and reliable method to compare forecast accuracy of each different machine learning model. RMSE, along with the correlation between actual and predicted values, will be used over three different machine learning methods in order to gage, and then compare, model performance.

Finally, one of the three models will be selected as the best-performing predictor of the cycling and running segments. Its ability to forecast over the different gender and distance groups will be discussed in order to establish how machine learning can most effectively be applied during triathlons.

Once the race data is split into the four performance groups (Olympic Male, Olympic Female, Sprint Male and Sprint Female), the data can be explored to understand basic trends and identify any glaring outliers. The slight disparity between distances in races can be expected, for instance, as discussed by Walsh, N (2021) in Triathlon Magazine, even in prestigious Olympic distance events such as the Rio 2016 Olympics, the bike distance was measured to be 38km rather than the standard 40km. A 5% difference can be significant, particularly in cycling as it’s the longest segment. It is important to remove races which have vastly different distributions as they have inconsistent knock-on effects in fatigue and energy levels on the other segments.

Producing scatter plots such as Fig 2 (below) for each performance group highlights races which are out of the ordinary. The plot shows a significantly faster swimming time from the Cape Town 2015 male Olympic distance race than the rest of the dataset. If this were to remain in the distribution then it would significantly affect the correlation between swimming and the other variables as well as the predictive power of the variables within the models. The swimming segment in Cape Town was also shorter in the women’s races in 2014 and 2015 as can be seen by the black and pink clusters which are very distant from the rest of the dataset in Fig 3. It is therefore important to have a robust and consistent dataset ahead of normalisation and modelling in order to not skew the data by any abnormal results which will distort the min or max values.

In the case of the Sprint Male performance group Fig 4, Kitzbuhel 2013 is considerably shorter in the run segment whilst Sarasota-Bradenton 2018 has notably lower swimming times. As part of the spot check, it can be useful to bear in mind the world record time for each distance, which in the case of a 5km sprint 12:35 (755 seconds). Any time which is near or less than the world record should therefore raise concern about the reliability of the data. Using a colour scheme which separates the data points by race helps to distinguish races and identify whether a group of abnormal results are individual outliers or all part of the same race.

Now that the races have been spot checked and they appear to be consistent with other results in the same performance group, boxplots can be used to show the segment distribution (Fig 5 below). The boxplots show that there is in fact very little difference in segment distribution between the different distances and genders. In each of the four performance groups, the mean swimming segment percentage comprises 16% of the total race whilst cycling is between 52% and 53% and running is between 29% and 30%. These figures, which were taken from the statistical summaries in Box 7, show little variance in each performance group as the interquartile ranges don´t exceed 2% in any of the datasets and the medians don´t deviate by more than 1%. Having an established and consistent segment distribution for elite triathletes is useful as it details competitive performance in simple terms and can be used as a very initial guide for coaches and athletes. Imbalance in segment distribution can be used to address any imbalance in performance whereby one discipline is over or underperforming. During a race, once the swimming segment is completed, a rough estimate of overall performance can be calculated using these percentage splits. However, this method falls short at predicting running and cycling times if the swimming and target total time are already known; which is why machine learning is required.

Box 7 Statistical summaries of segment distribution of each segment over all four performance groups

> summary(SprintEM$SwimPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.1284 0.1584 0.1652 0.1646 0.1711 0.1932> summary(SprintEW$SwimPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.1282 0.1538 0.1609 0.1604 0.1672 0.1936> summary(StandardEM$SwimPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.1370 0.1584 0.1630 0.1638 0.1684 0.1877> summary(StandardEW$SwimPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.1355 0.1548 0.1600 0.1608 0.1656 0.1864> summary(SprintEM$BikePercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.4325 0.5142 0.5259 0.5256 0.5368 0.5885> summary(SprintEW$BikePercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.4727 0.5116 0.5220 0.5235 0.5353 0.5857> summary(StandardEM$BikePercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.4660 0.5210 0.5319 0.5309 0.5417 0.5877> summary(StandardEW$BikePercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.4709 0.5179 0.5286 0.5285 0.5402 0.5756> summary(SprintEM$RunPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.2393 0.2773 0.2870 0.2875 0.2969 0.3952> summary(SprintEW$RunPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.2470 0.2839 0.2938 0.2941 0.3031 0.3575> summary(StandardEM$RunPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.2547 0.2843 0.2927 0.2939 0.3018 0.3603> summary(StandardEW$RunPercentage)Min. 1st Qu. Median Mean 3rd Qu. Max.0.2622 0.2895 0.2981 0.2993 0.3079 0.3663

Over the *Bike* — *Total Time* scatter plot (Fig 6), the data points are divided by colour and show patterns of vertically aligned groups. The graph displays a strong positive correlation but very little variety in bike times within races, hence why they appear in vertical lines. This supports the previously discussed notion that cyclists tend to ride alongside each other in small groups, therefore they achieve similar cycling times (x-axis) despite getting different total times (y-axis).

According to the correlation matrix in Fig 7, *Swim*, which in the model acts as an independent variable, appears to have less correlation with *Bike* and *Run* than *Total Time* does. This may impact its predictive power in the machine learning models as it is not closely associated with the other target variables. Fortunately, *Swim* is not the only independent variable in the models so will be supported by *Total Time* which, considering its strong correlation, should be able to better explain the dependent variables.

Whilst the *Bike* –*Total Time* scatter plot showed subtle vertical patterns, the *Run* — *Total Time* plot shows vague diagonal lines parallel to the line of best fit (Fig 8). This is because, as *Total Time* consists of *Swim* + *T1* + *Bike* + *T2* + *Run*, and is dominated by *Bike* (52% — 53%) whilst the transitions and swimming times are very small and the cycling times have very little variety, *Run* then becomes a key determinant of *Total Time*. The transition times are now discarded because, as shown in the correlation matrix Fig 7, they have virtually no correlation to any of the target variables, and in any case, are very small segments of the race.

There is an interesting difference between which segment correlates with overall performance in terms of raw time and in terms of position. In each of the four groups, *Bike* has a stronger relationship with *Total Time*. This is in part due to the fact that in a triathlon, the longest segment is always the cycling segment, contributing to 52% — 53% of total race time. However, when assessing the relationship between segment position and total position, in each of the four performance groups, it is the running position which enjoys the strongest correlation with the overall position. This is to say that how an athlete performs in the running segment is a greater indication of their overall position than their position in any of the three other segments. These differences are illustrated in correlation matrixes which show how with respect to time (Fig 7) the most strongly correlated is *Bike*, whereas with respect to position (Fig 9), the variable which is most strongly correlated is *Run*.

Even after outlier races have been removed, one major limitation to the statistical analysis of triathlon is the variety in distance and conditions which each course poses. Unlike indoor athletics where venues worldwide are more uniform, in triathlon, it is a lot harder to find racing conditions which match each other. Sometimes, as is the case in Chicago (38.12km cycle), World Triathlon publish the official adjustment to the race length alongside the results. However, in other cases, the abnormal conditions are undisclosed and are only noticed in the final race results. Fig 1 shows the distribution of raw race times between twenty randomly selected courses between 2011 and 2015. Significant variety can be noticed despite appearing to correlate in the scatter plots as, for instance, even the fastest performer in the cycling segment in Auckland 2012 (8th boxplot) who raced in 4,179 seconds took longer than the slowest racers in 18 of the other 19 races in the boxplot. As its median value of 4,262 is about 600 seconds (ten minutes) longer than the majority of the other races, this race and other such anomalies must be discarded. This disparity in segment times is also manifested in total time despite running segments being very similar between races (Fig 10).

If the training data and testing data is based on an amalgamation of courses which vary so significantly, it is important that the disparity is considered by the models. Therefore, this reinforces the need for the standardisation and variables which offer a mean average of cycling and running times in each particular race course. Once not just whole outlier races, but individual outlier performances are removed, the soft-max normalisation technique is applied in order to produce normally distributed datasets with values between zero and one. Checks on the normality of the dependent and independent variables (Fig 11) and missingness (Fig 12) are carried out before the data is declared fit for modelling.

Having normalised and randomised the data before splitting it into two subsets, training (75%) and test (25%), the first model, based on the artificial neural network technique, is ready to be tested.

An initial model (Box 8) is based on the three independent variables — swim time, average cycling time and overall target time, predicts the dependent variable *Bike*. Accuracy can be evaluated by examining the relationship between predicted and actual values. Over the Olympic Male performance group, a correlation value of 0.903 indicates a very strong association between the predicted values and actual values and an initial RMSE value of 0.165 suggests a strong performing model. This relationship can be visualised in the scatter plot (Fig 13). The third measure of performance is using the lm function which, as well as demonstrating a statistically significant relationship from the low p-value, calculates the adjusted r-squared value of 0.816 (Box 9). This high value means that the initial neural network model has the ability to explain the majority of the cycling segment, based on just the swimming time, course average bike time and targeted overall time. Increasing hidden layers (Fig 14) improves the model slightly, with RMSE becoming 0.150, whilst the correlation between actual and predicted values stands at 0.920. The following box shows the commands to create the final ANN model where “Bike.tr” is the training data, along with a summary of the model results.

Box 8 Neural network model

> # ANN formula> n <- names(Bike.tr)> f <- as.formula(paste("Bike ~", paste(n[!n %in% "Bike"], collapse = " + ")))> Bike ~ Swim + CourseAverageBike + TotalTime> # ANN wirth hidden layers> nn2 <- neuralnet(f, data = Bike.tr, hidden=c(3,2))> # Generate predicted values> Bike.pred2 <- Bike.res2$net.result> # Correlation of actual versus predicted values> cor(Bike.te$Bike, Bike.pred2)> # Model summary> Bike.lm2 <- lm(Bike.te$Bike ~ Bike.pred2)> summary(Bike.lm2)

Box 9 Model summary of neural network model

> Bike.lm0 <- lm(Bike.te$Bike ~ Bike.pred0)> summary(Bike.lm0)Call:lm(formula = Bike.te$Bike ~ Bike.pred0)Residuals:Min 1Q Median 3Q Max-0.62451 -0.06938 -0.00567 0.08600 0.46097Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 0.010489 0.009218 1.138 0.255Bike.pred0 1.005085 0.015603 64.418 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1645 on 937 degrees of freedomMultiple R-squared: 0.8158, Adjusted R-squared: 0.8156F-statistic: 4150 on 1 and 937 DF, p-value: < 2.2e-16

The same steps are taken when trying to predict the dependent variables *Run,* whereby swimming times, average running time and total target time are treated as independent variables. An initial model brings a lower correlation between actual and predicted values as well as a markedly higher error value of 0.296. Despite having a statistically significant relationship between the dependent variable and the independent variable, the initial adjusted r-squared value is much lower at 0.337 which suggests there is other phenomena which better explain running time which are not featured in the model. Again, increasing hidden layers improves accuracy slightly bringing its final correlation to 0.636 and RMSE to 0.261. All of these values are less accurate than the neural network prediction for cycling.

The garson function is used to create visualisations (Fig 15 and Fig 16) of the relative importance of explanatory variables. By deconstructing model weights, the algorithm proves that when predicting *Bike*, the relative importance of *Total Time *is 60%, *Average Bike Time 31% *and *Swim *is 9%. Whilst when forecasting *Run*, the relative importance of *Total Time *is 68%, *Average Run Time 10% *and *Swim *is 22%. Notably, the bike model holds greater importance on the average course time than the run model does, this is because, as previously seen in Fig 6, there is less variety in cycling times as triathletes often cycle in groups (pelotons). The run model therefore depends more on the overall target time as there is low correlation with the swimming segment and the running segment of races has a large variety of performance times within the same race.

Linear regression can also be applied to machine learning in order to predict the dependent variable. As an initial test, a simple linear regression model is created with the dependent variable´s highest correlator — *Total Time*. This uses the same training and testing data subsets as Model 1 and can be visualised in Fig 17 in which the red regression line indicates the relationship and prediction between Total Time and Bike. Summarising this simple model shows a statistically significant relationship between the two variables and, within the Olympic Male performance group, the adjusted r-squared values is 0.727.

The simple linear regression model can then be developed by adding the other two independent variables (*Swim* and *CourseAverageBike*) to create a multiple linear regression model. Unsurprisingly, more explanatory variables in the model increase the adjusted r-squared value by 11% to 0.807. This is due to the moderate correlation between *CourseAverageBike* and *Bike*, which despite not having as much predictive power as *Total Time*, still helps the model account for the variety in different course conditions as it gives an average value of what how past competitors have previously fared on that course. Box 10 shows how the multiple linear regression model is created as well as the correlation between actual and predicted values.

Box 10 Multiple linear regression model

> # Multiple linear regression model> model2 <- lm(Bike ~ ., data = Bike.tr)> Bike.pr1 <- predict.lm(model2, Bike.te2, se.fit=TRUE)> # Correlation of actual versus predicted values> cor(Bike.te$Bike, Bike.pr1$fit)[1] 0.9026066

Despite the multiple linear regression only consisting of three variables, a variance inflation factor uses the VIF function to test for multicollinearity between them (Box 11). As the squared root values of 1.216, 1.725 and 1.902 don´t exceed 2, the model is not skewed by internal correlation.

Box 11 Variance Inflation Factor of regression model

> vif(model2)Swim CourseAverageBike TotalTime1.215739 1.725163 1.902089> sqrt(vif(model2)) > 2 # if > 2 vif too highSwim CourseAverageBike TotalTimeFALSE FALSE FALSE

The same process is followed to predict *Run*, with *CourseAverageRun* being added to the set of independent variables in the place of *CourseAverageBike*. Over the same performance group, the model appears to perform less well than when predicting *Bike*, partially due to the lower correlation between the target variable and the independent variables. A low adjusted R-squared value of 0.330 suggest that there are other factors which explain the run segment which cannot be associated to swimming time, total targeted time or the average running time for the course. Similarly, the correlation between actual running values and predicted running values is also lower 0.575 and the running error score of 0.297 is nearly double the size of cycling error scores.

The final model is the regression tree method which creates an algorithm to forecast the dependent variable (*Bike*) based on a series of nodes which separate the data based on the training data. In the case of the Olympic Male performance group, the initial plot (Fig 18) predicts six different outputs which can be compared to the actual values in Fig 19. A positive correlation can be recognised from the scatterplot as well as from the Spearman´s rank correlation which calculates a coefficient of 0.875. In the case of the Olympic Male performance group an RMSE score between actual and predicted value of 0.173 is considerably lower than the RMSE between actual values and mean actual values highlights how the model is better than average.

Using the *prune* function, the regression tree can be pruned in the hope of improving model accuracy. In this instance, the number of predictions of Bike values are reduced from six to four. This is too simplistic and worsens the model performance by reducing the correlation between actual and predicted values to 0.844 and the RMSE score increases to 0.194. The commands in Box 11 to create the regression tree model, run a correlation test and then prune the model.

Box 11 Multiple linear regression model

> # Regression tree> m.rpart <- rpart(`Bike` ~ ., data = Bike.tr)> # Predictions for the testing data> p1.rpart <- predict(m.rpart, Bike.te)> # Correlation of actual versus predicted values> cor.test(Bike.te$Bike, p1.rpart, method = "spearman", exact = FALSE)Spearman's rank correlation rhodata: Bike.te$Bike and p1.rpartS = 17311826, p-value < 2.2e-16alternative hypothesis: true rho is not equal to 0sample estimates:rho0.8749421> # Prune regression tree> m.rpart_prune <- prune(m.rpart, cp = 0.05)

The same process is used to predict the running segment and the initial regression tree instead produces ten outputs. As oppose to when *Bike* was the dependent variable, this tree (Fig 20) uses all three independent variables in the nodes. This is because the explanatory power of swimming is more significant to running than to cycling. Accuracy in this model is evaluated by a moderate correlation between actual and predicted values by the correlation coefficient of 0.652. Similarly, the RMSE value of 0.278 is only slightly less than the RMSE between the actual and mean of actual values 0.363 . Pruning this model (Fig 21) has very little improvement on the performance as reducing the model to just three outputs, and removing *Swim* and *CourseAverageRun* from the tree, is not sensitive enough to the variety in running times. Discarding those two variables from the initial model is likely due to its significantly lower correlations with *Run* (0.06 and 0.03) than the correlation between *Run* and *Total Time* 0.58 These are shown in the correlation matrix in Fig 7.

The previous discussion into the results of the three models are all based on the male Olympic distance datasets. In order to understand the compatibility of the models on the different distances and genders, the same models are tested using datasets from the other performance groups. Using standardised datasets as well as consistent calculations such as adjusted r-squared values, RMSE and correlation coefficients between actual and predicted values, enables cross-comparisons of model accuracy. The following two tables offer comparisons of values between models, split between the four different performance groups and the two different disciplines.

A clear gulf of more than 0.1 in RMSE values between predicting *Bike* and *Run* can be observed which is largely due to the weaker correlation between Run and its independent variables. Therefore, all of the machine learning methods struggle to make accurate predictions based on the variables at their disposal.

In respect to RMSE, all of the performance groups follow similar patterns in that none of the groups behave particularly differently to certain machine learning methods. There is a maximum variance of 0.02 in RMSE scores between performance groups which proves that the four different training datasets consist of similar predictive power. The accuracy of the first two models both improved by adding hidden layers to the artificial neural network model and adding more variables to the linear regression model. This can be observed by the decrease in RMSE values between ANN and ANN with hidden layers, as well as from simple to multiple linear regression.

On the other hand, in Model 3 each of the four groups all incurred larger error values post-pruning than in the initial regression tree models. This is because decreasing the number of outputs result in the predictions being too general and not sensitive enough to the different correlations between the independent variables and the dependent variable. Whilst pruning a regression tree can be useful when a tree is too complicated, it is important to maintain some degree of detail so that all of the independent variables can be utilised within the nodes.

A second approach to assessing model performance is by calculating correlation between actual and predicted values. Again, a notable gulf of more than 0.2 between the two dependent variables highlights how forecasted cycling times are closer to the actual values than the forecast of running times are those true values. The slight dip in accuracy after pruning the regression tree can also be noticed in the coefficients for model 3 whilst the developments in models 1 and 2 both improve their correlation coefficients.

Given that all the variables were normalised between 0 and 1, RMSE values would also be on this scale and as such the differences are relatively small. Over each distance and both genders, it is the neural network model which performs best in respect of RMSE and correlation coefficients. Whilst its strong performance is statistically proven, this technique is hard to predict and they are sometimes considered as “black boxes” in which its users are unable to clearly trace the processes which occur within the model. This is opposed to regression trees which, despite its slightly higher error values, can be easily understood and the process of how each outcome is formed can be clearly traced.

When comparing the three sports, there is a distinct lack of correlation for swimming than there is in cycling or running. This is because swimming is technically very distinct whilst cycling and running share the physiological importance of lower-body strength. Therefore, it can be assumed that if an athlete has the typical physical attributes which attain a high running performance, then such a level is more likely to be able to be replicated in cycling rather than swimming.

Now that the relative performance of the different models has been established, it is important to understand how and why they can be applied to triathletes in competitive races. After a triathlete has completed the swimming segment, their time can then be inputted into the model alongside the historical average cycling time for that course and the overall target time which the athlete is hoping to achieve. These last two variables can be determined prior to race commencing whilst the swimming time has to reflect their in-race performance. As the neural network method of cycling forecasts has clearly superior accuracy than running forecasts, predictions of running times will no longer derive from machine learning. The running time instead will be calculated by deducting swimming time, forecasted cycling time and an arbitrary transition time from the overall target time. The triathlete would then be offered two key pieces of forecasted information — cycling and running time — which would determine their pacing strategy. These splits aim to be the most suitable combination within that performance group to achieve their overall target time.

One major assumption in this model is that it uses previously achieved race splits as what should be used as a guide. This may not necessarily be reflective of all competitors as, naturally, each athlete has a different profile and will be stronger or weaker in certain disciplines. For instance, two athletes who complete the swimming segment at the same time and have the same overall target time are treated identically by the model and will receive the same pacing guidance for the remaining two segments. This may be negligent of the fact that one athlete may be a very strong cyclist and as such should target a faster cycling time which will leave more time to be spent in the final running segment. To resolve this asymmetry between athletes, further data points relating to performance-influencing characteristics would be required by the model. This could include VO2 max capacity, weight, age, strength or even non-physiological factors such as the quality of bicycle or running footwear. Particularly in the case of cycling, a more efficient bike can save an athlete from having to exert as much energy than their competitor in a less competent bike. Not so much in elite level races, but in races which have a wider qualification criterion, triathletes´ resources do vary substantially.

Similarly, the model could be enriched by additional data relating to the racing conditions of each particular course so that the performance groups could be divided further into more specific racing groups. As elite level races occur just six times a year, such quantities of data are not readily available. Whilst glaring outliers were removed from the training data before modelling, subtle differences in distance, altitude, weather, terrain and incline may not be completely absorbed by the average course time variable which was added to the modelling. Therefore, different types of race data, or even simply a greater quantity of race data with the same variables could help produce models which don´t have such variety in race data and would have greater correlation with the target variables.

Despite the four different distance and gender groups being treated separately, very little differences in performances and forecasts were noticed by the final model results, the correlation matrixes and the segment distribution boxplots (Fig 5). In terms of distance, Olympic distance triathlons (1.5km swim, 40km bike, and 10km run) may not be long enough to demand noticeably different physiological changes in competitors to sprint triathlons (750m swim, 20k bike and 5k run). It would be interesting to include Ironman (3.9km swim, 180.2km bike and 42.2k run) data into a similar model however such data was unavailable at point of acquisition.

Such models could be applied to other sports as in the case of multidisciplinary athletics events like heptathlons (long jump, high jump, shot put, javelin/pole vault, hurdles, 60m/200m, 800m/1000m) or decathlons(heptathlon + discuss throw and 400m), interesting associations between segments could be explored. This is because of the shared characteristics between 200m, 400m, hurdles and long jump which are all feature high-speed, non-endurance running. Similarly throwing events like javelin and discuss could be expected to correlate as they utilise similar muscle groups. Other trends would be less obvious but could still be included in models to help shed light on the explanatory factors behind disciplines like high jump which are more unique. Whilst the model wouldn’t be applied as a pacing guide, it could be applied as a performance guide which would help athletes understand what scores they need to target in each segment in order to achieve their desired medal or overall target points.

In conclusion, the artificial neural network model is the best performing model and can assist triathletes’ mid-race by giving real-time pacing guidance. This is with the end goal of achieving their overall target time having taking into consideration their performance in other segments and how the conditions of the race have impacted the performances of previous competitors. As a sport, elite triathlon is subjected to certain degrees or variance and inconsistencies and whilst this is sometimes difficult to capture by statistical models, this study does its best to normalise the data and make the predictions as applicable as possible to each race in which it is predicting.

Bentley DJ., Cox GR., Green D., Laursen PB. 2008 *Maximising performance in triathlon: Applied physiological and nutritional aspects of elite and non-elite competitions. *Journal of Science and Medicine in Sport. Volume 11 Issue 4 Pages 407–416

Cejuela, R., Cala, A., Pérez‐Turpin., J, Villa, J., Cortell, J. and Chinchilla, J., 2013 *Temporal Activity in Particular Segments and Transitions in The Olympic Triathlon, *Journal of Human Kinetics, Volume 36 Page 87–95.

Cuba-Dorado A., Vleck V., Álvarez-Yates T., Garcia-Garcia O. 2021. *Gender Effect on the Relationship between Talent Identification Tests and Later World Triathlon Series Performance.* Sports. Volume 9 Issue 12 Page 164. https://doi.org/10.3390/sports9120164

Eshragh, F. Pooyandeh, M, Marceau, D., 2015 *Automated negotiation in environmental resource management: Review and assessment,* Journal of Environmental Management, Volume 162, Pages 148–157, Available from: https://www.sciencedirect.com/science/article/pii/S030147971530195X

González-Parra G., Mora R., Hoeger B. 2013 *Maximal oxygen consumption in national elite triathletes that train in high altitude.* Journal of Human Sport and Exercise. Volume 8 Issue 2 Pages 342–349. DOI: https://doi.org/10.4100/jhse.2012.82.03

Knechtle, B., Kach, I., Rosemann, T. and Nikolaidis, T., 2019 *The effect of sex, age and performance level on pacing of Ironman triathletes*, Zurich Open Repository and Archive University of Zurich

Le Meur, Y., Bernard, T., Dorel, S., Abbiss, C., Honnorat, G., Brisswalter, J. and Hausswirth, C., 2011 *Relationships Between Triathlon Performance and Pacing Strategy During the Run in an International Competition, *International Journal of Sports Physiology and Performance, Volume 6 Issue 2 Pages 183–194

Millet, GP., Vleck, VE., 2000 *Physiological and biomechanical adaptations to the cycle to run transition in Olympic triathlon: review and practical recommendations for training, *British Journal of Sports Medicine Volume 34 Pages 384–390.

O’Toole, M.L., Douglas, P.S. 1995 *Applied Physiology of Triathlon. *Sports Med. Volume 19 Pages 251–267. https://doi.org/10.2165/00007256-199519040-00003

Sousa, CV., Aguiar, S., Olher, RR., Cunha, R., Nikolaidis, PT., Villiger, E., Rosemann, T. and Knechtle B., 2021 *What Is the Best Discipline to Predict Overall Triathlon Performance? An Analysis of Sprint, Olympic, Ironman 70.3, and Ironman 140.6,* Frontiers in Physiology. Volume 12. doi: 10.3389/fphys.2021.654552

Schabort EJ., Killian SC., St Clair Gibson A., Hawley JA., Noakes TD. 2000 *Prediction of triathlon race time from laboratory testing in national triathletes.* Med Sci Sports Exerc. Volume 32 Issue 4 Pages 844–849.

Sharma, A.P., Périard, J.D. 2020 *Physiological Requirements of the Different Distances of Triathlon.* Triathlon Medicine. Pages 5–17 https://doi.org/10.1007/978-3-030-22357-1_2

Sleivert GG, Rowlands DS. 1996 *Physical and physiological factors associated with success in the triathlon.* Sports Med. Volume 22 Issue 1 Pages 8–18. Available from: https://pubmed.ncbi.nlm.nih.gov/8819237/

Walsh, N. 2021 *What are the triathlon “world records” for each distance? *Triathlon Magazine. Available from: https://triathlonmagazine.ca/racing/what-are-the-triathlon-world-records-for-each-distance/

Wu SS, Peiffer JJ, Brisswalter J, Nosaka K, Abbiss CR. 2014 *Factors influencing pacing in triathlon.* Open Access J Sports Med. Vol 5 Pages 223–234.