Extracting insights using R’s dplyr and tidyr packages
Data is more available today than ever before. We can easily find out what we are interested in through many resources on the Internet. However, when trying to draw conclusions from the data, we quickly realize that the data, although easily found, is not in a format that can be readily used for a more detailed analysis. The same problem arises when several people try to collaborate on some research. If they have not previously agreed on how to label and enter data, everyone will most likely come up with their own system. This ultimately results in confusion when sharing and analyzing data. These issues create the need for a unique data organization system.
In his work Anna Karenina, Leo Tolstoy expressed the thought:
All happy families are alike; each unhappy family is unhappy in its own way.
Alluding to that quote, one of the visionaries of programming in R, Hadley Wickham, addressed the problem of data organization with the following words:
Tidy datasets are all alike but every messy dataset is messy in its own way.
With tidy data, each row is an observation and each column is a variable. Let’s say we list the types of cars by their horsepower and weight. Data organized in a tidy way would then contain three columns: the type of car, horsepower and mass. Consequently, by looking at one row of the data, we would get all the information about one type of car. Focusing on one data column would provide information about a single property (eg. mass) for all types of cars.
In an effort to translate the data into a tidy format, Hadley Wickham developed the
tidyverse – a suite of packages that includes the following packages:
tibble. In the following text, we will focus on the subset of these packages used for data wrangling –
tidyverse presents a set of consistent and effective methods for transforming and visualizing data, it is important to emphasize two things:
- results obtained using the
tidyversefunctions can be obtained (albeit usually in a more complicated/less intuitive way) using the basic R functions
- R is a programming language, and
tidyverseis a set of packages aimed at facilitating data transformations and visualizations. In other words, the
tidyversepackages can not solve all the problems we may encounter, so it is important to know the basics of R programming as well.
In the following text, I will cover the most important functions of the
tibble packages. I will showcase how these functions can be used to perform various common tasks involved in data cleaning and exploratory data analysis. I will use the data about the passengers of one of the most famous ships – the Titanic.
As a first step, we have to load the necessary packages and study the structure of the data we will use. Aside from the data wrangling packages, we have to load the
titanic package which contains the needed data.
The Titanic dataset is a
data.frame type object. Consequently, if we printed the Titanic dataset, the whole 891 rows would be returned. Since such a layout is clumsy and cumbersome, I printed only the first ten rows using the base R
head function. In order to bypass this issue, the
tibble package introduces a
tibble data type. This type improves on the
data.frame class by only printing the first ten rows when showing a dataset. It also prints out the number of rows and columns of a dataset and the data type of each column.
head function can be improved upon as it does not indicate the column data type, and when there are many variables (many columns), it does not show them all, but only as many as can fit on the screen. Instead, we can use the
glimpse command which shows a transposed version of the printout to get a more complete overview of the data.
We can see that the data contains 12 variables: passenger serial number (PassengerId), binary passenger survival variable(Survived), passenger class (Pclass), name (Name), sex (Sex), age (Age), number of siblings and spouses on board (Sibsp), number of children/parents on board (Parch), ticket number (Ticket), fare paid (Fare), cabin number (Cabin) and the passenger’s boarding location (Embarked).
Before moving forward with the analysis, I would like to introduce the pipe operator. The advantage of the pipe operator is the ability to bind functions. More precisely, it passes the result of one function as the first argument to the next function in the sequence. This may not seem very useful at first, but it greatly increases the clarity of the code when using a composition of many functions. Let’s look at an example.
The pipe operator enables us to read the composition from left to right, as if we were reading a book! Once you get used to using it, I guarantee you will never look back. This is why, even though all R functions can be used without it, I will use the pipe in the rest of this post.
dplyr package contains functions to facilitate data manipulation to extract meaningful insights. It is partly inspired by SQL (Structured Query Language), so those of you with background in SQL will probably find them quite intuitive.
We will start by renaming the columns using
rename in order to make them more understandable.
Renaming can also be performed by applying a function to multiple columns using the
rename_with function. For example, we might prefer a convention where all variable names are written in lowercase:
Select columns based on their names
We are often not interested in all of the dataset variables. The
select function allows us to subset the columns of interest.
If the desired columns are positioned adjacently in the initial data frame, there is also a shorter way to select them:
What if we want to select only the name and the variable associated with the number of relatives on board? Of course, we can again use the technique from the first example. However,
select can also be combined with a plethora of helper functions which make such tasks faster and easier to perform. For example, the
contains helper allows us to search by parts of column names. We can use this if we notice that both variables associated with the number of relatives contain the underscore symbol .
Similarly to the
contains function, we can also use the
ends_with functions. For a list and explanation of all helper functions, you can consult the help files by typing
?select in the R console.
Column selection can also be performed by specifying the columns we want to remove from the data. This is done by placing a minus sign in front of the names of unwanted columns.
Let’s simplify our dataset by removing the ticket, cabin number and the columns related to the number of relatives/spouses on board:
Value based column selection
What if we were interested only in columns containing numeric values? We could look up the variable types manually and then follow the procedure from the previous subsection.
Luckily, this tedious task can be automated since column selection can also be performed based on the values stored within them by teaming up the
Transform existing columns or add new ones using mutate
mutate function allows us to transform the existing columns or create new columns using either external data, or data already present in the data frame.
R has a handy
factor class built-in which should be used when a categorical variable has a limited number of possible values (levels). We can see that the Sex variable has such properties.
mutate enables us to change the column data type:
Some of you might also want to abbreviate the gender values. This can be done using the
Transforming multiple columns at once
Some of you probably noticed that it wouldn’t be bad to also convert data types of Survived and Embarked columns from character to factor.
Instead of doing it by specifying the columns one by one, we can transform multiple columns at the same time by applying the
across combination. The
across function specifies the columns to transform and the functions to be applied.
The selection of columns in the
across function can also be performed using the helper functions mentioned while talking about the applications of the
Filter data frame rows
filter function is used to subset observations from the data. As such, it can be used to answer questions about distinct groups of passengers onboard.
What percentage of passengers survived the crash?
Only a little more than a third of passengers survived the crash. Was there a difference in survival based on passenger sex?
What is the proportion of women among passengers?
What is the proportion of women passengers who survived the crash?
So, even though women represented a smaller population of passengers, their chances of survival were approximately two times higher than those of men. Of course, this was expected, since women and children are given priority in rescue missions.
Next, we will check how the survival varied based on passenger class.
The chances of survival were lower if you traveled in the lower classes. This could be due to the influence of richer passenger but also due to the fact that the first class was closer to the boat deck where the rescue boats were stationed.
There are also
filter helper function
if_all which allow us to filter on multiple columns at once. These can be used to perform a common task of removing observations with missing values.
Arrange rows based on column values
Often, we want to sort the data based on a certain criteria of interest. This is done with
arrange. Let’s sort the passengers in ascending order of age, and then sort the passengers of the same age alphabetically by name. By default, the sorting is ascending, but the
desc command can be used to order descending sorting.
From this output, we might hypothesize that older passengers are more likely to be male than female. Additionally, it also seems that older passengers seemed to have been higher class passenger. For me, this is a plausible claim, knowing that the Titanic traveled from England to the United States, carrying with it a lot of people in search of new opportunities and a better life. I would assume that such people would be younger, since an older man has less incentive to move away. We will be able to answer all of these questions with the tools covered in the next two subsections.
Obtain column summary statistics with
This command is very useful when we are not interested in individual values, but in the summary statistics of the data set. Let’s calculate the average age of the passengers.
We got an unexpected result. This is a consequence of “holes” in the data that have been replaced with NA (“Not available”, missing data). We can solve this problem by filtering the data frame to exclude passengers with an unknown age before calculating the summary statistics. When filtering, we use a special command that tests whether the data is of type
Just like with
summarise can also be combined with
across in order to get a summary on multiple columns at once.
In order to answer the questions about the age difference depending on the gender and passenger class, we could first filter the population of interest and then calculate the needed summary. However, this approach would require us to write out a new summary each time, similarly to how we calculated the survival rates depending on the passenger class. To avoid this we will use the
group_by function covered in the next subsection.
Group the observations based on a categorical property with
By itself, the
summarise function would not be that useful, but paired with the
group_by function, it allows for a quick summary of information about all the levels of categorical variables. The
group_by function divides the data into groups depending on the categories of a certain variable. Then the function of interest is applied separately to each one of those groups using
summarise, and the result is combined back to a single data frame.
This combination finally enables us to elegantly answer the hypotheses laid out earlier.
There were three times more men above the age 55 onboard, so it seems our first hypothesis proved to be correct.
Let’s also check the second hypothesis about the class related age differences:
Our hunch seems to be backed up by data, as there indeed seems to be a correlation between the passenger class and age.
Notice that this approach also enables us to more easily find out the survival differences based on the passenger class, discussed in the
In a similar way as with the
summarise function, the
group by command can be combined with the
Let’s create a variable that will rank passengers by age based on gender. The
rank command will help us by assigning an ordinal number to a set of values starting from the smallest to the largest. As with the
arrange command, the default order can be reversed using
filter can give us, for example, a list of the three oldest passengers for each gender.
The results are in line with our previous conclusions, as the oldest female passenger is 17 years younger than the oldest man.
As it turns out, there is also a convenience
slice_max function to achieve the same goal, so the above snippet should only be used as an educational example.
To cancel the grouping, we use the
ungroup function. It is advisable to do this at the end of any transformation that involves grouping. This is because forgetting that we have grouped the data by some variable can cause unexpected results in further analysis.
Functions for combining/matching data frames
Earlier, while calculating the average age, we noticed that the Titanic data is incomplete. Let’s see who these passengers are:
Fortunately, the Titanic crash is so famous that there are many webpages dedicated to listing the passengers and their backgrounds. Consequently, we can create a new data frame that will contain the names and ages of some of the passengers for whom the data is missing.
Since the missing data is spread out in the Titanic dataset (look at the PassengerId values which are actually just row numbers), it would be hard to append the additional information manually. These types of problems can be solved by
left_join, a function which adds data from one data frame to the other based on a common identification variable.
Notice that instead of a single Age variable, we have Age.x and Age.y. This occurs in order to enable us to discern which data frame the age variable came from. We can use
case_when to merge the two age columns into one.
As its name suggests, the
tidyr package contains commands to covert data frames into a tidy format. The Titanic data frame is already tidy, since every row represents a single passenger, and every column an independent passenger feature. However, this is often not the case with datasets. Additionally, an “untidy” format might be needed in some instances such as plotting. The next subsection will illustrate such an example.
Wide-long format conversions using pivot_wider and pivot_longer
These are probably the most valuable functions in the
tidyr package. The
pivot_longer function is useful when we want to convert multiple columns to a single one with multiple subgroups. The image below illustrates how this works.
The table on the left contains three related columns — they all represent a type of relative. Knowing this, we might want to group them into a single “Relative” column. This makes the data longer (hence longer in the function name). Furthermore, it became untidy since the same person is represented by three separate rows. Why would we do this?
Well, if we wanted to make a bar chart with the type of relative on the horizontal axis, we would have to provide a single variable. We can do this with the format on the right, but not with the format on the left.
Let’s now show how to apply these transformations on the Titanic data. To make it easier to follow the transformations, we will select only the first three rows of the Titanic data frame:
We can display the same data using fewer columns by applying
pivot_longer. For example, we could have a passenger’s name column and two columns that contain all information on the passenger’s properties (passenger class, age, gender, survival). Why two columns for properties and not one? Because one column must contain the information about the property in question, and the other the value of that property. Since we are combining numeric and character values, we must first convert all values to the same type (character).
As expected, the number of rows is seven times larger because now we have to repeat the name of each passenger seven times — once for each of the parameters in the Parameter_name column. Of course, as opposed to the example on the image above, it does not make sense to group the parameters in such a way, since they are not related. However, if you work with data often, these types of issues will be the least of your problems, as the input is at least consistent. You would probably never create such data yourself, but I am doing it here so I can show you how to get out of that mess when someone brings it over (and, trust me, someone eventually will…).
As an inverse of the
pivot_wider is the answer.
Of course, we should also convert the column values back to their proper types.
Separate() and unite()
Like the aforementioned
pivot_wider, these two functions are also inverses of each other. The simplest example of a reason to use the
separate function is to separate date records. Let’s say we have a column of birthday dates of Olympic medal winners in the format day_month_year and we want to calculate the average age of the winners in years. This would mean that we have to somehow extract only the year from the date.
separate enables us to do just that, to make, for example, three variables from one date variable: day, month and year. The
unite function works the other way around, connecting multiple variables into one (eg. day, month, year into a date variable of the format day_month_year). In essence, it is just a
paste function with some additional arguments, such as allowing for automatic removal of input columns from the data frame.
Looking at the names of the passengers, we can see that they can be divided into several groups. The first word before the comma is the last name, followed by the title and the first name. It is useful to separate these groups because we get additional information. For example, the information on marital status from passenger titles — Mr/Mrs for married passengers and Master/Miss for single passengers.
For those of you not familiar with regular expressions, the expression
[,\.] translates to: break up the string whenever you encounter a comma or a dot. Why do we need the the backslashes in front of the dot? Because the dot by itself has a special meaning in regular expression, so this is a way to tell R we are interested in the dot character itself, and not its special function.
The titles give us an insight into the social structure of the passenger population we wouldn’t be able to easily access otherwise.
I hope this post was useful to you and that you learned something new. I tried to cover as many common tasks and problems as possible. However, no single (non-simulated) dataset contains all the problems you might encounter in your data analysis, so there are many issues and scenarios that have not been covered. Additionally, no exploratory data analysis can be complete without visualizing the data. Since adding that would make this post too long, I left it for some future story. Nevertheless, if you have any questions or encounter a complex issue (including visualization) that you want to solve using the
tidyverse toolbox, please feel free to drop a comment!