Data Wrangling with R
R Ladies Colombo
April 2021 Meetup
Data Science Intern at Trabeya,
Consultant Data Analyst at Creative Hub.
Co-author of two R packages maintained by Dr. Thiyanga Thalagala, DSJobTracker and tsdataleaks.
Runner's up in two datathons and finalist in two datathons.
Image Credits goes to stats illustrations from Allison Horst
Data wrangling is where we take a raw dataset and transform it into a format that fits for different tasks
Simply it's like meal prep where we get the data ready to make the main dish (in this case it could be a model or visualization)
There are several packages that can help with data wrangling but for today we would be scratching the surface of the dplyr
package in the tidyverse ecosystem.
I won't be teaching you a list of functions and how to use them today, instead I will be walking you through an actual scenario where these functions might be needed.
The main outcome of this should be to have an idea about how to approach a problem and how to combine different tools that you have under your belt.
+
You are the new Data Science intern at Spotify! and you are told to work on a simple csv dataset that resembles the actual dataset.
You are given a snapshot of the dataset from 2020 to work on and your supervisor will be making the entire user interface of the recommender system.
She told us to use the TidyTuesday data on Spotify from 2020-01-21.
At the end she will talk about extending this into an actual real time application.
The supervisor will let you make the R codes locally using the dataset that was given to you and she will on the other end work on making the functions that show the plots and tables to the user.
Usually most of your time will be spent on data wrangling hence she entrusts you to get the job done.
Rows: 32,833Columns: 23$ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZ…$ track_name <chr> "I Don't Care (with Justin Bieber)…$ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara La…$ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68…$ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO2…$ track_album_name <chr> "I Don't Care (with Justin Bieber)…$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-…$ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Rem…$ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZ…$ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop",…$ playlist_subgenre <chr> "dance pop", "dance pop", "dance p…$ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650,…$ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833,…$ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, …$ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4…$ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1…$ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.…$ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870…$ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43…$ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.…$ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725,…$ tempo <dbl> 122.036, 99.972, 124.008, 121.956,…$ duration_ms <dbl> 194754, 162600, 176616, 169093, 18…
The entire data dictionary was found here and you manage to jot it down as follows.
We have the id,genre and subgenre of each playlist.
The id,name and release date of each album.
Finally id,name,artist and popularity of each track.
In addition we have the following features of songs.
As the first piece of the user interface for the recommmendation system, your supervisor thought of starting off with the basics.
She has decided to put up a table showing the top 5 popular tracks and their artists.
Your supervisor has asked you to make the table that goes out on the front page of the user interface.
As the first piece of the user interface for the recommmendation system, your supervisor thought of starting off with the basics.
She has decided to put up a table showing the top 5 popular tracks and their artists.
Your supervisor has asked you to make the table that goes out on the front page of the user interface.
spotify %>% select(track_name,track_artist,track_popularity)
The select
command helps us select columns that we want, and we can give a range of columns or even the columns that we don't want to have.
The select
function works as follows.
select(t,A)
. Likewise if we want to select everything except column B from tibble t we would say select(t,-B)
.select(t,starts_with("LKR_"))
)Right now you know how to select the top rows of a tibble with the head function, so if you can get the tracks arranged by their popularity in ascending order then everything's all set!
We need a function that can sort the rows based on one or more columns.
We need a function that can sort the rows based on one or more columns.
This is where the arrange
function comes to play. The arrange
function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.
We need a function that can sort the rows based on one or more columns.
This is where the arrange
function comes to play. The arrange
function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(track_popularity) %>% head(5)
track_name | track_artist | track_popularity |
---|---|---|
Siren | SUNMI | 0 |
Lollipop (Candyman) | Aqua | 0 |
Around The World | Aqua | 0 |
I'm Yours | Influencers Worship | 0 |
Talk About It | Bancali | 0 |
We need a function that can sort the rows based on one or more columns.
This is where the arrange
function comes to play. The arrange
function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(track_popularity) %>% head(5)
track_name | track_artist | track_popularity |
---|---|---|
Siren | SUNMI | 0 |
Lollipop (Candyman) | Aqua | 0 |
Around The World | Aqua | 0 |
I'm Yours | Influencers Worship | 0 |
Talk About It | Bancali | 0 |
Wait something's not right, this isn't in descending order! We have to tell the arrange
function to sort them in descending order.
You can tell the arrange
function to order in descending order by using a minus operator in front, somewhat similar to the select
function.
You can tell the arrange
function to order in descending order by using a minus operator in front, somewhat similar to the select
function.
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% head(5)
track_name | track_artist | track_popularity |
---|---|---|
Dance Monkey | Tones and I | 100 |
Dance Monkey | Tones and I | 100 |
ROXANNE | Arizona Zervas | 99 |
ROXANNE | Arizona Zervas | 99 |
ROXANNE | Arizona Zervas | 99 |
You can tell the arrange
function to order in descending order by using a minus operator in front, somewhat similar to the select
function.
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% head(5)
track_name | track_artist | track_popularity |
---|---|---|
Dance Monkey | Tones and I | 100 |
Dance Monkey | Tones and I | 100 |
ROXANNE | Arizona Zervas | 99 |
ROXANNE | Arizona Zervas | 99 |
ROXANNE | Arizona Zervas | 99 |
Hold up, why is it repeating like that?
One track can be in multiple playlists! That's why! Now what do we do?
Let's select only the distinct or unique rows.
You can tell the arrange
function to order in descending order by using a minus operator in front, somewhat similar to the select
function.
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% head(5)
track_name | track_artist | track_popularity |
---|---|---|
Dance Monkey | Tones and I | 100 |
Dance Monkey | Tones and I | 100 |
ROXANNE | Arizona Zervas | 99 |
ROXANNE | Arizona Zervas | 99 |
ROXANNE | Arizona Zervas | 99 |
Hold up, why is it repeating like that?
One track can be in multiple playlists! That's why! Now what do we do?
Let's select only the distinct or unique rows.
For that we can use the distinct
function. The distinct
function takes in a tibble as the first argument and then a set of columns to find distinctive rows or uses all columns if no arguments are given. You can see the pattern here right?
Finally by using the distinct
function which returns only the unique rows in the dataframe we were able to get the top 5 track names and artists.
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% distinct() %>% head(5)
track_name | track_artist | track_popularity |
---|---|---|
Dance Monkey | Tones and I | 100 |
ROXANNE | Arizona Zervas | 99 |
Tusa | KAROL G | 98 |
Memories | Maroon 5 | 98 |
Blinding Lights | The Weeknd | 98 |
Let's send this out to your supervisor now!
Ding! Ding!
There's a new message from your supervisor on Slack.
She wants this to be more presentable on the user interface by renaming the columns and removing the popularity.
Well we already know how to select columns that we don't need, but what about renaming columns?
Ding! Ding!
There's a new message from your supervisor on Slack.
She wants this to be more presentable on the user interface by renaming the columns and removing the popularity.
Well we already know how to select columns that we don't need, but what about renaming columns?
Based on your gut feeling you google up "rename tidyverse function" and land on the rename function to find out the following.
The rename
function takes in pairs of arguments in the form of
new_name = old_name
so we can rename the columns that we want with that function.
This is in addition to the first argument of a tibble to do all of these on.
Using the rename
function we were able to change the names of the column names as well
spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% distinct() %>% head(5) %>% select(-track_popularity) %>% rename(Song = track_name, Artist = track_artist)
Song | Artist |
---|---|
Dance Monkey | Tones and I |
ROXANNE | Arizona Zervas |
Tusa | KAROL G |
Memories | Maroon 5 |
Blinding Lights | The Weeknd |
What did we learn from trying out Task No. 1?
select
function helps us select the columns that we wantarrange
function helps us arrange the table by columnsrename
function helps us rename columnsdistinct
functions helps us select only the unique rows in the tableCongratulations! We finished one task at our internship at Spotify!
Your supervisor just pointed out a funny thing in the table that you sent her. It contains the most popular songs of all times!
She is all right with it since we worked really hard on it, but she wants to show a table of the top 5 most popular songs in the past year.
The snapshot she gave us is from January 2020 but our code will be used in real time even after 2021 according to her.
spotify %>% select(track_album_release_date) %>% head(2)
# A tibble: 2 x 1 track_album_release_date <chr> 1 2019-06-14 2 2019-12-13
Your supervisor just pointed out a funny thing in the table that you sent her. It contains the most popular songs of all times!
She is all right with it since we worked really hard on it, but she wants to show a table of the top 5 most popular songs in the past year.
The snapshot she gave us is from January 2020 but our code will be used in real time even after 2021 according to her.
spotify %>% select(track_album_release_date) %>% head(2)
# A tibble: 2 x 1 track_album_release_date <chr> 1 2019-06-14 2 2019-12-13
Ding! Ding!
Your supervisor replied back telling you to use the function as.Date
to make the column into a date and then the year
function from the lubridate
package.
Ding! Ding!
Your supervisor replied back telling you to use the function as.Date
to make the column into a date and then the year
function from the lubridate
package.
lubridate is a package that helps in analysing and extracting useful information from dates, like the year, month, day or even difference between dates in different units!
Image Credits goes to stats illustrations from Allison Horst
You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.
You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.
as.Date(spotify$track_album_release_date) %>% lubridate::year()
You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.
as.Date(spotify$track_album_release_date) %>% lubridate::year()
Before you send this out to your supervisor you want to make sure that the code that you wrote is correct, so you start matching with the documentation to get a rough idea about what happens.
year
function in lubridate
Sounds good!
We have a way to get an entire vector of data based on the pre existing column in the dataset.
Now how do we add a new column to a dataset? We could simply do this, dataset$new_column = vector_of_data
.
But you suddenly remember how your supervisor brings in changes suddenly, therefore instead of making a break in the pipe in the middle, you try to find alternatives that can fit into the %
operator pipeline.
We have a way to get an entire vector of data based on the pre existing column in the dataset.
Now how do we add a new column to a dataset? We could simply do this, dataset$new_column = vector_of_data
.
But you suddenly remember how your supervisor brings in changes suddenly, therefore instead of making a break in the pipe in the middle, you try to find alternatives that can fit into the %
operator pipeline.
We need to modify the existing dataset that comes after the last step in the pipeline. We can use the mutate
function for this. The argument format goes along as new_column_name = modification
. The modification should give out a vector that fits the number of rows in the dataset.
Image Credits goes to stats illustrations from Allison Horst
We will first mutate the original dataset to check whether this works. Remember you just have to give the name of the column you want the operation to happen on, and there's no need to have the dataset before that followed with a $
We will first mutate the original dataset to check whether this works. Remember you just have to give the name of the column you want the operation to happen on, and there's no need to have the dataset before that followed with a $
spotify %>% mutate(release_year = lubridate::year(as.Date(track_album_release_date))) %>% select(release_year) %>% head(2)
# A tibble: 2 x 1 release_year <dbl>1 20192 2019
Success! We managed to get the years out. Now we need to get the maximum of these years.
We will first mutate the original dataset to check whether this works. Remember you just have to give the name of the column you want the operation to happen on, and there's no need to have the dataset before that followed with a $
spotify %>% mutate(release_year = lubridate::year(as.Date(track_album_release_date))) %>% select(release_year) %>% head(2)
# A tibble: 2 x 1 release_year <dbl>1 20192 2019
Success! We managed to get the years out. Now we need to get the maximum of these years.
spotify_with_year <- spotify %>% mutate(release_year = lubridate::year(as.Date(track_album_release_date)))latest_year <- max(spotify_with_year$release_year,na.rm = TRUE)
## [1] 2020
The latest year is with us now! All we need to do now is to filter and get only the tracks that was released on a year that is equal to the latest year.
The latest year is with us now! All we need to do now is to filter and get only the tracks that was released on a year that is equal to the latest year.
The filter
function can be used to get a portion of a dataset that ticks the boxes of a set of one or more logical conditions. Basically any logical condition that gives out a logical vector matching the length of the dataset can be used to filter out the rows that you need.
It's kind of like an actual filter where the barriers are the rows with FALSE
.
Image Credits goes to stats illustrations from Allison Horst
Let's use the filter
function to filter out the rows that we need. We have to a logical condition as an argument to filter and you can combine different conditions using logical operators like &,|, ~ etc.
Let's use the filter
function to filter out the rows that we need. We have to a logical condition as an argument to filter and you can combine different conditions using logical operators like &,|, ~ etc.
spotify_with_year %>% filter(release_year == (latest_year-1)) %>% select(track_name,release_year) %>% head(5)
# A tibble: 5 x 2 track_name release_year <chr> <dbl>1 I Don't Care (with Justin Bieber) - Loud Luxury Remix 20192 Memories - Dillon Francis Remix 20193 All the Time - Don Diablo Remix 20194 Call You Mine - Keanu Silva Remix 20195 Someone You Loved - Future Humans Remix 2019
Great! Now we need to plug these into our previous code and we are good to go!
spotify_with_year %>% filter(release_year == 2019) %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% distinct() %>% head(5) %>% select(-track_popularity) %>% rename(Song = track_name, Artist = track_artist)
Song | Artist |
---|---|
Dance Monkey | Tones and I |
ROXANNE | Arizona Zervas |
Tusa | KAROL G |
Memories | Maroon 5 |
Blinding Lights | The Weeknd |
What did we learn from trying out Task No.2 ?
as.Date
function converts character columns to Date columns.year
function from lubridate
package gets the year from a Date column.mutate
function mutates existing columns to create new ones.filter
function helps to filter the rows based on a condition.That was a huge leap from what we learned in Task No. 1!
The supervisor has one final task for you.
You need to give the average scores for energy, danceability, instrumentalness, valence for each playlist to so that the supervisor can use that dataset in her Plotly Radar plot function.
This is going to be a tough one! You sit down and sketch out how the final dataset should look like and this is what you got,
playlist_genre | avg_energy | avg_dance | avg_instu | avg_valen |
---|---|---|---|---|
some name 1 | 0.9 | 0.3 | 0.2 | 0.4 |
some name 2 | 0.9 | 0.7 | 0.2 | 0.6 |
So now we know what sort of outcome we expect at the end. Now the question remains how do you group these playlists together.
We can use the group_by
function in dplyr for this. But the group_by
function can be tricky sometimes so you dig a bit deeper.
group_by
function literally groups rows with the same combination of values in the given columns togethermutate
function that is applied will apply within the group.Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278.
Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278.
spotify %>% group_by(playlist_genre) %>% mutate(mean_energy = mean(energy)) %>% select(playlist_genre,mean_energy) %>% head(3)
## # A tibble: 3 x 2## # Groups: playlist_genre [1]## playlist_genre mean_energy## <chr> <dbl>## 1 pop 0.701## 2 pop 0.701## 3 pop 0.701
Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278.
spotify %>% group_by(playlist_genre) %>% mutate(mean_energy = mean(energy)) %>% select(playlist_genre,mean_energy) %>% head(3)
## # A tibble: 3 x 2## # Groups: playlist_genre [1]## playlist_genre mean_energy## <chr> <dbl>## 1 pop 0.701## 2 pop 0.701## 3 pop 0.701
So we have to call distinct
for this as well it seems. And the tibble output shows that the dataset is now a grouped dataset by playlist_genre
. We should add an ungroup at the end to remove that effect, otherwise we might forget and issues may come up.
With that in your head you first write out the group_by
command and decide to be adventurous and mutate
a new column for each of the averages.
With that in your head you first write out the group_by
command and decide to be adventurous and mutate
a new column for each of the averages.
spotify %>% group_by(playlist_genre) %>% mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE), avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>% rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance, Instrumentals = avg_instu, Upbeat = avg_valen) %>% select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>% distinct() %>% ungroup() %>% head(4)
Genre | Energetic | Dance | Instrumentals | Upbeat |
---|---|---|---|---|
pop | 0.7010278 | 0.6393017 | 0.0598763 | 0.5035210 |
rap | 0.6507084 | 0.7183528 | 0.0759972 | 0.5050900 |
rock | 0.7328133 | 0.5205480 | 0.0624173 | 0.5373521 |
latin | 0.7083125 | 0.7132873 | 0.0444470 | 0.6055104 |
With that in your head you first write out the group_by
command and decide to be adventurous and mutate
a new column for each of the averages.
spotify %>% group_by(playlist_genre) %>% mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE), avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>% rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance, Instrumentals = avg_instu, Upbeat = avg_valen) %>% select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>% distinct() %>% ungroup() %>% head(4)
Genre | Energetic | Dance | Instrumentals | Upbeat |
---|---|---|---|---|
pop | 0.7010278 | 0.6393017 | 0.0598763 | 0.5035210 |
rap | 0.6507084 | 0.7183528 | 0.0759972 | 0.5050900 |
rock | 0.7328133 | 0.5205480 | 0.0624173 | 0.5373521 |
latin | 0.7083125 | 0.7132873 | 0.0444470 | 0.6055104 |
Do we really have to call distinct again like this? Or is there an easier way than having to group_by
-> mutate
-> select
-> distinct
?
Your supervisor noticed what you were typing on your laptop and gave a tip to Google up on the function summarize
to make the code more readable and safer.
And now after digging on the summarize
function, you realize that by simply replacing mutate with summarize your code now looks like this. And it automatically ungroups the outermost group after giving the summary statistics.
Your supervisor noticed what you were typing on your laptop and gave a tip to Google up on the function summarize
to make the code more readable and safer.
And now after digging on the summarize
function, you realize that by simply replacing mutate with summarize your code now looks like this. And it automatically ungroups the outermost group after giving the summary statistics.
spotify %>% group_by(playlist_genre) %>% summarize(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE), avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>% rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance, Instrumentals = avg_instu, Upbeat = avg_valen) %>% head(5)
Genre | Energetic | Dance | Instrumentals | Upbeat |
---|---|---|---|---|
edm | 0.8024759 | 0.6550409 | 0.2185779 | 0.4006555 |
latin | 0.7083125 | 0.7132873 | 0.0444470 | 0.6055104 |
pop | 0.7010278 | 0.6393017 | 0.0598763 | 0.5035210 |
r&b | 0.5909343 | 0.6701793 | 0.0289203 | 0.5312306 |
rap | 0.6507084 | 0.7183528 | 0.0759972 | 0.5050900 |
What did we learn from trying out Task No. 3?
group_by
helps to group the dataset into groups based on the values of a columnsummarize
helps to make a summary of a grouped / ungrouped dataset to get summary statisticsFinally! We finished all three tasks! We are now well on our way to start analysing data that come our way.
Data wrangling is kind of like meal prep where you prepare your ingredients(the raw data) for the final dish(the model or visualization)
There are several functions in different packages such as filter
,mutate
,select
,
arrange
,group_by
,summarize
,
distinct
from the dplyr
package and even date functions from the lubridate
package that help to wrangle data into the format we need.
Data wrangling is kind of like meal prep where you prepare your ingredients(the raw data) for the final dish(the model or visualization)
There are several functions in different packages such as filter
,mutate
,select
,
arrange
,group_by
,summarize
,
distinct
from the dplyr
package and even date functions from the lubridate
package that help to wrangle data into the format we need.
Get the Spotify Developer Account and access the APIs to create a dataset that is upto date and matches your style.
Create an R shiny dashboard that can show interactive graphs of all the artists and where they fall on different indexes which are based on how positive the song is and so on.
Analyse #TidyTuesday datasets that come out weekly and follow the R for Data Science Community on Twitter for more inspiration.
Email: janithcwanni@gmail.com
Twitter: @janithcwanni
Github: @janithwanni
Linkedin: Janith Wanniarachchi
Data Science Intern at Trabeya,
Consultant Data Analyst at Creative Hub.
Co-author of two R packages maintained by Dr. Thiyanga Thalagala, DSJobTracker and tsdataleaks.
Runner's up in two datathons and finalist in two datathons.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |