Data Wrangling with R

R Ladies Colombo
April 2021 Meetup

1 / 36

Hi there

Janith Wanniarachchi

BSc. Statistics (Special) Undergraduate at University of Sri Jayewardenepura, Sri Lanka,

Data Science Intern at Trabeya,

Consultant Data Analyst at Creative Hub.

Co-author of two R packages maintained by Dr. Thiyanga Thalagala, DSJobTracker and tsdataleaks.

Runner's up in two datathons and finalist in two datathons.

2 / 36

What is this data wrangling?

Image Credits goes to stats illustrations from Allison Horst

Data wrangling is where we take a raw dataset and transform it into a format that fits for different tasks
Simply it's like meal prep where we get the data ready to make the main dish (in this case it could be a model or visualization)
There are several packages that can help with data wrangling but for today we would be scratching the surface of the dplyr package in the tidyverse ecosystem.

3 / 36

What are we going to do today

4 / 36

I won't be teaching you a list of functions and how to use them today, instead I will be walking you through an actual scenario where these functions might be needed.
The main outcome of this should be to have an idea about how to approach a problem and how to combine different tools that you have under your belt.

Here's the scene,

You are the new Data Science intern at Spotify! and you are told to work on a simple csv dataset that resembles the actual dataset.
You are given a snapshot of the dataset from 2020 to work on and your supervisor will be making the entire user interface of the recommender system.
She told us to use the TidyTuesday data on Spotify from 2020-01-21.
At the end she will talk about extending this into an actual real time application.
The supervisor will let you make the R codes locally using the dataset that was given to you and she will on the other end work on making the functions that show the plots and tables to the user.
Usually most of your time will be spent on data wrangling hence she entrusts you to get the job done.

5 / 36

Glimpse at the data

Rows: 32,833
Columns: 23
$ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZ…
$ track_name               <chr> "I Don't Care (with Justin Bieber)…
$ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara La…
$ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68…
$ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO2…
$ track_album_name         <chr> "I Don't Care (with Justin Bieber)…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-…
$ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Rem…
$ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZ…
$ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop",…
$ playlist_subgenre        <chr> "dance pop", "dance pop", "dance p…
$ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650,…
$ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833,…
$ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, …
$ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4…
$ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1…
$ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.…
$ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870…
$ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43…
$ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.…
$ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725,…
$ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956,…
$ duration_ms              <dbl> 194754, 162600, 176616, 169093, 18…

6 / 36

7 / 36

What are all these numbers?

The entire data dictionary was found here and you manage to jot it down as follows.
We have the id,genre and subgenre of each playlist.
The id,name and release date of each album.
Finally id,name,artist and popularity of each track.
In addition we have the following features of songs.
- danceability : is it suitable for dancing (range: [0,1])
- energy : how intense and active is the song (range: [0,1])
- loudness : the overall loudness of track in dB (range: [-60,0])
- speechiness : is there a lot of speech in it? (range: [0,1])
- accousticness : is the song an acoustic song (range: [0,1])
- instrumentalness : does the song have no vocal content (range: [0,1])
- liveness : is there an audience in the song (range: [0,1])
- valence : how positive is the song (range: [0,1])
- key : the key of the song (CAUTION NA's means that there is no key)
- mode : 1 if its a major or 0 if its a minor
- tempo : BPM of the song

8 / 36

Task No. 1 
(Spot the Cool Kids)9 / 36

Task No. 1 (Spot the Cool Kids)

As the first piece of the user interface for the recommmendation system, your supervisor thought of starting off with the basics.
She has decided to put up a table showing the top 5 popular tracks and their artists.
Your supervisor has asked you to make the table that goes out on the front page of the user interface.

Let's break it down.

10 / 36

Task No. 1 (Spot the Cool Kids)

As the first piece of the user interface for the recommmendation system, your supervisor thought of starting off with the basics.
She has decided to put up a table showing the top 5 popular tracks and their artists.
Your supervisor has asked you to make the table that goes out on the front page of the user interface.

Let's break it down.

First let's select the tracks and their artist names and how popular they are.
Then we can think of a way to get the top five from that table someway.
This should work right? right????

10 / 36

How do we select column names?11 / 36

How do we select column names?

spotify %>% 
  select(track_name,track_artist,track_popularity)

The select command helps us select columns that we want, and we can give a range of columns or even the columns that we don't want to have.
The select function works as follows.
- The first argument is the tibble or data frame, in this case we have piped the spotify dataset using the pipe operator.
- The rest of the arguments are the different ways of selecting the variables.
- For example if we want to select columnA from tibble t we would say select(t,A). Likewise if we want to select everything except column B from tibble t we would say select(t,-B).
- you can even use a set of functions called tidy selectors to select columns based on how the name starts or ends, (For example select(t,starts_with("LKR_")))

Right now you know how to select the top rows of a tibble with the head function, so if you can get the tracks arranged by their popularity in ascending order then everything's all set!

11 / 36

Order! in the dataset

We need a function that can sort the rows based on one or more columns.

12 / 36

Order! in the dataset

We need a function that can sort the rows based on one or more columns.

This is where the arrange function comes to play. The arrange function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.

12 / 36

Order! in the dataset

We need a function that can sort the rows based on one or more columns.

This is where the arrange function comes to play. The arrange function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(track_popularity) %>%
  head(5)

track_name	track_artist	track_popularity
Siren	SUNMI	0
Lollipop (Candyman)	Aqua	0
Around The World	Aqua	0
I'm Yours	Influencers Worship	0
Talk About It	Bancali	0

12 / 36

Order! in the dataset

We need a function that can sort the rows based on one or more columns.

This is where the arrange function comes to play. The arrange function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(track_popularity) %>%
  head(5)

track_name	track_artist	track_popularity
Siren	SUNMI	0
Lollipop (Candyman)	Aqua	0
Around The World	Aqua	0
I'm Yours	Influencers Worship	0
Talk About It	Bancali	0

Wait something's not right, this isn't in descending order! We have to tell the arrange function to sort them in descending order.

12 / 36

You can tell the arrange function to order in descending order by using a minus operator in front, somewhat similar to the select function.

13 / 36

You can tell the arrange function to order in descending order by using a minus operator in front, somewhat similar to the select function.

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(-track_popularity) %>%
  head(5)

track_name	track_artist	track_popularity
Dance Monkey	Tones and I	100
Dance Monkey	Tones and I	100
ROXANNE	Arizona Zervas	99
ROXANNE	Arizona Zervas	99
ROXANNE	Arizona Zervas	99

13 / 36

You can tell the arrange function to order in descending order by using a minus operator in front, somewhat similar to the select function.

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(-track_popularity) %>%
  head(5)

track_name	track_artist	track_popularity
Dance Monkey	Tones and I	100
Dance Monkey	Tones and I	100
ROXANNE	Arizona Zervas	99
ROXANNE	Arizona Zervas	99
ROXANNE	Arizona Zervas	99

Hold up, why is it repeating like that?
One track can be in multiple playlists! That's why! Now what do we do? Let's select only the distinct or unique rows.

13 / 36

You can tell the arrange function to order in descending order by using a minus operator in front, somewhat similar to the select function.

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(-track_popularity) %>%
  head(5)

track_name	track_artist	track_popularity
Dance Monkey	Tones and I	100
Dance Monkey	Tones and I	100
ROXANNE	Arizona Zervas	99
ROXANNE	Arizona Zervas	99
ROXANNE	Arizona Zervas	99

Hold up, why is it repeating like that?
One track can be in multiple playlists! That's why! Now what do we do? Let's select only the distinct or unique rows.

For that we can use the distinct function. The distinct function takes in a tibble as the first argument and then a set of columns to find distinctive rows or uses all columns if no arguments are given. You can see the pattern here right?

13 / 36

Task No. 1 is completed!

Finally by using the distinct function which returns only the unique rows in the dataframe we were able to get the top 5 track names and artists.

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(-track_popularity) %>% 
  distinct() %>%
  head(5)

track_name	track_artist	track_popularity
Dance Monkey	Tones and I	100
ROXANNE	Arizona Zervas	99
Tusa	KAROL G	98
Memories	Maroon 5	98
Blinding Lights	The Weeknd	98

Let's send this out to your supervisor now!

14 / 36

Task No. 1 last final changes

Ding! Ding!

There's a new message from your supervisor on Slack.

She wants this to be more presentable on the user interface by renaming the columns and removing the popularity.

Well we already know how to select columns that we don't need, but what about renaming columns?

15 / 36

Task No. 1 last final changes

Ding! Ding!

There's a new message from your supervisor on Slack.

She wants this to be more presentable on the user interface by renaming the columns and removing the popularity.

Well we already know how to select columns that we don't need, but what about renaming columns?

Based on your gut feeling you google up "rename tidyverse function" and land on the rename function to find out the following.

The rename function takes in pairs of arguments in the form of

new_name = old_name

so we can rename the columns that we want with that function.

This is in addition to the first argument of a tibble to do all of these on.

15 / 36

Using the rename function we were able to change the names of the column names as well

spotify %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(-track_popularity) %>% 
  distinct() %>% 
  head(5) %>% 
  select(-track_popularity) %>% 
  rename(Song = track_name, Artist = track_artist)

Song	Artist
Dance Monkey	Tones and I
ROXANNE	Arizona Zervas
Tusa	KAROL G
Memories	Maroon 5
Blinding Lights	The Weeknd

16 / 36

Recap of Task No. 1

What did we learn from trying out Task No. 1?

select function helps us select the columns that we want
arrange function helps us arrange the table by columns
rename function helps us rename columns
distinct functions helps us select only the unique rows in the table

Congratulations! We finished one task at our internship at Spotify!

17 / 36

Task No. 2 
 (Missing the Good old days)18 / 36

Task No. 2 (Missing the Good old days)

Your supervisor just pointed out a funny thing in the table that you sent her. It contains the most popular songs of all times!
She is all right with it since we worked really hard on it, but she wants to show a table of the top 5 most popular songs in the past year.
The snapshot she gave us is from January 2020 but our code will be used in real time even after 2021 according to her.

First we need to make a new column indicating the year that the album was released in.

spotify %>% 
  select(track_album_release_date) %>% 
  head(2)

# A tibble: 2 x 1
  track_album_release_date
  <chr>                   
1 2019-06-14              
2 2019-12-13

19 / 36

Task No. 2 (Missing the Good old days)

Your supervisor just pointed out a funny thing in the table that you sent her. It contains the most popular songs of all times!
She is all right with it since we worked really hard on it, but she wants to show a table of the top 5 most popular songs in the past year.
The snapshot she gave us is from January 2020 but our code will be used in real time even after 2021 according to her.

First we need to make a new column indicating the year that the album was released in.

spotify %>% 
  select(track_album_release_date) %>% 
  head(2)

# A tibble: 2 x 1
  track_album_release_date
  <chr>                   
1 2019-06-14              
2 2019-12-13

Uh-oh, These are in a character format and you don't know how to get the year out of a character like this. Let's ask our supervisor on how to solve this.

19 / 36

lubricate can fix your string of horrible dates?

Ding! Ding!
Your supervisor replied back telling you to use the function as.Date to make the column into a date and then the year function from the lubridate package.

20 / 36

lubricate can fix your string of horrible dates?

Ding! Ding!
Your supervisor replied back telling you to use the function as.Date to make the column into a date and then the year function from the lubridate package.

What is this lubridate?

lubridate is a package that helps in analysing and extracting useful information from dates, like the year, month, day or even difference between dates in different units!

Image Credits goes to stats illustrations from Allison Horst

20 / 36

You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.

21 / 36

You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.

as.Date(spotify$track_album_release_date) %>% lubridate::year()

21 / 36

You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.

as.Date(spotify$track_album_release_date) %>% lubridate::year()

Before you send this out to your supervisor you want to make sure that the code that you wrote is correct, so you start matching with the documentation to get a rough idea about what happens.

You create a vector of Dates from a vector of characters
You get the year as a number from the year function in lubridate

Sounds good!

21 / 36

How do we modify the existing columns?

We have a way to get an entire vector of data based on the pre existing column in the dataset.

Now how do we add a new column to a dataset? We could simply do this, dataset$new_column = vector_of_data.

But you suddenly remember how your supervisor brings in changes suddenly, therefore instead of making a break in the pipe in the middle, you try to find alternatives that can fit into the % operator pipeline.

22 / 36

How do we modify the existing columns?

We have a way to get an entire vector of data based on the pre existing column in the dataset.

Now how do we add a new column to a dataset? We could simply do this, dataset$new_column = vector_of_data.

We need to modify the existing dataset that comes after the last step in the pipeline. We can use the mutate function for this. The argument format goes along as new_column_name = modification. The modification should give out a vector that fits the number of rows in the dataset.

Image Credits goes to stats illustrations from Allison Horst

22 / 36

Let's mutate the dataset!

We will first mutate the original dataset to check whether this works. Remember you just have to give the name of the column you want the operation to happen on, and there's no need to have the dataset before that followed with a $

23 / 36

Let's mutate the dataset!

spotify %>% 
  mutate(release_year = lubridate::year(as.Date(track_album_release_date))) %>%
  select(release_year) %>% 
  head(2)

# A tibble: 2 x 1
  release_year
         <dbl>
1         2019
2         2019

Success! We managed to get the years out. Now we need to get the maximum of these years.

23 / 36

Let's mutate the dataset!

spotify %>% 
  mutate(release_year = lubridate::year(as.Date(track_album_release_date))) %>%
  select(release_year) %>% 
  head(2)

# A tibble: 2 x 1
  release_year
         <dbl>
1         2019
2         2019

Success! We managed to get the years out. Now we need to get the maximum of these years.

spotify_with_year <- spotify %>% 
  mutate(release_year = lubridate::year(as.Date(track_album_release_date)))
latest_year <- max(spotify_with_year$release_year,na.rm = TRUE)

## [1] 2020

23 / 36

How can we filter the songs that we want?

The latest year is with us now! All we need to do now is to filter and get only the tracks that was released on a year that is equal to the latest year.

24 / 36

How can we filter the songs that we want?

The latest year is with us now! All we need to do now is to filter and get only the tracks that was released on a year that is equal to the latest year.

The filter function can be used to get a portion of a dataset that ticks the boxes of a set of one or more logical conditions. Basically any logical condition that gives out a logical vector matching the length of the dataset can be used to filter out the rows that you need.

It's kind of like an actual filter where the barriers are the rows with FALSE.

Image Credits goes to stats illustrations from Allison Horst

24 / 36

Let's use the filter function to filter out the rows that we need. We have to a logical condition as an argument to filter and you can combine different conditions using logical operators like &,|, ~ etc.

25 / 36

spotify_with_year %>% 
  filter(release_year == (latest_year-1)) %>%
  select(track_name,release_year) %>% 
  head(5)

# A tibble: 5 x 2
  track_name                                            release_year
  <chr>                                                        <dbl>
1 I Don't Care (with Justin Bieber) - Loud Luxury Remix         2019
2 Memories - Dillon Francis Remix                               2019
3 All the Time - Don Diablo Remix                               2019
4 Call You Mine - Keanu Silva Remix                             2019
5 Someone You Loved - Future Humans Remix                       2019

Great! Now we need to plug these into our previous code and we are good to go!

25 / 36

Task 2 is done and dusted

spotify_with_year %>%
  filter(release_year == 2019) %>% 
  select(track_name,track_artist,track_popularity) %>% 
  arrange(-track_popularity) %>% 
  distinct() %>% 
  head(5) %>% 
  select(-track_popularity) %>% 
  rename(Song = track_name, Artist = track_artist)

Song	Artist
Dance Monkey	Tones and I
ROXANNE	Arizona Zervas
Tusa	KAROL G
Memories	Maroon 5
Blinding Lights	The Weeknd

26 / 36

Recap of Task No. 2

What did we learn from trying out Task No.2 ?

as.Date function converts character columns to Date columns.
year function from lubridate package gets the year from a Date column.
mutate function mutates existing columns to create new ones.
filter function helps to filter the rows based on a condition.

That was a huge leap from what we learned in Task No. 1!

27 / 36

Task No. 3 
 (Playlists for your moods)28 / 36

Task No. 3 (Playlists for your moods)

The supervisor has one final task for you.

You need to give the average scores for energy, danceability, instrumentalness, valence for each playlist to so that the supervisor can use that dataset in her Plotly Radar plot function.

This is going to be a tough one! You sit down and sketch out how the final dataset should look like and this is what you got,

playlist_genre	avg_energy	avg_dance	avg_instu	avg_valen
some name 1	0.9	0.3	0.2	0.4
some name 2	0.9	0.7	0.2	0.6

So now we know what sort of outcome we expect at the end. Now the question remains how do you group these playlists together.

29 / 36

Understanding group_by

We can use the group_by function in dplyr for this. But the group_by function can be tricky sometimes so you dig a bit deeper.

group_by function literally groups rows with the same combination of values in the given columns together
After that any mutate function that is applied will apply within the group.
But remember to ungroup the dataset afterwards, otherwise they would remain in the grouped state and all the actions that you perform on it will be performed group wise.

30 / 36

Testing out the waters

Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278.

31 / 36

Testing out the waters

Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278.

spotify %>% 
  group_by(playlist_genre) %>% 
  mutate(mean_energy = mean(energy)) %>% 
  select(playlist_genre,mean_energy) %>% 
  head(3)

## # A tibble: 3 x 2
## # Groups:   playlist_genre [1]
##   playlist_genre mean_energy
##   <chr>                <dbl>
## 1 pop                  0.701
## 2 pop                  0.701
## 3 pop                  0.701

31 / 36

Testing out the waters

Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278.

spotify %>% 
  group_by(playlist_genre) %>% 
  mutate(mean_energy = mean(energy)) %>% 
  select(playlist_genre,mean_energy) %>% 
  head(3)

## # A tibble: 3 x 2
## # Groups:   playlist_genre [1]
##   playlist_genre mean_energy
##   <chr>                <dbl>
## 1 pop                  0.701
## 2 pop                  0.701
## 3 pop                  0.701

So we have to call distinct for this as well it seems. And the tibble output shows that the dataset is now a grouped dataset by playlist_genre. We should add an ungroup at the end to remove that effect, otherwise we might forget and issues may come up.

31 / 36

With that in your head you first write out the group_by command and decide to be adventurous and mutate a new column for each of the averages.

32 / 36

With that in your head you first write out the group_by command and decide to be adventurous and mutate a new column for each of the averages.

spotify %>% 
  group_by(playlist_genre) %>%
  mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE),
            avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>%
  rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
         Instrumentals = avg_instu, Upbeat = avg_valen) %>% 
  select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>% 
  distinct() %>% 
  ungroup() %>% 
  head(4)

Genre	Energetic	Dance	Instrumentals	Upbeat
pop	0.7010278	0.6393017	0.0598763	0.5035210
rap	0.6507084	0.7183528	0.0759972	0.5050900
rock	0.7328133	0.5205480	0.0624173	0.5373521
latin	0.7083125	0.7132873	0.0444470	0.6055104

32 / 36

With that in your head you first write out the group_by command and decide to be adventurous and mutate a new column for each of the averages.

spotify %>% 
  group_by(playlist_genre) %>%
  mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE),
            avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>%
  rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
         Instrumentals = avg_instu, Upbeat = avg_valen) %>% 
  select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>% 
  distinct() %>% 
  ungroup() %>% 
  head(4)

Genre	Energetic	Dance	Instrumentals	Upbeat
pop	0.7010278	0.6393017	0.0598763	0.5035210
rap	0.6507084	0.7183528	0.0759972	0.5050900
rock	0.7328133	0.5205480	0.0624173	0.5373521
latin	0.7083125	0.7132873	0.0444470	0.6055104

Do we really have to call distinct again like this? Or is there an easier way than having to group_by -> mutate -> select -> distinct?

32 / 36

Task No. 3 is over!

Your supervisor noticed what you were typing on your laptop and gave a tip to Google up on the function summarize to make the code more readable and safer.

And now after digging on the summarize function, you realize that by simply replacing mutate with summarize your code now looks like this. And it automatically ungroups the outermost group after giving the summary statistics.

33 / 36

Task No. 3 is over!

Your supervisor noticed what you were typing on your laptop and gave a tip to Google up on the function summarize to make the code more readable and safer.

spotify %>% 
  group_by(playlist_genre) %>% 
  summarize(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE),
            avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>%
  rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
         Instrumentals = avg_instu, Upbeat = avg_valen) %>% 
  head(5)

Genre	Energetic	Dance	Instrumentals	Upbeat
edm	0.8024759	0.6550409	0.2185779	0.4006555
latin	0.7083125	0.7132873	0.0444470	0.6055104
pop	0.7010278	0.6393017	0.0598763	0.5035210
r&b	0.5909343	0.6701793	0.0289203	0.5312306
rap	0.6507084	0.7183528	0.0759972	0.5050900

33 / 36

Recap of Task No. 3

What did we learn from trying out Task No. 3?

group_by helps to group the dataset into groups based on the values of a column
summarize helps to make a summary of a grouped / ungrouped dataset to get summary statistics

Finally! We finished all three tasks! We are now well on our way to start analysing data that come our way.

34 / 36

What did we learn today?

Data wrangling is kind of like meal prep where you prepare your ingredients(the raw data) for the final dish(the model or visualization)
There are several functions in different packages such as filter,mutate,select, arrange,group_by,summarize, distinct from the dplyr package and even date functions from the lubridate package that help to wrangle data into the format we need.

35 / 36

What did we learn today?

Data wrangling is kind of like meal prep where you prepare your ingredients(the raw data) for the final dish(the model or visualization)
There are several functions in different packages such as filter,mutate,select, arrange,group_by,summarize, distinct from the dplyr package and even date functions from the lubridate package that help to wrangle data into the format we need.

Where to go from here?

Get the Spotify Developer Account and access the APIs to create a dataset that is upto date and matches your style.
Create an R shiny dashboard that can show interactive graphs of all the artists and where they fall on different indexes which are based on how positive the song is and so on.
Analyse #TidyTuesday datasets that come out weekly and follow the R for Data Science Community on Twitter for more inspiration.

35 / 36

Thank you!

Where can you find me?

Email: janithcwanni@gmail.com

Twitter: @janithcwanni

Github: @janithwanni

Linkedin: Janith Wanniarachchi

36 / 36

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Hi there

Janith Wanniarachchi

What is this data wrangling?

What are we going to do today

Here's the scene,

Glimpse at the data

What are all these numbers?

Task No. 1 (Spot the Cool Kids)

Task No. 1 (Spot the Cool Kids)

Task No. 1 (Spot the Cool Kids)

How do we select column names?

How do we select column names?

Order! in the dataset

Order! in the dataset

Order! in the dataset

Order! in the dataset

Task No. 1 is completed!

Task No. 1 last final changes

Task No. 1 last final changes

Recap of Task No. 1

Task No. 2 (Missing the Good old days)

Task No. 2 (Missing the Good old days)

Task No. 2 (Missing the Good old days)

lubricate can fix your string of horrible dates?

lubricate can fix your string of horrible dates?

What is this lubridate?

How do we modify the existing columns?

How do we modify the existing columns?

Let's mutate the dataset!

Let's mutate the dataset!

Let's mutate the dataset!

How can we filter the songs that we want?

How can we filter the songs that we want?

Task 2 is done and dusted

Recap of Task No. 2

Task No. 3 (Playlists for your moods)

Task No. 3 (Playlists for your moods)

Understanding group_by

Testing out the waters

Testing out the waters

Testing out the waters

Task No. 3 is over!

Task No. 3 is over!

Recap of Task No. 3

What did we learn today?

What did we learn today?

Where to go from here?

Thank you!

Where can you find me?

Hi there

Janith Wanniarachchi

Help

Task No. 1
(Spot the Cool Kids)

Task No. 2
(Missing the Good old days)

Task No. 3
(Playlists for your moods)