class: center, middle, first-slide, inverse <span style="font-family: 'Playfair Display', serif; font-weight: 500; font-size: 54pt"> Data Wrangling with R </span> <hr style="width:60%;margin: 0 auto;"/> <span style="font-family: 'Lato', sans-serif;font-size: 24pt;font-weight:400;"> R Ladies Colombo <br/> <span style="font-weight:300;font-size: 18pt;">April 2021 Meetup</span> </span> .center[<img src="images/bg4.png" style="position:fixed;bottom:0px;left:-0%"/>] --- <h1 style="text-align:center;color:#363a59">Hi there</h1> .pull-left[ .center[<img src="images/profile_picture.jpg" style=" height: auto; border-radius: 50%; border: 3px solid #414770; width:45%;"/>] <hr/> <br/> .center[ <h2 style="color:#5b85aa"> Janith Wanniarachchi </h2> <br/> .status-highlight[BSc. Statistics (Special) Undergraduate] at University of Sri Jayewardenepura, Sri Lanka, .status-highlight[Data Science Intern] at Trabeya, .status-highlight[Consultant Data Analyst] at Creative Hub. ] ] .pull-right[ Co-author of two R packages maintained by Dr. Thiyanga Thalagala, [DSJobTracker](https://github.com/thiyangt/DSjobtracker) and [tsdataleaks](https://github.com/thiyangt/tsdataleaks). .center[<img src="https://github.com/thiyangt/DSjobtracker/raw/master/hexsticker/figures/hexsticker.png" style="width: 30%"> <img src="https://github.com/thiyangt/ThiTal/raw/master/static/Slides/isf2020/hexsticker.png" style="width: 30%">] Runner's up in two datathons and finalist in two datathons. <br/> <img src="images/2019_hacktoberfest.jpg" style="width:50%;transform: rotate(-40deg);" /><img src="images/2020_hacktoberfest.jpg" style="width:50%;transform: rotate(40deg);" /> ] --- # What is this data *wrangling*? .center[<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_wrangling.png" style="width:40%">] .footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)] * Data wrangling is where we take a raw dataset and transform it into a format that fits for different tasks * Simply it's like meal prep where we get the data ready to make the main dish (in this case it could be a model or visualization) * There are several packages that can help with data wrangling but for today we would be scratching the surface of the `dplyr` package in the tidyverse ecosystem. --- <h1 style="text-align:center">What are we going to do today</h1> .center[<img src="https://i.imgflip.com/55djd3.jpg"/>] ??? * I won't be teaching you a list of functions and how to use them today, instead I will be walking you through an actual scenario where these functions might be needed. * The main outcome of this should be to have an idea about how to approach a problem and how to combine different tools that you have under your belt. --- ## Here's the scene, .left-column[ <br/><br/> <img src="images/spotify_logo.png"/> <br/><br/> .center[<span style="font-size:64pt"> + </span>] <br/> <img src="images/rstudio_logo_cropped.png"/> ] .right-column[* You are the new Data Science intern at Spotify! and you are told to work on a simple csv dataset that resembles the actual dataset. * You are given a snapshot of the dataset from 2020 to work on and your supervisor will be making the entire user interface of the recommender system. * She told us to use the [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md) data on Spotify from 2020-01-21. * At the end she will talk about extending this into an actual real time application. * The supervisor will let you make the R codes locally using the dataset that was given to you and she will on the other end work on making the functions that show the plots and tables to the user. * Usually most of your time will be spent on data wrangling hence she entrusts you to get the job done. ] --- ### Glimpse at the data ``` Rows: 32,833 Columns: 23 $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZ… $ track_name <chr> "I Don't Care (with Justin Bieber)… $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara La… $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68… $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO2… $ track_album_name <chr> "I Don't Care (with Justin Bieber)… $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-… $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Rem… $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZ… $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop",… $ playlist_subgenre <chr> "dance pop", "dance pop", "dance p… $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650,… $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833,… $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, … $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4… $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1… $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.… $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870… $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43… $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.… $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725,… $ tempo <dbl> 122.036, 99.972, 124.008, 121.956,… $ duration_ms <dbl> 194754, 162600, 176616, 169093, 18… ``` --- class: middle, center <img src="images/numbers_mason_meme.jpg" style="width: 150%;height: 150%"/> --- # What are all these numbers? * The entire data dictionary was found [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md#spotify_songscsv) and you manage to jot it down as follows. * We have the id,genre and subgenre of each playlist. * The id,name and release date of each album. * Finally id,name,artist and popularity of each track. * In addition we have the following features of songs. + danceability : is it suitable for dancing (range: [0,1]) + energy : how intense and active is the song (range: [0,1]) + loudness : the overall loudness of track in dB (range: [-60,0]) + speechiness : is there a lot of speech in it? (range: [0,1]) + accousticness : is the song an acoustic song (range: [0,1]) + instrumentalness : does the song have no vocal content (range: [0,1]) + liveness : is there an audience in the song (range: [0,1]) + valence : how positive is the song (range: [0,1]) + key : the key of the song (CAUTION NA's means that there is no key) + mode : 1 if its a major or 0 if its a minor + tempo : BPM of the song --- class: middle, center, title-only-slide <h1>Task No. 1 <br/>(Spot the Cool Kids)</h1> --- # Task No. 1 (Spot the Cool Kids) .pull-left[ * As the first piece of the user interface for the recommmendation system, your supervisor thought of starting off with the basics. * She has decided to put up a table showing the top 5 popular tracks and their artists. * Your supervisor has asked you to make the table that goes out on the front page of the user interface. ] .pull-right[ * Let's break it down. ] -- <div class="pull-right" style="clear:none"> <ul> <li>First let's select the tracks and their artist names and how popular they are. </li> <li>Then we can think of a way to get the top five from that table someway. </li> <li>This should work right? right????</li> </ul> <span class="center"> <img src="https://i.pinimg.com/originals/6c/06/7f/6c067f95ccf0401be464131c4e4f6c32.jpg" style="width:60%"/> </span> </div> --- ## How do we *select* column names? -- ```r spotify %>% * select(track_name,track_artist,track_popularity) ``` * The `select` command helps us select columns that we want, and we can give a range of columns or even the columns that we don't want to have. * The `select` function works as follows. + The first argument is the tibble or data frame, in this case we have piped the spotify dataset using the pipe operator. + The rest of the arguments are the different ways of selecting the variables. + For example if we want to select columnA from tibble t we would say `select(t,A)`. Likewise if we want to select everything except column B from tibble t we would say `select(t,-B)`. + you can even use a set of functions called tidy selectors to select columns based on how the name starts or ends, (For example `select(t,starts_with("LKR_"))`) Right now you know how to select the top rows of a tibble with the head function, so if you can get the tracks arranged by their popularity in ascending order then everything's all set! --- # Order! in the dataset We need a function that can sort the rows based on one or more columns. -- This is where the `arrange` function comes to play. The `arrange` function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order. -- ```r spotify %>% select(track_name,track_artist,track_popularity) %>% * arrange(track_popularity) %>% head(5) ``` |track_name |track_artist | track_popularity| |:-------------------|:-------------------|----------------:| |Siren |SUNMI | 0| |Lollipop (Candyman) |Aqua | 0| |Around The World |Aqua | 0| |I'm Yours |Influencers Worship | 0| |Talk About It |Bancali | 0| -- Wait something's not right, this isn't in descending order! We have to tell the `arrange` function to sort them in descending order. --- You can tell the `arrange` function to order in descending order by using a minus operator in front, somewhat similar to the `select` function. -- ```r spotify %>% select(track_name,track_artist,track_popularity) %>% * arrange(-track_popularity) %>% head(5) ``` |track_name |track_artist | track_popularity| |:------------|:--------------|----------------:| |Dance Monkey |Tones and I | 100| |Dance Monkey |Tones and I | 100| |ROXANNE |Arizona Zervas | 99| |ROXANNE |Arizona Zervas | 99| |ROXANNE |Arizona Zervas | 99| -- Hold up, why is it repeating like that? One track can be in multiple playlists! That's why! Now what do we do? Let's select only the distinct or unique rows. -- For that we can use the `distinct` function. The `distinct` function takes in a tibble as the first argument and then a set of columns to find distinctive rows or uses all columns if no arguments are given. You can see the pattern here right? --- ## Task No. 1 is completed! Finally by using the `distinct` function which returns only the unique rows in the dataframe we were able to get the top 5 track names and artists. ```r spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% * distinct() %>% head(5) ``` |track_name |track_artist | track_popularity| |:---------------|:--------------|----------------:| |Dance Monkey |Tones and I | 100| |ROXANNE |Arizona Zervas | 99| |Tusa |KAROL G | 98| |Memories |Maroon 5 | 98| |Blinding Lights |The Weeknd | 98| Let's send this out to your supervisor now! --- ## Task No. 1 last final changes .pull-left[Ding! Ding! There's a new message from your supervisor on Slack. She wants this to be more presentable on the user interface by renaming the columns and removing the popularity. <img src="https://i.imgflip.com/2lhzmp.jpg" style="height:80%;width:80%;"/> ] .pull-right[ Well we already know how to select columns that we don't need, but what about renaming columns? ] -- <div class="pull-right" style="clear:none"/> <p>Based on your gut feeling you google up "rename tidyverse function" and land on the rename function to find out the following.</p> <p>The <code class="remark-inline-code">rename</code> function takes in pairs of arguments in the form of </p> <p><em>new_name = old_name</em> </p> <p>so we can rename the columns that we want with that function. </p> <p>This is in addition to the first argument of a tibble to do all of these on. </p> </div> --- Using the `rename` function we were able to change the names of the column names as well ```r spotify %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% distinct() %>% head(5) %>% select(-track_popularity) %>% * rename(Song = track_name, Artist = track_artist) ``` |Song |Artist | |:---------------|:--------------| |Dance Monkey |Tones and I | |ROXANNE |Arizona Zervas | |Tusa |KAROL G | |Memories |Maroon 5 | |Blinding Lights |The Weeknd | --- ## Recap of Task No. 1 What did we learn from trying out Task No. 1? * `select` function helps us select the columns that we want * `arrange` function helps us arrange the table by columns * `rename` function helps us rename columns * `distinct` functions helps us select only the unique rows in the table Congratulations! We finished one task at our internship at Spotify! .center[<img src="https://media.giphy.com/media/kBZBlLVlfECvOQAVno/giphy.gif" />] --- class: center,middle, title-only-slide <h1>Task No. 2 <br/> (Missing the Good old days)</h1> --- # Task No. 2 (Missing the Good old days) .pull-left[ * Your supervisor just pointed out a funny thing in the table that you sent her. It contains the most popular songs of all times! * She is all right with it since we worked really hard on it, but she wants to show a table of the top 5 most popular songs in the past year. * The snapshot she gave us is from January 2020 but our code will be used in real time even after 2021 according to her. ] .pull-right[ * First we need to make a new column indicating the year that the album was released in. ```r spotify %>% select(track_album_release_date) %>% head(2) ``` ``` # A tibble: 2 x 1 track_album_release_date <chr> 1 2019-06-14 2 2019-12-13 ``` ] -- <div class="pull-right" style="clear:none"> Uh-oh, These are in a character format and you don't know how to get the year out of a character like this. Let's ask our supervisor on how to solve this. </div> --- ## lubricate can fix your string of horrible dates? Ding! Ding! Your supervisor replied back telling you to use the function `as.Date` to make the column into a date and then the `year` function from the `lubridate` package. -- <h2>What is this <span class="remark-inline-code" style="font-size:32pt">lubridate</span>?</h2> lubridate is a package that helps in analysing and extracting useful information from dates, like the year, month, day or even difference between dates in different units! .center[<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/lubridate_ymd.png" style="width:50%"/>] .footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)] --- You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need. -- ```r *as.Date(spotify$track_album_release_date) %>% lubridate::year() ``` -- Before you send this out to your supervisor you want to make sure that the code that you wrote is correct, so you start matching with the documentation to get a rough idea about what happens. 1. You create a vector of Dates from a vector of characters 2. You get the year as a number from the `year` function in `lubridate` Sounds good! .center[<img src="https://media.tenor.com/images/356d20c65288742f2ba36e4514f156b3/tenor.gif" width="40%"/>] --- ## How do we modify the existing columns? .pull-left[We have a way to get an entire vector of data based on the pre existing column in the dataset. Now how do we add a new column to a dataset? We could simply do this, `dataset$new_column = vector_of_data`. But you suddenly remember how your supervisor brings in changes suddenly, therefore instead of making a break in the pipe in the middle, you try to find alternatives that can fit into the `%` operator pipeline. ] -- .pull-right[ We need to modify the existing dataset that comes after the last step in the pipeline. We can use the `mutate` function for this. The argument format goes along as `new_column_name = modification`. The modification should give out a vector that fits the number of rows in the dataset. .center[<img src="images/dplyr_mutate.png" style="width:95%"/>] ] .footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)] --- ## Let's mutate the dataset! We will first mutate the original dataset to check whether this works. Remember you just have to give the name of the column you want the operation to happen on, and there's no need to have the dataset before that followed with a `$` -- ```r spotify %>% * mutate(release_year = lubridate::year(as.Date(track_album_release_date))) %>% select(release_year) %>% head(2) ``` ``` # A tibble: 2 x 1 release_year <dbl> 1 2019 2 2019 ``` Success! We managed to get the years out. Now we need to get the maximum of these years. -- ```r spotify_with_year <- spotify %>% mutate(release_year = lubridate::year(as.Date(track_album_release_date))) latest_year <- max(spotify_with_year$release_year,na.rm = TRUE) ``` ``` ## [1] 2020 ``` --- ## How can we filter the songs that we want? The latest year is with us now! All we need to do now is to filter and get only the tracks that was released on a year that is equal to the latest year. -- The `filter` function can be used to get a portion of a dataset that ticks the boxes of a set of one or more logical conditions. Basically any logical condition that gives out a logical vector matching the length of the dataset can be used to filter out the rows that you need. It's kind of like an actual filter where the barriers are the rows with `FALSE`. .center[<img src="images/dplyr_filter_sm.png" style="width:50%"/>] .footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)] --- Let's use the `filter` function to filter out the rows that we need. We have to a logical condition as an argument to filter and you can combine different conditions using logical operators like &,|, ~ etc. -- ```r spotify_with_year %>% * filter(release_year == (latest_year-1)) %>% select(track_name,release_year) %>% head(5) ``` ``` # A tibble: 5 x 2 track_name release_year <chr> <dbl> 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix 2019 2 Memories - Dillon Francis Remix 2019 3 All the Time - Don Diablo Remix 2019 4 Call You Mine - Keanu Silva Remix 2019 5 Someone You Loved - Future Humans Remix 2019 ``` Great! Now we need to plug these into our previous code and we are good to go! --- ## Task 2 is done and dusted ```r spotify_with_year %>% filter(release_year == 2019) %>% select(track_name,track_artist,track_popularity) %>% arrange(-track_popularity) %>% distinct() %>% head(5) %>% select(-track_popularity) %>% rename(Song = track_name, Artist = track_artist) ``` |Song |Artist | |:---------------|:--------------| |Dance Monkey |Tones and I | |ROXANNE |Arizona Zervas | |Tusa |KAROL G | |Memories |Maroon 5 | |Blinding Lights |The Weeknd | --- ## Recap of Task No. 2 What did we learn from trying out Task No.2 ? * `as.Date` function converts character columns to Date columns. * `year` function from `lubridate` package gets the year from a Date column. * `mutate` function mutates existing columns to create new ones. * `filter` function helps to filter the rows based on a condition. That was a huge leap from what we learned in Task No. 1! .center[<img src="https://en.meming.world/images/en/thumb/a/aa/Something_of_a_Scientist.jpg/600px-Something_of_a_Scientist.jpg" style="width:60%"/>] --- class: center,middle, title-only-slide <h1>Task No. 3 <br/> (Playlists for your moods)</h1> --- # Task No. 3 (Playlists for your moods) The supervisor has one final task for you. You need to give the average scores for energy, danceability, instrumentalness, valence for each playlist to so that the supervisor can use that dataset in her Plotly Radar plot function. This is going to be a tough one! You sit down and sketch out how the final dataset should look like and this is what you got, |playlist_genre | avg_energy| avg_dance| avg_instu| avg_valen| |:--------------|----------:|---------:|---------:|---------:| |some name 1 | 0.9| 0.3| 0.2| 0.4| |some name 2 | 0.9| 0.7| 0.2| 0.6| So now we know what sort of outcome we expect at the end. Now the question remains how do you group these playlists together. --- ## Understanding group_by We can use the `group_by` function in dplyr for this. But the `group_by` function can be tricky sometimes so you dig a bit deeper. * `group_by` function literally groups rows with the same combination of values in the given columns together * After that any `mutate` function that is applied will apply within the group. * But remember to ungroup the dataset afterwards, otherwise they would remain in the grouped state and all the actions that you perform on it will be performed group wise. .center[<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/group_by_ungroup.png" style="width:50%" />] --- ## Testing out the waters Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as 0.7010278. -- ```r spotify %>% group_by(playlist_genre) %>% mutate(mean_energy = mean(energy)) %>% select(playlist_genre,mean_energy) %>% head(3) ``` ``` ## # A tibble: 3 x 2 ## # Groups: playlist_genre [1] ## playlist_genre mean_energy ## <chr> <dbl> ## 1 pop 0.701 ## 2 pop 0.701 ## 3 pop 0.701 ``` -- So we have to call `distinct` for this as well it seems. And the tibble output shows that the dataset is now a grouped dataset by `playlist_genre`. We should add an ungroup at the end to remove that effect, otherwise we might forget and issues may come up. --- With that in your head you first write out the `group_by` command and decide to be adventurous and `mutate` a new column for each of the averages. -- ```r spotify %>% * group_by(playlist_genre) %>% mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE), avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>% rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance, Instrumentals = avg_instu, Upbeat = avg_valen) %>% select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>% distinct() %>% ungroup() %>% head(4) ``` |Genre | Energetic| Dance| Instrumentals| Upbeat| |:-----|---------:|---------:|-------------:|---------:| |pop | 0.7010278| 0.6393017| 0.0598763| 0.5035210| |rap | 0.6507084| 0.7183528| 0.0759972| 0.5050900| |rock | 0.7328133| 0.5205480| 0.0624173| 0.5373521| |latin | 0.7083125| 0.7132873| 0.0444470| 0.6055104| -- Do we really have to call distinct again like this? Or is there an easier way than having to `group_by` -> `mutate` -> `select` -> `distinct`? --- ## Task No. 3 is over! Your supervisor noticed what you were typing on your laptop and gave a tip to Google up on the function `summarize` to make the code more readable and safer. And now after digging on the `summarize` function, you realize that by simply replacing mutate with summarize your code now looks like this. And it automatically ungroups the outermost group after giving the summary statistics. -- ```r spotify %>% group_by(playlist_genre) %>% * summarize(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE), * avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>% rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance, Instrumentals = avg_instu, Upbeat = avg_valen) %>% head(5) ``` |Genre | Energetic| Dance| Instrumentals| Upbeat| |:-----|---------:|---------:|-------------:|---------:| |edm | 0.8024759| 0.6550409| 0.2185779| 0.4006555| |latin | 0.7083125| 0.7132873| 0.0444470| 0.6055104| |pop | 0.7010278| 0.6393017| 0.0598763| 0.5035210| |r&b | 0.5909343| 0.6701793| 0.0289203| 0.5312306| |rap | 0.6507084| 0.7183528| 0.0759972| 0.5050900| --- ## Recap of Task No. 3 What did we learn from trying out Task No. 3? * `group_by` helps to group the dataset into groups based on the values of a column * `summarize` helps to make a summary of a grouped / ungrouped dataset to get summary statistics Finally! We finished all three tasks! We are now well on our way to start analysing data that come our way. .center[<img src="https://i.kym-cdn.com/entries/icons/original/000/028/021/work.jpg" style="width:70%"/>] --- .pull-left[## What did we learn today? * Data wrangling is kind of like meal prep where you prepare your ingredients(the raw data) for the final dish(the model or visualization) * There are several functions in different packages such as `filter`,`mutate`,`select`, `arrange`,`group_by`,`summarize`, `distinct` from the `dplyr` package and even date functions from the `lubridate` package that help to wrangle data into the format we need.] -- .pull-right[## Where to go from here? * Get the Spotify Developer Account and access the APIs to create a dataset that is upto date and matches your style. * Create an R shiny dashboard that can show interactive graphs of all the artists and where they fall on different indexes which are based on how positive the song is and so on. * Analyse [#TidyTuesday](https://github.com/rfordatascience/tidytuesday/) datasets that come out weekly and follow the [R for Data Science Community](https://twitter.com/R4DScommunity) on Twitter for more inspiration.] --- class: center, middle <br/> # Thank you! <br/> <h2 style="color:#5b85aa">Where can you find me?</h2> Email: [janithcwanni@gmail.com](#) Twitter: [@janithcwanni](https://twitter.com/janithcwanni) Github: [@janithwanni](https://github.com/janithwanni) Linkedin: [Janith Wanniarachchi](https://www.linkedin.com/in/janith-wanniarachchi-462851117/)