class: title-slide, center, middle # Importing Data & Project-oriented Workflows --- # Importing data ### Importing data in R has 2 challenging aspects... -- 1. You need to call a function that works with a particular data format (`.csv`, `.txt`, `.sav`, etc.) -- 2. You need to tell R where to look for the data --- # Importing data ### Challenge # 1: File types -- .pull-left[ .center[ ### readr <img src="images/hex/readr.png" width="55%" /> `read_csv()`, `read_tsv()`, `read_delim()`, `read_fwf()`, etc... ] ] -- .pull-right[ .center[ ### rio <img src="images/hex/rio.png" width="55%" /> `import()` ] ] --- # Importing data ### Challenge # 1: File types .pull-left[ .center[ ### readr <img src="images/hex/readr.png" width="55%" /> `read_csv()`, `read_tsv()`, `read_delim()`, `read_fwf()`, etc... ] ] .pull-right[ .center[ ### rio <img src="images/hex/rio.png" width="55%" /> `import()` ✅ ] ] --- # Importing data ### `rio::import()` We just call `import()` and under the hood it calls the right read function given the file's extension (`.csv`, `.txt`, `.sav`, `.xlsx`, etc.) -- We'll get some practice with this in a few minutes --- # Project-oriented workflows ### Challenge # 2 : File paths -- When R looks for a file, it has a starting point. This is called the **working directory**. -- The working directory that you're currently in is displayed in the console window and the files tab. Let's take a look in RStudio...
-- *** If you ever get lost, you can print your working directory with `getwd()` -- If you are working in a `.Rmd` document, R by default will set whatever folder on your computer where that `.Rmd` file lives as your working directory -- ```r getwd() ``` ``` ## [1] "/Users/bcullen/Desktop/rstats_teaching/uopsych-r-bootcamp/2021/summeR-bootcamp-2021/static/slides" ``` For example, I created these slides in a `.Rmd` document that lives in this folder on my computer ☝️ --- class: split-three # Project-oriented workflows ### RStudio Projects The best way to simplify issues with working directories is to use **RStudio Projects**. -- *** .column[.content[.center[ <br><br><br><br><br><br><br><br><br><br><br><br><br> ### Step 1 <img src="images/create_project1.png" width="90%" /> ]]] .column[.content[.center[ <br><br><br><br><br><br><br><br><br><br><br><br><br> ### Step 2 <img src="images/create_project2.png" width="90%" /> ]]] .column[.content[.center[ <br><br><br><br><br><br><br><br><br><br><br><br><br> ### Step 3 <img src="images/create_project3.png" width="90%" /> ]]] --- # Project-oriented workflows ### RStudio Projects When you create a Project in RStudio, it is associated with a folder somewhere on your computer. -- It will automatically create a `.Rproj` file in that folder, which will keep track of the "top level" of your project. -- *** For example, we've been using RStudio Projects for "Your Turn" exercises <img src="images/example_rproj.png" width="2165" /> --- # Project-oriented workflows ### `here::here()` -- .pull-left[ In combination with RStudio Projects, use the here package `here::here()` will build a file path to the top level of your project directory. This makes it easy to tell R where files live relative to the top-level folder of your project ] .pull-right[ <img src="images/here.png" width="3696" /> ] --- class: inverse # Your turn 1
04
:
00
#### 1. Load the rio and here packages. #### 2. Run the following code to import the data called `pragmatic_scales_data.csv`. Why do you get an error? Where is this file saved? *Hint*: Look through the folder(s) in the Files pane ```r ps_data <- import("pragmatic_scales_data.csv") ``` #### 3. Fix the error in the code above to import the data. *Hint*: use the `here()` function #### 4. Remember that rio is flexible with file types -- `rio::import()` will call the right function under the hood to read in the file based on the file extension. Use rio to import `pragmatic_scales_data.sav` (an SPSS file type) and save it to a new object named `ps_data_2`. --- class: solution # Solution .panelset[ .panel[.panel-name[Q1] ```r # Q1. library(rio) library(here) ``` ] .panel[.panel-name[Q2] ```r # Q2. ps_data <- import("pragmatic_scales_data.csv") ``` ``` ## Error in import("pragmatic_scales_data.csv"): No such file ``` *** The file **pragmatic_scales_data.csv** is saved in the **data** folder, so we need to tell R to look in that folder. ] .panel[.panel-name[Q3] ```r # Q3. ps_data <- import(here("data/pragmatic_scales_data.csv")) ``` ```r # Alternative solution -- specify each subfolder separately ps_data <- import(here("data", "pragmatic_scales_data.csv")) ``` ] .panel[.panel-name[Q4] ```r # Q4. ps_data_2 <- import(here("data/pragmatic_scales_data.sav")) ``` ```r # Alternative solution -- specify each subfolder separately ps_data_2 <- import(here("data", "pragmatic_scales_data.sav")) ``` ] ] --- # Exporting data You can also use rio to export your data using `export()`, saving it in any of the formats that it works with. -- *** Here are the arguments you will need to use for `export()` ```r export(x, file) ``` -- `x` is the `data.frame` object in your RStudio Environment you want to export -- `file` is the path/filename for the resulting file -- *** For example, let's say I want to export `ps_data` as an `.xlsx` file and put it into the `data/` subdirectory. ```r export(ps_data, here::here("data", "ps_data.xlsx")) ``` --- class: inverse # Your turn 2
04
:
00
1. Look through the Files pane and find the file `another_data_set.csv`. Make note of what subdirectory it is saved in. Import the data and save to an object called `another_df`. 1. Now export the data you just imported and save it into the `data/` directory. Make sure the name of the resulting file is `another_data_set`, and export it as a `.xlsx` file. 1. One of your colleagues insists you send them a `.sav` file so that they can run the analyses in SPSS. Make another copy of `another_data_set` in the `data/` subdirectory that is in the `.sav` format. 1. Finally, let's read one of these datasets to make sure everything worked as expected. Import `another_data_set.sav` , which you just created, and import it, saving it to a new object named `another_df_2`. --- class: solution # Solution .panelset[ .panel[.panel-name[Q1] ```r # Q1. another_df <- import(here("data", "more_data", "another_data_set.csv")) ``` ] .panel[.panel-name[Q2] ```r # Q2. export(another_df, here("data", "another_data_set.xlsx")) ``` ] .panel[.panel-name[Q3] ```r # Q3. export(another_df, here("data", "another_data_set.sav")) ``` ] .panel[.panel-name[Q4] ```r # Q4. another_df_2 <- import(here("data", "another_data_set.sav")) ``` ] ] --- # Viewing data Now that your data is loaded in R, you'll want to take a look at it. There are a few different ways to do that, which each offer different information. -- *** ### `View()` One way is to click on the `View` button in the environment pane...
-- You should see `ps_data` in the environment pane with a little data table icon at the far right. Click on that icon. -- You'll notice that this ran `View(ps_data)` in the console. We could have instead just typed this directly ourselves -- notice the capital `V` in `View()` 👀 --- # Viewing data ### `head()` and `tail()` -- .panelset[ .panel[.panel-name[`head()`] You can also see just the first few rows of a dataframe with `head()`, which is especially helpful for large data sets ```r head(ps_data) ``` ``` ## subid item correct age condition ## 1 M22 faces 1 2.00 Label ## 2 M22 houses 1 2.00 Label ## 3 M22 pasta 0 2.00 Label ## 4 M22 beds 0 2.00 Label ## 5 T22 beds 0 2.13 Label ## 6 T22 faces 0 2.13 Label ``` ] .panel[.panel-name[`tail()`] `tail()` is the complement to `head()`, displaying just the final rows from a dataframe ```r tail(ps_data) ``` ``` ## subid item correct age condition ## 583 MSCH84 pasta 1 2.83 No Label ## 584 MSCH84 beds 0 2.83 No Label ## 585 MSCH85 faces 0 2.69 No Label ## 586 MSCH85 houses 0 2.69 No Label ## 587 MSCH85 pasta 0 2.69 No Label ## 588 MSCH85 beds 0 2.69 No Label ``` ] ] --- # Viewing data ### `str()` and `glimpse()` .panelset[ .panel[.panel-name[`str()`] We saw `str()` when we first introduced data frames. It's worth mentioning it again because it can be so useful when you import data to see how your variables were read in (i.e. their types) ```r str(ps_data) ``` ``` ## 'data.frame': 588 obs. of 5 variables: ## $ subid : chr "M22" "M22" "M22" "M22" ... ## $ item : chr "faces" "houses" "pasta" "beds" ... ## $ correct : int 1 1 0 0 0 0 1 1 0 0 ... ## $ age : num 2 2 2 2 2.13 2.13 2.13 2.13 2.32 2.32 ... ## $ condition: chr "Label" "Label" "Label" "Label" ... ``` ] .panel[.panel-name[`glimpse()`] `glimpse()` is very similar to `str()` but is a tidyverse function, and it shows you a little more raw data ```r glimpse(ps_data) ``` ``` ## Rows: 588 ## Columns: 5 ## $ subid <chr> "M22", "M22", "M22", "M22", "T22", "T22", "T22", "T22", "T1… ## $ item <chr> "faces", "houses", "pasta", "beds", "beds", "faces", "house… ## $ correct <int> 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1,… ## $ age <dbl> 2.00, 2.00, 2.00, 2.00, 2.13, 2.13, 2.13, 2.13, 2.32, 2.32,… ## $ condition <chr> "Label", "Label", "Label", "Label", "Label", "Label", "Labe… ``` ] ] --- class: inverse # Your turn 3
02
:
00
1. Take a look at `another_df`, which should be saved in your Global Environment. Click the "View" button in the Environment pane, and also use `View()` in your Console. 1. Now look at some summary information about `another_df` using `str()` and `glimpse()`. *Hint*. You will need to load the tidyverse package first in order to use `glimpse()`. 1. Lastly find the number of rows and columns in `another_df` using `nrow()` and `ncol()`, respectively. Make sure your answers match the summary information given to you above. --- class: solution # Solution .panelset[ .panel[.panel-name[Q1] ```r View(another_df) ``` ] .panel[.panel-name[Q2] ```r library(tidyverse) str(another_df) ``` ``` ## 'data.frame': 32 obs. of 4 variables: ## $ subid : chr "A001" "A001" "A001" "A001" ... ## $ stimuli: chr "A" "B" "C" "D" ... ## $ correct: int 0 0 1 0 1 1 1 1 0 0 ... ## $ age : num 2.5 2.5 2.5 2.5 2.75 2.75 2.75 2.75 3.6 3.6 ... ``` ```r glimpse(another_df) ``` ``` ## Rows: 32 ## Columns: 4 ## $ subid <chr> "A001", "A001", "A001", "A001", "B002", "B002", "B002", "B002… ## $ stimuli <chr> "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D", "… ## $ correct <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1… ## $ age <dbl> 2.50, 2.50, 2.50, 2.50, 2.75, 2.75, 2.75, 2.75, 3.60, 3.60, 3… ``` ] .panel[.panel-name[Q3] ```r nrow(another_df) ``` ``` ## [1] 32 ``` ```r ncol(another_df) ``` ``` ## [1] 4 ``` ] ] --- class: inverse, center, middle # Q & A
05
:
00
--- class: inverse, center, middle # Next up... ## Data visualization with ggplot2 --- class: inverse, center, middle # Break!
10
:
00