Assignment 5

Please submit both the .Rmd and a .html file on Canvas.

Load your libraries here

Problem 0

All of the data needed for this assignment is contained in a .Rdata file at: https://github.com/adamkucz/psych548/blob/main/data/HW5.Rdata?raw=true

Download the .Rdata file and use the load() function to read these objects into your global environment

Problem 1

Easier Option

The d_dep_long dataset contains a subset of data from a daily diary study with a community sample of adults from King County, WA. Participants completed a short survey at the end of each day for up to 75 days. There are 27907 rows and 3 columns in the data. The 3 variables are:

pid: A unique identification number for each participant
day: The day of the study for each participant
depressedmood: Scores 0-10 (integer values) on the question “I felt down, depressed, or hopeless today.”

Part A: Long to Wide

Using pivot_wider() from the tidyr package, turn long form d_dep_long into wide form d_dep_w such that each row contains each participant’s entire data. There should be 514 rows and 76 columns (1 col for pid and 75 for each observation of depressedmood. Name your columns day_{daynumber} where {daynumber} is 1 through 75. Your final data should be identical to d_dep_wide.

# Check if they are identical (including metadata)
# Returns TRUE if they are identical, FALSE if not
identical(d_dep_w, d_dep_wide)

## [1] TRUE

Part B: Wide to Long

Using pivot_longer() from the tidyr package, turn wide form d_dep_wide into long form d_dep_l such that each row corresponds with 1 day from one participant. The value in day_ corresponds with the day number of the study. There should be 27907 rows and 3 columns (pid, day, depressedmood). Your final data should be identical to d_dep_long. Hint: Make sure to remove the rows where depressedmood is NA at the end.

# Check if they are identical (excluding metadata differences)
# Make sure datasets are arranged the same (ascending by pid and day)
d_dep_l <- d_dep_l[order(d_dep_l$pid, d_dep_l$day), ]
d_dep_long <- d_dep_long[order(d_dep_long$pid, d_dep_long$day), ]

# Returns TRUE if they are the same, FALSE if there are differences
all(d_dep_l == d_dep_long)

## [1] TRUE

Harder Option

You will have to consult the help page for pivot_longer() and pivot_wider() to complete this problem (i.e., additional arguments to those discussed in lecture are needed).

The d_depanx_long dataset contains a subset of data from a daily diary study with a community sample of adults from King County, WA. Participants completed a short survey at the end of each day for up to 75 days. There are 27567 rows and 6 columns in the data. The 6 variables are:

pid: A unique identification number for each participant
day: The day of the study for each participant
depressedmood: Scores 0-10 (integer values) on the question “I felt down, depressed, or hopeless today.”
anhedonia: Scores 0-10 (integer values) on the question “I had little interest or pleasure in doing things today.”
anxietycovid Scores 0-10 (integer values) on the question “I felt anxious or worried about getting COVID-19 today.”
anxietygeneral Scores 0-10 (integer values) on the question “I felt a general sense of anxiety today.”

Part A: Long to Wide

Using pivot_wider() from the tidyr package, turn long form d_depanx_long into wide form d_depanx_w such that each row contains each participant’s entire data. There should be 511 rows and 301 columns (1 col for pid and 75 for each observation of depressedmood, anhedonia, anxietycovid, and anxietygeneral). Name your columns {variable}_{daynumber} where {variable} is one of the four variables in d_depanx_long and {daynumber} is 1 through 75. Your final data should be identical to d_dep_wide.

# Check if they are identical (including metadata)
# Returns TRUE if they are identical, FALSE if not
identical(d_depanx_w, d_depanx_wide)

## [1] TRUE

Part B: Wide to Long

# Check if they are identical (excluding metadata differences)
# Make sure datasets are arranged the same (ascending by pid and day)
d_depanx_l <- d_depanx_l[order(d_depanx_l$pid, d_depanx_l$day), ]
d_depanx_long <- d_depanx_long[order(d_depanx_long$pid, d_depanx_long$day), ]

# Returns TRUE if they are the same, FALSE if there are differences
all(d_depanx_l == d_depanx_long)

## [1] TRUE

Problem 2

Easier Option

Using the date column from d dataset, create a new variable called yearDay that represents the nth day of the year. For example, an observation that is recorded on 1/3/2020 should have yearDay equal to 3 and an observation recorded on 2/14/2020 should have yearDay equal to 45. Take note that there is missing data within! Hint: what is the class of date (class(date))?

Harder Option

The day variable in these data actually represents the number of days (starting at 1) from when the participant received their first survey alert. Because some participants did not complete their first daily survey, their first day value starts after 1. However, we might want a value that represents days since they actually started.

Using the date column from d, create a new variable called dayNew that represents the nth day in the study for each participant after their first observation. For example, if a participant started the study on 1/1/2020, dayNew would be 3 on 1/3/2020. If a participant started the study on 1/2/2020, dayNew would be 2 on 1/3/2020. Take note that there is missing data within!

Problem 3

The date column in d is currently formatted as day, month abbreviation, and year (e.g., 19Mar2020). Suppose you want your dates instead to be in the format “mm/dd/yy” (e.g., 03/19/20). Use a combination of format() as as.Date()/as_date() to complete this task (see slides) and consult with ?strptime to find the right conversion specifications.

Now, create a new variable called weekend in d which takes on one of two values:

0 = weekday
1 = weekend

Use the lubridate functions discussed in lecture, as well as functions for creating new variables discussed in previous weeks, to accomplish this task.

Problem 4

Suppose you are a newly minted clinician in independent practice with a caseload of 10 patients. You realize you need to keep track of your patients’ diagnoses for billing purposes, so you ask your clinic manager to collect your patients’ diagnoses into one file for you. You get the following data:

P1	P2	P3	P4	P5	P6	P7	P8	P9	P10
PTSD	OCD; SAD	MDD; PDD; OCPD	PTSD; MDD	PD; SAD	GAD; MDD	BPD; MDD	PTSD; MDD	GAD	BPD

These values correspond with the following DSM-5 disorders:

PTSD: Post-traumatic stress disorder
OCD: Obsessive-Compulsive Disorder
MDD: Major Depressive Disorder
GAD: Generalized Anxiety Disorder
BPD: Borderline Personality Disorder
SAD: Social Anxiety Disorder
PD: Panic Disorder
PDD: Persistent Depressive Disorder
OCPD: Obsessive-Compulsive Personality Disorder

First, shift these data to be in long form. They should look like this:

## # A tibble: 10 x 2
##    Patient Diagnoses     
##    <chr>   <chr>         
##  1 P1      PTSD          
##  2 P2      OCD; SAD      
##  3 P3      MDD; PDD; OCPD
##  4 P4      PTSD; MDD     
##  5 P5      PD; SAD       
##  6 P6      GAD; MDD      
##  7 P7      BPD; MDD      
##  8 P8      PTSD; MDD     
##  9 P9      GAD           
## 10 P10     BPD

Second, change each of these acronymns into their full DSM-5 disorder name. Use regular expressions to do this and keep the values in one column. Hint: Do not try to do this in one regular expression. This is a situation where you will either repeat code with values changed, or write your own function to simplify this. Your code should work no matter what order you run your code in (e.g., whether you replace PD first or OCPD first)! Use the power of regular expressions to replace exactly what you want to replace.

Some of these acronyms share similarities (e.g., PD, PDD, OCPD), so you can’t just used a fixed value (e.g., PD) to change them because then “OCPD”, for example, would become “OCPanic Disorder.” You want to target certain letters (in a specific arrangement) that do not follow something (for you to decide) and that also are not followed by something (also for you to decide). Need more direction? Message me on Slack.

Your data should look like this after you are finished:

## # A tibble: 10 x 2
##    Patient Diagnoses                                                            
##    <chr>   <chr>                                                                
##  1 P1      Post-traumatic Stress Disorder                                       
##  2 P2      Obsessive-Compulsive Disorder; Social Anxiety Disorder               
##  3 P3      Major Depressive Disorder; Persistent Depressive Disorder; Obsessive…
##  4 P4      Post-traumatic Stress Disorder; Major Depressive Disorder            
##  5 P5      Panic Disorder; Social Anxiety Disorder                              
##  6 P6      Generalized Anxiety Disorder; Major Depressive Disorder              
##  7 P7      Borderline Personality Disorder; Major Depressive Disorder           
##  8 P8      Post-traumatic Stress Disorder; Major Depressive Disorder            
##  9 P9      Generalized Anxiety Disorder                                         
## 10 P10     Borderline Personality Disorder

Third, choose one of these two options.

Easier Option

Use separate() to get 1 column for each diagnoses. Because the maximum number of diagnoses given to one individual patient is 3, you should have separate columns. Name them: Dx1, Dx2, and Dx3. If a patient does not have a second or third diagnosis, their value should be NA. Your data should look like this:

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 9 rows [1, 2, 4,
## 5, 6, 7, 8, 9, 10].

## # A tibble: 10 x 4
##    Patient Dx1                   Dx2                  Dx3                       
##    <chr>   <chr>                 <chr>                <chr>                     
##  1 P1      Post-traumatic Stres… <NA>                 <NA>                      
##  2 P2      Obsessive-Compulsive… Social Anxiety Diso… <NA>                      
##  3 P3      Major Depressive Dis… Persistent Depressi… Obsessive-Compulsive Pers…
##  4 P4      Post-traumatic Stres… Major Depressive Di… <NA>                      
##  5 P5      Panic Disorder        Social Anxiety Diso… <NA>                      
##  6 P6      Generalized Anxiety … Major Depressive Di… <NA>                      
##  7 P7      Borderline Personali… Major Depressive Di… <NA>                      
##  8 P8      Post-traumatic Stres… Major Depressive Di… <NA>                      
##  9 P9      Generalized Anxiety … <NA>                 <NA>                      
## 10 P10     Borderline Personali… <NA>                 <NA>

Harder Option

Make 9 new columns in your data, one for each diagnosis (use the acronyms as column names). The value for each column should be 1 if the patient has the diagnosis, otherwise the value should be 0. This format is what you need if you wanted to use these data for any sort of analysis.

You will want to use ifelse() to construct the columns. Use several separate calls to ifelse() rather than the nested structure we have seen before.

Inside ifelse() you need a vector of TRUEs and FALSEs specifying whether each disorder is in the patient’s list of diagnoses. Use grepl() to get this vector of TRUEs and FALSEs.

Your data should look like this:

## # A tibble: 10 x 11
##    Patient Diagnoses        PTSD   OCD   MDD   GAD   BPD   SAD    PD   PDD  OCPD
##    <chr>   <chr>           <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 P1      Post-traumatic…     1     0     0     0     0     0     0     0     0
##  2 P2      Obsessive-Comp…     0     1     0     0     0     1     0     0     0
##  3 P3      Major Depressi…     0     0     1     0     0     0     0     1     1
##  4 P4      Post-traumatic…     1     0     1     0     0     0     0     0     0
##  5 P5      Panic Disorder…     0     0     0     0     0     1     1     0     0
##  6 P6      Generalized An…     0     0     1     1     0     0     0     0     0
##  7 P7      Borderline Per…     0     0     1     0     1     0     0     0     0
##  8 P8      Post-traumatic…     1     0     1     0     0     0     0     0     0
##  9 P9      Generalized An…     0     0     0     1     0     0     0     0     0
## 10 P10     Borderline Per…     0     0     0     0     1     0     0     0     0

Assignment 5

Data Cleaning

Problem 0

Problem 1

Easier Option

Harder Option

Problem 2

Easier Option

Harder Option

Problem 3

Problem 4

Easier Option

Harder Option