-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path_data_wrangling.qmd
More file actions
73 lines (58 loc) · 2.66 KB
/
_data_wrangling.qmd
File metadata and controls
73 lines (58 loc) · 2.66 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
<!--
Update the file paths as needed throughout
-->
<!--
Some case studies may wish to merge Data Exploration with this Wrangling subsection
-->
<!--
At the end of this section, save incremental data so that users can pause and come back later
-->
# **Data Wrangling**
***
Data wrangling includes cleaning data to fix mistakes and reorganizing data to make it easier to use in visualizations and analyses.
[Data import is in file "02_wrangle_data.R", decide how much to walk learner through in case study.]
02_wrangle_data.R contains the following code:
```{r, eval = FALSE}
library(dplyr)
library(stringr)
load("processed_data/sample_metadata.rda")
# clean data
clean_metadata <- sample_metadata %>%
filter(!is.na(SampleID)) %>% # remove all rows in data that are missing sample ID
mutate(Study = # clean study name to only include last name of first author
str_remove(Study, "\\_.*") %>% # remove _year from study name
str_remove_all("(?<=.)[A-Z]")) %>% # remove first name initial from study name
rename(sampling_paper = `Sampling from papers`, # remove spaces from names of variables
sampling_day = `Sampling, day`) %>%
mutate(sampling_day = as.numeric(sampling_day)) # make sampling_day a numeric variable
# in Yassour study, sample IDs are not unique to samples
clean_metadata %>%
filter(Study == "Yassour") %>%
select(Study, Category, SampleID, sampling_day) %>%
head()
# create new unique sample IDs for Yassour study
clean_metadata <- clean_metadata %>%
mutate(SampleID = ifelse(Study == "Yassour", # check if study is Yassour
paste0(SampleID, sampling_day), # if so, combine sample ID and sampling day
SampleID)) # if not, leave sample ID as is
# confirm that we've fixed this problem
clean_metadata %>%
filter(Study == "Yassour") %>%
select(Study, Category, SampleID, sampling_day) %>%
head()
# two studies (Parnanen and Yassour) took samples from mothers during pregnancy
# for these samples, `sampling_day` represents sampling day since start of pregnancy,
# unlike all other samples for which `sampling_day` represents sampling day since birth
ids_before_birth <- clean_metadata %>%
filter(Category == "mother",
Study == "Parnanen" & sampling_paper == "32WK" |
Study == "Yassour" & sampling_paper == "Gest") %>%
pull(SampleID) # get sample ids for these samples
clean_metadata <- clean_metadata %>%
mutate(sampling_before_birth = ifelse(SampleID %in% ids_before_birth,
TRUE,
FALSE))
# save cleaned data
save(clean_metadata, file = "processed_data/clean_sample_metadata.rda")
```
***