Skip to content

Commit 6ac72f1

Browse files
Merge pull request #15 from jhudsl/after_deletion
After switching to just BMF
2 parents 90c107b + f2364b6 commit 6ac72f1

File tree

6 files changed

+77970
-17
lines changed

6 files changed

+77970
-17
lines changed

05-filtering_data.Rmd

Lines changed: 9 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
# Filtering Orgs
33

4-
First reading in our data
4+
First reading in our data:
55

66
```{r}
77
library(tidyverse)
@@ -79,37 +79,29 @@ After filtering out these orgs we end up with `r nrow(org_data)` rows, removing
7979

8080
Filing is due May 15 each year, since the data is from March 2024, that means it only shows 2023 complete data.
8181

82-
Orgs only need to file every 3 years as well, but this would be a sliding window for orgs at different times as far as I understand it, but we could require that the last filing time be within the last 3 years.
83-
84-
Also might want to remove orgs that didn't file in the last 3 years.
85-
86-
For now I will filter for orgs that filed after May 15, 2023.
82+
Orgs only need to file every 3 years as well, but this would be a sliding window for orgs at different times as, but we could require that the last filing time be within the last 3 complete years.
8783

8884
According to the [dictionary](https://nccsdata.s3.amazonaws.com/harmonized/harmonized_data_dictionary.xlsx):
8985

9086
- ORG_FISCAL_YEAR = Fiscal Year of the BMF from which this record was pulled
9187
- ORG_YEAR_LAST = Most recent year EIN was recorded in BMF
9288

93-
```{r}
94-
year_2024 <- org_data %>% filter(ORG_YEAR_LAST == 2024)
95-
96-
org_data <- org_data %>% filter(ORG_YEAR_LAST != 2024)
97-
```
98-
99-
After filtering out orgs that filed between May 15, 2023 and March 2024, we have `r nrow(org_data)` rows, removing `r nrow(year_2024)` orgs with that filed at that time.
10089

10190

102-
If we assume a sliding window and people would need to have filed in the last 3 years, than we can include orgs that filed in 2021, 2022, and 2023.
91+
If we assume a sliding window and people would need to have filed in the last 3 complete years, than we can include orgs that filed in 2021, 2022, and 2023 (as well as the incomplete 2024).
10392

10493
```{r}
105-
active_orgs <- org_data %>% filter(ORG_YEAR_LAST %in% c(2023, 2022, 2021))
106-
94+
active_orgs <- org_data %>% filter(ORG_YEAR_LAST %in% c(2024, 2023, 2022, 2021))
95+
non_active_orgs <- org_data %>% filter(! ORG_YEAR_LAST %in% c(2024, 2023, 2022, 2021))
10796
```
10897

109-
After filtering out orgs that last filed in the last 3 years, we end up with `r nrow(active_orgs)`.
98+
After filtering out orgs that filed between May 15, 2023 and March 2024, we have `r nrow(org_data)` rows, removing `r nrow(non_active_orgs)` orgs with that filed at that time.
99+
100+
After filtering out orgs that last filed in the last 3 complete years, we end up with `r nrow(active_orgs)`.
110101

111102

112103
```{r}
113104
write_rds(org_data, file = "data/processed/org_data_filtered.rds")
114105
write_rds(active_orgs, file = "data/processed/active_orgs.rds")
106+
write_csv(non_active_orgs, file = "data/processed/non_active_orgs.csv")
115107
```

data/processed/active_orgs.rds

24.9 MB
Binary file not shown.

0 commit comments

Comments
 (0)