|
1 | 1 |
|
2 | 2 | # Filtering Orgs |
3 | 3 |
|
4 | | -First reading in our data |
| 4 | +First reading in our data: |
5 | 5 |
|
6 | 6 | ```{r} |
7 | 7 | library(tidyverse) |
@@ -79,37 +79,29 @@ After filtering out these orgs we end up with `r nrow(org_data)` rows, removing |
79 | 79 |
|
80 | 80 | Filing is due May 15 each year, since the data is from March 2024, that means it only shows 2023 complete data. |
81 | 81 |
|
82 | | -Orgs only need to file every 3 years as well, but this would be a sliding window for orgs at different times as far as I understand it, but we could require that the last filing time be within the last 3 years. |
83 | | - |
84 | | -Also might want to remove orgs that didn't file in the last 3 years. |
85 | | - |
86 | | -For now I will filter for orgs that filed after May 15, 2023. |
| 82 | +Orgs only need to file every 3 years as well, but this would be a sliding window for orgs at different times as, but we could require that the last filing time be within the last 3 complete years. |
87 | 83 |
|
88 | 84 | According to the [dictionary](https://nccsdata.s3.amazonaws.com/harmonized/harmonized_data_dictionary.xlsx): |
89 | 85 |
|
90 | 86 | - ORG_FISCAL_YEAR = Fiscal Year of the BMF from which this record was pulled |
91 | 87 | - ORG_YEAR_LAST = Most recent year EIN was recorded in BMF |
92 | 88 |
|
93 | | -```{r} |
94 | | -year_2024 <- org_data %>% filter(ORG_YEAR_LAST == 2024) |
95 | | -
|
96 | | -org_data <- org_data %>% filter(ORG_YEAR_LAST != 2024) |
97 | | -``` |
98 | | - |
99 | | -After filtering out orgs that filed between May 15, 2023 and March 2024, we have `r nrow(org_data)` rows, removing `r nrow(year_2024)` orgs with that filed at that time. |
100 | 89 |
|
101 | 90 |
|
102 | | -If we assume a sliding window and people would need to have filed in the last 3 years, than we can include orgs that filed in 2021, 2022, and 2023. |
| 91 | +If we assume a sliding window and people would need to have filed in the last 3 complete years, than we can include orgs that filed in 2021, 2022, and 2023 (as well as the incomplete 2024). |
103 | 92 |
|
104 | 93 | ```{r} |
105 | | -active_orgs <- org_data %>% filter(ORG_YEAR_LAST %in% c(2023, 2022, 2021)) |
106 | | -
|
| 94 | +active_orgs <- org_data %>% filter(ORG_YEAR_LAST %in% c(2024, 2023, 2022, 2021)) |
| 95 | +non_active_orgs <- org_data %>% filter(! ORG_YEAR_LAST %in% c(2024, 2023, 2022, 2021)) |
107 | 96 | ``` |
108 | 97 |
|
109 | | -After filtering out orgs that last filed in the last 3 years, we end up with `r nrow(active_orgs)`. |
| 98 | +After filtering out orgs that filed between May 15, 2023 and March 2024, we have `r nrow(org_data)` rows, removing `r nrow(non_active_orgs)` orgs with that filed at that time. |
| 99 | + |
| 100 | +After filtering out orgs that last filed in the last 3 complete years, we end up with `r nrow(active_orgs)`. |
110 | 101 |
|
111 | 102 |
|
112 | 103 | ```{r} |
113 | 104 | write_rds(org_data, file = "data/processed/org_data_filtered.rds") |
114 | 105 | write_rds(active_orgs, file = "data/processed/active_orgs.rds") |
| 106 | +write_csv(non_active_orgs, file = "data/processed/non_active_orgs.csv") |
115 | 107 | ``` |
0 commit comments