-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathstarting_from_raw_data_only.Rmd
More file actions
391 lines (250 loc) · 17.5 KB
/
starting_from_raw_data_only.Rmd
File metadata and controls
391 lines (250 loc) · 17.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
---
title: "Starting_raw_data_only"
author: "Carrie"
date: "2024-11-13"
output: html_document
---
```{r}
library(readxl)
library(here)
library(tidyverse)
library(janitor)
```
# New eo_md
990 or not is PF filing req code
filing req code - don't have to file any forms or not 01 = 990, 02 = 990 but less than 25000 income, 03 = group return (don't know what that means), 06 = church, 07 = gov, 00 = not required to file
- from here: https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf - chose Maryland
- direct link: https://www.irs.gov/pub/irs-soi/eo_md.csv
or maybe new link https://www.irs.gov/downloads then --> irs-soi (https://www.irs.gov/statistics/soi-tax-stats-business-tax-statistics) then --> https://www.irs.gov/pub/irs-soi/22incdmd.xlsx
updated date is 12/10/2024:
```{r}
irs_new<- read_csv("New_version_data/eo_md.csv")
```
# new epostcard 990-N
Gives us information about who was an active org and helps save some from removal for revocation otherwise not listed.
link: https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads
direct link to data: (last data posting Oct 28, 2024) https://apps.irs.gov/pub/epostcard/data-download-epostcard.zip
dictionary: https://www.irs.gov/pub/irs-tege/990n-data-dictionary.pdf
EIN, Tax Year , Organization Name, Gross receipts not greater than, Organization has terminated, Tax Period Begin Date , Tax Period End Date , Website URL , Principal Officer’s Name , Principal Officer’s Address Line 1, Principal Officer’s Address Line 2, Principal Officer’s Address City , Principal Officer’s Address Province, Principal Officer’s Address State , Principal Officer’s Address Zip Code , Principal Officer’s Address Country ,Organization Mailing Address Line 1 , Organization Mailing Address Line 2 , Organization Mailing Address City , Organization Mailing Address Province , Organization Mailing Address State , Organization Mailing Address Zip Code, Organization Mailing Address Country ,Organization Doing Business as Name 1, Organization Doing Business as Name 2 , Organization Doing Business as Name 3 ,
```{r}
# specify character for last column
epost_old <- read_delim("New_version_data/data-download-epostcard.txt",
delim = "|", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE, col_types = c("X26" = "c"))
#trying a different way to read in the data with different column leangths
epost <- read.table("New_version_data/data-download-epostcard.txt", col.names = paste0("V",seq_len(28)), fill = TRUE, header=FALSE, sep="|")
```
### check problems (no longer needed)
```{r, include = FALSE}
problem_info<-problems(epost_old) #get problem info, looks like some extra columns
problem_rows <-slice(epost_old, problem_info$row) # get problematic rows from epost
head(epost_old)#compare with nonproblematic rows
problem_info %>% count(col) #mostly column 20 and 25
#
#
# #how many problem_rows are Maryland data
# MaybeMD_probs <-problem_rows %>% filter(if_any(.cols = everything(), ~ grepl("MD", .)))
```
<!-- ```{bash, include = FALSE} -->
<!-- # checking the original version of the file for one of the problematic rows -->
<!-- <!-- awk 'FNR>=9294 && FNR<=9294' New_version_data/data-download-epostcard_2024.txt # first problematic row is actually originally on line 9294 --> -->
<!-- <!-- awk 'FNR>=358951 && FNR<=358951' New_version_data/data-download-epostcard_2024.txt #second problematic row --> -->
<!-- <!-- awk 'FNR>=1 && FNR<=3' New_version_data/data-download-epostcard_2024.txt # checking line 1-3 to ensure that there were some empty lines for the first two lines - yup looks like it! --> -->
<!-- ``` -->
### Get row names:
```{r}
colnames(epost) <- c("EIN", "Tax Year" , "Organization Name", "Gross receipts not greater than", "Organization has terminated", "Tax Period Begin Date" , "Tax Period End Date" , "Website URL" , "Principal Officer’s Name" , "Principal Officer’s Address Line 1", "Principal Officer’s Address Line 2", "Principal Officer’s Address City" , "Principal Officer’s Address Province", "Principal Officer’s Address State" , "Principal Officer’s Address Zip Code" , "Principal Officer’s Address Country" ,"Organization Mailing Address Line 1" , "Organization Mailing Address Line 2" , "Organization Mailing Address City" , "Organization Mailing Address Province" , "Organization Mailing Address State" , "Organization Mailing Address Zip Code", "Organization Mailing Address Country" ,"Organization Doing Business as Name 1", "Organization Doing Business as Name 2", "Organization Doing Business as Name 3")
epost <-clean_names(epost)# removing spaces from names
# checking the rows that have extra columns - only one for MD but will keep it in the dataset
epost_extra_long <- epost %>% filter(na !="") %>% filter(organization_mailing_address_state == "MD")
epost_extra_long2 <- epost %>% filter(na_2 !="") %>% filter(organization_mailing_address_state == "MD")
```
### How many in epost not in irs_new? filter for md
```{r}
irs_new<- rename_with(irs_new, tolower) #make names lowercase
epost <- epost %>% filter(principal_officer_s_address_state == "MD") #filter for only MD
epost_only <-anti_join(epost,irs_new, by ="ein")
nrow(epost_only) #1567 not in irs_new
nrow(irs_new)
nrow(epost)
library(DT)
datatable(epost_only)
```
## Combine IRS data
```{r}
irs_epost <-full_join(epost, irs_new, by = "ein")
IRS <-left_join(irs_new, epost, by = "ein") # keeps all rows of IRS and adds info from epost where possible as new columns
nrow(IRS) == nrow(irs_new)# test if dimensions are still the same
```
## Revocations
https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads --> click on automatic revocation or exemption list link
direct link: https://apps.irs.gov/pub/epostcard/data-download-revocation.zip
last updated Dec,9, 2024
Under Internal Revenue Code Section 6033(j)(1)(A), the IRS will revoke the tax-exempt status of nonprofit organizations automatically when organizations with a filing requirement do not file their required Forms 990 for three consecutive years. When such a revocation occurs, it is effective as of the filing deadline, which is typically May 15 for organizations filing on a calendar year basis. This year, however, the Form 990 filing deadline was delayed to July 15 due to the COVID-19 pandemic
Organizations that do not file a required annual information return or notice for three consecutive years automatically lose their tax-exempt status by operation of law. An automatic revocation is effective on the original filing due date of the third annual return or notice (the "revocation date"). Due to the COVID-19 emergency, this year the IRS extended the filing dates for these returns and notices due from April 1 through July 14 to July 15, 2020. Organizations eligible for this relief that failed to file for the two previous years and did not file by July 15 have automatically lost their tax-exempt status. Due to systemic limitations, these organizations appear on the auto-revocation list showing a revocation date between April 1 and July 14, 2020. However, the revocation date for these organizations is July 15, 2020. For more information on automatic revocation, including how to request reinstatement, see Automatic revocation - How to have your tax-exempt status reinstated.
more info: https://www.irs.gov/pub/irs-pdf/p4991.pdf
```{r}
revocations <- read_delim("New_version_data/data-download-revocation_3.txt",
delim = "|", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
head(revocations)
colnames(revocations) <- c("ein", "rev_org_name", "address1", "address2", "city", "state", "zip", "country", "some_rev_code", "rev_date1", "rev_date2", "not_sure")
IRS_with_rev <-left_join(IRS, revocations, suffix = c("irs", "rev"), by = "ein")
```
## Checking the revocation matching orgs
So 2020 would be the excluded date. but maybe the revocation list is enough.
So include 21-24, but not 2020 (if you submitted in 2020, there would be a 3 year grace period). Also people didn't have to submit in 2020 for 2 years. If people don't submit for 3 years they lose their exempt status.
done: what should I do about these? let's ignore tax year (maybe it is the first year they submitted?- check on this too) - leave in orgs with newer tax period or tax start date.
done: just check that tax start date aligns mostly with tax period - (for those were we have both - then we can trust it for choosing orgs to keep)
done: check that start and end dates match - turns out these have to do with when they file based on bylaws... to know when they are supposed to file. if tax year, end, and start line up - that is the last year they submitted (and in the last since June 2021 - 24) - if also on revocation list- don't remove these orgs, otherwise remove orgs on revocation list
Combine with revocations
```{r}
IRS_with_revonly <-inner_join(IRS, revocations, by = "ein")
nrow(IRS_with_revonly)
IRS_with_revonly %>% select(ein, tax_period, tax_period_begin_date, tax_period_end_date, rev_date1, rev_date2)
```
```{r}
# make into date format
IRS_with_revonly <- IRS_with_revonly %>% mutate(tax_period_year = str_sub(tax_period, start = 1, end = 4))
IRS_with_revonly <- IRS_with_revonly %>% mutate(tax_period_asdate = ym(tax_period), # IRS data
rev_date1_asdate = dmy(rev_date1), # rev data
rev_date2_asdate = dmy(rev_date2), # rev data
tax_period_begin_date_asdate = mdy(tax_period_begin_date),
tax_period_end_date_asdate = mdy(tax_period_end_date)) # IRS data (looked at the dates - seems to be myd format)
# turns out tax_year variable maybe not that reliable... not sure how we feel about that
#IRS %>% select(tax_period, tax_year) %>% count(tax_year, tax_period) %>% glimpse()
# check work in doing conversion
IRS_with_revonly %>% select(tax_period_year, tax_period_asdate, rev_date1_asdate, rev_date2_asdate, tax_period_begin_date_asdate, tax_period_end_date_asdate) %>% head()
IRS_with_revonly %>% select(tax_period_year, tax_period_asdate, tax_period_begin_date_asdate, tax_period_end_date_asdate) %>% drop_na() %>% head()
IRS_with_revonly <- IRS_with_revonly %>% mutate(diff_end_date = tax_period_end_date_asdate - tax_period_asdate)
# show the odd values for tax period and end date to not remove
odd_values <- filter(IRS_with_revonly, diff_end_date >31 | diff_end_date < (-31)) %>% select(ein, tax_period_year, tax_period_asdate, tax_period_begin_date_asdate, tax_period_end_date_asdate, diff_end_date)
datatable(odd_values)
```
done:***If the tax filing date is newer than the revocation date, then we can assume the organization got reinstated for exemption and it should not be removed.***
Therefore we want to remove eins with tax_dates that are younger than the revocation date and also if the end date of the tax period matches the tax period date.
```{r}
#compare dates - assuming that tax period date is on first of month
IRS_with_revonly <-IRS_with_revonly %>% mutate(
tax_date_newer = tax_period_asdate > rev_date2_asdate)
ein_to_rem <-IRS_with_revonly %>% filter(tax_date_newer == FALSE & abs(diff_end_date)<= 31)
ein_to_rem %>% select(ein, diff_end_date, tax_date_newer, tax_period, rev_date2_asdate, tax_period_end_date_asdate)
datatable(ein_to_rem)
```
#### Removing these orgs
```{r}
IRS_no_revoc<-anti_join(IRS, ein_to_rem, by = "ein")
IRS <-IRS_no_revoc
```
## PO Boxes
check for addresses of P.O. or PO boxes and keep NA values
Clean:what about post office road? These are fine - not in baltimore
```{r}
#check PO box stuff: IRS %>% filter(str_detect(street, "PO ")) %>% pull(street)
nrow(IRS)#40377
IRS %>% filter(is.na(street)) %>% nrow() # how many rows are NA - None
# first make sure all is uppercase
IRS <- IRS %>% mutate(street = toupper(street))
## checks
IRS %>%filter(str_detect(street, "P\\.|PO |POST OFFICE")) %>% head()
po_check<-IRS %>%filter(str_detect(street, "P\\.|PO |POST OFFICE"))
po_check_P.<-IRS %>%filter(str_detect(street, "P\\."))
nrow(po_check_P.)
po_checkPO_BOX<-IRS %>%filter(str_detect(street, "PO BOX"))
nrow(po_checkPO_BOX)
po_checkPOST_OFFICE<-IRS %>%filter(str_detect(street, "POST OFFICE")) # what about post office road? it's ok because not in Baltimore
nrow(po_checkPOST_OFFICE)
IRS <- IRS %>% filter(!str_detect(street, "PO |POST OFFICE"))
nrow(IRS) #34762 # removed 5615
```
# Filtering Orgs
Based on https://www.irs.gov/pub/irs-soi/eo-info.pdf
page 3 - we will exclude 00 for foundation code - all orgs except 501c3 to filter down to 501c3
```{r}
IRS %>% count(foundation) # No NA foundation codes
rem_orgs <- IRS %>% filter(foundation == "00")
IRS_org_rem<- IRS %>% filter(foundation != "00")
nrow(IRS) - nrow(IRS_org_rem) # drop 4639 # check filtering
write_csv(rem_orgs, file = here("irs_00_rem.csv"))
```
Could consider filtering out 04, 17 (because not operating)
Check if they are all high asset and maybe we keep but look at on their own
https://www.irs.gov/charities-non-profits/exempt-organizations-annual-reporting-requirements-filing-procedures-tax-year
foundations that provide money 04 granters, 17 boosters- so nonoperational - but not very many .
let's keep on their own as an aside but remove from main data
```{r}
orgs_17_04 <-IRS_org_rem %>% filter(foundation == "17" |foundation == "04")
orgs_17_04 %>% count(foundation)
write_csv(orgs_17_04, file = here("orgs_17_04.csv"))
orgs_17_04$asset_amt
IRS_org_rem <- IRS_org_rem %>% filter(foundation != "17") %>% filter(foundation != "04")
IRS_org_rem %>% count(foundation)
```
## remove NTEE code IX [(IX. Mutual/Membership Benefit - Y)](https://urbaninstitute.github.io/nccs-legacy/ntee/ntee-history.html) - social clubs
- there only appears to be a small number of membership benefit clubs
(Keep NA for broad analyses of high of vs low asset but possibly remove for other analyses)
We can do line by line analysis of these - want to keep boys and girls club
```{r}
IRS_org_rem %>% count(ntee_cd)
```
todo: leave in clubs/lodges but remove social clubs pattern y
```{r}
Boys_and_girls <- IRS_org_rem %>% filter(str_detect(string = name , pattern = "BOYS AND GIRLS CLUB"))
social_clubs <- IRS_org_rem %>% filter(str_detect(string = ntee_cd, pattern = "Y|y")) # 49
social_clubs %>% count(ntee_cd)
datatable(social_clubs)
clubs_lodges <- IRS_org_rem %>% filter(str_detect(string = name, pattern = "club|Club|CLUB|lodge|Lodge|LODGE"))
datatable(clubs_lodges)
write_csv(social_clubs, file = here("irs_social_clubs_rem.csv"))
write_csv(clubs_lodges, file = here("irs_clubs_lodges_rem.csv"))
write_csv(Boys_and_girls, file = here("irs_boys_girls.csv"))
#first remove for "Y ntee"
IRS_org_rem <- IRS_org_rem %>% filter(!str_detect(string = ntee_cd, pattern = "Y|y") |is.na(ntee_cd))
#remove for word club"
IRS_org_rem <- IRS_org_rem %>% filter(!str_detect(string = name, pattern = "club|Club|CLUB|lodge|Lodge|LODGE")|is.na(ntee_cd))
```
## add boys and girls club back
```{r}
IRS_org_rem <- IRS_org_rem %>%rbind(Boys_and_girls)
IRS_org_rem %>% filter(str_detect(string = name , pattern = "BOYS AND GIRLS CLUB"))
#IRS_org_rem %>% count(ntee_cd)%>% view()
```
## Filter by year
Filter for 10 14, 2021 or newer - because 3 years back from revocation data date of Oct 14 2024
... or actually now last updated Dec,9, 2024... so 12,9,2021
People need to submit every 3 years - if they don't for 3 years then off the exempt list
May 15th is filing deadline
So if submitted Jan 2021 ... they have until May 15th 2025 if they didn't file again in 2022, 2023, 2024.
Revocation happens in may 15th each year (for people filing in that year) - have until May 15 of 2025 to submit for the tax filing year of 2024-
Anyone that would have been up in 2019 or 2020 got 2 years to submit and thus 2022 would be the limit. Leaning towards sticking with 2020 because it has been confusing to read the statement about that... because maybe people 2021 would have had extra time but we arent sure.
2018 would be very solid, but might include people who are on a revocation list
Some orgs that don't have to submit postcard, so would not be in the postcard data. Todo - write as a limit for epost/revocation check
Churches don't have to submit necessarily or if an org is affiliated with a gov unit.
Tax year 2020 got 6 months extension - which pushes them into 2021 but not within the 3 years (jan 2024- tax year 2023 - goes back to Jan 21)
Currently sold on 2021 being a good option... or maybe backing up into 2020.
```{r}
nrow(IRS_org_rem)
IRS_org_rem <- IRS_org_rem %>% mutate(tax_period_asdate = ym(tax_period))
IRS <- IRS_org_rem %>% filter(tax_period_asdate >= "2021-12-09" |is.na(tax_period_asdate)) # keep rows where year is less than or equal to 2020 or is na
nrow(IRS)
#OLD CODE
#Check for jan data from 2020
# filter(irs_new_2020, year == 2020, month ==1)
#
# #irs_old_2020 <- filter(irs_old, year <2020) # this removes NAs
# year_info <-IRS %>% dplyr::select(contains(c("year","period")))
#
# print(filter(year_info, rowSums(is.na(year_info)) != ncol(year_info))) # hmm where are there many with no date info...
#
# # what are the rows with all NA???
# IRS %>% filter(is.na(year))
```
## Save the data
```{r}
write_rds(IRS, file = "New_version_data/new_IRSdata.rds")
```
TODO Then use shape file to filter for just Baltimore.
***KEEP 17 AND 4 LIST SEPERATE FOR MAYBE A MAP BUT WANT THEM OUT FOR OTHER ANALYSES.
keep logdes and clubs
try filtering for baltimore first.
Send list to tyler... then work on analysis again... then maps (can recruit people).