Skip to content

Commit 90c107b

Browse files
Merge pull request #14 from jhudsl/after_deletion
After deletion
2 parents 100f376 + ecf170b commit 90c107b

18 files changed

+115887
-93
lines changed

01_data_import_pre_geo.Rmd

Lines changed: 25 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -15,73 +15,65 @@ library(janitor)
1515
filing req code - don't have to file any forms or not 01 = 990, 02 = 990 but less than 25000 income, 03 = group return (don't know what that means), 06 = church, 07 = gov, 00 = not required to file
1616

1717

18-
updated date is 2025-04-13 00:10:44
19-
20-
download date 5/5/2025:
18+
download date 9/4/2025:
2119
https://www.irs.gov/downloads/irs-soi -->then search for eo_md.csv
22-
more about it: https://www.irs.gov/statistics/soi-tax-stats-about-soi
20+
more about soi files: https://www.irs.gov/statistics/soi-tax-stats-about-soi
2321

2422
```{r}
25-
irs_new_old<- read_csv("New_version_data/eo_md.csv") #older
26-
irs_new<- read_csv("New_version_data/eo_md_5_25.csv") #newer
23+
irs_new<- read_csv("data/eo_md.csv") #older
2724
```
2825

2926

3027
# new epostcard 990-N
3128

3229
Gives us information about who was an active org and helps save some from removal for revocation otherwise not listed.
3330
link: https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads
34-
direct link to data: (last data posting Oct 28, 2024) https://apps.irs.gov/pub/epostcard/data-download-epostcard.zip
31+
direct link to data: (last data posting March 10, 2025) https://apps.irs.gov/pub/epostcard/data-download-epostcard.zip
3532
dictionary: https://www.irs.gov/pub/irs-tege/990n-data-dictionary.pdf
3633

37-
EIN, Tax Year , Organization Name, Gross receipts not greater than, Organization has terminated, Tax Period Begin Date , Tax Period End Date , Website URL , Principal Officer’s Name , Principal Officer’s Address Line 1, Principal Officer’s Address Line 2, Principal Officer’s Address City , Principal Officer’s Address Province, Principal Officer’s Address State , Principal Officer’s Address Zip Code , Principal Officer’s Address Country ,Organization Mailing Address Line 1 , Organization Mailing Address Line 2 , Organization Mailing Address City , Organization Mailing Address Province , Organization Mailing Address State , Organization Mailing Address Zip Code, Organization Mailing Address Country ,Organization Doing Business as Name 1, Organization Doing Business as Name 2 , Organization Doing Business as Name 3 ,
38-
```{r}
39-
# specify character for last column
40-
epost_old <- read_delim("New_version_data/data-download-epostcard.txt",
41-
delim = "|", escape_double = FALSE, col_names = FALSE,
42-
trim_ws = TRUE, col_types = c("X26" = "c"))
34+
EIN, Tax Year , Organization Name, Gross receipts not greater than, Organization has terminated, Tax Period Begin Date , Tax Period End Date , Website URL , Principal Officer’s Name , Principal Officer’s Address Line 1, Principal Officer’s Address Line 2, Principal Officer’s Address City , Principal Officer’s Address Province, Principal Officer’s Address State , Principal Officer’s Address Zip Code , Principal Officer’s Address Country ,Organization Mailing Address Line 1 , Organization Mailing Address Line 2 , Organization Mailing Address City , Organization Mailing Address Province , Organization Mailing Address State , Organization Mailing Address Zip Code, Organization Mailing Address Country ,Organization Doing Business as Name 1, Organization Doing Business as Name 2 , Organization Doing Business as Name 3
4335

44-
#trying a different way to read in the data with different column lengths
45-
epost <- read.table("New_version_data/data-download-epostcard.txt", col.names = paste0("V", seq_len(28)), fill = TRUE, header=FALSE, sep="|")
36+
It was discovered over time that there are also two extra columns and that na values were empty or a single space at times.
4637

47-
epost_new <- read.table("New_version_data/data-download-epostcard_5_8_25.txt", col.names = paste0("V", seq_len(28)), fill = TRUE, header=FALSE, sep="|")
38+
```{r}
39+
#reading in the data with different column lengths
40+
epost <- read.table("data/data-download-epostcard.txt", col.names = paste0("V", seq_len(28)), fill = TRUE, header=FALSE, sep="|", quote = "", na.strings = c("" , " ", NA ))
4841
```
4942

5043
### check for problems
5144
```{r, include = FALSE}
52-
prbs <- problems(epost)
53-
dim(prbs)
45+
nrow(problems(epost))
5446
```
5547

56-
<!-- ```{bash, include = FALSE} -->
57-
<!-- # checking the original version of the file for one of the problematic rows -->
58-
<!-- <!-- awk 'FNR>=9294 && FNR<=9294' New_version_data/data-download-epostcard_2024.txt # first problematic row is actually originally on line 9294 --> -->
59-
60-
<!-- <!-- awk 'FNR>=358951 && FNR<=358951' New_version_data/data-download-epostcard_2024.txt #second problematic row --> -->
61-
62-
<!-- <!-- awk 'FNR>=1 && FNR<=3' New_version_data/data-download-epostcard_2024.txt # checking line 1-3 to ensure that there were some empty lines for the first two lines - yup looks like it! --> -->
63-
<!-- ``` -->
6448

6549
### Get row names:
6650
```{r}
67-
colnames(epost) <- c("EIN", "Tax Year" , "Organization Name", "Gross receipts not greater than", "Organization has terminated", "Tax Period Begin Date" , "Tax Period End Date" , "Website URL" , "Principal Officer’s Name" , "Principal Officer’s Address Line 1", "Principal Officer’s Address Line 2", "Principal Officer’s Address City" , "Principal Officer’s Address Province", "Principal Officer’s Address State" , "Principal Officer’s Address Zip Code" , "Principal Officer’s Address Country" ,"Organization Mailing Address Line 1" , "Organization Mailing Address Line 2" , "Organization Mailing Address City" , "Organization Mailing Address Province" , "Organization Mailing Address State" , "Organization Mailing Address Zip Code", "Organization Mailing Address Country" ,"Organization Doing Business as Name 1", "Organization Doing Business as Name 2", "Organization Doing Business as Name 3")
51+
colnames(epost) <- c("EIN", "Tax Year" , "Organization Name", "Gross receipts not greater than", "Organization has terminated", "Tax Period Begin Date" , "Tax Period End Date" , "Website URL" , "Principal Officer’s Name" , "Principal Officer’s Address Line 1", "Principal Officer’s Address Line 2", "Principal Officer’s Address City" , "Principal Officer’s Address Province", "Principal Officer’s Address State" , "Principal Officer’s Address Zip Code" , "Principal Officer’s Address Country" ,"Organization Mailing Address Line 1" , "Organization Mailing Address Line 2" , "Organization Mailing Address City" , "Organization Mailing Address Province" , "Organization Mailing Address State" , "Organization Mailing Address Zip Code", "Organization Mailing Address Country" ,"Organization Doing Business as Name 1", "Organization Doing Business as Name 2", "Organization Doing Business as Name 3", "extra_1", "extra_2")
6852
6953
7054
epost <-clean_names(epost)# removing spaces from names
7155
72-
# checking the rows that have extra columns - only one for MD but will keep it in the dataset
73-
epost_extra_long <- epost %>% filter(na !="") %>% filter(organization_mailing_address_state == "MD")
74-
epost_extra_long2 <- epost %>% filter(na_2 !="") %>% filter(organization_mailing_address_state == "MD")
56+
```
7557

58+
```{r}
59+
# checking the rows that have extra columns - only one for MD but will keep it in the dataset
60+
epost %>% drop_na(extra_1) %>% nrow() # look at how many are not empty values
61+
epost %>% drop_na(extra_2) %>% nrow() # look at how many are not empty values
62+
#Looks like all the extra values are na
63+
#removing empty columns
64+
epost <- remove_empty(epost, "cols")
7665
```
7766

7867

68+
69+
7970
### How many in epost not in irs_new? filter for md
8071

8172
```{r}
8273
irs_new<- rename_with(irs_new, tolower) #make names lowercase
8374
epost <- epost %>% filter(principal_officer_s_address_state == "MD") #filter for only MD
8475
76+
epost <-epost %>% mutate(ein = as.character(ein))
8577
epost_only <-anti_join(epost,irs_new, by ="ein")
8678
nrow(epost_only) # not in irs_new
8779
nrow(irs_new)
@@ -107,9 +99,8 @@ nrow(IRS) == nrow(irs_new)# test if dimensions are still the same
10799

108100
https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads --> click on automatic revocation or exemption list link
109101

110-
Direct link downloaded 5_8_25 (this dataset will change overtime): https://apps.irs.gov/pub/epostcard/data-download-revocation.zip
102+
Direct link downloaded 9/4/2025 (this dataset will change overtime): https://apps.irs.gov/pub/epostcard/data-download-revocation.zip
111103

112-
last updated Dec,9, 2024
113104

114105
Under Internal Revenue Code Section 6033(j)(1)(A), the IRS will revoke the tax-exempt status of nonprofit organizations automatically when organizations with a filing requirement do not file their required Forms 990 for three consecutive years. When such a revocation occurs, it is effective as of the filing deadline, which is typically May 15 for organizations filing on a calendar year basis. This year, however, the Form 990 filing deadline was delayed to July 15 due to the COVID-19 pandemic
115106

@@ -120,10 +111,7 @@ more info: https://www.irs.gov/pub/irs-pdf/p4991.pdf
120111

121112

122113
```{r}
123-
revocations <- read_delim("New_version_data/data-download-revocation_3.txt",
124-
delim = "|", escape_double = FALSE, col_names = FALSE,
125-
trim_ws = TRUE)
126-
revocations_new <- read_delim("New_version_data/data-download-revocation_5_8_25.txt",
114+
revocations <- read_delim("data/data-download-revocation.txt",
127115
delim = "|", escape_double = FALSE, col_names = FALSE,
128116
trim_ws = TRUE)
129117
head(revocations)
@@ -136,7 +124,7 @@ IRS_with_rev <-left_join(IRS, revocations, suffix = c("irs", "rev"), by = "ein")
136124
```{r}
137125
library(ggmap)
138126
IRS <- IRS_with_rev %>% unite("address",street:zipirs, remove = FALSE, sep = ", ") # this next step is slow so commenting out, making address variable
139-
saveRDS(IRS, file = "New_version_data/IRS_data_before_lat_long.rds")
127+
saveRDS(IRS, file = "data/IRS_data_before_lat_long.rds")
140128
```
141129

142130

02_geo_coding.R

Lines changed: 63 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,26 @@
11
library(tidyverse)
22
library(tidygeocoder)
33

4-
IRS <- readRDS(file = "New_version_data/IRS_data_before_lat_long.rds")
4+
IRS <- readRDS(file = "data/IRS_data_before_lat_long.rds")
5+
IRS <- head(IRS, n = 7)
6+
# Function to geocode a subset of data
7+
geocode_subset <- function(data_subset) {
8+
data_subset %>% geocode(address = address,
9+
method = 'arcgis', lat = latitude , long = longitude)
10+
}
11+
12+
# Split the data into chunks
13+
num_cores <- detectCores() - 1
14+
data_chunks <- split(IRS, rep(1:num_cores, length.out = nrow(IRS)))
15+
16+
# Perform parallel geocoding
17+
results <- mclapply(data_chunks, geocode_subset, mc.cores = num_cores)
18+
19+
# Combine results
20+
final_results <- bind_rows(results)
21+
22+
# Save the results to a CSV file
23+
saveRDS(final_results, file = "data/geos_new.rds")
524

625
### example code ###########
726
# some_addresses <- tibble::tribble(
@@ -16,8 +35,8 @@ IRS <- readRDS(file = "New_version_data/IRS_data_before_lat_long.rds")
1635
# lat_longs
1736
################
1837

19-
split_vect<- rep(1:10, length.out = nrow(IRS)) # vector to split by
20-
split_IRS <-IRS %>% split(split_vect)
38+
#split_vect<- rep(1:10, length.out = nrow(IRS)) # vector to split by
39+
#split_IRS <-IRS %>% split(split_vect)
2140

2241
# geo_IRS_1<- geocode(split_IRS[[1]], address = address,
2342
# method = 'arcgis', lat = latitude , long = longitude)
@@ -42,43 +61,43 @@ split_IRS <-IRS %>% split(split_vect)
4261

4362

4463

45-
IRSfirst3rd <- IRS[1:15000,]
46-
IRSsecond3rd <- IRS[15001:30000,]
47-
IRSthird3rd <- IRS[30001:(length(IRS$ein)),]
48-
49-
geosfirst3rd <-IRSfirst3rd %>% geocode(address = address,
50-
method = 'arcgis', lat = latitude , long = longitude)
51-
52-
saveRDS(geosfirst3rd, file = "New_version_data/first_geos.rds")
53-
54-
IRSsecond3rd_A <-IRSsecond3rd[1:8000,]
55-
56-
IRSsecond3rd_B <-IRSsecond3rd[8001:length(IRSfirst3rd$ein),]
57-
beep(1, geossecond3rd_A<-IRSsecond3rd_A %>% geocode(address = address,
58-
method = 'arcgis', lat = latitude , long = longitude))
59-
60-
beep(1, geossecond3rd_B<-IRSsecond3rd_B %>% geocode(address = address,
61-
method = 'arcgis', lat = latitude , long = longitude))
62-
63-
saveRDS(geossecond3rd_A, file = "New_version_data/second_geosA.rds")
64-
saveRDS(geossecond3rd_B, file = "New_version_data/second_geosB.rds")
65-
66-
beep(1, geoslast3rd<-IRSthird3rd %>% geocode(address = address,
67-
method = 'arcgis', lat = latitude , long = longitude))
68-
69-
saveRDS(geoslast3rd, file = "New_version_data/third_geos.rds")
70-
71-
geos <- rbind(first_geos, geossecond3rd_A, geossecond3rd_B, geoslast3rd)
72-
73-
saveRDS(geos, file = "New_version_data/geos.rds")
74-
75-
76-
#library(naniar)
77-
#geos %>%pull(latitude) %>% pct_complete() #~100 %
78-
79-
80-
81-
64+
# IRSfirst3rd <- IRS[1:15000,]
65+
# IRSsecond3rd <- IRS[15001:30000,]
66+
# IRSthird3rd <- IRS[30001:(length(IRS$ein)),]
67+
#
68+
# geosfirst3rd <-IRSfirst3rd %>% geocode(address = address,
69+
# method = 'arcgis', lat = latitude , long = longitude)
70+
#
71+
# saveRDS(geosfirst3rd, file = "data/first_geos.rds")
72+
#
73+
# IRSsecond3rd_A <-IRSsecond3rd[1:8000,]
74+
#
75+
# IRSsecond3rd_B <-IRSsecond3rd[8001:length(IRSfirst3rd$ein),]
76+
# beep(1, geossecond3rd_A<-IRSsecond3rd_A %>% geocode(address = address,
77+
# method = 'arcgis', lat = latitude , long = longitude))
78+
#
79+
# beep(1, geossecond3rd_B<-IRSsecond3rd_B %>% geocode(address = address,
80+
# method = 'arcgis', lat = latitude , long = longitude))
81+
#
82+
# saveRDS(geossecond3rd_A, file = "data/second_geosA.rds")
83+
# saveRDS(geossecond3rd_B, file = "data/second_geosB.rds")
84+
#
85+
# beep(1, geoslast3rd<-IRSthird3rd %>% geocode(address = address,
86+
# method = 'arcgis', lat = latitude , long = longitude))
87+
#
88+
# saveRDS(geoslast3rd, file = "data/third_geos.rds")
89+
#
90+
# geos <- rbind(first_geos, geossecond3rd_A, geossecond3rd_B, geoslast3rd)
91+
#
92+
# saveRDS(geos, file = "data/geos.rds")
93+
#
94+
#
95+
# #library(naniar)
96+
# #geos %>%pull(latitude) %>% pct_complete() #~100 %
97+
#
98+
#
99+
#
100+
#
82101

83102
#' To cite tidygeocoder use:
84103
#'
@@ -102,3 +121,7 @@ saveRDS(geos, file = "New_version_data/geos.rds")
102121
#' pages = {3544},
103122
#' note = {R package version 1.0.5},
104123
#' }
124+
#'
125+
#'
126+
127+

04_prepare_neighborhood_data.Rmd

Lines changed: 65 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,69 @@ library(naniar)
99

1010
# Data Import
1111
```{r}
12-
df_simplified<-read_excel(here::here("Nonprofit_Baltimore_Analysis.xlsx"), sheet = 3)
13-
BMF <- read_csv("New_version_data/MD_BMF_V1.1.csv")
14-
neighborhoods <- read_csv("New_version_data/Neighborhood_Statistical_Area_(NSA)_Boundaries.csv")
15-
neighborhoods2 <- read_csv("New_version_data/Neighborhoods(in)_from_tyler_April4_2025.csv") # hmm these are kinda different
12+
13+
#IRS <- read_rds("data/IRS_data_before_lat_long.rds") # 41,638 rows from the eo_md.csv
14+
#any(is.na(IRS$cityirs)) # no na values for city
15+
#IRS <- filter(IRS, is.na(cityirs) | cityirs == "BALTIMORE") #filter for Baltimore for now to keep small as we test
16+
#df_simplified<-read_excel(here::here("Nonprofit_Baltimore_Analysis.xlsx"), sheet = 3)
17+
BMF <- read_csv("data/MD_BMF_V1.1.csv")# much bigger than raw IRS data 73,768 rows (also comes from IRS)
18+
neighborhoods <- read_csv("data/Neighborhood_Statistical_Area_(NSA)_Boundaries.csv")
19+
```
20+
21+
## Get lat and long for shape file
22+
23+
https://stackoverflow.com/questions/66381795/check-whether-point-coordinate-lies-within-polygon
24+
https://www.statsilk.com/maps/convert-esri-shapefile-map-geojson-format
25+
26+
```{r}
27+
library(raster)
28+
library(sf)
29+
30+
neighborhood_shape <-st_read("data/Neighborhood_Statistical_Area_(NSA)_Boundaries/Neighborhood_Statistical_Area_(NSA)_Boundaries.shp")
31+
32+
```
33+
34+
35+
```{r}
36+
BMF_geo <- BMF %>% dplyr::select(EIN, LATITUDE, LONGITUDE)
37+
any(is.na(BMF_geo$LATITUDE)) # no missing location info
38+
any(is.na(BMF_geo$LONGITUDE))
39+
```
40+
41+
```{r}
42+
CRS <- st_crs(neighborhood_shape$geometry)
43+
pnts_sf <- st_as_sf(BMF, coords = c('LONGITUDE', 'LATITUDE'), crs = st_crs(4326)) %>% st_set_crs(4326)
44+
#neighborhood_Sf <-neighborhood_shape$geometry
45+
#neighborhood_Sf <- neighborhood_Sf %>% st_set_crs(4326)
46+
pnts_trans <- st_transform(pnts_sf, 2163)
47+
neighborhood_tt <- st_transform(neighborhood_shape$geometry, 2163)
48+
intersection <- pnts_sf %>% mutate(
49+
intersection = as.integer(st_intersects( pnts_trans, neighborhood_tt )))
50+
in_balt <- intersection %>% filter(!is.na(intersection)) # just baltimore locations
51+
```
52+
53+
54+
55+
checking that it worked
56+
```{r}
57+
# first row:
58+
# ein geometry intersection
59+
# 1 010591773 POINT (-76.69024 39.36632) 43
60+
61+
#neighborhood_shape[43,]$name
62+
#filter(IRS, ein == "010591773")
63+
64+
#Looks like this is in that location
65+
66+
```
67+
68+
69+
Combining it all together:
70+
```{r}
71+
neighborhood_shape <-as_tibble(neighborhood_shape)
72+
neighborhood_shape<-neighborhood_shape %>% mutate(id = row_number())
73+
org_data <-left_join(in_balt, neighborhood_shape, by = c("intersection" = "id"))
74+
75+
write_rds(org_data, file = "data/processed/org_data.rds")
76+
1677
```

0 commit comments

Comments
 (0)