-
Notifications
You must be signed in to change notification settings - Fork 12
Expand file tree
/
Copy pathIRS_E-Filers_Index.Rmd
More file actions
200 lines (100 loc) · 4.71 KB
/
IRS_E-Filers_Index.Rmd
File metadata and controls
200 lines (100 loc) · 4.71 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
# Build IRS 990 E-Filers Database
This script builds the INDEX file for all IRS E-Filer data provided on the Amazon Web Server. The index contains a limited number of variables:
## DATA DICTIONARY
Variable | Notes
---------|---------
EIN | Employer Identification Number
SubmittedOn | ? Unclear
TaxPeriod | Tax year and month of the filing YYYYMM
FilingYear | Tax year of the filing YYYY
DLN | Document Locator Number - internal to IRS
LastUpdated | Last time the org updated their file
FormType | 990, 990-EZ, or 990-PF
ObjectId | Unique ID - Tax Period + DLN (sortof)
OrganizationName | Name of the nonprofit
IsElectronic | Was the return filed electronically? T/F
IsAvailable | Is the return available online? T/F
URL | Web address of the IRS electronic return
## Accessing IRS 990 Filings on AWS
Information about this data can be found at: https://aws.amazon.com/public-data-sets/irs-990/
An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.
All of the data is publicly accessible via the S3 bucket's HTTPS endpoint at https://s3.amazonaws.com/irs-form-990. No authentication is required to download data over HTTPS.
## BUILD THE INDEX DATABASE OF ALL ELECTRONIC FILERS
```R
# INSTALL REQUIRED PACKAGES
install.packages( "jsonlite" )
install.packages( "R.utils" )
install.packages( "curl" )
# LOAD REQUIRED PACKAGES
library( jsonlite )
library( R.utils )
library( curl )
### CREATE A DIRECTORY FOR YOUR DATA
getwd()
dir.create( "IRS Nonprofit Data" )
setwd( "./IRS Nonprofit Data" )
### DOWNLOAD FILES AND UNZIP (deprecated Dec 2016 - IRS released new index files)
#
# electronic.filers <- "https://s3.amazonaws.com/irs-form-990/index.json.gz"
#
# download.file( url=electronic.filers, "electronic.json.gz" )
#
# gunzip("electronic.json.gz", remove=TRUE )
#
# data.ef <- fromJSON( txt="electronic.json" )[[1]]
# CREATE A DATA FRAME OF ELECTRONIC FILERS FROM IRS JSON FILES
dat1 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2011.json")[[1]]
dat2 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2012.json")[[1]]
dat3 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2013.json")[[1]]
dat4 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2014.json")[[1]]
dat5 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2015.json")[[1]]
dat6 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2016.json")[[1]]
dat7 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2017.json")[[1]]
data.ef <- rbind( dat1, dat2, dat3, dat4, dat5, dat6, dat7 )
nrow( data.ef )
# REFORMAT FILING DATE FROM YYYY-MM TO YYYY
# Tax Period represents the end of the nonprofit's accounting year
# The tax filing year is always the previous year, unless the accounting year ends in December
year <- as.numeric( substr( data.ef$TaxPeriod, 1, 4 ) )
month <- substr( data.ef$TaxPeriod, 5, 6 )
data.ef$FilingYear <- year - 1
data.ef$FilingYear[ month == "12" ] <- year[ month == "12" ]
# table( data.ef$FormType, data.ef$FilingYear, useNA="ifany" )
#
# 2009 2010 2011 2012 2013 2014 2015
# 990 33360 123107 159539 179675 198615 215764 73233
# 990EZ 15500 63253 82066 93769 104425 114822 60967
# 990PF 2352 25275 34597 39936 45870 52617 34387
# DROP UNRELIABLE YEARS
table( data.ef$FilingYear ) # note that some records are nonsensical
data.ef <- data.ef[ data.ef$FilingYear > 2009 & data.ef$FilingYear < 2017 , ]
table( data.ef$FilingYear, useNA="ifany" )
nrow( data.ef )
# EXCLUDE DATA THAT IS NOT AVAILABLE IN ELECTRONIC FORMAT
table( data.ef$IsElectronic, data.ef$IsAvailable )
data.ef <- data.ef[ data.ef$IsAvailable == TRUE , ]
nrow( data.ef )
```
## DESCRIPTIVES
```R
length( unique( data.ef$EIN ) ) # number of unique orgs
table( table( data.ef$EIN, useNA="ifany" ) ) # available panel spells
# why do we have some organizations with more spells than years ?
table.ein <- table( data.ef$EIN, useNA="ifany" )
table.ein[ table.ein > 6 ]
data.ef[ data.ef$EIN == "830229733" , ] # submitted two filings in 2013
```
## WRITE DATA TO YOUR FAVORITE FORMAT
```R
# AS R DATA SET
saveRDS( data.ef, file="ElectronicFilers.rds" )
# AS CSV
write.csv( data.ef, file="ElectronicFilers.csv", row.names=F )
# IN STATA
install.packages( "haven" )
library( haven )
write_dta( data.ef, file="ElectronicFilers.dta" )
# IN SPSS - creates a text file and a script for reading it into SPSS
library( foreign )
write.foreign( df=data.ef, datafile="ElectronicFilers.txt", codefile="CodeToLoadDataInSPSS.txt", package="SPSS" )
```