ef2

Updated R package for processing IRS 990 Efile datasets.

This package is a Gen 2 version of the irs990efile package with an improved XML to RDB workflow that is more robust and reduces processing time to about 15 minutes per tax year (the original package takes about 3 days to process one tax year).

Processed files are available on the NCCS website in CSV format: https://nccs.urban.org/nccs/datasets/efile/

Installation

# ef2 currently depends on these packages only available on github
devtools::install_github( 'ultinomics/xmltools' )
devtools::install_github( 'nonprofit-open-data-collective/irs990efile' )
devtools::install_github( 'nonprofit-open-data-collective/ef2' )

Usage

The efficiency gains from the package partly come from the pre-processing files. All XML files are converted into something similar to parquet (long) formats with one row per xpath within a document. These are stored in a Duck Database file.

The desired tables are then extracted from the database using:

library( irs990efile )
library( ef2 )

wd <- # project working directory
years <- 2020:2022
table_names <- c("F9-P08-T00-REVENUE","F9-P09-T00-EXPENSES","F9-P10-T00-BALANCE-SHEET")
extract_csv_tables( wd=wd, years=years, table_names=table_names )

The processed DuckDB databases are available at: https://nccs-efile.s3.dualstack.us-east-1.amazonaws.com/duckdb/efile_v2_1/ + EFILE{YEAR}.duckdb

You can attach S3 versions of DuckDB databases using virtual memory in order to execute processes without downloading the files to your local machine.

version <- "efile_v2_1"
con <- get_s3_database( filename="EFILE2021.duckdb", version=version )

You can then run queries to inspect raw data or extract tables directly from S3.

index <- get_current_index_full()
i2 <- dplyr::filter( index, FormType %in% c("990","990EZ") )
table( i2$TaxYear, i2$FormType ) |> knitr::kable()

	990	990EZ
2007	17	17
2008	87	114
2009	33311	15470
2010	123025	63326
2011	159504	82048
2012	179688	93750
2013	198855	104375
2014	218619	116417
2015	233519	124894
2016	243903	130484
2017	261612	139145
2018	271442	149384
2019	284515	152689
2020	343790	171891
2021	336521	202730
2022	348034	205532
2023	329042	198731
2024	78817	80682

# SPLIT INTO YEARS TO KEEP SIZE MANAGEABLE
dir.create("index")
for( i in 2009:2024 ){
  index_i <- dplyr::filter( i2, TaxYear == i )
  saveRDS( index_i, paste0("index/index",i,".rds") )
}

To build a new database:

YEAR <- "2020"
index_y <- readRDS( paste0( "index/index", YEAR, ".rds" ) )
build_database( year=YEAR, urls = index_y$URL )

You can update an existing database by differencing the XML files currently indexed in the database (using XPATH URL as the unique keys) and

update_db( year=YEAR, index = index_y )

It will only create database tables for the XML documents that are missing from the current database by checking for missing cases:

missing_urls <- find_missing_urls( year=YEAR, index = index_y )

Thus it is a more efficient way to maintain a database by updating the current version rather than building from scratch.

Process

Step 1: Flatten XML Files

Convert XML files to something the resembles a Parquet long format. Raw XML fields:

<BooksInCareOfDetail>
  <PersonNm>Vanuel Bloss</PersonNm>
  <USAddress>
    <AddressLine1Txt>319 Linsbury Ct</AddressLine1Txt>
    <CityNm>Gastonia</CityNm>
    <StateAbbreviationCd>NC</StateAbbreviationCd>
    <ZIPCd>28056</ZIPCd>
  </USAddress>
  <PhoneNum>7048241662</PhoneNum>
</BooksInCareOfDetail>

Get converted into a table with one row per XML node. In other words, the file is flattened so that each piece of data (the VALUE column) gets a separate row.

XPATH	NODE_TYPE	VARIABLE_NAME	TABLE_NAME	VALUE
BooksInCareOfDetail	parent
BooksInCareOfDetail/PersonNm	terminal	F9_P0_IN_CARE_OF_NAME	F9-P00-T00-HEADER	Manuel Floss
BooksInCareOfDetail/USAddress/AddressLine1Txt	terminal	F9_P0_IN_CARE_OF_ADDR_L1	F9-P00-T00-HEADER	319 Lingenbury Ct
BooksInCareOfDetail/USAddress/CityNm	terminal	F9_P0_IN_CARE_OF_ADDR_CITY	F9-P00-T00-HEADER	Gastanio
BooksInCareOfDetail/USAddress/StateAbbreviationCd	terminal	F9_P0_IN_CARE_OF_ADDR_STATE	F9-P00-T00-HEADER	NC
BooksInCareOfDetail/USAddress/ZIPCd	terminal	F9_P0_IN_CARE_OF_ADDR_ZIP	F9-P00-T00-HEADER	20056
BooksInCareOfDetail/PhoneNum	terminal	F9_P0_IN_CARE_OF_PHONE	F9-P00-T00-HEADER	8048241362

Note that some XML nodes are "parent" nodes whose only role is to group children nodes ("terminal" nodes) to provide the heirarchical structure that gives XML the flexibility to embed tables within tables. These nodes contain no data and filtered out before converting the flattened XML file into tables.

Flattened XML files are stored as a DuckDB instance on an AWS S3 server.

Step 2: Extract Tables

The table structures and variable names are all defined within the 990 Concordance file. Once the XML conversions to flat files is complete, the data is transformed by pivoting each table to a "wide" format:

F9-P00-T00-HEADER TABLE:

F9_P0_IN_CARE_OF_NAME	F9_P0_IN_CARE_OF_ADDR_L1	F9_P0_IN_CARE_OF_ADDR_CITY	F9_P0_IN_CARE_OF_ADDR_STATE	F9_P0_IN_CARE_OF_ADDR_ZIP	F9_P0_IN_CARE_OF_PHONE
Manuel Floss	319 Lingenbury Ct	Gastanio	NC	20056	8048241362

Additional Details

All Efile tables contain a consistent set of metadata fields the describe both the organization (EIN, name), the type of filing (form type, tax year), and XML attributes (the original xpath, the URL of the raw file, the time stamp of return submission).

In addition, they contain a set of filtering variables that are useful when preparing data for analysis - indicators of whether the filing in an amended return or a group return, and an indicator of whether it is a partial return. These fields are used for removing redundant files since

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
R		R
data-raw		data-raw
data		data
dev		dev
inst/extdata		inst/extdata
man		man
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
vignettes.zip		vignettes.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ef2

Installation

Usage

Process

Step 1: Flatten XML Files

Step 2: Extract Tables

Additional Details

Code Tutorial

Step 1: Flatten XMLs

Step 2: Attach an S3 DuckDB File

Step 3: Extract One-to-One Tables

Step 4: Extract One-to-Many Tables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ef2

Installation

Usage

Process

Step 1: Flatten XML Files

Step 2: Extract Tables

Additional Details

Code Tutorial

Step 1: Flatten XMLs

Step 2: Attach an S3 DuckDB File

Step 3: Extract One-to-One Tables

Step 4: Extract One-to-Many Tables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages