The goal of backsearchr is to automate backward citation searching. This is a common step in literature reviews - researchers look through the papers cited in a review to check that they have considered (included or excluded) all relevant papers.
You can install the development version of backsearchr like so:
remotes::install_github("mrc-ide/backsearchr")Use Podman Compose for a reproducible extraction/comparison workflow without local R/Java setup.
- Podman
- Podman Compose (
podman compose)
data/
in/
pdfs/ # input PDFs
provided_refs.csv # references from your main list
out/ # outputs written here
Create mount folders once:
mkdir -p data/in/pdfs data/outBuild the image:
podman compose buildThe supported local image tag is localhost/backsearchr:local. You can verify
that Podman sees it with:
podman images | grep backsearchrBy default the compose setup builds and runs as linux/amd64. This keeps
GROBID working on both x86_64 hosts and ARM64 hosts running emulation.
If you need to override that, set BACKSEARCHR_PLATFORM:
BACKSEARCHR_PLATFORM=linux/amd64 podman compose buildExtract references with CERMINE:
podman compose run --rm backsearchr \
extract \
--method cermine \
--indir /work/data/in/pdfs \
--output /work/data/out/extracted_refs.csvExtract references with GROBID:
podman compose run --rm backsearchr \
extract \
--method grobid \
--indir /work/data/in/pdfs \
--output /work/data/out/extracted_refs.csvCompare extracted vs provided references:
podman compose run --rm backsearchr \
compare \
--provided /work/data/in/provided_refs.csv \
--extracted /work/data/out/extracted_refs.csv \
--outdir /work/data/outRun end-to-end in one command:
podman compose run --rm backsearchr \
extract-and-compare \
--method grobid \
--indir /work/data/in/pdfs \
--provided /work/data/in/provided_refs.csv \
--outdir /work/data/outStandard outputs in data/out:
extracted_refs.csvmatched.csvnot_matched.csvdoi_matched.csvtitles_matched.csv
- Permission denied on output files:
- Ensure
data/outis writable on the host.
- Ensure
- Platform and architecture:
- The default compose platform is
linux/amd64because GROBID's bundled Linux native library is x86_64-only. - On ARM64 hosts this usually runs via emulation.
- If you override
BACKSEARCHR_PLATFORM, GROBID may fail on non-x86_64 Linux targets.
- The default compose platform is
- GROBID path override:
- Set
BACKSEARCHR_GROBID_JARand/orBACKSEARCHR_GROBID_HOMEwhen running compose if you want to use custom assets.
- Set
- No PDFs found:
- Confirm files are under
data/in/pdfsand--indirpoints to/work/data/in/pdfs.
- Confirm files are under
There are two steps to backward citation searching:
- Given a list of papers, extract the references from each paper. For details, go to the section Extracting References.
- Given a list of references (call it the main list) and another list of papers that are to be checked for inclusion (call it the check list), identify the papers in the main list that are not in the check list. For details, go to the section Comparing References.
There are multiple ways to extract references from a paper, two which we have considered here are:
- Using an API e.g., rscopus.
- Using a pdf parser. We support cermine and GROBID for this purpose. The advantages of this method are:
- parsers can be run locally and do not require an API key.
- there is no need to worry about rate limits.
- there is no need too worry about scopus subscription.
If neither of the above methods suits you, you can choose any other method to compile a list of references as the comparison functionality only needs a list, and is agnostic to how this list is obtained.
This comparison is non-trivial for several reasons. First, using automatically compiled lists, we cannot reply on all fields used for comparison being present. Second, there might be inconsistences between fiels due to OCR erros, formatting differences, or spurious spaces. We use the following rules to compare papers: - If the doi of a paper in the main list is an exact match to the doi of a paper in the check list, then the papers are considered to be the same. However, doi is often missing, especially from older papers. We then use the second rule. - If the title of a paper in the main list is an exact match to the title of a paper in the check list, then the papers are considered to be the same. The comaprison is case-insensitive and punctuation is removed. - Finally, we revert to comparison of meta-data. If the first author, year of publication, and journal of a paper in the main list are an exact match to the corresponding fields in the check list, then the papers are flagged as being potentially similar. As before, the comparison is case-insensitive and punctuation is removed
library(backsearchr)
## basic example code