This repository contains code, data, and visualizations for a random side project examining taxonomic patterns of publications in Nature or Science that name at least one new species based on fossil material (i.e. extinct taxa). It is only a food-for-thought retrospective analysis and is not intended to be used as some sort of predictive tool for which high-profile journal might be preferable for aiming a novel taxon manuscript.
- Author: Bryan M. Gee (ORCID: 0000-0003-4517-3290)
- Email: bryangee.temnospondyli@gmail.com
- Version: 1.1.0
- Last updated: 2026/03/06
There are four scripts in this repository:
- taxonomy-retrieval.py: This script takes the input file with the list of new species and their associated bibliographic information and retrieves information from the PBDB API on taxonomic ranks.
- taxon-classification.py: This script takes the manually annotated output file from the first script and then applies a series of classifications to bin species into more logical bins for comparison.
- plots.py: This script generates the plots.
- article-retrieval.py: This script retrieves metadata (on authors mainly) for papers describing new species.
There are two data files in this repository:
- input-data.csv: This is the data in the original collected format. You will need this if you want to re-run the code.
- {date}_df-classified.csv: This is the data in the final output version after taxonomic rank information has been compiled, manual clean-up and augmentation has been performed, and the semi-haphazard classification scheme described below has been applied. You don't need this to re-run the code unless you skip to the plotting part. It will otherwise be regenerated if you run the first two scripts.
I first made web queries for the phrase 'sp. nov. fossil' in Nature and Science's websites, restricting the search to articles published between 2016 and 2016, inclusive. The links are:
These searches have to be done through the journals' site, rather than through a public resource like the Crossref API, because they have to be able to search the entire text, which is often paywalled, not just the publicly available title and abstract.
Because the sample size is a little low, and these journals are atypical, I wanted a representative of a more "normal" journal. I went with the Journal of Vertebrate Paleontology for two reasons: (1) Journal of Paleontology's date filters get messed up for old articles that were digitized (these are filtered by their date of being put online, not the original publication date) and (2) most high-profile extinct taxa naming papers are on vertebrates. The link for JVP. As a note, there are almost 750 articles returned for JVP in this time range, so the data for JVP is only through 2020, inclusive. The dataset now includes data for Palaeontology through 2010, for Current Biology through 2006, and for PNAS through 2000. For all journals, only 'typical' research articles are included; formally designated 'reviews', errata, features, etc. are not included (not that they tend to be describing new species anyway).
These assessments only capture information on new species, which are often, but hardly always, associated with a new genus. Standalone new genera for existing species (comb. nov.), new subgenera, new subspecies, or any new supra-generic rank are not recorded.
I then examined each article manually and extracted basic information about novel taxa at the species level: journal, article title, publication year, DOI, taxon name, country the holotype was discovered in, and geologic era and period. In the few instances in which a species was described but uncertainly placed (indicated via a '?' or 'cf.' in front of the genus name), it was recorded as being in the genus without this to facilitate downstream processes. Two Nature articles were excluded, Zeng et al., 2026, which described a speciose Cambrian assemblage in which more than half of the >150 recognized species are considered new, as the authors did not formalize names for new species (probably due to Nature's page limits), and Moore et al., 2024, which is about a parasitoid wasp in modern Drosophila. One Science article was excluded, Miao et al., 2022, which improperly used 'sp. nov.' in reference to a species named in 2008 and which invokes no taxonomic act. Similar issues are found in JVP, Palaeontology, PNAS, and Current Biology articles but are too numerous to list here. JVP's search results also seem to return hits if 'sp. nov.' is in the references. Any article where a new species was not formalized was omitted in the downstream scripting process.
In order to standardize how information about the taxonomic ranks of each novel taxon were treated, I utilized the Paleobiology Database (PBDB) REST API. This is not a perfect mechanism, and I am well-aware of the PBDB's limitations, but the only real alternative that could ensure coverage for all sampled taxa was Wikipedia - a lot of papers truncate the systematic paleontology or depict ranks not equivalent to those in other papers. A dynamic resource is also preferable to the original paper, as the taxonomy of some recently named taxa may have shifted to a different consensus over time. Note that some novel trace fossil ichnospecies are not categorized (their producer may be unknown).
The PBDB doesn't always have data for taxa, so after retrieving information on the taxa that do, I used a combination of Wikipedia and the PBDB to identify slightly higher ranks (e.g., family) that might be listed in the PBDB. Anyone re-running the script would need to do the same manual steps for missing data. I also did some minor cleaning up to account for a few genera that needed to be replaced due to preoccupation, a weird frameshift in which the PBDB output was frameshifted by one somewhere in the middle of the merged dataframe (still trying to figure this out, but it has no functional import at this point), and changed 'United States' listings to 'United States of America' to prepare it for geopandas specifically.
To make the plots slightly more comprehensible and more focused on disparities between clades, I created a series of Boolean columns based on whether a taxon was listed as belonging to a clade (e.g., 'Vertebrata') and then used those columns to create conditional categorizations (e.g., if a taxon was listed as a tetrapodomorph but not as an amniote, it is classifed as 'Amphibians and friends'). These categories are hardly asymmetrical, don't always hew to Linnaean ranks, are sometimes referring to paraphyletic clades, and will probably be tweaked in the future.
This is just a fun side project for me, but it would be easy to either expand the temporal range for sampling journal articles (or maybe to just skip to a decade-long time bin from a few decades ago) or to expand the journal scope (e.g., Proc B, PNAS, Current Biology). If you find this interesting and want to contribute, feel free to fork (and make a PR if you want to merge back).
still under development In order to gain insight into trends by publishing authors on these papers, I utilized the OpenAlex REST API. For those not familiar with the resource, OpenAlex is a freely available bibliographic resource that ingests and standardizes metadata from Crossref (the people who mint DOIs for journal articles) and DataCite (the people who mint DOIs for datasets and software). Information on ROR-standardized affiliations and country were extracted. For those not familiar with ROR, this is a global registry of research institution names intended to mitigate problems of writing the same institution's name in many different ways (sort of like ORCID but for institutions). You technically do not need an Open Alex API key for the type of analysis that this script does (querying individual DOIs rather than doing a large search), but I would still recommend getting one as a way of being polite about the requests you make. Plus, this might be a cool resource to investigate further if you ever do any bibliographic work, and then you might be using endpoints that do require a key.
Metadata about journal publications and taxonomic classifications is typically regarded as non-creative and thus not eligible to be held in copyright. All data included here are thus treated as being in the public domain (CC0 license waiver). The code is licensed under MIT, which basically allows anyone to do anything with it as long as they credit the original (read the License file for more). The visualizations are creative, but I see no point in copyrighting them since anyone can regenerate them on their own with my code and dataset.
Anyone should feel free to re-use and re-purpose these materials. If you make use of public APIs, please make sure to use best practices for making polite requests to REST APIs. You will need to install various Python modules not in the standard library: geopandas, matplotlib, pandas and requests. These are all standard, widely used modules available on PyPi. If you want to produce the map plots, you will need to get the shapefile from: https://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-admin-0-countries/.
- Version 1.1.0 provides updated data and the script for retrieving metadata on articles.
- Version 1.0.1 fixes two data entry bugs that did not exert significant impact on initial graphs. Assessment of two other tickets concluded that they are not currently issues.