NEW: AI-ready

gregcaporaso · gregcaporaso · commit 3e90e5d0e3db · 2025-07-18T10:39:42.000-07:00
diff --git a/blog/2025-07-18-ai-ready.md b/blog/2025-07-18-ai-ready.md
@@ -1 +1,33 @@
 # AI-ready bioinformatics infrastructure
+
+If you've checked out the new QIIME 2 homepage at https://qiime2.org, you may have noticed that we describe the QIIME 2 Framework (Q2F) as "AI ready".
+That's not intended just as a buzzy catchphrase: I think that Q2F provides ideal bioinformatics infrastructure for AI-based biological data science.
+In this post, I'll talk about why and I'd love to hear from you about this idea of AI readiness in bioinformatics software.
+
+## Managing data like code
+
+In a traditional bioinformatics workflow (think of a BLAST search), a user would have some code - a software program - that they would apply to their data, and the program would generate some report.
+In this universe, the program and the data are effectively the input[^query-and-reference-data], and the report is the output.
+
+Let's contrast that with a machine-learning/AI-based bioinformatics workflow.
+An example here could be a hypothetical microbiome-based disease predictor - something many people want as an outcome of their research.
+Here a variety of models might be explored, where training data is provided and a report on test data informs which is the best classification model.
+In this universe, the training data, test data and the reports are the input, and a computer program (the disease predictor) is the output.
+
+A consumer of the output of the traditional workflow is reviewing a report - for example a page of BLAST results or a figure in a paper.
+A consumer of the output of the ML/AI workflow is applying a computer program to their data.
+In the traditional workflow, a different reference database could change the results in the report.
+In the ML/AI workflow, different training data will change how the algorithm itself worked.
+For this reason, the data for the ML/AI workflow needs to be managed as code is managed in both workflows: that includes clear versioning, licensing, and provenance (since invalid preparation can lead to an invalid model).
+
+
+**Pick up here.**
+
+
+**It is increasingly important to understand the source and licensing of data and pre-processing pipelines in full detail.**
+Because all data is stored in QIIME 2 Artifacts that are immutable and assigned Universally Unique Identifiers (UUIDs) on creation, all data is validated and versioned. Because its provenance tracking system records source code versions, environment configurations, analysis parameters, and input and output data UUIDs, and assigns a UUIDs to each execution of a job, it serves as an effective experiment management platform. As new concerns arise, for example as has occurred over the past two years regarding the recognition that metagenomics host-read filtering tools have been failing to adequately ensure research study participant privacy [[61]](https://paperpile.com/c/NBOx3J/g033) or that a specific taxonomic classification approach lead to erroneous taxonomy assignments [[62]](https://paperpile.com/c/NBOx3J/vftN), this detailed information enables comprehensive and definitive review of old results to reassess what was done, and can facilitate recommendations for researchers on how to adapt their workflows to address errors or adopt evolving best practices. The QIIME 2 framework’s Parsl-based [[26, 27]](https://paperpile.com/c/NBOx3J/PbBj+wTfV) parallel Pipeline execution enables scaling of analyses from multi-core laptops to thousands of interconnected compute nodes that don’t share memory, and we will continue to improve upon our Parsl integration in QIIME 2 with advice from the Parsl team (see Letter of Collaboration from Chard). It supports Pipeline resumption (i.e., ability to reuse results that were already computed in a re-run of a failed job), which saves time and energy associated with large computations. Finally, the separation of analytic code (plugins) from interfaces has enabled the development of interfaces that are geared toward users with varying levels of computational training, including Galaxy-based and other GUIs, a command line interface, and a Python 3 API that all provide access to the same analytic functionality. All of the features outlined here are achieved “for free” on creation of QIIME 2 plugins, a process that is fully documented in Developing with QIIME 2.
+
+
+
+
+[^query-and-reference-data]: In the BLAST search example, both the query sequence(s) and the reference data are inputs.