NEW: AI-ready

gregcaporaso · gregcaporaso · commit e822b38308c2 · 2025-07-31T09:59:59.000-07:00
diff --git a/blog/2025-07-18-ai-ready.md b/blog/2025-07-18-ai-ready.md
@@ -0,0 +1,44 @@
+# What does AI-ready bioinformatics infrastructure need to do?
+
+If you've checked out the new QIIME 2 homepage at https://qiime2.org, you may have noticed that we describe the QIIME 2 Framework (Q2F) as "AI ready".
+That's not intended just as a buzzy catchphrase: I think that Q2F provides ideal bioinformatics infrastructure for AI-based biological data science.
+In this post, I'll start talking about why and in upcoming posts I'll continue to explore this idea.
+
+In a traditional bioinformatics workflow such as an alignment-based homology search (e.g., a BLAST search), a user would have a software program and some data.
+They would apply the program to their data, and the program would generate a report.
+Here we can think of the program and the data as the input[^query-and-reference-data], and the report as the output.
+
+```{image} ./2025-07-18-images/traditional-bioinformatics-workflow.png
+:alt: Traditional bioinformatics workflow schematic
+:width: 500px
+:align: center
+```
+
+Let's contrast that with a machine-learning/AI-based bioinformatics workflow.
+An example here could be building a microbiome-based disease predictor - something many people would like to produce as an outcome of their research.
+Here a variety of models and/or parameter settings might be explored, where training data is provided and a report on test data informs which is the best classifier.
+In this universe, the data[^training-and-test-data] and the reports are effectively the input, and a computer program (the disease predictor) is the output.
+
+```{image} ./2025-07-18-images/ai-bioinformatics-workflow.png
+:alt: AI bioinformatics workflow schematic
+:width: 500px
+:align: center
+```
+
+A consumer of the output of the traditional workflow is reviewing a report - for example a page of BLAST results or a figure in a paper.
+A consumer of the output of the ML/AI workflow is applying a computer program to their data - for example to assess whether their microbiome data contains disease indicators.
+This creates security risks, requires consideration of how to distribute the output so others can use it, and generally means that detailed data validation and preparation steps need to be provided to consumers.
+
+In the traditional workflow, a difference in the query or reference data could change the results in the report.
+In the ML/AI workflow, a difference in the training or test data could change how the algorithm itself works.
+The line between data and code is therefore blurred in the ML/AI workflow, and suggests that data needs to be managed as code including unambiguous identification of data (and different versions of data), and clear licensing about if and how data can be used for training ML/AI models.
+
+In the traditional workflow, one program and set of parameters are applied (or maybe a few) by a human.
+In the ML/AI workflow, many iterations of parameter settings may be explored across different computer programs to optimize.
+This makes a strong case for automated record keeping of the variants that have been applied, and also underscores a need for computational efficiency including recovery from job failures.
+
+Some of this the QIIME 2 Framework currently solves, while some is currently a work in progress.
+Future posts will touch on these ideas, and I'm also interested to hear what I might be missing in this question of what it means for bioinformatics infrastructure to be AI ready.
+
+[^query-and-reference-data]: In the BLAST search example, both the query sequence(s) and the reference data are inputs.
+[^training-and-test-data]: In the ML/AI example, both the training and test data are inputs.
diff --git a/blog/myst.yml b/blog/myst.yml
@@ -21,6 +21,7 @@ project:
   toc:
     - file: intro.md
     - file: 2025-06-11-stacks.md
+    - file: 2025-07-18-ai-ready.md
 
   downloads:
     - file: _static/environment.yml