|
1 | | -# AI-ready bioinformatics infrastructure |
| 1 | +# What does AI-ready bioinformatics infrastructure need to do? |
2 | 2 |
|
3 | 3 | If you've checked out the new QIIME 2 homepage at https://qiime2.org, you may have noticed that we describe the QIIME 2 Framework (Q2F) as "AI ready". |
4 | 4 | That's not intended just as a buzzy catchphrase: I think that Q2F provides ideal bioinformatics infrastructure for AI-based biological data science. |
5 | | -In this post, I'll talk about why and I'd love to hear from you about this idea of AI readiness in bioinformatics software. |
| 5 | +In this post, I'll start talking about why and in upcoming posts I'll continue to explore this idea. |
6 | 6 |
|
7 | | -## Managing data like code |
| 7 | +In a traditional bioinformatics workflow such as an alignment-based homology search (e.g., a BLAST search), a user would have a software program and some data. |
| 8 | +They would apply the program to their data, and the program would generate a report. |
| 9 | +Here we can think of the program and the data as the input[^query-and-reference-data], and the report as the output. |
8 | 10 |
|
9 | | -In a traditional bioinformatics workflow (think of a BLAST search), a user would have some code - a software program - that they would apply to their data, and the program would generate some report. |
10 | | -In this universe, the program and the data are effectively the input[^query-and-reference-data], and the report is the output. |
| 11 | +```{image} ./2025-07-18-images/traditional-bioinformatics-workflow.png |
| 12 | +:alt: Traditional bioinformatics workflow schematic |
| 13 | +:width: 500px |
| 14 | +:align: center |
| 15 | +``` |
11 | 16 |
|
12 | 17 | Let's contrast that with a machine-learning/AI-based bioinformatics workflow. |
13 | | -An example here could be a hypothetical microbiome-based disease predictor - something many people want as an outcome of their research. |
14 | | -Here a variety of models might be explored, where training data is provided and a report on test data informs which is the best classification model. |
15 | | -In this universe, the training data, test data and the reports are the input, and a computer program (the disease predictor) is the output. |
| 18 | +An example here could be building a microbiome-based disease predictor - something many people would like to produce as an outcome of their research. |
| 19 | +Here a variety of models and/or parameter settings might be explored, where training data is provided and a report on test data informs which is the best classifier. |
| 20 | +In this universe, the data[^training-and-test-data] and the reports are effectively the input, and a computer program (the disease predictor) is the output. |
16 | 21 |
|
17 | | -A consumer of the output of the traditional workflow is reviewing a report - for example a page of BLAST results or a figure in a paper. |
18 | | -A consumer of the output of the ML/AI workflow is applying a computer program to their data. |
19 | | -In the traditional workflow, a different reference database could change the results in the report. |
20 | | -In the ML/AI workflow, different training data will change how the algorithm itself worked. |
21 | | -For this reason, the data for the ML/AI workflow needs to be managed as code is managed in both workflows: that includes clear versioning, licensing, and provenance (since invalid preparation can lead to an invalid model). |
22 | | - |
23 | | - |
24 | | -**Pick up here.** |
| 22 | +```{image} ./2025-07-18-images/ai-bioinformatics-workflow.png |
| 23 | +:alt: AI bioinformatics workflow schematic |
| 24 | +:width: 500px |
| 25 | +:align: center |
| 26 | +``` |
25 | 27 |
|
| 28 | +A consumer of the output of the traditional workflow is reviewing a report - for example a page of BLAST results or a figure in a paper. |
| 29 | +A consumer of the output of the ML/AI workflow is applying a computer program to their data - for example to assess whether their microbiome data contains disease indicators. |
| 30 | +This creates security risks, requires consideration of how to distribute the output so others can use it, and generally means that detailed data validation and preparation steps need to be provided to consumers. |
26 | 31 |
|
27 | | -**It is increasingly important to understand the source and licensing of data and pre-processing pipelines in full detail.** |
28 | | -Because all data is stored in QIIME 2 Artifacts that are immutable and assigned Universally Unique Identifiers (UUIDs) on creation, all data is validated and versioned. Because its provenance tracking system records source code versions, environment configurations, analysis parameters, and input and output data UUIDs, and assigns a UUIDs to each execution of a job, it serves as an effective experiment management platform. As new concerns arise, for example as has occurred over the past two years regarding the recognition that metagenomics host-read filtering tools have been failing to adequately ensure research study participant privacy [[61]](https://paperpile.com/c/NBOx3J/g033) or that a specific taxonomic classification approach lead to erroneous taxonomy assignments [[62]](https://paperpile.com/c/NBOx3J/vftN), this detailed information enables comprehensive and definitive review of old results to reassess what was done, and can facilitate recommendations for researchers on how to adapt their workflows to address errors or adopt evolving best practices. The QIIME 2 framework’s Parsl-based [[26, 27]](https://paperpile.com/c/NBOx3J/PbBj+wTfV) parallel Pipeline execution enables scaling of analyses from multi-core laptops to thousands of interconnected compute nodes that don’t share memory, and we will continue to improve upon our Parsl integration in QIIME 2 with advice from the Parsl team (see Letter of Collaboration from Chard). It supports Pipeline resumption (i.e., ability to reuse results that were already computed in a re-run of a failed job), which saves time and energy associated with large computations. Finally, the separation of analytic code (plugins) from interfaces has enabled the development of interfaces that are geared toward users with varying levels of computational training, including Galaxy-based and other GUIs, a command line interface, and a Python 3 API that all provide access to the same analytic functionality. All of the features outlined here are achieved “for free” on creation of QIIME 2 plugins, a process that is fully documented in Developing with QIIME 2. |
29 | | - |
| 32 | +In the traditional workflow, a difference in the query or reference data could change the results in the report. |
| 33 | +In the ML/AI workflow, a difference in the training or test data could change how the algorithm itself works. |
| 34 | +The line between data and code is therefore blurred in the ML/AI workflow, and suggests that data needs to be managed as code including unambiguous identification of data (and different versions of data), and clear licensing about if and how data can be used for training ML/AI models. |
30 | 35 |
|
| 36 | +In the traditional workflow, one program and set of parameters are applied (or maybe a few) by a human. |
| 37 | +In the ML/AI workflow, many iterations of parameter settings may be explored across different computer programs to optimize. |
| 38 | +This makes a strong case for automated record keeping of the variants that have been applied, and also underscores a need for computational efficiency including recovery from job failures. |
31 | 39 |
|
| 40 | +Some of this the QIIME 2 Framework currently solves, while some is currently a work in progress. |
| 41 | +Future posts will touch on these ideas, and I'm also interested to hear what I might be missing in this question of what it means for bioinformatics infrastructure to be AI ready. |
32 | 42 |
|
33 | 43 | [^query-and-reference-data]: In the BLAST search example, both the query sequence(s) and the reference data are inputs. |
| 44 | +[^training-and-test-data]: In the ML/AI example, both the training and test data are inputs. |
0 commit comments