Skip to content

Commit 5439752

Browse files
committed
doc: add huggingface shared data table schema to admin/data_tables
Documents the 7-column `huggingface` data table schema, column conventions, controlled vocabularies for pipeline_tag/domain, XML filter patterns, and an example .loc entry. Announced in galaxyproject/galaxy-hub#3923.
1 parent 374869d commit 5439752

1 file changed

Lines changed: 115 additions & 0 deletions

File tree

doc/source/admin/data_tables.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,3 +103,118 @@ When a new tool is installed that uses a data table a new entry is added to
103103
subdirectory in `tool_data_path` (in a subdirectory that has the name of the
104104
toolshed). By default this is `tool-data/toolshed.g2.bx.psu.edu/`. Note that
105105
these directories will also contain tool data table config files, but they are unused.
106+
107+
## The `huggingface` shared data table
108+
109+
Galaxy tools that consume pre-downloaded Hugging Face models share a single
110+
data table named `huggingface`. Using one shared table means admins maintain
111+
one `.loc` file and all tools benefit from every registered model entry.
112+
113+
### Declaring the table
114+
115+
Add the following block to `tool_data_table_conf.xml`:
116+
117+
```xml
118+
<!-- Hugging Face models -->
119+
<table name="huggingface" comment_char="#" allow_duplicate_entries="False">
120+
<columns>value, name, pipeline_tag, domain, free_tag, version, path</columns>
121+
<file path="/opt/galaxy/tool-data/huggingface.loc" />
122+
</table>
123+
```
124+
125+
Each tool ships a `tool-data/huggingface.loc.sample` that uses the same
126+
7-column layout.
127+
128+
### Column reference
129+
130+
| # | Column | Purpose |
131+
|---|--------|---------|
132+
| 0 | `value` | Unique row ID across the whole table |
133+
| 1 | `name` | Human-readable label shown in the Galaxy select widget |
134+
| 2 | `pipeline_tag` | Model role — see controlled vocabulary below |
135+
| 3 | `domain` | Coarse data domain — see controlled vocabulary below |
136+
| 4 | `free_tag` | Optional narrowing tag; fallback filter when `pipeline_tag`/`domain` alone are not specific enough |
137+
| 5 | `version` | Model version string |
138+
| 6 | `path` | Path to the model data, a directory or a specific file, depending on the model structure |
139+
140+
**`value` (column 0)**
141+
142+
Must be globally unique across every row in `huggingface.loc`, regardless of
143+
which tool added it. Use the Hugging Face model ID (`<owner>/<model-name>`)
144+
directly — it is stable and unambiguous. If the same model is registered at
145+
more than one version, append the version:
146+
147+
```
148+
black-forest-labs/FLUX.1-dev
149+
black-forest-labs/FLUX.1-dev_2
150+
```
151+
152+
**`pipeline_tag` (column 2)**
153+
154+
Use the official [Hugging Face pipeline tag](https://huggingface.co/models).
155+
Common values:
156+
157+
| Value | When to use |
158+
|-------|-------------|
159+
| `text-to-image` | Image generation models |
160+
| `automatic-speech-recognition` | ASR / transcription models |
161+
| `feature-extraction` | Sentence / document embedding models |
162+
| `tabular-classification` | Tabular ML classifiers |
163+
| `tabular-regression` | Tabular ML regressors |
164+
| `text-generation` | Causal / instruction-tuned LLMs |
165+
166+
Do not invent synonyms for existing Hugging Face tags.
167+
168+
**`domain` (column 3)**
169+
170+
A broad category for the data type the model works with:
171+
`image` · `text` · `audio` · `tabular` · `sequence` · `video` · `multimodal`
172+
173+
**`free_tag` (column 4)**
174+
175+
An optional short identifier used as a fallback narrowing filter when
176+
`pipeline_tag` and `domain` alone are not specific enough. Because a model
177+
can be consumed by multiple tools, `free_tag` must not encode a specific tool
178+
name. Choose a short, lowercase, descriptive value and document it alongside
179+
the tool that introduces it.
180+
181+
**`version` (column 5)**
182+
183+
The model version string. A tool declares in its XML which version(s) it
184+
accepts, allowing multiple versions of the same model to coexist. Where
185+
possible, rows are only added, never removed or edited.
186+
187+
**`path` (column 6)**
188+
189+
The path to the model data on the production server (maintained by admins).
190+
Can be a directory (when the tool reads the whole Hugging Face cache layout)
191+
or a specific file (e.g. a `.ckpt` checkpoint).
192+
193+
### XML filter convention
194+
195+
Filter primarily by `pipeline_tag` (column 2) and/or `domain` (column 3) so
196+
only relevant model types are shown to the user. Add a `version` or
197+
`free_tag` filter only when you need to narrow the selection further:
198+
199+
```xml
200+
<param name="model" type="select" label="Model">
201+
<options from_data_table="huggingface">
202+
<filter type="static_value" column="2" value="<pipeline_tag>"/>
203+
<filter type="static_value" column="3" value="<domain>"/>
204+
<!-- optional: narrow further by version or free_tag -->
205+
<!-- <filter type="static_value" column="5" value="<version>"/> -->
206+
<!-- <filter type="static_value" column="4" value="<free_tag>"/> -->
207+
</options>
208+
</param>
209+
```
210+
211+
### Example `.loc` entry
212+
213+
Each row is TAB-separated (7 columns):
214+
215+
```
216+
# Columns: value <TAB> name <TAB> pipeline_tag <TAB> domain <TAB> free_tag <TAB> version <TAB> path
217+
#
218+
# Flux
219+
black-forest-labs/FLUX.1-dev FLUX.1 [dev] text-to-image image flux 1 /data/hf_models
220+
```

0 commit comments

Comments
 (0)