Skip to content

Commit 4cafd91

Browse files
authored
Merge pull request #22511 from arash77/docs/huggingface-data-table
doc: add huggingface shared data table schema
2 parents 2758b55 + dbe7d36 commit 4cafd91

1 file changed

Lines changed: 125 additions & 0 deletions

File tree

doc/source/admin/data_tables.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,3 +103,128 @@ When a new tool is installed that uses a data table a new entry is added to
103103
subdirectory in `tool_data_path` (in a subdirectory that has the name of the
104104
toolshed). By default this is `tool-data/toolshed.g2.bx.psu.edu/`. Note that
105105
these directories will also contain tool data table config files, but they are unused.
106+
107+
## The `huggingface` shared data table
108+
109+
Galaxy tools that consume pre-downloaded Hugging Face models share a single
110+
data table named `huggingface`. Using one shared table means admins maintain
111+
one `.loc` file and all tools benefit from every registered model entry.
112+
113+
### Declaring the table
114+
115+
Add the following block to `tool_data_table_conf.xml`:
116+
117+
```xml
118+
<!-- Hugging Face models -->
119+
<table name="huggingface" comment_char="#" allow_duplicate_entries="False">
120+
<columns>value, name, pipeline_tag, domain, free_tag, version, path</columns>
121+
<file path="/opt/galaxy/tool-data/huggingface.loc" />
122+
</table>
123+
```
124+
125+
Each tool ships a `tool-data/huggingface.loc.sample` that uses the same
126+
7-column layout.
127+
128+
### Column reference
129+
130+
| # | Column | Purpose |
131+
|---|--------|---------|
132+
| 0 | `value` | Unique row ID across the whole table |
133+
| 1 | `name` | Human-readable label shown in the Galaxy select widget |
134+
| 2 | `pipeline_tag` | Model role — see controlled vocabulary below |
135+
| 3 | `domain` | Coarse data domain — see controlled vocabulary below |
136+
| 4 | `free_tag` | Optional narrowing tag; fallback filter when `pipeline_tag`/`domain` alone are not specific enough |
137+
| 5 | `version` | Model version string |
138+
| 6 | `path` | Path to the model data, a directory or a specific file, depending on the model structure |
139+
140+
**`value` (column 0)**
141+
142+
Must be globally unique across every row in `huggingface.loc`, regardless of
143+
which tool added it. Use the Hugging Face model ID (`<owner>/<model-name>`)
144+
directly — it is stable and unambiguous. If the same model is registered at
145+
more than one version, append the version:
146+
147+
```
148+
black-forest-labs/FLUX.1-dev
149+
sentence-transformers/all-MiniLM-L6-v2
150+
openai/whisper-large-v3_3.0
151+
```
152+
153+
**`pipeline_tag` (column 2)**
154+
155+
Use the official [Hugging Face pipeline tag](https://huggingface.co/models).
156+
Common values:
157+
158+
| Value | When to use |
159+
|-------|-------------|
160+
| `text-to-image` | Image generation models |
161+
| `automatic-speech-recognition` | ASR / transcription models |
162+
| `feature-extraction` | Sentence / document embedding models |
163+
| `tabular-classification` | Tabular ML classifiers |
164+
| `tabular-regression` | Tabular ML regressors |
165+
| `text-generation` | Causal / instruction-tuned LLMs |
166+
167+
Do not invent synonyms for existing Hugging Face tags.
168+
169+
**`domain` (column 3)**
170+
171+
A broad category for the data type the model works with:
172+
`image` · `text` · `audio` · `tabular` · `sequence` · `video` · `multimodal`
173+
174+
**`free_tag` (column 4)**
175+
176+
An optional short identifier used as a fallback narrowing filter when
177+
`pipeline_tag` and `domain` alone are not specific enough. Because a model
178+
can be consumed by multiple tools, `free_tag` must not encode a specific tool
179+
name. Choose a short, lowercase, descriptive value and document it alongside
180+
the tool that introduces it.
181+
182+
**`version` (column 5)**
183+
184+
The model version string. A tool declares in its XML which version(s) it
185+
accepts, allowing multiple versions of the same model to coexist. Where
186+
possible, rows are only added, never removed or edited.
187+
188+
**`path` (column 6)**
189+
190+
The path to the model data on the production server (maintained by admins).
191+
Can be a directory (when the tool reads the whole Hugging Face cache layout)
192+
or a specific file (e.g. a `.ckpt` checkpoint).
193+
194+
### XML filter convention
195+
196+
Filter primarily by `pipeline_tag` (column 2) and/or `domain` (column 3) so
197+
only relevant model types are shown to the user. Add a `version` or
198+
`free_tag` filter only when you need to narrow the selection further:
199+
200+
```xml
201+
<param name="model" type="select" label="Model">
202+
<options from_data_table="huggingface">
203+
<filter type="static_value" column="2" value="<pipeline_tag>"/>
204+
<filter type="static_value" column="3" value="<domain>"/>
205+
<!-- optional: narrow further by version or free_tag -->
206+
<!-- <filter type="static_value" column="5" value="<version>"/> -->
207+
<!-- <filter type="static_value" column="4" value="<free_tag>"/> -->
208+
</options>
209+
</param>
210+
```
211+
212+
Example from the Flux tool (filters by `free_tag` to restrict to Flux-specific model variants):
213+
214+
```xml
215+
<options from_data_table="huggingface">
216+
<filter type="static_value" column="4" value="flux"/>
217+
<filter type="static_value" column="5" value="1"/>
218+
</options>
219+
```
220+
221+
### Example `.loc` entry
222+
223+
Each row is TAB-separated (7 columns):
224+
225+
```
226+
# Columns: value <TAB> name <TAB> pipeline_tag <TAB> domain <TAB> free_tag <TAB> version <TAB> path
227+
#
228+
# Flux
229+
black-forest-labs/FLUX.1-dev FLUX.1 [dev] text-to-image image flux 1 /data/hf_models
230+
```

0 commit comments

Comments
 (0)