Date Completed: 2025-11-11
This document describes what was migrated from the Galaxy Training Network to this repository and how the migration was accomplished.
All 13 architecture topics were successfully migrated from ~/workspace/training-material/topics/dev/tutorials/ into this structured repository. The migration transforms architecture documentation from presentation slides into a single source of truth that generates multiple output formats.
- ecosystem
- project-management
- principles
- files
- frameworks
- dependency-injection
- tasks
- application-components
- plugins
- client
- dependencies
- startup
- production
Before migrating content, we established a validation schema using Pydantic v2:
Pydantic Models (scripts/models.py):
TopicMetadata- Validates metadata.yaml structureTrainingMetadata- Nested training configurationSphinxMetadata- Sphinx documentation metadataTopicContent- Validates content.yaml structureContentBlock- Individual slide/prose blocks
Key Design Decisions:
- Metadata separate from content (YAML + Markdown)
- Content blocks can be inline or external files
- Smart defaults: slides render in both formats, prose in docs only
- Topic sequences tracked via
previous_to/continues_tochain
Validation (scripts/validate.py):
- Comprehensive checks of all topics
- Clear error messages for schema violations
- Related topic verification
- File reference validation
Each topic extracted from GTN slides and reorganized:
Source Format (GTN Training Material):
training-material/topics/dev/tutorials/architecture-N-<topic>/
slides.html # Remark.js presentation
../../assets/images/ # Asset images
../../images/ # Topic-specific images
Target Format (this repository):
topics/<topic-id>/
metadata.yaml # Topic configuration
content.yaml # Content blocks
.claude/
CLAUDE.md # AI context for this topic
For each topic, we:
-
Extracted slides from HTML presentations
-
Created metadata.yaml with:
- Training metadata (questions, objectives, key points, time estimation)
- Sphinx metadata (section, subsection, level, toc_depth)
- Hub metadata (audience, tags)
- Related topics and code paths
- Contributors and versioning info
-
Created content.yaml with:
- One block per slide
- Type indicator (slide vs prose)
- Unique block ID
- Heading from slide
- Content as markdown
-
Extracted footnotes for topic sequencing:
- Parsed navigation footnotes from last slides
- Populated
continues_to/previous_tofields - Established complete topic chain
-
Copied assets:
- Images from
assets/directory →doc/source/_images/ - PlantUML source files (.plantuml.txt) preserved alongside SVGs
- Topic-specific images →
images/directory - Asset references updated in content
- Images from
The original slides used Remark.js directives for styling. We implemented smart handling:
Handled Directives:
.code[...]- Code block styling.reduce90[...],.enlarge150[...]- Layout classes.pull-left[...]/.pull-right[...]- Side-by-side layout.footnote[...]- Navigation footer
Processing:
- Speaker notes stripped per-block (removed content after
???) - Directives unwrapped using bracket-counting algorithm
- Pull directives converted to side-by-side with horizontal dividers
- Bracket-counting approach handles nested brackets correctly
Two output formats now supported:
Generates GTN-compatible Remark.js slides:
Process:
- Load metadata.yaml (title, questions, objectives, key points)
- Load content.yaml blocks
- Filter to slide-type blocks
- Render with Jinja2 template matching GTN style
- Output to outputs/training-slides/generated/
Features:
- Dynamic metadata sections (questions, objectives, key points)
- Proper layout directives and classes
- Image integration
- Contributor attribution
- Time estimation display
Generates Galaxy Sphinx documentation in Markdown:
Process:
- Load metadata.yaml (title, questions, objectives, key points)
- Load content.yaml blocks
- Filter to slide-type blocks (prose skipped in Sphinx)
- Process markdown:
- Strip speaker notes per-block
- Unwrap Remark.js directives
- Convert .pull-left/.pull-right to divider format
- Convert image paths:
../../images/→../_images/ - Convert asset paths:
{{ site.baseurl }}/assets/images/→../_images/ - Convert bare URLs to markdown links
- Generate learning sections (questions, objectives, key takeaways)
- Write to doc/source/architecture/
Advanced Features:
- Bracket-counting algorithm for multi-line directive extraction
- Per-block speaker notes handling
- Topic ordering via
continues_to/previous_tochain - Index file generation with proper toctree ordering
Topics are ordered using metadata fields:
Chain Setup:
ecosystem → project-management → principles → files
→ frameworks → dependency-injection
Later topics:
- tasks, application-components, plugins, client, dependencies, startup, production
How It Works:
- Each topic has
previous_to(what came before) andcontinues_to(what comes next) - Extracted from navigation footnotes in original slides
- Sphinx index follows chain:
continues_tolinks - Falls back to alphabetical if chain breaks
All 13 topics validate successfully:
Validation Checks:
- Metadata schema compliance (Pydantic v2)
- Content block structure validation
- File reference resolution
- Related topic existence
- Image file verification
- Topic chain consistency
Command:
make validate
# or
python scripts/validate.pyTraining Slides:
- Location:
outputs/training-slides/generated/ - Can be copied to training-material repository
- Matches original GTN slide format
- All 13 topics generate successfully
Sphinx Documentation:
- Location:
doc/source/architecture/*.md - Ready for Galaxy documentation build
- Topic index with proper ordering
- All 13 topics with learning questions/objectives/key points
- Images properly linked
Problem: Remark.js directives with nested brackets failed with simple regex.
Example: .footnote[[Dependency Injection](url)] - regex matched wrong closing bracket.
Solution: Implemented bracket-counting algorithm that:
- Tracks nesting depth
- Only closes at matching depth
- Handles multiple directives in sequence
- Works for all directive types
Problem: Initial implementation stripped notes at document level, removing all content after first ???.
Solution: Moved speaker note stripping to per-block level in generate_topic_markdown().
Problem: Markdown links with colons in URLs (e.g., https://) were matched as YAML directive syntax.
Solution: Changed regex from ':' in line to ^[\w_-]+: to only match identifier-style YAML keys.
Problem: Some slides had both informational and navigation footnotes; code only extracted first one.
Solution: Updated extraction loop to remove ALL footnotes from last slide, not just first.
Problem: Missing import re in _unwrap_remark_directives() caused NameError.
Solution: Added missing import; verified bracket-counting extraction works for pull directives.
Problem: Asset images used different path than topic images.
Solution: Implemented separate path handling:
{{ site.baseurl }}/assets/images/→../_images/(assets)../../images/→../_images/(topic images)
| File | Purpose |
|---|---|
scripts/models.py |
Pydantic v2 validation models |
scripts/validate.py |
Topic validation script |
scripts/migrate_topic.py |
Migration helper (used during initial migration) |
outputs/training-slides/build.py |
Slide generation |
outputs/training-slides/template.html |
Remark.js template |
outputs/sphinx-docs/build.py |
Sphinx doc generation |
Each topic directory contains:
metadata.yaml- Configurationcontent.yaml- Content blocks.claude/CLAUDE.md- AI context
| Location | Contents | Should Commit |
|---|---|---|
outputs/training-slides/generated/ |
Built slides (HTML) | No - gitignored |
outputs/sphinx-docs/generated/ |
Built Sphinx markdown | No - gitignored |
doc/source/architecture/*.md |
Sphinx output for build | Yes - this is final output |
doc/source/_images/ |
Copied asset images | Yes - source of images |
images/ |
Topic images & PlantUML source | Yes - source files |
- Topics migrated: 13
- Total slides extracted: ~150+
- Images copied: 40+
- Asset files: 10+
- PlantUML diagrams: 10+
- Lines of Python: ~500+ (build scripts)
- Lines of YAML: ~1000+ (metadata & content)
- Architecture docs locked in GTN slides (training-material repo)
- Single presentation format
- Hard to maintain across versions
- Difficult to extract for other uses
- Presentation markup mixed with content
- Single source of truth in this repo
- Multiple output formats (slides, Sphinx docs, future: Hub articles)
- Easy to update: edit markdown, regenerate outputs
- Clear separation: metadata + content
- Validation ensures consistency
- Topic sequencing tracked in metadata
See PLAN.md for roadmap covering:
- Real-world usage to find pain points
- GitHub Actions automation
- Migration to Galaxy repository
- Support for AI-driven knowledge base use cases