Guidelines for AI agents operating in this repository.
Gyoshu is a scientific research agent extension for OpenCode. It provides:
- Persistent Python REPL with structured output markers
- Jupyter notebook integration for reproducible research
- Session management for research workflows
| Agent | Role | Korean | What They Do |
|---|---|---|---|
| Gyoshu | Professor | 교수 | Plans research, orchestrates workflow, manages sessions |
| Jogyo | Teaching Assistant | 조교 | Executes Python code, runs experiments, generates outputs |
| Baksa | PhD Reviewer | 박사 | Adversarial verifier - challenges claims, calculates trust scores |
| Jogyo Paper Writer | Grad Student | 조교 | Transforms raw findings into narrative research reports |
Gyoshu uses OpenCode's built-in model aliases by default:
| Agent | Default Model | Role |
|---|---|---|
| Gyoshu | sonnet |
Research planner |
| Baksa | sonnet |
Adversarial verifier |
| Jogyo | sonnet |
Research executor |
| Jogyo Paper Writer | sonnet |
Report writer |
| Jogyo Feedback | sonnet |
Feedback explorer |
| Jogyo Insight | sonnet |
Evidence gatherer |
OpenCode supports these model aliases (mapped to actual model IDs via your opencode.json):
| Alias | Typical Mapping | Use Case |
|---|---|---|
sonnet |
Claude Sonnet | Balanced capability, default for most agents |
opus |
Claude Opus | Complex reasoning, research planning |
haiku |
Claude Haiku | Fast, simple tasks |
For best research quality, configure premium models in your opencode.json:
{
"model": "anthropic/claude-opus-4-5-high",
"small_model": "anthropic/claude-haiku-4-5"
}| Agent | Recommended Premium Model | Why |
|---|---|---|
| Gyoshu | anthropic/claude-opus-4-5-high |
Complex research planning requires top-tier reasoning |
| Baksa | anthropic/claude-opus-4-5-high |
Adversarial verification needs strong critical thinking |
| Jogyo | anthropic/claude-sonnet-4-5-high |
Balanced capability for code execution |
| Jogyo subagents | anthropic/claude-sonnet-4-5-high |
Consistent quality across tasks |
Important: Agent files in ~/.config/opencode/agent/ are managed by Gyoshu and will be reset on package updates. To persistently override agent models, use opencode.json:
{
"agent": {
"gyoshu": {
"model": "anthropic/claude-opus-4-5-high"
},
"baksa": {
"model": "anthropic/claude-opus-4-5-high"
},
"jogyo": {
"model": "anthropic/claude-sonnet-4-5-high"
}
}
}This method survives package updates because opencode.json is user-controlled.
Gyoshu tracks installed files in ~/.config/opencode/.gyoshu/install.json. Files listed there are considered "Gyoshu-owned" and will be updated when the package updates. This ensures you always have the latest agent prompts and capabilities.
If you need to customize agent behavior beyond model selection, consider:
- Using
opencode.jsonoverrides (recommended) - Creating custom agents with different names
- Forking the package for deep customizations
Note: Premium models (Anthropic, OpenAI) require API keys configured in OpenCode.
# Run all tests
pytest
# Run with verbose output (default via pyproject.toml)
pytest -v --tb=short
# Run a single test file
pytest tests/test_bridge.py
# Run a specific test class
pytest tests/test_bridge.py::TestParseMarkers
# Run a single test
pytest tests/test_bridge.py::TestParseMarkers::test_simple_marker
# Run with coverage
pytest --cov=src/bridge --cov-report=term-missing# Run all tests
bun test
# Watch mode for development
bun test --watch
# Run a specific test file
bun test src/tool/session-manager.test.tsThis is an OpenCode extension - no compilation needed. TypeScript files are executed directly by Bun.
# Standard library first (alphabetical within each category)
import argparse
import json
import os
import sys
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional, Callable
# Third-party next (blank line before)
import pytest
# Local imports last (blank line before)
from gyoshu_bridge import parse_markers, execute_codefrom typing import Any, Dict, List, Optional
def execute_code(code: str, namespace: dict) -> Dict[str, Any]:
"""Execute Python code in the given namespace."""
...
def parse_markers(text: str) -> List[Dict]:
..."""Module-level docstring at the top of each file.
Describe the module's purpose and key components.
Include protocol formats, methods, or usage examples.
"""
def send_response(
id: Optional[str],
result: Optional[Dict] = None,
error: Optional[Dict] = None
) -> None:
"""Send JSON-RPC 2.0 response via protocol channel."""
...UPPER_SNAKE_CASEfor constants:JSON_RPC_VERSION,ERROR_PARSEPascalCasefor classes:ExecutionState,TestParseMarkerssnake_casefor functions/variables:send_response,parse_markers_leading_underscorefor private/internal:_send_protocol,_protocol_fd
# =============================================================================
# SECTION NAME IN ALL CAPS
# =============================================================================
# Code for this section...# Use specific exception types with descriptive context
try:
result = json.loads(data)
except json.JSONDecodeError as e:
return make_error(ERROR_PARSE, f"Parse error: {e}")
except TimeoutError as e:
result["exception"] = str(e)
result["exception_type"] = "TimeoutError"
except KeyboardInterrupt:
result["exception"] = "Execution interrupted"
result["exception_type"] = "KeyboardInterrupt"
except Exception as e:
# Last resort - always include type and message
result["exception"] = str(e)
result["exception_type"] = type(e).__name__
# Never use bare except
# Never silently swallow exceptions// External packages first
import { tool } from "@opencode-ai/plugin";
// Built-in Node modules next
import * as fs from "fs/promises";
import * as path from "path";
import * as os from "os";
// Local modules last (use multi-line for readability)
import { durableAtomicWrite, fileExists, readFile } from "../lib/atomic-write";
import {
getRuntimeDir,
getSessionDir,
ensureDirSync,
existsSync,
} from "../lib/paths";/**
* Session Manager - OpenCode tool for managing Gyoshu research sessions
*
* Provides CRUD operations for session manifests with:
* - Atomic, durable writes to prevent data corruption
* - Cell execution tracking with content hashes
*
* @module session-manager
*/
// Import from centralized path resolver (see src/lib/paths.ts)
import { getRuntimeDir, getResearchDir } from "../lib/paths";
/**
* Get the runtime directory for a specific session.
* Uses centralized path resolver for consistency.
*/
const runtimeDir = getRuntimeDir(sessionId);
/**
* Get the research directory for storing research manifests.
* Always use path helpers instead of hardcoding paths.
*/
const researchDir = getResearchDir();// Descriptive JSDoc for each interface
/**
* Environment metadata captured for reproducibility.
*/
interface EnvironmentMetadata {
/** Python interpreter version */
pythonVersion: string;
/** Operating system platform */
platform: string;
}
// Use type for unions
type SessionMode = "PLANNER" | "AUTO" | "REPL";
type GoalStatus = "PENDING" | "IN_PROGRESS" | "COMPLETED" | "BLOCKED";PascalCasefor interfaces/types:SessionManifest,CellExecutionUPPER_SNAKE_CASEfor constants:DEFAULT_TIMEOUT,MAX_RETRIEScamelCasefor variables/functions:researchSessionID,readFile
import pytest
class TestModuleName:
"""Tests for module_name - brief description."""
def test_specific_behavior(self):
"""What this test verifies."""
result = function_under_test(input)
assert result["expected_key"] == expected_value
@pytest.fixture
def setup_data(self):
"""Fixture description."""
return {"test": "data"}import { describe, test, expect } from "bun:test";
describe("ModuleName", () => {
test("specific behavior", () => {
const result = functionUnderTest(input);
expect(result.expectedKey).toBe(expectedValue);
});
});Gyoshu provides two commands for all research operations:
| Command | Purpose |
|---|---|
/gyoshu [subcommand|goal] |
Unified interactive research command |
/gyoshu-auto <goal> |
Autonomous research (hands-off bounded execution) |
The main entry point for all research operations. Supports subcommands and direct goals.
| Subcommand | Description | Example |
|---|---|---|
| (no args) | Show status and suggestions | /gyoshu |
<goal> |
Start new research with discovery | /gyoshu analyze customer churn |
plan <goal> |
Create research plan only | /gyoshu plan classify iris species |
continue [id] |
Continue existing research | /gyoshu continue iris-clustering |
list [--status X] |
List all research projects | /gyoshu list --status active |
search <query> |
Search researches & notebooks | /gyoshu search "correlation" |
report [id] |
Generate research report | /gyoshu report |
repl <query> |
Direct REPL exploration | /gyoshu repl show df columns |
migrate [--options] |
Migrate legacy data | /gyoshu migrate --to-notebooks |
replay <sessionId> |
Replay for reproducibility | /gyoshu replay ses_abc123 |
unlock <sessionId> |
Unlock stuck session | /gyoshu unlock ses_abc123 |
abort [sessionId] |
Abort current research | /gyoshu abort |
doctor |
Check system health and diagnose issues | /gyoshu doctor |
help |
Show usage and examples | /gyoshu help |
Runs research autonomously with bounded cycles (max 10). Executes until completion, blocked, or budget exhausted.
/gyoshu-auto analyze wine quality factors using XGBoostUse this when you have a clear goal and want hands-off execution.
# See current status and suggestions
/gyoshu
# Start interactive research (searches for similar prior work first)
/gyoshu analyze customer churn patterns
# Continue previous research
/gyoshu continue churn-analysis
# Search across all notebooks and research
/gyoshu search "feature importance"
# Generate a report for the current research
/gyoshu report
# Hands-off autonomous research
/gyoshu-auto cluster wine dataset and identify quality predictorsGyoshu implements a "Never Trust" philosophy where every claim from Jogyo must be verified by Baksa before acceptance.
- Jogyo Completes Work: Signals completion with evidence via
gyoshu_completion - Gyoshu Gets Snapshot: Reviews current state via
gyoshu_snapshot - Baksa Challenges: Generates probing questions and calculates trust score
- Decision:
- Trust >= 80: VERIFIED - Accept result
- Trust 60-79: PARTIAL - Accept with caveats
- Trust < 60: DOUBTFUL - Request rework from Jogyo
- Max 3 Rounds: If verification fails 3 times, escalate to BLOCKED
| Component | Weight | Description |
|---|---|---|
| Statistical Rigor | 30% | CI reported, effect size calculated, assumptions checked |
| Evidence Quality | 25% | Artifacts exist, code is reproducible |
| Metric Verification | 20% | Independent checks match claimed values |
| Completeness | 15% | All objectives addressed |
| Methodology | 10% | Sound approach, appropriate tests |
The following immediately reduce trust score by 30 points:
[FINDING]without accompanying[STAT:ci][FINDING]without accompanying[STAT:effect_size]- "Significant" claim without p-value
- Correlation claim without effect size interpretation
- ML metrics without baseline comparison
When Jogyo responds to challenges, use these markers:
# Respond to a specific challenge (N = challenge number)
print("[CHALLENGE-RESPONSE:1] Re-verified correlation with alternative method")
# Provide reproducible verification code
print("[VERIFICATION-CODE] df['accuracy'].mean() == 0.95")
# Show independent cross-validation
print("[INDEPENDENT-CHECK] 5-fold CV confirms accuracy: 0.94 ± 0.02")1. Jogyo: "Model accuracy is 95%"
2. Baksa challenges:
- "Re-run with different random seed"
- "Show confusion matrix"
- "What's the baseline accuracy?"
3. Trust Score: 45 (DOUBTFUL)
4. Gyoshu sends rework request to Jogyo
5. Jogyo responds with enhanced evidence
6. Baksa re-evaluates: Trust Score 82 (VERIFIED)
7. Gyoshu accepts result
Gyoshu enforces senior data scientist level research quality through hard quality gates. Every claim must have statistical evidence before becoming a verified finding.
Every finding must include:
Claim → Data slice → Method/Test → Assumptions →
Estimate + CI → Effect size → p-value → Robustness checks →
Practical "so what"
| Gate | Requirement | Consequence if Missing |
|---|---|---|
| Hypothesis | H0/H1 stated before analysis | Finding marked "exploratory" |
| Confidence Interval | 95% CI reported | Finding rejected |
| Effect Size | Cohen's d, r², or OR reported | Finding rejected |
| Assumptions | Statistical assumptions checked | Warning flag |
| Robustness | At least one sensitivity check | Warning flag |
| So What | Practical significance explained | Finding incomplete |
| Category | Trust Score | Report Section |
|---|---|---|
| Verified Findings | ≥ 80 | Key Findings |
| Partial Findings | 60-79 | Findings (with caveats) |
| Exploratory Notes | < 60 | Exploratory Observations |
Quality gates are automated checks that run at research completion (via gyoshu-completion tool) to enforce statistical rigor. The system:
- Scans notebook outputs for structured markers
- Validates findings using the "Finding Gating Rule"
- Validates ML pipelines for required components
- Calculates quality score (100 - sum of penalties)
- Categorizes findings as Verified, Partial, or Exploratory
Every [FINDING] marker must have supporting evidence within 10 lines BEFORE it:
[STAT:ci]- Confidence interval (required)[STAT:effect_size]- Effect magnitude (required)
If either is missing, the finding is marked as unverified and goes to "Exploratory Observations" in the report.
Quality Score = 100 - (sum of all penalties)
| Violation | Penalty | Description |
|---|---|---|
FINDING_NO_CI |
-30 | Finding without confidence interval |
FINDING_NO_EFFECT_SIZE |
-30 | Finding without effect size |
ML_NO_BASELINE |
-20 | ML metrics without baseline comparison |
ML_NO_CV |
-25 | ML metrics without cross-validation |
ML_NO_INTERPRETATION |
-15 | ML metrics without feature importance |
| Score Range | Status | Result |
|---|---|---|
| 100 | SUCCESS | All quality gates passed |
| 80-99 | PARTIAL | Minor issues, findings still accepted |
| 60-79 | PARTIAL | Some findings moved to exploratory |
| 0-59 | PARTIAL | Significant quality issues |
Note: Quality gates never block completion, but they do affect how findings are categorized in reports.
Understanding the difference between verified and exploratory findings:
# ❌ This finding will be marked EXPLORATORY (score -60)
print("[FINDING] Model accuracy is 95%")
# Why it fails:
# - No [STAT:ci] within 10 lines before
# - No [STAT:effect_size] within 10 lines beforeThis produces:
Quality Score: 40/100
Violations:
- FINDING_NO_CI: Missing confidence interval (-30)
- FINDING_NO_EFFECT_SIZE: Missing effect size (-30)
Report Section: "Exploratory Observations" (not trusted)
# ✅ This finding will be VERIFIED (score 100)
# 1. Statistical evidence BEFORE the finding
print(f"[STAT:estimate] accuracy = 0.95")
print(f"[STAT:ci] 95% CI [0.93, 0.97]")
print(f"[STAT:effect_size] Cohen's d = 0.82 (large improvement over baseline)")
print(f"[STAT:p_value] p < 0.001")
# 2. NOW state the finding with summary evidence
print("[FINDING] Model (AUC=0.95) significantly outperforms baseline "
"(d=0.82, 95% CI [0.93, 0.97], p<0.001)")
# 3. Explain practical significance
print("[SO_WHAT] This means 40% fewer false negatives in fraud detection")This produces:
Quality Score: 100/100
Violations: None
Report Section: "Key Findings" (trusted, verified)
# ✅ Complete ML pipeline with all required markers
# 1. Baseline comparison (REQUIRED)
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='stratified')
dummy_scores = cross_val_score(dummy, X, y, cv=5)
print(f"[METRIC:baseline_accuracy] {dummy_scores.mean():.3f}")
# 2. Model cross-validation (REQUIRED)
scores = cross_val_score(rf_model, X, y, cv=5)
print(f"[METRIC:cv_accuracy_mean] {scores.mean():.3f}")
print(f"[METRIC:cv_accuracy_std] {scores.std():.3f}")
# 3. Feature interpretation (REQUIRED)
importances = rf_model.feature_importances_
print(f"[METRIC:feature_importance] age={importances[0]:.2f}, income={importances[1]:.2f}")
# 4. Statistical evidence for finding
improvement = scores.mean() - dummy_scores.mean()
ci_low, ci_high = scores.mean() - 1.96*scores.std(), scores.mean() + 1.96*scores.std()
print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
print(f"[STAT:effect_size] Improvement = {improvement:.3f} ({improvement/dummy_scores.std():.1f}σ)")
# 5. Verified finding
print(f"[FINDING] Random Forest achieves {scores.mean():.1%} accuracy, "
f"outperforming baseline by {improvement:.1%} (95% CI [{ci_low:.3f}, {ci_high:.3f}])")A Goal Contract defines measurable success criteria for research before execution begins. This enables Gyoshu to objectively determine whether a research goal has been achieved, rather than relying solely on subjective verification.
A Goal Contract is a formal specification that:
- States the goal in clear, measurable terms
- Defines acceptance criteria that must be met for success
- Limits retry attempts to prevent infinite loops
- Enables automatic verification at research completion
Goal contracts are stored in the notebook's YAML frontmatter under the gyoshu.goal_contract key:
---
title: "Customer Churn Classification"
gyoshu:
schema_version: 1
reportTitle: churn-classification
status: active
goal_contract:
version: 1
goal_text: "Build a classification model with 90% accuracy"
goal_type: "ml_classification"
max_goal_attempts: 3
acceptance_criteria:
- id: AC1
kind: metric_threshold
metric: cv_accuracy_mean
op: ">="
target: 0.90
- id: AC2
kind: marker_required
marker: "METRIC:baseline_accuracy"
- id: AC3
kind: artifact_exists
artifactPattern: "*.pkl"
- id: AC4
kind: finding_count
minCount: 3
---| Field | Type | Required | Description |
|---|---|---|---|
version |
number | Yes | Schema version (currently 1) |
goal_text |
string | Yes | Human-readable goal statement |
goal_type |
string | No | Goal category: ml_classification, ml_regression, eda, statistical, custom |
max_goal_attempts |
number | No | Maximum pivot attempts before BLOCKED (default: 3) |
acceptance_criteria |
array | Yes | List of criteria that must ALL pass |
Checks if a [METRIC:name] marker value meets a threshold.
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier (e.g., AC1) |
kind |
string | Must be metric_threshold |
metric |
string | Metric name (e.g., cv_accuracy_mean, f1_score) |
op |
string | Comparison operator: >=, >, <=, <, == |
target |
number | Target value to compare against |
Example:
- id: AC1
kind: metric_threshold
metric: cv_accuracy_mean
op: ">="
target: 0.90How it works: Scans notebook output for [METRIC:cv_accuracy_mean] 0.92 and checks if 0.92 >= 0.90.
Verifies that a specific marker type appears in the notebook output.
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier |
kind |
string | Must be marker_required |
marker |
string | Marker type to find (e.g., METRIC:baseline_accuracy, STAT:ci) |
Example:
- id: AC2
kind: marker_required
marker: "METRIC:baseline_accuracy"How it works: Searches for [METRIC:baseline_accuracy] in any cell output. Passes if found at least once.
Verifies that a specific artifact file was created in the reports directory.
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier |
kind |
string | Must be artifact_exists |
artifactPattern |
string | Glob pattern to match (e.g., *.pkl, figures/*.png, model.joblib) |
Example:
- id: AC3
kind: artifact_exists
artifactPattern: "models/*.pkl"How it works: Checks reports/{reportTitle}/models/ for any .pkl file. Passes if at least one match exists.
Verifies that a minimum number of verified [FINDING] markers exist.
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier |
kind |
string | Must be finding_count |
minCount |
number | Minimum number of verified findings required |
Example:
- id: AC4
kind: finding_count
minCount: 3How it works: Counts [FINDING] markers that have supporting [STAT:ci] and [STAT:effect_size] within 10 lines before. Only verified findings count.
goal_contract:
version: 1
goal_text: "Classify wine quality with F1 >= 0.85"
goal_type: ml_classification
max_goal_attempts: 3
acceptance_criteria:
- id: AC1
kind: metric_threshold
metric: cv_f1_mean
op: ">="
target: 0.85
- id: AC2
kind: marker_required
marker: "METRIC:baseline_accuracy"
- id: AC3
kind: artifact_exists
artifactPattern: "models/*.pkl"goal_contract:
version: 1
goal_text: "Complete comprehensive EDA with 5+ insights"
goal_type: eda
max_goal_attempts: 2
acceptance_criteria:
- id: AC1
kind: finding_count
minCount: 5
- id: AC2
kind: artifact_exists
artifactPattern: "figures/*.png"
- id: AC3
kind: marker_required
marker: "CONCLUSION"goal_contract:
version: 1
goal_text: "Test hypothesis with p < 0.05"
goal_type: statistical
max_goal_attempts: 2
acceptance_criteria:
- id: AC1
kind: marker_required
marker: "STAT:p_value"
- id: AC2
kind: marker_required
marker: "STAT:ci"
- id: AC3
kind: marker_required
marker: "STAT:effect_size"
- id: AC4
kind: finding_count
minCount: 1Gyoshu uses a Two-Gate verification system to ensure both research quality (Trust Gate) and goal achievement (Goal Gate) before accepting results.
| Gate | What It Checks | Who Evaluates | Pass Condition |
|---|---|---|---|
| Trust Gate | Research quality, statistical rigor, evidence validity | Baksa (adversarial verifier) | Trust score ≥ 80 |
| Goal Gate | Whether acceptance criteria are met | Automated (from goal contract) | All criteria pass |
Trust Gate alone is insufficient:
- Research can be methodologically sound but fail to achieve the stated goal
- Example: Perfect statistical analysis showing 70% accuracy when goal was 90%
Goal Gate alone is insufficient:
- Goal can be "achieved" through flawed methodology
- Example: Claiming 95% accuracy on training set without cross-validation
Together, they ensure:
- Results are trustworthy AND meaningful
- Claims are verified AND goals are met
- Research is rigorous AND successful
| Trust Gate | Goal Gate | Final Status | Action |
|---|---|---|---|
| ✅ PASS | ✅ MET | SUCCESS | Accept result, generate report |
| ✅ PASS | ❌ NOT_MET | PARTIAL | Pivot: try different approach |
| ✅ PASS | 🚫 BLOCKED | BLOCKED | Goal impossible, escalate to user |
| ❌ FAIL | ✅ MET | PARTIAL | Rework: improve evidence quality |
| ❌ FAIL | ❌ NOT_MET | PARTIAL | Rework: fix methodology |
| ❌ FAIL | 🚫 BLOCKED | BLOCKED | Cannot proceed, escalate to user |
Trust Gate:
PASS: Trust score ≥ 80 (verified)FAIL: Trust score < 80 (needs rework)
Goal Gate:
MET: All acceptance criteria passNOT_MET: Some criteria failed, but retry is possibleBLOCKED: Goal is impossible (e.g., data doesn't support the hypothesis)
When gates fail, Gyoshu doesn't immediately give up:
┌─────────────────────────────────────────────────────────────┐
│ Research Execution │
└─────────────────────────┬───────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ Trust Gate Check │
│ (Baksa adversarial verification) │
└───────────┬─────────────────────────────────┬───────────────┘
│ PASS │ FAIL
▼ ▼
┌───────────────────────┐ ┌───────────────────────────┐
│ Goal Gate Check │ │ Rework Request │
│ (Automated criteria) │ │ (Fix evidence quality) │
└───────┬───────────────┘ └─────────────┬─────────────┘
│ │
MET │ NOT_MET │
│ │ │
▼ ▼ │
SUCCESS PARTIAL │
│ │
├───────── Attempt < Max? ────────┤
│ Yes │
▼ │
┌─────────────┐ │
│ PIVOT │◄────────────────────────┘
│ Try new │
│ approach │
└─────────────┘
│
│ Attempt >= Max
▼
BLOCKED
| Action | Trigger | What Happens |
|---|---|---|
| Rework | Trust Gate FAIL | Jogyo improves evidence (adds CI, effect size, etc.) without changing approach |
| Pivot | Goal Gate NOT_MET | Jogyo tries a different approach (new model, different features, etc.) |
The max_goal_attempts field in the goal contract limits how many times Gyoshu will try to achieve the goal:
goal_contract:
max_goal_attempts: 3 # Try up to 3 different approachesAttempt counting:
- Each Pivot increments the attempt counter
- Reworks do NOT increment (same approach, better evidence)
- When attempts ≥ max_goal_attempts, status becomes BLOCKED
BLOCKED status means:
- The goal cannot be achieved with available data/methods
- User intervention is required
- Gyoshu will NOT keep trying indefinitely
Goal: "Build classifier with 90% accuracy"
Attempt 1:
- Jogyo trains Random Forest → 85% accuracy
- Trust Gate: PASS (proper CV, baseline comparison)
- Goal Gate: NOT_MET (85% < 90%)
- Decision: PARTIAL → Pivot
Attempt 2:
- Jogyo trains XGBoost → 92% accuracy
- Trust Gate: FAIL (no confidence interval reported)
- Decision: PARTIAL → Rework
Attempt 2 (Rework):
- Jogyo adds CI: 95% CI [0.90, 0.94]
- Trust Gate: PASS
- Goal Gate: MET (92% ≥ 90%)
- Decision: SUCCESS ✅
Gate results are included in the completion response:
{
"status": "PARTIAL",
"trustGate": {
"passed": true,
"score": 85
},
"goalGate": {
"status": "NOT_MET",
"criteriaResults": [
{ "id": "AC1", "passed": false, "actual": 0.85, "target": 0.90 },
{ "id": "AC2", "passed": true }
]
},
"action": "PIVOT",
"attemptNumber": 1,
"maxAttempts": 3
}When working with Gyoshu REPL output, use these markers:
# Research Process
print("[OBJECTIVE] Research goal statement")
print("[HYPOTHESIS] H0: no effect; H1: treatment improves outcome")
print("[CONCLUSION] Final conclusions with evidence summary")# Test Decision - explain why this test
print("[DECISION] Using Welch's t-test: two independent groups, unequal variance")
# Assumption Checking
print("[CHECK:normality] Shapiro-Wilk p=0.23 - normality assumption OK")
print("[CHECK:homogeneity] Levene's p=0.04 - using Welch's (unequal var)")
# Statistical Results (ALL required before [FINDING])
print(f"[STAT:estimate] mean_diff = {mean_diff:.3f}")
print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
print(f"[STAT:effect_size] Cohen's d = {d:.3f} (medium)")
print(f"[STAT:p_value] p = {p:.4f}")
# Robustness Check
print("[INDEPENDENT_CHECK] Bootstrap 95% CI: [0.12, 0.28] - consistent")
# Only AFTER above evidence:
print("[FINDING] Treatment shows medium effect (d=0.45, 95% CI [0.2, 0.7])")
# Practical Significance
print("[SO_WHAT] Effect translates to $50K annual savings per customer segment")
# Limitations
print("[LIMITATION] Self-selection bias - users opted in voluntarily")# Baseline (REQUIRED before claiming model performance)
print(f"[METRIC:baseline_accuracy] {dummy_score:.3f}")
# Cross-Validation (REQUIRED - report mean ± std)
print(f"[METRIC:cv_accuracy_mean] {scores.mean():.3f}")
print(f"[METRIC:cv_accuracy_std] {scores.std():.3f}")
# Model Performance with CI
print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
print(f"[METRIC:improvement_over_baseline] {improvement:.3f}")
# Interpretation (REQUIRED)
print("[METRIC:top_features] age (0.23), income (0.18), tenure (0.15)")
print("[FINDING] Random Forest (AUC=0.82) outperforms baseline (0.65) by 0.17")
print("[SO_WHAT] Model identifies 80% of churners in top 20% of predictions")print("[DATA] Dataset description")
print(f"[SHAPE] {df.shape}")
print(f"[METRIC:missing_rate] {missing_pct:.1f}%")print("[PATTERN] Identified pattern")
print("[OBSERVATION] Descriptive observation")
print("[EXPERIMENT] Experimental setup description")import numpy as np
from scipy.stats import ttest_ind, shapiro, levene
# 1. State hypothesis
print("[HYPOTHESIS] H0: No difference between groups; H1: Treatment > Control")
print("[DECISION] Using Welch's t-test for independent samples")
# 2. Check assumptions
_, p_norm_t = shapiro(treatment)
_, p_norm_c = shapiro(control)
print(f"[CHECK:normality] Treatment p={p_norm_t:.3f}, Control p={p_norm_c:.3f}")
_, p_var = levene(treatment, control)
print(f"[CHECK:homogeneity] Levene's p={p_var:.3f} - using Welch's t-test")
# 3. Run test
t_stat, p_value = ttest_ind(treatment, control, equal_var=False)
# 4. Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(treatment)-1)*treatment.std()**2 +
(len(control)-1)*control.std()**2) /
(len(treatment) + len(control) - 2))
cohens_d = (treatment.mean() - control.mean()) / pooled_std
# 5. Calculate CI for difference
from scipy.stats import sem
mean_diff = treatment.mean() - control.mean()
se_diff = np.sqrt(sem(treatment)**2 + sem(control)**2)
ci_low = mean_diff - 1.96 * se_diff
ci_high = mean_diff + 1.96 * se_diff
# 6. Report ALL statistics
print(f"[STAT:estimate] mean_diff = {mean_diff:.3f}")
print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
print(f"[STAT:effect_size] Cohen's d = {cohens_d:.3f} ({'small' if abs(cohens_d) < 0.5 else 'medium' if abs(cohens_d) < 0.8 else 'large'})")
print(f"[STAT:p_value] p = {p_value:.4f}")
# 7. Robustness check
from scipy.stats import mannwhitneyu
_, p_mw = mannwhitneyu(treatment, control, alternative='greater')
print(f"[INDEPENDENT_CHECK] Mann-Whitney U p={p_mw:.4f} (non-parametric confirmation)")
# 8. NOW state finding with full evidence
print(f"[FINDING] Treatment shows {'significant' if p_value < 0.05 else 'no significant'} effect "
f"(d={cohens_d:.2f}, 95% CI [{ci_low:.2f}, {ci_high:.2f}], p={p_value:.4f})")
# 9. Practical significance
print(f"[SO_WHAT] A {abs(cohens_d):.1f}σ effect means ~{abs(mean_diff)*100:.0f} unit improvement per customer")
# 10. Limitations
print("[LIMITATION] Single time point; longitudinal effects unknown")Gyoshu can generate publication-quality research reports from notebooks and export them to PDF.
Reports are generated by extracting structured markers from notebook cell outputs. Use these markers in your REPL output to populate report sections:
| Marker | Report Section | Description |
|---|---|---|
[OBJECTIVE] |
Executive Summary | Research goal statement |
[HYPOTHESIS] |
Hypotheses | Proposed explanations |
[METRIC:name] |
Performance Metrics | Named metrics with values |
[FINDING] |
Key Findings | Important discoveries |
[LIMITATION] |
Limitations | Known constraints |
[NEXT_STEP] |
Recommended Next Steps | Follow-up actions |
[CONCLUSION] |
Conclusion | Final summary |
Reports follow the IMRAD (Introduction, Methods, Results, Analysis, Discussion) structure:
| Section | Content | Required Markers |
|---|---|---|
| Executive Summary | Question, answer, magnitude, confidence | [OBJECTIVE], [CONCLUSION] |
| Hypotheses & Endpoints | H0/H1, metrics, alpha | [HYPOTHESIS], [DECISION] |
| Methods | Data, tests, assumptions | [DATA], [CHECK:*] |
| Results | Estimates + CI + effect sizes | [STAT:*], [METRIC:*] |
| Robustness | Sensitivity analyses | [INDEPENDENT_CHECK] |
| Key Findings | Verified discoveries (trust ≥ 80) | [FINDING] + [STAT:ci] + [STAT:effect_size] |
| Exploratory Observations | Unverified claims (trust < 80) | [FINDING] without full stats |
| Implications ("So What") | Practical significance | [SO_WHAT] |
| Limitations | Threats to validity | [LIMITATION] |
| Next Steps | Follow-up actions | [NEXT_STEP] |
When research completes with SUCCESS status, a markdown report is automatically generated and saved to:
reports/{reportTitle}/report.md
The report includes:
- Executive Summary: Objective, key metrics, and status
- Hypotheses: All proposed explanations
- Performance Metrics: Table of all
[METRIC:name]values - Key Findings: Numbered list of discoveries
- Output Files: Artifacts from the reports directory
- Conclusion: Final research summary
Generate a report manually using the /gyoshu report command:
# Generate report for current research
/gyoshu report
# Generate report for specific research
/gyoshu report my-research-slugOr via the research-manager tool:
research-manager(action: "report", reportTitle: "my-research")Export markdown reports to PDF using available converters:
| Priority | Converter | Quality | Install Command |
|---|---|---|---|
| 1 | pandoc | Best (LaTeX math support) | apt install pandoc texlive-xetex or brew install pandoc basictex |
| 2 | wkhtmltopdf | Good (widely available) | apt install wkhtmltopdf or brew install wkhtmltopdf |
| 3 | weasyprint | Good (CSS-based) | pip install weasyprint |
Export via the research-manager tool:
research-manager(action: "export-pdf", reportTitle: "my-research")PDF files are saved to:
reports/{reportTitle}/report.pdf
Note: At least one PDF converter must be installed for PDF export. Gyoshu automatically detects and uses the best available converter.
When using the gyoshu-completion tool with exportPdf: true, PDF export happens automatically after report generation:
gyoshu-completion({
researchSessionID: "my-session",
status: "SUCCESS",
summary: "Research complete",
evidence: { ... },
exportPdf: true // Automatically exports report.pdf after generating report.md
})This is useful for autonomous research workflows where you want both the markdown report and PDF export without a separate step.
Gyoshu provides checkpoint/resume capability for long-running research:
Research is divided into bounded stages (max 4 minutes each):
- Each stage has a unique ID:
S{NN}_{verb}_{noun}(e.g.,S01_load_data) - Stages emit markers:
[STAGE:begin],[STAGE:progress],[STAGE:end] - Checkpoints are created at stage boundaries
# Stage boundaries
print("[STAGE:begin:id=S01_load_data]")
print("[STAGE:end:id=S01_load_data:duration=120s]")
# Checkpoint saved
print("[CHECKPOINT:saved:id=ckpt-001:stage=S01_load_data]")
# Rehydrated from checkpoint
print("[REHYDRATED:from=ckpt-001]")reports/{reportTitle}/checkpoints/{runId}/{checkpointId}/
└── checkpoint.json # Manifest with artifact hashes
# Continue research (auto-detects checkpoints)
/gyoshu continue my-research
# List checkpoints
checkpoint-manager(action: "list", reportTitle: "my-research")
# Resume from specific checkpoint
checkpoint-manager(action: "resume", reportTitle: "my-research", runId: "run-001")| Issue | Solution |
|---|---|
| "No valid checkpoints" | Artifacts may be missing or corrupted. Check reports/*/checkpoints/ |
| "Manifest SHA256 mismatch" | Checkpoint file was modified. Use previous checkpoint |
| "Session locked" | Use /gyoshu unlock <sessionId> after verifying no active process |
| Action | Description |
|---|---|
save |
Create new checkpoint at stage boundary |
list |
List all checkpoints for a research/run |
validate |
Verify checkpoint integrity (manifest + artifacts) |
resume |
Find last valid checkpoint and generate rehydration code |
prune |
Keep only last N checkpoints (default: 5) |
emergency |
Fast checkpoint for watchdog/abort (skips validation) |
Checkpoints have a trust level that controls security validation:
| Level | Description | Validation |
|---|---|---|
local |
Created by this system (default) | Standard validation |
imported |
Copied from another project | + Parent directory symlink check |
untrusted |
From external/unknown source | + Parent symlink check + User confirmation |
When to use each level:
local: Normal checkpoints created during research (automatic)imported: When copying checkpoints from a colleague or another machineuntrusted: When loading checkpoints from the internet or unknown sources
Security implications:
localcheckpoints trust the local filesystemimportedanduntrustedcheckpoints verify that parent directories aren't symlinks (prevents escape attacks)untrustedcheckpoints show a warning before resume, as rehydration code could execute arbitrary Python
Example:
# Save with explicit trust level (for imported checkpoint)
checkpoint-manager(action: "save", ..., trustLevel: "imported")
# Resume will show warning for non-local checkpoints
checkpoint-manager(action: "resume", reportTitle: "imported-research")
# Returns: { ..., trustWarning: "Checkpoint is imported - verify source before resuming" }Gyoshu/
├── notebooks/ # Research notebooks (default location)
│ ├── README.md # Auto-generated index
│ ├── _migrated/ # Migrated legacy research
│ └── {reportTitle}.ipynb # Self-describing notebooks
│
├── reports/ # Research reports (mirrors notebooks)
│ └── {reportTitle}/
│ ├── README.md # Combined report view
│ ├── figures/
│ ├── models/
│ ├── exports/
│ ├── report.md # Generated research report
│ └── report.pdf # PDF export (if converter available)
│
├── src/ # OpenCode extension source
│ ├── agent/ # Agent definitions
│ ├── command/ # Slash commands
│ ├── tool/ # Tool implementations
│ ├── lib/ # Shared utilities
│ ├── bridge/ # Python REPL bridge
│ └── skill/ # Research skills
├── data/ # Datasets
├── .venv/ # Python environment
└── ...
Runtime data is stored in OS-appropriate temp directories, NOT in the project root:
Linux (with XDG_RUNTIME_DIR):
$XDG_RUNTIME_DIR/gyoshu/ # Usually /run/user/{uid}/gyoshu
└── {shortSessionId}/
├── bridge.sock # Python REPL socket
├── session.lock # Session lock
└── bridge_meta.json # Runtime state
macOS:
~/Library/Caches/gyoshu/runtime/
└── {shortSessionId}/...
Linux (fallback):
~/.cache/gyoshu/runtime/
└── {shortSessionId}/...
Environment Variable Override: Set GYOSHU_RUNTIME_DIR to force a custom location.
Note: Session IDs are hashed to 12 characters to respect Unix socket path limits (~108 bytes).
Gyoshu stores research metadata in notebooks, not separate JSON files:
Each notebook has YAML frontmatter in the first cell (raw cell):
---
# Quarto-compatible fields (optional)
title: "Customer Churn Prediction"
date: 2026-01-01
# Gyoshu-specific fields
gyoshu:
schema_version: 1
reportTitle: churn-prediction # Notebook identifier
status: active # active | completed | archived
created: "2026-01-01T10:00:00Z"
updated: "2026-01-01T15:00:00Z"
tags: [ml, classification]
runs:
- id: run-001
started: "2026-01-01T10:00:00Z"
status: completed
---Cells are tagged with gyoshu-* markers in metadata to structure the research:
gyoshu-objective,gyoshu-hypothesis,gyoshu-finding, etc.
| File | Purpose |
|---|---|
src/bridge/gyoshu_bridge.py |
JSON-RPC Python execution bridge |
src/tool/research-manager.ts |
Research operations |
src/tool/session-manager.ts |
Runtime session management |
src/tool/python-repl.ts |
REPL tool interface |
src/tool/notebook-writer.ts |
Jupyter notebook generation |
src/tool/migration-tool.ts |
Legacy session migration utility |
src/tool/notebook-search.ts |
Notebook content search |
src/lib/notebook-frontmatter.ts |
Frontmatter parsing/updating |
src/lib/readme-index.ts |
README index generation |
src/lib/paths.ts |
Centralized path resolver |
src/lib/report-markdown.ts |
Report generation library |
src/lib/pdf-export.ts |
PDF export utilities |
tests/test_bridge.py |
Bridge unit tests |
- Create test class in appropriate file under
tests/ - Use
test_prefix for test methods - Run:
pytest tests/test_file.py::TestClass::test_method -v
- Edit
src/bridge/gyoshu_bridge.py - Run tests:
pytest tests/test_bridge.py -v - Test manually with JSON-RPC messages
Research is now stored in the notebooks/ directory by default.
./notebooks/
├── README.md # Auto-generated root index
└── {reportTitle}.ipynb # Research notebook with YAML frontmatter
Reports are stored in a mirrored structure:
./reports/
└── {reportTitle}/
├── README.md # Combined report view
├── figures/ # Saved plots
├── models/ # Saved model files
└── exports/ # Data exports (CSV, etc.)
Migration Note: Legacy research stored at
gyoshu/research/or~/.gyoshu/sessions/is still readable. Use the/gyoshu migrate --to-notebookscommand to move data to the new structure.
Gyoshu uses Python virtual environments for research reproducibility.
| Priority | Type | Detection Method |
|---|---|---|
| 1 | venv | .venv/bin/python exists |
python3 -m venv .venv
.venv/bin/pip install pandas numpy scikit-learn matplotlib seabornNote: Gyoshu uses your project's virtual environment. It never modifies system Python.