Skip to content

Commit dedada3

Browse files
committed
Multi-tenant hardening + e2e robustness fixes
End-to-end test verified: project goes from create → conda env clone → Deep Research (Gemini) → title → citations → Dev Phase 4-step loop → real claude code experimenter → LaTeX compile → 2-page PDF, all using the project owner's own keys with no global fallback. ## Multi-tenant: remove ALL global config fallbacks ARK is a product now: every user must supply their own keys. No more silently using the lab admin's key when a user hasn't configured one. * `deep_research.py`: strip `get_gemini_api_key()` to env-var only; delete `_global_config()` and `save_gemini_api_key()`. * `telegram.py`: drop `_load_global()`, `save()`, `_global_config_path()`. `TelegramConfig` only reads per-project config + `ARK_TELEGRAM_*` env vars. * `orchestrator.py:_get_bot_model`, `agents.py:_get_ark_model`, `cli.py:_get_configured_model`: never read `~/.ark/config.yaml`; use per-project config or hardcoded default. Drop the now-unused `get_config_dir` import in orchestrator.py and agents.py. * `cli.py:_run_deep_research_for_project`: prompt user to export `GEMINI_API_KEY` instead of saving to global config. * `cli.py:cmd_setup_bot`: global telegram setup is gone — point users at `ark setup-bot <project>` or `ARK_TELEGRAM_*` env vars. * `webapp/jobs.py` + `slurm_template.sh` + `webapp/utils/verify.py`: drop the `ARK_NO_GLOBAL_CONFIG=1` workaround flag — no longer needed because the fallback is fully removed. ## Robustness fixes hit during real e2e * `webapp/jobs.py:poll_local_job`: zombie reap. Use `os.waitpid(pid, WNOHANG)` instead of `os.kill(pid, 0)` so finished wrapper subprocesses get reaped instead of staying as zombies that the poller mistakes for "still running". * `agents.py:run_agent`: empty-run detection no longer uses length or elapsed time as a quality signal. Length/duration are not correlated with correctness — a good title is 60 chars, a good yes/no is 3 chars. Only flag empty when `returncode != 0` or output is literally empty. Initialise `stdout`/`stderr` BEFORE the try block so the TimeoutExpired path doesn't `NameError` on `stderr`. Re-capture stdout/stderr inside the timeout handler with a short timeout. * `pipeline.py:_parse_title_from_agent_output`: new helper that strips `Title:` / `Generated title:` label lines so a model that prefixes its answer with a label doesn't end up with the label saved as the actual paper title. * `pipeline.py:_plan_experiments`: page-aware experiment budget. A 1-page workshop poster doesn't need 8 experiments; tell the planner to plan exactly N experiments based on `venue_pages` so the experimenter agent can actually finish in its time budget. * `pipeline.py:_run_experiments` + `execution.py`: bump experimenter agent timeout from 1800s to 3600s — parameter sweeps need it, and the orchestrator already handles per-experiment results streaming. ## webapp dashboard log SSE no longer double-renders `webapp/routes.py:api_stream_log`: track which file is being tailed and how many lines have been sent FROM THAT FILE. On the very first iteration, skip past the existing content (the client just fetched it via `/log?lines=300`); on a real log-file rotation (env_provision.log → local_*.out), reset and send the new file from the start. Previously the SSE always restarted from line 0 every time the connection opened, replaying everything the initial `loadLog()` had already rendered.
1 parent 8719bfd commit dedada3

File tree

11 files changed

+203
-265
lines changed

11 files changed

+203
-265
lines changed

ark/agents.py

Lines changed: 31 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,6 @@ def _fmt_tok(n: int) -> str:
6464
return f"{n / 1_000:.1f}k"
6565
return str(n)
6666

67-
from ark.paths import get_config_dir
6867
from ark.ui import (
6968
ElapsedTimer, RateLimitCountdown, agent_styled, styled, Style, Icons,
7069
)
@@ -190,16 +189,13 @@ def _build_path_boundary(self) -> str:
190189
)
191190

192191
def _get_ark_model(self) -> str | None:
193-
"""Read ARK model preference from .ark/config.yaml. Returns None to use CLI default."""
194-
config_file = get_config_dir() / "config.yaml"
195-
try:
196-
if config_file.exists():
197-
with open(config_file) as f:
198-
cfg = yaml.safe_load(f) or {}
199-
return cfg.get("model")
200-
except Exception:
201-
pass
202-
return None
192+
"""
193+
Return the ARK model from the project config, or None to use CLI default.
194+
195+
No global config fallback — ARK is multi-tenant; per-project config
196+
must declare its own model.
197+
"""
198+
return self.config.get("model")
203199

204200
def _kill_process_tree(self, pid: int):
205201
"""Kill a process and all its descendants."""
@@ -522,6 +518,8 @@ def run_agent(self, agent_type: str, task: str, timeout: int = 1800,
522518

523519
timer.start()
524520
result = ""
521+
stdout = ""
522+
stderr = ""
525523
usage_record = None # populated when claude returns parseable JSON
526524

527525
try:
@@ -582,7 +580,14 @@ def run_agent(self, agent_type: str, task: str, timeout: int = 1800,
582580
process.kill()
583581
timer.stop()
584582
self.log(f"Agent {agent_type} timed out ({timeout}s)", "WARN")
585-
stdout, _ = process.communicate()
583+
# Capture whatever stdout/stderr is available so the empty-run
584+
# detection + downstream `stderr` references don't NameError.
585+
try:
586+
stdout, stderr = process.communicate(timeout=10)
587+
except Exception:
588+
stdout, stderr = "", ""
589+
stdout = stdout or ""
590+
stderr = stderr or ""
586591
# JSON envelope is usually missing on timeout (truncated mid-stream).
587592
# Try once; on failure fall back to raw text and let empty-run handle it.
588593
if self.model == "claude":
@@ -599,10 +604,20 @@ def run_agent(self, agent_type: str, task: str, timeout: int = 1800,
599604
timer.stop()
600605
elapsed = int(time.time() - start_time)
601606

602-
# Empty-run detection with auto-retry
603-
MIN_AGENT_TIME = 15
604-
MIN_RESULT_LEN = 100
605-
is_empty = elapsed < MIN_AGENT_TIME and len(result.strip()) < MIN_RESULT_LEN
607+
# Empty-run detection with auto-retry.
608+
# "Empty" means the agent didn't do its job — that's a property
609+
# of the *outcome*, not the *length* or *speed* of the response.
610+
# A good title is 60 chars; a good yes/no is 3 chars; a good
611+
# "find this file" is one line. Length is not a quality signal.
612+
#
613+
# The only honest signals for "this run was broken":
614+
# - process exited non-zero (claude code crashed / errored)
615+
# - process produced literally no output
616+
stripped = result.strip()
617+
is_empty = (
618+
process.returncode != 0
619+
or not stripped
620+
)
606621
if is_empty:
607622
self.log(f"Agent [{agent_type}] empty-run detected (attempt {attempt}/{MAX_RETRIES}): ran only {elapsed}s, output only {len(result.strip())} chars", "WARN")
608623
self.log(f" returncode: {process.returncode}", "WARN")

ark/cli.py

Lines changed: 23 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,13 @@ def get_projects_dir() -> Path:
4343

4444

4545
def _get_configured_model(default: str = "claude-sonnet-4-6") -> str:
46-
"""Read model from .ark/config.yaml, falling back to default."""
47-
config_file = get_config_dir() / "config.yaml"
48-
try:
49-
if config_file.exists():
50-
with open(config_file) as f:
51-
cfg = yaml.safe_load(f) or {}
52-
return cfg.get("model", default)
53-
except Exception:
54-
pass
46+
"""
47+
Return the default ARK model.
48+
49+
No global config fallback — ARK is multi-tenant; per-project config.yaml
50+
must declare its own model. Callers needing a project-specific model
51+
should read it from the project's own config.yaml.
52+
"""
5553
return default
5654

5755

@@ -1503,9 +1501,8 @@ def _cmd_new_wizard(args, name: str, project_dir: Path, pdf_spec: dict):
15031501
# Check for Gemini API key
15041502
from ark.deep_research import get_gemini_api_key
15051503
if not get_gemini_api_key():
1506-
print(f" {_c('Note:', Colors.YELLOW)} No Gemini API key found.")
1507-
print(f" Set it: ark config {name} --set gemini_api_key=YOUR_KEY")
1508-
print(f" Or add to ~/.ark/config.yaml")
1504+
print(f" {_c('Note:', Colors.YELLOW)} No Gemini API key found in environment.")
1505+
print(f" Export it before running: export GEMINI_API_KEY=YOUR_KEY")
15091506
else:
15101507
print(f" {_c('✓', Colors.GREEN)} Gemini API key found")
15111508

@@ -1705,28 +1702,18 @@ def _finalize_project(name: str, project_dir: Path, config: dict,
17051702
# ============================================================
17061703

17071704
def _run_deep_research_for_project(config: dict, state_dir: Path, custom_query: str = None):
1708-
"""Run Gemini Deep Research for a project. Auto-runs if API key is available."""
1709-
from ark.deep_research import (
1710-
run_deep_research,
1711-
get_gemini_api_key,
1712-
save_gemini_api_key,
1713-
)
1705+
"""Run Gemini Deep Research for a project if GEMINI_API_KEY is set."""
1706+
from ark.deep_research import run_deep_research, get_gemini_api_key
17141707

17151708
print()
17161709
print(f"{_c('Deep Research (Gemini)', Colors.BOLD)}")
17171710

17181711
api_key = get_gemini_api_key()
17191712
if not api_key:
1720-
if sys.stdin.isatty():
1721-
print(" No Gemini API key found.")
1722-
api_key = prompt_input(" Enter Gemini API key").strip()
1723-
if api_key:
1724-
save_gemini_api_key(api_key)
1725-
print(f" {_c('API key saved to ~/.ark/config.yaml', Colors.DIM)}")
1726-
if not api_key:
1727-
print(f" {_c('Skipped: no API key. Set with: ark config gemini-key <KEY>', Colors.YELLOW)}")
1728-
print()
1729-
return None
1713+
print(f" {_c('Skipped: no GEMINI_API_KEY in environment.', Colors.YELLOW)}")
1714+
print(f" {_c('Export it before running: export GEMINI_API_KEY=YOUR_KEY', Colors.DIM)}")
1715+
print()
1716+
return None
17301717

17311718
# Allow custom query interactively
17321719
if custom_query is None and sys.stdin.isatty():
@@ -2816,85 +2803,16 @@ def cmd_setup_bot(args):
28162803
_cmd_setup_bot_project(project)
28172804
return
28182805

2819-
# Global setup
2820-
from ark.telegram import TelegramConfig
2821-
2822-
print()
2823-
print(f" {_c('Telegram Setup (Global)', Colors.BOLD + Colors.CYAN)}")
2824-
print()
2825-
2826-
# Load existing global config
2827-
existing = TelegramConfig()
2828-
existing_token = existing._load_global().get("bot_token") # global only, not project
2829-
2830-
if existing_token:
2831-
print(f" {_c('Existing bot token found.', Colors.YELLOW)}")
2832-
if not prompt_yn(" Replace it?", default=False):
2833-
tg_token = existing_token
2834-
else:
2835-
tg_token = prompt_input(" New Bot Token").strip()
2836-
else:
2837-
print(f" Get your Bot Token from @BotFather in Telegram (/newbot).")
2838-
tg_token = prompt_input(" Bot Token").strip()
2839-
2840-
if not tg_token:
2841-
print("Aborted.")
2842-
return
2843-
2844-
# Auto-detect chat_id via getUpdates
2806+
# Global telegram config is no longer supported. ARK is multi-tenant:
2807+
# each project must have its own bot token, configured per-project.
28452808
print()
2846-
print(f" {_c('Now go to Telegram and send any message to your bot.', Colors.BOLD)}")
2847-
input(" Press Enter when done... ")
2848-
2849-
import urllib.request, json as _json, time as _time
2850-
tg_chat_id = None
2851-
for attempt in range(3):
2852-
try:
2853-
url = f"https://api.telegram.org/bot{tg_token}/getUpdates"
2854-
with urllib.request.urlopen(url, timeout=10) as resp:
2855-
data = _json.loads(resp.read())
2856-
results = data.get("result", [])
2857-
if results:
2858-
last = results[-1]
2859-
msg = last.get("message") or last.get("edited_message") or {}
2860-
sender = msg.get("from", {})
2861-
tg_chat_id = str(sender.get("id") or msg.get("chat", {}).get("id", ""))
2862-
if tg_chat_id:
2863-
print(f" {_c(f'→ Chat ID detected: {tg_chat_id}', Colors.GREEN)}")
2864-
break
2865-
except Exception as e:
2866-
print(f" {_c(f'Warning: {e}', Colors.YELLOW)}")
2867-
if attempt < 2:
2868-
print(f" No messages found yet, retrying in 3s...")
2869-
_time.sleep(3)
2870-
2871-
if not tg_chat_id:
2872-
print(f" {_c('Could not auto-detect Chat ID.', Colors.YELLOW)}")
2873-
tg_chat_id = prompt_input(" Enter Chat ID manually").strip()
2874-
2875-
if not tg_chat_id:
2876-
print("Aborted.")
2877-
return
2878-
2879-
# Send test message
2880-
try:
2881-
url = f"https://api.telegram.org/bot{tg_token}/sendMessage"
2882-
data = _json.dumps({
2883-
"chat_id": tg_chat_id,
2884-
"text": "✅ ARK Telegram notifications configured!",
2885-
"parse_mode": "Markdown",
2886-
}).encode("utf-8")
2887-
req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
2888-
urllib.request.urlopen(req, timeout=10)
2889-
print(f" {_c('→ Test message sent! Check your Telegram.', Colors.GREEN)}")
2890-
except Exception as e:
2891-
print(f" {_c(f'Warning: test message failed: {e}', Colors.YELLOW)}")
2892-
2893-
# Save to global config (~/.ark/telegram.yaml)
2894-
existing.save(tg_token, tg_chat_id)
2895-
2809+
print(f" {_c('Global telegram setup is no longer supported.', Colors.YELLOW)}")
28962810
print()
2897-
print(f" {_c('Saved to ~/.ark/telegram.yaml (shared by all projects).', Colors.GREEN)}")
2811+
print(f" ARK is multi-tenant — each project must have its own bot.")
2812+
print(f" Configure per-project: {_c('ark setup-bot <project_name>', Colors.BOLD)}")
2813+
print(f" Or set env vars before running:")
2814+
print(f" {_c('export ARK_TELEGRAM_BOT_TOKEN=...', Colors.DIM)}")
2815+
print(f" {_c('export ARK_TELEGRAM_CHAT_ID=...', Colors.DIM)}")
28982816
print()
28992817

29002818

ark/deep_research.py

Lines changed: 7 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -9,72 +9,21 @@
99
import os
1010
import threading
1111
import time
12-
import yaml
1312
from datetime import datetime
1413
from pathlib import Path
1514

1615

17-
from ark.paths import get_config_dir
18-
19-
20-
def _global_config() -> Path:
21-
return get_config_dir() / "config.yaml"
22-
23-
2416
def get_gemini_api_key() -> str:
2517
"""
26-
Get Gemini API key from env var or global config.
18+
Return the Gemini API key from the process environment.
2719
28-
When ``ARK_NO_GLOBAL_CONFIG=1`` is set (which the webapp does for every
29-
orchestrator subprocess), the global ``.ark/config.yaml`` fallback is
30-
skipped — only env vars are honored. This prevents one webapp user's
31-
project from silently using another user's (or the lab admin's)
32-
Gemini key when they haven't configured their own.
20+
ARK is multi-tenant: each user must supply their own key. There is
21+
no shared/global config fallback. The webapp injects the key from
22+
the project owner's encrypted user record into the orchestrator
23+
subprocess as an environment variable; CLI users must export
24+
``GEMINI_API_KEY`` (or the synonym ``GOOGLE_API_KEY``) themselves.
3325
"""
34-
# 1. Environment variable
35-
key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
36-
if key:
37-
return key
38-
39-
# 2. Global config (skipped under webapp / multi-user mode)
40-
no_global = os.environ.get("ARK_NO_GLOBAL_CONFIG", "").strip().lower()
41-
if no_global and no_global not in ("0", "false", "no", "off"):
42-
return ""
43-
44-
if _global_config().exists():
45-
try:
46-
with open(_global_config()) as f:
47-
cfg = yaml.safe_load(f) or {}
48-
key = cfg.get("gemini_api_key")
49-
if key:
50-
return key
51-
except Exception:
52-
pass
53-
54-
return ""
55-
56-
57-
def save_gemini_api_key(key: str):
58-
"""Save Gemini API key to global config."""
59-
_global_config().parent.mkdir(parents=True, exist_ok=True)
60-
61-
cfg = {}
62-
if _global_config().exists():
63-
try:
64-
with open(_global_config()) as f:
65-
cfg = yaml.safe_load(f) or {}
66-
except Exception:
67-
pass
68-
69-
cfg["gemini_api_key"] = key
70-
with open(_global_config(), "w") as f:
71-
yaml.dump(cfg, f, default_flow_style=False)
72-
73-
# Restrict permissions
74-
try:
75-
os.chmod(_global_config(), 0o600)
76-
except Exception:
77-
pass
26+
return os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY") or ""
7827

7928

8029
def build_research_query(config: dict) -> str:

ark/execution.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -552,7 +552,7 @@ def _run_experiment_task(self, issue: dict, action_plan: dict):
552552
2. Check whether result files were generated
553553
3. Evaluate whether the data supports the paper's arguments
554554
4. Update auto_research/state/findings.yaml with new findings
555-
""", timeout=1800)
555+
""", timeout=3600)
556556

557557
# 3. Wait for jobs
558558
self.log_step("Waiting for experiment completion...", "info")

ark/orchestrator.py

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@
3232
PROJECT_DIR = None
3333

3434
from ark.memory import get_memory, SimpleMemory
35-
from ark.paths import get_config_dir
3635
from ark.agents import AgentMixin
3736
from ark.compiler import CompilerMixin
3837
from ark.execution import ExecutionMixin
@@ -362,16 +361,13 @@ def stop_telegram_listener(self):
362361
self.telegram.stop()
363362

364363
def _get_bot_model(self) -> str:
365-
"""Read bot model preference from .ark/config.yaml."""
366-
config_file = get_config_dir() / "config.yaml"
367-
try:
368-
if config_file.exists():
369-
with open(config_file) as f:
370-
cfg = yaml.safe_load(f) or {}
371-
return cfg.get("bot_model", "claude-sonnet-4-6")
372-
except Exception:
373-
pass
374-
return "claude-sonnet-4-6"
364+
"""
365+
Return the model used for Telegram bot replies.
366+
367+
Prefers a per-project ``bot_model`` from the project config; falls
368+
back to the default. No global config fallback — ARK is multi-tenant.
369+
"""
370+
return self.config.get("bot_model") or "claude-sonnet-4-6"
375371

376372
def _handle_telegram_message(self, text: str):
377373
"""Handle incoming Telegram message via Claude agent."""

0 commit comments

Comments
 (0)