进程与资源生命周期管理

本章核心源码：cli.py（8430-8442 信号处理、824-950 worktree 清理）、gateway/run.py（1471-1528 graceful shutdown）、tools/process_registry.py、tools/browser_tool.py（374-478 清理守护线程）、tools/terminal_tool.py（715-811 清理守护线程）、gateway/status.py、hermes_cli/doctor.py

定位：本章拆解 Hermes 的"第 1 层防御"——进程生命周期管理。从信号处理到 Gateway 5 步关停序列、Process Registry 的 LRU/checkpoint/rolling buffer 到 Browser/Terminal 的清理守护线程、worktree 年龄清理和 Doctor 诊断。前置依赖：第 13 章（CLI）、第 14 章（Gateway）。适用场景：需要理解 Hermes 如何在 SSH 断开、SIGTERM、OOM 等非正常退出时保护数据和清理资源。

为什么生命周期管理是稳定性的第一层

一个 agent 系统和普通 CLI 工具的根本区别在于运行时间和资源持有量。一次 ls 命令持续毫秒，一次 Hermes 会话可以持续数小时。在这期间，Hermes 可能：

持有 3 个 subprocess（terminal 命令）
打开 1 个浏览器实例（Playwright/Camofox）
维护 1 个 SQLite 连接（session state）
管理 5 个 Gateway 平台连接（Telegram + Discord + Slack + ...）
创建 2 个 git worktree

如果进程被 kill -9、SSH 断开、或 OOM killer 终止，这些资源会成为孤儿。Hermes 的生命周期管理就是为了让非正常退出造成的损害最小化。

信号处理：SIGTERM/SIGHUP → KeyboardInterrupt

# cli.py:8430-8442
def _signal_handler(signum, frame):
    """Handle SIGHUP/SIGTERM by triggering graceful cleanup."""
    logger.debug("Received signal %s, triggering graceful shutdown", signum)
    raise KeyboardInterrupt()

try:
    import signal as _signal
    _signal.signal(_signal.SIGTERM, _signal_handler)
    if hasattr(_signal, 'SIGHUP'):
        _signal.signal(_signal.SIGHUP, _signal_handler)
except Exception:
    pass  # Signal handlers may fail in restricted environments

Hermes 将 SIGTERM 和 SIGHUP 统一转换为 KeyboardInterrupt。这个选择有三个理由：

复用已有的 Ctrl+C 清理路径：Python 的 try/except KeyboardInterrupt 已经在各处设置，将信号统一到 KeyboardInterrupt 就能复用这些清理逻辑
SIGHUP 处理 SSH 断开：当用户的 SSH 连接断开时，终端发送 SIGHUP。如果不处理，进程会被默认行为终止（不运行 atexit handlers）
受限环境容错：except Exception: pass 确保在 Docker 或 systemd 等受限环境中信号注册失败不会阻止启动

Gateway 优雅关停：5 步序列

Gateway 的关停是最复杂的生命周期事件——它需要同时处理多个平台连接、运行中的 agent、后台任务和持久化状态。

sequenceDiagram
    participant Signal as SIGTERM/SIGHUP
    participant GW as Gateway
    participant Agent as Running Agents
    participant Platform as Platform Adapters
    participant Tasks as Background Tasks
    participant Status as Status File

    Signal->>GW: stop() called
    Note over GW: Step 1: Set _running = False

    GW->>Agent: interrupt("Gateway shutting down")
    Note over Agent: Step 2: Interrupt all running agents
    Agent->>Agent: shutdown_memory_provider()
    Agent->>Agent: on_session_finalize hook

    GW->>Platform: cancel_background_tasks()
    GW->>Platform: disconnect()
    Note over Platform: Step 3: Disconnect all adapters

    GW->>Tasks: task.cancel() for each
    Note over Tasks: Step 4: Cancel background tasks

    GW->>Status: remove_pid_file()
    GW->>Status: write_runtime_status(stopped)
    Note over Status: Step 5: Clean up state files

源码走读（gateway/run.py:1471-1528）

# gateway/run.py:1471-1528
async def stop(self) -> None:
    self._running = False                              # Step 1: 阻止新消息处理

    for session_key, agent in list(self._running_agents.items()):
        agent.interrupt("Gateway shutting down")       # Step 2: 中断运行中的 agent
        # 触发 plugin hook
        _invoke_hook("on_session_finalize", session_id=..., platform="gateway")
        # 关闭 memory provider
        agent.shutdown_memory_provider()

    for platform, adapter in list(self.adapters.items()):
        await adapter.cancel_background_tasks()        # Step 3a: 取消后台任务
        await adapter.disconnect()                     # Step 3b: 断开平台连接

    for _task in list(self._background_tasks):
        _task.cancel()                                 # Step 4: 取消 Gateway 级后台任务
    self._background_tasks.clear()

    self.adapters.clear()
    self._running_agents.clear()

    remove_pid_file()                                  # Step 5a: 删除 PID 文件
    write_runtime_status(gateway_state="stopped",      # Step 5b: 记录退出原因
                         exit_reason=self._exit_reason)

每一步都用 try/except 包裹（代码中省略了异常处理以突出逻辑），确保单个步骤的失败不阻塞后续步骤的执行。

AIAgent.close()：统一资源清理

Gateway 的 5 步关停序列处理的是 Gateway 级别的资源。但每个 AIAgent 实例本身也持有大量资源。AIAgent.close()（run_agent.py:2800-2850）提供了统一的 agent 级资源清理，按固定顺序执行 5 个步骤：

Kill 后台进程：通过 Process Registry 终止该 agent 启动的所有后台子进程
清理 Terminal 沙箱：关闭 Docker/Modal/SSH 等沙箱环境，释放容器和远程连接
清理 Browser 会话：关闭 Playwright/Camofox 浏览器实例，释放 Chromium 进程或云端 Browserbase 实例
关闭子 Agent：递归关闭通过 delegate() 创建的子 agent（每个子 agent 也会调用自己的 close()）
关闭 HTTP 客户端：关闭 httpx.AsyncClient，释放连接池

这个方法在两个场景被调用：Gateway 关停时（通过 stop() 中的 agent 中断流程）和 session 重置时（旧 agent 实例被替换前）。5 步顺序是刻意的——后台进程可能依赖沙箱环境，所以先 kill 进程再清理沙箱；子 agent 可能共享浏览器会话，所以先清理浏览器再关闭子 agent。

每一步都用 try/except 包裹，确保单步失败不阻塞后续清理——这与 Gateway 的 5 步关停序列使用相同的防御模式。

Process Registry

Process Registry（tools/process_registry.py）管理通过 terminal(background=true) 启动的后台进程。它解决了三个问题：输出缓冲、进程追踪、崩溃恢复。

核心数据结构（process_registry.py:57-89）

# tools/process_registry.py:57-59
MAX_OUTPUT_CHARS = 200_000      # 200KB rolling output buffer
FINISHED_TTL_SECONDS = 1800     # 已完成进程保留 30 分钟
MAX_PROCESSES = 64              # LRU 清理阈值

# tools/process_registry.py:62-89
@dataclass
class ProcessSession:
    id: str                        # "proc_xxxxxxxxxxxx"
    command: str
    task_id: str = ""              # 沙箱隔离键
    session_key: str = ""          # Gateway 会话键
    pid: Optional[int] = None
    process: Optional[subprocess.Popen] = None
    output_buffer: str = ""        # 最近 200KB 输出
    max_output_chars: int = MAX_OUTPUT_CHARS
    detached: bool = False         # 崩溃恢复后无 pipe
    notify_on_complete: bool = False  # 完成时通知

LRU 清理

当进程数超过 MAX_PROCESSES（64）时，_prune_if_needed() 按 LRU 策略清理：先移除已完成最久的进程，必要时 kill 最老的运行进程。

JSON Checkpoint（process_registry.py:54）

CHECKPOINT_PATH = get_hermes_home() / "processes.json"

每次 spawn/kill/exit 时，Registry 将进程状态序列化到 processes.json。当 Gateway 崩溃重启时，_load_checkpoint() 恢复已知进程的元数据——虽然 pipe 已断开（标记为 detached=True），但 PID 仍然可用于状态检查和 kill。

后台进程的输出通过 daemon reader thread（process_registry.py:224-278）持续读取到 output_buffer 中。当 buffer 超过 MAX_OUTPUT_CHARS（200KB）时，截断前面的内容保留最新的部分。这让 process(action="poll") 总是能返回最近的输出，而不需要让 agent 处理几 MB 的历史日志。

Browser 清理守护线程（browser_tool.py:374-478）

浏览器会话是最重型的资源——每个 session 可能持有一个 Chromium 进程（本地）或一个 Browserbase 云实例（付费）。

双层清理机制

第一层：atexit handler（browser_tool.py:401-407）

# browser_tool.py:401-407
# Register cleanup via atexit only.  Previous versions installed
# SIGINT/SIGTERM handlers that called sys.exit(), but this conflicts
# with prompt_toolkit's async event loop.
atexit.register(_emergency_cleanup_all_sessions)

注意注释中的经验教训：早期版本在 SIGINT/SIGTERM handler 中调用 sys.exit()，但这与 prompt_toolkit 的 async event loop 冲突——在 key-binding callback 内部抛出 SystemExit 会破坏协程状态让进程不可杀。现在只用 atexit，依赖 cli.py 的信号处理转换为 KeyboardInterrupt。

第二层：inactivity cleanup thread（browser_tool.py:442-478）

# browser_tool.py:442-478
def _browser_cleanup_thread_worker():
    while _cleanup_running:
        _cleanup_inactive_browser_sessions()   # 检查并清理超时 session
        for _ in range(30):                     # 每 30 秒检查一次
            if not _cleanup_running:
                break
            time.sleep(1)                       # 1 秒间隔，快速响应停止

_cleanup_inactive_browser_sessions()（browser_tool.py:414-439）检查每个 session 的最后活动时间，超过 BROWSER_SESSION_INACTIVITY_TIMEOUT 的 session 被自动关闭。Sleep 使用 1 秒间隔循环而非 time.sleep(30)，让线程能在进程退出时快速响应 _cleanup_running = False。

Terminal 清理守护线程（terminal_tool.py:715-811）

Terminal 环境（Docker/Modal/SSH 等）的清理逻辑与 Browser 类似但有一个独特的考虑——后台进程保活：

# terminal_tool.py:715-727
def _cleanup_inactive_envs(lifetime_seconds: int = 300):
    # Check the process registry -- skip cleanup for sandboxes with
    # active background processes
    from tools.process_registry import process_registry
    for task_id in list(_last_activity.keys()):
        if process_registry.has_active_processes(task_id):
            _last_activity[task_id] = current_time  # Keep sandbox alive

如果一个 Docker 容器里还有后台进程在跑（如 pytest -v），即使 agent 5 分钟没有调用过这个容器的 terminal，容器也不会被清理——后台进程的活跃性被传递到沙箱的活跃性。

清理分为两个 phase（terminal_tool.py:729-774）：

# terminal_tool.py:729-750（简化）
# Phase 1: 在锁内收集要清理的环境，但不执行清理
with _env_lock:
    for task_id, last_time in list(_last_activity.items()):
        if current_time - last_time > lifetime_seconds:
            env = _active_environments.pop(task_id, None)
            envs_to_stop.append((task_id, env))

# Phase 2: 在锁外执行实际清理（Modal/Docker 停止可能阻塞 10-15s）
for task_id, env in envs_to_stop:
    env.cleanup()

Phase 分离是为了避免锁持有时间过长：Modal 和 Docker 的 teardown 可能阻塞 10-15 秒，如果在锁内执行，所有并发的 terminal/file 工具调用都会被阻塞。

Worktree 年龄清理（cli.py:824-950）

hermes -w 为每个工作流创建 git worktree。如果会话因 crash 或用户遗忘而未清理，worktree 会持续积累。

三级清理策略（cli.py:824-894）

# cli.py:824-845
def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24):
    now = time.time()
    soft_cutoff = now - (max_age_hours * 3600)       # 24h
    hard_cutoff = now - (max_age_hours * 3 * 3600)   # 72h

年龄	行为	原因
< 24h	跳过	会话可能仍在活跃
24h-72h	检查是否有未推送 commits，无则删除	避免丢失用户工作
> 72h	强制删除	没有什么应该放这么久

# cli.py:862-871
if not force:
    # 24h-72h: only remove if no unpushed commits
    result = subprocess.run(
        ["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
        capture_output=True, text=True, timeout=5, cwd=str(entry),
    )
    if result.stdout.strip():
        continue  # Has unpushed commits — skip

孤儿分支清理（cli.py:897-953）

worktree 删除后，对应的 hermes/hermes-* 和 pr-* 分支可能仍然存在。_prune_orphaned_branches() 对比活跃 worktree 的分支列表，删除没有对应 worktree 的分支：

# cli.py:943-947
orphaned = [
    b for b in all_branches
    if b not in active_branches
    and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
]

Gateway 状态持久化（gateway/status.py）

Gateway 的状态通过两个文件持久化：

# gateway/status.py:23-25
_GATEWAY_KIND = "hermes-gateway"
_RUNTIME_STATUS_FILE = "gateway_state.json"

gateway.pid：PID 文件，用于检测 Gateway 是否在运行。CLI 的 send_message 功能通过检查这个文件来决定是否可用
gateway_state.json：运行时状态，包含启动时间、连接的平台、退出原因等。Doctor 诊断和 hermes gateway status 读取这个文件

PID 文件的存活检测不只是检查文件是否存在——还验证 PID 对应的进程是否真的在运行（status.py:60-67），防止 crash 后 PID 文件残留导致误判。

Doctor 诊断（hermes_cli/doctor.py）

hermes doctor 是一个诊断命令，检查 Hermes 的安装和配置状态。它在生命周期管理中的角色是事后诊断——当用户报告问题时，doctor 能快速定位：

# hermes_cli/doctor.py:1-28
# Load .env so API key checks work
from dotenv import load_dotenv
load_dotenv(_env_path, encoding="utf-8")
load_dotenv(PROJECT_ROOT / ".env", override=False, encoding="utf-8")

Doctor 独立加载 .env（不依赖已加载的环境），确保即使在环境变量污染的情况下也能正确诊断。它检查的项目包括：

API key 是否配置
Provider 连通性
工具依赖是否安装（git、node 等）
Gateway 状态
配置结构错误（通过 validate_config_structure()）

多层超时体系

Hermes 的超时不是单一的——它是一个多层递进的体系：

层级	超时	位置	用途
API 调用	60s read timeout	run_agent.py streaming path	单次 API 响应超时
Stale stream	90s	run_agent.py streaming path	流式响应停滞检测
工具执行	180s（可配置）	terminal config	单个 shell 命令超时
Agent 不活跃	1800s（可配置）	agent.gateway_timeout	Gateway 场景的 agent 总超时
Browser 不活跃	120s	browser config	浏览器 session 自动关闭
Terminal 不活跃	300s	cleanup thread	沙箱环境自动清理
进程保留	1800s	process registry	已完成进程的元数据保留

设计启示

信号统一为 KeyboardInterrupt：不是为每个信号写不同的处理逻辑，而是统一转换为 Python 已有的异常类型，复用已有的清理路径。这简单且可靠
清理分 phase：Browser 和 Terminal 的清理都将"决定要清理什么"（快速，在锁内）和"执行清理"（慢速，在锁外）分离，避免锁持有时间过长
多级年龄清理：worktree 的 24h/72h 三级策略平衡了"不清理活跃会话"和"不让垃圾无限积累"的矛盾。有未推送 commits 的 worktree 被保护，确保用户工作不丢失

设计赌注回扣：本章服务于 Run Anywhere 赌注——Gateway 的 5 步关停序列让 Hermes 在 systemd 管理的 VPS 上能正确响应 systemctl stop；Process Registry 的 checkpoint 让 Gateway 崩溃重启后能恢复后台进程追踪；信号处理在 Docker、SSH、受限环境中都有 fallback。

版本演化说明

本章核心分析基于 Hermes Agent v0.8.0（2026 年 4 月）。生命周期管理相关能力是在 v0.3.0-v0.8.0 之间分批收紧的：早期先建立 Process Registry、browser/session 清理和基础持久化，后续再逐步补上 PTY、通知、worktree 多级回收和更细的状态恢复逻辑。worktree 清理的显著增强明确属于 v0.8.0 收尾阶段。v0.8.0 还新增了 AIAgent.close() 统一资源清理方法，将散落在各处的清理逻辑整合为 5 步有序序列。

Hermes Agent 源码与设计