测试体系：一个复杂 agent 如何被约束住

本章核心源码：tests/conftest.py、tests/ 目录结构

定位：本章拆解 Hermes 的测试体系——从 _isolate_hermes_home fixture 的全局隔离设计到测试目录作为架构镜像的组织方式，理解一个拥有 400+ 测试文件的复杂 agent 如何用测试约束自身的正确性。前置依赖：第 2 章（仓库地图）、第 17-21 章（工程基础各章）。适用场景：想给 Hermes 添加测试，或想借鉴 agent 系统的测试方法论。

为什么 agent 测试比普通软件更难

测试一个 LLM agent 面临三个独特挑战：

非确定性输出：同一个 prompt 送给同一个模型，两次返回的内容可能不同。这意味着传统的"输入-输出断言"方法只能覆盖编排逻辑，不能覆盖端到端行为
副作用密集：agent 执行工具时会创建文件、运行命令、打开浏览器、发送消息。测试必须隔离这些副作用，否则测试之间会互相污染
全局状态耦合：配置文件、会话数据库、记忆文件都存储在 HERMES_HOME（默认 ~/.hermes）。如果测试读写用户的真实 home 目录，不仅会破坏用户数据，还会因为机器间配置差异导致测试结果不可重复

Hermes 的测试体系围绕这三个挑战构建了三层防线：环境隔离（conftest.py）、架构镜像（目录组织）、稳定性专项（生命周期/中断测试）。

autouse _isolate_hermes_home：测试隔离的基石

fixture 解析（tests/conftest.py:19-41）

# tests/conftest.py:19-41
@pytest.fixture(autouse=True)
def _isolate_hermes_home(tmp_path, monkeypatch):
    """Redirect HERMES_HOME to a temp dir so tests never write to ~/.hermes/."""
    fake_home = tmp_path / "hermes_test"
    fake_home.mkdir()
    (fake_home / "sessions").mkdir()
    (fake_home / "cron").mkdir()
    (fake_home / "memories").mkdir()
    (fake_home / "skills").mkdir()
    monkeypatch.setenv("HERMES_HOME", str(fake_home))

    # Reset plugin singleton so tests don't leak plugins from ~/.hermes/plugins/
    try:
        import hermes_cli.plugins as _plugins_mod
        monkeypatch.setattr(_plugins_mod, "_plugin_manager", None)
    except Exception:
        pass

    # Tests should not inherit the agent's current gateway/messaging surface
    monkeypatch.delenv("HERMES_SESSION_PLATFORM", raising=False)
    monkeypatch.delenv("HERMES_SESSION_CHAT_ID", raising=False)
    monkeypatch.delenv("HERMES_SESSION_CHAT_NAME", raising=False)
    monkeypatch.delenv("HERMES_GATEWAY_SESSION", raising=False)

这个 fixture 的设计有四个层次：

层次	代码行	作用
目录隔离	21-25	每个测试在独立的 `tmp_path` 下创建 HERMES_HOME
环境变量重定向	26	`monkeypatch.setenv("HERMES_HOME", ...)`
Plugin 单例重置	29-32	防止 plugins 从真实 home 泄漏到测试
Gateway 状态清理	35-38	删除 gateway 相关环境变量

autouse=True 意味着这个 fixture 自动应用于所有测试——开发者不需要显式声明就能获得完全隔离的测试环境。这是一个关键决策：如果改为 opt-in，新加的测试可能忘记声明而意外读写用户的 ~/.hermes。

为什么不用 mock HERMES_HOME

一种替代方案是 monkeypatch.setattr(hermes_constants, "get_hermes_home", lambda: fake_path)。Hermes 选择了设置环境变量而非 mock 函数，原因是：

覆盖面更广：许多模块在导入时就通过 os.getenv("HERMES_HOME", ...) 缓存了路径值。Mock 函数只能影响调用 get_hermes_home() 的代码，不能影响直接读环境变量的代码
subprocess 继承：如果测试中启动了子进程（如 terminal 工具测试），子进程会继承环境变量，也指向临时目录。Mock 函数不会传递到子进程

测试目录：架构的镜像

tests/
├── conftest.py         <-- 全局 fixture（_isolate_hermes_home 等）
├── run_agent/          <-- 编排层（AIAgent 大循环）
│   ├── test_run_agent.py
│   ├── test_interrupt_propagation.py
│   └── test_real_interrupt_subagent.py
├── agent/              <-- 编排支撑（agent/ 模块）
├── tools/              <-- 能力层（工具系统）
│   ├── test_interrupt.py
│   └── test_code_execution.py
├── skills/             <-- 能力层（技能系统）
├── gateway/            <-- 平台层（Gateway + 平台适配）
├── hermes_cli/         <-- 入口层（CLI 命令）
├── cli/                <-- 入口层（TUI 交互）
│   ├── test_cli_init.py
│   ├── test_cli_interrupt_subagent.py
│   ├── test_cli_retry.py
│   └── test_cli_provider_resolution.py
├── cron/               <-- 入口层（定时调度）
├── acp/                <-- 平台层（ACP 适配）
├── plugins/            <-- 能力层（插件系统）
├── honcho_plugin/      <-- 能力层（Honcho 记忆插件）
├── e2e/                <-- 端到端集成测试
├── integration/        <-- 集成测试
│   └── test_checkpoint_resumption.py
├── environments/       <-- 研究（RL 环境）
├── fakes/              <-- 测试用 mock 对象
└── (顶层测试文件)      <-- 跨模块测试
    ├── test_hermes_state.py
    ├── test_model_tools.py
    ├── test_model_tools_async_bridge.py
    └── test_hermes_logging.py

测试目录几乎是生产代码的一一对应镜像。这个组织方式的价值是：

定位效率：想了解 tools/terminal_tool.py 的行为？直接看 tests/tools/test_terminal_tool.py
覆盖率可视：如果生产代码有一个目录 gateway/platforms/ 但 tests/gateway/ 下没有对应的测试文件，说明测试覆盖有缺口
增量开发：修改某个模块时，只需运行对应测试目录即可验证，不需要跑全量测试

全局测试超时（conftest.py:67-119）

SIGALRM 超时（conftest.py:108-119）

# tests/conftest.py:72-73
def _timeout_handler(signum, frame):
    raise TimeoutError("Test exceeded 30 second timeout")

# tests/conftest.py:108-119
@pytest.fixture(autouse=True)
def _enforce_test_timeout():
    """Kill any individual test that takes longer than 30 seconds."""
    if sys.platform == "win32":
        yield
        return
    old = signal.signal(signal.SIGALRM, _timeout_handler)
    signal.alarm(30)
    yield
    signal.alarm(0)
    signal.signal(signal.SIGALRM, old)

每个测试最多运行 30 秒，超时后抛出 TimeoutError。这防止：

subprocess 启动后未 terminate 导致测试挂起
网络请求没有设置超时导致等待 DNS 解析
死锁（多线程测试中偶发的 lock ordering 问题）

Windows 上跳过（SIGALRM 是 Unix-only），其他平台也有 except Exception: pass 的防御。

Event loop fixture（conftest.py:76-105）

# tests/conftest.py:76-105
@pytest.fixture(autouse=True)
def _ensure_current_event_loop(request):
    """Provide a default event loop for sync tests that call get_event_loop()."""
    if request.node.get_closest_marker("asyncio") is not None:
        yield
        return

    try:
        loop = asyncio.get_event_loop_policy().get_event_loop()
    except RuntimeError:
        loop = None

    created = loop is None or loop.is_closed()
    if created:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)

    try:
        yield
    finally:
        if created and loop is not None:
            try:
                loop.close()
            finally:
                asyncio.set_event_loop(None)

Python 3.11+ 不再为同步测试保证默认 event loop。许多 Gateway 测试在同步测试中调用 asyncio.get_event_loop().run_until_complete()，这个 fixture 确保它们总是有可用的 loop。对于已标记 @pytest.mark.asyncio 的测试则跳过——避免与 pytest-asyncio 的 loop 管理冲突。

mock_config fixture（conftest.py:49-64）

# tests/conftest.py:49-64
@pytest.fixture()
def mock_config():
    """Return a minimal hermes config dict suitable for unit tests."""
    return {
        "model": "test/mock-model",
        "toolsets": ["terminal", "file"],
        "max_turns": 10,
        "terminal": {
            "backend": "local",
            "cwd": "/tmp",
            "timeout": 30,
        },
        "compression": {"enabled": False},
        "memory": {"memory_enabled": False, "user_profile_enabled": False},
        "command_allowlist": [],
    }

这个 fixture 不是 autouse 的——只有需要构造 AIAgent 的测试才显式使用。它的设计原则是最小化：

max_turns: 10（而非默认 90）：防止测试中的 agent 循环太久
compression: {enabled: False}：避免测试触发压缩的 API 调用
memory: {enabled: False}：避免测试写入记忆文件
command_allowlist: []：清空命令白名单，让安全检查处于默认状态

测试分层策略

Hermes 的测试分为四个层次，每层验证不同的关注点：

第一层：单元测试（tests/agent/, tests/tools/）

验证单个模块的行为，mock 所有外部依赖。例如：

tests/agent/ 下的测试验证 prompt 构建、记忆管理、模型路由等独立模块
tests/tools/ 验证工具注册、schema 生成、权限检查

第二层：集成测试（tests/run_agent/, tests/cli/）

验证模块之间的交互。例如：

tests/run_agent/test_interrupt_propagation.py 验证中断信号从父 agent 传播到子 agent
tests/cli/test_cli_retry.py 验证 CLI 层的 API 重试逻辑

第三层：平台测试（tests/gateway/）

验证 Gateway 和平台适配器的行为。这些测试通常 mock 平台 API，验证消息路由、会话管理、投递重试等逻辑。

第四层：端到端测试（tests/e2e/）

验证完整的用户场景。这些测试启动真实的 Hermes 进程，通过 CLI 输入发送消息，验证工具执行和输出。

稳定性专项测试

第 20-21 章讨论的生命周期管理和运行时防御需要专门的测试来验证：

中断传播测试

tests/run_agent/test_interrupt_propagation.py   -- 父→子 agent 中断
tests/run_agent/test_real_interrupt_subagent.py  -- 真实子代理中断
tests/tools/test_interrupt.py                    -- 工具执行中断
tests/cli/test_cli_interrupt_subagent.py         -- CLI 中断子代理

中断测试验证的核心问题是：当用户按 Ctrl+C 时，所有活跃的子代理和工具是否都能安全停止？ 这些测试通常使用 threading.Timer 模拟延迟中断，然后验证 _interrupt_requested 标志的传播和资源清理。

Checkpoint 恢复测试

tests/integration/test_checkpoint_resumption.py  -- 检查点恢复
tests/test_batch_runner_checkpoint.py            -- 批量运行检查点

Provider 解析测试

tests/cli/test_cli_provider_resolution.py        -- provider 路由正确性

这些测试验证第 18 章讨论的 API 模式检测和 provider 路由在各种输入组合下的行为。

Async 桥接测试

tests/test_model_tools_async_bridge.py           -- _run_async() 三种路径

验证第 19 章讨论的 _run_async() 在有/无 running event loop、主线程/worker 线程等不同上下文中的正确行为。

隔离是测试可靠性的基础：_isolate_hermes_home 的 autouse 设计确保每个测试在干净的环境中运行。没有这个基础，其他测试技巧都建在沙子上
镜像目录组织降低认知负担：不需要索引或搜索来找到某个模块的测试——路径即答案。这对一个有 400+ 测试文件的项目尤为重要
稳定性需要专门测试：中断传播、checkpoint 恢复、async 桥接这些行为不会被"正常路径"的测试覆盖到。它们需要专门的测试，刻意制造异常条件来验证防御机制

设计赌注回扣：本章回扣全部四个赌注——测试验证了所有赌注的工程实现：_isolate_hermes_home 确保 profile 隔离（Run Anywhere）；记忆和技能的测试验证 Learning Loop 和 Personal Long-Term 的正确性；CLI 交互测试验证 CLI-First 的体验质量；中断传播和 async 桥接测试验证跨入口一致性（Run Anywhere）。

版本演化说明

本章核心分析基于 Hermes Agent v0.8.0（2026 年 4 月）。测试隔离基础设施在很早的公开版本里就已经存在；之后 v0.5.0-v0.8.0 之间持续补上了 plugin 单例清理、gateway 环境变量清理、全局超时和 Python 3.11+ event loop 兼容。测试规模也从早期的几百个文件逐步增长到今天的 400+。

Hermes Agent 源码与设计