跳转到内容

离线评估与持续回归测试

前三节我们讨论了评估的必要性、指标体系和 LangSmith 平台。但 LangSmith 是在线评估——每次运行都要调用 LLM API,这会产生成本和延迟。对于开发阶段的快速迭代,我们需要一套离线评估 + 持续回归的机制。

离线评估 vs 在线评估

维度在线评估(LangSmith)离线评估(本地)
成本每次调用 LLM API使用本地模型或缓存结果
速度受网络延迟影响本地执行,毫秒级
隐私数据上传到云端数据不离开本地
适用阶段生产监控、A/B 测试开发迭代、模型选择
评估能力完整(支持 RAGAS)有限(需要自建)

两者不是替代关系,而是互补的。开发阶段用离线评估快速验证代码变更,上线后用在线评估持续监控质量。

离线评估工作流

第一步:准备测试数据集

离线评估需要一份标注好的测试数据。格式与在线评估类似,但保存在本地文件中:

python
# tests/eval_data/customer_service.json
{
    "test_cases": [
        {
            "id": "tc_001",
            "category": "pricing",
            "question": "免费版支持几个人?",
            "expected_answer": "免费版最多支持 5 名团队成员。",
            "context": "pricing.md",
            "metadata": {
                "priority": "high",
                "expected_intent": "product_inquiry",
            },
        },
        {
            "id": "tc_002",
            "category": "refund",
            "question": "订单完成后多久内可以申请退款?",
            "expected_answer": "订单完成后 30 天内可以申请退款。",
            "context": "policies.md",
            "metadata": {
                "priority": "high",
                "expected_intent": "refund_request",
            },
        },
        {
            "id": "tc_003",
            "category": "handoff",
            "question": "转人工",
            "expected_answer": "__HANDOFF__",
            "context": "N/A",
            "metadata": {
                "priority": "critical",
                "expected_intent": "handoff_request",
            },
        },
    ]
}

第二步:编写离线评估脚本

python
import json
from pathlib import Path
from typing import Dict, List
from dataclasses import dataclass

@dataclass
class TestCase:
    id: str
    category: str
    question: str
    expected_answer: str
    context: str
    metadata: dict

@dataclass
class EvalResult:
    test_case_id: str
    actual_answer: str
    passed: bool
    score: float
    reason: str
    latency_ms: float

class OfflineEvaluator:
    def __init__(self, bot, eval_data_path: str):
        self.bot = bot
        self.test_cases: List[TestCase] = self._load_test_cases(eval_data_path)

    def _load_test_cases(self, path: str) -> List[TestCase]:
        with open(path, "r", encoding="utf-8") as f:
            data = json.load(f)
        return [TestCase(**case) for case in data["test_cases"]]

    def evaluate_all(self) -> Dict[str, List[EvalResult]]:
        results = {}
        for case in self.test_cases:
            result = self._evaluate_single(case)
            if case.category not in results:
                results[case.category] = []
            results[case.category].append(result)
        return results

    def _evaluate_single(self, case: TestCase) -> EvalResult:
        import time
        start = time.time()

        self.bot.initialize()
        response = self.bot.process_message(case.question)

        latency = (time.time() - start) * 1000

        passed, score, reason = self._compare_answers(
            case.expected_answer,
            response["response"],
            case.category,
        )

        return EvalResult(
            test_case_id=case.id,
            actual_answer=response["response"],
            passed=passed,
            score=score,
            reason=reason,
            latency_ms=latency,
        )

    def _compare_answers(self, expected: str, actual: str,
                       category: str) -> tuple[bool, float, str]:
        if category == "handoff":
            is_handoff = actual == "__HANDOFF__" or "转人工" in actual
            return is_handoff, 1.0 if is_handoff else 0.0, (
                "正确触发 Handoff" if is_handoff else "未触发 Handoff"
            )

        if expected.lower() in actual.lower():
            return True, 1.0, "完全匹配"

        if any(keyword in actual.lower() for keyword in expected.lower().split()):
            return True, 0.8, "部分匹配"

        if category == "pricing":
            if "5" in actual and "人" in actual:
                return True, 0.9, "包含关键数字"
            if "免费" in actual:
                return True, 0.7, "回答了免费版相关信息"

        return False, 0.0, "不匹配"

    def generate_report(self, results: Dict[str, List[EvalResult]]) -> str:
        report_lines = ["# 离线评估报告\n"]
        report_lines.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")

        for category, case_results in results.items():
            report_lines.append(f"\n## {category.upper()} 测试\n")
            passed_count = sum(1 for r in case_results if r.passed)
            total_count = len(case_results)
            pass_rate = passed_count / total_count * 100 if total_count > 0 else 0
            avg_score = sum(r.score for r in case_results) / total_count if total_count > 0 else 0

            report_lines.append(f"通过率: {pass_rate:.1f}% ({passed_count}/{total_count})")
            report_lines.append(f"平均得分: {avg_score:.2f}")

            failed_cases = [r for r in case_results if not r.passed]
            if failed_cases:
                report_lines.append(f"\n### 失败案例\n")
                for fc in failed_cases[:5]:
                    report_lines.append(f"- **{fc.test_case_id}**: {fc.reason}")
                    report_lines.append(f"  预期: {fc.actual_answer[:100]}...")

        return "\n".join(report_lines)

第三步:集成到 pytest

为了让评估脚本更易用,我们把它包装成 pytest 测试:

python
import pytest
from pathlib import Path

@pytest.fixture(scope="session")
def bot():
    from customer_service_bot import CustomerServiceBot
    bot = CustomerServiceBot()
    bot.initialize()
    return bot

@pytest.fixture
def evaluator(bot):
    from offline_evaluator import OfflineEvaluator
    return OfflineEvaluator(bot, "tests/eval_data/customer_service.json")

def test_pricing_questions(evaluator):
    results = evaluator.evaluate_all()
    pricing_results = results.get("pricing", [])

    passed = sum(1 for r in pricing_results if r.passed)
    total = len(pricing_results)
    pass_rate = passed / total * 100 if total > 0 else 0

    assert pass_rate >= 80, f"定价问题通过率仅 {pass_rate:.1f}%,期望 ≥80%"
    assert all(r.latency_ms < 2000 for r in pricing_results), "存在响应超过 2 秒的测试"

def test_refund_questions(evaluator):
    results = evaluator.evaluate_all()
    refund_results = results.get("refund", [])

    passed = sum(1 for r in refund_results if r.passed)
    total = len(refund_results)
    pass_rate = passed / total * 100 if total > 0 else 0

    assert pass_rate >= 75, f"退款问题通过率仅 {pass_rate:.1f}%,期望 ≥75%"

def test_handoff_scenarios(evaluator):
    results = evaluator.evaluate_all()
    handoff_results = results.get("handoff", [])

    passed = sum(1 for r in handoff_results if r.passed)
    total = len(handoff_results)
    pass_rate = passed / total * 100 if total > 0 else 0

    assert pass_rate == 100, "Handoff 场景应该 100% 通过"

def test_overall_performance(evaluator):
    results = evaluator.evaluate_all()
    all_results = [r for cat_results in results.values() for r in cat_results]

    total_passed = sum(1 for r in all_results if r.passed)
    total_tests = len(all_results)
    overall_pass_rate = total_passed / total_tests * 100 if total_tests > 0 else 0

    avg_latency = sum(r.latency_ms for r in all_results) / total_tests if total_tests > 0 else 0

    assert overall_pass_rate >= 75, f"整体通过率仅 {overall_pass_rate:.1f}%,期望 ≥75%"
    assert avg_latency < 1500, f"平均响应时间 {avg_latency:.0f}ms 超过阈值 1500ms"

运行评估:

bash
cd /path/to/project
pytest tests/test_evaluator.py -v

输出:

tests/test_evaluator.py::test_pricing_questions PASSED
tests/test_evaluator.py::test_refund_questions PASSED
tests/test_evaluator.py::test_handoff_scenarios PASSED
tests/test_evaluator.py::test_overall_performance PASSED

======================== 5 passed in 3.42s ========================

持续回归测试

回归测试的核心思想是:每次代码变更后,自动运行一套完整的测试用例,确保没有破坏已有的功能

回归测试套件设计

一个完整的回归测试套件应该覆盖:

测试类别测试数量典型用例
核心功能测试20-30定价查询、退款流程、订单状态、Handoff 触发
边界情况测试10-15空输入、超长文本、特殊字符、并发请求
性能基准测试5-10P95 延迟、Token 使用量、并发吞吐量
安全测试5-10Prompt 注入、SQL 注入、恶意输入
集成测试5-10端到端场景(完整对话流程)

性能基准测试

python
import pytest
import statistics
from typing import List

def benchmark_p95_latency(bot, questions: List[str], iterations: int = 10):
    latencies = []
    for _ in range(iterations):
        start = time.time()
        for q in questions:
            bot.process_message(q)
        end = time.time()
        latencies.append((end - start) / len(questions) * 1000)

    p95 = statistics.quantiles(latencies, n=20)[18]
    avg = statistics.mean(latencies)

    print(f"P95 延迟: {p95:.0f}ms")
    print(f"平均延迟: {avg:.0f}ms")

    return p95

@pytest.mark.benchmark
def test_latency_baseline(bot):
    questions = [
        "免费版支持几个人?",
        "专业版多少钱?",
        "退款流程是什么?",
    ]
    p95 = benchmark_p95_latency(bot, questions)

    assert p95 < 1500, f"P95 延迟 {p95}ms 超过阈值 1500ms"

@pytest.mark.benchmark
def test_token_usage_efficiency(bot):
    total_tokens = 0
    for _ in range(10):
        result = bot.process_message("免费版支持几个人?")
        total_tokens += result.get("token_count", 100)

    avg_tokens = total_tokens / 10
    print(f"平均 Token 使用量: {avg_tokens}")

    assert avg_tokens < 200, f"平均 Token 使用量 {avg_tokens} 超过阈值 200"

集成到 CI/CD

回归测试应该每次代码 push 时自动运行。以下是 GitHub Actions 的示例配置:

yaml
# .github/workflows/regression-test.yml
name: Regression Tests

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run regression tests
        run: |
          pytest tests/test_evaluator.py -v --tb=short
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: regression-results
          path: reports/

回归失败的处理策略

当回归测试失败时,不应该直接阻止合并,而是根据失败类型采取不同策略:

失败类型策略示例
功能回归阻止合并,要求修复定价问题从 95% 降到 80%
性能退化警告但允许合并(需说明)P95 从 1200ms 升到 1800ms,但仍在可接受范围
边界情况记录但不阻止新发现的特殊字符处理问题,标记为已知问题
测试本身问题修复测试,不阻止代码合并测试用例更新导致误判

离线评估的常见误区

误区一:测试用例太少。只准备 10-20 个测试 case,无法覆盖真实场景的多样性。一个健康的回归测试套件应该至少有 50+ 个测试用例,并且随着功能增加持续扩充。

误区二:测试用例过于简单。所有测试都是"免费版支持几个人?"这种直球问题,不测试边界和异常。好的测试套件应该包含:正常路径 + 边界情况 + 错误输入 + 复杂组合场景

误区三:只测不修。测试失败后没有分析原因就直接跳过。正确的做法是:失败 → 分析根因 → 修复 → 重新测试 → 验证修复。这是一个闭环过程。

误区四:硬编码的预期答案。测试用例的 expected_answer 写死了"免费版最多支持 5 人",但产品策略变了(改成了 3 人),测试就会失败。预期答案应该基于业务规则文档而不是写死具体数值。

评估体系的最终形态

把前面三节的内容整合起来,一个完整的评估体系应该是这样的:

开发阶段
├── 离线评估 (13.4)
│   ├── 本地测试套件 (pytest)
│   ├── 回归测试 (CI/CD)
│   └── 性能基准测试
│   └── 目标:快速迭代、成本控制

上线阶段
├── 在线评估 (13.3)
│   ├── LangSmith Trace 采集
│   ├── Dataset 管理
│   ├── 自动化评估器
│   └── 目标:持续监控、质量基线

优化阶段
├── 指标体系 (13.2)
│   ├── 检索相关性 (Context Relevance)
│   ├── 忠实度 (Faithfulness)
│   ├── 答案相关性 (Answer Relevance)
│   └── Agent 专用指标 (工具调用正确率、目标达成率)
│   └── 目标:多维衡量、精准诊断

反馈闭环
└── 人工审核 (13.1)
    ├── L1: 人工抽查
    ├── L2: 自动规则监控
    ├── L3: LLM-as-Judge
    └── L4: 黄金基准集
    └── 目标:发现盲点、建立 ground truth

这个体系覆盖了从开发到上线、从离线到在线、从功能到性能的全生命周期。它不是一次性的工作,而是需要持续维护和改进的基础设施。

基于 MIT 许可发布