Using QA Pairs for Recursive Chatbot Performance Improvement

Who this is for: those building retrieval-augmented LLM chatbots (e.g., construction specifications) who want a measurable, repeatable way to improve answers, prompts, and even the choice of model over time.

TL;DR — The Pipeline

  • Generate QA pairs by running real questions through the same retrieval and prompt stack as production.
  • Judge each pair on response quality and keyword quality with an independent LLM.
  • Turn results into a report with trends, outliers, and concrete prompt/search fixes.
  • Apply improvements, rerun until satisfied with the results.

ASCII map of the loop:

Questions → Hybrid Retrieval → Prompted LLM → QA Pairs
            ↓                                     ↑
       LLM-as-Judge  ←  Reports & Trends  ←  Scores
            ↓                                     ↑
     Prompt/Retrieval Tweaks  →  Deploy (if gates pass)

Mini-results example (after 1 iteration):

Avg response score: 0.86 (prev 0.72)
High (>= 0.8): 12   Medium (0.6–0.79): 6   Low (< 0.6): 2
Keyword score avg: 0.83 (prev 0.68)

Quickstart

If you want to run this loop end to end, here’s a straightforward way to set it up in your own environment.

  • A document store or index that supports both semantic search (embeddings) and keyword/boolean search.
  • Access to an LLM for generation and an optional reranker for ordering results.
  • Stable prompts for generation, keyword extraction, and judging (kept consistent between prod and evaluation).

  • Run the pipeline

  • Generate QA pairs by running representative questions through your retrieval + prompting stack.
  • Evaluate with a separate “judge” prompt/model for response quality and keyword quality.
  • Produce a human-readable report that summarizes metrics and highlights outliers.
  • Optionally visualize distributions and trends in a dashboard of your choice.

  • Outputs

  • A JSON list of QA pairs (prompt, response, model, timestamp, keywords, search_terms, explanation).
  • An evaluation JSON (per-item scores + a summary block).
  • A Markdown or HTML report for quick review.

The Recursive Improvement Cycle

Step 1 — Run the QA pipeline

  • Create QA pairs by mirroring production: same system prompt, same hybrid retrieval, same context formatting.

Step 2 — Analyze with an independent judge

  • Score response quality and keyword quality. Capture explanations and error types.

Step 3 — Implement improvements

  • Update system prompt, tighten retrieval, and adjust formatting rules based on patterns.

Step 4 — Test and repeat

  • Ship only if results pass: e.g., avg response ≥ 0.85, low-score share ≤ 5%, keyword avg ≥ 0.8, and no regressions on critical questions.

Why Hybrid Retrieval Matters (and how it works here)

The chatbot’s accuracy is bounded by what it retrieves. Our knowledge-base method combines three pieces to maximize both recall and precision:

1) Vector search (semantic recall)

  • Create a query embedding and search a document store or index that holds text plus precomputed embeddings.
  • Cosine similarity ranks the most semantically relevant pages.

2) Text search with LLM-extracted terms (precision)

  • An LLM extracts up to ~5 domain-specific terms (e.g., Section numbers, equipment, environments) and we build a simple boolean string:term1 OR term2 OR term3.
  • We run a case-insensitive REGEXP search over page content, catching exact technical matches.

3) Reranking (final ordering)

  • Merge semantic results with text matches, limit overly long lists, truncate long texts for efficiency, and rerank all candidates with a reranking model.

Important implementation details from the actual codebase:

  • Storage: any document store that can return both text and embeddings; normalize vectors and support multiple formats/dimensions as needed.
  • Text search: build a boolean string from extracted terms and apply it against full-text indices.
  • Rerank: use a dedicated reranking model to order the final set for the specific query.

This end-to-end retrieval is used for both production and QA so the evaluation reflects real behavior, not idealized retrieval.

Retrieval notes (practical behavior)

  • Boolean grammar: text search supports AND, OR, and NOT with case-insensitive word boundaries. Basic plural handling applies (e.g., conduit matches conduits).
  • Keywords vs search terms: a small list of focused keywords is joined with OR into a boolean string for the text search.
  • Rerank fallback: if a reranker isn’t available, return results without reranking (still usable, just less precise ordering).
  • Context length: long candidates are truncated and the final set is limited to keep prompting efficient.

Automated Evaluation: Two Judges, Clear Rubrics

We evaluate each QA pair along two tracks and aggregate results:

  • Response quality (0.0–1.0): completeness, factual accuracy from supplied context, structure, and correct citations.
  • Keyword quality (0.0–1.0): relevance and completeness of the extracted terms that drive text search.

We report: average, median, p25/p75, min/max, and counts for High (>= 0.8), Medium (0.6-0.79), Low (< 0.6). We mitigate LLM-as-judge bias by using a separate judge model and keeping temperature low. Periodic human spot-checks keep the system honest.

Evaluation Prompt Samples

Use these as a starting point. Keep temperature low and require strict JSON to reduce drift.

Response quality (single score)

System:
You are an impartial evaluator. Score how well the assistant’s response answers the user’s prompt, relying only on the provided context when present. Use a 0.0–1.0 scale with decimals. Penalize unsupported claims, factual errors, missing key steps, weak structure, and citation issues (when sources are included). Reward completeness, correctness, clarity, and good use of supplied context. Do not add new facts. Return ONLY a JSON object.
Rubric (guide):
1.0 = comprehensive, correct, well-structured, grounded in context (if provided)
0.8–0.9 = strong, minor gaps/omissions
0.6–0.7 = adequate but misses important points or has clarity issues
0.4–0.5 = weak coverage or notable errors
0.0–0.3 = largely incorrect/unsupported/off-topic
Output JSON schema:
{ "score": number, "explanation": string }
User:
<prompt>
{{USER_PROMPT}}
</prompt>
<context>
{{OPTIONAL_CONTEXT_OR_EMPTY}}
</context>
<response>
{{ASSISTANT_RESPONSE}}
</response>

Keyword quality (retrieval terms)

System:
You are evaluating a list of keywords proposed for retrieving documents relevant to the user’s prompt. Score 0.0–1.0 for relevance, coverage of core concepts, specificity (not too generic), and non-redundancy. Penalize missing key terms or overly broad terms. Return ONLY a JSON object.
Rubric (guide):
1.0 = highly relevant, comprehensive, specific, minimal redundancy
0.8–0.9 = mostly relevant with minor gaps or minor redundancy
0.6–0.7 = generally relevant but misses important concepts or is too generic
<=0.5 = poor coverage or many irrelevant items
Output JSON schema:
{ "score": number, "explanation": string }
User:
<prompt>
{{USER_PROMPT}}
</prompt>
<keywords>
{{["term 1", "term 2", "term 3"]}}
</keywords>

Pairwise A/B (optional)

System:
You are an impartial judge. Compare Response A vs Response B for the same prompt (and optional context). Choose the better answer based on correctness, completeness, clarity, and grounding in the provided context. Do not invent facts. Return ONLY a JSON object.
Output JSON schema:
{ "winner": "A" | "B", "margin": number, "rationale": string }
Notes:
- margin in [0.0, 1.0] where higher = stronger win
- keep rationale concise
User:
<prompt>
{{USER_PROMPT}}
</prompt>
<context>
{{OPTIONAL_CONTEXT_OR_EMPTY}}
</context>
<response_A>
{{ASSISTANT_RESPONSE_A}}
</response_A>
<response_B>
{{ASSISTANT_RESPONSE_B}}
</response_B>

Using QA to Compare Models (not just prompts)

This same setup compares different LLMs easily. Because generation and judging are decoupled, you can run the same question set and retrieval context across multiple models, then evaluate and chart the deltas. Most clients let you swap models via configuration.

Example flow:

  • Pick a fixed question set.
  • For each candidate model, generate responses, evaluate with the same judge, and record scores.
  • Compare averages, low-score counts, and per-item winners.

A Concrete Before/After

Before (score ~0.70): unstructured and generic

We should consider NEC rules and general mounting practices. Typically panels are installed at reasonable heights...

After (score ~0.90): structured, citation-forward (post-cleanup Markdown)

**Verdict.** The switchgear mounting height must comply with NEC workspace clearances and project specifications (e.g., Section 26 05 XX, p. ####).
**Supporting Details**
- Materials: Equipment clearances and working space per NEC 110.26; verify final pad height does not create violations.
- Installation: Coordinate housekeeping pad thickness so the topmost device remains within allowable reach.
- Constraints / Exclusions: No explicit numeric panel height found beyond NEC workspace; do not assume outside supplied context.
**Terminology Notes**
- “Switchgear” vs “panelboard”: confirm the equipment type referenced in the spec section.
**Gaps & Follow-Up**
- Exact inches-to-center not specified in supplied context. Confirm with drawings and AHJ if needed.
**Verification & Coordination**
- Verify NEC 110.26 clearances and any inspection/commissioning references in the spec.
**Sources**
- Source 1: <exact PDF URL>

Retrieval Ablation Snapshot

  • Vector-only: strong semantic recall, weaker precision on code sections → avg 0.78
  • Text-only: precise on section numbers/terms, misses paraphrases → avg 0.74
  • Hybrid + rerank: best of both → avg 0.86

Score Distributions (example)

Compact view across three iterations (n = 20 QA pairs):

Iteration   Avg   Median   High (>= 0.8)   Medium (0.6–0.79)   Low (< 0.6)
Baseline    0.72  0.71     6               8                    6
Iter 1      0.86  0.86     12              6                    2
Iter 2      0.89  0.90     14              4                    2

Measuring Success (beyond accuracy)

  • Response consistency: similar questions receive similar structure and citations.
  • Error types: track reductions in retrieval misses, unsupported claims, and sourcing mistakes.

Acceptance Gates

Use clear gates to decide when to ship:

  • Response quality: average >= 0.85; share of low scores (< 0.6) <= 5%.
  • Keyword quality: average >= 0.80.
  • Non-regression: for a small “critical” subset, no score decrease > 0.05.
  • Judge discipline: keep the judge model and rubric fixed between comparative runs.

Tips:
– Keep a stable holdout set you never tune on.
– Add a small challenge set to catch new failure modes.

Common failure types

  • Retrieval miss (nothing relevant surfaced)
  • Weak sourcing (claims not supported by provided sources)
  • Unsupported claim (hallucinated requirement)
  • Formatting drift (missing sections or citations)
  • Keyword gap (too generic or misses core terms)

Reproduce This (minimal steps)

1) Prepare inputs

  • A set of representative prompts or questions (10–25 to start).
  • A corpus indexed for both vector and text search (document text + embeddings).
  • API keys for an LLM provider and a reranker.

2) Generate QA pairs

  • Run each question through hybrid retrieval and your system prompt to produce entries like {prompt, response, model, timestamp, keywords, search_terms, explanation} and write them to a JSON file.

Note: keywords is the list of extracted terms (e.g., ["Section 26 05 33", "EMT"]). search_terms is the boolean string used for text search (e.g., term1 OR term2 OR term3).

3) Evaluate

  • For each pair, run two judge prompts (response quality and keyword quality) and write results plus a summary block to an evaluation JSON.

4) Report

  • Produce a Markdown report with overall stats (avg, median, p25/p75, high/med/low counts), per-item scores, and highlights.

5) Gate and iterate

  • Apply acceptance gates, update prompts/retrieval, rerun. Keep a changelog tying changes to score deltas.

Reliability tips

  • Incremental writes after each item to avoid losing progress.
  • Backoff on 429s errors from LLMs; use fallback models.

Repro and Ops

  • Version your runs: record model names, prompt versions, DB checksum, and a commit hash alongside results.
  • Keep judge and generator models separate when comparing changes.
  • Be careful with sensitive content: avoid logging full context; prefer IDs in logs. Never print API keys.

Simplified Code Sketches (illustrative)

Generation

def generate_pairs(questions, system_prompt, retrieve, llm):
    items = []
    for q in questions:
        terms = llm.extract_terms(q)              # ["Section 26 05 33", "EMT", ...]
        search_terms = " OR ".join(terms)
        context = retrieve(q, search_terms)       # hybrid search returns concatenated snippets
        prompt = f"<context>{context}</context>\n<query>{q}</query>"
        resp = llm.generate(system_prompt, prompt)
        items.append({
            "prompt": q,
            "response": resp,
            "model": llm.name,
            "keywords": terms,
            "search_terms": search_terms,
            "explanation": "Term extraction rationale here"
        })
    return items  # write to a JSON file

Evaluation

def judge_pairs(pairs, judge):
    results, r_scores, k_scores = [], [], []
    for p in pairs:
        rq = judge.score_response(p["prompt"], p["response"])   # {score, explanation}
        kq = judge.score_keywords(p["prompt"], p.get("keywords", []))
        results.append({
            "prompt": p["prompt"],
            "score": rq["score"],
            "explanation": rq["explanation"],
            "keywords": p.get("keywords", []),
            "search_terms": p.get("search_terms", ""),
            "keywords_evaluation": kq,
        })
        r_scores.append(rq["score"])
        k_scores.append(kq["score"])
    summary = compute_stats(r_scores, k_scores)
    return {"results": results, "summary": summary}  # write to an evaluation JSON

Model comparison (generation side only)

def compare_models(questions, system_prompt, retrieve, models, judge):
    leaderboard = []
    for m in models:
        pairs = generate_pairs(questions, system_prompt, retrieve, m)
        evald = judge_pairs(pairs, judge)
        leaderboard.append({"model": m.name, "avg": evald["summary"]["response_quality"]["average_score"]})
    return sorted(leaderboard, key=lambda x: x["avg"], reverse=True)

Stats helper

def compute_stats(resp_scores, kw_scores):
    def stats(xs):
        xs = sorted(xs)
        n = len(xs)
        mid = n // 2
        return {
            "average_score": sum(xs) / n if n else 0.0,
            "median": (xs[mid] if n % 2 else (xs[mid - 1] + xs[mid]) / 2) if n else 0.0,
            "p25": xs[max(0, (n * 25) // 100 - 1)] if n else 0.0,
            "p75": xs[max(0, (n * 75) // 100 - 1)] if n else 0.0,
            "min_score": xs[0] if n else 0.0,
            "max_score": xs[-1] if n else 0.0,
            "high_scores_count": sum(1 for s in xs if s >= 0.8),
            "medium_scores_count": sum(1 for s in xs if 0.6 <= s < 0.8),
            "low_scores_count": sum(1 for s in xs if s < 0.6),
        }
    return {
        "response_quality": stats(resp_scores),
        "keyword_extraction_quality": stats(kw_scores)
    }

LLM-Assisted Prompt Refinement (from QA reports)

When scores plateau or you see recurring error patterns, feed your QA report into an LLM agent (e.g., Codex) to propose targeted system prompt improvements. Use the report’s summary, top low-scoring items, and the current system prompt as inputs.

What to include

  • Summary metrics: average, median, p25/p75, and high/medium/low counts.
  • Top 5–10 lowest-scoring pairs: question, response (truncated), judge explanation, and any keyword quality notes.
  • Current system prompt text.

Analyst prompt (template)

You are an expert prompt engineer. Read the QA report and propose precise edits to the system prompt that will:
- Improve factual grounding and citation discipline
- Reduce unsupported claims and formatting drift
- Preserve sourcing and “base only on supplied context” constraints
Return JSON with fields:
- revised_system_prompt: string (full prompt)
- change_log: [ {section, before, after, reason} ]
- risks: [string] (potential regressions to watch)
- acceptance_gates: { avg>=, max_low_share<=, notes }
Use only concrete, minimal edits. Do not relax safety/citation requirements.

Minimal code sketch (agent call)

def build_prompt_for_agent(report_md: str, current_prompt: str):
    system = (
        "You are a rigorous prompt engineer. "
        "Suggest minimal, surgical edits with clear rationale."
    )
    user = f"""
<qa_report>
{report_md}
</qa_report>
<current_system_prompt>
{current_prompt}
</current_system_prompt>
Follow the JSON output schema exactly.
"""
    return system, user
# agent is your LLM client (e.g., Codex-style agent)
system, user = build_prompt_for_agent(report_md, system_prompt)
result = agent.chat(system=system, messages=[{"role": "user", "content": user}])
proposal = json.loads(result)
# Apply proposal.revised_system_prompt, re-run QA, check acceptance_gates

Iteration flow

  • Compile the newest QA report and current system prompt.
  • Ask the agent for JSON diff-style changes and reasons.
  • Apply suggested edits, re-run QA on the same question set.
  • Gate on acceptance thresholds; keep or revert based on results.

Case Study — QA report → prompt edits

Example: We fed a recent QA report to the agent. The agent proposed surgical edits focused on consistency, citation discipline, and calling out gaps. We then re-ran QA with the revised prompt.

Post-change results (from that run):

  • Avg: 0.88, Median: 0.90, Range: 0.80–0.90
  • High (>= 0.8): 10/10, Medium: 0, Low: 0
  • Avg eval time per item: ~6.5s (n=10)

Case Study — mixed results → prioritized fixes

Another run surfaced more mixed scores. We sent that report to the agent and asked for the smallest set of fixes that should move the most items from Medium to High without increasing Low counts.

Post-change results (example):

  • Avg: 0.85 (prev 0.81), Median: 0.85
  • High (>= 0.8): 13/20 (prev 9/20)
  • Medium: 6/20 (prev 8/20)
  • Low: 1/20 (prev 3/20)
  • Notes: Majority of upgrades were items penalized for weak sourcing and missing explicit gaps.

Agent Prompt

Sample agent prompt (single, ready-to-use):

You are an expert prompt engineer for electrical building specifications. Read the QA report and propose minimal, surgical edits to ALL LLM system prompts used to: (a) generate answers and (b) extract keywords for retrieval. 
Inputs you will receive:
1) QA report (score summary, highs/mediums/lows, judge explanations, a few lowest-scoring items)
2) Orchestrator overview (how generation/retrieval/evaluation flow)
3) Current prompts: answering, keyword-extraction, and judge rubrics
Tasks:
- Find recurring failure patterns (unsupported claims, weak/irrelevant sources, formatting drift, retrieval misses)
- Propose possible edits to the answering and keyword-extraction prompts; adjust judge rubrics only if they mis-score intended behavior
- Preserve constraints: base only on supplied context, enforce a Sources section (deduplicated, max 5, direct-support URLs), maintain professional electrical-engineering tone and structure

Target prompts to refine in this workflow:

  • Answering prompt (governs structure, citations, tone)
  • Keyword-extraction prompt (drives precise text search terms)
  • Judge rubrics (response and keyword quality; only adjust if they mis-score intended behavior)

Best Practices

  • Start small (10–15 questions), then expand coverage.
  • Keep generation and judging models separate to avoid self-reinforcement.
  • Maintain a changelog tying prompt/retrieval changes to score deltas.

Conclusion

QA pairs turn chatbot improvement into a measurable, repeatable process. By mirroring production retrieval, judging responses on clear rubrics, and gating releases on objective thresholds, the system improves predictably. The same pipeline also lets you compare models quickly: swap the generator model, keep the judge constant, and let the data decide. Start small, iterate fast, and let the scores guide the work.