Building a No-Vibe LLM Evaluation System: A Practical How-To Guide

Introduction

If you've ever relied on an LLM evaluation system that feels more like a vibe check—scoring outputs with vague metrics and subjective human judgment—you know the frustration. Hallucinations slip through, and decisions aren't reproducible. I built a lightweight evaluation layer in pure Python that replaces that guesswork with a structured approach. By separating attribution, specificity, and relevance, it catches false claims before they reach production. This guide walks you through building your own version, step by step.

Building a No-Vibe LLM Evaluation System: A Practical How-To Guide — Source: towardsdatascience.com

What You Need

Basic knowledge of Python (functions, lists, dictionaries)
Python 3.8+ installed on your machine
A few sample LLM outputs (text strings) and their expected source documents or knowledge base
Optional: a simple vector database or list of facts for attribution checks
No external libraries required—pure Python only

Step-by-Step Instructions

Step 1: Define Your Evaluation Criteria

Before writing code, clarify what each metric means in your context:

Attribution: Does the output cite or rely on provided sources? (Yes/No or a score)
Specificity: Does the output contain concrete details (numbers, names, dates) rather than vague generalities?
Relevance: Does the output directly answer the query or stay on topic?

Write these definitions down as clear rules. For example: “An output is attributed if at least 70% of its claims can be traced to a known source.”

Step 2: Build a Function to Parse LLM Output

Create a Python function that breaks the LLM text into individual claims (sentences or clauses).

def parse_claims(text):
    import re
    # Simple sentence splitting
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s for s in sentences if len(s) > 10]

This gives you a list of claim strings to evaluate independently.

Step 3: Implement the Attribution Check

Attribution ensures every claim is backed by a source. You'll need a reference set of facts (e.g., a dictionary of {fact: source_id}). The function checks if each claim matches any known fact (using exact or fuzzy matching).

def check_attribution(claim, knowledge_base):
    for fact, source in knowledge_base.items():
        if fact.lower() in claim.lower():
            return True, source
    return False, None

Return a boolean and the source ID. For a more robust system, use TF-IDF or an embedding model, but pure Python works for a prototype.

Step 4: Implement the Specificity Check

Specificity measures detail. Count occurrences of digits, proper nouns (capitalized words that aren't at sentence start), and named entities. Create a scoring function:

def check_specificity(claim):
    import re
    digits = len(re.findall(r'\d+', claim))
    proper_nouns = len(re.findall(r'\b[A-Z][a-z]+\b', claim))
    return (digits + proper_nouns) > 2  # arbitrary threshold

Adjust the threshold based on your domain.

Step 5: Implement the Relevance Check

Relevance compares output to the user's query. Use simple keyword overlap or character n-grams:

def check_relevance(output, query):
    query_words = set(query.lower().split())
    output_words = set(output.lower().split())
    overlap = len(query_words & output_words) / len(query_words)
    return overlap > 0.3

Again, tune the threshold.

Step 6: Combine into a Decision Layer

Create a single function that takes output, query, and knowledge base, then returns a pass/fail decision and a report.

def evaluate_output(output, query, knowledge_base):
    claims = parse_claims(output)
    results = []
    for claim in claims:
        attr = check_attribution(claim, knowledge_base)
        spec = check_specificity(claim)
        rel = check_relevance(claim, query)
        results.append({'claim': claim, 'attribution': attr[0], 'specificity': spec, 'relevance': rel})
    # Decision: pass if all claims meet all criteria
    passed = all(r['attribution'] and r['specificity'] and r['relevance'] for r in results)
    return {'passed': passed, 'details': results}

This is the core layer that replaces “vibes” with reproducible decisions.

Step 7: Test and Iterate

Run your function on a set of known-good and known-bad examples. Adjust thresholds and criteria until false positives/negatives are minimized. Log every failure to improve your knowledge base and rules. Over time, you can add more sophisticated checks (e.g., contradiction detection) while keeping the same three-pillar architecture.

Tips for Success

Start small: Don't try to cover every edge case. Build for one use case first, then expand.
Keep it modular: Each check should be independently testable. You can replace simple functions with ML models later.
Use threshold tuning: Run a grid search over your test set to find the best cutoff values for specificity and relevance.
Document your criteria: Write down why you chose each threshold—this makes the system reproducible and debuggable.
Combine with human review: For high-stakes applications, use this layer as a first filter, then pass borderline cases to a human.

Tags: