Building a No-Vibe LLM Evaluation System: A Practical How-To Guide
Introduction
If you've ever relied on an LLM evaluation system that feels more like a vibe check—scoring outputs with vague metrics and subjective human judgment—you know the frustration. Hallucinations slip through, and decisions aren't reproducible. I built a lightweight evaluation layer in pure Python that replaces that guesswork with a structured approach. By separating attribution, specificity, and relevance, it catches false claims before they reach production. This guide walks you through building your own version, step by step.

What You Need
- Basic knowledge of Python (functions, lists, dictionaries)
- Python 3.8+ installed on your machine
- A few sample LLM outputs (text strings) and their expected source documents or knowledge base
- Optional: a simple vector database or list of facts for attribution checks
- No external libraries required—pure Python only
Step-by-Step Instructions
Step 1: Define Your Evaluation Criteria
Before writing code, clarify what each metric means in your context:
- Attribution: Does the output cite or rely on provided sources? (Yes/No or a score)
- Specificity: Does the output contain concrete details (numbers, names, dates) rather than vague generalities?
- Relevance: Does the output directly answer the query or stay on topic?
Write these definitions down as clear rules. For example: “An output is attributed if at least 70% of its claims can be traced to a known source.”
Step 2: Build a Function to Parse LLM Output
Create a Python function that breaks the LLM text into individual claims (sentences or clauses).
def parse_claims(text):
import re
# Simple sentence splitting
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
return [s for s in sentences if len(s) > 10]
This gives you a list of claim strings to evaluate independently.
Step 3: Implement the Attribution Check
Attribution ensures every claim is backed by a source. You'll need a reference set of facts (e.g., a dictionary of {fact: source_id}). The function checks if each claim matches any known fact (using exact or fuzzy matching).
def check_attribution(claim, knowledge_base):
for fact, source in knowledge_base.items():
if fact.lower() in claim.lower():
return True, source
return False, None
Return a boolean and the source ID. For a more robust system, use TF-IDF or an embedding model, but pure Python works for a prototype.
Step 4: Implement the Specificity Check
Specificity measures detail. Count occurrences of digits, proper nouns (capitalized words that aren't at sentence start), and named entities. Create a scoring function:
def check_specificity(claim):
import re
digits = len(re.findall(r'\d+', claim))
proper_nouns = len(re.findall(r'\b[A-Z][a-z]+\b', claim))
return (digits + proper_nouns) > 2 # arbitrary threshold
Adjust the threshold based on your domain.

Step 5: Implement the Relevance Check
Relevance compares output to the user's query. Use simple keyword overlap or character n-grams:
def check_relevance(output, query):
query_words = set(query.lower().split())
output_words = set(output.lower().split())
overlap = len(query_words & output_words) / len(query_words)
return overlap > 0.3
Again, tune the threshold.
Step 6: Combine into a Decision Layer
Create a single function that takes output, query, and knowledge base, then returns a pass/fail decision and a report.
def evaluate_output(output, query, knowledge_base):
claims = parse_claims(output)
results = []
for claim in claims:
attr = check_attribution(claim, knowledge_base)
spec = check_specificity(claim)
rel = check_relevance(claim, query)
results.append({'claim': claim, 'attribution': attr[0], 'specificity': spec, 'relevance': rel})
# Decision: pass if all claims meet all criteria
passed = all(r['attribution'] and r['specificity'] and r['relevance'] for r in results)
return {'passed': passed, 'details': results}
This is the core layer that replaces “vibes” with reproducible decisions.
Step 7: Test and Iterate
Run your function on a set of known-good and known-bad examples. Adjust thresholds and criteria until false positives/negatives are minimized. Log every failure to improve your knowledge base and rules. Over time, you can add more sophisticated checks (e.g., contradiction detection) while keeping the same three-pillar architecture.
Tips for Success
- Start small: Don't try to cover every edge case. Build for one use case first, then expand.
- Keep it modular: Each check should be independently testable. You can replace simple functions with ML models later.
- Use threshold tuning: Run a grid search over your test set to find the best cutoff values for specificity and relevance.
- Document your criteria: Write down why you chose each threshold—this makes the system reproducible and debuggable.
- Combine with human review: For high-stakes applications, use this layer as a first filter, then pass borderline cases to a human.
Related Articles
- The Enduring Power of Developer Communities in a World of AI Tools
- Config Secures $27M Seed at $200M+ Valuation to Build Data Infrastructure for Robotics AI, Backed by Samsung Ventures
- 10 Key Insights into Mind Robotics: Rivian CEO's AI-Powered Robot Startup Raises $400M
- Adidas Unveils the 2026 World Cup Ball: A Four-Panel Design for Three Host Nations
- Reviving Retro PC Games on Windows 11: A Complete Guide to Using DOSBox
- Chipotle's Turnaround: A Surprising Win for Customers and Investors Alike
- Anthropic Surpasses $30 Billion Revenue Run Rate Following Explosive 80x Growth, CEO Reveals
- How to Reorganize Your Engineering Team for AI Agents: A Step-by-Step Guide