MLOps#Evaluation #Testing #Quality #MLOps

AI Evaluation Is the Hardest Unsolved Problem in Engineering

We've gotten incredibly good at building AI systems. We're still terrible at knowing whether they actually work. Evals are the bottleneck nobody's fixing.

Misha Lubich

September 1, 20253 min read

AI Evaluation Is the Hardest Unsolved Problem in Engineering

The Eval Crisis

Here's a dirty secret about the AI industry: most production AI systems have no meaningful evaluation pipeline. None. Zero. They ship vibes-based AI.

I'm not talking about academic benchmarks. MMLU and HumanEval are useless for telling you whether your customer support bot actually resolves tickets. I'm talking about evaluation for your specific product, with your specific users, on your specific data.

How Companies Evaluate AI (2025 Survey, n=200)

35% of companies shipping AI products have no evaluation pipeline. They pushed a prompt to production and prayed. Another 30% have one engineer who occasionally checks 20 outputs and says "looks good."

Why Evals Are So Hard

No ground truth. For classification, you have labeled data. For generation? What's a "correct" summary? What's a "good" code review? The evaluation of quality is itself an AI-hard problem.
Distribution shift. Your eval set from last month doesn't represent today's users. Production traffic evolves constantly. Static eval sets go stale fast.
Multiple dimensions. An AI output can be accurate but unhelpful. Helpful but verbose. Concise but wrong. You need to evaluate along 5-10 axes simultaneously.
Scale. Human evaluation is the gold standard and doesn't scale. LLM-as-judge scales but has its own biases.

A Practical Eval Framework

Here's the framework I use in every AI project:

# The three-layer eval pyramid

class EvalPyramid:
    """
    Layer 1: Unit evals     — Fast, automated, run on every commit
    Layer 2: Scenario evals — Weekly, broader test scenarios  
    Layer 3: Human evals    — Monthly, gold-standard quality check
    """
    
    def unit_eval(self, output: str, expected: dict) -> Score:
        """Check structural correctness, format, safety."""
        checks = [
            self.check_format(output, expected["format"]),
            self.check_safety(output),
            self.check_length(output, expected["max_tokens"]),
            self.check_required_fields(output, expected["fields"]),
        ]
        return aggregate(checks)
    
    def scenario_eval(self, output: str, scenario: Scenario) -> Score:
        """LLM-as-judge for quality dimensions."""
        return llm_judge(
            output=output,
            criteria=scenario.criteria,
            rubric=scenario.rubric,
            model="claude-3-5-sonnet",
        )
    
    def human_eval(self, output: str, task: Task) -> Score:
        """Expert human rating on 1-5 scale."""
        return collect_human_rating(output, task, num_raters=3)

Eval Deployment Pipeline

The Bottom Line

If you're building AI and you don't have evals, you don't have a product. You have a demo that might work. Build your eval suite before you build your product — or at least in parallel. It's the single highest-leverage investment you can make in AI quality.

#Evaluation #Testing #Quality #MLOps #Best Practices

Back to all posts

MLOps1 min1k views

Silent Tool Failures Are the Quiet Killer of Agent Reliability

The model says the row was updated. The audit log disagrees. Until you treat tool I/O like distributed systems, agents will keep shipping confident lies.

April 5, 2026Read more →