The new AI skill is taste: how teams evaluate model work without drowning in metrics
AI evaluation is becoming product engineering: teams need taste, test sets, source discipline, reviewer rubrics, and failure analysis instead of leaderboard worship.

Key takeaways
- Leaderboards help select candidates, but product-quality AI depends on workflow-specific evaluation.
- Human taste is not anti-scientific; it becomes useful when converted into rubrics, examples, and repeatable review.
- The best eval suites include good answers, bad answers, stale context, missing evidence, and policy refusal cases.
- Evaluation should change product behavior, not sit in a spreadsheet after launch.
Research integrity
The new AI skill is taste: how teams evaluate model work without drowning in metrics
The least fashionable AI skill is becoming the most important one: taste. Not taste as in pretty demos or clever prompts. Taste as in knowing whether the answer is actually useful, whether the summary missed the one sentence that mattered, whether the code is maintainable, whether the citation supports the claim, and whether the system should have refused the task entirely.
AI teams often start with leaderboards because leaderboards are comforting. They turn messy intelligence into a row of numbers. That is useful when choosing candidate models, but it does not tell you whether your legal memo assistant respects internal clause preferences, whether your SOC summarizer preserves uncertainty, or whether your developer agent edits in the style of your repository.
The move from model evaluation to product evaluation is the move from 'Which model is smartest?' to 'Which system creates trustworthy work here?' That system includes prompts, retrieval, tools, permissions, user interface, fallback behavior, and review loops. A model can be excellent and still power a bad product if the surrounding system gives it weak context or the wrong incentives.
Taste becomes engineering when it is written down. A senior analyst may say, 'This summary feels thin.' That instinct matters, but it is not enough. What made it thin? Did it omit numbers? Fail to identify the decision owner? Flatten contradictory evidence? Ignore dates? Use vague language? Once the reviewer names the defect, the team can build a rubric. Once there is a rubric, examples can be scored. Once examples are scored, the system can improve.
A good evaluation set looks like the work, not like a textbook. If the product answers customer-support questions, the evals should include messy customer language, outdated help-center pages, overlapping product names, refund edge cases, angry tone, and questions where the answer should be escalated. If the product writes code, the evals should include tests, style constraints, dependency rules, and cases where the right answer is a small patch rather than a rewrite.
The strongest evals include traps. Stale documents should lose to current ones. A source that says the opposite of the expected answer should be caught. A user asking for data they cannot access should be refused. A prompt injection hidden in retrieved text should be ignored. A task with missing information should trigger a question. These cases feel unfair only if you imagine the model as a quiz taker. In production, they are normal weather.
Metrics should be plain enough that product and engineering can argue about them together. Accuracy matters, but so do citation support, completeness, tone, format adherence, latency, cost, refusal quality, and action safety. Some tasks need exact scoring. Others need reviewer judgment. The answer is not to eliminate subjectivity; it is to make subjectivity discussable.
One practical method is the three-column review: expected behavior, observed behavior, defect label. For every bad answer, the reviewer chooses a label such as missing evidence, wrong source, overconfident uncertainty, permission leak, bad format, unnecessary verbosity, unsafe action, or weak next step. After a few dozen failures, patterns appear. Maybe the model is fine but retrieval is noisy. Maybe the prompt rewards completeness over clarity. Maybe the UI hides citations. Maybe the workflow should be split into two steps.
Evaluation also needs regression tests. AI products are unusually easy to accidentally change. A model update, prompt edit, embedding refresh, parser tweak, or tool-description rewrite can improve one behavior and break another. When a high-value failure is found, it should become a permanent test. The point is not to freeze the product; it is to stop relearning painful lessons.
Human review should be sampled, not ceremonial. Many teams review heavily during launch, then stop once the dashboard looks quiet. That is backwards. Real users invent new prompts, documents drift, policies change, and models update. A light but steady review habit catches slow degradation before it becomes a reputation problem.
The cultural challenge is that evaluation slows the fantasy down. It turns 'AI can do this' into 'AI can do this under these conditions, with these sources, at this quality threshold, except for these cases.' That sentence is less exciting and far more useful. It is the difference between a demo and a product.
Taste is not a replacement for metrics. It is the source material for better metrics. The teams that build durable AI systems will be the ones that turn expert judgment into examples, examples into tests, tests into product changes, and product changes into safer work. The leaderboard may help pick the engine. Taste keeps the vehicle on the road.
Frequently asked questions
What is an AI evaluation?
An AI evaluation is a repeatable way to test whether a model or AI system performs well on the tasks that matter, including accuracy, usefulness, safety, citations, format, and policy behavior.
Are benchmark scores enough?
No. Benchmarks are useful for comparing general capability, but every serious AI product needs task-specific tests based on real user workflows and failure modes.
How many examples does an AI eval need?
Start with 30 to 100 strong examples if the workflow is narrow. Expand as failures appear. Quality and coverage matter more than a huge but shallow set.



