Gaurav Panchal's Blog: Frontier models to build Harnesses not Products

We are likely overusing frontier AI models as if they were end‑user products, when they should mostly function as infrastructure for building “harnesses”—systems that orchestrate many smaller, cheaper, task‑specific models to handle the variable parts of application development.[1][2][3][5] In this view, frontier models are the “meta‑intelligence” that configures and supervises work, while specialized models do most of the execution.[1][5]

Below is a structured article exploring that idea.

1. Frontier models: what they are really good at

Frontier models are the most advanced, general‑purpose AI models available at any point in time, trained with massive compute and data to achieve state‑of‑the‑art performance across many tasks.[1][2][3][6][7]

Key properties:[1][2][3]

Very broad capabilities: code generation, complex reasoning, multimodal understanding and generation (text, images, audio, video).[1][2][3]
Emergent abilities: skills that were not explicitly programmed—such as tool use, multi‑step reasoning, or creative synthesis.[2][3]
Cross‑domain generality: they can be adapted to many domains with relatively little task‑specific fine‑tuning.[2][3]

In practice, they are excellent at:

Figuring out what needs to be done (problem decomposition, planning, strategy).
Interacting with humans at a high level (requirements gathering, explanation, negotiation).
Rapidly exploring design spaces (alternative APIs, schemas, architectures, prompts, workflows).

They are not inherently optimized for:

Strict latency constraints at scale.
Very high accuracy on narrow, highly specialized tasks.
Cost‑efficient, predictable throughput in production environments.

Those are tasks where smaller, purpose‑built models tend to win.[1][5]

2. How we might be using AI the wrong way

Most organizations today adopt something like this pattern:

Pick a popular frontier model (or a hosted API).
Wrap it in a chat UI, maybe add RAG.
Use it for everything—from requirements to code suggestions to test generation to monitoring.

This leads to several systemic issues:

Overkill for simple tasks
Frontier models are used even when a simple classification, extraction, or routing job would suffice.[1][5]
Unnecessary cost and latency
General models with billions of parameters are invoked for trivial subtasks that a tiny, fast model could handle at a fraction of the cost.[1][5][9]
Lower accuracy on specialized tasks
In domains like software testing, purpose‑built models trained on testing data often outperform frontier models (e.g., 90%+ test execution success vs. ~60–70% for frontier models repurposed for testing).[5]
Operational fragility
Every new feature “just calls the big model,” creating systems that are hard to optimize, govern, and reason about over time.[1][2][6]
Governance blind spots
When a single frontier model sits directly in the user path, it’s harder to implement fine‑grained guardrails, auditing, and domain‑specific constraints.[2][4][6][7]

The core mistake: treating frontier models as monolithic application engines rather than orchestration brains that configure and manage a constellation of smaller models and tools.

3. Harnesses: using frontier models as orchestration, not as the engine

In emerging AI architectures, there is a growing focus on harnesses—agentic or orchestration platforms that coordinate models and tools.[1][3][8][9]

Conceptually, a harness is:

A system layer that:
Routes tasks to the right models and tools.
Structures multi‑step workflows.
Applies guardrails, validations, and feedback loops.[1][2][3][4]
Often powered by a frontier model to perform:
Task understanding and decomposition.
Dynamic tool selection and prompt construction.
Monitoring and adjusting sub‑models based on outcomes.[1][3][8]

For example, modern AI agent architectures often:

Use a router that classifies the incoming task and decides whether to call a frontier model or a smaller, specialized one.[1]
Keep the system “always at the frontier” for complex reasoning, while relying on more specialized, light‑weight models for repetitive or domain‑specific tasks.[1][5][8]

This is the direction major vendors are pushing toward: frontier + open/specialized models in a coordinated system, not “one giant model for everything.”[1][2][8][9]

4. A better pattern: frontier‑as‑harness, small‑models‑as‑workers

The alternative architecture you’re proposing can be summarized as:

Use frontier models to build and control harnesses; inside those harnesses, use smaller, specialized models for the variable parts of application development.

Concretely, this suggests a three‑layer approach:

4.1 Foundation & infrastructure layer

Data pipelines, vector stores, CI/CD for models, monitoring, security, and governance.[2][6]
This layer is model‑agnostic and focuses on:
Data quality and versioning.
Deployment pipelines and LiveOps (monitoring, rollback, improvement).[2]
Security, zero‑trust principles, and AI‑native defences.[2][4][6]

4.2 Frontier harness (orchestration) layer

This is where frontier models live:

Task understanding & routing
The harness uses a frontier model to:
Parse the user request or dev intent.
Decompose it into subtasks (e.g., “generate schema,” “produce tests,” “update docs”).
Decide which sub‑model or tool handles each part.[1][3]
Policy & guardrail enforcement
The harness implements:
Safety filters, output checks, and bias controls.
IP protection and PII removal.
Compliance with legal and regulatory constraints.[2][4][7]
Meta‑learning about the system
The frontier model observes:
Which small models perform well on which tasks.
When to retrain or swap them.
Where human review is required.

Frontier models thus act as meta‑developers: managing workflows, tools, and quality—not writing every line of code or answering every trivial user query.

4.3 Specialized / small‑model layer

Below the harness are smaller models:

Domain‑specific models:
Code linters and static analysis models for a particular tech stack.
Testing models trained on UI interaction patterns and QA data—shown to outperform general frontier models in automated testing accuracy and maintenance cost.[5]
Domain classifiers, slot fillers, and ranking models for particular products or verticals.
Narrow, optimized models:
Intent classification, sentiment analysis, FAQ retrieval.
Schema generation, code formatting, docstring synthesis.

These models are easier to:

Fine‑tune on proprietary data.
Validate, govern, and explain.
Run locally, on‑device, or on edge hardware for low latency and privacy.

5. Software development as a case study

Applying this architecture to application development:

5.1 Today’s typical pattern

Developers chat directly with a frontier model for:
Requirements clarification, API design, code generation, test generation, documentation.
The same model is sometimes used in CI, testing, incident response, and even product analytics.

Problems:

Cost scales linearly (or worse) with adoption.
The model is not tuned to your stack, patterns, or organizational constraints.
You get inconsistent output styles and varying quality across teams.

5.2 Harness‑centric pattern

Imagine an AI dev harness with these characteristics:

Frontier model roles:
Understand high‑level requirements and user stories.
Propose architecture options and trade‑offs.
Design workflows and pipelines (e.g., CI steps, quality gates).
Generate and refine prompts or configurations for downstream models.
Small / specialized models:
Code generation tuned to your language, framework, and internal libraries.
Testing models specialized in UI, integration, and regression scenarios.[5]
Security and compliance analyzers for your regulatory domain.
Performance analyzers calibrated to your infrastructure.

End‑to‑end flow might look like:

Product manager describes a feature in natural language.
Frontier harness:

Parses requirements, identifies affected components.
Generates a plan: models, tools, and steps.

Harness calls:

Code‑gen model A for backend changes.
Code‑gen model B for frontend.
Testing model C for test cases and test execution.[5]

Harness runs validation:

Static analysis, security scans, style checks.
Backstops with frontier reasoning only when anomalies appear.

Only then are changes surfaced for human review or deployed.

In this system, frontier models are used sparingly but strategically, maximizing their unique capabilities while minimizing their footprint in routine work.

6. Why smaller models should dominate the variable parts

The “variable parts” of application development—business logic, tests, routing rules, workflows—are where specialization pays off most.

Advantages of using smaller models here:

Higher task‑specific performance
Evidence from testing shows that purpose‑built models can achieve significantly higher accuracy and lower maintenance cost than adapted frontier models in that domain.[5]
Lower cost and latency
Smaller models require less compute, making them suitable for:
CI/CD integration.
High‑throughput test runs.
User‑facing features needing sub‑second latency.
Easier governance and control
Narrow models:
Have smaller behaviour surfaces, making failures easier to characterize.
Can be more rigorously evaluated on domain‑specific benchmarks.
Are simpler to monitor for drift and regressions.[2][4][6]
Better privacy & IP protection
Sensitive data can stay in:
On‑prem or VPC deployments of small models.
Isolated environments with strict network and data governance.[2][6]

Frontier models stay in a more limited, well‑guarded role: high‑level reasoning, orchestration, exception handling, and designing/updating the harness itself.

7. Governance, safety, and the frontier harness

Regulators and safety organizations are already thinking in terms of frontier AI frameworks and systems, not just standalone models.[4][6][7]

Relevant trends:

Frontier frameworks & best practices
Workstreams are focused on safety, security, and robust operational practices for frontier systems—underscoring that these models should be embedded in structured frameworks, not used naked in production.[4]
Dynamic, capability‑oriented regulation
Legal analyses emphasize that frontier models should be defined and regulated based on capabilities and training compute, with adaptable thresholds as models advance.[3][7]
Zero‑trust and AI‑native defences
Guidance from cybersecurity agencies calls for treating AI agents as non‑human identities, subject to zero‑trust architectures, segmentation, and AI‑native defences.[6]

A harness‑centric architecture naturally aligns with these principles:

Frontier models sit behind multiple layers of:
Routing.
Guardrails and filters.
Logging and observability.
Small models encapsulate narrower risks, making:
Audits more tractable.
Red‑teaming more focused.
Interventions more targeted (retrain or replace a sub‑model rather than a monolith).[2][4][6]

8. Practical migration path: from “big model everywhere” to “frontier harness + small models”

For teams already using frontier models as the default engine, a pragmatic evolution might look like:

Instrument and observe

Log which queries/tasks are sent to the frontier model.
Cluster them into categories: complex reasoning vs. simple classifications, CRUD ops, formatting, etc.

Introduce a router

Use a simple heuristic or a smaller model to route easy, repetitive tasks to cheaper, specialized models.[1][5]

Carve out specialized domains

Identify high‑volume, well‑bounded tasks (e.g., test generation, simple code transformations).
Train or adopt specialized small models for these.

Elevate the frontier model

Shift the frontier model’s role toward:
- Planning and workflow design.
- Edge‑case handling and arbitration between models.
- Monitoring and adapting the harness based on performance.

Integrate governance

Implement guardrails at the harness level: safety filters, policy checks, and human‑in‑the‑loop gates where needed.[2][4][6]

Continuously rebalance

As small models improve, move more work down from the frontier.
As new frontier capabilities emerge, update the harness’s orchestration logic, not every application.

Over time, frontier models become a strategic, centralized capability, while most day‑to‑day work flows through optimized, domain‑specific components.

9. Reframing the question

So “maybe we are using AI the wrong way” can be sharpened into:

Wrong mental model: frontier AI is not an application; it is infrastructure and orchestration intelligence.
Right mental model: treat frontier models as the top of a hierarchy that designs, configures, and supervises a network of smaller models and tools; the bulk of application logic lives in these more specialized, cheaper, and controllable components.[1][2][3][5][8]

Under this framing, the future of AI‑assisted application development is not “one giant model that does everything,” but harness‑centric systems where:

Frontier models build and evolve the harness.
The harness routes and governs.
Small models execute the variable, domain‑specific work.

Gaurav Panchal's Blog

Frontier models to build Harnesses not Products

Thursday, May 28, 2026