Methodology · LLM-App Engineeringprovenverified

Build-or-Buy Foundation Model Decision

also known as seven-axis model sourcing, open vs proprietary decision, build vs buy foundation model

Applies to: llm-appagentrag-system

Tags: build-vs-buymodel-sourcingseven-axes

Choose how to source your model by scoring it on seven clear factors. The choices are: host an open-weights model yourself, call a vendor's API, or fine-tune your own. The seven factors are data privacy, data lineage, performance, functionality, control, cost, and team capability. Give each factor a verdict and a weight. Together they point to the sourcing decision. The thing you keep is a written record that names which factors the losing options failed on. That way the choice can be checked later when conditions change.

Methodology process overview

flowchart TD in1[Use case and data profile] --> s1[List candidate options] in2[Performance target] --> s1 in3[Team capability inventory] --> s1 in4[Cost envelope] --> s1 s1 --> s2[Score on data privacy] s2 --> s3[Score on data lineage] s3 --> s4[Score on performance functionality control] s4 --> s5[Score on cost and team needs] s5 --> s6[Weight decide document] s6 --> out1[Seven-axis comparison table] s6 --> out2[Decision record]

Intent. Replace gut-feel calls like 'use OpenAI' or 'self-host Llama' with a seven-factor comparison whose verdicts and weights are written down.

When to apply. Use this for any new LLM application, or any system whose current model choice is up for review. Triggers include a new regulation, a new model release, a new cost ceiling, or a new performance gap. Run it again from time to time, because the open-versus-vendor landscape moves fast. Don't apply it when an outside rule already forces the answer, such as a contract that names one vendor. Skip it for throwaway prototypes that last under a day.

Example scenario

A mid-size healthtech startup was building a clinician-facing note summariser for German hospitals. After a six-week prototype on the OpenAI API, it faced the build-or-buy question. The team listed four candidates: keep OpenAI, switch to Azure OpenAI EU, self-host Llama 3 70B on dedicated GPUs, or fine-tune Mistral 7B in-house. Each was scored on the seven factors. The head of platform and the data-protection officer set the weights together. Data privacy and data lineage carried the heaviest weights, because German hospital data falls under BDSG plus GDPR. Pure OpenAI lost at once on data location. Azure OpenAI EU scored well on privacy but lost on data lineage against the open-weights options. Self-hosted Llama 3 won privacy and lineage but lost on team capability. The four-person platform team had no 24/7 GPU on-call rotation. The hybrid path won. They used Azure OpenAI EU for general summarisation, and a fine-tuned Mistral 7B for the one structured-extraction subtask where the in-house running cost was worth it. The decision record named team capability as the factor that knocked out pure self-hosting. It also pinned the condition for revisiting: when the platform team grows past eight engineers, score it again. Six months later the data-protection officer quoted the record word for word during an external audit.

Inputs

Use case and data profile — What the model must do and what data it will see. Include how sensitive the data is, where it must live, and whether a third party may log the prompts.
Performance target — The quality bar from your test set, plus how fast and how many requests the system must handle.
Team capability inventory — An honest read on your in-house skills: machine-learning operations, running GPUs, and security review.
Cost envelope — The most you will pay per call, the most you will spend per month, and how long you have to pay off any up-front cost.

Outputs

Seven-axis comparison table — A grid of the candidate options scored on each factor, with a weight and a verdict. Options are vendor API, self-hosted open weights, and fine-tuned own model.
Decision record — A written reason that names which factors the losing options failed on, and the conditions under which to revisit the choice.

Steps (6)

List candidate options
List the realistic options. They include a vendor API such as OpenAI, Anthropic, or Google, self-hosted open weights such as Llama, Mistral, or Qwen, a fine-tuned in-house model, and a hybrid of these. Treat hybrid as a real option, not a fallback.
Score each option on data privacy
Where does the prompt go? Is it logged? Can it be used for training? Does it meet your data-location rules? Vendor APIs usually lose this factor when the data is regulated.
Score on data lineage
Do you know what the model was trained on? This is its data lineage. Open-weights models tend to be more open about it. Vendor models tend to be less open. It matters when regulators or your own users ask.
Score on performance, functionality, and control
Performance is the quality from your test set. Functionality is the features the option supports, such as function calling, structured output, vision, and long context. Control is whether you can pin a model version, switch providers, fine-tune, or run offline.
usesMulti-Model Routing Structured Output
Score on cost and team needs
Cost is the per-call cost at your expected volume plus the running cost, such as GPUs, security review, and monitoring. Team is whether you have the people to run this option. A self-hosted model with no on-call rotation is a future incident.
Weight, decide, and document
Weight each factor by how much it matters for your task. Add up the weighted scores. Pick the highest. Then write down why. Name the factors the losers failed on and the condition that would flip the choice. This makes the next review cheap.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Seven factors, not vibes. Every factor has a verdict and a weight written down.
Hybrid is a first-class option. Use a vendor for some calls and self-host for others.
Team capability is a factor. The model you cannot operate at 3am is the wrong model.
The choice can be checked later. Name the factors the loser failed on, and the next review is cheap.

Known failure modes (3)

Related patterns (4)

Related compositions (1)

recipe · abstract shape
Production LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.

Related methodologies (2)

Sources (2)

Provenance

Added to catalog: 2026-05-24
Last updated: 2026-05-27
Verification status: verified

Methodology process overview

Steps (6)

List candidate options

Score each option on data privacy

Score on data lineage

Score on performance, functionality, and control

Score on cost and team needs

Weight, decide, and document