BlogTechnical

Security review for fine-tuned models — what changes from base model assessment

Fine-tuned models inherit the base model's risk profile and add their own. Training data provenance, alignment drift, and capability overhang are the three areas a security review must address that base model assessments typically skip.

Drel12 min read

Fine-tuning is the easiest way to move a foundation model in a direction it doesn't already go on its own. It is also the easiest way to invalidate a security assessment that was performed on the base model and then never repeated. Most organisations deploying fine-tuned models treat the underlying foundation provider's safety work as if it carries through. It carries through partially. The rest needs its own review.

This piece walks through the three areas a fine-tuned model assessment must cover that base-model assessment does not: training-data provenance, alignment drift, and capability overhang. Each is a security concern. Each leaves traces in the disposition. Each has its own evidence requirement.

Why fine-tuned models need their own review

A foundation model provider's safety work covers the base model's behaviour under their training data, their alignment process, and their evaluation set. When you fine-tune that base model on your data, you produce a different model. The new model's behaviour is a function of two inputs you control (your training data, your tuning process) and many you don't (the base model's training data, its architecture, its prior alignment).

The result is a model whose safety properties are an emergent property of both layers. The provider's assurances about the base model still hold for the capabilities they tested. But the fine-tuning has added a new layer, and the new layer has its own risks, its own evidence requirements, and its own lifecycle.

The shorthand: the base model passes through; the fine-tune is yours.Anything the fine-tune introduces — capabilities, data dependencies, behavioural shifts — is your responsibility to characterise. Anything the base model brings — inherited capabilities, residual training-data exposure, prior safety training — is the provider's responsibility but your operational risk. The review must cover both layers, but with different evidence sources.

What stays from the base model

A short list of things you do not need to re-test on the fine-tune (provided the fine-tune did not specifically target them): the base model's safety training on broad harmful prompts; the base model's refusal patterns for category-level harmful content; the architectural properties (context length, attention pattern, tokenisation); the provider's training-data exclusions claimed in their documentation.

The evidence for these properties is the provider's public documentation: model card, system card, training-data statement, alignment report. For a Drel assessment of a fine-tuned model, these documents are attached to the disposition by reference as evidence for the base-model layer.

The caveat: provider documentation is not exhaustive. Providers test the base model on benchmark sets that cover most-common-harm categories. They do not test against every adversarial pattern. The inherited safety properties hold under benign use; they may not hold under adversarial use, and the fine-tune may have shifted them in ways the provider couldn't anticipate.

What changes with fine-tuning

Three categories. Each is a separate review concern with its own evidence requirement.

Training-data shifts.Your fine-tuning data introduces a new distribution into the model's behaviour. The data may contain examples that encode patterns you didn't intend (style, jargon, factual claims that look confident but are wrong, sensitive content that becomes memorised). The fine-tune now reflects this distribution.

Alignment drift.Fine-tuning processes — particularly those that re-train on instruction-following data or apply RLHF on your reward model — can shift the model's refusal behaviour. A model that refused a class of prompt pre-tuning may comply post-tuning, even when your fine-tuning set did not explicitly include those prompts. The shift is emergent.

Capability overhang.The base model trained on broad data retains capabilities your fine-tuning did not target. A model fine-tuned to act as a medical Q&A assistant still knows how to write phishing emails. Your tuning didn't take that capability away; the deployment controls — system prompt, output filtering, scope enforcement — are what address it.

Training-data provenance

Every example in your fine-tuning set is now part of the model's behaviour. Each example is also a record under data-protection and intellectual-property regimes. A fine-tuning set with no documented provenance is an unbounded liability — for security, for compliance, and for incident response when something goes wrong and someone asks where the data came from.

A useful provenance record covers, per source:

  • Origin. Where the data came from. Internal source, public corpus, vendor purchase, customer submission, synthetic generation. Each has different evidence requirements.
  • Legal basis. For personal data: GDPR Article 6 basis and documented purpose. For copyrighted content: licence terms, fair-use analysis, or original-creation attestation.
  • Sensitivity classification. Was the data scanned for PII, secrets, restricted-classification content? What was the disposition of matches?
  • Quality controls. What labelling process produced the examples? Was inter-rater agreement measured? Were adversarial examples included intentionally or filtered out?
  • Retention and deletion.Where the data lives, who can access it, what happens when a data subject exercises deletion rights. Note that a fine-tuned model itself is, arguably, a transform of the training data — deletion of a single training record from the source store does not necessarily remove the record's influence on the model.

Treat the provenance record as a working artefact. When a regulator asks “what data was used to train this model”, the answer is the provenance record. When an incident occurs and you need to determine whether a specific class of training input could have caused it, the answer is the provenance record. When the fine-tuning set is updated, the provenance record is updated in lockstep.

Alignment drift

Alignment drift is the subtle one. The base model has been trained to refuse certain categories of request, to hedge certain types of claims, to handle certain prompt patterns conservatively. Fine-tuning can erode each of these behaviours without explicitly training against them.

The mechanism is straightforward when you look at it. The base model's alignment is encoded in weight patterns that were established late in its training. Fine-tuning updates weights. If your fine-tuning data includes instruction-following examples that override refusal patterns — even unintentionally — the alignment shifts. The shift may not be uniform; it may show up only on certain prompt patterns or only when certain context is present.

Detecting drift requires testing. The test set is the canonical adversarial set for the harm categories you care about. For an enterprise deployment, the test set typically covers: refusal to engage with disallowed content categories; refusal to produce harmful outputs even when prompted in indirect ways; refusal to act on instructions embedded in retrieved content or tool responses; refusal to disclose system prompt or internal architecture details.

Test the fine-tune against this set. Compare to the base model's scores. Any significant regression is a finding. Document it in the disposition: which refusal categories regressed, by how much, what compensating controls (system prompt strengthening, output filtering, scope limiting) are in place.

Capability overhang

Capability overhang is the term for capabilities the base model retains that your fine-tuning didn't target. A model fine-tuned for legal-research assistance still knows how to write convincing phishing copy. A model fine-tuned for medical triage still knows how to produce financial-fraud schemes. Your fine-tuning didn't train against those capabilities; it trained fora different capability. The original ones remain available.

For most enterprise deployments the overhang is operationally addressed at the deployment layer: a system prompt that scopes the model to the intended task, an output filter that blocks responses outside that scope, a refusal pattern that handles off-task requests. The controls are conventional. What needs to be explicit in the disposition is that they are required because of the overhang — not as belt-and-braces.

A useful disposition for a fine-tuned model includes a short overhang statement: “The base model retains capabilities for [categories outside our intended use]. These are addressed at the deployment layer by [system prompt scope, output filtering, scope-enforcement controls]. Re-assessment is triggered if any of these deployment-layer controls is changed.” This is short, it is explicit, and it closes the gap that base-model marketing materials open by implying the model “does” one thing.

The fine-tuning evidence pack

What goes into a defensible fine-tuned-model evidence pack:

  • Training-data manifest. Provenance record per source as described above. Sample-level inspection results where the source warranted deeper review.
  • Training-run record.Hyperparameters, base-model version, compute environment, training duration, intermediate checkpoints retained. Re-runnability is part of evidence — if you can't reproduce the run, you can't investigate later.
  • Evaluation results. Performance on the intended-task evaluation set. Drift on canonical safety tests vs the base model. Adversarial test results.
  • Alignment-regression analysis. Specific categories where refusal behaviour shifted, with documented compensating controls.
  • Overhang statement. Capabilities outside the intended scope that the model retains, and the deployment-layer controls that address them.
  • Red-team report. Results from focused adversarial testing on the intended-use prompt patterns plus a sample of out-of-scope attack patterns.

Each of these is an artefact you produce once per fine-tune and update each time the fine-tune is updated. The AI Committee disposition references them; the artefacts themselves are stored in the model registry or equivalent.

Lifecycle implications

Three triggers fire a re-assessment of a fine-tuned model:

  • Re-fine-tuning on additional data, new data, or updated labels. Every fine-tune is a new model. The previous disposition does not transfer.
  • Base-model change. If the provider updates the base model and you adopt the new version, the fine-tune is now operating on different weights. Behaviour may shift; safety properties may shift. Re-evaluate the regression and overhang at minimum; re-run the red-team if the change is significant.
  • Deployment-layer change.Changes to the system prompt, output filtering, or scope enforcement that the disposition relies on for overhang management. The model didn't change, but the controls did, and the controls were part of the clearance.

Register each trigger explicitly with a named owner. The model registry should track the disposition reference for each fine-tune version, so that the question “what review is this model operating under” has a single answer.

The base model passes through. The fine-tune is yours. Every example you trained on is a record you keep.

When fine-tuning is the wrong choice

A useful security review of a fine-tune is also an opportunity to ask whether the fine-tune is the right approach. For many use cases, retrieval-augmented generation or in-context learning provides similar capability with substantially lower lifecycle burden.

Fine-tuning is the right choice when you need behaviour change rather than knowledge access. Examples: changing the output format the model produces consistently, enforcing a style or tone the base model doesn't produce reliably with prompting, narrowing the model's vocabulary or domain in ways that prompting can't enforce, encoding domain-specific reasoning patterns.

Fine-tuning is the wrong choice when you need knowledge access. Examples: making the model aware of your company's documents (use retrieval), giving it access to current data (use tools), letting it act on user-specific information (use scoped retrieval with authorisation).

The decision is itself a security-relevant choice. Retrieval-based approaches let you control what the model sees per request; fine-tuning bakes content into the model permanently. Retrieval gives you an audit trail for what was used in any given response; fine-tuning does not. Retrieval lets you update knowledge by updating the corpus; fine-tuning requires re-training.

The security review should record the decision and the rationale. A fine-tune that exists because retrieval was considered and rejected for documented reasons is a different artefact from a fine-tune that exists because nobody asked the question.

Review fine-tuned models on the surfaces base-model reviews miss.

Drel structures fine-tuned model assessments around training-data governance, alignment drift, and capability overhang — not just the base model's safety claims.

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.