What is fine-tuning?
Fine-tuning is the process of adapting a pretrained model by updating its weights on a smaller, task-specific dataset so it produces better outputs for a particular job. The pretrained model already understands language, vision, or code, but fine-tuning refines that knowledge so it’s better matched to your domain.
Fine-tuning is distinct from training from scratch. Training from scratch means initializing random weights and feeding a model enormous amounts of data, often involving hundreds of billions of tokens, thousands of GPU hours, and millions of dollars. Fine-tuning skips all of that. Instead, you inherit the base model's knowledge and refine it.
Fine-tuning is part of a broader principle called transfer learning: using knowledge gained on one task to improve performance on another. When someone says "we fine-tuned the model," they mean they took a foundation model and ran additional training on their specific examples until the model's behavior shifted toward what they wanted to achieve.
How does fine-tuning work?
The fine-tuning process typically follows a standard sequence:
- Pick a base model: Start with a pretrained model suited to your task type, such as a large language model (LLM) for text, a vision model for images, or a code model for software generation. The choice impacts everything downstream.
- Prepare your dataset: You’ll need labeled examples: input-output pairs that show the model what good output looks like. For a legal summarization task, that means hundreds of contract sections paired with expert-written summaries. For a medical Q&A system, it means clinician-reviewed question-and-answer pairs. These examples need to be correct, consistent, and representative of the edge cases you care about.
- Configure training: Set your hyperparameters, such as learning rate, batch size, number of epochs (complete passes through the training dataset), and which layers to update. These choices will determine whether you end up with a useful model or one that simply memorizes your training data but fails when it encounters anything new.
- Train the model: The model's weights are updated based on your examples. This is the stage most people envisage when they hear "fine-tuning," but it's typically the smallest portion of the overall timeline.
- Evaluate the results: Compare the fine-tuned model against your acceptance criteria using a holdout dataset that the model didn’t see during training. If you don't have clear evaluation rubrics, you won't know whether fine-tuning worked. This is a problem that's more common than most teams admit.
- Deploy and monitor: Fine-tuned models can degrade as your domain shifts, so you’ll need a plan for versioning and retraining.
The part that surprises most teams is that steps 2 and 5 consume most of the timeline. Building a fine-tuned model that works well depends heavily on data quality, which is determined by the expertise of the people writing and reviewing the examples. This human-intensive work often goes unrecognized in technical discussions, but it's where quality is determined. Mercor connects domain experts with AI teams who need this expertise at scale. To see what rigorous evaluation looks like in practice, it’s worth exploring how frontier models are evaluated on real domain tasks.
Fine-tuning vs. prompt engineering vs. RAG
These three approaches are often confused, and picking the wrong one can be an expensive mistake. Here's how to distinguish between them:
Prompt engineering changes the instructions you give the model at inference time. You're not touching the model itself, but you're writing better prompts, providing examples in context, or structuring your requests differently. The cost is almost zero, and the iteration is fast. The limit is the model's context window and its existing capabilities. If the model doesn't know how to do what you need, no prompt will fix it.
Retrieval-augmented generation (RAG) changes what information the model can access when it generates a response. It involves building a retrieval layer, typically a vector database, that fetches relevant documents and injects them into the prompt at inference time. The model's weights don't change. RAG is often the right call when the problem is missing context: the model is capable but doesn't have access to your internal documents, policies, or recent data. Fine-tuning and RAG solve different problems and are often used together.
Fine-tuning changes the model's weights. You're altering how the model behaves rather than what it can access or the instructions you give it. This approach is best when you need the model to adopt a specific style, consistently follow a particular output format, or perform a domain-specific task that generic models handle poorly.
Prompt engineering is typically the cheapest and fastest approach, while RAG has moderate infrastructure costs but no training costs. Fine-tuning AI models involves the heaviest intervention, as well as meaningful data, compute, and expertise costs, which means it should be the last option you reach for, not the first. Start with prompting. If that's not enough, try RAG. If the model still can't do what you need, fine-tune.
One risk of fine-tuning, which doesn't apply to the other approaches, is that you can make the model worse. Overfitting, catastrophic forgetting (where fine-tuning erases useful general knowledge), and bias amplification from skewed training data are all real possibilities. In contrast, prompting and RAG don't carry these risks.
What fine tuning actually requires?
Most content on this topic focuses on compute requirements and frameworks, but that's the easy part. The real constraint in LLM fine-tuning is the quality of your training data. Producing professional-quality data for domain-specific tasks requires genuine subject-matter expertise.
In practice, quality data means that if, for instance, you're fine-tuning a model for clinical note summarization, your training examples need to be written or validated by clinicians who understand what a good summary includes, what's clinically dangerous to omit, and how the output will be used downstream. A data labeling team without that domain knowledge will produce examples that look right but fail in ways that matter.
This is essentially what data labeling is: creating structured human judgments that tell the model what correct looks like for your specific task. Fine-tuning training data is a form of labeled data, and the same quality principle governs both. Domain expertise determines whether the labels are reliable, and reliable labels are what the fine-tuned model learns from.
According to Hugging Face's documentation on fine-tuning LLMs, a relatively small number of high-quality examples can be enough for narrow tasks using parameter-efficient methods, while broader behavioral changes may require a larger dataset. The range is wide because it depends on the base model, the technique, and the task. But across all of those variables, one thing holds true: quality dominates quantity. A hundred examples written by a domain expert will outperform a thousand written by someone who’s guessing.
Your evaluation rubrics matter just as much. If you can't define what "good" looks like for your task in specific, measurable terms, fine-tuning becomes a gamble. Evaluation leakage (where your test set overlaps with your training data or your rubrics don't capture what actually matters) is one of the more subtle ways that fine-tuning projects can fail.
The human expertise required to produce professional-quality fine-tuning data is exactly what Mercor sources and deploys. Teams building AI can source the expert talent their fine-tuning projects require. You can learn more about what AI trainers do and how they contribute to model training to better understand how high-quality data is created in practice.
Common fine tuning techniques
You don't need to master these techniques to make good decisions, but it’s important to recognize the differences because they vary in terms of cost, risk, and the kind of expertise you'll need.
Supervised fine-tuning (SFT): teaching the model your preferred outputs with labeled examples
SFT is what most people mean when they refer to "fine-tuning" without qualification. You provide labeled input-output pairs: a prompt and the desired response. The model then learns to imitate the demonstrated behavior across those examples.
SFT is also the first stage inside a reinforcement learning from human feedback (RLHF) pipeline. This matters because if someone mentions both, they're describing a two-stage process where SFT comes first. The quality of your SFT data sets a ceiling for everything that follows.
Full fine-tuning: updating all weights
Full fine-tuning updates every parameter in the model. For a 70-billion-parameter model, that means adjusting all 70 billion weights based on your training data. It's the most thorough approach and the most compute-intensive, typically requiring multiple high-end GPUs and significant training time.
Teams typically choose full fine-tuning over parameter-efficient alternatives when the task requires deep behavioral changes across the model's capabilities, not just surface-level adjustments. The tradeoff is that full fine-tuning on a small dataset is the fastest path to overfitting, where the model memorizes your examples instead of learning from them. To justify this approach, you need sufficient data and strong regularization.
LoRA: parameter-efficient fine-tuning that dominates most real deployments
Low-rank adaptation (LoRA) trains a small set of additional parameters while keeping the original model unchanged. Instead of updating 70 billion weights, you might update a few million, inserted as low-rank matrices alongside the original layers.
This approach leads to dramatically lower compute costs, faster training, and the ability to swap fine-tuned adapters in and out without redeploying the full model. LoRA has become the default method for most fine-tuning work in practice because it hits a strong tradeoff between cost and performance. According to the original LoRA research published by Hu et al. (2021), LoRA can match or approach full fine-tuning performance on many tasks while training a fraction of the parameters.
QLoRA: LoRA plus quantization for lower-cost training on smaller hardware
Quantized LoRA (QLoRA) applies LoRA to a quantized (compressed) model, reducing memory requirements enough to allow the fine-tuning of large models on consumer-grade hardware. Dettmers et al. (2023) used QLoRA to demonstrate the fine-tuning of a 65-billion-parameter model on a single 48GB GPU.
This is significant for individual practitioners, small teams, and anyone who doesn't have access to a cluster of high-spec GPUs. If you're fine-tuning LLMs on your own hardware, QLoRA is probably your best option.
RLHF: aligning behavior via ranked outputs and expert evaluation
RLHF is a fine-tuning approach where human evaluators rank multiple model outputs for the same input, and the model learns to prefer higher-ranked responses. It's what makes the difference between a model that can generate text and one that generates text people actually find helpful, safe, and accurate. You can learn more about what RLHF is and how it works to understand the full pipeline and where it fits in model development.
RLHF builds on SFT. First, you fine-tune the model with supervised examples, then you refine its behavior with human preference rankings. This is where human judgment becomes a bottleneck: the rankings need to come from people who can tell the difference between a good output and a plausible-sounding wrong one. For medical, legal, or engineering tasks, that means physicians, attorneys, and engineers doing the evaluation.
You can also look at how AI models are benchmarked on real software engineering tasks at Mercor, for an example of an RLHF-style evaluation in practice.
When fine-tuning makes sense and when it doesn't
| Fine-tune when: | Don't fine-tune when: |
|---|---|
| Your task has stable, well-defined acceptance criteria. You know what good looks like and can measure it. | Your problem is missing context. If the model gives wrong answers because it doesn't have access to your data, RAG is the solution. Fine-tuning can't inject knowledge that the model has never encountered. |
| You need a consistent output format or style across thousands of generations. Prompting alone can't maintain that level of consistency. | Your policies or requirements change frequently. Fine-tuning embeds behavior into the weights. That means if the ground truth shifts every quarter, you'll be retraining constantly. |
| Your domain is specialized enough that the base model produces noticeably wrong outputs, and those errors follow patterns you can correct with examples. | Your ground truth is uncertain or contested. If your own experts can't agree on what the right answer is, your fine-tuning data will encode that disagreement, and the model will reproduce it confidently. |
| You have (or can build) a training set of at least a few hundred high-quality examples for narrow tasks or a few thousand for broader behavioral changes. | You haven't tried prompting and RAG first. Fine-tuning is the most expensive intervention, so it makes sense to try the cheaper options first. |
Risks to keep in mind include overfitting (mitigate with holdout sets and early stopping), catastrophic forgetting (address by mixing general data into your fine-tuning set), bias amplification (minimize with data audits and diverse evaluation), and evaluation leakage (ensure strict train/test separation and human review). However, none of these risks are reasons to avoid fine-tuning; they're reasons to do it carefully.
Fine-tuning decisions often stall because teams lack the domain expertise to build quality training data in-house. Mercor helps bridge that gap for teams who need it. You can also explore platforms for AI model training and human-in-the-loop feedback to gain a broader understanding of the ecosystem.
Fine-tuning only works as well as the people and rubrics behind it
Every technique mentioned here, be it SFT, LoRA, or RLHF, depends on the same thing: the quality of the human judgment behind the training data and evaluation. Compute is a commodity. Good data built by people who understand the domain is not.
If you're a domain expert in fields such as medicine, law, engineering, finance, or research, AI teams building fine-tuned models need your judgment, so consider applying to contribute to AI training projects on Mercor. If you're building AI models and need this kind of expertise, Mercor can help you source the expert talent your fine-tuning projects require.
Frequently Asked Questions
What is the difference between fine-tuning and training from scratch?+−
Training from scratch involves initializing a model with random weights and training it on a massive dataset, often consisting of billions of examples, over weeks or months of compute time. Fine-tuning starts with a model that already has general knowledge and adjusts it based on a smaller, task-specific dataset. The difference in cost and time is orders of magnitude: fine-tuning a model might take hours or days on a single GPU, while training from scratch can cost millions of dollars in compute alone.
What is fine-tuning vs. RAG?+−
Fine-tuning changes the model's internal weights so it behaves differently. In contrast, RAG alters what information the model can access at inference time by retrieving relevant documents and injecting them into the prompt. If your problem is that the model doesn't know about your data, use RAG. If the model knows enough but doesn't produce the right kind of output, fine-tune.
When should you fine-tune a model?+−
Fine-tune when you have a stable task with clear acceptance criteria, need a consistent output format or domain-specific behavior, and have tried prompting and RAG first. Don't fine-tune when the core problem is missing context, when your requirements change frequently, or when your team can't agree on what a correct output looks like.
How much data do you need to fine-tune a model?+−
It varies significantly by task, base model, and method. For narrow tasks using parameter-efficient methods such as LoRA, a few hundred high-quality examples can be enough, but broader behavioral changes may need thousands. Typically, quality matters far more than quantity. A hundred expert-validated examples will outperform a thousand low-quality ones.
How does fine-tuning work?+−
Fine-tuning starts with a pretrained model. You then prepare a dataset of labeled examples that demonstrate the behavior you want, configure your training parameters, and run additional training passes that update the model's weights. After training, you evaluate the fine-tuned model against a holdout set to check performance, then deploy and monitor it over time.
What are the different types of fine-tuning?+−
The main variants are SFT, which trains on labeled input-output pairs; full fine-tuning, which updates all model parameters; LoRA and QLoRA, which are parameter-efficient methods that update a small fraction of weights; and RLHF, which refines behavior using human preference rankings.
Is fine-tuning the same as transfer learning?+−
Fine-tuning and transfer learning aren’t exactly the same. Transfer learning is the broader principle of applying knowledge learned on one task to a different task. Fine-tuning is the most common practical technique for doing transfer learning in modern AI, where you take a pretrained model and adapt it to your specific use case. All fine-tuning is transfer learning, but transfer learning also includes other approaches, such as feature extraction, where you freeze the model's weights and only train a new output layer on top.
