LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune
Fine-tune large language models for production — LoRA and QLoRA parameter-efficient tuning, instruction dataset preparation, evaluation with ROUGE and human rev
LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune
Fine-tuning a large language model means taking a pre-trained model and continuing training on your specific data to adapt its behavior. Done correctly, it produces a model that follows your format conventions, uses your terminology, and stays within your domain. Done incorrectly, it produces a model that hallucinates confidently in your brand voice.
The first question is always: should you fine-tune at all?
Fine-Tuning vs RAG: Choose Carefully
| Use Case | Fine-Tuning | RAG |
|---|---|---|
| Teach the model facts (product details, docs) | ❌ Slow, expensive, facts become stale | ✅ Update knowledge base, always current |
| Teach the model a writing style or format | ✅ Works well | ❌ Prompting alone often insufficient |
| Reduce hallucination on specific domain | ❌ Can increase confident hallucination | ✅ Grounds answers in retrieved context |
| Reduce token cost per query | ✅ Shorter system prompts needed | ❌ RAG adds retrieval overhead |
| Keep data out of third-party models | ✅ Run locally | ✅ Local retrieval too |
The 2026 default: Start with RAG. Fine-tune only when you have clear evidence that prompt engineering + RAG can't solve the problem.
Fine-tune when you need:
- Consistent output format that prompting can't reliably produce
- Domain-specific vocabulary and reasoning not in the base model's training
- Significant inference cost reduction via smaller specialized model
- Specific tone/persona that persists across all outputs
Parameter-Efficient Fine-Tuning: LoRA and QLoRA
Full fine-tuning updates all model weights — impractical for most teams (a 7B model full fine-tune requires 8× A100 GPUs and ~$5,000+ per run).
LoRA (Low-Rank Adaptation) freezes the original weights and adds small trainable matrices to specific layers:
Original weight matrix W (frozen): 7B params × 100% = 28GB VRAM
LoRA adapters (trainable): 7B params × ~0.1% = 28MB
During training: update only adapter weights
During inference: merge adapters back into W (zero added latency)
QLoRA combines LoRA with 4-bit quantization — reduces the base model from 28GB to ~4GB VRAM, making 7B+ model fine-tuning possible on a single consumer GPU.
# train.py — QLoRA fine-tuning with Hugging Face + PEFT
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./fine-tuned-model"
# 4-bit quantization config (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4", # Normal Float 4 — best quality
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model.config.use_cache = False # Required for gradient checkpointing
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank — higher = more capacity, more VRAM
lora_alpha=32, # Scaling factor (typically 2× rank)
target_modules=[ # Which layers to add LoRA to
"q_proj", "k_proj", # Attention query/key projections
"v_proj", "o_proj", # Attention value/output projections
"gate_proj", "up_proj", # MLP layers
"down_proj",
],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable params: 41,943,040 || All params: 8,072,257,536 || Trainable%: 0.52
🤖 AI Is Not the Future — It Is Right Now
Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.
- LLM integration (OpenAI, Anthropic, Gemini, local models)
- RAG systems that answer from your own data
- AI agents that take real actions — not just chat
- Custom ML models for prediction, classification, detection
Dataset Preparation
The quality of your fine-tuning dataset matters more than any hyperparameter. A model fine-tuned on 500 high-quality examples outperforms one trained on 5,000 noisy ones.
Instruction tuning format (the standard for chat/instruction-following fine-tuning):
# dataset.py — prepare instruction dataset
from datasets import Dataset
# Each example: instruction + optional input + expected output
examples = [
{
"instruction": "Extract the key financial metrics from this earnings report excerpt.",
"input": "Revenue for Q3 2026 was $4.2B, up 18% YoY. Operating margin improved to 24.3% from 21.1% in Q3 2025. Net income was $892M...",
"output": "Revenue: $4.2B (+18% YoY)\nOperating Margin: 24.3% (vs 21.1% prior year)\nNet Income: $892M",
},
# ... 500-5000 more examples
]
def format_instruction(example):
"""Format as Llama 3 chat template"""
if example.get("input"):
prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{example['instruction']}
{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
else:
prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
return {"text": prompt}
dataset = Dataset.from_list(examples)
dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)
# Train/eval split
dataset = dataset.train_test_split(test_size=0.1, seed=42)
Dataset quality checklist:
- Every example is something a real user would ask
- Outputs are exactly what you want the model to produce (correct format, tone, length)
- No contradictory examples (same input, different output)
- No PII in training data
- Diverse instruction phrasings (don't repeat the same prompts)
- Reviewed by domain expert, not just the engineer who wrote them
Training Configuration
# Training arguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 × 4 = 16
gradient_checkpointing=True, # Save VRAM by recomputing activations
warmup_ratio=0.03,
learning_rate=2e-4,
lr_scheduler_type="cosine",
fp16=False,
bf16=True, # Better than fp16 for modern GPUs
logging_steps=10,
evaluation_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
report_to="wandb", # Track training with Weights & Biases
dataloader_num_workers=4,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
args=training_args,
dataset_text_field="text",
max_seq_length=2048,
packing=True, # Pack multiple short examples into one sequence (faster training)
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
⚡ Your Competitors Are Already Using AI — Are You?
We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.
- AI agent systems that run autonomously — not just chatbots
- Integrates with your existing tools (CRM, ERP, Slack, etc.)
- Explainable outputs — know why the model decided what it did
- Free AI opportunity audit for your business
Evaluation
Loss curves during training tell you about optimization, not quality. Evaluate your fine-tuned model with metrics that matter:
# evaluate.py
from evaluate import load
from transformers import pipeline
rouge = load("rouge")
fine_tuned_pipe = pipeline("text-generation", model=OUTPUT_DIR, tokenizer=tokenizer)
def evaluate_model(test_examples: list[dict]) -> dict:
predictions = []
references = []
for example in test_examples:
# Generate from fine-tuned model
output = fine_tuned_pipe(
format_prompt(example["instruction"], example.get("input")),
max_new_tokens=200,
temperature=0.1, # Low temperature for eval
do_sample=True,
)
generated = extract_response(output[0]["generated_text"])
predictions.append(generated)
references.append(example["output"])
# ROUGE scores (text overlap)
rouge_scores = rouge.compute(predictions=predictions, references=references)
# Format accuracy (custom metric — does output match expected format?)
format_match = sum(
has_correct_format(pred) for pred in predictions
) / len(predictions)
return {
"rouge1": rouge_scores["rouge1"],
"rouge2": rouge_scores["rouge2"],
"rougeL": rouge_scores["rougeL"],
"format_accuracy": format_match,
"num_examples": len(test_examples),
}
The irreplaceable eval: Have domain experts review 50–100 randomly sampled model outputs blind (without knowing whether they came from fine-tuned or base model). Human judgment remains the most reliable quality signal.
Serving the Fine-Tuned Model
# serve.py — FastAPI serving for fine-tuned model
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch
app = FastAPI()
# Load base + LoRA adapters (smaller than full fine-tuned weights)
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
model = model.merge_and_unload() # Merge LoRA weights for faster inference
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
class GenerateRequest(BaseModel):
instruction: str
input: str | None = None
max_tokens: int = 500
temperature: float = 0.3
@app.post("/generate")
async def generate(request: GenerateRequest):
prompt = format_prompt(request.instruction, request.input)
output = pipe(
prompt,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
do_sample=True,
return_full_text=False,
)
return {"response": output[0]["generated_text"].strip()}
Hardware and Cost Reference
| Setup | GPU | VRAM | Cost/Hour | 7B Model Fine-Tune |
|---|---|---|---|---|
| Google Colab Pro | T4 | 16GB | ~$0.50 | QLoRA only; 6–12h |
| RunPod A10G | A10G | 24GB | $0.69 | QLoRA comfortable; 4–8h |
| RunPod A100 40G | A100 | 40GB | $1.89 | Full LoRA; 2–4h |
| Lambda Labs A100 80G | A100 | 80GB | $2.49 | Full fine-tune; 2–3h |
| AWS p4d.24xlarge | 8× A100 | 320GB | $32.77 | 70B+ models |
Typical cost for 7B model QLoRA fine-tune (1,000 examples, 3 epochs):
- RunPod A10G: ~$5–15 total
- Google Colab Pro: ~$3–10
Working With Viprasol
We build production ML systems that include fine-tuned models where appropriate — from dataset curation and training through evaluation and serving infrastructure. We also help teams determine whether fine-tuning is the right approach or whether RAG will get them there faster and cheaper.
→ Talk to our AI/ML team about LLM customization for your product.
See Also
- Machine Learning Model Deployment — serving fine-tuned models in production
- MLOps and Machine Learning Pipeline — training infrastructure and versioning
- AI Prompt Engineering — often the right first step before fine-tuning
- Vector Database Guide — RAG as the alternative to fine-tuning
- AI and Machine Learning Services — LLM engineering and deployment
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Want to Implement AI in Your Business?
From chatbots to predictive models — harness the power of AI with a team that delivers.
Free consultation • No commitment • Response within 24 hours
Ready to automate your business with AI agents?
We build custom multi-agent AI systems that handle sales, support, ops, and content — across Telegram, WhatsApp, Slack, and 20+ other platforms. We run our own business on these systems.