LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune

Fine-tuning a large language model means taking a pre-trained model and continuing training on your specific data to adapt its behavior. Done correctly, it produces a model that follows your format conventions, uses your terminology, and stays within your domain. Done incorrectly, it produces a model that hallucinates confidently in your brand voice.

The first question is always: should you fine-tune at all?

Fine-Tuning vs RAG: Choose Carefully

Use Case	Fine-Tuning	RAG
Teach the model facts (product details, docs)	❌ Slow, expensive, facts become stale	✅ Update knowledge base, always current
Teach the model a writing style or format	✅ Works well	❌ Prompting alone often insufficient
Reduce hallucination on specific domain	❌ Can increase confident hallucination	✅ Grounds answers in retrieved context
Reduce token cost per query	✅ Shorter system prompts needed	❌ RAG adds retrieval overhead
Keep data out of third-party models	✅ Run locally	✅ Local retrieval too

The 2026 default: Start with RAG. Fine-tune only when you have clear evidence that prompt engineering + RAG can't solve the problem.

Fine-tune when you need:

Consistent output format that prompting can't reliably produce
Domain-specific vocabulary and reasoning not in the base model's training
Significant inference cost reduction via smaller specialized model
Specific tone/persona that persists across all outputs

Parameter-Efficient Fine-Tuning: LoRA and QLoRA

Full fine-tuning updates all model weights — impractical for most teams (a 7B model full fine-tune requires 8× A100 GPUs and ~$5,000+ per run).

LoRA (Low-Rank Adaptation) freezes the original weights and adds small trainable matrices to specific layers:

Original weight matrix W (frozen):  7B params × 100% = 28GB VRAM
LoRA adapters (trainable):          7B params × ~0.1% = 28MB

During training: update only adapter weights
During inference: merge adapters back into W (zero added latency)

QLoRA combines LoRA with 4-bit quantization — reduces the base model from 28GB to ~4GB VRAM, making 7B+ model fine-tuning possible on a single consumer GPU.

# train.py — QLoRA fine-tuning with Hugging Face + PEFT
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./fine-tuned-model"

# 4-bit quantization config (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",       # Normal Float 4 — best quality
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False  # Required for gradient checkpointing

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank — higher = more capacity, more VRAM
    lora_alpha=32,                 # Scaling factor (typically 2× rank)
    target_modules=[               # Which layers to add LoRA to
        "q_proj", "k_proj",        # Attention query/key projections
        "v_proj", "o_proj",        # Attention value/output projections
        "gate_proj", "up_proj",    # MLP layers
        "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable params: 41,943,040 || All params: 8,072,257,536 || Trainable%: 0.52

🤖 AI Is Not the Future — It Is Right Now

Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.

LLM integration (OpenAI, Anthropic, Gemini, local models)
RAG systems that answer from your own data
AI agents that take real actions — not just chat
Custom ML models for prediction, classification, detection

Explore AI for My Business WhatsApp

Dataset Preparation

The quality of your fine-tuning dataset matters more than any hyperparameter. A model fine-tuned on 500 high-quality examples outperforms one trained on 5,000 noisy ones.

Instruction tuning format (the standard for chat/instruction-following fine-tuning):

# dataset.py — prepare instruction dataset
from datasets import Dataset

# Each example: instruction + optional input + expected output
examples = [
    {
        "instruction": "Extract the key financial metrics from this earnings report excerpt.",
        "input": "Revenue for Q3 2026 was $4.2B, up 18% YoY. Operating margin improved to 24.3% from 21.1% in Q3 2025. Net income was $892M...",
        "output": "Revenue: $4.2B (+18% YoY)\nOperating Margin: 24.3% (vs 21.1% prior year)\nNet Income: $892M",
    },
    # ... 500-5000 more examples
]

def format_instruction(example):
    """Format as Llama 3 chat template"""
    if example.get("input"):
        prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{example['instruction']}

{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['output']}<|eot_id|>"""
    else:
        prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['output']}<|eot_id|>"""
    return {"text": prompt}

dataset = Dataset.from_list(examples)
dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)

# Train/eval split
dataset = dataset.train_test_split(test_size=0.1, seed=42)

Dataset quality checklist:

Every example is something a real user would ask
Outputs are exactly what you want the model to produce (correct format, tone, length)
No contradictory examples (same input, different output)
No PII in training data
Diverse instruction phrasings (don't repeat the same prompts)
Reviewed by domain expert, not just the engineer who wrote them

Training Configuration

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size = 4 × 4 = 16
    gradient_checkpointing=True,      # Save VRAM by recomputing activations
    warmup_ratio=0.03,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    fp16=False,
    bf16=True,                        # Better than fp16 for modern GPUs
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    report_to="wandb",                # Track training with Weights & Biases
    dataloader_num_workers=4,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    args=training_args,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,  # Pack multiple short examples into one sequence (faster training)
)

trainer.train()
trainer.save_model(OUTPUT_DIR)

⚡ Your Competitors Are Already Using AI — Are You?

We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.

AI agent systems that run autonomously — not just chatbots
Integrates with your existing tools (CRM, ERP, Slack, etc.)
Explainable outputs — know why the model decided what it did
Free AI opportunity audit for your business

Get a Free AI Audit WhatsApp

Evaluation

Loss curves during training tell you about optimization, not quality. Evaluate your fine-tuned model with metrics that matter:

# evaluate.py
from evaluate import load
from transformers import pipeline

rouge = load("rouge")
fine_tuned_pipe = pipeline("text-generation", model=OUTPUT_DIR, tokenizer=tokenizer)

def evaluate_model(test_examples: list[dict]) -> dict:
    predictions = []
    references = []

    for example in test_examples:
        # Generate from fine-tuned model
        output = fine_tuned_pipe(
            format_prompt(example["instruction"], example.get("input")),
            max_new_tokens=200,
            temperature=0.1,  # Low temperature for eval
            do_sample=True,
        )
        generated = extract_response(output[0]["generated_text"])
        predictions.append(generated)
        references.append(example["output"])

    # ROUGE scores (text overlap)
    rouge_scores = rouge.compute(predictions=predictions, references=references)

    # Format accuracy (custom metric — does output match expected format?)
    format_match = sum(
        has_correct_format(pred) for pred in predictions
    ) / len(predictions)

    return {
        "rouge1": rouge_scores["rouge1"],
        "rouge2": rouge_scores["rouge2"],
        "rougeL": rouge_scores["rougeL"],
        "format_accuracy": format_match,
        "num_examples": len(test_examples),
    }

The irreplaceable eval: Have domain experts review 50–100 randomly sampled model outputs blind (without knowing whether they came from fine-tuned or base model). Human judgment remains the most reliable quality signal.

Serving the Fine-Tuned Model

# serve.py — FastAPI serving for fine-tuned model
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch

app = FastAPI()

# Load base + LoRA adapters (smaller than full fine-tuned weights)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
model = model.merge_and_unload()  # Merge LoRA weights for faster inference
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

class GenerateRequest(BaseModel):
    instruction: str
    input: str | None = None
    max_tokens: int = 500
    temperature: float = 0.3

@app.post("/generate")
async def generate(request: GenerateRequest):
    prompt = format_prompt(request.instruction, request.input)
    output = pipe(
        prompt,
        max_new_tokens=request.max_tokens,
        temperature=request.temperature,
        do_sample=True,
        return_full_text=False,
    )
    return {"response": output[0]["generated_text"].strip()}

Hardware and Cost Reference

Setup	GPU	VRAM	Cost/Hour	7B Model Fine-Tune
Google Colab Pro	T4	16GB	~$0.50	QLoRA only; 6–12h
RunPod A10G	A10G	24GB	$0.69	QLoRA comfortable; 4–8h
RunPod A100 40G	A100	40GB	$1.89	Full LoRA; 2–4h
Lambda Labs A100 80G	A100	80GB	$2.49	Full fine-tune; 2–3h
AWS p4d.24xlarge	8× A100	320GB	$32.77	70B+ models

Typical cost for 7B model QLoRA fine-tune (1,000 examples, 3 epochs):

RunPod A10G: ~$5–15 total
Google Colab Pro: ~$3–10

Working With Viprasol

We build production ML systems that include fine-tuned models where appropriate — from dataset curation and training through evaluation and serving infrastructure. We also help teams determine whether fine-tuning is the right approach or whether RAG will get them there faster and cheaper.

→ Talk to our AI/ML team about LLM customization for your product.

LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune

LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune

Fine-Tuning vs RAG: Choose Carefully

Parameter-Efficient Fine-Tuning: LoRA and QLoRA

🤖 AI Is Not the Future — It Is Right Now

Dataset Preparation

Training Configuration

⚡ Your Competitors Are Already Using AI — Are You?

Evaluation

Serving the Fine-Tuned Model

Hardware and Cost Reference

Working With Viprasol

See Also

Viprasol Tech Team

Want to Implement AI in Your Business?

Ready to automate your business with AI agents?

Related Articles

LLM Prompt Engineering: System Prompts, Few-Shot Examples, Chain-of-Thought, and Structured Output

AI Model Evaluation: Benchmarking LLMs, Regression Testing, and Eval Frameworks

Custom Chatbot Development Services: Full Buyer's Guide