Back to Blog

LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune

Fine-tune large language models for production — LoRA and QLoRA parameter-efficient tuning, instruction dataset preparation, evaluation with ROUGE and human rev

Viprasol Tech Team
April 30, 2026
14 min read

LLM Fine-Tuning: LoRA, QLoRA, Instruction Tuning, and When Not to Fine-Tune

Fine-tuning a large language model means taking a pre-trained model and continuing training on your specific data to adapt its behavior. Done correctly, it produces a model that follows your format conventions, uses your terminology, and stays within your domain. Done incorrectly, it produces a model that hallucinates confidently in your brand voice.

The first question is always: should you fine-tune at all?


Fine-Tuning vs RAG: Choose Carefully

Use CaseFine-TuningRAG
Teach the model facts (product details, docs)❌ Slow, expensive, facts become stale✅ Update knowledge base, always current
Teach the model a writing style or format✅ Works well❌ Prompting alone often insufficient
Reduce hallucination on specific domain❌ Can increase confident hallucination✅ Grounds answers in retrieved context
Reduce token cost per query✅ Shorter system prompts needed❌ RAG adds retrieval overhead
Keep data out of third-party models✅ Run locally✅ Local retrieval too

The 2026 default: Start with RAG. Fine-tune only when you have clear evidence that prompt engineering + RAG can't solve the problem.

Fine-tune when you need:

  • Consistent output format that prompting can't reliably produce
  • Domain-specific vocabulary and reasoning not in the base model's training
  • Significant inference cost reduction via smaller specialized model
  • Specific tone/persona that persists across all outputs

Parameter-Efficient Fine-Tuning: LoRA and QLoRA

Full fine-tuning updates all model weights — impractical for most teams (a 7B model full fine-tune requires 8× A100 GPUs and ~$5,000+ per run).

LoRA (Low-Rank Adaptation) freezes the original weights and adds small trainable matrices to specific layers:

Original weight matrix W (frozen):  7B params × 100% = 28GB VRAM
LoRA adapters (trainable):          7B params × ~0.1% = 28MB

During training: update only adapter weights
During inference: merge adapters back into W (zero added latency)

QLoRA combines LoRA with 4-bit quantization — reduces the base model from 28GB to ~4GB VRAM, making 7B+ model fine-tuning possible on a single consumer GPU.

# train.py — QLoRA fine-tuning with Hugging Face + PEFT
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./fine-tuned-model"

# 4-bit quantization config (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",       # Normal Float 4 — best quality
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False  # Required for gradient checkpointing

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank — higher = more capacity, more VRAM
    lora_alpha=32,                 # Scaling factor (typically 2× rank)
    target_modules=[               # Which layers to add LoRA to
        "q_proj", "k_proj",        # Attention query/key projections
        "v_proj", "o_proj",        # Attention value/output projections
        "gate_proj", "up_proj",    # MLP layers
        "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable params: 41,943,040 || All params: 8,072,257,536 || Trainable%: 0.52

🤖 AI Is Not the Future — It Is Right Now

Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.

  • LLM integration (OpenAI, Anthropic, Gemini, local models)
  • RAG systems that answer from your own data
  • AI agents that take real actions — not just chat
  • Custom ML models for prediction, classification, detection

Dataset Preparation

The quality of your fine-tuning dataset matters more than any hyperparameter. A model fine-tuned on 500 high-quality examples outperforms one trained on 5,000 noisy ones.

Instruction tuning format (the standard for chat/instruction-following fine-tuning):

# dataset.py — prepare instruction dataset
from datasets import Dataset

# Each example: instruction + optional input + expected output
examples = [
    {
        "instruction": "Extract the key financial metrics from this earnings report excerpt.",
        "input": "Revenue for Q3 2026 was $4.2B, up 18% YoY. Operating margin improved to 24.3% from 21.1% in Q3 2025. Net income was $892M...",
        "output": "Revenue: $4.2B (+18% YoY)\nOperating Margin: 24.3% (vs 21.1% prior year)\nNet Income: $892M",
    },
    # ... 500-5000 more examples
]

def format_instruction(example):
    """Format as Llama 3 chat template"""
    if example.get("input"):
        prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{example['instruction']}

{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['output']}<|eot_id|>"""
    else:
        prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['output']}<|eot_id|>"""
    return {"text": prompt}

dataset = Dataset.from_list(examples)
dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)

# Train/eval split
dataset = dataset.train_test_split(test_size=0.1, seed=42)

Dataset quality checklist:

  • Every example is something a real user would ask
  • Outputs are exactly what you want the model to produce (correct format, tone, length)
  • No contradictory examples (same input, different output)
  • No PII in training data
  • Diverse instruction phrasings (don't repeat the same prompts)
  • Reviewed by domain expert, not just the engineer who wrote them

Training Configuration

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size = 4 × 4 = 16
    gradient_checkpointing=True,      # Save VRAM by recomputing activations
    warmup_ratio=0.03,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    fp16=False,
    bf16=True,                        # Better than fp16 for modern GPUs
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    report_to="wandb",                # Track training with Weights & Biases
    dataloader_num_workers=4,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    args=training_args,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,  # Pack multiple short examples into one sequence (faster training)
)

trainer.train()
trainer.save_model(OUTPUT_DIR)

⚡ Your Competitors Are Already Using AI — Are You?

We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.

  • AI agent systems that run autonomously — not just chatbots
  • Integrates with your existing tools (CRM, ERP, Slack, etc.)
  • Explainable outputs — know why the model decided what it did
  • Free AI opportunity audit for your business

Evaluation

Loss curves during training tell you about optimization, not quality. Evaluate your fine-tuned model with metrics that matter:

# evaluate.py
from evaluate import load
from transformers import pipeline

rouge = load("rouge")
fine_tuned_pipe = pipeline("text-generation", model=OUTPUT_DIR, tokenizer=tokenizer)

def evaluate_model(test_examples: list[dict]) -> dict:
    predictions = []
    references = []

    for example in test_examples:
        # Generate from fine-tuned model
        output = fine_tuned_pipe(
            format_prompt(example["instruction"], example.get("input")),
            max_new_tokens=200,
            temperature=0.1,  # Low temperature for eval
            do_sample=True,
        )
        generated = extract_response(output[0]["generated_text"])
        predictions.append(generated)
        references.append(example["output"])

    # ROUGE scores (text overlap)
    rouge_scores = rouge.compute(predictions=predictions, references=references)

    # Format accuracy (custom metric — does output match expected format?)
    format_match = sum(
        has_correct_format(pred) for pred in predictions
    ) / len(predictions)

    return {
        "rouge1": rouge_scores["rouge1"],
        "rouge2": rouge_scores["rouge2"],
        "rougeL": rouge_scores["rougeL"],
        "format_accuracy": format_match,
        "num_examples": len(test_examples),
    }

The irreplaceable eval: Have domain experts review 50–100 randomly sampled model outputs blind (without knowing whether they came from fine-tuned or base model). Human judgment remains the most reliable quality signal.


Serving the Fine-Tuned Model

# serve.py — FastAPI serving for fine-tuned model
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch

app = FastAPI()

# Load base + LoRA adapters (smaller than full fine-tuned weights)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
model = model.merge_and_unload()  # Merge LoRA weights for faster inference
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

class GenerateRequest(BaseModel):
    instruction: str
    input: str | None = None
    max_tokens: int = 500
    temperature: float = 0.3

@app.post("/generate")
async def generate(request: GenerateRequest):
    prompt = format_prompt(request.instruction, request.input)
    output = pipe(
        prompt,
        max_new_tokens=request.max_tokens,
        temperature=request.temperature,
        do_sample=True,
        return_full_text=False,
    )
    return {"response": output[0]["generated_text"].strip()}

Hardware and Cost Reference

SetupGPUVRAMCost/Hour7B Model Fine-Tune
Google Colab ProT416GB~$0.50QLoRA only; 6–12h
RunPod A10GA10G24GB$0.69QLoRA comfortable; 4–8h
RunPod A100 40GA10040GB$1.89Full LoRA; 2–4h
Lambda Labs A100 80GA10080GB$2.49Full fine-tune; 2–3h
AWS p4d.24xlarge8× A100320GB$32.7770B+ models

Typical cost for 7B model QLoRA fine-tune (1,000 examples, 3 epochs):

  • RunPod A10G: ~$5–15 total
  • Google Colab Pro: ~$3–10

Working With Viprasol

We build production ML systems that include fine-tuned models where appropriate — from dataset curation and training through evaluation and serving infrastructure. We also help teams determine whether fine-tuning is the right approach or whether RAG will get them there faster and cheaper.

Talk to our AI/ML team about LLM customization for your product.


See Also

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Want to Implement AI in Your Business?

From chatbots to predictive models — harness the power of AI with a team that delivers.

Free consultation • No commitment • Response within 24 hours

Viprasol · AI Agent Systems

Ready to automate your business with AI agents?

We build custom multi-agent AI systems that handle sales, support, ops, and content — across Telegram, WhatsApp, Slack, and 20+ other platforms. We run our own business on these systems.