Training a model is the glamorous part of ML. Deploying it so other systems can call it reliably, at scale, with sub-100ms latency, and without you managing servers — that's the unglamorous part that determines whether the model creates business value. AWS SageMaker handles the infrastructure for model serving, but getting the deployment right requires understanding endpoint types, container requirements, autoscaling behavior, and inference pipeline composition.

This guide covers SageMaker real-time inference from model packaging through autoscaling in production.

SageMaker Inference Concepts

Endpoint: The HTTPS endpoint that serves predictions. Backed by one or more EC2 instances running your model container.

Endpoint Configuration: Defines which model(s) and instance type(s) to use. You can swap configs without endpoint downtime.

Model: References a Docker container + model artifacts (weights, tokenizer, config) stored in S3.

Inference Pipeline: Chain multiple containers in sequence — preprocessing → model → postprocessing — as a single endpoint call.

Multi-Model Endpoint (MME): One endpoint, dozens of models. SageMaker loads/unloads models dynamically based on invocation frequency.

Model Packaging: Custom Inference Container

For models not natively supported by SageMaker's built-in containers:

# inference/handler.py — Custom inference handler
import json
import logging
import os
from typing import Any

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

logger = logging.getLogger(__name__)

# Global model state (loaded once at startup)
_model = None
_tokenizer = None
_device = None


def model_fn(model_dir: str) -> dict:
    """Load model from model_dir. Called once at container start."""
    global _model, _tokenizer, _device

    _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info(f"Loading model from {model_dir} on {_device}")

    _tokenizer = AutoTokenizer.from_pretrained(model_dir)
    _model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    _model.to(_device)
    _model.eval()

    logger.info("Model loaded successfully")
    return {"model": _model, "tokenizer": _tokenizer}


def input_fn(request_body: str | bytes, request_content_type: str) -> dict:
    """Deserialize and preprocess request."""
    if request_content_type == "application/json":
        data = json.loads(request_body)
        if isinstance(data, str):
            return {"texts": [data]}
        if isinstance(data, list):
            return {"texts": data}
        if "text" in data:
            return {"texts": [data["text"]]}
        if "texts" in data:
            return {"texts": data["texts"]}
        raise ValueError(f"Unexpected JSON structure: {data}")
    raise ValueError(f"Unsupported content type: {request_content_type}")


def predict_fn(input_data: dict, model_artifacts: dict) -> dict:
    """Run inference."""
    model = model_artifacts["model"]
    tokenizer = model_artifacts["tokenizer"]
    texts = input_data["texts"]

    # Batch tokenize
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512,
    )
    inputs = {k: v.to(_device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=-1)
        predictions = torch.argmax(probabilities, dim=-1)

    return {
        "predictions": predictions.cpu().tolist(),
        "probabilities": probabilities.cpu().tolist(),
        "labels": [model.config.id2label[p] for p in predictions.cpu().tolist()],
    }


def output_fn(prediction: dict, accept: str) -> tuple[str, str]:
    """Serialize the response."""
    if accept in ("application/json", "*/*"):
        return json.dumps(prediction), "application/json"
    raise ValueError(f"Unsupported accept type: {accept}")

# Dockerfile — Custom SageMaker inference container
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py311-cu118-ubuntu20.04-sagemaker

# Install additional dependencies
COPY requirements.txt /opt/ml/code/requirements.txt
RUN pip install --no-cache-dir -r /opt/ml/code/requirements.txt

# Copy inference handler
COPY inference/ /opt/ml/code/

# SageMaker expects the handler at this location
ENV SAGEMAKER_PROGRAM inference/handler.py

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8080/ping || exit 1

# requirements.txt
transformers==4.41.0
torch==2.2.0
sentencepiece==0.2.0
protobuf==4.25.3

🤖 AI Is Not the Future — It Is Right Now

Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.

LLM integration (OpenAI, Anthropic, Gemini, local models)
RAG systems that answer from your own data
AI agents that take real actions — not just chat
Custom ML models for prediction, classification, detection

Explore AI for My Business WhatsApp

Model Artifacts: Packing and Uploading to S3

# scripts/package_model.py
import os
import tarfile
import boto3
from pathlib import Path

def package_model(model_dir: str, output_name: str = "model.tar.gz") -> str:
    """Package model directory into tar.gz for SageMaker."""
    output_path = f"/tmp/{output_name}"

    with tarfile.open(output_path, "w:gz") as tar:
        for file_path in Path(model_dir).rglob("*"):
            if file_path.is_file():
                # Preserve relative path structure
                arcname = file_path.relative_to(model_dir)
                tar.add(file_path, arcname=arcname)

    return output_path


def upload_model(local_path: str, bucket: str, prefix: str) -> str:
    """Upload model artifact to S3."""
    s3 = boto3.client("s3")
    s3_key = f"{prefix}/{os.path.basename(local_path)}"

    print(f"Uploading {local_path} to s3://{bucket}/{s3_key}...")
    s3.upload_file(local_path, bucket, s3_key)
    return f"s3://{bucket}/{s3_key}"


if __name__ == "__main__":
    # Save model from HuggingFace Hub to local directory
    from transformers import AutoTokenizer, AutoModelForSequenceClassification

    MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
    LOCAL_DIR = "/tmp/model_artifacts"

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

    tokenizer.save_pretrained(LOCAL_DIR)
    model.save_pretrained(LOCAL_DIR)

    # Package and upload
    archive = package_model(LOCAL_DIR)
    s3_uri = upload_model(
        archive,
        bucket="my-ml-artifacts",
        prefix="models/sentiment-classifier/v1.0"
    )
    print(f"Model uploaded to: {s3_uri}")

Terraform: SageMaker Endpoint

# sagemaker/main.tf

locals {
  model_name    = "${var.project}-${var.model_name}-${var.model_version}"
  endpoint_name = "${var.project}-${var.model_name}"
}

# IAM role for SageMaker
resource "aws_iam_role" "sagemaker" {
  name = "${var.project}-sagemaker-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "sagemaker_full" {
  role       = aws_iam_role.sagemaker.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}

resource "aws_iam_role_policy" "sagemaker_s3" {
  role = aws_iam_role.sagemaker.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:ListBucket"]
        Resource = [
          "arn:aws:s3:::${var.model_artifacts_bucket}",
          "arn:aws:s3:::${var.model_artifacts_bucket}/*"
        ]
      },
      {
        Effect   = "Allow"
        Action   = ["ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:GetAuthorizationToken"]
        Resource = "*"
      }
    ]
  })
}

# SageMaker Model
resource "aws_sagemaker_model" "main" {
  name               = local.model_name
  execution_role_arn = aws_iam_role.sagemaker.arn

  primary_container {
    image          = "${var.ecr_repository_url}:${var.image_tag}"
    model_data_url = var.model_s3_uri

    environment = {
      SAGEMAKER_PROGRAM          = "inference/handler.py"
      SAGEMAKER_SUBMIT_DIRECTORY = "/opt/ml/code"
      PYTHONUNBUFFERED           = "1"
      MODEL_MAX_BATCH_SIZE       = "32"
    }
  }

  tags = var.common_tags
}

# Endpoint Configuration
resource "aws_sagemaker_endpoint_configuration" "main" {
  name = "${local.endpoint_name}-${var.model_version}"

  production_variants {
    variant_name           = "primary"
    model_name             = aws_sagemaker_model.main.name
    initial_instance_count = 1
    instance_type          = var.instance_type  # e.g., "ml.g4dn.xlarge"
    initial_variant_weight = 1

    # Managed instance scaling (alternative to Application Auto Scaling)
    managed_instance_scaling {
      status            = "ENABLED"
      min_instance_count = 1
      max_instance_count = 10
    }
  }

  # Routing config — least outstanding requests for ML workloads
  shadow_production_variants {
    # Optional: shadow variant for A/B testing new model version
    variant_name           = "shadow"
    model_name             = aws_sagemaker_model.main.name  # point to new model
    initial_instance_count = 1
    instance_type          = var.instance_type
    initial_variant_weight = 0
    sampling_percentage    = 10  # 10% of traffic mirrored for shadow testing
  }

  async_inference_config {
    # Optional: for async jobs (large payloads, long inference time)
    output_config {
      s3_output_path = "s3://${var.inference_output_bucket}/async-results/"
      kms_key_id     = var.kms_key_arn
    }
  }

  tags = var.common_tags
}

# Endpoint (blue/green deployment by updating endpoint_config_name)
resource "aws_sagemaker_endpoint" "main" {
  name                 = local.endpoint_name
  endpoint_config_name = aws_sagemaker_endpoint_configuration.main.name

  deployment_config {
    blue_green_update_policy {
      traffic_routing_configuration {
        type                  = "LINEAR"
        wait_interval_in_seconds = 60
        linear_step_size {
          type  = "CAPACITY_PERCENT"
          value = 25  # shift 25% every 60s
        }
      }
      termination_wait_in_seconds = 120
    }
    auto_rollback_configuration {
      alarms {
        alarm_name = aws_cloudwatch_metric_alarm.endpoint_error_rate.alarm_name
      }
    }
  }

  tags = var.common_tags

  lifecycle {
    ignore_changes = [endpoint_config_name]  # Managed by deployment pipeline
  }
}

# CloudWatch alarm for auto-rollback trigger
resource "aws_cloudwatch_metric_alarm" "endpoint_error_rate" {
  alarm_name          = "${local.endpoint_name}-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "Invocation5XXErrors"
  namespace           = "AWS/SageMaker"
  period              = 60
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "SageMaker endpoint error rate too high — triggers rollback"

  dimensions = {
    EndpointName = local.endpoint_name
    VariantName  = "primary"
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

⚡ Your Competitors Are Already Using AI — Are You?

We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.

AI agent systems that run autonomously — not just chatbots
Integrates with your existing tools (CRM, ERP, Slack, etc.)
Explainable outputs — know why the model decided what it did
Free AI opportunity audit for your business

Get a Free AI Audit WhatsApp

Autoscaling Configuration

# autoscaling.tf

resource "aws_appautoscaling_target" "sagemaker" {
  max_capacity       = 20
  min_capacity       = 1
  resource_id        = "endpoint/${aws_sagemaker_endpoint.main.name}/variant/primary"
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"

  depends_on = [aws_sagemaker_endpoint.main]
}

# Scale on invocations per instance
resource "aws_appautoscaling_policy" "sagemaker_invocations" {
  name               = "${local.endpoint_name}-invocations-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 70.0  # invocations per instance per minute
    scale_in_cooldown  = 300   # 5 min — don't scale in too aggressively
    scale_out_cooldown = 60    # 1 min — respond quickly to load spikes

    predefined_metric_specification {
      predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
    }
  }
}

# Scale on model latency (P99)
resource "aws_appautoscaling_policy" "sagemaker_latency" {
  name               = "${local.endpoint_name}-latency-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 200.0  # target 200ms P99 latency
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    customized_metric_specification {
      metric_name = "ModelLatency"
      namespace   = "AWS/SageMaker"
      statistic   = "p99"
      unit        = "Microseconds"

      dimensions {
        name  = "EndpointName"
        value = aws_sagemaker_endpoint.main.name
      }
      dimensions {
        name  = "VariantName"
        value = "primary"
      }
    }
  }
}

# Scheduled scaling — pre-warm for business hours
resource "aws_appautoscaling_scheduled_action" "scale_up_morning" {
  name               = "${local.endpoint_name}-morning-scale-up"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace
  schedule           = "cron(0 8 ? * MON-FRI *)"  # 8 AM UTC weekdays

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "scale_down_night" {
  name               = "${local.endpoint_name}-night-scale-down"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace
  schedule           = "cron(0 20 ? * MON-FRI *)"  # 8 PM UTC weekdays

  scalable_target_action {
    min_capacity = 1
    max_capacity = 5
  }
}

Calling the Endpoint from Application Code

# lib/sagemaker_client.py
import json
import boto3
import logging
from typing import Any
from functools import lru_cache

logger = logging.getLogger(__name__)


@lru_cache(maxsize=1)
def get_runtime_client():
    return boto3.client("sagemaker-runtime", region_name="us-east-1")


class SageMakerInferenceClient:
    def __init__(self, endpoint_name: str):
        self.endpoint_name = endpoint_name
        self.client = get_runtime_client()

    def predict(self, texts: list[str], timeout: int = 10) -> dict[str, Any]:
        """Invoke endpoint with list of texts."""
        payload = json.dumps({"texts": texts})

        try:
            response = self.client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                ContentType="application/json",
                Accept="application/json",
                Body=payload,
            )

            result = json.loads(response["Body"].read())
            return result

        except self.client.exceptions.ModelNotReadyException:
            logger.warning("Model not ready — endpoint may be scaling")
            raise
        except Exception as e:
            logger.error(f"Inference error: {e}", exc_info=True)
            raise

    def predict_single(self, text: str) -> dict[str, Any]:
        """Convenience method for single text."""
        result = self.predict([text])
        return {
            "prediction": result["predictions"][0],
            "label": result["labels"][0],
            "probability": max(result["probabilities"][0]),
        }

    def batch_predict(
        self,
        texts: list[str],
        batch_size: int = 32,
    ) -> list[dict[str, Any]]:
        """Process large lists in batches."""
        results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i : i + batch_size]
            batch_result = self.predict(batch)
            for j, pred in enumerate(batch_result["predictions"]):
                results.append({
                    "text": batch[j],
                    "prediction": pred,
                    "label": batch_result["labels"][j],
                    "probability": max(batch_result["probabilities"][j]),
                })
        return results

// lib/sagemaker.ts — TypeScript wrapper for Next.js API routes
import {
  SageMakerRuntimeClient,
  InvokeEndpointCommand,
} from "@aws-sdk/client-sagemaker-runtime";

const client = new SageMakerRuntimeClient({ region: process.env.AWS_REGION });

interface InferenceResult {
  predictions: number[];
  labels: string[];
  probabilities: number[][];
}

export async function invokeEndpoint(
  endpointName: string,
  texts: string[]
): Promise<InferenceResult> {
  const command = new InvokeEndpointCommand({
    EndpointName: endpointName,
    ContentType: "application/json",
    Accept: "application/json",
    Body: JSON.stringify({ texts }),
  });

  const response = await client.send(command);
  const body = new TextDecoder().decode(response.Body);
  return JSON.parse(body) as InferenceResult;
}

// app/api/analyze/route.ts
import { NextRequest, NextResponse } from "next/server";
import { invokeEndpoint } from "@/lib/sagemaker";

export async function POST(req: NextRequest) {
  const { text } = await req.json();

  const result = await invokeEndpoint(
    process.env.SAGEMAKER_ENDPOINT_NAME!,
    [text]
  );

  return NextResponse.json({
    sentiment: result.labels[0],
    confidence: Math.max(...result.probabilities[0]),
  });
}

Inference Pipeline: Preprocessing + Model + Postprocessing

# pipeline_containers/preprocessor/handler.py
import json
import re

def model_fn(model_dir):
    return {}  # No model weights needed for preprocessing

def predict_fn(input_data, model):
    """Clean text before sending to model."""
    texts = input_data["texts"]
    cleaned = [
        re.sub(r'[^\w\s.,!?]', '', t.strip().lower())[:512]
        for t in texts
    ]
    return {"texts": cleaned}

def input_fn(body, content_type):
    return json.loads(body)

def output_fn(prediction, accept):
    return json.dumps(prediction), "application/json"

# pipeline_containers/postprocessor/handler.py
import json

LABEL_DESCRIPTIONS = {
    "POSITIVE": "Positive sentiment",
    "NEGATIVE": "Negative sentiment",
    "NEUTRAL": "Neutral or mixed sentiment",
}

def model_fn(model_dir):
    return {}

def predict_fn(input_data, model):
    results = []
    for i, label in enumerate(input_data["labels"]):
        prob = max(input_data["probabilities"][i])
        results.append({
            "label": label,
            "description": LABEL_DESCRIPTIONS.get(label, label),
            "confidence": round(prob, 4),
            "tier": "high" if prob > 0.9 else "medium" if prob > 0.7 else "low",
        })
    return {"results": results}

def input_fn(body, content_type):
    return json.loads(body)

def output_fn(prediction, accept):
    return json.dumps(prediction), "application/json"

# Pipeline model definition in Terraform
resource "aws_sagemaker_model" "pipeline" {
  name               = "${var.project}-sentiment-pipeline"
  execution_role_arn = aws_iam_role.sagemaker.arn

  # Containers execute in sequence
  container {
    image          = "${var.ecr_repository_url}:preprocessor-${var.image_tag}"
    container_hostname = "preprocessor"
  }

  container {
    image          = "${var.ecr_repository_url}:model-${var.image_tag}"
    model_data_url = var.model_s3_uri
    container_hostname = "model"
  }

  container {
    image          = "${var.ecr_repository_url}:postprocessor-${var.image_tag}"
    container_hostname = "postprocessor"
  }
}

Cost Estimates

Instance Type	vCPU	RAM	GPU	Cost/hr	Good For
ml.t2.medium	2	4GB	—	$0.065	Development, tiny models
ml.m5.large	2	8GB	—	$0.134	CPU-bound inference
ml.m5.4xlarge	16	64GB	—	$1.075	Large batch CPU inference
ml.g4dn.xlarge	4	16GB	T4 (16GB)	$0.736	Small-medium GPU models
ml.g4dn.2xlarge	8	32GB	T4 (16GB)	$1.22	Production GPU inference
ml.p3.2xlarge	8	61GB	V100 (16GB)	$3.83	Large GPU models
ml.inf2.xlarge	4	16GB	Inferentia2	$0.76	High-throughput optimized

Typical production setup: 2× ml.g4dn.xlarge with autoscaling 1–10 = ~$1.50/hr base, scales with traffic.

Cost and Timeline Estimates

Scope	Team	Timeline	Cost Range
Basic endpoint from HuggingFace model	1 ML engineer	2–3 days	$600–1,200
Custom container + CI/CD deployment	1–2 engineers	1 week	$2,000–4,000
Autoscaling + monitoring + blue-green	2 engineers	2 weeks	$4,000–8,000
Full MLOps pipeline (training → deploy → monitor)	2–3 engineers	4–6 weeks	$12,000–28,000

Working With Viprasol

Deploying an ML model to SageMaker involves more than clicking "deploy" in the console — custom inference containers, autoscaling tuned to your latency requirements, blue-green deployment with auto-rollback, and integrating the endpoint into your application stack. Our team handles the MLOps layer so your data scientists can focus on model quality.

What we deliver:

Custom SageMaker inference containers with production-grade handlers
Terraform-managed endpoint configuration with blue-green deployment
Autoscaling policies calibrated to your traffic patterns
Inference pipeline composition for pre/post-processing
Application-layer SDK wrappers (Python + TypeScript)

Talk to our team about ML deployment →

Or explore our AI and ML services.

AWS SageMaker Real-Time Inference: Endpoints, Autoscaling, and Inference Pipelines

SageMaker Inference Concepts

Model Packaging: Custom Inference Container

🤖 AI Is Not the Future — It Is Right Now

Model Artifacts: Packing and Uploading to S3

Terraform: SageMaker Endpoint

⚡ Your Competitors Are Already Using AI — Are You?

Autoscaling Configuration

Calling the Endpoint from Application Code

Inference Pipeline: Preprocessing + Model + Postprocessing

Cost Estimates

Cost and Timeline Estimates

See Also

Working With Viprasol

Viprasol Tech Team

Want to Implement AI in Your Business?

Ready to automate your business with AI agents?

Related Articles

MLOps: Building Production Machine Learning Pipelines That Don't Break

AWS Lambda Container Images: ECR, Multi-Stage Dockerfiles, and Cold Start Optimization

MLOps in 2026: Model Registry, Drift Detection, and Production ML Pipelines