Back to Blog

AWS SageMaker Real-Time Inference: Endpoints, Autoscaling, and Inference Pipelines

Deploy machine learning models with AWS SageMaker real-time endpoints. Covers model packaging, endpoint configuration, autoscaling policies, multi-model endpoints, and inference pipelines with Terraform.

Viprasol Tech Team
March 9, 2027
13 min read

Training a model is the glamorous part of ML. Deploying it so other systems can call it reliably, at scale, with sub-100ms latency, and without you managing servers โ€” that's the unglamorous part that determines whether the model creates business value. AWS SageMaker handles the infrastructure for model serving, but getting the deployment right requires understanding endpoint types, container requirements, autoscaling behavior, and inference pipeline composition.

This guide covers SageMaker real-time inference from model packaging through autoscaling in production.

SageMaker Inference Concepts

Endpoint: The HTTPS endpoint that serves predictions. Backed by one or more EC2 instances running your model container.

Endpoint Configuration: Defines which model(s) and instance type(s) to use. You can swap configs without endpoint downtime.

Model: References a Docker container + model artifacts (weights, tokenizer, config) stored in S3.

Inference Pipeline: Chain multiple containers in sequence โ€” preprocessing โ†’ model โ†’ postprocessing โ€” as a single endpoint call.

Multi-Model Endpoint (MME): One endpoint, dozens of models. SageMaker loads/unloads models dynamically based on invocation frequency.

Model Packaging: Custom Inference Container

For models not natively supported by SageMaker's built-in containers:

# inference/handler.py โ€” Custom inference handler
import json
import logging
import os
from typing import Any

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

logger = logging.getLogger(__name__)

# Global model state (loaded once at startup)
_model = None
_tokenizer = None
_device = None


def model_fn(model_dir: str) -> dict:
    """Load model from model_dir. Called once at container start."""
    global _model, _tokenizer, _device

    _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info(f"Loading model from {model_dir} on {_device}")

    _tokenizer = AutoTokenizer.from_pretrained(model_dir)
    _model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    _model.to(_device)
    _model.eval()

    logger.info("Model loaded successfully")
    return {"model": _model, "tokenizer": _tokenizer}


def input_fn(request_body: str | bytes, request_content_type: str) -> dict:
    """Deserialize and preprocess request."""
    if request_content_type == "application/json":
        data = json.loads(request_body)
        if isinstance(data, str):
            return {"texts": [data]}
        if isinstance(data, list):
            return {"texts": data}
        if "text" in data:
            return {"texts": [data["text"]]}
        if "texts" in data:
            return {"texts": data["texts"]}
        raise ValueError(f"Unexpected JSON structure: {data}")
    raise ValueError(f"Unsupported content type: {request_content_type}")


def predict_fn(input_data: dict, model_artifacts: dict) -> dict:
    """Run inference."""
    model = model_artifacts["model"]
    tokenizer = model_artifacts["tokenizer"]
    texts = input_data["texts"]

    # Batch tokenize
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512,
    )
    inputs = {k: v.to(_device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=-1)
        predictions = torch.argmax(probabilities, dim=-1)

    return {
        "predictions": predictions.cpu().tolist(),
        "probabilities": probabilities.cpu().tolist(),
        "labels": [model.config.id2label[p] for p in predictions.cpu().tolist()],
    }


def output_fn(prediction: dict, accept: str) -> tuple[str, str]:
    """Serialize the response."""
    if accept in ("application/json", "*/*"):
        return json.dumps(prediction), "application/json"
    raise ValueError(f"Unsupported accept type: {accept}")
# Dockerfile โ€” Custom SageMaker inference container
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py311-cu118-ubuntu20.04-sagemaker

# Install additional dependencies
COPY requirements.txt /opt/ml/code/requirements.txt
RUN pip install --no-cache-dir -r /opt/ml/code/requirements.txt

# Copy inference handler
COPY inference/ /opt/ml/code/

# SageMaker expects the handler at this location
ENV SAGEMAKER_PROGRAM inference/handler.py

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8080/ping || exit 1
# requirements.txt
transformers==4.41.0
torch==2.2.0
sentencepiece==0.2.0
protobuf==4.25.3

๐Ÿค– AI Is Not the Future โ€” It Is Right Now

Businesses using AI automation cut manual work by 60โ€“80%. We build production-ready AI systems โ€” RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.

  • LLM integration (OpenAI, Anthropic, Gemini, local models)
  • RAG systems that answer from your own data
  • AI agents that take real actions โ€” not just chat
  • Custom ML models for prediction, classification, detection

Model Artifacts: Packing and Uploading to S3

# scripts/package_model.py
import os
import tarfile
import boto3
from pathlib import Path

def package_model(model_dir: str, output_name: str = "model.tar.gz") -> str:
    """Package model directory into tar.gz for SageMaker."""
    output_path = f"/tmp/{output_name}"

    with tarfile.open(output_path, "w:gz") as tar:
        for file_path in Path(model_dir).rglob("*"):
            if file_path.is_file():
                # Preserve relative path structure
                arcname = file_path.relative_to(model_dir)
                tar.add(file_path, arcname=arcname)

    return output_path


def upload_model(local_path: str, bucket: str, prefix: str) -> str:
    """Upload model artifact to S3."""
    s3 = boto3.client("s3")
    s3_key = f"{prefix}/{os.path.basename(local_path)}"

    print(f"Uploading {local_path} to s3://{bucket}/{s3_key}...")
    s3.upload_file(local_path, bucket, s3_key)
    return f"s3://{bucket}/{s3_key}"


if __name__ == "__main__":
    # Save model from HuggingFace Hub to local directory
    from transformers import AutoTokenizer, AutoModelForSequenceClassification

    MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
    LOCAL_DIR = "/tmp/model_artifacts"

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

    tokenizer.save_pretrained(LOCAL_DIR)
    model.save_pretrained(LOCAL_DIR)

    # Package and upload
    archive = package_model(LOCAL_DIR)
    s3_uri = upload_model(
        archive,
        bucket="my-ml-artifacts",
        prefix="models/sentiment-classifier/v1.0"
    )
    print(f"Model uploaded to: {s3_uri}")

Terraform: SageMaker Endpoint

# sagemaker/main.tf

locals {
  model_name    = "${var.project}-${var.model_name}-${var.model_version}"
  endpoint_name = "${var.project}-${var.model_name}"
}

# IAM role for SageMaker
resource "aws_iam_role" "sagemaker" {
  name = "${var.project}-sagemaker-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "sagemaker_full" {
  role       = aws_iam_role.sagemaker.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}

resource "aws_iam_role_policy" "sagemaker_s3" {
  role = aws_iam_role.sagemaker.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:ListBucket"]
        Resource = [
          "arn:aws:s3:::${var.model_artifacts_bucket}",
          "arn:aws:s3:::${var.model_artifacts_bucket}/*"
        ]
      },
      {
        Effect   = "Allow"
        Action   = ["ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:GetAuthorizationToken"]
        Resource = "*"
      }
    ]
  })
}

# SageMaker Model
resource "aws_sagemaker_model" "main" {
  name               = local.model_name
  execution_role_arn = aws_iam_role.sagemaker.arn

  primary_container {
    image          = "${var.ecr_repository_url}:${var.image_tag}"
    model_data_url = var.model_s3_uri

    environment = {
      SAGEMAKER_PROGRAM          = "inference/handler.py"
      SAGEMAKER_SUBMIT_DIRECTORY = "/opt/ml/code"
      PYTHONUNBUFFERED           = "1"
      MODEL_MAX_BATCH_SIZE       = "32"
    }
  }

  tags = var.common_tags
}

# Endpoint Configuration
resource "aws_sagemaker_endpoint_configuration" "main" {
  name = "${local.endpoint_name}-${var.model_version}"

  production_variants {
    variant_name           = "primary"
    model_name             = aws_sagemaker_model.main.name
    initial_instance_count = 1
    instance_type          = var.instance_type  # e.g., "ml.g4dn.xlarge"
    initial_variant_weight = 1

    # Managed instance scaling (alternative to Application Auto Scaling)
    managed_instance_scaling {
      status            = "ENABLED"
      min_instance_count = 1
      max_instance_count = 10
    }
  }

  # Routing config โ€” least outstanding requests for ML workloads
  shadow_production_variants {
    # Optional: shadow variant for A/B testing new model version
    variant_name           = "shadow"
    model_name             = aws_sagemaker_model.main.name  # point to new model
    initial_instance_count = 1
    instance_type          = var.instance_type
    initial_variant_weight = 0
    sampling_percentage    = 10  # 10% of traffic mirrored for shadow testing
  }

  async_inference_config {
    # Optional: for async jobs (large payloads, long inference time)
    output_config {
      s3_output_path = "s3://${var.inference_output_bucket}/async-results/"
      kms_key_id     = var.kms_key_arn
    }
  }

  tags = var.common_tags
}

# Endpoint (blue/green deployment by updating endpoint_config_name)
resource "aws_sagemaker_endpoint" "main" {
  name                 = local.endpoint_name
  endpoint_config_name = aws_sagemaker_endpoint_configuration.main.name

  deployment_config {
    blue_green_update_policy {
      traffic_routing_configuration {
        type                  = "LINEAR"
        wait_interval_in_seconds = 60
        linear_step_size {
          type  = "CAPACITY_PERCENT"
          value = 25  # shift 25% every 60s
        }
      }
      termination_wait_in_seconds = 120
    }
    auto_rollback_configuration {
      alarms {
        alarm_name = aws_cloudwatch_metric_alarm.endpoint_error_rate.alarm_name
      }
    }
  }

  tags = var.common_tags

  lifecycle {
    ignore_changes = [endpoint_config_name]  # Managed by deployment pipeline
  }
}

# CloudWatch alarm for auto-rollback trigger
resource "aws_cloudwatch_metric_alarm" "endpoint_error_rate" {
  alarm_name          = "${local.endpoint_name}-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "Invocation5XXErrors"
  namespace           = "AWS/SageMaker"
  period              = 60
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "SageMaker endpoint error rate too high โ€” triggers rollback"

  dimensions = {
    EndpointName = local.endpoint_name
    VariantName  = "primary"
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

โšก Your Competitors Are Already Using AI โ€” Are You?

We build AI systems that actually work in production โ€” not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.

  • AI agent systems that run autonomously โ€” not just chatbots
  • Integrates with your existing tools (CRM, ERP, Slack, etc.)
  • Explainable outputs โ€” know why the model decided what it did
  • Free AI opportunity audit for your business

Autoscaling Configuration

# autoscaling.tf

resource "aws_appautoscaling_target" "sagemaker" {
  max_capacity       = 20
  min_capacity       = 1
  resource_id        = "endpoint/${aws_sagemaker_endpoint.main.name}/variant/primary"
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"

  depends_on = [aws_sagemaker_endpoint.main]
}

# Scale on invocations per instance
resource "aws_appautoscaling_policy" "sagemaker_invocations" {
  name               = "${local.endpoint_name}-invocations-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 70.0  # invocations per instance per minute
    scale_in_cooldown  = 300   # 5 min โ€” don't scale in too aggressively
    scale_out_cooldown = 60    # 1 min โ€” respond quickly to load spikes

    predefined_metric_specification {
      predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
    }
  }
}

# Scale on model latency (P99)
resource "aws_appautoscaling_policy" "sagemaker_latency" {
  name               = "${local.endpoint_name}-latency-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 200.0  # target 200ms P99 latency
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    customized_metric_specification {
      metric_name = "ModelLatency"
      namespace   = "AWS/SageMaker"
      statistic   = "p99"
      unit        = "Microseconds"

      dimensions {
        name  = "EndpointName"
        value = aws_sagemaker_endpoint.main.name
      }
      dimensions {
        name  = "VariantName"
        value = "primary"
      }
    }
  }
}

# Scheduled scaling โ€” pre-warm for business hours
resource "aws_appautoscaling_scheduled_action" "scale_up_morning" {
  name               = "${local.endpoint_name}-morning-scale-up"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace
  schedule           = "cron(0 8 ? * MON-FRI *)"  # 8 AM UTC weekdays

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "scale_down_night" {
  name               = "${local.endpoint_name}-night-scale-down"
  resource_id        = aws_appautoscaling_target.sagemaker.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker.service_namespace
  schedule           = "cron(0 20 ? * MON-FRI *)"  # 8 PM UTC weekdays

  scalable_target_action {
    min_capacity = 1
    max_capacity = 5
  }
}

Calling the Endpoint from Application Code

# lib/sagemaker_client.py
import json
import boto3
import logging
from typing import Any
from functools import lru_cache

logger = logging.getLogger(__name__)


@lru_cache(maxsize=1)
def get_runtime_client():
    return boto3.client("sagemaker-runtime", region_name="us-east-1")


class SageMakerInferenceClient:
    def __init__(self, endpoint_name: str):
        self.endpoint_name = endpoint_name
        self.client = get_runtime_client()

    def predict(self, texts: list[str], timeout: int = 10) -> dict[str, Any]:
        """Invoke endpoint with list of texts."""
        payload = json.dumps({"texts": texts})

        try:
            response = self.client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                ContentType="application/json",
                Accept="application/json",
                Body=payload,
            )

            result = json.loads(response["Body"].read())
            return result

        except self.client.exceptions.ModelNotReadyException:
            logger.warning("Model not ready โ€” endpoint may be scaling")
            raise
        except Exception as e:
            logger.error(f"Inference error: {e}", exc_info=True)
            raise

    def predict_single(self, text: str) -> dict[str, Any]:
        """Convenience method for single text."""
        result = self.predict([text])
        return {
            "prediction": result["predictions"][0],
            "label": result["labels"][0],
            "probability": max(result["probabilities"][0]),
        }

    def batch_predict(
        self,
        texts: list[str],
        batch_size: int = 32,
    ) -> list[dict[str, Any]]:
        """Process large lists in batches."""
        results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i : i + batch_size]
            batch_result = self.predict(batch)
            for j, pred in enumerate(batch_result["predictions"]):
                results.append({
                    "text": batch[j],
                    "prediction": pred,
                    "label": batch_result["labels"][j],
                    "probability": max(batch_result["probabilities"][j]),
                })
        return results
// lib/sagemaker.ts โ€” TypeScript wrapper for Next.js API routes
import {
  SageMakerRuntimeClient,
  InvokeEndpointCommand,
} from "@aws-sdk/client-sagemaker-runtime";

const client = new SageMakerRuntimeClient({ region: process.env.AWS_REGION });

interface InferenceResult {
  predictions: number[];
  labels: string[];
  probabilities: number[][];
}

export async function invokeEndpoint(
  endpointName: string,
  texts: string[]
): Promise<InferenceResult> {
  const command = new InvokeEndpointCommand({
    EndpointName: endpointName,
    ContentType: "application/json",
    Accept: "application/json",
    Body: JSON.stringify({ texts }),
  });

  const response = await client.send(command);
  const body = new TextDecoder().decode(response.Body);
  return JSON.parse(body) as InferenceResult;
}

// app/api/analyze/route.ts
import { NextRequest, NextResponse } from "next/server";
import { invokeEndpoint } from "@/lib/sagemaker";

export async function POST(req: NextRequest) {
  const { text } = await req.json();

  const result = await invokeEndpoint(
    process.env.SAGEMAKER_ENDPOINT_NAME!,
    [text]
  );

  return NextResponse.json({
    sentiment: result.labels[0],
    confidence: Math.max(...result.probabilities[0]),
  });
}

Inference Pipeline: Preprocessing + Model + Postprocessing

# pipeline_containers/preprocessor/handler.py
import json
import re

def model_fn(model_dir):
    return {}  # No model weights needed for preprocessing

def predict_fn(input_data, model):
    """Clean text before sending to model."""
    texts = input_data["texts"]
    cleaned = [
        re.sub(r'[^\w\s.,!?]', '', t.strip().lower())[:512]
        for t in texts
    ]
    return {"texts": cleaned}

def input_fn(body, content_type):
    return json.loads(body)

def output_fn(prediction, accept):
    return json.dumps(prediction), "application/json"
# pipeline_containers/postprocessor/handler.py
import json

LABEL_DESCRIPTIONS = {
    "POSITIVE": "Positive sentiment",
    "NEGATIVE": "Negative sentiment",
    "NEUTRAL": "Neutral or mixed sentiment",
}

def model_fn(model_dir):
    return {}

def predict_fn(input_data, model):
    results = []
    for i, label in enumerate(input_data["labels"]):
        prob = max(input_data["probabilities"][i])
        results.append({
            "label": label,
            "description": LABEL_DESCRIPTIONS.get(label, label),
            "confidence": round(prob, 4),
            "tier": "high" if prob > 0.9 else "medium" if prob > 0.7 else "low",
        })
    return {"results": results}

def input_fn(body, content_type):
    return json.loads(body)

def output_fn(prediction, accept):
    return json.dumps(prediction), "application/json"
# Pipeline model definition in Terraform
resource "aws_sagemaker_model" "pipeline" {
  name               = "${var.project}-sentiment-pipeline"
  execution_role_arn = aws_iam_role.sagemaker.arn

  # Containers execute in sequence
  container {
    image          = "${var.ecr_repository_url}:preprocessor-${var.image_tag}"
    container_hostname = "preprocessor"
  }

  container {
    image          = "${var.ecr_repository_url}:model-${var.image_tag}"
    model_data_url = var.model_s3_uri
    container_hostname = "model"
  }

  container {
    image          = "${var.ecr_repository_url}:postprocessor-${var.image_tag}"
    container_hostname = "postprocessor"
  }
}

Cost Estimates

Instance TypevCPURAMGPUCost/hrGood For
ml.t2.medium24GBโ€”$0.065Development, tiny models
ml.m5.large28GBโ€”$0.134CPU-bound inference
ml.m5.4xlarge1664GBโ€”$1.075Large batch CPU inference
ml.g4dn.xlarge416GBT4 (16GB)$0.736Small-medium GPU models
ml.g4dn.2xlarge832GBT4 (16GB)$1.22Production GPU inference
ml.p3.2xlarge861GBV100 (16GB)$3.83Large GPU models
ml.inf2.xlarge416GBInferentia2$0.76High-throughput optimized

Typical production setup: 2ร— ml.g4dn.xlarge with autoscaling 1โ€“10 = ~$1.50/hr base, scales with traffic.

Cost and Timeline Estimates

ScopeTeamTimelineCost Range
Basic endpoint from HuggingFace model1 ML engineer2โ€“3 days$600โ€“1,200
Custom container + CI/CD deployment1โ€“2 engineers1 week$2,000โ€“4,000
Autoscaling + monitoring + blue-green2 engineers2 weeks$4,000โ€“8,000
Full MLOps pipeline (training โ†’ deploy โ†’ monitor)2โ€“3 engineers4โ€“6 weeks$12,000โ€“28,000

See Also


Working With Viprasol

Deploying an ML model to SageMaker involves more than clicking "deploy" in the console โ€” custom inference containers, autoscaling tuned to your latency requirements, blue-green deployment with auto-rollback, and integrating the endpoint into your application stack. Our team handles the MLOps layer so your data scientists can focus on model quality.

What we deliver:

  • Custom SageMaker inference containers with production-grade handlers
  • Terraform-managed endpoint configuration with blue-green deployment
  • Autoscaling policies calibrated to your traffic patterns
  • Inference pipeline composition for pre/post-processing
  • Application-layer SDK wrappers (Python + TypeScript)

Talk to our team about ML deployment โ†’

Or explore our AI and ML services.

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Want to Implement AI in Your Business?

From chatbots to predictive models โ€” harness the power of AI with a team that delivers.

Free consultation โ€ข No commitment โ€ข Response within 24 hours

Viprasol ยท AI Agent Systems

Ready to automate your business with AI agents?

We build custom multi-agent AI systems that handle sales, support, ops, and content โ€” across Telegram, WhatsApp, Slack, and 20+ other platforms. We run our own business on these systems.