AWS SageMaker Real-Time Inference: Endpoints, Autoscaling, and Inference Pipelines
Deploy machine learning models with AWS SageMaker real-time endpoints. Covers model packaging, endpoint configuration, autoscaling policies, multi-model endpoints, and inference pipelines with Terraform.
Training a model is the glamorous part of ML. Deploying it so other systems can call it reliably, at scale, with sub-100ms latency, and without you managing servers โ that's the unglamorous part that determines whether the model creates business value. AWS SageMaker handles the infrastructure for model serving, but getting the deployment right requires understanding endpoint types, container requirements, autoscaling behavior, and inference pipeline composition.
This guide covers SageMaker real-time inference from model packaging through autoscaling in production.
SageMaker Inference Concepts
Endpoint: The HTTPS endpoint that serves predictions. Backed by one or more EC2 instances running your model container.
Endpoint Configuration: Defines which model(s) and instance type(s) to use. You can swap configs without endpoint downtime.
Model: References a Docker container + model artifacts (weights, tokenizer, config) stored in S3.
Inference Pipeline: Chain multiple containers in sequence โ preprocessing โ model โ postprocessing โ as a single endpoint call.
Multi-Model Endpoint (MME): One endpoint, dozens of models. SageMaker loads/unloads models dynamically based on invocation frequency.
Model Packaging: Custom Inference Container
For models not natively supported by SageMaker's built-in containers:
# inference/handler.py โ Custom inference handler
import json
import logging
import os
from typing import Any
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
logger = logging.getLogger(__name__)
# Global model state (loaded once at startup)
_model = None
_tokenizer = None
_device = None
def model_fn(model_dir: str) -> dict:
"""Load model from model_dir. Called once at container start."""
global _model, _tokenizer, _device
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger.info(f"Loading model from {model_dir} on {_device}")
_tokenizer = AutoTokenizer.from_pretrained(model_dir)
_model = AutoModelForSequenceClassification.from_pretrained(model_dir)
_model.to(_device)
_model.eval()
logger.info("Model loaded successfully")
return {"model": _model, "tokenizer": _tokenizer}
def input_fn(request_body: str | bytes, request_content_type: str) -> dict:
"""Deserialize and preprocess request."""
if request_content_type == "application/json":
data = json.loads(request_body)
if isinstance(data, str):
return {"texts": [data]}
if isinstance(data, list):
return {"texts": data}
if "text" in data:
return {"texts": [data["text"]]}
if "texts" in data:
return {"texts": data["texts"]}
raise ValueError(f"Unexpected JSON structure: {data}")
raise ValueError(f"Unsupported content type: {request_content_type}")
def predict_fn(input_data: dict, model_artifacts: dict) -> dict:
"""Run inference."""
model = model_artifacts["model"]
tokenizer = model_artifacts["tokenizer"]
texts = input_data["texts"]
# Batch tokenize
inputs = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512,
)
inputs = {k: v.to(_device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = torch.softmax(logits, dim=-1)
predictions = torch.argmax(probabilities, dim=-1)
return {
"predictions": predictions.cpu().tolist(),
"probabilities": probabilities.cpu().tolist(),
"labels": [model.config.id2label[p] for p in predictions.cpu().tolist()],
}
def output_fn(prediction: dict, accept: str) -> tuple[str, str]:
"""Serialize the response."""
if accept in ("application/json", "*/*"):
return json.dumps(prediction), "application/json"
raise ValueError(f"Unsupported accept type: {accept}")
# Dockerfile โ Custom SageMaker inference container
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py311-cu118-ubuntu20.04-sagemaker
# Install additional dependencies
COPY requirements.txt /opt/ml/code/requirements.txt
RUN pip install --no-cache-dir -r /opt/ml/code/requirements.txt
# Copy inference handler
COPY inference/ /opt/ml/code/
# SageMaker expects the handler at this location
ENV SAGEMAKER_PROGRAM inference/handler.py
# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/ping || exit 1
# requirements.txt
transformers==4.41.0
torch==2.2.0
sentencepiece==0.2.0
protobuf==4.25.3
๐ค AI Is Not the Future โ It Is Right Now
Businesses using AI automation cut manual work by 60โ80%. We build production-ready AI systems โ RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.
- LLM integration (OpenAI, Anthropic, Gemini, local models)
- RAG systems that answer from your own data
- AI agents that take real actions โ not just chat
- Custom ML models for prediction, classification, detection
Model Artifacts: Packing and Uploading to S3
# scripts/package_model.py
import os
import tarfile
import boto3
from pathlib import Path
def package_model(model_dir: str, output_name: str = "model.tar.gz") -> str:
"""Package model directory into tar.gz for SageMaker."""
output_path = f"/tmp/{output_name}"
with tarfile.open(output_path, "w:gz") as tar:
for file_path in Path(model_dir).rglob("*"):
if file_path.is_file():
# Preserve relative path structure
arcname = file_path.relative_to(model_dir)
tar.add(file_path, arcname=arcname)
return output_path
def upload_model(local_path: str, bucket: str, prefix: str) -> str:
"""Upload model artifact to S3."""
s3 = boto3.client("s3")
s3_key = f"{prefix}/{os.path.basename(local_path)}"
print(f"Uploading {local_path} to s3://{bucket}/{s3_key}...")
s3.upload_file(local_path, bucket, s3_key)
return f"s3://{bucket}/{s3_key}"
if __name__ == "__main__":
# Save model from HuggingFace Hub to local directory
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
LOCAL_DIR = "/tmp/model_artifacts"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(LOCAL_DIR)
model.save_pretrained(LOCAL_DIR)
# Package and upload
archive = package_model(LOCAL_DIR)
s3_uri = upload_model(
archive,
bucket="my-ml-artifacts",
prefix="models/sentiment-classifier/v1.0"
)
print(f"Model uploaded to: {s3_uri}")
Terraform: SageMaker Endpoint
# sagemaker/main.tf
locals {
model_name = "${var.project}-${var.model_name}-${var.model_version}"
endpoint_name = "${var.project}-${var.model_name}"
}
# IAM role for SageMaker
resource "aws_iam_role" "sagemaker" {
name = "${var.project}-sagemaker-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "sagemaker_full" {
role = aws_iam_role.sagemaker.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}
resource "aws_iam_role_policy" "sagemaker_s3" {
role = aws_iam_role.sagemaker.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [
"arn:aws:s3:::${var.model_artifacts_bucket}",
"arn:aws:s3:::${var.model_artifacts_bucket}/*"
]
},
{
Effect = "Allow"
Action = ["ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:GetAuthorizationToken"]
Resource = "*"
}
]
})
}
# SageMaker Model
resource "aws_sagemaker_model" "main" {
name = local.model_name
execution_role_arn = aws_iam_role.sagemaker.arn
primary_container {
image = "${var.ecr_repository_url}:${var.image_tag}"
model_data_url = var.model_s3_uri
environment = {
SAGEMAKER_PROGRAM = "inference/handler.py"
SAGEMAKER_SUBMIT_DIRECTORY = "/opt/ml/code"
PYTHONUNBUFFERED = "1"
MODEL_MAX_BATCH_SIZE = "32"
}
}
tags = var.common_tags
}
# Endpoint Configuration
resource "aws_sagemaker_endpoint_configuration" "main" {
name = "${local.endpoint_name}-${var.model_version}"
production_variants {
variant_name = "primary"
model_name = aws_sagemaker_model.main.name
initial_instance_count = 1
instance_type = var.instance_type # e.g., "ml.g4dn.xlarge"
initial_variant_weight = 1
# Managed instance scaling (alternative to Application Auto Scaling)
managed_instance_scaling {
status = "ENABLED"
min_instance_count = 1
max_instance_count = 10
}
}
# Routing config โ least outstanding requests for ML workloads
shadow_production_variants {
# Optional: shadow variant for A/B testing new model version
variant_name = "shadow"
model_name = aws_sagemaker_model.main.name # point to new model
initial_instance_count = 1
instance_type = var.instance_type
initial_variant_weight = 0
sampling_percentage = 10 # 10% of traffic mirrored for shadow testing
}
async_inference_config {
# Optional: for async jobs (large payloads, long inference time)
output_config {
s3_output_path = "s3://${var.inference_output_bucket}/async-results/"
kms_key_id = var.kms_key_arn
}
}
tags = var.common_tags
}
# Endpoint (blue/green deployment by updating endpoint_config_name)
resource "aws_sagemaker_endpoint" "main" {
name = local.endpoint_name
endpoint_config_name = aws_sagemaker_endpoint_configuration.main.name
deployment_config {
blue_green_update_policy {
traffic_routing_configuration {
type = "LINEAR"
wait_interval_in_seconds = 60
linear_step_size {
type = "CAPACITY_PERCENT"
value = 25 # shift 25% every 60s
}
}
termination_wait_in_seconds = 120
}
auto_rollback_configuration {
alarms {
alarm_name = aws_cloudwatch_metric_alarm.endpoint_error_rate.alarm_name
}
}
}
tags = var.common_tags
lifecycle {
ignore_changes = [endpoint_config_name] # Managed by deployment pipeline
}
}
# CloudWatch alarm for auto-rollback trigger
resource "aws_cloudwatch_metric_alarm" "endpoint_error_rate" {
alarm_name = "${local.endpoint_name}-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "Invocation5XXErrors"
namespace = "AWS/SageMaker"
period = 60
statistic = "Sum"
threshold = 5
alarm_description = "SageMaker endpoint error rate too high โ triggers rollback"
dimensions = {
EndpointName = local.endpoint_name
VariantName = "primary"
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
โก Your Competitors Are Already Using AI โ Are You?
We build AI systems that actually work in production โ not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.
- AI agent systems that run autonomously โ not just chatbots
- Integrates with your existing tools (CRM, ERP, Slack, etc.)
- Explainable outputs โ know why the model decided what it did
- Free AI opportunity audit for your business
Autoscaling Configuration
# autoscaling.tf
resource "aws_appautoscaling_target" "sagemaker" {
max_capacity = 20
min_capacity = 1
resource_id = "endpoint/${aws_sagemaker_endpoint.main.name}/variant/primary"
scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
service_namespace = "sagemaker"
depends_on = [aws_sagemaker_endpoint.main]
}
# Scale on invocations per instance
resource "aws_appautoscaling_policy" "sagemaker_invocations" {
name = "${local.endpoint_name}-invocations-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.sagemaker.resource_id
scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
service_namespace = aws_appautoscaling_target.sagemaker.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 70.0 # invocations per instance per minute
scale_in_cooldown = 300 # 5 min โ don't scale in too aggressively
scale_out_cooldown = 60 # 1 min โ respond quickly to load spikes
predefined_metric_specification {
predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
}
}
}
# Scale on model latency (P99)
resource "aws_appautoscaling_policy" "sagemaker_latency" {
name = "${local.endpoint_name}-latency-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.sagemaker.resource_id
scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
service_namespace = aws_appautoscaling_target.sagemaker.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 200.0 # target 200ms P99 latency
scale_in_cooldown = 300
scale_out_cooldown = 60
customized_metric_specification {
metric_name = "ModelLatency"
namespace = "AWS/SageMaker"
statistic = "p99"
unit = "Microseconds"
dimensions {
name = "EndpointName"
value = aws_sagemaker_endpoint.main.name
}
dimensions {
name = "VariantName"
value = "primary"
}
}
}
}
# Scheduled scaling โ pre-warm for business hours
resource "aws_appautoscaling_scheduled_action" "scale_up_morning" {
name = "${local.endpoint_name}-morning-scale-up"
resource_id = aws_appautoscaling_target.sagemaker.resource_id
scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
service_namespace = aws_appautoscaling_target.sagemaker.service_namespace
schedule = "cron(0 8 ? * MON-FRI *)" # 8 AM UTC weekdays
scalable_target_action {
min_capacity = 3
max_capacity = 20
}
}
resource "aws_appautoscaling_scheduled_action" "scale_down_night" {
name = "${local.endpoint_name}-night-scale-down"
resource_id = aws_appautoscaling_target.sagemaker.resource_id
scalable_dimension = aws_appautoscaling_target.sagemaker.scalable_dimension
service_namespace = aws_appautoscaling_target.sagemaker.service_namespace
schedule = "cron(0 20 ? * MON-FRI *)" # 8 PM UTC weekdays
scalable_target_action {
min_capacity = 1
max_capacity = 5
}
}
Calling the Endpoint from Application Code
# lib/sagemaker_client.py
import json
import boto3
import logging
from typing import Any
from functools import lru_cache
logger = logging.getLogger(__name__)
@lru_cache(maxsize=1)
def get_runtime_client():
return boto3.client("sagemaker-runtime", region_name="us-east-1")
class SageMakerInferenceClient:
def __init__(self, endpoint_name: str):
self.endpoint_name = endpoint_name
self.client = get_runtime_client()
def predict(self, texts: list[str], timeout: int = 10) -> dict[str, Any]:
"""Invoke endpoint with list of texts."""
payload = json.dumps({"texts": texts})
try:
response = self.client.invoke_endpoint(
EndpointName=self.endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=payload,
)
result = json.loads(response["Body"].read())
return result
except self.client.exceptions.ModelNotReadyException:
logger.warning("Model not ready โ endpoint may be scaling")
raise
except Exception as e:
logger.error(f"Inference error: {e}", exc_info=True)
raise
def predict_single(self, text: str) -> dict[str, Any]:
"""Convenience method for single text."""
result = self.predict([text])
return {
"prediction": result["predictions"][0],
"label": result["labels"][0],
"probability": max(result["probabilities"][0]),
}
def batch_predict(
self,
texts: list[str],
batch_size: int = 32,
) -> list[dict[str, Any]]:
"""Process large lists in batches."""
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
batch_result = self.predict(batch)
for j, pred in enumerate(batch_result["predictions"]):
results.append({
"text": batch[j],
"prediction": pred,
"label": batch_result["labels"][j],
"probability": max(batch_result["probabilities"][j]),
})
return results
// lib/sagemaker.ts โ TypeScript wrapper for Next.js API routes
import {
SageMakerRuntimeClient,
InvokeEndpointCommand,
} from "@aws-sdk/client-sagemaker-runtime";
const client = new SageMakerRuntimeClient({ region: process.env.AWS_REGION });
interface InferenceResult {
predictions: number[];
labels: string[];
probabilities: number[][];
}
export async function invokeEndpoint(
endpointName: string,
texts: string[]
): Promise<InferenceResult> {
const command = new InvokeEndpointCommand({
EndpointName: endpointName,
ContentType: "application/json",
Accept: "application/json",
Body: JSON.stringify({ texts }),
});
const response = await client.send(command);
const body = new TextDecoder().decode(response.Body);
return JSON.parse(body) as InferenceResult;
}
// app/api/analyze/route.ts
import { NextRequest, NextResponse } from "next/server";
import { invokeEndpoint } from "@/lib/sagemaker";
export async function POST(req: NextRequest) {
const { text } = await req.json();
const result = await invokeEndpoint(
process.env.SAGEMAKER_ENDPOINT_NAME!,
[text]
);
return NextResponse.json({
sentiment: result.labels[0],
confidence: Math.max(...result.probabilities[0]),
});
}
Inference Pipeline: Preprocessing + Model + Postprocessing
# pipeline_containers/preprocessor/handler.py
import json
import re
def model_fn(model_dir):
return {} # No model weights needed for preprocessing
def predict_fn(input_data, model):
"""Clean text before sending to model."""
texts = input_data["texts"]
cleaned = [
re.sub(r'[^\w\s.,!?]', '', t.strip().lower())[:512]
for t in texts
]
return {"texts": cleaned}
def input_fn(body, content_type):
return json.loads(body)
def output_fn(prediction, accept):
return json.dumps(prediction), "application/json"
# pipeline_containers/postprocessor/handler.py
import json
LABEL_DESCRIPTIONS = {
"POSITIVE": "Positive sentiment",
"NEGATIVE": "Negative sentiment",
"NEUTRAL": "Neutral or mixed sentiment",
}
def model_fn(model_dir):
return {}
def predict_fn(input_data, model):
results = []
for i, label in enumerate(input_data["labels"]):
prob = max(input_data["probabilities"][i])
results.append({
"label": label,
"description": LABEL_DESCRIPTIONS.get(label, label),
"confidence": round(prob, 4),
"tier": "high" if prob > 0.9 else "medium" if prob > 0.7 else "low",
})
return {"results": results}
def input_fn(body, content_type):
return json.loads(body)
def output_fn(prediction, accept):
return json.dumps(prediction), "application/json"
# Pipeline model definition in Terraform
resource "aws_sagemaker_model" "pipeline" {
name = "${var.project}-sentiment-pipeline"
execution_role_arn = aws_iam_role.sagemaker.arn
# Containers execute in sequence
container {
image = "${var.ecr_repository_url}:preprocessor-${var.image_tag}"
container_hostname = "preprocessor"
}
container {
image = "${var.ecr_repository_url}:model-${var.image_tag}"
model_data_url = var.model_s3_uri
container_hostname = "model"
}
container {
image = "${var.ecr_repository_url}:postprocessor-${var.image_tag}"
container_hostname = "postprocessor"
}
}
Cost Estimates
| Instance Type | vCPU | RAM | GPU | Cost/hr | Good For |
|---|---|---|---|---|---|
| ml.t2.medium | 2 | 4GB | โ | $0.065 | Development, tiny models |
| ml.m5.large | 2 | 8GB | โ | $0.134 | CPU-bound inference |
| ml.m5.4xlarge | 16 | 64GB | โ | $1.075 | Large batch CPU inference |
| ml.g4dn.xlarge | 4 | 16GB | T4 (16GB) | $0.736 | Small-medium GPU models |
| ml.g4dn.2xlarge | 8 | 32GB | T4 (16GB) | $1.22 | Production GPU inference |
| ml.p3.2xlarge | 8 | 61GB | V100 (16GB) | $3.83 | Large GPU models |
| ml.inf2.xlarge | 4 | 16GB | Inferentia2 | $0.76 | High-throughput optimized |
Typical production setup: 2ร ml.g4dn.xlarge with autoscaling 1โ10 = ~$1.50/hr base, scales with traffic.
Cost and Timeline Estimates
| Scope | Team | Timeline | Cost Range |
|---|---|---|---|
| Basic endpoint from HuggingFace model | 1 ML engineer | 2โ3 days | $600โ1,200 |
| Custom container + CI/CD deployment | 1โ2 engineers | 1 week | $2,000โ4,000 |
| Autoscaling + monitoring + blue-green | 2 engineers | 2 weeks | $4,000โ8,000 |
| Full MLOps pipeline (training โ deploy โ monitor) | 2โ3 engineers | 4โ6 weeks | $12,000โ28,000 |
See Also
- AWS Bedrock RAG with Knowledge Bases
- AWS Lambda Container Deployments
- Terraform State Management and Remote Backends
- AWS CloudWatch Observability Setup
- Building AI Features with Claude API
Working With Viprasol
Deploying an ML model to SageMaker involves more than clicking "deploy" in the console โ custom inference containers, autoscaling tuned to your latency requirements, blue-green deployment with auto-rollback, and integrating the endpoint into your application stack. Our team handles the MLOps layer so your data scientists can focus on model quality.
What we deliver:
- Custom SageMaker inference containers with production-grade handlers
- Terraform-managed endpoint configuration with blue-green deployment
- Autoscaling policies calibrated to your traffic patterns
- Inference pipeline composition for pre/post-processing
- Application-layer SDK wrappers (Python + TypeScript)
Talk to our team about ML deployment โ
Or explore our AI and ML services.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Want to Implement AI in Your Business?
From chatbots to predictive models โ harness the power of AI with a team that delivers.
Free consultation โข No commitment โข Response within 24 hours
Ready to automate your business with AI agents?
We build custom multi-agent AI systems that handle sales, support, ops, and content โ across Telegram, WhatsApp, Slack, and 20+ other platforms. We run our own business on these systems.