# Amazon SageMaker deployment Deploy Llama models using Amazon SageMaker with GPU instances. ## Overview This Terraform configuration sets up a basic example deployment, demonstrating how to deploy/serve foundation models using Amazon SageMaker. Amazon SageMaker provides managed inference endpoints with auto-scaling capabilities. This example shows how to use basic services such as: - IAM roles for permissions management - Service accounts for fine-grained access control - Connecting model artifacts in S3 with SageMaker for deployment In our [architecture patterns for private cloud guide](/docs/open_source/private_cloud.md) we outline advanced patterns for cloud deployment that you may choose to implement in a more complete deployment. This includes: - Deployment into multiple regions or clouds - Managed keys/secrets services - Comprehensive logging systems for auditing and compliance - Backup and recovery systems ## Getting started ### Prerequisites * AWS account with access to Amazon SageMaker * Terraform installed * Model artifacts packaged as `tar.gz` (see model setup below) * Container image (AWS pre-built or custom ECR) * A Hugging Face account with access to the appropriate models (such as Llama 3.2 1B or Llama 3.3 70B) * **GPU quota**: Request quota increase for `ml.p4d.24xlarge` instances via AWS Service Quotas (default is 0) ### Deploy 1. Configure AWS credentials: ```bash aws configure ``` 2. Prepare Llama model artifacts: ```bash # Download model using Hugging Face CLI pip install huggingface-hub huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --local-dir ./model # Package for Amazon SageMaker tar -czf model.tar.gz -C model . aws s3 cp model.tar.gz s3://your-bucket/model/ ``` 3. Create configuration: ```bash cd terraform/amazon-sagemaker-default cp terraform.tfvars.example terraform.tfvars ``` 4. Edit terraform.tfvars with your model S3 path and other variables 5. Deploy: ```bash terraform init terraform plan terraform apply ``` ### Usage ```python import boto3 import json client = boto3.client('sagemaker-runtime', region_name='us-east-1') response = client.invoke_endpoint( EndpointName='your-endpoint-name', ContentType='application/json', Body=json.dumps({ "inputs": "Hello, how are you?", "parameters": { "max_new_tokens": 256, "temperature": 0.7 } }) ) result = json.loads(response['Body'].read()) print(result) ``` ## Next steps * [Amazon SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/) * [Amazon SageMaker Runtime API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/)