Connor Treacy d03fda145d Add files via upload před 1 týdnem
..
README.md d03fda145d Add files via upload před 1 týdnem
main.tf d03fda145d Add files via upload před 1 týdnem
outputs.tf d03fda145d Add files via upload před 1 týdnem
terraform.tfvars.example d03fda145d Add files via upload před 1 týdnem
variables.tf d03fda145d Add files via upload před 1 týdnem

README.md

Google Cloud Platform Cloud Run deployment

Deploy containerized Llama models using Google Cloud Run with auto-scaling.

Overview

This Terraform configuration sets up a basic example deployment, demonstrating how to deploy/serve foundation models using Google Cloud Run services. Google Cloud Run provides serverless container deployment with automatic scaling.

This example shows how to use basic services such as:

  • IAM roles for permissions management
  • Service accounts for fine-grained access control
  • Containerization of Llama models on Cloud Run

In our architecture patterns for private cloud guide we outline advanced patterns for cloud deployment that you may choose to implement in a more complete deployment. This includes:

  • Deployment into multiple regions or clouds
  • Managed keys/secrets services
  • Comprehensive logging systems for auditing and compliance
  • Backup and recovery systems

Getting started

Prerequisites

  • GCP project with billing account enabled (required for Google Cloud Run and Google Cloud Artifact Registry)
  • Terraform installed
  • Docker container image with Llama model (see container setup below)
  • Google Cloud CLI configured
  • Application Default Credentials: gcloud auth application-default login

Deploy

  1. Configure GCP authentication:

    gcloud auth login
    gcloud config set project YOUR_PROJECT_ID
    
  2. Prepare container image with vLLM. For speed and simplicity's sake, we will use a small 1B parameter model. You may choose to use a larger Llama model, and if so should increase the resource requirements in your tfvars file. ```bash

    Create Dockerfile

    cat > Dockerfile << 'EOF' FROM vllm/vllm-openai:latest ENV MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct CMD ["vllm", "serve", "$MODEL_NAME", "--host", "0.0.0.0", "--port", "8080"] EOF

# Build and push docker build -t llama-inference . docker tag llama-inference gcr.io/YOUR_PROJECT_ID/llama-inference:latest docker push gcr.io/YOUR_PROJECT_ID/llama-inference:latest


3. Edit terraform.tfvars with your project ID and container image.

4. Create configuration:
   ```bash
   cd terraform/gcp-cloud-run-default
   cp terraform.tfvars.example terraform.tfvars
  1. Deploy: bash terraform init terraform plan terraform apply

Usage

# Get service URL
SERVICE_URL=$(terraform output -raw service_url)

# Make request
curl -X POST $SERVICE_URL/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "max_tokens": 100}'

Next steps