|
|
vor 1 Woche | |
|---|---|---|
| .. | ||
| README.md | vor 1 Woche | |
| main.tf | vor 1 Woche | |
| outputs.tf | vor 1 Woche | |
| terraform.tfvars.example | vor 1 Woche | |
| variables.tf | vor 1 Woche | |
Deploy containerized Llama models using Google Cloud Run with auto-scaling.
This Terraform configuration sets up a basic example deployment, demonstrating how to deploy/serve foundation models using Google Cloud Run services. Google Cloud Run provides serverless container deployment with automatic scaling.
This example shows how to use basic services such as:
In our architecture patterns for private cloud guide we outline advanced patterns for cloud deployment that you may choose to implement in a more complete deployment. This includes:
gcloud auth application-default loginConfigure GCP authentication:
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
Prepare container image with vLLM. For speed and simplicity's sake, we will use a small 1B parameter model. You may choose to use a larger Llama model, and if so should increase the resource requirements in your tfvars file. ```bash
cat > Dockerfile << 'EOF' FROM vllm/vllm-openai:latest ENV MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct CMD ["vllm", "serve", "$MODEL_NAME", "--host", "0.0.0.0", "--port", "8080"] EOF
# Build and push docker build -t llama-inference . docker tag llama-inference gcr.io/YOUR_PROJECT_ID/llama-inference:latest docker push gcr.io/YOUR_PROJECT_ID/llama-inference:latest
3. Edit terraform.tfvars with your project ID and container image.
4. Create configuration:
```bash
cd terraform/gcp-cloud-run-default
cp terraform.tfvars.example terraform.tfvars
bash
terraform init
terraform plan
terraform apply
# Get service URL
SERVICE_URL=$(terraform output -raw service_url)
# Make request
curl -X POST $SERVICE_URL/predict \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, how are you?", "max_tokens": 100}'