2 anni fa · 4901841059
--- a/recipes/benchmarks/inference_throughput/README.md
+++ b/recipes/benchmarks/inference_throughput/README.md
@@ -1,8 +1,8 @@
 
				 # Inference Throughput Benchmarks
			
 
				-In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
			
 
				+In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama models inference on various backends:
			
 
				 * On-prem - Popular serving frameworks and containers (i.e. vLLM)
			
 
				-* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
			
 
				-* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
			
 
				+* Cloud API - Popular API services (i.e. Azure Model-as-a-Service or Serverless API)
			
 
				+* [**WIP**]On-device - Popular on-device inference solutions on mobile and desktop (i.e. ExecuTorch, MLC-LLM, Ollama)
			
 
				 * [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
			
 
				 
			
 
				 # Why
			
@@ -16,7 +16,7 @@ Here are the parameters (if applicable) that you can configure for running the b
 
				 * **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
			
 
				 * **MAX_NEW_TOKENS** - Max number of tokens generated
			
 
				 * **CONCURRENT_LEVELS** - Max number of concurrent requests
			
 
				-* **MODEL_PATH** - Model source
			
 
				+* **MODEL_PATH** - Model source from Huggingface
			
 
				 * **MODEL_HEADERS** - Request headers
			
 
				 * **SAFE_CHECK** - Content safety check (either Azure service or simulated latency)
			
 
				 * **THRESHOLD_TPS** - Threshold TPS (threshold for tokens per second below which we deem the query to be slow)