Browse Source

Update readme in Inference folder

Chester Hu 10 tháng trước cách đây
mục cha
commit
4901841059
1 tập tin đã thay đổi với 4 bổ sung4 xóa
  1. 4 4
      recipes/benchmarks/inference_throughput/README.md

+ 4 - 4
recipes/benchmarks/inference_throughput/README.md

@@ -1,8 +1,8 @@
 # Inference Throughput Benchmarks
-In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama 2 models inference on various backends:
+In this folder we provide a series of benchmark scripts that apply a throughput analysis for Llama models inference on various backends:
 * On-prem - Popular serving frameworks and containers (i.e. vLLM)
-* [**WIP**]Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
-* [**WIP**]On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
+* Cloud API - Popular API services (i.e. Azure Model-as-a-Service or Serverless API)
+* [**WIP**]On-device - Popular on-device inference solutions on mobile and desktop (i.e. ExecuTorch, MLC-LLM, Ollama)
 * [**WIP**]Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
 
 # Why
@@ -16,7 +16,7 @@ Here are the parameters (if applicable) that you can configure for running the b
 * **PROMPT** - Prompt sent in for inference (configure the length of prompt, choose from 5, 25, 50, 100, 500, 1k and 2k)
 * **MAX_NEW_TOKENS** - Max number of tokens generated
 * **CONCURRENT_LEVELS** - Max number of concurrent requests
-* **MODEL_PATH** - Model source
+* **MODEL_PATH** - Model source from Huggingface
 * **MODEL_HEADERS** - Request headers
 * **SAFE_CHECK** - Content safety check (either Azure service or simulated latency)
 * **THRESHOLD_TPS** - Threshold TPS (threshold for tokens per second below which we deem the query to be slow)