فهرست منبع

Apply suggestions from code review

Matthias Reso 1 سال پیش
والد
کامیت
00e0b0be6c

+ 2 - 1
.github/scripts/spellcheck_conf/wordlist.txt

@@ -1411,4 +1411,5 @@ tp
 QLoRA
 QLoRA
 ntasks
 ntasks
 srun
 srun
-xH
+xH
+unquantized

+ 1 - 1
recipes/3p_integrations/vllm/README.md

@@ -30,7 +30,7 @@ The script will ask for another prompt ina loop after completing the generation
 When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism.
 When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism.
 
 
 ## Multi-node multi-gpu inference
 ## Multi-node multi-gpu inference
-The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
+The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference.
 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference.
 vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node.
 vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node.
 To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries.
 To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries.

+ 1 - 1
recipes/quickstart/inference/local_inference/README.md

@@ -86,5 +86,5 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes
 ```
 ```
 
 
 ## Inference on large models like Meta Llama 405B
 ## Inference on large models like Meta Llama 405B
-The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
+The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).