2 лет назад · 00e0b0be6c
--- a/.github/scripts/spellcheck_conf/wordlist.txt
+++ b/.github/scripts/spellcheck_conf/wordlist.txt
@@ -1411,4 +1411,5 @@ tp
 
				 QLoRA
			
 
				 ntasks
			
 
				 srun
			
 
				-xH
			
 
				+xH
			
 
				+unquantized
			
--- a/recipes/3p_integrations/vllm/README.md
+++ b/recipes/3p_integrations/vllm/README.md
@@ -30,7 +30,7 @@ The script will ask for another prompt ina loop after completing the generation
 
				 When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism.
			
 
				 
			
 
				 ## Multi-node multi-gpu inference
			
 
				-The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
			
 
				+The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
			
 
				 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference.
			
 
				 vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node.
			
 
				 To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries.
			
--- a/recipes/quickstart/inference/local_inference/README.md
+++ b/recipes/quickstart/inference/local_inference/README.md
@@ -86,5 +86,5 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes
 
				 ```
			
 
				 
			
 
				 ## Inference on large models like Meta Llama 405B
			
 
				-The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
			
 
				+The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
			
 
				 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).