1 year ago · 5250a20684
--- a/recipes/quickstart/finetuning/LLM_finetuning_overview.md
+++ b/recipes/quickstart/finetuning/LLM_finetuning_overview.md
@@ -61,33 +61,4 @@ To boost the performance of fine-tuning with FSDP, we can make use a number of f
 
				 
			
 
				 - **Activation Checkpointing**  which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.
			
 
				 
			
 
				-- **auto_wrap_policy** Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the  model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and  MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.
			
 
				-
			
 
				-### Inference
			
 
				-
			
 
				-After fine-tuning the model, you can use the `code-merge-inference.py` script to generate text from images. The script supports merging PEFT adapter weights from a specified path.
			
 
				-
			
 
				-#### Usage
			
 
				-
			
 
				-To run the inference script, use the following command:
			
 
				-
			
 
				-```bash
			
 
				-python code-merge-inference.py \
			
 
				-    --image_path "path/to/your/image.png" \
			
 
				-    --prompt_text "Your prompt text here" \
			
 
				-    --temperature 1 \
			
 
				-    --top_p 0.5 \
			
 
				-    --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct" \
			
 
				-    --hf_token "your_hugging_face_token" \
			
 
				-    --finetuning_path "path/to/your/finetuned/model"
			
 
				-```
			
 
				-
			
 
				-#### Script Details
			
 
				-
			
 
				-The `code-merge-inference.py` script performs the following steps:
			
 
				-
			
 
				-1. **Load Model and Processor**: Loads the pre-trained model and processor, and optionally loads PEFT adapter weights if specified.
			
 
				-2. **Process Image**: Opens and converts the input image.
			
 
				-3. **Generate Text**: Generates text from the image using the model and processor.
			
 
				-
			
 
				-For more details, refer to the `code-merge-inference.py` script.
			
 
				+- **auto_wrap_policy** Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the  model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and  MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.
			
--- a/recipes/quickstart/inference/local_inference/README.md
+++ b/recipes/quickstart/inference/local_inference/README.md
@@ -114,3 +114,32 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes
 
				 ## Inference on large models like Meta Llama 405B
			
 
				 The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
			
 
				 To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
			
 
				+
			
 
				+### Inference-with-lora-checkpoints
			
 
				+
			
 
				+After fine-tuning the model, you can use the `code-merge-inference.py` script to generate text from images. The script supports merging PEFT adapter weights from a specified path.
			
 
				+
			
 
				+#### Usage
			
 
				+
			
 
				+To run the inference script, use the following command:
			
 
				+
			
 
				+```bash
			
 
				+python code-merge-inference.py \
			
 
				+    --image_path "path/to/your/image.png" \
			
 
				+    --prompt_text "Your prompt text here" \
			
 
				+    --temperature 1 \
			
 
				+    --top_p 0.5 \
			
 
				+    --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct" \
			
 
				+    --hf_token "your_hugging_face_token" \
			
 
				+    --finetuning_path "path/to/your/finetuned/model"
			
 
				+```
			
 
				+
			
 
				+#### Script Details
			
 
				+
			
 
				+The `code-merge-inference.py` script performs the following steps:
			
 
				+
			
 
				+1. **Load Model and Processor**: Loads the pre-trained model and processor, and optionally loads PEFT adapter weights if specified.
			
 
				+2. **Process Image**: Opens and converts the input image.
			
 
				+3. **Generate Text**: Generates text from the image using the model and processor.
			
 
				+
			
 
				+For more details, refer to the `code-merge-inference.py` script.