|
@@ -1,5 +1,12 @@
|
|
|
# Local Inference
|
|
|
|
|
|
+For Multi-Modal inference we have added [multi_modal_infer.py](multi_modal_infer.py) which uses the transformers library
|
|
|
+
|
|
|
+The way to run this would be
|
|
|
+```
|
|
|
+python multi_modal_infer.py --image_path "../responsible_ai/resources/dog.jpg" --input_prompt "Describe this image" --temperature 0.5 --top_p 0.8 --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct"
|
|
|
+```
|
|
|
+
|
|
|
For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
|
|
|
To finetune all model parameters the output dir of the training has to be given as --model_name argument.
|
|
|
In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
|
|
@@ -87,4 +94,4 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes
|
|
|
|
|
|
## Inference on large models like Meta Llama 405B
|
|
|
The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
|
|
|
-To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
|
|
|
+To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
|