|
1 year ago | |
---|---|---|
.. | ||
hf-text-generation-inference | 1 year ago | |
README.md | 1 year ago | |
chat_completion.py | 1 year ago | |
chat_utils.py | 1 year ago | |
chats.json | 1 year ago | |
checkpoint_converter_fsdp_hf.py | 1 year ago | |
inference.py | 1 year ago | |
model_utils.py | 1 year ago | |
safety_utils.py | 1 year ago | |
samsum_prompt.txt | 1 year ago | |
vLLM_inference.py | 1 year ago |
This folder contains inference examples for Llama 2. So far, we have provided support for three methods of inference:
inference script script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models.
vLLM_inference.py script takes advantage of vLLM's paged attention concept for low latency.
The hf-text-generation-inference folder contains information on Hugging Face Text Generation Inference (TGI).
For more in depth information on inference including inference safety checks and examples, see the inference documentation here.
We received feedback from the community on our prompt template and we are providing an update to reduce the false refusal rates seen. False refusals occur when the model incorrectly refuses to answer a question that it should, for example due to overly broad instructions to be cautious in how it provides responses.
Based on evaluation and analysis, we recommend the removal of the system prompt as the default setting. Pull request #104] removes the system prompt as the default option, but still provides an example to help enable experimentation for those using it.