radu/LLamaRecipes @ 428e8e83ed7b479e09711aadf1bbc0315cd61656

Allen 428e8e83ed test		1 vuosi sitten
..
data	864f0d9a67 benchmark_summarization	1 vuosi sitten
utils	428e8e83ed test	1 vuosi sitten
README.md	9fb1080e17 test	1 vuosi sitten
exp.sh	9fb1080e17 test	1 vuosi sitten
generation.py	428e8e83ed test	1 vuosi sitten
setup.sh	4c08c68f80 Create setup.sh	1 vuosi sitten
utils_cache.py	428e8e83ed test	1 vuosi sitten
utils_llama.py	428e8e83ed test	1 vuosi sitten

Run Llama with H2O for long context inference

The following example runs inference of Llama-2-7b on XSUM summarization tasks. We're using --enable_h2o_generation to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use --num_heavy_hitter_tokens to decide the number of heavy-hitter KV pairs and --num_window_lengthfor the KV cache size. The number of local KV pairs equals num_window_length - num_heavy_hitter_tokens. Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling postional rolling is important when sequence length exceeds the pretrained context windows, e.g., 4K in Llama-2.

python generation.py
--input-path data/summarization/xsum.jsonl \ # 5-shot inference on 1000 samples from XSUM dataset. Other option: cnn_dailymail.jsonl.
--output-path summarization_output/xsum_h2o.jsonl \ # output path for generated sequence
--model-name meta-llama/Llama-2-7b-hf \ # models
--enable_h2o_generation

dsf

README.md

Run Llama with H2O for long context inference