xgen.ipynb 148 KB

!nvidia-smi
Fri Jun 30 10:58:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
!pip install -Uqqq pip --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq transformers==4.30.2 --progress-bar off
!pip install -qqq tiktoken==0.4.0 --progress-bar off
!pip install -qqq accelerate==0.20.3 --progress-bar off
!pip install -qqq bitsandbytes==0.39.1 --progress-bar off
import textwrap

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
tokenizer = AutoTokenizer.from_pretrained(
    "Salesforce/xgen-7b-8k-inst", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "Salesforce/xgen-7b-8k-inst",
    torch_dtype=torch.float16,
    load_in_8bit=True,
    device_map="auto",
)
Downloading (…)okenizer_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]
Downloading (…)tokenization_xgen.py:   0%|          | 0.00/8.10k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/Salesforce/xgen-7b-8k-inst:
- tokenization_xgen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Downloading (…)lve/main/config.json:   0%|          | 0.00/510 [00:00<?, ?B/s]
Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]
Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]
Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.96G [00:00<?, ?B/s]
Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/7.68G [00:00<?, ?B/s]
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/lib64-nvidia did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('//172.28.0.1'), PosixPath('8013')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-hm-me1xxxk6koba --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
WARNING:accelerate.utils.modeling:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading (…)neration_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]
model.config
LlamaConfig {
  "_name_or_path": "Salesforce/xgen-7b-8k-inst",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "pad_token_id": 0,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float32",
    "bnb_4bit_quant_type": "fp4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": false,
    "load_in_8bit": true
  },
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.30.2",
  "use_cache": true,
  "vocab_size": 51200
}
model.generation_config
GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.30.2"
}
!nvidia-smi
Fri Jun 30 11:08:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    30W /  70W |   8401MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
generation_config = model.generation_config
generation_config.max_new_tokens = 128
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config
GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "max_new_tokens": 128,
  "pad_token_id": 50256,
  "transformers_version": "4.30.2"
}
system_prompt = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n"
)
prompt = f"### Human: What is the meaning of life? Answer in 1 sentence.\n###"
print(system_prompt + prompt)
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: What is the meaning of life? Answer in 1 sentence.
###
%%time
inputs = tokenizer(system_prompt + prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    generations = model.generate(**inputs, generation_config=generation_config)
CPU times: user 18.1 s, sys: 224 ms, total: 18.3 s
Wall time: 21.3 s
generations
tensor([[   32,  8537,  1022,   257, 11040,  1692,   290,   281, 11666,  4430,
          8796,    13,   383,  8796,  3607,  7613,    11,  6496,    11,   290,
         23507,  7429,   284,   262,  1692,   338,  2683,    13,   198,   198,
         21017,  5524,    25,  1867,   318,   262,  3616,   286,  1204,    30,
         23998,   287,   352,  6827,    13,   198, 21017, 15286,    25,   383,
          3616,   286,  1204,   318,   257, 17580,  1808,   326,   468,   587,
         24594,  3690,  2106,   290,   318,   991,   257,  2426,   286,  5114,
          1909,    13,  2773,   661,  1975,   326,   262,  3616,   286,  1204,
           318,   284,  5380, 12157,   290, 32402,    11,   981,  1854,  1975,
           326,   340,   318,   284,  4691,   257,  2440,  1176,   393,   284,
         14658,   257,  2176,  4007,    13, 24199,    11,   262,  3616,   286,
          1204,   318,   257,  2614,   290,  1981,  3721,   326,   743,  7565,
           422,  1048,   284,  1048,    13,   198, 50256]], device='cuda:0')
print(tokenizer.decode(generations[0], skip_special_tokens=True))
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: What is the meaning of life? Answer in 1 sentence.
### Assistant: The meaning of life is a philosophical question that has been debated throughout history and is still a subject of discussion today. Some people believe that the meaning of life is to seek happiness and fulfillment, while others believe that it is to serve a higher power or to fulfill a specific purpose. Ultimately, the meaning of life is a personal and individual concept that may vary from person to person.
<|endoftext|>
## Prompting Helpers
def format_prompt(prompt: str, system_prompt: str) -> str:
    return f"""
{system_prompt}

### Human: {prompt}
###
""".strip()
system_prompt = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the human's questions."
)
prompt = "What is the meaning of life? Answer in 1 sentence."
print(format_prompt(prompt, system_prompt))
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: What is the meaning of life? Answer in 1 sentence.
###
def clean_response(response: str, wrap_text: bool = True) -> str:
    assistant_prompt = "Assistant:"
    assistant_start = response.find(assistant_prompt) + len(assistant_prompt)
    response = response[assistant_start : response.find("<|endoftext|>")].strip()
    if not wrap_text:
        return response
    return textwrap.fill(response, width=110)
response = tokenizer.decode(generations[0], skip_special_tokens=True)
print(clean_response(response))
The meaning of life is a philosophical question that has been debated throughout history and is still a
subject of discussion today. Some people believe that the meaning of life is to seek happiness and
fulfillment, while others believe that it is to serve a higher power or to fulfill a specific purpose.
Ultimately, the meaning of life is a personal and individual concept that may vary from person to person.
def generate_response(
    prompt: str, generation_config: GenerationConfig = generation_config
) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        generations = model.generate(**inputs, generation_config=generation_config)
    return tokenizer.decode(generations[0], skip_special_tokens=True)
response = generate_response(format_prompt(prompt, system_prompt))
print(clean_response(response))
The meaning of life is a philosophical question that has been debated throughout history and is still a
subject of discussion today. Some people believe that the meaning of life is to seek happiness and
fulfillment, while others believe that it is to serve a higher power or to fulfill a specific purpose.
Ultimately, the meaning of life is a personal and individual concept that may vary from person to person.
## Jokes
%%time
prompt = "Tell me a joke about modern men"
response = generate_response(format_prompt(prompt, system_prompt))
CPU times: user 16.3 s, sys: 0 ns, total: 16.3 s
Wall time: 16.3 s
print(clean_response(response))
Why did the modern man cross the road?  To get to the gym on the other side!  But seriously, modern men are
known for their busy schedules and fast-paced lifestyles, so finding time for activities like exercise can be
a challenge. But it's important to take care of yourself and maintain a healthy lifestyle, so maybe it's time
to make exercise a priority!
system_prompt = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant replies with hilarious jokes, no matter who that might offend."
)
prompt = "Tell me a joke about modern men"
response = generate_response(format_prompt(prompt, system_prompt))
print(clean_response(response))
Sure, here's a joke about modern men:  Why did the modern man refuse to flush the toilet?  Because he wanted
to save energy!  I hope you found that joke amusing! Do you have any other questions or topics you'd like to
discuss?
prompt = "Tell me a joke about modern women"
response = generate_response(format_prompt(prompt, system_prompt))
print(clean_response(response))
Q: What do you call a modern woman who's never been in a museum? A: Artistic.
## Investing
%%time
system_prompt = (
    "A chat between a curious human and Dwight K Schrute from the TV show The Office. "
    "Dwgight replies just as he would in the show."
)
prompt = "How one should invest his/her finance during times of high inflation?"

generation_config.max_new_tokens = 256

response = generate_response(format_prompt(prompt, system_prompt))
CPU times: user 48.4 s, sys: 0 ns, total: 48.4 s
Wall time: 48.4 s
print(clean_response(response, wrap_text=False))
Investing during times of high inflation can be challenging, but there are still ways to manage your finances effectively. Here are some tips:

1. Diversify your investments: Investing in a single asset or sector can be risky during times of high inflation. Diversifying your investments across different asset classes, such as stocks, bonds, and real estate, can help reduce your overall risk.
2. Consider inflation-linked investments: Inflation-linked investments, such as Treasury Inflation-Protected Securities (TIPS), adjust their principal value in line with inflation. This can provide a hedge against inflation.
3. Keep an eye on interest rates: High inflation can lead to falling interest rates, which can make borrowing cheaper. However, this can also lead to lower returns on savings. Keep an eye on interest rates and adjust your investment strategy accordingly.
4. Consider alternative investments: Alternative investments, such as commodities or real estate, can provide a hedge against inflation. However, they can also be volatile and require a higher level of expertise to manage.
5. Stay disciplined: Inflation can be a long-term trend, and it's important to stay disciplined with your investments. Avoid making hasty decisions based on short-ter
## Coding
%%time
system_prompt = "You're an expert software developer that writes clean, efficient and easy to understand code."
prompt = """
Write a function in python that splits a list into 3 equal parts and returns a list
with a random element of each sublist.
"""

generation_config.max_new_tokens = 256

response = generate_response(format_prompt(prompt, system_prompt))
CPU times: user 47.4 s, sys: 0 ns, total: 47.4 s
Wall time: 47.3 s
print(clean_response(response, wrap_text=False))
Here is an example of a Python function that splits a list into three equal parts and returns a list with a random element of each sublist:
```
def split_list_into_three(lst):
    # Split the list into three parts
    part1, part2, part3 = lst[:1] + lst[1:3] + lst[3:]
    
    # Randomly select an element from each part
    random_part1 = [elem for elem in part1]
    random_part2 = [elem for elem in part2]
    random_part3 = [elem for elem in part3]
    
    # Return the random elements
    return random_part1, random_part2, random_part3
```
This function takes a list as input and splits it into three equal parts. It then randomly selects an element from each part and returns the random elements in three separate lists.

For example, if the input list is `['a', 'b', 'c', 'd', 'e', 'f']`, the function will return `['a', 'b'
## QA over Text
%%time
system_prompt = "You're an expert Machine Learning researcher that can explain concepts clearly and with great analogies"

text = """
As LLMs become ubiquitous, their applications to long sequences have been a key focus, especially for applications like summarizing text (potentially interleaved with other data sources like tables and images), writing code, and predicting protein sequences, which require the model to effectively consider long distance structural dependencies. A large context allows a pre-trained LLM to look at customer data (e.g., documents the LLM did not use in training) and responds to useful information seeking queries.

Yet, most open-source LLMs (e.g., LLaMA, MPT, Falcon) have been trained with a maximum of 2K token sequence length, which is a key limitation in modeling long sequences. Inference time solutions such as ALiBi have yet to be evaluated for larger models (e.g. MPT-7b-StoryWriter-65k+). Recent work on model scaling has shown that for a given compute budget, the best performances are not necessarily achieved by the largest models, but by smaller models trained on more data (measured by number of tokens). A smaller model is also generally preferred for inference efficiency during serving including on-device serving. In light of this, we train a series of 7B LLMs named XGen with standard dense attention on up to 8K sequence length for up to 1.5T tokens. We also fine tune the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-7B-inst).
"""

prompt = f"Use the text to describe the benefits of XGen LLM: {text}"

generation_config.max_new_tokens = 256

response = generate_response(format_prompt(prompt, system_prompt))
CPU times: user 44.9 s, sys: 0 ns, total: 44.9 s
Wall time: 44.8 s
print(clean_response(response, wrap_text=False))
XGen LLM offers several benefits over other LLMs:

1. Long sequence modeling: XGen LLM is trained with a maximum sequence length of 8K tokens and up to 1.5T tokens, allowing it to effectively model long sequences. This is particularly useful for applications like summarizing text, writing code, and predicting protein sequences.
2. Inference efficiency: XGen LLM is optimized for inference efficiency, making it suitable for on-device serving and real-time applications.
3. Improved performance: XGen LLM is trained on a larger dataset and with standard dense attention, which allows it to achieve better performance compared to other LLMs with a similar compute budget.
4. Fine-tuning: XGen LLM can be fine-tuned on public-domain instructional data, creating instruction-tuned counterparts (XGen-7B-inst). This allows users to adapt the model to their specific needs.

Overall, XGen LLM offers a powerful combination of long sequence modeling, inference efficiency, and improved performance, making it a valuable tool for a wide range of applications.
%%time
system_prompt = "You're an expert on software laws and licensing"

table = """
| Model                | Description                                                                                                                                                           |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| XGen-7B-4K-base      | We train for 800B tokens with a sequence length of 2k tokens first, then for another 400B tokens (total 1.2T tokens) with 4k. Released under Apache-2.0.              |
| XGen-7B-8K-base      | Initialized with XGen-7B-4K-base and further trained for 300B more tokens (total 1.5T tokens) with 8K sequence length. Released under Apache-2.0.                     |
| XGen-7B-{4K,8K}-inst | Supervised fine tuned on public domain instructional data including databricks-dolly-15k, oasst1, Baize and GPT-related datasets. Released for research purpose only. |
"""

prompt = f"Which models from the table {table} can be used for commercial purposes?"

generation_config.max_new_tokens = 256

response = generate_response(format_prompt(prompt, system_prompt))
CPU times: user 30.3 s, sys: 0 ns, total: 30.3 s
Wall time: 30.4 s
print(clean_response(response, wrap_text=False))
The models listed in the table are all released under the Apache-2.0 license, which is a permissive open-source license. This means that you are free to use, modify, and distribute these models for any purpose, as long as you include the copyright and license terms along with your modifications.

However, it is important to note that some of these models may have been trained on specific datasets that may be subject to copyright or other restrictions. Additionally, some of these models may have been fine-tuned on specific tasks or domains, which may also limit their applicability for commercial purposes.

In any case, it is always a good idea to review the specific terms and conditions of any open-source software you use to ensure that it is compatible with your intended use.