跳转至主要内容
英特尔标志 - 返回主页
我的工具

选择您的语言

  • Bahasa Indonesia
  • Deutsch
  • English
  • Español
  • Français
  • Português
  • Tiếng Việt
  • ไทย
  • 한국어
  • 日本語
  • 简体中文
  • 繁體中文
登录 以访问受限制的内容

使用 Intel.com 搜索

您可以使用几种方式轻松搜索整个 Intel.com 网站。

  • 品牌名称: 酷睿 i9
  • 文件号: 123456
  • Code Name: Emerald Rapids
  • 特殊操作符: “Ice Lake”、Ice AND Lake、Ice OR Lake、Ice*

快速链接

您也可以尝试使用以下快速链接查看最受欢迎搜索的结果。

  • 产品信息
  • 支持
  • 驱动程序和软件

最近搜索

登录 以访问受限制的内容

高级搜索

仅搜索

Sign in to access restricted content.

不建议本网站使用您正在使用的浏览器版本。
请考虑通过单击以下链接之一升级到最新版本的浏览器。

  • Safari
  • Chrome
  • Edge
  • Firefox



Develop Solutions on Intel® Gaudi® AI Accelerators

 

 

 

 

  • Overview
  • Inference
  • Fine-Tune
  • Pretrain

Pretrain Llama-2-7B Using  Megatron-DeepSpeed* with the FP8 Datatype on the Intel® Gaudi® 2 AI Accelerator

Learn how to run pretraining of Meta* Llama-2-7b using the Megatron-DeepSpeed* library on the Intel® Gaudi® AI accelerator.  The Megatron-DeepSpeed library improves memory consumption on the Intel Gaudi AI accelerator while running large language models.

Set up the environment, select parameters, run the workload, and then see a price-performance comparison. The Intel Gaudi AI accelerator supports PyTorch* as the main framework for training (based on the Habana* implementation of DeepSpeed). Additional examples can be found for training large transformer language models such as Llama 2 at scale.

The following steps will let you:

  • Get access to a node for the Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
  • Ensure that all the software is installed and configured properly by running the PyTorch version of the Docker* image for the accelerator.
  • Install prerequisites.
  • Download and preprocess the dataset.
  • Select parameters and run pretraining on the model.

 

Performance Evaluation

Before running the model, look at the performance measurements and price-performance comparison to an equivalent H100 pretraining example. In this case, select the Llama-2-7b parameter model using FP8 with a sequence length of 4,096 and a batch size of 1,024. Use eight Intel Gaudi AI accelerators (Model Performance) and compare this against the same model configuration using the H100 GPU with published inference benchmarks from NVIDIA*.

The following image shows that the tokens per dollar are higher than the NVIDIA solution.

Performance cost differences

  • Setup Instructions
  • Run and Fine-Tune

Accessing the Intel Gaudi Node in the Intel® Tiber® AI Cloud

To access an Intel® Gaudi® node in the Intel® Tiber™ AI cloud, go to Intel® Tiber™ AI Cloud Console and access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.

This is your Alt Text

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command:  ssh -L 8888:localhost:8888 ... to be able to access the Notebook.

Details about setting up Jupyter Notebooks on an Intel® Gaudi® Platform are available here.

Docker Setup

With access to the node, use the latest Intel® Gaudi® Docker image by first calling the Docker run command which will automatically download and run the Docker image:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Start the Docker image and enter the Docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Install pre-requisites

Once in the Docker environment, install the necessary libraries:

Start in the root directory and install the DeepSpeed Library:

cd ~
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Now install the Hugging Face Optimum Intel® Gaudi® library and GitHub Examples, selecting the latest validated release of both:

pip install optimum-habana==1.16.0
git clone -b 1.19.0 https://github.com/HabanaAI/Megatron-DeepSpeed.git

Next, transition to the Megatron-DeepSpeed directory and install the set of requirements to perform training:

cd Megatron-DeepSpeed
pip install -r megatron/core/requirements.txt

Setup the correct path for Megatron-DeepSpeed:

export MEGATRON_DEEPSPEED_ROOT=`pwd`
export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT:$PYTHONPATH

Finally, Set Python 3.10 as the default Python version. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.10

Download Dataset

To download datasets used for training Llama2, follow the directions in the Megatron-Deepspeed Github page. This tutorial uses a subset of the Oscar dataset to pre-train language models and word representations.

It is possible to download the full (500GB+) oscar dataset. Or a subset of the dataset can be downloaded for a quick start. These steps are based on the Oscar dataset repository.

First, clone the dataset repository:

cd ~
git clone https://github.com/bigscience-workshop/bigscience.git
cd bigscience/data/oscar

Next, edit the file oscar-to-jsonl.py. This example downloads the zh dataset (Chinese). Edit the file in the language subsets list. Remove the comment on unshuffled_deduplicated_zh and comment out unshuffled_deduplicated_en:

### Build/Load Datasets

# Once this part of the process completes it gets cached, so on subsequent runs it'll be much faster
language_subsets = (
     # "unshuffled_deduplicated_ar",
     # "unshuffled_deduplicated_sw",
     "unshuffled_deduplicated_zh",
     # "unshuffled_deduplicated_en",
     # "unshuffled_deduplicated_fr",
     # "unshuffled_deduplicated_pt",
     # "unshuffled_deduplicated_es",
)

Run the Python script that downloads and pre-process the data. Note the use of the -s option, that will download only a subset of the dataset, for the purposes of this tutorial (this operation can take some time, depending on the download speed and hardware used):

$PYTHON ./oscar-to-jsonl.py -s

When the above operation completes, the ~/bigscience/data/oscar/ directory will contain the following data files:

-rw-r--r-- 1 root root 66707628 Jul 26 00:38 oscar-0.jsonl
-rw-r--r-- 1 root root 63555928 Jul 26 00:38 oscar-1.jsonl
-rw-r--r-- 1 root root 59082488 Jul 26 00:38 oscar-2.jsonl
-rw-r--r-- 1 root root 63054515 Jul 26 00:38 oscar-3.jsonl
-rw-r--r-- 1 root root 59592060 Jul 26 00:38 oscar-4.jsonl

The next step is to tokenize the dataset. There are different ways to perform tokenization of a dataset. This example uses the GPT2BPETokenizer method (Byte-Pair Encoding).

According to the directions in the Gaudi Megatron-DeepSpeed github page, the five jsonl files above can be concatenated into a single large file to be tokenized, or the tokenization can be done on each one of the five files separately (and then the 5 tokenized files can be merged). In this tutorial the smaller files are processed individually, to prevent possible host out of memory issues.

The GPT2BPETokenizer method is used to tokenize the five jsonl files separately. First, download the gpt2 vocabulary.json and the merges.txt file:

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Next, create and execute a shell script as follows. This script will tokenize the individual jsonl files one at a time and will write the tokenized files to the zh_tokenized directory. In the sixth line, the number of workers can be changed according to the number of cores in the CPU that is being used:

# tokenize individual jsonl files
# loop count will change based on number of files for a given dataset
mkdir zh_tokenized
for i in $(seq 0 4);
do
    $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-${i}.jsonl --output-prefix zh_tokenized/tokenized${i} --tokenizer-type GPT2BPETokenizer --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --append-eod --workers 16
done

After the above operation is completed, the “zh_tokenized” directory will contain the following files:

-rw-r--r-- 1 root root 93115006 Jul 26 00:47 tokenized0_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized0_text_document.idx
-rw-r--r-- 1 root root 88055238 Jul 26 00:47 tokenized1_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized1_text_document.idx
-rw-r--r-- 1 root root 82539576 Jul 26 00:47 tokenized2_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized2_text_document.idx
-rw-r--r-- 1 root root 87806904 Jul 26 00:47 tokenized3_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized3_text_document.idx
-rw-r--r-- 1 root root 82680922 Jul 26 00:48 tokenized4_text_document.bin
-rw-r--r-- 1 root root 166862 Jul 26 00:48 tokenized4_text_document.idx

To complete the tokenization step, the multiple tokenized dataset files generated above should be merged into a single file. For this, run the following commands:

# merge tokenized files
mkdir zh_tokenized_merged
$PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/merge_datasets.py --input zh_tokenized --output-prefix zh_tokenized_merged/tokenized_text_document

Which will result in the zh_tokenized_merged directory created and containing the following merged files:

ls -lt zh_tokenized_merged
-rw-r--r-- 1 root root 834222 Jul 26 00:57 tokenized_text_document.idx
-rw-r--r-- 1 root root 434197646 Jul 26 00:57 tokenized_text_document.bin

To make pretraining easier, copy the gpt2-merges.txt and gpt2-vocab.json files into the zh_tokenized_merged directory. Using the GPT2BPETokenizer with pretraining requires those files to be in the same directory as the data.

cp gpt2-* zh_tokenized_merged

This completes the dataset downloading and preprocessing steps.

 

 

 

Intel® Tiber® AI Cloud

 

Llama2 7B Training

Write an example script, called run_llama_wrapper.sh, to perform training on Llama 2 7B. The first part of the script contains debugging information (see the debugging guide documentation for more information). The Habana log enablement env variables are described in the debugging guide documentation:

export LOG_LEVEL_ALL=4
export ENABLE_CONSOLE=true
export HABANA_LOGS=./habana_log

Next, setup environment variable of the directories containing the model references and data used for training:

export MODEL_REFERENCES_ROOT=${MEGATRON_DEEPSPEED_ROOT}
export HL_DATA_DIR_ROOT=~/bigscience/data/oscar/zh_tokenized_merged
export HL_DATA_FILE_PREFIX=tokenized_text_document
export OUT_DIR="Llama2-7B-training"
export HL_HOSTSFILE=/launch/hostsfile
export HL_TOKENIZER_TYPE=GPT2BPETokenizer 

The rest of the script contains variables that will control training:

mkdir -p ${OUT_DIR}

HL_SAVE=0 \
HL_EXIT_INTERVAL=80 \
HL_RESULTS_DIR=${OUT_DIR} \
HL_LOG_INTERVAL=10 \
HL_TOKENIZER_TYPE=${HL_TOKENIZER_TYPE} \
HL_NUM_NODES=1 \
HL_PP=1 HL_TP=1 HL_DP=8 \
HL_DATA_DIR_ROOT=${HL_DATA_DIR_ROOT} \
HL_LLAMA_MODEL_SIZE=7 \
HL_LLAMA_VER=2 \
HL_DATA_FILE_PREFIX=${HL_DATA_FILE_PREFIX} \
HL_ZERO_STAGE=1 \
HL_CKP_ACT=2 \
HL_SEQ_LEN=4096 \
HL_GBS=512 \
HL_USE_FAST_SOFTMAX=1 \
HL_GRAD_ACCUM_DTYPE=bf16 \
HL_USE_TRANSFORMER_ENGINE=1 \
HL_USE_CACHE_FP8_WEIGHT_FWD=1 \
HL_USE_CACHE_FP8_WEIGHT=1 \
${MODEL_REFERENCES_ROOT}/scripts/run_llama.sh 2>&1 | tee ${OUT_DIR}/llama_8x.log

Execute the script to start the training:

./run_llama_wrapper.sh &

FP8 performance enhancements are enabled by setting HL_USE_TRANSFORMER_ENGINE=1. The HL_USE_CACHE_FP8_WEIGHT_FWD=1 and HL_USE_CACHE_FP8_WEIGHT=1 settings improve FP8 performance.

Untested ZeRO Optimizer errors

If the version of Deepspeed being used has an untested zero optimizer the run may terminate with the following error message:

AssertionError: You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.

To bypass this issue, add the following entry to the EOT statement that creates the ds_config.json file in the ~/Megatron-DeepSpeed/scripts/run_llama.sh shell script:

  "zero_allow_untested_optimizer": true

Llama2 7B Training Results

As the performance results can vary depending on the hardware used, the results shown in this section are to be considered as examples and not as benchmark results. Detailed information about performance data for Intel® Gaudi® AI Accelerators can be found here.

In a sample run of the run_llama script, the following information is reported in the output log at the end of the execution (remember that the sample run ended after 80 iterations, as specified by the env variable: HL_EXIT_INTERVAL=80):

iteration 80/ 500000 | consumed samples:81920 | consumed tokens:335544320 | elapsed time per iteration (ms): 62373.1 | learning rate:1.200E-05 | global batch size: 1024 | lm loss:3.354671E+00 | loss scale:1.0 | grad norm:4.962 | num zeros:0.0 | actual seqlen:4096 | number of skipped iterations:0 | number of nan iterations:0 | samples per second:16.417 | tokens per gpu per second (tgs):8405.678 | TFLOPs:409.21 |

The total number of tokens per second is:

tokens per gpu per second (tgs) \* 8 HPUs ~= 8400 \* 8 ~= 67,200 tokens/sec

These results align with the published numbers for Intel Gaudi 2.

Next Steps

Now that you have run a pre-training case, you can go back to the HuggingFace* Optimum Habana validated models to see more options for running training or inference.

 

 

 

Intel® Tiber® AI Cloud

 


Stay Informed


Register for the latest Intel Gaudi AI accelerator developer news, events, training, and updates.

Sign Up

除非标为可选,否则所有字段均为必填。

英特尔致力于为您提供优质、个性化的体验,您的数据帮助我们实现这一目标。
本网站采用了 reCAPTCHA 保护机制,并且适用谷歌隐私政策和服务条款。
提交此表单,即表示您确认自己已经年满 18 周岁。英特尔将针对此业务请求处理您的个人数据。要详细了解英特尔的实践,包括如何管理您的偏好和设置,请访问英特尔的隐私声明。
提交此表单,即表示您确认自己已经年满 18 周岁。 英特尔可能会与您联系,以进行与营销相关的沟通。您可以随时选择退出。要详细了解英特尔的实践,包括如何管理您的偏好和设置,请访问英特尔的隐私声明。

Thank you for signing up with Intel.

  • 公司信息
  • 英特尔资本
  • 企业责任部
  • 投资者关系
  • 联系我们
  • 新闻发布室
  • 网站地图
  • 招贤纳士 (英文)
  • © 英特尔公司
  • 沪 ICP 备 18006294 号-1
  • 使用条款
  • *商标
  • Cookie
  • 隐私条款
  • 请勿分享我的个人信息 California Consumer Privacy Act (CCPA) Opt-Out Icon

英特尔技术可能需要支持的硬件、软件或服务激活。// 没有任何产品或组件能够做到绝对安全。// 您的成本和结果可能会有所不同。// 性能因用途、配置和其他因素而异。请访问 intel.cn/performanceindex 了解更多信息。// 请参阅我们的完整法律声明和免责声明。// 英特尔致力于尊重人权,并避免成为侵犯人权行为的同谋。请参阅英特尔的《全球人权原则》。英特尔产品和软件仅可用于不会导致或有助于任何国际公认的侵犯人权行为的应用。

英特尔页脚标志