Easy Quantization in PyTorch Using Fine-Grained FX

Improve Quantization Productivity with Intel® Neural Compressor and Hugging Face Optimum-Intel

Get the Latest on All Things CODE

author-image

作者

The Issue:

PyTorch provides an FX toolkit for developers to transform a torch.nn.Module into a torch.fx.GraphModule. With the generated GraphModule, FX can execute static quantization by automatically inserting quantize and dequantize operations.

It’s useful to convert an imperative model into a graph model because the latter gives better performance with multiple optimization options such as post-training static quantization.

However, FX cannot handle dynamic control flows automatically, and there are many cases that will block the model transformation because of dynamic control flows.

The Solution: Fine-Grained FX

Fine-grained FX helps models with dynamic control flows on ease-of-use quantization. It is integrated into the pytorch_fx backend of Intel Neural Compressor and supports three popular quantization methods:

  1. post-training dynamic quantization
  2. post-training static quantization
  3. quantization-aware training

PyTorch recommends post-training dynamic quantization for NLP models because its real-time variable scales and zero-points shows stable accuracy after quantization.

Post-training static quantization performs quantization based on fixed scales and zero-points. It supports continuous quantization modules, avoiding redundant quantization and dequantization operations.

Theoretically, static quantization has a better performance than dynamic quantization. Quantization-aware training for static quantization requires an additional training process to adjust model weights to reduce quantization loss. It can provide high accuracy based on the best performance of static quantization.

Because the imperative model consists of several blocks, fine-grained FX will aggressively and recursively detect these blocks for module transformation.

Two examples are shown below: natural language processing (Figure 1) and object detection (Figure 2).

The darker green blocks are detected as suitable for module transformation because they are the largest blocks without any control flow. We leverage the FX toolkit on these blocks and do quantization automatically. By reassembling these processed blocks using the original control flows, the resulting model maintains the same behavior and provides higher performance by leveraging INT8.

"

Figure 1. Fine-grained FX for BERT natural language processing
 

Figure 2. Fine-grained FX for YOLO-V2 object detection

"

Adopting Our Solution

Intel® Neural Compressor for NLP

We provide two kinds of examples for natural language processing models based on Hugging Face Transformers. You can easily replace the input model with your own and quantize it based on fine-grained FX:

Hugging Face Optimum-Intel

Optimum-Intel is an extension of Transformers that enable the use of popular compression techniques such as quantization and pruning via Intel Neural Compressor. All tasks in Optimum-Intel support fine-grained FX: language modeling, multiple choice, question answering, summarization, text classification, token classification, and translation.

We also uploaded several INT8 models into the Hugging Face model hub that can be easily initialized and leveraged with Intel Neural Compressor, e.g.:

from neural_compressor.utils.load_huggingface import OptimizedModel
int8_model = OptimizedModel.from_pretrained(
    'Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-static',
)

Model Name

Approach

Accuracy (accuracy/f1)

Model Size (MB)

INT8

FP32

Relative Loss

INT8

FP32

Compression Ratio

Albert-base-v2 [INT8/FP32]

Post training static quantization

0.9255

0.9232

-0.249%

25

44.6

1.784

Bart-large [INT8/FP32]

Post training dynamic quantization

0.9051

0.912

0.757%

547

1556.48

2.845

Bert-base [INT8/FP32]

Post training static quantization

0.7838

0.7915

0.973%

133

418

3.143

Bert-base [INT8/FP32]

Post training dynamic quantization

0.8997

0.9042

0.498%

174

418

2.402

Bert-base [INT8/FP32]

Quantize aware training

0.9142

0.9042

-1.106%

107

418

3.907

Bert-base [INT8/FP32]

Post training static quantization

0.8997

0.9042

0.498%

120

418

3.483

Camembert-base [INT8/FP32]

Post training dynamic quantization

0.8843

0.8928

0.952%

180

422

2.344

Distilbert-base [INT8/FP32]

Post training static quantization

0.9859

0.9882

0.233%

64.5

253

3.922

Distilbert-base [INT8/FP32]

Post training static quantization

0.9037

0.9106

0.758%

65

255

3.923

Electra-small-discriminator [INT8/FP32]

Post training static quantization

0.9007

0.8983

-0.267%

14

51.8

3.700

Roberta-base [INT8/FP32]

Post training static quantization

0.9247

0.9138

-1.193%

121

476

3.934

Xlnet-base [INT8/FP32]

Post training static quantization

0.8893

0.8897

0.045%

215

448

2.084

"

Future Work

The vision of fine-grained FX is to improve the productivity of PyTorch quantization, especially of the static quantization approach. We are continuously uploading INT8 models to the Hugging Face model hub for quick deployment.

We invite users to:

"