Intel® Integrated Performance Primitives Cryptography Acceleration on 3rd Generation Intel® Xeon® Processor Scalable and 10th Gen Intel® Core™ Processors

ID 标签 688842
已更新 3/26/2021
版本 Latest
公共

author-image

作者

Introduction

Intel® Integrated Performance Primitives (Intel® IPP) Cryptography is a software library that provides a comprehensive set of application domain-specific highly optimized functions. It is a secure, fast and lightweight library of building blocks for cryptography, highly-optimized for various Intel® CPUs. This can provide tremendous development and maintenance savings. You can write programs with one optimized execution path, avoiding the alternative of multiple paths (Intel® Streaming Single Instruction Multiple Data (SIMD) Extensions 2, Supplemental Streaming SIMD Extensions 3, Intel® Advanced Vector Extensions , etc.) to achieve optimal performance across multiple generations of processors.

The goal of the Intel® IPP Cryptography software is to provide algorithmic building blocks with

  • a simple "primitive" C interface and data structures to enhance usability and portability
  • faster time-to-market
  • scalability with Intel® hardware

Intel® IPP Cryptography library is available as part of the Intel® oneAPI Base Toolkit.

Intel® IPP Cryptography library is also open sourced. For details about the open source version, please refer to this link.


History of Cryptography Instruction Set

Bulk encryption/decryption, hash functions and pubic key algorithms constitutes the basis of classic cryptography. Until 2010 these algorithms implemented in software which used the basic x32 and/or x64 instruction set or similar. As a result, the implementations spent quite a few CPU cycles on execution. In addition, implementations of cryptographic algorithms that resisted to side-channel attacks only increased their execution time.

In 2010, Intel launched microprocessors based on Westmere microarchitecture, which expanded Instruction Set Architecture (ISA) by so-called Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) and carry-less Multiplication CLMUL instruction. The purpose of Intel® AES-NI is to improve the speed (as well improve the resistance to side-channel attacks) of AES-based implementations of standard modes. Together with CLMUL instructions they formed the basis for AES Galois Counter (AES-GCM) mode providing confidentiality and authentication simultaneously.

In 2013, it was initially implemented in low-power Intel Atom® Processor Goldmont microarchitecture the hardware acceleration of Secure Hash Algorithm (SHA). This extension, named SHA-NI, supports SHA-1 and SHA-256 algorithms. 

In 2014, the ADX extension was implemented on the Broadwell microarchitecture. This extension consists of ADCX and ADOX instructions and together with MULX instruction implemented earlier are using in context of multi-precision arithmetic implementations. So, for example, till now the best OpenSSL* public key implementations are based on MULX with accompanied instructions ADCX and ADOX.


Cryptography-related ISA Extension

The client and server configurations of microprocessor inherits all cryptographic extensions mentioned above and contains additional extensions of ISA. Among additional extensions, there are: VAES and VCLMUL instructions, Galois Field New Instructions (GFNI) and IFMA instructions.

VAES and VCLMUL are extensions of the AES-NI and CLMUL instructions correspondingly. They extend existing instructions to the 2x128 and 4x128 vector’s variant. The VAES instructions perform one round of AES encryption/decryption using the same or different value(s) of round key(s). VAES instruction extension helps to implement the AES parallelizable modes even mush more efficient than legacy AES-NI.  2x128 and 4x128 vector variant of CLMUL improves the performance of AES-GCM mode. 

Galois Field New Instructions (GFNI) are presented by three instructions: GF2P8AFFINEQB, GF2P8AFFINEINVQB and GF2P8MULB. The GF2P8AFFINEQB and GF2P8AFFINEINVQB computes affine transformation in the GF(2^8). The first one involves in affine transformation of the element x belonging GF(2^8)  and the second one in the inversion 1/x of x. The last GF2P8MULB computes multiplication of x and y elements of GF(2^8). All three are using GF(2^8) generated by g(x)= x^8 + x^4 + x^3 + x + 1 polynomial matched to AES algorithm. Based on fundamental mathematical isomorphism of GF(2^8) this helps implementing algorithms involving affine transformation and multiplication over any GF(2^8). Thus, in particular, it helps in implementation of SM4 algorithms.

IFMA extension – consists of two instructions VPMADD52LUQ and VPMADD52HUQ for packed multiplication of unsigned 52-bit integers and accumulate low/high52 bit product in 64-bit accumulator. These instructions supported in in 3 forms: 2x64, 4x64 and 8x64. The target for this extension is multi-precision arithmetic and basically multiplicative operations. Using this extension helps to implement efficiently public key cryptography algorithms (RSA and Elliptic Curve based encryption and sign operations).

Intel® IPP Crypto Library

Intel® IPP Crypto Library is focused on efficient implementation/optimization of basic cryptography algorithms. Enabling of new Intel ISA in cryptography helps improve the performance and considered as important activity of Intel® IPP Crypto development. In the Intel® IPP Crypto 2020 Update 3 release, ISA concerning all 3 directions: bulk encryption, hashes (SHA1 and SHA256) and RSA encryption have been implemented and enabled. The result in performance difference between non-enabled and enabled Intel® IPP Crypto are presented in Table 1 – Table 3 later in this article. In both cases the benchmark has been performed on the microprocessor.

Since 2020, another cryptography library, called Crypto Multi-Buffer (MB) is delivered together with Intel® IPP Crypto library. Unlike Intel® IPP Crypto the Crypto MB focuses on parallel processing of 8 independent cryptographic request and aimed to support server and cloud applications. It can be used as a standalone library or together with Intel® Quick Assist Technology (Intel® QAT) Engine. By itself, the “multi-buffer” approach provides advantages that complement enabling. The result in performance difference between OpenSSL* 1.1.1 and Crypto MB are presented in Table 4 – Table 6 later in this article. In both cases the benchmark has been performed on the client and server microprocessor.

 

Enabling Results and Conclusion

The computer platforms and library versions have been used for measurements are the following:

  • Intel® Core™ i7-1065G7 CPU @ 1.30GHz, L1d=192KiB, L1i=128KiB, L2=2MiB, L3=8MiB running with Ubuntu* 20.04.1
  • Intel® Xeon® CPU @ 2.2HGz, Ice Lake Server, L1d=48K, L1i=32K, L2=1280K, L3=36864K running with RedHat* 8.1
  • Intel® IPP Cryptography 2020 Update 3
  • OpenSSL* 1.1.1

The result of Intel® IPP Crypto performance are presented in CPU cycles/byte in case of measurement of AES128, SM4, SHA1 and SHA256 algorithms. Performance results of RSA-2048 are presented in CPU cycles/operation.

   AES128-DEC-CBC  
Length, Bytes w/o New ISA Enabled New ISA Enabled  
1024 0.125 0.0586 cycle/byte
2048 0.124 0.0601 cycle/byte
4096 0.122 0.0605 cycle/byte
  AES128-CTR  
1024 0.159 0.0811 cycle/byte
2048 0.154 0.0713 cycle/byte
4096 0.152 0.068 cycle/byte
  AES128-GCM  
1024 0.505 0.136 cycle/byte
2048 0.479 0.119 cycle/byte
4096 0.465 0.109 cycle/byte
  AES128-XTS  
1024 0.174 0.0996 cycle/byte
2048 0.166 0.0806 cycle/byte
4096 0.158 0.071 cycle/byte
  SM4-DEC-CBC  
1024 2.15 0.375 cycle/byte
2048 1.9 0.369 cycle/byte
4096 1.83 0.367 cycle/byte
  SM4-CTR  
1024 2.22 0.434 cycle/byte
2048 2.19 0.42 cycle/byte
4096 2.18 0.414 cycle/byte

Table 1. Performance of AES128 and SM4 block ciphers with Intel® IPP Crypto with and without new ISA.

  SHA-1  
Length, Bytes w/o New ISA Enabled New ISA Enabled  
1024 1.59 0.896 cycle/byte
2048 1.52 0.857 cycle/byte
4096 1.49 0.838 cycle/byte
  SHA-256  
1024 3.33 1.11 cycle/byte
2048 3.22 1.08 cycle/byte
4096 3.17 1.05 cycle/byte

Table 2. Performance of SHA-1 and SHA-256 hash functions with Intel® IPP Crypto with and without new ISA.

  RSA-2048  
  w/o New ISA Enabled  New ISA Enabled  
private exp (crt) 760563 404080 cycles/op
public exp, e=65537 20168 12266 cycles/op

Table 3. Performance of public and private keys RSA-2048 operation with Intel® IPP Crypto with and without new ISA.

Concerning to Crypto MB, the performance comparison with OpenSSL* 1.1.1 is presented below in this article. Again, benchmarks of both OpenSSL* and Crypto MB have been measured on client and server. Because OpenSSL* does not demonstrate the differences between runs on client and server, only one number in OpenSSL* column is presented. In contrast, performance of Crypto MB vary depends on target CPU in spite of the same code run in both cases. 
The results below are presented in CPU cycles/operation in case of public key algorithms (RSA and EC). It’s important to note that OpenSSL* performs single RSA or EC operation whereas Crypto MB performs 8 similar operations. So, for fair comparison with OpenSSL* data related to Crypto MB, it should be divided by 8.

  OpenSSL* Crypto MB  
    Client Server  
RSA-2048, public e=65537 59491 123188 87542 cycles/op
RSA-3072, public e=65537 125746 267623 186446 cycles/op
RSA-4096, public e=65537 216909 463649 321544 cycles/op
RSA-2048, private (crt) 2027312 3592084 2051226 cycles/op
RSA-3072, private (crt) 6282572 14866835 8521718 cycles/op
RSA-4096, private (crt) 14066625 29022226 18969184 cycles/op

Table 4. Performance of public and private keys RSA-2048/3072/4096 operations in OpenSSL* 1.1.1 and Crypto MB on client and server.

    OpenSSL* Crypto MB  
  EC   Client Server  
DH P256 173376 368513 328036 cycles/op
P384 2981591 1326145 1457200 cycles/op
P521 1033005 2041656 1602712 cycles/op
X25519 122354 190977 137273 cycles/op
DSA, sign P256 73031 142040 131474 cycles/op
P384 3129416 511742 547092 cycles/op
P521 895759 897758 747956 cycles/op

Table 5. Performance of ECDH and ECDSA sign over different EC in OpenSSL* 1.1.1 and Crypto MB on client and server.

Length, Byte OpenSSL* Client
64 36.7 2.8
1024 15.7 1.25
8192 13.9 1.25

Table 6. Performance of SM3 implementation in OpenSSL 1.1.1 and Crypto MB on client.

"