Introduction
Intel® Integrated Performance Primitives (Intel® IPP) Cryptography is a software library that provides a comprehensive set of application domain-specific highly optimized functions. It is a secure, fast and lightweight library of building blocks for cryptography, highly-optimized for various Intel® CPUs. This can provide tremendous development and maintenance savings. You can write programs with one optimized execution path, avoiding the alternative of multiple paths (Intel® Streaming Single Instruction Multiple Data (SIMD) Extensions 2, Supplemental Streaming SIMD Extensions 3, Intel® Advanced Vector Extensions , etc.) to achieve optimal performance across multiple generations of processors.
The goal of the Intel® IPP Cryptography software is to provide algorithmic building blocks with
- a simple "primitive" C interface and data structures to enhance usability and portability
- faster time-to-market
- scalability with Intel® hardware
Intel® IPP Cryptography library is available as part of the Intel® oneAPI Base Toolkit.
Intel® IPP Cryptography library is also open sourced. For details about the open source version, please refer to this link.
History of Cryptography Instruction Set
Bulk encryption/decryption, hash functions and pubic key algorithms constitutes the basis of classic cryptography. Until 2010 these algorithms implemented in software which used the basic x32 and/or x64 instruction set or similar. As a result, the implementations spent quite a few CPU cycles on execution. In addition, implementations of cryptographic algorithms that resisted to side-channel attacks only increased their execution time.
In 2010, Intel launched microprocessors based on Westmere microarchitecture, which expanded Instruction Set Architecture (ISA) by so-called Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) and carry-less Multiplication CLMUL instruction. The purpose of Intel® AES-NI is to improve the speed (as well as improve the resistance to side-channel attacks) of AES-based implementations of standard modes. Together with CLMUL instructions they formed the basis for AES Galois Counter (AES-GCM) mode providing confidentiality and authentication simultaneously.
In 2013, it was initially implemented in low-power Intel Atom® Processor Goldmont microarchitecture the hardware acceleration of Secure Hash Algorithm (SHA). This extension, named SHA-NI, supports SHA-1 and SHA-256 algorithms.
In 2014, the ADX extension was implemented on the Broadwell microarchitecture. This extension consists of ADCX and ADOX instructions and together with MULX instruction implemented earlier are using in context of multi-precision arithmetic implementations. So, for example, till now the best OpenSSL* public key implementations are based on MULX with accompanied instructions ADCX and ADOX.
Cryptography-related ISA Extension
The client and server configurations of microprocessor inherits all cryptographic extensions mentioned above and contains additional extensions of ISA. Among additional extensions, there are: VAES and VCLMUL instructions, Galois Field New Instructions (GFNI) and IFMA instructions.
VAES and VCLMUL are extensions of the AES-NI and CLMUL instructions correspondingly. They extend existing instructions to the 2x128 and 4x128 vector’s variant. The VAES instructions perform one round of AES encryption/decryption using the same or different value(s) of round key(s). VAES instruction extension helps to implement the AES parallelizable modes even mush more efficient than legacy AES-NI. 2x128 and 4x128 vector variant of CLMUL improves the performance of AES-GCM mode.
Galois Field New Instructions (GFNI) are presented by three instructions: GF2P8AFFINEQB, GF2P8AFFINEINVQB and GF2P8MULB. The GF2P8AFFINEQB and GF2P8AFFINEINVQB computes affine transformation in the GF(2^8). The first one involves in affine transformation of the element x belonging GF(2^8) and the second one in the inversion 1/x of x. The last GF2P8MULB computes multiplication of x and y elements of GF(2^8). All three are using GF(2^8) generated by g(x)= x^8 + x^4 + x^3 + x + 1 polynomial matched to AES algorithm. Based on fundamental mathematical isomorphism of GF(2^8) this helps implementing algorithms involving affine transformation and multiplication over any GF(2^8). Thus, in particular, it helps in implementation of SM4 algorithms.
IFMA extension – consists of two instructions VPMADD52LUQ and VPMADD52HUQ for packed multiplication of unsigned 52-bit integers and accumulate low/high52 bit product in 64-bit accumulator. These instructions supported in in 3 forms: 2x64, 4x64 and 8x64. The target for this extension is multi-precision arithmetic and basically multiplicative operations. Using this extension helps to implement efficiently public key cryptography algorithms (RSA and Elliptic Curve based encryption and sign operations).
Intel® IPP Crypto Library
Intel® IPP Crypto Library is focused on efficient implementation/optimization of basic cryptography algorithms. Enabling of new Intel ISA in cryptography helps improve the performance and considered as important activity of Intel® IPP Crypto development. In the Intel® IPP Crypto 2020 Update 3 release, ISA concerning all 3 directions: bulk encryption, hashes (SHA1 and SHA256) and RSA encryption have been implemented and enabled. The result in performance difference between non-enabled and enabled Intel® IPP Crypto are presented in Table 1 – Table 3 later in this article. In both cases the benchmark has been performed on the microprocessor.
Since 2020, another cryptography library, called Crypto Multi-Buffer (MB) is delivered together with Intel® IPP Crypto library. Unlike Intel® IPP Crypto the Crypto MB focuses on parallel processing of 8 independent cryptographic request and aimed to support server and cloud applications. It can be used as a standalone library or together with Intel® Quick Assist Technology (Intel® QAT) Engine. By itself, the “multi-buffer” approach provides advantages that complement enabling. The result in performance difference between OpenSSL* 1.1.1 and Crypto MB are presented in Table 4 – Table 6 later in this article. In both cases the benchmark has been performed on the client and server microprocessor.
Enabling Results and Conclusion
The computer platforms and library versions have been used for measurements are the following:
- Intel® Core™ i7-1065G7 CPU @ 1.30GHz, L1d=192KiB, L1i=128KiB, L2=2MiB, L3=8MiB running with Ubuntu* 20.04.1
- Intel® Xeon® Platinum 8368 CPU @ 2.40GHz, L1d=48K, L1i=32K, L2=1280K, L3=58368K running with RedHat* 8.1
- Intel® IPP Cryptography 2020 Update 3
- OpenSSL* 1.1.1
The result of Intel® IPP Crypto performance are presented in CPU cycles/byte in case of measurement of AES128, SM4, SHA1 and SHA256 algorithms. Performance results of RSA-2048 are presented in CPU cycles/operation.
Length, Bytes | w/o New ISA Enabled | New ISA Enabled | |
AES128-DEC-CBC | |||
1024 | 0.326 | 0.148 | cycles/byte |
2048 | 0.321 | 0.152 | cycles/byte |
4096 | 0.318 | 0.154 | cycles/byte |
AES128-CTR | |||
1024 | 0.416 | 0.206 | cycles/byte |
2048 | 0.401 | 0.182 | cycles/byte |
4096 | 0.394 | 0.169 | cycles/byte |
AES128-GCM | |||
1024 | 1.35 | 0.344 | cycles/byte |
2048 | 1.31 | 0.291 | cycles/byte |
4096 | 1.26 | 0.262 | cycles/byte |
SM4-DEC-CBC | |||
1024 | 5.24 | 0.92 | cycles/byte |
2048 | 4.59 | 0.918 | cycles/byte |
4096 | 4.59 | 0.928 | cycles/byte |
SM4-CTR | |||
1024 | 5.3 | 1.1 | cycles/byte |
2048 | 5.25 | 1.08 | cycles/byte |
4096 | 5.23 | 1.06 | cycles/byte |
Table 1. Performance of AES128 and SM4 block ciphers with Intel® IPP Crypto with and without new ISA.
Length, Bytes | w/o New ISA Enabled | New ISA Enabled | |
SHA-1 | |||
1024 | 4.12 | 2.3 | cycles/byte |
2048 | 3.98 | 2.18 | cycles/byte |
4096 | 3.9 | 2.12 | cycles/byte |
SHA-256 | |||
1024 | 8.71 | 2.88 | cycles/byte |
2048 | 8.44 | 2.73 | cycles/byte |
4096 | 83 | 2.66 | cycles/byte |
Table 2. Performance of SHA-1 and SHA-256 hash functions with Intel® IPP Crypto with and without new ISA.
w/o New ISA Enabled | New ISA Enabled | ||
RSA-2048 | |||
private exp (crt) | 1978124 | 1064056 | cycles/op |
public exp, e=65537 | 51806 | 32822 | cycles/op |
RSA-3072 | |||
private exp (crt) | 6596848 | 4934034 | cycles/op |
public exp, e=65537 | 111736 | 52076 | cycles/op |
RSA-4096 | |||
private exp (crt) | 14880000 | 8086920 | cycles/op |
public exp, e=65537 | 196042 | 70724 | cycles/op |
Table 3. Performance of public and private keys RSA-2048 operation with Intel® IPP Crypto with and without new ISA.
Concerning to Crypto MB, the performance comparison with OpenSSL* 1.1.1 is presented below in this article. Again, benchmarks of both OpenSSL* and Crypto MB have been measured on client and server. Because OpenSSL* does not demonstrate the differences between runs on client and server, only one number in OpenSSL* column is presented. In contrast, performance of Crypto MB vary depends on target CPU in spite of the same code run in both cases.
The results below are presented in CPU cycles/operation in case of public key algorithms (RSA and EC). It’s important to note that OpenSSL* performs single RSA or EC operation whereas Crypto MB performs 8 similar operations. So, for fair comparison with OpenSSL* data related to Crypto MB, it should be divided by 8.
OpenSSL* | Crypto MB | |||
Client | Server | |||
RSA-2048, public e=65537 | 60090 | 123188 | 109932 | cycles/op |
RSA-3072, public e=65537 | 127410 | 267623 | 244976 | cycles/op |
RSA-4096, public e=65537 | 219198 | 463649 | 434724 | cycles/op |
RSA-2048, private (crt) | 2067784 | 3592084 | 2636810 | cycles/op |
RSA-3072, private (crt) | 6294468 | 14866835 | 14485652 | cycles/op |
RSA-4096, private (crt) | 14168714 | 29022226 | 23385406 | cycles/op |
Table 4. Performance of public and private keys RSA-2048/3072/4096 operations in OpenSSL* 1.1.1 and Crypto MB on client and server.
OpenSSL* | Crypto MB | ||||
EC | Client | Server | |||
DH | P256 | 177496 | 368513 | 340726 | cycles/op |
P384 | 2938032 | 1326145 | 1482418 | cycles/op | |
P521 | 1107801 | 2041656 | 1654938 | cycles/op | |
X25519 | 118727 | 190977 | 140426 | cycles/op | |
DSA, sign | P256 | 75207 | 142040 | 136626 | cycles/op |
P384 | 3110865 | 511742 | 564720 | cycles/op | |
P521 | 951580 | 897758 | 787000 | cycles/op |
Table 5. Performance of ECDH and ECDSA sign over different EC in OpenSSL* 1.1.1 and Crypto MB on client and server.
Length, Byte | OpenSSL* | Client |
64 | 36.7 | 2.8 |
1024 | 15.7 | 1.25 |
8192 | 13.9 | 1.25 |
Table 6. Performance of SM3 implementation in OpenSSL 1.1.1 and Crypto MB on client.
"