Industry background: Industry clouds are developing rapidly while open source is becoming a major trend
The increase of large scale cloud deployments across government, finance, education, healthcare among other industries increases the importance and requirements for the operation and management of the data centers. The expanding scale of industry clouds presents demanding requirements on the construction and operation of data centers. OpenStack is also becoming part of cloud environments within organizations. Implementation of open source technology is impacted when there are challenges with resource management and automated service delivery for IaaS 1.
In regards to the building of private clouds for government, finance, healthcare and other industries, the emergence of open source technologies have helped industries implement the agility, efficiency, scalability, and control required by IT systems and have assisted industry clouds in better controlling the costs of building and maintaining IT systems. Therefore, the exploration and application of open source technologies have already become a major trend in industry clouds.
Open source technologies, such as OpenStack*, have become widely recognized for their utility in the digital transformation of enterprises. According to the "2018 China SDC (software-defined compute) software market report" published by SDC, it was predicted that between 2018-2023, the compound annual growth of OpenStack in the Chinese market will be 25.7% and revenues will have reached 538 million US dollars. As OpenStack moves further towards application in production environments from testing environments, the demands and requirements of large-scale cluster management, multi-cloud data center management, stability, performance, and efficiency in large enterprises has become increasingly important. It has also become the benchmark for OpenStack solution design and capabilities for cloud computing vendors.
Business challenges: Optimizing data center infrastructure and management in large-scale industry clouds
Industry clouds from government, finance, education, medical, and other sectors are trending towards large-scale development. As the capacity of industry clouds grow rapidly, an increasing number or organizations are getting involved. An increasing number of industry clouds now cover entire provinces and even the entire nation. Against this backdrop, many industry clouds have managed to build large-scale clouds with single clusters with over a thousand nodes, with total size of several thousand nodes.
The expanding scale of industry clouds presents demanding requirements on the construction and operation of data centers. From an operations point of view, ultra large-scale industry cloud platforms will significantly increase the complexity of resource operation and maintenance management in IaaS. Furthermore, OpenStack is gradually becoming part of key systems within organizations. This puts a high demand on the delivery speeds and quality of open source clouds. The implementation of the open source strategy would be seriously affected if smart resource management and automated service delivery cannot be provided. Therefore, an increasing number of companies hope that by relying on the visualization of tools and processes, they could enhance compliance, and ensure application performance to address complexity, security, efficiency and other issues in cloud environments.
Inspur* released an OpenStack-based system that is combined with elements of its own "Yunhai series" to create the InCloud OpenStack (ICOS). The combined ICOS virtualization platform and cloud management realizes the integrated scheduling and management of the basic compute, storage, networking, security and other resources in cloud data centers. It also supports dynamic service changes, smart management of resources and automated delivery of services. ICOS has been comprehensively optimized for performance, reliability, and safety for large-scale industry cloud applications and is fully connected with the various core OpenStack components. Additional components were also developed based on requirements to make up the defficiencies of the various modules and components under the OpenStack open framework.
In terms of infrastructure, large-scale industry clouds, which always have thousands of nodes, have extremely high requirements in terms of performance and total cost of ownership (TCO). Industry cloud infrastructures need to provide ultra-high data processing and storage performance to satisfy the requirements of key applications while also providing support for high load applications such as data management, model training, and model deployment. At the same time, the cost of building and maintaining large-scale industry clouds is thousands of times that of single nodes, therefore cost control is very important when optimizing for performance.
Solution: Large-scale industry cloud performance optimization on Intel® architecture
Large-scale industry clouds based on open technologies present new demands on the structure and optimizations of infrastructures. First of all, the rapid growth of open source applications has brought about an increase in workloads. An increasingly "open source" technology stack affects performance greatly, and this effect is especially pronounced within large-scale industry clouds. Secondly, agile infrastructures have become a major trend, where easy deployment and management of performance, QoS, and TCO in the software-defined level is required. Lastly, infrastructures also require openness; while people are making use of the open source community, they should also contribute back to the community to create an active open source eco-system.
As the base platform for this solution, the Inspur ICOS cloud operating system uses the OpenStack platform as its core, and relies on Inspur's deep understanding of its clients needs to make further optimizations and customization of core OpenStack components such as Nova (compute), Cinder (block storage), Swift (object storage), Neutron (network), Glance (mirror), Ironic (bare-metal), Heat (resource orchestration), Trove (database), and others, to release the InCloud OpenStack Rocky version.
Users require outstanding performance and resource orchestration capabilities in large-scale industry cloud applications as well as well-built architecture to make it easier to optimize computing and storage performances. They also require the ability to optimize on any layer within the architecture. Intel® architecture can help improve the visualization and control of related shared resources such as CPU cache and main memory under an open source environment. These features help to implement smart orchestration as well as higher utilization and service levels. This provides clients with a fully automated cloud platform based on software-defined infrastructure.
Figure 1. Structure of ICOS
In order to continue optimizing the performance of open source technology-based large-scale industry clouds, Inspur worked with Intel at a data center with over 200 nodes that are based on Intel® architecture for deployments and tests. The tests performed mainly involve "3H" including high concurrency stress tests, network/hard disk IO and CPU/memory performance tests, stability and high availability tests, etc, to lay the foundation for the deployment of large-scale industry cloud.
In terms of hardware configuration, Inspur deployed the second generation Intel® Xeon® scalable processor for the partial testing phase targeted at open source cloud computing technology, as well as Intel® Optane™ data center-grade persistent memory. The Intel® Xeon® scalable processor comes with innovative technology features and powerful performance and can accelerate the processing of loads in data center, enterprise, and smart edge cloud computing environments, to provide powerful support to large-scale industry clouds. The Intel® Optane™ data center-grade persistent memory provides the industry’s leading throughput rates, latency, high service quality and ultra-high durability and can provide latency performance that is similar to RAM. It also supports fast caching and storage and has huge potential in lowering the deployment costs of large-scale industry clouds.
Verification of the control plane
Inspur used tests to verify the maximum number of users that ICOS can support while running various typical applications in a single region, and also to find the optimal parameter configurations under this architecture. This is to guarantee the high efficiency, stability, and reliability of cloud services under large-scale deployments and high-load conditions. During the first tuning process, Inspur tuned the parameters for the highest number of connections and processes for the network, kernel, and Haproxy, the Ansible configuration file, Mariadb, and others, to ensure the successful deployment of ICOS on Intel® based architecture.
After the post-tuning by Inspur and Intel, it was discovered during a test with a large number of virtual machines that Neutron was unable to allocate IP addresses. After engineers from both parties analyzed the related source code, it was found that a retry mechanism was used in the the community solution for IP address conflict, when the upper limit of retries was used, the IP address allocation would fail. The retry mechanism will cause a temporary spike in the work load in the neutron server and affect the neutron server's performance. Engineers from both parties introduced the use of OpenStack Tooz and designed a distributed lock solution based on the community's original IP address allocation algorithm. Not only did the new solution completely solve the IP address allocation issue caused by IP address conflict, it also improved the concurrent port creation performance of neutron server. After performing the tests, the success rate for concurrent creation of a large number of virtual machines reached 100%, and the average time consumed was reduced significantly. This solution has already been submitted to the community as a "BP" and has a "+2" review status, and should become an integral part of the community soon.
Afterwards, Inspur and Intel also verified the optimized parameters for virtual machine creation in Nova. By concurrently creating virtual machines using images, via cloud hard disks, and concurrently creating virtual machine snapshot and other optimizations, the Intel® architecture-based ICOS is able to easily handle the demands from host creation in large-scale clouds, and can provide stable and reliable services in batch mounting, downloading, and image operations.
Besides verifying the performance of ICOS in large-scale deployment of nodes in a single region, both parties also used feature extraction, subtotalling and qualitative and quantitative analysis methods to identify issues during the deployment process. By using optimized configurations and a complete deployment structure, they solved various issues related to the operating system, notification communication, database, and other ICOS components, so that the system could operate stably.
Data plane implementation verification
In the cloud environment, the front-end data plane virtual machine hosts the virtual network performance of business applications. Because business applications are hosted by virtual machines, virtual network performance is critical. At the same time, the performance of processor and memory determines the computing power of the platform directly. Therefore, it is very important to find the boundary and bottleneck of processor and memory performance for the subsequent system tuning and resource management.
Inspur collaborates with Intel on front-end data plane network performance as well as processor and memory performance testing. In the memory bandwidth test, both sides adopted the industry wide popular comprehensive memory bandwidth performance measurement tool STREAM which has good spatial locality. STREAM is used to test the memory bandwidth from two dimensions – "different thread numbers" and "different CPU Cores and memory sizes". Without bottleneck in the memory size, the test results show that the memory bandwidth is linearly related to the number of vCPU. One of the important prerequisites for increasing memory bandwidth and the amount of vCPU is to maintain sufficient memory capacity.
Subsequently, both parties used the unixBench test tool to test the performance of the virtual machine's vCPU. The test results show that the linear relationship between CPU performance and thread count is positive, and the linear relationship between the CPU performance of VM and the number of allocated vCPU is positive. At the same time, Inspur and Intel fine-tuned from HugePage and CPU Pinning and found that the performance of VM is much better when CPU Pinning is turned on.
Through the test, both sides proved that ICOS can provide enough performance and stability support in large-scale scene. In addition, in future scenarios where computing, storage, networking, and memory requirements are higher, it is recommended to use high-performance hardware that is based on Intel®Architecture to meet the needs of customers while enhancing the experience of large-scale industry cloud.
Promotion of Open Source Community
In general, the ICOS based on open source technology has realized the omnidirectional high availability of control plane and data plane, including the HA enhancement of control plane node, the host HA enhancement of all virtual machines, and the mechanism enhancement of virtual machine HA, etc. At the same time, the release is significantly more efficient and can be programmed to deploy/upgrade using code-driven implementations. It also supports automated continuous integration and verification, so it can support fast delivery of up to 500+ nodes a day, and can be easily extended online without interrupting the business.
Feedback to the open source community is an important part of this round of testing and a necessary choice for active open source ecology. Inspur and Intel summarized and repaired the problems encountered in the test. They optimized and fixed the problem of IP conflicts on the batch creation virtual machine allocation port and contributed the modified solution to the number of new functions (Completed Blueprints, BP). The form of BP was presented to the community to improve the high availability of community project functionality. At present, the ICOS part of BP based on open source technology has been integrated, which improves the platform's reliability, efficiency, performance and security.
Effect: Intel® Architecture + ICOS Optimizes Large-Scale Industry Cloud
This test documents the experience for large-scale ICOS deployment designs that is based on Intel® Architecture processors. Bottlenecks were fixed through optimization. Stability, reliability, and security of ICOS were improved, opening the door for large-scale field implementation in the future. Testing has also shown that the Intel® Architecture is well suited for the infrastructure needs of open source cloud computing systems such as OpenStack. It provides optimized performance, agility, and more efficient use of computing, storage, and network resources to increase the density of the VM at a lower cost.
In the deployment practice of large-scale industry cloud, Intel® Architecture provides a full range of enhancements in computing/ storage architecture, network optimization, intelligent acceleration, and intelligent management to better meet the heavy load demands of the large-scale industry cloud. At the same time, by combining with the software definition infrastructure, this solution can realize the easy deployment of performance, QoS and TCO, and realize the rapid construction of industry cloud.
Going forward, Intel and Inspur plan to launch a solution that is based on the Intel® OpenSDI solution and optimized for the integration of big data/AI application, which can satisfy the demand for supporting large-scale industry cloud, and help industry cloud develop rapidly.
2nd Gen Intel® Xeon® Scalable Processor
● Optimization for workloads, industry leading performance
● Enhanced hardware virtualization features compared to previous generation processors
● Excellent resource utilization and agility
● Suitable for demanding I/O intensive workloads, helping in accelerating the data revolution
Inspur used the 2nd gen Intel Xeon scalable processor in partial testing phase in the deployment of its large-scale industry cloud.