Introduction
With an increasing number of workloads running simultaneously on a system, there is more pressure on shared resources such as the CPU, cache, network bandwidth, and memory. While this reduces workload performance, if one or more of the workloads is bursty in nature it also reduces performance determinism. An interfering workload is called a noisy neighbor, and for the purposes of this discussion a workload could be any software application, a container, or even a virtual machine (VM).
Intel® Resource Director Technology (Intel® RDT) provides hardware support to monitor and manage shared resources, such as the last level cache (LLC) (also called the L3 cache), and memory bandwidth. In conjunction with software support, starting with the operating system and going up the solution stack, this functionality is being made available to monitor and manage shared resources to isolate workloads and improve determinism. In particular, the cache monitoring technology (CMT) aspect of Intel RDT provides last-level cache usage information for a workload.
OpenStack* is an open source cloud operating system that controls datacenter resources, namely compute, storage, and networking. Users and administrators can access the resources through a web interface or RESTful API calls. For the purposes of this document, we assume that the reader has some knowledge of OpenStack, either as an operator/deployer, or as a developer.
Let us explore how to enable and use CMT, in the context of an OpenStack cloud, to detect cache-related workload interference and take remedial action(s).
Note 1: Readers of this article should have basic understanding of OpenStack and its deployment and configuration.
Note 2: All of the configurations and examples are based on the OpenStack Newton* release version (released in October 2016) and the Gnocchi* v3.0 release.
Enabling CMT in OpenStack*
To leverage CMT in OpenStack requires touching the Nova*, Ceilometer*, and optionally the Gnocchi and Aodh* projects. The Nova project concerns itself with scheduling and managing workloads on the compute hosts. Ceilometer and Gnocchi pertain to telemetry. The Ceilometer agent runs on the compute hosts, gathers configured items of telemetry, and pushes them out for storage and future retrieval. The actual telemetry data could be saved in Ceilometer’s own database or the Gnocchi time series database with indices. The latter is superior, in both storage efficiency and retrieval speed. OpenStack Aodh supports defining rule-action pairs, such as whether some telemetry crosses a threshold and, if so, whether to emit an alarm. Alarms in turn could trigger some kind of operator intervention.
Enabling CMT in Nova*
OpenStack Nova provides access to the compute resources via a RESTful API and a web dashboard. To enable the CMT feature in Nova, the following preconditions have to be met:
- The compute node hardware must support the CMT feature. The following CPUs support CMT (but are not limited to): Intel® Xeon® processor E5 v3 and Intel Xeon processor E5 v4 families. Please verify that the CPU specification supports CMT.
- The libvirt version installed on Nova compute nodes is version 2.0.0 or greater.
- The hypervisor running on the Nova compute host is a kernel-based virtual machine.
If all of the above preconditions are satisfied, and Nova is currently running, edit the libvirt section of the Nova configuration file (by default it is /etc/nova/nova.conf):
[libvirt]
virt_type = kvm
enabled_perf_events = cmt
After saving the above modifications, restart the Nova compute service.;
Openstack-nova-compute is a service on each compute host.
On Ubuntu* and CentOS* 6.5 hosts, run the following commands to restart the Nova compute service:
# service openstack-nova-compute restart
# service openstack-nova-compute status
On CentOS 7 and Fedora* 20 hosts, run the following commands instead to restart the Nova compute service:
# systemctl restart openstack-nova-compute
# systemctl status openstack-nova-compute
Once Nova is restarted, any new VMs launched by Nova will have the CMT feature enabled.
If devstack is being used instead to install a fresh OpenStack environment, add the following to the devstack local.conf file:
[[post-config|$NOVA_CONF]]
[libvirt]
virt_type = kvm
enabled_perf_events = cmt, mbml, mbmt
After saving the above configuration, run devstack to start the installation.
Enabling CMT in Ceilometer*
Ceilometer is part of the OpenStack Telemetry project whose mission is to:
- Reliably collect utilization data from each host and for the VMs running on those hosts.
- Persist the data for subsequent retrieval and analysis.
- Trigger actions when defined criteria are met.
To get the last-level cache usage of a running VM, Ceilometer must be installed, configured to collect the cpu_l3_cache metric, and be running. Ceilometer defaults to collecting the metric. The cpu_l3_cache metric is collected by the Ceilometer agent running on the compute host by periodically polling for VM utilization metrics on the host.
If devstack is being used to install Ceilometer along with other OpenStack services and components, add the following in the devstack local.conf file:
[[local|localrc]]
enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer
enable_plugin aodh git://git.openstack.org/openstack/aodh
After saving the above configuration, run devstack to start the installation. This will install Ceilometer as well as Aodh (OpenStack alarming service) in addition to other OpenStack services and components.
Storing the CMT Metrics
There are two options to save telemetry data; namely in Ceilometer’s own backend database or in Gnocchi’s (also a member of the OpenStack Telemetry project) database. Gnocchi provides a time-series database with a resource indexing service, which is vastly superior to the Ceilometer native storage with respect to performance at scale, better disk utilization, and faster data retrieval. We recommend installing Gnocchi and configuring storage with the same. To do so using devstack, modify the following devstack local.conf file as follows:
[[local|localrc]]
enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer
CEILOMETER_BACKEND=gnocchi
enable_plugin aodh git://git.openstack.org/openstack/aodh
enable_plugin gnocchi git://git.openstack.org/openstack/gnocchi
After saving the above configuration, run devstack to start the installation.
Refer to Gnocchi documentation for information on other Gnocchi installation methods.
After installing Gnocchi and Ceilometer, confirm that the following configuration settings are in place:
In the Ceilometer configuration file (by default it is /etc/ceilometer/ceilometer.conf), make sure the options are listed as follows:
[DEFAULT]
meter_dispatchers = gnocchi
[dispatcher_gnocchi]
filter_service_activity = False
archive_policy = low
url = <url to the Gnocchi API endpoint>
In the Gnocchi dispatcher configuration file (by default it is /etc/ceilometer/gnocchi_resources.yaml), make sure that the cpu_l3_cache metric is added into the resource type instance’s metrics list:
… …
- resource_type: instance
metrics:
- 'instance'
- 'memory'
- 'memory.usage'
- 'memory.resident'
- 'vcpus'
- 'cpu'
- 'cpu_l3_cache'
… …
If any modifications are made to the above configuration files, you must restart the Ceilometer collector so that the new configurations take effect.
Verify Things are Working
To verify that all of the above are working, test as follows:
- Create a new VM.
$ openstack server create --flavor m1.tiny --image cirros-0.3.4-x86_64-uec abc
- Confirm that the VM has been created successfully.
$ openstack server list
ID
Name
Status
Networks
Image Name
7e38a89b-c829-4fb9-b44a-35090fbc0866
abc
ACTIVE
private=10.0.0.3
cirros-0.3.4-x86_64-uec
- Wait for some time to allow the Ceilometer agent to collect the cpu_l3_cache metrics. The wait time is determined by the related pipeline defined in the /etc/ceilometer/pipeline.yaml file.
- Check to see if the related metrics are collected and stored.
- If the metric is stored in Ceilometer’s own database backend, use the following command:
ID
Resource ID
Name
Type
Volume
Unit
Timestamp
f42e275a-b36a-11e6-96b2-525400e9f0eb
7e38a89b-c829-4fb9-b44a-35090fbc0866
cpu_l3_cache
gauge
270336.0
B
2016-12-08T23:57:37.535615
8e872286-b369-11e6-96b2-525400e9f0eb
7e38a89b-c829-4fb9-b44a-35090fbc0866
cpu_l3_cache
gauge
450560.0
B
2016-12-08T23:47:37.505369
28e57758-b368-11e6-96b2-525400e9f0eb
7e38a89b-c829-4fb9-b44a-35090fbc0866
cpu_l3_cache
gauge
270336.0
B
2016-12-08T23:37:37.536424
…...
…...
…...
…...
…...
…...
…...
- However, if the metric is stored in Gnocchi, access it as follows:
$ gnocchi measures show --resource-id 9184470a-594e-4a46-a124-fa3aaaf412dc cpu_l3_cache --aggregation mean
Timestamp
Granularity
Value
2016-12-09T00:00:00+00:00
86400.0
282350.933333
2016-12-09T01:00:00+00:00
3600.0
216268.8
2016-12-09T01:45:00+00:00
300.0
180224.0
2016-12-09T01:55:00+00:00
300.0
180224.0
… ...
… ...
… ...
- If the metric is stored in Ceilometer’s own database backend, use the following command:
Using CMT in OpenStack
A noisy neighbor in the OpenStack environment could be a VM consuming resources in a manner that adversely affects one or more different VMs on the same compute node. Whether because of a lack of knowledge of workload characteristics, appropriate information during Nova scheduling, or a change in the workload characteristics (because of a spike in usage or a virus or other), a noisy situation may occur on a host. The cloud admin might want to detect and take some action, such as live migrating the greedy workload or terminating it. The OpenStack Aodh project) enables detecting scenarios and alerting to their existence using condition-action pairs. An Aodh rule that monitors VM cache usage crossing some threshold would automate detecting of noisy neighbor scenarios.
Below, we illustrate setting up an Aodh rule to detect noisy neighbors. The actual rule depends upon whether the CMT telemetry data is stored. We first cover storage in the Ceilometer database and then in the Gnocchi time series database.
Metrics Stored in Ceilometer Database
Below, we define, using the Aodh command-line utility, a threshold CMT metrics rule:
$ aodh --debug alarm create --name cpu_l3_cache -t threshold --alarm-action "log://" --repeat-actions True --comparison-operator "gt" --threshold 180224 --meter-name cpu_l3_cache --period 600 --statistic avg
Field |
Value |
---|---|
alarm_actions |
[u'log://'] |
alarm_id |
e3673d39-90ed-4455-80f1-fd7e06e1f2b8 |
comparison_operator |
gt |
description |
Alarm when cpu_l3_cache is gt a avg of 180224 over 600 seconds |
enabled |
True |
evaluation_periods |
1 |
exclude_outliers |
False |
insufficient_data_actions |
[] |
meter_name |
cpu_l3_cache |
name |
cpu_l3_cache |
ok_actions |
[] |
period |
600 |
project_id |
f1730972dd484b94b3b943d93f3ee856 |
repeat_actions |
True |
query |
|
severity |
low |
state |
insufficient data |
state_timestamp |
2016-12-08T23:59:05.712994 |
statistic |
avg |
threshold |
180224 |
time_constraints |
[] |
timestamp |
2016-12-08T23:59:05.712994 |
type |
threshold |
user_id |
cfcd1ea48a1046b192dbd3f5af11290e |
This creates an alarm rule named cpu_l3_cache that is triggered if, and only if, within a sliding window of 10 minutes (600 seconds), the VM’s average cpu_l3_cache metric is greater than 180224. If the alarm is triggered, it will be logged in the Aodh alarm notifier agent’s log. Alternately, instead of just logging the alarm event, a notifier may be used to push a notification to one or more configured endpoints. For example, we could use the http notifier by providing "http://<endpoint ip>:<endpoint port>" as the alarm-action parameter.
Metrics Stored in Gnocchi*
If the metrics are stored in Gnocchi, an Aodh alarm could be created through a gnocchi_resources_threshold rule such as the following, using the Aodh command-line utility:
$ aodh --debug alarm create -t gnocchi_resources_threshold --name test1 --alarm-action "log://alarm" --repeat-actions True --metric cpu_l3_cache --threshold 100000 --resource-id 9184470a-594e-4a46-a124-fa3aaaf412dc --aggregation-method mean --resource-type instance --granularity 300 --comparison-operator 'gt'
Field |
Value |
---|---|
aggregation_method |
mean |
alarm_actions |
[u'log://alarm'] |
alarm_id |
71f48ee1-b92f-4982-92e4-4c520649a8e0 |
comparison_operator |
gt |
description |
gnocchi_resources_threshold alarm rule |
enabled |
True |
evaluation_periods |
1 |
granularity |
300 |
insufficient_data_actions |
[] |
metric |
cpu_l3_cache |
name |
test1 |
ok_actions |
[] |
period |
600 |
project_id |
543aa2e8e17449149d5c101c55675005 |
repeat_actions |
True |
resource_id |
9184470a-594e-4a46-a124-fa3aaaf412dc |
resource_type |
instance |
state |
insufficient data |
state_timestamp |
2016-12-09T05:57:07.089530 |
threshold |
100000 |
time_constraints |
[] |
timestamp |
2016-12-09T05:57:07.089530 |
type |
gnocchi_resources_threshold |
user_id |
ca859810b379425085756faf6fd04ded |
This creates an alarm named test1 if, and only if, within a sliding 10-minute window (600 seconds), the VM 9184470a-594e-4a46-a124-fa3aaaf412dc registers an average cpu_l3_cache metric greater than 180224. If triggered, an alarm is logged to the Aodh alarm notifier agent’s log output. Instead of the command-line utility the Aodh RESTful API could be used to define alarms; refer to http://docs.openstack.org/developer/aodh/webapi/v2.html for details.
While Gnocchi v3.0 is limited in its resource querying capabilities in comprehending metric type and thresholds, such enhancements are expected in future releases.
More About Intel® Resource Director Technology (Intel® RDT)
The Intel RDT family comprises, beyond CMT, other monitoring and resource allocation technologies. Those that will soon be available are:
- Cache Allocation Technology (CAT) enables allocation of cache to workloads, either in exclusive or shared mode, to ensure performance despite co-resident (running on the same host) workloads. For instance, more cache can be allocated to a high-priority task that has a larger working set or, conversely, restricting cache usage for a streaming application that has a lower priority so that it does not interfere with higher priority tasks.
- Memory Bandwidth Monitoring (MBM), along the lines of CMT, provides memory usage information for workloads.
- Code Data Prioritization (CDP) enables separate control over code and data placement in the last-level cache.
To learn more visit http://www.intel.com/content/www/cn/zh/architecture-and-technology/resource-director-technology.html.
In conclusion, we hope the above provides you with adequate information to start using CMT in an OpenStack cloud to gain deeper insights into workload characteristics to positively influence performance.