Cloud Telemetry: Advancing Your IT Strategy

Monitor your resources more effectively to optimize performance and total cost of ownership (TCO).

Cloud Telemetry Overview:

  • Telemetry refers to monitoring and analyzing information about IT systems to track performance and identify issues.

  • The best telemetry strategies use a holistic, multisystems approach to identify key metrics that matter to business and IT operations.

  • New frontiers for telemetry include artificial intelligence (AI) and predictive analytics to detect and, in some cases, resolve problems without human involvement.

author-image

作者

What Is Cloud Telemetry?

Cloud telemetry uses software tools to record and analyze information about IT infrastructure that would otherwise be difficult to gather.

For cloud management, telemetry is critically important: to the human eye, IT infrastructure looks very similar whether the hardware is performing optimally or not. Telemetry gives IT professionals the ability to observe components and monitor applications in a deeper way, with metrics that track performance, utilization, energy consumption, and more.

By making effective use of telemetry, organizations can improve key performance indicators, including TCO, reliability, security, performance, and power consumption. Telemetry can also generate insights to help IT teams manage evolving capacity requirements and detect whether infrastructure is being used efficiently.

With recent advances in telemetry and cloud orchestration, organizations can make advances toward a truly modern, autonomous data center. AI and predictive analytics in cloud telemetry can predict failures and other problems—and sometimes even fix them without human involvement.

Optimizing Infrastructure with Telemetry

Telemetry capabilities have become more robust in recent years. Newer metrics and techniques, made available with advanced hardware, allow deeper cloud monitoring and analysis than previous generations of technology. Driving value and performance from infrastructure requires taking a holistic, multisystems approach to telemetry.

Server
Protecting hardware investments requires server telemetry that offers an in-depth look at server health. Metrics pertaining to power consumption and volumetric airflow, as well as heating and cooling, can help identify problems that could compromise hardware health. Server load monitoring and balancing, as well as server memory tracking, are also important considerations.

Compute
With effective telemetry, businesses can manage compute resources more efficiently. Telemetry can detect utilization by core and compare CPU utilization percentages to component specifications. If a CPU is running inefficiently, IT can troubleshoot or replace it in order to achieve expected performance levels.

Memory
Telemetry for classic dual in-line memory modules (DIMMs) centers on failure prediction. Because problems on a specific DIMM will often apply to an entire lot, telemetry could help analyze which other DIMMs to swap out to minimize failures. Modern persistent memory modules (PMMs) like Intel® Optane™ persistent memory enable more robust telemetry. This is because PMMs have an endurance analyzer with predicted lifespan to enhance predictive maintenance.

Storage
Solid state drives (SSDs) have led to significant improvements in telemetry capabilities. These drives, including Intel® SSDs, have modern health analyzer tools that give insights into performance and remaining drive life. As drive health drops off gradually, telemetry makes it possible to predict when drives will fail.

Networking
Telemetry for network infrastructure has advanced in recent years. While legacy drivers allowed a look into I/O and performance only, newer drivers can offer a more comprehensive view of network utilization. With Intel® FPGA-based smart NICs, load balancing can be closely managed to offload network workloads from central compute resources.

Applications
Telemetry in the form of application monitoring can give a deeper look into whether your applications are meeting benchmarks. IT teams can analyze latency and timeouts loading times, and other measures of overall application health.

By making effective use of telemetry, organizations can improve key performance indicators, including TCO, reliability, security, performance, and power consumption.

Telemetry Strategy: Tips You Can Use

There is no “one size fits all” strategy for telemetry. Your existing infrastructure, short- and long-term cost considerations, and business goals will determine the overall direction of your strategy.

However, some basic principles apply for any organization hoping to prepare for the future with a modern telemetry strategy:

  • Less is more: While telemetry focuses on collecting information about hardware and software, not all data is equally important. Often, information is overcollected but underutilized. It’s important to identify the right metrics to track.
  • Go step by step: Analyzing workloads is a four-step process, starting with a view of platform health and validating hardware. Next, use characterization to better understand system behavior. Balance your platform for specific workloads with hardware right-sizing. Once the hardware picture looks good, profile and optimize software to identify inefficiencies.
  • Get predictive: Newer hardware and tools make it possible to see problems before they start. By transitioning to hardware that can measure its own health and remaining lifespan, you can enable a predictive, rather than reactive, maintenance strategy that minimizes failures and service interruptions.
  • Automate decisions: As telemetry gets better at identifying problems, infrastructure data mining can help you make better decisions about utilization and performance optimization. By detecting how workloads have been balanced and components used in the past, infrastructure data mining can inform better decisions about the future. Many of these decisions can even be made automatically after training AI models, so that performance and power consumption can be optimized without human involvement.

Telemetry capabilities often evolve in tandem with infrastructure advances—for instance, as new types of hardware allow new measurements to be made. Consider each new component added to your current configuration and whether it may impact your strategy and key metrics.

Intel Tools for Cloud Telemetry

At Intel, we’re committed to helping businesses understand how to use telemetry effectively and find the metrics that matter. From hardware-enabled telemetry to development kits and frameworks for performance engineers, Intel® technology works to improve modern cloud telemetry.

Intel took telemetry all the way to the silicon with our advanced performance monitoring units (Intel® PMU). This sophisticated on-chip hardware enables more-robust telemetry with advanced metrics and diagnostics. We continue to evolve PMUs alongside our architecture updates.

With the Intel® Telemetry Collector (ITC), IT teams can rapidly analyze and see performance visualizations for a range of systems. ITC gives you access to the same collection of tools that Intel’s own performance engineers would use in a performance review.

As telemetry capabilities evolve, Intel will continue our mission of innovation and education for all stages of workload analysis and optimization. From predictive and augmented analytics to hardware-based telemetry advances, we’re excited to enable the technology of the future for our customers today.