Artificial intelligence is a game-changing technology across industries and business processes, designed to make workers more efficient, reduce the steps it takes to complete a task, and gain answers and insights faster. But those powerful capabilities also put new demands on compute infrastructure and this requires a new class of infrastructure observability metrics.  

It’s with that in mind that we announced this week at HPE Discover 2024 that OpsRamp, a Hewlett Packard Enterprise company, now supports full-stack AI workload-to-infrastructure observability enabling IT teams to monitor their AI infrastructure in context of the AI workloads, correlated with the rest of their data center infrastructure. AI infrastructure supported includes NVIDIA GPUs and AI clusters, NVIDIA DGX Systems, NVIDIA Mellanox, InfiniBand and Spectrum ethernet switches.     

Metrics collected focus on availability, health, usage, performance, power consumption and many more. OpsRamp also provides integration with CrowdStrike for protection against misconfigurations and IAO detection. Customers can visualize the security posture of their AI infrastructure through the unified full-stack service map view in OpsRamp overlayed with vulnerabilities. In addition, with HPE Ezmeral integration, devops, SREs and operations teams can visualize AI workload-to-infrastructure performance, health, and utilization for a full stack view of their AI deployments. 

At the core of compute infrastructure for AI is the GPU, or graphical processing unit. Originally developed for 3D rendering and visuals for gaming systems, the GPU contains smaller and more specialized processing cores than the CPUs that power most compute infrastructure. These smaller cores work in parallel to deliver massive performance and processing speed, up to 1000x faster than a CPU. In gaming, that translates into more realistic visuals and smoother play. In AI, GPUs have enabled faster inferencing, or model training, since they perform repetitive tasks so quickly.

Operations Dashboard

OpsRamp’s new AI workloads-to-infrastructure observability includes such metrics as power draw; memory clock frequency; memory utilization; frame buffer memory; GPU temperature; memory temperature; and GPU utilization. With this support for observability of NVIDIA infrastructure and AI workloads, OpsRamp can help organizations monitor and manage the performance of their AI systems correlated with the rest of the hybrid cloud to ensure that AI infrastructure is reaching the high-performance levels needed for the applications it supports. 

AI Infrastructure and Workload Monitoring Overview

OpsRamp has deep expertise in infrastructure monitoring and management and already supported NVIDIA GPU monitoring. This is an extension of those capabilities driven by HPE’s growing partnership with NVIDIA. We believe we have more extensive coverage of this new class of infrastructure than any other vendor. We can show AI infrastructure correlated with the rest of data center infrastructure (CPU, GPU, network and storage). Plus, with our integration with HPE Ezmeral, we offer full-stack AI workload-to-infrastructure observability.  

With this support for observability of NVIDIA hardware, OpsRamp can help organizations monitor and manage the performance of their own AI systems. OpsRamp’s AI workloads-to-infrastructure observability is available now. 

Next Steps:  

CTA_Observability_Webinar_2024-100


Recommended posts