PagerDuty Chair Warns AIOps Must Evolve for AI Agent Drift Detection
Jenn Tejada, Executive Chair of PagerDuty, recently highlighted the emerging operational risks associated with autonomous AI agents, as reported by Let's Data Science on July 3, 2026. Tejada emphasized that unlike conventional software failures that often result in immediate crashes, AI agent failures frequently manifest as gradual "model drift." This subtle degradation in performance or behavior is difficult to detect with traditional monitoring tools, allowing issues to compound and escalate into significant outages before they are noticed. She stressed the urgent need for AIOps platforms to adapt and provide continuous monitoring of AI agents alongside existing infrastructure.
This insight is crucial for cloud and DevOps professionals grappling with the operationalization of AI. As organizations increasingly deploy AI agents for critical tasks, the reliability of these systems becomes paramount. The unique failure mode of "drift" means that established observability practices, which typically focus on system health and explicit error states, are inadequate. Without specialized AIOps capabilities, teams risk extended mean time to detection (MTTD) and mean time to resolution (MTTR) for AI-related incidents. This directly impacts service availability, operational efficiency, and potentially leads to significant business disruption and erosion of user trust. The challenge is not just about monitoring if an agent is "up," but if it's "doing the right thing" and maintaining its intended behavior over time.
This development is a natural progression in the convergence of AI and IT operations. AIOps has long been championed for its ability to apply AI and machine learning to large volumes of operational data, enabling proactive problem identification, anomaly detection, and automated remediation in complex, distributed environments. With the rise of agentic AI—systems capable of autonomous, multi-step goal execution—AIOps must now extend its purview to monitor the AI itself. This trend is amplified by the massive investment in AI infrastructure, with hyperscaler AI capital expenditure estimated at $725 billion for 2026, nearly doubling from the previous year. This significant scale-up of AI deployments underscores the imperative for robust operational frameworks that can manage the inherent unpredictability of AI agents. The need for AIOps to monitor AI agents is a direct response to the evolving landscape of AI-driven systems, where the "black box" nature of some AI models necessitates external, intelligent oversight.
Practitioners must prioritize extending their AIOps strategies to incorporate agent-level telemetry and drift detection. This involves instrumenting AI agents to expose metrics related to their decision-making processes, tool calls, confidence scores, and deviations from expected output distributions. Furthermore, integrating human-in-the-loop mechanisms within incident response tooling is essential, allowing for human override paths to pause or redirect a drifting agent before it causes widespread damage. Organizations should invest in AIOps solutions that offer advanced analytics for behavioral anomaly detection in AI agents, rather than just infrastructure alerts. The goal is to build a safety net that balances the benefits of AI autonomy with the necessary human oversight, ensuring that AI agents remain aligned with operational goals and do not inadvertently "break the internet" through unmonitored, evolving failures.
Read original source