We just rolled out Noction Flow Analyzer (NFA) version 24.12. Here’s a quick rundown of...
AI and ML in networking: practical insights beyond the hype
In this post, we’ll explore the practical considerations behind leveraging AI and ML in network operations, review a bit of historical context, and discuss current attempts to build intelligent network management solutions. We will also mention how platforms, like the Noction Intelligent Routing Platform, have approached the problem of network optimization – offering tangible benefits while also highlighting the hurdles that remain. Ultimately, while AI and ML hold promise, the journey to automated, self-tuning networks is riddled with caveats, trade-offs, and incremental steps rather than giant leaps forward.
A brief look back: from manual correlation to automated intelligence
The idea of intelligent network management predates today’s machine learning boom by decades. Early attempts at automated correlation tools date back over 20 years, with platforms that tried to map application dependencies onto underlying network infrastructure. Back then, pulling data often relied on protocols like SNMP for the network layer, and more proprietary mechanisms for application data. These tools aimed to correlate faults and understand the impact of specific infrastructure elements failing. The logic, however, was primarily rule-based, painstakingly maintained by engineering teams who had to encode every relationship manually.
The big problem was – and still is – that networks are not static. They constantly change as devices are added, workloads shift, security policies update and cloud platforms proliferate. Back then, if your correlation engine needed to understand that a particular web application ran on a certain server, behind a certain load balancer, connected through a certain firewall, you had to feed it all that information manually. A week later, a new route changed the application flow, and your correlation was outdated. The promise of “intelligent” correlation was quickly tempered by the reality of maintaining an accurate, evolving model of the network. In practice, these systems often fell short, or required so much human labor that their value proposition was questionable.
Today, AI and ML are positioned as remedies to these limitations. Instead of static, human-defined correlation, machine learning models can (in theory) learn network behavior patterns, detect anomalies, and infer complex dependencies from large volumes of Telemetry. Yet, the classic pitfalls remain: if you don’t feed the AI high-quality, accurate, and comprehensive data, its output will be flawed. AI is not a magician that transforms poor data into reliable insights. It simply automates the detection of patterns – good or bad – in the data it’s given.
Getting data under control: the foundation of AI success
One of the first hurdles in implementing AI-driven network management is data hygiene. Networks generate massive amounts of Telemetry: routing tables, firewall logs, load balancer statistics, server metrics, latency measurements, and more. Enterprises may store all of this data in a data lake, believing that having “more data” is always better. But if this data is not normalized, correlated, and continuously updated, it becomes a swamp of unknowns rather than a lake of insights.
Before even thinking about “intelligent” operations, organizations must ensure they know their own infrastructure. Which devices are deployed where? What firmware versions are running? Which overlays map onto which underlays? Who is responsible for maintaining them? Without a solid inventory, consistent naming conventions, and a baseline understanding of the environment, the AI models will struggle. They may end up making recommendations that are based on incomplete or stale information, leading to mistrust and reluctance from network engineers.
Moreover, as networks increasingly span on-prem data centers, multiple public cloud regions, and containerized microservices, simply feeding all of that complexity into a machine-learning model is no trivial task. The underlying relationships become more intricate with each layer of abstraction. If you can’t get a handle on the “ground truth” of your environment, your ML-driven correlation engine will be guessing blindly at how pieces fit together.
Slow and incremental progress: sitting, crawling, then walking
Despite the hype, most organizations are, at best, in the early stages of applying AI and ML to network management. A realistic pathway involves small, incremental improvements rather than a sudden leap to full autonomy.
– Simple Observations and Alerts:
At the outset, machine learning models might do something fairly basic but still useful: analyzing performance metrics to detect anomalies. For example, they might learn what “normal” CPU utilization on a router looks like and alert administrators when usage drifts outside expected ranges. Or they might detect when a link’s latency starts trending upward at certain times, pointing to a potential congestion issue. This kind of basic anomaly detection adds value and reduces the burden on human operators who would otherwise have to sift through endless graphs.
– Correlation of Events and Dependencies:
A slightly more advanced step might involve correlating one event with another. For instance, if a particular switch fails, the system might infer which applications are affected by mapping the routes and firewall sessions associated with that device. However, making these correlations robust and reliable is tricky. Without careful curation and ongoing adjustments, the AI might produce too many false positives or miss critical impacts. Engineers then spend time verifying the AI’s output, sometimes questioning whether the effort is worth it.
– Predictive Maintenance and Capacity Planning:
As machine learning models improve and engineers gain confidence, they can start doing predictive maintenance, like forecasting when a given optical transceiver might fail based on its operating temperature and historical lifespan data. This use case is relatively low-risk and often yields real benefits. Improving capacity planning is another area where ML can shine, helping organizations figure out where to add bandwidth or when to upgrade devices.
– Closed-Loop Automation (Sometime in the Future):
The holy grail – fully automated, closed-loop remediation – remains distant. The vision where an AI engine not only detects a problem but also reconfigures network elements without human intervention is compelling. Yet, few are willing to trust an AI blindly with such tasks, for fear of cascading failures. Achieving this level of trust requires not only technical excellence but also cultural changes within the organization.
Hype vs. Reality of AI/ML in networking
A common misconception is that AI and ML tools in networking will rapidly displace engineers, or at least free them entirely from repetitive tasks. The reality is more nuanced. While some tasks can be offloaded, AI-driven solutions often require more upfront effort: cleaning data, building infrastructure maps, defining “normal” behavior, and continuously updating models as the network changes.
It’s also worth noting that AI is not infallible. Models can make mistakes, especially if they encounter scenarios they haven’t seen before. A model trained on historical data may fail to adapt when new types of traffic, protocols, or configurations emerge. In a field as mission-critical as networking – where outages and misconfigurations can have severe consequences – network operators tend to be cautious. They want a proven track record before turning over the keys to the kingdom.
Moreover, the complexity of modern infrastructure – spanning multiple clouds, relying on overlays like SD-WAN, and incorporating ephemeral containers – means that the AI’s job is inherently more complicated than simply detecting spikes in CPU usage. Detailed intent-based networking, where engineers define desired outcomes and let the system figure out the underlying configuration, is theoretically appealing. However, getting the intent right and ensuring the ML model can derive the correct actions to fulfill it are enormous challenges. The more abstract the environment, the easier it is for an AI model to draw the wrong conclusions.
Cultural and organizational challenges
Adopting AI and ML tools for network management isn’t just a technical problem. It’s also a cultural one. Network engineers are trained and experienced in deterministic systems. They expect commands and responses, cause and effect. Machine learning, by contrast, is probabilistic. It deals in likelihoods rather than certainties, and its “reasoning” is often opaque.
This opacity can erode trust. If an AI platform proposes rerouting certain flows or adjusting firewall rules to improve performance, engineers want to know why. Explaining a decision made by a deep learning model can be challenging. Overcoming this barrier requires transparency, robust testing environments, and policies that ensure human oversight.
Additionally, network teams, application teams, and security teams must collaborate. AI-driven solutions often aggregate insights from multiple domains – application performance data, security event logs, and infrastructure telemetry – and require buy-in from different groups. Misalignment or distrust between these stakeholders can stall AI initiatives. Investing in organizational readiness and cross-team communication is often as important as the technology itself.
Noction IRP, Artificial Intelligence and Machine Learning
Some platforms already demonstrate the practical value of data-driven decision-making in network management and are examining ways to leverage AI and ML in the future. One such solution is our Intelligent Routing Platform (IRP), which continuously monitors critical metrics – latency, packet loss, throughput, and historical reliability – to dynamically adjust routing in multi-homed BGP environments. By intelligently analyzing Telemetry data, IRP streamlines operations, reduces manual route optimization, and frees engineers to tackle more strategic challenges.
While IRP’s current approach relies on analytics and algorithmic decision-making, Noction is considering how advanced AI and ML techniques could further enhance its capabilities. In the near future, machine learning models may help predict routing issues before they escalate, optimize paths more proactively, and support new threat mitigation strategies. In fact, Noction plans to introduce an anomaly detection feature within IRP’s Threat Mitigation module as early as Q1 2025, enabling automatic recognition of deviations from normal traffic patterns.
Beyond the hype
AI and ML hold the potential to transform network management, but the journey is far from a straightforward upgrade. Historical attempts at automation struggled with maintaining current and accurate mappings of infrastructure, and today’s AI-driven approaches encounter similar challenges – albeit on a larger scale with more sophisticated tools.
In the end, the best approach is one grounded in realism and caution. AI and ML can enhance the capabilities of network engineers, but they do not replace the need for expertise, curiosity, and careful validation. The future will likely feature a hybrid model, where intelligent tools assist rather than supplant human judgment. The organizations that succeed will be the ones that invest in the right data, the right processes, and the right culture – accepting that the path to “fully autonomous and intelligent” networks is a marathon, not a sprint.
In recent years, the concepts of Artificial Intelligence (AI) and Machine Learning (ML) have moved from the academic realm to the forefront of many industries. The networking world, in particular, has started to explore ways to harness these technologies for more efficient management and operations. The promise is enticing: networks that manage themselves, anticipate failures before they occur, optimize traffic on the fly, and adjust configurations without human intervention. Yet, the reality of deploying AI and ML for network management is far from simple. Many organizations have found that the complexity, data requirements, and integration challenges quickly erode the surface-level optimism that often surrounds these technologies.
In this post, we’ll explore the practical considerations behind leveraging AI and ML in network operations, review a bit of historical context, and discuss current attempts to build intelligent network management solutions. We will also mention how platforms, like the Noction Intelligent Routing Platform, have approached the problem of network optimization – offering tangible benefits while also highlighting the hurdles that remain. Ultimately, while AI and ML hold promise, the journey to automated, self-tuning networks is riddled with caveats, trade-offs, and incremental steps rather than giant leaps forward.
A brief look back: from manual correlation to automated intelligence
The idea of intelligent network management predates today’s machine learning boom by decades. Early attempts at automated correlation tools date back over 20 years, with platforms that tried to map application dependencies onto underlying network infrastructure. Back then, pulling data often relied on protocols like SNMP for the network layer, and more proprietary mechanisms for application data. These tools aimed to correlate faults and understand the impact of specific infrastructure elements failing. The logic, however, was primarily rule-based, painstakingly maintained by engineering teams who had to encode every relationship manually.
The big problem was – and still is – that networks are not static. They constantly change as devices are added, workloads shift, security policies update and cloud platforms proliferate. Back then, if your correlation engine needed to understand that a particular web application ran on a certain server, behind a certain load balancer, connected through a certain firewall, you had to feed it all that information manually. A week later, a new route changed the application flow, and your correlation was outdated. The promise of “intelligent” correlation was quickly tempered by the reality of maintaining an accurate, evolving model of the network. In practice, these systems often fell short, or required so much human labor that their value proposition was questionable.
Today, AI and ML are positioned as remedies to these limitations. Instead of static, human-defined correlation, machine learning models can (in theory) learn network behavior patterns, detect anomalies, and infer complex dependencies from large volumes of Telemetry. Yet, the classic pitfalls remain: if you don’t feed the AI high-quality, accurate, and comprehensive data, its output will be flawed. AI is not a magician that transforms poor data into reliable insights. It simply automates the detection of patterns – good or bad – in the data it’s given.
Getting data under control: the foundation of AI success
One of the first hurdles in implementing AI-driven network management is data hygiene. Networks generate massive amounts of Telemetry: routing tables, firewall logs, load balancer statistics, server metrics, latency measurements, and more. Enterprises may store all of this data in a data lake, believing that having “more data” is always better. But if this data is not normalized, correlated, and continuously updated, it becomes a swamp of unknowns rather than a lake of insights.
Before even thinking about “intelligent” operations, organizations must ensure they know their own infrastructure. Which devices are deployed where? What firmware versions are running? Which overlays map onto which underlays? Who is responsible for maintaining them? Without a solid inventory, consistent naming conventions, and a baseline understanding of the environment, the AI models will struggle. They may end up making recommendations that are based on incomplete or stale information, leading to mistrust and reluctance from network engineers.
Moreover, as networks increasingly span on-prem data centers, multiple public cloud regions, and containerized microservices, simply feeding all of that complexity into a machine-learning model is no trivial task. The underlying relationships become more intricate with each layer of abstraction. If you can’t get a handle on the “ground truth” of your environment, your ML-driven correlation engine will be guessing blindly at how pieces fit together.
Slow and incremental progress: sitting, crawling, then walking
Despite the hype, most organizations are, at best, in the early stages of applying AI and ML to network management. A realistic pathway involves small, incremental improvements rather than a sudden leap to full autonomy.
– Simple Observations and Alerts:
At the outset, machine learning models might do something fairly basic but still useful: analyzing performance metrics to detect anomalies. For example, they might learn what “normal” CPU utilization on a router looks like and alert administrators when usage drifts outside expected ranges. Or they might detect when a link’s latency starts trending upward at certain times, pointing to a potential congestion issue. This kind of basic anomaly detection adds value and reduces the burden on human operators who would otherwise have to sift through endless graphs.
– Correlation of Events and Dependencies:
A slightly more advanced step might involve correlating one event with another. For instance, if a particular switch fails, the system might infer which applications are affected by mapping the routes and firewall sessions associated with that device. However, making these correlations robust and reliable is tricky. Without careful curation and ongoing adjustments, the AI might produce too many false positives or miss critical impacts. Engineers then spend time verifying the AI’s output, sometimes questioning whether the effort is worth it.
– Predictive Maintenance and Capacity Planning:
As machine learning models improve and engineers gain confidence, they can start doing predictive maintenance, like forecasting when a given optical transceiver might fail based on its operating temperature and historical lifespan data. This use case is relatively low-risk and often yields real benefits. Improving capacity planning is another area where ML can shine, helping organizations figure out where to add bandwidth or when to upgrade devices.
– Closed-Loop Automation (Sometime in the Future):
The holy grail – fully automated, closed-loop remediation – remains distant. The vision where an AI engine not only detects a problem but also reconfigures network elements without human intervention is compelling. Yet, few are willing to trust an AI blindly with such tasks, for fear of cascading failures. Achieving this level of trust requires not only technical excellence but also cultural changes within the organization.
Hype vs. Reality of AI/ML in networking
A common misconception is that AI and ML tools in networking will rapidly displace engineers, or at least free them entirely from repetitive tasks. The reality is more nuanced. While some tasks can be offloaded, AI-driven solutions often require more upfront effort: cleaning data, building infrastructure maps, defining “normal” behavior, and continuously updating models as the network changes.
It’s also worth noting that AI is not infallible. Models can make mistakes, especially if they encounter scenarios they haven’t seen before. A model trained on historical data may fail to adapt when new types of traffic, protocols, or configurations emerge. In a field as mission-critical as networking – where outages and misconfigurations can have severe consequences – network operators tend to be cautious. They want a proven track record before turning over the keys to the kingdom.
Moreover, the complexity of modern infrastructure – spanning multiple clouds, relying on overlays like SD-WAN, and incorporating ephemeral containers – means that the AI’s job is inherently more complicated than simply detecting spikes in CPU usage. Detailed intent-based networking, where engineers define desired outcomes and let the system figure out the underlying configuration, is theoretically appealing. However, getting the intent right and ensuring the ML model can derive the correct actions to fulfill it are enormous challenges. The more abstract the environment, the easier it is for an AI model to draw the wrong conclusions.
Cultural and organizational challenges
Adopting AI and ML tools for network management isn’t just a technical problem. It’s also a cultural one. Network engineers are trained and experienced in deterministic systems. They expect commands and responses, cause and effect. Machine learning, by contrast, is probabilistic. It deals in likelihoods rather than certainties, and its “reasoning” is often opaque.
This opacity can erode trust. If an AI platform proposes rerouting certain flows or adjusting firewall rules to improve performance, engineers want to know why. Explaining a decision made by a deep learning model can be challenging. Overcoming this barrier requires transparency, robust testing environments, and policies that ensure human oversight.
Additionally, network teams, application teams, and security teams must collaborate. AI-driven solutions often aggregate insights from multiple domains – application performance data, security event logs, and infrastructure telemetry – and require buy-in from different groups. Misalignment or distrust between these stakeholders can stall AI initiatives. Investing in organizational readiness and cross-team communication is often as important as the technology itself.
Noction IRP, Artificial Intelligence and Machine Learning
Some platforms already demonstrate the practical value of data-driven decision-making in network management and are examining ways to leverage AI and ML in the future. One such solution is our Intelligent Routing Platform (IRP), which continuously monitors critical metrics – latency, packet loss, throughput, and historical reliability – to dynamically adjust routing in multi-homed BGP environments. By intelligently analyzing Telemetry data, IRP streamlines operations, reduces manual route optimization, and frees engineers to tackle more strategic challenges.
While IRP’s current approach relies on analytics and algorithmic decision-making, Noction is considering how advanced AI and ML techniques could further enhance its capabilities. In the near future, machine learning models may help predict routing issues before they escalate, optimize paths more proactively, and support new threat mitigation strategies. In fact, Noction plans to introduce an anomaly detection feature within IRP’s Threat Mitigation module as early as Q1 2025, enabling automatic recognition of deviations from normal traffic patterns.
Beyond the hype
AI and ML hold the potential to transform network management, but the journey is far from a straightforward upgrade. Historical attempts at automation struggled with maintaining current and accurate mappings of infrastructure, and today’s AI-driven approaches encounter similar challenges – albeit on a larger scale with more sophisticated tools.
In the end, the best approach is one grounded in realism and caution. AI and ML can enhance the capabilities of network engineers, but they do not replace the need for expertise, curiosity, and careful validation. The future will likely feature a hybrid model, where intelligent tools assist rather than supplant human judgment. The organizations that succeed will be the ones that invest in the right data, the right processes, and the right culture – accepting that the path to “fully autonomous and intelligent” networks is a marathon, not a sprint.
Boost BGP Performance
Automate BGP Routing optimization with Noction IRP
SUBSCRIBE TO NEWSLETTER
You May Also Like
From Idle to Established: BGP states, BGP ports and TCP interactions
Understanding BGP states is essential to grasp how BGP operates. Similar to interior gateway protocols (IGPs) like...
ACK and NACK in Networking
In networking, communication between devices relies on the efficient exchange of data packets. Among the essential...
BGP and asymmetric routing
What is asymmetric routing? Asymmetric routing is a network communication scenario where the forward and reverse paths...