In a detailed blog post, Facebook Network Engineer Petr Lapukhov describes how NetNORAD works and how it came about.
Keeping the company's massive network up and running is a top priority, but common troubleshooting methods were taking too long, he
"The ultimate goal is to detect network interruptions and automatically mitigate them within seconds.
In contrast, a human-driven investigation may take multiple minutes, if not hours," he
"Some of these issues can be detected using traditional network monitoring, usually by querying the device counters via SNMP or retrieving information via device CLI.
Often, this takes time on the order of minutes to produce a robust signal and inform the operator or trigger an automated remediation response."
In addition, Facebook
engineers often encountered "gray failures," where the problem isn't detectable by traditional metrics or a device can't report its own malfunctioning, he
These issues led Facebook
to build NetNORAD, which Lapukhov
described as a system that treats the network like a "black box" and troubleshoots network problems "independently of device polling."
hile this does not constitute a complete fault detection system, we hope you can use these components as a starting point, building upon them with your own code and other open source products for data analysis," Lapukhov