Angelo Failla, Production Engineer, Facebook
Why did Facebook need a new DHCP solution?
We use dhcp for provisioning servers in our production datacenters. We use it both for bare metal provisioning, (to install the operating system) and to assign addresses to the out of band management interfaces. Our old system was based on ISC dhcpd and static configuration files generated from our inventory system. We loaded them into our dhcp servers using a complex svn/rsync based periodic pipeline, restarting the dhcp server(s) to pick up the changes.
This took longer than we wanted. At our scale there are a lot of parts being added or replaced all the time (both NIC cards and servers). The dhcp servers were spending more time restarting to pick up the changes than serving actual traffic. In addition to that the reconfiguration pipeline was slow. Sometimes the changes would propagate very slowly (~3 hours), slowing down repair times in the datacenters.
In short, we wanted a faster way to bootstrap hardware in our datacenters after maintenance or expansion.
Facebook has very high standards for availability for all our datacenters. For redundancy we had two physical dhcp servers (in an active/standby configuration), in each cluster of servers. The problem was if we lost both dhcp servers in a cluster, the cluster lost dhcp completely. We wanted a more flexible approach, where every dhcp server in the network would be able to serve requests coming from any machine.
A few years ago we replaced all of our hardware load balancers with software load balancers based on Linux IPVS and Proxygen (which we open-sourced). We decided we were going to do something similar for dhcp. We created a virtual cluster of dhcp servers, with individual instances distributed around the network. We are using Anycast and BGP to address these dhcp servers with a single set of addresses. This allowed us to simplify our cluster/datacenter bootstrapping processes and have better recovery in case of local failure of both dhcp servers in a cluster.
Why did Facebook decide to use Kea?
We liked the fact that Kea is modern software and is designed to be extensible. Kea has hook points where you can add your own logic to parse incoming DHCP packets and modify them as you like right before they leave the server network interface. We leveraged the hooks feature extensively to customize Kea to meet our requirements.
Although one of Kea’s big advantages over ISC DHCP is that Kea is dynamically reconfigurable, we didn’t care about that. We wanted to centralize as much configuration data as possible, and run a stateless dhcp service. We planned to deploy in Tupperware (our Linux Container technology, roughly equivalent to Google’s Borg). We didn’t want to package huge configuration files with the application, nor did we want to maintain this data in multiple places on the network. What we have developed is simple and fast to deploy: we just install the Kea binary with a very basic configuration file and then it fetches all the rest of the information dynamically from our inventory system. We maintain the client configuration information, such as host allocation, subnets, etc. centrally in our inventory system. This simplifies dhcp server deployment and on-going configuration maintenance.
How hard was it to design and implement this stateless system?
It wasn’t hard. The Doxygen documentation on the KEA website was very clear. I liked that the docs came with a section that discussed the hook API using a simple library example.
When we started using Kea, there was no support for network boot options, so we wrote an external C++ hook module to do this. Having the possibility to inject your own logic into the server gave us a lot of new possibilities: suddenly we were able to integrate the server with the rest of our infrastructure. Things like submitting stats to our monitoring system and having nice dashboards, sending logs to our Scribe infrastructure, alarming, and working around bad dhcp clients were now possible without writing any Python glue code.
Kea also helped us write a workaround to bypass some TFTP firmware issues we encountered when we turned up our first ever ipv6-only cluster last summer.
How long did this project take?
We started working with Kea in February 2014. It took around 1 month to get a proof of concept hook deployed in a single cluster: 1 week was spent playing with the vanilla version of Kea and reading docs, another 2 weeks was spent writing and benchmarking middleware C++ code to talk to our inventory system, the final week of the month was spent turning that code into the actual first hook library.
It took roughly 2 months to productionalize our setup and deploy it in a few production clusters, integrating things with our provisioning infrastructure and fixing various bugs/issues.
By end of month 4 (end of May 2014) we were serving all of our dhcp clients using Kea! Since then we have been improving our hook library, refactoring things, dealing with new requirements etc.
What was the result?
We now use hooks to request server configuration information dynamically from our inventory system, instead of generating static configurations. We can bring up or reload a server cluster faster. With our old system, it could take hours for changes to propagate down to the dhcp servers. With this new stateless design, we have been able to eliminate dhcp server reloading and get the overall time to propagate changes in our inventory system down to 1 or 2 minutes.
We also have significantly simplified the routing. Now dhcp servers anywhere on the network can respond to any client request on a shared anycast address. This has simplified our setup: we use dhcp agents running on our top of rack switches, so using a single global anycast IP in their configuration saves us time when bootstrapping new clusters or data centers. We have improved cross-cluster redundancy also eliminating the need of using hardware load balancers.
We also have beautiful dashboards and metrics which help us in troubleshooting provisioning issues.
What suggestions and observations do you have for others?
- For datacenter applications, I highly recommend using a stateless dhcp approach. Keep state and configuration data separate from the application, handing it over to external systems. It simplifies configuration management and server deployment.
- Re-use existing open source solutions, such as Kea, and try to fight the “not invented here” syndrome. Write something from scratch only if what is available doesn’t satisfy your requirements. And when you do, try to open source it.
About Angelo Failla
Angelo is a Production Engineer for Facebook in the Dublin, Ireland Cluster Infrastructure team working on Cluster/DataCenter automation. Angelo has over 10 years experience as a system engineer, and over 15 years experience as a programmer. He uses his coding skills (bash/python/C++ and others) to improve operations at Facebook with automation and auto-remediation. Angelo has spoken at PyCon Ireland 2014, SREcon15 Europe and DevOps Italia 2015. Originally from Sicily, Angelo’s own Facebook page is dominated by pictures of food. Angelo can be reached via Twitter: @pallotron, LinkedIN, and of course, Facebook.
Angelo Failla gave a talk on the DHCP Infrastructure at Facebook at the SREcon 15 conference in Dublin. Both slides and a video recording are posted.