I’d like to share an idea I implemented for AFTR (so I am describing it in the AFTR context) which is a part of the debug primer and which could be integrated into BIND 10.
AFTR is managed through control channels (over TCP or a stream Unix socket) like a BIND 9 rndc but in a connected mode (so on the AFTR side it is named “sessions”). Four commands are for interest here:
The first command is named ‘noop’ and just polls the liveness of process by returning (or not returning when it is frozen) an answer.
The second command is named ‘fork‘ and as you can expect calls the fork() Unix system call (which does lazy memory copy in all modern systems). This command comes from a control channel C.
- the parent just closes C so C remains attached only to the child
- the child closes everything at the exception of C, in particular it closes the socket used for the service, and it reopens syslog. After, it waits for commands from C which acts which a live image of the process at the instant the fork() is performed.
The next command is ‘abort’ and calls the Unix abort(). It is supposed to be used after a ‘fork‘ to get a core file, so you can use postmortem analysis tools on an image of a live process.
The last command is ‘reboot‘ and restarts from the very beginning. It is implemented (at Rob’s suggestion) by closing everything and do execv() with a copy of the arguments in main().
So the AFTR debug primer says:
“Summary for the busy operator:
- noop -> nothing: go to the shell to kill and relaunch it
- noop -> expected message: open another session, send fork, wait for the child pid message, send abort on this new session. On the previous session (where you sent noop), send reboot“
In the context of BIND 10 we can implement the same set of commands:
- I believe we already have the equivalent of ‘noop‘
- a way to address commands to an image of a module is needed (but is not difficult)
- reboot by itself is not an interesting command as the control provides already a way to relaunch a module but it makes sense if the abort() on critical inconsistency condition is replaced by a fork/abort/reboot as described above. (I suggest to name this the ‘phenix mode‘)