RIPE 86 - Rotterdam Trip Notes
Five of us from ISC attended the RIPE 86 meeting in Rotterdam, May 22 - 26, 2023.Read post
The BIND 9 core development team includes three people who focus on quality assurance. Given the size of the BIND legacy codebase, and the activity level of the larger development team, ensuring the quality of our monthly BIND releases is akin to the task given to Hercules, to clean out the Augean Stables in a day.
|Code base||680000 lines of code (approximately)|
|Changes per month||50 - 150 MRs merged|
|Core developers||9 engineers, including 3 QA specialists|
The Synopsys Black Duck Open Hub site rates BIND 9 as “Very High Activity,” which means the rate of change in the repo is unusually high compared to other open source projects. This high rate of change, along with the overall size of the codebase, make ensuring quality a particular challenge.
To get an update on what this team is doing, and how they are managing this task, I interviewed the team leader, Michał Kępień, via email. Clearly an above-average ability to prioritize tasks is important in this role: the answers below came back about five months later.
What are the main responsibilities of the BIND QA team? (I am thinking about release operations, packaging, maintaining the build farm, monitoring performance lab, triaging bugs, CVE processes, etc.)
We try to help everywhere we are needed, but our “official” day-to-day duties revolve around:
Improving existing tools (and developing new ones) which help the developers make informed decisions and ensure the code being committed is not broken in one way or another,
Overseeing the monthly release preparation process (which includes enforcing the schedule, looking for missing bits, polishing documentation, examining final test results, packaging, and more),
Maintaining the CI environment (keeping the list of operating systems we run tests on up-to-date, monitoring capacity, tweaking settings for optimal resource usage, and more).
However, we also help developers reproduce bugs, review merge requests, carry out ad hoc tests on request, and many other things, depending on current needs.
What things make BIND QA challenging?
The same things which make maintaining and improving BIND 9 code challenging: the DNS protocol itself is fairly complex, the deployment base is huge (which means the number of use cases out there is practically unlimited), every deployment environment is different (both in terms of the hardware/software platform used and ever-changing network conditions), and there is a lot of source code which was not written with testability in mind. This means we have to prioritize to at least cover the most typical scenarios.
What kinds of tests do we do on every commit as part of our CI?
The code is built and tested (unit tests + system tests) on several popular Linux distributions, FreeBSD, and Windows (where applicable). Some of those builds employ various sanitizers (ASAN, UBSAN, TSAN). Both GCC and Clang are used; compilation warnings are treated as errors on the supported platforms.
Apart from the above, other tools are also run to ensure consistency of coding style (clang-format, Coccinelle for C code; flake8, PyLint for test code written in Python), enforce the development process we follow (Danger), and detect the more obvious bugs early (Clang Static Analyzer).
All in all, about 70 jobs are run for every revision of each merge request. On top of that, scheduled pipelines are started for each maintained branch on a daily basis; these include a few extra jobs whose purpose is to either fill the gaps in platform coverage or run tests which take too long to be invoked for every merge request (e.g. some performance tests, respdiff).
What other QA processes do we do on a release candidate?
In terms of code testing, it is not so much a question of what extra tests we run, but what test results we look at closely before signing off on a release. All of the tests run for release tarballs are automated and therefore also run (at least periodically) for preceding revisions of the source tree. At release preparation time, we compare current performance results with those obtained for previous releases and analyze intermittent test failures to ensure they are not manifestations of lurking bugs (usually they turn out to be test code deficiencies, but it is not always immediately obvious).
We also clean up the release notes and verify whether the documentation changes introduced since the previous release are accurate, correct, and complete.
How much time do you require between code freeze and release of a maintenance version? What happens during that time?
It varies case by case. In a typical release, it takes about two days to do the things listed in my response to the previous question. That may sound like a lot, but note that in certain months, we have five releases to prepare: 9.17, 9.16, 9.11, plus Subscription Editions: 9.16-S and 9.11-S. Sometimes we wrap up within a day, but then other times some nasty bug is found at the very last minute and that throws a spanner into the works.
What tasks take the most time for the team?
It really is a mixture of all of the things listed above. While we try to make sure the infamous bus factor is above 1 for critical work, each one of us has their own niche of a sort in terms of what we spend most of our time on. Scheduling and prioritizing can be tricky at times because on the one hand, an innocent-looking OS update might trigger issues which take days to solve, while on the other hand a task we anticipated would take weeks to complete can sometimes be finished sooner.
How much has our increased effort at packaging added to the workload?
Most of the work required to make packaging work was a one-off effort of setting up a build system and integrating it with Cloudsmith. It took a while, but things are pretty stable these days and ongoing maintenance boils down to bumping version numbers and applying occasional tweaks to the packaging recipes or the testing scripts when something breaks in a new release.
Of the different types of testing - static analysis, packet fuzzing, unit tests, system tests, build tests, performance tests, security testing, and so on - where do you think we have good coverage/effective testing, and where could we improve?
I think we are doing pretty well in terms of testing the build process. We used to get non-trivial numbers of reports about broken builds after each public release. This effect seems to have subsided in the past months and I think GitLab CI running on a reasonably broad spectrum of platforms, combined with pairwise testing of build options, played a major role in that.
The development team also managed to eradicate all known issues reported by various sanitizers (ASAN, UBSAN, TSAN) in BIND 9.16. The recent refactoring of the dispatch code opened up some new code paths, leading to new warnings which need to be addressed, but I am confident we will get these sorted out over time.
We are continuously improving the scope of our internal performance tests. The goal here is to be able to make informed design choices based on solid data rather than just gut feelings and/or educated guesses.
Fuzzing tests these days seem to have reached a point of diminishing returns in terms of issues discovered in existing code, but they allow us to sleep better at night, knowing that any issues with new code will be detected in due course.
As for unit and system tests, the challenge here is that writing them is a retroactive effort in the case of BIND 9: while we are writing tests whenever possible for new code, there is still a large volume of code which was committed in the past and not accompanied by tests. In terms of the ratio of lines of code covered by unit and system tests, we are currently just shy of 80%, but this is just one of the applicable metrics.
How do you evaluate the effectiveness of our internal QA efforts (e.g., do we track how many bugs we find in internal testing vs external testing? do you have a sense of whether we are finding a healthy proportion internally?)?
Tongue-in-cheek: as long as we are getting any external reports about actual bugs, it means there is room for improvement in our internal testing.
Given the above, I am afraid we do not do any kind of tracking or statistics. With the resources we have, we try to prioritize fixing problems, of which there is an abundance.
Do you feel good about our ability to prevent the recurrence of previously discovered and fixed bugs?
Yes, definitely. I am more concerned about our ability to predict future problems and/or catch mistakes in new code before it gets released to the public than I am about our regression suite. Given the number of test jobs we run, even rarely occurring but known problems should become exposed over time.
What about performance testing and preventing performance regressions - what do we do as far as that, and to what extent is this ad hoc vs. a regular automated process?
Performance evaluation is currently not fully automatic: the tests are run automatically on a regular basis, but their results are examined by humans. Resolver performance in particular is a multi-dimensional subject and there is no single metric that would allow one to look at it and say “this is unequivocally better/worse than before.” We are, however, exploring possible solutions for automatic flagging of drastic shifts in performance numbers.
What are the BIND QA accomplishments in the past year or two that you are most proud of?
The most significant accomplishment was Petr Špaček’s work leveraging the CZNIC resolver performance tools for benchmarking BIND 9. (Note: this is a realistic test bed for benchmarking resolver operations, which Petr describes in this talk at RIPE79.)
The other thing I am happy about is that we have managed to establish and maintain a monthly release cadence, which is more challenging than it may sound.
If you had more time or resources, what are some projects you would love to tackle?
Given the current state of the worldwide software industry, starting an alpaca ranch crossed my mind more than one time in the past–
…oh, you mean for BIND 9? Due to the nature of our work, we discover new research and experimentation opportunities almost on a daily basis and each one of us has some ideas on what we could follow up on if there were time. There is room for improvement in the way we write tests, store their results, visualize them, and track trends over time. Sometimes, we allow ourselves to go down a rabbit hole or two for the fun of it as it helps prevent burnout and sometimes even results in new automated tests being implemented. But usually there is enough priority work in the queue to force us to defer “greenfield research” until some undetermined point in the future.
If you could order some new tools to be “magicked into existence,” what would you like to have?
If I could ask for a pony, I would like to have a tool that would give us deeper insight into how people use our software: what configuration options they use, how that relates to the nature of their user base, what platforms they run our software on, etc. Given our huge deployment base, it is challenging to assess whether certain changes we make are good for at least the majority of our users - and we have to resort to educated guesses. For some changes, we consult the community through our mailing lists, but even though the subscriber lists are of substantial size, it is still just a small fraction of the entire user population.
For years already, we have had internal discussions about potential solutions that would allow such “reporting” in a secure and anonymous way, but finding the sweet spot between “spyware” and “white noise generator” is quite challenging. But hey, you said magic was on the table.
Are there free open source tools we are using that you would recommend to other open source projects?
It may not be the type of thing you asked about, but for bug reproduction and troubleshooting I can certainly recommend rr. Seriously, if you have not tried it, drop everything and try it now as it might revolutionize the way you approach troubleshooting.
What can users do to help improve BIND quality? How important are open source user bug reports to our overall quality process?
Bug reports are always much appreciated as long as the reporter provides us with actionable information (we provide a GitLab issue template to indicate what we consider useful information) and/or is willing to cooperate when asked for more information (or experiments). Some classes of bugs are pretty much impossible to track down without extensive help from the reporter, either due to the platform used, the specific network conditions in effect at a given time, or other dynamic factors.
Please also remember that while we would love to make our software 100% bug-free, we have to prioritize and we do not have the resources to fix every single problem reported. That does not mean we do not appreciate the reports and users' cooperation, though.
This sounds like a good opportunity to remind people that kindness goes a long way in the open source world. There is no better way to make sure your problem will be ignored than through obnoxiousness.
How do you keep the QA staff motivated, and how do you maintain your own motivation in the face of a fairly large volume of ongoing issues?
I touched upon that in one of the responses above: going down a rabbit hole that fascinates you in one way or another from time to time tends to be good for morale. Rabbit holes can be opportunities in disguise because you never know when exploring new areas of knowledge or experimenting with new tools (even seemingly unrelated) will make you more effective at your day job and/or make the product you are working on better. Every job involves bits that you hate about it - the important part is to make sure those ugly bits do not take most of your work time.
What's New from ISC