Deprecation notice for BIND 9: "resolver-nonbackoff-tries", "resolver-retry-interval"

Thu Dec 7 21:12:51 UTC 2023

I welcome birds of a feather. Need to define / refine the problem
statement first.

On 12/7/23 12:30 AM, Petr Špaček wrote:

> On 07. 12. 23 1:05, Fred Morris wrote:
>> On Wed, 6 Dec 2023, Evan Hunt wrote:
>> I say go ahead, if nothing else consider it a "scream test". But can
>> you take a moment and tell us which stakeholder group(s) you think
>> you're optimizing for, why, and how?
>
> On the technical level we optimize using real (anonymized!) traffic
> provided to us by operators. Here's what we need:
> https://kb.isc.org/docs/collecting-client-queries-for-dns-server-testing
>
> If you want us to optimize for your use-case let's talk how we can get
> the data and replicate your setup!

I run Dnstap (for $reasons), but I'd be able to run dnscap and from the
look of that KB page you only want the queries. I'm not sure that really
captures the qualitative issue(s). I plan to dig into this some more
over the winter anyway, maybe I should turn the tables and ask if there
are other systemic issues I should look at or for?

I'm using DNS largely for purposes other than FQDN -> address mapping.
The things I've written have gotten enough uptake that I'm past the
"kook" stage and into the "conspiracy" stage, but although I get some
feedback at this point it's all basically anecdotal I don't have a
"movement" that I can ask for disciplined feedback. I've done a number
of different things poking at the same elephant over the past few years,
and what I consistently see is a focus on "a query and a response" and
I'm not sure that that is adequate systems thinking for the issues at
hand. There seem to be a number of them, and they all point to
inadequate systems thinking. That happens. As a neighboring example,
adding more packet buffering to routers and wifi hotspots should be an
unambiguous Good Thing, right? Even a decade after finding out that it's
not, there are still people and constituent groups which haven't gotten
the memo.

The key thing I'm going to set up and examine this winter is the impact
of qname minimization. But there are enough of these maybe some sort of
memo is in order. Maybe somebody else wants to work on it with me?

So here are some things which I've noticed about DNS in the field and
lack of systems thinking. The first two (frags and TC=1) are fairly well
known, and are provided as known examples where systems thinking is weak
and what this means. But most importantly: "systems thinking in the DNS
is provably weak".

Frags. Frags are good? No they are bad. If a single UDP frag isn't
delivered, the packet can't be reassembled. The server thinks all is
fine and good and Procrustes' algorithm has made it all fit. The packet
failing to be reassembled means that at the application layer no reply
was received from the server. It really doesn't matter whether TC=1 is
set or not, because it will never make it to the application. If traffic
shaping mistakenly and simplistically thinks "dropping UDP is ok" it is
double good for UDP frags.

TC=1 is permission-based; (different implication) what if it only works
over TCP? There is no provision in the algorithm to try TCP if no
response is received via UDP. The 1980s recursion algorithm makes the
decision to use TCP a polite society thing. The querant doesn't just try
it. It waits for the server to say "here you are, this is what I can do
for you; but I encourage you to please try again with TCP" and the
querant thinks "oh how nice of you, what an excellent idea; thank you I
will". There is no provision in the algo to unilaterally try TCP when
UDP has failed to perform well or at all. This is arguably most
important for stub resolvers. If the issue was simply buffer bloat, then
forcing queries over TCP wouldn't provide observably better performance
(which is often the case and why this is worth mentioning). The
suspicion has to be traffic shaping, but I don't know that that's the
case; crappy SOHO routers are largely black boxes. As an aside: are
people still blocking TCP/53? Wasn't that long ago when this was
conventional security theater.

Aggressive UDP retry presumes fast over correct responses, or at least
"correct enough" even if not the most timely. In pursuit of happy
eyeballs, speed over everything else! The fastest thing is a static zone
file which never changes. But the real world today encompasses
forwarders as well as database backends (and this is for FQDN -> address
mappings!) and in the quest for the fastest possible response caches get
built on top of the database so that something can be served meeting the
objectives of what is measured (response time). Without going into
technical details, please accept that this increases complexity and the
work needed to be done to keep what's served to the querant as fresh as
practicable. On the other hand if a typical response time of 1/10th of a
second is acceptable, there's time to wait for the database and no need
for the additional complexity. Some datastores might take even longer
than that (nobody cares about happy eyeballs in that use case). What is
the reason for caching resolvers? We see proposals to do prefetch for
answers which are soon to expire from cache. If the network is slow
enough that that matters, and it works, why the continued obsession with
superfast authoritative responses? If this is a prefetch, is the retry
schedule less aggressive?

Aggressive UDP retry mints unique requests. Anecdotally it has been
observed that the aggressive retries from caching resolvers directed at
authoritatives mint new queries (query id) for each retry. (I have to
ask because of the next item on the list) is this a de-optimization in
the name of privacy? The same application (caching resolver) is issuing
what are as far as the protocol is concerned different queries which
presumably could have come from different applications on the same host.
If there's a full-blown recursive resolver living on the host wouldn't
those apps avail themselves of the resource? Personally I would hope so.
So can the authoritative server debounce (reply to only one request
within some time period), or does it have to reply to each and every one
of them on the offchance that they're coming from different
applications? (And if they're using the stub resolver, shouldn't it be
caching? And if they're not using the stub resolver, maybe their "very
good reason" should include dealing with whatever the issue is and not
passing it off to the DNS? Or if they're not, maybe a competent sysadmin
should be sandboxing that app with a caching resolver in front of it?)

Qname minimization generates more requests. Without explaining in detail
what qname minimization is for or what it entails, traditionally a DNS
query contains the full accurate query when sent to every authoritative
server regardless of whether or not it is conceivable that that server
can answer the query rather than providing a referral; with qname
minimization this is not the case and the query is tailored to the type
of response(s) the authoritative server is anticipated to be able to
provide. Aside from the additional traffic, the crux of most of what can
go wrong happens with empty non-terminals. Empty non-terminals are
comparatively rare in the portions of the namespace utilized for FQDN ->
address mapping. Based on this observation maybe it should be limited to
that use case? Why aren't there tuning / configuration options around
this? (Won't be surprised if there are for at least some implementations.)

If this resonates with you, feel free to reach out. If you use the
trualias morris.dns.systems.thinking.r3y at m3047.net that will help me
manage things if there are more than a handful of interested parties.

--

Fred Morris