Clients get DNS timeouts because ipv6 means more queries for each lookup

Wed Jul 13 22:27:59 UTC 2011

On 7/13/2011 2:39 PM, Jonathan Kamens wrote:
>
> I agree that the order of the A/AAAA responses shouldn't matter to the 
> result. The whole getaddrinfo() call should fail regardless of whether 
> the failure is seen first or the valid response is seen first. Why? 
> Because getaddrinfo() should, if it isn't already, be using the RFC 
> 3484 algorithm (and/or whatever the successor to RFC 3484 ends up 
> being) to sort the addresses, and for that algorithm to work, one 
> needs *both* the IPv4 address(es) *and* the IPv6 address(es) 
> available, in order to compare their scopes, prefixes, etc..
>
> RFC 3484 tells you how to sort addresses you've got.
>
> If you've only got one address, then bang! It's already sorted for 
> you. You don't need RFC 3484 to tell you how to sort it.
>
No, you've got one address, and one unspecified nameserver failure. 
Garbage in, garbage out. To say that a nameserver failure is equivalent 
to NODATA is not only technically incorrect, it leads to all sorts of 
operational problems in the real world.

> I have to say that some of the people on this list seem completely 
> detached from what real users in the real world want their computers 
> to do.
>
Really? Do you think I'm an academic? Do you think I sit and write 
Internet Drafts and RFCs all day? No, I'm an implementor. I deal with 
DNS operational problems and issues all day, every workday. And I can 
tell you that I don't appreciate library routines making wild-ass 
assumptions that, in the face of some questionable behavior by a 
nameserver, maybe, possibly some quantity of addresses that I've 
acquired from that dodgy nameserver are good enough for my clients to 
try and connect to. No thanks. If there's a real problem I want to know 
about it as clearly and unambiguously as possible. I can't deal 
effectively with a problem if it's being masked by some library routine 
doing something weird behind my back.
>
> If I am trying to connect to a site on the internet, then I want my 
> computer to do its best to try to connect to the site. I don't want it 
> to throw up its hands and say, "Oh, I'm sorry, one of my address 
> lookups failed, so I'm not going to let you use the /other/ address 
> lookup, the one that succeeded, because some RFC somewhere could be 
> interpreted as implying that's a bad idea, if I wanted to do so." 
> Please, that's ridiculous.
>
No, what's more ridiculous is if users can't get to a site SOME OF THE 
TIME, because someone's DNS is broken, a moronic library routine then 
routes the traffic some unexpected way, and a whole raft of other 
variables enter the picture, without anyone realizing or paying 
attention to the dependencies and interconnectivity that is required to 
keep the client working. There is a certain threshold of brokenness 
where the infrastructure has to "throw up its hands", as you put it, and 
say "nuh uh, not gonna happen", because to try to work around the 
problem based on not enough information about the topology, the 
environment, the dependencies, etc. you're likely to cause more harm 
than good by making the failure modes way more complex than necessary.
>
> If one of the lookups "fails", and this failure is presented to the 
> RFC 3484 algorithm as NODATA for a particular address family, then the 
> algorithm could make a bad selection of the destination address, and 
> this can lead to other sorts of breakage, e.g. trying to use a 
> tunneled connection where no tunnel exists.
>
> If the address the client gets doesn't work, then the address doesn't 
> work. How is being unable to connect because the address turned out to 
> not be routable different from being unable to connect because the 
> computer refused to even try?
>
Because the failure modes are substantially different and it could take 
significant man-hours to determine that the root cause of the problem is 
actually DNS brokenness rather than something else in the network 
infrastructure (routers, switches, VPN concentrators, firewalls, IPSes, 
load-balancers, etc.) or in the client or server (OS, application, 
middleware, etc.)

Have you ever actually troubleshot a difficult connectivity problem in a 
complex networking environment? Trust me, you want clear symptoms, clear 
failure modes. Not a bunch of components making dumb assumptions and/or 
trying to be "helpful" outside of their defined scope of functionality. 
That kind of "help" is like offering a glass of water to a drowning man.
>
>
> Another possibility you're not considering is that the invoking 
> application itself may make independent IPv4-specific and 
> IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? 
> Maybe IPv6 capability is something the user has to buy a separate 
> license for, so the IPv6 part is a slightly separate codepath, added 
> in a later version, than the base product, which is IPv4-only. When 
> one of the getaddrinfo() calls returns address records and the other 
> returns garbage, your "fix" doesn't prevent such an application from 
> doing something unpredictable, possibly catastrophic. So it's really 
> not a general solution to the problem.
>
> I have no idea what you're talking about. If the application makes 
> independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm 
> proposing to glibc is completely irrelevant and does not impact the 
> existing functionality in any way. The IPv4 lookup will succeed, the 
> IPv6 lookup will fail, and the application is then free to decide what 
> to do.
>
I wasn't saying the glibc change would break such an application, only 
that your proposed "fix" doesn't help it either, so it shouldn't be 
mistaken for a general solution to the problem. Your "fix" only applies 
to AF_UNSPEC and one should not assume that getaddrinfo() is *always* 
called with AF_UNSPEC. That is, after all, why the ai_family member of 
the "hints" parameter struct takes values other than AF_UNSPEC.
>
> In summary, getattrinfo() with AF_UNSPEC has a very clear meaning -- 
> "Give me whatever addresses you can." The man page says, and I am 
> quoting, "The value AF_UNSPEC undicates that getaddrinfo() should 
> return socket addresses for any address family (either IPv4 or IPv6, 
> for example) that can be used with node and service." I don't see how 
> the language could be any more clear.
>

Clear eh? That's because you're reading it through the tunnel vision of 
your own preferences. The text you quoted says nothing about failure 
modes, RFC 3484 or the full-/partial-answer distinction. Are you hanging 
your hat completely on the phrase "can be used"? Really? You're reading 
that much detail into such generic language? Well, as I've said before, 
for RFC 3484 to work properly, one needs all available IPv4 and IPv6 
addresses. If the list of addresses is short because of nameserver 
brokenness, I'd say RFC 3484 couldn't do its job and the resulting 
partial address list is "unusable" because it hasn't been properly 
processed. That's my take on "can be used".

Since you decided to start playing the "man page semantic game", it's my 
turn now.

I could just as easily point to this text from the Linux version of the 
man page:

*EAI_AGAIN*
    The name server returned a temporary failure indication. Try again
    later. 
*EAI_BADFLAGS*
    /ai_flags/ contains invalid flags. 
*EAI_FAIL*
    The name server returned a permanent failure indication. 

Notice that the descriptions of EAI_AGAIN and EAI_FAIL say "The 
nameserver returned a [...] failure". Not "The nameserver returned [...] 
failures for all lookups" or "The nameserver returned only failures". In 
the English language, "a failure" is equivalent to "more than 0 failures".

Based on this text, slanted with my own biases and preferences, I'll say 
then that *any* error return from a nameserver should cause 
getaddrinfo() to fail, in order to be consistent with those error-code 
descriptions. Are you going to argue differently? Shall we split hairs 
over the meaning of the word "a"?

> To suggest that it's reasonable and correct for it to refuse to return 
> a successfully fetched address is simply ludicrous.
>
By the same faulty reasoning, any addresses returned in a *truncated* 
DNS response are still usable. We hashed that one over for years, and 
finally decided it was a really stupid idea. Partial DNS results are not 
reliably usable. And this is just as true for one "part" of a 
getaddrinfo() lookup failing as it is for a truncated DNS response. 
Unless you have all of the information which was requested, you can't 
assume that the invoker is going to be able to sensibly use what you 
hand back to it. If one detects a failure that materially affects the 
completeness or trustworthiness of a response, one has to either a) try 
to acquire the information another way (e.g. in the case of response 
truncation, retry the query using TCP) or b) assume the worst and put 
the onus on the invoker, or perhaps further up the responsibility chain 
to the user, admin, implementor, or architect, to flag the error as an 
actual error and fix what's really wrong.

                                             - Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/117f9c1c/attachment.html>