Strange problem with a query deleting a record...

Sat Aug 24 12:55:57 UTC 2013

On 08/24/2013 12:46 AM, Barry Margolin wrote:
> In article <mailman.1159.1377301811.20661.bind-users at lists.isc.org>,
>   Mark Andrews <marka at isc.org> wrote:
>
>> In message <52177D81.8020206 at chrysler.com>, Kevin Darcy writes:
>>> On 8/22/2013 12:55 PM, johnh at primebuchholz.com wrote:
>>>> Greetings All,
>>>>
>>>> First of all, I apologize if this is out of place - I'm having a very
>>>> strange issue that is either a problem with bind itself, or at least,
>>>> affecting it.  Summary:
>>>>
>>>> For only ONE address, whenever I attempt to access it through my squid
>>>> proxy, the record disappears from DNS, and the retry time changes too.
>>>> Essentially, accessing www.thisdomain.com works, but a link to a portal
>>>> on
>>>> that page to the subdomain login.thisdomain.com causes the problem.  I'm
>>>> willing to bet the problem lies with squid, but as to how it could
>>>> possibly change a record in bind... Well, I'm stumped.  If you don't go
>>>> through squid, everything works.  All other requests to bind for the
>>>> address of the host in question work fine. Here's a the output of dig
>>>> from
>>>> before accessing the page through squid:
>>>>
>>>> ; <<>> DiG 9.4.1-P1 <<>> login.thisdomain.com
>>>> ;; global options:  printcmd
>>>> ;; Got answer:
>>>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45037
>>>> ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0
>>>>
>>>> ;; QUESTION SECTION:
>>>> ;login.thisdomain.com.            IN      A
>>>>
>>>> ;; ANSWER SECTION:
>>>> login.thisdomain.com.     17      IN      A       111.222.333.123
>>>>
>>>> ;; AUTHORITY SECTION:
>>>> thisdomain.com.         168319  IN      NS      ns1.thisdomain.com.
>>>> thisdomain.com.         168319  IN      NS      ns2.thisdomain.com.
>>>>
>>>> ;; Query time: 0 msec
>>>> ;; SERVER: 127.0.0.1#53(127.0.0.1)
>>>> ;; WHEN: Thu Aug 22 12:29:57 2013
>>>> ;; MSG SIZE  rcvd: 88
>>>>
>>>> You can do anything to request the address from bind and it works,
>>>> *except* try to access it through squid.  Bypassing squid and going
>>>> directly through the firewall works fine.
>>>>
>>>> Now, immediately after you try to access it through squid:
>>>>
>>>> ; <<>> DiG 9.4.1-P1 <<>> login.thisdomain.com
>>>> ;; global options:  printcmd
>>>> ;; Got answer:
>>>> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 43943
>>>> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
>>>>
>>>> ;; QUESTION SECTION:
>>>> ;login.thisdomain.com.            IN      A
>>>>
>>>> ;; AUTHORITY SECTION:
>>>> thisdomain.com.         298     IN      SOA     ns1.thisdomain.com.
>>>> serv.anotherdomain.com. 2006062510 3600 3600 2592000 300
>>>>
>>>> ;; Query time: 0 msec
>>>> ;; SERVER: 127.0.0.1#53(127.0.0.1)
>>>> ;; WHEN: Thu Aug 22 12:30:06 2013
>>>> ;; MSG SIZE  rcvd: 95
>>>>
>>>> After the 5-minute retry shown above expires, the original record
>>>> reappears.
>>>>
>>>> Ideas?  I'm stumped.  It seems like squid is somehow able to corrupt
>>>> bind's info, but I can't imagine how.
>>> I have a theory. If this is a name that's hosted on a stupid
>>> load-balancer, and that load-balancer doesn't understand non-A-record
>>> query types, then if Squid is sending a non-A query type (e.g. SRV,
>>> possibly even AAAA, if it's *really* stupid), then the load-balancer may
>>> be erroneously "poisoning" your cache with an NXDOMAIN response.
>>>
>>> We ran into this many years ago with Cisco GSSes (Global Site Selectors)
>>> and work around it by having a "shadow" version of the zone, which the
>>> GSSes proxy to for QTYPEs they don't handle. That "shadow" version of
>>> the zone has a wildcard entry in it which forces responses to be NODATA
>>> instead of NXDOMAIN, and this prevents the cache poisoning.
>>>
>>>                                                               - Kevin
>> The load balancer should be able to correct for such misconfigurations
>> by changing the rcode of the response from NXDOMAIN to NOERROR.  It
>> knows what names is is answering for so it can know that the NXDOMAIN
>> is a erroneous response.
> If I understand what Kevin was saying, the load balancer IS the DNS
> server. If you ask it for the A record it's responsible for, it sends a
> reasonable reply. If you ask it for some other record type for that
> name, it sends NXDOMAIN instead of NOERROR.
>
> It's a design flaw in these load balancers.
>

Thanks everyone who's been helping with this.

In order to investigate this further, I did a tcpdump of both a 
"working" conversation of a browser requesting the site, not going 
through the squid proxy, and another of the "broken" conversation 
through the proxy.

Result:  There is an NXDOMAIN response to a request for an AAAA record 
that the proxy makes that is causing this.  The browser never asks for 
anything but an A record, which succeeds.

I've contacted the site in question with this info, so hopefully it'll 
get resolved.  I'll keep the list posted on any results or info for 
posterity.

-John

--
	Please consider the environment before printing this e-mail.

	This e-mail is intended only for the named person or entity to which it
	is addressed and contains valuable business information that is
	privileged, confidential and/or otherwise protected from disclosure.
	Dissemination, distribution or copying of this e-mail or the information
	herein by anyone other than the intended recipient, or an employee, or
	agent responsible for delivering the message to the intended recipient,
	is strictly prohibited.  All contents are the copyright property of the
	sender.  If you are not the intended recipient, you are nevertheless
	bound to respect the sender's worldwide legal rights.  We require that
	unintended recipients delete the e-mail and destroy all electronic
	copies in their system, retaining no copies in any media.  If you have
	received this e-mail in error, please immediately notify us by calling
	our Help Desk at (603) 433-1143, or e-mail to it at primebuchholz.com.
	We appreciate your cooperation.