FOLLOWUP- DNS MX timeouts

Wed Jul 8 04:29:03 UTC 2009

In message <4A53CF4A.8050600 at provident-solutions.com>, "Vernon A. Fort" writes:
> Mark Andrews wrote:
> > In message <4A452428.9020701 at provident-solutions.com>, "Vernon A. Fort" wri
> tes:
> >   
> >> I've run into a problem with named and timeouts primarily with MX 
> >> lookups.  When a MX query fails the first time, i have to restart the 
> >> named process before it will return a successful query.  Again, its 
> >> mainly with MX lookups but it also happens with A records as well.  The 
> >> problem subsides for 1-2 hours and starts happening again - basically i 
> >> look in the mailq for deferred messages with MX lookup failures.
> >>
> >>     
> > This box is a Gentoo install running a medium volume (500K per day) mail 
> >   
> >> server - lots of dns queries due to rbl's, spamassassin, etc.  This 
> >> problem started showing up around mid-may.  Since then, i have 
> >> re-installed bind and bind-tools several times, updated the kernel, 
> >> linux headers to 2.6.29, recompiled glibc, etc....
> >>
> >> I just updated to 9.6.0-P1 from 9.4.3-P2 - same problem exists.  When 
> >> doing a manual MX lookup (dig MX isc.org) - it takes around 45 seconds 
> >> on the first attempt.  If it fails the first time, it will never return 
> >> a positive query, just "connection timed out; no servers could be 
> >> reached" until i restart named.  I can't say for sure but the bind 
> >> application was updated around the time i noticed this problem.  All 
> >> versions of bind i have tried (in gentoo portage) have the same problem.
> >>
> >> Can anyone help me find where this problem might be?  I've google'd 
> >> until my eyes are red and throbbing.
> >>
> >> Thanks
> >>
> >> Vernon
> >> _______________________________________________
> >> bind-users mailing list
> >> bind-users at lists.isc.org
> >> https://lists.isc.org/mailman/listinfo/bind-users
> >>     
> >
> > I suggest that you fix your firewalls to allow 4096 byte EDNS
> > responses though.  Both ORG and ISC.ORG are signed zones so there
> > reponses are larger than with unsigned zones.  Named is having to
> > retry with different options to get a response through your firewall
> > and this takes time.
> >
> > A EDNS/UDP MX response is 1999 bytes for isc.org.
> >
> > ;; Query time: 872 msec
> > ;; SERVER: 2001:4f8:0:2::19#53(2001:4f8:0:2::19)
> > ;; WHEN: Sat Jun 27 09:39:34 2009
> > ;; MSG SIZE  rcvd: 1999
> >   
> I now have two servers running behind checkpoint firewall which are 
> failing to resolve MX records.  One of IT guys called CheckPoint and 
> support suggested they disable the smart defense  DNS udp check.  This 
> did correct the problem, but queries are still sluggish from time to time.
> 
> I have three questions related to this:
> 
> 1.  On both servers - the dns version (and glibc) were updated in 
> mid-January bind-9.4.1 to 9.4.3.  The SmartDefense DNS check has been 
> enabled on both firewalls long before the last updates were applied.  
> Why did the issues just now start showing up (late May - early June)?

	The ORG zone went from unsigned to signed using NSEC3 in
	that period.  I suspect SmartDefense doesn't yet know about
	NSEC3 records.
 
> 2.  When a email is deferred in the mailq, it will stay deferred until 
> named is restarted.  I just tested this on a mail message that sat in 
> the queue for just about three days.  I keep trying to dig MX domain.com 
> during this time period and NOTHING would resolved (including any A 
> records) until i restarted named.  Why?

	Did you look at the nameserver logs?
 
> 3.  In both network environments, i switched the resolution to internal 
> windows 2003 dns servers.  NO problems occurred during the week we used 
> the windows DNS server.  Why would smartdefense not have the same effect 
> on windows based name servers?

	Windows 2003 dns servers don't talk EDNS nor DNSSEC so
	firewalls don't interfere with the responses.
 
> Updated to bind-9.6.1 and updating the root.zone file made little if any 
> difference.  Basically,  It appears that SOMETHING has changed somewhere 
> because we have just now altered the cisco PIX rules to increase the udp 
> packet size due to timeout in these environments.  I have seen posts 
> related to my problems as far back as 2-3 years ago.  So again, i'm 
> scratching my head wondering what the heck did i miss - why did these 
> problems just now start showing up?
> 
> Any pointers or additional reading would be greatly appreciated.  I'm 
> just trying to understand from a 1000 foot view but whatever view anyone 
> suggests is fine.
> 
> Vernon
> 
> _______________________________________________
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: marka at isc.org