loss of masters over ipsec hoses bind

Sat Dec 22 16:10:56 UTC 2007

On Dec 21, 2007 10:29 PM, Barry Margolin <barmar at alum.mit.edu> wrote:
> In article <fkh44f$199f$1 at sf1.isc.org>,
>  "Matt LaPlante" <cyberdog3k at gmail.com> wrote:
>
> > I'm currently running Bind 9.4.1 (Ubuntu Gutsy).  I have several zones
> > in master->slave setups, which normally works just fine.  The other
> > day, however, I ran into an odd problem.  A couple of the slave zones
> > generally update over an ipsec connected network.  The ipsec
> > connection went away, and shortly thereafter bind royally wedged
> > itself, refusing to serve any data (including basic forward lookups)
> > and was not even responding to rndc restarts.  It took me a good while
> > of restarting the system and poking around logs to decide to strace
> > the process, which eventually lead me to removing the ipsec-dependant
> > slave zones from the config.  As soon as I did this, Bind became
> > stable again.  Interestingly, zones which updated over public IP space
> > behaved fine, even if the master server was unreachable.  It was only
> > zones that were trying to go over the down ipsec connection that hosed
> > the daemon.
> >
> > This whole issue is logged in a bit more detail here, including output
> > from strace:
> > https://bugs.launchpad.net/ubuntu/+source/bind/+bug/177489
> >
> > I can (apparently) reproduce this issue again with little difficulty,
> > so I'd be glad to help debug it.
> >
> > -
> > Matt LaPlante
>
> I'm having a hard time imagining how IPSEC could be impacting this.
> named uses TCP and UDP exclusively, and the underlying connection
> topology should be transparent to it.  Are you sure there aren't some
> configuration differences between the public and private zones, such as
> the refresh and retry intervals?  If the retry intervals are extremely
> short, named could spend all its time retrying the zone transfers after
> a failure.

Here is the zone config from one of the private zones (there are only two):

                                182        ; serial
                                3600       ; refresh (1 hour)
                                600        ; retry (10 minutes)
                                2419200    ; expire (4 weeks)
                                86400      ; minimum (1 day)

I realize that things *should* be transparent, but the fact is I can
reproduce the outage exactly as documented.  My working theory is that
the ipsec connection is failing to return some tcp/udp packet as timed
out or unreachable or something, and bind is just waiting forever.
This is causing a lockup in the code that is in turn causing all
functionality to cease.  It may in fact even be an ipsec bug in some
way, but I think the error condition itself should not dos bind in the
process.  I can certainly try to gather more information if it would
be helpful (although it may take extra time given the holidays).

>
> --
> Barry Margolin, barmar at alum.mit.edu
> Arlington, MA
> *** PLEASE post questions in newsgroups, not directly to me ***
> *** PLEASE don't copy me on replies, I'll read them in the group ***
>
>
>