faster fail-over between multiple masters

Tue Aug 30 16:17:23 UTC 2011

Am 30.08.2011 00:04, schrieb Mark Andrews:
> In message <4E5B6098.80503 at pernau.at>, Klaus Darilion writes:
>> Hi!
>>
>> I have 9.7.0-P1 as slave configured with two masters: M1 and M2. M2 is
>> currently down.
>>
>> When M1 sends a NOTIFY to inform the salve of the new zone, bind starts
>> querying for the SOA record at M2. As M2 is down, bind sends
>> retransmissions and tries it several times. It takes up to 2 minutes
>> until bind starts asking M1 - then the transfer of course works fine.
>>
>> The question is: can I tweak bind to fail-over to other master servers
>> faster?
> 
> 	try-tcp-refresh no;

Hi Mark!

Thanks for the hint. But I do not see how this can help us, as the slave
never used TCP. The SOA lookups are always done via UDP.

Some more debugging showed, that the problem happens in the following
scenario:

1. On the slave we have set max-refresh-time to 5 minutes. (We have
added this in case the slave missed some NOTIFYs due to network problems).

2. Thus, every 4.5 minutes the slave asks both masters for the serial.
The lookup to M1 works fine, the lookup to M2 of course fails as M2 is
down and thus bind starts with retransmissions: every lookup has 2
retransmissions every 15 seconds, then bind this again with a new
"transaction"

3. If bind receives a NOTIFY while it tries to query M2, the NOTIFY is
more or less ignored:

  client 1.1.1.1#15733: received notify for zone 'xyz': TSIG 'foobar'
  zone xyz/IN: notify from 1.1.1.1: refresh in progress, refresh check
queued

Thus, it takes up two 2 minutes until bind gives up querying M2 and
starting again with querying M1.

Is it possible to tweak the retransmission timers and query timeouts
when bind performs SOA lookups?

Thanks
Klaus