Incremental transfers generate complete zone reloading

Mon Jan 16 06:44:35 UTC 2023

On 15. 01. 23 19:02, Jesus Cea wrote:
> I have a huge zone receiving a constant flow of small dns updates. My 
> secondaries receive notifications and transfer the zone incrementally. 
> Cool, everything works as expected.
> 
> Nevertheless, I see this lines in my logs, constantly (every time a 
> change arrives incrementally):
> 
> """
> 15-Jan-2023 17:49:47.662 general: info: rpz: rpz.local: new zone version 
> came too soon, deferring update for 28 seconds
> 15-Jan-2023 17:49:54.716 notify: info: client @11f80268 X.X.X.X#63514: 
> received notify for zone 'rpz.local'
> 15-Jan-2023 17:49:54.716 general: info: zone rpz.local/IN: notify from 
> X.X.X.X#63514: serial 8991
> 15-Jan-2023 17:50:15.662 general: info: rpz: rpz.local: reload start
> 15-Jan-2023 17:50:16.884 general: info: rpz: rpz.local: reload done
> """
> 
> Ok, my updates are coming too fast (first line). No problem, the 
> secondary will eventually retrieve the changes. What worries me is the 
> last couple of lines: The rpz zone (big, around 800.000 domains) is 
> being reloaded constantly and it takes a couple of seconds eating CPU, 
> when the incremental changes are actually pretty tiny.

That's correct - this is result of the current implementation of RPZ.

It is tracked as
https://gitlab.isc.org/isc-projects/bind9/-/issues/3746

> I would guess the incremental changes would do an incremental change in 
> memory structures, not a full zone reload taking a couple of seconds and 
> sucking an entire CPU core.

Updates to "normal" DNS zones are indeed updated incrementally, but RPZ 
is different kind of beast.

> 
> My secondary configuration is pretty trivial:
> 
> """
> [...]
>    response-policy {
>      zone "rpz.local" policy nxdomain;
>    };
> 
> [...]
> 
> zone "rpz.local" {
>    type slave;
>    file "../secundarios/db.rpz.local";
>    allow-query { 127.0.0.1; };
>    allow-transfer { none; };
>    masters {
>      X.X.X.X;
>    };
> };
> 
> """
> 
> Is this maybe related with being a "response-policy" zone?

That's correct.

> If this is 
> the case and a malware RPZ is going to be BIG by definition, what would 
> be the suggested approach?

Depends on your operational needs, and if it is indeed causing 
measurable trouble in your environment.

With multiple the CPU cores available you can trade "more even" latency 
for efficiency by configuring workaround mentioned here:
https://gitlab.isc.org/isc-projects/bind9/-/issues/3746#note_337063

> Thanks!
> 
> PS: I have not tried alternative secondary storage backends yet, like 
> "map". I am trying to understand what is going on first.

First, it would not help - this is caused by RPZ "frontend", not the 
storage "backend".

Second, map is deprecated in 9.16 and removed from 9.18 onward. In case 
you use it somewhere it's time to move on!

HTH.

-- 
Petr Špaček
Internet Systems Consortium