Strange named freezing

Mon Dec 27 09:24:33 UTC 2021

I apologize for the persistence, but maybe there will be some 
recommendations for debugging?

13.12.2021 7:18, Nikita Druba пишет:
> Hi!
>
> My system - OS FreeBSD 12.2 and filesystem - zfs. Samba 4.13.14 runs 
> in a jail with Bind 9.16.23 like backend. Also I have Bind 9.16.23 on 
> another server, its working like secondary dns. Secondary Bind gets 
> zones from DC by transferring with a tsig-key. Also, I have several 
> subnetworks(loopback and 3 other), whom DC listen.
>
> Some time ago I moved DC from one jail to another. And I have strange 
> behaviour of Bind at new DC.
>
> When I set in resolv.conf of new DC other dns server, for example - 
> old DC or secondary Bind, all works fine. New DC successfully resolve 
> any records by nslookup or host commands from himself or other host.
>
> When I set in resolv.conf of new DC localhost or himself internal ip, 
> Bind periodically freezing by the next regularity:
>
> - Bind stops to reply for the requests for a ~5 minutes. After start 
> working without service restart and freeze again.
>
> - At the daytime(when employees in a office), in freezes after less 1 
> minute work, at the night - after 10-15 minutes.
>
> - If I change resolv.conf from secondary Bind to internal IP, then not 
> need to restart Bind or Samba to start or stop periodically freezing. 
> Just change nameserver record and wait. If it was freezed, when 
> resolv.conf changing, then it will be in freeze state ~5 minutes after 
> start freezing and after will work fine.
>
> - If I change resolv.conf from secondary Bind to loopback, then NEED 
> to restart Bind to start or stop freezing.
>
> - When Bind freeze - it don't stopped service by a command and don't 
> killed by default, only kill -9 work.
>
> - Internal Samba DNS work fine and don't freeze, when resolv.conf look 
> to localhost.
>
> - Sometime Bind freeze not for all subnetworks. It can freeze for 
> localhost and 2 subnetworks. In one last subnetwork DC Bind can 
> successfully resolve any records from any subnetworks. But this 
> situation I saw only one time and can't repeat it for now.
>
> - No special Bind log records with "debug 50", in time or before of 
> freezing. Its freezing after any messages. And all this messages I see 
> in log, when Bind works without freezing.
>
> - I tried to run bind with logging to terminal, but don't saw no 
> additional information, when freeze. Terminal logs the same, like in 
> log files.
>
> - rndc freeze also.
>
> I found one way for resolving this problem. My server, where work jail 
> with DC, have 40 CPUs(20 cores and 40 threads). Therefore, when I 
> starts named, it is creates 40 workers for every listen ip, i.e. 40 
> tcp and 40 udp for every ip.
>
> Because its too much for my configuration, I intuitively made a 
> decision to try to decrease number of named workers to 10 by "-n 10". 
> And all works without freezing with correct resolv.conf during last 2 
> weeks.
>
> After, I tried set "-n 40", the same like named defines this value 
> automatically. After restart named freezed again. May be it was 
> coincidence, but with other settings named do not stop freezing. Also 
> I noticed, that when named works without freezing, "number of zones" 
> in "rndc status" output decreasing from 9 to 3. Seems, that named 
> missed samba zones, but resolving of records from them works fine.
>
> I tried to collect some logs by ktrace and catched freeze moment. 
> After last record from usual log(when Bind freezing), in kdump starts 
> many times repeating the next records:
>
>  36460 named    CALL  nanosleep(0x7fffffffea30,0)
>  36460 named    RET   nanosleep 0
>
> What can be wrong here? How I can more localize the problem?
>