Intermittent v9.18 build fails on Fedora COPR buildsys, always in `netmgr_test` ?

Mon Sep 12 11:49:24 UTC 2022

Hi,

I have done some testing, it seems two tests fail the most often:

tcp_recv_two_quota and tcp_noresponse

PID 32090 exceeded run time limit, sending SIGKILL

Would you know, why just those tests so often timeouts?

But I have found also strange issues when trying to find a way to 
reproduce on my local machine.

When repeating make -j8 check in tests/isc build directory, the test 
often fails with just exit status 255 and no more details. Like 
netmgr_test.log contains:

[ RUN      ] tcp_half_recv_half_send_sendback
[       OK ] tcp_half_recv_half_send_sendback
[ RUN      ] tcp_recv_one_quota
[       OK ] tcp_recv_one_quota
[ RUN      ] tcp_recv_two_quota
[       OK ] tcp_recv_two_quota
[ RUN      ] tcp_recv_send_quota
[       OK ] tcp_recv_send_quota
[ RUN      ] tcp_recv_half_send_quota
[       OK ] tcp_recv_half_send_quota
[ RUN      ] tcp_half_recv_send_quota
FAIL netmgr_test (exit status: 255)

What might be cause of this kind of termination? Since it does not 
happen separately, I cannot step this with gdb. I think it happens just 
when running under multiple make processes, in my case make -j8 (I have 
4 cores with hyperthreading).

It does happen about 20% cases of running, do not have exact numbers.

Do such issues happen also on bind's infrastructure on gitlab?

Regards,
Petr

On 8/29/22 22:57, PGNet Dev wrote:
> I'm building bind9 (v9.18.5, atm) on Fedora's COPR infrastructure.
>
> Building for Fedora 36, 37 & Rawhide, the builds FAIL 
> randomly/intermittently here
>
> For example, with no changes to any source/spec, simply triggering 
> rebuilds, over a period of just a few hours,
>
>
>  Time                   F36   F37   Rawhide  build URL
>  --------------------   ----  ----  -------  ----------
>  2022-08-29 15:58 EDT   OK    FAIL  OK 
> https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784469/
>
>  2022-08-29 14:23 EDT   FAIL  OK    OK 
> https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784210/
>
>  2022-08-29 11:49 EDT   OK    OK    OK 
> https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4776394/
>
> I'm trying to get a handle on cause ...
>
> Local builds on my own infrastructure are always successful; the 
> issue's only on COPR.
>
> The FAILs are always in `netmgr_test` unittests ...
>
> looking at netmgr test source, my as-yet-unfounded suspicion is that 
> these timeouts
>
> https://github.com/isc-projects/bind9/blob/v9_18_5/tests/isc/netmgr_test.c#L116
>
> are intermittently hitting limits -- only in COPR/online.  perhaps for 
> specific transport?
>
> I also note that -- in main, upstream, 3 days ago -- netmgr tests are 
> being split up, into separate per-transport tests,
>
> https://github.com/isc-projects/bind9/commit/37a1be5acc32244cec03cedc1bd46bc4aa0fbc18
>
> I'm not clear what specific problem is being solved by that split, but 
> imagine that it might well have an effect on builds @ COPR.
>
> I've not been able to get detailed test FAIL logs from COPR builds 
> (local builds do not FAIL).  currently, @ #fedora-buildsys, did manage 
> to get a reproducer of the build FAIL; I'm hoping I might get access 
> to those FAIL logs via a manual COPR build.
>
>
> Anyone here seen similar issues with netmgr, or maybe have a clue?
>
> Fwiw, I've initially filed at RH BZ already:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=2122010
>
> ; no response there yet.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20220912/0a87a3d0/attachment.htm>