Question

Fri Jun 3 10:03:50 UTC 2022

ok, now we are getting somewhere...

Note startup error messages should be in syslog, or perhaps "systemctl 
status isc-dhcp-server" will show them.

So having the "wrong" network range would cause issues, the requests 
come in from a certain subnet, and the server tries to match the 
requests to a subnet definition, but of course on the secondary server 
it doesn't have 192.168.0.0 so it can't offer an address. That explains 
why there is no requests being served.

Next in the failover peer section, both config files have "primary". One 
of them needs to be "secondary", eg changing backup to be the back up 
server should have this as the failover peer setting. mclt is only 
specified on  primary. This would definitely be causing problems now as 
you have top primary failover peers for the same subnet. Before there 
were two different subnets, so no clashes as failover is done on a 
subnet by subnet basis. You could have different peers for each subnet 
for example.

# backup
failover peer "dhcp-failover" {
         secondary; # declare this to be the secondary server
         address 192.168.1.51;
         port 647;
         peer address 192.168.1.50;
         peer port 647;
         max-response-delay 30;
         max-unacked-updates 10;
         load balance max seconds 3;
         split 128;
}

# primary
failover peer "dhcp-failover" {
         primary; # declare this to be the primary server
         address 192.168.1.50;
         port 647;
         peer address 192.168.1.51;
         peer port 647;
         max-response-delay 30;
         max-unacked-updates 10;
         load balance max seconds 3;
         split 128;
         mclt 3600; # only on primary
}

With this change I think it should work now... fingers crossed :)

regards,
Glenn

On 2022-06-03 19:33, Leslie Rhorer wrote:
>     Oi, veh!  Something else has died, now, and I don't know what or
> how.  The only change I made was the one listed below, and now dhcpd
> won't run from /etc/init.d/isc-dhcp-server.  Rather, it runs, but then
> it quits.  I can run it manually from the CL with exactly the same
> syntax, and it remains up, but then the primary server quits.
> 
> On 6/3/2022 4:03 AM, Leslie Rhorer wrote:
>> Well, I found one error left over from when this was a /24 network.  
>> The range definition on the secondary server was from 192.168.1.220 to 
>> 192.168.1.240, instead of 192.168.0.200 to 192.168.0.240.  You can see 
>> the error in the backup.conf.gz file. I am not sure what issues this 
>> would cause, other than of course serving addresses in a range I want 
>> to change.
>> 
>> On 6/3/2022 2:45 AM, Glenn Satchell wrote:
>>> Hi Leslie,
>>> 
>>> Ok I can see a packet flow in that pcap file between the two servers. 
>>> It shows a TCP packet from 192.168.1.50 port 46869 with the SYN [S] 
>>> flag to 192.168.1.51 port 647 - so that's trying to open the 
>>> connection.
>>> 192.168.1.51 responds with RST [R] flag, so 192.168.50 tries again, 
>>> and on it goes. So looks like 192.168.51 is not listening on that 
>>> port. There's no failover connection being established. So we have 
>>> that to sort out first.
>>> 
>>> $ tcpdump -r secondary.pcap -v
>>> reading from file secondary.pcap, link-type EN10MB (Ethernet)
>>> 16:23:34.924575 IP (tos 0x0, ttl 64, id 46213, offset 0, flags [DF], 
>>> proto TCP (6), length 60)
>>>     192.168.1.50.46869 > 192.168.1.51.647: Flags [S], cksum 0xdfce 
>>> (correct), seq 4009562500, win 64240, options [mss 1460,sackOK,TS val 
>>> 3809692760 ecr 0,nop,wscale 7], length 0
>>> 16:23:34.924599 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], 
>>> proto TCP (6), length 40)
>>>     192.168.1.51.647 > 192.168.1.50.46869: Flags [R.], cksum 0x71fb 
>>> (correct), seq 0, ack 4009562501, win 0, length 0
>>> 16:23:39.925032 IP (tos 0x0, ttl 64, id 20478, offset 0, flags [DF], 
>>> proto TCP (6), length 60)
>>>     192.168.1.50.57529 > 192.168.1.51.647: Flags [S], cksum 0x995f 
>>> (correct), seq 2790876011, win 64240, options [mss 1460,sackOK,TS val 
>>> 3809697760 ecr 0,nop,wscale 7], length 0
>>> 16:23:39.925054 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], 
>>> proto TCP (6), length 40)
>>>     192.168.1.51.647 > 192.168.1.50.57529: Flags [R.], cksum 0x3f14 
>>> (correct), seq 0, ack 2790876012, win 0, length 0
>>> 
>>> When I look at it with wireshark it's the same but perhaps shown a 
>>> little more clearly
>>> 
>>> 1    0.000000    192.168.1.50    192.168.1.51    TCP    74 46869 → 
>>> 647 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809692760 
>>> TSecr=0 WS=128
>>> 2    0.000024    192.168.1.51    192.168.1.50    TCP    54 647 → 
>>> 46869 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
>>> 3    5.000457    192.168.1.50    192.168.1.51    TCP    74 57529 → 
>>> 647 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809697760 
>>> TSecr=0 WS=128
>>> 4    5.000479    192.168.1.51    192.168.1.50    TCP    54 647 → 
>>> 57529 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
>>> 5    10.000924    192.168.1.50    192.168.1.51    TCP    74 51935 → 
>>> 647 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809702760 
>>> TSecr=0 WS=128
>>> 6    10.000945    192.168.1.51    192.168.1.50    TCP    54 647 → 
>>> 51935 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
>>> 7    15.001390    192.168.1.50    192.168.1.51    TCP    74 57497 → 
>>> 647 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809707761 
>>> TSecr=0 WS=128
>>> 
>>> Can you please post the failover peer definitions for both dhcp 
>>> servers, I think we need to check that they make sense. Second the 
>>> interface configs for that interface on each server, output from "ip 
>>> addr show ethX" or whatever the correct interface name is please. We 
>>> need to be sure the address, netmask, etc, match up.
>>> 
>>> So that packet capture is very useful. It's pin pointed an issue 
>>> straight away.
>>> 
>>> regards,
>>> Glenn
>>> 
>>> On 2022-06-03 16:37, Leslie Rhorer wrote:
>>>>     I am seeing a listening connection on the primary server on 647,
>>>> but nothing on the secondary.  I have included the tcdump from the
>>>> secondary on port 647 as a gz file.  'Still waiting on the dumps on
>>>> ports 67 and 68 (it's taking a while for 100 packets to pass)
>>>> 
>>>> On 6/3/2022 1:03 AM, Glenn Satchell wrote:
>>>>> Hi Leslie,
>>>>> 
>>>>> I know about capturing packets on a 10G interface :) many gigabytes 
>>>>> in a few seconds...
>>>>> 
>>>>> So you need to use filters when capturing, eg with tcpdump
>>>>> 
>>>>>   tcpdump -i eth0 host <other dhcp server IP or name> and tcp port 
>>>>> 647
>>>>> 
>>>>> will only capture the failover traffic on eth0 directed to or from 
>>>>> the other server, and ignore the rest.
>>>>> 
>>>>>   tcpdump udp and port 68 or port 67
>>>>> 
>>>>> will capture dhcp packets.
>>>>> 
>>>>> You can add options like "-c 100" to stop after 100 packets are 
>>>>> captured. "-w filename" will capture to a file and you can copy 
>>>>> this file to your desktop and use wireshark to read it.
>>>>> 
>>>>> With failover, it's better to restart one dhcp server, wait for it 
>>>>> to sync, then restart the other one. If you shut down both and then 
>>>>> start them, then they come up in recover mode.
>>>>> 
>>>>> Also looking at failover connections:
>>>>> 
>>>>>   netstat -ant | grep 647
>>>>> 
>>>>> should show an established connection between the two servers.
>>>>> 
>>>>> regards,
>>>>> Glenn
>>>>> 
>>>>> On 2022-06-03 15:39, Leslie Rhorer wrote:
>>>>> 
>>>>>> On 6/2/2022 11:30 PM, Gregory Sloop wrote:
>>>>>> 
>>>>>>> Are you seeing balance messages every hour as the two re-balance 
>>>>>>> the available lease pool?
>>>>>> No, I don't think so.  It has only been a couple of hours since I 
>>>>>> have had both online, however.
>>>>>> 
>>>>>>> You say they are both handling leases properly, but how do you 
>>>>>>> know this? (That a machine gets a lease from somewhere is not 
>>>>>>> good evidence.)
>>>>>> 
>>>>>> Do you mean because some other machine / device could be issuing 
>>>>>> leases?  No.  In that case,
>>>>>> 
>>>>>> 1. Killing both servers would not take down any DHCP clients. If 
>>>>>> both servers are shut down, DHCP clients start failing in about an 
>>>>>> hour, until they are all dead.
>>>>>> 
>>>>>> 2. DHCP responses on the LAN stop completely the moment both 
>>>>>> servers are taken down.
>>>>>> 
>>>>>> 3. No other machine would know anything about the list of 
>>>>>> dynamically assigned fixed IP addresses in dhcpd.static. None of 
>>>>>> the addresses of any of the clients ever change.
>>>>>> 
>>>>>> 4. Whenever one server is shut down, the other responds with tons 
>>>>>> of responses in  the log.
>>>>>> 
>>>>>>> A packet capture in front of the secondary might be helpful to 
>>>>>>> see what traffic is passing - both to the peer and to clients.
>>>>>> While not impossible, that is a bit easier said than done. The 
>>>>>> links between the servers are 10G.  I can look into it.
>>>>>> 
>>>>>>> (I hate making captures, at least as much as the next person, but 
>>>>>>> dang if they don't, nearly always, show something that was 
>>>>>>> different than I assumed. So, I've just gotten a lot less averse 
>>>>>>> to getting captures. Yeah, they'll probably take me extra time to 
>>>>>>> setup and get and paw through, [all when I could be fixin' 
>>>>>>> stuff!] but they can save hours or days of fruitless searching 
>>>>>>> for a fix, when I don't even really *know* what's wrong yet. 
>>>>>>> Don't know about anyone else, but fixing problems gets a whole 
>>>>>>> lot easier when I actually know what's wrong, or at least have a 
>>>>>>> good idea what's going on. :)
>>>>>> 
>>>>>> Agreed, although when an interface is chunking away at over 10,000 
>>>>>> packets per second...
>>>>>> 
>>>>>> If something doesn't break loose, I will see about loading 
>>>>>> Wireshark.
>>