Question

Fri Jun 3 07:45:43 UTC 2022

Hi Leslie,

Ok I can see a packet flow in that pcap file between the two servers. It 
shows a TCP packet from 192.168.1.50 port 46869 with the SYN [S] flag to 
192.168.1.51 port 647 - so that's trying to open the connection.
192.168.1.51 responds with RST [R] flag, so 192.168.50 tries again, and 
on it goes. So looks like 192.168.51 is not listening on that port. 
There's no failover connection being established. So we have that to 
sort out first.

$ tcpdump -r secondary.pcap -v
reading from file secondary.pcap, link-type EN10MB (Ethernet)
16:23:34.924575 IP (tos 0x0, ttl 64, id 46213, offset 0, flags [DF], 
proto TCP (6), length 60)
     192.168.1.50.46869 > 192.168.1.51.647: Flags [S], cksum 0xdfce 
(correct), seq 4009562500, win 64240, options [mss 1460,sackOK,TS val 
3809692760 ecr 0,nop,wscale 7], length 0
16:23:34.924599 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 
TCP (6), length 40)
     192.168.1.51.647 > 192.168.1.50.46869: Flags [R.], cksum 0x71fb 
(correct), seq 0, ack 4009562501, win 0, length 0
16:23:39.925032 IP (tos 0x0, ttl 64, id 20478, offset 0, flags [DF], 
proto TCP (6), length 60)
     192.168.1.50.57529 > 192.168.1.51.647: Flags [S], cksum 0x995f 
(correct), seq 2790876011, win 64240, options [mss 1460,sackOK,TS val 
3809697760 ecr 0,nop,wscale 7], length 0
16:23:39.925054 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 
TCP (6), length 40)
     192.168.1.51.647 > 192.168.1.50.57529: Flags [R.], cksum 0x3f14 
(correct), seq 0, ack 2790876012, win 0, length 0

When I look at it with wireshark it's the same but perhaps shown a 
little more clearly

1	0.000000	192.168.1.50	192.168.1.51	TCP	74	46869 → 647 [SYN] Seq=0 
Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809692760 TSecr=0 WS=128
2	0.000024	192.168.1.51	192.168.1.50	TCP	54	647 → 46869 [RST, ACK] Seq=1 
Ack=1 Win=0 Len=0
3	5.000457	192.168.1.50	192.168.1.51	TCP	74	57529 → 647 [SYN] Seq=0 
Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809697760 TSecr=0 WS=128
4	5.000479	192.168.1.51	192.168.1.50	TCP	54	647 → 57529 [RST, ACK] Seq=1 
Ack=1 Win=0 Len=0
5	10.000924	192.168.1.50	192.168.1.51	TCP	74	51935 → 647 [SYN] Seq=0 
Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809702760 TSecr=0 WS=128
6	10.000945	192.168.1.51	192.168.1.50	TCP	54	647 → 51935 [RST, ACK] 
Seq=1 Ack=1 Win=0 Len=0
7	15.001390	192.168.1.50	192.168.1.51	TCP	74	57497 → 647 [SYN] Seq=0 
Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=3809707761 TSecr=0 WS=128

Can you please post the failover peer definitions for both dhcp servers, 
I think we need to check that they make sense. Second the interface 
configs for that interface on each server, output from "ip addr show 
ethX" or whatever the correct interface name is please. We need to be 
sure the address, netmask, etc, match up.

So that packet capture is very useful. It's pin pointed an issue 
straight away.

regards,
Glenn

On 2022-06-03 16:37, Leslie Rhorer wrote:
>     I am seeing a listening connection on the primary server on 647,
> but nothing on the secondary.  I have included the tcdump from the
> secondary on port 647 as a gz file.  'Still waiting on the dumps on
> ports 67 and 68 (it's taking a while for 100 packets to pass)
> 
> On 6/3/2022 1:03 AM, Glenn Satchell wrote:
>> Hi Leslie,
>> 
>> I know about capturing packets on a 10G interface :) many gigabytes in 
>> a few seconds...
>> 
>> So you need to use filters when capturing, eg with tcpdump
>> 
>>   tcpdump -i eth0 host <other dhcp server IP or name> and tcp port 647
>> 
>> will only capture the failover traffic on eth0 directed to or from the 
>> other server, and ignore the rest.
>> 
>>   tcpdump udp and port 68 or port 67
>> 
>> will capture dhcp packets.
>> 
>> You can add options like "-c 100" to stop after 100 packets are 
>> captured. "-w filename" will capture to a file and you can copy this 
>> file to your desktop and use wireshark to read it.
>> 
>> With failover, it's better to restart one dhcp server, wait for it to 
>> sync, then restart the other one. If you shut down both and then start 
>> them, then they come up in recover mode.
>> 
>> Also looking at failover connections:
>> 
>>   netstat -ant | grep 647
>> 
>> should show an established connection between the two servers.
>> 
>> regards,
>> Glenn
>> 
>> On 2022-06-03 15:39, Leslie Rhorer wrote:
>> 
>>> On 6/2/2022 11:30 PM, Gregory Sloop wrote:
>>> 
>>>> Are you seeing balance messages every hour as the two re-balance the 
>>>> available lease pool?
>>> No, I don't think so.  It has only been a couple of hours since I 
>>> have had both online, however.
>>> 
>>>> You say they are both handling leases properly, but how do you know 
>>>> this? (That a machine gets a lease from somewhere is not good 
>>>> evidence.)
>>> 
>>> Do you mean because some other machine / device could be issuing 
>>> leases?  No.  In that case,
>>> 
>>> 1. Killing both servers would not take down any DHCP clients. If both 
>>> servers are shut down, DHCP clients start failing in about an hour, 
>>> until they are all dead.
>>> 
>>> 2. DHCP responses on the LAN stop completely the moment both servers 
>>> are taken down.
>>> 
>>> 3. No other machine would know anything about the list of dynamically 
>>> assigned fixed IP addresses in dhcpd.static.  None of the addresses 
>>> of any of the clients ever change.
>>> 
>>> 4. Whenever one server is shut down, the other responds with tons of 
>>> responses in  the log.
>>> 
>>>> A packet capture in front of the secondary might be helpful to see 
>>>> what traffic is passing - both to the peer and to clients.
>>> While not impossible, that is a bit easier said than done.  The links 
>>> between the servers are 10G.  I can look into it.
>>> 
>>>> (I hate making captures, at least as much as the next person, but 
>>>> dang if they don't, nearly always, show something that was different 
>>>> than I assumed. So, I've just gotten a lot less averse to getting 
>>>> captures. Yeah, they'll probably take me extra time to setup and get 
>>>> and paw through, [all when I could be fixin' stuff!] but they can 
>>>> save hours or days of fruitless searching for a fix, when I don't 
>>>> even really *know* what's wrong yet. Don't know about anyone else, 
>>>> but fixing problems gets a whole lot easier when I actually know 
>>>> what's wrong, or at least have a good idea what's going on. :)
>>> 
>>> Agreed, although when an interface is chunking away at over 10,000 
>>> packets per second...
>>> 
>>> If something doesn't break loose, I will see about loading Wireshark.