HP xw6400 Network Boot DHCP failure - no client DHCPREQUEST after server DHCPOFFER

Sun Sep 11 10:15:40 UTC 2011

Hello everyone,

I'm having the same problem this person (Howard Wang) had back in 2006:

https://lists.isc.org/pipermail/dhcp-users/2006-May/000763.html

I will try to answer all of Simon's questions based on that thread.

Here is some background:

Our network had a DNS hardware failure recently, which caused us to
reorganize the network slightly, but everything has been restored
(Internet access, DHCP, DNS, and a bunch of other services) except for
workstations' ability to network boot.

Before the failure (a few weeks ago), the DNS and DHCP servers were
working fine, giving out IPs for normal release/renewals and also
allowing successful network boots (leading to fully-automatic
Kickstart installations).

Some background on our network setup:

- There are basically two relevant subnets: 70.0/28 and 71.0/24
- Running ISC DHCP (isc-dhcpd-V3.0.6) on Ubuntu Server 8.04.4 at
###.###.71.254 (this server is also the router)
- Running BIND 9.7.3 on Ubuntu Server 11.04 at ###.###.70.1
- TFTPD-HPA server (aka. web101) is at ###.###.70.3  (we changed the
IP after network failure, but the TFTPD server config has not changed)
- All workstations and servers (both subnets) currently have IPs from
the DHCP server and are able to release/renew normally
- All workstations and servers (both subnets) are able to query the
DNS server and access the Internet normally
- No dynamic DNS, only static: DHCP MAC-to-host, and DNS host-to-IP

This is the current network booting process (and failure point):

- User instructs workstation to network boot (at bootup), which then
asks for an IP using DHCP
- Workstation (client) sends a broadcast DHCPDISCOVER packet
- DHCP server (70.14 or 71.254, depending on the workstation subnet)
sends a broadcast DHCPOFFER packet
- Workstation does not respond, but instead, loops back and sends
another DHCPDISCOVER packet

The strange part:  when the same workstation is fully booted into
Ubuntu Desktop (manual install), it can release/renew just fine
(discover, offer, request, ack).

Here are the subnet declarations from dhcpd.conf:

  subnet ###.###.70.0 netmask 255.255.255.240 {
    option routers ###.###.70.14;
    default-lease-time 4400;
  }

  subnet ###.###.71.0 netmask 255.255.255.0 {
    always-broadcast true;
    default-lease-time 4400;
    option routers ###.###.71.254;
    option broadcast-address ###.###.71.255;

    pool {
      max-lease-time 300;
      range ###.###.71.230 ###.###.71.253;
      allow unknown-clients;
    }
  }

We use "use-host-decl-names" so DHCP maps MAC-to-hostname, and DNS
maps hostname-to-IP:

  use-host-decl-names on;

The syntax used in all of these config excerpts should be OK since it
was working fine like this a few weeks ago.

A couple workstations on the 70.0/28 network (one for Wireshark, one
for doing the network boot and release/renewal tests)

  host linux301 { hardware ethernet 00:1b:78:a9:4a:ae; fixed-address
linux301; next-server web101; }
  host linux303 { hardware ethernet 00:1b:78:a9:49:35; fixed-address
linux303; next-server web101; }

A couple workstations on the 71.0/23 network (again, one for
Wireshark, one for doing the network boot and release/renewal tests)

  host linux107 { hardware ethernet 00:1b:78:a9:4b:5a; fixed-address
linux107; next-server web101; }
  host linux204 { hardware ethernet 00:1b:78:a9:4b:44; fixed-address
linux204; next-server web101; }

"web101" (a web server) is also the TFTPD-HPA server, which was
working fine with network booting before the network failure.

So, I ran wireshark and I compared packets from the "OK case" against
the "FAIL case", explained here:

OK case = fully-booted workstation release/renew: discover, offer,
request, and ack packets
FAIL case = network boot attempt: discover, offer, (no response, then
looping), discover, offer...

Specifically, I diff'ed the DHCPOFFER packets in both cases (full text
exports from Wireshark of each packet):

These are six bytes only found in the DHCPOFFER packet in the OK case:

+    Option: (t=28,l=4) Broadcast Address = ###.###.71.255
+        Option: (28) Broadcast Address
+        Length: 4
+        Value: 9DF247FF

 (1 byte for Code, 1 byte for Length, +4 address bytes = 6 bytes
total,  according to RFC2132)

...which explains this other part of the same diff:

-    Length: 316
-    Checksum: 0x1a7c [validation disabled]
+    Length: 322
+    Checksum: 0xa5b7 [validation disabled]

So, I figured maybe differences in the DISCOVER packets were causing
the differences in the OFFER packets.

So, diff'ed the DHCPDISCOVER packets (in both cases, sent from the
same client, running wireshark on the DHCP server):

Notable differences (maybe irrelevant, but they stood out), broadcast
on FAIL case, unicast on OK case:

@@ -56,10 +56,10 @@
     Hardware type: Ethernet
     Hardware address length: 6
     Hops: 0
-    Transaction ID: 0x7aa94480
-    Seconds elapsed: 8
-    Bootp flags: 0x8000 (Broadcast)
-        1... .... .... .... = Broadcast flag: Broadcast
+    Transaction ID: 0x6d9ad047
+    Seconds elapsed: 6
+    Bootp flags: 0x0000 (Unicast)
+        0... .... .... .... = Broadcast flag: Unicast
         .000 0000 0000 0000 = Reserved flags: 0x0000
     Client IP address: 0.0.0.0 (0.0.0.0)
     Your (client) IP address: 0.0.0.0 (0.0.0.0)

Also, the parameter request list is different (FAIL case packet is
much shorter, but OK case requests a "Broadcast Address"):

         Value: 01
-    Option: (t=55,l=24) Parameter Request List
+    Option: (t=50,l=4) Requested IP Address = ###.###.71.123
+        Option: (50) Requested IP Address
+        Length: 4
+        Value: 9DF2477B
+    Option: (t=12,l=8) Host Name = "kickseed"
+        Option: (12) Host Name
+        Length: 8
+        Value: 6B69636B73656564
+    Option: (t=55,l=13) Parameter Request List
         Option: (55) Parameter Request List
-        Length: 24
-        Value: 01020305060B0C0D0F1011122B363C438081828384858687
+        Length: 13
+        Value: 011C02030F06770C2C2F1A792A
         1 = Subnet Mask
+        28 = Broadcast Address
         2 = Time Offset
         3 = Router
-        5 = Name Server
+        15 = Domain Name
         6 = Domain Name Server
-        11 = Resource Location Server
+        119 = Domain Search [TODO]
         12 = Host Name
-        13 = Boot File Size
(more FAIL case packet requested parameters, about 20 more)
....
(and now the final five OK case requested parameters)...
+        44 = NetBIOS over TCP/IP Name Server
+        47 = NetBIOS over TCP/IP Scope
+        26 = Interface MTU
+        121 = Classless Static Route
+        42 = Network Time Protocol Servers
     End Option
     Padding

I still strongly believe the problem is the missing 6 bytes from the
DHCPOFFER packet in the FAIL (network boot DHCP) case.

So, one of the following probably needs to happen:

1. The DHCP client needs to specifically request a broadcast DHCPOFFER, or...
2. The DHCP servers needs to be forced to add those 6 bytes (Option:
(28) Broadcast Address) somehow...

#1 is probably not the case since the clients were network booting
just fine (a few weeks ago) before the network failure.

Does anyone know how to accomplish #2?

I've also tried (from the DHCP handbook):

- forcing the DHCP server to always-broadcast, no effect
- setting up a DHCP relay agent on Router/DHCP server (not needed, but
just tried it anyway)

I've also checked the ARP cache tables on both the server and client
sides (both sides successfully cache the correct IP->MAC mapping).

Just to clarify, I ran these tests on both subnets (70.0/28 and
71.0/24) just to make sure crossing subnets wasn't the problem.

The syslog or /var/log/messages does not offer anything more of
interest besides the basic DHCP log lines.

Any help would be greatly appreciated,

Masao