I need to parse dhcpd.leases to store data in mysql

Fri Jun 30 15:05:35 UTC 2006

Sébastien CRAMATTE wrote:
> Well  I would like to run a  "tail -q -n+0 --follow=name --retry  
> /var/lib/dhcpd3.leases"   background process
> And get leases from this stream. By this mean I don't need to parse all 
> file every times (I've got more than 2000 ips, leases are renewed every 
> 24H )
> 
> I work for cable operator and I need to make a log of all leases and 
> keep datas during 1Year or more . This is why I need
> to store these leases in a database.
> 
> But with your code how can know if a lease was parsed before ?  I don't 
> want to store 2 times the same lease ? I  should use  "lease,starts, 
> end, mac, state"as database unique keys ????

OK, now I understand what you want.

This is not as quite as simple as it might seem, if you need
it to be completely robust.  A few observations:

- although "real-time" loading into dhcpd might seem like a
good idea, it might not be.  How do you do database
maintenance?  That is, the database will need to be down once
in a while, either intentionally for maintenance, or accidentally
due to some failure.  If you can afford to have dhcpd down at
the same time, fine, but if you need dhcp to be very close to
100% available, then you probably want dhcp and the database
loading de-coupled.

- you might think (I did) that time+ip+mac+state is unique
for leases (thus could be db primary key), but it is not, because
sometimes (though rare) multiple leases with the same mac and
ip occur at the same time, given the second-level time resolution
in the lease file.  This is presumably due to semi-broken clients,
as it obviously makes little sense.  But it happens.  You have to
decide whether you want to actually capture *all* leases, or
perhaps discard some small percentage that are probably noise.

- parsing the lease file periodically, as some have suggested
as a possibility, can be a problem if the lease file has been
rewritten between your parsing passes.  There are two issues:

1) you will completely miss leases that were assigned and
released in that window.  For example, if you process the
lease file at 12:00, a lease is assigned at 12:01, then the
client releases the lease at 12:02, and the dhcpd server
re-writes the lease file at 12:03, and you again process the
lease file at 12:05, you will never see the lease from 12:02.
Whether this is a problem depends on your needs.

2) when the lease file is re-written, the order is not preserved.
After re-writing the lease file, two lease records that have the
same time will not necessarily be in the order in which they
were actually issued.  Whether this is problem for you also
depends on your needs.  It is for me, but I'm not going to
go into why I care about this (I suspect most people don't).

Accordingly, some 'tail -f' type technique is needed to guarantee
that you actually get every lease, and get them in order.

With that background, here is roughly what I do:

I have a program that "serializes" leases.  It does the
equivalent of 'tail --follow' on the lease file, and it
assigns a unique serial number to each lease, and then
it writes (appends) to a "shadow" lease file that is just
like the real leases file, except I add a serial field.
I timestamp the shadow lease file, so that I have lease
one shadow (serialized) lease file per day.

I use ip+starts+ends+state as the unique key.  As mentioned
above, this is in fact not unique, but the rare occasions
when it is not I consider noise and ignore.  My program
maintains a hash of the key fields, so that when the
dhcpd daemon rewrites the file, and my program then re-opens
the new dhcpd.leases and re-reads the leases, for each lease
it checks the hash and ignores those it has already processed.
This hash, and the current serial number, are maintained
in a dbm file, so my program can stop and start, and
keep its state.  As long as my program is stopped and started
without the dhcpd server re-writing in between, I cannot
lose any leases, and there are never any duplicates in
my serialize shadow file, and the order in which the lease
records were originally written is preserved. (it is rare
that my program needs to start/stop; it typically runs
for months at a time)

Then I have a second program that parses the shadow lease
file, and loads the leases into a database.  The database
is on a different machine.  The primary key in the database
is the serial number, so avoiding duplicates in the database
is simple (automatic).  This program runs in real-time normally,
doing a 'tail --follow' style thing, so my leases are loaded
into the database in near real-time (1-2 second latency).
But since the serializing and db loading are decoupled,
I can turn off the db loader if the db is down, or if there's
a network problem or something.  The db loader program
keeps track of the last serial number loaded into the database,
and so when it is started, it picks up where it left off,
and then catches up and continues real-time loading.

So this is an outline of one approach. I have naturally
left out many details, but it should give you some sense
of how it works and what the major issues and objectives
were in my case.

Mark