dont-use-fsync real world impact

Fri Sep 27 10:10:21 UTC 2019

Apologies for a late response. I've read the other answers and
replication does seem like an interesting solution.

On 8. 09. 19 00:19, Simon Hobson wrote:
> Jure Sah <e at juresah.si> wrote:
>
>> The documentation clearly states that using the dont-use-fsync option is
>> not recommended.
>>
>> I am wondering what is the realistic impact of this? As I understand the
>> kernel commits dirty pages to disk every 30 seconds by default, and this
>> is configurable. Wouldn't this mean that at worst 30 seconds worth of
>> leases are lost?
> Yes, but that could be a rather serious loss of data for some operators. As always, there's no "one size fits all" answer, different operators will have different ideas on this.
> Indeed, AIUI (from several years ago at least) the DHCP service in Windows Server massively outperformed the ISC DHCp server in benchmarks using out of the box settings. The reason for this was that the MS server did NOT fsync it's leases database and thus is vulnerable to exactly the issue you mention - also making non-compliant with the relevant RFC.
> However, in their defence, they have "sort of" moved that security aspect to clients by making the clients very sticky about their leases - more so than other clients in my observations. That doesn't fully prevent the problem of the server missing knowledge of leases it's granted.
>
>> The leases file is in most cases relatively tiny (under 1 MB)
> That's probably a generalisation too far. Mine (at home) is only 20k, but as Andrew Bell has already pointed out, some people do have large lease files.

Well, a typical modern server, especially if it's a dedicated machine
has at least 64 GB of RAM, of which the OS takes up at most 4 GB,
leaving a good 60 GB of cache space for that lease file.

Suffice it to say, the leases file is in all events tiny and could
easily fit in RAM several hundred times over.

>
>> From the past correspondence from the mailing list archive I surmise
>> that people usually work around this by using hardware cache that does
>> not obey fsync, which simply offloads the problem from the kernel to the
>> cache controller and only superficially solves the problem.
> Yes, but no.
> Yes it offloads the problem, no it's not just a superficial fix. A "proper" hardware cache will be battery backed and can survive a crash or power failure of the host. So if we assume we're talking about the hardware cache in a disk controller (eg a RAID controller) then if the power goes off without the chance of an orderly shutdown, then the battery backed cache will hold the updates until the power comes back on again - at which point it will push the updates out to the disk(s).
> There are other sorts of cache hardware. In the distant past I recall seeing (and drooling over !) a "magic box" that comprised a stack of RAM, some disks, a battery, and a controller. To the host it presented as a standard wide SCSI device (that dates it), while internally it was a big RAM disk. In the event of power failure, the battery would run the system long enough to write everything to disk.
> In both cases (and others), under normal conditions it's safe to assume that if the "disk" comes back and says "yes that's written", then it's either been written or has been saved into battery backed cache that will survive problems such as host crashes or power failures. If the cache/disk subsystem fails in that promise, then that's really little different to having a normal disk fail and lose all your data.

See and this is where I see the problem. I understand that this is a
software mailing list and that this might not exactly be obvious to
people who deal with things several abstraction layers above the
hardware... and I also understand that at the end of the day this might
not matter in the real world. However, if the question is the value of
fsync and battery-backed disk cache, consider the following:

When a write is executed, it is first built in the write buffer of the
application, from where it is transfered to the kernel file page memory
structure in system RAM. When an fsync or dirty page write is executed,
the kernel pushes the data over to the disk controller which stores it
in the hardware disk write buffer, and then transfers it to the physical
media.

If there is a power failiure, and it unluckily occurs before a dirty
page write or fsync, then the data is still in the system RAM and it
goes poof and is never committed to the battery backed hardware disk
write buffer, to be put into the disks on reboot. So exactly what impact
does the battery have on systems that do not carry out timely fsyncs?
And what impact do timely fsyncs have on systems that do not have
battery-backed storage cache?

It could be argued that systems not battery backed should not have
hardware disk cache. And it could be argued that systems without UPS
could loose data since the last write. But to argue that battery-backed
disk cache somehow helps in systems with fsync turned off is nonsense.

I've had some discussions on the topic on the other applications mailing
lists, and it appears that the developers of the software understand
that the primary purpose of regular fsyncs is to ensure atomic writes,
rather than to preserve seconds worth of leases. If there is an
unmitigated power failiure it is understood that there will be some data
loss, but the fsyncing is there to ensure that the leases database
remains in a recoverable state (in the case of the leases file, atomic
writes ensure that the leases file is syntactically correct). They
understood the performance bottleneck of their application due to fsync,
but conceded that without an atomic write mechanism by the underlying
filesystems, there was no real alternative.

Are there any ISC-DHCP devs or maintainers reading this list or should I
post over on the other mailing list? Basically I wish to know if anyone
has thought about an alternative to the atomic write problem, that has
fewer bottlenecks. Are there any plans, canceled ideas, etc?

LP,
Jure