dont-use-fsync real world impact

Fri Sep 27 17:46:08 UTC 2019

Jure Sah <e at juresah.si> wrote:

> Well, a typical modern server, especially if it's a dedicated machine
> has at least 64 GB of RAM, of which the OS takes up at most 4 GB,
> leaving a good 60 GB of cache space for that lease file.

You've obviously worked for more generous managers than I have :-(

>>> From the past correspondence from the mailing list archive I surmise
>>> that people usually work around this by using hardware cache that does
>>> not obey fsync, which simply offloads the problem from the kernel to the
>>> cache controller and only superficially solves the problem.
>> Yes, but no.
>> Yes it offloads the problem, no it's not just a superficial fix. A "proper" hardware cache will be battery backed and can survive a crash or power failure of the host. So if we assume we're talking about the hardware cache in a disk controller (eg a RAID controller) then if the power goes off without the chance of an orderly shutdown, then the battery backed cache will hold the updates until the power comes back on again - at which point it will push the updates out to the disk(s).
>> There are other sorts of cache hardware. In the distant past I recall seeing (and drooling over !) a "magic box" that comprised a stack of RAM, some disks, a battery, and a controller. To the host it presented as a standard wide SCSI device (that dates it), while internally it was a big RAM disk. In the event of power failure, the battery would run the system long enough to write everything to disk.
>> In both cases (and others), under normal conditions it's safe to assume that if the "disk" comes back and says "yes that's written", then it's either been written or has been saved into battery backed cache that will survive problems such as host crashes or power failures. If the cache/disk subsystem fails in that promise, then that's really little different to having a normal disk fail and lose all your data.
> 
> See and this is where I see the problem. I understand that this is a
> software mailing list and that this might not exactly be obvious to
> people who deal with things several abstraction layers above the
> hardware... and I also understand that at the end of the day this might
> not matter in the real world. However, if the question is the value of
> fsync and battery-backed disk cache, consider the following:
> 
> When a write is executed, it is first built in the write buffer of the
> application, from where it is transfered to the kernel file page memory
> structure in system RAM. When an fsync or dirty page write is executed,
> the kernel pushes the data over to the disk controller which stores it
> in the hardware disk write buffer, and then transfers it to the physical
> media.
> 
> If there is a power failiure, and it unluckily occurs before a dirty
> page write or fsync, then the data is still in the system RAM and it
> goes poof and is never committed to the battery backed hardware disk
> write buffer, to be put into the disks on reboot. So exactly what impact
> does the battery have on systems that do not carry out timely fsyncs?

I stink we are talking slightly different combinations of options.
I was talking about using battery backed cache to mitigate the performance issue of frequent fsyncs. So the steps between application building a new record and it being "secure" are fast, leaving the slow disk writes protected by battery backup to the cache.
If we are talking about when the application doesn't do fsyncs, then I agree with you.

> I've had some discussions on the topic on the other applications mailing
> lists, and it appears that the developers of the software understand
> that the primary purpose of regular fsyncs is to ensure atomic writes,
> rather than to preserve seconds worth of leases. If there is an
> unmitigated power failiure it is understood that there will be some data
> loss, but the fsyncing is there to ensure that the leases database
> remains in a recoverable state (in the case of the leases file, atomic
> writes ensure that the leases file is syntactically correct). They
> understood the performance bottleneck of their application due to fsync,
> but conceded that without an atomic write mechanism by the underlying
> filesystems, there was no real alternative.

It's my understanding that the fsyncs are to ensure that the data has been committed to permanent storage BEFORE the lease is offered to a client - as required by the DHCP RFCs. Atomic writes aren't really an issue - I strongly suspect that the server builds a whole lease file record in the buffer and passes that in one write operation, and if that's the case, then the only non-atomic write issue would be if multiple writes were made (without fsync) such that lease file records cross a disk buffer boundaries.
If a lease file entry were truncated, then I believe the server can deal with this on startup by discarding the incomplete record.

So yes, ensuring atomic writes is a by-product of using fsyncs - but not (in this case) the primary reason.