Hi,
reading here and there I’m getting scared and scared about bit rot or similar problems (firmware error? ssd nand ruined?): the disks seems to work fine, but data might be corrupted and so (my) easily backup strategy - that includes two rsynced copies in different places and time, one on ssd and another on hdd - cannot be enough. Please note that all my personal data are on ext4 filesystems and they are less than 1 TB (ok, it’s not a datahoarding size bu this is a sub where theme-related experts are). Maybe the probability is low, and the probability that a critical file is impacted is lowest, but you know Murphy? I do.
Now, the gold solution should be to replace all of my physical servers with others that support ECC ram; then I’ll have to buy at least 3 CMR-disks for building a ZFS raid or a btrfs similar one. Actually this solution is not sustainable because of time, space and cost: so I have to accept the risk to a second best solution… but which? I also would like to avoid the use of other (just optical) media type.
For example, using a backup tool - restic/kopia or proxmox backup server - might riduce the risk? I say so because of an incremental approach might allow me to restore data at selected point in the past. Of course, I have no way to find that point in the past and, moreover, i will lost all data produced after the time point. Maybe I could apply this strategy just to a subset of very critical and immutable data (official documents)? Or, for these documents, I could just use rsync with the checksum option?
As usual, thanks for any suggestion!
copy / paste of my previous post
Silent bit rot where a bit flips but there is no hardware is extremely rare. My stats say once a year on 300TB of data. Some statistics major can correct me but if someone has 1TB of data then they should see a single bit flip in 300 years so maybe their great great great grandchildren will see it and report back to them in a time machine.
All of my data is on ordinary ext4 hard drives. I buy all my drives in groups of 3. I have my local file server, local backup, and remote backup. I have 2 drives in the local file server dedicated for snapraid parity and run “snapraid sync” every night.
Snapraid has a data scrub feature. I run that once every 6 months to verify that my primary copy of my data in my file server is still correct.
Then I run cshatag on every file when generates SHA256 checksums and stores them as ext4 extended attribute metadata. It compares the stored checksum and stored timestamp and if any file has changed but the timestamp wasn’t edited it reports it as corrupt.
https://github.com/rfjakob/cshatag
Then I use rsync -RHXva when I make my backups via rsync of all my media drives. This data is almost never modifed, just new files are added. The -X option is to also copy over the extended attribute metadata. Then I run the same cshatag file on the local backup and remote backup server. This takes about 1 day to run. On literally 90 million files across 300TB it finds a single file about once a year that has been silently corrupted. I have 2 other copies that match each other so I overwrite the bad file with one of the good copies.
I only run rsnapshot on /home because that is where my frequently changing files are. The other 99% of my data is maybe “write only” so I just use rsync from the main file server to the two backups. Before I run rsync for real I use rsync --dry-run to show what WOULD change but it doesn’t do anything. If I see the files I expect to be written then I run it for real. If I were to see thousands of files that would be changed I would stop and investigate. Was this a cryptolocker virus that updated thousands of files?
As for backing up the operating system I have the /etc and /root account backed up every hour through rsnapshot along with /home
I’m not running a business. I can reinstall Linux in 15 minutes on a new SSD and copy over the handful of files I need from the /etc backup
Man… thanks! You are a really Master! It is not clear to me the further steps after the snapraid activities but I have to read it with more attention, I think.
Check out Linux builtin dm-integrity
Store backups using rar and a data recovery record of 6% or larger and you won’t have to worry about bot rot or minor sectors going bad.
…but data might be corrupted and so (my) easily backup strategy - that includes two rsynced copies in different places and time, one on ssd and another on hdd - cannot be enough.
It isn’t enough if you don’t perform a CRC and generate a HASH.
Obsessing about preventing bitrot is like taking multiple multiple vitamins then crossing the street without checking for traffic. Mantras: Any storage media/device can fail at any time, for any reason, with or without notice. Reliability and longevity is BACKUP, BACKUP, BACKUP!
Is bitflip/bitrot real? I believe it is. However, the likelihood of it corrupting multiple drives/media at the same time, as long as they’re continually checked, verified and copied to new drives/media is infinitesimally small.
But if I backup data which are - from the source - bit rotten, the errors wont be propagated to the copies? I am miss8ng something…
Backups with versioning should solve the issue as soon as you would be able to identify which data got corrupted.
And there is a best practise to do so … ?
Best practices of making backups?
Checksum your files, as you say. Just md5sum or sfv tools will do this and are quick and easy to check at any time.
If you want a way to recover as well, create a certain % of parity files with a tool like par2. If you’re just worried about the occasional flipped bit, a very small % of parity will go a really long way. Creating par2 is NOT super fast but you only have to do it once.
This is basically file-level RAID.
Interesting, I will go deep. I think I should add or reformulate the question or I am missing a thing: I first have to avoid - or check - that source files are not bitrotten, otherwise all backups will be have bitrotten data, right? I mean, I might checksum but on working data how can I have evidence of it? And on “cold” data - f.e., pictures - how can be sure that source ruined files are not copied onto the backups? I cant - because of time - checksum 1tb of data every night…