I have a handful of smallish machines in my home; the 'always on' systems include an iMac (about 25% used of 1TB drive), a few rpis (about 25-50% used of 2-4TB storage, give/take), and a mac mini (50% used of 4TB RAIDZ2 + 2TB Time Machine). "Traditional" backups involve slurping all the important data to a (single) redundant location, and traditionally requires that the single location be as large as the sum of storage in the network. So backups here would require yet another machine and another 8T drive; or I would have to host that drive on an existing machine and deal with the potential irritation of restoring that machine if/when it fails.

This seems silly; in aggregate my network is less than half utilized, so it should be able to back itself up, right? BUT my mini has a few filesystems with over 1T used... so I would have to split those up and back them up to different machines around the network, keep track of all that, and pull it back if/when I have to restore. That's silly. And OBTW my net reliability approaches zero as I spread subsets-of-backups over more hosts: single host failure would almost certainly cause loss of some portion of the backup.

Quick recap of things I'd want:

  1. not-another-machine to store copies of my data
  2. not a single, massive FS on which to store them
  3. no SPOF (backup host)
  4. no "where did the files go" tracking

Cluster filesystems are available to pool storage around the network; a bunch of smallish filesystems can contribute their storage to look like one big filesystem. Unfortunately, as above I have at least 4 different OSs and I think 5-6 flavors. FORTUNATELY they're all UNIX-like (Windows was excised in 2004).

What I want is a cluster backup solution: NO extra machine to store data, NO massive FS, NO single point of failure, and NO juggling of "this can fit over there (as long as it doesn't grow). This should be a clever (not smart) and dynamic system. I would also love the ability to keep N copies of data spread across my M machines, and I DO NOT want to care which of those machines are up when it comes time to restore.

So yeah, I got bored and made one. It doesn't mangle files (keeping them easy to verify, easy to restore) but it also doesn't require I have any single storage location as large as any single dataset. It is dynamic, will live on free storage in the network, and requires minimal configuration. I can even specify a minimum number of replicas per "source", and the system will maintain at least those; if there are two replicas, I can survive *any* single-host failure.

Ex: say a source filesystem has 10 files, 0.txt ... 9.txt, of 100GB each (1TB total). I have 3 hosts on the network with 700GB available each (total: 2.1TB available). A traditional system won't work here. The dynamic system, however, will self-organize and may come up with a solution like: host A: {0-7}.txt; host B: {3-9}.txt; host C: {0,1,2,8,9,4,7}.txt. Any single failure still leaves a complete replica on the network, and every file exists (at least) twice.

Implementation details: it's a simple client/server architecture, many:many. The configuration specifies source filesystems, backup locations, space (or reservation) for backups, and some options like "TCP port" or "number of replicas." That's it. Adding a new source (or backup host) is a one-line config change. The configuration is also backed up, so the change gets made once and propagates to all participating machines.

The "Server" is relatively dumb, passive, and really acts like a locking mechanism. Clients will request files, and can ask status ("are any files underserved" / "are any files overserved"). Clients are more clever (but ignorant of one another): they will try to copy underserved files, greedily. If they are over-capacity, they'll try to drop overserved files. There is no client-client communication, and no server-initiated client communication.

The upside is a very simple dynamic system which stores N copies of data over M machines; the downside, there may be some over-use of the network while clients settle in to a stable convergence. If, for instance, one (large) file is served to 4 clients initially, 2 of them may choose to drop it later as other files need coverage. This results in the lost network traffic and storage IO for the two extra copies. In practice this seems pretty minimal, the system tends to converge quickly and with some almost-trivial locking in the server, "efficiency" approaches 100%.

Copies are performed with rsync (see above: all UNIX-like). It handles checksumming on copy, and can smartly transmit partial data for changed files. The clients and server will exchange checksums (server never volunteers -- it only confirms). For IO sanity, checksums can be sampled (and the clients+servers can rate-limit their own IO).

Metadata communication is over TCP, and could be encrypted with minimal effort. It's all client->server including the likely case where the client and server are on the same host.

It's a "teach myself nontrivial Python" project. I have mixed feelings about the language, tho more positive than when I started.