All Backup Solutions Suck

Recently I have been working a lot on a backup solution at work, which has been a painful experience to say the least. Why? Simply because there is no solution that meets my ideal requirements. These are pretty precise:

A nice bonus would be volume level backups of Openstack Cinder.

From what I can tell, nothing currently out there meets these requirements. If I take away the Open Source requirement it does not get much better. Rubrik looks interesting, if it weren't tied into VMware, as are a few other solutions.

Nothing meets my needs :-( Naturally this has got me thinking about how something like this could be achieved, so here goes.

I am actually taking my inspiration from the monitoring world, where there has been similar problems. In the past, one just went straight for Nagios to do alerting and Munin/Cacti for storing metrics. For various reasons these just sucked, but the big one for me was this:

I had to tell it what it had to monitor!

Tomorrow, I may be called upon to deploy a Hadoop cluster with 100 slaves. All of these would have to be individually added to Nagios. This invariable got forgotten and before long nothing was monitored and Nagios was forgotten about. Things broke, nobody knew about it. Everyone said "IT SUCKS".

However, recently I\'ve been playing around with Sensu. This works the other way round. The node announces itself to the server, which has a set of rules that that the node uses to monitor itself. This, allied with all the comms being over a Message Queue, makes it astoundingly scalable.

This is the sort of principle that backup should follow.

You have a central server, which is essentially just an API that a node can query to discover what to do. This is based on rules such as cloud metadata, hostname, subnet, whatever else takes your fancy. As this server is just an API, we can use an HTTP load balancer and a NoSQL database to improve availability and scaling.

A new agent comes online, after being installed by my CM system. It queries the API to find out what to do, it the takes it from there. The only time it will interact with the central server is when its configuration changes. It knows what to backup and where to put it, so off it goes. It can use existing tools: tar, bzip2, duplicity, gpg etc and push it directly into the storage desired (S3/Glacier or Openstack Swift would be the best choices I\'d say).

Of course I mindful that not all nodes will have direct access to the storage backend for many reasons. In this case, it could use the Load Balancer already used for the server API to bounce off to the storage. After all, this is just an HTTP stream we are talking about; even a fairly lightweight HAProxy instance will be able to handle 100s of streams.

So on paper, what this should give us is a backup solution that is:

All this seems obvious to me, so why has no-one done it?

Of course, a genuine backup product needs do do reporting and things like that. This is another role the central server could take on: it has a MongoDB cluster to store all that in. Or, that could be an \"add-on\" that just hooks into the same MongoDB (UNIX principal: do one job, and do it well).