S3 Glacier Deep Archive for home media server long term storage

Overview

An interesting challenge with managing a home media server is backups. The data requirements for home media servers is staggering, assuming your working with a Bluray movie, each one can take up to ~30GB per full HD movie (assuming 1080p). This can amount to terabytes of storage needed. At those scales traditional backup solutions can get expensive.

The data however has two interesting characteristic:

The primary source of the data isn't in your server, but the disk itself
The data is generally immutable

Those two characteristics make it possible to build a backup or archiving solution using S3 Glacier Deep Archive (S3 GDA).

S3 GDA has two important characteristics, it's cheap and durable. The number of 9's of durability far outweigh anything that a home user could manage with their spare time. Additionally, the costs are absurdly low. For my initial requirements for storage, 7TB would be just under $7 a month. It would take multiple years at those prices to recoup the cost of buying single hard drives to duplicate the data, let alone achieve the level of durability promised by S3 GDA.

Limitations

In order to keep those levels of cost, GDA imposes two important restrictions:

Data must be stored for at least 180 days (without modification)
Data needs to be restored before it can be accessed
- This restoration process has a latency measured in hours, if not days
Storage is cheap, retrieval is not. So retrieve only when truly needed.

However, the characteristics above make this perfect for our needs. If rapid restoration is required, the disks serve as an immediate backup. Realistically you don't need immediate access to the whole collection after a disaster. Access to a select few pieces is ok, while the whole collection is slowly restored from S3 GDA or back from disk. Even if disks were lost with the server, you could selectively chose media to restore first.

Strategy

To achieve this, I wrote a set of bash scripts that help automate the process of collecting files, grouping them together (tar), encrypting them client side (gpg) and uploading them (aws-cli). To prevent having to write the large archive files to disk, these scripts all accept or forward their output via a pipe to the next step. Because of the nature of cable internet connections, the upload speeds are slow, so performance is bottle-necked by the network. The machine mostly sits idle through the process with small spikes in disk and cpu activity when loading and encrypting more data.

The end result of these scripts is two files being uploaded to S3:

A manifest file, containing a short description of the archive AND list of files
An archive file, grouping like media assets (TV Show Season, Movies from a certain year / decade)

Those sets of files are encrypted client side and have obscure names to prevent identification if someone were to gain access to the S3 bucket.

The manifest file however, is not stored in S3 GDA, but S3 Infrequent Access, this allows for cheap storage and fast retrieval times. These manifest files are a couple hundred kilobytes at most.

In the event of a restoration event, one would simply need to download all manifest files, and decrypt them. If a specific file is needed, you can then search through the manifests for the file, and download the right archive. All without needing any sort of organization structure, or revealing information in the filename.

Open Questions

This has all been tested with a small number of archives so far, as the upload process continues there might be previously unknown issues or improper error handling to deal with.

The restoration step is a big question, I can manually test restoring things from S3 GDA and fetching them, but doing so over the whole corpus is time consuming and expensive. Doing so early on might not match the experience of doing so after years of storage. How S3 GDA works so efficiently to have those costs is unknown and might have time variables outside of our control.