iconik Disaster Recovery Plan
Introduction
This plan describes the measures iconik Media AB is taking in order to keep iconik operational even in the event of a catastrophic failure. The plan outlines the overall approach we take to ensure our ability to restore a part of the system for each component of the system. It does not go into exact detail how each component is backed up and restored. This is covered in the internal system documentation.
Disaster analysis
Failure of a kubernetes cluster
If a single kubernetes cluster fails we can redirect the traffic to another region. This will introduce additional latency for users but should otherwise be transparent. It will mean a rise in network egress costs if the region we are redirecting to is on a different continent, but this is preferable to system downtime. The DNS entry app.iconik.io uses a global load balancer in GKE and will automatically fail over if the geographically closest region is unavailable.
Failure of multiple GCE regions
If all clusters or regions fail at the same time this will cause downtime as automatic failover is impossible. In this case we will have to perform a full restore which most likely means restoring the Cassandra Database, Elasticsearch and the Kubernetes configuration from backups. How these are taken and restored is described below.
Data corruption
Data corruption can and will happen in a large system. Cassandra has built-in protection against such corruption through replication between nodes. Some kinds of corruption are automatically resolved but some will have to be handled manually. If a database file becomes corrupt cassandra will warn about this and we can then schedule a repair from the other nodes which store the data. All data in our Cassandra cluster is mirrored to a minimum of 2 nodes in each region, and all data is replicated to all regions.
Loss of access to GCE completely
In the highly unlikely event that we as a company lose access to GCE completely we have a backup cluster set up in AWS which we can replicate all services to. This will bring iconik and its services back. Depending on how much of GCE is down we may not be able to bring the GCE-hosted storage buckets back online directly, but any customer hosted buckets should be available directly.
Backups
Iconik Media AB maintains backups of all parts of the system under our control. This section describes how these backups are taken and where they are kept.
Cassandra
Cassandra is run as a distributed cluster with a complete replica of the data in each region, and automatic replication between the regions. This means that the cluster and its data is safe even in the event of the loss of a whole GCE region. We still need backups in order to protect the data in the event of a bug or operator error, or in the unlikely event that all GCE regions fail at once.
To do this we take nightly snapshots of the cassandra database and store them in a multi-region Google Cloud Bucket. This bucket is also manually replicated to a second bucket in another region. This replication is done using a service account which only has read access to the primary backup bucket and the credentials the primary backup script is running with does not have access to the secondary backup bucket. This way the backups are safe from an active attack on the system.
Elasticsearch
We run a separate Elasticsearch cluster in each region. These clusters do not contain any primary data which cannot be reproduced from another source. However, recreating the cluster from the source data would lead to an unacceptable downtime in the event of a failure so we maintain backups of the Elasticsearch clusters as well. These backups are done as Elasticsearch snapshots using the Elasticsearch GCE plugin and are stored in a multi-region Google Cloud Bucket. These backups are also replicated to a secondary backup bucket as described in the Cassandra section.
Kubernetes configuration
Each region runs a GKE cluster which controls all the iconik microservices. The cluster configuration determines how many replicas of each server are run, how they communicate and similar things. The Kubernetes clusters are configured from files which are committed to our GIT repository and deployed using our CI/CD system.
Kubernetes secrets
Kubernetes secrets store things like passwords, access keys and similar things. They are designed to be kept separate from other configurations to minimize exposure. Because of this we do not backup up secrets with the other Kubernetes configuration. Instead we keep these in our 1Password password vault where only the relevant employees have access.
Google Cloud Storage buckets
Customer media which is not uploaded to customer-owned buckets end up in the iconik-files, iconik-keyframes and iconik-proxies buckets. These buckets are configured to be multi-region buckets to be available even in the event of a failure of a single GCE region. In addition to this we also replicate all data to secondary buckets in a separate region on the same continent every night and keep these backups available 30 days after files have been removed from the primary storage. Note: This backup scheme does not include customer-owned buckets. Customers are expected to backup these buckets themselves.
Testing full system restore
A full system restore should be performed at least quarterly in order to verify that we can bring the system back online in the event of a complete failure. By performing these we can ensure that we keep the maximum time from a severe incident to restoration of the service down as much as possible. It also allows us to verify the integrity of our backup process.