Reading time ~7 minutes
Automated Github Backups with ECS and S3
- Security Considerations
- How Much Does this Cost?
- Show Me the Code
Over the past couple of weeks, I started thinking a bit more about adding resiliency to my personal projects and accounts. You can follow my entire thought process on Twitter (see the embedded Tweet below), but in this blog post I’m going to focus on Github.
Following AWS, the second most critical service for my projects is Github: 90% of my code is stored there (mostly in private projects), and I have to admit I never took a backup of this data.
So I finally decided to set some time aside to set up an automated process to backup my Github account, and I ended up relying on ECS (Fargate) and S3 Glacier. This blog explains the architecture and implications of the final setup I decided to go with.
In the past couple of weeks, I started thinking a bit more about adding resiliency into my personal projects/accounts. A thread 🧵— Marco Lancini (@lancinimarco) June 21, 2021
At a high level, this is how the final setup looks like:
- Backups of my Github account are taken via an ECS on Fargate Task Definition, with execution triggered periodically by a CloudWatch Event Rule, and secrets (i.e., the Github PAT) pulled from Parameter Store.
- The data fetched from Github is zipped and uploaded to an S3 bucket, where it will transition to Glacier after one day.
- Notifications are sent via SNS for every task starting and/or stopping, as well as for every new object created in the destination S3 bucket.
Let’s see what all of this means, and let’s analyse the different components in more detail.
Docker Image and Python Logic
Let’s start by talking about the Docker Image which hosts the actual application logic in charge of the backup.
The logic is based on python-github-backup, a Python script that can be used to backup an entire organization or repository (including issues and wikis in the most appropriate format), and which I’ve customized for my use case.
In particular, I’ve added the following:
- Fetch the Github Personal Access Token (PAT) and target user from environment variables.
- Zip the final output folder.
- Upload the Zip file to an S3 Bucket.
This customized version of
python-github-backup is then packaged as a Docker image, and stored in an ECR repository within one of my AWS accounts.
The image is automatically built and pushed to ECR via Github Actions.
Terraform and Infrastructure Setup
The rest of the components you can see in the “Architecture” diagram above are managed via Terraform. I ended up creating a module which can be used to create:
- An ECR repository where to store the Docker image of the customised
- A destination S3 bucket with a lifecycle policy which transitions objects to Glacier after 1 day.
- A Systems Manager Parameter Store where to store the Github PAT.
An ECS Cluster on Fargate cluster, in the dedicated VPC.
- An ECS Task Definition, with execution triggered periodically (
cron) by a CloudWatch Event Rule, and secrets pulled from Parameter Store.
- For notifications:
- A dedicated SNS Topic.
- A CloudWatch Event Rule to alert on every ECS Task starting (
RUNNING) and/or stopping (
- An S3 Event Notification to alert on every new object created in the destination bucket.
Subscribe to CloudSecList
- Run the Terraform module above, which will setup all the necessary components
- Create a Personal Access Token in Github, and assign the following scopes to it:
- Store the Github’s PAT in the Parameter Store:
Github Personal Access Token to grant access to the org
- Build the custom Docker image and upload it to ECR. You could automate this via your CI/CD pipeline, or, otherwise, you could push it manually with a script similar to the one below:
- Wait till the first day of the next month (or run a Task manually), to have your Github backup stored into S3!
Code related to all my personal projects is stored within a single monorepo, and all (well, the majority) of dependencies are vendorised (I briefly touched about this in “My Blogging Stack”, but this will probably warrant another post on its own).
This setup is no exception: I initially reviewed python-github-backup, tailored it to my needs, and now Github Actions builds the Docker image from the custom copy within the monorepo.
At the same time, the terraform module leverages 2 other external modules:
Although the public module I released on Github uses the upstream versions, the module I use internally refers to local vendorised copies of these modules.
This is where this solution could be improved, in my opinion.
For my use case, I decided to store Github’s PAT in Parameter Store instead of Secrets Manager
mainly from a pricing point of view, with Parameter Store not incurring in additional charges for
For me, this is a “good enough” tradeoff for now, but I understand Secrets Manager could be seen as a more reliable solution for storing Github’s PAT.
For handling backups, I decided to have a dedicated AWS account.
Another improvement could involve setting up cross-account backups, via AWS Backup, to replicate the data stored in S3 into another account. This data, though, already exists in two places already (the live data in Github, and the backup in S3) so it seems an overkill for now.
How Much Does this Cost?
Since I’ve just deployed this solution, I don’t have enough historical data to show you exactly how much I spent on it.
What I can do, though, is to use the AWS Pricing Calculator to give you an estimate:
|Service||Monthly Forecast ($)||First 12 months Forecast ($)|
As you can see, the biggest entry, as expected, will be storage: I expect to have ~1GB generated each month, for a total of ~12GB concurrently stored in Glacier when at full regime (since the retaining period for each backup is 1 year).
Show Me the Code
As briefly mentioned, both the custom Docker image and the Terraform module needed to recreate the different components of the architecture are available on Github:
- The code for recreating the Docker image, alongside the customised python script, can be found at: github.com/marco-lancini/utils/tree/main/docker/python-github-backup
- The Terraform module can be found at: github.com/marco-lancini/utils/tree/main/terraform/aws-github-backups
In this post I outlined architecture and implications of an automated process aiming to backup a Github account, relying on ECS Fargate and S3 Glacier.
The next service I want to tackle, since it is where I store the majority of my personal data, is GDrive.
Expect another blog post (with code) hopefully soon.
I blogged about it at: “Automated GDrive Backups with ECS and S3”.
I hope you found this post useful and interesting, and I’m keen to get feedback on it! If you find the information shared was useful, if something is missing, or if you have ideas on how to improve it, please let me know on Twitter.