| Follow @lancinimarco | Subscribe to CloudSecList

Reading time ~9 minutes

Automated GDrive Backups with ECS and S3

Over the past couple of weeks, I started thinking a bit more about adding resiliency to my personal projects and accounts.

In a previous post (“Automated Github Backups with ECS and S3”) I started by taking a look at how to back up my Github data. In this blog post, instead, I’m going to focus on GDrive, since it is where I store the majority of my personal data.

In fact, I finally decided to set some time aside to set up an automated process to backup my GDrive account, and I ended up relying on ECS (Fargate) and S3 Glacier. This blog explains the architecture and implications of the final setup I decided to go with.


Architecture

Similarly to what I described in “Automated Github Backups with ECS and S3” (this post will indeed have the same structure), this is how the final setup looks like:

  • Backups of my GDrive account are taken via an ECS (on Fargate) Task Definition, with execution triggered periodically by a CloudWatch Event Rule, and secrets (i.e., the OAuth Token) pulled from Parameter Store.
  • The data fetched from GDrive is zipped and uploaded to an S3 bucket, where it will transition to Glacier after one day.
  • The task uses an EFS volume as a temporary location where to store files downloaded from GDrive, and before uploading them to S3. The task removes every file from the EFS Volume upon completion.
  • Notifications are sent via SNS for every task starting and/or stopping, as well as for every new object created in the destination S3 bucket.
GDrive Backups with ECS - Architecture
GDrive Backups with ECS - Architecture

Let’s see what all of this means, and let’s analyse the different components in more detail.

Docker Image and Backup Logic

Let’s start by talking about the Docker Image which hosts the actual application logic in charge of the backup.

The logic is based on a custom bash script which leverages rclone, a command line program designed to manage and sync files onto cloud storage.

The full script is available on Github, but it basically runs rclone with a custom config (more on this later) to first obtain a copy of the target GDrive folder, zip it, and then to copy the zipped output to S3:

[... setup EFS...]

# Copy from GDrive
rclone --config ${RCLONE_CONFIG_PATH} --drive-acknowledge-abuse copy ${RCLONE_FOLDER_GDRIVE}/ ${OUTPUT_DIRECTORY}

[...zip folder...]

# Copy to S3
rclone --config ${RCLONE_CONFIG_PATH} copy $fname ${RCLONE_FOLDER_S3}${VAR_OUTPUT_S3}

[... cleanup EFS...]

This script is then packaged as a Docker image, and stored in an ECR repository within one of my AWS accounts. The image is automatically built and pushed to ECR via Github Actions.

FROM ubuntu:20.04

RUN apt-get update && apt-get install -y curl unzip zip
# I know, I know...
RUN curl https://rclone.org/install.sh | bash

WORKDIR /src
COPY docker/rclone-gdrive-backup/rclone-run.sh /src
The code for recreating this Docker image, alongside the custom bash script, can be found on Github at:
github.com/marco-lancini/utils/tree/main/docker/rclone-gdrive-backup

Terraform and Infrastructure Setup

The rest of the components you can see in the “Architecture” diagram above are managed via Terraform. I ended up creating a module which can be used to create:

  • An ECR repository where to store the Docker image of the custom bash script.
ECR Repository
ECR Repository
S3 Lifecycle Policy
S3 Lifecycle Policy
Systems Manager Parameter Store
Systems Manager Parameter Store
ECS Cluster
ECS Cluster
ECS Task Definition
ECS Task Definition
  • An EFS file system, with a mount target in the same subnet used by the ECS Task.
EFS with Mount Target
EFS with Mount Target
  • For notifications:
    • A dedicated SNS Topic.
    • A CloudWatch Event Rule to alert on every ECS Task starting (RUNNING) and/or stopping (STOPPED).
    CloudWatch Event Rule
    CloudWatch Event Rule
    S3 Event Notification
    S3 Event Notification
This Terraform module can be found on Github at:
github.com/marco-lancini/utils/tree/main/terraform/aws-gdrive-backups

OAuth Setup

With the infrastructure ready, the last component missing is a way for rclone to authenticate against the GSuite APIs. After some research Googling, I came across an useful article which describes how to set this up using OAuth.

The process is composed by 3 parts, described next:

  1. Enable the Google Drive APIs.
  2. Generate OAuth credentials.
  3. Seed an rclone Config File.

Enable the Google Drive APIs

Step Screenshot
Login with your Google account at: https://console.cloud.google.com  
From the left sidebar, navigate to “APIs & Services > Library
Search for and enable the “Google Drive API
From the left sidebar, select “Credentials”, then “Configure Consent Screen
Select External as User type, then “Create
In the “OAuth Consent Screen”, choose an application name, and provide user support and developer contact information
Select auth/drive.readonly (“See and download all your Google Drive files”) as scope for the application
In the “Test users” screen, add your Gmail address associated with GDrive

Generate OAuth credentials

Step Screenshot
From the left sidebar, navigate to “Credentials”, then “Create credentials” and select “Oauth client ID
Choose Desktop app as application type, and pick a name
A client ID & secret will be generated

Seed an rclone Config File

The last step involves creating a configuration file that can be used by rclone to authenticate against the GSuite APIs and get authorized to retrieve content from GDrive.

First of all, create a template file like the following, replacing client_id and client_secret with the ones generated previously:

[gdrive]
type = drive
client_id = <OAUTH-KEY>.apps.googleusercontent.com
client_secret = <OAUTH-SECRET>
scope = drive.readonly

[s3]
type = s3
provider = AWS
env_auth = true
region = eu-west-1
acl = private
storage_class = STANDARD
no_check_bucket = true

In the config above, the [gdrive] section will be used by rclone to authenticate against GSuite (notice the drive.readonly scope), whereas the [s3] section is used to configure access to S3 (where the final zip will be uploaded). Authentication/authorization to S3 is handled separately by an IAM policy attached to the service account used by the ECS Task.

Next, run the custom Docker image to (re-)authenticate to Gdrive, and follow the process. As a result, this step it will add a token entry in the [gdrive] section of the config file:

[email protected]:/src# rclone --config /src/rclone.conf config reconnect gdrive:
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes (default)
n) No
y/n> n
Please go to the following link: https://accounts.google.com/o/oauth2/auth?access_type=offline&client_id=...
Log in and authorize rclone for access
Enter verification code> <redacted>
Configure this as a Shared Drive (Team Drive)?
y) Yes
n) No (default)
y/n> n
S3 Event Notification S3 Event Notification S3 Event Notification

Subscribe to CloudSecList

If you found this article interesting, you can join thousands of security professionals getting curated security-related news focused on the cloud native landscape by subscribing to CloudSecList.com.


Usage

  • Run the Terraform module above, which will setup all the necessary components.
  • Create OAuth credentials for accessing GDrive, as outlined in the section above.
  • Store the rclone config file in the Parameter Store:
    • Name: GDRIVE_RCLONE_CONFIG
    • Description: rclone config file for GDrive backups
    • Type: SecureString
    • KMS: default
  • Build the custom Docker image and upload it to ECR. You could automate this via your CI/CD pipeline, or, otherwise, you could push it manually with a script similar to the one below:
#! /bin/bash

AWS_ACCOUNT_ID="XXXXX"
AWS_REGION="XXXXX"
IMAGE_NAME="rclone-gdrive-backup"
IMAGE_VERSION="latest"
ECR_REPO="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${IMAGE_NAME}"

# AUTHENTICATE TO ECR
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

# BUILD IMAGE
docker build -t ${IMAGE_NAME} .

# TAG IMAGE
docker tag ${IMAGE_NAME}:${IMAGE_VERSION} ${ECR_REPO}:${IMAGE_VERSION}

# PUSH IMAGE
docker push ${ECR_REPO}:${IMAGE_VERSION}
  • Wait till the first day of the next month (or run a Task manually), to have your GDrive backup stored into S3!

Security Considerations

For those of you who already read “Automated Github Backups with ECS and S3”, you might notice this section is quite similar to what I already described there. I decided to post those conclusions here as well anyway, though, to have this post self-contained.

Dependencies

Code related to all my personal projects is stored within a single monorepo, and all (well, the majority) of dependencies are vendorised (I briefly touched about this in “My Blogging Stack”, but this will probably warrant another post on its own).

The terraform module described in this post leverages 2 other external modules: umotif-public/ecs-fargate/aws and umotif-public/ecs-fargate-scheduled-task/aws. Although the public module I released on Github uses the upstream versions, the module I use internally refers to local vendorised copies of these modules.

Secrets Management

This is where this solution could be improved, in my opinion.

For my use case, I decided to store the rclone config file in Parameter Store instead of Secrets Manager mainly from a pricing point of view, with Parameter Store not incurring in additional charges for Standard parameters.

For me, this is a “good enough” tradeoff for now, but I understand Secrets Manager could be seen as a more reliable solution for storing secrets.

Storage Reliability

For handling backups, I decided to have a dedicated AWS account.

Another improvement could involve setting up cross-account backups, via AWS Backup, to replicate the data stored in S3 into another account. This data, though, already exists in two places already (the live data in GDrive, and the backup in S3) so it seems an overkill for now.

Other two options worth looking into could be S3 Object Lock and Glacier Vault Lock.


How Much Does this Cost?

Since I’ve just deployed this solution, I don’t have enough historical data to show you exactly how much I spent on it.

What I can do, though, is to use the AWS Pricing Calculator to give you an estimate:

Service Monthly Forecast ($) First 12 months Forecast ($)
S3 Glacier 0.60 7.20
ECR 0.0098 0.12
Parameter Store 0 0
CloudWatch 0 0
Total 0.6098 7.32

As you can see, the biggest entry, as expected, will be storage: I expect to have ~50GB generated each month, for a total of ~150GB concurrently stored in Glacier when at full regime (since the retaining period for each backup is 90 days).


Show Me the Code

As briefly mentioned, both the custom Docker image and the Terraform module needed to recreate the different components of the architecture are available on Github:


Conclusions

In this post I outlined architecture and implications of an automated process aiming to backup a GDrive account, relying on ECS Fargate and S3 Glacier.

I hope you found this post useful and interesting, and I’m keen to get feedback on it! If you find the information shared was useful, if something is missing, or if you have ideas on how to improve it, please let me know on Twitter.

Subscribe to CloudSecList

If you found this article interesting, you can join thousands of security professionals getting curated security-related news focused on the cloud native landscape by subscribing to CloudSecList.com.

Marco Lancini

Marco Lancini
Hi, I'm Marco Lancini. I'm a Security Engineer, mainly interested in cloud native technologies and security...  [read more]