Deploying Scalable APIs With Terraform and GitHub Actions

Scalability is a fundamental concept to consider when building efficient cloud solutions. Along with security, they are highlighted in Well-Architected framework pillars like Cost Optimisation and Reliability. I explored these two concepts in the third project I built as part of the #DevOpsAllStarsChallenge.

In this post, you will learn how to deploy a scalable public-facing API using AWS services like API Gateway and Elastic Container Service (ECS). In a way, it enhances the first project for the challenge by rearchitecting the infrastructure for security and using extra tooling like:

Terraform to provision/manage the infrastructure, and
GitHub Actions for automated CD workflows.

Project Summary

I created a containerised API management system for querying weather data - a web API built with Flask. It exposes a /weather endpoint that takes a city query parameter (and if the parameter is not provided, it uses a default city value of Manchester).

Main Concepts Covered

API management with API Gateway and Application Load Balancers for security and scalability
Public and private subnetting architecture for enhanced resource protection
Infrastructure-as-Code with Terraform
Terraform state management with backend blocks (S3 type)
Continuous Deployment with GitHub Actions

Read more about the project on GitHub

Architecture

This architecture would also work without the API gateway i.e., the load balancer is internet-facing (in a public subnet), so it has an accessible DNS name. However, I wanted to try API gateway so it was a good opportunity to learn about restricting access to load balancers through security groups.

Repo Structure

Infrastructure Configuration

I used Terraform to provision the entire architecture, including the ECS service creation (which needs an image already pushed to ECR for the task definition). This will be covered further below.

All Terraform files can be found here.

Step 1: AWS Environment Setup

The following resources are set up within the aws_environment module:

1) custom VPC with public and private subnets: This was easy to set up using the Terraform AWS VPC module. You configure the module and Terraform will handle the creation of all relevant resources without needing to explicitly write the code yourself.

I chose 10.16.0.0/16 for the VPC (i.e., 65,536 available IP addresses from 10.16.0.0 to 10.16.255.255 ) and subnets (10.16.12.0/24, 10.16.24.0/24, 10.16.36.0/24 and 10.16.48.0/24 with 256 IP addresses each).

RFC 1918 on reserved ranges for private IP addresses
NAT gateway is enabled to ensure the resources in private subnets can communicate with resources placed in public subnets (or the internet).

2) Access management: IAM roles, policies to attach to the roles, security groups (allow only traffic from API gateway into load balancer, allow only traffic from load balancer into ECS service).

ECR repository for the container images.

Step 2: ECS Setup

Found in modules/ecs_setup:

1) Application Load Balancer, including target group and listener;

2) ECS: cluster, task definition for the service, and the service itself.

Step 3: API Gateway

Found in modules/api_gateway_setup:

All required configurations to create the gateway and integrate with the load balancer
Update security group for the load balancer to ensure it only allows incoming traffic from the API gateway IP ranges.

Continuous Deployment

I used GitHub Actions to automate the infrastructure deployment and management process. First, I outlined the jobs I wanted to run and the steps for each of them. This was continuously refined as I worked through the resources to be deployed.

Workflow Structure

I also experimented with managing deployments using GHA’s environments. This was a way of adding an extra level of authorisation with Terraform, by configuring the environment to have deployment protection rules.

In this instance, I added myself as a required reviewer; so for every run of the workflow, the aws_environment_setup job needs an approval because it uses the configured environment. Once approved, all jobs with terraform apply commands run with the auto-approval option specified in the command.

Debugging Issues

1) Terraform Syntax errors

If you are new to provisioning these resources with Terraform, you will rely a lot on the documentation; it takes a lot of time and effort to ensure you properly configure what you need.

2) Resources With Dependencies

I realised that I needed to split the way I provision the resources e.g the ECR repository needs to be already set up before I could run a script to create and push the API’s Docker image to the repo, and the image is then used for the ECS service task definition.

As a result, I had a job that ran a partial terraform apply with the -target option for aws_environment module, the next job runs the script, and the next job runs terraform apply as a whole to provision the rest of the infrastructure.

3) Remote Terraform State Management

This was one of the issues I spent quite some time on. Terraform uses state files to keep track of changes it has applied to provision infrastructure and new additions to the config files. However, this is different when running Terraform via GitHub Actions because the state files don’t get pushed to live.

I learnt about the usage of backend block to manage where Terraform stores its state files; but this was quite tricky because of the split provisioning I needed to do. Eventually, I was able to initialise the backend correctly (I used the S3 type, so the state file was stored in a S3 bucket).

terraform {
  backend "s3" {
    bucket = "devops-challenge-tf-state-files"
    key    = "files/terraform.tfstate"
    region = "eu-west-2"
  }
}

4) Gateway URL returning error messages

I encountered a network/endpoint error during one of the deployments, and it happened after I changed the cidr_block for my load balancer to only accept traffic from the public subnets (with the intention that it should only allow traffic from the gateway).

However, API Gateway is a fully managed service and it uses managed IPs i.e you did not configure them, so they are not within your custom VPC/subnets. It took me looking through my codes again to realise I never specified my VPC/subnets in my gateway configurations, so I could not assume the gateway resource was within my VPC.

Use aws_ip_ranges instead: a data source for getting IP ranges of AWS services. The downside however, is that it causes failures when running terraform destroy. I had to retrieve the IP addresses manually (from the workflow logs) to destroy my setup.

I also encountered ‘Missing Authentication Token’ errors when trying to access the gateway’s base URL or /health for load balancer health checks.

This was because you need to configure a resource, method, and integration for each endpoint.

Conclusions

This was a comprehensive project that required a lot of hours with lots of debugging; I was able to acquire new knowledge by building it incrementally and going back to improve it even more.

I still need to read more on:

how to configure an API gateway to route requests from its root url to the load balancer (and the service target group);
figure out how to retrieve the IP ranges for API gateway to restrict access to the load balancer;
how to configure health checks correctly.

I’d also want to explore these enhancements that use Elasticache and DynamoDB.

There have been more videos published for the challenge but thankfully, they are IaC versions of existing projects, so I’m not lagging behind much.

Till Next time✨