EC2 instance provisioning with cloud-init

The last thing I wrote about yesterday was making our little web service a bit more secure with SSL, and we are about ready to start deploying an Elixir app to AWS. Before that, though, I wanted to satisfy my curiosity by provisioning the instance using cloud-init rather than relying on, the somewhat frowned-upon, remote-exec.

In this example, remote-exec is basically used to install nginx using apt-get via ssh. Using cloud-config (part of cloud-init) we can get the server to do this itself on boot. This example file is pretty useful in figuring things out.

The first thing was to create a yaml file, which I called cloud_config.yaml and put in the same directory as the Terraform. The contents were

#cloud-config
package_update: true
package_upgrade: true

packages:
    - nginx

runcmd:
    - service nginx start

The comment #cloud-config is treated as a directive and is essential.

package-update runs apt-get update to make sure the package list is up to date. package_upgrade runs apt-get upgrade to make sure everythig is up to date. Next under packages we make sure that nginx is installed and the runcmd starts nginx.

We add this to our instance resource "aws_instance" "web" with the attribute

  user_data  =  file("cloud_config.yaml")

There’s also the option to use Terraform templates, but whatevs.

And that’s it, and we can run terraform apply. Except it didn’t work when I did that 🙀: browsing to the ELB’s address gave a blank page rather than the Nginx welcome page. Clicking around the console it showed that the ELB considered the instance “Out of service”. It had failed the health checks.

Provisioning remote-exec means that the Terraform waits for that to complete before setting up the ELB. With cloud-init the ELB is set up before nginx is up, so the health check can fail. The default health check configuration is to check every 30 seconds, to consider unhealthy after 2 failed checks and to then need 10 checks to be considered healthy. My mental arithmetic makes that at least 5 minutes until a failed instance will be considered healthy again. I changed the timings with Terraform configuration of resource "aws_elb" "web"

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 3
    target              = "HTTP:80/"
    interval            = 5
  }

I also changed the protocol (with the target attribute) to http from tcp. This is almost certainly less optimal, but it was good for debugging: by ssh’ing on to the box I could then tail -f /var/log/nginx/access.log to see the heartbeats.

No hearbeats. Instance still out-of-service. No dice.

It turns out that for no reason I can fathom the ELB lost the ability to route to the subnet of the instance, when the instance was configured in this way. There are two solutions (that I found from trial, error, and hints on Stack Overflow) to this:

Make the ELB and the instance share the same subnet
Make an explicit local egress rule from the ELB to the VPC in the ELB’s security group.

I settled on the latter. In resource "aws_security_group" "elb":

  egress {
    from_port = 80
    to_port =  80
    protocol = "tcp"
    cidr_blocks = ["10.0.0.0/16"]
  }

The Terraform for all of this is now here.