EC2 instance provisioning with cloud-init
The last thing I wrote about yesterday was making our little web service a bit more secure with SSL, and we are about ready to start deploying an Elixir app to AWS. Before that, though, I wanted to satisfy my curiosity by provisioning the instance using cloud-init
rather than relying on, the somewhat frowned-upon, remote-exec
.
In this example, remote-exec
is basically used to install nginx
using apt-get
via ssh
. Using cloud-config
(part of cloud-init
) we can get the server to do this itself on boot. This example file is pretty useful in figuring things out.
The first thing was to create a yaml file, which I called cloud_config.yaml
and put in the same directory as the Terraform. The contents were
#cloud-config
package_update: true
package_upgrade: true
packages:
- nginx
runcmd:
- service nginx start
The comment #cloud-config
is treated as a directive and is essential.
package-update
runs apt-get update
to make sure the package list is up to date. package_upgrade
runs apt-get upgrade
to make sure everythig is up to date. Next under packages
we make sure that nginx
is installed and the runcmd
starts nginx
.
We add this to our instance resource "aws_instance" "web"
with the attribute
user_data = file("cloud_config.yaml")
There’s also the option to use Terraform templates, but whatevs.
And that’s it, and we can run terraform apply
. Except it didn’t work when I did that 🙀: browsing to the ELB’s address gave a blank page rather than the Nginx welcome page. Clicking around the console it showed that the ELB considered the instance “Out of service”. It had failed the health checks.
Provisioning remote-exec
means that the Terraform waits for that to complete before setting up the ELB. With cloud-init
the ELB is set up before nginx
is up, so the health check can fail. The default health check configuration is to check every 30 seconds, to consider unhealthy after 2 failed checks and to then need 10 checks to be considered healthy. My mental arithmetic makes that at least 5 minutes until a failed instance will be considered healthy again. I changed the timings with Terraform configuration of resource "aws_elb" "web"
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 3
target = "HTTP:80/"
interval = 5
}
I also changed the protocol (with the target
attribute) to http
from tcp
. This is almost certainly less optimal, but it was good for debugging: by ssh’ing on to the box I could then tail -f /var/log/nginx/access.log
to see the heartbeats.
No hearbeats. Instance still out-of-service. No dice.
It turns out that for no reason I can fathom the ELB lost the ability to route to the subnet of the instance, when the instance was configured in this way. There are two solutions (that I found from trial, error, and hints on Stack Overflow) to this:
- Make the ELB and the instance share the same subnet
- Make an explicit local egress rule from the ELB to the VPC in the ELB’s security group.
I settled on the latter. In resource "aws_security_group" "elb"
:
egress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["10.0.0.0/16"]
}
The Terraform for all of this is now here.