Infrastructure as Code: Best Practices for 2024
Infrastructure as Code: Best Practices for 2024
Building reliable and maintainable infrastructure
Introduction
Infrastructure as Code (IaC) has become the cornerstone of modern DevOps practices. By treating infrastructure configuration as code, teams can achieve better reliability, consistency, and maintainability in their deployments. This guide explores the latest best practices and tools for implementing IaC in 2024.
Why Infrastructure as Code Matters
Traditional Infrastructure Challenges
- Manual Configuration Drift: Systems diverge from their intended state over time
- Lack of Reproducibility: Difficulty recreating environments consistently
- No Version Control: Infrastructure changes aren’t tracked or reviewed
- Poor Documentation: Tribal knowledge and outdated documentation
- Scaling Difficulties: Manual processes don’t scale with growth
IaC Benefits
- Consistency: Identical environments across development, staging, and production
- Version Control: Track all infrastructure changes with Git
- Automation: Reduce manual errors and deployment time
- Cost Management: Better resource optimization and cleanup
- Disaster Recovery: Quick environment recreation from code
Core Principles of IaC
1. Declarative Configuration
Define the desired end state rather than the steps to achieve it:
# Terraform example - Declarative
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1d0"
instance_type = "t3.micro"
tags = {
Name = "web-server"
Environment = "production"
}
}
2. Idempotency
Operations should produce the same result regardless of how many times they’re executed:
# Ansible example - Idempotent
- name: Ensure nginx is installed
package:
name: nginx
state: present
3. Immutable Infrastructure
Replace rather than modify existing infrastructure:
# Build new images instead of modifying running containers
FROM nginx:alpine
COPY ./config/nginx.conf /etc/nginx/nginx.conf
COPY ./static /usr/share/nginx/html
Modern IaC Tools and Technologies
Terraform: The Industry Standard
Terraform has established itself as the de facto standard for infrastructure provisioning:
Key Features
- Multi-cloud support: AWS, Azure, GCP, and 100+ providers
- State management: Tracks infrastructure state and changes
- Plan and apply workflow: Preview changes before execution
- Module system: Reusable infrastructure components
Terraform Best Practices
# Use variables for configuration
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
# Use data sources for existing resources
data "aws_availability_zones" "available" {
state = "available"
}
# Create reusable modules
module "vpc" {
source = "./modules/vpc"
cidr_block = var.vpc_cidr
availability_zones = data.aws_availability_zones.available.names
environment = var.environment
}
Pulumi: Modern IaC with Real Programming Languages
import * as aws from "@pulumi/aws";
// Create VPC with TypeScript
const vpc = new aws.ec2.Vpc("main", {
cidrBlock: "10.0.0.0/16",
enableDnsHostnames: true,
enableDnsSupport: true,
tags: {
Name: `${environment}-vpc`,
},
});
// Type-safe configuration
interface DatabaseConfig {
instanceClass: string;
allocatedStorage: number;
engine: string;
}
const dbConfig: DatabaseConfig = {
instanceClass: "db.t3.micro",
allocatedStorage: 20,
engine: "postgres",
};
AWS CDK: Cloud-Native IaC
from aws_cdk import (
Stack,
aws_ec2 as ec2,
aws_ecs as ecs,
aws_ecs_patterns as ecs_patterns,
)
class WebServiceStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
# Create VPC
vpc = ec2.Vpc(self, "VPC", max_azs=2)
# Create ECS cluster
cluster = ecs.Cluster(self, "Cluster", vpc=vpc)
# Create Fargate service
ecs_patterns.ApplicationLoadBalancedFargateService(
self, "Service",
cluster=cluster,
task_image_options=ecs_patterns.ApplicationLoadBalancedTaskImageOptions(
image=ecs.ContainerImage.from_registry("nginx"),
container_port=80,
),
public_load_balancer=True,
)
State Management Best Practices
Remote State Storage
Never store Terraform state locally in production:
# terraform/backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
State Locking
Prevent concurrent modifications:
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Security Best Practices
Secrets Management
Never hardcode secrets in IaC:
# Use AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "rds-password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
# ... other configuration
}
Least Privilege Access
# IAM policy with minimal permissions
resource "aws_iam_role_policy" "lambda_policy" {
name = "lambda-policy"
role = aws_iam_role.lambda_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:*:*:*"
}
]
})
}
Network Security
# Security group with minimal access
resource "aws_security_group" "web" {
name_prefix = "web-"
vpc_id = var.vpc_id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "web-security-group"
}
}
Testing Infrastructure Code
Terratest for Terraform
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestTerraformWebModule(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/web",
Vars: map[string]interface{}{
"environment": "test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Verify outputs
instanceId := terraform.Output(t, terraformOptions, "instance_id")
assert.NotEmpty(t, instanceId)
}
Kitchen-Terraform
# .kitchen.yml
driver:
name: terraform
provisioner:
name: terraform
verifier:
name: terraform
color: true
platforms:
- name: aws
suites:
- name: default
driver:
variables:
environment: test
verifier:
controls:
- instance_created
- security_group_configured
Environment Management
Multi-Environment Strategy
environments/
├── dev/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── staging/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
└── prod/
├── main.tf
├── variables.tf
└── terraform.tfvars
modules/
├── networking/
├── compute/
└── database/
Environment-Specific Configuration
# environments/prod/terraform.tfvars
environment = "prod"
instance_type = "t3.large"
min_size = 3
max_size = 10
enable_backup = true
# environments/dev/terraform.tfvars
environment = "dev"
instance_type = "t3.micro"
min_size = 1
max_size = 2
enable_backup = false
CI/CD Integration
GitLab CI Pipeline
# .gitlab-ci.yml
stages:
- validate
- plan
- deploy
variables:
TF_ROOT: ${CI_PROJECT_DIR}/terraform
before_script:
- terraform --version
- terraform init
validate:
stage: validate
script:
- terraform validate
- terraform fmt -check
plan:
stage: plan
script:
- terraform plan -out=plan.tfplan
artifacts:
paths:
- plan.tfplan
except:
- main
deploy:
stage: deploy
script:
- terraform apply plan.tfplan
when: manual
only:
- main
GitHub Actions
# .github/workflows/terraform.yml
name: Terraform
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.0
- name: Terraform Init
run: terraform init
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -no-color
if: github.event_name == 'pull_request'
- name: Terraform Apply
run: terraform apply -auto-approve
if: github.ref == 'refs/heads/main'
Monitoring and Observability
Infrastructure Drift Detection
# CloudWatch alarm for drift detection
resource "aws_cloudwatch_metric_alarm" "config_compliance" {
alarm_name = "config-compliance-alarm"
comparison_operator = "LessThanThreshold"
evaluation_periods = "2"
metric_name = "ComplianceByConfigRule"
namespace = "AWS/Config"
period = "300"
statistic = "Average"
threshold = "1"
alarm_description = "This metric monitors config compliance"
dimensions = {
ConfigRuleName = aws_config_config_rule.required_tags.name
}
}
Cost Monitoring
resource "aws_budgets_budget" "infrastructure" {
name = "infrastructure-budget"
budget_type = "COST"
limit_amount = "100"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filters = {
Service = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["alerts@company.com"]
}
}
Common Pitfalls and Solutions
1. Resource Dependencies
Problem: Resources created in wrong order Solution: Use explicit dependencies
resource "aws_instance" "web" {
depends_on = [aws_security_group.web]
# configuration
}
2. Hardcoded Values
Problem: Environment-specific values in code Solution: Use variables and data sources
# Bad
cidr_block = "10.0.0.0/16"
# Good
cidr_block = var.vpc_cidr
3. Large State Files
Problem: Monolithic infrastructure state Solution: Split into multiple states
terraform/
├── networking/
├── compute/
├── database/
└── monitoring/
Advanced Patterns
Module Composition
module "web_tier" {
source = "./modules/web-tier"
vpc_id = module.networking.vpc_id
private_subnets = module.networking.private_subnets
security_groups = [module.security.web_sg_id]
instance_type = var.web_instance_type
min_size = var.web_min_size
max_size = var.web_max_size
}
module "database" {
source = "./modules/rds"
vpc_id = module.networking.vpc_id
database_subnets = module.networking.database_subnets
security_groups = [module.security.db_sg_id]
instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage
}
Conditional Resources
resource "aws_cloudwatch_log_group" "app_logs" {
count = var.enable_logging ? 1 : 0
name = "/aws/lambda/${var.function_name}"
retention_in_days = var.log_retention_days
}
Future of Infrastructure as Code
Emerging Trends
- AI-Assisted Infrastructure: Tools like GitHub Copilot for IaC
- Policy as Code: Open Policy Agent (OPA) integration
- GitOps: Argo CD and Flux for Kubernetes
- Infrastructure from Code: Generate IaC from application code
Next-Generation Tools
- Crossplane: Kubernetes-native infrastructure management
- Nitric: Application-focused infrastructure
- Wing: Cloud-oriented programming language
Conclusion
Infrastructure as Code has evolved from a best practice to a necessity in modern software development. By following the principles and practices outlined in this guide, teams can build more reliable, secure, and maintainable infrastructure.
Key takeaways for 2024:
- Choose the right tool for your team and use case
- Implement proper state management from the beginning
- Security should be built-in, not bolted on
- Test your infrastructure code like application code
- Monitor for drift and compliance continuously
- Use CI/CD pipelines for automated deployments
The future of IaC continues to evolve with better tooling, improved cloud provider APIs, and new paradigms like Infrastructure from Code. Stay current with these trends to maintain competitive advantage in cloud infrastructure management.
Published on 2024-01-10 by Jesús Pérez