Blog

ALB Log Troubleshooting with Local ELK

Investigating AWS Application Load Balancer 504 spikes by ingesting ALB access logs from S3 into a local Elasticsearch / Kibana stack for fast correlation.

Problem

AWS ALB access logs were being written to S3, but the incident required more than just a single line search. A batch of 504s was impacting availability, and I needed a quick way to join request details, client IP, target response time, and request path across many files.

Why local ELK

For this troubleshooting runbook, a local ELK stack delivered fast iteration with no production environment changes. It let me:

Stack setup

I used Docker Compose with Elasticsearch, Kibana, and Logstash OSS 7.10.2. The logs were downloaded from the ALB S3 bucket into a local logs/ directory, then mounted into the Logstash container.

services:
  elasticsearch:
    image: elasticsearch/elasticsearch-oss:7.10.2
    container_name: alb-elasticsearch
    environment:
      - cluster.name=alb-local
      - node.name=alb-elasticsearch
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -fsS 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=5s' >/dev/null || exit 1"
        ]
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 30s

  kibana:
    image: kibana/kibana-oss:7.10.2
    container_name: alb-kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      elasticsearch:
        condition: service_healthy
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -fsS http://localhost:5601/api/status >/dev/null || exit 1"
        ]
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 60s

  logstash:
    image: logstash/logstash-oss:7.10.2
    container_name: alb-logstash
    volumes:
      - ./logs:/logs:ro
      - ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
      - ./logstash/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
    environment:
      - LS_JAVA_OPTS=-Xms1g -Xmx1g
    depends_on:
      elasticsearch:
        condition: service_healthy
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -fsS http://localhost:9600/_node/pipelines >/dev/null || exit 1"
        ]
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 30s

volumes:
  elasticsearch-data:

Directory layout

Here is the local structure for the troubleshooting stack and pipeline files:

alb-logs/
  docker-compose.yml
  logstash/
    logstash.conf
    logstash.yml
  logs/
    000111222333_elasticloadbalancing_eu-west-1_app.example-internal-alb.0123456789abcdef_20240115T0000Z_10.0.0.10_example.log

This shows the compose entrypoint, the Logstash pipeline and config files, and the raw ALB log directory that the parser consumes.

Parsing ALB logs

The Logstash pipeline reads each file from /logs/*.log, parses the ALB line format with grok, converts timestamps, and enriches the request URL. This is the exact pipeline used:

input {
  file {
    path => "/logs/*.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    mode => "read"
    file_completed_action => "log"
    file_completed_log_path => "/tmp/logstash-completed.log"
    codec => plain {
      charset => "UTF-8"
    }
  }
}

filter {
  if [path] {
    mutate {
      rename => {
        "path" => "log_file"
      }
    }
  }

  grok {
    match => {
      "message" => '%{WORD:alb_type} %{TIMESTAMP_ISO8601:log_time} %{NOTSPACE:elb} %{NOTSPACE:client} %{NOTSPACE:target} (?:%{NUMBER:request_processing_time:float}|-) (?:%{NUMBER:target_processing_time:float}|-) (?:%{NUMBER:response_processing_time:float}|-) (?:%{INT:elb_status_code:int}|-) (?:%{INT:target_status_code:int}|-) %{INT:received_bytes:int} %{INT:sent_bytes:int} "%{DATA:request}" "%{DATA:user_agent}" %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol} %{NOTSPACE:target_group_arn} "%{DATA:trace_id}" "%{DATA:domain_name}" "%{DATA:chosen_cert_arn}" %{NOTSPACE:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} "%{DATA:actions_executed}" "%{DATA:redirect_url}" "%{DATA:error_reason}" "%{DATA:target_port_list}" "%{DATA:target_status_code_list}" "%{DATA:classification}" "%{DATA:classification_reason}" %{NOTSPACE:conn_trace_id}(?: "%{DATA:transformed_host}" "%{DATA:transformed_uri}" "%{DATA:request_transform_status}")?'
    }
    tag_on_failure => ["_alb_grok_failure"]
  }

  date {
    match => [ "log_time", "ISO8601" ]
    target => "@timestamp"
  }

  grok {
    match => {
      "client" => "%{IP:client_ip}:%{INT:client_port:int}"
    }
    tag_on_failure => []
  }

  if [target] and [target] != "-" {
    grok {
      match => {
        "target" => "%{IP:target_ip}:%{INT:target_port:int}"
      }
      tag_on_failure => []
    }
  }

  grok {
    match => {
      "request" => "%{WORD:http_method} %{NOTSPACE:url} HTTP/%{NUMBER:http_version}"
    }
    tag_on_failure => ["_request_parse_failure"]
  }

  ruby {
    code => '
      require "uri"

      url = event.get("url")

      if url
        begin
          parsed = URI.parse(url)

          event.set("url_scheme", parsed.scheme) if parsed.scheme
          event.set("url_host", parsed.host) if parsed.host
          event.set("url_port", parsed.port) if parsed.port
          event.set("url_path", parsed.path) if parsed.path
          event.set("url_query", parsed.query) if parsed.query
        rescue
          path = url.split("?")[0]
          query = url.include?("?") ? url.split("?", 2)[1] : nil

          event.set("url_path", path)
          event.set("url_query", query) if query
        end
      end
    '
  }

  mutate {
    convert => {
      "matched_rule_priority" => "integer"
    }

    add_field => {
      "log_type" => "aws_alb"
    }

    remove_field => [
      "@version"
    ]
  }
}

output {
  elasticsearch {
    hosts => [ "http://elasticsearch:9200" ]
    index => "alb-logs-%{+YYYY.MM.dd}"
  }

  stdout {
    codec => dots
  }
}

Log example

These are actual ALB log lines from the local sample dataset.

http 2024-01-15T23:55:00.393927Z app/example-internal-alb/0123456789abcdef 10.0.1.11:58547 10.0.2.21:80 0.000 0.902 0.004 502 502 8407 72260 "PUT http://www.example.com:80/fulfillment/api/v1/shipments/ORD-4046-9727/tracking?requestId=req-0001-2212 HTTP/1.1" "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1" - - arn:aws:elasticloadbalancing:eu-west-1:000111222333:targetgroup/example-target-group/0123456789abcdef "Self=1-00000002-aaaaaaaaaaaaaaaaaaaaaaaa;Root=1-00000002-bbbbbbbbbbbbbbbbbbbbbbbb" "-" "-" 502 2024-01-15T23:54:59.996927Z "forward" "-" "-" "10.0.2.21:80" "502" "-" "-" TID_00000000000000000000000000000002 "-" "-" "-"
http 2024-01-15T23:55:00.913378Z app/example-internal-alb/0123456789abcdef 10.0.1.11:57878 10.0.2.25:80 0.004 57.632 0.000 504 - 632 639 "DELETE http://www.example.com:80/reports/api/v1/monthly-usage HTTP/1.1" "curl/8.4.0" - - arn:aws:elasticloadbalancing:eu-west-1:000111222333:targetgroup/example-target-group/0123456789abcdef "Self=1-00000013-aaaaaaaaaaaaaaaaaaaaaaaa;Root=1-00000013-bbbbbbbbbbbbbbbbbbbbbbbb" "-" "-" 504 2024-01-15T23:55:00.277378Z "forward" "-" "-" "10.0.2.25:80" "-" "-" "-" TID_00000000000000000000000000000013 "-" "-" "-"
http 2024-01-15T23:55:02.351454Z app/example-internal-alb/0123456789abcdef 10.0.1.10:60847 10.0.2.21:80 0.004 54.119 0.000 504 - 634 533 "PUT http://www.example.com:80/telemetry/collect/browser-events?requestId=req-0025-7716 HTTP/1.1" "Mozilla/5.0 (compatible; ExampleDocsBot/1.0; +https://docs.example.com/bot)" - - arn:aws:elasticloadbalancing:eu-west-1:000111222333:targetgroup/example-target-group/0123456789abcdef "Self=1-00000026-aaaaaaaaaaaaaaaaaaaaaaaa;Root=1-00000026-bbbbbbbbbbbbbbbbbbbbbbbb" "-" "-" 504 2024-01-15T23:55:02.023454Z "forward" "-" "-" "10.0.2.21:80" "-" "-" "-" TID_00000000000000000000000000000026 "-" "-" "-"

Those 504 lines are exactly the raw ALB entries the parser is built to understand.

Key fields I extracted included:

Logstash runtime config

This is the exact alb-logs/logstash/logstash.yml that controls Logstash worker and batching behavior.

http.host: "0.0.0.0"
pipeline.workers: 4
pipeline.batch.size: 1000
pipeline.batch.delay: 50
log.level: info

Finding the 504 pattern

With the pipeline in place, Kibana Discover makes it easy to filter on 504s and inspect the surrounding context. A useful starting query was:

elb_status_code:504 AND target_status_code:* 

From there I could drill into:

Results

The local stack turned a pile of S3 files into searchable incident data. I was able to see the patterns that mattered, identify the service endpoints with the highest error share, and confirm that the issue was upstream of the ALB rather than in the front-door configuration.

How to reuse this

If you need to repeat the same workflow, keep the pipeline files in place and refresh the local log set from S3. The same configuration works for new time windows, other ALB buckets, or similar ELB access log formats.

For a more permanent solution, I would next look at shipping logs directly from S3 into Elasticsearch with Filebeat or an AWS Lambda transform, then add alerting on elb_status_code:504 and high backend latency.

← Back to Blog