Blog
June 24, 2026
Investigating AWS Application Load Balancer 504 spikes by ingesting ALB access logs from S3 into a local Elasticsearch / Kibana stack for fast correlation.
AWS ALB access logs were being written to S3, but the incident required more than just a single line search. A batch of 504s was impacting availability, and I needed a quick way to join request details, client IP, target response time, and request path across many files.
For this troubleshooting runbook, a local ELK stack delivered fast iteration with no production environment changes. It let me:
I used Docker Compose with Elasticsearch, Kibana, and Logstash OSS 7.10.2. The logs were downloaded from the ALB S3 bucket into a local logs/ directory, then mounted into the Logstash container.
services:
elasticsearch:
image: elasticsearch/elasticsearch-oss:7.10.2
container_name: alb-elasticsearch
environment:
- cluster.name=alb-local
- node.name=alb-elasticsearch
- discovery.type=single-node
- bootstrap.memory_lock=true
- ES_JAVA_OPTS=-Xms2g -Xmx2g
ulimits:
memlock:
soft: -1
hard: -1
ports:
- "9200:9200"
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
healthcheck:
test:
[
"CMD-SHELL",
"curl -fsS 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=5s' >/dev/null || exit 1"
]
interval: 10s
timeout: 10s
retries: 30
start_period: 30s
kibana:
image: kibana/kibana-oss:7.10.2
container_name: alb-kibana
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
elasticsearch:
condition: service_healthy
healthcheck:
test:
[
"CMD-SHELL",
"curl -fsS http://localhost:5601/api/status >/dev/null || exit 1"
]
interval: 10s
timeout: 10s
retries: 30
start_period: 60s
logstash:
image: logstash/logstash-oss:7.10.2
container_name: alb-logstash
volumes:
- ./logs:/logs:ro
- ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
- ./logstash/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
environment:
- LS_JAVA_OPTS=-Xms1g -Xmx1g
depends_on:
elasticsearch:
condition: service_healthy
healthcheck:
test:
[
"CMD-SHELL",
"curl -fsS http://localhost:9600/_node/pipelines >/dev/null || exit 1"
]
interval: 10s
timeout: 10s
retries: 30
start_period: 30s
volumes:
elasticsearch-data:
Here is the local structure for the troubleshooting stack and pipeline files:
alb-logs/
docker-compose.yml
logstash/
logstash.conf
logstash.yml
logs/
000111222333_elasticloadbalancing_eu-west-1_app.example-internal-alb.0123456789abcdef_20240115T0000Z_10.0.0.10_example.log
This shows the compose entrypoint, the Logstash pipeline and config files, and the raw ALB log directory that the parser consumes.
The Logstash pipeline reads each file from /logs/*.log, parses the ALB line format with grok, converts timestamps, and enriches the request URL. This is the exact pipeline used:
input {
file {
path => "/logs/*.log"
start_position => "beginning"
sincedb_path => "/dev/null"
mode => "read"
file_completed_action => "log"
file_completed_log_path => "/tmp/logstash-completed.log"
codec => plain {
charset => "UTF-8"
}
}
}
filter {
if [path] {
mutate {
rename => {
"path" => "log_file"
}
}
}
grok {
match => {
"message" => '%{WORD:alb_type} %{TIMESTAMP_ISO8601:log_time} %{NOTSPACE:elb} %{NOTSPACE:client} %{NOTSPACE:target} (?:%{NUMBER:request_processing_time:float}|-) (?:%{NUMBER:target_processing_time:float}|-) (?:%{NUMBER:response_processing_time:float}|-) (?:%{INT:elb_status_code:int}|-) (?:%{INT:target_status_code:int}|-) %{INT:received_bytes:int} %{INT:sent_bytes:int} "%{DATA:request}" "%{DATA:user_agent}" %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol} %{NOTSPACE:target_group_arn} "%{DATA:trace_id}" "%{DATA:domain_name}" "%{DATA:chosen_cert_arn}" %{NOTSPACE:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} "%{DATA:actions_executed}" "%{DATA:redirect_url}" "%{DATA:error_reason}" "%{DATA:target_port_list}" "%{DATA:target_status_code_list}" "%{DATA:classification}" "%{DATA:classification_reason}" %{NOTSPACE:conn_trace_id}(?: "%{DATA:transformed_host}" "%{DATA:transformed_uri}" "%{DATA:request_transform_status}")?'
}
tag_on_failure => ["_alb_grok_failure"]
}
date {
match => [ "log_time", "ISO8601" ]
target => "@timestamp"
}
grok {
match => {
"client" => "%{IP:client_ip}:%{INT:client_port:int}"
}
tag_on_failure => []
}
if [target] and [target] != "-" {
grok {
match => {
"target" => "%{IP:target_ip}:%{INT:target_port:int}"
}
tag_on_failure => []
}
}
grok {
match => {
"request" => "%{WORD:http_method} %{NOTSPACE:url} HTTP/%{NUMBER:http_version}"
}
tag_on_failure => ["_request_parse_failure"]
}
ruby {
code => '
require "uri"
url = event.get("url")
if url
begin
parsed = URI.parse(url)
event.set("url_scheme", parsed.scheme) if parsed.scheme
event.set("url_host", parsed.host) if parsed.host
event.set("url_port", parsed.port) if parsed.port
event.set("url_path", parsed.path) if parsed.path
event.set("url_query", parsed.query) if parsed.query
rescue
path = url.split("?")[0]
query = url.include?("?") ? url.split("?", 2)[1] : nil
event.set("url_path", path)
event.set("url_query", query) if query
end
end
'
}
mutate {
convert => {
"matched_rule_priority" => "integer"
}
add_field => {
"log_type" => "aws_alb"
}
remove_field => [
"@version"
]
}
}
output {
elasticsearch {
hosts => [ "http://elasticsearch:9200" ]
index => "alb-logs-%{+YYYY.MM.dd}"
}
stdout {
codec => dots
}
}
These are actual ALB log lines from the local sample dataset.
http 2024-01-15T23:55:00.393927Z app/example-internal-alb/0123456789abcdef 10.0.1.11:58547 10.0.2.21:80 0.000 0.902 0.004 502 502 8407 72260 "PUT http://www.example.com:80/fulfillment/api/v1/shipments/ORD-4046-9727/tracking?requestId=req-0001-2212 HTTP/1.1" "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1" - - arn:aws:elasticloadbalancing:eu-west-1:000111222333:targetgroup/example-target-group/0123456789abcdef "Self=1-00000002-aaaaaaaaaaaaaaaaaaaaaaaa;Root=1-00000002-bbbbbbbbbbbbbbbbbbbbbbbb" "-" "-" 502 2024-01-15T23:54:59.996927Z "forward" "-" "-" "10.0.2.21:80" "502" "-" "-" TID_00000000000000000000000000000002 "-" "-" "-"
http 2024-01-15T23:55:00.913378Z app/example-internal-alb/0123456789abcdef 10.0.1.11:57878 10.0.2.25:80 0.004 57.632 0.000 504 - 632 639 "DELETE http://www.example.com:80/reports/api/v1/monthly-usage HTTP/1.1" "curl/8.4.0" - - arn:aws:elasticloadbalancing:eu-west-1:000111222333:targetgroup/example-target-group/0123456789abcdef "Self=1-00000013-aaaaaaaaaaaaaaaaaaaaaaaa;Root=1-00000013-bbbbbbbbbbbbbbbbbbbbbbbb" "-" "-" 504 2024-01-15T23:55:00.277378Z "forward" "-" "-" "10.0.2.25:80" "-" "-" "-" TID_00000000000000000000000000000013 "-" "-" "-"
http 2024-01-15T23:55:02.351454Z app/example-internal-alb/0123456789abcdef 10.0.1.10:60847 10.0.2.21:80 0.004 54.119 0.000 504 - 634 533 "PUT http://www.example.com:80/telemetry/collect/browser-events?requestId=req-0025-7716 HTTP/1.1" "Mozilla/5.0 (compatible; ExampleDocsBot/1.0; +https://docs.example.com/bot)" - - arn:aws:elasticloadbalancing:eu-west-1:000111222333:targetgroup/example-target-group/0123456789abcdef "Self=1-00000026-aaaaaaaaaaaaaaaaaaaaaaaa;Root=1-00000026-bbbbbbbbbbbbbbbbbbbbbbbb" "-" "-" 504 2024-01-15T23:55:02.023454Z "forward" "-" "-" "10.0.2.21:80" "-" "-" "-" TID_00000000000000000000000000000026 "-" "-" "-"
Those 504 lines are exactly the raw ALB entries the parser is built to understand.
Key fields I extracted included:
elb_status_code / target_status_coderequest / url_path / url_queryclient_ip / target_ip / target_group_arn@timestamp from ALB log_timeThis is the exact alb-logs/logstash/logstash.yml that controls Logstash worker and batching behavior.
http.host: "0.0.0.0"
pipeline.workers: 4
pipeline.batch.size: 1000
pipeline.batch.delay: 50
log.level: info
With the pipeline in place, Kibana Discover makes it easy to filter on 504s and inspect the surrounding context. A useful starting query was:
elb_status_code:504 AND target_status_code:*
From there I could drill into:
The local stack turned a pile of S3 files into searchable incident data. I was able to see the patterns that mattered, identify the service endpoints with the highest error share, and confirm that the issue was upstream of the ALB rather than in the front-door configuration.
If you need to repeat the same workflow, keep the pipeline files in place and refresh the local log set from S3. The same configuration works for new time windows, other ALB buckets, or similar ELB access log formats.
For a more permanent solution, I would next look at shipping logs directly from S3 into Elasticsearch with Filebeat or an AWS Lambda transform, then add alerting on elb_status_code:504 and high backend latency.