Kevin's Blog

The only way to discover the limits of the possible is to go beyond them into the impossible. - Arthur C. Clarke

Mar 10, 2018 - 13 minute read - Comments - workshop

Happy OPS // Angry ELK

Join me as we build a highly available ELK stack with the most recent version of Elasticsearch, Kibana, Filebeats, Logstash, Curator and Elastialert. This is a scaled down, but production ready, installation suitable for Swarm / EE clusters.

content was inspired and derived from a presentation made by Don Bauer @ the Nashville Docker Meetup


Setup

This tutorial is meant for DevOps engineers who have some familiarity with Docker, Swarm (or Enterprise), building images and deploying services. There are a couple helper scripts which will build and run the entire stack against a Swarm cluster included.

Pre-requisites

A modern development machine or server with at least 4 hyperthreaded cores, 16gb RAM, and a SSD. If you’re running Linux be sure and increase the max virtual memory areas on your machine.

  • Docker v17.12+
  • Docker is running in Swarm mode – docker swarm init

Scripts

scripts/build.sh
 1 2 3 4 5 6 7 8 910
#!/bin/bash

docker build -t elk-demo-elasticsearch:latest elasticsearch/image/.
docker build -t elk-demo-es-proxy:latest es-proxy/image/.
docker build -t elk-demo-kibana:latest kibana/image/.
docker build -t elk-demo-logstash:latest logstash/image/.
docker build -t elk-demo-filebeat:latest filebeat/image/.
docker build -t elk-demo-curator:latest curator/image/.
docker build -t elk-demo-elastalert:latest elastalert/image/.
scripts/run.sh
 1 2 3 4 5 6 7 8 910111213141516171819
#!/bin/bash

docker network create -d overlay --attachable elk-network

docker stack deploy \
  --compose-file ../elasticsearch/swarm/swarm.yml elk-demo
docker stack deploy \
  --compose-file ../es-proxy/swarm/swarm.yml      elk-demo
docker stack deploy \
  --compose-file ../kibana/swarm/swarm.yml        elk-demo
docker stack deploy \
  --compose-file ../logstash/swarm/swarm.yml      elk-demo
docker stack deploy \
  --compose-file ../filebeat/swarm/swarm.yml      elk-demo
docker stack deploy \
  --compose-file ../curator/swarm/swarm.yml       elk-demo
docker stack deploy \
  --compose-file ../elastalert/swarm/swarm.yml    elk-demo

Notes

Filebeat

This install uses a filebeat to scrape logs. Most tutorials out there will use logspout as the collector but we’ve observed on large installs that this generates a significant load on the Docker daemon since logspout interfaces with directly with the Docker socket to scrape logs. This has a number of negative side effects so Filebeat was chosen due to its ability to scrape directly from the JSON source files.

This also assumes that your services were deployed with json-file as the log driver. This is typically the default in swarm installations so if you don’t know what we’re referring to here, you won’t have to worry. We’ve included gelf and logspout connectors in Logstash but we won’t be covering the usage of those in this workshop.

X-Pack

X-Pack is Elastics value added features which cost real money to license. We’ve disabled these features in the configuration for our services. We will not be covering how to enable these features or installing licenses for these products.

Installing / Start up

This workshop is meant to be a guide to configuring the entire stack yourself and familiaring yourself with the configuration and swarm scripts. We have put the entire project in a git repository for your convience.

Once you clone this repository you can run the helper scripts located in scripts/ to build the images and deploy the stack. Once the system is up and running you may access kibana locally at http://localhost:5601 (Kibana startup is notoriously slow, it may take up to 15-20 minutes to load)


ElasticSearch

Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

Elasticsearch is developed alongside a data-collection and log-parsing engine called Logstash, and an analytics and visualisation platform called Kibana. The three products are designed for use as an integrated solution, referred to as the “Elastic Stack” (also known as the “ELK stack”).

We will be building a production ready Elastic Stack using Docker Swarm for the orchestration and deployment of this solution. We’ll also deploy additional services to collect logs from the Swarm cluster itself along with tools to enable monitoring.

CPU / Memory Configuration

We’ve scaled down the CPU and memory configuration for this deployment considerably. As with any real-time production system you should experiment with these settings, keep an eye on your CPU / Memory usage, and adjust accordingly.

There are some best practices to follow in regards to JVM heap configuration and thread pools but we won’t focus too much on them in this workshop. If you’re going to increase your CPU pool you should adjust your thread options along with it. If you’re increasing your available RAM pool, a general rule of thumb is to allocate ½ of the available ram to the maximum heap.

High Availability

For this deployment we are deploying multiple Elasticsearch nodes configured to be multi-master and should be able to withstand a single node failure.

We’re also deploying a front-tier load balancer to distribute requests across the elasticsearch cluster. important :: there are no healthchecks internal to the proxy server itself

Files

  • elasticsearch/
    • image/
      • Dockerfile
      • elasticsearch.yml
    • swarm/
      • swarm.yml
  • es-proxy/
    • image/
      • Dockerfile
      • Caddyfile
    • swarm/
      • swarm.yml

Dockerfile(s)

elasticsearch/image/Dockerfile
1234567
FROM    docker.elastic.co/elasticsearch/elasticsearch:6.2.2

COPY    --chown=elasticsearch:elasticsearch \
        elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.yml

RUN     echo "networkaddress.cache.ttl=30" >> ${JAVA_HOME}/lib/security/java.security && \
        echo "networkaddress.cache.negative.ttl=0" >> ${JAVA_HOME}/lib/security/java.security
es-proxy/image/Dockerfile
123
FROM    registry.hub.docker.com/stefanprodan/caddy:0.10.10

COPY    Caddyfile /etc/caddy/Caddyfile

Configuration

elasticsearch/image/elasticsearch.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142434445
path:
  data:
    - /usr/share/elasticsearch/data

cluster:
  routing:
    allocation:
      allow_rebalance: always
      cluster_concurrent_rebalance: 2
      awareness:
        attributes: zone
    rebalance:
      enable: all

network:
  host: _site_

http:
  host: 0.0.0.0

transport:
  host: 0.0.0.0
  ping_schedule: 5s
  tcp:
    compress: true

discovery:
  zen:
    minimum_master_nodes: 2 # Default 1 || Formula (master_eligible_nodes / 2) + 1

# X-Pack Settings
xpack.ml.enabled: false
xpack.security.enabled: false
xpack.monitoring.enabled: true
xpack.watcher.enabled: false
xpack.license.self_generated.type: basic

# Gateway Settings
gateway:
  expected_nodes: 3
  recover_after_nodes: 2
  recover_after_data_nodes: 1

# Override CPU Detection
processors: 1
es-proxy/image/Caddyfile
12345678
:9200 {
  proxy / es-node-01:9200 es-node-02:9200 es-node-03:9200 {
    transparent
  }

  errors stderr
  tls off
}

Swarm Template(s)

elasticsearch/swarm/swarm.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
version: '3.3'

networks:
  elk-network:
    external: true

volumes:
  es_data_1:
  es_data_2:
  es_data_3:

services:
  es-node-01:
    image: elk-demo-elasticsearch:latest
    networks:
    - elk-network
    environment:
      node.name: 'es-node-01'
      network.publish_host: 'es-node-01'
      discovery.zen.ping.unicast.hosts: 'es-node-02,es-node-03'
      cluster.name: 'es-cluster'
      node.master: 'true'
      node.ingest: 'true'
      node.data: 'true'
      node.attr.zone: 'a' # shard allocation
      ES_JAVA_OPTS: '-Xms256M -Xmx256M -XX:ParallelGCThreads=1 -XX:CICompilerCount=2' # prevent ES from overallocation resources
    volumes:
    - es_data_1:/usr/share/elasticsearch/data
    stop_grace_period: '1m30s'
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: '1'
          memory: '512M'
      restart_policy:
        condition: 'any'
        delay: '30s'
  es-node-02:
    image: elk-demo-elasticsearch:latest
    networks:
    - elk-network
    environment:
      node.name: 'es-node-02'
      network.publish_host: 'es-node-02'
      discovery.zen.ping.unicast.hosts: 'es-node-01,es-node-03'
      cluster.name: 'es-cluster'
      node.master: 'true'
      node.ingest: 'true'
      node.data: 'true'
      node.attr.zone: 'b' # shard allocation
      ES_JAVA_OPTS: '-Xms256M -Xmx256M -XX:ParallelGCThreads=1 -XX:CICompilerCount=2' # prevent ES from overallocation resources
    volumes:
    - es_data_2:/usr/share/elasticsearch/data
    stop_grace_period: '1m30s'
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: '1'
          memory: '512M'
      restart_policy:
        condition: 'any'
        delay: '30s'
  es-node-03:
    image: elk-demo-elasticsearch:latest
    networks:
    - elk-network
    environment:
      node.name: 'es-node-03'
      network.publish_host: 'es-node-03'
      discovery.zen.ping.unicast.hosts: 'es-node-01,es-node-02'
      cluster.name: 'es-cluster'
      node.master: 'true'
      node.ingest: 'true'
      node.data: 'true'
      node.attr.zone: 'c' # shard allocation
      ES_JAVA_OPTS: '-Xms256M -Xmx256M -XX:ParallelGCThreads=1 -XX:CICompilerCount=2' # prevent ES from overallocation resources
    volumes:
    - es_data_3:/usr/share/elasticsearch/data
    stop_grace_period: '1m30s'
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: '1'
          memory: '512M'
      restart_policy:
        condition: 'any'
        delay: '30s'

es-proxy/swarm/swarm.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021
version: '3.3'

networks:
  elk-network:
    external: true

services:
  es-proxy:
    image: elk-demo-es-proxy:latest
    networks:
    - elk-network
    deploy:
      replicas: 2
      resources:
        limits:
          memory: '128M'
        reservations:
          memory: '64M'
      restart_policy:
        condition: 'any'
        delay: '30s'

Logstash

High Availability

We’re deploying multiple logstash replicas to spread load and to ensure logs are always ingested during logstash upgrades. This also has the side-effect of being highly available.

Files

  • logstash/
    • image/
      • patterns/
        • grok
      • pipeline/
        • filters.conf
        • inputs.conf
        • outputs.conf
      • Dockerfile
      • logstash.yml
    • swarm/
      • swarm.yml

Dockerfile

logstash/image/Dockerfile
 1 2 3 4 5 6 7 8 91011121314151617181920
FROM  docker.elastic.co/logstash/logstash:6.2.2

# install plugins
RUN   cd /usr/share/logstash && \
      bin/logstash-plugin install logstash-filter-aggregate && \
      bin/logstash-plugin install logstash-filter-json && \
      bin/logstash-plugin install logstash-filter-kv && \
      bin/logstash-plugin install logstash-filter-useragent

# copy configuration(s)
COPY  logstash.yml /usr/share/logstash/config/logstash.yml
COPY  patterns /usr/share/logstash/patterns
RUN   rm -f /usr/share/logstash/pipeline/logstash.conf
ADD   pipeline/ /usr/share/logstash/pipeline/

HEALTHCHECK --interval=30s --timeout=5s --start-period=300s \
    CMD curl -f localhost:9600/_node/stats/process || exit 1

# config test
RUN   logstash -t

Configuration

Groks

Groks are powerful methods of structuring the unstructured data which is ingested via Logstash. This can ensure the data which is being shipped to Elasticsearch is well-formed and can be indexed intelligently. Read more about groks here.

logstash/image/patterns/grok
12
# Apache 2.4 Modified Pattern
HTTPD_ERROR_CUSTOM \[%{HTTPDERROR_DATE:timestamp}\] \[(%{WORD:module})?:%{LOGLEVEL:loglevel}\] \[pid %{POSINT:pid}(:tid %{NUMBER:tid})?\]( \(%{POSINT:proxy_errorcode}\)%{DATA:proxy_message}:)?( \[client %{IPORHOST:clientip}:%{POSINT:clientport}\])?( %{DATA:errorcode}:)? %{GREEDYDATA:message}
logstash/image/pipeline/filters.conf
 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132
filter {
  # Remove all ANSI escape sequences from message
  # 's/\033\[\d*(;\d*)*m//g'
  mutate {
    id => "mutate_filter_ansi"
    gsub => [
      "message", "(\033\[\d*(;\d*)*m)", ""
    ]
  }

  # Filter to mutate/apply docker specific fields
  if [docker][container][name] {
    # Mutate Rules
    # * docker.container.labels.com.docker.stack.namespace    => stack_namespace
    # * docker.container.labels.com.docker.swarm.service.name => swarm_service_name
    # * docker.container.labels.com.docker.swarm.service.id   => swarm_service_id
    # * docker.container.labels.com.docker.swarm.node.id      => swarm_node_id
    # * docker.container.id                                   => docker_container_id
    # * docker.container.name                                 => docker_container_name
    mutate {
      id => "filter_mutate_label_rename"
      rename => {
        "[docker][container][id]" => "docker_container_id"
        "[docker][container][name]" => "docker_container_name"
        "[docker][container][labels][com][docker][stack][namespace]" => "stack_namespace"
        "[docker][container][labels][com][docker][swarm][service][name]" => "swarm_service_name"
        "[docker][container][labels][com][docker][swarm][service][id]" => "swarm_service_id"
        "[docker][container][labels][com][docker][swarm][node][id]" => "swarm_node_id"
      }
    }
  }
}
logstash/image/pipeline/inputs.conf
 1 2 3 4 5 6 7 8 910111213141516171819202122232425
input {
  # beats
  beats {
    id => "input_beats"
    port => 5044
    add_field => { "input_proto" => "beats" }
  }

  # logspout inputs
  tcp {
    id => "input_tcp"
    port  => 5050
    codec => json
    add_field => { "input_proto" => "tcp" }
  }
  udp {
    id => "input_upd"
    port  => 5050
    codec => json
    buffer_size => 16777216
    receive_buffer_bytes => 16777216
    queue_size => 50000
    add_field => { "input_proto" => "udp" }
  }
}
logstash/image/pipeline/outputs.conf
12345678
output {
  # elasticsearch output
  elasticsearch {
    id => "output_es_log"
    hosts => ["http://es-proxy:9200"]
  }
}
logstash/image/logstash.yml
 1 2 3 4 5 6 7 8 910111213141516

http.host: '0.0.0.0'
path.config: '/usr/share/logstash/pipeline'
node.name: 'logstash'
pipeline:
  batch:
    size: ${BATCH_SIZE:100}
    delay: ${BATCH_DELAY:25}
queue:
  type: 'persisted'
  drain: true
  checkpoint.writes: 1024
xpack:
  monitoring:
    enabled: true
    elasticsearch.url: 'http://es-proxy:9200'

Swarm Template

logstash/swarm/swarm.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536
version: '3.3'

networks:
  elk-network:
    external: true

services:
  logstash:
    image: elk-demo-logstash:latest
    networks:
    - elk-network
    environment:
      LOGSPOUT: 'ignore'
      LS_JAVA_OPTS: '-Xms256m -Xmx256m -XX:ParallelGCThreads=1 -XX:CICompilerCount=2 -Dnetworkaddress.cache.ttl=5 -Dnetworkaddress.cache.negative.ttl=5'
    ports:
    - '51415:51415/udp'
    - '51415:51415/tcp'
    - '12201:12201/udp'
    - '12201:12201/tcp'
    - '5050:5050/tcp'
    - '5050:5050/udp'
    deploy:
      mode: 'replicated'
      replicas: 2
      update_config:
        parallelism: 1
        delay: '60s'
        failure_action: 'rollback'
        monitor: '3m'
      resources:
        limits:
          cpus: '1'
          memory: '1G'
      restart_policy:
        condition: 'any'
        delay: '30s'

Filebeat

Filebeat is a log data shipper for local files. Installed as an agent on your servers, Filebeat monitors the log directories or specific log files, tails the files, and forwards them either to Elasticsearch or Logstash for indexing.

Here’s how Filebeat works: When you start Filebeat, it starts one or more prospectors that look in the local paths you’ve specified for log files. For each log file that the prospector locates, Filebeat starts a harvester. Each harvester reads a single log file for new content and sends the new log data to libbeat, which aggregates the events and sends the aggregated data to the output that you’ve configured for Filebeat.

source: elasticsearch documentation

High Availability

Filebeat is installed “globally” on the swarm, meaning approximately one instance is deployed to every node on the cluster. Since filebeat scrapes log files on the host machine this requirement is somewhat self-explanatory.

Files

  • filebeat/
    • image/
      • Dockerfile
      • filebeat.yml
    • swarm/
      • swarm.yml

Dockerfile

filebeat/image/Dockerfile
12345
FROM docker.elastic.co/beats/filebeat:6.2.1

USER root
COPY filebeat.yml /usr/share/filebeat/filebeat.yml
RUN chmod a-w /usr/share/filebeat/filebeat.yml

Configuration

filebeat/image/filebeat.yml
 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
# Auto-discovery
filebeat.autodiscover:
  providers:
    - type: docker
      templates:
      - condition:
          and:
            - not:
              contains:
                docker.container.name: "_logstash"
    
        config:
          - type: docker
            containers.path: "/var/lib/docker/containers"
            containers.ids: [ "${data.docker.container.id}" ]

            tags: ["v6.2", "auto"]
            
            exclude_lines: [ "^\\s+[\\-`('.|_]" ] # Drop ASCII Art
            close_inactive: 5m
            close_renamed: true
            close_removed: true
            tail_files: true
            clean_removed: true
            close_timeout: 5m

# Processor Types:
# https://www.elastic.co/guide/en/beats/filebeat/current/defining-processors.html
processors:
- drop_event:
    when:
      equals:
        kibana.type: "response"
      or:
      - contains:
          docker.container.name: "_logstash"

# Get Docker Meta Data
- add_docker_metadata:
    host: "unix:///var/run/docker.sock"

# Handle Kibana JSON
- decode_json_fields:
    fields:
    - "message"
    target: "kibana"
    max_depth: 1
    overwrite_keys: false
    process_array: false
    when:
      contains:
        docker.container.name: "_kibana"

# Where to send the stuffs
output.logstash:
  enabled: true
  hosts: ["logstash:5044"]
  ttl: 30
  slow_start: true

# Dunno why it needs this yet
setup.kibana.host: "http://kibana:5601"

# CPUs
max_procs: 1

Swarm Template

filebeat/swarm/swarm.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233
version: '3.3'

volumes:
  filebeat_registry:
    driver: 'local'

networks:
  elk-network:
    external: true

services:
  filebeat:
    image: elk-demo-filebeat:latest
    environment:
      max_procs: 1
    networks:
    - elk-network
    volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:rw # logs
    - /var/run/docker.sock:/var/run/docker.sock:ro # get that metadata
    - /var/log:/hostfs/var/log # host logs
    - filebeat_registry:/usr/share/filebeat/data # persistent log cache
    deploy:
      mode: global
      resources:
        limits:
          cpus: '1'
          memory: '1G'
        reservations:
          cpus: '.5'
      restart_policy:
        condition: 'any'
        delay: '30s'

Kibana

Kibana is an open source analytics and visualization platform designed to work with Elasticsearch. You use Kibana to search, view, and interact with data stored in Elasticsearch indices. You can easily perform advanced data analysis and visualize your data in a variety of charts, tables, and maps.

Kibana makes it easy to understand large volumes of data. Its simple, browser-based interface enables you to quickly create and share dynamic dashboards that display changes to Elasticsearch queries in real time.

source: elasticsearch documentation

High Availability

The swarm template included with this workshop is configured to deploy only a single replica of Kibana. However, changing replicas: n is all you need to do to deploy multiple replicas of this application

Files

  • kibana/
    • image/
      • Dockerfile
      • kibana.yml
    • swarm/
      • swarm.yml

Dockerfile

kibana/image/Dockerfile
123
FROM docker.elastic.co/kibana/kibana:6.2.2

COPY kibana.yml /usr/share/kibana/config/kibana.yml

Configuration

kibana/image/kibana.yml
 1 2 3 4 5 6 7 8 9101112
elasticsearch.url: 'http://es-proxy:9200'
kibana.index: '.kibana-6'
server.name: 'kibana'
server.host: '0.0.0.0'

xpack.monitoring.elasticsearch.url: 'http://es-proxy:9200'
xpack.security.enabled: false
xpack.monitoring.enabled: true
xpack.monitoring.ui.container.elasticsearch.enabled: true
xpack.apm.ui.enabled: false
xpack.graph.enabled: false
xpack.ml.enabled: false

Swarm Template

kibana/swarm/swarm.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021222324
version: '3.3'

networks:
  elk-network:
    external: true

services:
  kibana:
    image: elk-demo-kibana:latest
    networks:
    - elk-network
    ports:
    - 5601:5601
    environment:
      SERVER_NAME: kibana
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 30s

Curator

Curator is a helper / python script that will prune your indexes based on any number of filters. The most apparent use-case is to enforce retention policies preventing your disks from filling up or your cloud provider bill from spiriling out of control.

You can read more about the curator via the official documentation

Files

  • curator/
    • image/
      • etc/
        • cron.d/
          • curator
        • curator/
          • actions.yml
          • curator.yml
      • Dockerfile
    • swarm/
      • swarm.yml

Dockerfile

curator/image/Dockerfile
 1 2 3 4 5 6 7 8 910111213
FROM phusion/baseimage:0.9.22

ENV CURATOR_VERSION="5.4.1"

RUN apt-get update -y && \
    apt-get install python python-pip -y && \
    pip install --quiet elasticsearch-curator==${CURATOR_VERSION} && \
    rm -rf /var/lib/apt/lists/*

COPY etc/cron.d/* /etc/cron.d
COPY etc/curator /etc/curator

RUN chmod 644 /etc/cron.d/*

Configuration

curator/image/etc/cron.d/curator
12
20 23 * * * root /usr/local/bin/curator --config /etc/curator/curator.yml /etc/curator/actions.yml
curator/image/etc/curator/actions.yml
 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031
---
# Remember, leave a key empty if there is no value.  None will be a string,
# not a Python "NoneType"
#
# Also remember that all examples have 'disable_action' set to True.  If you
# want to use this action as a template, be sure to set this to False after
# copying it.
actions:
  1:
    action: delete_indices
    description: >-
      Delete indices older than 30 days (based on index name), for logstash-
      prefixed indices. Ignore the error if the filter does not result in an
      actionable list of indices (ignore_empty_list) and exit cleanly.
    options:
      ignore_empty_list: True
      timeout_override:
      continue_if_exception: True
      disable_action: False
    filters:
    - filtertype: pattern
      kind: prefix
      value: logstash-
      exclude:
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 30
      exclude:
curator/image/etc/curator/curator.yml
 1 2 3 4 5 6 7 8 910111213141516171819202122232425
---
# Remember, leave a key empty if there is no value.  None will be a string,
# not a Python "NoneType"
client:
  hosts:
    - es-proxy
  port: 9200
  url_prefix:
  use_ssl: False
  certificate:
  client_cert:
  client_key:
  aws_key:
  aws_secret_key:
  aws_region:
  ssl_no_validate: False
  http_auth:
  timeout: 30
  master_only: False

logging:
  loglevel: INFO
  logfile:
  logformat: default
  blacklist: ['elasticsearch', 'urllib3']

Swarm Template

curator/swarm/swarm.yml
 1 2 3 4 5 6 7 8 910111213141516171819
version: '3.3'

networks:
  elk-network:
    external: true

services:
  curator:
    image: elk-demo-curator:latest
    networks:
    - elk-network
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: '0.5'
      restart_policy:
        condition: any
        delay: 30s

Elastalert

ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch.

Developed for use at Yelp, they realized Kibana is great for visualizing and querying data, but that it needed a companion tool for alerting on inconsistencies in their data. Out of this need, ElastAlert was created.

ElastAlert allows for a configuration based approach to defining triggers and actions including notification systems for Slack, Pagerduty, Email, and more.

Learn more on Github, or Read the Docs

High Availability

ElastAlert does not appear to be cluster aware or idempotent. I don’t recommend running more than 1 replica of ElastAlert.

Files

  • elastalert/
    • image/
      • config/
        • elastalert_supervisord.conf
        • elastalertconfig.yml
      • rules/
        • error.yml
      • Dockerfile
      • entrypoint.sh
    • swarm/
      • swarm.yml

Dockerfile

elastalert/image/Dockerfile
 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142
FROM registry.hub.docker.com/library/alpine:3.7

ENV ELASTALERT_URL https://github.com/Yelp/elastalert/archive/v0.1.28.zip

WORKDIR /opt

# Install software required for Elastalert and NTP for time synchronization.
RUN apk update && \
    apk upgrade && \
    apk add ca-certificates openssl-dev openssl libffi-dev python2 python2-dev py2-pip py2-yaml gcc musl-dev tzdata openntpd wget && \
    wget -O elastalert.zip "${ELASTALERT_URL}" && \
    unzip elastalert.zip && \
    rm elastalert.zip && \
    mv e* /opt/elastalert && \
    cd /opt/elastalert && \
    python setup.py install && \
    pip install -e . && \
    pip uninstall twilio --yes && \
    pip install twilio==6.0.0 && \
    easy_install supervisor && \
    mkdir -p /opt/config && \
    mkdir -p /opt/rules && \
    mkdir -p /opt/logs && \
    mkdir -p /var/empty && \
    apk del python2-dev && \
    apk del musl-dev && \
    apk del gcc && \
    apk del openssl-dev && \
    apk del libffi-dev && \
    rm -rf /var/cache/apk/*

WORKDIR /opt/elastalert

COPY config/* /opt/config/
COPY rules/* /opt/rules
COPY entrypoint.sh /opt/entrypoint.sh

# Make the start-script executable.
RUN chmod +x /opt/entrypoint.sh

# Launch Elastalert when a container is started.
CMD ["/opt/entrypoint.sh"]

Configuration

elastalert/image/entrypoint.sh
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117
#!/bin/sh

set -e

SET_CONTAINER_TIMEZONE=${SET_CONTAINER_TIMEZONE:-True}
CONTAINER_TIMEZONE=${CONTAINER_TIMEZONE:-US/Central}
ELASTICSEARCH_HOST=${ELASTICSEARCH_HOST:-elasticsearchhost}
ELASTICSEARCH_PORT=${ELASTICSEARCH_PORT:-9200}
ELASTICSEARCH_TLS=${ELASTICSEARCH_TLS:-False}
ELASTICSEARCH_TLS_VERIFY=${ELASTICSEARCH_TLS_VERIFY:-True}
ELASTALERT_INDEX=${ELASTALERT_INDEX:-elastalert_status}

CONFIG_DIR=/opt/config
RULES_DIRECTORY=/opt/rules
LOG_DIR=/opt/logs
ELASTALERT_HOME=/opt/elastalert
ELASTALERT_CONFIG=${CONFIG_DIR}/elastalert_config.yaml
ELASTALERT_SUPERVISOR_CONF=${CONFIG_DIR}/elastalert_supervisord.conf

# Set schema and elastalert options
case "${ELASTICSEARCH_TLS}:${ELASTICSEARCH_TLS_VERIFY}" in
    True:True)
        WGET_SCHEMA='https://'
        WGET_OPTIONS='-q -T 3'
        CREATE_EA_OPTIONS='--ssl --verify-certs'
    ;;
    True:False)
        WGET_SCHEMA='https://'
        WGET_OPTIONS='-q -T 3 --no-check-certificate'
        CREATE_EA_OPTIONS='--ssl --no-verify-certs'
    ;;
    *)
        WGET_SCHEMA='http://'
        WGET_OPTIONS='-q -T 3'
        CREATE_EA_OPTIONS='--no-ssl'
    ;;
esac

# Set the timezone.
if [ "$SET_CONTAINER_TIMEZONE" = "True" ]; then
    cp /usr/share/zoneinfo/${CONTAINER_TIMEZONE} /etc/localtime && \
    echo "${CONTAINER_TIMEZONE}" >  /etc/timezone && \
    echo "Container timezone set to: $CONTAINER_TIMEZONE"
else
    echo "Container timezone not modified"
fi

# Force immediate synchronisation of the time and start the time-synchronization service.
# In order to be able to use ntpd in the container, it must be run with the SYS_TIME capability.
# In addition you may want to add the SYS_NICE capability, in order for ntpd to be able to modify its priority.
ntpd -s

# Elastalert configuration:
if [ ! -f ${ELASTALERT_CONFIG} ]; then
    cp "${ELASTALERT_HOME}/config.yaml.example" "${ELASTALERT_CONFIG}" && \

    # Set the rule directory in the Elastalert config file to external rules directory.
    sed -i -e"s|rules_folder: [[:print:]]*|rules_folder: ${RULES_DIRECTORY}|g" "${ELASTALERT_CONFIG}"
    # Set the Elasticsearch host that Elastalert is to query.
    sed -i -e"s|es_host: [[:print:]]*|es_host: ${ELASTICSEARCH_HOST}|g" "${ELASTALERT_CONFIG}"
    # Set the port used by Elasticsearch at the above address.
    sed -i -e"s|es_port: [0-9]*|es_port: ${ELASTICSEARCH_PORT}|g" "${ELASTALERT_CONFIG}"
    # Set the user name used to authenticate with Elasticsearch.
    if [ -n "${ELASTICSEARCH_USER}" ]; then
        sed -i -e"s|#es_username: [[:print:]]*|es_username: ${ELASTICSEARCH_USER}|g" "${ELASTALERT_CONFIG}"
    fi
    # Set the password used to authenticate with Elasticsearch.
    if [ -n "${ELASTICSEARCH_PASSWORD}" ]; then
        sed -i -e"s|#es_password: [[:print:]]*|es_password: ${ELASTICSEARCH_PASSWORD}|g" "${ELASTALERT_CONFIG}"
    fi
    # Set the writeback index used with elastalert.
    sed -i -e"s|writeback_index: [[:print:]]*|writeback_index: ${ELASTALERT_INDEX}|g" "${ELASTALERT_CONFIG}"
fi

# Elastalert Supervisor configuration:
if [ ! -f ${ELASTALERT_SUPERVISOR_CONF} ]; then
    cp "${ELASTALERT_HOME}/supervisord.conf.example" "${ELASTALERT_SUPERVISOR_CONF}" && \

    # Redirect Supervisor log output to a file in the designated logs directory.
    sed -i -e"s|logfile=.*log|logfile=${LOG_DIR}/elastalert_supervisord.log|g" "${ELASTALERT_SUPERVISOR_CONF}"
    # Redirect Supervisor stderr output to a file in the designated logs directory.
    sed -i -e"s|stderr_logfile=.*log|stderr_logfile=${LOG_DIR}/elastalert_stderr.log|g" "${ELASTALERT_SUPERVISOR_CONF}"
    # Modify the start-command.
    sed -i -e"s|python elastalert.py|elastalert --config ${ELASTALERT_CONFIG}|g" "${ELASTALERT_SUPERVISOR_CONF}"
fi

# Set authentication if needed
if [ -n "$ELASTICSEARCH_USER" ] && [ -n "$ELASTICSEARCH_PASSWORD" ]; then
    WGET_AUTH="$ELASTICSEARCH_USER:$ELASTICSEARCH_PASSWORD@"
else
    WGET_AUTH=""
fi

# Wait until Elasticsearch is online since otherwise Elastalert will fail.
while ! wget ${WGET_OPTIONS} -O - "${WGET_SCHEMA}${WGET_AUTH}${ELASTICSEARCH_HOST}:${ELASTICSEARCH_PORT}" 2>/dev/null
do
    echo "Waiting for Elasticsearch..."
    sleep 1
done
sleep 5

# Check if the Elastalert index exists in Elasticsearch and create it if it does not.
if ! wget ${WGET_OPTIONS} -O - "${WGET_SCHEMA}${WGET_AUTH}${ELASTICSEARCH_HOST}:${ELASTICSEARCH_PORT}/${ELASTALERT_INDEX}" 2>/dev/null
then
    echo "Creating Elastalert index in Elasticsearch..."
    elastalert-create-index ${CREATE_EA_OPTIONS} \
        --host "${ELASTICSEARCH_HOST}" \
        --port "${ELASTICSEARCH_PORT}" \
        --config "${ELASTALERT_CONFIG}" \
        --index "${ELASTALERT_INDEX}" \
        --old-index ""
else
    echo "Elastalert index already exists in Elasticsearch."
fi

echo "Starting Elastalert..."
exec supervisord -c "${ELASTALERT_SUPERVISOR_CONF}" -n

elastalert/image/config/elastalert_supervisord.conf
 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132
[unix_http_server]
file=/var/run/elastalert_supervisor.sock

[supervisord]
logfile=/var/log/elastalert_supervisord.log
logfile_maxbytes=1MB
logfile_backups=2
loglevel=debug
nodaemon=false
directory=%(here)s

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///var/run/elastalert_supervisor.sock

[program:elastalert]
# running globally
command = elastalert --config /opt/config/elastalertconfig.yml --verbose
               
# (alternative) using virtualenv
# command=/path/to/venv/bin/elastalert --config /path/to/config.yaml --verbose
process_name=elastalert
autorestart=true
startsecs=15
stopsignal=INT
stopasgroup=true
killasgroup=true
stderr_logfile=/var/log/elastalert_stderr.log
stderr_logfile_maxbytes=5MB

elastalert/image/config/elastalertconfig.yml
 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132

# This is the folder that contains the rule yaml files
# Any .yaml file will be loaded as a rule
rules_folder: /opt/rules

# How often ElastAlert will query Elasticsearch
# The unit can be anything from weeks to seconds
run_every:
  seconds: 30

# ElastAlert will buffer results from the most recent
# period of time, in case some log sources are not in real time
buffer_time:
  minutes: 15

# The Elasticsearch hostname for metadata writeback
# Note that every rule can have its own Elasticsearch host
es_host: es-proxy

# The Elasticsearch port
es_port: 9200

# The index on es_host which is used for metadata storage
# This can be a unmapped index, but it is recommended that you run
# elastalert-create-index to set a mapping
writeback_index: elastalert_status

# If an alert fails for some reason, ElastAlert will retry
# sending the alert until this time period has elapsed
alert_time_limit:
  days: 1
elastalert/image/rules/error.yml
 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031

# (Required)
# Rule name, must be unique
name: Log_Errors

# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
type: any

# (Required)
# Index to search, wildcard supported
index: logstash-*

filter:
- query:
    query_string: 
      query: 'message: "*exception*" OR message: "*ERROR 1*"'

include:
  - tag
  - message

# (Required)
# The alert is used when a match is found
alert:
- "slack"
slack_webhook_url: 'http://localhost' ## UPDATE IF YOU WANT SLACK NOTIFICATIONS
slack_username_override: 'Elast-Alert'
slack_channel_override: '#monitoring'

Swarm Template

elastalert/swarm/swarm.yml
 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627

version: '3.3'

networks:
  elk-network:
    external: true

services:
  elastalert:
    image: elk-demo-elastalert:latest
    networks:
    - elk-network
    ports:
    - '3030:3030'
    environment:
    - ELASTICSEARCH_PORT=9200
    - ELASTICSEARCH_HOST=es-proxy
    - ELASTALERT_SUPERVISOR_CONF=/opt/config/elastalert_supervisord.conf
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        delay: 60s
      restart_policy:
        condition: any
        delay: 30s

Wrap Up

Thanks for taking the time to make it all the way down here. Hopefully you have successfully booted an entire ELK stack that is ready to start injesting thousands of logs per second. We’ve tested this configuration (with a bit more CPU/RAM) and have had no issues with injesting millions of logs per hour. YMMV and of course, your storage backend is critical to the stability of the Elasticsearch services.

If you have any questions, feel free to post below. Thanks again! :)