Update ES mapping

Assume that there is a session doc type under the index called abc.

The following example will update the existing field date and enforce the format.

PUT /abc/_mapping/session
{
  "properties": {
    "date": {
      "type": "date",
      "format" : "yyyy-MM-dd" 
    }
  } 
}

If you want to add a new nested type called scans,

PUT /abc/_mapping/session
{
  "properties": {
    "scans": {
      "type": "nested"
    }
  }
}

Create an index in Elasticsearch

The following shows a minimal setup for creating an index in Elasticsearch (5.1.0). This example attempted to create an index test and define some of the properties for a type, my_type. One thing worth to mention in the example is the keyword mapping for string-typed fields defined in dynamic_templates. This setting will add a keyword field (previously known as the raw field) to all string fields. ES does not automatically create such keyword fields for custom types but this sometimes creates troubles for querying or visualization because the string fields are tokenized by default. Having such extra keyword(not analyzed) value for string fields is often found useful.

PUT test
{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
      "my_type": {
        "_all": {
          "enabled": true,
          "norms": false
        },
        "dynamic_templates": [
          {
            "message_field": {
              "path_match": "message",
              "match_mapping_type": "string",
              "mapping": {
                "norms": false,
                "type": "text"
              }
            }
          },
          {
            "string_fields": {
              "match": "*",
              "match_mapping_type": "string",
              "mapping": {
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                },
                "norms": false,
                "type": "text"
              }
            }
          }
        ]
      }
    }
}

Dockerfile Explained

A Dockerfile is a script that includes a series of commands to automatically build a new Docker image from a base image. The Dockerfile is provided to the Docker daemon, which in turn executes the instructions inside the Dockerfile and creates the image.

Use Cases

One of the simplest use cases is one wants to customize a Docker image pulled from Dockerhub, adding new commands or changing the provided entrypoint scripts.

Dockerfile can also be useful to dynamic container provisioning. Imagine you work at a company provides PaaS or FaaS. The services requests sent from your clients can be mapped into the Dockerfiles. Docker daemon will then build the image on demand and pass the containers back to your clients.

Instructions Used by Dockerfile

You may have already noticed that Dockerfile’s syntax is rather simple. Each line is either a comment or a instruction followed by arguments, as shown below.

# Comment
INSTRUCTION arguments

We will now walk through a sample Dockerfile, taken from a Jupyter build, and explain the structure and commands step-by-step.

Dockerfile use # for line comment. The command FROM indicates the base image to use. In this example, it uses jupyter/pyspark-notebook as the base image. If the base image isn’t already on your host, Docker daemon will try to pull the image from Dockerhub.

# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
FROM jupyter/pyspark-notebook

Define the maintainer.

MAINTAINER Jupyter Project <jupyter@googlegroups.com>

Define the user that runs the container.

USER root

The ENV command is to set the environment variables that can be accessed by the processes running inside the container. This is equivalent to run export VAR=arguments in a Linux shell.

# RSpark config
ENV R_LIBS_USER $SPARK_HOME/R/lib

The RUN command is to execute its arguments, in this case apt-get, inside the container. The scope of RUN is within the building time.

# R pre-requisites
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    fonts-dejavu \
    gfortran \
    gcc && apt-get clean && \
    rm -rf /var/lib/apt/lists/*

USER $NB_USER

# R packages
RUN conda config --add channels r && \
    conda install --quiet --yes \
    'r-base=3.3.2' \
    'r-irkernel=0.7*' \
    'r-ggplot2=2.2*' \
    'r-rcurl=1.95*' && conda clean -tipsy

# Apache Toree kernel
RUN pip --no-cache-dir install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
RUN jupyter toree install --sys-prefix

# Spylon-kernel
RUN conda install --quiet --yes 'spylon-kernel=0.2*'
RUN python -m spylon_kernel install --sys-prefix

Build the Image

The following example shows how to build an image using the Dockerfile. It is always recommended you build the image from the directory where the Dockerfile lives in. Be careful about the dot at the end of the line, it instructs the build to use current working dir as the build context.

## --rm  clean up the intermediate layers
## -t    target, e.g., apache/toree:1.02. The default tag is latest 
sudo docker build --rm -t repo:tag .

It is worth to mention, Docker uses cache to accelerate the build. If the Dockerfile has a new line inserted, Docker will use cached image layers before that new line and rebuild everything from that new line to the end.

Bring up ELK on Docker Swarm

Assuming there is a working Docker Swarm, this blog describes the steps to bring up an ELK stack on Docker Swarm.

First off, you need to decide if the official ELK Docker images on Docker Hub work for you; Or you would need to use custom images. If the official ones (Elasticsearch, Kibana, Logstash) serve the purpose, you may directly skip to service creation section; Otherwise you need to build the images on all individual nodes in the Swarm cluster or setup your own Docker registry.

Service Creation

All services should be created on the manager node in Swarm cluster. First create an Elasticsearch service called es-master, mapping a host dir /data/es to /usr/share/elasticsearch/data within the container. This also assumes an overlay network es is already existing.

docker service create \
               --network es \
               --name es-master \
               -p 9200:9200 \
               --mount type=bind,source=/data/es,destination=/usr/share/elasticsearch/data \
               elasticsearch

Create Kibana service called kibana, joining into es network. -e option points to es-master. The example command uses a custom Kibana image called kibana/plugin.

docker service create \
               --network es \
               --name kibana \
               -p 5601:5601 \
               -e ELASTICSEARCH_URL=http://es-master:9200 kibana/plugin

To verify the services,

docker service ls

ID            NAME       REPLICAS  IMAGE          COMMAND
5w8v5jksx7h5  kibana     1/1       kibana/plugin  
bpojoyb5wz16  es-master  1/1       elasticsearch  

To see on which node kibana is running,

docker service ps kibana

ID                         NAME      IMAGE          NODE          DESIRED STATE  CURRENT STATE           ERROR
39sadh4cfpqp0zwdh6mbh47er  kibana.1  kibana/plugin  indocgubt104  Running        Running 34 seconds ago  

To launch kibana in a browser, type node_IP:5601 in URL bar. Note that you can use either the IP address of manager node or the worker node actually runs kibana.

Setup a Docker Overlay Network on Multiple Hosts

We have seen many use cases where one fires up a few Docker containers on one single host. To accommodate the growth of data or complexity in business, we would need to consider running the containerized tasks on multiple physical hosts. One of the challenges was how to maintain the communications among the distributed tasks as if they were on the same host.

Fortunately Docker provides a mechanism called Overlay Networking, which basically creates an VXLAN layer 2 overlay tunnel on top of layer 3, i.e., TCP/IP. The details won’t be discussed here but interested readers can go here for more information. It is not hard to imagine that this would allow two containers, sitting on different hosts, to talk with each other. Cool!

This blog will walk through a simple example to create a Docker Swarm that spans two physical hosts, and we will create a Overlay Network to stitch together the distributed containers.

Since the version 1.12.0, Docker Engine natively includes Swarm mode, which makes bringing up a Swarm cluster much easier than using the previous standalone Swarm. Say now there are two nodes, node 1 and node 2. We decide to elect the node 1 to be the manager node. Note that one can have more than one manager nodes in a Swarm cluster but for the sake of simplicity, we just use node 1.

Initiate the Swarm

The following command on node 1 will initiate the Swarm and elect that node as the manager. This command will also spit out the command you would use on node 2 to join the cluster.

docker swarm init

Copy and paste the output from the above, run on node 2.

docker swarm join -- token ...

Come back on node 1 to verify 2 nodes are present in the cluster.

docker node ls

ID                           HOSTNAME      STATUS  AVAILABILITY  MANAGER STATUS
1gwudwxftloza3vldyr4p6p4y *  indocgubt103  Ready   Active        Leader
e9bcxw8vy1ow0jp80gopr2c58    indocgubt104  Ready   Active        

Create an Overlay Network

Now let’s create an Overlay Network called es. On node 1 run the following:

docker network create -d overlay es

To verify, run the following on node 1. Please note that es won’t show up on node 2 until there is actually a container uses the network.

docker network ls

NETWORK ID          NAME                DRIVER              SCOPE
43652a980910        bridge              bridge              local               
1ac35860a4cb        docker_gwbridge     bridge              local               
912wlikzt94x        es                  overlay             swarm               
5032a295b055        host                host                local               
1my3c1fbunaq        ingress             overlay             swarm               
7687da500317        none                null                local  

Attach the Service to the Overlay Network

Uses the example in this post, I would like to deploy an Elasticsearch service to es network.

docker service create \
               --network es \
               --name es-master \
               -p 9200:9200 \
               --mount type=bind,source=/data/es,destination=/usr/share/elasticsearch/data \
               elasticsearch

Now we bring up another test container to see if it can talk / ping the es-master.

docker service create \
               --name test \
               --network es \
               busybox sleep 300000

Run docker service ps to find out on which node the busybox is sent to, switch to that node, run

docker exec -it container_ID /bin/sh
ping es-master

If nothing goes wrong, the ping should return the results. More information about service communication on Overlay network is here

Deploy Elasticsearch 5.1.1 Docker

The docker image can be pulled from Here. Firing up the container is straightforward. The following shows the command line, where /data/es is a host dir prepared for persisting the ES data, -p 9200:9200 maps the port 9200 within the container to the host.

docker run -d -p 9200:9200  -v /data/es:/usr/share/elasticsearch/data elasticsearch

Point your browser to YOUR_HOST_IP:9200 should return something similar to

{
  "name" : "1sl-DGB",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "LZkO1Gd6StG-Bl3V5-Sa-g",
  "version" : {
    "number" : "5.1.1",
    "build_hash" : "5395e21",
    "build_date" : "2016-12-06T12:36:15.409Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
  "tagline" : "You Know, for Search"
}

Depending on the default setting on the host OS, one might experience an error shown below:

[2016-12-10T00:40:17,055][INFO ][o.e.t.TransportService   ] [1sl-DGB] publish_address {172.17.0.2:9300}, bound_addresses {[::]:9300}
[2016-12-10T00:40:17,060][INFO ][o.e.b.BootstrapCheck     ] [1sl-DGB] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
ERROR: bootstrap checks failed
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2016-12-10T00:40:17,065][INFO ][o.e.n.Node               ] [1sl-DGB] stopping ...
[2016-12-10T00:40:17,102][INFO ][o.e.n.Node               ] [1sl-DGB] stopped
[2016-12-10T00:40:17,102][INFO ][o.e.n.Node               ] [1sl-DGB] closing ...
[2016-12-10T00:40:17,115][INFO ][o.e.n.Node               ] [1sl-DGB] closed

This can be caused by that some Linux distributions have low defaults in vm.max_map_count, which defines the number of virtual memory areas that a particular process can own. This setting can be temporarily changed by typing this in command line,

sudo sysctl -w vm.max_map_count=262144

or permanently set it in /etc/sysctl.conf.

Bring up Elasticsearch as a service in Docker Swarm

Assuming there is an overlay network es already exists, the following command line will create a ES service in Docker Swarm.

docker service create \
               --network es \
               --name es-master \
               -p 9200:9200 \
               --mount type=bind,source=/data/es,destination=/usr/share/elasticsearch/data \
               elasticsearch

This service can be accessed by other services within the same overlay network through es-master:9200.

The Magic raw Field

Elasticsearch automatically creates a raw field, aka not-analyzed field, to any string typed field. This is useful as you do not always find tokenized strings useful when they are used for sorting, or aggregation in Kibana. Having a raw copy of the string content makes mentioned tasks more efficient and sometimes less confusing.

However, Elasticsearch by default creates the raw field for strings only for index whose name matches the pattern logstash-*. This is because this type of index will use the default index template which specifies the creation of the raw field. There are at least two ways to create raw fields for custom index, one is to make a custom template that works with your conf file; another, shown in this blog, is through the mappings.

The following example assumes there is a type vs in the index bedmaster. We only care to create the raw field for the string field Parameter in type vs.

  • Create the index.
PUT /bedmaster
{
  settings": {
    "number_of_shards" : 2,
    "number_of_replicas" : 1
  }
}
  • Create the mapping for vs.
PUT /bedmaster/_mapping/vs
{
    "vs": {
      "properties": {
        "Parameter": {
          "type": "string",
          "fields": {
            "raw": {
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
}
  • Run logstash again to insert the data.
  • Verify the mappings are correct.
GET /bedmaster/vs/_mapping

{
  "bedmaster": {
    "mappings": {
      "vs": {
        "properties": {
          "": {
            "type": "double"
          },
          "%": {
            "type": "double"
          },
          "@timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "@version": {
            "type": "string"
          },
          "Bpm": {
            "type": "double"
          },
          "BrMin": {
            "type": "double"
          },
          "Label": {
            "type": "string"
          },
          "Location": {
            "type": "string"
          },
          "Parameter": {
            "type": "string",
            "fields": {
              "raw": {
                "type": "string",
                "index": "not_analyzed"
              }
            }
          },
          "host": {
            "type": "string"
          },
          "mm": {
            "type": "double"
          },
          "mmHg": {
            "type": "double"
          },
          "path": {
            "type": "string"
          }
        }
      }
    }
  }
}

Elasticsearch: Term Query vs Match Query

Given the sample data, consider the following two queries:

Query 1

GET bedmaster/_search
{
    "query" : {
        "term" : { "Parameter" : "HR" }
    }

}

Query 2

GET bedmaster/_search
{
    "query" : {
        "match" : { "Parameter" : "HR" }
    }

}

While clearly the documents with Parameter: HR exist, Query 1 does not return anything but Query 2 returns results as expected. Why? This is because “The term query finds documents that contain the exact term specified in the inverted index.”, see the guide.

The content in a document is tokenized into terms and reverted indexed, which means Elasticsearch knows in which document the term being queried appears. When term is used in a query, one needs to provide the tokenized content in order for the match, in this case, lowercased “HR”, see Query 3.

Query 3

GET bedmaster/_search
{
    "query" : {
        "term" : { "Parameter" : "hr" }
    }

}

By default the string field is analyzed, which means the content will go through some types of analyzer to get tokenized for full text search. term query does not care about this process but just tries to match the tokens therefore the content provided in a term query needs to be tokenized as well. On the other hand, match query, in Query 2, will send the query clause to the analyzer first thus it sees the hits even with the uppercased “HR”.

Sample XML Data (1)

<?xml version="1.0"?>
<!--BedMasterEx Version 4.3-->
<BedMasterEx>
<VitalSignInfo>
            <Location>K2ICU_BED01</Location>
            <Label>W027</Label>
            <Started>10/07/15 15:02</Started>
            <Stopped>10/08/15 16:08</Stopped>
            <Interval>1 Minute</Interval>
            <Duration>1.01:06:25</Duration>
            <Averaged>False</Averaged>
            <Comment >test comment </Comment>
      </VitalSignInfo>
  <VitalSigns CollectionTime="10/7/15 15:02:09">
            <VitalSign>
                  <Parameter>HR</Parameter>
                  <Value UnitOfMeasure="Bpm">128</Value>
                  <AlarmLimitLow Label="HR LO">50</AlarmLimitLow>
                  <AlarmLimitHigh Label="HR HI">150</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>PVC</Parameter>
                  <Value UnitOfMeasure="Bpm">0</Value>
                  <AlarmLimitHigh Label="PVC HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-AVR</Parameter>
                  <Value UnitOfMeasure="mm">-0.5</Value>
                  <AlarmLimitLow Label="ST-AVR LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-AVR HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-AVL</Parameter>
                  <Value UnitOfMeasure="mm">0.9</Value>
                  <AlarmLimitLow Label="ST-AVL LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-AVL HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-AVF</Parameter>
                  <Value UnitOfMeasure="mm">-0.4</Value>
                  <AlarmLimitLow Label="ST-AVF LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-AVF HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-I</Parameter>
                  <Value UnitOfMeasure="mm">0.8</Value>
                  <AlarmLimitLow Label="ST-I LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-I HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-II</Parameter>
                  <Value UnitOfMeasure="mm">0.2</Value>
                  <AlarmLimitLow Label="ST-II LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-II HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-III</Parameter>
                  <Value UnitOfMeasure="mm">-1</Value>
                  <AlarmLimitLow Label="ST-III LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-III HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-V</Parameter>
                  <Value UnitOfMeasure="mm">-0.1</Value>
                  <AlarmLimitLow Label="ST-V LO">0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-V HI">10</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>ST-V1</Parameter>
                  <Value UnitOfMeasure="mm">-0.1</Value>
                  <AlarmLimitLow Label="ST-V1 LO">-2.0</AlarmLimitLow>
                  <AlarmLimitHigh Label="ST-V1 HI">2.0</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>RESP</Parameter>
                  <Value UnitOfMeasure="BrMin">28</Value>
                  <AlarmLimitLow Label="RESP LO">8</AlarmLimitLow>
                  <AlarmLimitHigh Label="RESP HI">30</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>APNEA</Parameter>
                  <Value UnitOfMeasure="">0</Value>
                  <AlarmLimitLow Label="APNEA LO">8</AlarmLimitLow>
                  <AlarmLimitHigh Label="APNEA HI">30</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>AR1-S</Parameter>
                  <Value UnitOfMeasure="mmHg">85</Value>
                  <AlarmLimitLow Label="AR1-S LO">76</AlarmLimitLow>
                  <AlarmLimitHigh Label="AR1-S HI">200</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>AR1-D</Parameter>
                  <Value UnitOfMeasure="mmHg">43</Value>
                  <AlarmLimitLow Label="AR1-D LO">37</AlarmLimitLow>
                  <AlarmLimitHigh Label="AR1-D HI">90</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>AR1-M</Parameter>
                  <Value UnitOfMeasure="mmHg">56</Value>
                  <AlarmLimitLow Label="AR1-M LO">40</AlarmLimitLow>
                  <AlarmLimitHigh Label="AR1-M HI">120</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>AR1-R</Parameter>
                  <Value UnitOfMeasure="Bpm">128</Value>
                  <AlarmLimitLow Label="AR1-R LO">50</AlarmLimitLow>
                  <AlarmLimitHigh Label="AR1-R HI">150</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>SPO2-R</Parameter>
                  <Value UnitOfMeasure="Bpm">127</Value>
                  <AlarmLimitLow Label="SPO2-R LO">50</AlarmLimitLow>
                  <AlarmLimitHigh Label="SPO2-R HI">130</AlarmLimitHigh>
            </VitalSign>
            <VitalSign>
                  <Parameter>SPO2-%</Parameter>
                  <Value UnitOfMeasure="%">95</Value>
                  <AlarmLimitLow Label="SPO2-% LO">90</AlarmLimitLow>
                  <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
            </VitalSign>
      </VitalSigns>
            
</BedMasterEx>

Elasticsearch Cheat Sheet

Create index

ES can automatically create index during data ingestion but also allow one to create index in advance, where settings such as shards and replicas can be explicitly specified.

PUT /index_name
{
  "settings": {
    "number_of_shards" : 2,
    "number_of_replicas" : 1
  }
}

References

Queries

Assuming there is an index test that includes two types: list and other, one can perform some simple queries to get insight of the data.

List all the indices in an Elasticsearch instance and check cluster health.

GET /_cat/indices
GET /_cat/health

List mappings and settings of test index.

GET /test

List all documents in test index.

GET /test/_search

List matching documents in test

GET test/list/_search
{
  "query": {
    "match": {
      "tags": "elasticsearch" 
    }
  }
}