Elasticsearch is not quite like other databases – this is unsurprising, because elasticsearch is actually an index server. Data storage is secondary and happens though attaching the input data to a special field for later retrieval (thats the _source) field. This has interesting implications to your workflow. For that reason, I’d like to present my workflow when developing elasticsearch indices.

What happens when you index

Once you index data into elasticsearch, it goes through a step called “analysis”: Lucene splits the data into tokens and makes it searchable using those (on an index level, these tokes are called “terms”). As a short example: “brown fox” would mean that the document can be found both using “brown” and “fox”. The standard analyzer splits text on whitespace and special characters, downcases the resulting terms and lets you find them. Had we used the keyword analyzer, the document could only be found using “brown fox”. If you want to know the details, the Analyze API is your best friend and explains a lot of what is happening.

The problem is: Analysis happens only once, when the document is indexed. Should you change the analyzer later, all values that are currently in the index will stay untouched until you put them in again. This is the dreaded “reindex”.

Non-compatible changes to mappings share similar problems.

In development, this is extra problematic, as these changes are frequent and need to be tested quickly.

Luckily, all that can be mitigated.

Bulk

Elasticsearch has a very nice API for bulk insertions. It is really fast and easy to use. On top of that, the elasticsearch bulk format is easy to write and flexible. Simply put, it works by alternating one line describing a request and one with a document in it. Both have to end in a newline, also the last one.

{"index" : { "_index" : "my_index", "_type": "my_type", "_id": "bar"}}
{"text": "lorem ipsum"}

If you send this file to the bulk API, it will create one index request using the three parameters given. Bulk isn’t actually valid json because it holds more documents. That’s why I call the files .bulk and not .json.

curl -XPOST localhost:9200/_bulk --data-binary @docs.bulk

Note the use of --data-binary, -d would strip newlines and thus break the format.

Now, this is all very specific. The biggest problem is that the index is explicit and we might want to change it. Luckily, the API has you covered. Every individual index and type has a bulk endpoint that allows you to constrain the scope of the operation. E.g., you could send

{"index" : {"_type": "my_type", "_id": "bar"}}
{"text": "lorem ipsum"}

to

curl -XPOST localhost:9200/my_index/_bulk --data-binary @docs.bulk

and would express the same thing.

docs.bulk is the heartpeace of everything that is coming. I always have one around that holds a reasonable amount of sample data. The sample data should be:

  • As natural as possible (live data, preferrably)
  • As much as possible
  • Small enough that the bulk operation runs in a few seconds

1GB of data is not unreasonable.

Mapping & Index settings

Mappings are the other important part. Mappings describe how data is put into elasticsearch. Also, they are tightly bound to index settings: the previously mentioned analyzers are defined on an index level and not inside the mappings where they are used.

A settings file looks like this (recent elasticsearch can use yaml as file format here):

---
settings:
  analysis:
    analyzer:
      german_analyzer:
        tokenizer: standard
        filter: [standard, stop, lowercase, asciifolding, german_stemmer]
    filter:
      german_stemmer:
        type: stemmer
        name: light_german
mappings:
  my_type:
    properties:
      text:
        type: "string"
        analyzer: "german_analyzer"

This is the file where everything interesting happens. As you can see, I configured a modestly complex analyzer here. I am not done, e.g. I want to configure some synonyms later etc. This file will change a lot during development.

With this file, I can quickly create an index:

curl -XPOST localhost:9200/my_index --data-binary @settings.yaml

As a quick aside: If you have data with many fields, there is a trick to get mappings generated quickly. Index the data against elasticsearch and the dynamic mapping detection will generate a standard mapping for you. Download that mapping using:

curl localhost:9200/my_index/_mapping > mapping.json

Query

Usually, you have one or more queries to validate the resulting index. This part is simple – just put each queries in a file and use the file with:

curl localhost:9200/my_index/_search --data-binary @my_query.json

Keep some around for quick validation of your ideas. Encoding them in actual tests yields bonus points, but these queries are usually exploratory in nature.

Adventurous natures keep their queries in the templating language of their choice and pipe the result to curl.

Putting it all together

The only thing missing here is any kind of workflow. Well, that’s very simple. I’ll drop right to the shell:

# delete any previous versions of that index
curl -XDELETE $ES_HOST/$INDEX
# recreate using our settings
curl -XPOST $ES_HOST/$INDEX --data-binary @settings.yaml
# put our data in
curl -XPOST $ES_HOST/$INDEX/_bulk --data-binary @docs.json
# any additional setup you have to do
#....

Now, validate your queries. Rinse, repeat. Distribute that bundle to everyone in your team:

  • A settings file
  • A bulk data file
  • Queries to validate
  • Some script to make this quick

There are refinements to this process, e.g. I compiled the settings file out of multiple files (one per type) when I had a larger project.

The advantage of this workflow is that is completely dependency free and usable in every project.

Leave a Reply