Indexes are the logical unit that allows us to organize data in elasticsearch. The easiest way to use indexes would be to create one for each group of logs/metrics that share a certain structure or fields. Taking into account that we are mainly saving logs or metrics; this strategy will cause us a series of problems:
- How are we going to control that the indices/shards do not grow too much?
- How do we delete old data?By means of a delete by query? It can be a pretty slow process.
- If the indices (and specifically the shards) grow too much, it will take a long time to recover the shards in the event of a failure or restart.
To simplify these problems, it is very common to generate new indices every day or every week. In Logstash it can be done in a simple way. For example, with the following output the name of the index in which we write will depend on the current day:
output {
elasticsearch {
hosts => ["http://elasticsearch.datadope.io:9200"]
index => "app-%{+YYYY.MM.dd}"
}
}
This strategy gives us a series of advantages:
- If we want to delete data for a specific day, it is as simple as deleting the index for that day.
- The indices no longer grow so uncontrollably. At most, an index will contain all the data that has been generated in a day.
- More efficient searches, since if we need to do a search on the data of today or yesterday, we can consult only the indexes referring to these days, without having to search in all the data.
However, with this strategy we still have a number of problems. On the one hand, we must bear in mind that, especially if we talk about logs, the amount of data that reaches us can be very variable. Therefore, although we have automated the deletion of old indexes, sometimes more data can arrive than normal, filling the disks of the data nodes, causing manual interventions and forcing us to review from time to time the retentions of each type of log. In addition, having a daily index for each type of log can be excessive, since if we have many types of data, we will end up with many indices (and many shards). In elasticsearch the more shards we have (ignoring shards with very little data, which have no impact), the more memory we will need in our nodes. Therefore, having many medium/small size indices will have an impact on the memory usage of the data nodes. In general, it is advisable to have shards of between a few and a few tens of GB.
To solve these problems, rollover and ILM (Index Lifecycle Management) indexes appeared. With this strategy, we will initially create an index (bootstrap index) and an alias pointing to that index. In general, we will use the alias for both reading and writing, without having to worry about what indexes the alias uses. When the index exceeds the size configured in the ILM policy, elasticsearch will run a rollover of that index (we can also cause the rollover based on the time since the creation of the index or the number of documents it contains), which will generate a new index and from that moment all the requests for indexing new documents on the alias will be made in the new index. The readings will be performed on the current write index and the previous indexes (that is, on all the indexes that the alias handles). After running the rollover, with ILM the indices will go through a series of phases as time progresses. This allows us, for example, the possibility of using SSD disk for recent “hot” data in which many reads or writes occur, or send the old indexes that are barely consulted to mechanical disks. In each phase we can execute different maintenance tasks, such as forcemerge, shrink, etc. The general idea is that now instead of using indexes directly, we will use aliases with ILM policies that are responsible for controlling the size and life cycle of the underlying indices.
The alias will always have an index to add new data (parameter is_write_index). All other indexes will only be used for reading and updating documents. And each time the rollover occurs, a new index will be generated that will become the write_index.
Data streams
Data streams are a new abstraction that appears in Elasticsearch 7.9 to manage timeseries data and is the recommended way to save append-only data (that is, data that we will not modify or delete). To read or write, we will send our requests to the data stream, and, as with aliases, the datastreams below will maintain a series of rollover indexes, which can also rotate with ILM policies, exactly as with aliases. In addition, the data streams also have a single active index to which the writes of new documents will be directed and the rest of the indexes will always be accessible in read mode for when we make a query to the data stream. We can use data streams directly as a data source when creating index patterns (called data views from kibana 8) in kibana, searches in Watcher or even data sources in graphana. Therefore, data streams work very similarly to aliases with ILM policies and rollover indexes. But there are some differences that should be highlighted:
- When we wanted to use ILM and alias-based rollover indexes, for each new data group we needed to manually create the first index (bootstrap index) as well as the alias that pointed to this index. With data streams this is no longer necessary. We can create data streams directly with the simple fact of sending write requests to the data stream (as long as you have defined a template in elastic).
- With aliases we defined an index name (example log-demo-000001) and each time the rollover occurred new indexes were generated log-demo-000002, log-demo-000003, etc. With data streams the underlying indexes are generated automatically and use the following nomenclature:
.ds-<data-stream>-<yyyy.MM.dd>-<generation>
where <data-stream> is the name of the data stream, <yyy.MM.dd> is the date of creation of the index and generation is a 6-digit number starting with 000001. Each time the rollover occurs, this number will increase.
- With aliases we could execute update or delete operations. With data streams this is not possible. However, it is allowed to execute api calls _update_by_query to update documents) and _delete_by_query (to delete documents). It should be noted that these limits apply at the datastream level. Therefore, if we try the update and delete operations directly on the underlying indices, they will work without problems (as long as the configuration of the indexes allows it).
- All the documents that we want to ingest in a data stream must have the field defined @timestamp.
When naming datastreams, Elasticsearch recommends using the following nomenclature:
{type}-{dataset}-{namespace}
The type field represents the data type (in the Elastic standard currently this field must have the logs or metrics values). The dataset field is a name that identifies the type of data as well as its structure. And finally, the namespace field represents arbitrary user-defined groupings. Example: logs-nginx.access-prod.
Logstash and indexes with dynamic names
Since Logstash 7.13.0, elasticsearch output officially supports the use of data streams through a series of new variables to facilitate writing to data streams. In addition, these variables allow us to:
- Better integration with Elastic Agent, as Elastic agent (or any other agent) can now send data from the data stream in the variables in the document type data_stream.*, which logstash will understand and use to know where and how to index the data, which can be explicitly overwritten with the elastic output configuration options data_stream_*.
- They push us to use the Elasticsearch Common Schema that in Elasticsearch they have defined for data streams, and that follows the nomenclature defined above.
Therefore, if we want to follow elastic standards in the most rigorous way, we should use these variables data_stream_* in elastic output. However, the fact that logstash has these new options for data streams, does not mean that we are obliged to use them whenever we want to write to data streams. We can and will even have to do without them in some particular cases, as we will see below.
Sometimes we may be interested in using fields from the document itself to determine the name of the index in which we want to write. Let’s imagine that we have an application called app for which we want to index the documents in different indexes based on the app_type field of the document. We could use the classic daily indexing of a daily index:
output {
elasticsearch {
hosts => ["http://elasticsearch.datadope.io:9200"]
index => "app-%{app_type}-%{+YYYY.MM.dd}"
}
}
As we have seen at the beginning, daily indices are not the most optimal way to manage data in elastic. What if we wanted to use aliases based on rollover indexes with ILM policies? To use these aliases, we must initialize the first index and create an alias that points to it. If we are talking about concrete and controlled cases, we can create these indexes and aliases by hand. But if we have a lot of different cases and we need these indexes and aliases to be generated automatically, we have a problem, as Logstash has never been able to do this. However, with data streams we no longer have this restriction, since, as we have said, to create a data stream, we only need a template that matches the name of the data stream we want to create, and when sending the write request, the new data stream will be created if it does not exist.
To write to data streams from Logstash we can use the options data_stream_* of the elastic output. But when we need the name of the index in which we want to write comes from a variable, these options do not work, since they do not interpret the variables. But this is not a problem, since we can write in data streams as if they were normal indices.
We only need a template in elasticsearch that matches the name of the data stream in which we want to write and use the following option in the logstash output:
action => "create"
Example
We are going to test the generation of data streams from logstash using as names of the data stream fields of the document to be ingested.
- We first generate a policy (actually this step is not necessary, since elasticsearch includes default policies that we could use). We can do this using the elasticsearch api using “Dev Tools” in kibana:
PUT _ilm/policy/demo-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "30d"
},
"set_priority": {
"priority": 75
}
}
},
"delete": {
"min_age": "60d",
"actions": {
"delete": {
"delete_searchable_snapshot": true
}
}
}
}
}
}
2. We create a template so that elastic knows that everything that matches this pattern will be a data stream that uses the policy generated in the previous step.
PUT _index_template/logs-demo
{
"index_patterns": [
"logs-demo-*"
],
"template":{
"settings": {
"index.lifecycle.name": "demo-policy"
}
},
"data_stream": {},
"priority": 300
}
3. We run the following logstash pipeline, which generates 6 different test messages repeatedly, parses the app and msg fields, and indexes in elasticsearch based on the app field:
input {
generator {
lines => [
"msg=Mensaje de prueba 1 app=alpha",
"msg=Mensaje de prueba 2 app=alpha",
"msg=Mensaje de prueba 1 app=beta",
"msg=Mensaje de prueba 2 app=beta",
"msg=Mensaje de prueba 1 app=gamma",
"msg=Mensaje de prueba 2 app=gamma"
]
}
}
filter {
dissect {
mapping => {
"message" => "msg=%{msg} app=%{app}"
}
}
}
output {
elasticsearch {
hosts => ["http://elastic.datadope.io:9200"]
user => "ingestor"
password => "***"
index => "logs-demo-%{app}"
action => "create"
}
}
4. We can verify that data streams have been generated for each type of app automatically:
If we see the index section, we can verify that each data stream has generated an index, where it writes the documents.
That is, we have generated 3 data streams, which below use rollover indexes and their ILM policies automatically from logstash, without the need to manually create any alias or any index. All we have needed is to generate a template (and a policy, if the ones we already had are not worth it).
Conclusion
As we have seen, we can organize the information in elasticsearch in many ways. If we want to optimize the use of resources by elasticsearch and simplify the management / operation of our indexes, avoiding imbalances generated by a variable volume of data, it is best to use rollover indexes with ILM policies. To use this indexing strategy, we used to have to manually generate an index (bootstrap index) and an alias pointing to that index. Now with data streams this is not necessary.
Data streams don’t introduce many new features. Rather they are a small evolution. A new abstraction that encapsulates the aliases and indexes controlled by these aliases. With this new abstraction we can generate new data groups with their rollover indexes and ILM policies without having to intervene manually every time a new data group appears.
This allows us, among other things, to be able to generate rollover indexes dynamically based on information from the documents from Logstash.