Searchable Snapshots

 Reading time: 8

In this blog we are going to discuss searchable snapshots, and how they transform the way data is stored and searched. By making data searchable in backups stored in AWS S3, Microsoft Azure Storage and Google Cloud Storage, the overall infrastructure cost is greatly reduced. Let’s take a look at how this is achieved.

Elasticsearch and Lucene

The main characteristics that define Elasticsearch are the following:

  • – It is a free and open distributed search and analytics engine for all data types, including textual, numeric, geospatial, structured and unstructured.
  • – It is built on Apache Lucene and is known for its simple, distributed, fast and scalable REST APIs.
  • – It is the core component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis and visualisation. It now also includes a large collection of lightweight agents to deliver the data.

The core of Elasticsearch is the Apache Lucene library, which includes functions for indexing, searching, retrieving, updating documents and text analysis. Its main features:

  • A search engine library written in Java
  • A high-level Apache project, as of 2005
  • Suitable for applications requiring structured search, full text search, facets, nearest neighbour search in high dimensional vectors, spell checking or query suggestions.
  • Apache Lucene is open source and available for free download.
  • Elasticsearch was born with the goal of simplifying the integration of search into any Java application. The open source, distributed, RESTful search engine is based on Lucene.

Structure of an index

The data structure in Elasticsearch is divided as follows:

  • A document is indexed in an index.

    Example of the creation of an index via the API:

				
					curl -X POST "localhost:9200/my_blogs/_doc" -H 'Content-Type: application/json' -d
'{
  "title": "Fighting Ebola with Elastic",
  "category": "User Stories",
  "author": {
    "first_name": "Emily",
    "last_name": "Mosher"
  }
}'

				
			
  • ● An index is a logical way of grouping data, it can also be thought of as an optimised collection of documents. Each document is a collection of fields, within these fields are the data. These fields are stored in key-value form.
  • Elasticsearch stores its data using a data structure called an inverted index. It numbers each unique word that appears in any document and identifies all documents in which each word appears.
Los índices alfabéticos siguen una estructura similar a los índices invertidos.
  • The Elasticsearch data is stored via Lucene, which has a set of inverted indexes in a composite of files called segments.
  • Segments are independent indices of Lucene.
  • Multiple segments make up a *shard.
  • A shard is a fraction of an Elasticsearch index.

*Shard: It is a lot of segments put together What a segment does is defined in the previous point. Each point adds another layer of definition.

Snapshot

While replication can protect a cluster from hardware failures, it cannot help when someone accidentally deletes an index. Therefore, an Elasticsearch cluster needs to perform regular backups.

Each snapshot is automatically deduplicated to save storage space and reduce network transfer costs. This means that when a backup is taken, the index segments are copied and stored in the snapshot repository. Since the segments are immutable, the snapshot only needs to copy any new segments created since the last repository.

Example of what each block could contain:

				
					{
  "name" : "snapshot_1",
  "index-version" : 5,
  "files" : [ {
    "name" : "__0",
    "physical_name" : "_0.cfs",
    "length" : 8037,
    "checksum" : "trhzg",
    "part_size" : 104857600
  }, {
    "name" : "__1",
    "physical_name" : "_0.cfe",
    "length" : 314,
    "checksum" : "14i5z7r",
    "part_size" : 104857600
  }, {
    "name" : "__2",
    "physical_name" : "_0.si",
    "length" : 270,
    "checksum" : "19azdai",
    "part_size" : 104857600
  }, {
    "name" : "__3",
    "physical_name" : "segments_5",
    "length" : 107,
    "part_size" : 104857600
  } ]
}

				
			
				
					{
  "name" : "snapshot_2",
  "index-version" : 6,
  "files" : [ {
    "name" : "__0",
    "physical_name" : "_0.cfs",
    "length" : 8037,
    "checksum" : "trhzg",
    "part_size" : 104857600
  }, {
    "name" : "__1",
    "physical_name" : "_0.cfe",
    "length" : 314,
    "checksum" : "14i5z7r",
    "part_size" : 104857600
  }, {
    "name" : "__2",
    "physical_name" : "_0.si",
    "length" : 270,
    "checksum" : "19azdai",
    "part_size" : 104857600
  }, {
    "name" : "__4",
    "physical_name" : "segments_6",
    "length" : 107,
    "part_size" : 104857600
  } ]
}

				
			

We can see that the important information is in the blocks under the names “segments_*”. They are defined as:

NameExtensionDescription
Segments Filesegments.gen, segments_NStores information about a commit point. Active segments in the index are stored in the segment information file, segments_N.
Compound File.cfs, .cfeAn optional "virtual" file consisting of all other index files for systems that often run out of file identifiers.

Data Tiers

In order to optimise the cost and performance of a cluster, Elasticsearch defines data tiers. These definitions allow a data node to have a specific behaviour, through the assignment of a role, rather than achieving that behaviour through attributes and configuration parameters.

Old configuration:

				
					# En nodo hot 
node.attr.node_type: hot

# En nodo warm
node.attr.node_type: warm 

# En nodo cold
node.attr.node_type: cold

				
			
				
					PUT /myindex 
{ 
  "settings": { 
    "index.routing.allocation.include.node_type": "hot" 
  } 
}

				
			

Now directly:

				
					#../elasticsearch.yml
node.roles: ["data_hot", "data_content","ingest", "ml"]

				
			

This functionality is particularly interesting for use cases where fast growing content has to be managed, such as catalogue storage or daily ingested logs.

The optimisation is based on the search frequency of the data:

HOT TIER

  • recent data

  • highly relevant information that is constantly being sought

  • more expensive

  • better efficiency

  • fast disc

WARM TIER

  • average price

  • data from recent weeks

  • somewhat relevant information

  • searches are occasionally carried out

COLD TIER

  • cheaper

  • data from recent months

  • not very relevant

  • searches are rarely carried out

FROZEN

  • very cheap, almost zero

  • data from recent years

  • irrelevant data (for when the lawyer asks)

With these levels, the life cycle of the data is observed. When data is first ingested, it is likely to be heavily searched. When investigating an incident, for example, rapid access to all relevant data is needed to identify and resolve the problem. When an attacker compromises a host or application, the ability to respond quickly often determines the impact of the breach.

Data can also be categorised into different levels of use depending on the source or type. Some data may only be needed for legal or compliance reasons, or to look back occasionally for comparison.

Users therefore need different levels of processing power and storage for these different levels of needs, whether based on age, source of data or other criteria:

What the data layers provide us with is:

  • Texts and time series are handled in a simplified way.
  • We moved from configuring node attributes to using node roles.
  • Elastic automatically manages the allocation and relocation of data between tiers at the stage it touches.

Searchable Snapshots

After looking at the different data tiers for data lifecycle management, we will focus on the “cold” and “frozen” tiers. The cold tier, driven by searchable snapshots, can reduce storage costs by up to 50% by increasing the local storage density of read-only data through offloading the redundant copy of the data to a low-cost object store.

The frozen tier stores data exclusively in the low-cost object store while maintaining the ability to search it, with local caching for fast lookups on frequently accessed data.

The hot/warm part where the disk has our indexes sliced into primary and replica shards. This copy exists so that when there is a problem in this shard, it can continue indexing and fetching data, giving resilience to the cluster.

In the cold part we will have the primary shards on our node and the replica on s3, this is a snapshot. In case of problems, the data can be reindexed again. In this cold part, the replicas no longer exist. The data is downloaded and persisted locally. On the issue of resilience, Elastic recovers from snapshot when necessary. Costs are reduced without really impacting performance, storage is reduced by half.

The type of data we have at this level should not change, there is no new indexing or modification.

On the frozen side, the snapshot is stored directly in the cloud. This is very old data that we don’t look at often, so we want to store almost all of our data in s3. When we want to query a piece of data that is at this level, we will retrieve directly from the copy.

Is it worth restoring an entire index only to find that it doesn’t have the information we want?

We are interested in the data in Lucene (Meta Lookup, Doc Values, Stored Fields, Term Dictionary, Term Proximity, Normalization Factors, Point Values). These are the data that are actually indexed in the shards and that allow us to perform the search/aggregation requests.

These data that are stored in the Lucene indexes, there are different types, some used to perform time period searches, others for keyword searches, others to store the origin of the data, etc.

The Elastic team decided to optimise the format of the Lucene data and to simplify the structure of this file, as there is no need to access large documents to perform certain types of requests.

It is not necessary to access all the documents in the snapshot, only those necessary to perform the search. Therefore, the number of documents to be downloaded at the time of the query is reduced.

We will have shards that will allow access to snapshots that have been taken, interrogate the shard, knowing that the important data will be on the node. It will be restored for the search. This is the basis of the Searchable Snapshot intelligence and so retrieve data on demand, through a caching system.

Therefore, frozen searchable snapshots (ss):

  • an index snapshot that looks like a regular index.
  • only the necessary data is downloaded from searches
  • a persistent cache for recently fetched snapshots to fetch as if they were local
  • costs are like basically paying for the s3 only

ILM and Mount API

Index lifecycle management (ILM) provides some conventions to facilitate data management on hot (fast machines with SSDs) and warm/warm (lower cost machines that may have spinning disks) nodes. Snapshot lifecycle management (SLM) further facilitates the use of low-cost object stores from AWS, Google, Azure and on-premises storage providers to perform and store backups.

For its creation, the repository must be initialised by means of:

				
					POST /_snapshot/<repository>/<snapshot>/_mount
				
			

Example:

  1. The name of the index in snapshot to mount
  2. The name of the index to be created.
  3. Any index configuration to add to the new index.
  4. List of index configurations to ignore when mounting the instant index.

Conclusion

Searchable snapshots allow you to use backups to search for infrequently accessed, read-only data in a very cost-effective way. Cold and frozen data tiers use searchable snapshots to reduce storage and operational costs. These can eliminate the need for replicas after moving beyond the active tier, which could halve the local storage needed to search your data.

Irene Hernández Borrás
Ana Ramírez

Ana Ramírez

Did you find it interesting?

Leave a Reply

Your email address will not be published. Required fields are marked *

FOLLOW US

CATEGORIES

LATEST POST