Machine learning is a concept within the computer science branch of artificial intelligence, which refers to a group of algorithms that can improve automatically through experience. These types of algorithms are used in those tasks in which it is difficult or impossible to use conventional algorithms.
Through the use of machine learning algorithms, models are achieved that allow predictions or decisions to be made without the developer having to program them explicitly.
The main objective of machine learning is to develop or improve algorithms so that they can learn from different types of data, so that general models can be generated that make accurate predictions or that find patterns in the data set, depending on the purpose and paradigm.
Then, starting from the beginning, AI can be defined as the theory or development of intelligent systems through the use of computers to perform tasks that, a priori, only humans can perform. For example, identifying a certain animal in a photograph. It should be noted that AI can refer to algorithms that do not contemplate machine learning, such as computer vision, although they are frequently associated.
Within AI there are several fields, and in particular machine learning is that field in which it is based on the basis that machines that develop these tasks intelligently can (and should) learn from experience to improve their knowledge. It is not intelligence as such, because it is not something that can be compared with human intelligence, but algorithms are studied that can perform complex tasks, adapt to different changes in the data with which they deal, without the need for human manual adaptation. It is not intelligence as such because, for example, these algorithms need several iterations to learn even simple concepts, or because their associative character is quite poor, although there are projects such as OpenAI’s DALL-E that show that much progress is being made in this area.
Delving a little deeper, within Machine Learning we can find different paradigms:
- Supervised learning: The training data (which is that set of data that is used to train the algorithm) is labeled, so that the algorithm has the input (raw data) and the output (correct prediction/label). During the training phase, he identifies patterns and trends within the same label, and how to differentiate it from the others, so that he is able to identify similar samples and label them correctly. During the validation phase, the model obtained in the training phase is used with data that the model has not seen before, so that the model makes predictions of the labels of those samples. Taking into account that the data is labeled and starting from the basis that the labels are correct, different metrics can be obtained to evaluate the performance of the model based on the successes in the predictions and the original labels of said samples (confusion matrix). An example of this type of algorithm can be the KNN.
- Unsupervised learning: In this case, the data is not labeled, so it is the algorithm that gives mathematically, identifies patterns and hidden structures in the distribution of the data. It requires more resources, they usually take longer to train but they are more powerful. Generally, everything related to unsupervised learning is developed using neural networks, although there are hundreds of neural network structures and other algorithms capable of working with unlabeled data. The performance of these algorithms is quantified with cost functions, among others. An example of this type of algorithm can be HCA.
- Reinforcement learning: It is the approach that tries to imitate how humans learn, always using the trial-and-error method. Favorable responses are rewarded and unfavorable ones are discarded. An example of this type of algorithm would be Q-learning.
The life cycle of a Machine Learning model, as mentioned at the beginning, begins with the selection of the algorithm to be used. Depending on the algorithm, the other steps may change, but as a general rule, we can say that they follow the step of training, validation and deployment.
Typically, in supervised models, the dataset is divided into training data and validation data, typically at a 70/30 ratio. The training data is what the model will learn from, and will be used in the so-called training phase. During this phase, the algorithm will be adjusted in the way it considers (using different ratios, weights …) to identify / classify the training data; that is, he is given 600 pictures of cats and told that they are cats, and the algorithm (if selected correctly) will find the function that lets him know that this is a cat.
In the validation phase, the model is given samples that it has not seen before, waits for a response and compares that output with the label that was had.
As can be seen, this only applies to supervised learning models, because in unsupervised learning there are no labels and the performance of a model is quantified with other functions, such as cost functions.
Machine Learning on ElasticSearch
There are many tools that already provide the user with pre-trained models ready to use new data, such as ES.
ES provides a series of APIs that allow access to hybrid machine learning models, which although not customizable by the user, allow some customization. To make this model as generic as possible, it is born from the combination of four main algorithms:
- Clustering: Group similar samples.
- Time series decomposition: Decompose timeseries data into its different components (FT).
- Bayesian distribution modelling: Allows to model the uncertainty of the input and output of the model.
- Correlation Analysis: Identifies correlations in the samples.
In ES the use of ML is mainly used to identify anomalous behaviors within a data set. That is, it allows to generate a model on the behavior that certain data follows, detects deviations from it, and categorizes them.
Jobs & datafeeds
In ES, the models we create are referred to as jobs. Such models aim to identify anomalous behaviors, that is, deviations from the model. Therefore, in each job you can define different detectors that detect anomalies in different cases.
A detector employs a function that is applied to data in a time bucket, and detects anomalies in this “processed” data. The different detectors within the same job are independent.
Then, in the job (a JSON file) the logic of the model is defined, as well as the fields of the indexes that that model will use.
On the other hand, there is the datafeed, which is responsible for feeding data to the job. The datafeed, in general, sends data from one or more ES indexes to an ML job, which processes it and generates a model. The datafeed is customizable at different levels, allowing you to make query filters and configure delays in the case of metrics that may have it.
As far as anomalies are concerned, knowing the hierarchical levels of machine learning anomaly scores is essential when it comes to understanding the information that this score transmits to us. It allows us to know what this anomaly depends on, how it manifests itself or how it can be used as an alert indicator.
It is the lowest hierarchical level of anomaly scoring, and corresponds to the appearance of anomalous behavior for a particular sample. That is, it refers to the appearance of unusual behavior at its simplest level.
For example: At the last minute, requests to a certain API are 200% higher than normal.
It is the average hierarchical level of anomaly scoring, and refers to quantifying the contribution of a certain entity or “influencer” to the anomalous behavior that appeared. Therefore, each entity is given a score depending on its influence on the score of the anomaly, which relates it to the rest of the entities. For example: Of ten countries whose users use our website, know which of them has the greatest influence on an anomaly that occurred in a certain period of time.
It is the highest hierarchical level of anomaly scoring, and corresponds to the aggregation of the score as a function of time (specifically, as a function of the bucket_span). It aims to consider different anomalous behaviors and relate them based on the magnitude of their record scores and the number of anomalies in that space of time, if they occur at the same time.
Note: It is not an average to use, but use the influencer scores of each bucket_span.
For example: Evaluate user deviations from a service over a time interval.
To detect and alert deviations from the global dataset based on time, the bucket score would be the most appropriate. To alert of the entities that have the greatest weight in the most anomalous behaviors, the influencer score would be the metric to be analyzed. To alert of the most severe anomaly in a time window, the record score should be used.
Finally, there is a type of punctuation that although it does not enter the previous section, it is very useful. The Multi Bucket Impact is a score whose value ranges between -5 and 5, and what it allows to quantify is an anomalous behavior that extends over time. Specifically, the current-time bucket and the previous eleven buckets are analyzed to determine whether the anomalous behavior, more or less regardless of its score, is categorized as an anomalous event by the duration it has had.