One of the services that we are using the most at Datadope, from the public cloud providers in recent years, is the Kubernetes service. Most of our customers use Google Platform Cloud services, so almost all of the service migrations we have had to do have been against the GKE (Google Kubernetes Engine) service, which is the service offered by Google Cloud Platform.
We wanted to make this post to break down a series of guidelines to follow when migrating a service to GKE, as well as the mistakes that have been made, so that we can learn from them and try to avoid them in any future project.
¿What is GKE?
Let’s start by making a brief introduction about what the Kubernetes Cloud service is and what it provides us, more specifically the service offered by Google’s public cloud platform.
GKE stands for Google Kubernetes Engine.
- Kubernetes as a service offered by Google Cloud Platform.
- Key features of Kubernetes as a service: Although this post is focused on the GKE service, these features are common to most public cloud providers.
- Simple and intuitive cluster creation process.
- Automatic cluster scalability.
- High cluster availability.
- Comprehensive monitoring services.
Now that we know a little about what a Kubernetes service provides us in a public cloud, let’s start with the guidelines we recommend to follow when migrating a service to GKE.
Guideline 1: Analysis of the service to be migrated
We start with the basics, perform an analysis of the service to be migrated. We recommend that this analysis be as exhaustive as possible, since a large part of the deviations in delivery times may be due to an incomplete analysis of the service requirements.
As each application or service has very specific requirements to work properly, we will indicate the main characteristics that, in general, it is recommended to analyze the service to be migrated.
Hardware resources used by the service
One of the most important requirements for a service or application to work correctly is the hardware resources used during its operation.
It will be necessary to measure the resources being used by our service. We can distinguish several states in the operation of a service.
- The usual operation at long times, with low workload: Let’s say that it would be the behavior and resource consumption that it would have most of the time.
- The usual operation with high workload: In this case there are moments in which the service will have a high workload, but we know these moments of high load.
Based on the analysis of the resources usually used, the limits that the migrated service will use will be configured.
-
On the other hand we have the load peaks: It will be necessary to analyze different moments in which our service has had load peaks, of a predetermined time, but which are not usually the usual behavior.
This information will help us to configure a possible autoscaling of the service that will allow us to cover these load peaks at very specific times.
The idea of resource analysis is to measure the resources used by the service in each of these states, in order to have as much information as possible and resize as optimally as possible, the resources that we are going to associate to the service when deploying it in our GKE cluster.
Necessary communications
Once the hardware requirements of our service have been analyzed, another important analysis that is recommended is the analysis of the communications that the service needs to function correctly.
Some of the questions that will help us to analyze these communications may be the following:
- Does our service need to be exposed to the Internet? If the service needs to be exposed to the Internet, it is necessary to know this requirement since it is likely that we also need other resources such as a DNS record for our service and a load balancer service that sends incoming requests from the Internet to our service.
- What communications with our network or internal networks does our service perform? In this case, we will analyze the communications that should be allowed in our firewalls, so that there is communication between the service deployed in GKE and our internal networks.
- Does our service have any component that performs additional communications to those performed by the service itself? To explain this type of communications, we will use an example, the Logstash service. Logstash is an application that is used to manipulate all kinds of log files, these log files can be sent to Logstash from different services and applications. It would be necessary to know all the communications that exist between Logstash and each of those services that need to send their logs. These communications will be added to the list of communications to be enabled in our firewalls, so that even if our service is in GKE, it will continue to work correctly.
This communications analysis is an important part of the migration project. We can deviate a lot from the deadlines if we do not perform a complete and detailed analysis. It can be a tedious analysis, but if done correctly, it will prevent us from last minute requests to open communications that, if they involve considerable bureaucracy, can delay the delivery of the project considerably.
Configurations, credentials and sensitive data
Last but not least, the configuration of the service itself will be analyzed. Among these configuration files we can find from plain text configuration files to binary files with sensitive information inside. It will be necessary to know all these files used by our service, so that, at the time of migration, we can convert them into objects usable by Kubernetes and associate them to the migrated service.
Guideline 2: Study of existing services in public cloud providers
We already know the requirements of our service, after the analysis performed. The next step is to study the different solutions offered by the public cloud, in order to use the solution that best suits our needs.
Normally, when migrating a service to the public cloud, we will have two options:
-
The public cloud provider offers the service or application we are migrating:
–It can be a good solution if the advantages fit our needs.
Possible advantages of using a service offered by the Cloud provider:
-
Ease in managing the resources needed for the service.
-
High availability at international level.
-
Lower pay-per-use than the service provided by GKE.
-
Specific technical support for the service.
Possible drawbacks:
-
Higher cost.
-
Inability to integrate with onpremise systems and services.
-
The public cloud provider does not offer the service because it is a custom service of the client. In this post we will focus on this case, but I wanted to mention the previous option, because we should always check if it is better suited to customer needs if there is the possibility of using the public cloud platform service itself.
After analyzing the different Cloud providers and the requirements of our service, it will be possible to decide the best option and make a proposal.
Guideline 3: Analyze and manage the integration with GKE
As we have commented several times, this post is focused on the GKE service, so, once we have decided that we are going to use GKE to deploy our service to migrate, we will analyze the GKE service we have, either an existing cluster in a customer’s infrastructure or a cluster deployed by us.
Here are some important questions and points that will help you to analyze and manage the GKE service:
- Does the cluster have a defined structure (repositories, etc.), should we generate such a structure? If the client has a cluster, where it deploys its applications, it will be our function to know its structure and if, to deploy a new application, we must generate an additional structure, or request that the corresponding team generates it within the client’s equipment.
- Is there a non-productive environment? If so, whenever possible, it is advisable to use this environment to deploy a version of the service to be migrated, first to this non-production environment, to perform checks and tests, before finally deploying it to the production cluster.
- Namespaces: ¿Which namespace will be used by the service to be migrated? It will be necessary to know how the cluster namespaces are organized, if the cluster has been previously created, or to decide in which namespace we are going to deploy the service, if the cluster has been created by us.
- Resources associated to the namespace we will use. Likewise, we must know if the namespace we are going to use has any resource limitation, in order to request more resources if necessary.
- CI/CD. We will request information on whether there is a repository structure, and any tool that performs continuous integration and continuous deployment on the cluster, tools such as ArgoCD or a similar service.
- Subdomain associated to the cluster. If we are going to deploy a service that needs to be accessible through a domain name. Ex: Traefik to manage the subdomain associated to the cluster or if the client’s cluster already has a similar tool.
- Own monitoring system. By default, GKE provides a system for collecting metrics and visualization of the cluster, we could use this system to monitor the service throughout the migration.
- Manage the necessary communications for the service to be migrated.
Step 4: Migrate the service to GKE
With the GKE infrastructure defined, the next step will be the migration of the service itself.
If we have a non-productive cluster, we will perform the migration in this cluster, before deploying the service in the productive application cluster. If we only have one cluster, the service will be migrated to that cluster, but it will not act as a productive service until we validate its operation.
The following are important steps in the migration process:
Resources associated to the service:
-
We will use the analysis of the service to define the resources that the service will use.
-
Limit the resources.
-
We must define the associated resources taking into account the replication feature of Kubernetes.
Examples of use:
-
If the service we are going to migrate has to support a lot of load constantly, we can associate several replicas with sufficiently high resources to support the load. We will associate at least two replicas to provide high availability whenever we can.
-
If the service has peak loads and generally a low load, we will be able to associate an autoscaling to the service, so that replicas are automatically generated as soon as an increase in load is detected.
Communications:
As with the resources, we will use the analysis of the service performed to adjust the communications required by the service itself, and those of all its components. Here are some guidelines to take into account:
-
Additional communications to the existing ones may be needed during the migration. Mainly so as not to affect the service in production.
-
Enable connection between the cluster and the onprem networks if the service needs them.
-
Know if the applications deployed in the cluster need a proxy to access the Internet, if the service needs it.
-
Allow access from the Internet to the service, if necessary.
Once a non-productive version of the service has been deployed, we could start with the performance tests, to make sure, before putting the service into production, that we have everything we need for it to work correctly.
Step 5: Functional testing
We will test the migrated service before putting it into production.
As a recommendation, we will perform at least two tests, resource tests, which will help us to check if the configuration we have assigned to the service is enough to cover its functionality or if we are using more resources, which leads to unnecessary expenses and communications test, which will allow us to detect if all the connections required by the service are available.
We leave you some additional guidelines on these two tests:
Resource testing:
-
If possible, we will send the workload equally to the productive service, as to the service we have just migrated, so the tests will be as close to reality as possible.
-
We will associate a number of replicas depending on the test results.
Communications tests:
-
We will test the communications of both the service and all its components.
-
If temporary communications have been made for the migration process, we must make sure that the ones we will need for when the service is put into production are available.
Guideline 6: Service monitoring
In addition to the functional tests, another important aspect to take into account is the monitoring of the service. At this point, we can highlight the following:
-
GKE offers along with all deployed Kubernetes clusters, the Grafana tool, which will allow us the visualization of cluster metrics.
-
We can use the Kubernetes API to extract any metric about the namespace infrastructure where the service we are going to migrate will be deployed.
-
In addition to the metrics offered by GKE, generally, we can monitor the functional status of the service using some metrics collector like Telegraf or log extractor like Filebeat if we have information about the functionality of the service in the logs.
Step 7: Putting the service into production
Finally, once all the service operation tests have been carried out and the monitoring of the service has been prepared, we face the transition to production. We can recommend that the following points be reviewed in relation to the move to production:
-
If temporary communications have been made for the migration process, we must make sure that the communications that we will need for when the service is put into production are available.
-
If it is possible, simulate a production pass of the service, in a predefined time window. Highly recommended.
-
If we have used a non-productive cluster to migrate the service, the service will be deployed in the productive cluster with all the corrections made after the tests in the non-productive cluster.
And so far this review of the guidelines and recommendations to follow in a process of migrating a service from onpremise to the public cloud. What guidelines would you add to this process, so common in recent years?