AWX is an essential component in the automation flow of an IT organization however there are times when a job stays in some of the tasks for an indefinite time without any message from the output console that indicates that it is happening. In this post we will give some ideas to solve this situation. Specifically, we will see the following points:
- How to control that certain jobs do not stay in a certain state indefinitely
- Typical use cases.
- The importance of logging in our service
AWX in a nutshell
For many organizations using Ansible from the command line is not a viable solution among other reasons why
- There is a need to obtain a detailed report of the work carried out and its corresponding failures, especially if there are audit processes.
- An organization can be made up of different teams with different permissions and management needs for playbooks, inventories, and credentials.
- A complete visual history of the current or historical executions of the executions performed and the state of the server helps to identify potential problems before they have an impact on the final service.
- Possibility to plan “playbooks” to help keep the infrastructure in a known state.
AWX provides a mechanism for teams to use Ansible from a graphical interface.
“AWX is an open-source upstream project from Ansible Tower and helps teams manage complex multi-tier deployments by adding control, knowledge and delegation to environments using Ansible.”
Possible problems in using AWX
There are two typical cases that can lead to having jobs in a certain state for longer than desired:
- Operators who have executed a job and left it running without really controlling what actions they are performing.
- Execution of remediation actions in an unattended manner.
The first case is a natural consequence of the use of the tool by a multitude of people within an IT organization. The second case is more related to a part of the monitoring flow we use.
Troubleshooting AWX
AWX offers an API of its service from which you can perform different actions including running playbooks. In this case the problem we intend to solve is how we can ensure that a service is solved unattended in case there is a problem. In our case we use Zabbix as a centerpiece to manage incidents and alerts within the entire IT flow.
“Zabbix is an open-source solution that offers a monitoring solution that covers all the needs of an IT organization such as: SSO, distributed monitoring, unlimited in terms of data retention, security and much more”
Zabbix among other functionalities has the ability to execute actions based on events such as a service crash
In this case, we have a certain service in which incidents have been raised either because a port is not raised, there is no certain file in a path or because of a certain entry in a log to name a few examples.
In the event that any of these alerts are opened, Zabbix can be configured so that in addition to notifying those responsible for the service, it executes a series of actions in order to resolve said incident. An action in Zabbix can be:
- Send a message to a single user to a group and by one or more means. In addition, messages can be customized and can have conditions
- Execute remote commands: execution of a script on the server or agent, IPMI protocols, SSH, telnet or a global script.
That is, we can configure this action to call AWX and thus be able to execute a job with a series of tasks defined in a “playbook” or “role” in the affected service. In the event that the service goes well, it will recover and the alert will no longer be active in Zabbix. There are cases that actions are executed unattended but jobs for some reason remain “hanging” or take longer than expected running.
Controlling the lack of control of automation
By default, Ansible executes tasks synchronously, keeping the connection on the remote node until the tasks have been performed. This means that each task blocks the next one by default and being a sequential execution, the subsequent tasks cannot be executed if there is a previous task that has not been executed. For example, a task may take longer to complete than the time of the SSH session causing a timeout. Ansible has mechanisms to be able to control this behavior through parameters such as async and poll, configuration parameters of the SSH session etc. Note that in this case the person in charge of parallelizing the execution of jobs is AWX. The typical tasks that Ansible can hang on are (not only)
- Connection problems either because you do not have access to the network where the machine is, the SSH keys are not correctly configured etc.
- Problems with the setup module or gathering facts: The setup module can sometimes get hung up when it collects hardware information (information related to disks with high input and output flows, erroneous mount points, etc.)
How far we have come we know that we can have jobs in AWX that have remained in a certain state without having an answer from it.
To be able to control this there are several more or less automatic and / or sophisticated options. But a first step is to use the AWX API to get the “jobs” that have remained in a certain state for a certain time. All deployments of AWX include an ad-hoc component called AWX Cleaner that functions as a “watchdog”. And at a high level the functionality could be represented as follows:
“AWX has defined a series of states in which you can find a job such as running, waiting, pending or new you can consult more information about the API and the different methods in [1]
- We get the Jobs
$ curl -s -u <user>:<pass> http://<awx-host>/api/v2/jobs/ | jq '.results[]'
2. We filter the jobs obtained by the time they have been running, the status and if we specify the name.
$ curl -s -u admin:admin http://<awx-host>/api/v2/jobs/ | jq '(.results[]| select(.status | test("successful"))|select(.elapsed>100))|.id'
...
34343
...
3. We make an API call asking if the job with a certain ID can be canceled.
$ curl -s -u <user>:<pass> http://<awx-host>/api/v2/jobs/34343/cancel/ | jq '.'
True
4. We store in a list the jobs that we can cancel
5. We cancel the jobs and take information from the event to be able to make a subsequent análisis
$ curl -X POST -s -u <user>:<pass> http://<awx-host>/api/v2/jobs/34343/cancel/ | jq '.'
We can get a list of the jobs filtered by status and elapsed time using the following command
$ curl -s -u <user>:<pass> http://<awx-host>/api/v2/jobs/ |\
jq -r '(["ID","Elapsed", "Limit", " Name"]),\
(.results[]| select(.status | test("runnning"))|select(.elapsed>100)|[.id,.elapsed, .limit, .name])|@tsv'
ID Elapsed Limit Name
34343 181.185 hostX Reinicio servicios X
...
34367 558.483 hostX Reinicio servicios X
In this line of bash code the first thing we do is get the json with all the jobs, we indicate to jq that we are going to build a table with the ID, Elapsed, Limit and Name fields to finally obtain these values by applying some filters in the returned results (.results[]) and generate a table separated by values (@tsv). Sometimes it can be useful to explore the “schema” of the json with the following calls. They can try replacing the string “running” with another state such as “succesful”. For another example of using jq at an intermediate level you can consult [2]
$ curl -s -u <user>:<pass> http://<awx-host>/api/v2/jobs/ | jq '.|keys'
[
"count",
"next",
"previous",
"results"
]
$ curl -s -u <user>:<pass> http://<awx-host>/api/v2/jobs/ | jq '.results[]|keys'
[
"allow_simultaneous",
"artifacts",
"ask_credential_on_launch",
...
]
The Python code in this section has not been included since it would be very long and it is not very difficult to generate a basic functional code leaving aside extra functionalities added ad hoc to the needs of the client.
The importance of logging
Adding telemetry (logging, metrics and traces) to our services is essential to be able to correlate the health of the components at the individual level with the health of the service at the business level. Gone are those times when finding the logs of an application was an odyssey and in the worst case you had to ask a certain team. Nowadays it is very easy to generate traces of structured logs to later generate aggregates in other tools such as Elasticsearch. In this case the configuration used logging would be as follows
#!/usr/bin/env python
import sys
import argparse
from ruamel.yaml import YAML
import logging.config
from pythonjsonlogger import jsonlogger
FORMAT = "[%(asctime)s %(levelname)-5s|%(name)s|%(filename)20s:%(lineno)-3s|%(funcName)20s()] %(message)s"
logger = logging.getLogger(__name__)
In the following link we can find a functional skeleton to generate a command line interface or in its Anglo-Saxon nomenclature Command Line Interpreter (CLI) for our application.
https://gist.github.com/oscaromeu/aeb7424cd4b65d182bbaae38b747afdd
https://gist.github.com/oscaromeu/310780a26c76d45896b2b9c76053f3cb
To run the application with its corresponding logging settings as follows
python3 cli_demo.py logging.yaml -vvv
{"asctime": "2022-01-11 11:32:23", "msecs": 514.5304203033447, "levelname": "INFO", "name": "__main__", "process": 368056, "processName": "MainProcess", "filename": "cli_demo.py", "lineno": 22, "message": "info-level log message"}
{"asctime": "2022-01-11 11:32:23", "msecs": 514.6398544311523, "levelname": "DEBUG", "name": "__main__", "process": 368056, "processName": "MainProcess", "filename": "cli_demo.py", "lineno": 23, "message": "debug-level log message"}
{"asctime": "2022-01-11 11:32:23", "msecs": 514.7027969360352, "levelname": "ERROR", "name": "__main__", "process": 368056, "processName": "MainProcess", "filename": "cli_demo.py", "lineno": 24, "message": "error-level log message"}