This document aims to provide standard operating procedure to handle diagnostics and troubleshooting of VuSmartMaps container based installations.
This SOP assumes a container installation of vuSmartMaps catering to the new Kafka based architecture. The data pipeline comprises Kafka brokers for collection and queueing, Kafka Streams for processing and enriching, Elasticsearch for storage and Vienna for UI.
Also this is technically not intended to be a full fledged debugging guide. The scope for this document is for the support team to handle basic troubleshooting and root cause detection of any issues occurring on a running vuSmartMaps setup, before passing on to the escalations. Hence you will not see debugging steps like basic configuration validations on config files on this document since they are expected to be correct on a running installation. We focus on runtime issues and possible root causes in this document.
<TBD Need to have the architecture diagram and components here, can be taken from Installation guide for container vusmartmaps>
It is important for anyone handling container based vuSmartMaps to have the following knowledge before proceeding with the SOPs. These are minimum requirements to handle container based VuSmartmaps at any level.
vsmaps is a wrapper script which is written to handle the following functionality of docker commands. It will be available as an vsmaps alias or in data directory as script. All docker services information is stored in different stack files which have the details from where the docker container service should run, how many instances to run etc. To check list of docker stack (compose file available) run below command
vsmaps -l
To check list of services in docker stack (compose) files run below command
vsmaps -s
Using vsmaps wrapper (recommended)
To Check status of all containers running in vsmaps stack. Use the below command.
vsmaps status all all
If we want to check the status of container service from one file then we can check using the below command. For ex. If we want to check the status of service from the 3_kafka-zookeeper file then we can check using this command.
vsmaps status 3_kafka-zookeeper all
We can also use docker command to check the status of the services, for example. If we want to check status of elasticsearch-1 container status, we can check using below command
docker service ps vsmaps_elasticsearch –no-trunc
<TBD>
Use the following command to login to a container shell.
To login inside container shell we can use vsmaps login command. For ex. If we want to login inside the vuinterface-1 container shell use the below command.
vsmaps login vuinterface-1
As such vsmaps should give all docker service related information, however it will be beneficial to know these basic docker commands in case if vsmaps is not adequate.
docker service ps –no-trunc vsmaps_vuinterface-1
This command will show current status of the container service and previous status if service has failed and error code with error message
docker stack ps vsmaps
This will show the output of all the containers running in the vsmaps stack.
docker service logs vsmaps_elasticsearch-1
If we want to see logs in follow mode i.e. similar to tail -f filename
docker service logs -f vsmaps_elasticsearch-1
If we want to see the last n number of logs with the following enabled then run below command. Ex. If we want to see last 500 lines of a service then run below command
docker service logs -f -n500 vsmaps_elasticsearch-1
docker stats
Please use the following commands to check various logs
VuSmartmaps logs on vuNode and vuInterface
To check logs of any vuinterface or vunode container services, first we have to login to that container.
vuinterface :- In case of vuinterface there are 2 types of services running in this container. First django app behind apache and kibana. To check these logs first we have to go inside that container. E.g. if we want to check logs of vuinterface-1 run below command to go inside that.
vsmaps login vuinterface-1
vunode :- vunode is what we earlier known as daq service. In the 9.x.x build we have shifted to microservice architecture. Here each daq services are independent of each other, for example alert, discovery , dcm and snmp. E.x. for example if you want to check logs of alert vunode then first login inside that container
vsmaps login alert-1
To check vunode logs it is in /var/log/vusmartmaps/vusoft_logging.yyyy-mm-dd and alias for this vulog and to open the log file use vilog alias
Elasticsearch logs
We can check elasticsearch logs from any node below the command. Suppose we want to check logs of elasticsearch-1 instance.
vsmaps logs elasticsearch-1
docker service logs -f -n500 vsmaps_elasticsearch-1
vsmaps login elasticsearch-1
cd /var/log/elasticsearch
Kafka Broker Logs
We can check kafka broker logs from any node below the command. Suppose we want to check logs of broker-1 instance.
vsmaps logs broker-1
docker service logs -f -n500 vsmaps_broker-1
Kafka Connect Logs
We can check kafka connect logs from any node below the command. Suppose we want to check logs of connect-1 instances.
vsmaps logs connect-1
docker service logs -f -n500 vsmaps_connect-1
Kafka streams logs
To check kafka stream logs we can check using docker service logs command. To check logs of that we have to provide the app_id of that kafka stream application. For example if we to check logs of snmp-app stream container
docker service logs -f -n500 snmp-app
Telegraph SNMP Logs
We can check telegraf snmo logs from any node below the command. Suppose we want to check logs of telegraf-snmp-1 instance.
vsmaps logs telegraf-snmp-1
docker service logs -f -n500 vsmaps_telegraf-snmp-1
Some installations(ex: SBI) will have a dedicated UI based internal monitoring storyboards setup with Graffana and TSDB. The following storyboards will help giving a summary of entire vusmartmaps health
<TBD, pls include the key storyboards screenshots with a one line description)
To Check From the home page. It will show all the dashboard and alert which are coming lately.
For Elasticsearch details go to elasticsearch
Kafka Lag
The SOPs are categorised based on the visible indication of a given issue. Hence there can be multiple root causes and the diagnostics are expected to be a sequence of checks.
The following are the first set of things to be checked for any issue.
curl -XGET <ES-ip>:9200/_cluster/health?pretty
To check all the services running use the vsmaps command.
vsmaps status all all
In this case, the indication is that the basic vusmartmaps url itself does not work. It either shows “site cannot be reached” or some error code.
Check if vuinterface is running. If not running start vuinterface.To check vuinterface running
vsmaps status vuinterface all
To check logs login into vuinterface container and check logs using following alias
alog → to check apache access logs
elog → to check apache error logs
vulog → to check vusmartmaps logs
The indication is that the login page shows up, however the user is not able to complete a successful login despite entering the correct credentials.
If Django is not able to connect to mysql, check for mysql service running status and you are able to manually login to mysql db
mysql -uroot -proot -hmysql-1 ( To check login to mysql happening from vuinterface container)
nc -vzw 5 <mysql-container-name> <mysql_port>
nc -vzw 5 mysql-1 3306
If redis service is not running or redis is full and not taking any new data.
nc -vzw 5 <redis-container-name> <redis_port>
nc -vzw 5 route 6379
Ensure If elasticsearch and kibana services are running
nc -vzw 5 <es-container-name> <es_port>
nc -vzw 5 es 9200
To check kibana check from vuinterface container.
sudo service kibana status
Ensure elasticsearch cluster is green
curl -XGET es:9200/_cluster/health?pretty
Check if Apache got stopped and unable to start due to system issue.
sudo service apache2 status
When ldap is configured, vusmartmaps will attempt to login any user via LDAP and then it will try using local login. If slow login happens for an user who is configured in local as well as ldap, then you want to check if ldap server is reachable and working.
First check connectivity b/w the machine and ldap server
nc -vzw 5 <ad-server-ip> <ad-server-port>
Check ldap credentials for that ldap-utils should be installed on the machine
ldapsearch -x -b “DC=AD, DC=SBI” -H ldaps://ad.sbi:3269 -D “CN=NOC Tool,CN=Users,DC=AD,DC=SBI” -w
To check mysql slow log login to mysql container and go to /var/log/mysql
vsmaps login mysql-1
cd /var/log/mysql
The diagnostics for this depends on the type of polling and agents.
For SNMP polled network devices
→ Go to vumodule → select snmp polling vublock → goto sources and search the device ip to be monitored.
snmpwalk -v3 -l authpriv -u S3n0cuSer -a SHA -A <authenticationPassPhrase> -x AES -X <privacyPassPhrase> <device_ip>
For Server health and Log data missing from specific devices
netstat -an | grep 9092
In case we have a group of devices affected, there is a possibility that those nodes use the same Kafka endpoint to send data and the corresponding endpoint is down or not working as expected.
Check connectivity from there polling node to kafka node using
nc -vzw 5 <kafka-ip> <port>
This represents a minor or major failure somewhere, most likely a single binding component in the data pipeline. If you are able to see the storyboards, but no data on it, then atleast we can assume the kibana/vienna is able to reach elasticsearch. However it is not able to get any data for the queried period. There can be two possibilities.
docker service ps <appIdOfStream>
curl -XGET <elasticsearch-ip>:<elasticsearch-port>/_cat/allocation?v
This can occur if we are trying to access more than the max bucket limit. For this we have to check bucket limit
curl -s <es-ip>:9200/_cluster/settings?include_defaults=true | json_pp | grep max_buckets
We can increase max bucket as per use case but it shouldn’t be larger than 10000 as it will affect elasticsearch
curl -X PUT "<es-ip>:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient" : {
"search.max_buckets" : "10000"
}
}
'
To check we have to check first kafka connect is running and able to connect with kafka
Check from the connect cluster , kafka broker is accessible. To do this login inside kafka connect containers and check kafka connectivity using the following command.
vsmaps login connect-1
Then check connectivity to kafka using the below command.
nc -vzw 5 <broker> <port>
If that’s working then check kafka connect worker status to check that run below command
curl -s "http://<connect_host>:<connect_port>/connectors?expand=info&expand=status" | jq '. | to_entries[] | [ .value.info.type, .key, .value.status.connector.state,.value.status.tasks[].state,.value.info.config."connector.class"]|join(":|:")' | column -s : -t| sed 's/\"//g'| sort
curl -s "http://localhost:9082/connectors?expand=info&expand=status" | jq '. | to_entries[] | [ .value.info.type, .key, .value.status.connector.state,.value.status.tasks[].state,.value.info.config."connector.class"]|join(":|:")' | column -s : -t| sed 's/\"//g'| sort
To Check Details of unassigned shards run below command
curl -XGET <es-ip>:<es-port>/_cat/allocation?v
To check details of the unassigned shards run below command
curl -XGET <es-ip>:<es-port>/_cluster/allocation/explain?pretty
To fix this we have to remove corrupted shards.
To reroute shards if they got unassigned run below command.
curl -XPOST ES-IP:9200/_cluster/reroute?retry_failed=true
To check why configuration collection stopped. First check the dcm container is running if not running start it. If running then login to dcm container and check the logs for further debugging
vsmaps login dcm-1
And check vulog or vilog
We will check following things to validate that alarms are working
vsmaps status alert-vunode all