SOP for Docker based VuSmartMaps

Introduction

This document aims to provide standard operating procedure to handle diagnostics and troubleshooting of VuSmartMaps container based installations.

Assumptions

This SOP assumes a container installation of vuSmartMaps catering to the new Kafka based architecture. The data pipeline comprises Kafka brokers for collection and queueing, Kafka Streams for processing and enriching, Elasticsearch for storage and Vienna for UI.

Also this is technically not intended to be a full fledged debugging guide. The scope for this document is for the support team to handle basic troubleshooting and root cause detection of any issues occurring on a running vuSmartMaps setup, before passing on to the escalations. Hence you will not see debugging steps like basic configuration validations on config files on this document since they are expected to be correct on a running installation. We focus on runtime issues and possible root causes in this document.

Architecture

Basic knowledge needed to execute the SOPs:

It is important for anyone handling container based vuSmartMaps to have the following knowledge before proceeding with the SOPs. These are minimum requirements to handle container based VuSmartmaps at any level.

vsmaps

vsmaps is a wrapper script which is written to handle the following functionality of docker commands. It will be available as an vsmaps alias or in data directory as script. All docker services information is stored in different stack files which have the details from where the docker container service should run, how many instances to run etc. To check list of docker stack (compose file available) run below command

vsmaps -l

To check list of services in docker stack (compose) files run below command

vsmaps -s

Check status of VuSmartMaps services:

Using vsmaps wrapper (recommended)

To Check status of all containers running in vsmaps stack. Use the below command.

vsmaps status all all

If we want to check the status of container service from one file then we can check using the below command. For ex. If we want to check the status of service from the 3_kafka-zookeeper file then we can check using this command.

vsmaps status 3_kafka-zookeeper all

We can also use docker command to check the status of the services, for example. If we want to check status of elasticsearch-1 container status, we can check using below command

docker service ps vsmaps_elasticsearch –no-trunc

<TBD>

Login to a container

Use the following command to login to a container shell.

To login inside container shell we can use vsmaps login command. For ex. If we want to login inside the vuinterface-1 container shell use the below command.

vsmaps login vuinterface-1

Basic Docker Commands

As such vsmaps should give all docker service related information, however it will be beneficial to know these basic docker commands in case if vsmaps is not adequate.

How to check service status . e.g If we want to check status of service named vsmaps_vuinterface-1

docker service ps –no-trunc vsmaps_vuinterface-1

This command will show current status of the container service and previous status if service has failed and error code with error message

To check the status of all the services running in vsmaps stacks, use the below command.

docker stack ps vsmaps

This will show the output of all the containers running in the vsmaps stack.

How to check docker container service stdout logs. e.g If we want to check status of service named vsmaps_elasticsearch-1

docker service logs vsmaps_elasticsearch-1

If we want to see logs in follow mode i.e. similar to tail -f filename

docker service logs -f vsmaps_elasticsearch-1

If we want to see the last n number of logs with the following enabled then run below command. Ex. If we want to see last 500 lines of a service then run below command

docker service logs -f -n500 vsmaps_elasticsearch-1

How to check statistics of docker services running on particular nodes like cpu, memory network i/o and disk i/o

docker stats

Checking various logs

Please use the following commands to check various logs

VuSmartmaps logs on vuNode and vuInterface

To check logs of any vuinterface or vunode container services, first we have to login to that container.

vuinterface :- In case of vuinterface there are 2 types of services running in this container. First django app behind apache and kibana. To check these logs first we have to go inside that container. E.g. if we want to check logs of vuinterface-1 run below command to go inside that.

vsmaps login vuinterface-1

To check apache access logs it is in /var/log/apache2/access.log and alias for this is alog
To check apache error logs it is in /var/log/apache2/error.log and alias for this elog
To check vusmartmaps logs it is in /var/log/vusmartmaps/vusoft_logging.yyyy-mm-dd and alias for this vulog and to open the log file use vilog alias
To check kibana logs it is in /var/log/kibana/ directory

vunode :- vunode is what we earlier known as daq service. In the 9.x.x build we have shifted to microservice architecture. Here each daq services are independent of each other, for example alert, discovery , dcm and snmp. E.x. for example if you want to check logs of alert vunode then first login inside that container

vsmaps login alert-1

To check vunode logs it is in /var/log/vusmartmaps/vusoft_logging.yyyy-mm-dd and alias for this vulog and to open the log file use vilog alias

Elasticsearch logs

We can check elasticsearch logs from any node below the command. Suppose we want to check logs of elasticsearch-1 instance.

Using the vsmaps command, in this it will show the last 50 lines of elasticsearch container logs in follow mode.

vsmaps logs elasticsearch-1

Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see.

docker service logs -f -n500 vsmaps_elasticsearch-1

Checking older logs of elasticsearch. In this case we have to login inside elasticsearch container instance for which we want to check logs and then go to the /var/log/elasticsearch directory

vsmaps login elasticsearch-1

cd /var/log/elasticsearch

Kafka Broker Logs

We can check kafka broker logs from any node below the command. Suppose we want to check logs of broker-1 instance.

Using the vsmaps command, in this it will show the last 50 lines of broker container logs in follow mode.

vsmaps logs broker-1

Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see.

docker service logs -f -n500 vsmaps_broker-1

Kafka Connect Logs

We can check kafka connect logs from any node below the command. Suppose we want to check logs of connect-1 instances.

Using the vsmaps command, in this it will show the last 50 lines of connect container logs in follow mode.

vsmaps logs connect-1

Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see.

docker service logs -f -n500 vsmaps_connect-1

Kafka streams logs

To check kafka stream logs we can check using docker service logs command. To check logs of that we have to provide the app_id of that kafka stream application. For example if we to check logs of snmp-app stream container

docker service logs -f -n500 snmp-app

Telegraph SNMP Logs

We can check telegraf snmo logs from any node below the command. Suppose we want to check logs of telegraf-snmp-1 instance.

Using the vsmaps command, in this it will show the last 50 lines of telegraf snmp container logs in follow mode.

vsmaps logs telegraf-snmp-1

Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see.

docker service logs -f -n500 vsmaps_telegraf-snmp-1

Browsing Grafana Storyboards

Some installations(ex: SBI) will have a dedicated UI based internal monitoring storyboards setup with Graffana and TSDB. The following storyboards will help giving a summary of entire vusmartmaps health

<TBD, pls include the key storyboards screenshots with a one line description)

To Check From the home page. It will show all the dashboard and alert which are coming lately.

For Elasticsearch details go to elasticsearch

Kafka Lag

SOPs for troubleshooting

The SOPs are categorised based on the visible indication of a given issue. Hence there can be multiple root causes and the diagnostics are expected to be a sequence of checks.

Basic Health Check Indicator for any Problems

The following are the first set of things to be checked for any issue.

The elasticsearch cluster is green. To check cluster state run this command.

curl -XGET <ES-ip>:9200/_cluster/health?pretty

All the expected vusmartmaps services are running. No service has stopped recently.

To check all the services running use the vsmaps command.

vsmaps status all all

VuSmartMaps page not accessible

In this case, the indication is that the basic vusmartmaps url itself does not work. It either shows “site cannot be reached” or some error code.

Check status of Apache services

Check if vuinterface is running. If not running start vuinterface.To check vuinterface running

vsmaps status vuinterface all

If all the services are running then check connectivity from vuinterface container to other container elasticsearch , mysql and redis.
Confirm if there are no issues with DNS resolution of vusmartmaps hostname, if there an IP available to reach vusmartmaps, try it using the IP
In case you see a valid error code like 404 or 500, it indicates that request is reaching the vusmartmaps server and there is some issue in apache sending back the login page. Further diagnostics have to be done using apache access/error logs on all the vuinterface instances.

To check logs login into vuinterface container and check logs using following alias

alog → to check apache access logs

elog → to check apache error logs

vulog → to check vusmartmaps logs

Unable to Login to VuSmartMaps

The indication is that the login page shows up, however the user is not able to complete a successful login despite entering the correct credentials.

In case the error is invalid credentials and if vusmartmaps is integrated with AD/LDAP, the user has to cross verify the credentials (for expiry or lock out).
In case of “Internal error”, the vusmartmaps is experiencing an exception while django is trying to authenticate, this can be due to multiple reasons.
Run all these command by login inside vuinterface container

If Django is not able to connect to mysql, check for mysql service running status and you are able to manually login to mysql db

mysql -uroot -proot -hmysql-1 ( To check login to mysql happening from vuinterface container)

nc -vzw 5 <mysql-container-name> <mysql_port>

nc -vzw 5 mysql-1 3306

If redis service is not running or redis is full and not taking any new data.

nc -vzw 5 <redis-container-name> <redis_port>

nc -vzw 5 route 6379

Ensure If elasticsearch and kibana services are running

nc -vzw 5 <es-container-name> <es_port>

nc -vzw 5 es 9200

To check kibana check from vuinterface container.

sudo service kibana status

Ensure elasticsearch cluster is green

curl -XGET es:9200/_cluster/health?pretty

Check if Apache got stopped and unable to start due to system issue.

sudo service apache2 status

VusmartMaps login works, but takes more than 30 seconds

When ldap is configured, vusmartmaps will attempt to login any user via LDAP and then it will try using local login. If slow login happens for an user who is configured in local as well as ldap, then you want to check if ldap server is reachable and working.

Check LDAP server is reachable and queryable

First check connectivity b/w the machine and ldap server

nc -vzw 5 <ad-server-ip> <ad-server-port>

Check ldap credentials for that ldap-utils should be installed on the machine

ldapsearch -x -b “DC=AD, DC=SBI” -H ldaps://ad.sbi:3269 -D “CN=NOC Tool,CN=Users,DC=AD,DC=SBI” -w

Check if mysql queries are delayed, by checking the slow query log in mysql nodes

To check mysql slow log login to mysql container and go to /var/log/mysql

vsmaps login mysql-1

cd /var/log/mysql

Data Collection stops for specific devices

The diagnostics for this depends on the type of polling and agents.

For SNMP polled network devices

Ensure the device is not removed from the list of polled devices in data source configuration in the UI.

→ Go to vumodule → select snmp polling vublock → goto sources and search the device ip to be monitored.

Check a test snmpwalk on the device to ensure that the credentials are fine and the device is reachable.

snmpwalk -v3 -l authpriv -u S3n0cuSer -a SHA -A <authenticationPassPhrase> -x AES -X <privacyPassPhrase> <device_ip>

For Server health and Log data missing from specific devices

Check if the respective agents are running on the end nodes.
Check if you have an established connection from the node to the configured kafka host and port on the agent. This connection should be in ESTABLISHED state and should not have Send-Q and/or Recv-Q in a stuck mode without decrementing. Also check if there are too many(more than 20) CLOSE_WAIT connections from the same host.

netstat -an | grep 9092

In case we have a group of devices affected, there is a possibility that those nodes use the same Kafka endpoint to send data and the corresponding endpoint is down or not working as expected.

Check connectivity from there polling node to kafka node using

nc -vzw 5 <kafka-ip> <port>

No recent data on any of the storyboards

This represents a minor or major failure somewhere, most likely a single binding component in the data pipeline. If you are able to see the storyboards, but no data on it, then atleast we can assume the kibana/vienna is able to reach elasticsearch. However it is not able to get any data for the queried period. There can be two possibilities.

The data is updating, however not visible only for recent timestamps. This is probably because there is a huge backlog of data on all topics in the kafka cluster. We should check the running status of streams services. If services are running, we should check the streams logs for any indication of error or if Streams java application is doing frequent GC. It may help to restart streams services to see if the backlog gets cleared faster.

docker service ps <appIdOfStream>

The data is not updating at all. This can happen during total component failures. So it's better to check the running status of all services, especially Kafka Broker and Kafka Streams.

Data is coming with a Lag

Ensure all nodes of elasticsearch are running and the cluster is green.
Ensure the disk spaces of all nodes have not exceeded 85%. If some nodes exceed 85%, it could prevent indexing of new shards on those nodes. This can overload other nodes reducing the indexing throughput.

<TBD, _cat/allocation command>

curl -XGET <elasticsearch-ip>:<elasticsearch-port>/_cat/allocation?v

If kafka consumers are showing lag on multiple topics, the problem could be either slowness on the streams processing or back pressure from the elasticsearch side. For elasticsearch, we can check /_cat/thread_pool/write to understand if one or more nodes have multiple rejections while indexing. If there are rejections, it can happen for multiple reasons, the most likely reason is long GC cycles. Check this from the ES logs and if a specific node is found to be doing long GC cycles, we can restart that ES service alone.

<TBD, process to check GC on ES>

The Indexing throughput can also go down causing lag on other lesser frequent causes like increase in diskIO latency, high surge in data rate, increase in data containing several fields mapped to type “text”.
The Graffana storyboards can help correlating the root cause. It now captures write queue and search queue rejections and also GC cycles.

Search Requests taking a long time or timing out

Elasticsearch search requests hit any one of the ES nodes and this node is called the coordinating node for that search request. It determines the nodes containing the shards for that search request, consolidating the results from each shard and does final merge and aggregations before sending the response. So when search requests are taking long there can be multiple reasons.

If some requests are timing out, there may be one or more nodes busy with doing other things(like GC) and rejecting search requests because the thread_pools for those nodes are full (being serviced or yet to be serviced). This can be seen via /_cat/thread_pools/search command on the ES cluster for all nodes. Thread pools can also be viewed in graffana

The coordinating node itself is busy. Now in docker based ES cluster we do not know which exactly is the coordinating node for a specific request as the services interact using service name and it can hit any of the hosts in the cluster. So this will be a challenge and we should do this.

- Searching intermittently results in Courier Fetch: shards error

This can occur if we are trying to access more than the max bucket limit. For this we have to check bucket limit

curl -s <es-ip>:9200/_cluster/settings?include_defaults=true | json_pp | grep max_buckets

We can increase max bucket as per use case but it shouldn’t be larger than 10000 as it will affect elasticsearch

curl -X PUT "<es-ip>:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'

{

"transient" : {

"search.max_buckets" : "10000"

}

I see the data on Kafka Topic, but don't see it in the ES index.

To check we have to check first kafka connect is running and able to connect with kafka

Check from the connect cluster , kafka broker is accessible. To do this login inside kafka connect containers and check kafka connectivity using the following command.

vsmaps login connect-1

Then check connectivity to kafka using the below command.

nc -vzw 5 <broker> <port>

If that’s working then check kafka connect worker status to check that run below command

Cluster Red, Unassigned shards due to shard Corruption

To check if all shards are

To Check Details of unassigned shards run below command

curl -XGET <es-ip>:<es-port>/_cat/allocation?v

To check details of the unassigned shards run below command

curl -XGET <es-ip>:<es-port>/_cluster/allocation/explain?pretty

To fix this we have to remove corrupted shards.

To reroute shards if they got unassigned run below command.

curl -XPOST ES-IP:9200/_cluster/reroute?retry_failed=true

Configuration Collection Stopped:-

To check why configuration collection stopped. First check the dcm container is running if not running start it. If running then login to dcm container and check the logs for further debugging

vsmaps login dcm-1

And check vulog or vilog

Alarms stopped working

We will check following things to validate that alarms are working

Check alert vunode are running , if not running then start alert vunode

vsmaps status alert-vunode all

If it is running then check the alert vunode logs to find out the reasons
If some issue with mysql or mysql_server_gone_away error comes then check connectivity with mysql if it is there then restart the alert-vunode service.
If everything is fine then check kafka connect alarm connect worker if it is running or not.

Related Articles
SOP- VuSmartMaps not loading
Standard Operating Procedure for L1 Support Team for vuSmartMaps not loading issue. If User reports that vuSmartMaps is not getting loaded or internal server 404 error, then most likely web service (apache2) is not running. Login to shipper and ...
Container based setup migration issue
In the container-based setup with the new developer update, there is an error as below: (Access denied for user) Please suggest how to fix this issue. To fix this issue, kindly login into the database using the below command and create one more user. ...
Account Locked while accessing vusmartmaps URL 334
Solution Document ICICI || Account Locked while accessing vusmartmaps URL Overview General/Customer specific General Author Akash Thosare Reviewer NA Approver NA Release date 18/08/2022 Product Version 8.5r5 Audience: ...
How do I set the IP address in a newly configured vuSmartMaps VM?
Customer has imported the OVF template and want to set the IP Address. Do we need to get involved or can customer set the same on its own? Customer can login as guest user and assign the ip address. login: guest pass: Vunet#123 To assign IP address ...
Duplicate documents are reflecting for snmp data in ES - 536
Solution Document Duplicate documents in Elasticsearch Overview General/Customer specific General Author Rukmini Reviewer Aman,Siva Approver Release date 03/08/2022 Product Version 9.1r3 ...

SOP for Docker based VuSmartMaps

SOP for Docker based VuSmartMaps

Introduction

Assumptions

Architecture

Basic knowledge needed to execute the SOPs:

vsmaps

Check status of VuSmartMaps services:

Login to a container

Basic Docker Commands

Checking various logs

Browsing Grafana Storyboards

SOPs for troubleshooting

Basic Health Check Indicator for any Problems

VuSmartMaps page not accessible

Unable to Login to VuSmartMaps

VusmartMaps login works, but takes more than 30 seconds

Data Collection stops for specific devices

No recent data on any of the storyboards

Data is coming with a Lag

Search Requests taking a long time or timing out

- Searching intermittently results in Courier Fetch: shards error

I see the data on Kafka Topic, but don't see it in the ES index.

Cluster Red, Unassigned shards due to shard Corruption

Configuration Collection Stopped:-

Alarms stopped working

Related Articles

SOP- VuSmartMaps not loading

Container based setup migration issue

Account Locked while accessing vusmartmaps URL 334

How do I set the IP address in a newly configured vuSmartMaps VM?

Duplicate documents are reflecting for snmp data in ES - 536