SOP for Docker based VuSmartMaps

SOP for Docker based VuSmartMaps


Introduction

This document aims to provide standard operating procedure to handle diagnostics and troubleshooting of VuSmartMaps container based installations.


Assumptions

This SOP assumes a container installation of vuSmartMaps catering to the new Kafka based architecture. The data pipeline comprises Kafka brokers for collection and queueing, Kafka Streams for processing and enriching, Elasticsearch for storage and Vienna for UI.


Also this is technically not intended to be a full fledged debugging guide. The scope for this document is for the support team to handle basic troubleshooting and root cause detection of any issues occurring on a running vuSmartMaps setup, before passing on to the escalations. Hence you will not see debugging steps like basic configuration validations on config files on this document since they are expected to be correct on a running installation.  We focus on runtime issues and possible root causes in this document. 


Architecture

<TBD Need to have the architecture diagram and components here, can be taken from Installation guide for container vusmartmaps>


Basic knowledge needed to execute the SOPs:

It is important for anyone handling container based vuSmartMaps to have the following knowledge before proceeding with the SOPs. These are minimum requirements to handle container based VuSmartmaps at any level. 


vsmaps 

vsmaps is a wrapper script which is written to handle the following functionality of docker commands. It will be available as an vsmaps alias or in data directory as script. All docker services information is stored in different stack files which have the details from where the docker container service  should run, how many instances to run etc. To check list of docker stack (compose file available) run below command

                vsmaps -l


To check list of services in docker stack (compose) files  run below command

                

vsmaps -s

Check status of VuSmartMaps services:

Using vsmaps wrapper (recommended)


To Check status of all containers running in vsmaps stack. Use the below command.


vsmaps status all all


If we want to check the status of container service from one file then we can check using the below command. For ex. If we want to check the status of service from the 3_kafka-zookeeper file then we can check using this command.

vsmaps status 3_kafka-zookeeper all



We can also use docker command to check the status of the services, for example. If we want to check status of elasticsearch-1 container status, we can check using below command

docker service ps vsmaps_elasticsearch –no-trunc





<TBD>

Login to a container 

Use the following command to login to a container shell.


To login inside container shell we can use vsmaps login command. For ex. If we want to login inside the vuinterface-1 container shell use the below command.

                vsmaps login vuinterface-1


Basic Docker Commands

As such vsmaps should give all docker service related information, however it will be beneficial to know these basic docker commands in case if vsmaps is not adequate.


  1. How to check service status . e.g If we want to check status of service named vsmaps_vuinterface-1

docker service ps –no-trunc vsmaps_vuinterface-1

This command will show current status of the container service and previous status if service has failed and error code with error message

  1. To check the status of all the services running in vsmaps stacks, use the below command.

docker stack ps vsmaps

This will show the output of all the containers running in the vsmaps stack.

  1. How to check docker container service stdout logs. e.g If we want to check status of service named vsmaps_elasticsearch-1


docker service logs vsmaps_elasticsearch-1


If we want to see logs in follow mode i.e. similar to tail -f filename


docker service logs -f  vsmaps_elasticsearch-1


If we want to see the last n number of logs with the following enabled then run below command. Ex. If we want to see last 500 lines of a service then run below command


docker service logs -f -n500  vsmaps_elasticsearch-1



  1. How to check statistics of docker services running on particular nodes like cpu, memory network i/o and disk i/o

docker stats





Checking various logs

Please use the following commands to check various logs


VuSmartmaps logs on vuNode and vuInterface


To check logs of any vuinterface or vunode container services, first we have to login to that container.

vuinterface :- In case of vuinterface there are 2 types of services running in this container. First django app behind apache and kibana. To check these logs first we have to go inside that container. E.g.  if we want to check logs of vuinterface-1 run below command to go inside that.

            vsmaps login vuinterface-1

  1. To check apache access logs it is in /var/log/apache2/access.log and alias for this is alog
  2. To check apache error logs it is in /var/log/apache2/error.log and alias for this elog
  3. To check vusmartmaps logs it is in /var/log/vusmartmaps/vusoft_logging.yyyy-mm-dd and alias for this vulog  and to open the log file use vilog alias
  4. To check kibana logs it is in /var/log/kibana/ directory



vunode :- vunode is what we earlier known as daq service. In the 9.x.x build we have shifted to microservice architecture. Here each daq services are independent of each other, for example alert, discovery , dcm and snmp. E.x. for example if you want to check logs of alert vunode then first login inside that container

            vsmaps login alert-1

To check vunode logs it is in /var/log/vusmartmaps/vusoft_logging.yyyy-mm-dd and alias for this vulog  and to open the log file use vilog alias



Elasticsearch logs

We can check elasticsearch logs from any node below the command.  Suppose we want to check logs of elasticsearch-1 instance.

  1. Using the vsmaps command, in this it will show the last 50 lines of elasticsearch container logs in follow mode.

            vsmaps logs elasticsearch-1

  1. Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see. 

docker service logs -f -n500 vsmaps_elasticsearch-1

  1. Checking older logs of elasticsearch. In this case we have to login inside elasticsearch container instance for which we want to check logs and then go to the /var/log/elasticsearch directory

    vsmaps login elasticsearch-1 

    cd /var/log/elasticsearch


Kafka Broker Logs


We can check kafka broker logs from any node below the command.  Suppose we want to check logs of broker-1 instance.

  1. Using the vsmaps command, in this it will show the last 50 lines of broker container logs in follow mode.

            vsmaps logs broker-1

  1. Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see. 

docker service logs -f -n500 vsmaps_broker-1


Kafka  Connect Logs


We can check kafka connect logs from any node below the command.  Suppose we want to check logs of connect-1 instances.

  1. Using the vsmaps command, in this it will show the last 50 lines of connect container logs in follow mode.

            vsmaps logs connect-1

  1. Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see. 

docker service logs -f -n500 vsmaps_connect-1





Kafka streams logs

To check kafka stream logs we can check using docker service logs command. To check logs of that we have to provide the app_id of that kafka stream application. For example if we to check logs of snmp-app stream container

        docker service logs -f -n500 snmp-app



Telegraph SNMP Logs


We can check telegraf snmo logs from any node below the command.  Suppose we want to check logs of telegraf-snmp-1 instance.

  1. Using the vsmaps command, in this it will show the last 50 lines of telegraf snmp container logs in follow mode.

            vsmaps logs telegraf-snmp-1

  1. Using docker service command. In this we can provide arguments to check the number of lines of logs we want to see. 

docker service logs -f -n500 vsmaps_telegraf-snmp-1



Browsing Grafana Storyboards

Some installations(ex: SBI) will have a dedicated UI based internal monitoring storyboards setup with Graffana and TSDB. The following storyboards will help giving a summary of entire vusmartmaps health

<TBD, pls include the key storyboards screenshots with a one line description)

To Check From the home page. It will show all the dashboard and alert which are coming lately.

For Elasticsearch details go to elasticsearch

Kafka Lag

SOPs for troubleshooting

The SOPs are categorised based on the visible indication of a given issue. Hence there can be multiple root causes and the diagnostics are expected to be a sequence of checks.


Basic Health Check Indicator for any Problems

The following are the first set of things to be checked for any issue.


  1. The elasticsearch cluster is green. To check cluster state run this command.

           curl -XGET <ES-ip>:9200/_cluster/health?pretty

  1. All the expected vusmartmaps services are running. No service has stopped recently.

To check all the services running use the vsmaps command.

        vsmaps status all all


VuSmartMaps page not accessible

In this case, the indication is that the basic vusmartmaps url itself does not work. It either shows “site cannot be reached” or some error code.


  1. Check status of Apache services

Check if vuinterface is running. If not running start vuinterface.To check vuinterface running 

  vsmaps status vuinterface all

  1. If all the services are running then check connectivity from vuinterface container to other container elasticsearch , mysql and redis.
  2. Confirm if there are no issues with DNS resolution of vusmartmaps hostname, if there an IP available to reach vusmartmaps, try it using the IP
  3. In case you see a valid error code like 404 or 500, it indicates that request is reaching the vusmartmaps server and there is some issue in apache sending back the login page. Further diagnostics have to be done using apache access/error logs on all the vuinterface instances. 

    To check logs login into vuinterface container and check logs using following alias

alog → to check apache access logs

elog → to check apache error logs

vulog → to check vusmartmaps logs


Unable to Login to VuSmartMaps

The indication is that the login page shows up, however the user is not able to complete a successful login despite entering the correct credentials. 


  1. In case the error is invalid credentials and if vusmartmaps is integrated with AD/LDAP, the user has to cross verify the credentials (for expiry or lock out).
  2. In case of “Internal error”, the vusmartmaps is experiencing an exception while django is trying to authenticate, this can be due to multiple reasons.
  3. Run all these command by login inside vuinterface container

If Django is not able to connect to mysql, check for mysql service running status and you are able to manually login to mysql db

mysql -uroot -proot -hmysql-1     ( To check login to mysql happening from vuinterface container)


nc -vzw 5 <mysql-container-name> <mysql_port>

nc -vzw 5 mysql-1 3306


If redis service is not running or redis is full and not taking any new data.

nc -vzw 5 <redis-container-name> <redis_port>

nc -vzw 5 route 6379


Ensure If elasticsearch and kibana services are running

nc -vzw 5 <es-container-name> <es_port>

nc -vzw 5 es 9200


To check kibana check from vuinterface container. 

sudo service kibana status


Ensure elasticsearch cluster is green

curl -XGET es:9200/_cluster/health?pretty

Check if Apache got stopped and unable to start due to system issue.

sudo service apache2 status



VusmartMaps login works, but takes more than 30 seconds

When ldap is configured, vusmartmaps will attempt to login any user via LDAP and then it will try using local login. If slow login happens for an user who is configured in local as well as ldap, then you want to check if ldap server is reachable and working. 


  1. Check LDAP server is reachable and queryable

First check connectivity b/w the machine and ldap server

nc -vzw 5 <ad-server-ip> <ad-server-port>

Check ldap credentials for that ldap-utils should be installed on the machine

ldapsearch -x -b “DC=AD, DC=SBI” -H ldaps://ad.sbi:3269  -D “CN=NOC Tool,CN=Users,DC=AD,DC=SBI” -w


  1. Check if mysql queries are delayed, by checking the slow query log in mysql nodes

    To check mysql slow log login to mysql container and go to /var/log/mysql

    vsmaps login mysql-1

cd /var/log/mysql

Data Collection stops for specific devices


The diagnostics for this depends on the type of polling and agents. 


For SNMP polled network devices

  1. Ensure the device is not removed from the list of polled devices in data source configuration in the UI.

→ Go to vumodule → select snmp polling vublock → goto sources and search the device ip to be monitored.


  1. Check a test snmpwalk on the device to ensure that the credentials are fine and the device is reachable.

snmpwalk -v3 -l authpriv -u S3n0cuSer -a SHA -A <authenticationPassPhrase> -x AES -X <privacyPassPhrase> <device_ip>

For Server health and Log data missing from specific devices 

  1. Check if the respective agents are running on the end nodes.
  2. Check if you have an established connection from the node to the configured kafka host and port on the agent.  This connection should be in ESTABLISHED state and should not have Send-Q and/or Recv-Q in a stuck mode without decrementing. Also check if there are too many(more than 20) CLOSE_WAIT connections from the same host.

    netstat -an | grep 9092

In case we have a group of devices affected, there is a possibility that those nodes use the same Kafka endpoint to send data and the corresponding endpoint is down or not working as expected.

Check connectivity from there polling node to kafka node using 

    nc -vzw 5 <kafka-ip> <port>


No recent data on any of the storyboards

This represents a minor or major failure somewhere, most likely a single binding component in the data pipeline. If you are able to see the storyboards, but no data on it, then atleast we can assume the kibana/vienna is able to reach elasticsearch. However it is not able to get any data for the queried period. There can be two possibilities.


  1. The data is updating, however not visible only for recent timestamps. This is probably because there is a huge backlog of data on all topics in the kafka cluster. We should check the running status of streams services. If services are running, we should check the streams logs for any indication of error or if Streams java application is doing frequent GC. It may help to restart streams services to see if the backlog gets cleared faster.

docker service ps <appIdOfStream>

  1. The data is not updating at all. This can happen during total component failures. So it's better to check the running status of all services, especially Kafka Broker and Kafka Streams.



Data is coming with a Lag

  1. Ensure all nodes of elasticsearch are running and the cluster is green.
  2. Ensure the disk spaces of all nodes have not exceeded 85%. If some nodes exceed 85%, it could prevent indexing of new shards on those nodes. This can overload other nodes reducing the indexing throughput.
  1. <TBD, _cat/allocation command>

curl -XGET <elasticsearch-ip>:<elasticsearch-port>/_cat/allocation?v

  1. If kafka consumers are showing lag on multiple topics, the problem could be either slowness on the streams processing or back pressure from the elasticsearch side. For elasticsearch, we can check /_cat/thread_pool/write to understand if one or more nodes have multiple rejections while indexing. If there are rejections, it can happen for multiple reasons, the most likely reason is long GC cycles. Check this from the ES logs and if a specific node is found to be doing long GC cycles, we can restart that ES service alone. 
  1. <TBD, process to check GC on ES>
  1. The Indexing throughput can also go down causing lag on other lesser frequent causes like increase in diskIO latency, high surge in data rate,  increase in data containing several fields mapped to type “text”.
  2. The Graffana storyboards can help correlating the root cause. It now captures write queue and search queue rejections and also GC cycles.


Search Requests taking a long time or timing out

  1. Elasticsearch search requests hit any one of the ES nodes and this node is called the coordinating node for that search request. It determines the nodes containing the shards for that search request, consolidating the results from each shard and does final merge and aggregations before sending the response. So when search requests are taking long there can be multiple reasons.
  1. If some requests are timing out, there may be one or more nodes busy with doing other things(like GC) and rejecting search requests because the thread_pools for those nodes are full (being serviced or yet to be serviced). This can be seen via /_cat/thread_pools/search command on the ES cluster for all nodes.  Thread pools can also be viewed in graffana

  1. The coordinating node itself is busy. Now in docker based ES cluster we do not know which exactly is the coordinating node for a specific request as the services interact using service name and it can hit any of the hosts in the cluster. So this will be a challenge and we should do this.


- Searching intermittently results in Courier Fetch: shards error

This can occur if we are trying to access more than the max bucket limit. For this we have to check bucket limit 

curl -s <es-ip>:9200/_cluster/settings?include_defaults=true | json_pp | grep max_buckets


We can increase max bucket as per use case but it shouldn’t be larger than 10000 as it will affect elasticsearch

curl -X PUT "<es-ip>:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'

{

   "transient" : {

      "search.max_buckets" : "10000"

   }

}

'

I see the data on Kafka Topic, but don't see it in the ES index.

To check we have to check first kafka connect is running and able to connect with kafka 

Check from the connect cluster , kafka broker is accessible. To do this login inside kafka connect containers and check kafka connectivity using the following command.

vsmaps login connect-1

Then check connectivity to kafka using the below command.

            nc -vzw 5 <broker> <port>


If that’s working then check kafka connect worker status to check that run below command

curl -s "http://<connect_host>:<connect_port>/connectors?expand=info&expand=status" |           jq '. | to_entries[] | [ .value.info.type, .key, .value.status.connector.state,.value.status.tasks[].state,.value.info.config."connector.class"]|join(":|:")' |   column -s : -t| sed 's/\"//g'| sort


curl -s "http://localhost:9082/connectors?expand=info&expand=status" |           jq '. | to_entries[] | [ .value.info.type, .key, .value.status.connector.state,.value.status.tasks[].state,.value.info.config."connector.class"]|join(":|:")' |   column -s : -t| sed 's/\"//g'| sort



Cluster Red, Unassigned shards due to shard Corruption

  1. To check if all shards are 


To Check Details of unassigned shards run below command

curl  -XGET <es-ip>:<es-port>/_cat/allocation?v

To check details of the unassigned shards run below command

curl -XGET <es-ip>:<es-port>/_cluster/allocation/explain?pretty

To fix this we have to remove corrupted shards.


To reroute shards if they got unassigned run below command.


curl -XPOST ES-IP:9200/_cluster/reroute?retry_failed=true




Configuration Collection Stopped:-


To check why configuration collection stopped. First check the dcm container is running if not running start it. If running  then login to dcm container and check the logs for further debugging 

        vsmaps login dcm-1

And check vulog or vilog

Alarms stopped working

We will check following things to validate that alarms are working 

  1. Check alert vunode are running , if not running then start alert vunode 

    vsmaps status alert-vunode all

  1. If it is running then check the alert vunode logs to find out the reasons 
  2. If some issue with mysql or mysql_server_gone_away error comes then check connectivity with mysql if it is there then restart the alert-vunode service.
  3. If everything is fine then check kafka connect alarm connect worker if it is running or not.





















































    • Related Articles

    • SOP- VuSmartMaps not loading

      Standard Operating Procedure for L1 Support Team for vuSmartMaps not loading issue. If User reports that vuSmartMaps is not getting loaded or internal server 404 error, then most likely web service (apache2) is not running.  Login to shipper and ...
    • Container based setup migration issue

      In the container-based setup with the new developer update, there is an error as below: (Access denied for user) Please suggest how to fix this issue. To fix this issue, kindly login into the database using the below command and create one more user. ...
    • Account Locked while accessing vusmartmaps URL 334

       Solution Document ICICI || Account Locked while accessing vusmartmaps URL Overview General/Customer specific General Author Akash Thosare Reviewer NA Approver  NA  Release date 18/08/2022 Product Version 8.5r5 Audience: ...
    • How do I set the IP address in a newly configured vuSmartMaps VM?

      Customer has imported the OVF template and want to set the IP Address. Do we need to get involved or can customer set the same on its own? ​Customer can login as guest user and assign the ip address. login: guest pass: Vunet#123 To assign IP address ...
    • Duplicate documents are reflecting for snmp data in ES - 536

                   Solution Document  Duplicate documents in Elasticsearch   Overview General/Customer specific General Author Rukmini Reviewer Aman,Siva Approver   Release date 03/08/2022 Product Version 9.1r3 ...