Service telemetry troubleshooting.

When issues arise and you don’t see any metrics happening on Grafana dashboard you can follow these troubleshooting step to narrow down to where the issues are.

In my examples here I will be using STF version 1.5 see infrawatch to see the upstream documentation for Service Telemetry Framework.

I will be also using OpenStack version 17 code named Wallaby the Upstream version. See here Openstack.org

Openstack

  • Check the logs that any errors are happening in OpenStack both on the Controllers and the compute nodes. Logs to check are;
    • /var/log/containers/metrics_qdr/metrics_qdr.log on all the overclound nodes.
    • /var/log/containers/collectd/collectd.log on all overcloud nodes.
    • /var/log/containers/collectd/sensubility.log on all overcloud nodes.
    • /var/log/containers/ceilometer/compute.log On all Compute nodes.
    • /var/log/containers/ceilometer/central.log On the Controller nodes.

All the services like collectd, ceilometer and metrics-qdr are running as containers managed by systemd.

To check the containers that are running for, also make sure no containers have stopped, if so check why…

podman ps -a -f name=collect -f name=metric -f name=ceilometer
  • All the services above will have option in the config to enable debug but do note that it can dump a lot of output to the logs files.

Checking the metrics sent from the overcloud to AMP is a good start to see if metrics are being sent.

Follow these commands below

# listener=$(podman exec -it metrics_qdr cat /etc/qpid-dispatch/qdrouterd.conf | grep -A2 listener | grep host | awk -F': ' '{print $2}' | tr -d '\r')
# podman exec -it metrics_qdr qdstat -b $listener:5666 -c
# podman exec -it metrics_qdr qdstat -b $listener:5666 -a
# podman exec -it metrics_qdr qdstat -b $listener:5666 --links

These are the established connection.

2023-03-16 06:36:55.606130 UTC
Router.controller-0.localdomain

Connections
  id     host                 container                                                                                                  role    proto  dir  security     authentication  tenant  last dlv      uptime
  ============================================================================================================================================================================================================================
  15937  172.24.33.235:55546  openstack.org/om/container/controller-0/ceilometer-agent-notification/13/93d8ad26dc014fa5967aba7aac99ad8f  normal  amqp   in   no-security  no-auth                 -             003:07:45:14
  25179  172.24.33.235:42676  metrics                                                                                                    normal  amqp   in   no-security  anonymous-user          000:00:00:03  000:00:28:08
  25181  172.24.33.235:38626  controller-0.internalapi.localdomain-infrawatch-out-1678946927                                             normal  amqp   in   no-security  no-auth                 -             000:00:28:07
  25180  172.24.33.235:38624  controller-0.internalapi.localdomain-infrawatch-in-1678946927                                              normal  amqp   in   no-security  anonymous-user          -             000:00:28:07
  25256  172.24.33.235:47784  1c98a314-ffbf-4a36-a1c3-08d07fead890                                                                       normal  amqp   in   no-security  no-auth                 000:00:00:00  000:00:00:00

See above that both ceilometer and collectd (metrics) is connected

This below is the most important output as you can see the in and out. You want to see these values growing.

2023-03-16 06:36:11.909805 UTC
Router.controller-0.localdomain

Router Addresses
  class   addr                                       phs  distrib    pri  local  remote  in          out         thru  fallback
  ===============================================================================================================================
  local   $_management_internal                           closest    -    0      0       0           0           0     0
  mobile  $management                                0    closest    -    0      0       67,583      0           0     0
  local   $management                                     closest    -    0      0       0           0           0     0
  local   _edge                                           closest    -    0      0       60,941,727  21,628,862  0     0
  mobile  anycast/ceilometer/cloud1-event.sample     0    multicast  -    0      0       0           0           0     0
  mobile  anycast/ceilometer/cloud1-metering.sample  0    multicast  -    0      0       0           0           0     0
  mobile  sensubility/cloud1-telemetry               0    balanced   -    0      0       0           0           0     0
  local   temp.YfviR9tyb+1HWgq                            balanced   -    1      0       0           0           0     0
  local   temp.yZcrl7V0NyH49jI                            balanced   -    1      0       0           1           0     0
2023-03-16 06:31:29.407910 UTC
Router.controller-0.localdomain

Router Links
  type      dir  conn id  id     peer  class   addr                                       phs  cap  pri  undel  unsett  deliv   presett  psdrop  acc  rej  rel     mod  delay  rate  stuck  cred  blkd
  ==========================================================================================================================================================================================================
  endpoint  out  15937    40975        local   temp.YfviR9tyb+1HWgq                            250  0    0      0       0       0        0       0    0    0       0    0      0     0      200   -
  endpoint  in   15937    40976        mobile  anycast/ceilometer/cloud1-metering.sample  0    250  0    0      0       0       0        0       0    0    0       0    0      0     0      0     07:39:48
  endpoint  in   15937    40977        mobile  anycast/ceilometer/cloud1-event.sample     0    250  0    0      0       0       0        0       0    0    0       0    0      0     0      0     07:39:23
  endpoint  in   25179    59460                                                                250  0    0      0       207607  0        0       0    0    207607  0    0      0     0      250   -
  endpoint  in   25181    59461        mobile  sensubility/cloud1-telemetry               0    250  0    0      0       0       0        0       0    0    0       0    0      0     0      0     00:22:27
  endpoint  in   25226    59550        mobile  $management                                0    250  0    0      0       2       0        0       2    0    0       0    0      0     0      250   -
  endpoint  out  25226    59551        local   temp.Kzpzn5bHefs8RA6                            250  0    0      0       1       1        0       0    0    0       0    0      0     0      1     -

And from OCP side:

 POD=$(oc get pods -l application=default-interconnect -o custom-columns=POD:.metadata.name --no-headers)
 oc exec -it $POD -- qdstat -c
 oc exec -it $POD -- qdstat -a

OpenShift

  • Check pods in the service-telemetry project oc get pods
NAME                                                      READY   STATUS             RESTARTS          AGE
alertmanager-default-0                                    3/3     Running            0                 2d4h
default-cloud1-ceil-event-smartgateway-6c9648c5cb-st8gd   2/2     Running            13 (2d4h ago)     26d
default-cloud1-ceil-meter-smartgateway-67684d88c8-98qgz   3/3     Running            2 (2d4h ago)      2d4h
default-cloud1-coll-event-smartgateway-7d7675cdd7-lslg9   2/2     Running            13 (2d4h ago)     26d
default-cloud1-coll-meter-smartgateway-76d5ff6db5-m5fs7   3/3     Running            10 (2d4h ago)     33d
default-cloud1-sens-meter-smartgateway-7b59669fdd-rzf25   3/3     Running            12 (2d4h ago)     33d
default-interconnect-845c4b647c-vgtr9                     1/1     Running            1                 2d4h
elastic-operator-585f6dff6d-c4cfc                         1/1     Running            326 (5h49m ago)   41d
elasticsearch-es-default-0                                0/1     CrashLoopBackOff   2 (7s ago)        2m18s
grafana-deployment-6b74678ddd-4qt6k                       2/2     Running            0                 6d15h
grafana-operator-controller-manager-5bd4bd7556-nklmb      2/2     Running            7 (2d4h ago)      2d4h
interconnect-operator-99dc7f8d8-kvfmh                     1/1     Running            14 (6d15h ago)    41d
prometheus-default-0                                      3/3     Running            5                 20d
prometheus-operator-b5d479c56-9cm5n                       1/1     Running            5 (6d16h ago)     41d
service-telemetry-operator-f9ddb898c-6hggz                1/1     Running            1                 2d4h
smart-gateway-operator-79557664f8-jf56r                   1/1     Running            4 (6d16h ago)     41d
  • In my case, above you can see that the elasticsearch-es-default-0 is in CrashLoopBackOff state. oc logs elasticsearch-es-default-0 to investigate more.

MORE TO COME!