Service telemetry troubleshooting.
When issues arise and you don’t see any metrics happening on Grafana dashboard you can follow these troubleshooting step to narrow down to where the issues are.
In my examples here I will be using STF version 1.5 see infrawatch to see the upstream documentation for Service Telemetry Framework.
I will be also using OpenStack version 17 code named Wallaby the Upstream version. See here Openstack.org
Openstack
- Check the logs that any errors are happening in OpenStack both on the Controllers and the compute nodes.
Logs to check are;
/var/log/containers/metrics_qdr/metrics_qdr.log
on all the overclound nodes./var/log/containers/collectd/collectd.log
on all overcloud nodes./var/log/containers/collectd/sensubility.log
on all overcloud nodes./var/log/containers/ceilometer/compute.log
On all Compute nodes./var/log/containers/ceilometer/central.log
On the Controller nodes.
All the services like collectd
, ceilometer
and metrics-qdr
are running as containers managed by systemd.
To check the containers that are running for, also make sure no containers have stopped, if so check why…
podman ps -a -f name=collect -f name=metric -f name=ceilometer
- All the services above will have option in the config to enable debug but do note that it can dump a lot of output to the logs files.
Checking the metrics sent from the overcloud to AMP is a good start to see if metrics are being sent.
Follow these commands below
# listener=$(podman exec -it metrics_qdr cat /etc/qpid-dispatch/qdrouterd.conf | grep -A2 listener | grep host | awk -F': ' '{print $2}' | tr -d '\r')
# podman exec -it metrics_qdr qdstat -b $listener:5666 -c
# podman exec -it metrics_qdr qdstat -b $listener:5666 -a
# podman exec -it metrics_qdr qdstat -b $listener:5666 --links
These are the established connection.
2023-03-16 06:36:55.606130 UTC
Router.controller-0.localdomain
Connections
id host container role proto dir security authentication tenant last dlv uptime
============================================================================================================================================================================================================================
15937 172.24.33.235:55546 openstack.org/om/container/controller-0/ceilometer-agent-notification/13/93d8ad26dc014fa5967aba7aac99ad8f normal amqp in no-security no-auth - 003:07:45:14
25179 172.24.33.235:42676 metrics normal amqp in no-security anonymous-user 000:00:00:03 000:00:28:08
25181 172.24.33.235:38626 controller-0.internalapi.localdomain-infrawatch-out-1678946927 normal amqp in no-security no-auth - 000:00:28:07
25180 172.24.33.235:38624 controller-0.internalapi.localdomain-infrawatch-in-1678946927 normal amqp in no-security anonymous-user - 000:00:28:07
25256 172.24.33.235:47784 1c98a314-ffbf-4a36-a1c3-08d07fead890 normal amqp in no-security no-auth 000:00:00:00 000:00:00:00
See above that both ceilometer and collectd (metrics) is connected
This below is the most important output as you can see the in
and out
. You want to see these values growing.
2023-03-16 06:36:11.909805 UTC
Router.controller-0.localdomain
Router Addresses
class addr phs distrib pri local remote in out thru fallback
===============================================================================================================================
local $_management_internal closest - 0 0 0 0 0 0
mobile $management 0 closest - 0 0 67,583 0 0 0
local $management closest - 0 0 0 0 0 0
local _edge closest - 0 0 60,941,727 21,628,862 0 0
mobile anycast/ceilometer/cloud1-event.sample 0 multicast - 0 0 0 0 0 0
mobile anycast/ceilometer/cloud1-metering.sample 0 multicast - 0 0 0 0 0 0
mobile sensubility/cloud1-telemetry 0 balanced - 0 0 0 0 0 0
local temp.YfviR9tyb+1HWgq balanced - 1 0 0 0 0 0
local temp.yZcrl7V0NyH49jI balanced - 1 0 0 1 0 0
2023-03-16 06:31:29.407910 UTC
Router.controller-0.localdomain
Router Links
type dir conn id id peer class addr phs cap pri undel unsett deliv presett psdrop acc rej rel mod delay rate stuck cred blkd
==========================================================================================================================================================================================================
endpoint out 15937 40975 local temp.YfviR9tyb+1HWgq 250 0 0 0 0 0 0 0 0 0 0 0 0 0 200 -
endpoint in 15937 40976 mobile anycast/ceilometer/cloud1-metering.sample 0 250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 07:39:48
endpoint in 15937 40977 mobile anycast/ceilometer/cloud1-event.sample 0 250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 07:39:23
endpoint in 25179 59460 250 0 0 0 207607 0 0 0 0 207607 0 0 0 0 250 -
endpoint in 25181 59461 mobile sensubility/cloud1-telemetry 0 250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00:22:27
endpoint in 25226 59550 mobile $management 0 250 0 0 0 2 0 0 2 0 0 0 0 0 0 250 -
endpoint out 25226 59551 local temp.Kzpzn5bHefs8RA6 250 0 0 0 1 1 0 0 0 0 0 0 0 0 1 -
And from OCP side:
POD=$(oc get pods -l application=default-interconnect -o custom-columns=POD:.metadata.name --no-headers)
oc exec -it $POD -- qdstat -c
oc exec -it $POD -- qdstat -a
OpenShift
- Check pods in the
service-telemetry
projectoc get pods
NAME READY STATUS RESTARTS AGE
alertmanager-default-0 3/3 Running 0 2d4h
default-cloud1-ceil-event-smartgateway-6c9648c5cb-st8gd 2/2 Running 13 (2d4h ago) 26d
default-cloud1-ceil-meter-smartgateway-67684d88c8-98qgz 3/3 Running 2 (2d4h ago) 2d4h
default-cloud1-coll-event-smartgateway-7d7675cdd7-lslg9 2/2 Running 13 (2d4h ago) 26d
default-cloud1-coll-meter-smartgateway-76d5ff6db5-m5fs7 3/3 Running 10 (2d4h ago) 33d
default-cloud1-sens-meter-smartgateway-7b59669fdd-rzf25 3/3 Running 12 (2d4h ago) 33d
default-interconnect-845c4b647c-vgtr9 1/1 Running 1 2d4h
elastic-operator-585f6dff6d-c4cfc 1/1 Running 326 (5h49m ago) 41d
elasticsearch-es-default-0 0/1 CrashLoopBackOff 2 (7s ago) 2m18s
grafana-deployment-6b74678ddd-4qt6k 2/2 Running 0 6d15h
grafana-operator-controller-manager-5bd4bd7556-nklmb 2/2 Running 7 (2d4h ago) 2d4h
interconnect-operator-99dc7f8d8-kvfmh 1/1 Running 14 (6d15h ago) 41d
prometheus-default-0 3/3 Running 5 20d
prometheus-operator-b5d479c56-9cm5n 1/1 Running 5 (6d16h ago) 41d
service-telemetry-operator-f9ddb898c-6hggz 1/1 Running 1 2d4h
smart-gateway-operator-79557664f8-jf56r 1/1 Running 4 (6d16h ago) 41d
- In my case, above you can see that the
elasticsearch-es-default-0
is inCrashLoopBackOff
state.oc logs elasticsearch-es-default-0
to investigate more.