Telemetry and visualizations ------------------------------ The telemetry feature allows the set up of Omnia telemetry (to poll values from all Omnia provisioned nodes in the cluster) and/or iDRAC telemetry (To poll values from all eligible iDRACs in the cluster). It also installs `Grafana `_ and `Loki `_ as Kubernetes pods. .. note:: In order to enable telemetry feature in Omnia, ensure to add ``telemetry`` in ``software_config.json``. To initiate telemetry support, fill out the following parameters in ``input/telemetry_config.yml``: .. csv-table:: Parameters :file: ../Tables/telemetry_config.csv :header-rows: 1 :keepspace: .. [1] Boolean parameters do not need to be passed with double or single quotes. .. note:: The ``input/telemetry_config.yml`` file is encrypted during the execution of ``omnia.yml`` playbook. Use the below commands to edit the encrypted input files: :: ansible-vault edit telemetry_config.yml --vault-password-file .telemetry_vault_key Once you have executed ``discovery_provision.yml`` and has also provisioned the cluster, initiate telemetry on the cluster as part of ``omnia.yml``, which configures the cluster with scheduler, storage and authentication using the below command. :: ansible-playbook omnia.yml -i inventory Optionally, you can initiate only telemetry using the below command: :: ansible-playbook telemetry.yml -i inventory .. note:: * To run the ``telemetry.yml`` playbook independently from the ``omnia.yml`` playbook on Intel Gaudi nodes, start by executing the ``performance_profile.yml`` playbook. Once that’s done, you can run the ``telemetry.yml`` playbook separately. * Depending on the type of telemetry initiated, include the following possible groups in the inventory: * omnia_telemetry: ``slurm_control_node``, ``slurm_node``, ``login``, ``kube_control_plane``, ``kube_node``, ``etcd``, ``auth_server`` * idrac_telemetry: ``idrac`` * k8s_telemetry on Prometheus: ``kube_control_plane``, ``kube_node``, ``etcd`` * If you would like a local backup of the timescaleDB used to store telemetry data, `click here <../Utils/timescaledb_utility.html>`_. After initiation, new iDRACs can be added for ``idrac_telemetry`` acquisition by running the following commands: :: ansible-playbook add_idrac_node.yml -i inventory **Modifying telemetry information** To modify how data is collected from the cluster, modify the variables in ``omnia/input/telemetry_config.yml`` and re-run the ``telemetry.yml`` playbook. * When ``omnia_telemetry_support`` is set to false, Omnia Telemetry Acquisition service will be stopped on all cluster nodes provided in the passed inventory. * When ``omnia_telemetry_support`` is set to true, Omnia Telemetry Acquisition service will be restarted on all cluster nodes provided in the passed inventory. * To start or stop the collection of regular metrics, health check metrics, or GPU metrics, update the values of ``collect_regular_metrics``, ``collect_health_check_metrics``, or ``collect_gpu_metrics``. For a list of all metrics collected, `click here `_. .. note:: * Currently, changing the ``grafana_username`` and ``grafana_password`` values is not supported via ``telemetry.yml``. * The passed inventory should have an idrac group, if ``idrac_telemetry_support`` is true. * If ``omnia_telemetry_support`` is true, then the inventory should have OIM and cluster node groups (as specified in the sample files) along with optional login group. * If a subsequent run of ``telemetry.yml`` fails, the ``telemetry_config.yml`` file will be unencrypted. **To access the Grafana UI** *Prerequisites* * ``visualisation_support`` should be set to true when running ``telemetry.yml`` or ``omnia.yml``. i. Find the IP address of the Grafana service using ``kubectl get svc -n grafana`` .. image:: ../images/grafanaIP.png ii. Login to the Grafana UI by connecting to the cluster IP of grafana service obtained above via port 5000, that's ``http://xx.xx.xx.xx:5000/login`` .. image:: ../images/Grafana_login.png iii. Enter the ``grafana_username`` and ``grafana_password`` as mentioned in ``input/telemetry_config.yml``. .. image:: ../images/Grafana_Dashboards.png Loki log collections can viewed on the explore section of the grafana UI. .. image:: ../images/Grafana_Loki.png Datasources configured by Omnia can be viewed as seen below. .. image:: ../images/GrafanaDatasources.png **To use Loki for log filtering** i. Login to the Grafana UI by connecting to the cluster IP of grafana service obtained above via port 5000. That is ``http://xx.xx.xx.xx:5000/login`` ii. In the Explore page, select **control-plane-loki**. .. image:: ../images/Grafana_ControlPlaneLoki.png iii. The log browser allows you to filter logs by job, node and/or user. Example :: (job)= "cluster deployment logs") |= "nodename" (job="compute log messages") |= "nodename" |="node_username" **To use Grafana to view telemetry data** i. Login to the Grafana UI by connecting to the cluster IP of grafana service obtained above via port 5000. That is ``http://xx.xx.xx.xx:5000/login`` ii. In the Explore page, select **telemetry-postgres**. .. image:: ../images/Grafana_Telemetry_PostGRES.png iii. The query builder allows you to create SQL commands that can be used to query the ``omnia_telemetry.metrics`` table. Filter the data required using the following fields: * **id**: The name of the metric. * **context**: The type of metric being collected (Regular Metric, Health Check Metric and GPU metric). * **label**: A combined field listing the **id** and **context** row values. * **value**: The value of the metric at the given timestamp. * **unit**: The unit measure of the metric (eg: Seconds, kb, percent, etc.) * **system**: The service tag of the cluster node. * **hostname**: The hostname of the cluster node. * **time**: The timestamp at which the metric was polled from the cluster node. *iDRAC telemetry data from Grafana* .. image:: ../images/idractelemetry.png .. note:: If you are more comfortable using SQL queries over the query builder, click on **Edit SQL** to directly provide your query. Optionally, the data returned from a query can be viewed as a graph. **Visualizations** If ``idrac_telemetry_support`` and ``visualisation_support`` is set to true, Parallel Coordinate graphs can be used to view system statistics. .. toctree:: Visualizations/index TelemetryMetrics MetricInfo TimescaleDB Prometheus_k8s Gaudi_metrics