Browse Source

Merge pull request #943 from cgoveas/devel

Updating Docs
Sujit Jadhav 3 years ago
parent
commit
f6918c1c01

+ 4 - 2
docs/INSTALL_OMNIA.md

@@ -52,7 +52,9 @@ To install the Omnia control plane and manage workloads on your cluster using th
 >> 2. `pip uninstall ansible-base (if ansible 2.9 is installed)`
 >> 3. `pip uninstall ansible-core (if ansible 2.10  > version is installed)`
 
-	 
+>> __Note:__ If you are using LeapOS, zypper may need to be updated before installing Omnia using the command: `zypper update -y`
+
+
 * On the management station, run the following commands to install Git:
 	```
 	dnf install epel-release -y
@@ -93,7 +95,7 @@ git clone -b release https://github.com/dellhpc/omnia.git
 | domain_name                | omnia.test    | Sets the intended domain name                                                                                                                                                                                                                        |
 | realm_name                 | OMNIA.TEST    | Sets the intended realm name                                                                                                                                                                                                                         |
 | directory_manager_password |               | Password authenticating admin level access to the Directory for system   management tasks. It will be added to the instance of directory server   created for IPA. <br> Required Length: 8 characters. <br> The   password must not contain -,\, '," |
-| kerberos_admin_password         |               | "admin" user password for the IPA server on RockyOS. If LeapOS is in use, it is used as the "kerberos admin" user password for 389-ds <br> This field is not relevant to Management Stations running `LeapOS`                                                                                                                                                                                                                            |
+| kerberos_admin_password    |               | "admin" user password for the IPA server on RockyOS. If LeapOS is in use, it is used as the "kerberos admin" user password for 389-ds <br> This field is not relevant to Management Stations running `LeapOS`                                                                                                                                                                                                                            |
 | enable_secure_login_node   |  **false**, true             | Boolean value deciding whether security features are enabled on the Login Node. For more information, see [here](docs/Security/Enable_Security_LoginNode.md).                                                                                                                                                                                                                           |
 	
 	

File diff suppressed because it is too large
+ 4 - 1
docs/INSTALL_OMNIA_CONTROL_PLANE.md


+ 47 - 22
docs/Telemetry_Visualization/TELEMETRY.md

@@ -1,40 +1,65 @@
-# Viewing Performance Stats on Grafana
+# Setting Up Grafana
 
-Using [Texas Technical University data visualization lab](https://idatavisualizationlab.github.io/HPCC), data polled from iDRAC and Slurm can be processed to generate live graphs. These Graphs can be accessed on the Grafana UI.
+Using Grafana, users can poll multiple devices and create graphs/visualizations of key system metrics such as temperature, System power consumption, Memory Usage, IO Usage, CPU Usage, Total Memory Power, System Output Power, Total Fan Power, Total Storage Power, System Input Power, Total CPU Power, RPM Readings, Total Heat Dissipation, Power to Cool ratio, System Air Flow Efficiency etc.
 
-Once `control_plane.yml` is executed and Grafana is set up, use `telemetry.yml` to initiate the Graphs. Data polled via Slurm and iDRAC is streamed into internal databases. This data is processed to create the 4 graphs listed below.
+A lot of these metrics are collected using iDRAC telemetry. iDRAC telemetry allows you to stream telemetry data from your servers to a centralized log/metrics servers. For more information on iDRAC telemetry, click [here]( https://github.com/dell/iDRAC-Telemetry-Reference-Tools).
 
->> __Note__: This feature only works on Nodes using iDRACs with a datacenter license running a minimum firmware of 4.0.
+## Prerequisites
 
-## All your data in a glance
+1. To set up Grafana, ensure that `control_plane/input_params/login_vars.yml` is updated with the Grafana Username and Password.
+2. All parameters in `telemetry/input_params/telemetry_login_vars.yml` need to be filled in:
 
-Using the following graphs, data can be visualized to gather correlational information.
-1. [Parallel Coordinates](https://idatavisualizationlab.github.io/HPCC/#ParallelCoordinates) <br>
-Parallel coordinates are a great way to capture a systems status. It shows all ranges of individual metrics like CPU temp, Fan Speed, Memory Usage etc. The graph can be narrowed by time or metric ranges to get specific correlations such as CPU Temp vs Fan Speed etc.
+| Parameter Name        | Default Value | Information |
+|-----------------------|---------------|-------------|
+| timescaledb_user      | 		        |  Username used for connecting to timescale db. Minimum Length: 2 characters.          |
+| timescaledb_password  | 		        |  Password used for connecting to timescale db. Minimum Length: 2 characters.           |
+| mysqldb_user          | 		        |  Username used for connecting to mysql db. Minimum Length: 2 characters.         |
+| mysqldb_password      | 		        |  Password used for connecting to mysql db. Minimum Length: 2 characters.            |
+| mysqldb_root_password | 		        |  Password used for connecting to mysql db for root user. Minimum Legth: 2 characters.         |
 
-![Parallel Coordinates](Images/ParallelCoordinates.png)
+3. All parameters in `telemetry/input_params/telemetry_base_vars.yml` need to be filled in:
 
-<br>
+| Parameter Name          | Default Value     | Information |
+|-------------------------|-------------------|-------------|
+| idrac_telemetry_support | true              | This variable is used to enable iDRAC telemetry support and visualizations. Accepted Values: true/false            |
+| slurm_telemetry_support | true              | This variable is used to enable slurm telemetry support and visualizations. Slurm Telemetry support can only be activated when idrac_telemetry_support is set to true. Accepted Values: True/False.        |
+| timescaledb_name        | telemetry_metrics | Postgres DB with timescale extension is used for storing iDRAC and slurm telemetry metrics.            |
+| mysqldb_name			  | idrac_telemetrysource_services_db | MySQL DB is used to store IPs and credentials of iDRACs having datacenter license           |
 
-2. [Spiral Layout](https://idatavisualizationlab.github.io/HPCC/#Spiral_Layout) <br>
-Spiral Layouts are best for viewing the change in a single metric over time. It is often used to check trends in metrics over a business day. Data visualized in this graph can be sorted using other metrics like Job IDs etc to understand the pattern of utilization on your devices.
+3. Find the IP of the Grafana UI using:
+ 
+`kubectl get svc -n grafana`
 
-![Spiral Layout](Images/Spirallayout.gif)
+## Logging into Grafana
 
-<br>
+Use any one of the following browsers to access the Grafana UI (https://< Grafana UI IP >:5000):
+* Chrome/Chromium
+* Firefox
+* Safari
+* Microsoft Edge
 
-3. [Sankey Viewer](https://idatavisualizationlab.github.io/HPCC/#SankeyViewer) <br>
-Sankey Viewers are perfect for viewing utilization by nodes/users/jobs. It provides point in time information for quick troubleshooting.
+>> __Note:__ Always enable JavaScript in your browser. Running Grafana without JavaScript enabled in the browser is not supported.
 
-![Sankey Viewer](Images/SankeyViewer.png)
+## Prerequisites to Enabling Slurm Telemetry
 
-<br>
+* Slurm Telemetry cannot be executed without iDRAC support
+* Omnia control plane should be executed and node_inventory should be created in awx.
+* The slurm manager and compute nodes are fetched at run time from node_inventory.
+* Slurm should be installed on the nodes, if not there is no point in executing slurm telemetry.
+* A minimum of one cluster is required for Slurm Telemetry to work.
+* Once telemetry is running, delete the pods and images on control plane if a cluster change is intended.
 
-4. [Power Map](https://idatavisualizationlab.github.io/HPCC/#PowerMap) <br>
-Power Maps are an excellent way to see utilization along the axis of time for different nodes/users/jobs. Hovering over the graph allows the user to narrow down information by Job/User or Node.
+## Initiating Telemetry
 
-![Power Map](Images/PowerMap.png)
+1. Once `control_plane.yml` and `omnia.yml` are executed, run the following commands from `omnia/telemetry`:
 
-<br>
+`ansible-playbook telemetry.yml`
 
+>> __Note:__ Telemetry Collection is only initiated on iDRACs on AWX that have a datacenter license and are running a firmware version of 4 or higher.
 
+## Adding a New Node to Telemetry
+After initiation, new nodes can be added to telemetry by running the following commands from `omnia/telemetry`:
+		
+` ansible-playbook add_idrac_node.yml `
+
+	

+ 24 - 50
docs/Telemetry_Visualization/Visualization.md

@@ -1,70 +1,44 @@
-# Setting Up Grafana
+# Viewing Performance Stats on Grafana
 
-Using Grafana, users can poll multiple devices and create graphs/visualizations of key system metrics such as temperature, System power consumption, Memory Usage, IO Usage, CPU Usage, Total Memory Power, System Output Power, Total Fan Power, Total Storage Power, System Input Power, Total CPU Power, RPM Readings, Total Heat Dissipation, Power to Cool ratio, System Air Flow Efficiency etc.
+Using [Texas Technical University data visualization lab](https://idatavisualizationlab.github.io/HPCC), data polled from iDRAC and Slurm can be processed to generate live graphs. These Graphs can be accessed on the Grafana UI.
 
-A lot of these metrics are collected using iDRAC telemetry. iDRAC telemetry allows you to stream telemetry data from your servers to a centralized log/metrics servers. For more information on iDRAC telemetry, click [here]( https://github.com/dell/iDRAC-Telemetry-Reference-Tools).
+Once `control_plane.yml` is executed and Grafana is set up, use `telemetry.yml` to initiate the Graphs. Data polled via Slurm and iDRAC is streamed into internal databases. This data is processed to create the 4 graphs listed below.
 
-## Prerequisites
+>> __Note__: This feature only works on Nodes using iDRACs with a datacenter license running a minimum firmware of 4.0.
 
-1. To set up Grafana, ensure that `control_plane/input_params/login_vars.yml` is updated with the Grafana Username and Password.
-2. All parameters in `telemetry/input_params/login_vars.yml` need to be filled in:
+## All your data in a glance
 
-| Parameter Name        | Default Value | Information |
-|-----------------------|---------------|-------------|
-| timescaledb_user      | 		        |  Username used for connecting to timescale db. Minimum Length: 2 characters.          |
-| timescaledb_password  | 		        |  Password used for connecting to timescale db. Minimum Length: 2 characters.           |
-| mysqldb_user          | 		        |  Username used for connecting to mysql db. Minimum Length: 2 characters.         |
-| mysqldb_password      | 		        |  Password used for connecting to mysql db. Minimum Length: 2 characters.            |
-| mysqldb_root_password | 		        |  Password used for connecting to mysql db for root user. Minimum Legth: 2 characters.         |
+Using the following graphs, data can be visualized to gather correlational information. These graphs refresh every 5 seconds (Except SankeyViewer). 
 
-3. All parameters in `telemetry/input_params/base_vars.yml` need to be filled in:
+>> __Note:__ The timestamps used for the time metric are based on the `timezone` set in `control_plane/input_params/base_vars.yml`. 
 
-| Parameter Name          | Default Value     | Information |
-|-------------------------|-------------------|-------------|
-| mount_location          | /opt/omnia 		  | Sets the location all telemetry related files will be stored and both timescale and mysql databases will be mounted.            |
-| idrac_telemetry_support | true              | This variable is used to enable iDRAC telemetry support and visualizations. Accepted Values: true/false            |
-| slurm_telemetry_support | true              | This variable is used to enable slurm telemetry support and visualizations. Slurm Telemetry support can only be activated when idrac_telemetry_support is set to true. Accepted Values: True/False.        |
-| timescaledb_name        | telemetry_metrics | Postgres DB with timescale extension is used for storing iDRAC and slurm telemetry metrics.            |
-| mysqldb_name			  | idrac_telemetrysource_services_db | MySQL DB is used to store IPs and credentials of iDRACs having datacenter license           |
+1. [Parallel Coordinates](https://idatavisualizationlab.github.io/HPCC/#ParallelCoordinates) <br>
+Parallel coordinates are a great way to capture a systems status. It shows all ranges of individual metrics like CPU temp, Fan Speed, Memory Usage etc. The graph can be narrowed by time or metric ranges to get specific correlations such as CPU Temp vs Fan Speed etc.
 
-3. Find the IP of the Grafana UI using:
- 
-`kubectl get svc -n grafana`
+![Parallel Coordinates](Images/ParallelCoordinates.png)
 
-## Logging into Grafana
+<br>
 
-Use any one of the following browsers to access the Grafana UI (https://< Grafana UI IP >:5000):
-* Chrome/Chromium
-* Firefox
-* Safari
-* Microsoft Edge
+2. [Spiral Layout](https://idatavisualizationlab.github.io/HPCC/#Spiral_Layout) <br>
+Spiral Layouts are best for viewing the change in a single metric over time. It is often used to check trends in metrics over a business day. Data visualized in this graph can be sorted using other metrics like Job IDs etc to understand the pattern of utilization on your devices.
 
->> __Note:__ Always enable JavaScript in your browser. Running Grafana without JavaScript enabled in the browser is not supported.
+![Spiral Layout](Images/Spirallayout.gif)
 
-## Prerequisites to Enabling Slurm Telemetry
+<br>
 
-* Slurm Telemetry cannot be executed without iDRAC support
-* Omnia control plane should be executed and node_inventory should be created in awx.
-* The slurm manager and compute nodes are fetched at run time from node_inventory.
-* Slurm should be installed on the nodes, if not there is no point in executing slurm telemetry.
+3. [Sankey Viewer](https://idatavisualizationlab.github.io/HPCC/#SankeyViewer) <br>
+Sankey Viewers are perfect for viewing utilization by nodes/users/jobs. It provides point in time information for quick troubleshooting.
 
-## Initiating Telemetry
-
-1. Once `control_plane.yml` and `omnia.yml` are executed, run the following commands from `omnia/telemetry`:
-
-`ansible-playbook telemetry.yml`
-
->> __Note:__ Telemetry Collection is only initiated on iDRACs on AWX that have a datacenter license and are running a firmware version of 4 or higher.
-
-## Adding a New Node to Telemetry
-After initiation, new nodes can be added to telemetry by running the following commands from `omnia/telemetry`:
-		
-` ansible-playbook add_idrac_node.yml `
-
-	
+>> __Note:__ Due to the tremendous data processing undertaken by SankeyViewer, the graph does not auto-refresh. It can be manually refreshed by refreshing the internet tab or by clicking the refresh button on the top-right corner of the page.
 
+![Sankey Viewer](Images/SankeyViewer.png)
 
+<br>
 
+4. [Power Map](https://idatavisualizationlab.github.io/HPCC/#PowerMap) <br>
+Power Maps are an excellent way to see utilization along the axis of time for different nodes/users/jobs. Hovering over the graph allows the user to narrow down information by Job/User or Node.
 
+![Power Map](Images/PowerMap.png)
 
+<br>
 

+ 7 - 2
docs/control_plane/device_templates/CONFIGURE_INFINIBAND_SWITCHES.md

@@ -3,7 +3,12 @@ In your HPC cluster, connect the Mellanox InfiniBand switches using the Fat-Tree
 
 Omnia uses the server-based Subnet Manager (SM). SM runs as a Kubernetes pod on the management station. To enable the SM, Omnia configures the required parameters in the `opensm.conf` file. Based on the requirement, the parameters can be edited.  
 
->>**NOTE**: Install the InfiniBand hardware drivers by running the command: `yum groupinstall "Infiniband Support" -y`.  
+>>**NOTE**: Install the InfiniBand hardware drivers by running the below command (depending on the OS):  
+>> * `yum groupinstall "Infiniband Support" -y` (For Rocky)
+>> * `zypper install rdma-core librdmacm1 libibmad5 libibumad3` (For LeapOS)
+
+>> **NOTE:** When using LeapOS, infiniband commands such as sminfo, ibhosts etc only run correctly within the infiniband container.
+
 
 ## Setting up a new or factory reset switch
 
@@ -25,7 +30,7 @@ When connecting to a new or factory reset switch, the configuration wizard reque
 * **(Recommended)** If the user enters 'no', they still have to provide the admin and monitor passwords. 
 * If the user enters 'yes', they will also be prompted to enter the hostname for the switch, DHCP details, IPv6 details, etc.
 
->> **Note:** When initializing a factory reset switch, the user needs to ensure DHCP is enabled and an IPv6 address is not assigned. Omnia will assign an IP address to the Infiniband switch using DHCP with all other configurations.
+>> **NOTE:** When initializing a factory reset switch, the user needs to ensure DHCP is enabled and an IPv6 address is not assigned. Omnia will assign an IP address to the Infiniband switch using DHCP with all other configurations.
 
 ## Edit the "input_params" file 
 Under the `control_plane/input_params` directory, edit the following files: