Browse Source

Merge pull request #1019 from cgoveas/devel

Updating base_vars.yml information
Sujit Jadhav 3 years ago
parent
commit
ddd88c2f35

+ 8 - 0
docs/BEST_PRACTICES.md

@@ -0,0 +1,8 @@
+# Best Practices When Using Omnia
+* Ensure that PowerCap policy is disabled and the BIOS system profile is set to 'Performance' on the Control Plane.
+* Ensure that there is at least 50% (~35%)free space on the Control Plane before running Omnia.
+* Disable SElinux on the Control Plane.
+* Use a [host mapping file](examples/host_mapping_file_os_provisioning.csv) and [device mapping file](examples/mapping_device_file.csv) even when using DHCP configuration to ensure that IP assignments remain persistent across Control Plane reboots.
+* Avoid rebooting the Control Plane as much as possible to ensure that all network configuration does not get disturbed.
+* Review the [PreRequisites](PreRequisites) before running Omnia Scripts.
+* If telemetry is to be enabled using Omnia, use AWX to deploy Slurm/Kubernetes.

File diff suppressed because it is too large
+ 6 - 3
docs/Device_Configuration/Infiniband_Switches.md


+ 3 - 0
docs/Device_Configuration/Servers.md

@@ -38,6 +38,9 @@ After the configurations are validated, the **provision_idrac** file provisions
 >>**Note**:
 >> * The `idrac.yml` file initiates the provisioning of custom ISO on the PowerEdge servers. Wait for some time for the node inventory to be updated on the AWX UI. 
 >> * Due to the latest `catalog.xml` file, Firmware updates may fail for certain components. Omnia execution doesn't get interrupted but an error gets logged on AWX. For now, please download those individual updates manually.
+>> * If a server is connected to an Infiniband Switch via an Infiniband NIC, Omnia will not activate this NIC. Manually update the NIC using `ifup <IB NIC name>`
+>> * For servers running LeapOS, run `omnia.yml` to install Infiniband drivers and then activate the IB NIC using `ifup <IB NIC name>`. Alternatively, if the [Leap OSS](http://download.opensuse.org/distribution/leap/15.3/repo/oss/) and [Leap Non OSS](http://download.opensuse.org/distribution/leap/15.3/repo/non-oss/) are installed, use `zypper install -n rdma-core librdmacm1 libibmad5 libibumad3` to install IB NIC drivers before manually bringing up the interface using the command above.
+
 
 ### Provisioning newly added PowerEdge servers in the cluster
 To provision newly added servers, wait till the iDRAC IP addresses are automatically added to the *idrac_inventory*. After the iDRAC IP addresses are added, launch the iDRAC template on the AWX UI to provision CentOS custom OS on the servers.  

File diff suppressed because it is too large
+ 36 - 34
docs/Input_Parameter_Guide/Control_Plane_Parameters/base_vars.md


+ 2 - 0
docs/LIMITATIONS.md

@@ -17,3 +17,5 @@
 * All iDRACs must have the same username and password.
 * OpenSUSE Leap 15.3 is not supported on the Control Plane.
 * Slurm Telemetry is supported only on a single cluster.
+* Omnia does not Infiniband drivers on compute nodes running LeapOS.
+* Omnia does not activate Infiniband NICs on compute nodes automatically. Steps to enable them manually are provided [here](Device_Configuration/Servers.md)

+ 2 - 1
docs/PreRequisites/Omnia_Control_Plane_PreReqs.md

@@ -1,6 +1,7 @@
 # Pre-requisites Before Running Control Plane
 * Ensure that a stable Internet connection is available on control plane.
-* Rocky 8 is installed on the control plane. 		 
+* Rocky 8 is installed on the control plane.
+* Ensure that the root partition (/) has a minimum of 50% (~35G) free space. 
 * To provision the bare metal servers, download one of the following ISOs for deployment:
     1. [Leap 15.3](https://get.opensuse.org/leap/)
     2. [Rocky 8](https://rockylinux.org/)

+ 0 - 6
docs/Support_Matrix/Software/Operating_Systems/RHEL.md

@@ -1,6 +0,0 @@
-# Red Hat Enterprise Linux
-
-| OS Version 	| Control Plane 	| Compute Nodes 	|
-|------------	|--------------------	|---------------	|
-| 8.4          	|        Yes            	|       Yes        	|
-| 7          	|        No           	|           No    	|

+ 14 - 2
docs/Troubleshooting/FAQ.md

@@ -48,14 +48,26 @@ Resolution:
                 2. For connecting to the internet (Management purposes)
                 3. For connecting to PowerVault (Data Connection)
 
+## Why is the Infiniband NIC down after provisioning the server? <br>
+Omnia does not activate Infiniband NICs. To enable the device manually, use `ifup <IB NIC name>`. 
+>> __Note:__ If your server is running LeapOS, run `omnia.yml` to install IB drivers then manually enable devices. Alternatively, if the [Leap OSS](http://download.opensuse.org/distribution/leap/15.3/repo/oss/) and [Leap Non OSS](http://download.opensuse.org/distribution/leap/15.3/repo/non-oss/) are installed, use `zypper install -n rdma-core librdmacm1 libibmad5 libibumad3` to install IB NIC drivers before manually bringing up the interface using the command above.
+
 ## What to do if AWX jobs fail with `Error creating pod: container failed to start, ImagePullBackOff`?
 Potential Cause:<br>
- After running `control_plane.yml`, the AWX image got deleted.<br>
+ After running `control_plane.yml`, the AWX image got deleted due to space considerations (use `df -h` to diagnose the issue.).<br>
 Resolution:<br>
-    Run the following commands:<br>
+    Delete unnecessary files from the partition`` and then run the following commands:<br>
     1. `cd omnia/control_plane/roles/webui_awx/files`
     2. `buildah bud -t custom-awx-ee awx_ee.yml`
 
+## Why do pods and images appear to get deleted automatically?
+Potential Cause: <br>
+Lack of space in the root partition (/) causes Linux to clear files automatically (Use `df -h` to diagnose the issue).<br>
+Resolution:
+* Delete large, unused files to clear the root partition (Use the command `find / -xdev -size +5M | xargs ls -lh | sort -n -k5` to identify these files). Before running Omnia Control Plane, it is recommended to have a minimum of 50% free space in the root partition.
+* Once the partition is cleared, run `kubeadm reset -f`
+* Re-run `control_plane.yml`
+
 ## Why does the task 'control_plane_common: Setting Metric' fail?
 Potential Cause:
     The device name and connection name listed by the network manager in `/etc/sysconfig/network-scripts/ifcfg-<nic name>` do not match.

+ 5 - 5
docs/Troubleshooting/Troubleshooting_Guide.md

@@ -1,6 +1,6 @@
 # Logs Used for Troubleshooting
 
-1. /var/log (Control Plane)
+## 1. /var/log (Control Plane)
 
 All log files can be viewed via the Dashboard tab (![Dashboard Icon](../Telemetry_Visualization/Images/DashBoardIcon.PNG)). The Default Dashboard displays `omnia.log` and `syslog`. Custom dashboards can be created per user requirements.
 
@@ -23,7 +23,7 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
 | Zypper Logs        | /var/log/zypper.log                       | Installation Logs            | This log is configured on Leap OS                                                                  |
 
 
-2. Checking logs of individual containers:
+## 2. Checking logs of individual containers:
    1. A list of namespaces and their corresponding pods can be obtained using:
       `kubectl get pods -A`
    2. Get a list of containers for the pod in question using:
@@ -32,7 +32,7 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
       `kubectl logs pod <pod_name> -n <namespace> -c <container_name>`
 
 
-3. Connecting to internal databases:
+## 3. Connecting to internal databases:
 * TimescaleDB
 	* Go inside the pod: `kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash`
 	* Connect to psql: `psql -U <postgres_username>`
@@ -42,14 +42,14 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
 	* Connect to psql: `psql -U <mysqldb_username> -p <mysqldb_password>`
 	* Connect to database: `USE <mysqldb_name>`
 
-4. Checking and updating encrypted parameters:
+## 4. Checking and updating encrypted parameters:
    1. Move to the filepath where the parameters are saved (as an example, we will be using `login_vars.yml`):
       `cd control_plane/input_params`
    2. To view the encrypted parameters:
    `ansible-vault view login_vars.yml --vault-password-file .login_vault_key`
    3. To edit the encrypted parameters:
     `ansible-vault edit login_vars.yml --vault-password-file .login_vault_key`
-5. Checking pod status on the control plane
+## 5. Checking pod status on the control plane
     * Select the pod you need to troubleshoot from the output of `kubectl get pods -A`
     * Check the status of the pod by running `kubectl describe pod <pod name> -n <namespace name>`