浏览代码

Merge pull request #1019 from cgoveas/devel

Updating base_vars.yml information
Sujit Jadhav 3 年之前
父节点
当前提交
ddd88c2f35

+ 8 - 0
docs/BEST_PRACTICES.md

@@ -0,0 +1,8 @@
+# Best Practices When Using Omnia
+* Ensure that PowerCap policy is disabled and the BIOS system profile is set to 'Performance' on the Control Plane.
+* Ensure that there is at least 50% (~35%)free space on the Control Plane before running Omnia.
+* Disable SElinux on the Control Plane.
+* Use a [host mapping file](examples/host_mapping_file_os_provisioning.csv) and [device mapping file](examples/mapping_device_file.csv) even when using DHCP configuration to ensure that IP assignments remain persistent across Control Plane reboots.
+* Avoid rebooting the Control Plane as much as possible to ensure that all network configuration does not get disturbed.
+* Review the [PreRequisites](PreRequisites) before running Omnia Scripts.
+* If telemetry is to be enabled using Omnia, use AWX to deploy Slurm/Kubernetes.

文件差异内容过多而无法显示
+ 6 - 3
docs/Device_Configuration/Infiniband_Switches.md


+ 3 - 0
docs/Device_Configuration/Servers.md

@@ -38,6 +38,9 @@ After the configurations are validated, the **provision_idrac** file provisions
 >>**Note**:
 >> * The `idrac.yml` file initiates the provisioning of custom ISO on the PowerEdge servers. Wait for some time for the node inventory to be updated on the AWX UI. 
 >> * Due to the latest `catalog.xml` file, Firmware updates may fail for certain components. Omnia execution doesn't get interrupted but an error gets logged on AWX. For now, please download those individual updates manually.
+>> * If a server is connected to an Infiniband Switch via an Infiniband NIC, Omnia will not activate this NIC. Manually update the NIC using `ifup <IB NIC name>`
+>> * For servers running LeapOS, run `omnia.yml` to install Infiniband drivers and then activate the IB NIC using `ifup <IB NIC name>`. Alternatively, if the [Leap OSS](http://download.opensuse.org/distribution/leap/15.3/repo/oss/) and [Leap Non OSS](http://download.opensuse.org/distribution/leap/15.3/repo/non-oss/) are installed, use `zypper install -n rdma-core librdmacm1 libibmad5 libibumad3` to install IB NIC drivers before manually bringing up the interface using the command above.
+
 
 ### Provisioning newly added PowerEdge servers in the cluster
 To provision newly added servers, wait till the iDRAC IP addresses are automatically added to the *idrac_inventory*. After the iDRAC IP addresses are added, launch the iDRAC template on the AWX UI to provision CentOS custom OS on the servers.  

文件差异内容过多而无法显示
+ 36 - 34
docs/Input_Parameter_Guide/Control_Plane_Parameters/base_vars.md


+ 2 - 0
docs/LIMITATIONS.md

@@ -17,3 +17,5 @@
 * All iDRACs must have the same username and password.
 * OpenSUSE Leap 15.3 is not supported on the Control Plane.
 * Slurm Telemetry is supported only on a single cluster.
+* Omnia does not Infiniband drivers on compute nodes running LeapOS.
+* Omnia does not activate Infiniband NICs on compute nodes automatically. Steps to enable them manually are provided [here](Device_Configuration/Servers.md)

+ 2 - 1
docs/PreRequisites/Omnia_Control_Plane_PreReqs.md

@@ -1,6 +1,7 @@
 # Pre-requisites Before Running Control Plane
 * Ensure that a stable Internet connection is available on control plane.
-* Rocky 8 is installed on the control plane. 		 
+* Rocky 8 is installed on the control plane.
+* Ensure that the root partition (/) has a minimum of 50% (~35G) free space. 
 * To provision the bare metal servers, download one of the following ISOs for deployment:
     1. [Leap 15.3](https://get.opensuse.org/leap/)
     2. [Rocky 8](https://rockylinux.org/)

+ 0 - 6
docs/Support_Matrix/Software/Operating_Systems/RHEL.md

@@ -1,6 +0,0 @@
-# Red Hat Enterprise Linux
-
-| OS Version 	| Control Plane 	| Compute Nodes 	|
-|------------	|--------------------	|---------------	|
-| 8.4          	|        Yes            	|       Yes        	|
-| 7          	|        No           	|           No    	|

+ 14 - 2
docs/Troubleshooting/FAQ.md

@@ -48,14 +48,26 @@ Resolution:
                 2. For connecting to the internet (Management purposes)
                 3. For connecting to PowerVault (Data Connection)
 
+## Why is the Infiniband NIC down after provisioning the server? <br>
+Omnia does not activate Infiniband NICs. To enable the device manually, use `ifup <IB NIC name>`. 
+>> __Note:__ If your server is running LeapOS, run `omnia.yml` to install IB drivers then manually enable devices. Alternatively, if the [Leap OSS](http://download.opensuse.org/distribution/leap/15.3/repo/oss/) and [Leap Non OSS](http://download.opensuse.org/distribution/leap/15.3/repo/non-oss/) are installed, use `zypper install -n rdma-core librdmacm1 libibmad5 libibumad3` to install IB NIC drivers before manually bringing up the interface using the command above.
+
 ## What to do if AWX jobs fail with `Error creating pod: container failed to start, ImagePullBackOff`?
 Potential Cause:<br>
- After running `control_plane.yml`, the AWX image got deleted.<br>
+ After running `control_plane.yml`, the AWX image got deleted due to space considerations (use `df -h` to diagnose the issue.).<br>
 Resolution:<br>
-    Run the following commands:<br>
+    Delete unnecessary files from the partition`` and then run the following commands:<br>
     1. `cd omnia/control_plane/roles/webui_awx/files`
     2. `buildah bud -t custom-awx-ee awx_ee.yml`
 
+## Why do pods and images appear to get deleted automatically?
+Potential Cause: <br>
+Lack of space in the root partition (/) causes Linux to clear files automatically (Use `df -h` to diagnose the issue).<br>
+Resolution:
+* Delete large, unused files to clear the root partition (Use the command `find / -xdev -size +5M | xargs ls -lh | sort -n -k5` to identify these files). Before running Omnia Control Plane, it is recommended to have a minimum of 50% free space in the root partition.
+* Once the partition is cleared, run `kubeadm reset -f`
+* Re-run `control_plane.yml`
+
 ## Why does the task 'control_plane_common: Setting Metric' fail?
 Potential Cause:
     The device name and connection name listed by the network manager in `/etc/sysconfig/network-scripts/ifcfg-<nic name>` do not match.

+ 5 - 5
docs/Troubleshooting/Troubleshooting_Guide.md

@@ -1,6 +1,6 @@
 # Logs Used for Troubleshooting
 
-1. /var/log (Control Plane)
+## 1. /var/log (Control Plane)
 
 All log files can be viewed via the Dashboard tab (![Dashboard Icon](../Telemetry_Visualization/Images/DashBoardIcon.PNG)). The Default Dashboard displays `omnia.log` and `syslog`. Custom dashboards can be created per user requirements.
 
@@ -23,7 +23,7 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
 | Zypper Logs        | /var/log/zypper.log                       | Installation Logs            | This log is configured on Leap OS                                                                  |
 
 
-2. Checking logs of individual containers:
+## 2. Checking logs of individual containers:
    1. A list of namespaces and their corresponding pods can be obtained using:
       `kubectl get pods -A`
    2. Get a list of containers for the pod in question using:
@@ -32,7 +32,7 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
       `kubectl logs pod <pod_name> -n <namespace> -c <container_name>`
 
 
-3. Connecting to internal databases:
+## 3. Connecting to internal databases:
 * TimescaleDB
 	* Go inside the pod: `kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash`
 	* Connect to psql: `psql -U <postgres_username>`
@@ -42,14 +42,14 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
 	* Connect to psql: `psql -U <mysqldb_username> -p <mysqldb_password>`
 	* Connect to database: `USE <mysqldb_name>`
 
-4. Checking and updating encrypted parameters:
+## 4. Checking and updating encrypted parameters:
    1. Move to the filepath where the parameters are saved (as an example, we will be using `login_vars.yml`):
       `cd control_plane/input_params`
    2. To view the encrypted parameters:
    `ansible-vault view login_vars.yml --vault-password-file .login_vault_key`
    3. To edit the encrypted parameters:
     `ansible-vault edit login_vars.yml --vault-password-file .login_vault_key`
-5. Checking pod status on the control plane
+## 5. Checking pod status on the control plane
     * Select the pod you need to troubleshoot from the output of `kubectl get pods -A`
     * Check the status of the pod by running `kubectl describe pod <pod name> -n <namespace name>`