3 年之前 · 7868755bfe
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -1,54 +1,49 @@
 
				-# Frequently Asked Questions  
			
 
				+# Frequently Asked Questions
			
 
				 
			
 
				-* TOC
			
 
				-{:toc}
			
 
				 
			
 
				-## Why is the error "Wait for AWX UI to be up" displayed when `appliance.yml` fails?  
			
 
				-Cause: 
			
 
				-1. When AWX is not accessible even after five minutes of wait time. 
			
 
				-2. When __isMigrating__ or __isInstalling__ is seen in the failure message.
			
 
				+## Why is the error "Wait for AWX UI to be up" displayed when `control_plane.yml` fails?  
			
 
				+Potential Causes: 
			
 
				+1. AWX is not accessible even after five minutes of wait time. 
			
 
				+2. __isMigrating__ or __isInstalling__ is seen in the failure message.
			
 
				 	
			
 
				 Resolution:  
			
 
				-Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yml` file again, where __management-station-IP__ is the IP address of the management node.
			
 
				+Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `control_plane.yml` file again, where __management-station-IP__ is the IP address of the management node.
			
 
				 
			
 
				-## What are the next steps after the nodes in a Kubernetes cluster reboot?  
			
 
				-Resolution: 
			
 
				-Wait for 15 minutes after the Kubernetes cluster reboots. Next, verify the status of the cluster using the following services:
			
 
				-* `kubectl get nodes` on the manager node provides the correct k8s cluster status.  
			
 
				-* `kubectl get pods --all-namespaces` on the manager node displays all the pods in the **Running** state.
			
 
				-* `kubectl cluster-info` on the manager node displays both k8s master and kubeDNS are in the **Running** state.
			
 
				+## What to do if the nodes in a Kubernetes cluster reboot?  
			
 
				+Wait for 15 minutes after the Kubernetes cluster reboots. Next, verify the status of the cluster using the following commands:
			
 
				+* `kubectl get nodes` on the manager node to get the real-time k8s cluster status.  
			
 
				+* `kubectl get pods --all-namespaces` on the manager node to check which the pods are in the **Running** state.
			
 
				+* `kubectl cluster-info` on the manager node to verify that both the k8s master and kubeDNS are in the **Running** state.
			
 
				 
			
 
				 ## What to do when the Kubernetes services are not in the __Running__  state?  
			
 
				-Resolution:	
			
 
				-1. Run `kubectl get pods --all-namespaces` to verify the pods are in the **Running** state.
			
 
				+1. Run `kubectl get pods --all-namespaces` to verify that all pods are in the **Running** state.
			
 
				 2. If the pods are not in the **Running** state, delete the pods using the command:`kubectl delete pods <name of pod>`
			
 
				 3. Run the corresponding playbook that was used to install Kubernetes: `omnia.yml`, `jupyterhub.yml`, or `kubeflow.yml`.
			
 
				 
			
 
				 ## What to do when the JupyterHub or Prometheus UI is not accessible?  
			
 
				-Resolution:
			
 
				 Run the command `kubectl get pods --namespace default` to ensure **nfs-client** pod and all Prometheus server pods are in the **Running** state. 
			
 
				 
			
 
				-## While configuring the Cobbler, why does the `appliance.yml` fail with an error during the Run import command?  
			
 
				+## While configuring Cobbler, why does the `control_plane.yml` fail during the Run import command?  
			
 
				 Cause:
			
 
				-* When the mounted .iso file is corrupt.
			
 
				+* The mounted .iso file is corrupt.
			
 
				 	
			
 
				 Resolution:
			
 
				 1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
			
 
				-2. If the error message is **repo verification failed** then it signifies that the .iso file is not mounted properly.
			
 
				-3. Verify if the downloaded .iso file is valid and correct.
			
 
				-4. Delete the Cobbler container using `docker rm -f cobbler` and rerun `appliance.yml`.
			
 
				+2. If the error message is **repo verification failed**, the .iso file is not mounted properly.
			
 
				+3. Verify that the downloaded .iso file is valid and correct.
			
 
				+4. Delete the Cobbler container using `docker rm -f cobbler` and rerun `control_plane.yml`.
			
 
				 
			
 
				-## Why does the PXE boot fail with tftp timeout or service timeout errors?  
			
 
				-Cause:
			
 
				-* When RAID is configured on the server.
			
 
				-* When more than two servers in the same network have Cobbler services running.  
			
 
				+## Why does PXE boot fail with tftp timeout or service timeout errors?  
			
 
				+Potential Causes:
			
 
				+* RAID is configured on the server.
			
 
				+* Two or more servers in the same network have Cobbler services running.  
			
 
				 
			
 
				 Resolution:  
			
 
				-1. Create a Non-RAID or virtual disk in the server.  
			
 
				-2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
			
 
				+1. Create a Non-RAID or virtual disk on the server.  
			
 
				+2. Check if other systems except for the management node have cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
			
 
				 
			
 
				 ## What to do when the Slurm services do not start automatically after the cluster reboots?  
			
 
				-Resolution: 
			
 
				+
			
 
				 * Manually restart the slurmd services on the manager node by running the following commands:
			
 
				 ```
			
 
				 systemctl restart slurmdbd
			
@@ -57,77 +52,81 @@ systemctl restart prometheus-slurm-exporter
 
				 ```
			
 
				 * Run `systemctl status slurmd` to manually restart the following service on all the compute nodes.
			
 
				 
			
 
				-## What to do when the Slurm services fail? 
			
 
				-Cause: The `slurm.conf` is not configured properly.  
			
 
				-Resolution:
			
 
				+## Why do Slurm services fail? 
			
 
				+
			
 
				+Potential Cause: The `slurm.conf` is not configured properly. 
			
 
				+ 
			
 
				+Recommended Actions:
			
 
				 1. Run the following commands:
			
 
				 ```
			
 
				 slurmdbd -Dvvv
			
 
				 slurmctld -Dvvv
			
 
				 ```
			
 
				-2. Verify `/var/lib/log/slurmctld.log` file.
			
 
				+2. Refer the `/var/lib/log/slurmctld.log` file for more information.
			
 
				+
			
 
				+## What causes the "Ports are Unavailable" error?
			
 
				 
			
 
				-## How to resolve the "Ports are unavailable" error?
			
 
				 Cause: Slurm database connection fails.  
			
 
				-Resolution:
			
 
				+
			
 
				+Recommended Actions:
			
 
				 1. Run the following commands:
			
 
				 ```
			
 
				 slurmdbd -Dvvv
			
 
				 slurmctld -Dvvv
			
 
				 ```
			
 
				-2. Verify the `/var/lib/log/slurmctld.log` file.
			
 
				-3. Verify: `netstat -antp | grep LISTEN`
			
 
				+2. Refer the `/var/lib/log/slurmctld.log` file.
			
 
				+3. Check the output of `netstat -antp | grep LISTEN` for  PIDs in the listening state.
			
 
				 4. If PIDs are in the **Listening** state, kill the processes of that specific port.
			
 
				 5. Restart all Slurm services:
			
 
				-```
			
 
				-slurmctl restart slurmctld on manager node
			
 
				-systemctl restart slurmdbd on manager node
			
 
				-systemctl restart slurmd on compute node
			
 
				-```
			
 
				+
			
 
				+`slurmctl restart slurmctld` on manager node
			
 
				+
			
 
				+`systemctl restart slurmdbd` on manager node
			
 
				+
			
 
				+`systemctl restart slurmd` on compute node
			
 
				+
			
 
				 		
			
 
				-## What to do if Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding?  
			
 
				-Cause: With the host network which is a DNS issue.  
			
 
				+## Why do Kubernetes Pods stop communicating with the servers when the DNS servers are not responding?
			
 
				+
			
 
				+Potential Cause: The host network is faulty causing DNS to be unresponsive
			
 
				+ 
			
 
				 Resolution:
			
 
				-1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
			
 
				-2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. The suggested IP range is 192.168.0.0/16 and ensure that you provide an IP that is not in use in your host network.
			
 
				-3. Execute omnia.yml and skip slurm using __skip_ tag __slurm__.
			
 
				+1. In your Kubernetes cluster, run `kubeadm reset -f` on all the nodes.
			
 
				+2. On the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. The suggested IP range is 192.168.0.0/16. Ensure that the IP provided is not in use on your host network.
			
 
				+3. Execute omnia.yml and skip slurm `ansible-playbook omnia.yml --skip-tags slurm`
			
 
				 
			
 
				-## What to do if the time taken to pull the images to create the Kubeflow containers exceeds the limit and the Apply Kubeflow configurations task fails?  
			
 
				-Cause: Unstable or slow Internet connectivity.  
			
 
				+## Why does pulling images to create the Kubeflow timeout causing the 'Apply Kubeflow Configuration' task to fail?
			
 
				+  
			
 
				+Potential Cause: Unstable or slow Internet connectivity.  
			
 
				 Resolution:
			
 
				-1. Complete the PXE booting/ format the OS on the manager and compute nodes.
			
 
				-2. In the omnia_config.yml file, change the k8s_cni variable value from calico to flannel.
			
 
				+1. Complete the PXE booting/format the OS on the manager and compute nodes.
			
 
				+2. In the omnia_config.yml file, change the k8s_cni variable value from `calico` to `flannel`.
			
 
				 3. Run the Kubernetes and Kubeflow playbooks.  
			
 
				 
			
 
				-## How to resolve the "Permission denied" error while executing the `idrac.yml` file or other .yml files from AWX?
			
 
				-Cause: The "PermissionError: [Errno 13] Permission denied" error is displayed if you have used the ansible-vault decrypt or encrypt commands.  
			
 
				+## Why is permission denied when executing the `idrac.yml` file or other .yml files from AWX?
			
 
				+Potential Cause: The "PermissionError: [Errno 13] Permission denied" error is displayed if you have used the ansible-vault decrypt or encrypt commands.  
			
 
				 Resolution:
			
 
				-* Provide Chmod 644 permission to the .yml files which is missing the required permission. 
			
 
				 
			
 
				-It is suggested that you use the ansible-vault view or edit commands and that you do not use the ansible-vault decrypt or encrypt commands.
			
 
				+* Update permissions on the relevant .yml using `chmod 664 <filename>.yml`
			
 
				 
			
 
				-## What to do if LC is not ready?
			
 
				-Resolution:
			
 
				-* Ensure LC is in a ready state for all the servers.
			
 
				+It is recommended that the ansible-vault view or edit commands are used and not the ansible-vault decrypt or encrypt commands.
			
 
				+
			
 
				+## What to do if the LC is not ready?
			
 
				+* Ensure the LC is in a ready state for all the servers.
			
 
				 * Launch iDRAC template.
			
 
				 
			
 
				 ## What to do if the network CIDR entry of iDRAC IP in /etc/exports file is missing?
			
 
				-Resolution:
			
 
				-* Add additional network CIDR range of idrac IP in the */etc/exports* file if iDRAC IP is not in the management network range provided in base_vars.yml.
			
 
				+* Add an additional network CIDR range of idrac IPs in the */etc/exports* file if the iDRAC IP is not in the management network range provided in base_vars.yml.
			
 
				 
			
 
				-## What to do if a custom ISO file is not present in the device?
			
 
				-Resolution:
			
 
				+## What to do if a custom ISO file is not present on the device?
			
 
				 * Re-run the *control_plane.yml* file.
			
 
				 
			
 
				 ## What to do if the *management_station_ip.txt* file under *provision_idrac/files* folder is missing?
			
 
				-Resolution:
			
 
				 * Re-run the *control_plane.yml* file.
			
 
				 
			
 
				 ## Is Disabling 2FA supported by Omnia?
			
 
				-Resolution:
			
 
				 * Disabling 2FA is not supported by Omnia and must be manually disabled.
			
 
				 
			
 
				-## The provisioning of PowerEdge servers failed. How to resolve the issue and reprovision the servers?
			
 
				-Resolution:
			
 
				+## The provisioning of PowerEdge servers failed. How do I clean up before starting over?
			
 
				 1. Delete the respective iDRAC IP addresses from the *provisioned_idrac_inventory* on the AWX UI or delete the *provisioned_idrac_inventory* to delete the iDRAC IP addresses of all the servers in the cluster.
			
 
				 2. Launch the iDRAC template from the AWX UI.
			
--- a/docs/README.md
+++ b/docs/README.md
@@ -49,7 +49,7 @@ The following table lists the software and operating system requirements on the
 
				 
			
 
				 Requirements  |   Version
			
 
				 ----------------------------------  |   -------
			
 
				-OS pre-installed on the management station  |  CentOS 8.3
			
 
				+OS pre-installed on the management station  |  CentOS 8.4
			
 
				 OS deployed by Omnia on bare-metal Dell EMC PowerEdge Servers | CentOS 7.9 2009 Minimal Edition
			
 
				 Cobbler  |  2.8.5
			
 
				 Ansible AWX  |  19.1.0
			
@@ -76,7 +76,7 @@ The following table lists the software and its compatible version managed by Omn
 
				 Software	|	License	|	Compatible Version	|	Description
			
 
				 -----------	|	-------	|	----------------	|	-----------------
			
 
				 CentOS Linux release 7.9.2009 (Core)	|	-	|	7.9	|	Operating system on entire cluster except for management station
			
 
				-CentOS Linux release 8.3.2011	|	-	|	8.3	|	Operating system on the management station	
			
 
				+CentOS Linux release 8.4.2105	|	-	|	8.4	|	Operating system on the management station	
			
 
				 MariaDB	|	GPL 2.0	|	5.5.68	|	Relational database used by Slurm
			
 
				 Slurm	|	GNU General Public	|	20.11.7	|	HPC Workload Manager
			
 
				 Docker CE	|	Apache-2.0	|	20.10.2	|	Docker Service