Browse Source

Issue #256: Automate Passwordless SSH during each omnia.yml execution

Signed-off-by: sakshiarora13 <sakshi_arora1@dell.com>
Lucas A. Wilson 4 years ago
parent
commit
29119a6edc

+ 6 - 0
appliance/tools/roles/cluster_preperation/tasks/passwordless_ssh.yml

@@ -18,6 +18,12 @@
     ssh_status: false
     current_host: "{{ item }}"
 
+- name: Refresh ssh-key if changed
+  command: ssh-keygen -R {{ current_host }}
+  changed_when: False
+  ignore_errors: yes
+  when: "'manager' in group_names"
+
 - name: Verify whether passwordless ssh is set on the remote host
   command: ssh -o PasswordAuthentication=no root@"{{ current_host }}" 'hostname'
   register: ssh_output

+ 5 - 5
docs/INSTALL_OMNIA.md

@@ -11,7 +11,7 @@ __Note:__ The user should have root privileges to perform installations and conf
 
 1. Clone the Omnia repository.
 ``` 
-$ git clone https://github.com/dellhpc/omnia.git 
+git clone https://github.com/dellhpc/omnia.git 
 ```
 __Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. It is recommended that you do not rename this folder.
 
@@ -24,7 +24,7 @@ __Note:__ After the Omnia repository is cloned, a folder named __omnia__ is crea
 ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" 
 ```
 
-5. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed.  
+5. By default, no skip tags are selected, and both Kubernetes and Slurm will be deployed.  
 To skip the installation of Kubernetes, enter:  
 `ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"  --skip-tags "kubernetes"`  
 Similarly, to skip Slurm, enter:  
@@ -32,8 +32,8 @@ Similarly, to skip Slurm, enter:
 __Note:__ If you would like to skip the NFS client setup, enter the following command to skip the k8s_nfs_client_setup role of Kubernetes:  
 `ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"  --skip-tags "nfs_client"`
 
-6. To provide password for mariaDB Database for Slurm accounting and Kubernetes CNI, edit the `omnia_config.yml` file.  
-__Note:__ Supported Kubernetes CNI : calico and flannel. The default CNI is calico.  
+6. To provide password for mariaDB Database (for Slurm accounting) and Kubernetes CNI, edit the `omnia_config.yml` file.  
+__Note:__ Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.  
 To view the set passwords of omnia_config.yml at a later time, run the following command:  
 `ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key`
 
@@ -94,4 +94,4 @@ Commands to install JupyterHub and Kubeflow:
 
 ## Adding a new compute node to the cluster
 
-The user has to update the INVENTORY file present in `omnia` directory with the new node IP address in the compute group. Then, `omnia.yml` has to be run to add the new node to the cluster and update the configurations of the manager node.
+The user has to update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Make sure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run`omnia.yml` to add the new node to the cluster and update the configurations of the manager node.

+ 69 - 62
docs/INSTALL_OMNIA_APPLIANCE.md

@@ -1,50 +1,56 @@
 # Install the Omnia appliance
 
 ## Prerequisties
-Ensure that all the prerequisites listed in the [PREINSTALL_OMNIA_APPLIANCE](PREINSTALL_OMNIA_APPLIANCE.md) file are met before installing Omnia appliance
+Ensure that all the prerequisites listed in the [PREINSTALL_OMNIA_APPLIANCE](PREINSTALL_OMNIA_APPLIANCE.md) file are met before installing the Omnia appliance.
 
-__Note:__ Changing the manager node after the installation of Omnia is not supported by Omnia. If you want to change the manager node, you must redeploy the entire cluster.  
-__Note:__ The user should have root privileges to perform installations and configurations.
+__Note:__ After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.  
+
+__Note:__ You must have root privileges to perform installations and configurations using the Omnia appliance.
 
 ## Steps to install the Omnia appliance
 __Note:__ If there are errors when any of the following Ansible playbook commands are run, re-run the commands again.
 1. On the management node, change the working directory to the directory where you want to clone the Omnia Git repository.
 2. Clone the Omnia repository.
 ``` 
-$ git clone https://github.com/dellhpc/omnia.git 
+git clone https://github.com/dellhpc/omnia.git 
+```
+3. Change the directory to `omnia`
+4. Edit the `omnia_config.yml` file to:  
+	a. Provide passwords for mariaDB Database (for Slurm accounting) and Kubernetes CNI under `mariadb_password` and `k8s_cni` respectively.  
+	__Note:__ Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.
+	
+	To view the set passwords of `omnia_config.yml`, run the following command.
+```
+ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
 ```
-__Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. It is recommended that you do not rename this folder.
 
-3. Change the directory to `omnia/appliance`
-4. To provide passwords for Cobbler and AWX, edit the `appliance_config.yml` file.
-* To provide a mapping file for DHCP configuration, go to **appliance_config.yml** file and set the variable named **mapping_file_exits** as __true__, else set it to __false__.
+5. Change the directory to `omnia/appliance`
+6. Edit the `appliance_config.yml` file to:  
+	a. Provide passwords for Cobbler and AWX under `provision_password` and `awx_password` respectively.  
+	__Note:__ Minimum length of the password must be at least eight characters and a maximum of 30 characters. Do not use these characters while entering a password: -, \\, "", and \'  
+	
+	b. Change the NIC for the DHCP server under `hpc_nic`, and the NIC used to connect to the Internet under `public_nic`. The default values of **hpc_nic** and **public_nic** are set to em1 and em2 respectively.  
+	
+	c. Provide the CentOS-7-x86_64-Minimal-2009 ISO file path under `iso_file_path`. This ISO file is used by Cobbler to provision the OS on the compute nodes.  
+	__Note:__ It is recommended that the ISO image file is not renamed. And, you **must not** change the path of this ISO image file as the provisioning of the OS on the compute nodes may be impacted.
+	
+	d. Provide a mapping file for DHCP configuration under `mapping_file_path`. The **mapping_file.csv** template file is present under `omnia/examples`. Enter the details in the order: `MAC, Hostname, IP`. The header in the template file must not be deleted before saving the file.  
+	If you want to continue without providing a mapping file, leave the `mapping_file_path` value as blank.  
+	__Note:__ Ensure that duplicate values are not provided for MAC, Hostname, and IP in the mapping file. The Hostname should not contain the following characters: , (comma), \. (period), and - (hyphen).
+	
+	e. Provide valid DHCP range for HPC cluster under the variables `dhcp_start_ip_range` and `dhcp_end_ip_range`. 
+	
+	To view the set passwords of `appliance_config.yml`, run the following command.
+```
+ansible-vault view appliance_config.yml --vault-password-file .vault_key
+```
 
 Omnia considers the following usernames as default:  
 * `cobbler` for Cobbler Server
 * `admin` for AWX
 * `slurm` for MariaDB
 
-**Note**: 
-* Minimum length of the password must be at least eight characters and a maximum of 30 characters.
-* Do not use these characters while entering a password: -, \\, "", and \'
-
-5. Using the `appliance_config.yml` file, you can change the NIC for the DHCP server under **hpc_nic** and the NIC used to connect to the Internet under **public_nic**. Default values of **hpc_nic** and **public_nic** are set to em1 and em2 respectively.
-6. The valid DHCP range for HPC cluster is set in two variables named __Dhcp_start_ip_range__ and __Dhcp_end_ip_range__ present in the `appliance_config.yml` file.
-7. To provide passwords for mariaDB Database for Slurm accounting and Kubernetes CNI, edit the `omnia_config.yml` file.
-
-__Note:__ Supported Kubernetes CNI : calico and flannel. The default CNI is calico.
-
-To view the set passwords of `appliance_config.yml`, run the following command under omnia->appliance:
-```
-ansible-vault view appliance_config.yml --vault-password-file .vault_key
-```
-
-To view the set passwords of `omnia_config.yml`, run the following command:
-```
-ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
-```
-
-8. To install Omnia, run the following command:
+7. To install Omnia, run the following command.
 ```
 ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"
 ```
@@ -53,71 +59,72 @@ Omnia creates a log file which is available at: `/var/log/omnia.log`.
 
 **Provision operating system on the target nodes**  
 Omnia role used: *provision*  
-Ports used by __Cobbler__:  
-* __TCP__ ports: 80,443,69
-* __UDP__ ports: 69,4011
+Ports used by Cobbler:  
+* TCP ports: 80,443,69
+* UDP ports: 69,4011
 
 To create the Cobbler image, Omnia configures the following:
 * Firewall settings.
-* The kickstart file of Cobbler will enable the UEFI PXE boot.
+* The kickstart file of Cobbler which will enable the UEFI PXE boot.
 
 To access the Cobbler dashboard, enter `https://<IP>/cobbler_web` where `<IP>` is the Global IP address of the management node. For example, enter
 `https://100.98.24.225/cobbler_web` to access the Cobbler dashboard.
 
 __Note__: After the Cobbler Server provisions the operating system on the nodes, IP addresses and host names are assigned by the DHCP service.  
-* If a mapping file is not provided, the hostname to the server is provided based on the following format: **computexxx-xxx** where "xxx-xxx" is the last two octets of Host IP address. For example, if the Host IP address is 172.17.0.11 then he assigned hostname by Omnia is compute0-11.  
-* If a mapping file is provided, the hostnames follow the format provided in the mapping file.
+* If a mapping file is not provided, the hostname to the server is provided based on the following format: **computexxx-xxx** where "xxx-xxx" is the last two octets of Host IP address. For example, if the Host IP address is 172.17.0.11 then the assigned hostname by Omnia is compute0-11.  
+* If a mapping file is provided, the hostnames follow the format provided in the mapping file.  
+
+__Note__: If you want to add more nodes, append the new nodes in the existing mapping file. However, do not modify the previous nodes in the mapping file as it may impact the existing cluster.  
 
 **Install and configure Ansible AWX**  
 Omnia role used: *web_ui*  
-Port used by __AWX__ is __8081__.  
-AWX repository is cloned from the GitHub path: https://github.com/ansible/awx.git 
+The port used by AWX is __8081__.  
+The AWX repository is cloned from the GitHub path: https://github.com/ansible/awx.git 
 
-Omnia performs the following configuration on AWX:
+Omnia performs the following configurations on AWX:
 * The default organization name is set to **Dell EMC**.
 * The default project name is set to **omnia**.
-* Credential: omnia_credential
-* Inventory: omnia_inventory with compute and manager groups
-* Template: DeployOmnia and Dynamic Inventory
-* Schedules: DynamicInventorySchedule which is scheduled for every 10 mins
+* The credentials are stored in the **omnia_credential**.
+* Two groups, namely compute and manager groups, are provided under **omnia_inventory**. You can add hosts to these groups using the AWX UI. 
+* Pre-defined templates are provided: **DeployOmnia** and **DynamicInventory**
+* **DynamicInventorySchedule** which is scheduled to run every 10 minutes updates the inventory details dynamically. 
 
 To access the AWX dashboard, enter `http://<IP>:8081` where **\<IP>** is the Global IP address of the management node. For example, enter `http://100.98.24.225:8081` to access the AWX dashboard.
 
 **Note**: The AWX configurations are automatically performed Omnia and Dell Technologies recommends that you do not change the default configurations provided by Omnia as the functionality may be impacted.
 
-__Note__: Although AWX UI is accessible, hosts will be shown only after few nodes have been provisioned by Cobbler. It takes approximately 10 to 15 minutes to display the host details after the provisioning by Cobbler. If a server is provisioned but you are unable to view the host details on the AWX UI, then you can run **provision_report.yml** playbook from __omnia__ -> __appliance__ ->__tools__ folder to view the hosts which are reachable.
+__Note__: Although AWX UI is accessible, hosts will be shown only after few nodes have been provisioned by Cobbler. It takes approximately 10 to 15 minutes to display the host details after the provisioning by Cobbler. If a server is provisioned but you are unable to view the host details on the AWX UI, then you can run the following command from __omnia__ -> __appliance__ ->__tools__ folder to view the hosts which are reachable.
+```
+ansible-playbook -i ../roles/inventory/provisioned_hosts.yml provision_report.yml
+```
 
 ## Install Kubernetes and Slurm using AWX UI
 Kubernetes and Slurm are installed by deploying the **DeployOmnia** template on the AWX dashboard.
 
-1. On the AWX dashboard, under __RESOURCES__ __->__ __Inventories__, select __Groups__.
-2. Select either __compute__ or __manager__ group.
-3. Select the __Hosts__ tab.
-4. To add the hosts provisioned by Cobbler, select __Add__ __->__ __Add__ __existing__ __host__, and then select the hosts from the list and click __Save__.
-5. To deploy Omnia, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ and click __LAUNCH__.
-6. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed. To install only Kubernetes, enter `slurm` and select **Create "slurm"**. Similarly, to install only Slurm, select and add `kubernetes` skip tag. 
+1. On the AWX dashboard, under __RESOURCES__ __->__ __Inventories__, select **omnia_inventory**.
+2. Select __GROUPS__, and then select either __compute__ or __manager__ group.
+3. Select the __HOSTS__ tab.
+4. To add the hosts provisioned by Cobbler, click **+**, and then select **Existing Host**. 
+5. Select the hosts from the list and click __SAVE__.
+5. To deploy Omnia, under __RESOURCES__ -> __Templates__, select __DeployOmnia__, and then click __LAUNCH__.
+6. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed. To install only Kubernetes, enter `slurm` and select **slurm**. Similarly, to install only Slurm, select and add `kubernetes` skip tag. 
 
 __Note:__
 *	If you would like to skip the NFS client setup, enter `nfs_client` in the skip tag section to skip the **k8s_nfs_client_setup** role of Kubernetes.
 
-7. Click **Next**.
-8. Review the details in the **Preview** window, and click **Launch** to run the DeployOmnia template. 
-
-To establish the passwordless communication between compute nodes and manager node:
-1. In AWX UI, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ template.
-2. From __Playbook dropdown__ menu, select __appliance/tools/passwordless_ssh.yml__ and launch the template.
+7. Click **NEXT**.
+8. Review the details in the **PREVIEW** window, and click **LAUNCH** to run the DeployOmnia template. 
 
 __Note:__ If you want to install __JupyterHub__ and __Kubeflow__ playbooks, you have to first install the __JupyterHub__ playbook and then install the __Kubeflow__ playbook.
 
 __Note:__ To install __JupyterHub__ and __Kubeflow__ playbooks:
-*	From __AWX UI__, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ template.
-*	From __Playbook dropdown__ menu, select __platforms/jupyterhub.yml__ option and launch the template to install JupyterHub playbook.
-*	From __Playbook dropdown__ menu, select __platforms/kubeflow.yml__ option and launch the template to install Kubeflow playbook.
-
+*	From AWX UI, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ template.
+*	From __PLAYBOOK__ dropdown menu, select __platforms/jupyterhub.yml__ and launch the template to install JupyterHub playbook.
+*	From __PLAYBOOK__ dropdown menu, select __platforms/kubeflow.yml__ and launch the template to install Kubeflow playbook.
 
 The DeployOmnia template may not run successfully if:
 - The Manager group contains more than one host.
-- The Compute group does not contain a host. Ensure that the Compute group must be assigned with a minimum of one host node.
+- The Compute group does not contain a host. Ensure that the Compute group is assigned with at least one host node.
 - Under Skip Tags, when both kubernetes and slurm tags are selected.
 
 After **DeployOmnia** template is run from the AWX UI, the **omnia.yml** file installs Kubernetes and Slurm, or either Kubernetes or slurm, as per the selection in the template on the management node. Additionally, appropriate roles are assigned to the compute and manager groups.
@@ -164,11 +171,11 @@ The following __Slurm__ roles are provided by Omnia when __omnia.yml__ file is r
 - **slurm_workers** role:
 	- Installs the Slurm packages into all compute nodes as per the compute node requirements.
 - **slurm_start_services** role: 
-	- Starting the Slurm services so that compute node communicates with manager node.
+	- Starting the Slurm services so that communicates with manager node.
 - **slurm_exporter** role: 
 	- Slurm exporter is a package for exporting metrics collected from Slurm resource scheduling system to prometheus.
 	- Slurm exporter is installed on the host like Slurm, and Slurm exporter will be successfully installed only if Slurm is installed.
 
 ## Adding a new compute node to the Cluster
 
-If a new node is provisioned through Cobbler, the node address is automatically displayed on the AWX dashboard. The node is not assigned to any group. You can add the node to the compute group and run `omnia.yml` to add the new node to the cluster and update the configurations in the manager node.
+If a new node is provisioned through Cobbler, the node address is automatically displayed on the AWX dashboard. The node is not assigned to any group. You can add the node to the compute group along with the existing nodes and run `omnia.yml` to add the new node to the cluster and update the configurations in the manager node.

+ 4 - 0
docs/MONITOR_CLUSTERS.md

@@ -69,6 +69,10 @@ __Note:__ If Prometheus is installed on the host, start the Prometheus web serve
 
 Go to http://localhost:9090 to launch the Prometheus UI in the browser.
 
+__Note:__ Prometheus instance on the host (if already installed through slurm without Kubernetes) will be removed when Kubernetes is installed as Prometheus would be running as a pod. 
+__Note:__ The user can use a single instance of Prometheus when both k8s and slurm is installed.
+
+
 
 
 

+ 10 - 11
docs/PREINSTALL_OMNIA_APPLIANCE.md

@@ -1,20 +1,19 @@
-# Prerequisites
+# Prerequisites to install the Omnia appliance
 
-Ensure that the following prequisites are met before installing Omnia:
+Ensure that the following prequisites are met before installing the Omnia appliance:
 * On the management node, install Ansible and Git using the following commands:
 	* `yum install epel-release -y`
-	* `yum install ansible git -y`
-__Note:__ Ansible must be installed using __yum__. If Ansible is installed using __pip3__, re-install it using the __yum__ command again.
+	* `yum install ansible git -y`  
+	__Note:__ Ansible must be installed using __yum__. If Ansible is installed using __pip3__, re-install it using the __yum__ command again.
 * Ensure a stable Internet connection is available on management node and target nodes. 
 * CentOS 7.9 2009 is installed on the management node.
-* To provision the bare metal servers:
-	* Go to http://isoredirect.centos.org/centos/7/isos/x86_64/ and download the **CentOS-7-x86_64-Minimal-2009** ISO file to the following directory on the management node: `omnia/appliance/roles/provision/files`.
-	* Rename the downloaded ISO file to `CentOS-7-x86_64-Minimal-2009.iso`.
-* For DHCP configuration, you can provide a mapping file named mapping_file.csv under __omnia/appliance/roles/provision/files__. The details provided in the CSV file must be in the format: MAC, Hostname, IP. For example, `xx:xx:4B:C4:xx:44,validation01,172.17.0.81` and  `xx:xx:4B:C5:xx:52,validation02,172.17.0.82` are valid entries.
-__Note:__ Duplicate hostnames must not be provided in the mapping file and the hostname should not contain these characters: "_" and "."
-* Connect one of the Ethernet cards on the management node to the HPC switch and one of the ethernet card connected to the global network.
+* To provision the bare metal servers, go to http://isoredirect.centos.org/centos/7/isos/x86_64/ and download the **CentOS-7-x86_64-Minimal-2009** ISO file.
+* For DHCP configuration, you can provide a mapping file. The provided details must be in the format: MAC, Hostname, IP. For example, `xx:xx:4B:C4:xx:44,validation01,172.17.0.81` and  `xx:xx:4B:C5:xx:52,validation02,172.17.0.82` are valid entries.  
+__Note:__ A template for mapping file is available under `omnia/examples`, named `mapping_file.csv`. The header in the template file must not be deleted before saving the file.  
+__Note:__ Ensure that duplicate values are not provided for MAC, Hostname, and IP in the mapping file. The Hostname should not contain the following characters: , (comma), \. (period), and - (hyphen).
+* Connect one of the Ethernet cards on the management node to the HPC switch and one of the ethernet card connected to the lobal network.
 * If SELinux is not disabled on the management node, disable it from `/etc/sysconfig/selinux` and restart the management node.
-* The default mode of PXE is __UEFI__ and the BIOS Legacy Mode is not supported.
+* The default mode of PXE is __UEFI__, and the BIOS Legacy Mode is not supported.
 * The default boot order for the bare metal servers must be __PXE__.
 * Configuration of __RAID__ is not part of Omnia. If bare metal servers have __RAID__ controller installed then it is mandatory to create **VIRTUAL DISK**.
 

+ 119 - 31
docs/README.md

@@ -19,11 +19,16 @@ Omnia can install Kubernetes or Slurm (or both), along with additional drivers,
 ![Omnia Slurm Stack](images/omnia-slurm.png) 
 
 ## Installing Omnia
-Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [PREINSTALL](PREINSTALL.md) for instructions on network setup.
+Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [PREINSTALL](PREINSTALL_OMNIA.md) for instructions on network setup.
 
-Once servers have functioning OS and networking, you can using Omnia to install and start Slurm and/or Kubernetes. Please see [INSTALL](INSTALL_OMNIA.md) for detailed instructions.
+Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. Please see [INSTALL](INSTALL_OMNIA.md) for detailed instructions.  
 
-# Support Matrix
+## Installing the Omnia appliance
+Ensure all the prerequisites listed in the [PREINSTALL_OMNIA_APPLIANCE](PREINSTALL_OMNIA_APPLIANCE.md) are met before installing the Omnia appliance.
+
+For detailed instructions on installing the Omnia appliance, see [INSTALL_OMNIA_APPLIANCE](INSTALL_OMNIA_APPLIANCE.md).
+
+# Requirements Matrix
 
 Software and hardware requirements  |   Version
 ----------------------------------  |   -------
@@ -37,43 +42,128 @@ Kubeflow  |  1
 Prometheus  |  2.23.0
 Supported PowerEdge servers  |  R640, R740, R7525, C4140, DSS8440, and C6420
 
-__Note:__ For more information related to softwares, refer the __Software Supported__ section
+__Note:__ For more information about the supported software and compatible versions, see the **Software Supported** section.
 
-## Software Supported
+## Software Managed by Omnia
 Software	|	Licence	|	Compatible Version	|	Description
 -----------	|	-------	|	----------------	|	-----------------
 MariaDB	|	GPL 2.0	|	5.5.68	|	Relational database used by Slurm
 Slurm	|	GNU General Public	|	20.11.2	|	HPC Workload Manager
 Docker CE	|	Apache-2.0	|	20.10.2	|	Docker Service
-nvidia container runtime	|	Apache-2.0	|	3.4.0	|	Nvidia container runtime library
-Python-pip	|	MIT Licence	|	3.2.1	|	Python Package
+NVIDIA container runtime	|	Apache-2.0	|	3.4.2	|	Nvidia container runtime library
+Python PIP	|	MIT Licence	|	3.2.1	|	Python Package
 Python2	|	-	|	2.7.5	|	-
-kubelet	|	Apache-2.0	|	1.16.7	|	Provides external, versioned ComponentConfig API types for configuring the kubelet
-kubeadm	|	Apache-2.0	|	1.16.7	|	"fast paths" for creating Kubernetes clusters
-kubectl	|	Apache-2.0	|	1.16.7	|	Command line tool for kubernetes
-jupyterhub	|	Modified BSD Licence	|	1.1.0	|	Multi-user hub
-kfctl	|	Apache-2.0	|	1.0.2	|	CLI for deploying and managing kubeflow
-kubeflow	|	Apache-2.0	|	1	|	Cloud Native platform for machine learning
-helm	|	Apache-2.0	|	3.5.0	|	Kubernetes Package Manager
-helm chart	|	-	|	0.9.0	|	-
-tensorflow	|	Apache-2.0	|	2.1.0	|	Machine Learning framework
-horovod	|	Apache-2.0	|	0.21.1	|	Distributed deep learning training framework for Tensorflow
+Kubelet	|	Apache-2.0	|	1.16.7	|	Provides external, versioned ComponentConfig API types for configuring the kubelet
+Kubeadm	|	Apache-2.0	|	1.16.7	|	"fast paths" for creating Kubernetes clusters
+Kubectl	|	Apache-2.0	|	1.16.7	|	Command line tool for Kubernetes
+JupyterHub	|	Modified BSD Licence	|	0.9.6	|	Multi-user hub
+Kfctl	|	Apache-2.0	|	1.0.2	|	CLI for deploying and managing Kubeflow
+Kubeflow	|	Apache-2.0	|	1	|	Cloud Native platform for machine learning
+Helm	|	Apache-2.0	|	3.5.0	|	Kubernetes Package Manager
+Helm Chart	|	-	|	0.9.0	|	-
+TensorFlow	|	Apache-2.0	|	2.1.0	|	Machine Learning framework
+Horovod	|	Apache-2.0	|	0.21.1	|	Distributed deep learning training framework for Tensorflow
 MPI	|	Copyright (c) 2018-2019 Triad National Security,LLC. All rights reserved.	|	0.2.3	|	HPC library
-spark	|	Apache-2.0	|	2.4.7	|	Unified analytics engine for large scale data processing
-coreDNS	|	Apache-2.0	|	1.6.7	|	DNS server that chains plugins
-cni	|	Apache-2.0	|	0.3.1	|	Networking for Linux containers
-awx	|	Apache-2.0	|	15.0.0 or latest	|	Web based user interface
-postgreSQL	|	Copyright (c) 1996-2020, PostgreSQL Global Development Group	|	11	|	Database Management System
-redis	|	BSD-3-Clause Licence	|	6.0.8	|	in-memory database
-nginx	|	BSD-2-Clause Licence	|	1.17.0	|	-
-
-### Contributing to Omnia
+CoreDNS	|	Apache-2.0	|	1.6.2	|	DNS server that chains plugins
+CNI	|	Apache-2.0	|	0.3.1	|	Networking for Linux containers
+AWX	|	Apache-2.0	|	15.0.0	|	Web-based User Interface
+PostgreSQL	|	Copyright (c) 1996-2020, PostgreSQL Global Development Group	|	10.15	|	Database Management System
+Redis	|	BSD-3-Clause Licence	|	6.0.10	|	In-memory database
+NGINX	|	BSD-2-Clause Licence	|	1.14	|	-
+
+# Known Issues
+* Hosts are not displayed on the AWX UI.  
+	Resolution:
+	* Verify if `provisioned_hosts.yml` is available under `omnia/appliance/roles/inventory/files`.
+	* If hosts are not listed, then servers are not PXE booted yet.
+	* If hosts are listed, then an IP address is assigned to them by DHCP. However, PXE boot is still in process or is not initiated.
+	* Check for the reachable and unreachable hosts using the tool provided under `omnia/appliance/tools` named `provisioned_report.yml`. To run provisioned_report.yml, run the following command under `omnia/appliance` directory: `ansible-playbook -i roles/inventory/files/provisioned_hosts.yml tools/provisioned_report.yml`.
+
+# Frequently Asked Questions
+* Why is the error "Wait for AWX UI to be up" displayed when `appliance.yaml` fails?  
+	Cause: 
+	1. This error occurs when AWX is not accessible even after five minutes of wait time. 
+	2. When __isMigrating__ or __isInstalling__ is seen in the failure message.  
+	
+  Resolution:  
+	Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again.  
+
+* What are the steps to be followed after the nodes in a Kubernetes cluster are rebooted?  
+	Resolution: 
+	Wait for 10 to 15 minutes after the Kubernetes cluster is rebooted. Then, verify the status of cluster using these services:
+	* `kubectl get nodes` on the manager node provides correct k8s cluster status.  
+	* `kubectl get pods --all-namespaces` on the manager node displays all the pods in the running state.
+	* `kubectl cluster-info` on the manager node displays both k8s master and kubeDNS are in the running state.
+
+* What to do when the Kubernetes services are not in the Running state?  
+	Resolution:
+	1. Verify if the pods are in the running state by using the command: `kubectl get pods --all-namespaces`
+	2. If the pods are not in the running state, delete the pods using the command:`kubectl delete pods <name of pod>`
+	3. Run the corresponding playbook: `omnia.yml`, `jupyterhub.yml`, or `kubeflow.yml`.
+
+* What to do when the JupyterHub or Prometheus UI are not accessible?  
+	Resolution:
+	1. Run the following command to ensure **nfs-client** pod and all prometheus server pods are in the running state: `kubectl get pods --namespace default`
+
+* Why does the `appliance.yml` fail during the Cobbler configuration with an error during the Run import command?  
+	Cause:
+	* Issue occurs when the mounted .iso file is corrupted.
+	
+  Resolution:
+	1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
+	2. If the error message is **repo verification failed** then it signifies that the .iso file is not mounted properly.
+	3. Verify if the downloaded .iso file is valid and correct.
+	4. Delete the Cobbler container using `docker rm -f cobbler` and rerun `appliance.yml`.
+
+* Why does the PXE boot fail with tftp timeout or service timeout errors?  
+	Cause:
+	* Issue occurs when server is RAID controlled, or when more than two servers in the same network have Cobbler services running.  
+	
+  Resolution:  
+	1. Create a Non-RAID or virtual disk in the server.  
+	2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
+
+* After the cluster is rebooted, what to do when the Slurm services are not started automatically?  
+	Resolution: Manually restart the slurmd services on the manager node by running the following commands:
+	* `systemctl restart slurmdbd`
+	* `systemctl restart slurmctld`
+	* `systemctl restart prometheus-slurm-exporter`
+
+	Manually restart the following service on all the compute nodes: `systemctl status slurmd`
+
+* What to do when the Slurm services fail because the `slurm.conf` is not configured properly?  
+	Resolution:
+	1. Run the following commands:
+		* `slurmdbd -Dvvv`
+		* `slurmctld -Dvvv`
+	2. Verify `/var/lib/log/slurmctld.log` file.
+
+* How to troubleshoot the error "ports are unavailable" when Slurm database connection fails?  
+	Resolution:
+	1. Run the following commands:
+		* `slurmdbd -Dvvv`
+		*`slurmctld -Dvvv`
+	2. Verify the `/var/lib/log/slurmctld.log` file.
+	3. Verify: netstat -antp | grep LISTEN
+	4. If they are in the Listening state, stop (kill) PID of that specific port
+	5. Restart all slurm services:
+		* slurmctl restart slurmctld on manager node
+		* systemctl restart slurmdbd on manager node
+		* systemctl restart slurmd on compute node
+
+# Limitations
+1. The supported version of all the components are as per the `Requirements Matrix` and `Software Managed by Omnia` sections, and other versions than those listed are not supported by Omnia. This is to ensure that there is no impact to the functionality of Omnia.
+2. Removal of Slurm and Kubernetes component roles are not supported. However, skip tags can be provided at the start of installation to select the component roles.​
+3. After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.  
+4. Dell Technologies provides support to the Dell developed modules of Omnia. All the other third-party tools deployed by Omnia are outside the support scope.​
+
+##### Contributing to Omnia
 The Omnia project was started to give members of the [Dell Technologies HPC Community](https://dellhpc.org) a way to easily setup clusters of Dell EMC servers, and to contribute useful tools, fixes, and functionality back to the HPC Community.
 
-#### Open to All
+###### Open to All
 While we started Omnia within the Dell Technologies HPC Community, that doesn't mean that it's limited to Dell EMC servers, networking, and storage. This is an open project, and we want to encourage *everyone* to use and contribute to Omnia!
 
-##### Anyone Can Contribute!
+####### Anyone Can Contribute!
 It's not just new features and bug fixes that can be contributed to the Omnia project! Anyone should feel comfortable contributing. We are asking for all types of contributions:
 * New feature code
 * Bug fixes
@@ -82,6 +172,4 @@ It's not just new features and bug fixes that can be contributed to the Omnia pr
 * Feedback
 * Validation that it works for your particular configuration
 
-If you would like to contribute, see [CONTRIBUTING](https://github.com/dellhpc/omnia/blob/devel/CONTRIBUTING.md).
-
-###### [Omnia Contributors](CONTRIBUTORS.md)
+If you would like to contribute, see [CONTRIBUTING](https://github.com/dellhpc/omnia/b

+ 5 - 1
omnia.yml

@@ -123,4 +123,8 @@
   gather_facts: false
   roles:
     - slurm_exporter
-  tags: slurm
+  tags: slurm
+
+- name: Passwordless SSH between manager and compute nodes
+  include: appliance/tools/passwordless_ssh.yml
+  when: hostvars['127.0.0.1']['appliance_status']

+ 1 - 1
roles/cluster_validation/tasks/fetch_password.yml

@@ -84,4 +84,4 @@
   command: >-
     ansible-vault encrypt {{ role_path }}/../../{{ config_filename }}
     --vault-password-file {{ role_path }}/../../{{ config_vaultname }}
-  changed_when: false
+  changed_when: false

+ 19 - 1
roles/cluster_validation/tasks/main.yml

@@ -16,4 +16,22 @@
   include_tasks: validations.yml
 
 - name: Fetch passwords
-  include_tasks: fetch_password.yml
+  include_tasks: fetch_password.yml
+
+- name: Check if omnia is running from AWX
+  block:
+    - name: Appliance status
+      set_fact:
+        appliance_status: false
+
+    - name: Check AWX instance
+      command: awx-manage --version
+
+    - name: Update appliance status
+      set_fact:
+        appliance_status: true
+
+  rescue:
+    - name: Passwordless SSH status
+      debug:
+        msg: "omnia.yml running on host"