Browse Source

Issue #275: Docker version fix

Signed-off-by: abhishek-s-a <a_sa@dellteam.com>
Lucas A. Wilson 4 years ago
parent
commit
af4d924ab0

+ 9 - 0
appliance/roles/common/tasks/docker_installation.yml

@@ -65,6 +65,15 @@
     executable: pip3
   tags: install
 
+- name: Versionlock docker
+  command: "yum versionlock '{{ item }}'"
+  args:
+    warn: false
+  with_items:
+    - "{{ container_repo_install }}"
+  changed_when: true
+  tags: install
+
 - name: Configure docker
   copy:
     src: daemon.json

+ 4 - 1
appliance/roles/common/vars/main.yml

@@ -33,6 +33,7 @@ common_packages:
   - python-docker
   - net-tools
   - python-netaddr
+  - yum-plugin-versionlock
 
 # Usage: pre_requisite.yml
 internet_delay: 0
@@ -52,7 +53,9 @@ docker_repo_url: https://download.docker.com/linux/centos/docker-ce.repo
 docker_repo_dest: /etc/yum.repos.d/docker-ce.repo
 success: '0'
 container_type: docker
-container_repo_install: docker-ce
+container_repo_install:
+  - docker-ce-cli-20.10.2
+  - docker-ce-20.10.2
 docker_compose: docker-compose
 daemon_dest: /etc/docker/
 

+ 36 - 21
docs/INSTALL_OMNIA.md

@@ -1,44 +1,52 @@
-# Install Omnia
+# Install Omnia using CLI
 
-The following sections provide details on installing Omnia using CLI. If you want to install the Omnia appliance and manage workloads using the Omnia appliance, see [INSTALL_OMNIA_APPLIANCE](INSTALL_OMNIA_APPLIANCE.md) and [MONITOR_CLUSTERS](MONITOR_CLUSTERS.md) files for more information.
+The following sections provide details on installing Omnia using CLI. If you want to install the Omnia appliance and manage workloads using the Omnia appliance, see [Install the Omnia appliance](INSTALL_OMNIA_APPLIANCE.md) and [Monitor Kubernetes and Slurm](MONITOR_CLUSTERS.md) for more information.
 
-## Prerequisties to install Omnia using CLI
-Ensure that all the prerequisites listed in the [PREINSTALL_OMNIA](PREINSTALL_OMNIA.md) file are met before installing Omnia.
+## Prerequisites
+* Ensure that all the prerequisites listed in the [Preparation to install Omnia](PREINSTALL_OMNIA.md) are met before installing Omnia.
+* If there are errors when any of the following Ansible playbook commands are run, re-run the commands again. 
+* The user should have root privileges to perform installations and configurations.
+ 
+## Install Omnia using CLI
 
-## Steps to install Omnia using CLI
-__Note:__ If there are errors when any of the following Ansible playbook commands are run, re-run the commands again.  
-__Note:__ The user should have root privileges to perform installations and configurations.
-
-1. Clone the Omnia repository.
+1. Clone the Omnia repository:
 ``` 
 git clone https://github.com/dellhpc/omnia.git 
 ```
-__Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. It is recommended that you do not rename this folder.
+__Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. Ensure that you do not rename this folder.
 
 2. Change the directory to __omnia__: `cd omnia`
 
 3. An inventory file must be created in the __omnia__ folder. Add compute node IPs under **[compute]** group and the manager node IP under **[manager]** group. See the INVENTORY template file under `omnia\docs` folder.
 
-4. To install Omnia, run the following command.
+4. To install Omnia:
 ```
 ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" 
 ```
 
-5. By default, no skip tags are selected, and both Kubernetes and Slurm will be deployed.  
+5. By default, no skip tags are selected, and both Kubernetes and Slurm will be deployed.
+
 To skip the installation of Kubernetes, enter:  
-`ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"  --skip-tags "kubernetes"`  
-Similarly, to skip Slurm, enter:  
+`ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"  --skip-tags "kubernetes"` 
+
+To skip the installation of Slurm, enter:  
 `ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"  --skip-tags "slurm"`  
-__Note:__ If you would like to skip the NFS client setup, enter the following command to skip the k8s_nfs_client_setup role of Kubernetes:  
+
+To skip the NFS client setup, enter the following command to skip the k8s_nfs_client_setup role of Kubernetes:  
 `ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"  --skip-tags "nfs_client"`
 
-6. To provide password for mariaDB Database (for Slurm accounting) and Kubernetes CNI, edit the `omnia_config.yml` file.  
-__Note:__ Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.  
-To view the set passwords of omnia_config.yml at a later time, run the following command:  
+6. To provide passwords for mariaDB Database (for Slurm accounting), Kubernetes Pod Network CIDR, and Kubernetes CNI, edit the `omnia_config.yml` file.  
+__Note:__ 
+* Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico. 
+* The default value of Kubernetes Pod Network CIDR is 10.244.0.0/16. If 10.244.0.0/16 is already in use within your network, select a different Pod Network CIDR. For more information, see __https://docs.projectcalico.org/getting-started/kubernetes/quickstart__.
+
+To view the set passwords of omnia_config.yml at a later time:  
 `ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key`
 
 Omnia considers `slurm` as the default username for MariaDB.  
 
+## Kubernetes roles
+
 The following __kubernetes__ roles are provided by Omnia when __omnia.yml__ file is run:
 - __common__ role:
 	- Install common packages on manager and compute nodes
@@ -67,7 +75,14 @@ The following __kubernetes__ roles are provided by Omnia when __omnia.yml__ file
 - **k8s_start_services** role
 	- Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner
 
-__Note:__ After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
+__Note:__ 
+* After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
+* If Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding, then the Kubernetes Pod Network CIDR may be overlapping with the host network which is DNS issue. To resolve this issue follow the below steps:
+1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
+2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
+3. Execute omnia.yml and skip slurm using --skip-tags slurm.
+
+## Slurm roles
 
 The following __Slurm__ roles are provided by Omnia when __omnia.yml__ file is run:
 - **slurm_common** role:
@@ -92,6 +107,6 @@ Commands to install JupyterHub and Kubeflow:
 * `ansible-playbook platforms/jupyterhub.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"`
 * `ansible-playbook platforms/kubeflow.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" `
 
-## Adding a new compute node to the cluster
+## Add a new compute node to the cluster
 
-The user has to update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Make sure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run`omnia.yml` to add the new node to the cluster and update the configurations of the manager node.
+To update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Ensure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run`omnia.yml` to add the new node to the cluster and update the configurations of the manager node.

+ 39 - 36
docs/INSTALL_OMNIA_APPLIANCE.md

@@ -1,31 +1,28 @@
 # Install the Omnia appliance
 
-## Prerequisties
-Ensure that all the prerequisites listed in the [PREINSTALL_OMNIA_APPLIANCE](PREINSTALL_OMNIA_APPLIANCE.md) file are met before installing the Omnia appliance.
-
-__Note:__ After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.  
-
-__Note:__ You must have root privileges to perform installations and configurations using the Omnia appliance.
+## Prerequisites
+* Ensure that all the prerequisites listed in the [Prerequisites to install the Omnia appliance](PREINSTALL_OMNIA_APPLIANCE.md) file are met before installing the Omnia appliance.
+* After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.  
+* You must have root privileges to perform installations and configurations using the Omnia appliance.
+* If there are errors when any of the following Ansible playbook commands are run, re-run the commands again.
 
 ## Steps to install the Omnia appliance
-__Note:__ If there are errors when any of the following Ansible playbook commands are run, re-run the commands again.
+
 1. On the management node, change the working directory to the directory where you want to clone the Omnia Git repository.
-2. Clone the Omnia repository.
+2. Clone the Omnia repository:
 ``` 
 git clone https://github.com/dellhpc/omnia.git 
 ```
-3. Change the directory to `omnia`
-4. Edit the `omnia_config.yml` file to:  
-	a. Provide passwords for mariaDB Database (for Slurm accounting) and Kubernetes CNI under `mariadb_password` and `k8s_cni` respectively.  
-	__Note:__ Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.
-	
-	To view the set passwords of `omnia_config.yml`, run the following command.
-```
-ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
-```
-
-5. Change the directory to `omnia/appliance`
-6. Edit the `appliance_config.yml` file to:  
+3. Change the directory to __omnia__: `cd omnia`
+4. Edit the `omnia_config.yml` file to:
+* Provide passwords for mariaDB Database (for Slurm accounting), Kubernetes Pod Network CIDR, Kubernetes CNI under `mariadb_password` and `k8s_cni` respectively.  
+__Note:__ 
+* Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.	
+* The default value of Kubernetes Pod Network CIDR is 10.244.0.0/16. If 10.244.0.0/16 is already in use within your network, select a different Pod Network CIDR. For more information, see __https://docs.projectcalico.org/getting-started/kubernetes/quickstart__.
+
+5. Run `ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key` to view the set passwords of __omnia_config.yml__.
+6. Change the directory to __omnia__->__appliance__: `cd omnia/appliance`
+7. Edit the `appliance_config.yml` file to:  
 	a. Provide passwords for Cobbler and AWX under `provision_password` and `awx_password` respectively.  
 	__Note:__ Minimum length of the password must be at least eight characters and a maximum of 30 characters. Do not use these characters while entering a password: -, \\, "", and \'  
 	
@@ -40,24 +37,19 @@ ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
 	
 	e. Provide valid DHCP range for HPC cluster under the variables `dhcp_start_ip_range` and `dhcp_end_ip_range`. 
 	
-	To view the set passwords of `appliance_config.yml`, run the following command.
-```
-ansible-vault view appliance_config.yml --vault-password-file .vault_key
-```
+8. Run `ansible-vault view appliance_config.yml --vault-password-file .vault_key` to view the set passwords of __appliance_config.yml__.
 
 Omnia considers the following usernames as default:  
 * `cobbler` for Cobbler Server
 * `admin` for AWX
 * `slurm` for MariaDB
 
-7. To install Omnia, run the following command.
-```
-ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"
-```
+8. Run `ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"` to install Omnia appliance.
+
    
 Omnia creates a log file which is available at: `/var/log/omnia.log`.
 
-**Provision operating system on the target nodes**  
+## Provision operating system on the target nodes 
 Omnia role used: *provision*  
 Ports used by Cobbler:  
 * TCP ports: 80,443,69
@@ -76,7 +68,7 @@ __Note__: After the Cobbler Server provisions the operating system on the nodes,
 
 __Note__: If you want to add more nodes, append the new nodes in the existing mapping file. However, do not modify the previous nodes in the mapping file as it may impact the existing cluster.  
 
-**Install and configure Ansible AWX**  
+## Install and configure Ansible AWX 
 Omnia role used: *web_ui*  
 The port used by AWX is __8081__.  
 The AWX repository is cloned from the GitHub path: https://github.com/ansible/awx.git 
@@ -106,14 +98,16 @@ Kubernetes and Slurm are installed by deploying the **DeployOmnia** template on
 3. Select the __HOSTS__ tab.
 4. To add the hosts provisioned by Cobbler, click **+**, and then select **Existing Host**. 
 5. Select the hosts from the list and click __SAVE__.
-5. To deploy Omnia, under __RESOURCES__ -> __Templates__, select __DeployOmnia__, and then click __LAUNCH__.
-6. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed. To install only Kubernetes, enter `slurm` and select **slurm**. Similarly, to install only Slurm, select and add `kubernetes` skip tag. 
+6. To deploy Omnia, under __RESOURCES__ -> __Templates__, select __DeployOmnia__, and then click __LAUNCH__.
+7. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed. 
+8. To install only Kubernetes, enter `slurm` and select **slurm**. 
+9. To install only Slurm, select and add `kubernetes` skip tag. 
 
 __Note:__
 *	If you would like to skip the NFS client setup, enter `nfs_client` in the skip tag section to skip the **k8s_nfs_client_setup** role of Kubernetes.
 
-7. Click **NEXT**.
-8. Review the details in the **PREVIEW** window, and click **LAUNCH** to run the DeployOmnia template. 
+10. Click **NEXT**.
+11. Review the details in the **PREVIEW** window, and click **LAUNCH** to run the DeployOmnia template. 
 
 __Note:__ If you want to install __JupyterHub__ and __Kubeflow__ playbooks, you have to first install the __JupyterHub__ playbook and then install the __Kubeflow__ playbook.
 
@@ -129,6 +123,8 @@ The DeployOmnia template may not run successfully if:
 
 After **DeployOmnia** template is run from the AWX UI, the **omnia.yml** file installs Kubernetes and Slurm, or either Kubernetes or slurm, as per the selection in the template on the management node. Additionally, appropriate roles are assigned to the compute and manager groups.
 
+## Kubernetes roles
+
 The following __kubernetes__ roles are provided by Omnia when __omnia.yml__ file is run:
 - __common__ role:
 	- Install common packages on manager and compute nodes
@@ -157,7 +153,14 @@ The following __kubernetes__ roles are provided by Omnia when __omnia.yml__ file
 - **k8s_start_services** role
 	- Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner
 
-__Note:__ After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
+__Note:__ 
+* After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
+* If Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding, then the Kubernetes Pod Network CIDR may be overlapping with the host network which is DNS issue. To resolve this issue:
+1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
+2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
+3. Execute omnia.yml and skip slurm using --skip-tags slurm
+ 
+## Slurm roles
 
 The following __Slurm__ roles are provided by Omnia when __omnia.yml__ file is run:
 - **slurm_common** role:
@@ -176,6 +179,6 @@ The following __Slurm__ roles are provided by Omnia when __omnia.yml__ file is r
 	- Slurm exporter is a package for exporting metrics collected from Slurm resource scheduling system to prometheus.
 	- Slurm exporter is installed on the host like Slurm, and Slurm exporter will be successfully installed only if Slurm is installed.
 
-## Adding a new compute node to the Cluster
+## Add a new compute node to the Cluster
 
 If a new node is provisioned through Cobbler, the node address is automatically displayed on the AWX dashboard. The node is not assigned to any group. You can add the node to the compute group along with the existing nodes and run `omnia.yml` to add the new node to the cluster and update the configurations in the manager node.

+ 34 - 30
docs/MONITOR_CLUSTERS.md

@@ -1,11 +1,12 @@
 # Monitor Kuberentes and Slurm
 Omnia provides playbooks to configure additional software components for Kubernetes such as JupyterHub and Kubeflow. For workload management (submitting, conrolling, and managing jobs) of HPC, AI, and Data Analytics clusters, you can access Kubernetes and Slurm dashboards and other supported applications. 
 
-__Note:__ To access the below dashboards, user has to login to the manager node and open the installed web browser.
+To access any of the dashboards login to the manager node and open the installed web browser.
+
+If you are connecting remotely ensure your putty or any X11 based clients and you are using mobaxterm version 8 and above, follow the below mentioned steps:
 
-__Note:__ If you are connecting remotely make sure your putty or any other similar client supports X11 forwarding. If you are using mobaxterm version 8 and above, follow the below mentioned steps:
 1. To provide __ssh__ to the manager node.
-   `ssh -x root@<ip>` (where ip is the private ip of manager node)
+   `ssh -x root@<ip>` (where IP is the private IP of manager node)
 2. `yum install firefox -y`
 3. `yum install xorg-x11-xauth`
 4. `export DISPLAY=:10.0`
@@ -13,16 +14,21 @@ __Note:__ If you are connecting remotely make sure your putty or any other simil
 6. To launch firefox from terminal use the following command: 
    `firefox&`
 
-__Note:__ Everytime user logouts, the user have to run __export DISPLAY=:10.0__ command.
+__Note:__ When the putty/mobaxterm session ends, you must run __export DISPLAY=:10.0__ command each time, else Firefox cannot be launched again.
+
+## Setup user account in manager node
+1. Login to head node as root user and run `adduser __<username>__`.
+2. Run `passwd __<username>__` to set password.
+3. Run `usermod -a -G wheel __<username>__` to give sudo permission.
+
+__Note:__ Kuberenetes and Slurm job can be scheduled only for users with __sudo__ privileges.
 
 ## Access Kuberentes Dashboard
-1. To verify if the __Kubernetes-dashboard service__ is __running__, run the following command:
-  `kubectl get pods --namespace kubernetes-dashboard`
-2. To start the Kubernetes dashboard, run the following command:
-  `kubectl proxy`
-3. From the CLI, run the following command to see the generated tokens: `kubectl get secrets`
+1. To verify if the __Kubernetes-dashboard service__ is __running__, run `kubectl get pods --namespace kubernetes-dashboard`.
+2. To start the Kubernetes dashboard, run `kubectl proxy`.
+3. From the CLI, run `kubectl get secrets` to see the generated tokens.
 4. Copy the token with the name __prometheus-__-kube-state-metrics__ of the type __kubernetes.io/service-account-token__.
-5. Run the following command: `kubectl describe secret __<copied token name>__`
+5. Run `kubectl describe secret __<copied token name>__`
 6. Copy the encrypted token value.
 7. On a web browser(installed on the manager node), enter http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ to access the Kubernetes Dashboard.
 8. Select the authentication method as __Token__.
@@ -30,47 +36,45 @@ __Note:__ Everytime user logouts, the user have to run __export DISPLAY=:10.0__
 
 ## Access Kubeflow Dashboard
 
-__Note:__ Use only port number between __8000-8999__
-__Note:__ Suggested port number : 8085
+It is recommended that use port numbers between __8000-8999__ and suggested port number is __8085__.
 
 1. To see which are the ports are in use, use the following command:
    `netstat -an`
-2. Choose port number from __8000-8999__ which is not in use.
+2. Choose a port number between __8000-8999__ which is not in use.
 3. To run the __kubeflow__ dashboard at selected port number, run the following command:
    `kubectl port-forward -n kubeflow service/centraldashboard __selected_port_number__:80`
 4. On a web browser installed on the __manager node__, go to http://localhost:selected-port-number/ to launch the kubeflow central navigation dashboard.
 
 ## Access JupyterHub Dashboard
-If you have installed the JupyterHub application for Kubernetes, you can access the dashboard by following these actions:
-1. To verify if the JupyterHub services are running, run the following command: 
-   `kubectl get pods --namespace default`
-2. Ensure that the pod names starting with __hub__ and __proxy__ are in __running__ status.
-3. Run the following command:
-   `kubectl get services`
+
+1. To verify if the JupyterHub services are running, run `kubectl get pods --namespace jupyterhub`.
+2. Ensure that the pod names starting with __hub__ and __proxy__ are in __Running__ status.
+3. Run `kubectl get services --namespace jupyterhub`.
 4. Copy the **External IP** of __proxy-public__ service.
 5. On a web browser installed on the __manager node__, use the External IP address to access the JupyterHub Dashboard.
 6. Enter any __username__ and __password__ combination to enter the Jupyterhub. The __username__ and __password__ can be later configured from the JupyterHub dashboard.
 
-## Prometheus:
+## Prometheus
 
-* Prometheus is installed in two different ways:
-  * Prometheus is installed on the host when Slurm is installed without installing kubernetes.
-  * Prometheus is installed as a Kubernetes role, if you install both Slurm and Kubernetes.
+Prometheus is installed in two different ways:
+  * It is installed on the host when Slurm is installed without installing Kubernetes.
+  * It is installed as a Kubernetes role, if you install both Slurm and Kubernetes.
 
-If Prometheus is installed as part of k8s role, run the following commands before starting the Prometheus UI:
+If Prometheus is installed as part of kubernetes role, run the following commands before starting the Prometheus UI:
 1. `export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")`
 2. `echo $POD_NAME`
 3. `kubectl --namespace default port-forward $POD_NAME 9090`
 
-__Note:__ If Prometheus is installed on the host, start the Prometheus web server with the following command:
-* Navigate to Prometheus folder. The default path is __/var/lib/prometheus-2.23.0.linux-amd64/__.
-* Start the web server, 
-  `./prometheus.yml`
+If Prometheus is installed on the host, start the Prometheus web server by run the following command:
+1. Navigate to Prometheus folder. The default path is __/var/lib/prometheus-2.23.0.linux-amd64/__.
+2. Start the web server, 
+  `./prometheus`
 
 Go to http://localhost:9090 to launch the Prometheus UI in the browser.
 
-__Note:__ Prometheus instance on the host (if already installed through slurm without Kubernetes) will be removed when Kubernetes is installed as Prometheus would be running as a pod. 
-__Note:__ The user can use a single instance of Prometheus when both k8s and slurm is installed.
+__Note:__ 
+* If Prometheus was installed through slurm without Kubernetes then it will be removed when Kubernetes is installed as Prometheus would be running as a pod. 
+* You can use a single instance of Prometheus when both kubernetes and slurm are installed.
 
 
 

+ 6 - 6
docs/PREINSTALL_OMNIA.md

@@ -1,18 +1,18 @@
-# Pre-Installation Preparation
+# Preparation to install Omnia
 
 ## Assumptions
-Omnia assumes that prior to installation:
-* The manager and compute nodes must be installed with CentOS 7.9 2009 OS.
-* Network(s) has been cabled and nodes can reach the Internet.
+Ensure that the following prerequisites are met:
+* The manager and compute nodes must be running CentOS 7.9 2009 OS.
+* All nodes are connected to the network and have access to Internet.
 * SSH Keys for root have been installed on all nodes to allow for password-less SSH.
 * On the manager node, install Ansible and Git using the following commands:
 	* `yum install epel-release -y`
-	* `yum install ansible git -y`  
+	* `yum install ansible-2.9.17 git -y`  
 __Note:__ Ansible must be installed using __yum__. If Ansible is installed using __pip3__, re-install it using the __yum__ command again.
 
 
 ## Example system designs
-Omnia can configure systems which use Ethernet- or Infiniband-based fabric to connect the compute servers.
+Omnia can configure systems which use Ethernet or Infiniband-based fabric to connect the compute servers.
 
 ![Example system configuration with Ethernet fabric](images/example-system-ethernet.png)
 

+ 4 - 4
docs/PREINSTALL_OMNIA_APPLIANCE.md

@@ -3,15 +3,15 @@
 Ensure that the following prequisites are met before installing the Omnia appliance:
 * On the management node, install Ansible and Git using the following commands:
 	* `yum install epel-release -y`
-	* `yum install ansible git -y`  
+	* `yum install ansible-2.9.17 git -y` 
 	__Note:__ Ansible must be installed using __yum__. If Ansible is installed using __pip3__, re-install it using the __yum__ command again.
 * Ensure a stable Internet connection is available on management node and target nodes. 
 * CentOS 7.9 2009 is installed on the management node.
 * To provision the bare metal servers, go to http://isoredirect.centos.org/centos/7/isos/x86_64/ and download the **CentOS-7-x86_64-Minimal-2009** ISO file.
 * For DHCP configuration, you can provide a mapping file. The provided details must be in the format: MAC, Hostname, IP. For example, `xx:xx:4B:C4:xx:44,validation01,172.17.0.81` and  `xx:xx:4B:C5:xx:52,validation02,172.17.0.82` are valid entries.  
-__Note:__ A template for mapping file is available under `omnia/examples`, named `mapping_file.csv`. The header in the template file must not be deleted before saving the file.  
+__Note:__ A template for mapping file is present in the `omnia/examples`, named `mapping_file.csv`. The header in the template file must not be deleted before saving the file.  
 __Note:__ Ensure that duplicate values are not provided for MAC, Hostname, and IP in the mapping file. The Hostname should not contain the following characters: , (comma), \. (period), and - (hyphen).
-* Connect one of the Ethernet cards on the management node to the HPC switch and one of the ethernet card connected to the lobal network.
+* Connect one of the Ethernet cards on the management node to the HPC switch and the other ethernet card connected to the lobal network.
 * If SELinux is not disabled on the management node, disable it from `/etc/sysconfig/selinux` and restart the management node.
 * The default mode of PXE is __UEFI__, and the BIOS Legacy Mode is not supported.
 * The default boot order for the bare metal servers must be __PXE__.
@@ -20,7 +20,7 @@ __Note:__ Ensure that duplicate values are not provided for MAC, Hostname, and I
 ## Assumptions
 
 ## Example system designs
-Omnia can configure systems which use Ethernet- or Infiniband-based fabric to connect the compute servers.
+Omnia can configure systems which use Ethernet or Infiniband-based fabric to connect the compute servers.
 
 ![Example system configuration with Ethernet fabric](images/example-system-ethernet.png)
 

+ 65 - 54
docs/README.md

@@ -1,6 +1,6 @@
 **Omnia** (Latin: all or everything) is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into clusters capable of supporting HPC, AI, and data analytics workloads. It uses Slurm, Kubernetes, and other packages to manage jobs and run diverse workloads on the same converged solution. It is a collection of [Ansible](https://ansible.org) playbooks, is open source, and is constantly being extended to enable comprehensive workloads.
 
-## What Omnia Does
+## What Omnia does
 Omnia can build clusters which use Slurm or Kubernetes (or both!) for workload management. Omnia will install software from a variety of sources, including:
 - Standard CentOS and [ELRepo](http://elrepo.org) repositories
 - Helm repositories
@@ -12,39 +12,40 @@ Whenever possible, Omnia will leverage existing projects rather than reinvent th
 
 ![Omnia draws from existing repositories](images/omnia-overview.png)
 
-### Omnia Stacks
+### Omnia stacks
 Omnia can install Kubernetes or Slurm (or both), along with additional drivers, services, libraries, and user applications.
 ![Omnia Kubernetes Stack](images/omnia-k8s.png)
 
 ![Omnia Slurm Stack](images/omnia-slurm.png) 
 
 ## Installing Omnia
-Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [PREINSTALL](PREINSTALL_OMNIA.md) for instructions on network setup.
+Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [Preparation to install Omnia](PREINSTALL_OMNIA.md) for instructions on network setup.
 
-Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. Please see [INSTALL](INSTALL_OMNIA.md) for detailed instructions.  
+Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. Please see [Install Omnia using CLI](INSTALL_OMNIA.md) for detailed instructions.  
 
 ## Installing the Omnia appliance
 Ensure all the prerequisites listed in the [PREINSTALL_OMNIA_APPLIANCE](PREINSTALL_OMNIA_APPLIANCE.md) are met before installing the Omnia appliance.
 
 For detailed instructions on installing the Omnia appliance, see [INSTALL_OMNIA_APPLIANCE](INSTALL_OMNIA_APPLIANCE.md).
 
-# Requirements Matrix
+# System requirements  
+Ensure the supported version of all the software are installed as per the following table and other versions than those listed are not supported by Omnia. This is to ensure that there is no impact to the functionality of Omnia.
 
 Software and hardware requirements  |   Version
 ----------------------------------  |   -------
 OS installed on the management node  |  CentOS 7.9 2009
 OS deployed by Omnia on bare-metal servers | CentOS 7.9 2009 Minimal Edition
 Cobbler  |  2.8.5
-Ansible AWX Version  |  15.0.0
+Ansible AWX  |  15.0.0
 Slurm Workload Manager  |  20.11.2
 Kubernetes Controllers  |  1.16.7
 Kubeflow  |  1
 Prometheus  |  2.23.0
 Supported PowerEdge servers  |  R640, R740, R7525, C4140, DSS8440, and C6420
 
-__Note:__ For more information about the supported software and compatible versions, see the **Software Supported** section.
+## Software managed by Omnia
+Ensure the supported version of all the software are installed as per the following table and other versions than those listed are not supported by Omnia. This is to ensure that there is no impact to the functionality of Omnia.
 
-## Software Managed by Omnia
 Software	|	Licence	|	Compatible Version	|	Description
 -----------	|	-------	|	----------------	|	-----------------
 MariaDB	|	GPL 2.0	|	5.5.68	|	Relational database used by Slurm
@@ -56,7 +57,7 @@ Python2	|	-	|	2.7.5	|	-
 Kubelet	|	Apache-2.0	|	1.16.7	|	Provides external, versioned ComponentConfig API types for configuring the kubelet
 Kubeadm	|	Apache-2.0	|	1.16.7	|	"fast paths" for creating Kubernetes clusters
 Kubectl	|	Apache-2.0	|	1.16.7	|	Command line tool for Kubernetes
-JupyterHub	|	Modified BSD Licence	|	0.9.6	|	Multi-user hub
+JupyterHub	|	Modified BSD Licence	|	1.1.0	|	Multi-user hub
 Kfctl	|	Apache-2.0	|	1.0.2	|	CLI for deploying and managing Kubeflow
 Kubeflow	|	Apache-2.0	|	1	|	Cloud Native platform for machine learning
 Helm	|	Apache-2.0	|	3.5.0	|	Kubernetes Package Manager
@@ -71,43 +72,44 @@ PostgreSQL	|	Copyright (c) 1996-2020, PostgreSQL Global Development Group	|	10.1
 Redis	|	BSD-3-Clause Licence	|	6.0.10	|	In-memory database
 NGINX	|	BSD-2-Clause Licence	|	1.14	|	-
 
-# Known Issues
-* Hosts are not displayed on the AWX UI.  
-	Resolution:
-	* Verify if `provisioned_hosts.yml` is available under `omnia/appliance/roles/inventory/files`.
-	* If hosts are not listed, then servers are not PXE booted yet.
-	* If hosts are listed, then an IP address is assigned to them by DHCP. However, PXE boot is still in process or is not initiated.
-	* Check for the reachable and unreachable hosts using the tool provided under `omnia/appliance/tools` named `provisioned_report.yml`. To run provisioned_report.yml, run the following command under `omnia/appliance` directory: `ansible-playbook -i roles/inventory/files/provisioned_hosts.yml tools/provisioned_report.yml`.
+# Known issue  
+Issue: Hosts do not display on the AWX UI.  
+	
+Resolution:  
+* Verify if `provisioned_hosts.yml` is present in the `omnia/appliance/roles/inventory/files` folder.
+* Verify if hosts are not listed in the `provisioned_hosts.yml` file. If hosts are not listed, then servers are not PXE booted yet.
+* If hosts are listed in the `provisioned_hosts.yml` file, then an IP address has been assigned to them by DHCP. However, hosts are not displyed on the AWX UI as the PXE boot is still in process or is not initiated.
+* Check for the reachable and unreachable hosts using the `provisioned_report.yml` tool present in the `omnia/appliance/tools` folder. To run provisioned_report.yml, in the omnia/appliance directory, run `playbook -i roles/inventory/files/provisioned_hosts.yml tools/provisioned_report.yml`.
 
-# Frequently Asked Questions
+# Frequently asked questions
 * Why is the error "Wait for AWX UI to be up" displayed when `appliance.yaml` fails?  
 	Cause: 
-	1. This error occurs when AWX is not accessible even after five minutes of wait time. 
-	2. When __isMigrating__ or __isInstalling__ is seen in the failure message.  
+	1. When AWX is not accessible even after five minutes of wait time. 
+	2. When __isMigrating__ or __isInstalling__ is seen in the failure message.
 	
   Resolution:  
-	Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again.  
+	Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again, where __management-station-IP__ is the ip address of the management node.
 
-* What are the steps to be followed after the nodes in a Kubernetes cluster are rebooted?  
+* What are the next steps after the nodes in a Kubernetes cluster reboots?  
 	Resolution: 
-	Wait for 10 to 15 minutes after the Kubernetes cluster is rebooted. Then, verify the status of cluster using these services:
+	Wait for upto 15 minutes after the Kubernetes cluster reboots. Next, verify status of the cluster using the following services:
 	* `kubectl get nodes` on the manager node provides correct k8s cluster status.  
-	* `kubectl get pods --all-namespaces` on the manager node displays all the pods in the running state.
-	* `kubectl cluster-info` on the manager node displays both k8s master and kubeDNS are in the running state.
+	* `kubectl get pods --all-namespaces` on the manager node displays all the pods in the **Running** state.
+	* `kubectl cluster-info` on the manager node displays both k8s master and kubeDNS are in the **Running** state.
 
-* What to do when the Kubernetes services are not in the Running state?  
-	Resolution:
-	1. Verify if the pods are in the running state by using the command: `kubectl get pods --all-namespaces`
-	2. If the pods are not in the running state, delete the pods using the command:`kubectl delete pods <name of pod>`
-	3. Run the corresponding playbook: `omnia.yml`, `jupyterhub.yml`, or `kubeflow.yml`.
+* What to do when the Kubernetes services are not in the __Running__  state?  
+	Resolution:	
+	1. Run `kubectl get pods --all-namespaces` to verify the pods are in the **Running** state.
+	2. If the pods are not in the **Running** state, delete the pods using the command:`kubectl delete pods <name of pod>`
+	3. Run the corresponding playbook that was used to install Kubernetes: `omnia.yml`, `jupyterhub.yml`, or `kubeflow.yml`.
 
 * What to do when the JupyterHub or Prometheus UI are not accessible?  
 	Resolution:
-	1. Run the following command to ensure **nfs-client** pod and all prometheus server pods are in the running state: `kubectl get pods --namespace default`
+	Run the command `kubectl get pods --namespace default` to ensure **nfs-client** pod and all prometheus server pods are in the **Running** state. 
 
-* Why does the `appliance.yml` fail during the Cobbler configuration with an error during the Run import command?  
+* While configuring the Cobbler, why does the `appliance.yml` fail with an error during the Run import command?  
 	Cause:
-	* Issue occurs when the mounted .iso file is corrupted.
+	* When the mounted .iso file is corrupt.
 	
   Resolution:
 	1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
@@ -117,53 +119,62 @@ NGINX	|	BSD-2-Clause Licence	|	1.14	|	-
 
 * Why does the PXE boot fail with tftp timeout or service timeout errors?  
 	Cause:
-	* Issue occurs when server is RAID controlled, or when more than two servers in the same network have Cobbler services running.  
+	* When RAID is configured on the server.
+	* When more than two servers in the same network have Cobbler services running.  
 	
   Resolution:  
 	1. Create a Non-RAID or virtual disk in the server.  
 	2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
 
-* After the cluster is rebooted, what to do when the Slurm services are not started automatically?  
-	Resolution: Manually restart the slurmd services on the manager node by running the following commands:
-	* `systemctl restart slurmdbd`
-	* `systemctl restart slurmctld`
-	* `systemctl restart prometheus-slurm-exporter`
-
-	Manually restart the following service on all the compute nodes: `systemctl status slurmd`
-
-* What to do when the Slurm services fail because the `slurm.conf` is not configured properly?  
+* What to do when the Slurm services do not start automatically after the cluster reboots?  
+	Resolution: 
+	* Manually restart the slurmd services on the manager node by running the following commands:
+		* `systemctl restart slurmdbd`
+		* `systemctl restart slurmctld`
+		* `systemctl restart prometheus-slurm-exporter`
+	* Run `systemctl status slurmd` to manually restart the following service on all the compute nodes.
+
+* What to do when the Slurm services fail? 
+	Cause: The `slurm.conf` is not configured properly.  
 	Resolution:
 	1. Run the following commands:
 		* `slurmdbd -Dvvv`
 		* `slurmctld -Dvvv`
 	2. Verify `/var/lib/log/slurmctld.log` file.
 
-* How to troubleshoot the error "ports are unavailable" when Slurm database connection fails?  
+* What to do when when the error "ports are unavailable" is displayed?
+	Cause: Slurm database connection fails.  
 	Resolution:
 	1. Run the following commands:
 		* `slurmdbd -Dvvv`
 		*`slurmctld -Dvvv`
 	2. Verify the `/var/lib/log/slurmctld.log` file.
 	3. Verify: netstat -antp | grep LISTEN
-	4. If they are in the Listening state, stop (kill) PID of that specific port
-	5. Restart all slurm services:
+	4. If PIDs are in the **Listening** state, kill the processes of that specific port.
+	5. Restart all Slurm services:
 		* slurmctl restart slurmctld on manager node
 		* systemctl restart slurmdbd on manager node
 		* systemctl restart slurmd on compute node
+		
+* What to do if Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding?
+	Cause: With the host network which is DNS issue.
+	Resolution:
+	1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
+	2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
+	3. Execute omnia.yml and skip slurm using __skip_ tag __slurm__.
 
 # Limitations
-1. The supported version of all the components are as per the `Requirements Matrix` and `Software Managed by Omnia` sections, and other versions than those listed are not supported by Omnia. This is to ensure that there is no impact to the functionality of Omnia.
-2. Removal of Slurm and Kubernetes component roles are not supported. However, skip tags can be provided at the start of installation to select the component roles.​
-3. After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.  
-4. Dell Technologies provides support to the Dell developed modules of Omnia. All the other third-party tools deployed by Omnia are outside the support scope.​
-
-##### Contributing to Omnia
+1. Removal of Slurm and Kubernetes component roles are not supported. However, skip tags can be provided at the start of installation to select the component roles.​
+2. After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.  
+3. Dell Technologies provides support to the Dell developed modules of Omnia. All the other third-party tools deployed by Omnia are outside the support scope.​
+4. To change the Kubernetes single node cluster to a multi-node cluster or to change a multi-node cluster to a single node cluster, you must either redeploy the entire cluster or run `kubeadm reset -f` on all the nodes of the cluster. You then need to run `omnia.yml` file and skip the installation of Slurm using the skip tags.
+# Contributing to Omnia
 The Omnia project was started to give members of the [Dell Technologies HPC Community](https://dellhpc.org) a way to easily setup clusters of Dell EMC servers, and to contribute useful tools, fixes, and functionality back to the HPC Community.
 
-###### Open to All
+# Open to All
 While we started Omnia within the Dell Technologies HPC Community, that doesn't mean that it's limited to Dell EMC servers, networking, and storage. This is an open project, and we want to encourage *everyone* to use and contribute to Omnia!
 
-####### Anyone Can Contribute!
+# Anyone can contribute!
 It's not just new features and bug fixes that can be contributed to the Omnia project! Anyone should feel comfortable contributing. We are asking for all types of contributions:
 * New feature code
 * Bug fixes
@@ -172,4 +183,4 @@ It's not just new features and bug fixes that can be contributed to the Omnia pr
 * Feedback
 * Validation that it works for your particular configuration
 
-If you would like to contribute, see [CONTRIBUTING](https://github.com/dellhpc/omnia/b
+If you would like to contribute, see [CONTRIBUTORS](https://github.com/dellhpc/omnia/b

+ 9 - 0
roles/common/tasks/main.yml

@@ -62,6 +62,15 @@
     state: present
   tags: install
 
+- name: Versionlock docker
+  command: "yum versionlock '{{ item }}'"
+  args:
+    warn: false
+  with_items:
+    - "{{ docker_packages }}"
+  changed_when: true
+  tags: install
+
 - name: Collect host facts (including acclerator information)
   setup: ~
 

+ 7 - 2
roles/common/vars/main.yml

@@ -23,9 +23,14 @@ common_packages:
   - nvidia-detect
   - chrony
   - pciutils
-  - docker-ce
+  - docker-ce-cli-20.10.2
+  - docker-ce-20.10.2
   - openssl
 
+docker_packages:
+  - docker-ce-cli-20.10.2
+  - docker-ce-20.10.2
+
 custom_fact_dir: /etc/ansible/facts.d
 
 custom_fact_dir_mode: 0755
@@ -52,7 +57,7 @@ delay_count_one: "60"
 retry_count: "6"
 delay_count: "10"
 
-ntp_servers: 
+ntp_servers:
   - 0.centos.pool.ntp.org
   - 1.centos.pool.ntp.org
   - 2.centos.pool.ntp.org

+ 7 - 0
roles/slurm_start_services/tasks/main.yml

@@ -39,6 +39,13 @@
     enabled: yes
   tags: install
 
+- name: check slurmdbd is active
+  systemd:
+    name: slurmdbd
+  register: slurmdbd_status
+  until: 'slurmdbd_status.status.ActiveState=="active"'
+  retries: 20
+
 - name: Show cluster if exists
   command: sacctmgr -n show cluster {{ cluster_name }}
   register: slurm_clusterlist