4 tahun lalu · 574324ac35
--- a/.metadata/omnia_version
+++ b/.metadata/omnia_version
@@ -0,0 +1 @@
 
				+Omnia version 1.0.0
			
--- a/README.md
+++ b/README.md
@@ -1,6 +1,8 @@
 
				 <img src="docs/images/omnia-logo.png" width="500px">
			
 
				 
			
 
				-![GitHub](https://img.shields.io/github/license/dellhpc/omnia) ![GitHub issues](https://img.shields.io/github/issues-raw/dellhpc/omnia) ![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/dellhpc/omnia?include_prereleases) ![GitHub last commit (branch)](https://img.shields.io/github/last-commit/dellhpc/omnia/devel) ![GitHub commits since tagged version](https://img.shields.io/github/commits-since/dellhpc/omnia/omnia-v0.2/devel) 
			
 
				+![GitHub](https://img.shields.io/github/license/dellhpc/omnia) ![GitHub issues](https://img.shields.io/github/issues-raw/dellhpc/omnia) ![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/dellhpc/omnia?include_prereleases) ![GitHub last commit (branch)](https://img.shields.io/github/last-commit/dellhpc/omnia/devel) ![GitHub commits since tagged version](https://img.shields.io/github/commits-since/dellhpc/omnia/v1.0.0/devel) 
			
 
				+
			
 
				+![GitHub contributors](https://img.shields.io/github/contributors-anon/dellhpc/omnia) ![GitHub forks](https://img.shields.io/github/forks/dellhpc/omnia) ![GitHub Repo stars](https://img.shields.io/github/stars/dellhpc/omnia) ![GitHub all releases](https://img.shields.io/github/downloads/dellhpc/omnia/total)
			
 
				 
			
 
				 #### Ansible playbook-based deployment of Slurm and Kubernetes on Dell EMC PowerEdge servers running an RPM-based Linux OS
			
 
				 
			
--- a/appliance/roles/common/vars/main.yml
+++ b/appliance/roles/common/vars/main.yml
@@ -37,7 +37,7 @@ common_packages:
 
				 
			
 
				 # Usage: pre_requisite.yml
			
 
				 internet_delay: 0
			
 
				-internet_timeout: 1
			
 
				+internet_timeout: 10
			
 
				 hostname: github.com
			
 
				 port_no: 22
			
 
				 os_name: CentOS
			
--- a/appliance/tools/roles/fetch_password/tasks/main.yml
+++ b/appliance/tools/roles/fetch_password/tasks/main.yml
@@ -36,9 +36,9 @@
 
				   set_fact:
			
 
				     cobbler_password: "{{ provision_password }}"
			
 
				   no_log: true
			
 
				-  
			
 
				+
			
 
				 - name: Encrypt input config file
			
 
				   command: >-
			
 
				     ansible-vault encrypt {{ role_path }}/../../../{{ input_config_filename }}
			
 
				     --vault-password-file {{ role_path }}/../../../{{ vault_filename }}
			
 
				-  changed_when: false
			
 
				+  changed_when: false
			
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -0,0 +1,100 @@
 
				+# Frequently Asked Questions
			
 
				+
			
 
				+* TOC
			
 
				+{:toc}
			
 
				+
			
 
				+## Why is the error "Wait for AWX UI to be up" displayed when `appliance.yaml` fails?  
			
 
				+Cause: 
			
 
				+1. When AWX is not accessible even after five minutes of wait time. 
			
 
				+2. When __isMigrating__ or __isInstalling__ is seen in the failure message.
			
 
				+	
			
 
				+Resolution:  
			
 
				+Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again, where __management-station-IP__ is the ip address of the management node.
			
 
				+
			
 
				+## What are the next steps after the nodes in a Kubernetes cluster reboots?  
			
 
				+Resolution: 
			
 
				+Wait for upto 15 minutes after the Kubernetes cluster reboots. Next, verify status of the cluster using the following services:
			
 
				+* `kubectl get nodes` on the manager node provides correct k8s cluster status.  
			
 
				+* `kubectl get pods --all-namespaces` on the manager node displays all the pods in the **Running** state.
			
 
				+* `kubectl cluster-info` on the manager node displays both k8s master and kubeDNS are in the **Running** state.
			
 
				+
			
 
				+## What to do when the Kubernetes services are not in the __Running__  state?  
			
 
				+Resolution:	
			
 
				+1. Run `kubectl get pods --all-namespaces` to verify the pods are in the **Running** state.
			
 
				+2. If the pods are not in the **Running** state, delete the pods using the command:`kubectl delete pods <name of pod>`
			
 
				+3. Run the corresponding playbook that was used to install Kubernetes: `omnia.yml`, `jupyterhub.yml`, or `kubeflow.yml`.
			
 
				+
			
 
				+## What to do when the JupyterHub or Prometheus UI are not accessible?  
			
 
				+Resolution:
			
 
				+Run the command `kubectl get pods --namespace default` to ensure **nfs-client** pod and all prometheus server pods are in the **Running** state. 
			
 
				+
			
 
				+## While configuring the Cobbler, why does the `appliance.yml` fail with an error during the Run import command?  
			
 
				+Cause:
			
 
				+* When the mounted .iso file is corrupt.
			
 
				+	
			
 
				+Resolution:
			
 
				+1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
			
 
				+2. If the error message is **repo verification failed** then it signifies that the .iso file is not mounted properly.
			
 
				+3. Verify if the downloaded .iso file is valid and correct.
			
 
				+4. Delete the Cobbler container using `docker rm -f cobbler` and rerun `appliance.yml`.
			
 
				+
			
 
				+## Why does the PXE boot fail with tftp timeout or service timeout errors?  
			
 
				+Cause:
			
 
				+* When RAID is configured on the server.
			
 
				+* When more than two servers in the same network have Cobbler services running.  
			
 
				+
			
 
				+Resolution:  
			
 
				+1. Create a Non-RAID or virtual disk in the server.  
			
 
				+2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
			
 
				+
			
 
				+## What to do when the Slurm services do not start automatically after the cluster reboots?  
			
 
				+Resolution: 
			
 
				+* Manually restart the slurmd services on the manager node by running the following commands:
			
 
				+```
			
 
				+systemctl restart slurmdbd
			
 
				+systemctl restart slurmctld
			
 
				+systemctl restart prometheus-slurm-exporter
			
 
				+```
			
 
				+* Run `systemctl status slurmd` to manually restart the following service on all the compute nodes.
			
 
				+
			
 
				+## What to do when the Slurm services fail? 
			
 
				+Cause: The `slurm.conf` is not configured properly.  
			
 
				+Resolution:
			
 
				+1. Run the following commands:
			
 
				+```
			
 
				+slurmdbd -Dvvv
			
 
				+slurmctld -Dvvv
			
 
				+```
			
 
				+2. Verify `/var/lib/log/slurmctld.log` file.
			
 
				+
			
 
				+## What to do when when the error "ports are unavailable" is displayed?
			
 
				+Cause: Slurm database connection fails.  
			
 
				+Resolution:
			
 
				+1. Run the following commands:
			
 
				+```
			
 
				+slurmdbd -Dvvv
			
 
				+slurmctld -Dvvv
			
 
				+```
			
 
				+2. Verify the `/var/lib/log/slurmctld.log` file.
			
 
				+3. Verify: `netstat -antp | grep LISTEN`
			
 
				+4. If PIDs are in the **Listening** state, kill the processes of that specific port.
			
 
				+5. Restart all Slurm services:
			
 
				+```
			
 
				+slurmctl restart slurmctld on manager node
			
 
				+systemctl restart slurmdbd on manager node
			
 
				+systemctl restart slurmd on compute node
			
 
				+```
			
 
				+		
			
 
				+## What to do if Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding?  
			
 
				+Cause: With the host network which is DNS issue.  
			
 
				+Resolution:
			
 
				+1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
			
 
				+2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
			
 
				+3. Execute omnia.yml and skip slurm using __skip_ tag __slurm__.
			
 
				+
			
 
				+## What to do if time taken to pull the images to create the Kubeflow containers exceeds the limit and the Apply Kubeflow configurations task fails?  
			
 
				+Cause: Unstable or slow Internet connectivity.  
			
 
				+Resolution:
			
 
				+1. Complete the PXE booting/ format the OS on manager and compute nodes.
			
 
				+2. In the omnia_config.yml file, change the k8s_cni variable value from calico to flannel.
			
 
				+3. Run the Kubernetes and Kubeflow playbooks.
			
--- a/docs/INSTALL_OMNIA.md
+++ b/docs/INSTALL_OMNIA.md
@@ -107,6 +107,11 @@ Commands to install JupyterHub and Kubeflow:
 
				 * `ansible-playbook platforms/jupyterhub.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"`
			
 
				 * `ansible-playbook platforms/kubeflow.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" `
			
 
				 
			
 
				+__Note:__ When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the **Apply Kubeflow configurations** task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
			
 
				+* Format the OS on manager and compute nodes.
			
 
				+* In the `omnia_config.yml` file, change the k8s_cni variable value from calico to flannel.
			
 
				+* Run the Kubernetes and Kubeflow playbooks.
			
 
				+
			
 
				 ## Add a new compute node to the cluster
			
 
				 
			
 
				 To update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Ensure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run`omnia.yml` to add the new node to the cluster and update the configurations of the manager node.
			
--- a/docs/INSTALL_OMNIA_APPLIANCE.md
+++ b/docs/INSTALL_OMNIA_APPLIANCE.md
@@ -43,7 +43,7 @@ Omnia considers the following usernames as default:
 
				 * `admin` for AWX
			
 
				 * `slurm` for MariaDB
			
 
				 
			
 
				-8. Run `ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"` to install Omnia appliance.
			
 
				+9. Run `ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"` to install Omnia appliance.
			
 
				 
			
 
				    
			
 
				 Omnia creates a log file which is available at: `/var/log/omnia.log`.
			
@@ -115,6 +115,11 @@ __Note:__ To install __JupyterHub__ and __Kubeflow__ playbooks:
 
				 *	From __PLAYBOOK__ dropdown menu, select __platforms/jupyterhub.yml__ and launch the template to install JupyterHub playbook.
			
 
				 *	From __PLAYBOOK__ dropdown menu, select __platforms/kubeflow.yml__ and launch the template to install Kubeflow playbook.
			
 
				 
			
 
				+__Note:__ When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the **Apply Kubeflow configurations** task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
			
 
				+* Complete the PXE booting of the manager and compute nodes.
			
 
				+* In the `omnia_config.yml` file, change the k8s_cni variable value from calico to flannel.
			
 
				+* Run the Kubernetes and Kubeflow playbooks.
			
 
				+
			
 
				 The DeployOmnia template may not run successfully if:
			
 
				 - The Manager group contains more than one host.
			
 
				 - The Compute group does not contain a host. Ensure that the Compute group is assigned with at least one host node.
			
--- a/docs/PREINSTALL_OMNIA.md
+++ b/docs/PREINSTALL_OMNIA.md
@@ -7,7 +7,7 @@ Ensure that the following prerequisites are met:
 
				 * SSH Keys for root have been installed on all nodes to allow for password-less SSH.
			
 
				 * On the manager node, install Ansible and Git using the following commands:
			
 
				 	* `yum install epel-release -y`
			
 
				-	* `yum install ansible-2.9.17 git -y`  
			
 
				+	* `yum install ansible-2.9.18 git -y`  
			
 
				 __Note:__ Ansible must be installed using __yum__. If Ansible is installed using __pip3__, re-install it using the __yum__ command again.
			
 
				 
			
 
				 
			
--- a/docs/PREINSTALL_OMNIA_APPLIANCE.md
+++ b/docs/PREINSTALL_OMNIA_APPLIANCE.md
@@ -3,7 +3,7 @@
 
				 Ensure that the following prequisites are met before installing the Omnia appliance:
			
 
				 * On the management node, install Ansible and Git using the following commands:
			
 
				 	* `yum install epel-release -y`
			
 
				-	* `yum install ansible-2.9.17 git -y` 
			
 
				+	* `yum install ansible-2.9.18 git -y`  
			
 
				 	__Note:__ Ansible must be installed using __yum__. If Ansible is installed using __pip3__, re-install it using the __yum__ command again.
			
 
				 * Ensure a stable Internet connection is available on management node and target nodes. 
			
 
				 * CentOS 7.9 2009 is installed on the management node.
			
@@ -11,7 +11,7 @@ Ensure that the following prequisites are met before installing the Omnia applia
 
				 * For DHCP configuration, you can provide a mapping file. The provided details must be in the format: MAC, Hostname, IP. For example, `xx:xx:4B:C4:xx:44,validation01,172.17.0.81` and  `xx:xx:4B:C5:xx:52,validation02,172.17.0.82` are valid entries.  
			
 
				 __Note:__ A template for mapping file is present in the `omnia/examples`, named `mapping_file.csv`. The header in the template file must not be deleted before saving the file.  
			
 
				 __Note:__ Ensure that duplicate values are not provided for MAC, Hostname, and IP in the mapping file. The Hostname should not contain the following characters: , (comma), \. (period), and - (hyphen).
			
 
				-* Connect one of the Ethernet cards on the management node to the HPC switch and the other ethernet card connected to the lobal network.
			
 
				+* Connect one of the Ethernet cards on the management node to the HPC switch and the other ethernet card connected to the global network.
			
 
				 * If SELinux is not disabled on the management node, disable it from `/etc/sysconfig/selinux` and restart the management node.
			
 
				 * The default mode of PXE is __UEFI__, and the BIOS Legacy Mode is not supported.
			
 
				 * The default boot order for the bare metal servers must be __PXE__.
			
--- a/docs/README.md
+++ b/docs/README.md
@@ -18,15 +18,17 @@ Omnia can install Kubernetes or Slurm (or both), along with additional drivers,
 
				 
			
 
				 ![Omnia Slurm Stack](images/omnia-slurm.png) 
			
 
				 
			
 
				-## Installing Omnia
			
 
				-Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [Preparation to install Omnia](PREINSTALL_OMNIA.md) for instructions on network setup.
			
 
				+## Deploying clusters using the Omnia Appliance
			
 
				+The Omnia Appliance will automate the entire cluster deployment process, starting with provisioning the operating system to servers.
			
 
				 
			
 
				-Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. Please see [Install Omnia using CLI](INSTALL_OMNIA.md) for detailed instructions.  
			
 
				+Ensure all the prerequisites listed in [preparation to install Omnia Appliance](PREINSTALL_OMNIA_APPLIANCE.md) are met before installing the Omnia appliance.
			
 
				+
			
 
				+For detailed instructions on installing the Omnia appliance, see [Install Omnia Appliance](INSTALL_OMNIA_APPLIANCE.md).
			
 
				 
			
 
				-## Installing the Omnia appliance
			
 
				-Ensure all the prerequisites listed in the [PREINSTALL_OMNIA_APPLIANCE](PREINSTALL_OMNIA_APPLIANCE.md) are met before installing the Omnia appliance.
			
 
				+## Installing Omnia to servers with a pre-provisioned OS
			
 
				+Omnia can be deploy clusters to servers that already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [Preparation to install Omnia](PREINSTALL_OMNIA.md) for instructions on network setup.
			
 
				 
			
 
				-For detailed instructions on installing the Omnia appliance, see [INSTALL_OMNIA_APPLIANCE](INSTALL_OMNIA_APPLIANCE.md).
			
 
				+Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. Please see [Install Omnia using CLI](INSTALL_OMNIA.md) for detailed instructions.  
			
 
				 
			
 
				 # System requirements  
			
 
				 Ensure the supported version of all the software are installed as per the following table and other versions than those listed are not supported by Omnia. This is to ensure that there is no impact to the functionality of Omnia.
			
@@ -78,90 +80,10 @@ Issue: Hosts do not display on the AWX UI.
 
				 Resolution:  
			
 
				 * Verify if `provisioned_hosts.yml` is present in the `omnia/appliance/roles/inventory/files` folder.
			
 
				 * Verify if hosts are not listed in the `provisioned_hosts.yml` file. If hosts are not listed, then servers are not PXE booted yet.
			
 
				-* If hosts are listed in the `provisioned_hosts.yml` file, then an IP address has been assigned to them by DHCP. However, hosts are not displyed on the AWX UI as the PXE boot is still in process or is not initiated.
			
 
				+* If hosts are listed in the `provisioned_hosts.yml` file, then an IP address has been assigned to them by DHCP. However, hosts are not displayed on the AWX UI as the PXE boot is still in process or is not initiated.
			
 
				 * Check for the reachable and unreachable hosts using the `provisioned_report.yml` tool present in the `omnia/appliance/tools` folder. To run provisioned_report.yml, in the omnia/appliance directory, run `playbook -i roles/inventory/files/provisioned_hosts.yml tools/provisioned_report.yml`.
			
 
				 
			
 
				-# Frequently asked questions
			
 
				-* Why is the error "Wait for AWX UI to be up" displayed when `appliance.yaml` fails?  
			
 
				-	Cause: 
			
 
				-	1. When AWX is not accessible even after five minutes of wait time. 
			
 
				-	2. When __isMigrating__ or __isInstalling__ is seen in the failure message.
			
 
				-	
			
 
				-  Resolution:  
			
 
				-	Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again, where __management-station-IP__ is the ip address of the management node.
			
 
				-
			
 
				-* What are the next steps after the nodes in a Kubernetes cluster reboots?  
			
 
				-	Resolution: 
			
 
				-	Wait for upto 15 minutes after the Kubernetes cluster reboots. Next, verify status of the cluster using the following services:
			
 
				-	* `kubectl get nodes` on the manager node provides correct k8s cluster status.  
			
 
				-	* `kubectl get pods --all-namespaces` on the manager node displays all the pods in the **Running** state.
			
 
				-	* `kubectl cluster-info` on the manager node displays both k8s master and kubeDNS are in the **Running** state.
			
 
				-
			
 
				-* What to do when the Kubernetes services are not in the __Running__  state?  
			
 
				-	Resolution:	
			
 
				-	1. Run `kubectl get pods --all-namespaces` to verify the pods are in the **Running** state.
			
 
				-	2. If the pods are not in the **Running** state, delete the pods using the command:`kubectl delete pods <name of pod>`
			
 
				-	3. Run the corresponding playbook that was used to install Kubernetes: `omnia.yml`, `jupyterhub.yml`, or `kubeflow.yml`.
			
 
				-
			
 
				-* What to do when the JupyterHub or Prometheus UI are not accessible?  
			
 
				-	Resolution:
			
 
				-	Run the command `kubectl get pods --namespace default` to ensure **nfs-client** pod and all prometheus server pods are in the **Running** state. 
			
 
				-
			
 
				-* While configuring the Cobbler, why does the `appliance.yml` fail with an error during the Run import command?  
			
 
				-	Cause:
			
 
				-	* When the mounted .iso file is corrupt.
			
 
				-	
			
 
				-  Resolution:
			
 
				-	1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
			
 
				-	2. If the error message is **repo verification failed** then it signifies that the .iso file is not mounted properly.
			
 
				-	3. Verify if the downloaded .iso file is valid and correct.
			
 
				-	4. Delete the Cobbler container using `docker rm -f cobbler` and rerun `appliance.yml`.
			
 
				-
			
 
				-* Why does the PXE boot fail with tftp timeout or service timeout errors?  
			
 
				-	Cause:
			
 
				-	* When RAID is configured on the server.
			
 
				-	* When more than two servers in the same network have Cobbler services running.  
			
 
				-	
			
 
				-  Resolution:  
			
 
				-	1. Create a Non-RAID or virtual disk in the server.  
			
 
				-	2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
			
 
				-
			
 
				-* What to do when the Slurm services do not start automatically after the cluster reboots?  
			
 
				-	Resolution: 
			
 
				-	* Manually restart the slurmd services on the manager node by running the following commands:
			
 
				-		* `systemctl restart slurmdbd`
			
 
				-		* `systemctl restart slurmctld`
			
 
				-		* `systemctl restart prometheus-slurm-exporter`
			
 
				-	* Run `systemctl status slurmd` to manually restart the following service on all the compute nodes.
			
 
				-
			
 
				-* What to do when the Slurm services fail? 
			
 
				-	Cause: The `slurm.conf` is not configured properly.  
			
 
				-	Resolution:
			
 
				-	1. Run the following commands:
			
 
				-		* `slurmdbd -Dvvv`
			
 
				-		* `slurmctld -Dvvv`
			
 
				-	2. Verify `/var/lib/log/slurmctld.log` file.
			
 
				-
			
 
				-* What to do when when the error "ports are unavailable" is displayed?
			
 
				-	Cause: Slurm database connection fails.  
			
 
				-	Resolution:
			
 
				-	1. Run the following commands:
			
 
				-		* `slurmdbd -Dvvv`
			
 
				-		*`slurmctld -Dvvv`
			
 
				-	2. Verify the `/var/lib/log/slurmctld.log` file.
			
 
				-	3. Verify: netstat -antp | grep LISTEN
			
 
				-	4. If PIDs are in the **Listening** state, kill the processes of that specific port.
			
 
				-	5. Restart all Slurm services:
			
 
				-		* slurmctl restart slurmctld on manager node
			
 
				-		* systemctl restart slurmdbd on manager node
			
 
				-		* systemctl restart slurmd on compute node
			
 
				-		
			
 
				-* What to do if Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding?
			
 
				-	Cause: With the host network which is DNS issue.
			
 
				-	Resolution:
			
 
				-	1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
			
 
				-	2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
			
 
				-	3. Execute omnia.yml and skip slurm using __skip_ tag __slurm__.
			
 
				+# [Frequently asked questions](FAQ.md)
			
 
				 
			
 
				 # Limitations
			
 
				 1. Removal of Slurm and Kubernetes component roles are not supported. However, skip tags can be provided at the start of installation to select the component roles.
			
@@ -183,4 +105,4 @@ It's not just new features and bug fixes that can be contributed to the Omnia pr
 
				 * Feedback
			
 
				 * Validation that it works for your particular configuration
			
 
				 
			
 
				-If you would like to contribute, see [CONTRIBUTORS](https://github.com/dellhpc/omnia/b
			
 
				+If you would like to contribute, see [CONTRIBUTING](https://github.com/dellhpc/omnia/blob/release/CONTRIBUTING.md).
			
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -2,3 +2,4 @@ theme: jekyll-theme-minimal
 
				 title: Omnia
			
 
				 description: Ansible playbook-based tools for deploying Slurm and Kubernetes clusters for High Performance Computing, Machine Learning, Deep Learning, and High-Performance Data Analytics
			
 
				 logo: images/omnia-logo.png
			
 
				+markdown: kramdown
			
--- a/docs/images/omnia-branch-structure.png
+++ b/docs/images/omnia-branch-structure.png
--- a/docs/images/omnia-overview.png
+++ b/docs/images/omnia-overview.png
--- a/omnia_config.yml
+++ b/omnia_config.yml
@@ -26,4 +26,4 @@ k8s_cni: "calico"
 
				 # Kubernetes pod network CIDR.
			
 
				 # Default value is "10.244.0.0/16"
			
 
				 # Make sure this value does not overlap with any of the host networks.
			
 
				-k8s_pod_network_cidr: "10.244.0.0/16"
			
 
				+k8s_pod_network_cidr: "10.244.0.0/16"
			
--- a/site/CONTRIBUTORS.md
+++ b/site/CONTRIBUTORS.md
@@ -1,6 +0,0 @@
 
				-# Omnia Maintainers
			
 
				-- Luke Wilson and John Lockman (Dell Technologies)
			
 
				-<img src="images/delltech.jpg" height="90px" alt="Dell Technologies">
			
 
				-
			
 
				-# Omnia Contributors
			
 
				-<img src="images/delltech.jpg" height="90px" alt="Dell Technologies"> <img src="images/pisa.png" height="100px" alt="Universita di Pisa">
			
--- a/site/INSTALL.md
+++ b/site/INSTALL.md
@@ -1,110 +0,0 @@
 
				-## TL;DR Installation
			
 
				- 
			
 
				-### Kubernetes
			
 
				-Install Slurm and Kubernetes, along with all dependencies
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file omnia.yml
			
 
				-```
			
 
				-
			
 
				-Install Slurm only
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file omnia.yml --skip-tags "k8s"
			
 
				-```
			
 
				-
			
 
				-Install Kubernetes only
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file omnia.yml --skip-tags "slurm"
			
 
				- 
			
 
				-
			
 
				-Initialize Kubernetes cluster (packages already installed)
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file omnia.yml --skip-tags "slurm" --tags "init"
			
 
				-```
			
 
				-
			
 
				-### Install Kubeflow 
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file platforms/kubeflow.yml
			
 
				-```
			
 
				-
			
 
				-# Omnia  
			
 
				-Omnia is a collection of [Ansible](https://www.ansible.com/) playbooks which perform:
			
 
				-* Installation of [Slurm](https://slurm.schedmd.com/) and/or [Kubernetes](https://kubernetes.io/) on servers already provisioned with a standard [CentOS](https://www.centos.org/) image.
			
 
				-* Installation of auxiliary scripts for administrator functions such as moving nodes between Slurm and Kubernetes personalities.
			
 
				-
			
 
				-Omnia playbooks perform several tasks:
			
 
				-`common` playbook handles installation of software 
			
 
				-* Add yum repositories:
			
 
				-    - Kubernetes (Google)
			
 
				-    - El Repo (for Nvidia drivers)
			
 
				-    - EPEL (Extra Packages for Enterprise Linux)
			
 
				-* Install Packages from repos:
			
 
				-    - bash-completion
			
 
				-    - docker
			
 
				-    - gcc
			
 
				-    - python-pip
			
 
				-    - kubelet
			
 
				-    - kubeadm
			
 
				-    - kubectl
			
 
				-    - nfs-utils
			
 
				-    - nvidia-detect
			
 
				-    - yum-plugin-versionlock
			
 
				-* Restart and enable system level services
			
 
				-    - Docker
			
 
				-    - Kubelet
			
 
				-
			
 
				-`computeGPU` playbook installs Nvidia drivers and nvidia-container-runtime-hook
			
 
				-* Add yum repositories:
			
 
				-    - Nvidia (container runtime)
			
 
				-* Install Packages from repos:
			
 
				-    - kmod-nvidia
			
 
				-    - nvidia-container-runtime-hook
			
 
				-* Restart and enable system level services
			
 
				-    - Docker
			
 
				-    - Kubelet
			
 
				-* Configuration:
			
 
				-    - Enable GPU Device Plugins (nvidia-container-runtime-hook)
			
 
				-    - Modify kubeadm config to allow GPUs as schedulable resource 
			
 
				-* Restart and enable system level services
			
 
				-    - Docker
			
 
				-    - Kubelet
			
 
				-
			
 
				-`master` playbook
			
 
				-* Install Helm v3
			
 
				-* (optional) add firewall rules for Slurm and kubernetes
			
 
				-
			
 
				-Everything from this point on can be called by using the `init` tag
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file kubernetes/kubernetes.yml --tags "init"
			
 
				-```
			
 
				-
			
 
				-`startmaster` playbook
			
 
				-* turn off swap
			
 
				-*Initialize Kubernetes
			
 
				-    * Head/master
			
 
				-        - Start K8S pass startup token to compute/slaves
			
 
				-        - Initialize software defined networking (Calico)
			
 
				-
			
 
				-`startworkers` playbook
			
 
				-* turn off swap
			
 
				-* Join k8s cluster
			
 
				-
			
 
				-`startservices` playbook
			
 
				-* Setup K8S Dashboard
			
 
				-* Add `stable` repo to helm
			
 
				-* Add `jupyterhub` repo to helm
			
 
				-* Update helm repos
			
 
				-* Deploy NFS client Provisioner
			
 
				-* Deploy Jupyterhub
			
 
				-* Deploy Prometheus
			
 
				-* Install MPI Operator
			
 
				-
			
 
				-
			
 
				-### Slurm
			
 
				-* Downloads and builds Slurm from source
			
 
				-* Install package dependencies
			
 
				-    - Python3
			
 
				-    - munge
			
 
				-    - MariaDB
			
 
				-    - MariaDB development libraries
			
 
				-* Build Slurm configuration files
			
 
				-
			
--- a/site/PREINSTALL.md
+++ b/site/PREINSTALL.md
@@ -1,27 +0,0 @@
 
				-# Pre-Installation Preparation
			
 
				-
			
 
				-## Assumptions
			
 
				-Omnia assumes that prior to installation:
			
 
				-* Systems have a base operating system (currently CentOS 7 or 8)
			
 
				-* Network(s) has been cabled and nodes can reach the internet
			
 
				-* SSH Keys for `root` have been installed on all nodes to allow for password-less SSH
			
 
				-* Ansible is installed on either the master node or a separate deployment node
			
 
				-```
			
 
				-yum install ansible
			
 
				-```
			
 
				-
			
 
				-## Example system designs
			
 
				-Omnia can configure systems which use Ethernet- or Infiniband-based fabric to connect the compute servers.
			
 
				-
			
 
				-![Example system configuration with Ethernet fabric](images/example-system-ethernet.png)
			
 
				-
			
 
				-![Example system configuration with Infiniband fabric](images/example-system-infiniband.png)
			
 
				-
			
 
				-## Network Setup
			
 
				-Omnia assumes that servers are already connected to the network and have access to the internet.
			
 
				-### Network Topology
			
 
				-Possible network configurations include:
			
 
				-* A flat topology where all nodes are connected to a switch which includes an uplink to the internet. This requires multiple externally-facing IP addresses
			
 
				-* A hierarchical topology where compute nodes are connected to a common switch, but the master node contains a second network connection which is connected to the internet. All outbound/inbound traffic would be routed through the master node. This requires setting up firewall rules for IP masquerade, see [here](https://www.server-world.info/en/note?os=CentOS_7&p=firewalld&f=2) for an example.
			
 
				-### IP and Hostname Assignment
			
 
				-The recommended setup is to assign IP addresses to individual servers. This can be done manually by logging onto each node, or via DHCP.
			
--- a/site/README.md
+++ b/site/README.md
--- a/site/_config.yml
+++ b/site/_config.yml
@@ -1,4 +0,0 @@
 
				-theme: jekyll-theme-minimal
			
 
				-title: Omnia
			
 
				-description: Ansible playbook-based tools for deploying Slurm and Kubernetes clusters for High Performance Computing, Machine Learning, Deep Learning, and High-Performance Data Analytics
			
 
				-logo: images/omnia-logo.png
			
--- a/site/images/delltech.jpg
+++ b/site/images/delltech.jpg
--- a/site/images/example-system-ethernet.png
+++ b/site/images/example-system-ethernet.png
--- a/site/images/example-system-infiniband.png
+++ b/site/images/example-system-infiniband.png
--- a/site/images/omnia-branch-structure.png
+++ b/site/images/omnia-branch-structure.png
--- a/site/images/omnia-k8s.png
+++ b/site/images/omnia-k8s.png
--- a/site/images/omnia-logo.png
+++ b/site/images/omnia-logo.png
--- a/site/images/omnia-overview.png
+++ b/site/images/omnia-overview.png
--- a/site/images/omnia-slurm.png
+++ b/site/images/omnia-slurm.png
--- a/site/images/pisa.png
+++ b/site/images/pisa.png
--- a/site/metalLB/README.md
+++ b/site/metalLB/README.md
@@ -1,10 +0,0 @@
 
				-# MetalLB 
			
 
				-
			
 
				-MetalLB is a load-balancer implementation for bare metal Kubernetes clusters, using standard routing protocols.
			
 
				-https://metallb.universe.tf/
			
 
				-
			
 
				-Omnia installs MetalLB by manifest in the playbook `startservices`. A default configuration is provdied for layer2 protocol and an example for providing an address pool. Modify metal-config.yaml to suit your network requirements and apply the changes using with: 
			
 
				-
			
 
				-``` 
			
 
				-kubectl apply -f metal-config.yaml
			
 
				-```
			
--- a/site/metalLB/metal-config.yaml
+++ b/site/metalLB/metal-config.yaml
@@ -1,21 +0,0 @@
 
				-apiVersion: v1
			
 
				-kind: ConfigMap
			
 
				-metadata:
			
 
				-  namespace: metallb-system
			
 
				-  name: config
			
 
				-data:
			
 
				-  config: |
			
 
				-    address-pools:
			
 
				-    - name: default
			
 
				-      protocol: layer2
			
 
				-      addresses:
			
 
				-      - 192.168.2.150/32
			
 
				-      - 192.168.2.151/32
			
 
				-      - 192.168.2.152/32
			
 
				-      - 192.168.2.153/32
			
 
				-      - 192.168.2.154/32
			
 
				-      - 192.168.2.155/32
			
 
				-      - 192.168.2.156/32
			
 
				-      - 192.168.2.157/32
			
 
				-      - 192.168.2.158/32
			
 
				-      - 192.168.2.159/32