INSTALL_OMNIA.md 12 KB

Install Omnia

Prerequisties

Perform the following tasks before installing Omnia:

  • On the management node, install Ansible and Git using the following commands:
    • yum install epel-release -y
    • yum install ansible git -y

Note: ansible should be installed using yum only.

Note: If ansible is installed using pip3, install it again using yum only.

  • Ensure a stable Internet connection is available on management node and target nodes.
  • CentOS 7.9 2009 is installed on the management node.
  • To provision the bare metal servers,
    • Go to http://isoredirect.centos.org/centos/7/isos/x86_64/ and download the CentOS-7-x86_64-Minimal-2009 ISO file to the following directory on the management node: omnia/appliance/roles/provision/files.
    • Rename the downloaded ISO file to CentOS-7-x86_64-Minimal-2009.iso.
  • For DHCP configuration, you can provide a mapping file named mapping_file.csv under omnia/appliance/roles/provision/files. The details provided in the CSV file must be in the format: MAC, Hostname, IP xx:xx:4B:C4:xx:44,validation01,172.17.0.81 xx:xx:4B:C5:xx:52,validation02,172.17.0.82 Note: Duplicate hostnames must not be provided in the mapping file and the hostname should not contain these characters: "_" and "."
  • Connect one of the Ethernet cards on the management node to the HPC switch and one of the ethernet card connected to the global_network.
  • If SELinux is not disabled on the management node, disable it from /etc/sysconfig/selinux and restart the management node.
  • The default mode of PXE is UEFI and the BIOS legacy mode is not supported.
  • The default boot order for the bare metal server should be PXE.
  • Configuration of RAID is not part of omnia. If bare metal server has RAID controller installed then it's compulsory to create VIRTUAL DISK.

Steps to install Omnia

  1. On the management node, change the working directory to the directory where you want to clone the Omnia Git repository.
  2. Clone the Omnia repository.

    $ git clone https://github.com/dellhpc/omnia.git 
    

    Note: After the Omnia repository is cloned, a folder named omnia is created. It is recommended that you do not rename this folder.

  3. Change the directory to omnia/appliance

  4. To provide passwords for Cobbler and AWX, edit the appliance_config.yml file.

  5. If user want to provide the mapping file for DHCP configuration, go to appliance_config.yml file there is variable name mapping_file_exits set as true otherwise false.

Omnia considers the following usernames as default:

  • cobbler for Cobbler Server
  • admin for AWX`
  • slurm for Slurm

Note:

  • Minimum length of the password must be at least eight characters and maximum of 30 characters.
  • Do not use these characters while entering a password: -, \, "", and \'
  1. Using the appliance_config.yml file, you can also change the NIC for the DHCP server under hpc_nic and the NIC used to connect to the Internet under public_nic. Default values of both hpc_nic and public_nic is set to em1 and em2 respectively.
  2. The valid DHCP range for HPC cluster is set into two variables name Dhcp_start_ip_range and Dhcp_end_ip_range present in the appliance_config.yml file.
  3. To provide password for Slurm Database and Kubernetes CNI, edit the omnia_config.yml file.

Note:

  • Supported Kubernetes CNI : calico and flannel, default is calico.

To view the set passwords of appliance_config.yml at a later time, run the following command under omnia->appliance:

ansible-vault view appliance_config.yml --vault-password-file .vault_key

To view the set passwords of omnia_config.yml at a later time, run the following command:

ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
  1. To install Omnia, run the following command: ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"

Omnia creates a log file which is available at: /var/log/omnia.log.

Provision operating system on the target nodes
Omnia role used: provision

To create the Cobbler image, Omnia configures the following:

  • Firewall settings are configured.
  • The kickstart file of Cobbler will enable the UEFI PXE boot.

To access the Cobbler dashboard, enter https://<IP>/cobbler_web where <IP> is the Global IP address of the management node. For example, enter https://100.98.24.225/cobbler_web to access the Cobbler dashboard.

Note: If a mapping file is not provided, the hostname to the server is given on the basis of following format: compute- where "xxx" is the last 2 octets of Host Ip address After the Cobbler Server provisions the operating system on the nodes, IP addresses and host names are assigned by the DHCP service. The host names are assigned based on the following format: compute<xxx>-xxx where xxx is the Host ID (last 2 octet) of the Host IP address. For example, if the Host IP address is 172.17.0.11 then assigned hostname will be compute0-11. Note: If a mapping file is provided, the hostnames follow the format provided in the mapping file.

Install and configure Ansible AWX
Omnia role used: web_ui
AWX repository is cloned from the GitHub path: https://github.com/ansible/awx.git

Omnia performs the following configuration on AWX:

  • The default organization name is set to Dell EMC.
  • The default project name is set to omnia.
  • Credential: omnia_credential
  • Inventory: omnia_inventory with compute and manager groups
  • Template: DeployOmnia and Dynamic Inventory
  • Schedules: DynamicInventorySchedule which is scheduled for every 10 mins

To access the AWX dashboard, enter http://<IP>:8081 where <IP> is the Global IP address of the management node. For example, enter http://100.98.24.225:8081 to access the AWX dashboard.

*Note: The AWX configurations are automatically performed Omnia and Dell Technologies recommends that you do not change the default configurations provided by Omnia as the functionality may be impacted.

Note: Although AWX UI is accessible, hosts will be shown only after few nodes have been provisioned by a cobbler. It will take approx 10-15 mins. If any server is provisioned but user is not able to see any host on the AWX UI, then user can run provision_report.yml playbook from omnia -> appliance ->tools folder to see which hosts are reachable.

Install Kubernetes and Slurm using AWX UI

Kubernetes and Slurm are installed by deploying the DeployOmnia template on the AWX dashboard.

  1. On the AWX dashboard, under RESOURCES -> Inventories, select Groups.
  2. Select either compute or manager group.
  3. Select the Hosts tab.
  4. To add the hosts provisioned by Cobbler, select Add -> Add existing host, and then select the hosts from the list and click Save.
  5. To deploy Omnia, under RESOURCES -> Templates, select DeployOmnia and click LAUNCH.
  6. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed. To install only Kubernetes, enter slurm and select Create "slurm". Similarly, to install only Slurm, select and add kubernetes skip tag.

Note:

  • If you would like to skip the NFS client setup, enter _nfs_client in the skip tag section to skip the k8s_nfs_client_setup__ role of Kubernetes.
  1. Click Next.
  2. Review the details in the Preview window, and click Launch to run the DeployOmnia template.

To establish the passwordless communication between compute nodes and manager node:

  1. In AWX UI, under RESOURCES -> Templates, select DeployOmnia template.
  2. From Playbook dropdown menu, select appliance/tools/passwordless_ssh.yml and Launch the template.

Note: If you want to install jupyterhub and kubeflow playbooks, you have to first install the jupyterhub playbook and then install the kubeflow playbook.

Note: To install jupyterhub and kubeflow playbook:

  • From AWX UI, under RESOURCES -> Templates, select DeployOmnia template.
  • From Playbook dropdown menu, select platforms/jupyterhub.yml option and Launch the template to install jupyterhub playbook.
  • From Playbook dropdown menu, select platforms/kubeflow.yml option and Launch the template to install kubeflow playbook.

The DeployOmnia template may not run successfully if:

  • The Manager group contains more than one host.
  • The Compute group does not contain a host. Ensure that the Compute group must be assigned with a minimum of one host node.
  • Under Skip Tags, when both kubernetes and slurm tags are selected.

After DeployOmnia template is executed from the AWX UI, the omnia.yml file installs Kubernetes and Slurm, or either Kubernetes or slurm, as per the selection in the template on the management node. Additionally, appropriate roles are assigned to the compute and manager groups.

The following kubernetes roles are provided by Omnia when omnia.yml file is executed:

  • common role:
    • Install common packages on master and compute nodes
    • Docker is installed
    • Deploy time ntp/chrony
    • Install Nvidia drivers and software components
  • k8s_common role:
    • Required Kubernetes packages are installed
    • Starts the docker and kubernetes services.
  • k8s_manager role:
    • helm package for Kubernetes is installed.
  • k8s_firewalld role: This role is used to enable the required ports to be used by Kubernetes.
    • For head-node-ports: 6443, 2379-2380,10251,10252
    • For compute-node-ports: 10250,30000-32767
    • For calico-udp-ports: 4789
    • For calico-tcp-ports: 5473,179
    • For flanel-udp-ports: 8285,8472
  • k8s_nfs_server_setup role:
    • A nfs-share directory, /home/k8nfs, is created. Using this directory, compute nodes share the common files.
  • k8s_nfs_client_setup role
  • k8s_start_manager role:
    • Runs the /bin/kubeadm init command to initialize the Kubernetes services on manager node.
    • Initialize the Kubernetes services in the manager node and create service account for Kubernetes Dashboard
  • k8s_start_workers role:
    • The compute nodes are initialized and joined to the Kubernetes cluster with the manager node.
  • k8s_start_services role
    • Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner

Note: Once kubernetes is installed and configured, few Kubernetes and calico/flannel related ports will be opened in the manager/compute nodes. This is required for kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for kubernetes pods.

The following Slurm roles are provided by Omnia when omnia.yml file is executed:

  • slurm_common role:
    • Install the common packages on manager/head node and compute node.
  • slurm_manager role:
    • Install the packages only related to manager node
    • This role also enables the required ports to be used by slurm. tcp_ports: 6817,6818,6819 udp_ports: 6817,6818,6819
    • Creating and updating the slurm configuration files based on the manager node requirements.
  • slurm_workers role:
    • Install the slurm packages into all compute nodes as per the compute node requirements.
  • slurm_start_services role:
    • Starting the slurm services so that compute node starts to communicate with manager node.
  • slurm_exporter role:
    • slurm exporter is a package for exporting metrics collected from slurm resource scheduling system to prometheus.
    • Slurm exporter is installed on the host just like slurm and slurm exporter will be successfully installed only if slurm is installed.