Using Omnia 1.1, you can provision and monitor hardware devices such as servers, storage devices, network switches, and InfiniBand switches in an HPC cluster. To enable Omnia to provision or configure the supported hardware devices, Omnia requires the following connections to be made available in your HPC cluster environment.
Through this management network, management DHCP assigns IP addresses to the devices in the HPC cluster.
Depending on the pass-through switch configured in your HPC environment, the number of racks will be limited based on the number of ports available on the pass-through switch. To support additional racks, you can form an L1-L2 topology and configure a network of Passthrough Switches. A typical layout of an HPC cluster with a network of pass-through switches is as per the following illustration:
10:11:12:13,server1,100.96.20.66,compute
and 14:15:16:17,server2,100.96.22.199,manager
are valid entries./etc/sysconfig/selinux
and restart the management station.On the management station, ensure that Python 3.6 and Ansible are installed.
dnf install epel-release -y
dnf install python3 -y
pip3.6 install --upgrade pip
python3.6 -m pip install ansible
After the installation is complete, run ansible --version
to verify if the installation is successful. In the output, ensure that the executable location path is present in the PATH variable by running echo $PATH
.
If executable location path is not present, update the path by running export PATH=$PATH:<executable location>\
.For example,
ansible -- version
ansible 2.10.9
config file = None
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Aug 24 2020, 17:57:11) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]
The executable location is /usr/local/bin/ansible
. Update the path by running the following command:
export PATH=$PATH:/usr/local/bin
Note: To deploy Omnia, Python 3.6 provides bindings to system tools such as RPM, DNF, and SELinux. As versions greater than 3.6 do not provide these bindings to system tools, ensure that you install Python 3.6 with dnf.
Note: If Ansible version 2.9 or later is installed, ensure it is uninstalled before installing a newer version of Ansible. Run the following commands to uninstall Ansible before upgrading to newer version.
pip uninstall ansible
pip uninstall ansible-base (if ansible 2.9 is installed)
pip uninstall ansible-core (if ansible 2.10 > version is installed)
dnf install epel-release -y
dnf install git -y
Note:
- After the installation of the Omnia appliance, changing the management station is not supported. If you need to change the management station, you must redeploy the entire cluster.
- If there are errors while executing any of the Ansible playbook commands, then re-run the commands.
git clone https://github.com/dellhpc/omnia.git
cd omnia
Note:
- Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.
- The default value of Kubernetes Pod Network CIDR is 10.244.0.0/16. If 10.244.0.0/16 is already in use within your network, select a different Pod Network CIDR. For more information, see https://docs.projectcalico.org/getting-started/kubernetes/quickstart.
- The default path of the Ansible configuration file is
/etc/ansible/
. If the file is not present in the default path, then edit theansible_conf_file_path
variable to update the configuration path.
cd omnia/control_plane/input_params
| Variables
[Required/ Optional] | Default, choices | Description | | ----------- | ----------- | ----------- | ansible_conf_file_path | /etc/ansible | Directory path with the Ansible configuration file (ansible.cfg). If Ansible is installed using pip, provide the directory path. If the path is not provided, it is set as /etc/ansible, by default. | public_nic [Required]| eno2 | The NIC or Ethernet card connected to the public internet. | appliance_k8s_pod_net_cidr [Required] | 192.168.0.0/16 | Kubernetes pod network CIDR for appliance k8s network. Ensure this value does not overlap with any of the host networks. | snmp_trap_destination | | Enter an SNMP server IP address to receive SNMP traps from devices in the cluster. If this variable is left blank, SNMP will be disabled. | snmp_community_name [Required] | public | SNMP community name. | awx_organization | DellEMC | Organization name configured in AWX. | timezone | GMT | Enter a timezone that is set during the provisioning of OS. GMT is set as the default time zone. You can set the time zone to EST, CET, MST, CST6CDT, or PST8PDT. For a list of available time zones, see theappliance/common/files/timezone.txt
file. |
language | en-US | Set the language used during the provisioning of OS. By default, it is set to en-US. |
iso_file_path [Required] | | Provide the CentOS-7-x86_64-Minimal-2009 ISO file path. This ISO file is used by Cobbler to provision the OS on the compute nodes. Note: It is recommended that the ISO image file is not renamed. And, you must not change the path of this ISO image file as the provisioning of the OS on the compute nodes may be impacted. |
mngmnt_network_nic [Required] | eno1 | NIC or Ethernet card that is connected to the Management Network to provision the devices. By default, it is set to "eno1". |
mngmnt_network_dhcp_start_range, mngmnt_network_dhcp_end_range [Required] | | DHCP range for the Management Network to assign IPv4 addresses. |
mngmnt_mapping_file_path | | Enter the file path containing a device mapping file with the MAC addresses and respective IP addresses. A mapping_device_file.csv template file is provided under omnia/examples
. Enter the details in the order: MAC address, IP address. For example, 10:11:12:13,1.2.3.4
, 14:15:16:17,2.3.4.5
, and 18.19.20.21,3.4.5.6
are all valid entries. Ensure that you do not provide any duplicate entries in the file. |
host_network_nic [Required] | eno3 | NIC or Ethernet card that is connected to the Host Network to provision OS on bare metal servers. By default, it is set to "eno3". |
host_network_dhcp_start_range, host_network_dhcp_end_range [Required] | | DHCP range for the Host Network to assign IPv4 addresses. |
host_mapping_file_path | | Enter the file path containing a host mapping file with the MAC addresses, hostnames, IP addresses, and component role. A mapping_host_file.csv template file is provided under omnia/examples
. Enter the details in the order: MAC address, Hostname, IP address, Component_role. For example, 10:11:12:13,server1,100.96.20.66,compute
, 14:15:16:17,server2,100.96.22.199,manager
, 18.19.20.21,server3,100.96.23.67,nfs_node
, and 22.23.24.25,server4,100.96.23.75,login_node
are all valid entries. The Hostname should not contain the following characters: , (comma), . (period), and - (hyphen). Ensure that you do not provide any duplicate entries in the file. |NOTE: The IP address 192.168.25.x is used for PowerVault Storage communications. Therefore, do not use this IP address for other configurations.
cd ..
ansible-playbook control_plane.yml
Omnia creates a log file which is available at: /var/log/omnia.log
.
NOTE: If you want to view or edit the login_vars.yml file, run the following commands:
cd input_params
ansible-vault view login_vars.yml --vault-password-file .login_vault_key
oransible-vault edit login_vars.yml --vault-password-file .login_vault_key
.NOTE: It is suggested that you use the ansible-vault view or edit commands and that you do not use the ansible-vault decrypt or encrypt commands. If you have used the ansible-vault decrypt or encrypt commands, provide 644 permission to login_vars.yml.
After you deploy the Omnia control plane, the devices such as Ethernet switches, InfiniBand Switches, and PowerVault storage devices are configured by Omnia according to the support enabled in the base_vars.yml file. The bare metal servers in the cluster are provisioned with custom CentOS based on the availability of iDRAC Enterprise or Datacenter License on the iDRAC.
Omnia performs the following configurations on AWX:
Note: The AWX configurations are automatically performed by Omnia, and Dell Technologies recommends that you do not change the default configurations that are provided by Omnia as the functionality may be impacted.
For Omnia to configure the devices and to provision the bare metal servers which are introduced newly in the cluster, you must configure the corresponding input parameters and deploy the device-specific template from the AWX UI. Based on the devices added to the cluster, click the respective link to go to configuration section.
kubectl get svc -n awx
.kubectl get secret awx-admin-password -n awx -o jsonpath="{.data.password}" | base64 --decode
.http://<IP>:8052
, where IP is the awx-ui IP address and 8052 is the awx-ui port number. Log in to the AWX UI using the username as admin
and the retrieved password.login_node_required
variable in the omnia_config
file to "false", then you can skip assigning host to the login node.slurm
and select slurm.kubernetes
skip tag.NOTE: If you would like to skip the NFS client setup, enter
nfs_client
in the skip tag section to skip the k8s_nfs_client_setup role of Kubernetes.
The deploy_omnia_template may not run successfully if:
login_node_required
variable in the omnia_config
file to "false", then you can skip assigning host to the login node.Note: On the AWX UI, hosts will be listed only after few nodes have been provisioned by Omnia. It takes approximately 10 to 15 minutes to display the host details after the provisioning is complete. If a device is provisioned but you are unable to view the host details on the AWX UI, then you can run the following command from omnia -> control_plane -> tools folder to view the hosts which are reachable.
ansible-playbook -i ../roles/collect_node_info/provisioned_hosts.yml provision_report.yml
If you want to install JupyterHub and Kubeflow playbooks, you have to first install the JupyterHub playbook and then install the Kubeflow playbook.
To install JupyterHub and Kubeflow playbooks:
Note: When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the Apply Kubeflow configurations task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
omnia_config.yml
file, change the k8s_cni variable value from calico to flannel.NOTE: If you want to view or edit the omnia_config.yml
file, run the following command:
ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
-- To view the file.ansible-vault edit omnia_config.yml --vault-password-file .omnia_vault_key
-- To edit the file.
After DeployOmnia template is run from the AWX UI, the omnia.yml file installs Kubernetes and Slurm, or either Kubernetes or Slurm, as per the selection in the template on the management station. Additionally, appropriate roles are assigned to the compute and manager groups.
The following kubernetes roles are provided by Omnia when omnia.yml file is run:
/home/k8snfs
, is created. Using this directory, compute nodes share the common files.Note:
- Whenever the k8s_version, k8s_cni or k8s_pod_network_cidr needs to be modified after the HPC cluster is setup, the OS in the manager and compute nodes in the cluster must be re-flashed before executing
omnia.yml
again.- After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
- If Kubernetes Pods are unable to communicate with the servers (i.e., unable to access the Internet) when the DNS servers are not responding, then the Kubernetes Pod Network CIDR may be overlapping with the host network which is DNS issue. To resolve this issue:
- Disable firewalld.service.
- If the issue persists, then perform the following actions:
a. Format the OS on manager and compute nodes.
b. In the management station, edit the omnia_config.yml file to change the Kubernetes Pod Network CIDR or CNI value. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
c. Executeomnia.yml
and skip slurm using--skip-tags slurm
.
The following Slurm roles are provided by Omnia when omnia.yml file is run:
To enable the login node, the login_node_required variable must be set to "true" in the omnia_config.yml file.
If a new node is provisioned through Cobbler, the node address is automatically displayed on the AWX dashboard. The node is not assigned to any group. You can add the node to the compute group along with the existing nodes and run omnia.yml
to add the new node to the cluster and update the configurations in the manager node.