Using Omnia 1.2, you can provision and monitor hardware devices such as servers, storage devices, network switches, and InfiniBand switches in an HPC cluster. To enable Omnia to provision or configure the supported hardware devices, Omnia requires the following connections to be made available in your HPC cluster environment.
Through this management network, management DHCP assigns IP addresses to the devices in the HPC cluster.
Note: Cobbler web support has been discontinued from Omnia 1.2 onwards.
Depending on the pass-through switch configured in your HPC environment, the number of racks will be limited based on the number of ports available on the pass-through switch. To support additional racks, you can form an L1-L2 topology and configure a network of Passthrough Switches. A typical layout of an HPC cluster with a network of pass-through switches is as per the following illustration:
Note: Refer to the Omnia_Control_Plane_PreReqs.md file to ensure smooth running of the control_plane.
git clone https://github.com/dellhpc/omnia.git
cd omnia
Note: Ensure that the parameter
enable_security_support
intelemetry\input_params\base_vars.yml
is set to 'false' before editing the following variables.
To configure the login node, refer to Install_Omnia.
Note:
- Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.
- The default value of Kubernetes Pod Network CIDR is 10.244.0.0/16. If 10.244.0.0/16 is already in use within your network, select a different Pod Network CIDR. For more information, see https://docs.projectcalico.org/getting-started/kubernetes/quickstart.
- The default path of the Ansible configuration file is
/etc/ansible/
. If the file is not present in the default path, then edit theansible_conf_file_path
variable to update the configuration path.- If you choose to enable security on the login node, simply follow the steps mentioned here.
cd omnia/control_plane/input_params
Edit the base_vars.yml file to update the required variables.
Note: The IP address 192.168.25.x is used for PowerVault Storage communications. Therefore, do not use this IP address for other configurations.**
Provided that the host_mapping_file_path is updated as per the provided template, Omnia deploys the control plane and assigns the component roles by executing the omnia.yml file. To deploy the Omnia control plane, run the following command:
ansible-playbook control_plane.yml
If the host_mapping_file_path is not provided, then you must manually assign the component roles through the AWX UI. Go to Assign component roles using AWX UI.
Omnia creates a log file which is available at: /var/log/omnia.log
.
The installation of omnia control plane depends largely on the variables entered in base_vars.yml
. These variables decide how many functionalities of Omnia are actually required in your environment.
Omnia Control plane starts with the choice of assigning management/communication IPs (device_config_support
) to all available servers, switches and powervault devices. When true, all applicable devices are given new IPs. It is recommended that when device_config_support
is true, a device mapping file (Example here) is used to keep assigned IPs persistent between control plane reboots. If idrac_support
true, the devices are expected to have their own IPs furnished in the filepath mentioned under device_ip_list_path
. Having the IPs allows omnia to reach and configure switches, servers and powervaults without disturbing the existing network set up. Users can choose which devices require configuration using the variables ethernet_switch_support
, ib_switch_support
and powervault_support
.
device_config_support | idrac_support | Outcome |
---|---|---|
true | true | New Management IPs will be assigned and servers will be provisioned based on the value of provision_method |
true | false | An assert failure on control_plane_common will manifest and Omnia Control Plane will fail. |
false | true | Omnia will not assign IPs to the devices/iDRAC. Deployment will take place via the IPs provided in device_ip_list_path based on the provision_method . |
false | false | No IPs will be assigned by Omnia. Provisioning will only be through PXE. Slurm and Kubernetes can be deployed in the cluster. |
Once all network configuration is complete, Omnia uses AWX to integrate a centralized log system, receive live updates of running jobs, scheduled jobs, etc. AWX can also be used to assign component roles, install kuberenetes, JupyterHub, Kubeflow, Slurm, Prometheus and Grafana.
The file login_vars.yml
is populated with all credentials used by Omnia to deploy services.
If you want to view or edit the login_vars.yml file, run the following commands:
cd input_params
ansible-vault view login_vars.yml --vault-password-file .login_vault_key
or ansible-vault edit login_vars.yml --vault-password-file .login_vault_key
.Note: It is suggested that you use the ansible-vault view or edit commands and that you do not use the ansible-vault decrypt or encrypt commands. If you have used the ansible-vault decrypt or encrypt commands, provide 644 permission to login_vars.yml.
Omnia performs the following configurations on AWX:
Note: The AWX configurations are automatically performed by Omnia, and Dell Technologies recommends that you do not change the default configurations that are provided by Omnia as the functionality may be impacted.
For Omnia to configure the devices and to provision the bare metal servers which are introduced newly in the cluster, you must configure the corresponding input parameters and deploy the device-specific template from the AWX UI. Based on the devices added to the cluster, click the respective link to go to configuration section.
kubectl get svc -n awx
.kubectl get secret awx-admin-password -n awx -o jsonpath="{.data.password}" | base64 --decode
.http://<IP>:8052
, where IP is the awx-ui IP address and 8052 is the awx-ui port number. Log in to the AWX UI using the username as admin
and the retrieved password.login_node_required
variable in the omnia_config
file to "false", then you can skip assigning host to the login node.slurm
and select slurm.kubernetes
skip tag.Note: If you would like to skip the NFS client setup, enter
nfs_client
in the skip tag section to skip the k8s_nfs_client_setup role of Kubernetes.
The deploy_omnia_template may not run successfully if:
login_node_required
variable in the omnia_config
file to "false", then you can skip assigning host to the login node.Note: On the AWX UI, hosts will be listed only after few nodes have been provisioned by Omnia. It takes approximately 10 to 15 minutes to display the host details after the provisioning is complete. If a device is provisioned, but you are unable to view the host details on the AWX UI, then you can run the following command from omnia -> control_plane -> tools folder to view the hosts which are reachable.
ansible-playbook -i ../roles/collect_node_info/provisioned_hosts.yml provision_report.yml
If you want to install JupyterHub and Kubeflow playbooks, you have to first install the JupyterHub playbook and then install the Kubeflow playbook.
To install JupyterHub and Kubeflow playbooks:
Note: When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the Apply Kubeflow configurations task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
omnia_config.yml
file, change the k8s_cni variable value from calico to flannel.Note: If you want to view or edit the omnia_config.yml
file, run the following command:
ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
-- To view the file.ansible-vault edit omnia_config.yml --vault-password-file .omnia_vault_key
-- To edit the file.
After DeployOmnia template is run from the AWX UI, the omnia.yml file installs Kubernetes and Slurm, or either Kubernetes or Slurm, as per the selection in the template on the control plane. Additionally, appropriate roles are assigned to the compute and manager groups.
The following kubernetes roles are provided by Omnia when omnia.yml file is run:
/home/k8snfs
, is created. Using this directory, compute nodes share the common files.k8s_start_services role
Whenever the k8s_version, k8s_cni or k8s_pod_network_cidr needs to be modified after the HPC cluster is set up, the OS in the manager and compute nodes in the cluster must be re-flashed before executing omnia.yml
again.
After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
If Kubernetes Pods are unable to communicate with the servers (i.e., unable to access the Internet) when the DNS servers are not responding, then the Kubernetes Pod Network CIDR may be overlapping with the host network which is DNS issue. To resolve this issue:
omnia.yml
and skip slurm using --skip-tags slurm
.The following Slurm roles are provided by Omnia when omnia.yml file is run:
To enable the login node, the login_node_required variable must be set to "true" in the omnia_config.yml file.
Note: If LeapOS is being deployed, login_common and login_server roles will be skipped.
If a new node is provisioned through Cobbler, the node address is automatically displayed on the AWX dashboard. The node is not assigned to any group. You can add the node to the compute group along with the existing nodes and run omnia.yml
to add the new node to the cluster and update the configurations in the manager node.
From Omnia 1.2, the cobbler container OS will follow the OS on the control plane but will deploy multiple OS's based on the provision_os
value in base_vars.yml
.
control_plane.yml
provision_os
and iso_file_path
in base_vars.yml
. Then run control_plane.yml
Example: In a scenario where the user wishes to deploy LEAP and Rocky on their multiple servers, below are the steps they would use:
- Set
provision_os
to leap andiso_file_path
to/root/openSUSE-Leap-15.3-DVD-x86_64-Current.iso
.- Run
control_plane.yml
to provision leap and create a profile calledleap-x86_64
in the cobbler container.- Set
provision_os
to rocky andiso_file_path
to/root/Rocky-8.x-x86_64-minimal.iso
.- Run
control_plane.yml
to provision rocky and create a profile calledrocky-x86_64
in the cobbler container.Note: All compute nodes in a cluster must run the same OS.