|
@@ -5,7 +5,10 @@ The following sections provide details on installing Omnia using CLI.
|
|
To install the Omnia control plane and manage workloads on your cluster using the Omnia control plane, see [Install the Omnia Control Plane](INSTALL_OMNIA_CONTROL_PLANE.md) and [Monitor Kubernetes and Slurm](MONITOR_CLUSTERS.md) for more information.
|
|
To install the Omnia control plane and manage workloads on your cluster using the Omnia control plane, see [Install the Omnia Control Plane](INSTALL_OMNIA_CONTROL_PLANE.md) and [Monitor Kubernetes and Slurm](MONITOR_CLUSTERS.md) for more information.
|
|
|
|
|
|
## Prerequisites
|
|
## Prerequisites
|
|
-* The login, manager, and compute nodes must be running CentOS 7.9 2009 OS.
|
|
|
|
|
|
+* The login, manager, and compute nodes must be running CentOS 7.9 2009 OS/ Rocky 8.x/ LeapOS 15.3.
|
|
|
|
+>> __Note:__ If you are using LeapOS, the following repositories will be enabled when running `omnia.yml`:
|
|
|
|
+>> * OSS ([Repository](http://download.opensuse.org/distribution/leap/15.3/repo/oss/) + [Update](http://download.opensuse.org/update/leap/15.3/oss/))
|
|
|
|
+>> * Non-OSS ([Repository](http://download.opensuse.org/distribution/leap/15.3/repo/non-oss/) + [Update](http://download.opensuse.org/update/leap/15.3/non-oss/))
|
|
* If you have configured the `omnia_config.yml` file to enable the login node, the login node must be part of the cluster.
|
|
* If you have configured the `omnia_config.yml` file to enable the login node, the login node must be part of the cluster.
|
|
* All nodes must be connected to the network and must have access to the Internet.
|
|
* All nodes must be connected to the network and must have access to the Internet.
|
|
* Set the hostnames of all the nodes in the cluster.
|
|
* Set the hostnames of all the nodes in the cluster.
|
|
@@ -42,12 +45,12 @@ To install the Omnia control plane and manage workloads on your cluster using th
|
|
export PATH=$PATH:/usr/local/bin
|
|
export PATH=$PATH:/usr/local/bin
|
|
```
|
|
```
|
|
|
|
|
|
-**Note**: To deploy Omnia, Python 3.6 provides bindings to system tools such as RPM, DNF, and SELinux. As versions greater than 3.6 do not provide these bindings to system tools, ensure that you install Python 3.6 with dnf.
|
|
|
|
|
|
+>> **Note**: To deploy Omnia, Python 3.6 provides bindings to system tools such as RPM, DNF, and SELinux. As versions greater than 3.6 do not provide these bindings to system tools, ensure that you install Python 3.6 with dnf.
|
|
|
|
|
|
-**Note**: If Ansible version 2.9 or later is installed, ensure it is uninstalled before installing a newer version of Ansible. Run the following commands to uninstall Ansible before upgrading to a newer version.
|
|
|
|
-1. `pip uninstall ansible`
|
|
|
|
-2. `pip uninstall ansible-base (if ansible 2.9 is installed)`
|
|
|
|
-3. `pip uninstall ansible-core (if ansible 2.10 > version is installed)`
|
|
|
|
|
|
+>> **Note**: If Ansible version 2.9 or later is installed, ensure it is uninstalled before installing a newer version of Ansible. Run the following commands to uninstall Ansible before upgrading to a newer version.
|
|
|
|
+>> 1. `pip uninstall ansible`
|
|
|
|
+>> 2. `pip uninstall ansible-base (if ansible 2.9 is installed)`
|
|
|
|
+>> 3. `pip uninstall ansible-core (if ansible 2.10 > version is installed)`
|
|
|
|
|
|
|
|
|
|
* On the management station, run the following commands to install Git:
|
|
* On the management station, run the following commands to install Git:
|
|
@@ -56,7 +59,7 @@ To install the Omnia control plane and manage workloads on your cluster using th
|
|
dnf install git -y
|
|
dnf install git -y
|
|
```
|
|
```
|
|
|
|
|
|
-**Note**: If there are errors while executing the Ansible playbook commands, then re-run the commands.
|
|
|
|
|
|
+>> **Note**: If there are errors while executing the Ansible playbook commands, then re-run the commands.
|
|
|
|
|
|
## Steps to install Omnia using CLI
|
|
## Steps to install Omnia using CLI
|
|
|
|
|
|
@@ -71,7 +74,7 @@ From release branch:
|
|
git clone -b release https://github.com/dellhpc/omnia.git
|
|
git clone -b release https://github.com/dellhpc/omnia.git
|
|
```-->
|
|
```-->
|
|
|
|
|
|
-__Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. Ensure that you do not rename this folder.
|
|
|
|
|
|
+>> __Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. Ensure that you do not rename this folder.
|
|
|
|
|
|
2. Change the directory to __omnia__: `cd omnia`
|
|
2. Change the directory to __omnia__: `cd omnia`
|
|
|
|
|
|
@@ -97,12 +100,15 @@ __Note:__ After the Omnia repository is cloned, a folder named __omnia__ is crea
|
|
>> __NOTE:__ Without the login node, Slurm jobs can be scheduled only through the manager node.
|
|
>> __NOTE:__ Without the login node, Slurm jobs can be scheduled only through the manager node.
|
|
|
|
|
|
4. Create an inventory file in the *omnia* folder. Add login node IP address under the *[login_node]* group, manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and NFS node IP address under the *[nfs_node]* group. A template file named INVENTORY is provided in the *omnia\docs* folder.
|
|
4. Create an inventory file in the *omnia* folder. Add login node IP address under the *[login_node]* group, manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and NFS node IP address under the *[nfs_node]* group. A template file named INVENTORY is provided in the *omnia\docs* folder.
|
|
- **NOTE**: Ensure that all the four groups (login_node, manager, compute, nfs_node) are present in the template, even if the IP addresses are not updated under login_node and nfs_node groups.
|
|
|
|
|
|
+>> **NOTE**: Ensure that all the four groups (login_node, manager, compute, nfs_node) are present in the template, even if the IP addresses are not updated under login_node and nfs_node groups.
|
|
|
|
|
|
5. To install Omnia:
|
|
5. To install Omnia:
|
|
-```
|
|
|
|
-ansible-playbook omnia.yml -i inventory
|
|
|
|
-```
|
|
|
|
|
|
+
|
|
|
|
+| Leap OS | CentOS, Rocky |
|
|
|
|
+|----------------------------- |----------------------------------------------------------- |
|
|
|
|
+| `ansible-playbook omnia.yml -i inventory -e 'ansible_python_interpreter=/usr/bin/python3'` | `ansible-playbook omnia.yml -i inventory` |
|
|
|
|
+
|
|
|
|
+
|
|
|
|
|
|
6. By default, no skip tags are selected, and both Kubernetes and Slurm will be deployed.
|
|
6. By default, no skip tags are selected, and both Kubernetes and Slurm will be deployed.
|
|
|
|
|
|
@@ -118,15 +124,15 @@ ansible-playbook omnia.yml -i inventory
|
|
The default path of the Ansible configuration file is `/etc/ansible/`. If the file is not present in the default path, then edit the `ansible_config_file_path` variable to update the configuration path.
|
|
The default path of the Ansible configuration file is `/etc/ansible/`. If the file is not present in the default path, then edit the `ansible_config_file_path` variable to update the configuration path.
|
|
|
|
|
|
7. To provide passwords for mariaDB Database (for Slurm accounting), Kubernetes Pod Network CIDR, and Kubernetes CNI, edit the `omnia_config.yml` file.
|
|
7. To provide passwords for mariaDB Database (for Slurm accounting), Kubernetes Pod Network CIDR, and Kubernetes CNI, edit the `omnia_config.yml` file.
|
|
-__Note:__
|
|
|
|
|
|
+>> __Note:__
|
|
* Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.
|
|
* Supported values for Kubernetes CNI are calico and flannel. The default value of CNI considered by Omnia is calico.
|
|
* The default value of Kubernetes Pod Network CIDR is 10.244.0.0/16. If 10.244.0.0/16 is already in use within your network, select a different Pod Network CIDR. For more information, see __https://docs.projectcalico.org/getting-started/kubernetes/quickstart__.
|
|
* The default value of Kubernetes Pod Network CIDR is 10.244.0.0/16. If 10.244.0.0/16 is already in use within your network, select a different Pod Network CIDR. For more information, see __https://docs.projectcalico.org/getting-started/kubernetes/quickstart__.
|
|
|
|
|
|
-**NOTE**: If you want to view or edit the `omnia_config.yml` file, run the following command:
|
|
|
|
|
|
+>> **NOTE**: If you want to view or edit the `omnia_config.yml` file, run the following command:
|
|
- `ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key` -- To view the file.
|
|
- `ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key` -- To view the file.
|
|
- `ansible-vault edit omnia_config.yml --vault-password-file .omnia_vault_key` -- To edit the file.
|
|
- `ansible-vault edit omnia_config.yml --vault-password-file .omnia_vault_key` -- To edit the file.
|
|
|
|
|
|
-**NOTE**: It is suggested that you use the ansible-vault view or edit commands and that you do not use the ansible-vault decrypt or encrypt commands. If you have used the ansible-vault decrypt or encrypt commands, provide 644 permission to `omnia_config.yml`.
|
|
|
|
|
|
+>> **NOTE**: It is suggested that you use the ansible-vault view or edit commands and that you do not use the ansible-vault decrypt or encrypt commands. If you have used the ansible-vault decrypt or encrypt commands, provide 644 permission to `omnia_config.yml`.
|
|
|
|
|
|
Omnia considers `slurm` as the default username for MariaDB.
|
|
Omnia considers `slurm` as the default username for MariaDB.
|
|
|
|
|
|
@@ -160,7 +166,6 @@ The following __kubernetes__ roles are provided by Omnia when __omnia.yml__ file
|
|
- **k8s_start_services** role
|
|
- **k8s_start_services** role
|
|
- Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner
|
|
- Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner
|
|
|
|
|
|
-__Note:__
|
|
|
|
|
|
|
|
* Whenever k8s_version, k8s_cni or k8s_pod_network_cidr needs to be modified after the HPC cluster is setup, the OS in the manager and compute nodes in the cluster must be re-flashed before executing omnia.yml again.
|
|
* Whenever k8s_version, k8s_cni or k8s_pod_network_cidr needs to be modified after the HPC cluster is setup, the OS in the manager and compute nodes in the cluster must be re-flashed before executing omnia.yml again.
|
|
* After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
|
|
* After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports are opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.
|
|
@@ -209,11 +214,12 @@ Commands to install JupyterHub and Kubeflow:
|
|
* `ansible-playbook platforms/jupyterhub.yml -i inventory`
|
|
* `ansible-playbook platforms/jupyterhub.yml -i inventory`
|
|
* `ansible-playbook platforms/kubeflow.yml -i inventory`
|
|
* `ansible-playbook platforms/kubeflow.yml -i inventory`
|
|
|
|
|
|
-__Note:__ When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the **Apply Kubeflow configurations** task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
|
|
|
|
|
|
+>> __Note:__ When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the **Apply Kubeflow configurations** task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
|
|
* Format the OS on manager and compute nodes.
|
|
* Format the OS on manager and compute nodes.
|
|
* In the `omnia_config.yml` file, change the k8s_cni variable value from calico to flannel.
|
|
* In the `omnia_config.yml` file, change the k8s_cni variable value from calico to flannel.
|
|
* Run the Kubernetes and Kubeflow playbooks.
|
|
* Run the Kubernetes and Kubeflow playbooks.
|
|
|
|
|
|
|
|
+
|
|
## Add a new compute node to the cluster
|
|
## Add a new compute node to the cluster
|
|
|
|
|
|
To update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Ensure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run `omnia.yml` to add the new node to the cluster and update the configurations of the manager node.
|
|
To update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Ensure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run `omnia.yml` to add the new node to the cluster and update the configurations of the manager node.
|