INSTALL_OMNIA.md 6.0 KB

Install Omnia

The following sections provide details on installing Omnia using CLI. If you want to install the Omnia appliance and manage workloads using the Omnia appliance, see INSTALL_OMNIA_APPLIANCE and MONITOR_CLUSTERS files for more information.

Prerequisties to install Omnia using CLI

Ensure that all the prequisites listed in the PREINSTALL_OMNIA file are met before installing Omnia.

Steps to install Omnia using CLI

Note: The user should have root privileges to perform installations and configurations.
Note: If there are errors when any of the following Ansible playbook commands are executed, re-run the commands again.

  1. On the manager node, change the working directory to the directory where you want to clone the Omnia Git repository.
  2. Clone the Omnia repository.

    $ git clone https://github.com/dellhpc/omnia.git 
    

    Note: After the Omnia repository is cloned, a folder named omnia is created. It is recommended that you do not rename this folder.

  3. Change the directory to omnia, by executing the following command: cd omnia

  4. An inventory file must be created in the omnia folder. Add compute node IPs under [compute] group and the manager node IP under [manager] group. See the template INVENTORY file under omnia\docs folder.

  5. To install Omnia, run the following command:

    ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" 
    
  6. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed.
    To skip the installation of Kubernetes, enter:
    ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" --skip-tags "kubernetes"
    Similarly, to skip Slurm, enter:
    ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" --skip-tags "slurm"
    Note: If you would like to skip the NFS client setup, enter the following command to skip the k8s_nfs_client_setup role of Kubernetes:
    ansible-playbook omnia.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" --skip-tags "nfs_client"

  7. To provide password for mariaDB Database for Slurm accounting and Kubernetes CNI, edit the omnia_config.yml file.
    Note: Supported Kubernetes CNI : calico and flannel. The default CNI is calico.
    To view the set passwords of omnia_config.yml at a later time, run the following command:
    ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key

Omnia considers the following usernames as default:

  • slurm for MariaDB

The following kubernetes roles are provided by Omnia when omnia.yml file is executed:

  • common role:
    • Install common packages on manager and compute nodes
    • Docker is installed
    • Deploy time ntp/chrony
    • Install Nvidia drivers and software components
  • k8s_common role:
    • Required Kubernetes packages are installed
    • Starts the docker and kubernetes services.
  • k8s_manager role:
    • helm package for Kubernetes is installed.
  • k8s_firewalld role: This role is used to enable the required ports to be used by Kubernetes.
    • For head-node-ports: 6443, 2379-2380,10251,10252
    • For compute-node-ports: 10250,30000-32767
    • For calico-udp-ports: 4789
    • For calico-tcp-ports: 5473,179
    • For flanel-udp-ports: 8285,8472
  • k8s_nfs_server_setup role:
    • A nfs-share directory, /home/k8snfs, is created. Using this directory, compute nodes share the common files.
  • k8s_nfs_client_setup role
  • k8s_start_manager role:
    • Runs the /bin/kubeadm init command to initialize the Kubernetes services on manager node.
    • Initialize the Kubernetes services in the manager node and create service account for Kubernetes Dashboard
  • k8s_start_workers role:
    • The compute nodes are initialized and joined to the Kubernetes cluster with the manager node.
  • k8s_start_services role
    • Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner

Note: After Kubernetes is installed and configured, few Kubernetes and calico/flannel related ports will be opened in the manager and compute nodes. This is required for Kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for Kubernetes pods.

The following Slurm roles are provided by Omnia when omnia.yml file is executed:

  • slurm_common role:
    • Install the common packages on manager node and compute node.
  • slurm_manager role:
    • Install the packages only related to manager node
    • This role also enables the required ports to be used by slurm.
      tcp_ports: 6817,6818,6819
      udp_ports: 6817,6818,6819
    • Creating and updating the slurm configuration files based on the manager node requirements.
  • slurm_workers role:
    • Install the slurm packages into all compute nodes as per the compute node requirements.
  • slurm_start_services role:
    • Starting the slurm services so that compute node starts to communicate with manager node.
  • slurm_exporter role:
    • slurm exporter is a package for exporting metrics collected from slurm resource scheduling system to prometheus.
    • Slurm exporter is installed on the host just like slurm and slurm exporter will be successfully installed only if slurm is installed.

Note: If you want to install JupyterHub and Kubeflow playbooks, you have to first install the JupyterHub playbook and then install the Kubeflow playbook.

Commands to install JupyterHub and Kubeflow:

  • ansible-playbook platforms/jupyterhub.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"
  • ansible-playbook platforms/kubeflow.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"

Adding a new compute node to the cluster

The user has to update the INVENTORY file present in omnia directory with the new node IP address in the compute group. Then, omnia.yml has to be executed to add the new node to the cluster and update the configurations of the manager node.