|
4 years ago | |
---|---|---|
.. | ||
images | 4 years ago | |
metalLB | 4 years ago | |
CONTRIBUTORS.md | 4 years ago | |
INSTALL_OMNIA.md | 4 years ago | |
INSTALL_OMNIA_APPLIANCE.md | 4 years ago | |
INVENTORY | 4 years ago | |
MONITOR_CLUSTERS.md | 4 years ago | |
PREINSTALL_OMNIA.md | 4 years ago | |
PREINSTALL_OMNIA_APPLIANCE.md | 4 years ago | |
README.md | 4 years ago | |
_config.yml | 4 years ago |
Omnia (Latin: all or everything) is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into clusters capable of supporting HPC, AI, and data analytics workloads. It uses Slurm, Kubernetes, and other packages to manage jobs and run diverse workloads on the same converged solution. It is a collection of Ansible playbooks, is open source, and is constantly being extended to enable comprehensive workloads.
Omnia can build clusters which use Slurm or Kubernetes (or both!) for workload management. Omnia will install software from a variety of sources, including:
Whenever possible, Omnia will leverage existing projects rather than reinvent the wheel.
Omnia can install Kubernetes or Slurm (or both), along with additional drivers, services, libraries, and user applications.
Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on CentOS. Please see PREINSTALL for instructions on network setup.
Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. Please see INSTALL for detailed instructions.
Ensure all the prerequisites listed in the PREINSTALL_OMNIA_APPLIANCE are met before installing the Omnia appliance.
For detailed instructions on installing the Omnia appliance, see INSTALL_OMNIA_APPLIANCE.
Software and hardware requirements | Version |
---|---|
OS installed on the management node | CentOS 7.9 2009 |
OS deployed by Omnia on bare-metal servers | CentOS 7.9 2009 Minimal Edition |
Cobbler | 2.8.5 |
Ansible AWX Version | 15.0.0 |
Slurm Workload Manager | 20.11.2 |
Kubernetes Controllers | 1.16.7 |
Kubeflow | 1 |
Prometheus | 2.23.0 |
Supported PowerEdge servers | R640, R740, R7525, C4140, DSS8440, and C6420 |
Note: For more information about the supported software and compatible versions, see the Software Supported section.
Software | Licence | Compatible Version | Description |
---|---|---|---|
MariaDB | GPL 2.0 | 5.5.68 | Relational database used by Slurm |
Slurm | GNU General Public | 20.11.2 | HPC Workload Manager |
Docker CE | Apache-2.0 | 20.10.2 | Docker Service |
NVIDIA container runtime | Apache-2.0 | 3.4.2 | Nvidia container runtime library |
Python PIP | MIT Licence | 3.2.1 | Python Package |
Python2 | - | 2.7.5 | - |
Kubelet | Apache-2.0 | 1.16.7 | Provides external, versioned ComponentConfig API types for configuring the kubelet |
Kubeadm | Apache-2.0 | 1.16.7 | "fast paths" for creating Kubernetes clusters |
Kubectl | Apache-2.0 | 1.16.7 | Command line tool for Kubernetes |
JupyterHub | Modified BSD Licence | 0.9.6 | Multi-user hub |
Kfctl | Apache-2.0 | 1.0.2 | CLI for deploying and managing Kubeflow |
Kubeflow | Apache-2.0 | 1 | Cloud Native platform for machine learning |
Helm | Apache-2.0 | 3.5.0 | Kubernetes Package Manager |
Helm Chart | - | 0.9.0 | - |
TensorFlow | Apache-2.0 | 2.1.0 | Machine Learning framework |
Horovod | Apache-2.0 | 0.21.1 | Distributed deep learning training framework for Tensorflow |
MPI | Copyright (c) 2018-2019 Triad National Security,LLC. All rights reserved. | 0.2.3 | HPC library |
CoreDNS | Apache-2.0 | 1.6.2 | DNS server that chains plugins |
CNI | Apache-2.0 | 0.3.1 | Networking for Linux containers |
AWX | Apache-2.0 | 15.0.0 | Web-based User Interface |
PostgreSQL | Copyright (c) 1996-2020, PostgreSQL Global Development Group | 10.15 | Database Management System |
Redis | BSD-3-Clause Licence | 6.0.10 | In-memory database |
NGINX | BSD-2-Clause Licence | 1.14 | - |
provisioned_hosts.yml
is available under omnia/appliance/roles/inventory/files
.omnia/appliance/tools
named provisioned_report.yml
. To run provisioned_report.yml, run the following command under omnia/appliance
directory: ansible-playbook -i roles/inventory/files/provisioned_hosts.yml tools/provisioned_report.yml
.appliance.yaml
fails?Resolution:
Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again.
What are the steps to be followed after the nodes in a Kubernetes cluster are rebooted?
Resolution:
Wait for 10 to 15 minutes after the Kubernetes cluster is rebooted. Then, verify the status of cluster using these services:
kubectl get nodes
on the manager node provides correct k8s cluster status.kubectl get pods --all-namespaces
on the manager node displays all the pods in the running state.kubectl cluster-info
on the manager node displays both k8s master and kubeDNS are in the running state.What to do when the Kubernetes services are not in the Running state?
Resolution:
kubectl get pods --all-namespaces
kubectl delete pods <name of pod>
omnia.yml
, jupyterhub.yml
, or kubeflow.yml
.What to do when the JupyterHub or Prometheus UI are not accessible?
Resolution:
kubectl get pods --namespace default
Why does the appliance.yml
fail during the Cobbler configuration with an error during the Run import command?
Cause:
Resolution:
1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
2. If the error message is **repo verification failed** then it signifies that the .iso file is not mounted properly.
3. Verify if the downloaded .iso file is valid and correct.
4. Delete the Cobbler container using `docker rm -f cobbler` and rerun `appliance.yml`.
Resolution:
1. Create a Non-RAID or virtual disk in the server.
2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.
After the cluster is rebooted, what to do when the Slurm services are not started automatically?
Resolution: Manually restart the slurmd services on the manager node by running the following commands:
systemctl restart slurmdbd
systemctl restart slurmctld
systemctl restart prometheus-slurm-exporter
Manually restart the following service on all the compute nodes: systemctl status slurmd
What to do when the Slurm services fail because the slurm.conf
is not configured properly?
Resolution:
slurmdbd -Dvvv
slurmctld -Dvvv
/var/lib/log/slurmctld.log
file.How to troubleshoot the error "ports are unavailable" when Slurm database connection fails?
Resolution:
slurmdbd -Dvvv
*slurmctld -Dvvv
/var/lib/log/slurmctld.log
file.Requirements Matrix
and Software Managed by Omnia
sections, and other versions than those listed are not supported by Omnia. This is to ensure that there is no impact to the functionality of Omnia.The Omnia project was started to give members of the Dell Technologies HPC Community a way to easily setup clusters of Dell EMC servers, and to contribute useful tools, fixes, and functionality back to the HPC Community.
While we started Omnia within the Dell Technologies HPC Community, that doesn't mean that it's limited to Dell EMC servers, networking, and storage. This is an open project, and we want to encourage everyone to use and contribute to Omnia!
####### Anyone Can Contribute! It's not just new features and bug fixes that can be contributed to the Omnia project! Anyone should feel comfortable contributing. We are asking for all types of contributions:
If you would like to contribute, see [CONTRIBUTING](https://github.com/dellhpc/omnia/b