cgoveas 8267e068a6 Issue#674: Need to update Limitation/Deferral date for Catalog issues | 2 年 前 | |
---|---|---|
.. | ||
control_plane | 2 年 前 | |
images | 3 年 前 | |
login_node | 3 年 前 | |
metalLB | 4 年 前 | |
CONTRIBUTORS.md | 4 年 前 | |
EXAMPLE_SYSTEM_DESIGNS.md | 3 年 前 | |
FAQ.md | 2 年 前 | |
INSTALL_OMNIA.md | 3 年 前 | |
INSTALL_OMNIA_CONTROL_PLANE.md | 3 年 前 | |
INVENTORY | 3 年 前 | |
MONITOR_CLUSTERS.md | 3 年 前 | |
README.md | 2 年 前 | |
_config.yml | 3 年 前 |
Omnia (Latin: all or everything) is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into clusters capable of supporting HPC, AI, and data analytics workloads. It uses Slurm, Kubernetes, and other packages to manage jobs and run diverse workloads on the same converged solution. It is a collection of Ansible playbooks, is open source, and is constantly being extended to enable comprehensive workloads.
1.1.1
1.1
Omnia can build clusters that use Slurm or Kubernetes (or both!) for workload management. Omnia will install software from a variety of sources, including:
Whenever possible, Omnia will leverage existing projects rather than reinvent the wheel.
Omnia can install Kubernetes or Slurm (or both), along with additional drivers, services, libraries, and user applications.
The Omnia Control Plane will automate the entire cluster deployment process, starting with provisioning the operating system on the supported devices and updating the firmware versions of PowerEdge Servers. For detailed instructions, see Install the Omnia Control Plane.
Omnia can be deployed on clusters that already have an RPM-based Linux OS running on them and are all connected to the Internet. Currently, all Omnia testing is done on CentOS. Please see Example system designs for instructions on the network setup.
Once servers have functioning OS and networking, you can use Omnia to install and start Slurm and/or Kubernetes. For detailed instructions, see Install Omnia using CLI.
The following table lists the software and operating system requirements on the management station, manager, and compute nodes. To avoid any impact on the proper functioning of Omnia, other versions than those listed are not supported.
Requirements | Version |
---|---|
OS pre-installed on the management station | CentOS 8.4/ Rocky 8.4 |
OS deployed by Omnia on bare-metal Dell EMC PowerEdge Servers | CentOS 7.9 2009 Minimal Edition/ Rocky 8.4 Minimal Edition |
Cobbler | 3.2.2 |
Ansible AWX | 19.1.0 |
Slurm Workload Manager | 20.11.2 |
Kubernetes on the management station | 1.21.0 |
Kubernetes on the manager and compute nodes | 1.16.7 or 1.19.3 |
Kubeflow | 1 |
Prometheus | 2.23.0 |
The following table lists the supported devices managed by Omnia. Other devices than those listed in the following table will be discovered by Omnia, but features offered by Omnia will not be applicable.
Device type | Supported models |
---|---|
Dell EMC PowerEdge Servers | PowerEdge C4140, C6420, C6520, R240, R340, R440, R540, R640, R650, R740, R740xd, R740xd2, R750, R750xa, R840, R940, R940xa |
Dell EMC PowerVault Storage | PowerVault ME4084, ME4024, and ME4012 Storage Arrays |
Dell EMC Networking Switches | PowerSwitch S3048-ON and PowerSwitch S5232F-ON |
Mellanox InfiniBand Switches | NVIDIA MQM8700-HS2F Quantum HDR InfiniBand Switch 40 QSFP56 |
The following table lists the software and its compatible version managed by Omnia. To avoid any impact on the proper functioning of Omnia, other versions than those listed are not supported.
Software | License | Compatible Version | Description |
---|---|---|---|
CentOS Linux release 7.9.2009 (Core) | - | 7.9 | Operating system on entire cluster except for management station |
Rocky 8.4 | - | 8.4 | Operating system on entire cluster except for management station |
CentOS Linux release 8.4.2105 | - | 8.4 | Operating system on the management station |
Rocky 8.4 | - | 8.4 | Operating system on the management station |
MariaDB | GPL 2.0 | 5.5.68 | Relational database used by Slurm |
Slurm | GNU General Public | 20.11.7 | HPC Workload Manager |
Docker CE | Apache-2.0 | 20.10.2 | Docker Service |
FreeIPA | GNU General Public License v3 | 4.6.8 | Authentication system used in the login node |
OpenSM | GNU General Public License 2 | 3.3.24 | - |
NVIDIA container runtime | Apache-2.0 | 3.4.2 | Nvidia container runtime library |
Python PIP | MIT License | 21.1.2 | Python Package |
Python3 | - | 3.6.8 | - |
Kubelet | Apache-2.0 | 1.16.7,1.19,1.21 | Provides external, versioned ComponentConfig API types for configuring the kubelet |
Kubeadm | Apache-2.0 | 1.16.7,1.19,1.21 | "fast paths" for creating Kubernetes clusters |
Kubectl | Apache-2.0 | 1.16.7,1.19,1.21 | Command line tool for Kubernetes |
JupyterHub | Modified BSD License | 1.1.0 | Multi-user hub |
kubernetes Controllers | Apache-2.0 | 1.16.7,1.19,1.21 | Orchestration tool |
Kfctl | Apache-2.0 | 1.0.2 | CLI for deploying and managing Kubeflow |
Kubeflow | Apache-2.0 | 1 | Cloud Native platform for machine learning |
Helm | Apache-2.0 | 3.5.0 | Kubernetes Package Manager |
Helm Chart | - | 0.9.0 | - |
TensorFlow | Apache-2.0 | 2.1.0 | Machine Learning framework |
Horovod | Apache-2.0 | 0.21.1 | Distributed deep learning training framework for Tensorflow |
MPI | Copyright (c) 2018-2019 Triad National Security,LLC. All rights reserved. | 0.3.0 | HPC library |
CoreDNS | Apache-2.0 | 1.6.2 | DNS server that chains plugins |
CNI | Apache-2.0 | 0.3.1 | Networking for Linux containers |
AWX | Apache-2.0 | 19.1.0 | Web-based User Interface |
AWX.AWX | Apache-2.0 | 19.1.0 | Galaxy collection to perform awx configuration |
AWXkit | Apache-2.0 | to be updated | To perform configuration through CLI commands |
Cri-o | Apache-2.0 | 1.21 | Container Service |
Buildah | Apache-2.0 | 1.21.4 | Tool to build and run container |
PostgreSQL | Copyright (c) 1996-2020, PostgreSQL Global Development Group | 10.15 | Database Management System |
Redis | BSD-3-Clause License | 6.0.10 | In-memory database |
NGINX | BSD-2-Clause License | 1.14 | - |
dellemc.openmanage | GNU-General Public License v3.0 | 3.5.0 | It is a systems management and monitoring application that provides a comprehensive view of the Dell EMC servers, chassis, storage, and network switches on the enterprise network |
dellemc.os10 | GNU-General Public License v3.1 | 1.1.1 | It provides networking hardware abstraction through a common set of APIs |
Genisoimage-dnf | GPL v3 | 1.1.11 | Genisoimage is a pre-mastering program for creating ISO-9660 CD-ROM filesystem images |
OMSDK | Apache-2.0 | 1.2.456 | Dell EMC OpenManage Python SDK (OMSDK) is a python library that helps developers and customers to automate the lifecycle management of PowerEdge Servers |
The following table provides details about the interface keys supported by the S3048-ON ToR Switch. Dell EMC Networking OS10 Enterprise Edition is the supported operating system.
Interface key name | Type | Description |
---|---|---|
desc | string | Configures a single line interface description |
portmode | string | Configures port mode according to the device type |
switchport | boolean: true, false* | Configures an interface in L2 mode |
admin | string: up, down* | Configures the administrative state for the interface; configuring the value as administratively "up" enables the interface; configuring the value as administratively "down" disables the interface |
mtu | integer | Configures the MTU size for L2 and L3 interfaces (1280 to 65535) |
speed | string: auto, 1000, 10000, 25000, ... | Configures the speed of the interface |
fanout | string: dual, single; string:10g-4x, 40g-1x, 25g-4x, 100g-1x, 50g-2x (os10) | Configures fanout to the appropriate value |
suppress_ra | string: present, absent | Configures IPv6 router advertisements if set to present |
ip_type_dynamic | boolean: true, false | Configures IP address DHCP if set to true (ip_and_mask is ignored if set to true) |
ipv6_type_dynamic | boolean: true, false | Configures an IPv6 address for DHCP if set to true (ipv6_and_mask is ignored if set to true) |
ipv6_autoconfig | boolean: true, false | Configures stateless configuration of IPv6 addresses if set to true (ipv6_and_mask is ignored if set to true) |
vrf | string | Configures the specified VRF to be associated to the interface |
min_ra | string | Configures RA minimum interval time period |
max_ra | string | Configures RA maximum interval time period |
ip_and_mask | string | Configures the specified IP address to the interface |
ipv6_and_mask | string | Configures a specified IPv6 address to the interface |
virtual_gateway_ip | string | Configures an anycast gateway IP address for a VXLAN virtual network as well as VLAN interfaces |
virtual_gateway_ipv6 | string | Configures an anycast gateway IPv6 address for VLAN interfaces |
state_ipv6 | string: absent, present* | Deletes the IPV6 address if set to absent |
ip_helper | list | Configures DHCP server address objects (see ip_helper.*) |
ip_helper.ip | string (required) | Configures the IPv4 address of the DHCP server (A.B.C.D format) |
ip_helper.state | string: absent, present* | Deletes the IP helper address if set to absent |
flowcontrol | dictionary | Configures the flowcontrol attribute (see flowcontrol.*) |
flowcontrol.mode | string: receive, transmit | Configures the flowcontrol mode |
flowcontrol.enable | string: on, off | Configures the flowcontrol mode on |
flowcontrol.state | string: absent, present | Deletes the flowcontrol if set to absent |
ipv6_bgp_unnum | dictionary | Configures the IPv6 BGP unnum attributes (see ipv6_bgp_unnum.*) below |
ipv6_bgp_unnum.state | string: absent, present* | Disables auto discovery of BGP unnumbered peer if set to absent |
ipv6_bgp_unnum.peergroup_type | string: ebgp, ibgp | Specifies the type of template to inherit from |
stp_rpvst_default_behaviour | boolean: false, true | Configures RPVST default behavior of BPDU's when set to True, which is default |
Issue: Hosts are not displayed on the AWX UI.
Resolution:
Issue: There are ImagePullBack or ErrPullImage errors in the status of Kubernetes pods.
Cause: The errors occur when the Docker pull limit is exceeded.
Resolution:
kubeadm reset -f
in all the nodes before re-executing the playbook with the docker credentials.Issue: The kubectl
command stops working after a reboot and displays the following error message: The connection to the server head_node_ip:port was refused - did you specify the right host or port?
Resolution:
On the management station or the manager node, run the following commands:
swapoff -a
systemctl restart kubelet
Issue: If control_plane.yml fails at the webui_awx role, then the previous IP address and password are not cleared when control_plane.yml is re-run.
Resolution: In the webui_awx/files directory, delete the .tower_cli.cfg and .tower_vault_key files, and then re-run control_plane.yml
.
Issue: The FreeIPA server and client installation fails.
Cause: The hostnames of the manager and login nodes are not set in the correct format.
Resolution: If you have enabled the option to install the login node in the cluster, set the hostnames of the nodes in the format: hostname.domainname. For example, manager.omnia.test is a valid hostname for the login node. Note: To find the cause for the failure of the FreeIPA server and client installation, see ipaserver-install.log in the manager node or /var/log/ipaclient-install.log in the login node.
Issue: The inventory details are not updated in AWX when device or host credentials are invalid.
Resolution: Provide valid credentials of the devices and hosts in the cluster.
Issue: The Host list is empty after executing the control_plane playbook.
Resolution: Ensure that all devices used are in DHCP enabled mode.
Issue: The task 'Install Packages' fails on the NFS node with the message: Failure in talking to yum: Cannot find a valid baseurl for repo: base/7/x86_64.
Cause: There are connections missing on the NFS node.
Resolution: Ensure that there are 3 nics being used on the NFS node:
Issue: Hosts are not automatically deleted from awx UI when redeploying the cluster.
Resolution: Before re-deploying the cluster, ensure that the user manually deletes all hosts from the awx UI.
Issue: Decomissioned compute nodes do not get deleted automatically from the awx UI. Resolution: Once a node is decommisioned, ensure that the user manually deletes decomissioned hosts from the awx UI.
kubeadm reset -f
on all the nodes of the cluster. You then need to run the omnia.yml file and skip the installation of Slurm using the skip tags.catalog.xml
file, firmware updates of a few components might fail for server models: R640 and R740. Note that Omnia doesn't halt or get interrupted despite these failures. (Fix Expected by 17th December 2021)The Omnia project was started to give members of the Dell Technologies HPC Community a way to easily set up clusters of Dell EMC servers, and to contribute useful tools, fixes, and functionality back to the HPC Community.
While we started Omnia within the Dell Technologies HPC Community, that doesn't mean that it's limited to Dell EMC servers, networking, and storage. This is an open project, and we want to encourage everyone to use and contribute to Omnia!
It's not just new features and bug fixes that can be contributed to the Omnia project! Anyone should feel comfortable contributing. We are asking for all types of contributions:
If you would like to contribute, see CONTRIBUTING.