瀏覽代碼

Issue #695: deploy_job_templates changes

Signed-off-by: Lakshmi-Patneedi <Lakshmi_Patneedi@Dellteam.com>
Lakshmi-Patneedi 3 年之前
父節點
當前提交
3d641a3780

+ 64 - 24
control_plane/roles/deploy_job_templates/tasks/group_inventory.yml

@@ -12,35 +12,75 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ---
+- name: Initializing variables
+  set_fact:
+    compute_list: []
+    non_compute_list: []
+    component_roles: []
+
 - name: Get the hosts in node_inventory
-  command: >-
-    awx --conf.host {{ awx_host }} --conf.username {{ awx_admin_user }} --conf.password {{ awx_password }}
-    --conf.insecure hosts list --inventory {{ node_inventory }} -f human --filter "name"
+  command: awx --conf.host {{ awx_host }} --conf.username {{ awx_admin_user }} --conf.password {{ awx_password }} --conf.insecure hosts list --inventory {{ node_inventory }} -f human --filter "name"
   changed_when: false
   no_log: true
   register: hosts_list
 
-- name: Add the host to compute group in node_inventory if it exists
-  awx.awx.tower_group:
-    name: "{{ item.split(',')[3] }}"
-    inventory: "{{ node_inventory }}"
-    preserve_existing_hosts: true
-    hosts:
-      - "{{ item.split(',')[2] }}"
-    tower_config_file: "{{ tower_config_file }}"
+- name: Converting csv values to list
+  read_csv:
+    path: "{{ host_mapping_file_path }}"
+    delimiter: ','
+  register: mapping
+
+- name: Collecting compute node ip's from host mapping file
+  set_fact:
+      compute_list: "{{ compute_list + [ item.IP ] }}"
   when:
-    - item.split(',')[2] != "IP"
-    - item.split(',')[2] in hosts_list.stdout 
-    - item.split(',')[3] == "compute"
+    - item.Component_role ==  compute_node
+    - item.IP in hosts_list.stdout
+  no_log: true
+  with_items:
+      - "{{ mapping.list }}"
 
-- name: Add the host to other groups in node_inventory if it exists
-  awx.awx.tower_group:
-    name: "{{ item.split(',')[3] }}"
-    inventory: "{{ node_inventory }}"
-    hosts:
-      - "{{ item.split(',')[2] }}"
-    tower_config_file: "{{ tower_config_file }}"
+- name: Collecting manager,nfs_node,login_nodes ip's from host mapping file
+  set_fact:
+      non_compute_list: "{{ non_compute_list + [ item.IP ] }}"
+      component_roles: "{{ component_roles + [item.Component_role] }}"
   when:
-    - item.split(',')[2] != "IP"
-    - item.split(',')[2] in hosts_list.stdout
-    - item.split(',')[3] != "compute"
+    - item.Component_role != compute_node
+  no_log: true
+  with_items:
+      - "{{ mapping.list }}"
+
+- name: Adding ips to compute node in awx ui
+  block:
+    - name: Add the host to compute group in node_inventory if it exists
+      awx.awx.tower_group:
+        name: "{{ compute_node }}"
+        inventory: "{{ node_inventory }}"
+        hosts: "{{ compute_list }}"
+        tower_config_file: "{{ tower_config_file }}"
+      register: compute_output
+      no_log: true
+  rescue:
+    - name: Failed to add ip's to compute node in awx ui
+      fail:
+        msg: "{{ compute_output.stdout }}"
+
+- name: Adding ips to manager,nfs_node,login_node in awx ui
+  block:
+    - name: Add the host to other groups in node_inventory if it exists
+      awx.awx.tower_group:
+        name: "{{ item.0 }}"
+        inventory: "{{ node_inventory }}"
+        hosts:
+          - "{{ item.1 }}"
+        tower_config_file: "{{ tower_config_file }}"
+      when: item.1 in hosts_list.stdout
+      with_together:
+          - "{{ component_roles }}"
+          - "{{ non_compute_list }}"
+      register: non_compute_output
+      no_log: true
+  rescue:
+    - name: Failed to add ip's to manager,nfs_node,login_node
+      fail:
+        msg: "{{ non_compute_output.stdout }}"

+ 0 - 1
control_plane/roles/deploy_job_templates/tasks/main.yml

@@ -186,7 +186,6 @@
 
 - name: Group the hosts in node_inventory when mapping file is present
   include_tasks: "{{ role_path }}/tasks/group_inventory.yml"
-  with_items: "{{ mapping_file.stdout_lines }}"
   when: host_mapping_file and component_role_support
 
 - name: Launch deploy_omnia job template

+ 1 - 0
control_plane/roles/deploy_job_templates/vars/main.yml

@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ---
+compute_node: "compute"
 base_vars_file: "{{ role_path }}/../../input_params/base_vars.yml"
 awx_namespace: awx
 awx_admin_user: admin

+ 3 - 0
docs/FAQ.md

@@ -189,4 +189,7 @@ Potential Cause: Your Docker pull limit has been exceeded. For more information,
 ## Can Cobbler deploy both Rocky and CentOS at the same time?
 No. During Cobbler based deployment, only one OS is supported at a time. If the user would like to deploy both, please deploy one first, **unmount `/mnt/iso`** and then re-run cobbler for the second OS.
 
+## Why do Firmware Updates fail for some components with Omnia 1.1.1?
+Due to the latest `catalog.xml` file, Firmware updates fail for some components on server models R640 and R740. Omnia execution doesn't get interrupted but an error gets logged. For now, please download those individual updates manually.
+
 

文件差異過大導致無法顯示
+ 13 - 7
docs/INSTALL_OMNIA_CONTROL_PLANE.md


+ 14 - 8
docs/README.md

@@ -1,10 +1,10 @@
 **Omnia** (Latin: all or everything) is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into clusters capable of supporting HPC, AI, and data analytics workloads. It uses Slurm, Kubernetes, and other packages to manage jobs and run diverse workloads on the same converged solution. It is a collection of [Ansible](https://ansible.com) playbooks, is open source, and is constantly being extended to enable comprehensive workloads.
 
 #### Current release version
-1.1.1
+1.2
 
 #### Previous release version
-1.1  
+1.1.1
 
 ## Blogs about Omnia
 - [Introduction to Omnia](https://infohub.delltechnologies.com/p/omnia-open-source-deployment-of-high-performance-clusters-to-run-simulation-ai-and-data-analytics-workloads/)
@@ -27,6 +27,8 @@ Omnia can install Kubernetes or Slurm (or both), along with additional drivers,
 ![Omnia Slurm Stack](images/omnia-slurm.png)  
 
 ## What's new in this release
+* Extended support of Leap OS on Management station, login, compute and NFS nodes.
+* Omnia now supports Powervault configurations with 2 network interfaces.
 * Provisioning of Rocky custom ISO on supported PowerEdge servers using iDRAC.
 * Configuring Dell EMC networking switches, Mellanox InfiniBand switches, and PowerVault storage devices in the cluster. 
 * An option to configure a login node with the same configurations as the compute nodes in the cluster. With appropriate user privileges provided by the cluster administrator, users can log in to the login node and schedule Slurm jobs. The authentication mechanism in the login node uses the FreeIPA solution.
@@ -46,8 +48,8 @@ The following table lists the software and operating system requirements on the
 
 Requirements  |   Version
 ----------------------------------  |   -------
-OS pre-installed on the management station  |  CentOS 8.4/ Rocky 8.4
-OS deployed by Omnia on bare-metal Dell EMC PowerEdge Servers | CentOS 7.9 2009 Minimal Edition/ Rocky 8.4 Minimal Edition
+OS pre-installed on the management station  |  CentOS 8.4/ Rocky 8.5/ Leap 15.3
+OS deployed by Omnia on bare-metal Dell EMC PowerEdge Servers | Rocky 8.5 Minimal Edition/ Leap 15.3
 Cobbler  |  3.2.2
 Ansible AWX  |  19.1.0
 Slurm Workload Manager  |  20.11.2
@@ -55,6 +57,9 @@ Kubernetes on the management station  |  1.21.0
 Kubernetes on the manager and compute nodes	|	1.16.7 or 1.19.3
 Kubeflow  |  1
 Prometheus  |  2.23.0
+Ansible  |  2.9.21
+Python  |  3.6.15
+CRI-O  |  1.17.3
 
 ## Hardware managed by Omnia
 The following table lists the supported devices managed by Omnia. Other devices than those listed in the following table will be discovered by Omnia, but features offered by Omnia will not be applicable.
@@ -72,10 +77,11 @@ The following table lists the software and its compatible version managed by Omn
 
 Software	|	License	|	Compatible Version	|	Description
 -----------	|	-------	|	----------------	|	-----------------
+LeapOS 15.3	|	-	|	15.3|	Operating system on entire cluster
 CentOS Linux release 7.9.2009 (Core)	|	-	|	7.9	|	Operating system on entire cluster except for management station
-Rocky 8.4	|	-	|	8.4	|	Operating system on entire cluster except for management station
+Rocky 8.5	|	-	|	8.5	|	Operating system on entire cluster except for management station
 CentOS Linux release 8.4.2105	|	-	|	8.4	|	Operating system on the management station	
-Rocky 8.4	|	-	|	8.4	|	Operating system on the management station
+Rocky 8.5	|	-	|	8.5	|	Operating system on the management station
 MariaDB	|	GPL 2.0	|	5.5.68	|	Relational database used by Slurm
 Slurm	|	GNU General Public	|	20.11.7	|	HPC Workload Manager
 Docker CE	|	Apache-2.0	|	20.10.2	|	Docker Service
@@ -195,8 +201,6 @@ If hosts are listed, then an IP address has been assigned to them by DHCP. Howev
 * **Issue**: Hosts are not automatically deleted from awx UI when redeploying the cluster.  
 	**Resolution**: Before re-deploying the cluster, ensure that the user manually deletes all hosts from the awx UI.
 	
-* **Issue**: Decomissioned compute nodes do not get deleted automatically from the awx UI.
-	**Resolution**: Once a node is decommisioned, ensure that the user manually deletes decomissioned hosts from the awx UI.
 
 # [Frequently asked questions](FAQ.md)
 
@@ -209,6 +213,8 @@ If hosts are listed, then an IP address has been assigned to them by DHCP. Howev
 * To change the Kubernetes version from 1.16 to 1.19 or 1.19 to 1.16, you must redeploy the entire cluster.  
 * The Kubernetes pods will not be able to access the Internet or start when firewalld is enabled on the node. This is a limitation in Kubernetes. So, the firewalld daemon will be disabled on all the nodes as part of omnia.yml execution.
 * Only one storage instance (Powervault) is currently supported in the HPC cluster.
+* Cobbler web support has been discontinued from Omnia 1.2 onwards.
+
 
 # Contributing to Omnia
 The Omnia project was started to give members of the [Dell Technologies HPC Community](https://dellhpc.org) a way to easily set up clusters of Dell EMC servers, and to contribute useful tools, fixes, and functionality back to the HPC Community.