Преглед на файлове

Merge pull request #836 from cgoveas/devel

Updating security/visualization information
Sujit Jadhav преди 3 години
родител
ревизия
6c90b00f51

+ 19 - 8
docs/INSTALL_OMNIA.md

@@ -75,15 +75,26 @@ __Note:__ After the Omnia repository is cloned, a folder named __omnia__ is crea
 
 2. Change the directory to __omnia__: `cd omnia`
 
-3. In the `omnia_config.yml` file, provide the following details.  
-	a. The **k8s_version** variable specifies the Kubernetes version which will be installed on the manager and compute nodes. By default, it is set to **1.16.7**. Edit this variable to change the version. Supported versions are 1.16.7 and 1.19.3.  
-	b. The variable `login_node_required` is set to "true" by default to configure the login node. To configure the login node, edit the following variables:
-	* domain_name: Domain name you intend to configure.
-	* realm_name: A realm name is often, but not always, the upper case version of the name of the DNS domain over which it presides.
-	* directory_manager_password: Password of the Directory Manager with full access to the directory for system management tasks.
-	* ipa_admin_password: "admin" user password for the IPA server.  
+3. In the `omnia_config.yml` file, provide the following details:  
+
+| Parameter Name             | Default Value | Additional Information                                                                                                                                                                                                                               |
+|----------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| mariadb_password           | password      | Password used to access the Slurm database. <br> Required Length: 8   characters <br> The password must not contain -,\, ',"                                                                                                                         |
+| k8s_version                | 1.16.7        | Kuberenetes Version <br> Accepted Values: "1.16.7" or   "1.19.3"                                                                                                                                                                                     |
+| k8s_cni                    | calico        | CNI type used by Kuberenetes. <br> Accepted values: calico, flannel                                                                                                                                                                                  |
+| k8s_pod_network_cidr       | 10.244.0.0/16 | Kubernetes pod network CIDR                                                                                                                                                                                                                          |
+| docker_username            |               | Username to login to Docker. A kubernetes secret will be created and   patched to the service account in default namespace. <br> This value is   optional but suggested to avoid docker pull limit issues                                            |
+| docker_password            |               | Password to login to Docker <br> This value is mandatory if a   docker_username is provided                                                                                                                                                          |
+| ansible_config_file_path   | /etc/ansible  | Path where the ansible.cfg file can be found. <br> If `dnf` is   used, the default value is valid. If `pip` is used, the variable must be set   manually                                                                                             |
+| login_node_required        | TRUE          | Boolean indicating whether the login node is required or not                                                                                                                                                                                         |
+| domain_name                | omnia.test    | Sets the intended domain name                                                                                                                                                                                                                        |
+| realm_name                 | OMNIA.TEST    | Sets the intended realm name                                                                                                                                                                                                                         |
+| directory_manager_password |               | Password authenticating admin level access to the Directory for system   management tasks. It will be added to the instance of directory server   created for IPA. <br> Required Length: 8 characters. <br> The   password must not contain -,\, '," |
+| ipa_admin_password         |               | IPA server admin password                                                                                                                                                                                                                            |
+| enable_secure_login_node   |  **false**, true             | Boolean value deciding whether security features are enabled on the Login Node. For more information, see [here](docs/Security/Enable_Security_LoginNode.md).                                                                                                                                                                                                                           |
 	
-	If you do not want to configure the login node, then you can set the `login_node_required` variable to "false". Without the login node, Slurm jobs can be scheduled only through the manager node.
+	
+>> __NOTE:__  Without the login node, Slurm jobs can be scheduled only through the manager node.
 
 4. Create an inventory file in the *omnia* folder. Add login node IP address under the *[login_node]* group, manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and NFS node IP address under the *[nfs_node]* group. A template file named INVENTORY is provided in the *omnia\docs* folder.  
 	**NOTE**: Ensure that all the four groups (login_node, manager, compute, nfs_node) are present in the template, even if the IP addresses are not updated under login_node and nfs_node groups. 

Файловите разлики са ограничени, защото са твърде много
+ 8 - 11
docs/INSTALL_OMNIA_CONTROL_PLANE.md


+ 12 - 7
docs/README.md

@@ -4,12 +4,14 @@
 1.2
 
 #### Previous release version
-1.1.1
+1.1.2
 
 ## Blogs about Omnia
 - [Introduction to Omnia](https://infohub.delltechnologies.com/p/omnia-open-source-deployment-of-high-performance-clusters-to-run-simulation-ai-and-data-analytics-workloads/)
 - [Taming the Accelerator Cambrian Explosion with Omnia](https://infohub.delltechnologies.com/p/taming-the-accelerator-cambrian-explosion-with-omnia/)
 - [Containerized HPC Workloads Made Easy with Omnia and Singularity](https://infohub.delltechnologies.com/p/containerized-hpc-workloads-made-easy-with-omnia-and-singularity/)
+- [Solution Overview: Dell EMC Omnia Software](https://infohub.delltechnologies.com/section-assets/omnia-solution-overview)
+- [Solution Brief: Omnia Software](https://infohub.delltechnologies.com/section-assets/omnia-solution-brief)
 
 ## What Omnia does
 Omnia can build clusters that use Slurm or Kubernetes (or both!) for workload management. Omnia will install software from a variety of sources, including:
@@ -49,7 +51,7 @@ The following table lists the software and operating system requirements on the
 
 Requirements  |   Version
 ----------------------------------  |   -------
-OS pre-installed on the management station  |  CentOS 8.4/ Rocky 8.5/ Leap 15.3
+OS pre-installed on the management station  |  Rocky 8.5/ Leap 15.3
 OS deployed by Omnia on bare-metal Dell EMC PowerEdge Servers | Rocky 8.5 Minimal Edition/ Leap 15.3
 Cobbler  |  3.2.2
 Ansible AWX  |  19.4.0
@@ -81,7 +83,6 @@ Software	|	License	|	Compatible Version	|	Description
 LeapOS 15.3	|	-	|	15.3|	Operating system on entire cluster
 CentOS Linux release 7.9.2009 (Core)	|	-	|	7.9	|	Operating system on entire cluster except for management station
 Rocky 8.5	|	-	|	8.5	|	Operating system on entire cluster except for management station
-CentOS Linux release 8.4.2105	|	-	|	8.4	|	Operating system on the management station	
 Rocky 8.5	|	-	|	8.5	|	Operating system on the management station
 MariaDB	|	GPL 2.0	|	5.5.68	|	Relational database used by Slurm
 Slurm	|	GNU General Public	|	20.11.7	|	HPC Workload Manager
@@ -113,11 +114,15 @@ Buildah	|	Apache-2.0	|	1.21.4	|	Tool to build and run container
 PostgreSQL	|	Copyright (c) 1996-2020, PostgreSQL Global Development Group	|	10.15	|	Database Management System
 Redis	|	BSD-3-Clause License	|	6.0.10	|	In-memory database
 NGINX	|	BSD-2-Clause License	|	1.14	|	-
-dellemc.openmanage	|	GNU-General Public License v3.0	|	3.5.0	|	It is a systems management and monitoring application that provides a comprehensive view of the Dell EMC servers, chassis, storage, and network switches on the enterprise network
 dellemc.os10	|	GNU-General Public License v3.1	|	1.1.1	|	It provides networking hardware abstraction through a common set of APIs
-Genisoimage-dnf	|	GPL v3	|	1.1.11	|	Genisoimage is a pre-mastering program for creating ISO-9660 CD-ROM  filesystem images
-OMSDK	|	Apache-2.0	|	1.2.456	|	Dell EMC OpenManage Python SDK (OMSDK) is a python library that helps developers and customers to automate the lifecycle management of PowerEdge Servers
-
+OMSDK	|	Apache-2.0	|	1.2.488	|	Dell EMC OpenManage Python SDK (OMSDK) is a python library that helps developers and customers to automate the lifecycle management of PowerEdge Servers
+| Loki                                  | Apache License 2.0               | 2.4.1  | Loki is a log aggregation system   designed to store and query logs from all your applications and   infrastructure                            |
+| Promtail                              | Apache License 2.1               | 2.4.1  | Promtail is an agent which ships the contents of local logs to   a private Grafana Loki instance or Grafana Cloud.                             |
+| kube-prometheus-stack                 | Apache License 2.2               | 25.0.0 | Kube Prometheus Stack is a collection of Kubernetes manifests,   Grafana dashboards, and Prometheus rules.                                     |
+| mailx                                 | MIT License                      | 12.5   | mailx is a Unix utility program for sending and receiving   mail.                                                                              |
+| postfix                               | IBM Public License               | 3.5.8  | Mail Transfer Agent (MTA) designed to determine routes and   send emails                                                                       |
+| xorriso                               | GPL version 3                    | 1.4.8  | xorriso copies file objects from POSIX compliant filesystems   into Rock Ridge enhanced ISO 9660 filesystems.                                  |
+| Dell EMC   OpenManage Ansible Modules | GNU- General Public License v3.0 | 5.0.0  | OpenManage Ansible Modules simplifies and automates   provisioning, deployment, and updates of PowerEdge servers and modular   infrastructure. |
 
 # Known issues  
 * **Issue**: Hosts are not displayed on the AWX UI.  

+ 15 - 0
docs/Security/Enable_Security_LoginNode.md

@@ -0,0 +1,15 @@
+# Enabling Security on the Login Node (RockyOS)
+
+* Ensure that `enable_secure_login_node` is set to **true** in `omnia_config.yml`
+* Set the following parameters in `omnia_security_config.yml`
+
+|  Parameter Name        |  Default Value  |  Additional Information                                                                                                                                          |
+|------------------------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| max_failures           | 3               | Failures allowed before lockout. <br> This value cannot currently   be changed.                                                                                  |
+| failure_reset_interval | 60              | Period (in seconds) after which the number of failed login attempts is   reset <br> Accepted Values: 30-60                                                       |
+| lockout_duration       | 10              | Period (in seconds) for which users are locked out. <br> Accepted   Values: 5-10                                                                                 |
+| session_timeout        | 180             | Period (in seconds) after which idle users get logged out automatically   <br> Accepted Values: 30-90                                                            |
+| alert_email_address    |                 | Email address used for sending alerts in case of authentication failure   <br> If this variable is left blank, authentication failure alerts will   be disabled. |
+| allow_deny             | Allow           | This variable sets whether the user list is Allowed or Denied. <br>   Accepted Values: Allow, Deny                                                               |
+| user                   |                 | Array of users that are allowed or denied based on the `allow_deny`   value. Multiple users must be separated by a space.                                        |
+

+ 26 - 14
docs/Security/Enable_Security_ManagementStation.md

@@ -1,25 +1,37 @@
-# Enabling Security on the Management Station and Login Node
+# Enabling Security on the Management Station
 
-Omnia uses FreeIPA to enable security features like authorisation and access control.
+Omnia uses FreeIPA on RockyOS to enable security features like authorisation and access control.
 
 ## Enabling Authentication on the Management Station:
 
 Set the parameter 'enable_security_support' to true in `base_vars.yml`
 
-## Prerequisites Before Enabling FreeIPA:
-* Enter the relevant values in `security_vars.yml`:
-
-| Parameter Name | Default Value | Additional Information                                                                                           |
-|----------------|---------------|------------------------------------------------------------------------------------------------------------------|
-| domain_name    | omnia.test    | The domain name should not contain an underscore ( _ )                                                           |
-| realm_name     | omnia.test    | The realm name should follow the following rules per https://www.freeipa.org/page/Deployment_Recommendations <br> * The realm name must not conflict with any other existing Kerberos realm name (e.g. name used by Active Directory). <br> * The realm name should be upper-case (EXAMPLE.COM) version of primary DNS domain name (example.com).  |
+## Prerequisites Before Enabling Security:
 
 * Enter the relevant values in `login_vars.yml`:
 
 | Parameter Name             | Default Value | Additional Information                                                                           |
 |----------------------------|---------------|--------------------------------------------------------------------------------------------------|
-| directory_manager_password |               | Password of the Directory Manager with full access to the directory for system management tasks. |
-| ipa_admin_password         |               | "admin" user password for the IPA server                                                         |
+| ms_directory_manager_password |               | Password of the Directory Manager with full access to the directory for system management tasks. |
+| ms_kerberos_admin_password         |               | "admin" user password for the IPA server on RockyOS. If LeapOS is in use, it is used as the "kerberos admin" user password for 389-ds <br> This field is not relevant to Management Stations running `LeapOS`                                                         |
+
+
+
+* Enter the relevant values in `security_vars.yml:
+
+If `RockyOS` is in use on the Management Station:
+
+|  Parameter Name        |  Default Value  |  Additional Information                                                                                                                                                                                                                                                                                                                                      |
+|------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|  domain_name           |  omnia.test     |  The domain name should not contain   an underscore ( _ )                                                                                                                                                                                                                                                                                                    |
+|  realm_name            |  OMNIA.TEST     |  The realm name should follow the   following rules per https://www.freeipa.org/page/Deployment_Recommendations   <br> * The realm name must not conflict with any other existing   Kerberos realm name (e.g. name used by Active Directory). <br> * The   realm name should be upper-case (EXAMPLE.COM) version of primary DNS domain   name (example.com). |
+| max_failures           | 3               | Failures allowed before lockout. <br> This value cannot currently   be changed.                                                                                                                                                                                                                                                                              |
+| failure_reset_interval | 60              | Period (in seconds) after which the number of failed login attempts is   reset <br> Accepted Values: 30-60                                                                                                                                                                                                                                                   |
+| lockout_duration       | 10              | Period (in seconds) for which users are locked out. <br> Accepted   Values: 5-10                                                                                                                                                                                                                                                                             |
+| session_timeout        | 180             | Period (in seconds) after which idle users get logged out automatically   <br> Accepted Values: 30-90                                                                                                                                                                                                                                                        |
+| alert_email_address    |                 | Email address used for sending alerts in case of authentication failure. Currently, only one email address is supported in this field.   <br> If this variable is left blank, authentication failure alerts will   be disabled.                                                                                                                                                                                             |
+| allow_deny             | Allow           | This variable sets whether the user list is Allowed or Denied. <br>   Accepted Values: Allow, Deny                                                                                                                                                                                                                                                           |
+| user                   |                 | Array of users that are allowed or denied based on the `allow_deny`   value. Multiple users must be separated by a space.                                                                                                                                                                                                                                    |
 
 
 ## Log Aggregation via Grafana
@@ -34,12 +46,12 @@ Set the parameter 'enable_security_support' to true in `base_vars.yml`
 
 Loki uses basic regex based syntax to filter for specific jobs, dates or timestamps.
 
-* Select the Explore ![Explore Icon](Telemetry_Visualization/Images/ExploreIcon.PNG) tab to select control-plane-loki from the drop down.
+* Select the Explore ![Explore Icon](../Telemetry_Visualization/Images/ExploreIcon.PNG) tab to select control-plane-loki from the drop down.
 * Using [LogQL queries](https://grafana.com/docs/loki/latest/logql/log_queries/), all logs in `/var/log` can be accessed using filters (Eg: `{job=”Omnia”}` )
 
 ## Viewing Logs on the Dashboard
 
-All log files can be viewed via the Dashboard tab (![Dashboard Icon](Telemetry_Visualization/Images/DashBoardIcon.PNG)). The Default Dashboard displays `omnia.log` and `syslog`. Custom dashboards can be created per user requirements.
+All log files can be viewed via the Dashboard tab (![Dashboard Icon](../Telemetry_Visualization/Images/DashBoardIcon.PNG)). The Default Dashboard displays `omnia.log` and `syslog`. Custom dashboards can be created per user requirements.
 
 Below is a list of all logs available to Loki and can be accessed on the dashboard:
 
@@ -49,7 +61,7 @@ Below is a list of all logs available to Loki and can be accessed on the dashboa
 | syslogs            | /var/log/messages                         | System Logging               | This log is configured by Default                                                                  |
 | Audit Logs         | /var/log/audit/audit.log                  | All Login Attempts           | This log is configured by Default                                                                  |
 | CRON logs          | /var/log/cron                             | CRON Job Logging             | This log is configured by Default                                                                  |
-| Pods logs          | /var/log/pods/*/*/*log                    | k8s pods                     | This log is configured by Default                                                                  |
+| Pods logs          | /var/log/pods/ * / * / * log                    | k8s pods                     | This log is configured by Default                                                                  |
 | Access Logs        | /var/log/dirsrv/slapd-<Realm Name>/access | Directory Server Utilization | This log is available when FreeIPA is set up ( ie when   enable_security_support is set to 'true') |
 | Error Log          | /var/log/dirsrv/slapd-<Realm Name>/errors | Directory Server Errors      | This log is available when FreeIPA is set up ( ie when   enable_security_support is set to 'true') |
 | CA Transaction Log | /var/log/pki/pki-tomcat/ca/transactions   | FreeIPA PKI Transactions     | This log is available when FreeIPA is set up ( ie when   enable_security_support is set to 'true') |

docs/login_node/login_user_creation.md → docs/Security/login_user_creation.md


+ 24 - 8
docs/Telemetry_Visualization/Visualization.md

@@ -2,7 +2,7 @@
 
 Using Grafana, users can poll multiple devices and create graphs/visualizations of key system metrics such as temperature, System power consumption, Memory Usage, IO Usage, CPU Usage, Total Memory Power, System Output Power, Total Fan Power, Total Storage Power, System Input Power, Total CPU Power, RPM Readings, Total Heat Dissipation, Power to Cool ratio, System Air Flow Efficiency etc.
 
-A lot of these metrics are collected using iDRAC telemetry. iDRAC telemetry allows you to stream telemetry data from your servers to a centralized log/metrics servers. For more information on iDRAC telemetry, click [here](https://github.com/dell/iDRAC-Telemetry-Scripting).
+A lot of these metrics are collected using iDRAC telemetry. iDRAC telemetry allows you to stream telemetry data from your servers to a centralized log/metrics servers. For more information on iDRAC telemetry, click [here]( https://github.com/dell/iDRAC-Telemetry-Reference-Tools).
 
 ## Prerequisites
 
@@ -11,21 +11,21 @@ A lot of these metrics are collected using iDRAC telemetry. iDRAC telemetry allo
 
 | Parameter Name        | Default Value | Information |
 |-----------------------|---------------|-------------|
-| timescaledb_user      | postgres      |  Username used for connecting to timescale db. Minimum Legth: 2 characters.          |
-| timescaledb_password  | postgres      |  Password used for connecting to timescale db. Minimum Legth: 2 characters.           |
-| mysqldb_user          | mysql         |  Username used for connecting to mysql db. Minimum Legth: 2 characters.         |
-| mysqldb_password      | mysql         |  Password used for connecting to mysql db. Minimum Legth: 2 characters.            |
-| mysqldb_root_password | mysql         |  Password used for connecting to mysql db for root user. Minimum Legth: 2 characters.         |
+| timescaledb_user      | 		        |  Username used for connecting to timescale db. Minimum Legth: 2 characters.          |
+| timescaledb_password  | 		        |  Password used for connecting to timescale db. Minimum Legth: 2 characters.           |
+| mysqldb_user          | 		        |  Username used for connecting to mysql db. Minimum Legth: 2 characters.         |
+| mysqldb_password      | 		        |  Password used for connecting to mysql db. Minimum Legth: 2 characters.            |
+| mysqldb_root_password | 		        |  Password used for connecting to mysql db for root user. Minimum Legth: 2 characters.         |
 
 3. All parameters in `telemetry/input_params/base_vars.yml` need to be filled in:
 
 | Parameter Name          | Default Value     | Information |
 |-------------------------|-------------------|-------------|
-| mount_location          | /mnt/omnia        | Sets the location all telemetry related files will be stored and both timescale and mysql databases will be mounted.            |
+| mount_location          | idrac_telemetrysource_services_db | Sets the location all telemetry related files will be stored and both timescale and mysql databases will be mounted.            |
 | idrac_telemetry_support | true              | This variable is used to enable iDRAC telemetry support and visualizations. Accepted Values: true/false            |
 | slurm_telemetry_support | true              | This variable is used to enable slurm telemetry support and visualizations. Slurm Telemetry support can only be activated when idrac_telemetry_support is set to true. Accepted Values: True/False.        |
 | timescaledb_name        | telemetry_metrics | Postgres DB with timescale extension is used for storing iDRAC and slurm telemetry metrics.            |
-| myscaledb_name          | mysql             | MySQL DB is used to store IPs and credentials of iDRACs having datacenter license           |
+| mysqldb_name			  | idrac_telemetrysource_services_db             | MySQL DB is used to store IPs and credentials of iDRACs having datacenter license           |
 
 3. Find the IP of the Grafana UI using:
  
@@ -48,6 +48,22 @@ Use any one of the following browsers to access the Grafana UI (https://< Grafan
 * The slurm manager and compute nodes are fetched at run time from node_inventory.
 * Slurm should be installed on the nodes, if not there is no point in executing slurm telemetry.
 
+## Initiating Telemetry
+
+1. Once `control_plane.yml` and `telemetry.yml` are executed, run the following commands from `omnia/telemetry`:
+
+`ansible-playbook telemetry.yml`
+
+>> __Note:__ Telemetry Collection is only initiated on iDRACs on AWX that have a datacenter license and are running a firmware version of 4 or higher.
+
+## Adding a New Node to Telemetry
+After initiation, new nodes can be added to telemetry by running the following commands from `omnia/telemetry`:
+		
+` ansible-playbook add_idrac_node.yml `
+		
+
+
+