Browse Source

Issue#751: Updating Folder structure for docs

Signed-off-by: cgoveas <cassandra.goveas@dell.com>
cgoveas 3 years ago
parent
commit
f892d3d5ba

+ 9 - 0
.all-contributorsrc

@@ -334,6 +334,15 @@
       "contributions": [
         "code"
       ]
+    },
+    {
+      "login": "Artlands",
+      "name": "Jie Li",
+      "avatar_url": "https://avatars.githubusercontent.com/u/31781106?v=4",
+      "profile": "https://github.com/Artlands",
+      "contributions": [
+        "code"
+      ]
     }
   ],
   "contributorsPerLine": 7,

File diff suppressed because it is too large
+ 11 - 10
README.md


+ 13 - 0
docs/FAQ.md

@@ -192,4 +192,17 @@ No. During Cobbler based deployment, only one OS is supported at a time. If the
 ## Why do Firmware Updates fail for some components with Omnia 1.1.1?
 Due to the latest `catalog.xml` file, Firmware updates fail for some components on server models R640 and R740. Omnia execution doesn't get interrupted but an error gets logged. For now, please download those individual updates manually.
 
+## Why does the Task [network_ib : Authentication failure response] fail with the message 'Status code was -1 and not [302]: Request failed: <urlopen error [Errno 111] Connection refused>' on Infiniband Switches when running `infiniband.yml`?
+To configure a new Infiniband Switch, it is required that HTTP and JSON gateway be enabled. To verify that they are enabled, run:
+
+`show web` (To check if HTTP is enabled)
+
+`show json-gw` (To check if JSON Gateway is enabled)
+
+To correct the issue, run:
+
+`web http enable` (To enable the HTTP gateway)
+
+`json-gw enable` (To enable the JSON gateway)
+
 

File diff suppressed because it is too large
+ 8 - 1
docs/INSTALL_OMNIA_CONTROL_PLANE.md


+ 9 - 9
docs/README.md

@@ -51,7 +51,7 @@ Requirements  |   Version
 OS pre-installed on the management station  |  CentOS 8.4/ Rocky 8.5/ Leap 15.3
 OS deployed by Omnia on bare-metal Dell EMC PowerEdge Servers | Rocky 8.5 Minimal Edition/ Leap 15.3
 Cobbler  |  3.2.2
-Ansible AWX  |  19.1.0
+Ansible AWX  |  19.4.0
 Slurm Workload Manager  |  20.11.2
 Kubernetes on the management station  |  1.21.0
 Kubernetes on the manager and compute nodes	|	1.16.7 or 1.19.3
@@ -59,7 +59,7 @@ Kubeflow  |  1
 Prometheus  |  2.23.0
 Ansible  |  2.9.21
 Python  |  3.6.15
-CRI-O  |  1.17.3
+CRI-O  |  1.21.0
 
 ## Hardware managed by Omnia
 The following table lists the supported devices managed by Omnia. Other devices than those listed in the following table will be discovered by Omnia, but features offered by Omnia will not be applicable.
@@ -89,12 +89,12 @@ FreeIPA	|	GNU General Public License v3	|	4.6.8	|	Authentication system used in
 OpenSM	|	GNU General Public License 2	|	3.3.24	|	-
 NVIDIA container runtime	|	Apache-2.0	|	3.4.2	|	Nvidia container runtime library
 Python PIP	|	MIT License	|	21.1.2	|	Python Package
-Python3	|	-	|	3.6.8	|	-
-Kubelet	|	Apache-2.0	|	1.16.7,1.19,1.21	|	Provides external, versioned ComponentConfig API types for configuring the kubelet
-Kubeadm	|	Apache-2.0	|	1.16.7,1.19,1.21	|	"fast paths" for creating Kubernetes clusters
-Kubectl	|	Apache-2.0	|	1.16.7,1.19,1.21	|	Command line tool for Kubernetes
+Python3	|	-	|	3.6.8 (3.6.15 if LeapOS is being used)	|	-
+Kubelet	|	Apache-2.0	|	1.16.7,1.19, 1.21 (LeapOS only supports 1.21) 	|	Provides external, versioned ComponentConfig API types for configuring the kubelet
+Kubeadm	|	Apache-2.0	|	1.16.7,1.19, 1.21 (LeapOS only supports 1.21)	|	"fast paths" for creating Kubernetes clusters
+Kubectl	|	Apache-2.0	|	1.16.7,1.19, 1.21 (LeapOS only supports 1.21)	|	Command line tool for Kubernetes
 JupyterHub	|	Modified BSD License	|	1.1.0	|	Multi-user hub
-kubernetes Controllers	|	Apache-2.0	|	1.16.7,1.19,1.21	|	Orchestration tool	
+kubernetes Controllers	|	Apache-2.0	|	1.16.7,1.19 (1.21 if LeapOS is being used)	|	Orchestration tool	
 Kfctl	|	Apache-2.0	|	1.0.2	|	CLI for deploying and managing Kubeflow
 Kubeflow	|	Apache-2.0	|	1	|	Cloud Native platform for machine learning
 Helm	|	Apache-2.0	|	3.5.0	|	Kubernetes Package Manager
@@ -104,8 +104,8 @@ Horovod	|	Apache-2.0	|	0.21.1	|	Distributed deep learning training framework for
 MPI	|	Copyright (c) 2018-2019 Triad National Security,LLC. All rights reserved.	|	0.3.0	|	HPC library
 CoreDNS	|	Apache-2.0	|	1.6.2	|	DNS server that chains plugins
 CNI	|	Apache-2.0	|	0.3.1	|	Networking for Linux containers
-AWX	|	Apache-2.0	|	19.1.0	|	Web-based User Interface
-AWX.AWX	|	Apache-2.0	|	19.1.0	|	Galaxy collection to perform awx configuration
+AWX	|	Apache-2.0	|	19.4.0	|	Web-based User Interface
+AWX.AWX	|	Apache-2.0	|	19.4.0	|	Galaxy collection to perform awx configuration
 AWXkit	|	Apache-2.0	|	to be updated	|	To perform configuration through CLI commands
 Cri-o	|	Apache-2.0	|	1.21	|	Container Service
 Buildah	|	Apache-2.0	|	1.21.4	|	Tool to build and run container

+ 23 - 0
docs/Security/Enable_Security_ManagementStation.md

@@ -0,0 +1,23 @@
+# Enabling Security on the Management Station and Login Node
+
+## Enabling FreeIPA on the Management Station:
+
+Set the parameter 'enable_security_support' to true in `base_vars.yml`
+
+## Prerequisites Before Enabling FreeIPA:
+* Enter the relevant values in `security_vars.yml`:
+
+| Parameter Name | Default Value | Additional Information                                                                                           |
+|----------------|---------------|------------------------------------------------------------------------------------------------------------------|
+| domain_name    | omnia.test    | The domain name should not contain an underscore ( _ )                                                           |
+| realm_name     | omnia.test    | The realm name should follow the following rules per https://www.freeipa.org/page/Deployment_Recommendations <br> * The realm name must not conflict with any other existing Kerberos realm name (e.g. name used by Active Directory). <br> * The realm name should be upper-case (EXAMPLE.COM) version of primary DNS domain name (example.com).  |
+
+* Enter the relevant values in `login_vars.yml`:
+
+| Parameter Name             | Default Value | Additional Information                                                                           |
+|----------------------------|---------------|--------------------------------------------------------------------------------------------------|
+| directory_manager_password |               | Password of the Directory Manager with full access to the directory for system management tasks. |
+| ipa_admin_password         |               | "admin" user password for the IPA server                                                         |
+
+
+

BIN
docs/TelemetryAndMonitoring/Images/DashBoardIcon.PNG


BIN
docs/TelemetryAndMonitoring/Images/ExploreIcon.PNG


BIN
docs/TelemetryAndMonitoring/Images/Prometheus_Dashboard.jpg


BIN
docs/TelemetryAndMonitoring/Images/Prometheus_DataSource.jpg


+ 51 - 0
docs/TelemetryAndMonitoring/Install_Telemetry.md

@@ -0,0 +1,51 @@
+# Setting Up Telemetry
+
+Using Grafana, users can poll multiple devices and create graphs/visualizations of key statistics.
+
+## Prerequisites
+
+1. To set up Grafana, ensure that `control_plane/input_params/login_vars.yml` is updated with the Grafana Username and Password.
+2. All parameters in `telemetry/input_params/login_vars.yml` need to be filled in:
+
+| Parameter Name        | Default Value | Information |
+|-----------------------|---------------|-------------|
+| timescaledb_user      | postgres      |  Username used for connecting to timescale db. Minimum Legth: 2 characters.          |
+| timescaledb_password  | postgres      |  Password used for connecting to timescale db. Minimum Legth: 2 characters.           |
+| mysqldb_user          | mysql         |  Username used for connecting to mysql db. Minimum Legth: 2 characters.         |
+| mysqldb_password      | mysql         |  Password used for connecting to mysql db. Minimum Legth: 2 characters.            |
+| mysqldb_root_password | mysql         |  Password used for connecting to mysql db for root user. Minimum Legth: 2 characters.         |
+
+3. All parameters in `telemetry/input_params/base_vars.yml` need to be filled in:
+
+| Parameter Name          | Default Value     | Information |
+|-------------------------|-------------------|-------------|
+| mount_location          | /mnt/omnia        | Sets the location all telemetry related files will be stored and both timescale and mysql databases will be mounted.            |
+| idrac_telemetry_support | true              | This variable is used to enable iDRAC telemetry support and visualizations. Accepted Values: true/false            |
+| slurm_telemetry_support | true              | This variable is used to enable slurm telemetry support and visualizations. Slurm Telemetry support can only be activated when idrac_telemetry_support is set to true. Accepted Values: True/False.        |
+| timescaledb_name        | telemetry_metrics | Postgres DB with timescale extension is used for storing iDRAC and slurm telemetry metrics.            |
+| myscaledb_name          | mysql             | MySQL DB is used to store IPs and credentials of iDRACs having datacenter license           |
+
+3. Find the IP of the Grafana UI using:
+ 
+`kubectl get svc -n grafana`
+
+## Logging into Grafana
+
+Use any one of the following browsers to access the Grafana UI (https://< Grafana UI IP >:5000):
+* Chrome/Chromium
+* Firefox
+* Safari
+* Microsoft Edge
+
+>> __Note:__ Always enable JavaScript in your browser. Running Grafana without JavaScript enabled in the browser is not supported.
+
+## Prerequisites to Enabling Slurm Telemetry
+
+* Slurm Telemetry cannot be executed without iDRAC support
+* Omnia control plane should be executed and node_inventory should be created in awx.
+* The slurm manager and compute nodes are fetched at run time from node_inventory.
+* Slurm should be installed on the nodes, if not there is no point in executing slurm telemetry.
+
+
+
+

+ 105 - 0
docs/TelemetryAndMonitoring/MONITOR_CLUSTERS.md

@@ -0,0 +1,105 @@
+# Monitor Kubernetes and Slurm
+Omnia provides playbooks to configure additional software components for Kubernetes such as JupyterHub and Kubeflow. For workload management (submitting, conrolling, and managing jobs) of HPC, AI, and Data Analytics clusters, you can access Kubernetes and Slurm dashboards and other supported applications. 
+
+## Before accessing the dashboards
+To access any of the dashboards, ensure that a compatible web browser is installed. If you are connecting remotely to your Linux server by using MobaXterm version later than 8 or other X11 Clients though *ssh*, follow the below mentioned steps to launch the Firefox Browser:  
+* On the management station:
+	1. Connect using *ssh*. Run `ssh <user>@<IP-address>`, where *IP-address* is the private IP of the management station.
+	2. `dnf install mesa-libGL-devel -y`
+	3. `dnf install firefox -y`
+	4. `dnf install xorg-x11-xauth`
+	5. `export DISPLAY=:10.0`
+	6. `logout and login back`
+	7. To launch Firefox from terminal, run `firefox&`.  
+	
+* On the manager node:
+	1. Connect using *ssh*. Run `ssh <user>@<IP-address>`, where *IP-address* is the private IP of the manager node.
+	2. `yum install firefox -y`
+	3. `yum install xorg-x11-xauth`
+	4. `export DISPLAY=:10.0`
+	5. `logout and login back`
+	6. To launch Firefox from terminal, run `firefox&`
+
+**NOTE**: When the PuTTY or MobaXterm session ends, you must run **export DISPLAY=:10.0** command each time, else Firefox cannot be launched again.  
+
+## Access FreeIPA Dashboard  
+The FreeIPA Dashboard can be accessed from the management station, manager, and login nodes. To access the dashboard:
+1.	Install the Firefox Browser.
+2.	Open the Firefox Browser and enter the url: `https://<hostname>`. For example, enter `https://manager.example.com`.
+3.	Enter the username and password. If the admin or user has obtained a Kerberos ticket, then the credentials need not be provided.  
+
+**Note**: To obtain a Kerberos ticket, perform the following actions:
+1. Enter `kinit <username>`
+2. When prompted, enter the password.
+
+An administrator can create users on the login node using FreeIPA. The users will be prompted to change the passwords upon first login.
+
+## Access Kuberentes Dashboard
+1. To verify if the **Kubernetes-dashboard** service is in the Running state, run `kubectl get pods --namespace kubernetes-dashboard`.
+2. To start the Kubernetes dashboard, run `kubectl proxy`.
+3. To retrieve the encrypted token, run `kubectl get secret -n kubernetes-dashboard $(kubectl get serviceaccount admin-user -n kubernetes-dashboard -o jsonpath="{.secrets[0].name}") -o jsonpath="{.data.token}" | base64 --decode`.
+4. Copy the encrypted token value.
+5. On a web browser on the management station (for control_plane.yml) or manager node (for omnia.yml) enter http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/.
+6. Select the authentication method as __Token__.
+7. On the Kuberenetes Dashboard, paste the copied encrypted token and click **Sign in** to access the Kubernetes Dashboard.
+
+## Access Kubeflow Dashboard
+1. Before accessing the Kubeflow Dashboard, run `kubectl -n kubeflow get applications -o yaml profiles`. Wait till **profiles-deployment** enters the Ready state.
+2. To retrieve the **External IP** or **CLUSTER IP**, run `kubectl get services istio-ingressgateway --namespace istio-system`.
+3. On a web browser installed on the manager node, enter the **External IP** or **Cluster IP** to open the Kubeflow Central Dashboard.  
+
+For more information about the Kubeflow Central Dashboard, see https://www.kubeflow.org/docs/components/central-dash/overview/.
+
+## Access JupyterHub Dashboard
+
+1. To verify if the JupyterHub services are running, run `kubectl get pods --namespace jupyterhub`.
+2. Ensure that the pod names starting with __hub__ and __proxy__ are in the **Running** state.
+3. To retrieve the **External IP** or **CLUSTER IP**, run `kubectl get services proxy-public --namespace jupyterhub`.
+4. On a web browser installed on the manager node, enter the **External IP** or **Cluster IP** to open the JupyterHub Dashboard.
+5. JupyterHub is running with a default dummy authenticator. Enter any username and password combination to access the dashboard.
+
+For more information about configuring username and password, and to access the JupyterHub Dashboard, see https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customization.html.
+
+## Access Prometheus UI
+
+Prometheus is installed:
+  * As a Kubernetes role (**A**), when both Slurm and Kubernetes are installed.
+  * On the host when only Slurm is installed (**B**).
+
+**A**. When Prometheus is installed as a Kubernetes role.  
+* Access Prometheus with local host:  
+    1. Run the following commands:  
+       `export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")`  
+       `echo $POD_NAME`  
+       `kubectl --namespace default port-forward $POD_NAME 9090`  
+    2. To launch the Prometheus UI, in the web browser, enter `http://localhost:9090`.
+  
+* Access Prometheus with a private IP address:
+    1. Run `kubectl get services --all-namespaces`.
+    2. From the list of services, find  the **prometheus-xxxx-server** service under the **Name** column, and copy the **EXTERNAL-IP** address.  
+   For example, in the below list of services, `192.168.2.150` is the external IP address for the service `prometheus-1619158141-server`.
+   
+		NAMESPACE	|	NAME	|	TYPE	|	CLUSTER-IP	|	EXTERNAL-IP	|	PORT(S)	|	AGE  
+		---------	|	----	|	----	|	----------	|	-----------	|	-------	|	----  
+		default	|	kubernetes	|	ClusterIP	|	10.96.0.1	|	none	|	443/TCP	|	107m  
+		default	|	**prometheus-1619158141-server**	|	LoadBalancer	|	10.97.40.140	|	**192.168.2.150**	|	80:31687/TCP	|	106m  
+    3. To open Firefox, run `firefox&`.
+    4. Enter the copied External IP address to access Prometheus. For example, enter `192.168.2.150` to access Prometheus UI.
+
+**B**. When Prometheus is installed on the host.
+1. Navigate to Prometheus folder. The default path is `/var/lib/prometheus-2.23.0.linux-amd64/`.
+2. Start the web server: `./prometheus`.  
+3. To launch the Prometheus UI, in the web browser, enter `http://localhost:9090`. 
+
+__Note:__ 
+* If Prometheus is installed through Slurm without installing Kubernetes, then it will be removed when Kubernetes is installed because Prometheus would be running as a pod. 
+* Only a single instance of Prometheus is installed when both Kubernetes and Slurm are installed.
+
+## Accessing Prometheus data via Grafana UI (On the Management Station)
+
+* Once `control_plane.yml` is run, Prometheus is added to Grafana as a datasource (hpc-prometheus). This allows Grafana to display statistics from the Compute Nodes that have been polled using Prometheus.
+
+* Select the dashboard (![Dashboard Icon](Images/DashBoardIcon.PNG)) tab to view the list of Prometheus based dashboards. Some default dashboards include CoreDNS, Prometheus Overview, Kuberenetes Networking etc.
+
+>> __Note:__ Both the control plane and HPC clusters can be monitored on these dashboards by toggling the datasource at the top of each dashboard. 
+

+ 67 - 0
docs/TelemetryAndMonitoring/Monitor_Control_Plane.md

@@ -0,0 +1,67 @@
+# Monitoring The Management Station
+
+To monitor the Management Station, Omnia uses the Grafana UI with a Loki integration (This can be set up using the steps provided [here](Install_Telemetry.md)).  
+
+
+## Accessing Loki via Grafana
+
+[Loki](https://grafana.com/docs/loki/latest/fundamentals/overview/) is a datastore used to efficiently hold log data for security purposes. Using the `promtail` agent, logs are collated and streamed via a HTTP API.
+
+>> __Note:__ When `control_plane.yml` is run, Loki is automatically set up as a data source on the Grafana UI.
+
+
+
+### Querying Loki 
+
+Loki uses basic regex based syntax to filter for specific jobs, dates or timestamps.
+
+* Select the Explore ![Explore Icon](Images/ExploreIcon.PNG) tab to select control-plane-loki from the drop down.
+* Using [LogQL queries](https://grafana.com/docs/loki/latest/logql/log_queries/), all logs in `/var/log` can be accessed using filters (Eg: `{job=”Omnia”}` )
+
+## Viewing Logs on the Dashboard
+
+All log files can be viewed via the Dashboard tab (![Dashboard Icon](Images/DashBoardIcon.PNG)). The Default Dashboard displays `omnia.log` and `syslog`. Custom dashboards can be created per user requirements.
+
+## Accessing Prometheus data via Grafana
+
+* Once `control_plane.yml` is run, Prometheus is added to Grafana as a datasource. This allows Grafana to display statistics from the Control Plane that have been polled using Prometheus.
+
+![Prometheus DataSource](Images/Prometheus_DataSource.jpg)
+
+* Select the dashboard (![Dashboard Icon](Images/DashBoardIcon.PNG)) tab to view the list of Prometheus based dashboards. Some default dashboards include CoreDNS, Prometheus Overview, Kuberenetes Networking etc.
+
+>> __Note:__ Both the control plane and HPC clusters can be monitored on these dashboards by toggling the datasource at the top of each dashboard:
+
+| Data Source | Description | Source |
+|-------------|-------------|--------|
+|  hpc-prometheus-headnodeIP            | Manages the Kuberenetes and Slurm Cluster on the Manager and Compute nodes.            |  This datasource is set up when `Omnia.yml` is run.      |
+| control_plane_prometheus            | Monitors the Single Node cluster running on the Management Station            | This datasource is set up when `control_plane.yml` is run.        |
+
+
+![Prometheus DataSource](Images/Prometheus_Dashboard.jpg)
+
+
+
+
+| Type        | Subtype           | Dashboard Name                    | Available DataSources                               |
+|-------------|-------------------|-----------------------------------|-----------------------------------------------------|
+|             |                   | CoreDNS                           | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes |                   | API Types                         | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Compute Resources | Cluster                           | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Compute Resources | Namespace (Pods)                  | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Compute Resources | Node (Pods)                       | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Compute Resources | Pod                               | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Compute Resources | Workload                          | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes |                   | Kubelet                           | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Networking        | Cluster                           | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Networking        | Namespace (Pods)                  | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Networking        | Namespace (Workload)              | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Networking        | Pod                               | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes | Networking        | Workload                          | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes |                   | Scheduler                         | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Kuberenetes |                   | Stateful Sets                     | control-plane-prometheus, hpc-prometheus-headnodeIP |
+|             |                   | Prometheus Overview               | control-plane-prometheus, hpc-prometheus-headnodeIP |
+| Slurm       |                   | CPUs/GPUs, Jobs, Nodes, Scheduler | hpc-prometheus-headnodeIP                           |
+| Slurm       |                   | Node Exporter Server Metrics      | hpc-prometheus-headnodeIP                           |
+
+

+ 14 - 0
docs/control_plane/device_templates/CONFIGURE_INFINIBAND_SWITCHES.md

@@ -7,6 +7,20 @@ Omnia uses the server-based Subnet Manager (SM). SM runs as a Kubernetes pod on
 
 ## Setting up a new or factory reset switch
 
+Before running `infiniband.yml`, ensure that HTTP and JSON Gateway are enabled on your switch. This can be verifed by running:
+
+`show web` (To check if HTTP is enabled)
+
+`show json-gw` (To check if JSON Gateway is enabled)
+
+In case either service has been disabled, run:
+
+`web http enable` (To enable the HTTP gateway)
+
+`json-gw enable` (To enable the JSON gateway)
+
+
+
 When connecting to a new or factory reset switch, the configuration wizard requests to execute an initial configuration:
 * **(Recommended)** If the user enters 'no', they still have to provide the admin and monitor passwords. 
 * If the user enters 'yes', they will also be prompted to enter the hostname for the switch, DHCP details, IPv6 details, etc.