The 'storage' section of the values.yaml file configures the locations that HPCC stores all categories of data. Most of that configuration is provided within the list of storage planes. Each plane has 3 required fields - name, category and prefix. E.g. the following simplified list:
storage:
planes:
- name: dali
category: dali
prefix: "/var/lib/HPCCSystems/dalistorage"
- name: dll
category: dll
prefix: "/var/lib/HPCCSystems/queries"
- name: primarydata
category: data
prefix: "/var/lib/HPCCSystems/hpcc-data"
The 'name' property is used to identify the storage plane in the helm charts. It is also visible to the user - to identify a storage location within eclwatch or ECL code. The name must be unique and must not include upper-case characters. It loosely corresponds to a cluster in the bare-metal version of the platform.
'category' is used to indicate the kind of data that is being stored in that location. Different planes are used for the different categories to isolate the different types of data from each other, but also because they often require different performance characteristics. A named plane may only store one category of data. The following categories are currently supported (with some notes about performance characteristics:
Currently temp and spill are not completely implemented, but will be in future point releases. It is likely that other categories will be added in the future (for example, a location to store inter-subgraph spills).
The most common case is where this defines the path within the container that the storage is mounted. In the example above they are all sub-directories of /var/lib/HPCCSystems.
HPCC also allows some file systems to be accessed through a url syntax. For instance the following landing zone uses azure blob storage:
storage:
planes:
- name: azureblobs
prefix: "azure://ghallidayblobs@data"
category: lz
So far we have seen the properties that describe how the HPCC application views the storage, but how does Kubernetes associate those definitions with physical storage?
Ephemeral storage is allocated when the HPCC cluster is installed and deleted when the chart is uninstalled. It is useful for providing a clean system for testing and for a demonstration system to allow you to experiment with the system. It is not so useful for production systems - for this reason the helm chart generates a warning if it is used.
storageClass:
Which storage provisioner should be used to allocate the storage? A blank storage class indicates it should use the default provisioner.
storageSize:
How much memory is required for this storage?
E.g. An ephemeral data plane:
planes:
- name: data
storageClass: ""
storageSize: 1Gi
prefix: "/var/lib/HPCCSystems/hpcc-data"
category: data
And to add an ephemeral landing zone (which you can upload files to via eclwatch) you could use:
planes:
- name: mylandingzone
storageClass: ""
storageSize: 1Gi
prefix: "/var/lib/HPCCSystems/mylandingzone"
category: lz
For persistent storage, the hpcc cluster uses persistent volume claims that have already been created by installing another Kubernetes chart. Using a pvc allows the lifetime of the data stored on those volumes to be longer than the lifetime of the HPCC cluster that uses them. The helm/examples directory contains charts to simplify defining persistent storage for a local machine, azure, aws etc.
The values file can contain more than one storage plane definition for each category. The first storage plane in the list for each category is used as the default location to store that category of data.
That default can be overridden on each component by specifying a property with the name "<category>Plane". For example, to override the default dali storage plane to use daliPlane:
eclagent:
- name: hthor
prefix: hthor
dataPlane: premium-data # override the default data plane
dali:
- name: mydali
daliPlane: primary-dali-plane # override the plane to store the dali data
It is also possible to override the target storage plane by using the PLANE options on an OUTPUT statement in the ECL language. This allows the ECL programmer to write data to different storage planes depending on how the data is going to be used. The engines can read data from any of the data planes.
forcePermissions: <boolean>
In some situations the default permissions for the mounted volumes do not allow the hpcc user to write to the storage. Setting this option ensures the ownership of the volume is changed before the main process is started.
subPath: <string>
This property provides an optional sub-directory within <prefix> to use as the root directory. Most of the time the different categories of data will be stored in different locations and this option is not needed. However, if there is a requirement to store two categories of data in the same location, then it is legal to have two storage planes use the same prefix/path and different categories as long as the rest of the plane definitions are identical (except for the name and the subPath). The subPath property allows the data to reside in separate directories so they cannot clash.
secret: <string>
This provides the name of any secret that is required to access the plane's storage. It it currently unused, but may be required once inter-cluster remote file access is finished.
defaultSprayParts: <number>
Earlier we commented that storage planes are similar to clusters in bare-metal. One key difference is that bare-metal clusters are associated with a fixed size thor, whereas a storage plane is not. This property allows you to define the number of parts that a file is split into when it is imported/sprayed. The default is currently 1, but that will soon change to the size of the largest thor cluster.
cost:
This property allows you to specify the costs associated with the storage so that the platform can calculate an estimate of the costs associated with each file. Currently only the cost at rest is supported, transactional costs will be added later. E.g.
cost:
storageAtRest: 0.113 # Storage at rest cost: cost per GiB/month
There are two aspects to using bare-metal storage in the Kubernetes system. The first is the 'hostGroups' entry in the storage section which provides named lists of hosts. The hostGroups entries can take one of two forms:
storage:
hostGroups:
- name: "The name of the host group process"
hosts: [ "a list of host names" ]
This is the most common form, and directly associates a list of host names with a name. The second form:
storage:
hostGroups:
- name: "The name of the host group process"
hostGroup: "Name of the hostgroup to create a subset of"
count: "Number of hosts in the subset"
offset: "the first host to include in the subset"
delta: "Cycle offset to apply to the hosts"
allows one host group to be derived from another. Some typical examples with bare-metal clusters are smaller subsets of the host, or the same hosts, but storing different parts on different nodes. E.g.
storage:
hostGroups:
- name: groupABCDE # Explicit list of hosts
hosts: [A, B, C, D, E]
- name groupCDE # Subset of the group last 3 hosts
hostGroup: groupABCDE
count: 3
offset: 2
- name groupDEC # Same set of hosts, but different part->host mapping
hostGroup: groupCDE
delta: 1
The second aspect is to add a property to the storage plane definition to indicate which hosts are associated with it.
There are two options:
For example:
storage:
planes:
- name: demoOne
category: data
prefix: "/home/gavin/temp"
hostGroup: groupABCD # The name of the hostGroup
- name: myDropZone
category: lz
prefix: "/home/gavin/mydropzone"
hosts: [ 'mylandingzone.com' ] # Inline reference to an external host.