storage.md 9.5 KB

File storage in HPCC

The 'storage' section of the values.yaml file configures the locations that HPCC stores all categories of data. Most of that configuration is provided within the list of storage planes. Each plane has 3 required fields - name, category and prefix. E.g. the following simplified list:

storage:
  planes:
  - name: dali
    category: dali
    prefix: "/var/lib/HPCCSystems/dalistorage"
  - name: dll
    category: dll
    prefix: "/var/lib/HPCCSystems/queries"
  - name: primarydata
    category: data
    prefix: "/var/lib/HPCCSystems/hpcc-data"

name

The 'name' property is used to identify the storage plane in the helm charts. It is also visible to the user - to identify a storage location within eclwatch or ECL code. The name must be unique and must not include upper-case characters. It loosely corresponds to a cluster in the bare-metal version of the platform.

category

'category' is used to indicate the kind of data that is being stored in that location. Different planes are used for the different categories to isolate the different types of data from each other, but also because they often require different performance characteristics. A named plane may only store one category of data. The following categories are currently supported (with some notes about performance characteristics:

  • data
    Where are data files generated by HPCC stored? For Thor, storage costs are likely to be significant. Sequential access speed is important, but random access is much less so. For roxie speed of random access is likely to be important.
  • lz
    A landing zone where external users can read and write files. The HPCC system can import from or export files to a landing zone. Typically performance is less of an issue, could be blob/s3 bucket storage - accessed either directly or via an NFS mount.
  • dali
    The location of the dali metadata store. Needs to support fast random access.
  • dll
    Where are the compiled ECL queries stored? The storage needs to allow shared objects to be directly loaded from it efficiently.
  • sasha
    Location to store archived workunits, etc. Typically less speed critical and requires lower storage costs.
  • spill (optional)
    Where are spill files from thor written? Local NVMe disks are potentially a good choice.
  • temp (optional)
    Where are temporary files written?

Currently temp and spill are not completely implemented, but will be in future point releases. It is likely that other categories will be added in the future (for example, a location to store inter-subgraph spills).

prefix

The most common case is where this defines the path within the container that the storage is mounted. In the example above they are all sub-directories of /var/lib/HPCCSystems.

HPCC also allows some file systems to be accessed through a url syntax. For instance the following landing zone uses azure blob storage:

storage:
  planes:
  - name: azureblobs
    prefix: "azure://ghallidayblobs@data"
    category: lz

How is storage associated with a storage plane?

So far we have seen the properties that describe how the HPCC application views the storage, but how does Kubernetes associate those definitions with physical storage?

Ephemeral storage: (storageClass, storageSize)

Ephemeral storage is allocated when the HPCC cluster is installed and deleted when the chart is uninstalled. It is useful for providing a clean system for testing and for a demonstration system to allow you to experiment with the system. It is not so useful for production systems - for this reason the helm chart generates a warning if it is used.

  • storageClass:
    Which storage provisioner should be used to allocate the storage? A blank storage class indicates it should use the default provisioner.

  • storageSize:
    How much memory is required for this storage?

E.g. An ephemeral data plane:

  planes:
  - name: data
    storageClass: ""
    storageSize: 1Gi
    prefix: "/var/lib/HPCCSystems/hpcc-data"
    category: data

And to add an ephemeral landing zone (which you can upload files to via eclwatch) you could use:

  planes:
  - name: mylandingzone
    storageClass: ""
    storageSize: 1Gi
    prefix: "/var/lib/HPCCSystems/mylandingzone"
    category: lz

Persistent storage (pvc)

For persistent storage, the hpcc cluster uses persistent volume claims that have already been created by installing another Kubernetes chart. Using a pvc allows the lifetime of the data stored on those volumes to be longer than the lifetime of the HPCC cluster that uses them. The helm/examples directory contains charts to simplify defining persistent storage for a local machine, azure, aws etc.

  • pvc
    The pvc property names a Persistent Volume Claim created by another chart.

Default storage planes

The values file can contain more than one storage plane definition for each category. The first storage plane in the list for each category is used as the default location to store that category of data.

That default can be overridden on each component by specifying a property with the name "<category>Plane". For example, to override the default dali storage plane to use daliPlane:

eclagent:
- name: hthor
  prefix: hthor
  dataPlane: premium-data               # override the default data plane
dali:
- name: mydali
  daliPlane: primary-dali-plane         # override the plane to store the dali data

It is also possible to override the target storage plane by using the PLANE options on an OUTPUT statement in the ECL language. This allows the ECL programmer to write data to different storage planes depending on how the data is going to be used. The engines can read data from any of the data planes.

Other storage.planes options

  • forcePermissions: <boolean>
    In some situations the default permissions for the mounted volumes do not allow the hpcc user to write to the storage. Setting this option ensures the ownership of the volume is changed before the main process is started.

  • subPath: <string>
    This property provides an optional sub-directory within <prefix> to use as the root directory. Most of the time the different categories of data will be stored in different locations and this option is not needed. However, if there is a requirement to store two categories of data in the same location, then it is legal to have two storage planes use the same prefix/path and different categories as long as the rest of the plane definitions are identical (except for the name and the subPath). The subPath property allows the data to reside in separate directories so they cannot clash.

  • secret: <string>
    This provides the name of any secret that is required to access the plane's storage. It it currently unused, but may be required once inter-cluster remote file access is finished.

  • defaultSprayParts: <number>
    Earlier we commented that storage planes are similar to clusters in bare-metal. One key difference is that bare-metal clusters are associated with a fixed size thor, whereas a storage plane is not. This property allows you to define the number of parts that a file is split into when it is imported/sprayed. The default is currently 1, but that will soon change to the size of the largest thor cluster.

  • cost:
    This property allows you to specify the costs associated with the storage so that the platform can calculate an estimate of the costs associated with each file. Currently only the cost at rest is supported, transactional costs will be added later. E.g.

      cost:
        storageAtRest: 0.113                # Storage at rest cost: cost per GiB/month
    

Bare metal storage

There are two aspects to using bare-metal storage in the Kubernetes system. The first is the 'hostGroups' entry in the storage section which provides named lists of hosts. The hostGroups entries can take one of two forms:

storage:
  hostGroups:
  - name: "The name of the host group process"
    hosts: [ "a list of host names" ]

This is the most common form, and directly associates a list of host names with a name. The second form:

storage:
  hostGroups:
  - name: "The name of the host group process"
    hostGroup: "Name of the hostgroup to create a subset of"
    count: "Number of hosts in the subset"
    offset: "the first host to include in the subset"
    delta:  "Cycle offset to apply to the hosts"

allows one host group to be derived from another. Some typical examples with bare-metal clusters are smaller subsets of the host, or the same hosts, but storing different parts on different nodes. E.g.

storage:
  hostGroups:
  - name: groupABCDE              # Explicit list of hosts
    hosts: [A, B, C, D, E]
  - name groupCDE                 # Subset of the group last 3 hosts
    hostGroup: groupABCDE
    count: 3
    offset: 2
  - name groupDEC                 # Same set of hosts, but different part->host mapping
    hostGroup: groupCDE
    delta: 1

The second aspect is to add a property to the storage plane definition to indicate which hosts are associated with it.

There are two options:

  • hostGroup: <name>
    The name of the host group for bare metal. For historical reasons the name of the hostgroup must match the name of the storage plane.
  • hosts: <list-of-namesname>
    An inline list of hosts. Primarily useful for defining one-off external landing zones.

For example:

storage:
  planes:
  - name: demoOne
    category: data
    prefix: "/home/gavin/temp"
    hostGroup: groupABCD             # The name of the hostGroup
  - name: myDropZone
    category: lz
    prefix: "/home/gavin/mydropzone"
    hosts: [ 'mylandingzone.com' ]  # Inline reference to an external host.