HPCC Data Handling
Boca Raton Documentation Team
We welcome your comments and feedback about this document via
email to docfeedback@hpccsystems.com
Please include Documentation
Feedback in the subject line and reference the document name,
page numbers, and current Version Number in the text of the
message.
LexisNexis and the Knowledge Burst logo are registered trademarks
of Reed Elsevier Properties Inc., used under license.
HPCC Systems® is a registered trademark
of LexisNexis Risk Data Management Inc.
Other products, logos, and services may be trademarks or
registered trademarks of their respective companies.
All names and example data used in this manual are fictitious. Any
similarity to actual persons, living or dead, is purely
coincidental.
HPCC Systems®
HPCC Data Handling
Introduction
There are a number of different ways in which data may be
transferred to, from, or within an HPCC system. For each of these data
transfers, there are a few key parameters that must be known.
Prerequisites for most file
movements:
Logical filename
Physical filename
Record size (fixed)
Source directory
Destination directory
Dali IP address (source and/or destination)
Landing Zone IP address
The above parameters are used for these major data handling
methods:
Import - Spraying Data from the Landing Zone to Thor
Export - Despraying Data from Thor to Landing Zone
Copy - Replicating Data from Thor to Thor (within same Dali
File System)
Copying Data from Thor to Thor (between different Dali File
Systems)
HPCC Data Backups
Introduction
This section covers critical system data that requires regular
backup procedures to prevent data loss.
There are
The System Data Store (Dali data)
Environment Configuration files
Data Refinery (Thor) data files
Rapid Data Delivery Engine (Roxie) data files
Attribute Repositories
Landing Zone files
Dali data
The Dali Server data is typically mirrored to its backup node.
This location is specified in the environment configuration file using
the Configuration Manager.
Since the data is written simultaneously to both nodes, there is
no need for a manual backup procedure.
Environment Configuration files
There is only one active environment file, but you may have many
alternative configurations.
Configuration manager only works on files in the
/etc/HPCCSystems/source/ folder. To make a configuration active, it is
copied to /etc/HPCCSystems/environment.xml on all nodes.
Configuration Manager automatically creates backup copies in the
/etc/HPCCSystems/source/backup/ folder.
Thor data files
Thor clusters are normally configured to automatically replicate
data to a secondary location known as the mirror location. Usually, this
is on the second drive of the subsequent node.
If the data is not found at the primary location (for example, due
to drive failure or because a node has been swapped out), it looks in
the mirror directory to read the data. Any writes go to the primary and
then to the mirror. This provides continual redundancy and a quick means
to restore a system after a node swap.
A Thor data backup should be performed on a regularly scheduled
basis and on-demand after a node swap.
Manual backup
To run a backup manually, follow these steps:
Login to the Thor Master node.
If you don't know which node is your Thor Master node, you
can look it up using ECL Watch.
Run this command:
sudo su hpcc
/opt/HPCCSystems/bin/start_backupnode <thor_cluster_name>
This starts the backup process.
Wait until completion. It will say "backupnode finished" as
shown above.
Run the XREF utility in ECL Watch to verify that there are
no orphan files or lost files.
Scheduled backup
The easiest way to schedule the backup process is to create a
cron job. Cron is a daemon that serves as a task scheduler.
Cron tab (short for CRON TABle) is a text file that contains a
the task list. To edit with the default editor, use the
command:
sudo crontab -e
Here is a sample cron tab entry:
30 23 * * * /opt/HPCCSystems/bin/start_backupnode mythor
30 represents the minute of the hour.
23 represents the hour of the day
The asterisks (*) represent every day, month, and
weekday.
mythor is the clustername
To list the tasks scheduled, use the command:
sudo crontab -l
Roxie data files
Roxie data is protected by three forms of redundancy:
Original Source Data File Retention: When a query is deployed,
the data is typically copied from a Thor cluster's hard drives.
Therefore, the Thor data can serve as backup, provided it is not
removed or altered on Thor. Thor data is typically retained for a
period of time sufficient to serve as a backup copy.
Peer-Node Redundancy: Each Slave node typically has one or
more peer nodes within its cluster. Each peer stores a copy of data
files it will read.
Sibling Cluster Redundancy: Although not required, Roxie
deployments may run multiple identically-configured Roxie clusters.
When two clusters are deployed for Production each node has an
identical twin in terms of data and queries stored on the node in
the other cluster.
This provides multiple redundant copies of data files.
Attribute Repositories
Attribute repositories are stored on ECL developer's local hard
drives. They can contain a significant number of hours of work and
therefore should be regularly backed up. In addition, we suggest using
some form of source version control, too.
Landing Zone files
Landing Zones contain raw data for input. They can also contain
output files. Depending on the size or complexity of these files, you
may want to retain copies for redundancy.