HPCC Data Handling Boca Raton Documentation Team We welcome your comments and feedback about this document via email to docfeedback@hpccsystems.com Please include Documentation Feedback in the subject line and reference the document name, page numbers, and current Version Number in the text of the message. LexisNexis and the Knowledge Burst logo are registered trademarks of Reed Elsevier Properties Inc., used under license. HPCC Systems® is a registered trademark of LexisNexis Risk Data Management Inc. Other products, logos, and services may be trademarks or registered trademarks of their respective companies. All names and example data used in this manual are fictitious. Any similarity to actual persons, living or dead, is purely coincidental. HPCC Systems® <emphasis>HPCC Data Handling</emphasis> Introduction There are a number of different ways in which data may be transferred to, from, or within an HPCC system. For each of these data transfers, there are a few key parameters that must be known. <emphasis role="bold">Prerequisites for most file movements:</emphasis> Logical filename Physical filename Record size (fixed) Source directory Destination directory Dali IP address (source and/or destination) Landing Zone IP address The above parameters are used for these major data handling methods: Import - Spraying Data from the Landing Zone to Thor Export - Despraying Data from Thor to Landing Zone Copy - Replicating Data from Thor to Thor (within same Dali File System) Copying Data from Thor to Thor (between different Dali File Systems) <emphasis>HPCC Data Backups</emphasis> Introduction This section covers critical system data that requires regular backup procedures to prevent data loss. There are The System Data Store (Dali data) Environment Configuration files Data Refinery (Thor) data files Rapid Data Delivery Engine (Roxie) data files Attribute Repositories Landing Zone files Dali data The Dali Server data is typically mirrored to its backup node. This location is specified in the environment configuration file using the Configuration Manager. Since the data is written simultaneously to both nodes, there is no need for a manual backup procedure. Environment Configuration files There is only one active environment file, but you may have many alternative configurations. Configuration manager only works on files in the /etc/HPCCSystems/source/ folder. To make a configuration active, it is copied to /etc/HPCCSystems/environment.xml on all nodes. Configuration Manager automatically creates backup copies in the /etc/HPCCSystems/source/backup/ folder. Thor data files Thor clusters are normally configured to automatically replicate data to a secondary location known as the mirror location. Usually, this is on the second drive of the subsequent node. If the data is not found at the primary location (for example, due to drive failure or because a node has been swapped out), it looks in the mirror directory to read the data. Any writes go to the primary and then to the mirror. This provides continual redundancy and a quick means to restore a system after a node swap. A Thor data backup should be performed on a regularly scheduled basis and on-demand after a node swap. Manual backup To run a backup manually, follow these steps: Login to the Thor Master node. If you don't know which node is your Thor Master node, you can look it up using ECL Watch. Run this command: sudo su hpcc /opt/HPCCSystems/bin/start_backupnode <thor_cluster_name> This starts the backup process. Wait until completion. It will say "backupnode finished" as shown above. Run the XREF utility in ECL Watch to verify that there are no orphan files or lost files. Scheduled backup The easiest way to schedule the backup process is to create a cron job. Cron is a daemon that serves as a task scheduler. Cron tab (short for CRON TABle) is a text file that contains a the task list. To edit with the default editor, use the command: sudo crontab -e Here is a sample cron tab entry: 30 23 * * * /opt/HPCCSystems/bin/start_backupnode mythor 30 represents the minute of the hour. 23 represents the hour of the day The asterisks (*) represent every day, month, and weekday. mythor is the clustername To list the tasks scheduled, use the command: sudo crontab -l Roxie data files Roxie data is protected by three forms of redundancy: Original Source Data File Retention: When a query is deployed, the data is typically copied from a Thor cluster's hard drives. Therefore, the Thor data can serve as backup, provided it is not removed or altered on Thor. Thor data is typically retained for a period of time sufficient to serve as a backup copy. Peer-Node Redundancy: Each Slave node typically has one or more peer nodes within its cluster. Each peer stores a copy of data files it will read. Sibling Cluster Redundancy: Although not required, Roxie deployments may run multiple identically-configured Roxie clusters. When two clusters are deployed for Production each node has an identical twin in terms of data and queries stored on the node in the other cluster. This provides multiple redundant copies of data files. Attribute Repositories Attribute repositories are stored on ECL developer's local hard drives. They can contain a significant number of hours of work and therefore should be regularly backed up. In addition, we suggest using some form of source version control, too. Landing Zone files Landing Zones contain raw data for input. They can also contain output files. Depending on the size or complexity of these files, you may want to retain copies for redundancy.