DataHandling.xml 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344
  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
  4. <book lang="en_US" xml:base="../">
  5. <bookinfo>
  6. <title>HPCC Data Handling</title>
  7. <mediaobject>
  8. <imageobject>
  9. <imagedata fileref="images/redswooshWithLogo3.jpg" />
  10. </imageobject>
  11. </mediaobject>
  12. <author>
  13. <surname>Boca Raton Documentation Team</surname>
  14. </author>
  15. <legalnotice>
  16. <para>We welcome your comments and feedback about this document via
  17. email to <email>docfeedback@hpccsystems.com</email> </para>
  18. <para>Please include <emphasis role="bold">Documentation
  19. Feedback</emphasis> in the subject line and reference the document name,
  20. page numbers, and current Version Number in the text of the
  21. message.</para>
  22. <para>LexisNexis and the Knowledge Burst logo are registered trademarks
  23. of Reed Elsevier Properties Inc., used under license. </para>
  24. <para>HPCC Systems is a registered trademark of LexisNexis Risk Data
  25. Management Inc.</para>
  26. <para>Other products, logos, and services may be trademarks or
  27. registered trademarks of their respective companies. </para>
  28. <para>All names and example data used in this manual are fictitious. Any
  29. similarity to actual persons, living or dead, is purely
  30. coincidental.</para>
  31. <para></para>
  32. </legalnotice>
  33. <xi:include href="common/Version.xml" xpointer="FooterInfo"
  34. xmlns:xi="http://www.w3.org/2001/XInclude" />
  35. <xi:include href="common/Version.xml" xpointer="DateVer"
  36. xmlns:xi="http://www.w3.org/2001/XInclude" />
  37. <corpname>HPCC Systems</corpname>
  38. <xi:include href="common/Version.xml" xpointer="Copyright"
  39. xmlns:xi="http://www.w3.org/2001/XInclude" />
  40. <mediaobject role="logo">
  41. <imageobject>
  42. <imagedata fileref="images/LN_Rightjustified.jpg" />
  43. </imageobject>
  44. </mediaobject>
  45. </bookinfo>
  46. <chapter id="Data_Handling">
  47. <title><emphasis>HPCC Data Handling</emphasis></title>
  48. <sect1 id="Introduction" role="nobrk">
  49. <title>Introduction</title>
  50. <para>There are a number of different ways in which data may be
  51. transferred to, from, or within an HPCC system. For each of these data
  52. transfers, there are a few key parameters that must be known.</para>
  53. <sect2 id="Prerequisites-for-most-file-movements">
  54. <title><emphasis role="bold">Prerequisites for most file
  55. movements:</emphasis></title>
  56. <itemizedlist>
  57. <listitem>
  58. <para>Logical filename</para>
  59. </listitem>
  60. <listitem>
  61. <para>Physical filename</para>
  62. </listitem>
  63. <listitem>
  64. <para>Record size (fixed)</para>
  65. </listitem>
  66. <listitem>
  67. <para>Source directory</para>
  68. </listitem>
  69. <listitem>
  70. <para>Destination directory</para>
  71. </listitem>
  72. <listitem>
  73. <para>Dali IP address (source and/or destination)</para>
  74. </listitem>
  75. <listitem>
  76. <para>Landing Zone IP address</para>
  77. </listitem>
  78. </itemizedlist>
  79. <para>The above parameters are used for these major data handling
  80. methods:</para>
  81. <itemizedlist>
  82. <listitem>
  83. <para>Import - Spraying Data from the Landing Zone to Thor</para>
  84. </listitem>
  85. <listitem>
  86. <para>Export - Despraying Data from Thor to Landing Zone</para>
  87. </listitem>
  88. <listitem>
  89. <para>Copy - Replicating Data from Thor to Thor (within same Dali
  90. File System)</para>
  91. </listitem>
  92. <listitem>
  93. <para>Copying Data from Thor to Thor (between different Dali File
  94. Systems)</para>
  95. </listitem>
  96. </itemizedlist>
  97. </sect2>
  98. </sect1>
  99. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  100. xpointer="Data_Handling_Terms"
  101. xmlns:xi="http://www.w3.org/2001/XInclude" />
  102. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  103. xpointer="Working_with_a_data_file"
  104. xmlns:xi="http://www.w3.org/2001/XInclude" />
  105. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  106. xpointer="Data_Handling_Methods"
  107. xmlns:xi="http://www.w3.org/2001/XInclude" />
  108. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  109. xpointer="Data_Handling_Using_ECL-Watch"
  110. xmlns:xi="http://www.w3.org/2001/XInclude" />
  111. </chapter>
  112. <chapter>
  113. <title><emphasis>HPCC Data Backups</emphasis></title>
  114. <sect1 id="Introduction2" role="nobrk">
  115. <title>Introduction</title>
  116. <para>This section covers critical system data that requires regular
  117. backup procedures to prevent data loss.</para>
  118. <para>There are</para>
  119. <itemizedlist>
  120. <listitem>
  121. <para>The System Data Store (Dali data)</para>
  122. </listitem>
  123. <listitem>
  124. <para>Environment Configuration files</para>
  125. </listitem>
  126. <listitem>
  127. <para>Data Refinery (Thor) data files</para>
  128. </listitem>
  129. <listitem>
  130. <para>Rapid Data Delivery Engine (Roxie) data files</para>
  131. </listitem>
  132. <listitem>
  133. <para>Attribute Repositories</para>
  134. </listitem>
  135. <listitem>
  136. <para>Landing Zone files</para>
  137. </listitem>
  138. </itemizedlist>
  139. </sect1>
  140. <sect1>
  141. <title>Dali data</title>
  142. <para>The Dali Server data is typically mirrored to its backup node.
  143. This location is specified in the environment configuration file using
  144. the Configuration Manager.</para>
  145. <para>Since the data is written simultaneously to both nodes, there is
  146. no need for a manual backup procedure.</para>
  147. </sect1>
  148. <sect1>
  149. <title>Environment Configuration files</title>
  150. <para>There is only one active environment file, but you may have many
  151. alternative configurations.</para>
  152. <para>Configuration manager only works on files in the
  153. /etc/HPCCSystems/source/ folder. To make a configuration active, it is
  154. copied to /etc/HPCCSystems/environment.xml on all nodes.</para>
  155. <para>Configuration Manager automatically creates backup copies in the
  156. /etc/HPCCSystems/source/backup/ folder.</para>
  157. </sect1>
  158. <sect1>
  159. <title>Thor data files</title>
  160. <para>Thor clusters are normally configured to automatically replicate
  161. data to a secondary location known as the mirror location. Usually, this
  162. is on the second drive of the subsequent node.</para>
  163. <para>If the data is not found at the primary location (for example, due
  164. to drive failure or because a node has been swapped out), it looks in
  165. the mirror directory to read the data. Any writes go to the primary and
  166. then to the mirror. This provides continual redundancy and a quick means
  167. to restore a system after a node swap.</para>
  168. <para>A Thor data backup should be performed on a regularly scheduled
  169. basis and on-demand after a node swap.</para>
  170. <sect2>
  171. <title>Manual backup</title>
  172. <para>To run a backup manually, follow these steps:</para>
  173. <orderedlist>
  174. <listitem>
  175. <para>Login to the Thor Master node.</para>
  176. <para>If you don't know which node is your Thor Master node, you
  177. can look it up using ECL Watch.</para>
  178. </listitem>
  179. <listitem>
  180. <para>Run this command:</para>
  181. <programlisting>sudo su hpcc
  182. /opt/HPCCSystems/bin/start_backupnode &lt;thor_cluster_name&gt; </programlisting>
  183. <para>This starts the backup process.</para>
  184. <para></para>
  185. <graphic fileref="images/backupnode.jpg" />
  186. <para>Wait until completion. It will say "backupnode finished" as
  187. shown above.</para>
  188. </listitem>
  189. <listitem>
  190. <para>Run the XREF utility in ECL Watch to verify that there are
  191. no orphan files or lost files.</para>
  192. </listitem>
  193. </orderedlist>
  194. </sect2>
  195. <sect2 role="brk">
  196. <title>Scheduled backup</title>
  197. <para>The easiest way to schedule the backup process is to create a
  198. cron job. Cron is a daemon that serves as a task scheduler.</para>
  199. <para>Cron tab (short for CRON TABle) is a text file that contains a
  200. the task list. To edit with the default editor, use the
  201. command:</para>
  202. <programlisting>sudo crontab -e</programlisting>
  203. <para>Here is a sample cron tab entry:</para>
  204. <para><programlisting>30 23 * * * /opt/HPCCSystems/bin/start_backupnode mythor
  205. </programlisting>30 represents the minute of the hour.</para>
  206. <para>23 represents the hour of the day</para>
  207. <para>The asterisks (*) represent every day, month, and
  208. weekday.</para>
  209. <para>mythor is the clustername</para>
  210. <para>To list the tasks scheduled, use the command:</para>
  211. <programlisting>sudo crontab -l</programlisting>
  212. <para></para>
  213. </sect2>
  214. </sect1>
  215. <sect1 id="Roxie-Data-Backup">
  216. <title>Roxie data files</title>
  217. <para>Roxie data is protected by three forms of redundancy:</para>
  218. <itemizedlist mark="bullet">
  219. <listitem>
  220. <para>Original Source Data File Retention: When a query is deployed,
  221. the data is typically copied from a Thor cluster's hard drives.
  222. Therefore, the Thor data can serve as backup, provided it is not
  223. removed or altered on Thor. Thor data is typically retained for a
  224. period of time sufficient to serve as a backup copy.</para>
  225. </listitem>
  226. <listitem>
  227. <para>Peer-Node Redundancy: Each Slave node typically has one or
  228. more peer nodes within its cluster. Each peer stores a copy of data
  229. files it will read.</para>
  230. </listitem>
  231. <listitem>
  232. <para>Sibling Cluster Redundancy: Although not required, Roxie
  233. deployments may run multiple identically-configured Roxie clusters.
  234. When two clusters are deployed for Production each node has an
  235. identical twin in terms of data and queries stored on the node in
  236. the other cluster.</para>
  237. </listitem>
  238. </itemizedlist>
  239. <para>This provides multiple redundant copies of data files.</para>
  240. </sect1>
  241. <sect1>
  242. <title>Attribute Repositories</title>
  243. <para>Attribute repositories are stored on ECL developer's local hard
  244. drives. They can contain a significant number of hours of work and
  245. therefore should be regularly backed up. In addition, we suggest using
  246. some form of source version control, too.</para>
  247. </sect1>
  248. <sect1>
  249. <title>Landing Zone files</title>
  250. <para>Landing Zones contain raw data for input. They can also contain
  251. output files. Depending on the size or complexity of these files, you
  252. may want to retain copies for redundancy.</para>
  253. </sect1>
  254. </chapter>
  255. </book>