DataHandling.xml 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336
  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
  4. <book lang="en_US" xml:base="../">
  5. <bookinfo>
  6. <title>HPCC Data Handling</title>
  7. <mediaobject>
  8. <imageobject>
  9. <imagedata fileref="images/redswooshWithLogo3.jpg" />
  10. </imageobject>
  11. </mediaobject>
  12. <author>
  13. <surname>Boca Raton Documentation Team</surname>
  14. </author>
  15. <legalnotice>
  16. <para>We welcome your comments and feedback about this document via
  17. email to <email>docfeedback@hpccsystems.com</email> Please include
  18. <emphasis role="bold">Documentation Feedback</emphasis> in the subject
  19. line and reference the document name, page numbers, and current Version
  20. Number in the text of the message.</para>
  21. <para>LexisNexis and the Knowledge Burst logo are registered trademarks
  22. of Reed Elsevier Properties Inc., used under license. HPCC Systems is a
  23. registered trademark of LexisNexis Risk Data Management Inc.</para>
  24. <para>Other products, logos, and services may be trademarks or
  25. registered trademarks of their respective companies. All names and
  26. example data used in this manual are fictitious. Any similarity to
  27. actual persons, living or dead, is purely coincidental.</para>
  28. <para></para>
  29. </legalnotice>
  30. <xi:include href="common/Version.xml" xpointer="FooterInfo"
  31. xmlns:xi="http://www.w3.org/2001/XInclude" />
  32. <xi:include href="common/Version.xml" xpointer="DateVer"
  33. xmlns:xi="http://www.w3.org/2001/XInclude" />
  34. <corpname>HPCC Systems</corpname>
  35. <xi:include href="common/Version.xml" xpointer="Copyright"
  36. xmlns:xi="http://www.w3.org/2001/XInclude" />
  37. <mediaobject role="logo">
  38. <imageobject>
  39. <imagedata fileref="images/LN_Rightjustified.jpg" />
  40. </imageobject>
  41. </mediaobject>
  42. </bookinfo>
  43. <chapter id="Data_Handling">
  44. <title><emphasis>HPCC Data Handling</emphasis></title>
  45. <sect1 id="Introduction" role="nobrk">
  46. <title>Introduction</title>
  47. <para>There are a number of different ways in which data may be
  48. transferred to, from, or within an HPCC system. For each of these data
  49. transfers, there are a few key parameters that must be known.</para>
  50. <sect2 id="Prerequisites-for-most-file-movements">
  51. <title><emphasis role="bold">Prerequisites for most file
  52. movements:</emphasis></title>
  53. <itemizedlist>
  54. <listitem>
  55. <para>Logical filename</para>
  56. </listitem>
  57. <listitem>
  58. <para>Physical filename</para>
  59. </listitem>
  60. <listitem>
  61. <para>Record size (fixed)</para>
  62. </listitem>
  63. <listitem>
  64. <para>Source directory</para>
  65. </listitem>
  66. <listitem>
  67. <para>Destination directory</para>
  68. </listitem>
  69. <listitem>
  70. <para>Dali IP address (source and/or destination)</para>
  71. </listitem>
  72. <listitem>
  73. <para>Landing Zone IP address</para>
  74. </listitem>
  75. </itemizedlist>
  76. <para>The above parameters are used for these major data handling
  77. methods:</para>
  78. <itemizedlist>
  79. <listitem>
  80. <para>Import - Spraying Data from the Landing Zone to Thor</para>
  81. </listitem>
  82. <listitem>
  83. <para>Export - Despraying Data from Thor to Landing Zone</para>
  84. </listitem>
  85. <listitem>
  86. <para>Copy - Replicating Data from Thor to Thor (within same Dali
  87. File System)</para>
  88. </listitem>
  89. <listitem>
  90. <para>Copying Data from Thor to Thor (between different Dali File
  91. Systems)</para>
  92. </listitem>
  93. </itemizedlist>
  94. </sect2>
  95. </sect1>
  96. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  97. xpointer="Data_Handling_Terms"
  98. xmlns:xi="http://www.w3.org/2001/XInclude" />
  99. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  100. xpointer="Working_with_a_data_file"
  101. xmlns:xi="http://www.w3.org/2001/XInclude" />
  102. <xi:include href="HPCCDataHandling/DH-Mods/DH-Mod1.xml"
  103. xpointer="Data_Handling_Methods"
  104. xmlns:xi="http://www.w3.org/2001/XInclude" />
  105. </chapter>
  106. <chapter>
  107. <title><emphasis>HPCC Data Backups</emphasis></title>
  108. <sect1 id="Introduction2" role="nobrk">
  109. <title>Introduction</title>
  110. <para>This section covers critical system data that requires regular
  111. backup procedures to prevent data loss.</para>
  112. <para>There are</para>
  113. <itemizedlist>
  114. <listitem>
  115. <para>The System Data Store (Dali data)</para>
  116. </listitem>
  117. <listitem>
  118. <para>Environment Configuration files</para>
  119. </listitem>
  120. <listitem>
  121. <para>Data Refinery (Thor) data files</para>
  122. </listitem>
  123. <listitem>
  124. <para>Rapid Data Delivery Engine (Roxie) data files</para>
  125. </listitem>
  126. <listitem>
  127. <para>Attribute Repositories</para>
  128. </listitem>
  129. <listitem>
  130. <para>Landing Zone files</para>
  131. </listitem>
  132. </itemizedlist>
  133. </sect1>
  134. <sect1>
  135. <title>Dali data</title>
  136. <para>The Dali Server data is typically mirrored to its backup node.
  137. This location is specified in the environment configuration file using
  138. the Configuration Manager.</para>
  139. <para>Since the data is written simultaneously to both nodes, there is
  140. no need for a manual backup procedure.</para>
  141. </sect1>
  142. <sect1>
  143. <title>Environment Configuration files</title>
  144. <para>There is only one active environment file, but you may have many
  145. alternative configurations.</para>
  146. <para>Configuration manager only works on files in the
  147. /etc/HPCCSystems/source/ folder. To make a configuration active, it is
  148. copied to /etc/HPCCSystems/environment.xml on all nodes.</para>
  149. <para>Configuration Manager automatically creates backup copies in the
  150. /etc/HPCCSystems/source/backup/ folder.</para>
  151. </sect1>
  152. <sect1>
  153. <title>Thor data files</title>
  154. <para>Thor clusters are normally configured to automatically replicate
  155. data to a secondary location known as the mirror location. Usually, this
  156. is on the second drive of the subsequent node.</para>
  157. <para>If the data is not found at the primary location (for example, due
  158. to drive failure or because a node has been swapped out), it looks in
  159. the mirror directory to read the data. Any writes go to the primary and
  160. then to the mirror. This provides continual redundancy and a quick means
  161. to restore a system after a node swap.</para>
  162. <para>A Thor data backup should be performed on a regularly scheduled
  163. basis and on-demand after a node swap.</para>
  164. <sect2>
  165. <title>Manual backup</title>
  166. <para>To run a backup manually, follow these steps:</para>
  167. <orderedlist>
  168. <listitem>
  169. <para>Login to the Thor Master node.</para>
  170. <para>If you don't know which node is your Thor Master node, you
  171. can look it up using ECL Watch.</para>
  172. </listitem>
  173. <listitem>
  174. <para>Run this command:</para>
  175. <programlisting>sudo su hpcc
  176. /opt/HPCCSystems/bin/start_backupnode &lt;thor_cluster_name&gt; </programlisting>
  177. <para>This starts the backup process.</para>
  178. <para></para>
  179. <graphic fileref="images/backupnode.jpg" />
  180. <para>Wait until completion. It will say "backupnode finished" as
  181. shown above.</para>
  182. </listitem>
  183. <listitem>
  184. <para>Run the XREF utility in ECL Watch to verify that there are
  185. no orphan files or lost files.</para>
  186. </listitem>
  187. </orderedlist>
  188. </sect2>
  189. <sect2 role="brk">
  190. <title>Scheduled backup</title>
  191. <para>The easiest way to schedule the backup process is to create a
  192. cron job. Cron is a daemon that serves as a task scheduler.</para>
  193. <para>Cron tab (short for CRON TABle) is a text file that contains a
  194. the task list. To edit with the default editor, use the
  195. command:</para>
  196. <programlisting>sudo crontab -e</programlisting>
  197. <para>Here is a sample cron tab entry:</para>
  198. <para><programlisting>30 23 * * * /opt/HPCCSystems/bin/start_backupnode mythor
  199. </programlisting>30 represents the minute of the hour.</para>
  200. <para>23 represents the hour of the day</para>
  201. <para>The asterisks (*) represent every day, month, and
  202. weekday.</para>
  203. <para>mythor is the clustername</para>
  204. <para>To list the tasks scheduled, use the command:</para>
  205. <programlisting>sudo crontab -l</programlisting>
  206. <para></para>
  207. </sect2>
  208. </sect1>
  209. <sect1 id="Roxie-Data-Backup">
  210. <title>Roxie data files</title>
  211. <para>Roxie data is protected by three forms of redundancy:</para>
  212. <itemizedlist mark="bullet">
  213. <listitem>
  214. <para>Original Source Data File Retention: When a query is deployed,
  215. the data is typically copied from a Thor cluster's hard drives.
  216. Therefore, the Thor data can serve as backup, provided it is not
  217. removed or altered on Thor. Thor data is typically retained for a
  218. period of time sufficient to serve as a backup copy.</para>
  219. </listitem>
  220. <listitem>
  221. <para>Peer-Node Redundancy: Each Slave node typically has one or
  222. more peer nodes within its cluster. Each peer stores a copy of data
  223. files it will read.</para>
  224. </listitem>
  225. <listitem>
  226. <para>Sibling Cluster Redundancy: Although not required, Roxie
  227. deployments may run multiple identically-configured Roxie clusters.
  228. When two clusters are deployed for Production each node has an
  229. identical twin in terms of data and queries stored on the node in
  230. the other cluster.</para>
  231. </listitem>
  232. </itemizedlist>
  233. <para>This provides multiple redundant copies of data files.</para>
  234. </sect1>
  235. <sect1>
  236. <title>Attribute Repositories</title>
  237. <para>Attribute repositories are stored on ECL developer's local hard
  238. drives. They can contain a significant number of hours of work and
  239. therefore should be regularly backed up. In addition, we suggest using
  240. some form of source version control, too.</para>
  241. </sect1>
  242. <sect1>
  243. <title>Landing Zone files</title>
  244. <para>Landing Zones contain raw data for input. They can also contain
  245. output files. Depending on the size or complexity of these files, you
  246. may want to retain copies for redundancy.</para>
  247. </sect1>
  248. </chapter>
  249. </book>