HPCCSystemAdministratorsGuide.xml 63 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
  4. <book xml:base="../">
  5. <title>HPCC System Administrator's Guide</title>
  6. <bookinfo>
  7. <title>HPCC System Administrator's Guide</title>
  8. <mediaobject>
  9. <imageobject>
  10. <imagedata fileref="images/redswooshWithLogo3.jpg" />
  11. </imageobject>
  12. </mediaobject>
  13. <author>
  14. <surname>Boca Raton Documentation Team</surname>
  15. </author>
  16. <legalnotice>
  17. <para>We welcome your comments and feedback about this document via
  18. email to <email>docfeedback@hpccsystems.com</email></para>
  19. <para>Please include <emphasis role="bold">Documentation
  20. Feedback</emphasis> in the subject line and reference the document name,
  21. page numbers, and current Version Number in the text of the
  22. message.</para>
  23. <para>LexisNexis and the Knowledge Burst logo are registered trademarks
  24. of Reed Elsevier Properties Inc., used under license.</para>
  25. <para>HPCC Systems<superscript>®</superscript> is a registered trademark
  26. of LexisNexis Risk Data Management Inc.</para>
  27. <para>Other products, logos, and services may be trademarks or
  28. registered trademarks of their respective companies.</para>
  29. <para>All names and example data used in this manual are fictitious. Any
  30. similarity to actual persons, living or dead, is purely
  31. coincidental.</para>
  32. <para></para>
  33. </legalnotice>
  34. <xi:include href="common/Version.xml" xpointer="FooterInfo"
  35. xmlns:xi="http://www.w3.org/2001/XInclude" />
  36. <xi:include href="common/Version.xml" xpointer="DateVer"
  37. xmlns:xi="http://www.w3.org/2001/XInclude" />
  38. <corpname>HPCC Systems<superscript>®</superscript></corpname>
  39. <xi:include href="common/Version.xml" xpointer="Copyright"
  40. xmlns:xi="http://www.w3.org/2001/XInclude" />
  41. <mediaobject role="logo">
  42. <imageobject>
  43. <imagedata fileref="images/LN_Rightjustified.jpg" />
  44. </imageobject>
  45. </mediaobject>
  46. </bookinfo>
  47. <chapter>
  48. <title>Introducing HPCC Systems<superscript>®</superscript>
  49. Administraton</title>
  50. <sect1 id="HPCC_SysAdminIntro" role="nobrk">
  51. <title>Introduction</title>
  52. <para>HPCC (High Performance Computing Cluster) is a massive
  53. parallel-processing computing platform that solves Big Data
  54. problems.</para>
  55. <para>HPCC stores and processes large quantities of data, processing
  56. billions of records per second using massive parallel processing
  57. technology. Large amounts of data across disparate data sources can be
  58. accessed, analyzed, and manipulated in fractions of seconds. HPCC
  59. functions as both a processing and a distributed data storage
  60. environment, capable of analyzing terabytes of information.</para>
  61. </sect1>
  62. <sect1 id="HPCC_Architectural_Overview">
  63. <title>Architectural Overview</title>
  64. <para>An HPCC Systems<superscript>®</superscript> Platform consists of
  65. the following components: Thor, Roxie, ESP Server, Dali, Sasha, DFU
  66. Server, and ECLCC Server. LDAP security is optionally available.</para>
  67. <para><figure>
  68. <title>HPCC Architectural Diagram</title>
  69. <mediaobject>
  70. <imageobject>
  71. <imagedata fileref="images/SA004.jpg" />
  72. </imageobject>
  73. </mediaobject>
  74. </figure></para>
  75. <?hard-pagebreak ?>
  76. <para>Data loading is controlled through the Distributed File Utility
  77. (DFU) server.</para>
  78. <para>Data typically arrives on the landing zone (for example, by FTP).
  79. File movement (across components) is initiated by DFU. Data is copied
  80. from the landing zone and is distributed (sprayed) to the Data Refinery
  81. (Thor) by the ECL code. Data can be further processed via ETL (Extract,
  82. Transform, and Load process) in the refinery.</para>
  83. <para>A single physical file is distributed into multiple physical files
  84. across the nodes of a cluster. The aggregate of the physical files
  85. creates one logical file that is addressed by the ECL code.</para>
  86. <para><figure>
  87. <title>Data Processing</title>
  88. <mediaobject>
  89. <imageobject>
  90. <imagedata fileref="images/SA002.jpg" />
  91. </imageobject>
  92. </mediaobject>
  93. </figure></para>
  94. <para>The data retrieval process (despraying) places the file back on
  95. the landing zone.</para>
  96. <sect2 role="brk">
  97. <title>Clusters</title>
  98. <para>HPCC environment contains clusters which you define and use
  99. according to your needs. The types of clusters used in HPCC:</para>
  100. <sect3>
  101. <title>Thor</title>
  102. <para>Data Refinery (Thor) – Used to process every one of billions
  103. of records in order to create billions of "improved" records. ECL
  104. Agent (hThor) is also used to process simple jobs that would be an
  105. inefficient use of the Thor cluster.</para>
  106. </sect3>
  107. <sect3>
  108. <title>Roxie</title>
  109. <para>Rapid Data Delivery Engine (Roxie) – Used to search quickly
  110. for a particular record or set of records.</para>
  111. <para>Queries are compiled and published, usually in ECL Watch. Data
  112. moves in parallel from Thor nodes to the receiving Roxie nodes.
  113. Parallel bandwidth utilization improves the speed of putting new
  114. data into play.</para>
  115. </sect3>
  116. <sect3>
  117. <title>ECL Agent</title>
  118. <para>The ECL Agent's primary function is to send the job to execute
  119. on the appropriate cluster. The ECL Agent can act as a single-node
  120. cluster. That is called spawning an hThor cluster. hThor is used to
  121. process simple jobs that would otherwise be an inefficient use of
  122. Thor. For simple tasks, the ECL Agent will make a determination and
  123. perform the execution itself by acting as an hThor cluster. <figure>
  124. <title>Clusters</title>
  125. <mediaobject>
  126. <imageobject>
  127. <imagedata fileref="images/SA003.jpg" />
  128. </imageobject>
  129. </mediaobject>
  130. </figure></para>
  131. </sect3>
  132. </sect2>
  133. <sect2 role="brk">
  134. <title>System Servers</title>
  135. <para>The System Servers are integral middleware components of an HPCC
  136. system. They are used to control workflow and intercomponent
  137. communication.</para>
  138. <sect3>
  139. <title>Dali</title>
  140. <para>Dali is also known as the system data store. It manages
  141. workunit records, logical file directory, and shared object
  142. services.</para>
  143. <para>It maintains the message queues that drive job execution and
  144. scheduling. It also enforces the all LDAP security
  145. restrictions.</para>
  146. </sect3>
  147. <sect3>
  148. <title>Sasha</title>
  149. <para>The Sasha server is a companion “housekeeping” server to the
  150. Dali server. It works independently of all other components. Sasha’s
  151. main function is to reduce the stress on the Dali server. Whenever
  152. possible, Sasha reduces the resource utilization on Dali.</para>
  153. <para>Sasha archives workunits (including DFU Workunits) which are
  154. stored in a series of folders.</para>
  155. <para>Sasha also performs routine housekeeping such as removing
  156. cached workunits and DFU recovery files.</para>
  157. </sect3>
  158. <sect3>
  159. <title>DFU Server</title>
  160. <para>DFU server controls the spraying and despraying operations
  161. used to move data in and out of Thor.</para>
  162. <para>DFU services are available from: <itemizedlist>
  163. <listitem>
  164. <para>Standard libraries in ECL code.</para>
  165. </listitem>
  166. <listitem>
  167. <para>Client interfaces: Eclipse, ECL Playground, ECL IDE, and
  168. the ECL command line interface.</para>
  169. </listitem>
  170. <listitem>
  171. <para>DFU Plus command line interface.</para>
  172. </listitem>
  173. </itemizedlist></para>
  174. </sect3>
  175. <sect3>
  176. <title>ECLCC Server</title>
  177. <para>ECLCC Server is the compiler that translates ECL code. When
  178. you submit ECL code, the ECLCC Server generates optimized C++ which
  179. is then compiled and executed. ECLCC Server controls the whole
  180. compilation process.</para>
  181. <para>When you submit workunits for execution on Thor, they are
  182. first converted to executable code by the ECLCC Server.</para>
  183. <para>When you submit a workunit to Roxie, code is compiled and
  184. later published to the Roxie cluster, where it is available to
  185. execute multiple times.</para>
  186. <para>ECLCC Server is also used when the ECL IDE requests a syntax
  187. check.</para>
  188. <para>ECLCC Server uses a queue to convert workunits one at a time,
  189. however you can have ECLCC Servers deployed in the system to
  190. increase throughput and they will automatically load balance as
  191. required.</para>
  192. </sect3>
  193. <sect3>
  194. <title>ECL Agent</title>
  195. <para>ECL Agent (hThor) is a single node process for executing
  196. simple ECL Queries.</para>
  197. <para>ECL Agent is an execution engine that processes workunits by
  198. sending them to the appropriate cluster. ECL Agent processes are
  199. spawned on-demand when you submit a workunit.</para>
  200. </sect3>
  201. <sect3>
  202. <title>ESP Server</title>
  203. <para>ESP (Enterprise Service Platform) Server is the
  204. inter-component communication server. ESP Server is a framework that
  205. allows multiple services to be “plugged in” to provide various types
  206. of functionality to client applications via multiple
  207. protocols.</para>
  208. <para>Examples of services that are plugged into ESP
  209. include:<itemizedlist>
  210. <listitem>
  211. <para><emphasis role="bold">WsECL:</emphasis> Interface to
  212. published queries on a Roxie, Thor, or hThor cluster.</para>
  213. </listitem>
  214. <listitem>
  215. <para><emphasis role="bold">ECL Watch:</emphasis> A web-based
  216. query execution, monitoring, and file management interface. It
  217. can be accessed via the ECL IDE or a web browser. See
  218. <emphasis>Using ECL Watch</emphasis>.</para>
  219. </listitem>
  220. </itemizedlist></para>
  221. <para>The ESP Server supports both XML and JSON Formats.</para>
  222. <!--formerly : protocols - HTTP, HTTPS, SOAP, and JSON - -->
  223. </sect3>
  224. <sect3>
  225. <title>LDAP</title>
  226. <para>You can incorporate a Lightweight Directory Access Protocol
  227. (LDAP) server to work with Dali to enforce the security restrictions
  228. for file scopes, workunit scopes, and feature access.</para>
  229. <para>When LDAP is configured, you need to authenticate when
  230. accessing ECL Watch, WsECL, ECL IDE, or any other client tools.
  231. Those credentials are then used to authenticate any requests from
  232. those tools.</para>
  233. </sect3>
  234. <!-- *** COMMENTING OUT WHOLE Of MONITORING SECTION
  235. <sect3>
  236. <title>HPCC Reporting</title>
  237. <para>HPCC leverages the use of Ganglia reporting and monitoring
  238. components to monitor several aspects of the HPCC System.</para>
  239. <para>See <emphasis>HPCC Monitoring and Reporting</emphasis> for
  240. more information on how to add monitoring and reporting to your HPCC
  241. System.</para>
  242. <para>More to come***</para>
  243. </sect3>
  244. END COMMENT ***-->
  245. </sect2>
  246. <sect2>
  247. <title>Client Interfaces</title>
  248. <para>The following Client Interfaces are available to interact with
  249. the HPCC Platform.</para>
  250. <sect3>
  251. <title>Eclipse</title>
  252. <para>With the ECL plug-in for Eclipse, you can use the Eclipse IDE
  253. to create and execute queries into your data on an HPCC platform
  254. using Enterprise Control Language (ECL). Eclipse is open-source, and
  255. multi-platform and it can be used to interface with your data and
  256. workunits on HPCC. The ECL plug-in for Eclipse is also
  257. open-source.</para>
  258. </sect3>
  259. <sect3>
  260. <title>ECL IDE</title>
  261. <para>ECL IDE is a full-featured GUI providing access to your ECL
  262. code for ECL development. ECL IDE uses various ESP services via
  263. SOAP.</para>
  264. <para>The ECL IDE provides access to ECL Definitions to build your
  265. queries. These definitions are created by coding an expression that
  266. defines how some calculation or record set derivation is to be done.
  267. Once defined, they can be used in succeeding ECL definitions.</para>
  268. </sect3>
  269. <sect3>
  270. <title>ECL Watch</title>
  271. <para>ECL Watch is a web-based query execution, monitoring, and file
  272. management interface. It can be accessed via ECL IDE, Eclipse, or a
  273. web browser. ECL Watch allows you to see information about and
  274. manipulate workunits. It also allows you monitor cluster activity
  275. and perform other administrative tasks.</para>
  276. <para>Using ECL Watch you can:<itemizedlist>
  277. <listitem>
  278. <para>Browse through previously submitted workunits (WU). You
  279. can see a visual representation (graphs) of the data flow
  280. within the WU, complete with statistics which are updated as
  281. the job progresses.</para>
  282. </listitem>
  283. <listitem>
  284. <para>Search through files and see information including
  285. record counts and layouts or sample records.</para>
  286. </listitem>
  287. <listitem>
  288. <para>See the status of all system servers.</para>
  289. </listitem>
  290. <listitem>
  291. <para>View log files.</para>
  292. </listitem>
  293. <listitem>
  294. <para>Add users or groups and modify permissions.</para>
  295. </listitem>
  296. </itemizedlist></para>
  297. <para>See the <emphasis>Using ECL Watch </emphasis>Manual for more
  298. details.</para>
  299. </sect3>
  300. <sect3>
  301. <title><emphasis role="bold">Command Line Tools</emphasis></title>
  302. <para>Command line tools: <emphasis role="bold">ECL, DFU
  303. Plus</emphasis>, and <emphasis role="bold">ECL Plus</emphasis>
  304. provide command line access to functionality provided by the ECL
  305. Watch web pages. They work by communicating with the corresponding
  306. ESP service via SOAP.</para>
  307. <para>See the <emphasis>Client Tools </emphasis>Manual for more
  308. details.</para>
  309. </sect3>
  310. </sect2>
  311. </sect1>
  312. <!--Inclusion-from-ClientTool-As-Sect1: REMOVED-->
  313. </chapter>
  314. <chapter>
  315. <title>Hardware and Software Requirements</title>
  316. <para>This chapter consists of various Hardware and Software requirements
  317. that HPCC works well on. HPCC is designed to run on commodity hardware,
  318. which makes building and maintaining large scale (petabytes) clusters
  319. economically feasible. When planning your cluster hardware, you will need
  320. to balance a number of considerations.</para>
  321. <para>This section provides some insight as to what sort of hardware and
  322. infrastructure optimally HPCC works well on. This is not an exclusive
  323. comprehensive set of instructions, nor a mandate on what hardware you must
  324. have. Consider this as a guide to use when looking to implement or scale
  325. your HPCC system. These suggestions should be taken into consideration for
  326. your specific enterprise needs.</para>
  327. <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/Hardware.xml"
  328. xpointer="HW-Switch"
  329. xmlns:xi="http://www.w3.org/2001/XInclude" />
  330. <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/Hardware.xml"
  331. xpointer="HW-LoadBalancer"
  332. xmlns:xi="http://www.w3.org/2001/XInclude" />
  333. <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/Hardware.xml"
  334. xpointer="Nodes-Hardware"
  335. xmlns:xi="http://www.w3.org/2001/XInclude" />
  336. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  337. xpointer="System_sizings"
  338. xmlns:xi="http://www.w3.org/2001/XInclude" />
  339. <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/Hardware.xml"
  340. xpointer="Nodes-Software"
  341. xmlns:xi="http://www.w3.org/2001/XInclude" />
  342. <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/Hardware.xml"
  343. xpointer="workstation-requirements"
  344. xmlns:xi="http://www.w3.org/2001/XInclude" />
  345. </chapter>
  346. <chapter>
  347. <title>Hardware and Component Sizing</title>
  348. <para>This section provides some insight as to what sort of hardware and
  349. infrastructure optimally HPCC works well on. This is not an exclusive
  350. comprehensive set of instructions, nor a mandate on what hardware you must
  351. have. Consider this as a guide to use when looking to implement or scale
  352. your HPCC system. These suggestions should be taken into consideration for
  353. your specific enterprise needs.</para>
  354. <para>HPCC is designed to run on commodity hardware, which makes building
  355. and maintaining large scale (petabytes) clusters economically feasible.
  356. When planning your cluster hardware, you will need to balance a number of
  357. considerations, including fail-over domains and potential performance
  358. issues. Hardware planning should include distributing HPCC across multiple
  359. physical hosts, such as a cluster. Generally, one type of best practice is
  360. to run HPCC processes of a particular type, for example Thor, Roxie, or
  361. Dali, on a host configured specifically for that type of process.</para>
  362. <sect1>
  363. <title>Thor Hardware</title>
  364. <para>Thor slave nodes require a proper balance of CPU, RAM, network,
  365. and disk I/O in order to operate most efficiently. A single Thor slave
  366. node works optimally when allocated 4 CPU cores, 8GB RAM, 1Gb/sec
  367. network and 200MB/sec sequential read/write disk I/O.</para>
  368. <para>Hardware architecture can provide higher value within a single
  369. physical server. In such cases you can use multi-slave to configure your
  370. larger physical servers to run multiple Thor slave nodes per physical
  371. server.</para>
  372. <para>It is important to note that HPCC by nature is a parallel
  373. processing system and all Thor slave nodes will be exercising at
  374. precisely the same time. So when allocating more than one HPCC Thor
  375. slave per physical machine assure that each slave meets the recommended
  376. requirements.</para>
  377. <para>For instance, 1 physical server with 48 cores, 96GB RAM, 10Gb/sec
  378. network and 2GB/sec sequential I/O would be capable of running ten (10)
  379. HPCC Thor slaves at optimal efficiency. The order of optimization for
  380. resource usage in a Thor slave node is disk I/O 60%, network 30%, and
  381. CPU 10%. Any increase in sequential I/O will have the most impact on
  382. speed, followed by improvements in network, followed by improvements in
  383. CPU.</para>
  384. <para>Network architecture is also an important consideration. HPCC Thor
  385. nodes work optimally in a streamlined network architecture between all
  386. Thor slave processes.</para>
  387. <para>RAID is recommended and all RAID levels suitable for sequential
  388. read/write operations and high availability are acceptable. For example,
  389. RAID1, RAID10, RAID5 (preferred), and RAID6.</para>
  390. </sect1>
  391. <sect1>
  392. <title>Roxie Hardware Configurations</title>
  393. <para>HPCC Roxie processes require require a proper, yet different (from
  394. Thor) balance of CPU, RAM, network, and disk I/O in order to ensure
  395. efficient operations. A single HPCC Roxie node works optimally when
  396. allocated 6 or more CPU cores, 24GB RAM, 1Gb/sec network backbone, and
  397. 400/sec 4k random read IOPS. </para>
  398. <para>Each HPCC Roxie node is presented two hard drives, each capable of
  399. 200/sec 4k random seek IOPS. Hard drive recommendations for Roxie
  400. efficiency are 15K SAS, or SSD. A good rule of thumb is the more random
  401. read IOPS the better and faster your Roxie will perform.</para>
  402. <para>Running multiple HPCC Roxie nodes on a single physical server is
  403. not recommended, except in the cases of virtualization or
  404. containers.</para>
  405. </sect1>
  406. <sect1>
  407. <title>Dali and Sasha Hardware Configurations</title>
  408. <para>HPCC Dali processes store cluster metadata in RAM. For optimal
  409. efficiency, provide at least 48GB of RAM, 6 or more CPU cores, 1Gb/sec
  410. network interface and a high availability disk for a single HPCC Dali.
  411. HPCC's Dali processes are one of the few active/passive components.
  412. Using standard “swinging disk” clustering is recommended for a high
  413. availability setup. For a single HPCC Dali process, any suitable High
  414. Availability (HA) RAID level is fine.</para>
  415. <para>Sasha does not store any data. Sasha reads data from Dali then
  416. processes it. Sasha does store archived workunits (WUs) on a disk.
  417. Allocating a larger disk for Sasha reduces the amount of housekeeping
  418. needed. Since Sasha assists Dali by performing housekeeping, it works
  419. best when on its own node. You should avoid putting Sasha and Dali on
  420. the same node.</para>
  421. </sect1>
  422. <sect1>
  423. <title>Other HPCC Components</title>
  424. <para>ECL Agent, ECLCC Server, DFU Server, the Thor master, and ECL
  425. Watch are administrative processes which are used for supporting
  426. components of the main clusters.</para>
  427. <para>For maximum efficiency you should provide 24GB RAM, 6+ CPU cores,
  428. 1Gb/sec network and high availability disk(s). These components can be
  429. made highly available in an active/active fashion.</para>
  430. </sect1>
  431. </chapter>
  432. <chapter id="Routine_Maintenance">
  433. <title>Routine Maintenance</title>
  434. <para>In order to ensure that your HPCC system keeps running optimally,
  435. some care and maintenance is required. The following sections address
  436. routine maintenance tasks for your HPCC system.</para>
  437. <!--***SYSTEM HEALTH CHECK UP***TO COME***-->
  438. <sect1 role="nobrk">
  439. <title>Back Up Data</title>
  440. <para>An integral part of routine maintenance is the back up of
  441. essential data. Devise a back up strategy to meet the needs of your
  442. organization. This section is not meant to replace your current back up
  443. strategy, instead this section supplements it by outlining special
  444. considerations for HPCC Systems<superscript>®</superscript>.</para>
  445. <sect2>
  446. <title>Back Up Considerations</title>
  447. <para>You probably already have some sort of a back up strategy in
  448. place, by adding HPCC Systems<superscript>®</superscript> into your
  449. operating environment there are some additional considerations to be
  450. aware of. The following sections discuss back up considerations for
  451. the individual HPCC system components.</para>
  452. <sect3>
  453. <title>Dali</title>
  454. <para>Dali can be configured to create its own back up, ideally you
  455. would want that back up kept on a different server or node. You can
  456. specify the Dali back up folder location using the Configuration
  457. Manager. You may want to keep multiple copies that back up, to be
  458. able to restore to a certain point in time. For example, you may
  459. want to do daily snapshots, or weekly.</para>
  460. <para>You may want to keep back up copies at a system level using
  461. traditional back up methods.</para>
  462. </sect3>
  463. <sect3>
  464. <title>Sasha</title>
  465. <para>Sasha itself generates no original data but archives workunits
  466. to disks. Be aware that Sasha can create quite a bit of archive
  467. data. Once the workunits are archived they are no longer available
  468. in the Dali data store. The archives can still be retrieved, but
  469. that archive now becomes the only copy of these workunits.</para>
  470. <para>If you need high availability for these archived workunits,
  471. you should back them up at a system level using traditional back up
  472. methods.</para>
  473. </sect3>
  474. <sect3>
  475. <title>DFU Server</title>
  476. <para>DFU Server has no data. DFU workunits are stored in Dali until
  477. they are archived by Sasha.</para>
  478. </sect3>
  479. <sect3>
  480. <title>ECLCC Server</title>
  481. <para>ECLCC Server stores no data. ECL workunits are stored in Dali
  482. and archived by Sasha.</para>
  483. <!--***COMMENT:<para><emphasis role="bold">Note:</emphasis> No compiler is shipped
  484. with the HPCC System. The ECLCC Server compiles ECL code into C++,
  485. however you must have a C++ compiler to use on your system. </para> -->
  486. </sect3>
  487. <sect3>
  488. <title>ECL Agent</title>
  489. <para>ECL Agent stores no data.</para>
  490. </sect3>
  491. <sect3>
  492. <title>ECL Scheduler</title>
  493. <para>ECL Scheduler stores no data. ECL Workunits are stored in
  494. Dali.</para>
  495. </sect3>
  496. <sect3>
  497. <title>ESP Server</title>
  498. <para>ESP Server stores no data. If you are using SSL certificates,
  499. public and private keys they should be backed up using traditional
  500. methods.</para>
  501. </sect3>
  502. <sect3>
  503. <title>Thor</title>
  504. <para>Thor, the data refinery, as one of the critical components of
  505. HPCC Systems<superscript>®</superscript> needs to be backed up. Back
  506. up Thor by configuring replication and setting up a nightly back up
  507. cron task. Back up Thor on demand before and/or after any node swap
  508. or drive swap if you do not have a RAID configured.</para>
  509. <para>A very important part of administering Thor is to check the
  510. logs to ensure the previous back ups completed successfully.</para>
  511. <para><emphasis role="bold">Backupnode</emphasis></para>
  512. <para>Backupnode is a tool that is packaged with HPCC. Backupnode
  513. allows you to back up Thor nodes on demand or in a script. You can
  514. also use backupnode regularly in a crontab. You would always want to
  515. run it on the Thor master of that cluster.</para>
  516. <para>The following example is one suggested way for invoking
  517. backupnode manually.</para>
  518. <programlisting> /bin/su - hpcc -c "/opt/HPCCSystems/bin/start_backupnode thor" &amp; </programlisting>
  519. <para>The command line parameter must match the name of your Thor
  520. cluster. In your production environment, it is likely that you would
  521. provide descriptive names for your Thor clusters.</para>
  522. <para>For example, if your Thor cluster is named thor400_7s, you
  523. would call start_backupnode thor400_7s.</para>
  524. <programlisting> /bin/su - hpcc -c "/opt/HPCCSystems/bin/start_backupnode thor400_7s" &amp; </programlisting>
  525. <para>To run backupnode regularly you could use cron. For example,
  526. you may want a crontab entry (to back up thor400_7s) set to run at
  527. 1am daily:</para>
  528. <programlisting> 0 1 * * * /bin/su - hpcc -c "/opt/HPCCSystems/bin/start_backupnode thor400_7s" &amp; </programlisting>
  529. <para>Backupnode writes out its activity to a log file. That log can
  530. be found at /var/log/HPCCSystems/backupnode/MM_DD_YYYY_HH_MM_SS.log
  531. with the (MM) Month, (DD) Day, (YYYY) 4-digit Year, (HH) Hour, (MM)
  532. Minutes, and (SS) Seconds of the back up in the log file name. The
  533. main log file exists on the Thor master node. It shows what nodes it
  534. is run on and if it finished. You can find other backupnode logs on
  535. each of the thorslaves showing what files, if any, it needed to
  536. “restore”.</para>
  537. <para>It is important to check the logs to ensure the previous back
  538. ups completed successfully. The following entry is from the
  539. backupnode log showing that back up completed successfully:</para>
  540. <programlisting>00000028 2014-02-19 12:01:08 26457 26457 "Completed in 0m 0s with 0 errors"
  541. 00000029 2014-02-19 12:01:08 26457 26457 "backupnode finished" </programlisting>
  542. </sect3>
  543. <sect3>
  544. <title>Roxie</title>
  545. <para>Roxie data is protected by three forms of redundancy:</para>
  546. <itemizedlist>
  547. <listitem>
  548. <para><emphasis role="bold">Original Source Data File
  549. Retention:</emphasis> When a query is published, the data is
  550. typically copied from a remote site, either a Thor or a Roxie.
  551. The Thor data can serve as back up, provided it is not removed
  552. or altered on Thor. Thor data is typically retained for a period
  553. of time sufficient to serve as a back up copy.</para>
  554. </listitem>
  555. <listitem>
  556. <para><emphasis role="bold">Peer-Node Redundancy:</emphasis>
  557. Each Slave node typically has one or more peer nodes within its
  558. cluster. Each peer stores a copy of data files it will
  559. read.</para>
  560. </listitem>
  561. <listitem>
  562. <para><emphasis role="bold">Sibling Cluster
  563. Redundancy:</emphasis> Although not required, Roxie may run
  564. multiple identically-configured Roxie clusters. When two
  565. clusters are deployed for Production each node has an identical
  566. twin in terms of queries and/or data stored on the node in the
  567. other cluster. This configuration provides multiple redundant
  568. copies of data files. With three sibling Roxie clusters that
  569. have peer node redundancy, there are always six copies of each
  570. file part at any given time; eliminating the need to use
  571. traditional back up procedures for Roxie data files.</para>
  572. </listitem>
  573. </itemizedlist>
  574. </sect3>
  575. <sect3>
  576. <title>Landing Zone</title>
  577. <para>The Landing Zone is used to host incoming and outgoing files.
  578. This should be treated similarly to an FTP server. Use traditional
  579. system level back ups.</para>
  580. </sect3>
  581. <sect3>
  582. <title>Misc</title>
  583. <para>Back up of any additional component add-ons, your environment
  584. files (environment.xml), or other custom configurations should be
  585. done according to traditional back up methods.</para>
  586. </sect3>
  587. </sect2>
  588. </sect1>
  589. <sect1 id="Log_Files">
  590. <title>Log Files</title>
  591. <para>You can review system messages and see any error messages as they
  592. are reported and captured in log files. Log files can help you in
  593. understanding what is occurring on the system and useful in
  594. troubleshooting.</para>
  595. <sect2 id="Component_Logs">
  596. <title>Component Logs</title>
  597. <para>There are log files for each component in directories below
  598. <emphasis role="bold">/var/log/HPCCSystems</emphasis> (default
  599. location). You can optionally configure the system to write the logs
  600. in a different directory. You should know where the log files are, and
  601. refer to the logs first when troubleshooting any issues.</para>
  602. <para>There are log files which record activity among the various
  603. components. You can find the log files in subdirectories named
  604. corresponding to the components that they track. For example, the Thor
  605. logs would be found in a directory named mythor, the sasha log would
  606. be in the mysasha directory, the esp log in the myesp
  607. directory.</para>
  608. <para>In each of the component subdirectories, there are several log
  609. files. Most of the log files use a logical naming convention that
  610. includes the component name, the date, and time in the name of the log
  611. file. There is also usually a link for the component with a simple
  612. name, such as esp.log which is a short cut to the latest current log
  613. file for that component.</para>
  614. <para>Understanding the log files, and what is normally reported in
  615. the log files, helps in troubleshooting the HPCC system.</para>
  616. <para>As part of routine maintenance you may want to back up, archive,
  617. and remove the older log files.</para>
  618. </sect2>
  619. <sect2>
  620. <title>Accessing Log Files</title>
  621. <para>You can access and view the log files directly by going to the
  622. component log directory from a command prompt or a terminal
  623. application. You can also view the component log files through ECL
  624. Watch.</para>
  625. <para>To view logs on ECL Watch, click on the <emphasis
  626. role="bold">Operations</emphasis> icon, then click on the <emphasis
  627. role="bold">System Servers</emphasis> link. That opens the System
  628. Servers page in ECL Watch. There are several HPCC system components
  629. listed on that page. In the <emphasis role="bold">Directory</emphasis>
  630. column for each component there is a computer drive icon. Click the
  631. icon in the row for the component log you wish to view. <figure>
  632. <title>Logs in ECL Watch</title>
  633. <mediaobject>
  634. <imageobject>
  635. <imagedata fileref="images/SA005.jpg" />
  636. </imageobject>
  637. </mediaobject>
  638. </figure></para>
  639. <para>You can also view log files from the other links under the
  640. Operations icon in ECL Watch. <orderedlist>
  641. <listitem>
  642. <para>Click on the <emphasis role="bold">Target
  643. Clusters</emphasis> link to open the tab with links to your
  644. system's clusters.</para>
  645. </listitem>
  646. <listitem>
  647. <para>Click on the computer drive icon (circled in red in the
  648. above figure), in the row of the cluster and node of the
  649. component log you wish to view.</para>
  650. </listitem>
  651. </orderedlist></para>
  652. <para>To view cluster process logs: <orderedlist>
  653. <listitem>
  654. <para>Click on the <emphasis role="bold">Cluster
  655. Processes</emphasis> link to open the tab with links to your
  656. system's clusters processes.</para>
  657. </listitem>
  658. <listitem>
  659. <para>Click on the cluster process you wish to view more
  660. information about.</para>
  661. <para>For example, click on the <emphasis
  662. role="bold">myroxie</emphasis> link. You will then see a page of
  663. all that components nodes. You will see computer drive icon, in
  664. the row of each node. Click that icon to see the logs for the
  665. cluster process for that node.</para>
  666. </listitem>
  667. </orderedlist></para>
  668. <sect3 id="Workunit_Logs">
  669. <title>Log files in ECL Workunits</title>
  670. <para>You can also access the Thor or ECL Agent log files from the
  671. ECL Workunits. (not available for Roxie workunits) In ECL Watch when
  672. examining the Workunit details, you will see a <emphasis
  673. role="bold">Helpers</emphasis> tab. Click on the Helpers tab to
  674. display the relevant log files for that particular workunit. <figure>
  675. <title>Logs in ECL Watch Workunits</title>
  676. <mediaobject>
  677. <imageobject>
  678. <imagedata fileref="images/SA006.jpg" />
  679. </imageobject>
  680. </mediaobject>
  681. </figure></para>
  682. </sect3>
  683. </sect2>
  684. </sect1>
  685. </chapter>
  686. <xi:include href="HPCCCertify/Cert-Mods/CertPreflight.xml"
  687. xpointer="Cert_Prelight"
  688. xmlns:xi="http://www.w3.org/2001/XInclude" />
  689. <chapter id="OnDemand_Maintenance">
  690. <title>System Configuration and Management</title>
  691. <para>The HPCC system requires configuration. The Configuration Manager
  692. tool (configmgr) included with the system software is a valuable piece of
  693. setting up your HPCC system. The Configuration Manager is a graphical tool
  694. provided that can be used to configure your system. Configuration Manager
  695. has a wizard that you can run which will easily generate an environment
  696. file to get you configured, up and running quickly. There is an advanced
  697. option available through Configuration Manager which allows for a more
  698. specific configuration, while still using the graphical interface. If
  699. desired you can edit the environment files using any xml or text editor
  700. however the file structure must remain valid.</para>
  701. <para><figure>
  702. <title>Sample Production Configuration</title>
  703. <mediaobject>
  704. <imageobject>
  705. <imagedata fileref="images/SA008.jpg" />
  706. </imageobject>
  707. </mediaobject>
  708. </figure></para>
  709. <!--/*Including special SysAdmin Config Module -paras- */-->
  710. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  711. xpointer="cfgmgr_introP0"
  712. xmlns:xi="http://www.w3.org/2001/XInclude" />
  713. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  714. xpointer="cfgmgr_introP1"
  715. xmlns:xi="http://www.w3.org/2001/XInclude" />
  716. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  717. xpointer="cfgmgr_p1b"
  718. xmlns:xi="http://www.w3.org/2001/XInclude" />
  719. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  720. xpointer="cfgmgr_introP2"
  721. xmlns:xi="http://www.w3.org/2001/XInclude" />
  722. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  723. xpointer="cfgmgr_introP3"
  724. xmlns:xi="http://www.w3.org/2001/XInclude" />
  725. <!--/*Including special SysAdmin Config Module -Sect1- */-->
  726. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  727. xpointer="configuring-a-multi-node-system"
  728. xmlns:xi="http://www.w3.org/2001/XInclude" />
  729. <sect1>
  730. <title>Environment.conf</title>
  731. <para>Another component of HPCC system configuration is the
  732. environment.conf file. Environment.conf contains some global definitions
  733. that the configuration manager uses to configure the HPCC system. In
  734. most cases, the defaults are sufficient.</para>
  735. <para><informaltable colsep="1" frame="all" rowsep="1">
  736. <tgroup cols="2">
  737. <colspec colwidth="49.50pt" />
  738. <colspec />
  739. <tbody>
  740. <row>
  741. <entry><inlinegraphic fileref="images/caution.png" /></entry>
  742. <entry><emphasis role="bold">WARNING</emphasis>: These
  743. settings are essential to proper system operation. Only expert
  744. level HPCC administrators should attempt to change any aspects
  745. of this file.</entry>
  746. </row>
  747. </tbody>
  748. </tgroup>
  749. </informaltable>By default the environment.conf file is
  750. located:</para>
  751. <programlisting>/etc/HPCCSystems</programlisting>
  752. <para>Environment.conf is required upon startup of HPCC. The
  753. environment.conf is where the HPCC environment file is defined.</para>
  754. <programlisting>/opt/HPCCSystems/environment.xml</programlisting>
  755. <para>This is also where the working path is defined.</para>
  756. <programlisting>path=/opt/HPCCSystems</programlisting>
  757. <para>The working path is used by several aspects of the application,
  758. changing this could cause needless complications. By default the
  759. application installs there, and sets many resources to that as
  760. well.</para>
  761. <para>The default envrionment.conf:</para>
  762. <para><programlisting>## HPCC Systems default environment configuration file
  763. [DEFAULT SETTINGS]
  764. configs=/etc/HPCCSystems
  765. path=/opt/HPCCSystems
  766. classpath=/opt/HPCCSystems/classes
  767. runtime=/var/lib/HPCCSystems
  768. lock=/var/lock/HPCCSystems
  769. # Supported logging fields: AUD,CLS,DET,MID,TIM,DAT,PID,TID,NOD,JOB,USE,SES,
  770. # COD,MLT,MCT,NNT,COM,QUO,PFX,ALL,STD
  771. logfields=TIM+DAT+MLT+MID+PID+TID+COD+QUO+PFX
  772. pid=/var/run/HPCCSystems
  773. log=/var/log/HPCCSystems
  774. user=hpcc
  775. group=hpcc
  776. home=/Users
  777. environment=environment.xml
  778. sourcedir=/etc/HPCCSystems/source
  779. blockname=HPCCSystems
  780. interface=*
  781. # enable epoll method for notification events (true/false)
  782. use_epoll=true
  783. </programlisting></para>
  784. <sect2>
  785. <title>Path considerations</title>
  786. <para>Most of the directories are defined as absolute paths:</para>
  787. <programlisting>configs=/etc/HPCCSystems
  788. path=/opt/HPCCSystems
  789. classpath=/opt/HPCCSystems/classes
  790. runtime=/var/lib/HPCCSystems
  791. lock=/var/lock/HPCCSystems</programlisting>
  792. <para>HPCC will not run properly without the proper paths, and in some
  793. cases needs the absolute path. If a process or component can't find a
  794. path you will get an error message such as the following:</para>
  795. <programlisting>“There are no components configured to run on the node…” </programlisting>
  796. <para>If the path changes from HPCCSystems, it does NOT change in the
  797. environment.xml file. Any changes would require manually modifying the
  798. environment.xml file.</para>
  799. <para>The log file, <emphasis>hpcc-init.log</emphasis> is written to
  800. the HPCCSystems path.</para>
  801. </sect2>
  802. <sect2>
  803. <title>Other Environment.conf items</title>
  804. <para>Some other items used by or referred to in
  805. environment.conf.<variablelist>
  806. <varlistentry>
  807. <term>Use_epoll</term>
  808. <listitem>
  809. <para>It is an event mechanism to achieve better performance
  810. in more demanding applications where number of watched file
  811. descriptors is large.</para>
  812. </listitem>
  813. </varlistentry>
  814. <varlistentry>
  815. <term>Logfields</term>
  816. <listitem>
  817. <para>Categories available to be logged. These consist of
  818. Time(TIM), Date(DAT), Process ID (PID), Thread ID (TID),
  819. etc.</para>
  820. </listitem>
  821. </varlistentry>
  822. <varlistentry>
  823. <term>Interface</term>
  824. <listitem>
  825. <para>In the default environment.conf there is a value for
  826. interface. The default value for that is:</para>
  827. <programlisting>interface=*</programlisting>
  828. <para>The default value of * assigns the interface to an open
  829. ip address, in any order. Specifying an interface, such as
  830. Eth0, will assign the specified node as the primary.<!--***Add More info... WHY DOES THIS MATTER?--></para>
  831. </listitem>
  832. </varlistentry>
  833. </variablelist></para>
  834. </sect2>
  835. </sect1>
  836. <!--Inclusions-As-Sect1-->
  837. <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/UserSecurityMaint.xml"
  838. xpointer="User_Security_Maint"
  839. xmlns:xi="http://www.w3.org/2001/XInclude" />
  840. <sect1>
  841. <title>Workunits and Active Directory</title>
  842. <para>The performance of your system can vary depending on how some
  843. components interact. One area which could impact performance is the
  844. relationship with users, groups, and Active Directory. If possible,
  845. having a separate Active Directory specific to HPCC could be a good
  846. policy. There have been a few instances where just one Active Directory
  847. servicing many, diverse applications has been less than optimal.</para>
  848. <para>HPCC makes setting up your Active Directory OU's relatively easy.
  849. ESP creates all the OU's for you when it starts up, based on the
  850. settings you defined in Configuration Manager. You can then start
  851. Dali/ESP and use ECLWatch to add or modify users or groups.</para>
  852. <para>You can assign permissions to each user individually, however it
  853. is more manageable to assign these permissions to groups, and then add
  854. users to these groups as appropriate. Create a group for developers and
  855. power users (people with full read/write/delete access), and another
  856. group for users that only have only read access and perhaps another
  857. group that has both read and write access. Add any other groups as
  858. appropriate for your environment. Now you can assign users to their
  859. appropriate group(s).</para>
  860. <sect2>
  861. <title>Active Directory, and LDAP Commonality</title>
  862. <para>There are components that are common to both Active Directory
  863. and LDAP. There are a few relevant terms, that may need some further
  864. explanation. <variablelist>
  865. <varlistentry>
  866. <term>filesBasedn</term>
  867. <listitem>
  868. <para>Deals with restricting access to files. Also referred to
  869. as “file scoping“.</para>
  870. </listitem>
  871. </varlistentry>
  872. <varlistentry>
  873. <term>groupsBasedn</term>
  874. <listitem>
  875. <para>Controls the groups associated with the environment. For
  876. example, administrators, developers, ws_ecl only, etc.</para>
  877. </listitem>
  878. </varlistentry>
  879. <varlistentry>
  880. <term>modulesBasedn</term>
  881. <listitem>
  882. <para>Specific to systems using a legacy central repository
  883. and controls access to specific modules. Any module you create
  884. in the application will create an entry in
  885. Eclwatch&gt;&gt;User/Permissions&gt;&gt;Repository
  886. Modules</para>
  887. </listitem>
  888. </varlistentry>
  889. <varlistentry>
  890. <term>sudoersBasedn</term>
  891. <listitem>
  892. <para>Deprecated.</para>
  893. </listitem>
  894. </varlistentry>
  895. <varlistentry>
  896. <term>workunitsBasedn</term>
  897. <listitem>
  898. <para>Controls access to workunits.</para>
  899. </listitem>
  900. </varlistentry>
  901. </variablelist></para>
  902. </sect2>
  903. </sect1>
  904. <sect1>
  905. <title>Data Handling</title>
  906. <para>When you start working with your HPCC system, you will want to
  907. have some data on the system to process. Data gets transferred to and
  908. the HPCC system by a process called a spray. Likewise to get data out
  909. from an HPCC system it must be desprayed.</para>
  910. <para>As HPCC is a computer cluster the data gets deployed out over the
  911. nodes that make up the cluster. A <emphasis>spray</emphasis> or import
  912. is the relocation of a data file from one location (such as a Landing
  913. Zone) to a cluster. The term spray was adopted due to the nature of the
  914. file movement – the file is partitioned across all nodes within a
  915. cluster.</para>
  916. <para>A <emphasis>despray</emphasis> or export is the relocation of a
  917. data file from a Data Refinery cluster to a single machine location
  918. (such as a Landing Zone). The term despray was adopted due to the nature
  919. of the file movement – the file is reassembled from its parts on all
  920. nodes in the cluster and placed in a single file on the
  921. destination.</para>
  922. <para>A <emphasis>Landing Zone</emphasis> (or drop zone) is a physical
  923. storage location defined in your system's environment. There can be one
  924. or more of these locations defined. A daemon (dafilesrv) must be running
  925. on that server to enable file sprays and desprays. You can spray or
  926. despray some files to your landing zone through ECL Watch. To upload
  927. large files, you will need a tool that supports the secure copy
  928. protocol, something like a WinSCP.</para>
  929. <para>For more information about HPCC data handling see the
  930. <emphasis>HPCC Data Handling</emphasis> and the <emphasis>HPCC Data
  931. Tutorial</emphasis> documents.</para>
  932. </sect1>
  933. <!--add-certify-->
  934. </chapter>
  935. <chapter id="Best_Practices_Chapter">
  936. <title>Best Practices</title>
  937. <para>This chapter outlines various forms of best practices established by
  938. long time HPCC users and administrators running HPCC in a high
  939. availability, demanding production environment. While it is not required
  940. that you run your environment in this manner, as your specific
  941. requirements may vary. This section provides some best practice
  942. recommendations established after several years of running HPCC in a
  943. demanding, intense, production environment.</para>
  944. <sect1 id="BP_Cluster_Redundancy" role="nobrk">
  945. <title>Cluster Redundancy</title>
  946. <para>There are several aspects of cluster redundancy that should be
  947. considered when setting up your HPCC system.</para>
  948. <para><informaltable colsep="1" frame="all" rowsep="1">
  949. <?dbfo keep-together="always"?>
  950. <tgroup cols="2">
  951. <colspec colwidth="49.50pt" />
  952. <colspec />
  953. <tbody>
  954. <row>
  955. <entry><inlinegraphic fileref="images/tip.jpg" /></entry>
  956. <entry><para>Make sure you allocate ample resources to your
  957. key components. Dali is RAM intensive. ECL Agent and ECL
  958. Server are processor dependent. Thor should have a minimum of
  959. 4GB RAM per node.</para><para> </para></entry>
  960. </row>
  961. </tbody>
  962. </tgroup>
  963. </informaltable></para>
  964. <sect2>
  965. <title>Dali</title>
  966. <para>Dali should be run in an active/passive configuration.
  967. Active/passive meaning you would have two Dalis running, one primary,
  968. or active, and the other passive. In this scenario all actions are run
  969. on the active Dali, but duplicated on the passive one. If the active
  970. Dali fails, then you can fail over to the passive Dali.</para>
  971. <para>Another suggested best practice is to use standard clustering
  972. with a quorum and a takeover VIP (a kind of load balancer). If the
  973. primary Dali fails, you move the VIP and data directory over to the
  974. passive node and restart the Dali service.</para>
  975. </sect2>
  976. <sect2>
  977. <title>DFU Server</title>
  978. <para>You can run multiple instances of the DFU Server. You can run
  979. all instances as active, as opposed to an active/passive
  980. configuration. There is no need for a load balancer or VIP. Each
  981. instance routinely queries the Dali for workunits. Should one fail,
  982. the other(s) will continue to pull new workunits.</para>
  983. </sect2>
  984. <sect2>
  985. <title>ECLCC Server</title>
  986. <para>You can run multiple active instances of the ECLCC Server for
  987. redundancy. There is no need for a load balancer or VIP for this
  988. either. Each instance will routinely check for workunits. Should one
  989. fail, the other(s) will continue to compile.</para>
  990. </sect2>
  991. <sect2>
  992. <title>ESP/ECL Watch/WsECL</title>
  993. <para>To establish redundancy, place the ESP Servers in a VIP. For an
  994. active/active design, you must use a load balancer. For active/passive
  995. you can use pacemaker/heartbeat. If you run active/active, you should
  996. maintain a single client's connection to a single server for the life
  997. of a session for ECL Watch (port 8010). Other services, such as WsECL
  998. (port 8002) do not require a persistent connection to a single
  999. server.</para>
  1000. </sect2>
  1001. <sect2>
  1002. <title>ECL Agent</title>
  1003. <para>You can run multiple active instances of the ECL Agent. No need
  1004. for a load balancer or VIP. Each instance routinely queries for
  1005. workunits. Should one fail, the other(s) will continue to pull new
  1006. workunits.</para>
  1007. </sect2>
  1008. <sect2>
  1009. <title>Sasha</title>
  1010. <para>Sasha should be run in an active/passive configuration.
  1011. Active/passive meaning you would have two Sashas configured, one
  1012. primary (active), and the other standing by.</para>
  1013. </sect2>
  1014. <sect2>
  1015. <title>ECL Scheduler</title>
  1016. <para>No need for a load balancer, runs active/active. Each instance
  1017. routinely queries for workunits. Should one fail, the other(s) will
  1018. continue to schdeule workunits.</para>
  1019. </sect2>
  1020. <sect2>
  1021. <title>Thormaster</title>
  1022. <para>Set up Thor in an active/passive configuration. Active/passive
  1023. meaning you would have two instances running, one primary (active),
  1024. and the other passive. No load balancer needed. If the active instance
  1025. fails, then you can fail over to the passive. Failover then uses the
  1026. VIP (a kind of load balancer) to distribute any incoming
  1027. requests.</para>
  1028. </sect2>
  1029. <sect2>
  1030. <title>Dropzone</title>
  1031. <para>This is just a fileserver that runs the dafilesrv process.
  1032. Configure in the same fashion as you would any active/passive file
  1033. server. One primary, or active, and the other passive. No load
  1034. balancer needed. If the active instance fails, then you can fail over
  1035. to the passive.</para>
  1036. </sect2>
  1037. </sect1>
  1038. <sect1 id="BP_High_Availability">
  1039. <title>High Availability and Disaster Recovery</title>
  1040. <para>If you require high availability for your HPCC system, there are
  1041. some additional considerations that you should be aware of. This is not
  1042. comprehensive list, and it is not meant to be step-by-step instructions
  1043. for setting up disaster recovery. Instead this section just provides
  1044. some more information to consider when incorporating HPCC into your
  1045. disaster recovery plan.</para>
  1046. <sect2 id="Thor_HA">
  1047. <title>Thor</title>
  1048. <para>When designing a Thor cluster for high availability, consider
  1049. how it actually works -- a Thor cluster accepts jobs from a job queue.
  1050. If there are two Thor clusters handling the queue, one will continue
  1051. accepting jobs if the other one fails.</para>
  1052. <para>If a single component (thorslave or thormaster) fails, the other
  1053. will continue to process requests. With replication enabled, it will
  1054. be able to read data from the back up location of the broken Thor.
  1055. Other components (such as ECL Server, or ESP) can also have multiple
  1056. instances. The remaining components, such as Dali, or DFU Server, work
  1057. in a traditional shared storage high availability fail over
  1058. model.</para>
  1059. <sect3>
  1060. <title>The Downside</title>
  1061. <para>Costs twice as much initially because you essentially have to
  1062. have two of everything.</para>
  1063. </sect3>
  1064. <sect3>
  1065. <title>The Upside</title>
  1066. <para>Almost 100% of the time you can utilize the additional
  1067. processing capacity. You can run more jobs, have more space,
  1068. etc.</para>
  1069. </sect3>
  1070. <sect3>
  1071. <title>Disaster Recovery concerns</title>
  1072. <para>The important factor to consider for disaster recovery (DR) is
  1073. the bandwidth required to replicate your data. Your network
  1074. administrator should evaluate this aspect carefully.</para>
  1075. <para>If you have tens of gigabytes of delta each day then an rsync
  1076. type replication or some sort of hybrid model should suffice. If you
  1077. have hundreds of gigabytes to petabytes of deltas, the real limit is
  1078. your budget.</para>
  1079. <para>A best practice is to find where the data is the smallest (at
  1080. ingestion, after normalization, at Roxie) and replicate from that
  1081. point and rerun the processing in both locations.</para>
  1082. <para>The key to getting disaster recovery right is to know your
  1083. data flow. For instance, if you are ingesting 20TB of raw data
  1084. daily, then taking that raw data and rolling it up, scoring it,
  1085. indexing it, etc. You would be better off replicating an
  1086. intermediate dataset (that we call base files), rather than
  1087. replicating the large ingest. If the opposite is occurring (small
  1088. daily ingest and then blow the data up in size) – you would be
  1089. better off to ingest the input and then re-run it.</para>
  1090. <para>Thor has the ability to do a “Thor copy” which copies data
  1091. from one cluster to another. You can also do this through ECL code.
  1092. Additionally, you may decide you don’t want, or need to have a “hot”
  1093. DR Thor. In that case, the most common minor disasters cause only a
  1094. relatively brief, less than 1 day disaster. Since Thor is
  1095. responsible for creating data updates it can take a day or a few to
  1096. recover. The data just is not quite as fresh but as long as the
  1097. Roxies are replicated the data is still flowing. In the case of a
  1098. major disaster such as, a major earthquake, a tidal wave, extended
  1099. total power loss, multiple fiber cuts, where the systems will be out
  1100. for a day or more. The likelihood of that occurring may not justify
  1101. the costs of preventing against it. </para>
  1102. </sect3>
  1103. <sect3>
  1104. <title>Conclusion</title>
  1105. <para>Disaster recovery is a calculation. The cost of failure, times
  1106. the likelihood per year of an event occurring, less than or greater
  1107. than the cost to prevent against it. Taking all that into
  1108. consideration can help you to put a sensible DR plan in
  1109. place.</para>
  1110. </sect3>
  1111. </sect2>
  1112. <sect2 id="HA_Roxie">
  1113. <title>Roxie</title>
  1114. <para>In the case of Roxie, a best practice is to have multiple Roxie
  1115. clusters and use a proxy to balance. In case of how to keep the data
  1116. in sync, a pull approach is best. The Roxie automatically pulls the
  1117. data it needs from the “source” listed in the package file. The data
  1118. can also be pulled from another Roxie or a Thor. In most cases you
  1119. would pull to your DR Roxie from the primary Roxie out of the load
  1120. balancer, but it can also pull from a Thor in the primary location as
  1121. well.</para>
  1122. </sect2>
  1123. <sect2 id="HA_Middlewear">
  1124. <title>Middleware</title>
  1125. <para>Replication of some components (ECL Agent, ESP/Eclwatch, DFU
  1126. Server, etc.) are pretty straight forward as they really don’t have
  1127. anything to replicate. Dali is the biggest consideration when it comes
  1128. to replication. In the case of Dali, you have Sasha as the back up
  1129. locally. The Dali files can be replicated using rsync. A better
  1130. approach could be to use a synchronizing device (cluster WAN sync, SAN
  1131. block replication, etc.), and just put the Dali stores on that and
  1132. just allow it replicate as designed.</para>
  1133. <para>There isn't just a one size fits all approach. Special care,
  1134. design, and planning are required to make an effective DR strategy
  1135. that doesn't “over synchronize” across slow WAN links, but still
  1136. provides you with an acceptable level of redundancy for your business
  1137. needs.</para>
  1138. </sect2>
  1139. </sect1>
  1140. <sect1>
  1141. <title>Best Practice Considerations</title>
  1142. <para>There are several other aspects to best practice considerations,
  1143. and these will change with your system requirements. The following
  1144. sections are some best practice considerations for some aspects of the
  1145. HPCC system. Keep in mind that suggested best practices are merely
  1146. suggested and may not be appropriate for your needs. A thorough review
  1147. of the considerations highlighted here can be very helpful if your needs
  1148. align with the stated considerations.</para>
  1149. <!--/*Further elaboration of both User permissions, and permission settings... also some hardware set up best practices. Suggested***/-->
  1150. <sect2>
  1151. <title>Multiple Thors</title>
  1152. <para>You can run multiple Thors on the same physical hardware.
  1153. Multiple Thors on the same hardware are independent and unaware of
  1154. each other. The Thors run jobs as they receive them, regardless of
  1155. what the other(s) is/are doing. The speed of a single job will never
  1156. be faster with multiple Thors, but the throughput can be. You can run
  1157. two Thors picking up jobs from two different queues or the same
  1158. queue.</para>
  1159. <para>The downside of running multiple Thors on the same hardware is
  1160. that the physical memory on the nodes needs to be shared among each of
  1161. the Thors. This needs to be configured per Thor cluster
  1162. definition.</para>
  1163. <para>You must not place multiple Thors on hardware which does not
  1164. have enough CPU cores to support it. You should not have more Thors
  1165. than number of cores. One good rule is to use a formula where the
  1166. number of cores divided by two is the maximum number of Thor clusters
  1167. to use.</para>
  1168. </sect2>
  1169. </sect1>
  1170. <xi:include href="HPCCSystemAdmin/SA-Mods/SysAdminConfigMod.xml"
  1171. xpointer="Sample_Sizings"
  1172. xmlns:xi="http://www.w3.org/2001/XInclude" />
  1173. </chapter>
  1174. <chapter id="Resources">
  1175. <title>System Resources</title>
  1176. <para>There are additional resources available for the HPCC System.</para>
  1177. <sect1 id="HPCC_Resources" role="nobrk">
  1178. <title>HPCC Resources</title>
  1179. <para>The resources link can be found under the Operations Icon link.
  1180. The resources link in ECL Watch provides a link to the HPCC
  1181. Systems<superscript>®</superscript> web portal. Visit the HPCC
  1182. Systems<superscript>®</superscript> Web Portal at <ulink
  1183. url="http://hpccsystems.com/">http://hpccsystems.com/</ulink> for
  1184. software updates, plug-ins, support, documentation, and more. This is
  1185. where you can find resources useful for running and maintaining HPCC on
  1186. the web portal.</para>
  1187. <para>ECL Watch provides a link to the HPCC portal's download page:
  1188. <ulink
  1189. url="http://hpccsystems.com/download">http://hpccsystems.com/download</ulink>.
  1190. This is the page where you can download Installation packages, virtual
  1191. images, source code, documentation, and tutorials.</para>
  1192. </sect1>
  1193. <sect1>
  1194. <title>Additional Resources</title>
  1195. <para>Additional help for Learning ECL is also available. There are
  1196. online courses.</para>
  1197. <para><ulink
  1198. url="https://learn.lexisnexis.com/lexisnexis/resources/courses">https://learn.lexisnexis.com/lexisnexis/resources/courses
  1199. </ulink></para>
  1200. <para>There are training videos online.</para>
  1201. <para><ulink
  1202. url="https://learn.lexisnexis.com/lexisnexis/resources/courses/HPCC/Summit2014/NewECLWatch50Features/NewECLWatch50Features.html">Legacy
  1203. ECL Watch and New 5.0 ECL Watch</ulink></para>
  1204. <para>A quick summary of the differences in the interface, goes into
  1205. particular detail. Helpful for learning how to deploy Roxies.</para>
  1206. </sect1>
  1207. </chapter>
  1208. </book>