|
@@ -175,7 +175,7 @@
|
|
|
<title>System Servers</title>
|
|
|
|
|
|
<para>The System Servers are integral middleware components of an HPCC
|
|
|
- system. They are used to control workflow and intercomponent
|
|
|
+ system. They are used to control workflow and inter-component
|
|
|
communication.</para>
|
|
|
|
|
|
<sect3 id="SysAdm_Dali">
|
|
@@ -183,26 +183,43 @@
|
|
|
|
|
|
<para>Dali is also known as the system data store. It manages
|
|
|
workunit records, logical file directory, and shared object
|
|
|
- services.</para>
|
|
|
+ services. It maintains the message queues that drive job execution
|
|
|
+ and scheduling.</para>
|
|
|
|
|
|
- <para>It maintains the message queues that drive job execution and
|
|
|
- scheduling. It also enforces the all LDAP security
|
|
|
- restrictions.</para>
|
|
|
+ <para>Dali also performs session management. It tracks all active
|
|
|
+ Dali client sessions registered in the environment, such that you
|
|
|
+ can list all clients and their roles. (see <emphasis>dalidiag
|
|
|
+ -clients</emphasis>)</para>
|
|
|
+
|
|
|
+ <para>Another task Dali performs is to act as the locking manager.
|
|
|
+ HPCC uses Dali's locking manager to control shared and exclusive
|
|
|
+ locks to metadata.</para>
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="SysAdm_Sahsa">
|
|
|
<title>Sasha</title>
|
|
|
|
|
|
<para>The Sasha server is a companion “housekeeping” server to the
|
|
|
- Dali server. It works independently of all other components. Sasha’s
|
|
|
- main function is to reduce the stress on the Dali server. Whenever
|
|
|
- possible, Sasha reduces the resource utilization on Dali.</para>
|
|
|
+ Dali server. Sasha works independently of, yet in conjunction with
|
|
|
+ Dali. Sasha’s main function is to reduce the stress on the Dali
|
|
|
+ server. Wherever possible, Sasha reduces the resource utilization on
|
|
|
+ Dali. A very important aspect of Sasha is coalescing, by saving the
|
|
|
+ in-memory store to a new store edition.</para>
|
|
|
|
|
|
- <para>Sasha archives workunits (including DFU Workunits) which are
|
|
|
- stored in a series of folders.</para>
|
|
|
+ <para>Sasha archives workunits (including DFU Workunits) that are
|
|
|
+ then stored in folders on a disk.</para>
|
|
|
|
|
|
<para>Sasha also performs routine housekeeping such as removing
|
|
|
cached workunits and DFU recovery files.</para>
|
|
|
+
|
|
|
+ <para>Sasha can also run XREF, to cross reference physical files
|
|
|
+ with logical metadata, to determine if there are lost/found/orphaned
|
|
|
+ files. It then presents options (via EclWatch) for their recovery or
|
|
|
+ deletion.</para>
|
|
|
+
|
|
|
+ <para>Sasha is the component responsible for removing expired files
|
|
|
+ when the criteria has been met. The EXPIRE option on ECL's OUTPUT or
|
|
|
+ PERSIST sets that condition.</para>
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="SysAdm_DFU">
|
|
@@ -303,21 +320,6 @@
|
|
|
Those credentials are then used to authenticate any requests from
|
|
|
those tools.</para>
|
|
|
</sect3>
|
|
|
-
|
|
|
- <!-- *** COMMENTING OUT WHOLE Of MONITORING SECTION
|
|
|
- <sect3>
|
|
|
- <title>HPCC Reporting</title>
|
|
|
-
|
|
|
- <para>HPCC leverages the use of Ganglia reporting and monitoring
|
|
|
- components to monitor several aspects of the HPCC System.</para>
|
|
|
-
|
|
|
- <para>See <emphasis>HPCC Monitoring and Reporting</emphasis> for
|
|
|
- more information on how to add monitoring and reporting to your HPCC
|
|
|
- System.</para>
|
|
|
-
|
|
|
- <para>More to come***</para>
|
|
|
- </sect3>
|
|
|
- END COMMENT ***-->
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="SysAdm_ClienInterfaces">
|
|
@@ -442,7 +444,7 @@
|
|
|
</chapter>
|
|
|
|
|
|
<chapter id="SysAdm_HWSizing">
|
|
|
- <title>Hardware and Component Sizing</title>
|
|
|
+ <title>Hardware and Components</title>
|
|
|
|
|
|
<para>This section provides some insight as to what sort of hardware and
|
|
|
infrastructure optimally HPCC works well on. This is not an exclusive
|
|
@@ -527,17 +529,85 @@
|
|
|
<para>HPCC Dali processes store cluster metadata in RAM. For optimal
|
|
|
efficiency, provide at least 48GB of RAM, 6 or more CPU cores, 1Gb/sec
|
|
|
network interface and a high availability disk for a single HPCC Dali.
|
|
|
- HPCC's Dali processes are one of the few active/passive components.
|
|
|
- Using standard “swinging disk” clustering is recommended for a high
|
|
|
- availability setup. For a single HPCC Dali process, any suitable High
|
|
|
- Availability (HA) RAID level is fine.</para>
|
|
|
-
|
|
|
- <para>Sasha does not store any data. Sasha reads data from Dali then
|
|
|
- processes it. Sasha does store archived workunits (WUs) on a disk.
|
|
|
- Allocating a larger disk for Sasha reduces the amount of housekeeping
|
|
|
- needed. Since Sasha assists Dali by performing housekeeping, it works
|
|
|
- best when on its own node. You should avoid putting Sasha and Dali on
|
|
|
- the same node.</para>
|
|
|
+ HPCC's Dali processes are one of the few native active/passive
|
|
|
+ components. Using standard “swinging disk” clustering is recommended for
|
|
|
+ a high availability setup. For a single HPCC Dali process, any suitable
|
|
|
+ High Availability (HA) RAID level is fine.</para>
|
|
|
+
|
|
|
+ <para>Sasha only stores data to locally available disks, reading data
|
|
|
+ from Dali then processing it by archiving workunits (WUs) to disk. It is
|
|
|
+ beneficial to configure Sasha for a larger amount of archiving so that
|
|
|
+ Dali does not keep too many workunits in memory. This requires a larger
|
|
|
+ amount of disk space.</para>
|
|
|
+
|
|
|
+ <para>Allocating greater disk space for Sasha is sound practice as
|
|
|
+ configuring Sasha for more archiving better benefits Dali. Since Sasha
|
|
|
+ assists Dali by performing housekeeping, it works best when on its own
|
|
|
+ node. Ideally, you should avoid putting Sasha and Dali on the same node,
|
|
|
+ because the node that runs these components is extremely critical,
|
|
|
+ particularly when it comes to recovering from losses. Therefore, it
|
|
|
+ should be as robust as possible: RAID drives, fault tolerant,
|
|
|
+ etc.</para>
|
|
|
+
|
|
|
+ <sect2>
|
|
|
+ <title>Sasha/Dali Interactions</title>
|
|
|
+
|
|
|
+ <para>A critical role of Sasha is in coalescing. When Dali shuts down,
|
|
|
+ it saves its in-memory store to a new store edition by creating a new
|
|
|
+ <emphasis>dalisdsXXXX.xml</emphasis>, where XXXX is incremented to the
|
|
|
+ new edition. The current edition is recorded by the filename
|
|
|
+ store.XXXX</para>
|
|
|
+
|
|
|
+ <para>An explicit request to save using
|
|
|
+ <emphasis>dalidiag</emphasis>:</para>
|
|
|
+
|
|
|
+ <programlisting> dalidiag . -save </programlisting>
|
|
|
+
|
|
|
+ <para>The new editions, as per the above example are created the same
|
|
|
+ way. During an explicit save, all changes to SDS are blocked.
|
|
|
+ Therefore all clients will block if they try to make any alteration
|
|
|
+ until the save is complete.</para>
|
|
|
+
|
|
|
+ <para>There are some options (though not commonly used) that can
|
|
|
+ configure Dali to detect quiet/idle time and force a save in exactly
|
|
|
+ the same way an explicit save request does, meaning that it will block
|
|
|
+ any write transactions while saving.</para>
|
|
|
+
|
|
|
+ <para>All Dali SDS changes are recorded in a delta transaction log (in
|
|
|
+ XML format) with a naming convention of
|
|
|
+ <emphasis>daliincXXXX.xml</emphasis>, where XXXX is the current store
|
|
|
+ edition. They are also optionally mirrored to a backup location. This
|
|
|
+ transaction log grows indefinitely until the store is saved.</para>
|
|
|
+
|
|
|
+ <para>In the normal/recommended setup, Sasha is the primary creator of
|
|
|
+ new SDS store editions. It does so on a schedule and according to
|
|
|
+ other configuration options (for example, you could configure for a
|
|
|
+ minimum delta transaction log size). Sasha reads the last saved store
|
|
|
+ and the current transaction log and replays the transaction log over
|
|
|
+ the last saved store to form a new in-memory version, and then saves
|
|
|
+ it. Unlike the Dali saving process, this does not block or interfere
|
|
|
+ with Dali. In the event of abrupt termination of the Dali process
|
|
|
+ (such as being killed or a power loss) Dali uses the same delta
|
|
|
+ transaction log at restart in order to replay the last save and
|
|
|
+ changes to return to the last operational state.</para>
|
|
|
+
|
|
|
+ <para></para>
|
|
|
+
|
|
|
+ <!-- *** COMMENTING OUT WHOLE Of MONITORING SECTION
|
|
|
+ <sect3>
|
|
|
+ <title>HPCC Reporting</title>
|
|
|
+
|
|
|
+ <para>HPCC leverages the use of Ganglia reporting and monitoring
|
|
|
+ components to monitor several aspects of the HPCC System.</para>
|
|
|
+
|
|
|
+ <para>See <emphasis>HPCC Monitoring and Reporting</emphasis> for
|
|
|
+ more information on how to add monitoring and reporting to your HPCC
|
|
|
+ System.</para>
|
|
|
+
|
|
|
+ <para>More to come***</para>
|
|
|
+ </sect3>
|
|
|
+ END COMMENT ***-->
|
|
|
+ </sect2>
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="SysAdm_OtherHPCCcomponents">
|
|
@@ -602,46 +672,59 @@
|
|
|
<sect1 id="SysAdm_BackUpData" role="nobrk">
|
|
|
<title>Back Up Data</title>
|
|
|
|
|
|
- <para>An integral part of routine maintenance is the back up of
|
|
|
- essential data. Devise a back up strategy to meet the needs of your
|
|
|
- organization. This section is not meant to replace your current back up
|
|
|
- strategy, instead this section supplements it by outlining special
|
|
|
- considerations for HPCC Systems<superscript>®</superscript>.</para>
|
|
|
+ <para>An integral part of routine maintenance is the backup of essential
|
|
|
+ data. Devise a backup strategy to meet the needs of your organization.
|
|
|
+ This section is not meant to replace your current backup strategy,
|
|
|
+ instead this section supplements it by outlining special considerations
|
|
|
+ for HPCC Systems<superscript>®</superscript>.</para>
|
|
|
|
|
|
<sect2 id="SysAdm_BackUpConsider">
|
|
|
- <title>Back Up Considerations</title>
|
|
|
+ <title>Backup Considerations</title>
|
|
|
|
|
|
- <para>You probably already have some sort of a back up strategy in
|
|
|
+ <para>You probably already have some sort of a backup strategy in
|
|
|
place, by adding HPCC Systems<superscript>®</superscript> into your
|
|
|
operating environment there are some additional considerations to be
|
|
|
- aware of. The following sections discuss back up considerations for
|
|
|
- the individual HPCC system components.</para>
|
|
|
+ aware of. The following sections discuss backup considerations for the
|
|
|
+ individual HPCC system components.</para>
|
|
|
|
|
|
<sect3 id="SysAdm_BkU_Dali">
|
|
|
<title>Dali</title>
|
|
|
|
|
|
- <para>Dali can be configured to create its own back up, ideally you
|
|
|
- would want that back up kept on a different server or node. You can
|
|
|
- specify the Dali back up folder location using the Configuration
|
|
|
- Manager. You may want to keep multiple copies that back up, to be
|
|
|
- able to restore to a certain point in time. For example, you may
|
|
|
- want to do daily snapshots, or weekly.</para>
|
|
|
-
|
|
|
- <para>You may want to keep back up copies at a system level using
|
|
|
- traditional back up methods.</para>
|
|
|
+ <para>Dali can be configured to create its own backup. It is
|
|
|
+ strongly recommended that the backup be kept on a different server
|
|
|
+ or node for disaster recovery purposes. You can specify the Dali
|
|
|
+ backup folder location using the Configuration Manager. You may want
|
|
|
+ to keep multiple generations of backups, to be able to restore to a
|
|
|
+ certain point in time. For example, you may want to do daily
|
|
|
+ snapshots, or weekly.</para>
|
|
|
+
|
|
|
+ <para>You may want to keep backup copies at a system level using
|
|
|
+ traditional methods. Regardless of method or scheme you would be
|
|
|
+ well advised to backup your Dali.</para>
|
|
|
+
|
|
|
+ <para>You should try to avoid putting Dali, Sasha, and even your
|
|
|
+ Thor Master on the same node. Ideally you want each of these
|
|
|
+ components to be on separate nodes to not only reduce the stress on
|
|
|
+ the system hardware (allowing the system to operate better) but also
|
|
|
+ enabling you to recover your entire environment, files, and
|
|
|
+ workunits in the event of a loss. In addition it would affect every
|
|
|
+ other Thor/Roxie cluster in the same environment if you lose this
|
|
|
+ node.</para>
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="SysAdm_BkUp_Sasha">
|
|
|
<title>Sasha</title>
|
|
|
|
|
|
- <para>Sasha itself generates no original data but archives workunits
|
|
|
- to disks. Be aware that Sasha can create quite a bit of archive
|
|
|
- data. Once the workunits are archived they are no longer available
|
|
|
- in the Dali data store. The archives can still be retrieved, but
|
|
|
- that archive now becomes the only copy of these workunits.</para>
|
|
|
+ <para>Sasha is the component that does the SDS coalescing. It is
|
|
|
+ normally the sole component that creates new store editions. It's
|
|
|
+ also the component that creates the XREF metadata that ECLWatch
|
|
|
+ uses. Be aware that Sasha can create quite a bit of archive data.
|
|
|
+ Once the workunits are archived they are no longer available in the
|
|
|
+ Dali data store. The archives can still be accessed through ECL
|
|
|
+ Watch by restoring them to Dali.</para>
|
|
|
|
|
|
- <para>If you need high availability for these archived workunits,
|
|
|
- you should back them up at a system level using traditional back up
|
|
|
+ <para>If you need high availability for archived workunits, you
|
|
|
+ should back them up at a system level using traditional backup
|
|
|
methods.</para>
|
|
|
</sect3>
|
|
|
|
|
@@ -688,18 +771,18 @@
|
|
|
<title>Thor</title>
|
|
|
|
|
|
<para>Thor, the data refinery, as one of the critical components of
|
|
|
- HPCC Systems<superscript>®</superscript> needs to be backed up. Back
|
|
|
- up Thor by configuring replication and setting up a nightly back up
|
|
|
- cron task. Back up Thor on demand before and/or after any node swap
|
|
|
- or drive swap if you do not have a RAID configured.</para>
|
|
|
+ HPCC Systems<superscript>®</superscript> needs to be backed up.
|
|
|
+ Backup Thor by configuring replication and setting up a nightly
|
|
|
+ backup cron task. Backup Thor on demand before and/or after any node
|
|
|
+ swap or drive swap if you do not have a RAID configured.</para>
|
|
|
|
|
|
<para>A very important part of administering Thor is to check the
|
|
|
- logs to ensure the previous back ups completed successfully.</para>
|
|
|
+ logs to ensure the previous backups completed successfully.</para>
|
|
|
|
|
|
<para><emphasis role="bold">Backupnode</emphasis></para>
|
|
|
|
|
|
<para>Backupnode is a tool that is packaged with HPCC. Backupnode
|
|
|
- allows you to back up Thor nodes on demand or in a script. You can
|
|
|
+ allows you to backup Thor nodes on demand or in a script. You can
|
|
|
also use backupnode regularly in a crontab. You would always want to
|
|
|
run it on the Thor master of that cluster.</para>
|
|
|
|
|
@@ -718,7 +801,7 @@
|
|
|
<programlisting> /bin/su - hpcc -c "/opt/HPCCSystems/bin/start_backupnode thor400_7s" & </programlisting>
|
|
|
|
|
|
<para>To run backupnode regularly you could use cron. For example,
|
|
|
- you may want a crontab entry (to back up thor400_7s) set to run at
|
|
|
+ you may want a crontab entry (to backup thor400_7s) set to run at
|
|
|
1am daily:</para>
|
|
|
|
|
|
<programlisting> 0 1 * * * /bin/su - hpcc -c "/opt/HPCCSystems/bin/start_backupnode thor400_7s" & </programlisting>
|
|
@@ -729,7 +812,7 @@
|
|
|
<para>/var/log/HPCCSystems/backupnode/MM_DD_YYYY_HH_MM_SS.log</para>
|
|
|
|
|
|
<para>The (MM) Month, (DD) Day, (YYYY) 4-digit Year, (HH) Hour, (MM)
|
|
|
- Minutes, and (SS) Seconds of the back up comprising the log file
|
|
|
+ Minutes, and (SS) Seconds of the backup comprising the log file
|
|
|
name.</para>
|
|
|
|
|
|
<para>The main log file exists on the Thor master node. It shows
|
|
@@ -737,9 +820,9 @@
|
|
|
backupnode logs on each of the Thor nodes showing what files, if
|
|
|
any, it needed to restore.</para>
|
|
|
|
|
|
- <para>It is important to check the logs to ensure the previous back
|
|
|
- ups completed successfully. The following entry is from the
|
|
|
- backupnode log showing that back up completed successfully:</para>
|
|
|
+ <para>It is important to check the logs to ensure the previous
|
|
|
+ backups completed successfully. The following entry is from the
|
|
|
+ backupnode log showing that backup completed successfully:</para>
|
|
|
|
|
|
<programlisting>00000028 2014-02-19 12:01:08 26457 26457 "Completed in 0m 0s with 0 errors"
|
|
|
00000029 2014-02-19 12:01:08 26457 26457 "backupnode finished" </programlisting>
|
|
@@ -755,9 +838,9 @@
|
|
|
<para><emphasis role="bold">Original Source Data File
|
|
|
Retention:</emphasis> When a query is published, the data is
|
|
|
typically copied from a remote site, either a Thor or a Roxie.
|
|
|
- The Thor data can serve as back up, provided it is not removed
|
|
|
- or altered on Thor. Thor data is typically retained for a period
|
|
|
- of time sufficient to serve as a back up copy.</para>
|
|
|
+ The Thor data can serve as backup, provided it is not removed or
|
|
|
+ altered on Thor. Thor data is typically retained for a period of
|
|
|
+ time sufficient to serve as a backup copy.</para>
|
|
|
</listitem>
|
|
|
|
|
|
<listitem>
|
|
@@ -777,7 +860,7 @@
|
|
|
copies of data files. With three sibling Roxie clusters that
|
|
|
have peer node redundancy, there are always six copies of each
|
|
|
file part at any given time; eliminating the need to use
|
|
|
- traditional back up procedures for Roxie data files.</para>
|
|
|
+ traditional backup procedures for Roxie data files.</para>
|
|
|
</listitem>
|
|
|
</itemizedlist>
|
|
|
</sect3>
|
|
@@ -787,15 +870,15 @@
|
|
|
|
|
|
<para>The Landing Zone is used to host incoming and outgoing files.
|
|
|
This should be treated similarly to an FTP server. Use traditional
|
|
|
- system level back ups.</para>
|
|
|
+ system level backups.</para>
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="SysAdm_BkUp_Misc">
|
|
|
<title>Misc</title>
|
|
|
|
|
|
- <para>Back up of any additional component add-ons, your environment
|
|
|
+ <para>Backup of any additional component add-ons, your environment
|
|
|
files (environment.xml), or other custom configurations should be
|
|
|
- done according to traditional back up methods.</para>
|
|
|
+ done according to traditional backup methods.</para>
|
|
|
</sect3>
|
|
|
</sect2>
|
|
|
</sect1>
|
|
@@ -834,7 +917,7 @@
|
|
|
<para>Understanding the log files, and what is normally reported in
|
|
|
the log files, helps in troubleshooting the HPCC system.</para>
|
|
|
|
|
|
- <para>As part of routine maintenance you may want to back up, archive,
|
|
|
+ <para>As part of routine maintenance you may want to backup, archive,
|
|
|
and remove the older log files.</para>
|
|
|
</sect2>
|
|
|
|
|
@@ -1116,11 +1199,11 @@ lock=/var/lock/HPCCSystems</programlisting>
|
|
|
<para>Configuring your system for remote file access over Transport
|
|
|
Layer Security (TLS) requires modifying the <emphasis
|
|
|
role="bold">dafilesrv</emphasis> setting in the
|
|
|
- <emphasis>environment.conf</emphasis> file. </para>
|
|
|
+ <emphasis>environment.conf</emphasis> file.</para>
|
|
|
|
|
|
<para>To do this either uncomment (if they are already there), or add
|
|
|
the following lines to the <emphasis>environment.conf</emphasis> file.
|
|
|
- Then set the values as appropriate for your system. </para>
|
|
|
+ Then set the values as appropriate for your system.</para>
|
|
|
|
|
|
<para><programlisting>#enable SSL for dafilesrv remote file access
|
|
|
dfsUseSSL=true
|
|
@@ -1129,7 +1212,7 @@ dfsSSLPrivateKeyFile=/keyfilepath/keyfile</programlisting>Set the <emphasis
|
|
|
role="blue">dfsUseSSL=true</emphasis> and set the value for the paths
|
|
|
to point to the certificate and key file paths on your system. Then
|
|
|
deploy the <emphasis>environment.conf</emphasis> file (and cert/key
|
|
|
- files) to all nodes as appropriate. </para>
|
|
|
+ files) to all nodes as appropriate.</para>
|
|
|
|
|
|
<para>When dafilesrv is enabled for TLS (port 7600), it can still
|
|
|
connect over a non-TLS connection (port 7100) to allow legacy clients
|
|
@@ -1326,7 +1409,7 @@ dfsSSLPrivateKeyFile=/keyfilepath/keyfile</programlisting>Set the <emphasis
|
|
|
Active/passive meaning you would have two Dalis running, one primary,
|
|
|
or active, and the other passive. In this scenario all actions are run
|
|
|
on the active Dali, but duplicated on the passive one. If the active
|
|
|
- Dali fails, then you can fail over to the passive Dali.</para>
|
|
|
+ Dali fails, then you can fail over to the passive Dali.<!--NOTE: Add steps for how to configure an Active/Passive Dali--></para>
|
|
|
|
|
|
<para>Another suggested best practice is to use standard clustering
|
|
|
with a quorum and a takeover VIP (a kind of load balancer). If the
|
|
@@ -1397,8 +1480,7 @@ dfsSSLPrivateKeyFile=/keyfilepath/keyfile</programlisting>Set the <emphasis
|
|
|
meaning you would have two instances running, one primary (active),
|
|
|
and the other passive. No load balancer needed. If the active instance
|
|
|
fails, then you can fail over to the passive. Failover then uses the
|
|
|
- VIP (a kind of load balancer) to distribute any incoming
|
|
|
- requests.</para>
|
|
|
+ VIP (a kind of load balancer) to distribute any incoming requests.<!--NOTE: Add steps for how to configure the Active/Passive Thor--></para>
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="SysAdm_BestPrac_DropZone">
|
|
@@ -1427,17 +1509,23 @@ dfsSSLPrivateKeyFile=/keyfilepath/keyfile</programlisting>Set the <emphasis
|
|
|
|
|
|
<para>When designing a Thor cluster for high availability, consider
|
|
|
how it actually works -- a Thor cluster accepts jobs from a job queue.
|
|
|
- If there are two Thor clusters handling the queue, one will continue
|
|
|
- accepting jobs if the other one fails.</para>
|
|
|
+ If there are two Thor clusters servicing the job queue, one will
|
|
|
+ continue accepting jobs if the other one fails.</para>
|
|
|
|
|
|
- <para>If a single component (thorslave or thormaster) fails, the other
|
|
|
- will continue to process requests. With replication enabled, it will
|
|
|
- be able to read data from the back up location of the broken Thor.
|
|
|
- Other components (such as ECL Server, or ESP) can also have multiple
|
|
|
+ <para>With replication enabled, the still-functioning Thor will be
|
|
|
+ able to read data from the back up location of the broken Thor. Other
|
|
|
+ components (such as ECL Server, or ESP) can also have multiple
|
|
|
instances. The remaining components, such as Dali, or DFU Server, work
|
|
|
- in a traditional shared storage high availability fail over
|
|
|
+ in a traditional shared storage high availability failover
|
|
|
model.</para>
|
|
|
|
|
|
+ <para>Another important consideration is to keep your ESP and Dali on
|
|
|
+ separate nodes from your Thor master. This way if your Thor master
|
|
|
+ fails, you can replace it, bring up the replacement with the same IP
|
|
|
+ (address) and it should then come up. Since Thor stores no workunit
|
|
|
+ data, the DALI and ESP can provide the file metadata to recover your
|
|
|
+ workunits.</para>
|
|
|
+
|
|
|
<sect3 id="Thor_HA_Downside">
|
|
|
<title>The Downside</title>
|
|
|
|
|
@@ -1522,7 +1610,7 @@ dfsSSLPrivateKeyFile=/keyfilepath/keyfile</programlisting>Set the <emphasis
|
|
|
<para>Replication of some components (ECL Agent, ESP/Eclwatch, DFU
|
|
|
Server, etc.) are pretty straight forward as they really don’t have
|
|
|
anything to replicate. Dali is the biggest consideration when it comes
|
|
|
- to replication. In the case of Dali, you have Sasha as the back up
|
|
|
+ to replication. In the case of Dali, you have Sasha as the backup
|
|
|
locally. The Dali files can be replicated using rsync. A better
|
|
|
approach could be to use a synchronizing device (cluster WAN sync, SAN
|
|
|
block replication, etc.), and just put the Dali stores on that and
|
|
@@ -1571,7 +1659,24 @@ dfsSSLPrivateKeyFile=/keyfilepath/keyfile</programlisting>Set the <emphasis
|
|
|
number of cores divided by two is the maximum number of Thor clusters
|
|
|
to use.</para>
|
|
|
|
|
|
- <para></para>
|
|
|
+ <sect3>
|
|
|
+ <title>Multiple Nodes</title>
|
|
|
+
|
|
|
+ <para>Try to keep resources running on their own nodes, if possible
|
|
|
+ for either one or multiple Thor clusters. If running some kind of
|
|
|
+ active/passive high availability, don't keep your active and passive
|
|
|
+ master on the same node. Try to keep Dali and ESP on separate nodes.
|
|
|
+ Even if you don't have the luxury of very many nodes, you want the
|
|
|
+ Thor master and the Dali (at minimum) to be on separate nodes. The
|
|
|
+ best practice is to keep as many components as possible their own
|
|
|
+ nodes.</para>
|
|
|
+
|
|
|
+ <para>Another consideration for a multiple node system is to avoid
|
|
|
+ putting any of the components on nodes with slaves. This is not a
|
|
|
+ best practice and leads to an unbalanced cluster, resulting in those
|
|
|
+ slaves with less memory/cpu taking longer than the rest and dragging
|
|
|
+ the whole performance of the cluster down as a result.</para>
|
|
|
+ </sect3>
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="virtual-thor-slaves">
|