Ver código fonte

HPCC-20344 Document HPCC Spark integration

Signed-off-by: G-Pan <greg.panagiotatos@lexisnexis.com>
G-Pan 6 anos atrás
pai
commit
35d3ad1f60

+ 1 - 1
docs/BuildTools/cmake_config/HPCCSpark.txt

@@ -14,5 +14,5 @@
 #    limitations under the License.
 ################################################################################
 
-DOCBOOK_TO_PDF( ${FO_XSL} SparkHPCC.xml "Spark_HPCC_Connector_${DOC_LANG}")
+DOCBOOK_TO_PDF( ${FO_XSL} SparkHPCC.xml "HPCC_Spark_Integration_${DOC_LANG}")
 

+ 454 - 33
docs/EN_US/HPCCSpark/SparkHPCC.xml

@@ -3,7 +3,7 @@
 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
 <book xml:base="../">
   <bookinfo>
-    <title>Spark-HPCCSystems Distributed Connector</title>
+    <title>HPCC / Spark Integration</title>
 
     <mediaobject>
       <imageobject>
@@ -59,6 +59,389 @@
   </bookinfo>
 
   <chapter>
+    <title>HPCC / Spark Installation and Configuration</title>
+
+    <para>The HPCC Systems Spark plug-in, hpccsystems-plugin-spark, integrates
+    Spark into your HPCC System platform. Once installed and configured, the
+    Sparkthor component manages the Integrated Spark cluster. It dynamically
+    configures, starts, and stops your Integrated Spark cluster when you start
+    or stop your HPCC Systems platform.</para>
+
+    <sect1 role="nobrk">
+      <title>Spark Installation</title>
+
+      <para>To add Spark integration to your HPCC Systems cluster, you must
+      have an HPCC Cluster running version 7.0.0 or later. Java 8 is also a
+      required. You will need to configure the Sparkthor component. The
+      Sparkthor component needs to be associated with a valid existing Thor
+      cluster. The Spark slave nodes will be created alongside each Thor
+      slave. The Integrated Spark Master node will be designated during
+      configuration, along with any other Spark node resources. Then the
+      Sparkthor component will spawn an Integrated Spark cluster at start up.
+      You will also have a SPARK-HPCC jar connector available.</para>
+
+      <para>To get the Integrated Spark component, packages and plug-ins are
+      available from the HPCC Systems<superscript>®</superscript> web portal:
+      <ulink
+      url="https://hpccsystems.com/download/free-community-edition">https://hpccsystems.com/download/</ulink></para>
+
+      <para>Download the hpccsystems-plugin-spark package from the HPCC
+      Systems Portal.</para>
+
+      <sect2 id="installing-SparkPlugin">
+        <title>Installing the Spark Plug-in</title>
+
+        <para>The installation process and package that you download vary
+        depending on the operating system you plan to use. The installation
+        packages may fail to install if their dependencies are missing from
+        the target system. To install the package, follow the appropriate
+        installation instructions for your operating system:</para>
+
+        <sect3 id="Spark_CentOS">
+          <title>CentOS/Red Hat</title>
+
+          <para>For RPM based systems, you can install using yum.</para>
+
+          <para><programlisting>sudo yum install &lt;hpccsystems-plugin-spark&gt;  </programlisting></para>
+        </sect3>
+
+        <sect3>
+          <title>Ubuntu/Debian</title>
+
+          <para>To install a Ubuntu/Debian package, use:</para>
+
+          <programlisting>sudo dpkg -i &lt;hpccsystems-plugin-spark&gt;  </programlisting>
+
+          <para>After installing the package, you should run the following to
+          update any dependencies.</para>
+
+          <para><programlisting>sudo apt-get install -f </programlisting></para>
+
+          <itemizedlist>
+            <listitem>
+              <para>You need to copy and install the plug-in onto all nodes.
+              This can be done using the install-cluster.sh script which is
+              provided with HPCC. Use the following command:</para>
+
+              <programlisting>/opt/HPCCSystems/sbin/install-cluster.sh &lt;hpccsystems-plugin-spark&gt;</programlisting>
+
+              <para>More details including other options that may be used with
+              this command are included in the appendix of Installing and
+              Running the HPCC Platform, also available on the HPCC
+              Systems<superscript>®</superscript> web portal.</para>
+            </listitem>
+          </itemizedlist>
+        </sect3>
+      </sect2>
+    </sect1>
+
+    <sect1 id="Spark_Configuration">
+      <title>Spark Configuration</title>
+
+      <para>To configure your existing HPCC System to integrate Spark, install
+      the hpccsystems-plugin-spark package and modify your existing
+      environment (file) to add the Sparkthor component.</para>
+
+      <orderedlist>
+        <listitem>
+          <para>If it is running, stop the HPCC system, using this
+          command:</para>
+
+          <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/SysDStop.xml"
+                      xpointer="element(/1)"
+                      xmlns:xi="http://www.w3.org/2001/XInclude" />
+
+          <para><informaltable colsep="1" frame="all" rowsep="1">
+              <?dbfo keep-together="always"?>
+
+              <tgroup cols="2">
+                <colspec colwidth="49.50pt" />
+
+                <colspec />
+
+                <tbody>
+                  <row>
+                    <entry><inlinegraphic
+                    fileref="images/OSSgr3.png" /></entry>
+
+                    <entry>You can use this command to confirm HPCC processes
+                    are stopped:<para><programlisting>sudo systemctl status hpccsystems-platform.target</programlisting></para></entry>
+                  </row>
+                </tbody>
+              </tgroup>
+            </informaltable></para>
+        </listitem>
+
+        <listitem>
+          <para>Start the Configuration Manager service.<programlisting>sudo /opt/HPCCSystems/sbin/configmgr
+</programlisting></para>
+
+          <para><graphic fileref="images/gs_img_configmgrStart.jpg" /></para>
+        </listitem>
+
+        <listitem>
+          <para>Leave this window open. You can minimize it, if
+          desired.</para>
+        </listitem>
+
+        <listitem>
+          <para>Using a Web browser, go to the Configuration Manager's
+          interface:</para>
+
+          <programlisting>http://&lt;<emphasis>node ip </emphasis>&gt;:8015</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>Check the box by the Advanced View and select the environment
+          file to edit.</para>
+        </listitem>
+
+        <listitem>
+          <para>Enable write access (checkbox top-right of the page)</para>
+        </listitem>
+
+        <listitem>
+          <para>Right-click on the Navigator panel on left side</para>
+
+          <para>Choose <emphasis role="bold">New Components</emphasis> then
+          <emphasis role="bold">Sparkthor</emphasis></para>
+        </listitem>
+
+        <listitem>
+          <?dbfo keep-together="always"?>
+
+          <para>Configure the attributes of your Spark Instance:</para>
+
+          <para><informaltable colsep="1" id="Th.t1" rowsep="1">
+              <tgroup align="left" cols="4">
+                <colspec colwidth="155pt" />
+
+                <colspec colwidth="2*" />
+
+                <colspec colwidth="1*" />
+
+                <colspec colwidth="0.5*" />
+
+                <thead>
+                  <row>
+                    <entry>attribute</entry>
+
+                    <entry>values</entry>
+
+                    <entry>default</entry>
+
+                    <entry>required</entry>
+                  </row>
+                </thead>
+
+                <tbody>
+                  <row>
+                    <entry>name</entry>
+
+                    <entry>Name for this process</entry>
+
+                    <entry>mysparkthor</entry>
+
+                    <entry>required</entry>
+                  </row>
+
+                  <row>
+                    <entry>ThorClusterName</entry>
+
+                    <entry>Thor cluster for workers to attach to*</entry>
+
+                    <entry>mythor*</entry>
+
+                    <entry>required</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_EXECUTOR_CORES</entry>
+
+                    <entry>Number of cores for executors</entry>
+
+                    <entry>1</entry>
+
+                    <entry>optional</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_EXECUTOR_MEMORY</entry>
+
+                    <entry>Memory per executor</entry>
+
+                    <entry>1G</entry>
+
+                    <entry>optional</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_MASTER_WEBUI_PORT</entry>
+
+                    <entry>Base port to use for master web interface</entry>
+
+                    <entry>8080</entry>
+
+                    <entry>optional</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_MASTER_PORT</entry>
+
+                    <entry>Base port to use for master</entry>
+
+                    <entry>7077</entry>
+
+                    <entry>optional</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_WORKER_CORES</entry>
+
+                    <entry>Number of cores for workers</entry>
+
+                    <entry>1</entry>
+
+                    <entry>optional</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_WORKER_MEMORY</entry>
+
+                    <entry>Memory per worker</entry>
+
+                    <entry>1G</entry>
+
+                    <entry>optional</entry>
+                  </row>
+
+                  <row>
+                    <entry>SPARK_WORKER_PORT</entry>
+
+                    <entry>Base port to use for workers</entry>
+
+                    <entry>7071</entry>
+
+                    <entry>optional</entry>
+                  </row>
+                </tbody>
+              </tgroup>
+            </informaltable></para>
+
+          <para>*ThorClusterName targets an existing Thor cluster. When
+          configuring you must choose a valid existing Thor cluster for the
+          Integrated Spark cluster to mirror.</para>
+
+          <para><informaltable colsep="1" frame="all" rowsep="1">
+              <?dbfo keep-together="always"?>
+
+              <tgroup cols="2">
+                <colspec colwidth="49.50pt" />
+
+                <colspec />
+
+                <tbody>
+                  <row>
+                    <entry><inlinegraphic
+                    fileref="images/caution.png" /></entry>
+
+                    <entry>NOTE: You should leave a least 2 cores open for
+                    HPCC to use to provide Spark with Data. The number of
+                    cores and memory you allocate to Spark would depend on the
+                    workload. Do not try and allocate too many resources to
+                    Spark where you could run into an issue of HPCC and Spark
+                    conflicting for resources.</entry>
+                  </row>
+                </tbody>
+              </tgroup>
+            </informaltable></para>
+        </listitem>
+
+        <listitem>
+          <para>Specify a Spark Master Node; Select the Spark Master Instances
+          tab. Right-click on the Instances table and choose <emphasis
+          role="bold">Add Instances </emphasis></para>
+
+          <para>Add the instance of your Spark master node.</para>
+
+          <para><informaltable colsep="1" frame="all" rowsep="1">
+              <?dbfo keep-together="always"?>
+
+              <tgroup cols="2">
+                <colspec colwidth="49.50pt" />
+
+                <colspec />
+
+                <tbody>
+                  <row>
+                    <entry><inlinegraphic
+                    fileref="images/caution.png" /></entry>
+
+                    <entry>NOTE: You can only have one Spark Master
+                    Instance</entry>
+                  </row>
+                </tbody>
+              </tgroup>
+            </informaltable></para>
+        </listitem>
+
+        <listitem>
+          <para>Save the environment file. Exit configmgr (Ctrl+C). Copy the
+          environment file from the source directory to the /etc/HPCCSystems
+          directory.</para>
+
+          <para><informaltable colsep="1" frame="all" rowsep="1">
+              <?dbfo keep-together="always"?>
+
+              <tgroup cols="2">
+                <colspec colwidth="49.50pt" />
+
+                <colspec />
+
+                <tbody>
+                  <row>
+                    <entry><inlinegraphic
+                    fileref="images/caution.png" /></entry>
+
+                    <entry>Be sure system is stopped before attempting to move
+                    the environment.xml file.</entry>
+                  </row>
+                </tbody>
+              </tgroup>
+            </informaltable></para>
+
+          <para><programlisting>sudo cp /etc/HPCCSystems/source/&lt;new environment file.xml&gt; /etc/HPCCSystems/environment.xml</programlisting>and
+          distribute the new environment file to all the nodes in your
+          cluster.</para>
+
+          <para>You could choose to use the delivered hpcc-push.sh script to
+          deploy the new environment file. For example:</para>
+
+          <programlisting>sudo /opt/HPCCSystems/sbin/hpcc-push.sh -s &lt;sourcefile&gt; -t &lt;destinationfile&gt; </programlisting>
+        </listitem>
+      </orderedlist>
+
+      <para>Now you can start your HPCC System cluster and verify that
+      Sparkthor is alive.</para>
+
+      <para>To start your HPCC System.</para>
+
+      <xi:include href="Installing_and_RunningTheHPCCPlatform/Inst-Mods/SysDStart.xml"
+                  xpointer="element(/1)"
+                  xmlns:xi="http://www.w3.org/2001/XInclude" />
+
+      <para>Using a browser, navigate to your Integrated Master Spark instance
+      (the instance you added above) running on port 8080 of your HPCC
+      System.</para>
+
+      <para>For example, http://nnn.nnn.nnn.nnn:8080, where nnn.nnn.nnn.nnn is
+      your Integrated Spark Master node's IP address.</para>
+
+      <programlisting>https://192.168.56.101:8080</programlisting>
+    </sect1>
+  </chapter>
+
+  <chapter>
     <title>The Spark HPCC Systems Connector</title>
 
     <sect1 id="overview" role="nobrk">
@@ -84,17 +467,21 @@
       distributed dataset derived from the data on the HPCC Cluster and is
       created by the
       <emphasis>org.hpccsystems.spark.HpccFile.getRDD</emphasis>(…) method.
-      The <emphasis>HpccFile</emphasis> class uses the
-      <emphasis>org.hpccsystems.spark.HpccDataframeFactory</emphasis> class to
-      construct a <emphasis>Dataset&lt;Row&gt;</emphasis> object for the new
-      Spark interface.</para>
+      The <emphasis>HpccFile</emphasis> class supports loading data to
+      construct a <emphasis>Dataset&lt;Row&gt;</emphasis> object for the Spark
+      interface. This will first load the data into an RDD&lt;Row&gt; and then
+      convert this RDD to a Dataset&lt;Row&gt; through internal Spark
+      mechanisms.</para>
 
       <para>There are several additional artifacts of some interest. The
       <emphasis>org.hpccsystems.spark.ColumnPruner</emphasis> class is
       provided to enable retrieving only the columns of interest from the HPCC
-      Cluster. The <emphasis>org.hpccsystems.spark.thor.FileFilter</emphasis>
-      class is provided to enable retrieving only records of interest from the
-      HPCC Cluster.</para>
+      Cluster. The <emphasis>targetClusterList</emphasis> parameter on the
+      HpccFile constructor allows you to provide a string of comma delimited
+      field paths for this same purpose. The
+      <emphasis>org.hpccsystems.spark.thor.FileFilter</emphasis> class is
+      provided to enable retrieving only records of interest from the HPCC
+      Cluster.</para>
 
       <para>The git repository includes two examples under the
       Examples/src/main/scala folder. The examples
@@ -107,6 +494,9 @@
       url="https://github.com/hpcc-systems/ecl-ml/blob/master/ML/Tests/Explanatory/IrisDS.ecl">https://github.com/hpcc-systems/ecl-ml/blob/master/ML/Tests/Explanatory/IrisDS.ecl</ulink>)
       can be executed to generate the Iris dataset in HPCC. A walk-through of
       the examples is provided in the Examples section.</para>
+
+      <para>The Spark-HPCCSystems Distributed Connector also supports PySpark.
+      It uses the same classes/API as Java does.</para>
     </sect1>
 
     <sect1 id="primary-classes">
@@ -114,13 +504,12 @@
 
       <para>The <emphasis>HpccFile</emphasis> class and the
       <emphasis>HpccRDD</emphasis> classes are discussed in more detail below.
-      There are the primary classes used to access data from an HPCC Cluster.
-      The <emphasis>org.hpccsystems.spark.HpccDataframeFactory</emphasis>
-      class is employed by the
-      <emphasis>org.hpccsystems.spark.HpccFile</emphasis> class to produce a
-      <emphasis>Dataset&lt;Row&gt;</emphasis> instance employing an
-      <emphasis>org.hpccsystems.spark.HpccRDD</emphasis> instance to generate
-      the <emphasis>Dataset&lt;Row&gt;</emphasis> instance.</para>
+      These are the primary classes used to access data from an HPCC Cluster.
+      The <emphasis>HpccFile</emphasis> class supports loading data to
+      construct a <emphasis>Dataset&lt;Row&gt;</emphasis> object for the Spark
+      interface. This will first load the data into an RDD&lt;Row&gt; and then
+      convert this RDD to a Dataset&lt;Row&gt; through internal Spark
+      mechanisms.</para>
 
       <para>The <emphasis>org.hpccsystems.spark.HpccFile</emphasis> class has
       several constructors. All of the constructors take information about the
@@ -157,6 +546,48 @@
       block of data, and returns the first record. When the block is
       exhausted, the next block should be available on the socket and new read
       request is issued.</para>
+
+      <para>The <emphasis>HpccFileWriter</emphasis> is another primary class
+      used for writing data to an HPCC Cluster. It has a single constructor
+      with the following signature:</para>
+
+      <programlisting>public HpccFileWriter(String connectionString, String user, String pass) throws Exception { </programlisting>
+
+      <para>The first parameter <emphasis>connectionString</emphasis> contains
+      the same information as <emphasis>HpccFile</emphasis> only using a
+      single connection string. It should be in the following format:
+      {http|https}://{ECLWATCHHOST}:{ECLWATCHPORT} </para>
+
+      <para>The constructor will attempt to connect to HPCC. This connection
+      will then be used for any subsequent calls to
+      <emphasis>saveToHPCC</emphasis>.</para>
+
+      <programlisting>public long saveToHPCC(SparkContext sc, RDD&lt;Row&gt; scalaRDD, String clusterName, 
+                        String fileName) throws Exception {</programlisting>
+
+      <para>The <emphasis>saveToHPCC</emphasis> method only supports
+      RDD&lt;row&gt; types. You may need to modify your data representation to
+      use this functionality. However, this data representation is what is
+      used by Spark SQL and by HPCC. This is only supported by writing in a
+      co-located setup. Thus Spark and HPCC must be installed on the same
+      nodes. Reading only supports reading data in from a remote HPCC
+      cluster.</para>
+
+      <para>The <emphasis>clusterName</emphasis> as used in the above case is
+      the desired cluster to write data to, for example, the "mythor" Thor
+      cluster. Currently there is only support for writing to Thor clusters.
+      Writing to a Roxie cluster is not supported and will return an
+      exception. The filename as used in the above example is in the HPCC
+      format, for example: "~example::text".</para>
+
+      <para>Internally the saveToHPCC method will Spawn multiple Spark jobs.
+      Currently, this spawns two jobs. The first job maps the location of
+      partitions in the Spark cluster so it can provide this information to
+      HPCC. The second job does the actual writing of files. There are also
+      some calls internally to ESP to handle things like starting the writing
+      process by calling <emphasis>DFUCreateFile</emphasis> and publishing the
+      file once it has been written by calling
+      <emphasis>DFUPublishFile</emphasis>.</para>
     </sect1>
 
     <sect1 id="additional-classes-of-interest">
@@ -223,28 +654,18 @@
       examples.  </para>
 
       <para>These test programs are intended to be run from a development IDE
-      such as Eclipse whereas the examples below are dependent on the Spark
-      shell.</para>
+      such as Eclipse via the Spark-submit application whereas the examples
+      below are dependent on the Spark shell.</para>
 
       <sect2 id="iris_lr">
         <title>Iris_LR</title>
 
         <para>The Iris_LR example assumes a Spark Shell. You can use the
-        spark-submit command as well. The Spark shell command will need a
-        parameter for the location of the JAPI Jar and the Spark-HPCC
-        Jar.</para>
-
-        <programlisting> spark-shell –-jars /home/my_name/Spark-HPCC.jar,/home/my_name/japi.jar</programlisting>
-
-        <para>This brings up the Spark shell with the two jars pre-pended to
-        the class path. The first thing needed is to establish the imports for
-        the classes.</para>
+        spark-submit command if you intend to compile and package these
+        examples. If you are already logged onto a node on an integrated
+        Spark-HPCC Cluster run the spark-shell:</para>
 
-        <programlisting> import org.hpccsystems.spark.HpccFile
- import org.hpccsystems.spark.HpccRDD
- import org.apache.spark.mllib.regression.LabeledPoint
- import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
- import org.apache.spark.mllib.evaluation.MulticlassMetrics</programlisting>
+        <programlisting> /opt/HPCCSystems/externals/spark-hadoop/bin/spark-shell</programlisting>
 
         <para>The next step is to establish your HpccFile and your RDD for
         that file. You need the name of the file, the protocol (http or
@@ -253,7 +674,7 @@
         value is the <emphasis>SparkContext</emphasis> object provided by the
         shell.</para>
 
-        <programlisting> val hpcc = new HpccFile("my_data", "http", "my_esp", "8010", "x", "*", "")
+        <programlisting> val hpcc = new HpccFile("myfile", "http", "myeclwatchhost", "8010", "myuser", "mypass", "")
  val myRDD = hpcc.getRDD(sc)</programlisting>
 
         <para>Now we have an RDD of the data. Nothing has actually happened at
@@ -324,7 +745,7 @@
         and is used instead of the <emphasis>SparkContext</emphasis>
         object.</para>
 
-        <programlisting> val hpcc = new HpccFile("my_data", "http", "my_esp", "8010", "x", "*", "")
+        <programlisting> val hpcc = new HpccFile("myfile", "http", "myeclwatchhost", "8010", "myuser", "mypass", "")
  val mt_df = hpcc.getDataframe(spark)</programlisting>
 
         <para>The Spark <emphasis>ml</emphasis> Machine Learning classes use