RONCC
/
Big-Data-HPC-Platform


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327
							<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<sect1 id="Smart_Stepping">
  <title><emphasis role="bold">Smart Stepping</emphasis></title>

  <sect2 id="PG_Overview">
    <title>Overview</title>

    <para>Smart Stepping is a set of indexing techniques that, taken together,
    comprise a method of doing <emphasis>n</emphasis>-ary join/merge-join
    operations, where <emphasis>n</emphasis> is defined as two or more
    datasets. Smart Stepping enables the supercomputer to efficiently join
    records from multiple filtered data sources, including subsets of the same
    dataset. It is particularly efficient when the matches are sparse and
    uncorrelated. Smart Stepping also supports matching records from M-of-N
    datasets.</para>

    <para>Before the advent of Smart Stepping, finding the intersection of
    records from multiple datasets was performed by extracting the potential
    matches from one dataset, and then joining that candidate set to each of
    the other datasets in turn. The joins would use various mechanisms
    including index lookups, or reading the potential matches from a dataset,
    and then joining them. This means that the only way to join multiple
    datasets required that at least one dataset be read in its entirety and
    then joined to the others. This could be very inefficient if the
    programmer didn't take care to select the most efficient order in which to
    read the datasets. Unfortunately, it is often impossible to know
    beforehand which order would be the best. It is also often impossible to
    order the joins so that the two least frequent terms are joined. It was
    also particularly difficult to efficiently implement the M-of-N join
    varieties.</para>

    <para>With Smart Stepping technology, these multiple dataset joins become
    a single efficient operation instead of a series of multiple operations.
    Smart Stepping can only be used in the context where the join condition is
    primarily an equality test between columns in the input datasets and the
    input datasets must have output sorted by those columns.</para>

    <para>Smart Stepping also provides an efficient way of streaming
    information from a dataset, sorted by any trailing sort order. Previously
    if you had a sorted dataset (often an index) which was required to be
    filtered by some leading components, and then have the resulting rows
    sorted by the trailing components, you would have had to achieve it by
    reading the entire filtered result, and then post sorting that
    result.</para>

    <para>Smart Stepping can use significant amounts of temporary storage if
    used inappropriately. Therefore, care should be taken to use it
    properly.</para>
  </sect2>

  <sect2 id="Trailing_Field_Sorts">
    <title>Trailing Field Sorts</title>

    <para>The STEPPED function provides the ability to sort by trailing key
    component fields in a much more efficient manner than sorting after
    filtering (the only previous method of accomplishing this). The stepped
    trailing key fields allows the sorted rows to be returned without reading
    the entire dataset.</para>

    <para>Prior to the advent of Smart Stepping, a sorted dataset or index
    could efficiently produce filtered rows, or rows sorted in the same order
    as the original sort order, but it could not efficiently produce rows
    sorted by a trailing sort order of the index (whether filtered or not).
    The filtering then post-sorting method required that all rows be read from
    the dataset before any sorted rows could be retrieved. Smart Stepping
    allows the sorted data to be read immediately (and therefore
    partially).</para>

    <para>The easiest way to see the effect is with this example (contained in
    SmartStepping1.ECL—this code must be run in hthor or Roxie, not
    Thor):</para>

    <programlisting>IMPORT $; 
IDX := $.DeclareData.IDX__Person_State_City_Zip_LastName_FirstName_Payload;
Filter := IDX.State = 'LA' AND IDX.City = 'ABBEVILLE';
 //filter by the leading index elements
 //and sort the output by a trailing element
OUTPUT(SORT(IDX(Filter),FirstName),ALL);  //the old way
OUTPUT(STEPPED(IDX(Filter),FirstName),ALL); //Smart Stepping </programlisting>

    <para>The previous method of accomplishing this meant producing the
    filtered result set, then using SORT to achieve the desired sort order.
    The new method looks very similar, using STEPPED instead of SORT, and both
    OUTPUTs produce the same result, but the efficiency of the methods by
    which those results are achieved is very different.</para>

    <para>Once you've successfully run this code and gotten your result, take
    a look at the Graphs page.</para>

    <para>Notice that the first OUTPUT's sub-graph contains three activities:
    the index read, the sort, and the output. But the second OUTPUT's
    sub-graph only contains two activities: the index read and the output. All
    of the Smart Stepping work to produce the result is done by the index
    read. If you then go to the ECL Watch page for the workunit and look at
    the timings you should see that the second OUTPUT's graph1-1 time is
    significantly less than the first's graph1-2:</para>

    <para>Thus demonstrating the type of performance advantage Smart Stepping
    can have over previous methods. Of course, the real performance advantage
    shows up when you ask for only the first <emphasis>n</emphasis> records,
    as in this example (contained in SmartStepping1a.ECL):</para>

    <programlisting>IMPORT $; 
IDX := $.DeclareData.IDX__Person_State_City_Zip_LastName_FirstName_Payload;
Filter := IDX.State = 'LA' AND IDX.City = 'ABBEVILLE';
OUTPUT(CHOOSEN(SORT(IDX(Filter),FirstName),5));     //the old way
OUTPUT(CHOOSEN(STEPPED(IDX(Filter),FirstName),5));  //Smart Stepping </programlisting>

    <para>After running this code, check the timings on the ECL watch page.
    You should again see quite a performance difference between the two
    methods, even with this little amount of data.</para>
  </sect2>

  <sect2 id="N-ary_JOINs">
    <title>N-ary JOINs</title>

    <para>The primary purpose of Smart Stepping is to enable
    <emphasis>n</emphasis>-ary merge/join operations to be accomplished as
    efficiently as possible. To that end the concept of a set of datasets (or
    indexes) has been added to the language. This allows JOIN to be extended
    to operate on multiple datasets, not just two.</para>

    <para>For example, given this data (contained in the SmartStepping2.ECL
    file)</para>

    <programlisting>Rec := RECORD,MAXLENGTH(4096)
  STRING1 Letter;
  UNSIGNED1 DS;
  UNSIGNED1 Matches := 1;
  UNSIGNED1 LastMatch := 1;
  SET OF UNSIGNED1 MatchDSs := [1];
END;
     
ds1 := DATASET([{'A',1},{'B',1},{'C',1},{'D',1},{'E',1}],Rec);
ds2 := DATASET([{'A',2},{'B',2},{'H',2},{'I',2},{'J',2}],Rec);
ds3 := DATASET([{'B',3},{'C',3},{'M',3},{'N',3},{'O',3}],Rec);
ds4 := DATASET([{'A',4},{'B',4},{'R',4},{'S',4},{'T',4}],Rec);
ds5 := DATASET([{'B',5},{'V',5},{'W',5},{'X',5},{'Y',5}],Rec); </programlisting>

    <para>To do an inner join on all five datasets using Smart Stepping the
    code is this (also contained in the SmartStepping2.ECL file):</para>

    <programlisting>SetDS := [ds1,ds2,ds3,ds4,ds5];

Rec XF(Rec L,DATASET(Rec) Matches) := TRANSFORM
  SELF.Matches := COUNT(Matches);
  SELF.LastMatch := MAX(Matches,DS);
  SELF.MatchDSs := SET(Matches,DS);
  SELF := L;
END;
j1 := JOIN( SetDS,STEPPED(LEFT.Letter=RIGHT.Letter),XF(LEFT,ROWS(LEFT)),SORTED(Letter));

O1 := OUTPUT(j1);
</programlisting>

    <para>Without using Smart Stepping the code is this (also contained in the
    SmartStepping2.ECL file):</para>

    <programlisting>Rec XF1(Rec L,Rec R,integer MatchSet) := TRANSFORM
  SELF.Matches := L.Matches + 1;
  SELF.LastMatch := MatchSet;
  SELF.MatchDSs := L.MatchDSs + [MatchSet];
  SELF := L;
END;
j2 := JOIN( ds1,ds2,LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,2));
j3 := JOIN( j2,ds3, LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,3));
j4 := JOIN( j3,ds4, LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,4));
j5 := JOIN( j4,ds5, LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,5));
O2 := OUTPUT(SORT(j5,Letter));
</programlisting>

    <para>Both of these examples produce the same one-record output, but
    without Smart Stepping you need four separate JOINs to accomplish the
    goal, and in “real world” code you might need a separate TRANSFORM for
    each, depending on what result you were trying to produce.</para>

    <para>In addition to the standard inner join between all the datasets, the
    Smart Stepping form of JOIN also supports the same type of LEFT OUTER and
    LEFT ONLY joins as the standard JOIN operation. However, this form also
    supports <emphasis>M</emphasis> of <emphasis>N</emphasis> joins (MOFN),
    where matching records must appear in a specified minimum number of the
    datasets, and may optionally specify a maximum in which they appear, as in
    these examples (also contained in the SmartStepping2.ECL file):</para>

    <programlisting>j6 := JOIN( SetDS,
            STEPPED(LEFT.Letter=RIGHT.Letter),
            XF(LEFT,ROWS(LEFT)),
            SORTED(Letter),
            LEFT OUTER);
j7 := JOIN( SetDS,
            STEPPED(LEFT.Letter=RIGHT.Letter),
            XF(LEFT,ROWS(LEFT)),
            SORTED(Letter),
            LEFT ONLY);
j8 := JOIN( SetDS,
            STEPPED(LEFT.Letter=RIGHT.Letter),
            XF(LEFT,ROWS(LEFT)),
            SORTED(Letter),
            MOFN(3));
j9 := JOIN( SetDS,
            STEPPED(LEFT.Letter=RIGHT.Letter),
            XF(LEFT,ROWS(LEFT)),
            SORTED(Letter),
            MOFN(3,4));
O3 := OUTPUT(j6);
O4 := OUTPUT(j7);
O5 := OUTPUT(j8);
O6 := OUTPUT(j9);
</programlisting>

    <para>The RANGE function is also available to limit which datasets in the
    set of datasets will be processed, as in this example (also contained in
    the SmartStepping2.ECL file):</para>

    <programlisting>j10 := JOIN( RANGE(SetDS,[1,3,5]),
             STEPPED(LEFT.Letter=RIGHT.Letter),
             XF(LEFT,ROWS(LEFT)),
             SORTED(Letter));
O7 := OUTPUT(j10);

SEQUENTIAL(O1,O2,O3,O4,O5,O6,O7);</programlisting>

    <para>This feature can be useful in situations where you may not have all
    the information to select from all the datasets in the set.</para>

    <para>This next example demonstrates the most probable use for this
    technology in the real world—finding the set of parent records where
    related child records exist that fit a specified set of filter criteria.
    That's exactly what this example (contained in the SmartStepping3.ECL
    file) does:</para>

    <programlisting>LinkRec := RECORD
 UNSIGNED1 Link;
END;
DS_Rec := RECORD(LinkRec)
  STRING10 Name;
  STRING10 Address;
END;
Child1_Rec := RECORD(LinkRec)
  UNSIGNED1 Nbr;
END;
Child2_Rec := RECORD(LinkRec)
  STRING10 Car;
END;
Child3_Rec := RECORD(LinkRec)
  UNSIGNED4 Salary;
END;
Child4_Rec := RECORD(LinkRec)
  STRING10 Domicile;
END;</programlisting>

    <para>Using this form of RECORD structure inheritance makes it very simple
    to define the linkage between the parent and child files. Note also that
    all these files have different formats.</para>

    <programlisting>ds := DATASET([{1,'Fred','123 Main'},{2,'George','456 High'},
               {3,'Charlie','789 Bank'},{4,'Danielle','246 Front'},
               {5,'Emily','613 Boca'},{6,'Oscar','942 Frank'},
               {7,'Felix','777 John'},{8,'Adele','543 Bank'},
               {9,'Johan','123 Front'},{10,'Ludwig','212 Front'}],
              DS_Rec);
     
Child1 := DATASET([{1,5},{2,8},{3,11},{4,14},{5,17},
                   {6,20},{7,23},{8,26},{9,29},{10,32}],Child1_Rec);

Child2 := DATASET([{1,'Ford'},{2,'Ford'},{3,'Chevy'},
                   {4,'Lexus'},{5,'Lexus'},{6,'Kia'},
                   {7,'Mercury'},{8,'Jeep'},{9,'Lexus'},
                   {9,'Ferrari'},{10,'Ford'}],
                  Child2_Rec);
     

Child3 := DATASET([{1,10000},{2,20000},{3,155000},{4,800000},
                   {5,250000},{6,75000},{7,200000},{8,15000},
                   {9,80000},{10,25000}],
                  Child3_Rec);
     
Child4 := DATASET([{1,'House'},{2,'House'},{3,'House'},{4,'Apt'},
                   {5,'Apt'},{6,'Apt'},{7,'Apt'},{8,'House'},
                   {9,'Apt'},{10,'House'}],
                  Child4_Rec);
     
TblRec := RECORD(LinkRec),MAXLENGTH(4096)
  UNSIGNED1 DS;
  UNSIGNED1 Matches := 0;
  UNSIGNED1 LastMatch := 0;
  SET OF UNSIGNED1 MatchDSs := [];
END;
     
Filter1 := Child1.Nbr % 2 = 0;
Filter2 := Child2.Car IN ['Ford','Chevy','Jeep'];
Filter3 := Child3.Salary &lt; 100000;
Filter4 := Child4.Domicile = 'House';
     
t1 := PROJECT(Child1(Filter1),TRANSFORM(TblRec,SELF.DS:=1,SELF:=LEFT));
t2 := PROJECT(Child2(Filter2),TRANSFORM(TblRec,SELF.DS:=2,SELF:=LEFT));
t3 := PROJECT(Child3(Filter3),TRANSFORM(TblRec,SELF.DS:=3,SELF:=LEFT));
t4 := PROJECT(Child4(Filter4),TRANSFORM(TblRec,SELF.DS:=4,SELF:=LEFT));</programlisting>

    <para>The PROJECT operation is a simple way to transform the results for
    all these different format files into a single standard layout that can be
    used by the Smart Stepping JOIN operation.</para>

    <programlisting>SetDS := [t1,t2,t3,t4];
     
TblRec XF(TblRec L,DATASET(TblRec) Matches) := TRANSFORM
  SELF.Matches := COUNT(Matches);
  SELF.LastMatch := MAX(Matches,DS);
  SELF.MatchDSs := SET(Matches,DS);
  SELF := L;
END;

j1 := JOIN( SetDS,STEPPED(LEFT.Link=RIGHT.Link),XF(LEFT,ROWS(LEFT)),SORTED(Link));     

OUTPUT(j1);
     
OUTPUT(ds(link IN SET(j1,link)));</programlisting>

    <para>The first OUTPUT simply displays the same kind of result as the
    previous example. The second OUTPUT produces the “real-world” result set
    of the base dataset records that match the filter criteria for each of the
    child datasets.</para>
  </sect2>
</sect1>