RONCC
/
Big-Data-HPC-Platform


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534
							<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<sect1 id="Using_ECL_Keys-INDEX_Files">
  <title><emphasis role="bold">Using ECL Keys (INDEX Files)</emphasis></title>

  <para>The ETL (Extract, Transform, and Load—standard data ingest processing)
  operations in ECL typically operate against all or most of the records in
  any given dataset, which makes the use of keys (INDEX files) of little use.
  Many queries do the same.</para>

  <para>However, production data delivery to end-users rarely requires
  accessing all records in a dataset. End-users always want “instant” access
  to the data they're interested in, and most often that data is a very small
  subset of the total set of records available. Therefore, using keys
  (INDEXes) becomes a requirement.</para>

  <para>The following attribute definitions used by the code examples in this
  article are declared in the DeclareData MODULE structure attribute in the
  DeclareData.ECL file:</para>

  <programlisting>EXPORT Person := MODULE
  EXPORT File := DATASET('~PROGGUIDE::EXAMPLEDATA::People',Layout_Person, THOR);
  EXPORT FilePlus := DATASET('~PROGGUIDE::EXAMPLEDATA::People',
                             {Layout_Person,
                              UNSIGNED8 RecPos{VIRTUAL(fileposition)}}, THOR);
END;                                          
EXPORT Accounts := DATASET('~PROGGUIDE::EXAMPLEDATA::Accounts',
                           {Layout_Accounts_Link,
                            UNSIGNED8 RecPos{VIRTUAL(fileposition)}}, THOR);
EXPORT PersonAccounts := DATASET('~PROGGUIDE::EXAMPLEDATA::PeopleAccts',
                                 {Layout_Combined,
                                  UNSIGNED8 RecPos{virtual(fileposition)}},THOR);

EXPORT IDX_Person_PersonID := INDEX(Person.FilePlus,{PersonID,RecPos},
                                    '~PROGGUIDE::EXAMPLEDATA::KEYS::People.PersonID');
EXPORT IDX_Accounts_PersonID := INDEX(Accounts,{PersonID,RecPos},
                                      '~PROGGUIDE::EXAMPLEDATA::KEYS::Accounts.PersonID');

EXPORT IDX_Accounts_PersonID_Payload := 
        INDEX(Accounts,
              {PersonID},
              {Account,OpenDate,IndustryCode,AcctType,
               AcctRate,Code1,Code2,HighCredit,Balance,RecPos},
              '~PROGGUIDE::EXAMPLEDATA::KEYS::Accounts.PersonID.Payload');

EXPORT IDX_PersonAccounts_PersonID := 
        INDEX(PersonAccounts,{PersonID,RecPos},
              '~PROGGUIDE::EXAMPLEDATA::KEYS::PeopleAccts.PersonID');

EXPORT IDX__Person_LastName_FirstName := 
        INDEX(Person.FilePlus,{LastName,FirstName,RecPos},
              '~PROGGUIDE::EXAMPLEDATA::KEYS::People.LastName.FirstName');
EXPORT IDX__Person_PersonID_Payload := 
        INDEX(Person.FilePlus,{PersonID},
              {FirstName,LastName,MiddleInitial,
               Gender,Street,City,State,Zip,RecPos},
              '~PROGGUIDE::EXAMPLEDATA::KEYS::People.PersonID.Payload');
</programlisting>

  <para>Although you can use an INDEX as if it were a DATASET, there are only
  two operations in ECL that directly use keys: FETCH and JOIN.</para>

  <sect2 id="Simple_FETCH">
    <title>Simple FETCH</title>

    <para>The FETCH is the simplest use of an INDEX. Its purpose is to
    retrieve records from a dataset by using an INDEX to directly access only
    the specified records.</para>

    <para>The example code below (contained in the IndexFetch.ECL file)
    illustrates the usual form:</para>

    <programlisting>IMPORT $;

F1 := FETCH($.DeclareData.Person.FilePlus,
            $.DeclareData.IDX_Person_PersonID(PersonID=1),  
            RIGHT.RecPos);
OUTPUT(F1);	</programlisting>

    <para>You will note that the DATASET named as the first parameter has no
    filter, while the INDEX named as the second parameter does have a filter.
    This is always the case with FETCH. The purpose of an INDEX in ECL is
    always to allow “direct” access to individual records in the base dataset,
    therefore filtering the INDEX is always required to define the exact set
    of records to retrieve. Given that, filtering the base dataset is
    unnecessary.</para>

    <para>As you can see, there is no TRANSFORM function in this code. For
    most typical uses of FETCH a transform function is unnecessary, although
    it is certainly appropriate if the result data requires formatting, as in
    this example (also contained in the IndexFetch.ECL file):</para>

    <programlisting>r := RECORD
  STRING FullName;
  STRING Address;
  STRING CSZ;
END;

r Xform($.DeclareData.Person.FilePlus L) := TRANSFORM
  SELF.Fullname := TRIM(L.Firstname) + TRIM(' ' + L.MiddleInitial) + ' ' + L.Lastname;
  SELF.Address  := L.Street;
  SELF.CSZ      := TRIM(L.City) + ', ' + L.State + ' ' + L.Zip;
END;

F2 := FETCH($.DeclareData.Person.FilePlus,
            $.DeclareData.IDX_Person_PersonID(PersonID=1),
            RIGHT.RecPos,
            Xform(LEFT));
OUTPUT(F2);
</programlisting>

    <para>Even with a TRANSFORM function, this code is still a very
    straight-forward “go get me the records, please” operation.</para>
  </sect2>

  <sect2 id="Full-keyed_JOIN">
    <title>Full-keyed JOIN</title>

    <para>As simple as FETCH is, using INDEXes in JOIN operations is a little
    more complex. The most obvious form is a "full-keyed" JOIN, specified by
    the KEYED option, which, nominates an INDEX into the right-hand recordset
    (the second JOIN parameter). The purpose for this form is to handle
    situations where the left-hand recordset (named as the first parameter to
    the JOIN) is a fairly small dataset that needs to join to a large, indexed
    dataset (the right-hand recordset). By using the KEYED option, the JOIN
    operation uses the specified INDEX to find the matching right-hand
    records. This means that the join condition must use the key fields in the
    INDEX to find matching records.</para>

    <para>This example code (contained in the IndexFullKeyedJoin.ECL file)
    illustrates the usual use of a full-keyed join:</para>

    <programlisting>IMPORT $;

r1 := RECORD
  $.DeclareData.Layout_Person;
  $.DeclareData.Layout_Accounts;
END;

r1 Xform1($.DeclareData.Person.FilePlus L, 
          $.DeclareData.Accounts R) := TRANSFORM
  SELF := L;
  SELF := R;
END;
J1 := JOIN($.DeclareData.Person.FilePlus(PersonID BETWEEN 1 AND 100),
           $.DeclareData.Accounts,
           LEFT.PersonID=RIGHT.PersonID,
           Xform1(LEFT,RIGHT),
           KEYED($.DeclareData.IDX_Accounts_PersonID));

OUTPUT(J1,ALL);
</programlisting>

    <para>The right-hand Accounts file contains five million records, and with
    the specified filter condition the left-hand Person recordset contains
    exactly one hundred records. A standard JOIN between these two would
    normally require that all five million Accounts records be read to produce
    the result. However, by using the KEYED option the INDEX’s binary tree is
    used to find the entries with the appropriate key field values and get the
    pointers to the exact set of Accounts records required to produce the
    correct result. That means that the only records read from the right-hand
    file are those actually contained in the result.</para>
  </sect2>

  <sect2 id="Half-keyed_JOIN">
    <title>Half-keyed JOIN</title>

    <para>The half-keyed JOIN is a simpler version, wherein the INDEX is the
    right-hand recordset in the JOIN. Just as with the full-keyed JOIN, the
    join condition must use the key fields in the INDEX to do its work. The
    purpose of the half-keyed JOIN is the same as the full-keyed
    version.</para>

    <para>In fact, a full-keyed JOIN is, behind the curtains, actually the
    same as a half-keyed JOIN then a FETCH to retrieve the base dataset
    records. Therefore, a half-keyed JOIN and a FETCH are semantically and
    functionally equivalent, as shown in this example code (contained in the
    IndexHalfKeyedJoin.ECL file):</para>

    <programlisting>IMPORT $;

r1 := RECORD
  $.DeclareData.Layout_Person;
  $.DeclareData.Layout_Accounts;
END;
r2 := RECORD
  $.DeclareData.Layout_Person;
  UNSIGNED8 AcctRecPos;
END;

r2 Xform2($.DeclareData.Person.FilePlus L, 
          $.DeclareData.IDX_Accounts_PersonID R) := TRANSFORM
  SELF.AcctRecPos := R.RecPos;
  SELF := L;
END;

J2 := JOIN($.DeclareData.Person.FilePlus(PersonID BETWEEN 1 AND 100),
           $.DeclareData.IDX_Accounts_PersonID,
           LEFT.PersonID=RIGHT.PersonID,
           Xform2(LEFT,RIGHT));		

r1 Xform3($.DeclareData.Accounts L, r2 R) := TRANSFORM
  SELF := L;
  SELF := R;
END;
F1 := FETCH($.DeclareData.Accounts,
            J2,  
            RIGHT.AcctRecPos,
            Xform3(LEFT,RIGHT));

OUTPUT(F1,ALL);
</programlisting>

    <para>This code produces the same result set as the previous
    example.</para>

    <para>The advantage of using half-keyed JOINs over the full-keyed version
    comes in where you may need to do several JOINs to fully perform whatever
    process is being run. Using the half-keyed form allows you to accomplish
    all the necessary JOINs before you explicitly do the FETCH to retrieve the
    final result records, thereby making the code more efficient.</para>
  </sect2>

  <sect2 id="Payload_INDEXes">
    <title>Payload INDEXes</title>

    <para>There is an extended form of INDEX that allows each entry to carry a
    “payload”—additional data not included in the set of key fields. These
    additional fields may simply be additional fields from the base dataset
    (not required as part of the search key), or they may contain the result
    of some preliminary computation (computed fields). Since the data in an
    INDEX is always compressed (using LZW compression), carrying the extra
    payload doesn't tax the system unduly.</para>

    <para>A payload INDEX requires two separate RECORD structures as the
    second and third parameters of the INDEX declaration. The second parameter
    RECORD structure lists the key fields on which the INDEX is built (the
    search fields), while the third parameter RECORD structure defines the
    additional payload fields.</para>

    <para>The <emphasis role="bold">virtual(fileposition)</emphasis> record
    pointer field must always be the last field listed in any type of INDEX,
    therefore, when you're defining a payload key it is always the last field
    in the third parameter RECORD structure.</para>

    <para>This example code (contained in the IndexHalfKeyedPayloadJoin.ECL
    file) once again duplicates the previous results, but does so using just
    the half-keyed JOIN (without the FETCH) by making use of a payload
    key:</para>

    <programlisting>IMPORT $;

r1 := RECORD
  $.DeclareData.Layout_Person;
  $.DeclareData.Layout_Accounts;
END;

r1 Xform($.DeclareData.Person.FilePlus L, $.DeclareData.IDX_Accounts_PersonID_Payload R) := 
  TRANSFORM
    SELF := L;
    SELF := R;
END;

J2 := JOIN($.DeclareData.Person.FilePlus(PersonID BETWEEN 1 AND 100),
           $.DeclareData.IDX_Accounts_PersonID_Payload,
           LEFT.PersonID=RIGHT.PersonID,
           Xform(LEFT,RIGHT));
 
OUTPUT(J2,ALL);
</programlisting>

    <para>You can see that this makes for tighter code. By eliminating the
    FETCH operation you also eliminate the disk access associated with it,
    making your process faster. The requirement, of course, is to pre-build
    the payload keys so that the FETCH becomes unnecessary.</para>
  </sect2>

  <sect2 id="Computed_Fields_in_Payload_Keys">
    <title>Computed Fields in Payload Keys</title>

    <para>There is a trick to putting computed fields in the payload. Since a
    “computed field” by definition does not exist in the dataset, the
    technique required for their creation and use is to build the content of
    the INDEX beforehand.</para>

    <para>The following example code (contained in IndexPayloadFetch.ECL)
    illustrates how to accomplish this by building the content of some
    computed fields (derived from related child records) in a TABLE on which
    the INDEX is built:</para>

    <programlisting>IMPORT $;

PersonFile := $.DeclareData.Person.FilePlus;
AcctFile   := $.DeclareData.Accounts;
IDXname := '~$.DeclareData::EXAMPLEDATA::KEYS::Person.PersonID.CompPay';

r1 := RECORD
  PersonFile.PersonID;
  UNSIGNED8 AcctCount := 0;
  UNSIGNED8 HighCreditSum := 0;
  UNSIGNED8 BalanceSum := 0;
  PersonFile.RecPos;
END;

t1 := TABLE(PersonFile,r1);
st1 := DISTRIBUTE(t1,HASH32(PersonID));


r2 := RECORD
  AcctFile.PersonID;
  UNSIGNED8 AcctCount := COUNT(GROUP);
  UNSIGNED8 HighCreditSum := SUM(GROUP,AcctFile.HighCredit);
  UNSIGNED8 BalanceSum := SUM(GROUP,AcctFile.Balance);
END;

t2 := TABLE(AcctFile,r2,PersonID);
st2 := DISTRIBUTE(t2,HASH32(PersonID));

r1 countem(t1 L, t2 R) := TRANSFORM
  SELF := R;
  SELF := L;
END;

j := JOIN(st1,st2,LEFT.PersonID=RIGHT.PersonID,countem(LEFT,RIGHT),LOCAL);

Bld := BUILDINDEX(j,
                  {PersonID},
                  {AcctCount,HighCreditSum,BalanceSum,RecPos},
                  IDXname,OVERWRITE);


i := INDEX(PersonFile,
           {PersonID},
           {UNSIGNED8 AcctCount,UNSIGNED8 HighCreditSum,UNSIGNED8 BalanceSum,RecPos},
           IDXname);

f := FETCH(PersonFile,i(PersonID BETWEEN 1 AND 100),RIGHT.RecPos);

Get := OUTPUT(f,ALL);

SEQUENTIAL(Bld,Get);
</programlisting>

    <para>The first TABLE function gets all the key field values from the
    Person dataset for the INDEX and creates empty fields to contain the
    computed values. Note well that the RecPos virtual(fileposition) field
    value is also retrieved at this point.</para>

    <para>The second TABLE function calculates the values to go into the
    computed fields. The values in this example are coming from the related
    Accounts dataset. These computed field values will allow the final payload
    INDEX into the Person dataset to produce these child recordset values
    without any additional code (or disk access).</para>

    <para>The JOIN operation moves combines the result from two TABLEs into
    its final form. This is the data from which the INDEX is built.</para>

    <para>The BUILDINDEX action writes the INDEX to disk. The tricky part then
    is to declare the INDEX against the base dataset (not the JOIN result). So
    the key to this technique is to build the INDEX against a derived/computed
    set of data, then declare the INDEX against the base dataset from which
    that data was drawn.</para>

    <para>To demonstrate the use of a computed-field payload INDEX, this
    example code just does a simple FETCH to return the combined result
    containing all the fields from the Person dataset along with all the
    computed field values. In “normal” use, this type of payload key would
    generally be used in a half-keyed JOIN operation.</para>
  </sect2>

  <sect2 id="Computed_Fields_in_Search_Keys">
    <title>Computed Fields in Search Keys</title>

    <para>There is one situation where using a computed field as a search key
    is required—when the field you want to search on is a REAL or DECIMAL data
    type. Neither of these two is valid for use as a search key. Therefore,
    making the search key a computed STRING field containing the value to
    search on is a way to get around this limitation.</para>

    <para>The trick to computed fields in the payload is the same for search
    keys—build the content of the INDEX beforehand. The following example code
    (contained in IndexREALkey.ECL) illustrates how to accomplish this by
    building the content of computed search key fields on which the INDEX is
    built using a TABLE and PROJECT:</para>

    <programlisting>IMPORT $;

r := RECORD
  REAL8      Float := 0.0;
  DECIMAL8_3 Dec   := 0.0; 
  $.DeclareData.person.file;
END;
t := TABLE($.DeclareData.person.file,r);

r XF(r L) := TRANSFORM
  SELF.float := L.PersonID / 1000;
  SELF.dec := L.PersonID / 1000;
  SELF := L;
END;
p := PROJECT(t,XF(LEFT));

DSname   := '~PROGGUIDE::EXAMPLEDATA::KEYS::dataset';
IDX1name := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestIDX1';
IDX2name := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestIDX2';
OutName1 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout1';
OutName2 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout2';
OutName3 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout3';
OutName4 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout4';
OutName5 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout5';
OutName6 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout6';

DSout := OUTPUT(p,,DSname,OVERWRITE);

ds := DATASET(DSname,r,THOR);

idx1 := INDEX(ds,{STRING13 FloatStr := REALFORMAT(float,13,3)},{ds},IDX1name);
idx2 := INDEX(ds,{STRING13 DecStr := (STRING13)dec},{ds},IDX2name);

Bld1Out := BUILD(idx1,OVERWRITE);
Bld2Out := BUILD(idx2,OVERWRITE);

j1 := JOIN(idx1,idx2,LEFT.FloatStr = RIGHT.DecStr);
j2 := JOIN(idx1,idx2,KEYED(LEFT.FloatStr = RIGHT.DecStr));
j3 := JOIN(ds,idx1,KEYED((STRING10)LEFT.float = RIGHT.FloatStr));
j4 := JOIN(ds,idx2,KEYED((STRING10)LEFT.dec = RIGHT.DecStr));
j5 := JOIN(ds,idx1,KEYED((STRING10)LEFT.dec = RIGHT.FloatStr));
j6 := JOIN(ds,idx2,KEYED((STRING10)LEFT.float = RIGHT.DecStr));

JoinOut1 := OUTPUT(j1,,OutName1,OVERWRITE);
JoinOut2 := OUTPUT(j2,,OutName2,OVERWRITE);
JoinOut3 := OUTPUT(j3,,OutName3,OVERWRITE);
JoinOut4 := OUTPUT(j4,,OutName4,OVERWRITE);
JoinOut5 := OUTPUT(j5,,OutName5,OVERWRITE);
JoinOut6 := OUTPUT(j6,,OutName6,OVERWRITE);

SEQUENTIAL(DSout,Bld1Out,Bld2Out,JoinOut1,JoinOut2,JoinOut3,JoinOut4,JoinOut5,JoinOut6);
</programlisting>

    <para>This code starts with some filename definitions. The record
    structure adds two fields to the existing set of fields from our base
    dataset: a REAL8 field named “float” and a DECIMAL12_6 field named “dec.”
    These will contain our REAL and DECIMAL data that we want to search on.
    The PROJECT of the TABLE puts values into these two fields (in this case,
    just dividing the PersonID file by 1000 to achieve a floating point value
    to use that will be unique).</para>

    <para>The IDX1 INDEX definition creates the REAL search key as a STRING13
    computed field by using the REALFORMAT function to right-justify the
    floating point value into a 13-character STRING. This formats the value
    with exactly the number of decimal places specified in the REALFORMAT
    function.</para>

    <para>The IDX2 INDEX definition creates the DECIMAL search key as a
    STRING13 computed field by casting the DECIMAL data to a STRING13. Using
    the typecast operator simply left-justifies the value in the string. It
    may also drop trailing zeros, so the number of decimal places is not
    guaranteed to always be the same.</para>

    <para>Because of the two different methods of constructing the search key
    strings, the strings themselves are not equal, although the values used to
    create them are the same. This means that you cannot expect to “mix and
    match” between the two—you need to use each INDEX with the method used to
    create it. That's why the two JOIN operations that demonstrate their usage
    use the same method to create the string comparison value as was used to
    create the INDEX. This way, you are guaranteed to achieve matching
    values.</para>
  </sect2>

  <sect2 id="Using_an_INDEX_like_a_DATASET">
    <title>Using an INDEX like a DATASET</title>

    <para>Payload keys can also be used for standard DATASET-type operations.
    In this type of usage, the INDEX acts as if it were a dataset, with the
    advantage that it contains compressed data and a btree index. The key
    difference in this type of use is the use of KEYED and WILD in INDEX
    filters, which allows the INDEX read to make use of the btree instead of
    doing a full-table scan.</para>

    <para>The following example code (contained in IndexAsDataset.ECL)
    illustrates the use of an INDEX as if it were a DATASET, and compares the
    relative performance of INDEX versus DATASET use:</para>

    <programlisting>IMPORT $;

OutRec := RECORD
  INTEGER   Seq;
  QSTRING15 FirstName;
  QSTRING25 LastName;
  STRING2   State;
END;

IDX  := $.DeclareData.IDX__Person_LastName_FirstName_Payload;
Base := $.DeclareData.Person.File;

OutRec XF1(IDX L, INTEGER C) := TRANSFORM
  SELF.Seq := C;
  SELF := L;
END;

O1 := PROJECT(IDX(KEYED(lastname='COOLING'),
                  KEYED(firstname='LIZZ'),
              state='OK'),
              XF1(LEFT,COUNTER));
OUTPUT(O1,ALL);

OutRec XF2(Base L, INTEGER C) := TRANSFORM
  SELF.Seq := C;
  SELF := L;
END;

O2 := PROJECT(Base(lastname='COOLING',
                   firstname='LIZZ',
              state='OK'),
              XF2(LEFT,COUNTER));
OUTPUT(O2,ALL);
</programlisting>

    <para>Both PROJECT operations will produce exactly the same result, but
    the first one uses an INDEX and the second uses a DATASET. The only
    significant difference between the two is the use of KEYED in the INDEX
    filter. This indicates that the index read should use the btree to find
    the specific set of leaf node records to read. The DATASET version must
    read all the records in the file to find the correct one, making it a much
    slower process.</para>

    <para>If you check the workunit timings in ECL Watch, you should see a
    difference. In this test case, the difference may not appear to be
    significant (there's not that much test data), but in your real-world
    applications the difference between an index read operation and a
    full-table scan should prove meaningful.</para>
  </sect2>
</sect1>