<emphasis role="bold">Using ECL Keys (INDEX Files)</emphasis>

<emphasis role="bold">Using ECL Keys (INDEX Files)</emphasis> The ETL (Extract, Transform, and Load—standard data ingest processing) operations in ECL typically operate against all or most of the records in any given dataset, which makes the use of keys (INDEX files) of little use. Many queries do the same. However, production data delivery to end-users rarely requires accessing all records in a dataset. End-users always want “instant” access to the data they're interested in, and most often that data is a very small subset of the total set of records available. Therefore, using keys (INDEXes) becomes a requirement. The following attribute definitions used by the code examples in this article are declared in the DeclareData MODULE structure attribute in the DeclareData.ECL file: EXPORT Person := MODULE EXPORT File := DATASET('~PROGGUIDE::EXAMPLEDATA::People',Layout_Person, THOR); EXPORT FilePlus := DATASET('~PROGGUIDE::EXAMPLEDATA::People', {Layout_Person, UNSIGNED8 RecPos{VIRTUAL(fileposition)}}, THOR); END; EXPORT Accounts := DATASET('~PROGGUIDE::EXAMPLEDATA::Accounts', {Layout_Accounts_Link, UNSIGNED8 RecPos{VIRTUAL(fileposition)}}, THOR); EXPORT PersonAccounts := DATASET('~PROGGUIDE::EXAMPLEDATA::PeopleAccts', {Layout_Combined, UNSIGNED8 RecPos{virtual(fileposition)}},THOR); EXPORT IDX_Person_PersonID := INDEX(Person.FilePlus,{PersonID,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::People.PersonID'); EXPORT IDX_Accounts_PersonID := INDEX(Accounts,{PersonID,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::Accounts.PersonID'); EXPORT IDX_Accounts_PersonID_Payload := INDEX(Accounts, {PersonID}, {Account,OpenDate,IndustryCode,AcctType, AcctRate,Code1,Code2,HighCredit,Balance,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::Accounts.PersonID.Payload'); EXPORT IDX_PersonAccounts_PersonID := INDEX(PersonAccounts,{PersonID,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::PeopleAccts.PersonID'); EXPORT IDX__Person_LastName_FirstName := INDEX(Person.FilePlus,{LastName,FirstName,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::People.LastName.FirstName'); EXPORT IDX__Person_PersonID_Payload := INDEX(Person.FilePlus,{PersonID}, {FirstName,LastName,MiddleInitial, Gender,Street,City,State,Zip,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::People.PersonID.Payload'); Although you can use an INDEX as if it were a DATASET, there are only two operations in ECL that directly use keys: FETCH and JOIN. Simple FETCH The FETCH is the simplest use of an INDEX. Its purpose is to retrieve records from a dataset by using an INDEX to directly access only the specified records. The example code below (contained in the IndexFetch.ECL file) illustrates the usual form: IMPORT $; F1 := FETCH($.DeclareData.Person.FilePlus, $.DeclareData.IDX_Person_PersonID(PersonID=1), RIGHT.RecPos); OUTPUT(F1); You will note that the DATASET named as the first parameter has no filter, while the INDEX named as the second parameter does have a filter. This is always the case with FETCH. The purpose of an INDEX in ECL is always to allow “direct” access to individual records in the base dataset, therefore filtering the INDEX is always required to define the exact set of records to retrieve. Given that, filtering the base dataset is unnecessary. As you can see, there is no TRANSFORM function in this code. For most typical uses of FETCH a transform function is unnecessary, although it is certainly appropriate if the result data requires formatting, as in this example (also contained in the IndexFetch.ECL file): r := RECORD STRING FullName; STRING Address; STRING CSZ; END; r Xform($.DeclareData.Person.FilePlus L) := TRANSFORM SELF.Fullname := TRIM(L.Firstname) + TRIM(' ' + L.MiddleInitial) + ' ' + L.Lastname; SELF.Address := L.Street; SELF.CSZ := TRIM(L.City) + ', ' + L.State + ' ' + L.Zip; END; F2 := FETCH($.DeclareData.Person.FilePlus, $.DeclareData.IDX_Person_PersonID(PersonID=1), RIGHT.RecPos, Xform(LEFT)); OUTPUT(F2); Even with a TRANSFORM function, this code is still a very straight-forward “go get me the records, please” operation. Full-keyed JOIN As simple as FETCH is, using INDEXes in JOIN operations is a little more complex. The most obvious form is a "full-keyed" JOIN, specified by the KEYED option, which, nominates an INDEX into the right-hand recordset (the second JOIN parameter). The purpose for this form is to handle situations where the left-hand recordset (named as the first parameter to the JOIN) is a fairly small dataset that needs to join to a large, indexed dataset (the right-hand recordset). By using the KEYED option, the JOIN operation uses the specified INDEX to find the matching right-hand records. This means that the join condition must use the key fields in the INDEX to find matching records. This example code (contained in the IndexFullKeyedJoin.ECL file) illustrates the usual use of a full-keyed join: IMPORT $; r1 := RECORD $.DeclareData.Layout_Person; $.DeclareData.Layout_Accounts; END; r1 Xform1($.DeclareData.Person.FilePlus L, $.DeclareData.Accounts R) := TRANSFORM SELF := L; SELF := R; END; J1 := JOIN($.DeclareData.Person.FilePlus(PersonID BETWEEN 1 AND 100), $.DeclareData.Accounts, LEFT.PersonID=RIGHT.PersonID, Xform1(LEFT,RIGHT), KEYED($.DeclareData.IDX_Accounts_PersonID)); OUTPUT(J1,ALL); The right-hand Accounts file contains five million records, and with the specified filter condition the left-hand Person recordset contains exactly one hundred records. A standard JOIN between these two would normally require that all five million Accounts records be read to produce the result. However, by using the KEYED option the INDEX’s binary tree is used to find the entries with the appropriate key field values and get the pointers to the exact set of Accounts records required to produce the correct result. That means that the only records read from the right-hand file are those actually contained in the result. Half-keyed JOIN The half-keyed JOIN is a simpler version, wherein the INDEX is the right-hand recordset in the JOIN. Just as with the full-keyed JOIN, the join condition must use the key fields in the INDEX to do its work. The purpose of the half-keyed JOIN is the same as the full-keyed version. In fact, a full-keyed JOIN is, behind the curtains, actually the same as a half-keyed JOIN then a FETCH to retrieve the base dataset records. Therefore, a half-keyed JOIN and a FETCH are semantically and functionally equivalent, as shown in this example code (contained in the IndexHalfKeyedJoin.ECL file): IMPORT $; r1 := RECORD $.DeclareData.Layout_Person; $.DeclareData.Layout_Accounts; END; r2 := RECORD $.DeclareData.Layout_Person; UNSIGNED8 AcctRecPos; END; r2 Xform2($.DeclareData.Person.FilePlus L, $.DeclareData.IDX_Accounts_PersonID R) := TRANSFORM SELF.AcctRecPos := R.RecPos; SELF := L; END; J2 := JOIN($.DeclareData.Person.FilePlus(PersonID BETWEEN 1 AND 100), $.DeclareData.IDX_Accounts_PersonID, LEFT.PersonID=RIGHT.PersonID, Xform2(LEFT,RIGHT)); r1 Xform3($.DeclareData.Accounts L, r2 R) := TRANSFORM SELF := L; SELF := R; END; F1 := FETCH($.DeclareData.Accounts, J2, RIGHT.AcctRecPos, Xform3(LEFT,RIGHT)); OUTPUT(F1,ALL); This code produces the same result set as the previous example. The advantage of using half-keyed JOINs over the full-keyed version comes in where you may need to do several JOINs to fully perform whatever process is being run. Using the half-keyed form allows you to accomplish all the necessary JOINs before you explicitly do the FETCH to retrieve the final result records, thereby making the code more efficient. Payload INDEXes There is an extended form of INDEX that allows each entry to carry a “payload”—additional data not included in the set of key fields. These additional fields may simply be additional fields from the base dataset (not required as part of the search key), or they may contain the result of some preliminary computation (computed fields). Since the data in an INDEX is always compressed (using LZW compression), carrying the extra payload doesn't tax the system unduly. A payload INDEX requires two separate RECORD structures as the second and third parameters of the INDEX declaration. The second parameter RECORD structure lists the key fields on which the INDEX is built (the search fields), while the third parameter RECORD structure defines the additional payload fields. The virtual(fileposition) record pointer field must always be the last field listed in any type of INDEX, therefore, when you're defining a payload key it is always the last field in the third parameter RECORD structure. This example code (contained in the IndexHalfKeyedPayloadJoin.ECL file) once again duplicates the previous results, but does so using just the half-keyed JOIN (without the FETCH) by making use of a payload key: IMPORT $; r1 := RECORD $.DeclareData.Layout_Person; $.DeclareData.Layout_Accounts; END; r1 Xform($.DeclareData.Person.FilePlus L, $.DeclareData.IDX_Accounts_PersonID_Payload R) := TRANSFORM SELF := L; SELF := R; END; J2 := JOIN($.DeclareData.Person.FilePlus(PersonID BETWEEN 1 AND 100), $.DeclareData.IDX_Accounts_PersonID_Payload, LEFT.PersonID=RIGHT.PersonID, Xform(LEFT,RIGHT)); OUTPUT(J2,ALL); You can see that this makes for tighter code. By eliminating the FETCH operation you also eliminate the disk access associated with it, making your process faster. The requirement, of course, is to pre-build the payload keys so that the FETCH becomes unnecessary. Computed Fields in Payload Keys There is a trick to putting computed fields in the payload. Since a “computed field” by definition does not exist in the dataset, the technique required for their creation and use is to build the content of the INDEX beforehand. The following example code (contained in IndexPayloadFetch.ECL) illustrates how to accomplish this by building the content of some computed fields (derived from related child records) in a TABLE on which the INDEX is built: IMPORT $; PersonFile := $.DeclareData.Person.FilePlus; AcctFile := $.DeclareData.Accounts; IDXname := '~$.DeclareData::EXAMPLEDATA::KEYS::Person.PersonID.CompPay'; r1 := RECORD PersonFile.PersonID; UNSIGNED8 AcctCount := 0; UNSIGNED8 HighCreditSum := 0; UNSIGNED8 BalanceSum := 0; PersonFile.RecPos; END; t1 := TABLE(PersonFile,r1); st1 := DISTRIBUTE(t1,HASH32(PersonID)); r2 := RECORD AcctFile.PersonID; UNSIGNED8 AcctCount := COUNT(GROUP); UNSIGNED8 HighCreditSum := SUM(GROUP,AcctFile.HighCredit); UNSIGNED8 BalanceSum := SUM(GROUP,AcctFile.Balance); END; t2 := TABLE(AcctFile,r2,PersonID); st2 := DISTRIBUTE(t2,HASH32(PersonID)); r1 countem(t1 L, t2 R) := TRANSFORM SELF := R; SELF := L; END; j := JOIN(st1,st2,LEFT.PersonID=RIGHT.PersonID,countem(LEFT,RIGHT),LOCAL); Bld := BUILDINDEX(j, {PersonID}, {AcctCount,HighCreditSum,BalanceSum,RecPos}, IDXname,OVERWRITE); i := INDEX(PersonFile, {PersonID}, {UNSIGNED8 AcctCount,UNSIGNED8 HighCreditSum,UNSIGNED8 BalanceSum,RecPos}, IDXname); f := FETCH(PersonFile,i(PersonID BETWEEN 1 AND 100),RIGHT.RecPos); Get := OUTPUT(f,ALL); SEQUENTIAL(Bld,Get); The first TABLE function gets all the key field values from the Person dataset for the INDEX and creates empty fields to contain the computed values. Note well that the RecPos virtual(fileposition) field value is also retrieved at this point. The second TABLE function calculates the values to go into the computed fields. The values in this example are coming from the related Accounts dataset. These computed field values will allow the final payload INDEX into the Person dataset to produce these child recordset values without any additional code (or disk access). The JOIN operation moves combines the result from two TABLEs into its final form. This is the data from which the INDEX is built. The BUILDINDEX action writes the INDEX to disk. The tricky part then is to declare the INDEX against the base dataset (not the JOIN result). So the key to this technique is to build the INDEX against a derived/computed set of data, then declare the INDEX against the base dataset from which that data was drawn. To demonstrate the use of a computed-field payload INDEX, this example code just does a simple FETCH to return the combined result containing all the fields from the Person dataset along with all the computed field values. In “normal” use, this type of payload key would generally be used in a half-keyed JOIN operation. Computed Fields in Search Keys There is one situation where using a computed field as a search key is required—when the field you want to search on is a REAL or DECIMAL data type. Neither of these two is valid for use as a search key. Therefore, making the search key a computed STRING field containing the value to search on is a way to get around this limitation. The trick to computed fields in the payload is the same for search keys—build the content of the INDEX beforehand. The following example code (contained in IndexREALkey.ECL) illustrates how to accomplish this by building the content of computed search key fields on which the INDEX is built using a TABLE and PROJECT: IMPORT $; r := RECORD REAL8 Float := 0.0; DECIMAL8_3 Dec := 0.0; $.DeclareData.person.file; END; t := TABLE($.DeclareData.person.file,r); r XF(r L) := TRANSFORM SELF.float := L.PersonID / 1000; SELF.dec := L.PersonID / 1000; SELF := L; END; p := PROJECT(t,XF(LEFT)); DSname := '~PROGGUIDE::EXAMPLEDATA::KEYS::dataset'; IDX1name := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestIDX1'; IDX2name := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestIDX2'; OutName1 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout1'; OutName2 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout2'; OutName3 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout3'; OutName4 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout4'; OutName5 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout5'; OutName6 := '~PROGGUIDE::EXAMPLEDATA::KEYS::realkeytestout6'; DSout := OUTPUT(p,,DSname,OVERWRITE); ds := DATASET(DSname,r,THOR); idx1 := INDEX(ds,{STRING13 FloatStr := REALFORMAT(float,13,3)},{ds},IDX1name); idx2 := INDEX(ds,{STRING13 DecStr := (STRING13)dec},{ds},IDX2name); Bld1Out := BUILD(idx1,OVERWRITE); Bld2Out := BUILD(idx2,OVERWRITE); j1 := JOIN(idx1,idx2,LEFT.FloatStr = RIGHT.DecStr); j2 := JOIN(idx1,idx2,KEYED(LEFT.FloatStr = RIGHT.DecStr)); j3 := JOIN(ds,idx1,KEYED((STRING10)LEFT.float = RIGHT.FloatStr)); j4 := JOIN(ds,idx2,KEYED((STRING10)LEFT.dec = RIGHT.DecStr)); j5 := JOIN(ds,idx1,KEYED((STRING10)LEFT.dec = RIGHT.FloatStr)); j6 := JOIN(ds,idx2,KEYED((STRING10)LEFT.float = RIGHT.DecStr)); JoinOut1 := OUTPUT(j1,,OutName1,OVERWRITE); JoinOut2 := OUTPUT(j2,,OutName2,OVERWRITE); JoinOut3 := OUTPUT(j3,,OutName3,OVERWRITE); JoinOut4 := OUTPUT(j4,,OutName4,OVERWRITE); JoinOut5 := OUTPUT(j5,,OutName5,OVERWRITE); JoinOut6 := OUTPUT(j6,,OutName6,OVERWRITE); SEQUENTIAL(DSout,Bld1Out,Bld2Out,JoinOut1,JoinOut2,JoinOut3,JoinOut4,JoinOut5,JoinOut6); This code starts with some filename definitions. The record structure adds two fields to the existing set of fields from our base dataset: a REAL8 field named “float” and a DECIMAL12_6 field named “dec.” These will contain our REAL and DECIMAL data that we want to search on. The PROJECT of the TABLE puts values into these two fields (in this case, just dividing the PersonID file by 1000 to achieve a floating point value to use that will be unique). The IDX1 INDEX definition creates the REAL search key as a STRING13 computed field by using the REALFORMAT function to right-justify the floating point value into a 13-character STRING. This formats the value with exactly the number of decimal places specified in the REALFORMAT function. The IDX2 INDEX definition creates the DECIMAL search key as a STRING13 computed field by casting the DECIMAL data to a STRING13. Using the typecast operator simply left-justifies the value in the string. It may also drop trailing zeros, so the number of decimal places is not guaranteed to always be the same. Because of the two different methods of constructing the search key strings, the strings themselves are not equal, although the values used to create them are the same. This means that you cannot expect to “mix and match” between the two—you need to use each INDEX with the method used to create it. That's why the two JOIN operations that demonstrate their usage use the same method to create the string comparison value as was used to create the INDEX. This way, you are guaranteed to achieve matching values. Using an INDEX like a DATASET Payload keys can also be used for standard DATASET-type operations. In this type of usage, the INDEX acts as if it were a dataset, with the advantage that it contains compressed data and a btree index. The key difference in this type of use is the use of KEYED and WILD in INDEX filters, which allows the INDEX read to make use of the btree instead of doing a full-table scan. The following example code (contained in IndexAsDataset.ECL) illustrates the use of an INDEX as if it were a DATASET, and compares the relative performance of INDEX versus DATASET use: IMPORT $; OutRec := RECORD INTEGER Seq; QSTRING15 FirstName; QSTRING25 LastName; STRING2 State; END; IDX := $.DeclareData.IDX__Person_LastName_FirstName_Payload; Base := $.DeclareData.Person.File; OutRec XF1(IDX L, INTEGER C) := TRANSFORM SELF.Seq := C; SELF := L; END; O1 := PROJECT(IDX(KEYED(lastname='COOLING'), KEYED(firstname='LIZZ'), state='OK'), XF1(LEFT,COUNTER)); OUTPUT(O1,ALL); OutRec XF2(Base L, INTEGER C) := TRANSFORM SELF.Seq := C; SELF := L; END; O2 := PROJECT(Base(lastname='COOLING', firstname='LIZZ', state='OK'), XF2(LEFT,COUNTER)); OUTPUT(O2,ALL); Both PROJECT operations will produce exactly the same result, but the first one uses an INDEX and the second uses a DATASET. The only significant difference between the two is the use of KEYED in the INDEX filter. This indicates that the index read should use the btree to find the specific set of leaf node records to read. The DATASET version must read all the records in the file to find the correct one, making it a much slower process. If you check the workunit timings in ECL Watch, you should see a difference. In this test case, the difference may not appear to be significant (there's not that much test data), but in your real-world applications the difference between an index read operation and a full-table scan should prove meaningful.