<emphasis role="bold">Creating Example Data</emphasis>

<emphasis role="bold">Creating Example Data</emphasis> Getting Code Files All the example code for the Programmer's Guide is available for download from the HPCC Systems website from the same page that the PDF is available (click here). To make this code available for use in the ECL IDE, you simply: Download the ECL_Code_Files.ZIP file. In the ECL IDE, highlight your "My Files" folder, right-click and select "Insert Folder" from the popup menu. Name your new folder "ProgrammersGuide" (please note -- spaces are NOT allowed in your ECL repository folder names). In the ECL IDE, highlight your "ProgrammersGuide" folder, right-click and select "Locate File in Explorer" from the popup menu. Extract all the files from the ECL_Code_Files.ZIP file into your new folder. Generating Files The code that generates the example data used by all the Programmer's Guide articles is contained in a file named Gendata.ECL. You simply need to open that file in the ECL IDE (select File > Open from the menu, select the Gendata.ECL file, and it will open in a Builder window) then press the Submit button to generate the data files. The process takes a couple of minutes to run. Here is the code, fully explained. Some Constants IMPORT std; P_Mult1 := 1000; P_Mult2 := 1000; TotalParents := P_Mult1 * P_Mult2; TotalChildren := 5000000; These constants define the numbers used to generate 1,000,000 parent records and 5,000,000 child records. By defining these once as attributes the code could be easily made to generate a smaller number of parent records (such as 10,000 by changing both multipliers from 1000 to 100). However, the code as written is designed for a maximum of 1,000,000 parent records and would have to be changed in several places to accommodate generating more. The number of child records can be changed either direction without any other changes to the code (although if pushed too far upward you may encounter runtime errors regarding the maximum variable record length for the nested child dataset). For the purposes of demonstrating the techniques in these Programmer's Guide articles, 1,000,000 parent and 5,000,000 child records are more than sufficient. The RECORD Structures Layout_Person := RECORD UNSIGNED3 PersonID; STRING15 FirstName; STRING25 LastName; STRING1 MiddleInitial; STRING1 Gender; STRING42 Street; STRING20 City; STRING2 State; STRING5 Zip; END; Layout_Accounts := RECORD STRING20 Account; STRING8 OpenDate; STRING2 IndustryCode; STRING1 AcctType; STRING1 AcctRate; UNSIGNED1 Code1; UNSIGNED1 Code2; UNSIGNED4 HighCredit; UNSIGNED4 Balance; END; Layout_Accounts_Link := RECORD UNSIGNED3 PersonID; Layout_Accounts; END; Layout_Combined := RECORD,MAXLENGTH(1000) Layout_Person; DATASET(Layout_Accounts) Accounts; END; These RECORD structures define the field layouts for three datasets: a parent file (Layout_Person), a child file (Layout_Accounts_Link), and a parent with nested child dataset (Layout_Combined). These are used to generate three separate data files. The Layout_Accounts_Link and Layout_Accounts structures are separate because the child records in the nested structure will not contain the linking field to the parent, whereas the separate child file must contain the link. Starting Point Data //define data for record generation: //100 possible middle initials, 52 letters and 48 blanks SetMiddleInitials := 'ABCDEFGHIJKLMNOPQRSTUVWXYZ ' + 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '; //1000 First names SET OF STRING14 SetFnames := [ 'TIMTOHY ','ALCIAN ','CHAMENE ', ... ]; //1000 Last names SET OF STRING16 SetLnames := [ 'BIALES ','COOLING ','CROTHALL ', ... ]; These sets define the data that will be used to generate the records. By providing 1,000 first and last names, this code can generate 1,000,000 unique names. //2400 street addresses to choose from SET OF STRING31 SetStreets := [ '1 SANDHURST DR ','1 SPENCER LN ', ... ]; //Matched sets of 9540 City,State, Zips SET OF STRING15 SetCity := [ 'ABBEVILLE ','ABBOTTSTOWN ','ABELL ', ... ]; SET OF STRING2 SetStates := [ 'LA','PA','MD','NC','MD','TX','TX','IL','MA','LA','WI','NJ', ... ]; SET OF STRING5 SetZips := [ '70510','17301','20606','28315','21005','79311','79604', ... ]; Having 2400 street addresses and 9540 (valid) city, state, zip combinations provides plenty of opportunity to generate a reasonable mix of addresses. Generating Parent Records BlankSet := DATASET([{0,'','','','','','','','',[]}], Layout_Combined); CountCSZ := 9540; Here is the beginning of the data generation code. The BlankSet is a single empty “seed” record, used to start the process. The CountCSZ attribute simply defines the maximum number of city, state, zip combinations that are available for use in subsequent calculations that will determine which to use in a given record. Layout_Combined CreateRecs(Layout_Combined L, INTEGER C, INTEGER W) := TRANSFORM SELF.FirstName := IF(W=1,SetFnames[C],L.FirstName); SELF.LastName := IF(W=2,SetLnames[C],L.LastName); SELF := L; END; base_fn := NORMALIZE(BlankSet,P_Mult1,CreateRecs(LEFT,COUNTER,1)); base_fln := NORMALIZE(base_fn ,P_Mult2,CreateRecs(LEFT,COUNTER,2)); The purpose of this code is to generate 1,000,000 unique first/last name records as a starting point. The NORMALIZE operation is unique in that its second parameter defines the number of times to call the TRANSFORM function for each input record. This makes it uniquely suited to generating the kind of “bogus” data we need. We're doing two NORMALIZE operations here. The first generates 1,000 records with unique first names from the single blank record in the BlankSet inline DATASET. Then the second takes the 1,000 records from the first NORMALIZE and creates 1,000 new records with unique last names for each input record, resulting in 1,000,000 unique first/last name records. One interesting “trick” here is the use of a single TRANSFORM function for both of the NORMALIZE operations. Defining the TRANSFORM to receive one “extra” (third) parameter than it normally takes is what allows this. This parameter simply flags which NORMALIZE pass the TRANSFORM is doing. Layout_Combined PopulateRecs(Layout_Combined L, Layout_Combined R, INTEGER HashVal) := TRANSFORM CSZ_Rec := (HashVal % CountCSZ) + 1; SELF.PersonID := IF(L.PersonID = 0, Thorlib.Node() + 1, L.PersonID + CLUSTERSIZE); SELF.MiddleInitial := SetMiddleInitials[(HashVal % 100) + 1 ]; SELF.Gender := CHOOSE((HashVal % 2) + 1,'F','M'); SELF.Street := SetStreets[(HashVal % 2400) + 1 ]; SELF.City := SetCity[CSZ_Rec]; SELF.State := SetStates[CSZ_Rec]; SELF.Zip := SetZips[CSZ_Rec]; SELF := R; END; base_fln_dist := DISTRIBUTE(base_fln,HASH32(FirstName,LastName)); base_people := ITERATE(base_fln_dist, PopulateRecs(LEFT, RIGHT, HASHCRC(RIGHT.FirstName,RIGHT.LastName)), LOCAL); base_people_dist := DISTRIBUTE(base_people,HASH32(PersonID)); Once the two NORMALIZE operations have done their work, the next task is to populate the rest of the fields. Since one of those fields is the PersonID, which is the unique identifier field for the record, the fastest way to populate it is with ITERATE using the LOCAL option. Using the Thorlib.Node() function and CLUSTERSIZE compiler directive, you can uniquely number each record in parallel on each node with ITERATE. You may end up with a few holes in the numbering towards the end, but since the only requirement here is uniqueness and not contiguity, those holes are irrelevant. Since the first two NORMALIZE operations took place on a single node (look at the data skews shown in the ECL Watch graph), the first thing to do is DISTRIBUTE the records so each node has a proportional chunk of the data to work with. Then the ITERATE can do its work on each chunk of records in parallel. To introduce an element of randomity to the data choices, the ITERATE passes a hash value to the TRANSFORM function as an “extra” third parameter. This is the same technique used previously, but passing calculated values instead of constants. The CSZ_Rec attribute definition illustrates the use of local attribute definitions inside TRANSFORM functions. Defining the expression once, then using it multiple times as needed to produce a valid city, state, zip combination. The rest of the fields are populated by data selected using the passed in hash value in their expressions. The modulus division operator (%—produces the remainder of the division) is used to ensure that a value is calculated that is in the valid range of the number of elements for the given set of data from which the field is populated. Generating Child Records BlankKids := DATASET([{0,'','','','','',0,0,0,0}], Layout_Accounts_Link); SetLinks := SET(base_people,PersonID); SetIndustryCodes := ['BB','DC','ON','FM','FP','FF','FC','FA','FZ', 'CG','FS','OC','ZZ','HZ','UT','HF','CS','DM', 'JA','FY','HT','UE','DZ','AT']; SetAcctRates := ['1','0','9','*','Z','5','B','2', '3','4','A','7','8','E','C']; SetDateYears := ['1987','1988','1989','1990','1991','1992','1993', '1994','1995','1996','1997','1998','1999','2000', '2001','2002','2003','2004','2005','2006']; SetMonthDays := [31,28,31,30,31,30,31,31,30,31,30,31]; SetNarrs := [229,158,2,0,66,233,123,214,169,248,67,127,168, 65,208,114,73,218,238,57,125,113,88, 247,244,121,54,220,98,97]; Once again, we start by defining a “seed” record for the process as an inline DATASET and several sets of appropriate data for the specific fields. The SET function builds a set of valid PersonID values to use to create the links between the parent and child records. Layout_Accounts_Link CreateKids(Layout_Accounts_Link L, INTEGER C) := TRANSFORM CSZ_IDX := C % CountCSZ + 1; HashVal := HASH32(SetCity[CSZ_IDX],SetStates[CSZ_IDX],SetZips[CSZ_IDX]); DateMonth := HashVal % 12 + 1; SELF.PersonID := CHOOSE(TRUNCATE(C / TotalParents ) + 1, IF(C % 2 = 0, SetLinks[C % TotalParents + 1], SetLinks[TotalParents - (C % TotalParents )]), IF(C % 3 <> 0, SetLinks[C % TotalParents + 1], SetLinks[TotalParents - (C % TotalParents )]), IF(C % 5 = 0, SetLinks[C % TotalParents + 1], SetLinks[TotalParents - (C % TotalParents )]), IF(C % 7 <> 0, SetLinks[C % TotalParents + 1], SetLinks[TotalParents - (C % TotalParents )]), SetLinks[C % TotalParents + 1]); SELF.Account := (STRING)HashVal; SELF.OpenDate := SetDateYears[DateMonth] + INTFORMAT(DateMonth,2,1) + INTFORMAT(HashVal % SetMonthDays[DateMonth]+1,2,1); SELF.IndustryCode := SetIndustrycodes[HashVal % 24 + 1]; SELF.AcctType := CHOOSE(HashVal%5+1,'O','R','I','9',' '); SELF.AcctRate := SetAcctRates[HashVal % 15 + 1]; SELF.Code1 := SetNarrs[HashVal % 15 + 1]; SELF.Code2 := SetNarrs[HashVal % 15 + 16]; SELF.HighCredit := HashVal % 50000; SELF.Balance := TRUNCATE((HashVal % 50000) * ((HashVal % 100 + 1) / 100)); END; base_kids := NORMALIZE( BlankKids, TotalChildren, CreateKids(LEFT,COUNTER)); base_kids_dist := DISTRIBUTE(base_kids,HASH32(PersonID)); This process is similar to the one used for the parent records. This time, instead of passing in a hash value, a local attribute does that work inside the TRANSFORM. Just as before, the hash value is used to select the actual data to go in each field of the record. The interesting bit here is the expression to determine the PersonID field value. Since we're generating 5,000,000 child records it would be very simple to just give each parent five children. However, real-world data rarely looks like that. Therefore, the CHOOSE function is used to select a different method for each set of a million child records. The first million uses the first IF expression, and the second million uses the second, and so on... This creates a varying number of children for each parent, ranging from one to nine. Create the Nested Child Dataset Records Layout_Combined AddKids(Layout_Combined L, base_kids R) := TRANSFORM SELF.Accounts := L.Accounts + ROW({R.Account,R.OpenDate,R.IndustryCode, R.AcctType,R.AcctRate,R.Code1, R.Code2,R.HighCredit,R.Balance}, Layout_Accounts); SELF := L; END; base_combined := DENORMALIZE( base_people_dist, base_kids_dist, LEFT.PersonID = RIGHT.PersonID, AddKids(LEFT, RIGHT)); Now that we have separate recordsets of parent and child records, the next step is to combine them into a single dataset with each parent's child data nested within the same physical record as the parent. The reason for nesting the child data this way is to allow easy parent-child queries in the Data Refinery and Rapid data Delivery Engine without requiring the use of separate JOIN steps to make the links between the parent and child records. To build the nested child dataset requires the DENORMALIZE operation. This operation finds the links between the parent records and their associated children, calling the TRANSFORM function as many times as there are child records for each parent. The interesting technique here is the use of the ROW function to construct each additional nested child record. This is done to eliminate the linking field (PersonID) from each child record stored in the combined dataset, since it is the same value as contained in the parent record's PersonID field. Write Files to Disk O1 := OUTPUT(PROJECT(base_people_dist,Layout_Person),,'~PROGGUIDE::EXAMPLEDATA::People',OVERWRITE); O2 := OUTPUT(base_kids_dist,,'~PROGGUIDE::EXAMPLEDATA::Accounts',OVERWRITE); O3 := OUTPUT(base_combined,,'~PROGGUIDE::EXAMPLEDATA::PeopleAccts',OVERWRITE); P1 := PARALLEL(O1,O2,O3); These OUTPUT attribute definitions will write the datasets to disk. They are written as attribute definitions because they will be used in a SEQUENTIAL action. The PARALLEL action attribute simply indicates that all these disk writes can occur “simultaneously” if the optimizer decides it can do that. The first OUTPUT uses a PROJECT to produce the parent records as a separate file because the data was originally generated into a RECORD structure that contains the nested child DATASET field (Accounts) in preparation for creating the third file. The PROJECT eliminates that empty Accounts field from the output for this dataset. D1 := DATASET('~PROGGUIDE::EXAMPLEDATA::People', {Layout_Person,UNSIGNED8 RecPos{virtual(fileposition)}}, THOR); D2 := DATASET('~PROGGUIDE::EXAMPLEDATA::Accounts', {Layout_Accounts_Link,UNSIGNED8 RecPos{virtual(fileposition)}},THOR); D3 := DATASET('~PROGGUIDE::EXAMPLEDATA::PeopleAccts', {,MAXLENGTH(1000) Layout_Combined,UNSIGNED8 RecPos{virtual(fileposition)}},THOR); These DATASET declarations are needed to be able to build indexes. The UNSIGNED8 RecPos fields are the virtual fields (they only exist at runtime and not on disk) that are the internal record pointers. They're declared here to be able to reference them in the subsequent INDEX declarations. I1 := INDEX(D1,{PersonID,RecPos},'~PROGGUIDE::EXAMPLEDATA::KEYS::People.PersonID'); I2 := INDEX(D2,{PersonID,RecPos},'~PROGGUIDE::EXAMPLEDATA::KEYS::Accounts.PersonID'); I3 := INDEX(D3,{PersonID,RecPos},'~PROGGUIDE::EXAMPLEDATA::KEYS::PeopleAccts.PersonID'); B1 := BUILD(I1,OVERWRITE); B2 := BUILD(I2,OVERWRITE); B3 := BUILD(I3,OVERWRITE); P2 := PARALLEL(B1,B2,B3); These INDEX declarations allow the BUILD actions to use the single-parameter form. Once again, the PARALLEL action attribute indicates the index build may be done all at the same time. SEQUENTIAL(P1,P2); This SEQUENTIAL action simply says, “write all the data files to disk, and then build the indexes.” Defining the Files Once the datasets and indexes have been written to disk you must declare the files in order to use them in the example ECL code in the rest of the articles. These declarations are contained in the DeclareData.ECL file. To make them available to the rest of the example code you simply need to IMPORT it. Therefore, at the beginning of each example you will find this line of code: IMPORT $; This IMPORTs all the files in the ProgrammersGuide folder (including the DeclareData MODULE structure definition). Referencing anything from DeclareData is done by prepending $.DeclareData to the name of the EXPORT definition you need to use, like this: MyFile := $.DeclareData.Person.File; //rename $DeclareData.Person.File to MyFile to make //subsequent code simpler Here is some of the code contained in the DeclareData.ECL file: EXPORT DeclareData := MODULE EXPORT Layout_Person := RECORD UNSIGNED3 PersonID; STRING15 FirstName; STRING25 LastName; STRING1 MiddleInitial; STRING1 Gender; STRING42 Street; STRING20 City; STRING2 State; STRING5 Zip; END; EXPORT Layout_Accounts := RECORD STRING20 Account; STRING8 OpenDate; STRING2 IndustryCode; STRING1 AcctType; STRING1 AcctRate; UNSIGNED1 Code1; UNSIGNED1 Code2; UNSIGNED4 HighCredit; UNSIGNED4 Balance; END; EXPORT Layout_Accounts_Link := RECORD UNSIGNED3 PersonID; Layout_Accounts; END; SHARED Layout_Combined := RECORD,MAXLENGTH(1000) Layout_Person; DATASET(Layout_Accounts) Accounts; END; EXPORT Person := MODULE EXPORT File := DATASET('~PROGGUIDE::EXAMPLEDATA::People',Layout_Person, THOR); EXPORT FilePlus := DATASET('~PROGGUIDE::EXAMPLEDATA::People', {Layout_Person,UNSIGNED8 RecPos{virtual(fileposition)}}, THOR); END; EXPORT Accounts := DATASET('~PROGGUIDE::EXAMPLEDATA::Accounts', {Layout_Accounts_Link, UNSIGNED8 RecPos{virtual(fileposition)}}, THOR); EXPORT PersonAccounts:= DATASET('~PROGGUIDE::EXAMPLEDATA::PeopleAccts', {Layout_Combined, UNSIGNED8 RecPos{virtual(fileposition)}}, THOR); EXPORT IDX_Person_PersonID := INDEX(Person, {PersonID,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::People.PersonID'); EXPORT IDX_Accounts_PersonID := INDEX(Accounts, {PersonID,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::Accounts.PersonID'); EXPORT IDX_PersonAccounts_PersonID := INDEX(PersonAccounts, {PersonID,RecPos}, '~PROGGUIDE::EXAMPLEDATA::KEYS::PeopleAccts.PersonID'); END; By using a MODULE structure as a container, all the DATASET and INDEX declarations are in a single attribute editor window. This makes maintenance and update simple while allowing complete access to them all.