HPCC Data TutorialHPCC Data TutorialBoca Raton Documentation TeamWe welcome your comments and feedback about this document via
email to docfeedback@hpccsystems.comPlease include Documentation
Feedback in the subject line and reference the document name,
page numbers, and current Version Number in the text of the
message.LexisNexis and the Knowledge Burst logo are registered trademarks
of Reed Elsevier Properties Inc., used under license.HPCC Systems® is a registered trademark
of LexisNexis Risk Data Management Inc.Other products and services may be trademarks or registered
trademarks of their respective companies.All names and example data used in this manual are fictitious. Any
similarity to actual persons, living or dead, is purely
coincidental.HPCC Systems®IntroductionThe ECL Development ProcessThis tutorial provides a walk-through of the development process,
from beginning to end, and is designed to be an introduction to working
with data on any HPCCSystems HPCCHigh Performance Computing Cluster (HPCC) is a massively parallel
processing computing platform that solves Big Data problems. See
http://www.hpccsystems.com/Why-HPCC/How-it-works for more
details.. We will write code in ECLEnterprise Control Language (ECL) is a declarative, data centric
programming language used to manage all aspects of the massive data
joins, sorts, and builds that truly differentiate HPCC (High
Performance Computing Cluster) from other technologies in its
ability to provide flexible data analysis on a massive scale.to process our data and query it.This tutorial assumes:You have a running HPCC. This can be a VM Edition or a single
or multinode HPCC platform• You have the ECL IDEThe ECL IDE (Integrated Development Environment) is the tool
used to create queries into your data and ECL files with which to
build your queries. installed and configuredIn this tutorial, we will:Download a raw data fileThere are links to data file available at http://hpccsystems.com/community/docs/data-tutorial-guideThe download is approximately 30 MB (compressed) and is
available in either ZIP or .tar.gz format. Choose the appropriate
link.Spray the file to a Data Refinery cluster HPCC clusters
"spray" data into file parts on each node.A spray or import is
the relocation of a data file from one location to an HPCC cluster.
The term spray was adopted due to the nature of the file movement –
the file is partitioned across all nodes within a cluster.Examine the data and determine the pre-processing we need to
performPre-process the data to produce a new data fileDetermine the types of queries we wantCreate the queriesTest the queriesDeploy them to a Rapid Data Delivery Engine (RDDE) cluster,
also know as a Roxie cluster.Working with DataThe Original DataIn this scenario, we receive a structured data file containing
records with people's names and addresses. The HPCC also supports
unstructured data, but this example is simpler. This file is documented
in the following table:Field NameTypeDescriptionFirstName15 Character StringFirst NameLastName25 Character StringLast nameMiddleName15 Character StringMiddle NameZip5 Character StringZIP CodeStreet42 Character StringStreet AddressCity20 Character StringCityState2 Character StringStateThis gives us a record length of 124 (the total of all field
lengths). You will need to know this length for the File Spray process.Load the Incoming Data File to your Landing ZoneA Landing Zone (or Drop Zone) is a physical storage location
defined in your HPCC's environment. A daemon (DaFileSrv) must be
running on that server to enable file sprays and desprays.For smaller data files, you can use the upload/download file
utility in ECL Watch (a Web-based interface to your HPCC platform).
The sample data file is ~100 mb.Download the sample data file from the HPCC
Systems® portal.The data file is available from links found on http://hpccsystems.com/community/docs/data-tutorial-guide.
The download is approximately 30 MB (compressed) and is available
in either ZIP or tar.gz format (OriginalPerson.tar.gz or OriginalPerson.zip)Extract it to a folder on your local machine.In your browser, go to the ECL
Watch URL. For example, http://nnn.nnn.nnn.nnn:8010,
where nnn.nnn.nnn.nnn is your ESPThe ESP (Enterprise Services Platform) Server is the
communication layer server in you HPCC environment. Server's IP address.Your IP address could be different from the ones
provided in the example images. Please use the IP
address provided by your installation.From the ECL Watch home page, click on the Files icon, then click the Landing Zones link from the navigation
sub-menu.Press on the Upload action
button on the Landing Zones tab.Upload/downloadOnce you press the Upload button, a dialog opens where you
can choose a file to upload.Browse the files on your local machine, select the file to
upload, and then press the Open
button.The file you selected displays in the File Uploader dialog.File UploaderPress the Start button to
complete the file upload.Upload ProgressSpray the Data File to your Thor ClusterTo use the data file in our HPCC cluster, we must first “spray”
it to a Thor cluster. A spray or
import is the relocation of a data file from one
location to a Thor cluster. The term spray was adopted due to the
nature of the file movement – the file is partitioned across all nodes
within a cluster.In this example, the file is on your Landing Zone and is named
OriginalPerson.We are going to spray it to our Thor cluster and give it a
logical name of tutorial::YN::OriginalPersonwhere YN are your
initials. The Distributed File Utility maintains a list of logical
files and their corresponding physical file locations.Open ECL Watch in your browser using the following
URL:http://nnn.nnn.nnn.nnn:pppp
(where nnn.nnn.nnn.nnn is your
ESP Server’s IP Address and pppp is the port. The default port is
8010)From the ECL Watch home page, click on the Files icon, then click the Landing Zones link from the navigation
sub-menu.On the Landing Zones tab, click on the arrow next to your
mydropzone container to expand the list of uploaded files. mydropzoneFind the file you want to spray in the list
(OriginalPerson), check the box next to that file name to select
that file.Once you select the file from the list, the Spray action buttons become enabled.Press the Fixed action
button. This indicates that you are spraying a fixed width file.
Spray: Fixed action buttonThe Spray Fixed dialog
displays.The Target name field is automatically filled in with the
selected file. Spray Fixed dialogChoose the mythor cluster from the Group drop list.Fill in the Record Length
(124).Fill in the Target Scope
using the naming convention described earlier: tutorial::YN (remember, YN are your initials).Make sure the Replicatebox is checked.Note: This option is only
available on systems where replication has been enabled.Press the Spraybutton.The workunit details page displays. You can view the
progress of the spray.View ProgressOnce the spray is complete, we can proceed.Begin CodingIn this portion of the tutorial, we will write ECL code to define
the data file and execute simple queries on it so we can evaluate it and
determine any necessary pre-processing.Start the ECL IDE (Start >> All Programs >> HPCC
Systems >> ECL IDE )Log in to your environmentRight-click on the My Files
folder in the Repository window, and select Insert Folder from the pop-up menu.Insert FolderFor purposes of this tutorial, let’s create a folder called
TutorialYourName(where
YourName is your name).Enter TutorialYourName(where YourName
is your name)for the label, then press the OK
button.Enter Folder LabelRight-click on the TutorialYourNameFolder, and select Insert File from the pop-up menu.Enter Layout_People for the
label, then press the OK button.Insert FileA Builder Window opens.Layout People in BuilderNotice that some text has been written for you in the window.
This helps you to remember that the name of the file (Layout_People)
must always exactly match the name of the
single EXPORT definition (Layout_People) contained in that file.
This is a requirement -- one EXPORT definition per file, and its
name must match the filename.Write the following code in the Builder workspace:EXPORT Layout_People := RECORD
STRING15 FirstName;
STRING25 LastName;
STRING15 MiddleName;
STRING5 Zip;
STRING42 Street;
STRING20 City;
STRING2 State;
END; Code in Builder WindowPress the syntax check button on the main toolbar (or press
F7).It is always a good idea to check syntax before
submitting.Check SyntaxThis file defines the record structure for the data file.
Next, we will examine the data.Examine the DataIn this section, we will look at the data and determine if there
is any pre-processing we want to perform on the data. This is the step
in the development process where we convert the raw data into a form
we can use.Right-click on the TutorialYourName
Folder, and select Insert
File from the pop-up menu.Enter File_OriginalPerson
for the label, then press the OK button.Insert FileA Builder Window opens.Write the following code (remember to replace
YN with your initials):IMPORT TutorialYourName;
EXPORT File_OriginalPerson :=
DATASET('~tutorial::YN::OriginalPerson',TutorialYourName.Layout_People,THOR);
File_OriginalPerson.eclPress the syntax check button on the main toolbar (or press
F7) to check the syntax.This defines the Dataset. Next, we will examine the
data.Open a new Builder Window (CTRL+N) and write the following
code (remember to replace YourName with your
name):IMPORT TutorialYourName;
COUNT(TutorialYourName.File_OriginalPerson);
Press the syntax check button on the main toolbar (or press
F7) to check the syntax.Make sure the selected cluster is your Thor cluster, then
press the Submit button. Note
that your target cluster might have a different name.Target ThorWhen the Workunit completes, it displays a green checkmark
.Select the Workunit tab (the one with the number next to the
checkmark) and select the Result
1 tab (it may already be selected).Result tabThis shows us that there are 841,400 records in the
data file.Select the Builder tab and change COUNT to OUTPUT, as shown
below:IMPORT TutorialYourName;
OUTPUT(TutorialYourName.File_OriginalPerson);Note: The modified portion
is shown in bold.Check the syntax, if no errors, press the Submit button.When it completes, select the Workunit tab, then select the
Result 1 tab.Output ResultsNotice the names are in mixed case.For our purposes, it will be easier to have all the names in
all uppercase. This demonstrates one of the steps in the basic
process of preparing data (Extract, Transform, and Load—ETL) using
ECL.Close the Builder Window.Process the DataIn this section, we will write code to convert the original data
so that all names are in uppercase. We will then write this new file
to our Thor cluster.Right-click on the TutorialYourName
Folder, and select Insert File from the pop-up
menu.Name this one BWR_ProcessRawData and write the following
code (changing YN and YourName as before):IMPORT TutorialYourName, Std;
TutorialYourName.Layout_People toUpperPlease(TutorialYourName.Layout_People pInput)
:= TRANSFORM
SELF.FirstName := Std.Str.ToUpperCase(pInput.FirstName);
SELF.LastName := Std.Str.ToUpperCase(pInput.LastName);
SELF.MiddleName := Std.Str.ToUpperCase(pInput.MiddleName);
SELF.Zip := pInput.Zip;
SELF.Street := pInput.Street;
SELF.City := pInput.City;
SELF.State := pInput.State;
END ;
OrigDataset := TutorialYourName.File_OriginalPerson;
UpperedDataset := PROJECT(OrigDataset,toUpperPlease(LEFT));
OUTPUT(UpperedDataset,,'~tutorial::YN::TutorialPerson',OVERWRITE);
Check the syntax, if no errors press the Submit button.When it completes, select the Workunit tab, then select the
Result 1 tab.Process ResultThe results show that the process has successfully converted
the name fields to uppercase.After you examine the results, close the Builder
window.Using our New DataNow that we have our data in a useful format and the file is in
place, we can write more code to use the new data file. We will
determine the indexes we will need and create them. For this tutorial,
let’s assume the field we need to index is the Zip code field.In the DATASET definition, we will add a virtual field to the
RECORD structure for the fileposition. This is required for
indexes.Insert a File into the TutorialYourNameFolder. Name it
File_TutorialPerson and write this code (changing
YN to your initials):IMPORT TutorialYourName;
EXPORT File_TutorialPerson :=
DATASET('~tutorial::YN::TutorialPerson',
{TutorialYourName.Layout_People,
UNSIGNED8 fpos {virtual(fileposition)}},THOR);
Check the syntax, if no errors press the Submit button.When it completes, it displays a green checkmark
.Index the DataNext, we will define the INDEX.Insert a File into your Tutorial Folder. Name it IDX_PeopleByZipand write this code (changing YN
and YourName as before):IMPORT TutorialYourName;
EXPORT IDX_PeopleByZIP :=
INDEX(TutorialYourName.File_TutorialPerson,{zip,fpos},'~tutorial::YN::PeopleByZipINDEX');
Check the syntax.Next, we will build the index file.Insert a File into the TutorialYourNameFolder and name it BWR_BuildPeopleByZip and write this code
(replacing YourName with your name):IMPORT TutorialYourName;
BUILDINDEX(TutorialYourName.IDX_PeopleByZIP,OVERWRITE);
Check the syntax and if there are no errors, press the
Submit button.Wait for the Workunit to complete, then close the Builder
Window.Build a QueryNow that we have an index file, we will write a query that uses
it.Insert a File into your Tutorial Folder. Name it BWR_FetchPeopleByZip and write this code
(changing YourName as before):IMPORT TutorialYourName;
ZipFilter :='33024';
FetchPeopleByZip :=
FETCH(TutorialYourName.File_TutorialPerson,
TutorialYourName.IDX_PeopleByZIP(zip=ZipFilter),
RIGHT.fpos);
OUTPUT(FetchPeopleByZip);
Check the syntax and if there are no errors, press the
Submit button.When it completes, select the Workunittab, then select the Result tab.Examine the result, then close the Builder window and
resubmit the code.Note: You can change the
value of the ZipValue field to
get results from different Zip codes.Publishing your Thor QueryNow that we have created an indexed query, the next step is to
enable access to it through a Web interface.Our STORED variables provide a means to pass values as query
parameters. In this example, the user can supply the ZIP code so the
results are people from that ZIP code.Insert a File into the TutorialYourName
Folder and name it FetchPeopleByZipServiceWrite this code (changing YourName as
before):IMPORT TutorialYourName;
STRING10 ZipFilter := '' :STORED('ZIPValue');
resultSet :=
FETCH(TutorialYourName.File_TutorialPerson,
TutorialYourName.IDX_PeopleByZIP(zip=ZipFilter),
RIGHT.fpos);
OUTPUT(resultset);
Check the syntax, and save the file.Press the Submitbutton.When the workunit completes, select the Workunittab, then select the ECL Watch tab.Press the Publish button, on
the ECL Watch tab.Publish WorkunitThe Publish dialog displays, with the Job Name field
automatically filled in. You can add a comment in the Comment field
if you wish, then press Submit. Publish DialogIf there are no error messages, the workunit is published.
Leave the builder window open, you will need it again later.Execute using WsECLNow that the query is published, we can run it using the WsECL
Web service. WsECL provides a Web-based interface to your published
query. It also automatically creates an entry form to execute the
query.Using the following URL:http://nnn.nnn.nnn.nnn:pppp (where
nnn.nnn.nnn.nnn is your ESP Server’s IP address and pppp is the port.
Default port is 8002)WsECLClick on the + sign next to thor to expand the tree.Click on the fetchpeoplebyzipservice hyperlink.The form for the service displays.Service FormProvide a zip code (e.g., 33024) in the zipvalue field. Select Output Tables from the drop list, then
press the Submit button.The results display.ResultsCompile and Publish the Roxie QueryThe final step in this process is to publish the indexed query to
a Rapid Data Delivery Engine (Roxie) Cluster.We will recompile the code with Roxie as the target cluster, then
publish it to a Roxie cluster. In the ECL IDE, select the Builder tab on the
FetchPeopleByZipService file builder window.Using the Target drop list,
select Roxie as the Target cluster.Target RoxieIn the Builder window, in the upper left corner the
Submit button has a drop down
arrow next to it. Select the arrow to expose the Compile option.CompileSelect CompileWhen the workunit finishes, it will display a green circle
indicating it has compiled.CompiledPublish the Roxie queryNext we will publish the query to a Roxie Cluster.Select the workunit tab for the FetchPeopleByZipService that
you just compiled.This opens the workunit in an ECL Watch tab.Press the Publish action
button, then verify the information in the dialog and press
Submit.Publish QueryThis publishes the query.Run the Roxie Query in WsECLNow that the query is deployed to a Roxie cluster, we can run it
using the WS-ECL service Using the following URL:http://nnn.nnn.nnn.nnn:pppp (where
nnn.nnn.nnn.nnn is your ESP Server’s IP address and pppp is the port.
The default port is 8002)Click on the + sign next to myroxie to expand the tree.Click on the fetchpeoplebyzipservice hyperlink.The form for the service displays.RoxieECLProvide a zip code (e.g., 33024), select Output Tables from the drop list, and press
the Submit button.The results display.RoxieResultsSummaryNow that you have successfully processed raw data, sprayed it onto a
cluster, and deployed it to a RDDE cluster, what’s next?Here is a short list of suggestions on the path you might take from
here:Create indexes on other fields and create queries using
them.Write client applications to access your queries using JSON or
SOAP interfaces.Looks at the resources available on the Links tabLinksThe Links tab provides easy access to a form, a Sample
Request, a Sample Response, the WSDL, the XML Schema (XSD) and
more...Follow the procedures in this tutorial using your own
data!