RONCC
/
Big-Data-HPC-Platform


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506
							<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<sect1 id="Working_with_XML_Data">
  <title>Working with XML Data</title>

  <para>Data is not always handed to you in nice, easy-to-work-with,
  fixed-length flat files; it comes in many forms. One form growing in usage
  every day is XML. ECL has a number of ways of handling XML data—some obvious
  and some not so obvious.</para>

  <para><emphasis role="bold">NOTE:</emphasis> XML reading and parsing can
  consume a large amount of memory, depending on the usage. In particular, if
  the specified XPATH matches a very large amount of data, then a large data
  structure will be provided to the transform. Therefore, the more you match,
  the more resources you consume per match. For example, if you have a very
  large document and you match an element near the root that virtually
  encompasses the whole thing, then the whole thing will be constructed as a
  referenceable structure that the ECL can get at.</para>

  <sect2 id="Simple_XML_Data_Handling">
    <title>Simple XML Data Handling</title>

    <para>The XML options on DATASET and OUTPUT allow you to easily work with
    simple XML data. For example, an XML file that looks like this (this data
    generated by the code in GenData.ECL):</para>

    <programlisting>&lt;?xml version=1.0 ...?&gt;
&lt;timezones&gt;
&lt;area&gt;
  &lt;code&gt;
        215
  &lt;/code&gt;
  &lt;state&gt;
        PA
  &lt;/state&gt;
  &lt;description&gt;
        Pennsylvania (Philadelphia area)
  &lt;/description&gt;
  &lt;zone&gt;
        Eastern Time Zone
  &lt;/zone&gt;
&lt;/area&gt;
&lt;area&gt;
  &lt;code&gt;
        216
  &lt;/code&gt;
  &lt;state&gt;
        OH
  &lt;/state&gt;
  &lt;description&gt;
        Ohio (Cleveland area)
  &lt;/description&gt;
  &lt;zone&gt;
        Eastern Time Zone
  &lt;/zone&gt;
&lt;/area&gt;
&lt;/timezones&gt;
</programlisting>

    <para>This file can be declared for use in your ECL code (as this file is
    declared as the TimeZonesXML DATASET declared in the DeclareData MODULE
    Structure) like this:</para>

    <programlisting>EXPORT TimeZonesXML :=
          DATASET('~PROGGUIDE::EXAMPLEDATA::XML_timezones',
                  {STRING code,
                   STRING state,
                   STRING description,
                   STRING timezone{XPATH('zone')}},
                  XML('timezones/area') );
</programlisting>

    <para>This makes the data contained within each XML tag in the file
    available for use in your ECL code just like any flat-file dataset. The
    field names in the RECORD structure (in this case, in-lined in the DATASET
    declaration) duplicate the tag names in the file. The use of the XPATH
    modifier on the timezone field allows us to specify that the field comes
    from the &lt;zone&gt; tag. This mechanism allows us to name fields
    differently from their tag names.</para>

    <para>By defining the fields as STRING types without specifying their
    length, you can be sure you're getting all the data—including any
    carriage-returns, line feeds, and tabs in the XML file that are contained
    within the field tags (as are present in this file). This simple OUTPUT
    shows the result (this and all subsequent code examples in this article
    are contained in the XMLcode.ECL file).</para>

    <programlisting>IMPORT $;

ds := $.DeclareData.timezonesXML;

OUTPUT(ds);</programlisting>

    <para>Notice that the result displayed in the ECL IDE program contains
    squares in the data—these are the carriage-returns, line feeds, and tabs
    in the data. You can get rid of the extraneous carriage-returns, line
    feeds, and tabs by simply passing the records through a PROJECT operation,
    like this:</para>

    <programlisting>StripIt(STRING str) := REGEXREPLACE('[\r\n\t]',str,'$1');
RECORDOF(ds) DoStrip(ds L) := TRANSFORM
  SELF.code := StripIt(L.code);
  SELF.state := StripIt(L.state);
  SELF.description := StripIt(L.description);
  SELF.timezone := StripIt(L.timezone);
END;
StrippedRecs := PROJECT(ds,DoStrip(LEFT));
OUTPUT(StrippedRecs);
</programlisting>

    <para>The use of the REGEXREPLACE function makes the process very simple.
    Its first parameter is a standard Perl regular expression representing the
    characters to look for: carriage return (\r), line feed (\n), and tab
    (\t).</para>

    <para>You can now operate on the StrippedRecs recordset (or
    ProgGuide.TimeZonesXML dataset) just as you would with any other. For
    example, you might want to simply filter out unnecessary fields and
    records and write the result to a new XML file to pass on, something like
    this:</para>

    <programlisting>InterestingRecs := StrippedRecs((INTEGER)code BETWEEN 301 AND 303);
OUTPUT(InterestingRecs,{code,timezone},
       '~PROGGUIDE::EXAMPLEDATA::OUT::timezones300',
       XML('area',HEADING('&lt;?xml version=1.0 ...?&gt;\n&lt;timezones&gt;\n','&lt;/timezones&gt;')),OVERWRITE);
</programlisting>

    <para>The resulting XML file looks like this:</para>

    <programlisting>&lt;?xml version=1.0 ...?&gt;
&lt;timezones&gt;
&lt;area&gt;&lt;code&gt;301&lt;/code&gt;&lt;zone&gt;Eastern Time Zone&lt;/zone&gt;&lt;/area&gt;
&lt;area&gt;&lt;code&gt;302&lt;/code&gt;&lt;zone&gt;Eastern Time Zone&lt;/zone&gt;&lt;/area&gt;
&lt;area&gt;&lt;code&gt;303&lt;/code&gt;&lt;zone&gt;Mountain Time Zone&lt;/zone&gt;&lt;/area&gt;
&lt;/timezones&gt;
</programlisting>
  </sect2>

  <sect2 id="Complex_XML_Data_Handling">
    <title>Complex XML Data Handling</title>

    <para>You can create much more complex XML output by using the CSV option
    on OUTPUT instead of the XML option. The XML option will only produce the
    straight-forward style of XML shown above. However, some applications
    require the use of XML attributes inside the tags. This code demonstrates
    how to produce that format:</para>

    <programlisting>CRLF := (STRING)x'0D0A';          
OutRec := RECORD
  STRING Line;
END;
OutRec DoComplexXML(InterestingRecs L) := TRANSFORM
SELF.Line := '  &lt;area code="' + L.code + '"&gt;' + CRLF +
             '    &lt;zone&gt;' + L.timezone + '&lt;/zone&gt;' + CRLF +
             '  &lt;/area&gt;';
END;
ComplexXML := PROJECT(InterestingRecs,DoComplexXML(LEFT));
OUTPUT(ComplexXML,,'~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301',
       CSV(HEADING('&lt;?xml version=1.0 ...?&gt;'+CRLF+'&lt;timezones&gt;'+CRLF,'&lt;/timezones&gt;')),OVERWRITE);
</programlisting>

    <para>The RECORD structure defines a single output field to contain each
    logical XML record that you build with the TRANSFORM function. The PROJECT
    operation builds all of the individual output records, then the CSV option
    on the OUTPUT action specifies the file header and footer records (in this
    case, the XML file tags) and you get the result shown here:</para>

    <programlisting>&lt;?xml version=1.0 ...?&gt;
&lt;timezones&gt;
  &lt;area code="301"&gt;
    &lt;zone&gt;Eastern Time Zone&lt;/zone&gt;
  &lt;/area&gt;
  &lt;area code="302"&gt;
    &lt;zone&gt;Eastern Time Zone&lt;/zone&gt;
  &lt;/area&gt;
  &lt;area code="303"&gt;
    &lt;zone&gt;Mountain Time Zone&lt;/zone&gt;
  &lt;/area&gt;
&lt;/timezones&gt;
</programlisting>

    <para>So, if using the CSV option is the way to OUTPUT complex XML data
    formats, how can you access existing complex-format XML data and use ECL
    to work with it?</para>

    <para>The answer lies in using the XPATH option on field definitions in
    the input RECORD structure, like this:</para>

    <programlisting>NewTimeZones := 
 DATASET('~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301',
         {STRING area {XPATH('&lt;&gt;')}},
         XML('timezones/area'));
</programlisting>

    <para>The specified {XPATH('&lt;&gt;')} option basically says “give me
    everything that's in this XML tag, including the tags themselves” so that
    you can then use ECL to parse through the text to do your work. The
    NewTimeZones data records look like this one (since it includes all the
    carriage return/line feeds) when you do a simple OUTPUT and copy the
    record to a text editor:</para>

    <programlisting>&lt;area code="301"&gt;
  &lt;zone&gt;Eastern Time Zone&lt;/zone&gt;
&lt;/area&gt;</programlisting>

    <para>You can then use any of the string handling functions in ECL or the
    Service Library functions in StringLib or UnicodeLib (see the
    <emphasis>Services Library Reference</emphasis>) to work with the text.
    However, the more powerful ECL text parsing tool is the PARSE function,
    allowing you to define regular expressions and/or ECL PATTERN attribute
    definitions to process the data.</para>

    <para>This example uses the TRANSFORM version of PARSE to get at the XML
    data:</para>

    <programlisting>{ds.code, ds.timezone} Xform(NewTimeZones L) := TRANSFORM
  SELF.code     := XMLTEXT('@code');
  SELF.timezone := XMLTEXT('zone');
END;
ParsedZones := PARSE(NewTimeZones,area,Xform(LEFT),XML('area'));

OUTPUT(ParsedZones);
</programlisting>

    <para>In this code we're using the XML form of PARSE and its associated
    XMLTEXT function to parse the data from the complex XML structure. The
    parameter to XMLTEXT is the XPATH to the data we're interested in (the
    major subset of the XPATH standard that ECL supports is documented in the
    Language Reference in the RECORD structure discussion).</para>
  </sect2>

  <sect2 id="Input_with_Complex_XML_Formats">
    <title>Input with Complex XML Formats</title>

    <para>XML data comes in many possible formats, and some of them make use
    of “child datasets” such that a given tag may contain multiple instances
    of other tags that contain individual field tags themselves.</para>

    <para>Here's an example of such a complex structure using UCC data. An
    individual Filing may contain one or more Transactions, which in turn may
    contain multiple Debtor and SecuredParty records:</para>

    <programlisting>&lt;UCC&gt;
  &lt;Filing number='5200105'&gt;
    &lt;Transaction ID='5'&gt;
  &lt;StartDate&gt;08/01/2001&lt;/StartDate&gt;
  &lt;LapseDate&gt;08/01/2006&lt;/LapseDate&gt;
  &lt;FormType&gt;UCC 1 FILING STATEMENT&lt;/FormType&gt;
  &lt;AmendType&gt;NONE&lt;/AmendType&gt;
  &lt;AmendAction&gt;NONE&lt;/AmendAction&gt;
  &lt;EnteredDate&gt;08/02/2002&lt;/EnteredDate&gt;
  &lt;ReceivedDate&gt;08/01/2002&lt;/ReceivedDate&gt;
  &lt;ApprovedDate&gt;08/02/2002&lt;/ApprovedDate&gt;
  &lt;Debtor entityId='19'&gt;
    &lt;IsBusiness&gt;true&lt;/IsBusiness&gt;
    &lt;OrgName&gt;&lt;![CDATA[BOGUS LABORATORIES, INC.]]&gt;&lt;/OrgName&gt;
    &lt;Status&gt;ACTIVE&lt;/Status&gt;
    &lt;Address1&gt;&lt;![CDATA[334 SOUTH 900 WEST]]&gt;&lt;/Address1&gt;
    &lt;Address4&gt;&lt;![CDATA[SALT LAKE CITY 45 84104]]&gt;&lt;/Address4&gt;
    &lt;City&gt;&lt;![CDATA[SALT LAKE CITY]]&gt;&lt;/City&gt;
    &lt;State&gt;UTAH&lt;/State&gt;
    &lt;Zip&gt;84104&lt;/Zip&gt;
    &lt;OrgType&gt;CORP&lt;/OrgType&gt;
    &lt;OrgJurisdiction&gt;&lt;![CDATA[SALT LAKE CITY]]&gt;&lt;/OrgJurisdiction&gt;
    &lt;OrgID&gt;654245-0142&lt;/OrgID&gt;
    &lt;EnteredDate&gt;08/02/2002&lt;/EnteredDate&gt;
  &lt;/Debtor&gt;
  &lt;Debtor entityId='7'&gt;
    &lt;IsBusiness&gt;false&lt;/IsBusiness&gt;
    &lt;FirstName&gt;&lt;![CDATA[FRED]]&gt;&lt;/FirstName&gt;
    &lt;LastName&gt;&lt;![CDATA[JONES]]&gt;&lt;/LastName&gt;
    &lt;Status&gt;ACTIVE&lt;/Status&gt;
    &lt;Address1&gt;&lt;![CDATA[1038 E. 900 N.]]&gt;&lt;/Address1&gt;
    &lt;Address4&gt;&lt;![CDATA[OGDEN 45 84404]]&gt;&lt;/Address4&gt;
    &lt;City&gt;&lt;![CDATA[OGDEN]]&gt;&lt;/City&gt;
    &lt;State&gt;UTAH&lt;/State&gt;
    &lt;Zip&gt;84404&lt;/Zip&gt;
    &lt;OrgType&gt;NONE&lt;/OrgType&gt;
    &lt;EnteredDate&gt;08/02/2002&lt;/EnteredDate&gt;
  &lt;/Debtor&gt;
  &lt;SecuredParty entityId='20'&gt;
    &lt;IsBusiness&gt;true&lt;/IsBusiness&gt;
    &lt;OrgName&gt;&lt;![CDATA[WELLS FARGO BANK]]&gt;&lt;/OrgName&gt;
    &lt;Status&gt;ACTIVE&lt;/Status&gt;
    &lt;Address1&gt;&lt;![CDATA[ATTN: LOAN OPERATIONS CENTER]]&gt;&lt;/Address1&gt;
    &lt;Address3&gt;&lt;![CDATA[P.O. BOX 9120]]&gt;&lt;/Address3&gt;
    &lt;Address4&gt;&lt;![CDATA[BOISE 13 83707-2203]]&gt;&lt;/Address4&gt;
    &lt;City&gt;&lt;![CDATA[BOISE]]&gt;&lt;/City&gt;
    &lt;State&gt;IDAHO&lt;/State&gt;
    &lt;Zip&gt;83707-2203&lt;/Zip&gt;
    &lt;Status&gt;ACTIVE&lt;/Status&gt;
    &lt;EnteredDate&gt;08/02/2002&lt;/EnteredDate&gt;
  &lt;/SecuredParty&gt;
  &lt;Collateral&gt;
    &lt;Action&gt;ADD&lt;/Action&gt;
    &lt;Description&gt;&lt;![CDATA[ALL ACCOUNTS]]&gt;&lt;/Description&gt;
    &lt;EffectiveDate&gt;08/01/2002&lt;/EffectiveDate&gt;
  &lt;/Collateral&gt;
    &lt;/Transaction&gt;
    &lt;Transaction ID='375799'&gt;
  &lt;StartDate&gt;08/01/2002&lt;/StartDate&gt;
  &lt;LapseDate&gt;08/01/2006&lt;/LapseDate&gt;
  &lt;FormType&gt;UCC 3 AMENDMENT&lt;/FormType&gt;
  &lt;AmendType&gt;TERMINATION BY DEBTOR&lt;/AmendType&gt;
  &lt;AmendAction&gt;NONE&lt;/AmendAction&gt;
  &lt;EnteredDate&gt;02/23/2004&lt;/EnteredDate&gt;
  &lt;ReceivedDate&gt;02/18/2004&lt;/ReceivedDate&gt;
  &lt;ApprovedDate&gt;02/23/2004&lt;/ApprovedDate&gt;
    &lt;/Transaction&gt;
  &lt;/Filing&gt;
&lt;/UCC&gt;
</programlisting>

    <para>The key to working with this type of complex XML data are the RECORD
    structures that define the layout of the XML data.</para>

    <programlisting>CollateralRec := RECORD
  STRING Action        {XPATH('Action')}; 
  STRING Description   {XPATH('Description')}; 
  STRING EffectiveDate {XPATH('EffectiveDate')}; 
END;


PartyRec := RECORD
  STRING   PartyID         {XPATH('@entityId')}; 
  STRING   IsBusiness      {XPATH('IsBusiness')};
  STRING   OrgName         {XPATH('OrgName')};
  STRING   FirstName       {XPATH('FirstName')};
  STRING   LastName        {XPATH('LastName')};
  STRING   Status          {XPATH('Status[1]')};
  STRING   Address1        {XPATH('Address1')};
  STRING   Address2        {XPATH('Address2')};
  STRING   Address3        {XPATH('Address3')};
  STRING   Address4        {XPATH('Address4')};
  STRING   City            {XPATH('City')};
  STRING   State           {XPATH('State')};
  STRING   Zip             {XPATH('Zip')};
  STRING   OrgType         {XPATH('OrgType')};
  STRING   OrgJurisdiction {XPATH('OrgJurisdiction')};
  STRING   OrgID           {XPATH('OrgID')};
  STRING10 EnteredDate     {XPATH('EnteredDate')};
END;

TransactionRec := RECORD
  STRING            TransactionID  {XPATH('@ID')}; 
  STRING10          StartDate      {XPATH('StartDate')};   
  STRING10          LapseDate      {XPATH('LapseDate')};   
  STRING            FormType       {XPATH('FormType')};        
  STRING            AmendType      {XPATH('AmendType')};   
  STRING            AmendAction    {XPATH('AmendAction')};
  STRING10          EnteredDate    {XPATH('EnteredDate')};
  STRING10          ReceivedDate   {XPATH('ReceivedDate')};
  STRING10          ApprovedDate   {XPATH('ApprovedDate')};
  DATASET(PartyRec) Debtors        {XPATH('Debtor')};
  DATASET(PartyRec) SecuredParties {XPATH('SecuredParty')};
  CollateralRec     Collateral     {XPATH('Collateral')}
END;

UCC_Rec := RECORD 
  STRING                  FilingNumber {XPATH('@number')}; 
  DATASET(TransactionRec) Transactions {XPATH('Transaction')};
END; 
UCC := DATASET('~PROGGUIDE::EXAMPLEDATA::XML_UCC',UCC_Rec,XML('UCC/Filing')); 
</programlisting>

    <para>Building from the bottom up, these RECORD structures combine to
    create the final UCC_Rec layout that defines the entire format of this XML
    data.</para>

    <para>The XML option on the final DATASET declaration specifies the XPATH
    to the record tag (Filing) then the child DATASET “field” definitions in
    the RECORD structures handle the multiple instance issues. Because ECL is
    case insensitive and XML syntax is case sensitive, it is necessary to use
    the XPATH to define all the field tags. The PartyRec RECORD structure
    works with both the Debtors and SecuredParties child DATASET fields
    because both contain the same tags and information.</para>

    <para>Once you've defined the layout, how can you extract the data into a
    normalized relational structure to work with it in the supercomputer?
    NORMALIZE is the answer. NORMALIZE needs to know how many times to call
    its TRANSFORM, so you must use the TABLE function to get the counts, like
    this:</para>

    <programlisting>XactTbl := TABLE(UCC,{INTEGER XactCount := COUNT(Transactions), UCC});

OUTPUT(XactTbl);</programlisting>

    <para>This TABLE function gets the counts of the multiple Transaction
    records per Filing so that we can use NORMALIZE to extract them into a
    table of their own.</para>

    <programlisting>Out_Transacts := RECORD
   STRING            FilingNumber;
   STRING            TransactionID; 
   STRING10          StartDate;     
   STRING10          LapseDate;     
   STRING            FormType;         
   STRING            AmendType;     
   STRING            AmendAction;
   STRING10          EnteredDate;
   STRING10          ReceivedDate;
   STRING10          ApprovedDate;
   DATASET(PartyRec) Debtors;
   DATASET(PartyRec) SecuredParties;
   CollateralRec     Collateral;
END;

Out_Transacts Get_Transacts(XactTbl L, INTEGER C) := TRANSFORM
  SELF.FilingNumber := L.FilingNumber;
  SELF              := L.Transactions[C]; 
END;

Transacts := NORMALIZE(XactTbl,LEFT.XactCount,Get_Transacts(LEFT,COUNTER));

OUTPUT(Transacts);
</programlisting>

    <para>This NORMALIZE extracts all the Transactions into a separate
    recordset with just one Transaction per record with the parent information
    (the Filing number) appended. However, each record here still contains
    multiple Debtor and SecuredParty child records.</para>

    <programlisting>PartyCounts := TABLE(Transacts,
                     {INTEGER DebtorCount := COUNT(Debtors), 
                      INTEGER PartyCount := COUNT(SecuredParties),
                      Transacts});

OUTPUT(PartyCounts);
</programlisting>

    <para>This TABLE function gets the counts of the multiple Debtor and
    SecuredParty records for each Transaction.</para>

    <programlisting>Out_Parties := RECORD
   STRING   FilingNumber;
   STRING   TransactionID;
   PartyRec; 
END;

Out_Parties Get_Debtors(PartyCounts L, INTEGER C) := TRANSFORM
  SELF.FilingNumber  := L.FilingNumber;
  SELF.TransactionID := L.TransactionID; 
  SELF               := L.Debtors[C]; 
END;
TransactDebtors := NORMALIZE( PartyCounts,
                              LEFT.DebtorCount,
                              Get_Debtors(LEFT,COUNTER));

OUTPUT(TransactDebtors);
</programlisting>

    <para>This NORMALIZE extracts all the Debtors into a separate
    recordset.</para>

    <programlisting>Out_Parties Get_Parties(PartyCounts L, INTEGER C) := TRANSFORM
  SELF.FilingNumber  := L.FilingNumber;
  SELF.TransactionID := L.TransactionID; 
  SELF               := L.SecuredParties[C]; 
END;

TransactParties := NORMALIZE(PartyCounts,
                             LEFT.PartyCount,
                             Get_Parties(LEFT,COUNTER));

OUTPUT(TransactParties);
</programlisting>

    <para>This NORMALIZE extracts all the SecuredParties into a separate
    recordset. With this, we've now broken out all the child records into
    their own normalized relational structure that we can work with
    easily.</para>
  </sect2>

  <sect2 id="Piping_to_Third-Party_Tools">
    <title>Piping to Third-Party Tools</title>

    <para>One other way to work with XML data is to use third-party tools that
    you have adapted for use in the supercomputer so that you have the
    advantage of working with previously proven technology and the benefit of
    running that technology in parallel on all the supercomputer nodes at
    once.</para>

    <para>The technique is simple: just define the input file as a data stream
    and use the PIPE option on DATASET to process the data in its native form.
    Once the processing is complete, you can OUTPUT the result in whatever
    form it comes out of the third-party tool, something like this example
    code (non-functional):</para>

    <programlisting>Rec := RECORD
  STRING1  char;
END;

TimeZones := DATASET('timezones.xml',Rec,PIPE('ThirdPartyTool.exe'));

OUTPUT(TimeZones,,'ProcessedTimezones.xml');
</programlisting>

    <para>The key to this technique is the STRING1 field definition. This
    makes the input and output just a 1-byte-at-a-time data stream that flows
    into the third-party tool and back out of your ECL code in its native
    format. You don't even need to know what that format is. You could also
    use this technique with the PIPE option on OUTPUT.</para>
  </sect2>
</sect1>