123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
- "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
- <sect1 id="Working_with_XML_Data">
- <title>Working with XML Data</title>
- <para>Data is not always handed to you in nice, easy-to-work-with,
- fixed-length flat files; it comes in many forms. One form growing in usage
- every day is XML. ECL has a number of ways of handling XML data—some obvious
- and some not so obvious.</para>
- <para><emphasis role="bold">NOTE:</emphasis> XML reading and parsing can
- consume a large amount of memory, depending on the usage. In particular, if
- the specified XPATH matches a very large amount of data, then a large data
- structure will be provided to the transform. Therefore, the more you match,
- the more resources you consume per match. For example, if you have a very
- large document and you match an element near the root that virtually
- encompasses the whole thing, then the whole thing will be constructed as a
- referenceable structure that the ECL can get at.</para>
- <sect2 id="Simple_XML_Data_Handling">
- <title>Simple XML Data Handling</title>
- <para>The XML options on DATASET and OUTPUT allow you to easily work with
- simple XML data. For example, an XML file that looks like this (this data
- generated by the code in GenData.ECL):</para>
- <programlisting><?xml version=1.0 ...?>
- <timezones>
- <area>
- <code>
- 215
- </code>
- <state>
- PA
- </state>
- <description>
- Pennsylvania (Philadelphia area)
- </description>
- <zone>
- Eastern Time Zone
- </zone>
- </area>
- <area>
- <code>
- 216
- </code>
- <state>
- OH
- </state>
- <description>
- Ohio (Cleveland area)
- </description>
- <zone>
- Eastern Time Zone
- </zone>
- </area>
- </timezones>
- </programlisting>
- <para>This file can be declared for use in your ECL code (as this file is
- declared as the TimeZonesXML DATASET declared in the DeclareData MODULE
- Structure) like this:</para>
- <programlisting>EXPORT TimeZonesXML :=
- DATASET('~PROGGUIDE::EXAMPLEDATA::XML_timezones',
- {STRING code,
- STRING state,
- STRING description,
- STRING timezone{XPATH('zone')}},
- XML('timezones/area') );
- </programlisting>
- <para>This makes the data contained within each XML tag in the file
- available for use in your ECL code just like any flat-file dataset. The
- field names in the RECORD structure (in this case, in-lined in the DATASET
- declaration) duplicate the tag names in the file. The use of the XPATH
- modifier on the timezone field allows us to specify that the field comes
- from the <zone> tag. This mechanism allows us to name fields
- differently from their tag names.</para>
- <para>By defining the fields as STRING types without specifying their
- length, you can be sure you're getting all the data—including any
- carriage-returns, line feeds, and tabs in the XML file that are contained
- within the field tags (as are present in this file). This simple OUTPUT
- shows the result (this and all subsequent code examples in this article
- are contained in the XMLcode.ECL file).</para>
- <programlisting>IMPORT $;
- ds := $.DeclareData.timezonesXML;
- OUTPUT(ds);</programlisting>
- <para>Notice that the result displayed in the ECL IDE program contains
- squares in the data—these are the carriage-returns, line feeds, and tabs
- in the data. You can get rid of the extraneous carriage-returns, line
- feeds, and tabs by simply passing the records through a PROJECT operation,
- like this:</para>
- <programlisting>StripIt(STRING str) := REGEXREPLACE('[\r\n\t]',str,'$1');
- RECORDOF(ds) DoStrip(ds L) := TRANSFORM
- SELF.code := StripIt(L.code);
- SELF.state := StripIt(L.state);
- SELF.description := StripIt(L.description);
- SELF.timezone := StripIt(L.timezone);
- END;
- StrippedRecs := PROJECT(ds,DoStrip(LEFT));
- OUTPUT(StrippedRecs);
- </programlisting>
- <para>The use of the REGEXREPLACE function makes the process very simple.
- Its first parameter is a standard Perl regular expression representing the
- characters to look for: carriage return (\r), line feed (\n), and tab
- (\t).</para>
- <para>You can now operate on the StrippedRecs recordset (or
- ProgGuide.TimeZonesXML dataset) just as you would with any other. For
- example, you might want to simply filter out unnecessary fields and
- records and write the result to a new XML file to pass on, something like
- this:</para>
- <programlisting>InterestingRecs := StrippedRecs((INTEGER)code BETWEEN 301 AND 303);
- OUTPUT(InterestingRecs,{code,timezone},
- '~PROGGUIDE::EXAMPLEDATA::OUT::timezones300',
- XML('area',HEADING('<?xml version=1.0 ...?>\n<timezones>\n','</timezones>')),OVERWRITE);
- </programlisting>
- <para>The resulting XML file looks like this:</para>
- <programlisting><?xml version=1.0 ...?>
- <timezones>
- <area><code>301</code><zone>Eastern Time Zone</zone></area>
- <area><code>302</code><zone>Eastern Time Zone</zone></area>
- <area><code>303</code><zone>Mountain Time Zone</zone></area>
- </timezones>
- </programlisting>
- </sect2>
- <sect2 id="Complex_XML_Data_Handling">
- <title>Complex XML Data Handling</title>
- <para>You can create much more complex XML output by using the CSV option
- on OUTPUT instead of the XML option. The XML option will only produce the
- straight-forward style of XML shown above. However, some applications
- require the use of XML attributes inside the tags. This code demonstrates
- how to produce that format:</para>
- <programlisting>CRLF := (STRING)x'0D0A';
- OutRec := RECORD
- STRING Line;
- END;
- OutRec DoComplexXML(InterestingRecs L) := TRANSFORM
- SELF.Line := ' <area code="' + L.code + '">' + CRLF +
- ' <zone>' + L.timezone + '</zone>' + CRLF +
- ' </area>';
- END;
- ComplexXML := PROJECT(InterestingRecs,DoComplexXML(LEFT));
- OUTPUT(ComplexXML,,'~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301',
- CSV(HEADING('<?xml version=1.0 ...?>'+CRLF+'<timezones>'+CRLF,'</timezones>')),OVERWRITE);
- </programlisting>
- <para>The RECORD structure defines a single output field to contain each
- logical XML record that you build with the TRANSFORM function. The PROJECT
- operation builds all of the individual output records, then the CSV option
- on the OUTPUT action specifies the file header and footer records (in this
- case, the XML file tags) and you get the result shown here:</para>
- <programlisting><?xml version=1.0 ...?>
- <timezones>
- <area code="301">
- <zone>Eastern Time Zone</zone>
- </area>
- <area code="302">
- <zone>Eastern Time Zone</zone>
- </area>
- <area code="303">
- <zone>Mountain Time Zone</zone>
- </area>
- </timezones>
- </programlisting>
- <para>So, if using the CSV option is the way to OUTPUT complex XML data
- formats, how can you access existing complex-format XML data and use ECL
- to work with it?</para>
- <para>The answer lies in using the XPATH option on field definitions in
- the input RECORD structure, like this:</para>
- <programlisting>NewTimeZones :=
- DATASET('~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301',
- {STRING area {XPATH('<>')}},
- XML('timezones/area'));
- </programlisting>
- <para>The specified {XPATH('<>')} option basically says “give me
- everything that's in this XML tag, including the tags themselves” so that
- you can then use ECL to parse through the text to do your work. The
- NewTimeZones data records look like this one (since it includes all the
- carriage return/line feeds) when you do a simple OUTPUT and copy the
- record to a text editor:</para>
- <programlisting><area code="301">
- <zone>Eastern Time Zone</zone>
- </area></programlisting>
- <para>You can then use any of the string handling functions in ECL or the
- Service Library functions in StringLib or UnicodeLib (see the
- <emphasis>Services Library Reference</emphasis>) to work with the text.
- However, the more powerful ECL text parsing tool is the PARSE function,
- allowing you to define regular expressions and/or ECL PATTERN attribute
- definitions to process the data.</para>
- <para>This example uses the TRANSFORM version of PARSE to get at the XML
- data:</para>
- <programlisting>{ds.code, ds.timezone} Xform(NewTimeZones L) := TRANSFORM
- SELF.code := XMLTEXT('@code');
- SELF.timezone := XMLTEXT('zone');
- END;
- ParsedZones := PARSE(NewTimeZones,area,Xform(LEFT),XML('area'));
- OUTPUT(ParsedZones);
- </programlisting>
- <para>In this code we're using the XML form of PARSE and its associated
- XMLTEXT function to parse the data from the complex XML structure. The
- parameter to XMLTEXT is the XPATH to the data we're interested in (the
- major subset of the XPATH standard that ECL supports is documented in the
- Language Reference in the RECORD structure discussion).</para>
- </sect2>
- <sect2 id="Input_with_Complex_XML_Formats">
- <title>Input with Complex XML Formats</title>
- <para>XML data comes in many possible formats, and some of them make use
- of “child datasets” such that a given tag may contain multiple instances
- of other tags that contain individual field tags themselves.</para>
- <para>Here's an example of such a complex structure using UCC data. An
- individual Filing may contain one or more Transactions, which in turn may
- contain multiple Debtor and SecuredParty records:</para>
- <programlisting><UCC>
- <Filing number='5200105'>
- <Transaction ID='5'>
- <StartDate>08/01/2001</StartDate>
- <LapseDate>08/01/2006</LapseDate>
- <FormType>UCC 1 FILING STATEMENT</FormType>
- <AmendType>NONE</AmendType>
- <AmendAction>NONE</AmendAction>
- <EnteredDate>08/02/2002</EnteredDate>
- <ReceivedDate>08/01/2002</ReceivedDate>
- <ApprovedDate>08/02/2002</ApprovedDate>
- <Debtor entityId='19'>
- <IsBusiness>true</IsBusiness>
- <OrgName><![CDATA[BOGUS LABORATORIES, INC.]]></OrgName>
- <Status>ACTIVE</Status>
- <Address1><![CDATA[334 SOUTH 900 WEST]]></Address1>
- <Address4><![CDATA[SALT LAKE CITY 45 84104]]></Address4>
- <City><![CDATA[SALT LAKE CITY]]></City>
- <State>UTAH</State>
- <Zip>84104</Zip>
- <OrgType>CORP</OrgType>
- <OrgJurisdiction><![CDATA[SALT LAKE CITY]]></OrgJurisdiction>
- <OrgID>654245-0142</OrgID>
- <EnteredDate>08/02/2002</EnteredDate>
- </Debtor>
- <Debtor entityId='7'>
- <IsBusiness>false</IsBusiness>
- <FirstName><![CDATA[FRED]]></FirstName>
- <LastName><![CDATA[JONES]]></LastName>
- <Status>ACTIVE</Status>
- <Address1><![CDATA[1038 E. 900 N.]]></Address1>
- <Address4><![CDATA[OGDEN 45 84404]]></Address4>
- <City><![CDATA[OGDEN]]></City>
- <State>UTAH</State>
- <Zip>84404</Zip>
- <OrgType>NONE</OrgType>
- <EnteredDate>08/02/2002</EnteredDate>
- </Debtor>
- <SecuredParty entityId='20'>
- <IsBusiness>true</IsBusiness>
- <OrgName><![CDATA[WELLS FARGO BANK]]></OrgName>
- <Status>ACTIVE</Status>
- <Address1><![CDATA[ATTN: LOAN OPERATIONS CENTER]]></Address1>
- <Address3><![CDATA[P.O. BOX 9120]]></Address3>
- <Address4><![CDATA[BOISE 13 83707-2203]]></Address4>
- <City><![CDATA[BOISE]]></City>
- <State>IDAHO</State>
- <Zip>83707-2203</Zip>
- <Status>ACTIVE</Status>
- <EnteredDate>08/02/2002</EnteredDate>
- </SecuredParty>
- <Collateral>
- <Action>ADD</Action>
- <Description><![CDATA[ALL ACCOUNTS]]></Description>
- <EffectiveDate>08/01/2002</EffectiveDate>
- </Collateral>
- </Transaction>
- <Transaction ID='375799'>
- <StartDate>08/01/2002</StartDate>
- <LapseDate>08/01/2006</LapseDate>
- <FormType>UCC 3 AMENDMENT</FormType>
- <AmendType>TERMINATION BY DEBTOR</AmendType>
- <AmendAction>NONE</AmendAction>
- <EnteredDate>02/23/2004</EnteredDate>
- <ReceivedDate>02/18/2004</ReceivedDate>
- <ApprovedDate>02/23/2004</ApprovedDate>
- </Transaction>
- </Filing>
- </UCC>
- </programlisting>
- <para>The key to working with this type of complex XML data are the RECORD
- structures that define the layout of the XML data.</para>
- <programlisting>CollateralRec := RECORD
- STRING Action {XPATH('Action')};
- STRING Description {XPATH('Description')};
- STRING EffectiveDate {XPATH('EffectiveDate')};
- END;
- PartyRec := RECORD
- STRING PartyID {XPATH('@entityId')};
- STRING IsBusiness {XPATH('IsBusiness')};
- STRING OrgName {XPATH('OrgName')};
- STRING FirstName {XPATH('FirstName')};
- STRING LastName {XPATH('LastName')};
- STRING Status {XPATH('Status[1]')};
- STRING Address1 {XPATH('Address1')};
- STRING Address2 {XPATH('Address2')};
- STRING Address3 {XPATH('Address3')};
- STRING Address4 {XPATH('Address4')};
- STRING City {XPATH('City')};
- STRING State {XPATH('State')};
- STRING Zip {XPATH('Zip')};
- STRING OrgType {XPATH('OrgType')};
- STRING OrgJurisdiction {XPATH('OrgJurisdiction')};
- STRING OrgID {XPATH('OrgID')};
- STRING10 EnteredDate {XPATH('EnteredDate')};
- END;
- TransactionRec := RECORD
- STRING TransactionID {XPATH('@ID')};
- STRING10 StartDate {XPATH('StartDate')};
- STRING10 LapseDate {XPATH('LapseDate')};
- STRING FormType {XPATH('FormType')};
- STRING AmendType {XPATH('AmendType')};
- STRING AmendAction {XPATH('AmendAction')};
- STRING10 EnteredDate {XPATH('EnteredDate')};
- STRING10 ReceivedDate {XPATH('ReceivedDate')};
- STRING10 ApprovedDate {XPATH('ApprovedDate')};
- DATASET(PartyRec) Debtors {XPATH('Debtor')};
- DATASET(PartyRec) SecuredParties {XPATH('SecuredParty')};
- CollateralRec Collateral {XPATH('Collateral')}
- END;
- UCC_Rec := RECORD
- STRING FilingNumber {XPATH('@number')};
- DATASET(TransactionRec) Transactions {XPATH('Transaction')};
- END;
- UCC := DATASET('~PROGGUIDE::EXAMPLEDATA::XML_UCC',UCC_Rec,XML('UCC/Filing'));
- </programlisting>
- <para>Building from the bottom up, these RECORD structures combine to
- create the final UCC_Rec layout that defines the entire format of this XML
- data.</para>
- <para>The XML option on the final DATASET declaration specifies the XPATH
- to the record tag (Filing) then the child DATASET “field” definitions in
- the RECORD structures handle the multiple instance issues. Because ECL is
- case insensitive and XML syntax is case sensitive, it is necessary to use
- the XPATH to define all the field tags. The PartyRec RECORD structure
- works with both the Debtors and SecuredParties child DATASET fields
- because both contain the same tags and information.</para>
- <para>Once you've defined the layout, how can you extract the data into a
- normalized relational structure to work with it in the supercomputer?
- NORMALIZE is the answer. NORMALIZE needs to know how many times to call
- its TRANSFORM, so you must use the TABLE function to get the counts, like
- this:</para>
- <programlisting>XactTbl := TABLE(UCC,{INTEGER XactCount := COUNT(Transactions), UCC});
- OUTPUT(XactTbl);</programlisting>
- <para>This TABLE function gets the counts of the multiple Transaction
- records per Filing so that we can use NORMALIZE to extract them into a
- table of their own.</para>
- <programlisting>Out_Transacts := RECORD
- STRING FilingNumber;
- STRING TransactionID;
- STRING10 StartDate;
- STRING10 LapseDate;
- STRING FormType;
- STRING AmendType;
- STRING AmendAction;
- STRING10 EnteredDate;
- STRING10 ReceivedDate;
- STRING10 ApprovedDate;
- DATASET(PartyRec) Debtors;
- DATASET(PartyRec) SecuredParties;
- CollateralRec Collateral;
- END;
- Out_Transacts Get_Transacts(XactTbl L, INTEGER C) := TRANSFORM
- SELF.FilingNumber := L.FilingNumber;
- SELF := L.Transactions[C];
- END;
- Transacts := NORMALIZE(XactTbl,LEFT.XactCount,Get_Transacts(LEFT,COUNTER));
- OUTPUT(Transacts);
- </programlisting>
- <para>This NORMALIZE extracts all the Transactions into a separate
- recordset with just one Transaction per record with the parent information
- (the Filing number) appended. However, each record here still contains
- multiple Debtor and SecuredParty child records.</para>
- <programlisting>PartyCounts := TABLE(Transacts,
- {INTEGER DebtorCount := COUNT(Debtors),
- INTEGER PartyCount := COUNT(SecuredParties),
- Transacts});
- OUTPUT(PartyCounts);
- </programlisting>
- <para>This TABLE function gets the counts of the multiple Debtor and
- SecuredParty records for each Transaction.</para>
- <programlisting>Out_Parties := RECORD
- STRING FilingNumber;
- STRING TransactionID;
- PartyRec;
- END;
- Out_Parties Get_Debtors(PartyCounts L, INTEGER C) := TRANSFORM
- SELF.FilingNumber := L.FilingNumber;
- SELF.TransactionID := L.TransactionID;
- SELF := L.Debtors[C];
- END;
- TransactDebtors := NORMALIZE( PartyCounts,
- LEFT.DebtorCount,
- Get_Debtors(LEFT,COUNTER));
- OUTPUT(TransactDebtors);
- </programlisting>
- <para>This NORMALIZE extracts all the Debtors into a separate
- recordset.</para>
- <programlisting>Out_Parties Get_Parties(PartyCounts L, INTEGER C) := TRANSFORM
- SELF.FilingNumber := L.FilingNumber;
- SELF.TransactionID := L.TransactionID;
- SELF := L.SecuredParties[C];
- END;
- TransactParties := NORMALIZE(PartyCounts,
- LEFT.PartyCount,
- Get_Parties(LEFT,COUNTER));
- OUTPUT(TransactParties);
- </programlisting>
- <para>This NORMALIZE extracts all the SecuredParties into a separate
- recordset. With this, we've now broken out all the child records into
- their own normalized relational structure that we can work with
- easily.</para>
- </sect2>
- <sect2 id="Piping_to_Third-Party_Tools">
- <title>Piping to Third-Party Tools</title>
- <para>One other way to work with XML data is to use third-party tools that
- you have adapted for use in the supercomputer so that you have the
- advantage of working with previously proven technology and the benefit of
- running that technology in parallel on all the supercomputer nodes at
- once.</para>
- <para>The technique is simple: just define the input file as a data stream
- and use the PIPE option on DATASET to process the data in its native form.
- Once the processing is complete, you can OUTPUT the result in whatever
- form it comes out of the third-party tool, something like this example
- code (non-functional):</para>
- <programlisting>Rec := RECORD
- STRING1 char;
- END;
- TimeZones := DATASET('timezones.xml',Rec,PIPE('ThirdPartyTool.exe'));
- OUTPUT(TimeZones,,'ProcessedTimezones.xml');
- </programlisting>
- <para>The key to this technique is the STRING1 field definition. This
- makes the input and output just a 1-byte-at-a-time data stream that flows
- into the third-party tool and back out of your ECL code in its native
- format. You don't even need to know what that format is. You could also
- use this technique with the PIPE option on OUTPUT.</para>
- </sect2>
- </sect1>
|