Working with XML Data

Working with XML Data Data is not always handed to you in nice, easy-to-work-with, fixed-length flat files; it comes in many forms. One form growing in usage every day is XML. ECL has a number of ways of handling XML data—some obvious and some not so obvious. NOTE: XML reading and parsing can consume a large amount of memory, depending on the usage. In particular, if the specified XPATH matches a very large amount of data, then a large data structure will be provided to the transform. Therefore, the more you match, the more resources you consume per match. For example, if you have a very large document and you match an element near the root that virtually encompasses the whole thing, then the whole thing will be constructed as a referenceable structure that the ECL can get at. Simple XML Data Handling The XML options on DATASET and OUTPUT allow you to easily work with simple XML data. For example, an XML file that looks like this (this data generated by the code in GenData.ECL): <?xml version=1.0 ...?> <timezones> <area> <code> 215 </code> <state> PA </state> <description> Pennsylvania (Philadelphia area) </description> <zone> Eastern Time Zone </zone> </area> <area> <code> 216 </code> <state> OH </state> <description> Ohio (Cleveland area) </description> <zone> Eastern Time Zone </zone> </area> </timezones> This file can be declared for use in your ECL code (as this file is declared as the TimeZonesXML DATASET declared in the DeclareData MODULE Structure) like this: EXPORT TimeZonesXML := DATASET('~PROGGUIDE::EXAMPLEDATA::XML_timezones', {STRING code, STRING state, STRING description, STRING timezone{XPATH('zone')}}, XML('timezones/area') ); This makes the data contained within each XML tag in the file available for use in your ECL code just like any flat-file dataset. The field names in the RECORD structure (in this case, in-lined in the DATASET declaration) duplicate the tag names in the file. The use of the XPATH modifier on the timezone field allows us to specify that the field comes from the <zone> tag. This mechanism allows us to name fields differently from their tag names. By defining the fields as STRING types without specifying their length, you can be sure you're getting all the data—including any carriage-returns, line feeds, and tabs in the XML file that are contained within the field tags (as are present in this file). This simple OUTPUT shows the result (this and all subsequent code examples in this article are contained in the XMLcode.ECL file). IMPORT $; ds := $.DeclareData.timezonesXML; OUTPUT(ds); Notice that the result displayed in the ECL IDE program contains squares in the data—these are the carriage-returns, line feeds, and tabs in the data. You can get rid of the extraneous carriage-returns, line feeds, and tabs by simply passing the records through a PROJECT operation, like this: StripIt(STRING str) := REGEXREPLACE('[\r\n\t]',str,'$1'); RECORDOF(ds) DoStrip(ds L) := TRANSFORM SELF.code := StripIt(L.code); SELF.state := StripIt(L.state); SELF.description := StripIt(L.description); SELF.timezone := StripIt(L.timezone); END; StrippedRecs := PROJECT(ds,DoStrip(LEFT)); OUTPUT(StrippedRecs); The use of the REGEXREPLACE function makes the process very simple. Its first parameter is a standard Perl regular expression representing the characters to look for: carriage return (\r), line feed (\n), and tab (\t). You can now operate on the StrippedRecs recordset (or ProgGuide.TimeZonesXML dataset) just as you would with any other. For example, you might want to simply filter out unnecessary fields and records and write the result to a new XML file to pass on, something like this: InterestingRecs := StrippedRecs((INTEGER)code BETWEEN 301 AND 303); OUTPUT(InterestingRecs,{code,timezone}, '~PROGGUIDE::EXAMPLEDATA::OUT::timezones300', XML('area',HEADING('<?xml version=1.0 ...?>\n<timezones>\n','</timezones>')),OVERWRITE); The resulting XML file looks like this: <?xml version=1.0 ...?> <timezones> <area><code>301</code><zone>Eastern Time Zone</zone></area> <area><code>302</code><zone>Eastern Time Zone</zone></area> <area><code>303</code><zone>Mountain Time Zone</zone></area> </timezones> Complex XML Data Handling You can create much more complex XML output by using the CSV option on OUTPUT instead of the XML option. The XML option will only produce the straight-forward style of XML shown above. However, some applications require the use of XML attributes inside the tags. This code demonstrates how to produce that format: CRLF := (STRING)x'0D0A'; OutRec := RECORD STRING Line; END; OutRec DoComplexXML(InterestingRecs L) := TRANSFORM SELF.Line := ' <area code="' + L.code + '">' + CRLF + ' <zone>' + L.timezone + '</zone>' + CRLF + ' </area>'; END; ComplexXML := PROJECT(InterestingRecs,DoComplexXML(LEFT)); OUTPUT(ComplexXML,,'~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301', CSV(HEADING('<?xml version=1.0 ...?>'+CRLF+'<timezones>'+CRLF,'</timezones>')),OVERWRITE); The RECORD structure defines a single output field to contain each logical XML record that you build with the TRANSFORM function. The PROJECT operation builds all of the individual output records, then the CSV option on the OUTPUT action specifies the file header and footer records (in this case, the XML file tags) and you get the result shown here: <?xml version=1.0 ...?> <timezones> <area code="301"> <zone>Eastern Time Zone</zone> </area> <area code="302"> <zone>Eastern Time Zone</zone> </area> <area code="303"> <zone>Mountain Time Zone</zone> </area> </timezones> So, if using the CSV option is the way to OUTPUT complex XML data formats, how can you access existing complex-format XML data and use ECL to work with it? The answer lies in using the XPATH option on field definitions in the input RECORD structure, like this: NewTimeZones := DATASET('~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301', {STRING area {XPATH('<>')}}, XML('timezones/area')); The specified {XPATH('<>')} option basically says “give me everything that's in this XML tag, including the tags themselves” so that you can then use ECL to parse through the text to do your work. The NewTimeZones data records look like this one (since it includes all the carriage return/line feeds) when you do a simple OUTPUT and copy the record to a text editor: <area code="301"> <zone>Eastern Time Zone</zone> </area> You can then use any of the string handling functions in ECL or the Service Library functions in StringLib or UnicodeLib (see the Services Library Reference) to work with the text. However, the more powerful ECL text parsing tool is the PARSE function, allowing you to define regular expressions and/or ECL PATTERN attribute definitions to process the data. This example uses the TRANSFORM version of PARSE to get at the XML data: {ds.code, ds.timezone} Xform(NewTimeZones L) := TRANSFORM SELF.code := XMLTEXT('@code'); SELF.timezone := XMLTEXT('zone'); END; ParsedZones := PARSE(NewTimeZones,area,Xform(LEFT),XML('area')); OUTPUT(ParsedZones); In this code we're using the XML form of PARSE and its associated XMLTEXT function to parse the data from the complex XML structure. The parameter to XMLTEXT is the XPATH to the data we're interested in (the major subset of the XPATH standard that ECL supports is documented in the Language Reference in the RECORD structure discussion). Input with Complex XML Formats XML data comes in many possible formats, and some of them make use of “child datasets” such that a given tag may contain multiple instances of other tags that contain individual field tags themselves. Here's an example of such a complex structure using UCC data. An individual Filing may contain one or more Transactions, which in turn may contain multiple Debtor and SecuredParty records: <UCC> <Filing number='5200105'> <Transaction ID='5'> <StartDate>08/01/2001</StartDate> <LapseDate>08/01/2006</LapseDate> <FormType>UCC 1 FILING STATEMENT</FormType> <AmendType>NONE</AmendType> <AmendAction>NONE</AmendAction> <EnteredDate>08/02/2002</EnteredDate> <ReceivedDate>08/01/2002</ReceivedDate> <ApprovedDate>08/02/2002</ApprovedDate> <Debtor entityId='19'> <IsBusiness>true</IsBusiness> <OrgName><![CDATA[BOGUS LABORATORIES, INC.]]></OrgName> <Status>ACTIVE</Status> <Address1><![CDATA[334 SOUTH 900 WEST]]></Address1> <Address4><![CDATA[SALT LAKE CITY 45 84104]]></Address4> <City><![CDATA[SALT LAKE CITY]]></City> <State>UTAH</State> <Zip>84104</Zip> <OrgType>CORP</OrgType> <OrgJurisdiction><![CDATA[SALT LAKE CITY]]></OrgJurisdiction> <OrgID>654245-0142</OrgID> <EnteredDate>08/02/2002</EnteredDate> </Debtor> <Debtor entityId='7'> <IsBusiness>false</IsBusiness> <FirstName><![CDATA[FRED]]></FirstName> <LastName><![CDATA[JONES]]></LastName> <Status>ACTIVE</Status> <Address1><![CDATA[1038 E. 900 N.]]></Address1> <Address4><![CDATA[OGDEN 45 84404]]></Address4> <City><![CDATA[OGDEN]]></City> <State>UTAH</State> <Zip>84404</Zip> <OrgType>NONE</OrgType> <EnteredDate>08/02/2002</EnteredDate> </Debtor> <SecuredParty entityId='20'> <IsBusiness>true</IsBusiness> <OrgName><![CDATA[WELLS FARGO BANK]]></OrgName> <Status>ACTIVE</Status> <Address1><![CDATA[ATTN: LOAN OPERATIONS CENTER]]></Address1> <Address3><![CDATA[P.O. BOX 9120]]></Address3> <Address4><![CDATA[BOISE 13 83707-2203]]></Address4> <City><![CDATA[BOISE]]></City> <State>IDAHO</State> <Zip>83707-2203</Zip> <Status>ACTIVE</Status> <EnteredDate>08/02/2002</EnteredDate> </SecuredParty> <Collateral> <Action>ADD</Action> <Description><![CDATA[ALL ACCOUNTS]]></Description> <EffectiveDate>08/01/2002</EffectiveDate> </Collateral> </Transaction> <Transaction ID='375799'> <StartDate>08/01/2002</StartDate> <LapseDate>08/01/2006</LapseDate> <FormType>UCC 3 AMENDMENT</FormType> <AmendType>TERMINATION BY DEBTOR</AmendType> <AmendAction>NONE</AmendAction> <EnteredDate>02/23/2004</EnteredDate> <ReceivedDate>02/18/2004</ReceivedDate> <ApprovedDate>02/23/2004</ApprovedDate> </Transaction> </Filing> </UCC> The key to working with this type of complex XML data are the RECORD structures that define the layout of the XML data. CollateralRec := RECORD STRING Action {XPATH('Action')}; STRING Description {XPATH('Description')}; STRING EffectiveDate {XPATH('EffectiveDate')}; END; PartyRec := RECORD STRING PartyID {XPATH('@entityId')}; STRING IsBusiness {XPATH('IsBusiness')}; STRING OrgName {XPATH('OrgName')}; STRING FirstName {XPATH('FirstName')}; STRING LastName {XPATH('LastName')}; STRING Status {XPATH('Status[1]')}; STRING Address1 {XPATH('Address1')}; STRING Address2 {XPATH('Address2')}; STRING Address3 {XPATH('Address3')}; STRING Address4 {XPATH('Address4')}; STRING City {XPATH('City')}; STRING State {XPATH('State')}; STRING Zip {XPATH('Zip')}; STRING OrgType {XPATH('OrgType')}; STRING OrgJurisdiction {XPATH('OrgJurisdiction')}; STRING OrgID {XPATH('OrgID')}; STRING10 EnteredDate {XPATH('EnteredDate')}; END; TransactionRec := RECORD STRING TransactionID {XPATH('@ID')}; STRING10 StartDate {XPATH('StartDate')}; STRING10 LapseDate {XPATH('LapseDate')}; STRING FormType {XPATH('FormType')}; STRING AmendType {XPATH('AmendType')}; STRING AmendAction {XPATH('AmendAction')}; STRING10 EnteredDate {XPATH('EnteredDate')}; STRING10 ReceivedDate {XPATH('ReceivedDate')}; STRING10 ApprovedDate {XPATH('ApprovedDate')}; DATASET(PartyRec) Debtors {XPATH('Debtor')}; DATASET(PartyRec) SecuredParties {XPATH('SecuredParty')}; CollateralRec Collateral {XPATH('Collateral')} END; UCC_Rec := RECORD STRING FilingNumber {XPATH('@number')}; DATASET(TransactionRec) Transactions {XPATH('Transaction')}; END; UCC := DATASET('~PROGGUIDE::EXAMPLEDATA::XML_UCC',UCC_Rec,XML('UCC/Filing')); Building from the bottom up, these RECORD structures combine to create the final UCC_Rec layout that defines the entire format of this XML data. The XML option on the final DATASET declaration specifies the XPATH to the record tag (Filing) then the child DATASET “field” definitions in the RECORD structures handle the multiple instance issues. Because ECL is case insensitive and XML syntax is case sensitive, it is necessary to use the XPATH to define all the field tags. The PartyRec RECORD structure works with both the Debtors and SecuredParties child DATASET fields because both contain the same tags and information. Once you've defined the layout, how can you extract the data into a normalized relational structure to work with it in the supercomputer? NORMALIZE is the answer. NORMALIZE needs to know how many times to call its TRANSFORM, so you must use the TABLE function to get the counts, like this: XactTbl := TABLE(UCC,{INTEGER XactCount := COUNT(Transactions), UCC}); OUTPUT(XactTbl); This TABLE function gets the counts of the multiple Transaction records per Filing so that we can use NORMALIZE to extract them into a table of their own. Out_Transacts := RECORD STRING FilingNumber; STRING TransactionID; STRING10 StartDate; STRING10 LapseDate; STRING FormType; STRING AmendType; STRING AmendAction; STRING10 EnteredDate; STRING10 ReceivedDate; STRING10 ApprovedDate; DATASET(PartyRec) Debtors; DATASET(PartyRec) SecuredParties; CollateralRec Collateral; END; Out_Transacts Get_Transacts(XactTbl L, INTEGER C) := TRANSFORM SELF.FilingNumber := L.FilingNumber; SELF := L.Transactions[C]; END; Transacts := NORMALIZE(XactTbl,LEFT.XactCount,Get_Transacts(LEFT,COUNTER)); OUTPUT(Transacts); This NORMALIZE extracts all the Transactions into a separate recordset with just one Transaction per record with the parent information (the Filing number) appended. However, each record here still contains multiple Debtor and SecuredParty child records. PartyCounts := TABLE(Transacts, {INTEGER DebtorCount := COUNT(Debtors), INTEGER PartyCount := COUNT(SecuredParties), Transacts}); OUTPUT(PartyCounts); This TABLE function gets the counts of the multiple Debtor and SecuredParty records for each Transaction. Out_Parties := RECORD STRING FilingNumber; STRING TransactionID; PartyRec; END; Out_Parties Get_Debtors(PartyCounts L, INTEGER C) := TRANSFORM SELF.FilingNumber := L.FilingNumber; SELF.TransactionID := L.TransactionID; SELF := L.Debtors[C]; END; TransactDebtors := NORMALIZE( PartyCounts, LEFT.DebtorCount, Get_Debtors(LEFT,COUNTER)); OUTPUT(TransactDebtors); This NORMALIZE extracts all the Debtors into a separate recordset. Out_Parties Get_Parties(PartyCounts L, INTEGER C) := TRANSFORM SELF.FilingNumber := L.FilingNumber; SELF.TransactionID := L.TransactionID; SELF := L.SecuredParties[C]; END; TransactParties := NORMALIZE(PartyCounts, LEFT.PartyCount, Get_Parties(LEFT,COUNTER)); OUTPUT(TransactParties); This NORMALIZE extracts all the SecuredParties into a separate recordset. With this, we've now broken out all the child records into their own normalized relational structure that we can work with easily. Piping to Third-Party Tools One other way to work with XML data is to use third-party tools that you have adapted for use in the supercomputer so that you have the advantage of working with previously proven technology and the benefit of running that technology in parallel on all the supercomputer nodes at once. The technique is simple: just define the input file as a data stream and use the PIPE option on DATASET to process the data in its native form. Once the processing is complete, you can OUTPUT the result in whatever form it comes out of the third-party tool, something like this example code (non-functional): Rec := RECORD STRING1 char; END; TimeZones := DATASET('timezones.xml',Rec,PIPE('ThirdPartyTool.exe')); OUTPUT(TimeZones,,'ProcessedTimezones.xml'); The key to this technique is the STRING1 field definition. This makes the input and output just a 1-byte-at-a-time data stream that flows into the third-party tool and back out of your ECL code in its native format. You don't even need to know what that format is. You could also use this technique with the PIPE option on OUTPUT.