Working with XML Data
Data is not always handed to you in nice, easy-to-work-with,
fixed-length flat files; it comes in many forms. One form growing in usage
every day is XML. ECL has a number of ways of handling XML data—some obvious
and some not so obvious.
NOTE: XML reading and parsing can
consume a large amount of memory, depending on the usage. In particular, if
the specified XPATH matches a very large amount of data, then a large data
structure will be provided to the transform. Therefore, the more you match,
the more resources you consume per match. For example, if you have a very
large document and you match an element near the root that virtually
encompasses the whole thing, then the whole thing will be constructed as a
referenceable structure that the ECL can get at.
Simple XML Data Handling
The XML options on DATASET and OUTPUT allow you to easily work with
simple XML data. For example, an XML file that looks like this (this data
generated by the code in GenData.ECL):
<?xml version=1.0 ...?>
<timezones>
<area>
<code>
215
</code>
<state>
PA
</state>
<description>
Pennsylvania (Philadelphia area)
</description>
<zone>
Eastern Time Zone
</zone>
</area>
<area>
<code>
216
</code>
<state>
OH
</state>
<description>
Ohio (Cleveland area)
</description>
<zone>
Eastern Time Zone
</zone>
</area>
</timezones>
This file can be declared for use in your ECL code (as this file is
declared as the TimeZonesXML DATASET declared in the DeclareData MODULE
Structure) like this:
EXPORT TimeZonesXML :=
DATASET('~PROGGUIDE::EXAMPLEDATA::XML_timezones',
{STRING code,
STRING state,
STRING description,
STRING timezone{XPATH('zone')}},
XML('timezones/area') );
This makes the data contained within each XML tag in the file
available for use in your ECL code just like any flat-file dataset. The
field names in the RECORD structure (in this case, in-lined in the DATASET
declaration) duplicate the tag names in the file. The use of the XPATH
modifier on the timezone field allows us to specify that the field comes
from the <zone> tag. This mechanism allows us to name fields
differently from their tag names.
By defining the fields as STRING types without specifying their
length, you can be sure you're getting all the data—including any
carriage-returns, line feeds, and tabs in the XML file that are contained
within the field tags (as are present in this file). This simple OUTPUT
shows the result (this and all subsequent code examples in this article
are contained in the XMLcode.ECL file).
IMPORT $;
ds := $.DeclareData.timezonesXML;
OUTPUT(ds);
Notice that the result displayed in the ECL IDE program contains
squares in the data—these are the carriage-returns, line feeds, and tabs
in the data. You can get rid of the extraneous carriage-returns, line
feeds, and tabs by simply passing the records through a PROJECT operation,
like this:
StripIt(STRING str) := REGEXREPLACE('[\r\n\t]',str,'$1');
RECORDOF(ds) DoStrip(ds L) := TRANSFORM
SELF.code := StripIt(L.code);
SELF.state := StripIt(L.state);
SELF.description := StripIt(L.description);
SELF.timezone := StripIt(L.timezone);
END;
StrippedRecs := PROJECT(ds,DoStrip(LEFT));
OUTPUT(StrippedRecs);
The use of the REGEXREPLACE function makes the process very simple.
Its first parameter is a standard Perl regular expression representing the
characters to look for: carriage return (\r), line feed (\n), and tab
(\t).
You can now operate on the StrippedRecs recordset (or
ProgGuide.TimeZonesXML dataset) just as you would with any other. For
example, you might want to simply filter out unnecessary fields and
records and write the result to a new XML file to pass on, something like
this:
InterestingRecs := StrippedRecs((INTEGER)code BETWEEN 301 AND 303);
OUTPUT(InterestingRecs,{code,timezone},
'~PROGGUIDE::EXAMPLEDATA::OUT::timezones300',
XML('area',HEADING('<?xml version=1.0 ...?>\n<timezones>\n','</timezones>')),OVERWRITE);
The resulting XML file looks like this:
<?xml version=1.0 ...?>
<timezones>
<area><code>301</code><zone>Eastern Time Zone</zone></area>
<area><code>302</code><zone>Eastern Time Zone</zone></area>
<area><code>303</code><zone>Mountain Time Zone</zone></area>
</timezones>
Complex XML Data Handling
You can create much more complex XML output by using the CSV option
on OUTPUT instead of the XML option. The XML option will only produce the
straight-forward style of XML shown above. However, some applications
require the use of XML attributes inside the tags. This code demonstrates
how to produce that format:
CRLF := (STRING)x'0D0A';
OutRec := RECORD
STRING Line;
END;
OutRec DoComplexXML(InterestingRecs L) := TRANSFORM
SELF.Line := ' <area code="' + L.code + '">' + CRLF +
' <zone>' + L.timezone + '</zone>' + CRLF +
' </area>';
END;
ComplexXML := PROJECT(InterestingRecs,DoComplexXML(LEFT));
OUTPUT(ComplexXML,,'~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301',
CSV(HEADING('<?xml version=1.0 ...?>'+CRLF+'<timezones>'+CRLF,'</timezones>')),OVERWRITE);
The RECORD structure defines a single output field to contain each
logical XML record that you build with the TRANSFORM function. The PROJECT
operation builds all of the individual output records, then the CSV option
on the OUTPUT action specifies the file header and footer records (in this
case, the XML file tags) and you get the result shown here:
<?xml version=1.0 ...?>
<timezones>
<area code="301">
<zone>Eastern Time Zone</zone>
</area>
<area code="302">
<zone>Eastern Time Zone</zone>
</area>
<area code="303">
<zone>Mountain Time Zone</zone>
</area>
</timezones>
So, if using the CSV option is the way to OUTPUT complex XML data
formats, how can you access existing complex-format XML data and use ECL
to work with it?
The answer lies in using the XPATH option on field definitions in
the input RECORD structure, like this:
NewTimeZones :=
DATASET('~PROGGUIDE::EXAMPLEDATA::OUT::Complextimezones301',
{STRING area {XPATH('<>')}},
XML('timezones/area'));
The specified {XPATH('<>')} option basically says “give me
everything that's in this XML tag, including the tags themselves” so that
you can then use ECL to parse through the text to do your work. The
NewTimeZones data records look like this one (since it includes all the
carriage return/line feeds) when you do a simple OUTPUT and copy the
record to a text editor:
<area code="301">
<zone>Eastern Time Zone</zone>
</area>
You can then use any of the string handling functions in ECL or the
Service Library functions in StringLib or UnicodeLib (see the
Services Library Reference) to work with the text.
However, the more powerful ECL text parsing tool is the PARSE function,
allowing you to define regular expressions and/or ECL PATTERN attribute
definitions to process the data.
This example uses the TRANSFORM version of PARSE to get at the XML
data:
{ds.code, ds.timezone} Xform(NewTimeZones L) := TRANSFORM
SELF.code := XMLTEXT('@code');
SELF.timezone := XMLTEXT('zone');
END;
ParsedZones := PARSE(NewTimeZones,area,Xform(LEFT),XML('area'));
OUTPUT(ParsedZones);
In this code we're using the XML form of PARSE and its associated
XMLTEXT function to parse the data from the complex XML structure. The
parameter to XMLTEXT is the XPATH to the data we're interested in (the
major subset of the XPATH standard that ECL supports is documented in the
Language Reference in the RECORD structure discussion).
Input with Complex XML Formats
XML data comes in many possible formats, and some of them make use
of “child datasets” such that a given tag may contain multiple instances
of other tags that contain individual field tags themselves.
Here's an example of such a complex structure using UCC data. An
individual Filing may contain one or more Transactions, which in turn may
contain multiple Debtor and SecuredParty records:
<UCC>
<Filing number='5200105'>
<Transaction ID='5'>
<StartDate>08/01/2001</StartDate>
<LapseDate>08/01/2006</LapseDate>
<FormType>UCC 1 FILING STATEMENT</FormType>
<AmendType>NONE</AmendType>
<AmendAction>NONE</AmendAction>
<EnteredDate>08/02/2002</EnteredDate>
<ReceivedDate>08/01/2002</ReceivedDate>
<ApprovedDate>08/02/2002</ApprovedDate>
<Debtor entityId='19'>
<IsBusiness>true</IsBusiness>
<OrgName><![CDATA[BOGUS LABORATORIES, INC.]]></OrgName>
<Status>ACTIVE</Status>
<Address1><![CDATA[334 SOUTH 900 WEST]]></Address1>
<Address4><![CDATA[SALT LAKE CITY 45 84104]]></Address4>
<City><![CDATA[SALT LAKE CITY]]></City>
<State>UTAH</State>
<Zip>84104</Zip>
<OrgType>CORP</OrgType>
<OrgJurisdiction><![CDATA[SALT LAKE CITY]]></OrgJurisdiction>
<OrgID>654245-0142</OrgID>
<EnteredDate>08/02/2002</EnteredDate>
</Debtor>
<Debtor entityId='7'>
<IsBusiness>false</IsBusiness>
<FirstName><![CDATA[FRED]]></FirstName>
<LastName><![CDATA[JONES]]></LastName>
<Status>ACTIVE</Status>
<Address1><![CDATA[1038 E. 900 N.]]></Address1>
<Address4><![CDATA[OGDEN 45 84404]]></Address4>
<City><![CDATA[OGDEN]]></City>
<State>UTAH</State>
<Zip>84404</Zip>
<OrgType>NONE</OrgType>
<EnteredDate>08/02/2002</EnteredDate>
</Debtor>
<SecuredParty entityId='20'>
<IsBusiness>true</IsBusiness>
<OrgName><![CDATA[WELLS FARGO BANK]]></OrgName>
<Status>ACTIVE</Status>
<Address1><![CDATA[ATTN: LOAN OPERATIONS CENTER]]></Address1>
<Address3><![CDATA[P.O. BOX 9120]]></Address3>
<Address4><![CDATA[BOISE 13 83707-2203]]></Address4>
<City><![CDATA[BOISE]]></City>
<State>IDAHO</State>
<Zip>83707-2203</Zip>
<Status>ACTIVE</Status>
<EnteredDate>08/02/2002</EnteredDate>
</SecuredParty>
<Collateral>
<Action>ADD</Action>
<Description><![CDATA[ALL ACCOUNTS]]></Description>
<EffectiveDate>08/01/2002</EffectiveDate>
</Collateral>
</Transaction>
<Transaction ID='375799'>
<StartDate>08/01/2002</StartDate>
<LapseDate>08/01/2006</LapseDate>
<FormType>UCC 3 AMENDMENT</FormType>
<AmendType>TERMINATION BY DEBTOR</AmendType>
<AmendAction>NONE</AmendAction>
<EnteredDate>02/23/2004</EnteredDate>
<ReceivedDate>02/18/2004</ReceivedDate>
<ApprovedDate>02/23/2004</ApprovedDate>
</Transaction>
</Filing>
</UCC>
The key to working with this type of complex XML data are the RECORD
structures that define the layout of the XML data.
CollateralRec := RECORD
STRING Action {XPATH('Action')};
STRING Description {XPATH('Description')};
STRING EffectiveDate {XPATH('EffectiveDate')};
END;
PartyRec := RECORD
STRING PartyID {XPATH('@entityId')};
STRING IsBusiness {XPATH('IsBusiness')};
STRING OrgName {XPATH('OrgName')};
STRING FirstName {XPATH('FirstName')};
STRING LastName {XPATH('LastName')};
STRING Status {XPATH('Status[1]')};
STRING Address1 {XPATH('Address1')};
STRING Address2 {XPATH('Address2')};
STRING Address3 {XPATH('Address3')};
STRING Address4 {XPATH('Address4')};
STRING City {XPATH('City')};
STRING State {XPATH('State')};
STRING Zip {XPATH('Zip')};
STRING OrgType {XPATH('OrgType')};
STRING OrgJurisdiction {XPATH('OrgJurisdiction')};
STRING OrgID {XPATH('OrgID')};
STRING10 EnteredDate {XPATH('EnteredDate')};
END;
TransactionRec := RECORD
STRING TransactionID {XPATH('@ID')};
STRING10 StartDate {XPATH('StartDate')};
STRING10 LapseDate {XPATH('LapseDate')};
STRING FormType {XPATH('FormType')};
STRING AmendType {XPATH('AmendType')};
STRING AmendAction {XPATH('AmendAction')};
STRING10 EnteredDate {XPATH('EnteredDate')};
STRING10 ReceivedDate {XPATH('ReceivedDate')};
STRING10 ApprovedDate {XPATH('ApprovedDate')};
DATASET(PartyRec) Debtors {XPATH('Debtor')};
DATASET(PartyRec) SecuredParties {XPATH('SecuredParty')};
CollateralRec Collateral {XPATH('Collateral')}
END;
UCC_Rec := RECORD
STRING FilingNumber {XPATH('@number')};
DATASET(TransactionRec) Transactions {XPATH('Transaction')};
END;
UCC := DATASET('~PROGGUIDE::EXAMPLEDATA::XML_UCC',UCC_Rec,XML('UCC/Filing'));
Building from the bottom up, these RECORD structures combine to
create the final UCC_Rec layout that defines the entire format of this XML
data.
The XML option on the final DATASET declaration specifies the XPATH
to the record tag (Filing) then the child DATASET “field” definitions in
the RECORD structures handle the multiple instance issues. Because ECL is
case insensitive and XML syntax is case sensitive, it is necessary to use
the XPATH to define all the field tags. The PartyRec RECORD structure
works with both the Debtors and SecuredParties child DATASET fields
because both contain the same tags and information.
Once you've defined the layout, how can you extract the data into a
normalized relational structure to work with it in the supercomputer?
NORMALIZE is the answer. NORMALIZE needs to know how many times to call
its TRANSFORM, so you must use the TABLE function to get the counts, like
this:
XactTbl := TABLE(UCC,{INTEGER XactCount := COUNT(Transactions), UCC});
OUTPUT(XactTbl);
This TABLE function gets the counts of the multiple Transaction
records per Filing so that we can use NORMALIZE to extract them into a
table of their own.
Out_Transacts := RECORD
STRING FilingNumber;
STRING TransactionID;
STRING10 StartDate;
STRING10 LapseDate;
STRING FormType;
STRING AmendType;
STRING AmendAction;
STRING10 EnteredDate;
STRING10 ReceivedDate;
STRING10 ApprovedDate;
DATASET(PartyRec) Debtors;
DATASET(PartyRec) SecuredParties;
CollateralRec Collateral;
END;
Out_Transacts Get_Transacts(XactTbl L, INTEGER C) := TRANSFORM
SELF.FilingNumber := L.FilingNumber;
SELF := L.Transactions[C];
END;
Transacts := NORMALIZE(XactTbl,LEFT.XactCount,Get_Transacts(LEFT,COUNTER));
OUTPUT(Transacts);
This NORMALIZE extracts all the Transactions into a separate
recordset with just one Transaction per record with the parent information
(the Filing number) appended. However, each record here still contains
multiple Debtor and SecuredParty child records.
PartyCounts := TABLE(Transacts,
{INTEGER DebtorCount := COUNT(Debtors),
INTEGER PartyCount := COUNT(SecuredParties),
Transacts});
OUTPUT(PartyCounts);
This TABLE function gets the counts of the multiple Debtor and
SecuredParty records for each Transaction.
Out_Parties := RECORD
STRING FilingNumber;
STRING TransactionID;
PartyRec;
END;
Out_Parties Get_Debtors(PartyCounts L, INTEGER C) := TRANSFORM
SELF.FilingNumber := L.FilingNumber;
SELF.TransactionID := L.TransactionID;
SELF := L.Debtors[C];
END;
TransactDebtors := NORMALIZE( PartyCounts,
LEFT.DebtorCount,
Get_Debtors(LEFT,COUNTER));
OUTPUT(TransactDebtors);
This NORMALIZE extracts all the Debtors into a separate
recordset.
Out_Parties Get_Parties(PartyCounts L, INTEGER C) := TRANSFORM
SELF.FilingNumber := L.FilingNumber;
SELF.TransactionID := L.TransactionID;
SELF := L.SecuredParties[C];
END;
TransactParties := NORMALIZE(PartyCounts,
LEFT.PartyCount,
Get_Parties(LEFT,COUNTER));
OUTPUT(TransactParties);
This NORMALIZE extracts all the SecuredParties into a separate
recordset. With this, we've now broken out all the child records into
their own normalized relational structure that we can work with
easily.
Piping to Third-Party Tools
One other way to work with XML data is to use third-party tools that
you have adapted for use in the supercomputer so that you have the
advantage of working with previously proven technology and the benefit of
running that technology in parallel on all the supercomputer nodes at
once.
The technique is simple: just define the input file as a data stream
and use the PIPE option on DATASET to process the data in its native form.
Once the processing is complete, you can OUTPUT the result in whatever
form it comes out of the third-party tool, something like this example
code (non-functional):
Rec := RECORD
STRING1 char;
END;
TimeZones := DATASET('timezones.xml',Rec,PIPE('ThirdPartyTool.exe'));
OUTPUT(TimeZones,,'ProcessedTimezones.xml');
The key to this technique is the STRING1 field definition. This
makes the input and output just a 1-byte-at-a-time data stream that flows
into the third-party tool and back out of your ECL code in its native
format. You don't even need to know what that format is. You could also
use this technique with the PIPE option on OUTPUT.