RONCC
/
Big-Data-HPC-Platform


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
							<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<sect1 id="Cross-Tab_Reports">
  <title><emphasis role="bold">Cross-Tab Reports</emphasis></title>

  <para>Cross-Tab reports are a very useful way of discovering statistical
  information about the data that you work with. They can be easily produced
  using the TABLE function and the aggregate functions (COUNT, SUM, MIN, MAX,
  AVE, VARIANCE, COVARIANCE, CORRELATION). The resulting recordset contains a
  single record for each unique value of the “group by” fields specified in
  the TABLE function, along with the statistics you generate with the
  aggregate functions.</para>

  <para>The TABLE function's “group by” parameters are used and duplicated as
  the first set of fields in the RECORD structure, followed by any number of
  aggregate function calls, all using the GROUP keyword as the replacement for
  the recordset required by the first parameter of each of the aggregate
  functions. The GROUP keyword specifies performing the aggregate operation on
  the group and is the key to creating a Cross-Tab report. This creates an
  output table containing a single row for each unique value of the “group by”
  parameters.</para>

  <sect2 id="A_Simple_Crosstab">
    <title>A Simple CrossTab</title>

    <para>The example code below (contained in the CrossTab.ECL file) produces
    an output of State/CountAccts with counts from the nested child dataset
    created by the GenData.ECL code (see the <emphasis role="bold">Creating
    Example Data</emphasis> article):</para>

    <programlisting>IMPORT $;
Person := $.DeclareData.PersonAccounts;

CountAccts := COUNT(Person.Accounts);

MyReportFormat1 := RECORD
  State     := Person.State;
  A1        := CountAccts;
 GroupCount := COUNT(GROUP);
END;

RepTable1 := TABLE(Person,MyReportFormat1,State,CountAccts );
OUTPUT(RepTable1);

/* The result set would look something like this:
  State    A1  GroupCount
   AK     1    7
   AK     2    3
   AL     1    42
   AL     2    54
   AR     1    103
   AR     2    89
   AR     3    2    */  
</programlisting>

    <para>Slight modifications allow some more sophisticated statistics to be
    produced, such as:</para>

    <programlisting>MyReportFormat2 := RECORD
  State{cardinality(56)}  := Person.State;
  A1          := CountAccts;
  GroupCount  := COUNT(GROUP);
  MaleCount   := COUNT(GROUP,Person.Gender = 'M');
  FemaleCount := COUNT(GROUP,Person.Gender = 'F');
 END;

RepTable2 := TABLE(Person,MyReportFormat2,State,CountAccts );

OUTPUT(RepTable2);
</programlisting>

    <para>This adds a breakdown of how many men and women there are in each
    category, by using the optional second parameter to COUNT (available only
    for use in RECORD structures where its first parameter is the GROUP
    keyword).</para>

    <para>The addition of the {cardinality(56)} to the State definition is a
    hint to the optimizer that there are exactly 56 values possible in that
    field, allowing it to select the best algorithm to produce the output as
    quickly as possible.</para>

    <para>The possibilities are endless for the type of statistics you can
    generate against any set of data.</para>
  </sect2>

  <sect2 id="A_More_Complex_Example">
    <title>A More Complex Example</title>

    <para>As a slightly more complex example, the following code produces a
    Cross-Tab result table with the average balance on a bankcard trade,
    average high credit on a bankcard trade, and the average total balance on
    bankcards, tabulated by state and sex.</para>

    <para>This code demonstrates using separate aggregate attributes as the
    value parameters to the aggregate function in the CrossTab.</para>

    <programlisting>IsValidType(STRING1 PassedType) := PassedType IN ['O', 'R', 'I'];

IsRevolv := Person.Accounts.AcctType = 'R' OR 
        (~IsValidType(Person.Accounts.AcctType) AND 
         Person.Accounts.Account[1] IN ['4', '5', '6']);

SetBankIndCodes := ['BB', 'ON', 'FS', 'FC'];

IsBank := Person.Accounts.IndustryCode IN SetBankIndCodes;

IsBankCard := IsBank AND IsRevolv;

AvgBal := AVE(Person.Accounts(isBankCard),Balance);
TotBal := SUM(Person.Accounts(isBankCard),Balance);
AvgHC  := AVE(Person.Accounts(isBankCard),HighCredit);

R1 := RECORD
  person.state;
  person.gender;
  Number          := COUNT(GROUP);
  AverageBal      := AVE(GROUP,AvgBal);
  AverageTotalBal := AVE(GROUP,TotBal);
  AverageHC       := AVE(GROUP,AvgHC);
END;

T1 := TABLE(person, R1,  state, gender);

OUTPUT(T1);
</programlisting>
  </sect2>

  <sect2 id="A_Statistical_Example">
    <title>A Statistical Example</title>

    <para>The following example demonstrates the VARIANCE, COVARIANCE and
    CORRELATION functions to analyze grid points. It also shows the technique
    of putting the CrossTab into a MACRO, calling the MACRO to generate the
    specific result for a given dataset.</para>

    <programlisting>pointRec := { REAL x, REAL y };

analyze( ds ) := MACRO
  #uniquename(rec)
  %rec% := RECORD
    c     := COUNT(GROUP),
    sx    := SUM(GROUP, ds.x),
    sy    := SUM(GROUP, ds.y),
    sxx   := SUM(GROUP, ds.x * ds.x),
    sxy   := SUM(GROUP, ds.x * ds.y),
    syy   := SUM(GROUP, ds.y * ds.y),
    varx  := VARIANCE(GROUP, ds.x);
    vary  := VARIANCE(GROUP, ds.y);
    varxy := COVARIANCE(GROUP, ds.x, ds.y);
    rc    := CORRELATION(GROUP, ds.x, ds.y) ;
  END;  
  #uniquename(stats)
  %stats% := TABLE(ds,%rec% );

  OUTPUT(%stats%);
  OUTPUT(%stats%, { varx - (sxx-sx*sx/c)/c,
                    vary - (syy-sy*sy/c)/c,
                    varxy - (sxy-sx*sy/c)/c,
                    rc - (varxy/SQRT(varx*vary)) });
  OUTPUT(%stats%, { 'bestFit: y='+(STRING)((sy-sx*varxy/varx)/c)+' + '+(STRING)(varxy/varx)+'x' });
ENDMACRO;

ds1 := DATASET([{1,1},{2,2},{3,3},{4,4},{5,5},{6,6}], pointRec);
ds2 := DATASET([{1.93896e+009, 2.04482e+009},
                {1.77971e+009, 8.54858e+008},
                {2.96181e+009, 1.24848e+009},
                {2.7744e+009,  1.26357e+009},
                {1.14416e+009, 4.3429e+008},
                {3.38728e+009, 1.30238e+009},
                {3.19538e+009, 1.71177e+009} ], pointRec);
ds3 := DATASET([{1, 1.00039},
                {2, 2.07702},
                {3, 2.86158},
                {4, 3.87114},
                {5, 5.12417},
                {6, 6.20283} ], pointRec);

analyze(ds1);
analyze(ds2);
analyze(ds3); 
</programlisting>

    <para></para>
  </sect2>
</sect1>