PrG_Using_Group_Function.xml 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
  4. <sect1 id="Using_the_GROUP_Function">
  5. <title>Using the GROUP Function</title>
  6. <para>The GROUP function provides important functionality when processing
  7. very large datasets. The basic concept is that the GROUP function will break
  8. the dataset up into a number of smaller subsets, but the GROUPed dataset is
  9. still treated as a single entity in your ECL code.</para>
  10. <para>Operations on a GROUPed dataset are automatically performed on each
  11. subset, separately. Therefore, an operation on a GROUPed dataset will appear
  12. in the ECL code as a single operation, but will in fact internally be
  13. accomplished by serially performing the same operation against each subset
  14. in turn. The advantage this approach has is that each individual operation
  15. is much smaller, and more likely to be able to be accomplished without
  16. spilling to disk, which means the total time to perform all the separate
  17. operations will typically be less than performing the same operation against
  18. the entire dataset (sometimes dramatically so).</para>
  19. <sect2 id="GROUP_vs_SORT">
  20. <title>GROUP vs. SORT</title>
  21. <para>The GROUP function does not automatically sort the records it’s
  22. operating on—it will GROUP based on the order of the records it is given.
  23. Therefore, SORTing the records first by the field(s) on which you want to
  24. GROUP is usually done (except in circumstances where the GROUP field(s)
  25. are used only to break a single large operation up into a number of much
  26. smaller operations).</para>
  27. <para>For the set of operations that use TRANSFORM functions (such as
  28. ITERATE, PROJECT, ROLLUP, etc), operating on a GROUPed dataset where the
  29. operation is performed on each fragment (group) in the recordset,
  30. independently, implies that testing for boundary conditions will be
  31. different than if you were working with a SORTed dataset. For example, the
  32. following code (contained in GROUPfunc.ECL) uses the GROUP function to
  33. rank people's accounts, based on the open date and balance. The account
  34. with the newest open date is ranked highest (if there are multiple
  35. accounts opened the same day the one with the highest balance is used).
  36. There is no boundary check needed in the TRANSFORM function because the
  37. ITERATE starts over again with each person, so the L.Ranking field value
  38. for each new person group is zero (0).</para>
  39. <programlisting>IMPORT $;
  40. accounts := $.DeclareData.Accounts;
  41. rec := RECORD
  42. accounts.PersonID;
  43. accounts.Account;
  44. accounts.opendate;
  45. accounts.balance;
  46. UNSIGNED1 Ranking := 0;
  47. END;
  48. tbl := TABLE(accounts,rec);
  49. rec RankGrpAccts(rec L, rec R) := TRANSFORM
  50. SELF.Ranking := L.Ranking + 1;
  51. SELF := R;
  52. END;
  53. GrpRecs := SORT(GROUP(SORT(tbl,PersonID),PersonID),-Opendate,-Balance);
  54. i1 := ITERATE(GrpRecs,RankGrpAccts(LEFT,RIGHT));
  55. OUTPUT(i1);
  56. </programlisting>
  57. <para>The following code just uses SORT to achieve the same record order
  58. as in the previous code. Notice the boundary check code in the TRANSFORM
  59. function. This is required, since the ITERATE will perform a single
  60. operation against the entire dataset.:</para>
  61. <programlisting>rec RankSrtAccts(rec L, rec R) := TRANSFORM
  62. SELF.Ranking := IF(L.PersonID = R.PersonID,L.Ranking + 1, 1);
  63. SELF := R;
  64. END;
  65. SortRecs := SORT(tbl,PersonID,-Opendate,-Balance);
  66. i2 := ITERATE(SortRecs,RankSrtAccts(LEFT,RIGHT));
  67. OUTPUT(i2);
  68. </programlisting>
  69. <para>The different bounds checking in each is required by the fragmenting
  70. created by the GROUP function. The ITERATE operates separately on each
  71. fragment in the first example, and operates on the entire record set in
  72. the second.</para>
  73. </sect2>
  74. <sect2 id="PG_Performance_Considerations">
  75. <title>Performance Considerations</title>
  76. <para>There is also a major performance advantage to using the GROUP
  77. function. For example, the SORT is an <emphasis>n log n</emphasis>
  78. operation, so breaking large record sets up into smaller sets of sets can
  79. dramatically improve the amount of time it takes to perform the sorting
  80. operation.</para>
  81. <para>Assuming that a dataset contains 1 billion 1,000-byte records
  82. (1,000,000,000) and you're operating on a 100-node supercomputer. Assuming
  83. also that the data is evenly distributed, then you have 10 million records
  84. per node occupying 1 gigabyte of memory on each node. Suppose you need to
  85. sort the data by three fields: by personID, opendate, and balance. You
  86. could achieve the result three possible ways: a global SORT, a distributed
  87. local SORT, or a GROUPed distributed local SORT.</para>
  88. <para>Here's an example that demonstrates all three methods (contained in
  89. GROUPfunc.ECL):</para>
  90. <programlisting>bf := NORMALIZE(accounts,
  91. CLUSTERSIZE * 2,
  92. TRANSFORM(RECORDOF(ProgGuide.Accounts),
  93. SELF := LEFT));
  94. ds0 := DISTRIBUTE(bf,RANDOM()) : PERSIST('~PROGGUIDE::PERSIST::TestGroupSort');
  95. ds1 := DISTRIBUTE(ds,HASH32(personid));
  96. // do a global sort
  97. s1 := SORT(ds0,personid,opendate,-balance);
  98. a := OUTPUT(s1,,'~PROGGUIDE::EXAMPLEDATA::TestGroupSort1',OVERWRITE);
  99. // do a distributed local sort
  100. s3 := SORT(ds1,personid,opendate,-balance,LOCAL);
  101. b := OUTPUT(s3,,'~PROGGUIDE::EXAMPLEDATA::TestGroupSort2',OVERWRITE);
  102. // do a grouped local sort
  103. s4 := SORT(ds1,personid,LOCAL);
  104. g2 := GROUP(s4,personid,LOCAL);
  105. s5 := SORT(g2,opendate,-balance);
  106. c := OUTPUT(s5,,'~PROGGUIDE::EXAMPLEDATA::TestGroupSort3',OVERWRITE);
  107. SEQUENTIAL(a,b,c);
  108. </programlisting>
  109. <para>The result sets for all of these SORT operations are identical.
  110. However, the time it takes to produce them is not. The above example
  111. operates only on 10 million 46-byte records per node, not the one billion
  112. 1,000-byte records previously mentioned, but it certainly illustrates the
  113. techniques.</para>
  114. <para>For the hypothetical one billion record example, the performance of
  115. the Global Sort is calculated by the formula: 1 billion times the log of 1
  116. billion (9), resulting in a performance metric of 9 billion. The
  117. performance of Distributed Local Sort is calculated by the formula: 10
  118. million times the log of 10 million (7), resulting in a performance metric
  119. of 70 million. Assuming the GROUP operation created 1,000 sub-groups on
  120. each node, the performance of Grouped Local Sort is calculated by the
  121. formula: 1,000 times (10,000 times the log of 10,000 (4)), resulting in a
  122. performance metric of 40 million.</para>
  123. <para>The performance metric numbers themselves are meaningless, but their
  124. ratios do indicate the difference in performance you can expect to see
  125. between SORT methods. This means that the distributed local SORT will be
  126. roughly 128 times faster than the global SORT (9 billion / 70 million) and
  127. the grouped SORT will be roughly 225 times faster than the global SORT (9
  128. billion / 40 million) and the grouped SORT will be about 1.75 times faster
  129. than the distributed local SORT (70 million / 40 million).</para>
  130. </sect2>
  131. </sect1>