PrG_Smart_Stepping.xml 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
  4. <sect1 id="Smart_Stepping">
  5. <title><emphasis role="bold">Smart Stepping</emphasis></title>
  6. <sect2 id="PG_Overview">
  7. <title>Overview</title>
  8. <para>Smart Stepping is a set of indexing techniques that, taken together,
  9. comprise a method of doing <emphasis>n</emphasis>-ary join/merge-join
  10. operations, where <emphasis>n</emphasis> is defined as two or more
  11. datasets. Smart Stepping enables the supercomputer to efficiently join
  12. records from multiple filtered data sources, including subsets of the same
  13. dataset. It is particularly efficient when the matches are sparse and
  14. uncorrelated. Smart Stepping also supports matching records from M-of-N
  15. datasets.</para>
  16. <para>Before the advent of Smart Stepping, finding the intersection of
  17. records from multiple datasets was performed by extracting the potential
  18. matches from one dataset, and then joining that candidate set to each of
  19. the other datasets in turn. The joins would use various mechanisms
  20. including index lookups, or reading the potential matches from a dataset,
  21. and then joining them. This means that the only way to join multiple
  22. datasets required that at least one dataset be read in its entirety and
  23. then joined to the others. This could be very inefficient if the
  24. programmer didn't take care to select the most efficient order in which to
  25. read the datasets. Unfortunately, it is often impossible to know
  26. beforehand which order would be the best. It is also often impossible to
  27. order the joins so that the two least frequent terms are joined. It was
  28. also particularly difficult to efficiently implement the M-of-N join
  29. varieties.</para>
  30. <para>With Smart Stepping technology, these multiple dataset joins become
  31. a single efficient operation instead of a series of multiple operations.
  32. Smart Stepping can only be used in the context where the join condition is
  33. primarily an equality test between columns in the input datasets and the
  34. input datasets must have output sorted by those columns.</para>
  35. <para>Smart Stepping also provides an efficient way of streaming
  36. information from a dataset, sorted by any trailing sort order. Previously
  37. if you had a sorted dataset (often an index) which was required to be
  38. filtered by some leading components, and then have the resulting rows
  39. sorted by the trailing components, you would have had to achieve it by
  40. reading the entire filtered result, and then post sorting that
  41. result.</para>
  42. <para>Smart Stepping can use significant amounts of temporary storage if
  43. used inappropriately. Therefore, care should be taken to use it
  44. properly.</para>
  45. </sect2>
  46. <sect2 id="Trailing_Field_Sorts">
  47. <title>Trailing Field Sorts</title>
  48. <para>The STEPPED function provides the ability to sort by trailing key
  49. component fields in a much more efficient manner than sorting after
  50. filtering (the only previous method of accomplishing this). The stepped
  51. trailing key fields allows the sorted rows to be returned without reading
  52. the entire dataset.</para>
  53. <para>Prior to the advent of Smart Stepping, a sorted dataset or index
  54. could efficiently produce filtered rows, or rows sorted in the same order
  55. as the original sort order, but it could not efficiently produce rows
  56. sorted by a trailing sort order of the index (whether filtered or not).
  57. The filtering then post-sorting method required that all rows be read from
  58. the dataset before any sorted rows could be retrieved. Smart Stepping
  59. allows the sorted data to be read immediately (and therefore
  60. partially).</para>
  61. <para>The easiest way to see the effect is with this example (contained in
  62. SmartStepping1.ECL—this code must be run in hthor or Roxie, not
  63. Thor):</para>
  64. <programlisting>IMPORT $;
  65. IDX := $.DeclareData.IDX__Person_State_City_Zip_LastName_FirstName_Payload;
  66. Filter := IDX.State = 'LA' AND IDX.City = 'ABBEVILLE';
  67. //filter by the leading index elements
  68. //and sort the output by a trailing element
  69. OUTPUT(SORT(IDX(Filter),FirstName),ALL); //the old way
  70. OUTPUT(STEPPED(IDX(Filter),FirstName),ALL); //Smart Stepping </programlisting>
  71. <para>The previous method of accomplishing this meant producing the
  72. filtered result set, then using SORT to achieve the desired sort order.
  73. The new method looks very similar, using STEPPED instead of SORT, and both
  74. OUTPUTs produce the same result, but the efficiency of the methods by
  75. which those results are achieved is very different.</para>
  76. <para>Once you've successfully run this code and gotten your result, take
  77. a look at the Graphs page.</para>
  78. <para>Notice that the first OUTPUT's sub-graph contains three activities:
  79. the index read, the sort, and the output. But the second OUTPUT's
  80. sub-graph only contains two activities: the index read and the output. All
  81. of the Smart Stepping work to produce the result is done by the index
  82. read. If you then go to the ECL Watch page for the workunit and look at
  83. the timings you should see that the second OUTPUT's graph1-1 time is
  84. significantly less than the first's graph1-2:</para>
  85. <para>Thus demonstrating the type of performance advantage Smart Stepping
  86. can have over previous methods. Of course, the real performance advantage
  87. shows up when you ask for only the first <emphasis>n</emphasis> records,
  88. as in this example (contained in SmartStepping1a.ECL):</para>
  89. <programlisting>IMPORT $;
  90. IDX := $.DeclareData.IDX__Person_State_City_Zip_LastName_FirstName_Payload;
  91. Filter := IDX.State = 'LA' AND IDX.City = 'ABBEVILLE';
  92. OUTPUT(CHOOSEN(SORT(IDX(Filter),FirstName),5)); //the old way
  93. OUTPUT(CHOOSEN(STEPPED(IDX(Filter),FirstName),5)); //Smart Stepping </programlisting>
  94. <para>After running this code, check the timings on the ECL watch page.
  95. You should again see quite a performance difference between the two
  96. methods, even with this little amount of data.</para>
  97. </sect2>
  98. <sect2 id="N-ary_JOINs">
  99. <title>N-ary JOINs</title>
  100. <para>The primary purpose of Smart Stepping is to enable
  101. <emphasis>n</emphasis>-ary merge/join operations to be accomplished as
  102. efficiently as possible. To that end the concept of a set of datasets (or
  103. indexes) has been added to the language. This allows JOIN to be extended
  104. to operate on multiple datasets, not just two.</para>
  105. <para>For example, given this data (contained in the SmartStepping2.ECL
  106. file)</para>
  107. <programlisting>Rec := RECORD,MAXLENGTH(4096)
  108. STRING1 Letter;
  109. UNSIGNED1 DS;
  110. UNSIGNED1 Matches := 1;
  111. UNSIGNED1 LastMatch := 1;
  112. SET OF UNSIGNED1 MatchDSs := [1];
  113. END;
  114. ds1 := DATASET([{'A',1},{'B',1},{'C',1},{'D',1},{'E',1}],Rec);
  115. ds2 := DATASET([{'A',2},{'B',2},{'H',2},{'I',2},{'J',2}],Rec);
  116. ds3 := DATASET([{'B',3},{'C',3},{'M',3},{'N',3},{'O',3}],Rec);
  117. ds4 := DATASET([{'A',4},{'B',4},{'R',4},{'S',4},{'T',4}],Rec);
  118. ds5 := DATASET([{'B',5},{'V',5},{'W',5},{'X',5},{'Y',5}],Rec); </programlisting>
  119. <para>To do an inner join on all five datasets using Smart Stepping the
  120. code is this (also contained in the SmartStepping2.ECL file):</para>
  121. <programlisting>SetDS := [ds1,ds2,ds3,ds4,ds5];
  122. Rec XF(Rec L,DATASET(Rec) Matches) := TRANSFORM
  123. SELF.Matches := COUNT(Matches);
  124. SELF.LastMatch := MAX(Matches,DS);
  125. SELF.MatchDSs := SET(Matches,DS);
  126. SELF := L;
  127. END;
  128. j1 := JOIN( SetDS,STEPPED(LEFT.Letter=RIGHT.Letter),XF(LEFT,ROWS(LEFT)),SORTED(Letter));
  129. O1 := OUTPUT(j1);
  130. </programlisting>
  131. <para>Without using Smart Stepping the code is this (also contained in the
  132. SmartStepping2.ECL file):</para>
  133. <programlisting>Rec XF1(Rec L,Rec R,integer MatchSet) := TRANSFORM
  134. SELF.Matches := L.Matches + 1;
  135. SELF.LastMatch := MatchSet;
  136. SELF.MatchDSs := L.MatchDSs + [MatchSet];
  137. SELF := L;
  138. END;
  139. j2 := JOIN( ds1,ds2,LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,2));
  140. j3 := JOIN( j2,ds3, LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,3));
  141. j4 := JOIN( j3,ds4, LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,4));
  142. j5 := JOIN( j4,ds5, LEFT.Letter=RIGHT.Letter,XF1(LEFT,RIGHT,5));
  143. O2 := OUTPUT(SORT(j5,Letter));
  144. </programlisting>
  145. <para>Both of these examples produce the same one-record output, but
  146. without Smart Stepping you need four separate JOINs to accomplish the
  147. goal, and in “real world” code you might need a separate TRANSFORM for
  148. each, depending on what result you were trying to produce.</para>
  149. <para>In addition to the standard inner join between all the datasets, the
  150. Smart Stepping form of JOIN also supports the same type of LEFT OUTER and
  151. LEFT ONLY joins as the standard JOIN operation. However, this form also
  152. supports <emphasis>M</emphasis> of <emphasis>N</emphasis> joins (MOFN),
  153. where matching records must appear in a specified minimum number of the
  154. datasets, and may optionally specify a maximum in which they appear, as in
  155. these examples (also contained in the SmartStepping2.ECL file):</para>
  156. <programlisting>j6 := JOIN( SetDS,
  157. STEPPED(LEFT.Letter=RIGHT.Letter),
  158. XF(LEFT,ROWS(LEFT)),
  159. SORTED(Letter),
  160. LEFT OUTER);
  161. j7 := JOIN( SetDS,
  162. STEPPED(LEFT.Letter=RIGHT.Letter),
  163. XF(LEFT,ROWS(LEFT)),
  164. SORTED(Letter),
  165. LEFT ONLY);
  166. j8 := JOIN( SetDS,
  167. STEPPED(LEFT.Letter=RIGHT.Letter),
  168. XF(LEFT,ROWS(LEFT)),
  169. SORTED(Letter),
  170. MOFN(3));
  171. j9 := JOIN( SetDS,
  172. STEPPED(LEFT.Letter=RIGHT.Letter),
  173. XF(LEFT,ROWS(LEFT)),
  174. SORTED(Letter),
  175. MOFN(3,4));
  176. O3 := OUTPUT(j6);
  177. O4 := OUTPUT(j7);
  178. O5 := OUTPUT(j8);
  179. O6 := OUTPUT(j9);
  180. </programlisting>
  181. <para>The RANGE function is also available to limit which datasets in the
  182. set of datasets will be processed, as in this example (also contained in
  183. the SmartStepping2.ECL file):</para>
  184. <programlisting>j10 := JOIN( RANGE(SetDS,[1,3,5]),
  185. STEPPED(LEFT.Letter=RIGHT.Letter),
  186. XF(LEFT,ROWS(LEFT)),
  187. SORTED(Letter));
  188. O7 := OUTPUT(j10);
  189. SEQUENTIAL(O1,O2,O3,O4,O5,O6,O7);</programlisting>
  190. <para>This feature can be useful in situations where you may not have all
  191. the information to select from all the datasets in the set.</para>
  192. <para>This next example demonstrates the most probable use for this
  193. technology in the real world—finding the set of parent records where
  194. related child records exist that fit a specified set of filter criteria.
  195. That's exactly what this example (contained in the SmartStepping3.ECL
  196. file) does:</para>
  197. <programlisting>LinkRec := RECORD
  198. UNSIGNED1 Link;
  199. END;
  200. DS_Rec := RECORD(LinkRec)
  201. STRING10 Name;
  202. STRING10 Address;
  203. END;
  204. Child1_Rec := RECORD(LinkRec)
  205. UNSIGNED1 Nbr;
  206. END;
  207. Child2_Rec := RECORD(LinkRec)
  208. STRING10 Car;
  209. END;
  210. Child3_Rec := RECORD(LinkRec)
  211. UNSIGNED4 Salary;
  212. END;
  213. Child4_Rec := RECORD(LinkRec)
  214. STRING10 Domicile;
  215. END;</programlisting>
  216. <para>Using this form of RECORD structure inheritance makes it very simple
  217. to define the linkage between the parent and child files. Note also that
  218. all these files have different formats.</para>
  219. <programlisting>ds := DATASET([{1,'Fred','123 Main'},{2,'George','456 High'},
  220. {3,'Charlie','789 Bank'},{4,'Danielle','246 Front'},
  221. {5,'Emily','613 Boca'},{6,'Oscar','942 Frank'},
  222. {7,'Felix','777 John'},{8,'Adele','543 Bank'},
  223. {9,'Johan','123 Front'},{10,'Ludwig','212 Front'}],
  224. DS_Rec);
  225. Child1 := DATASET([{1,5},{2,8},{3,11},{4,14},{5,17},
  226. {6,20},{7,23},{8,26},{9,29},{10,32}],Child1_Rec);
  227. Child2 := DATASET([{1,'Ford'},{2,'Ford'},{3,'Chevy'},
  228. {4,'Lexus'},{5,'Lexus'},{6,'Kia'},
  229. {7,'Mercury'},{8,'Jeep'},{9,'Lexus'},
  230. {9,'Ferrari'},{10,'Ford'}],
  231. Child2_Rec);
  232. Child3 := DATASET([{1,10000},{2,20000},{3,155000},{4,800000},
  233. {5,250000},{6,75000},{7,200000},{8,15000},
  234. {9,80000},{10,25000}],
  235. Child3_Rec);
  236. Child4 := DATASET([{1,'House'},{2,'House'},{3,'House'},{4,'Apt'},
  237. {5,'Apt'},{6,'Apt'},{7,'Apt'},{8,'House'},
  238. {9,'Apt'},{10,'House'}],
  239. Child4_Rec);
  240. TblRec := RECORD(LinkRec),MAXLENGTH(4096)
  241. UNSIGNED1 DS;
  242. UNSIGNED1 Matches := 0;
  243. UNSIGNED1 LastMatch := 0;
  244. SET OF UNSIGNED1 MatchDSs := [];
  245. END;
  246. Filter1 := Child1.Nbr % 2 = 0;
  247. Filter2 := Child2.Car IN ['Ford','Chevy','Jeep'];
  248. Filter3 := Child3.Salary &lt; 100000;
  249. Filter4 := Child4.Domicile = 'House';
  250. t1 := PROJECT(Child1(Filter1),TRANSFORM(TblRec,SELF.DS:=1,SELF:=LEFT));
  251. t2 := PROJECT(Child2(Filter2),TRANSFORM(TblRec,SELF.DS:=2,SELF:=LEFT));
  252. t3 := PROJECT(Child3(Filter3),TRANSFORM(TblRec,SELF.DS:=3,SELF:=LEFT));
  253. t4 := PROJECT(Child4(Filter4),TRANSFORM(TblRec,SELF.DS:=4,SELF:=LEFT));</programlisting>
  254. <para>The PROJECT operation is a simple way to transform the results for
  255. all these different format files into a single standard layout that can be
  256. used by the Smart Stepping JOIN operation.</para>
  257. <programlisting>SetDS := [t1,t2,t3,t4];
  258. TblRec XF(TblRec L,DATASET(TblRec) Matches) := TRANSFORM
  259. SELF.Matches := COUNT(Matches);
  260. SELF.LastMatch := MAX(Matches,DS);
  261. SELF.MatchDSs := SET(Matches,DS);
  262. SELF := L;
  263. END;
  264. j1 := JOIN( SetDS,STEPPED(LEFT.Link=RIGHT.Link),XF(LEFT,ROWS(LEFT)),SORTED(Link));
  265. OUTPUT(j1);
  266. OUTPUT(ds(link IN SET(j1,link)));</programlisting>
  267. <para>The first OUTPUT simply displays the same kind of result as the
  268. previous example. The second OUTPUT produces the “real-world” result set
  269. of the base dataset records that match the filter criteria for each of the
  270. child datasets.</para>
  271. </sect2>
  272. </sect1>