codegen.txt 9.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
  1. /*##############################################################################
  2. HPCC SYSTEMS software Copyright (C) 2012 HPCC Systems®.
  3. Licensed under the Apache License, Version 2.0 (the "License");
  4. you may not use this file except in compliance with the License.
  5. You may obtain a copy of the License at
  6. http://www.apache.org/licenses/LICENSE-2.0
  7. Unless required by applicable law or agreed to in writing, software
  8. distributed under the License is distributed on an "AS IS" BASIS,
  9. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  10. See the License for the specific language governing permissions and
  11. limitations under the License.
  12. ############################################################################## */
  13. Issues that aren't handled correctly at the moment:
  14. o LEFT/RIGHT.
  15. Currently these can be ambiguous - because expressions at different levels of the tree can have the same
  16. logical representation. It means many all transforms need to be context dependant because the same expression
  17. in different locations may not have the same meaning. Currently get away with it most of the time.
  18. o No way to alias a dataset.
  19. - it is needed in some situations e.g., count(children.names, age > ave(children.names));
  20. o no_setresult
  21. The representation is ugly for walking activity lists - much better if could guarantee that a dataset was the first parameter
  22. no_setresult(dataset, select, sequence).
  23. o It is never simple to walk the activity tree because of exceptions like if(), setresult(), activerow etc.
  24. o Should possibly represent selectors as different HqlExpressions - it might remove some of the complication from the transformers.
  25. o Hoisting common items: conditions and dependencies
  26. - Generally if a cse is used in multiple places then we want to hoist it once
  27. - We don't want to hoist it too early otherwise dependencies might get messed up
  28. - If a cse is used in a condition branch then don't want to hoist it globally unless it is (first) used non-globally
  29. - cses shouldn't be moved within sequentials because it may change the meaning. We can assume that all evaluation
  30. of the same expressions produce the same result though
  31. - careful if hoisting, because may also need to hoist items in no_comma that are further up the tree....
  32. - if sequential inside a condition, then can't common up with a later non-conditional cse.
  33. - Need comma expressions to be created in the best locations. Generally tagged onto the activity at a minimum.
  34. - problem with creating a comma locally and then hoisting without analysing is they won't get commoned up.
  35. o Correct place for matched/fileposition
  36. - currently the subquery tree is walked to work out correct place to evaluated fileposition, but need reimplementing.
  37. o Filter scoring
  38. - It would be useful to be able to scope filters/datasets so we can reorder them/work out if optimizations are worthwhile.
  39. o no_mapto
  40. This is a particularly bad representation for walking the tree because of cse issues etc.
  41. better would be no_case(value, [match_values], result_values {expanded});
  42. Removing case/map activities at least makes cse transforming simpler.
  43. o no_xml flags in tree
  44. for xml etc. should inherit from child datasets, but ignore from child values. Also don't allow nested parse at moment.
  45. **********************************************************************************************************************************************************
  46. -- Reordering filters --
  47. o Should have a go at looking at HOLe, and trying to steal its code. Effectively need to
  48. - work out the cardinality of each field.
  49. - functions to calculate how likely a filter is to occur
  50. - functions to score how expensive an operation is.
  51. - do filters on outer datasets first
  52. - reorder the filters to match.
  53. o Hole also chooses the best order to do the joins in. That would be possible, but there aren't that many situations where
  54. reordering would be that helpful. It wouldn't be trivial, but i think it could be done reasonably easily.
  55. -- Related datasets --
  56. o What is the issue? Simplest with a couple of examples:
  57. address(exists(address.person.books(f)))
  58. o The nested child dataset is relatively simple if node created is (select(select(address,person,new),books,new)
  59. would be requested to build iterator on address.person.
  60. - If no break required, can just use nested iterators
  61. - if break required we need a class that can iterate; AND provide access to both person and book fields.
  62. - Similar problem as outer level nested iterator, except we can use pointers to parent record.
  63. o Would be easier if we had a representation:
  64. address(exists(related(person(parent-join-condition), books(join-condition))(f));
  65. related has syntax:
  66. related(filter1, filter2, filter3, filter4, ...)
  67. o relate() iterator:
  68. i) need to create a class. The class will probably be of the same form as the hole iterators:
  69. -- Related datasets --
  70. o The problem, an example:
  71. address(exists(books(books.personid=(person(person.addrid = address.id)).id)(f)))
  72. o The issue here is that we need to look in the filter condition for how the join is done.
  73. we really need a graph which looks like:
  74. address(exists(relate(person(addr = address.id), books(personid=person.id)))(f);
  75. o First requirement is to support the explicit syntax. Second is to spot where it needs to be introduced:
  76. o Main effect of the syntax is
  77. a) to ensure tables are met in the correct order
  78. b) never meet newtable.field
  79. o Automatic spotting:
  80. - For aggregates etc.
  81. - Spot all uses of dataset.field in filter where dataset is new and not a single row.
  82. - If these datasets provide a bridge from current scope, and main dataset for the aggregate then
  83. generate a relate(a,b,c) statement. Extract any filter which just relates to fields that are already
  84. in scope to the appropriate level.
  85. - Also applies to x[1] where x is a unrelated dataset.
  86. - Needs to be done after scope tagging so the table usage information is there which means that
  87. a) scope checking needs to be happy with it. Not sure how it would really cope.
  88. b) modified graph with the relates needs to also be correctly tagged.
  89. -- Sub query related work---
  90. * ROW(row, record) - convert fields by name with error if no match found.
  91. * BUG: Hoisting datasets because scope invariant probably does too much, and
  92. doesn't create a valid structure.
  93. * Add a debug option to always create compound activities regardless of sharing - for debugging code generation of sq.
  94. -- Related potential changes ---
  95. * What work would be needed to allow records to be represented as records in a record
  96. rather than expanding the record/dataset?
  97. What work to then allow items compatibility between record and implicit-children.
  98. * EXISTS(x) etc. should really push x onto a top scope stack, where all are accessible. Would
  99. reduce the number of qualifying prefixes on fields.
  100. * Distributed local activities after remote activities - potentially may speed things up, but
  101. generally compound aggregating sources will have a larger advantage.
  102. -------- SubQuery fundamental questions: ---------
  103. * Is an explicit syntax needed when accessing related non-nested datasets? What would be
  104. sensible? Can it be deduced?
  105. * How do you calculate an average in an enclosing scope? How does within relate?
  106. !! Do Aliases solve the problems they are proposed to, or do you need something to evaluate something in a different scope?
  107. * In what sense are parent fields accessible when iterating through people at root level?
  108. Could theoretically be serialised as part of the
  109. record, but removed by an output with no table. Implicit projects would then make it usable. Worth pursuing.
  110. >> Needs a representation in a record to indicate it also contains parent references.
  111. * How do you alias a dataset? Otherwise can't implement sqjoin for a disk based item.
  112. -----------------------------------------------------------------------
  113. -- Sub query complications ---
  114. *) generating a sub graph on a child dataset.
  115. *) accessing fields from a parent dataset (unrelated child).
  116. *) accessing fields from a grand parent (multi level nesting)
  117. *) accessing fields from within inline sub-query (both levels).
  118. *) Indicating whether subgraphs can be run locally or need to be distributed (effects how
  119. cursors are serialised).
  120. *) traversing query at a child level.
  121. *) Aggregating, and selecting as child activities.
  122. *) counting, aggregating and group-aggregating source activities. (ensure both tested)
  123. *) CSE causing datasets to be returned. - Are they worth it?
  124. *) Multiple sub-queries, with different fields being available at different nesting levels.
  125. *) Optimizing counts and representation.
  126. *) Are aliases ever needed? (Situations where global is currently used I suspect.)
  127. *) Grouped temporary recordsets.
  128. *) hoisting table/context invariant expressions. E.g.,
  129. SELF.countMatch := count(d2(field = left.field))
  130. recoded as
  131. summary := table(d2, {cnt := count(group), field}, field);
  132. SELF.countMatch := sum(summary(field = left.field), cnt);
  133. where summary is only evaluate once per query.
  134. *) Executing multiple sub queries at once.
  135. - Would it help
  136. - What chaos would it cause.
  137. - How about if we only allowed an explicit syntax?
  138. *) How do we replicate HOLe's join files? Effectively disk read (or other) datasets are cloned on each machine.