CodeGenerator.rst 52 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007
  1. ====================
  2. Eclcc/Code generator
  3. ====================
  4. ************
  5. Introduction
  6. ************
  7. Purpose
  8. ========
  9. The primary purpose of the code generator is to take an ECL query and convert it into a work unit
  10. that is suitable for running by one of the engines.
  11. Aims
  12. ====
  13. The code generator has to do its job accurately. If the code generator does not correctly map the
  14. ECL to the workunit it can lead to corrupt data and invalid results. Problems like that can often be
  15. very hard and frustrating for the ECL users to track down.
  16. There is also a strong emphasis on generating output that is as good as possible. Eclcc contains
  17. many different optimization stages, and is extensible to allow others to be easily added.
  18. Eclcc needs to be able to cope with reasonably large jobs. Queries that contain several megabytes of
  19. ECL, and generate tens of thousands of activities, and 10s of Mb of C++ are routine. These queries
  20. need to be processed relatively quickly.
  21. Key ideas
  22. =========
  23. Nearly all the processing of ECL is done using an expression graph. The representation of the
  24. expression graph has some particular characteristics:
  25. * Once the nodes in the expression graph have been created they are NEVER modified.
  26. * Nodes in the expression graph are ALWAYS commoned up if they are identical.
  27. * Each node in the expression graph is link counted (see below) to track its lifetime.
  28. * If a modified graph is required a new graph is created (sharing nodes from the old one)
  29. The ECL language is a declarative language, and in general is assumed to be pure - i.e. there are no
  30. side-effects, expressions can be evaluated lazily and re-evaluating an expression causes no
  31. problems. This allows eclcc to transform the graph in lots of interesting ways. (Life is never that
  32. simple so there are mechanisms for handling the exceptions.)
  33. From declarative to imperative
  34. ==============================
  35. One of the main challenges with eclcc is converting the declarative ECL code into imperative C++
  36. code. One key problem is it needs to try to ensure that code is only evaluated when it is required,
  37. but that it is also only evaluated once. It isn't always possible to satisfy both constraints - for
  38. example a global dataset expression used within a child query. Should it be evaluated once before
  39. the activity containing the child query is called, or each time the child query is called? If it is
  40. called on demand then it may not be evaluated as efficiently...
  41. This issue complicates many of the optimizations and transformations that are done to the queries.
  42. Long term the plan is to allow the engines to support more delayed lazy-evaluation, so that whether
  43. something is evaluated is more dynamic rather than static.
  44. Flow of processing
  45. ==================
  46. The idealised view of the processing within eclcc follows the following stages:
  47. * Parse the ECL into an expression graph.
  48. * Expand out function calls.
  49. * Normalize the expression graph so it is in a consistent format.
  50. * Normalize the references to fields within datasets to tie them up with their scopes.
  51. * Apply various global optimizations.
  52. * Translate logical operations into the activities that will implement them.
  53. * Resource and generate the global graph
  54. * For each activity, resource, optimize and generate its child graphs.
  55. In practice the progression is not so clear cut. There tends to be some overlap between the
  56. different stages, and some of them may occur in slightly different orders. However the order broadly
  57. holds.
  58. *****************************
  59. Working on the code generator
  60. *****************************
  61. The regression suite
  62. ====================
  63. Before any change is accepted for the code generator it is always run against several regression suites to ensure that
  64. it doesn't introduce any problems, and that the change has the desired effect. There are several different regression suites:
  65. * testing/regress/ecl - The run time regression suite.
  66. * ecl/regress - a compiler regression suite. This contains tests that cannot run and error tests.
  67. * LN private suite - This contains a large selection (>10Gb) of archived queries. The contain proprietary code so unfortunately cannot be released as open source.
  68. The ecl/regress directory contains a script 'regress.sh' that is used for running the regression tests. It should be
  69. executed in the directory containing the ecl files. The script generates the c++ code (and workunits) for each of the source
  70. files to a target directory, and then executes a comparison program to compare the new results with a previous "golden"
  71. reference set.
  72. Before making any changes to the compiler, a reference set should be created by running the regression script and copying the
  73. generated files to the reference directory.
  74. Here is a sample command line
  75. ``~/dev/hpcc/ecl/regress/regress.sh -t /regress/hpcc -e /home/<user>/buildr/Release/bin/eclcc -I /home/<user>/dev/hpcc/ecl/regress/modules -I /home/<user>/dev/hpcc/plugins/javaembed -I /home/<user>/dev/hpcc/plugins/v8embed -c /regress/hpcc.master -d bcompare``
  76. (A version of this command resides in a shell script in each of my regression suite directories, with the -t and -c options adapted for each suite.)
  77. For a full list of options execute the script with no parameters, or take a look at the script itself. A couple of useful options are:
  78. * The script can be run on a single file by using the -q option.
  79. * The (-e) option selects the path of the eclcc. This is particularly useful when running from the build
  80. directory (see below), or using multiple build directories to compare behaviour between different versions.
  81. We strongly recommend using a comparison program which allows rules to be defined to ignore certain differences (e.g., beyond compare).
  82. Running directly from the build directory
  83. =========================================
  84. It is much quicker to run eclcc directly from the build directory, rather than deploying a system and running eclcc
  85. from there. To do this you need to configure some options that eclcc requires, e.g. where the include files are found. The
  86. options can be set by either setting environment variables or by specifiying options in an eclcc.ini
  87. file. The following are the names of the different options:
  88. +-----------------------+-------------------+
  89. | Environment flag | Ini file option |
  90. +=======================+===================+
  91. | CL_PATH | compilerPath |
  92. +-----------------------+-------------------+
  93. | ECLCC_LIBRARY_PATH | libraryPath |
  94. +-----------------------+-------------------+
  95. | ECLCC_INCLUDE_PATH | includePath |
  96. +-----------------------+-------------------+
  97. | ECLCC_PLUGIN_PATH | plugins |
  98. +-----------------------+-------------------+
  99. | HPCC_FILEHOOKS_PATH | filehooks |
  100. +-----------------------+-------------------+
  101. | ECLCC_TPL_PATH | templatePath |
  102. +-----------------------+-------------------+
  103. | ECLCC_ECLLIBRARY_PATH | eclLibrariesPath |
  104. +-----------------------+-------------------+
  105. | ECLCC_ECLBUNDLE_PATH | eclBundlesPath |
  106. +-----------------------+-------------------+
  107. The eclcc.ini can either be a file in the local directory, or specified on the eclcc command line with -specs.
  108. Including the settings in a local eclcc.ini file also it easy to debug eclcc directly from the build directory
  109. within the eclipse environment.
  110. Hints and tips
  111. ==============
  112. * Logging
  113. There is an option for eclcc to output a logging file, and another to specify the level of detail in that logging
  114. file. If the detail level is above 500 then the expresssion tree for the query is output to the logging file after
  115. each of the code transformations. The tracing is very useful for tracking down at which stage inconsistencies are
  116. introduced in the expression graph, and also for learning how each transformation affects the query.
  117. The output format defaults to ECL - which is regenerated from the expression tree. (This ECL cannot generally be
  118. compiled without editing - partly because it contains extra annoations.) Use either of the following:
  119. ``eclcc myfile.ecl --logfile myfile.log --logdetail 999``
  120. ``regress.sh -q myfile.ecl -l myfile.log``
  121. * -ftraceIR
  122. There is a debug option (-ftraceIR) that generates an intermediate representation of the expression graph rather than
  123. regenerating ECL. The output tends to be less compact and harder to read quickly, but has the advantage of being
  124. better structured, and contains more details of the internal representation. ecl/hql/hqlir.cpp contains
  125. more details of the format.
  126. * Adding extra logging into the source code
  127. If you want to add tracing of expressions at any point in the code generation then adding either of the following
  128. calls will include the expression details in the log file:
  129. ``dbglogExpr(expr); // regenerate the ecl for an expression. See other functions in ecl/hql/hqlthql.hpp``
  130. ``EclIR::dbglogIR(expr); // regenerate the IR for an expression. See other functions in ecl/hql/hqlir.hpp``
  131. * Logging while debugging
  132. If you are debugging inside gdb it is often useful to be able to dump out details of an expression. Calling
  133. EclIR:dump_ir(expr); will generate the IR to stdout.
  134. ``p EclIR::dump_ir(expr)``
  135. The function can also be used with multiple parameters. Each expression will be dumped out, but common child nodes
  136. will only be generated once. This can be very useful when trying to determine the difference between two expressions.
  137. The quickest way is to call ``EclIR::dump_ir(expr1, expr2)``. The first difference between the expressions will
  138. be the expression that follows the first "return".
  139. * Expression sequence ids.
  140. Sometimes it can be hard to determine where a particular IHqlExpression node was created. If that is the case, then
  141. defining ``DEBUG_TRACK_INSTANCEID`` (in ecl/hql/hqlexpr.ipp) will add a unique sequence number to each IHqlExpression
  142. that is created. There is also a function checkSeqId() at the start of ecl/hql/hqlexpr.cpp which is called whenever
  143. an expression is created, linked, released etc.. Setting a breakpoint in that function can allow you to trace back
  144. exactly when and why a particular node was created.
  145. ***********
  146. Expressions
  147. ***********
  148. Expression Graph representation
  149. ===============================
  150. The key data structure within eclcc is the graph representation. The design has some key elements.
  151. * Once a node is created it is never modified.
  152. Some derived information (e.g., sort order, number of records, unique hash, ...) might be
  153. calculated and stored in the class after it has been created, but that doesn't change what the node
  154. represents in any way.
  155. Some nodes are created in stages - e.g., records, modules. These nodes are marked as fully
  156. completed when closeExpr() is called, after which they cannot be modified.
  157. * Nodes are always commoned up.
  158. If the same operator has the same arguments and type then there will be a unique IHqlExpression to
  159. represent it. This helps ensure that graphs stay as graphs and don't get converted to trees. It
  160. also helps with optimizations, and allows code duplicated in two different contexts to be brought
  161. together.
  162. * The nodes are link counted.
  163. Link counts are used to control the lifetime of the expression objects. Whenever a reference to an
  164. expression node is held, its link count is increased, and decreased when no longer required. The
  165. node is freed when there are no more references. (This generally works well, but does give us problems
  166. with forward references.)
  167. * The access to the graph is through interfaces.
  168. The main interfaces are IHqlExpression, IHqlDataset and IHqlScope. They are all defined in
  169. hqlexpr.hpp. The aim of the interfaces is to hide the implementation of the expression nodes so
  170. they can be restructured and changed without affecting any other code.
  171. * The expression classes use interfaces and a type field rather than polymorphism.
  172. This could be argued to be bad object design...but.
  173. There are more than 500 different possible operators. If a class was created for each of them the
  174. system would quickly become unwieldy. Instead there are several different classes which model the
  175. different types of expression (dataset/expression/scope).
  176. The interfaces contain everything needed to create and interrogate an expression tree, but they do
  177. not contain functionality for directly processing the graph.
  178. To avoid some of the shortcomings of type fields there are various mechanisms for accessing derived attributes which avoid interrogating the type field.
  179. * Memory consumption is critical.
  180. It is not unusual to have 10M or even 100M nodes in memory as a query is being processed. At that
  181. scale the memory consumption of each node matters - so great care should be taken when considering
  182. increasing the size of the objects. The node classes contain a class hierarchy which is there
  183. purely to reduce the memory consumption - not to reflect the functionality. With no memory
  184. constraints they wouldn't be there, but removing a single pointer per node can save 1Gb of memory
  185. usage for very complex queries.
  186. IHqlExpression
  187. --------------
  188. This is the interface that is used to walk and interrogate the expression graph once it has been created. Some of the main functions are:
  189. getOperator() What does this node represent? It returns a member of the node_operator enumerated type.
  190. numChildren() How many arguments does node have?
  191. queryChild(unsigned n) What is the nth child? If the argument is out of range it returns NULL.
  192. queryType() The type of this node.
  193. queryBody() Used to skip annotations (see below)
  194. queryProperty() Does this node have a child which is an attribute that matches a given name. (see below for more about attributes).
  195. queryValue() For a no_constant return the value of the constant. It returns NULL otherwise.
  196. The nodes in the expression graph are created through factory functions. Some of the expression types
  197. have specialised functions - e.g., createDataset, createRow, createDictionary, but scalar expressions
  198. and actions are normally created with createValue().
  199. Note: Generally ownership of the arguments to the createX() functions are assumed to be taken over by
  200. the newly created node.
  201. The values of the enumeration constants in node_operator are used to calculate "crcs" which are used
  202. to check if the ECL for a query matches, and if disk and index record formats match. It contains
  203. quite a few legacy entries no_unusedXXX which can be used for new operators (otherwise new operators
  204. must be added to the end).
  205. IHqlSimpleScope
  206. ---------------
  207. This interface is implemented by records, and is used to map names to the fields within the records.
  208. If a record contains IFBLOCKs then each of the fields in the ifblock is defined in the
  209. IHqlSimpleScope for the containing record.
  210. IHqlScope
  211. ---------
  212. Normally obtained by calling IHqlExpression::queryScope(). It is primarily used in the parser to
  213. resolve fields from within modules.
  214. The ECL is parsed on demand so as the symbol is looked up it may cause a cascade of ECL to be
  215. compiled. The lookup context (HqlLookupContext ) is passed to IHqlScope::lookupSymbol() for several
  216. reasons:
  217. * It contains information about the active repository - the source of the ECL which will be dynamically parsed.
  218. * It contains caches of expanded functions - to avoid repeating expansion transforms.
  219. * Some members are used for tracking definitions that are read to build dependency graphs, or archives of submitted queries.
  220. The interface IHqlScope currently has some members that are used for creation; this should be
  221. refactored and placed in a different interface.
  222. IHqlDataset
  223. -----------
  224. This is normally obtained by calling IHqlExpression::queryDataset(). It has shrunk in size over
  225. time, and could quite possibly be folded into IHqlExpression with little pain.
  226. There is a distinction in the code generator between "tables" and "datasets". A table
  227. (IHqlDataset::queryTable()) is a dataset operation that defines a new output record. Any operation
  228. that has a transform or record that defines an output record (e.g., PROJECT,TABLE) is a table, whilst
  229. those that don't (e.g., a filter, dedup) are not. There are a few apparent exceptions -e.g., IF
  230. (This is controlled by definesColumnList() which returns true the operator is a table.)
  231. Properties and attributes
  232. -------------------------
  233. There are two related by slightly different concepts. An attribute refers to the explicit flags that
  234. are added to operators (e.g., , LOCAL, KEEP(n) etc. specified in the ECL or some internal attributes
  235. added by the code generator). There are a couple of different functions for creating attributes.
  236. createExtraAttribute() should be used by default. createAttribute() is reserved for an attribute
  237. that never has any arguments, or in unusual situations where it is important that the arguments are
  238. never transformed. They are tested using queryAttribute()/hasAttribute() and represented by nodes of
  239. kind no_attr/no_expr_attr.
  240. The term "property" refers to computed information (e.g., record counts) that can be derived from the
  241. operator, its arguments and attributes. They are covered in more detail below.
  242. Field references
  243. ================
  244. Fields can be selected from active rows of a dataset in three main ways:
  245. * Some operators define LEFT/RIGHT to represent an input or processed dataset. Fields from these
  246. active rows are referenced with LEFT.<field-name>. Here LEFT or RIGHT is the "selector".
  247. * Other operators use the input dataset as the selector. E.g., myFile(myFile.id != 0). Here the
  248. input dataset is the "selector".
  249. * Often when the input dataset is used as the selector it can be omitted. E.g., myFile(id != 0).
  250. This is implicitly expanded by the PARSER to the second form.
  251. A reference to a field is always represented in the expression graph as a node of kind no_select
  252. (with createSelectExpr). The first child is the selector, and the second is the field. Needless
  253. to say there are some complications...
  254. * LEFT/RIGHT.
  255. The problem is that the different uses of LEFT/RIGHT need to be disambiguated since there may be
  256. several different uses of LEFT in a query. This is especially true when operations are executed in
  257. child queries. LEFT is represented by a node no_left(record, selSeq). Often the record is
  258. sufficient to disambiguate the uses, but there are situations where it isn't enough. So in
  259. addition no_left has a child which is a selSeq (selector sequence) which is added as a child
  260. attribute of the PROJECT or other operator. At parse time it is a function of the input dataset
  261. that is later normalized to a unique id to reduce the transformation work.
  262. * Active datasets. It is slightly more complicated - because the dataset used as the selector can
  263. be any upstream dataset up to the nearest table. So the following ECL code is legal:
  264. ::
  265. x := DATASET(...)
  266. y := x(x.id != 0);
  267. z := y(x.id != 100);
  268. Here the reference to x.id in the definition of z is referring to a field in the input dataset.
  269. Because of these semantics the selector in a normalized tree is actually
  270. inputDataset->queryNormalizedSelector() rather than inputDatset. This function currently returns the
  271. table expression node (but it may change in the future see below).
  272. Attribute "new"
  273. ---------------
  274. In some situations ECL allows child datasets to be treated as a dataset without an explicit
  275. NORMALIZE. E.g., EXISTS(myDataset.ChildDataset);
  276. This is primarily to enable efficient aggregates on disk files to be generated, but it adds some
  277. complications with an expression of the form dataset.childdataset.grandchild. E.g.,::
  278. EXISTS(dataset(EXISTS(dataset.childdataset.grandchild))
  279. Or::
  280. EXISTS(dataset.childdataset(EXISTS(dataset.childdataset.grandchild))
  281. In the first example dataset.childdataset within the dataset.childdataset.grandchild is a reference
  282. to a dataset that doesn't have an active cursor and needs to be iterated), whilst in the second it
  283. refers to an active cursor.
  284. To differentiate between the two, all references to fields within datasets/rows that don't have
  285. active selectors have an additional attribute("new") as a child of the select. So a no_select with a
  286. "new" attribute requires the dataset to be created, one without is a member of an active dataset
  287. cursor.
  288. If you have a nested row, the new attribute is added to the selection from the dataset, rather than
  289. the selection from the nested row. The functions queryDatasetCursor() and querySelectorDataset())
  290. are used to help interpret the meaning.
  291. (An alternative would be to use a different node from no_select - possibly this should be considered
  292. - it would be more space efficient.)
  293. The expression graph generated by the ECL parser doesn't contain any new attributes. These are added
  294. as one of the first stages of normalizing the expression graph. Any code that works on normalized
  295. expressions needs to take care to interpret no_selects correctly.
  296. Transforming selects
  297. --------------------
  298. When an expression graph is transformed and none of the records are changed, the representation of
  299. LEFT/RIGHT remains the same. This means any no_select nodes in the expression tree will also stay
  300. the same.
  301. However, if the transform modifies a table (highly likely) it means that the selector for the second
  302. form of field selector will also change. Unfortunately this means that transforms often cannot be
  303. short-circuited.
  304. It could significantly reduce the extent of the graph that needs traversing, and the number of nodes
  305. replaced in a transformed graph if this could be avoided. One possibility is to use a different
  306. value for dataset->queryNormalizedSelector() using a unique id associated with the table. I think it
  307. would be a good long term change, but it would require unique ids (similar to the selSeq) to be added
  308. to all table expressions, and correctly preserved by any optimization.
  309. Annotations
  310. ===========
  311. Sometimes it is useful to add information into the expression graph (e.g., symbol names, position
  312. information) that doesn't change the meaning, but should be preserved. Annotations allow information
  313. to be added in this way.
  314. An annotation's implementation of IHqlExpression generally delegates the majority of the methods
  315. through to the annotated expression. This means that most code that interrogates the expression
  316. graph can ignore their presence, which simplifies the caller significantly. However transforms need
  317. to be careful (see below).
  318. Information about the annotation can be obtained by calling IHqlExpression:: getAnnotationKind() and
  319. IHqlExpression:: queryAnnotation().
  320. Associated side-effects
  321. =======================
  322. In legacy ECL you will see code like the following\:::
  323. EXPORT a(x) := FUNCTION
  324. Y := F(x);
  325. OUTPUT(Y);
  326. RETURN G(Y);
  327. END;
  328. The assumption is that whenever a(x) is evaluated the value of Y will be output. However that
  329. doesn't particularly fit in with a declarative expression graph. The code generator creates a
  330. special node (no_compound) with child(0) as the output action, and child(1) as the value to be
  331. evaluated (g(Y)).
  332. If the expression ends up being included in the final query then the action will also be included
  333. (via the no_compound). At a later stage the action is migrated to a position in the graph where
  334. actions are normally evaluated.
  335. Derived properties
  336. ==================
  337. There are many pieces of information that it is useful to know about a node in the expression graph - many
  338. of which would be expensive to recomputed each time there were required. Eclcc has several
  339. mechanisms for caching derived information so it is available efficiently.
  340. * Boolean flags - getInfoFlags()/getInfoFlags2().
  341. There are many Boolean attributes of an expression that are useful to know - e.g., is it
  342. constant, does it have side-effects, does it reference any fields from a dataset etc. etc. The
  343. bulk of these are calculated and stored in a couple of members of the expression class. They are
  344. normally retrieved via accessor functions e.g., containsAssertKeyed(IHqlExpression*).
  345. * Active datasets - gatherTablesUsed().
  346. It is very common to want to know which datasets an expression references. This information is
  347. calculated and cached on demand and accessed via the IHqlExpression::gatherTablesUsed() functions.
  348. There are a couple of other functions IHqlExpression::isIndependentOfScope() and
  349. IHqlExpression::usesSelector() which provide efficient functions for common uses.
  350. * Information stored in the type.
  351. Currently datasets contain information about sort order, distribution and grouping as part of the
  352. expression type. This information should be accessed through the accessor functions applied to the
  353. expression (e.g., isGrouped(expr)). At some point in the future it is planned to move this
  354. information as a general derived property (see next).
  355. * Other derived property.
  356. There is a mechanism (in hqlattr) for calculating and caching an arbitrary derived property of an
  357. expression. It is currently used for number of rows, location-independent representation, maximum
  358. record size etc. . There are typically accessor functions to access the cached information (rather
  359. than calling the underlying IHqlExpression::queryAttribute() function).
  360. * Helper functions.
  361. Some information doesn't need to be cached because it isn't expensive to calculate, but rather than
  362. duplicating the code, a helper function is provided. E.g., queryOriginalRecord() and
  363. hasUnknownTransform(). They are not part of the interface because the number would make the
  364. interface unwieldy and they can be completely calculated from the public functions.
  365. However, it can be very hard to find the function you are looking for, and they would greatly
  366. benefit from being grouped e.g., into namespaces.
  367. Transformations
  368. ===============
  369. One of the key processes in eclcc is walking and transforming the expression graphs. Both of these
  370. are covered by the term transformations. One of the key things to bear in mind is that you need to
  371. walk the expression graph as a graph, not as a tree. If you have already examined a node once you
  372. shouldn't repeat the work - otherwise the execution time may be exponential with node depth.
  373. Other things to bear in mind
  374. * If a node isn't modified don't create a new one - return a link to the old one.
  375. * You generally need to walk the graph and gather some information before creating a modified graph.
  376. Sometimes creating a new graph can be short-circuited if no changes will be required.
  377. * Sometimes you can be tempted to try and short-circuit transforming part of a graph (e.g., the
  378. arguments to a dataset activity), but because of the way references to fields within dataset work
  379. that often doesn't work.
  380. * If an expression is moved to another place in the graph, you need to be very careful to check if the
  381. original context was conditional and that the new context is not.
  382. * The meaning of expressions can be context dependent. E.g., References to active datasets can be
  383. ambiguous.
  384. * Never walk the expressions as a tree, always as a graph!
  385. * Be careful with annotations.
  386. It is essential that an expression that is used in different contexts with different annotations
  387. (e.g., two different named symbols) is consistently transformed. Otherwise it is possible for a
  388. graph to be converted into a tree. E.g.,::
  389. A := x; B := x; C = A + B;
  390. must not be converted to::
  391. A' := x'; B' := X''; C' := A' + B';
  392. For this reason most transformers will check if expr->queryBody() matches expr, and if not will
  393. transform the body (the unannotated expression), and then clone any annotations.
  394. Some examples of the work done by transformations are:
  395. * Constant folding.
  396. * Expanding function calls.
  397. * Walking the graph and reporting warnings.
  398. * Optimizing the order and removing redundant activities.
  399. * Reducing the fields flowing through the generated graph.
  400. * Spotting common sub expressions.
  401. * Calculating the best location to evaluate an expression (e.g., globally instead of in a child query).
  402. * Many, many others.
  403. Some more details on the individual transforms are given below..
  404. **********
  405. Key Stages
  406. **********
  407. Parsing
  408. ========
  409. The first job of eclcc is to parse the ECL into an expression graph. The source for the ECL can come
  410. from various different sources (archive, source files, remote repository). The details are hidden
  411. behind the IEclSource/IEclSourceCollection interfaces. The createRepository() function is then used
  412. to resolve and parse the various source files on demand.
  413. Several things occur while the ECL is being parsed:
  414. * Function definitions are expanded inline.
  415. A slightly unusual behaviour. It means that the expression tree is a fully expanded expression -
  416. which is better suited to processing and optimizing.
  417. * Some limited constant folding occurs.
  418. When a function is expanded, often it means that some of the
  419. test conditions are always true/false. To reduce the transformations the condition may be folded
  420. early on.
  421. * When a symbol is referenced from another module this will recursively cause the ECL for that module
  422. (or definition within that module) to be parsed.
  423. * Currently the semantic checking is done as the ECL is parsed.
  424. If we are going to fully support template functions and delayed expansion of functions this will
  425. probably have to change so that a syntax tree is built first, and then the semantic checking is
  426. done later.
  427. Normalizing
  428. ===========
  429. There are various problems with the expression graph that comes out of the parser:
  430. * Records can have values as children (e.g., { myField := infield.value} ), but it causes chaos if
  431. record definitions can change while other transformations are going on. So the normalization
  432. removes values from fields.
  433. * Some activities use records to define the values that output records should contain (e.g., TABLE).
  434. These are now converted to another form (e.g., no_newusertable).
  435. * Sometimes expressions have multiple definition names. Symbols and annotations are rationalized and
  436. commoned up to aid commoning up other expressions.
  437. * Some PATTERN definitions are recursive by name. They are resolved to a form that works if all
  438. symbols are removed.
  439. * The CASE/MAP representation for a dataset/action is awkward for the transforms to process. They
  440. are converted to nested Ifs.
  441. (At some point a different representation might be a good idea.)
  442. * EVALUATE is a weird syntax. Instances are replaced with equivalent code which is much easier to
  443. subsequently process.
  444. * The datasets used in index definitions are primarily there to provide details of the fields. The
  445. dataset itself may be very complex and may not actually be used. The dataset input to an index is
  446. replaced with a dummy "null" dataset to avoid unnecessary graph transforming, and avoid introducing
  447. any additional incorrect dependencies.
  448. Scope checking
  449. ==============
  450. Generally if you use LEFT/RIGHT then the input rows are going to be available wherever they are
  451. used. However if they are passed into a function, and that function uses them inside a definition
  452. marked as global then that is invalid (since by definition global expressions don't have any context).
  453. Similarly if you use syntax <dataset>.<field>, its validity and meaning depends on whether <dataset>
  454. is active. The scope transformer ensures that all references to fields are legal, and adds a "new"
  455. attribute to any no_selects where it is necessary.
  456. Constant folding: foldHqlExpression
  457. ===================================
  458. This transform simplifies the expression tree. Its aim is to simplify scalar expressions, and
  459. dataset expressions that are valid whether or not the nodes are shared. Some examples are:
  460. * 1 + 2 => 3 and any other operation on scalar constants.
  461. * IF(true, x, y) => x
  462. * COUNT(<empty-dataset>) => 0
  463. * IF (a = b, 'c', 'd') = 'd' => IF(a=b, false, true) => a != b
  464. * Simplifying sorts, projects filters on empty datasets
  465. Most of the optimizations are fairly standard, but a few have been added to cover more esoteric
  466. examples which have occurred in queries over the years.
  467. This transform also supports the option to percolate constants through the graph. E.g., if a project
  468. assigns the value 3 to a field, it can substitute the value 3 wherever that field is used in
  469. subsequent activities. This can often lead to further opportunities for constant folding (and
  470. removing fields in the implicit project).
  471. Expression optimizer: optimizeHqlExpression
  472. ===========================================
  473. This transformer is used to simplify, combine and reorder dataset expressions. The transformer takes
  474. care to count the number of times each expression is used to ensure that none of the transformations
  475. cause duplication. E.g., swapping a filter with a sort is a good idea, but if there are two filters
  476. of the same sort and they are both swapped you will now be duplicating the sort.
  477. Some examples of the optimizations include:
  478. * COUNT(SORT(x)) => COUNT(x)
  479. * Moving filters over projects, joins, sorts.
  480. * Combining adjacent projects, projects and joins.
  481. * Removing redundant sorts or distributes
  482. * Moving filters from JOINs to their inputs.
  483. * Combining activities e.g., CHOOSEN(SORT(x)) => TOPN(x)
  484. * Sometimes moving filters into IFs
  485. * Expanding out a field selected from a single row dataset.
  486. * Combine filters and projects into compound disk read operations.
  487. Implicit project: insertImplicitProjects
  488. ========================================
  489. ECL tends to be written as general purpose definitions which can then be combined. This can lead to
  490. potential inefficiencies - e.g., one definition may summarise some data in 20 different ways, this is
  491. then used by another definition which only uses a subset of those results. The implicit project
  492. transformer tracks the data flow at each point through the expression graph, and removes any fields
  493. that are not required.
  494. This often works in combination with the other optimizations. For instance the constant percolation
  495. can remove the need for fields, and removing fields can sometimes allow a left outer join to be
  496. converted to a project.
  497. *********
  498. Workunits
  499. *********
  500. is this the correct term? Should it be a query? This should really be independent of this document...)
  501. =======================================================================================================
  502. The code generator ultimately creates workunits. A workunit completely describes a generated query.
  503. It consists of two parts. There is an xml component - this contains the workflow information, the
  504. various execution graphs, and information about options. It also describes which inputs can be
  505. supplied to the query and what results are generated. The other part is the generated shared object
  506. compiled from the generated C++. This contains functions and classes that are used by the engines to
  507. execute the queries. Often the xml is compressed and stored as a resource within the shared object -
  508. so the shared object contains a complete workunit.
  509. Workflow
  510. ========
  511. The actions in a workunit are divided up into individual workflow items. Details of when each
  512. workflow item is executed, what its dependencies are stored in the <Workflow> section of the xml.
  513. The generated code also contains a class definition, with a method perform() which is used to execute
  514. the actions associated with a particular workflow item. (The class instances are created by calling
  515. the exported createProcess() factory function).
  516. The generated code for an individual workflow item will typically call back into the engine at some
  517. point to execute a graph.
  518. Graph
  519. =====
  520. The activity graphs are stored in the xml. The graph contains details of which activities are
  521. required, how those activities link together, what dependencies there are between the activities.
  522. For each activity it might contain the following information:
  523. * A unique id.
  524. * The "kind" of the activity (from enum ThorActivityKind in eclhelper.hpp)
  525. * The ECL that created the activity.
  526. * Name of the original definition
  527. * Location (e.g., file, line number) of the original ECL.
  528. * Information about the record size, number of rows, sort order etc.
  529. * Hints which control options for a particular activity (e.g,, the number of threads to use while sorting).
  530. * Record counts and stats once the job has executed.
  531. Each activity in a graph also has a corresponding helper class instance in the generated code. (The
  532. name of the class is cAc followed by the activity number, and the exported factory method is fAc
  533. followed by the activity number.) These classes implement the interfaces defined in eclhelper.hpp.
  534. The engine uses the information from the xml to produce a graph of activities that need to be
  535. executed. It has a general purpose implementation of each activity kind, and it uses the class
  536. instance to tailor that general activity to the specific use e.g., what is the filter condition, what
  537. fields are set up, what is the sort order?
  538. Inputs and Results
  539. ==================
  540. The workunit xml contains details of what inputs can be supplied when that workunit is run. These
  541. correspond to STORED definitions in the ECL. The result xml also contains the schema for the results
  542. that the workunit will generate.
  543. Once an instance of the workunit has been run, the values of the results may be written back into
  544. dali's copy of the workunit so they can be retrieved and displayed.
  545. Generated code
  546. ==============
  547. Aims for the generated C++ code:
  548. * Minimal include dependencies.
  549. Compile time is an issue - especially for small on-demand queries. To help reduce compile times
  550. (and dependencies with the rest of the system) the number of header files included by the generated
  551. code is kept to a minimum. In particular references to jlib, boost and icu are kept within the
  552. implementation of the runtime functions, and are not included in the public dependencies.
  553. * Thread-safe.
  554. It should be possible to use the members of an activity helper from multiple threads without
  555. issue. The helpers may contain some context dependent state, so different instances of the helpers
  556. are needed for concurrent use from different contexts (e.g., expansions of a graph.)
  557. * Concise.
  558. The code should be broadly readable, but the variable names etc. are chosen to generate compact code.
  559. * Functional.
  560. Generally the generated code assigns to a variable once, and doesn't modify it afterwards. Some
  561. assignments may be conditional, but once the variable is evaluated it isn't updated. (There are of
  562. course a few exceptions - e.g., dataset iterators)
  563. **********************
  564. Implementation details
  565. **********************
  566. First a few pointers to help understand the code within eclcc:
  567. * It makes extensive use of link counting. You need understand that concept to get very far.
  568. * If something is done more than once then that is generally split into a helper function.
  569. The helper functions aren't generally added to the corresponding interface (e.g., IHqlExpression)
  570. because the interface would become bloated. Instead they are added as global functions. The big
  571. disadvantage of this approach is they can be hard to find. Even better would be for them to be
  572. rationalised and organised into namespaces.
  573. * The code is generally thread-safe unless there would be a significant performance implication. In
  574. generally all the code used by the parser for creating expressions is thread safe. Expression
  575. graph transforms are thread-safe, and can execute in parallel if a constant
  576. (NUM_PARALLEL_TRANSFORMS) is increased. The data structures used to represent the generated code
  577. are NOT thread-safe.
  578. * Much of the code generation is structured fairly procedurally, with classes used to process the
  579. stages within it.
  580. * There is a giant "God" class HqlCppTranslator - which could really do with refactoring.
  581. Parser
  582. ======
  583. The eclcc parser uses the standard tools bison and flex to process the ECL and convert it to a
  584. expression graph. There are a couple of idiosyncrasies with the way it is implemented.
  585. * Macros with fully qualified scope.
  586. Slightly unusually macros are defined in the same way that other definitions are - in particular to
  587. can have references to macros in other modules. This means that there are references to macros
  588. within the grammar file (instead of being purely handled by a pre-processor). It also means the
  589. lexer keeps an active stack of macros being processed.
  590. * Attributes on operators.
  591. Many of the operators have optional attributes (e.g., KEEP, INNER, LOCAL, ...). If these were all
  592. reserved words it would remove a significant number of keywords from use as symbols, and could also
  593. mean that when a new attribute was added it broke existing code. To avoid this the lexer looks
  594. ahead in the parser tables (by following the potential reductions) to see if the token really could
  595. come next. If it can't then it isn't reserved as a symbol.
  596. **************
  597. Generated code
  598. **************
  599. As the workunit is created the code generator builds up the generated code and the xml for the
  600. workunit. Most of the xml generation is encapsulated within the IWorkUnit interface. The xml for
  601. the graphs is created in an IPropertyTree, and added to the workunit as a block.
  602. C++ Output structures
  603. =====================
  604. The C++ generation is ultimately controlled by some template files (thortpl.cpp). The templates are
  605. plain text and contain references to allow named sections of code to be expanded at particular points.
  606. The code generator builds up some structures in memory for each of those named sections. Once the
  607. generation is complete some peephole optimization is applied to the code. This structure is walked
  608. to expand each named section of code as required.
  609. The BuildCtx class provides a cursor into that generated C++. It will either be created for a given
  610. named section, or more typically from another BuildCtx. It has methods for adding the different
  611. types of statements. Some are simple (e.g., addExpr()), whilst some create a compound statement
  612. (e.g., addFilter). The compound statements change the active selector so any new statements are
  613. added within that compound statement.
  614. As well as building up a tree of expressions, this data structure also maintains a tree of
  615. associations. For instance when a value is evaluated and assigned to a temporary variable, the
  616. logical value is associated with that temporary. If the same expression is required later, the
  617. association is matched, and the temporary value is used instead of recalculating it. The
  618. associations are also used to track the active datasets, classes generated for row-meta information,
  619. activity classes etc. etc.
  620. Activity Helper
  621. ===============
  622. Each activity in an expression graph will have an associated class generated in the C++. Each
  623. different activity kind expects a helper that implements a particular IHThorArg interface. E.g., a
  624. sort activity of kind TAKsort requires a helper that implements IHThorSortArg. The associated
  625. factory function is used to create instances of the helper class.
  626. The generated class might take one of two forms:
  627. * A parameterised version of a library class. These are generated for simple helpers that don't have
  628. many variations (e.g., CLibrarySplitArg for TAKsplit), or for special cases that occur very
  629. frequently (CLibraryWorkUnitReadArg for internal results).
  630. * A class derived from a skeleton implementation of that helper (typically CThorXYZ implementing
  631. interface IHThorXYZ). The base class has default implementations of some of the functions, and any
  632. exceptions are implemented in the derived class.
  633. Meta helper
  634. ===========
  635. This is a class that is used by the engines to encapsulate all the information about a single row -
  636. e.g., the format that each activity generates. It is an implementation of the IOutputMeta
  637. interface. It includes functions to
  638. * Return the size of the row.
  639. * Serialize and deserialize from disk.
  640. * Destroy and clean up row instances.
  641. * Convert to xml.
  642. * Provide information about the contained fields.
  643. Building expressions
  644. ====================
  645. The same expression nodes are used for representing expressions in the generated C++ as the original
  646. ECL expression graph. It is important to keep track of whether an expression represents untranslated
  647. ECL, or the "translated" C++. For instance ECL has 1 based indexes, while C++ is zero based. If you
  648. processed the expression x[1] it might get translated to x[0] in C++. Translating it again would
  649. incorrectly refer to x[-1].
  650. There are two key classes used while building the C++ for an ECL expression:
  651. CHqlBoundExpr.
  652. This represents a value that has been converted to C++. Depending on the type, one or more of the
  653. fields will be filled in.
  654. CHqlBoundTarget.
  655. This represents the target of an assignment -C++ variable(s) that are going to be assigned the
  656. result of evaluating an expression. It is almost always passed as a const parameter to a function
  657. because the target is well-defined and the function needs to update that target.
  658. A C++ expression is sometimes converted back to an ECL pseudo-expression by calling
  659. getTranslatedExpr(). This creates an expression node of kind no_translated to indicate the child
  660. expression has already been converted.
  661. Scalar expressions
  662. ------------------
  663. The generation code for expressions has a hierarchy of calls. Each function is there to allow
  664. optimal code to be generated - e.g., not creating a temporary variable if none are required. A
  665. typical flow might be:
  666. * buildExpr(ctx, expr, bound).
  667. Evaluate the ecl expression "expr" and save the C++ representation in the class bound. This might
  668. then call through to...
  669. * buildTempExpr(ctx, expr, bound);
  670. Create a temporary variable, and evaluate expr and assign it to that temporary variable.... Which
  671. then calls.
  672. * buildExprAssign(ctx, target, expr);
  673. evaluate the expression, and ensure it is assigned to the C++ target "target".
  674. The default implementation might be to call buildExpr....
  675. An operator must either be implemented in buildExpr() (calling a function called doBuildExprXXX) or
  676. in buildExprAssign() (calling a function called doBuildAssignXXX). Some operators are implemented in
  677. both places if there are different implementations that would be more efficient in each context.
  678. Similarly there are several different assignment functions:
  679. * buildAssign(ctx, <ecl-target>, <ecl-value>);
  680. * buildExprAssign(ctx, <c++-target>, <ecl-value>);
  681. * assign(ctx, <C++target>, <c++source>)
  682. The different varieties are there depending on whether the source value or targets have already been
  683. translated. (The names could be rationalised!)
  684. Datasets
  685. --------
  686. Most dataset operations are only implemented as activities (e.g., PARSE, DEDUP). If these are used
  687. within a transform/filter then eclcc will generate a call to a child query. An activity helper for the
  688. appropriate operation will then be generated.
  689. However a subset of the dataset operations can also be evaluated inline without calling a child query.
  690. Some examples are filters, projects, and simple aggregation. It removes the overhead of the child query
  691. call in the simple cases, and often generates more concise code.
  692. When datasets are evaluated inline there is a similar hierarchy of function calls:
  693. * buildDatasetAssign(ctx, target, expr);
  694. Evaluate the dataset expression, and assign it to the target (a builder interface).
  695. This may then call....
  696. * buildIterate(ctx, expr)
  697. Iterate through each of the rows in the dataset expression in turn.
  698. Which may then call...
  699. * buildDataset(ctx, expr, target, format)
  700. Build the entire dataset, and return it as a single value.
  701. Some of the operations (e.g., aggregating a filtered dataset) can be done more efficiently by summing and
  702. filtering an iterator, than forcing the filtered dataset to be evaluated first.
  703. Dataset cursors
  704. ---------------
  705. The interface IHqlCppDatasetCursor allows the code generator to iterate through a dataset, or select
  706. a particular element from a dataset. It is used to hide the different representation of datasets,
  707. e.g.,
  708. * Blocked - the rows are in a contiguous block of memory appended one after another.
  709. * Array - the dataset is represented by an array of pointers to the individual rows.
  710. * Link counted - similar to array, but each element is also link counted.
  711. * Nested. Sometimes the cursor may iterate through multiple levels of child datasets.
  712. Generally rows that are serialized (e.g., on disk) are in blocked format, and they are stored as link
  713. counted rows in memory.
  714. Field access classes
  715. --------------------
  716. The IReferenceSelector interface and the classes in hqltcppc[2] provide an interface for getting and
  717. setting values within a row of a dataset. They hide the details of the layout - e.g., csv/xml/raw
  718. data, and the details of exactly how each type is represented in the row.
  719. Key filepos weirdness
  720. ---------------------
  721. The current implementation of keys in HPCC uses a format which uses a separate 8 byte integer field
  722. which was historically used to store the file position in the original file. Other complications are
  723. that the integer fields are stored big-endian, and signed integer values are biased.
  724. This introduces some complication in the way indexes are handled. You will often find that the
  725. logical index definition is replaced with a physical index definition, followed by a project to
  726. convert it to the logical view. A similar process occurs for disk files to support
  727. VIRTUAL(FILEPOSITION) etc.
  728. ***********
  729. Source code
  730. ***********
  731. The following are the main directories used by the ecl compiler.
  732. +------------------+-------------------------------------------------------------------------------------+
  733. | Directory | Contents |
  734. +==================+=====================================================================================+
  735. | rtl/eclrtpl | Template text files used to generate the C++ code |
  736. +------------------+-------------------------------------------------------------------------------------+
  737. | rtl/include | Headers that declare interfaces implemented by the generated code |
  738. +------------------+-------------------------------------------------------------------------------------+
  739. | common/deftype | Interfaces and classes for scalar types and values. |
  740. +------------------+-------------------------------------------------------------------------------------+
  741. | common/workunit | Code for managing the representation of a work unit. |
  742. +------------------+-------------------------------------------------------------------------------------+
  743. | ecl/hql | Classes and interfaces for parsing and representing an ecl expression graph |
  744. +------------------+-------------------------------------------------------------------------------------+
  745. | ecl/hqlcpp | Classes for converting an expression graph to a work unit (and C++) |
  746. +------------------+-------------------------------------------------------------------------------------+
  747. | ecl/eclcc | The executable which ties everything together. |
  748. +------------------+-------------------------------------------------------------------------------------+
  749. **********
  750. Challenges
  751. **********
  752. From declarative to imperative
  753. ==============================
  754. As mentioned at the start of this document, one of the main challenges with eclcc is converting the
  755. declarative ECL code into imperative C++ code. The direction we are heading in is to allow the
  756. engines to support more lazy-evaluation so possibly in this instance to evaluate it the first time it
  757. is used (although that may potentially be much less efficient). This will allow the code generator
  758. to relax some of its current assumptions.
  759. There are several example queries which are already producing pathological behaviour from eclcc,
  760. causing it to generate C++ functions which are many thousands of lines long.
  761. The parser
  762. ==========
  763. Currently the grammar for the parser is too specialised. In particular the separate productions for
  764. expression, datasets, actions cause problems - e.g., it is impossible to properly allow sets of
  765. datasets to be treated in the same way as other sets.
  766. The semantic checking (and probably semantic interpretation) is done too early. Really the parser
  767. should build up a syntax tree, and then disambiguate it and perform the semantic checks on the syntax
  768. tree.
  769. The function calls should probably be expanded later than they are. I have tried in the past and hit
  770. problems, but I can't remember all the details. Some are related to the semantic checking.