12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007 |
- ====================
- Eclcc/Code generator
- ====================
- ************
- Introduction
- ************
- Purpose
- ========
- The primary purpose of the code generator is to take an ECL query and convert it into a work unit
- that is suitable for running by one of the engines.
- Aims
- ====
- The code generator has to do its job accurately. If the code generator does not correctly map the
- ECL to the workunit it can lead to corrupt data and invalid results. Problems like that can often be
- very hard and frustrating for the ECL users to track down.
- There is also a strong emphasis on generating output that is as good as possible. Eclcc contains
- many different optimization stages, and is extensible to allow others to be easily added.
- Eclcc needs to be able to cope with reasonably large jobs. Queries that contain several megabytes of
- ECL, and generate tens of thousands of activities, and 10s of Mb of C++ are routine. These queries
- need to be processed relatively quickly.
- Key ideas
- =========
- Nearly all the processing of ECL is done using an expression graph. The representation of the
- expression graph has some particular characteristics:
- * Once the nodes in the expression graph have been created they are NEVER modified.
- * Nodes in the expression graph are ALWAYS commoned up if they are identical.
- * Each node in the expression graph is link counted (see below) to track its lifetime.
- * If a modified graph is required a new graph is created (sharing nodes from the old one)
- The ECL language is a declarative language, and in general is assumed to be pure - i.e. there are no
- side-effects, expressions can be evaluated lazily and re-evaluating an expression causes no
- problems. This allows eclcc to transform the graph in lots of interesting ways. (Life is never that
- simple so there are mechanisms for handling the exceptions.)
- From declarative to imperative
- ==============================
- One of the main challenges with eclcc is converting the declarative ECL code into imperative C++
- code. One key problem is it needs to try to ensure that code is only evaluated when it is required,
- but that it is also only evaluated once. It isn't always possible to satisfy both constraints - for
- example a global dataset expression used within a child query. Should it be evaluated once before
- the activity containing the child query is called, or each time the child query is called? If it is
- called on demand then it may not be evaluated as efficiently...
- This issue complicates many of the optimizations and transformations that are done to the queries.
- Long term the plan is to allow the engines to support more delayed lazy-evaluation, so that whether
- something is evaluated is more dynamic rather than static.
- Flow of processing
- ==================
- The idealised view of the processing within eclcc follows the following stages:
- * Parse the ECL into an expression graph.
- * Expand out function calls.
- * Normalize the expression graph so it is in a consistent format.
- * Normalize the references to fields within datasets to tie them up with their scopes.
- * Apply various global optimizations.
- * Translate logical operations into the activities that will implement them.
- * Resource and generate the global graph
- * For each activity, resource, optimize and generate its child graphs.
- In practice the progression is not so clear cut. There tends to be some overlap between the
- different stages, and some of them may occur in slightly different orders. However the order broadly
- holds.
- *****************************
- Working on the code generator
- *****************************
- The regression suite
- ====================
- Before any change is accepted for the code generator it is always run against several regression suites to ensure that
- it doesn't introduce any problems, and that the change has the desired effect. There are several different regression suites:
- * testing/regress/ecl - The run time regression suite.
- * ecl/regress - a compiler regression suite. This contains tests that cannot run and error tests.
- * LN private suite - This contains a large selection (>10Gb) of archived queries. The contain proprietary code so unfortunately cannot be released as open source.
- The ecl/regress directory contains a script 'regress.sh' that is used for running the regression tests. It should be
- executed in the directory containing the ecl files. The script generates the c++ code (and workunits) for each of the source
- files to a target directory, and then executes a comparison program to compare the new results with a previous "golden"
- reference set.
- Before making any changes to the compiler, a reference set should be created by running the regression script and copying the
- generated files to the reference directory.
- Here is a sample command line
- ``~/dev/hpcc/ecl/regress/regress.sh -t /regress/hpcc -e /home/<user>/buildr/Release/bin/eclcc -I /home/<user>/dev/hpcc/ecl/regress/modules -I /home/<user>/dev/hpcc/plugins/javaembed -I /home/<user>/dev/hpcc/plugins/v8embed -c /regress/hpcc.master -d bcompare``
- (A version of this command resides in a shell script in each of my regression suite directories, with the -t and -c options adapted for each suite.)
- For a full list of options execute the script with no parameters, or take a look at the script itself. A couple of useful options are:
- * The script can be run on a single file by using the -q option.
- * The (-e) option selects the path of the eclcc. This is particularly useful when running from the build
- directory (see below), or using multiple build directories to compare behaviour between different versions.
- We strongly recommend using a comparison program which allows rules to be defined to ignore certain differences (e.g., beyond compare).
- Running directly from the build directory
- =========================================
- It is much quicker to run eclcc directly from the build directory, rather than deploying a system and running eclcc
- from there. To do this you need to configure some options that eclcc requires, e.g. where the include files are found. The
- options can be set by either setting environment variables or by specifiying options in an eclcc.ini
- file. The following are the names of the different options:
- +-----------------------+-------------------+
- | Environment flag | Ini file option |
- +=======================+===================+
- | CL_PATH | compilerPath |
- +-----------------------+-------------------+
- | ECLCC_LIBRARY_PATH | libraryPath |
- +-----------------------+-------------------+
- | ECLCC_INCLUDE_PATH | includePath |
- +-----------------------+-------------------+
- | ECLCC_PLUGIN_PATH | plugins |
- +-----------------------+-------------------+
- | HPCC_FILEHOOKS_PATH | filehooks |
- +-----------------------+-------------------+
- | ECLCC_TPL_PATH | templatePath |
- +-----------------------+-------------------+
- | ECLCC_ECLLIBRARY_PATH | eclLibrariesPath |
- +-----------------------+-------------------+
- | ECLCC_ECLBUNDLE_PATH | eclBundlesPath |
- +-----------------------+-------------------+
- The eclcc.ini can either be a file in the local directory, or specified on the eclcc command line with -specs.
- Including the settings in a local eclcc.ini file also it easy to debug eclcc directly from the build directory
- within the eclipse environment.
- Hints and tips
- ==============
- * Logging
- There is an option for eclcc to output a logging file, and another to specify the level of detail in that logging
- file. If the detail level is above 500 then the expresssion tree for the query is output to the logging file after
- each of the code transformations. The tracing is very useful for tracking down at which stage inconsistencies are
- introduced in the expression graph, and also for learning how each transformation affects the query.
- The output format defaults to ECL - which is regenerated from the expression tree. (This ECL cannot generally be
- compiled without editing - partly because it contains extra annoations.) Use either of the following:
- ``eclcc myfile.ecl --logfile myfile.log --logdetail 999``
- ``regress.sh -q myfile.ecl -l myfile.log``
- * -ftraceIR
- There is a debug option (-ftraceIR) that generates an intermediate representation of the expression graph rather than
- regenerating ECL. The output tends to be less compact and harder to read quickly, but has the advantage of being
- better structured, and contains more details of the internal representation. ecl/hql/hqlir.cpp contains
- more details of the format.
- * Adding extra logging into the source code
- If you want to add tracing of expressions at any point in the code generation then adding either of the following
- calls will include the expression details in the log file:
- ``dbglogExpr(expr); // regenerate the ecl for an expression. See other functions in ecl/hql/hqlthql.hpp``
- ``EclIR::dbglogIR(expr); // regenerate the IR for an expression. See other functions in ecl/hql/hqlir.hpp``
- * Logging while debugging
- If you are debugging inside gdb it is often useful to be able to dump out details of an expression. Calling
- EclIR:dump_ir(expr); will generate the IR to stdout.
- ``p EclIR::dump_ir(expr)``
- The function can also be used with multiple parameters. Each expression will be dumped out, but common child nodes
- will only be generated once. This can be very useful when trying to determine the difference between two expressions.
- The quickest way is to call ``EclIR::dump_ir(expr1, expr2)``. The first difference between the expressions will
- be the expression that follows the first "return".
- * Expression sequence ids.
- Sometimes it can be hard to determine where a particular IHqlExpression node was created. If that is the case, then
- defining ``DEBUG_TRACK_INSTANCEID`` (in ecl/hql/hqlexpr.ipp) will add a unique sequence number to each IHqlExpression
- that is created. There is also a function checkSeqId() at the start of ecl/hql/hqlexpr.cpp which is called whenever
- an expression is created, linked, released etc.. Setting a breakpoint in that function can allow you to trace back
- exactly when and why a particular node was created.
- ***********
- Expressions
- ***********
- Expression Graph representation
- ===============================
- The key data structure within eclcc is the graph representation. The design has some key elements.
- * Once a node is created it is never modified.
- Some derived information (e.g., sort order, number of records, unique hash, ...) might be
- calculated and stored in the class after it has been created, but that doesn't change what the node
- represents in any way.
- Some nodes are created in stages - e.g., records, modules. These nodes are marked as fully
- completed when closeExpr() is called, after which they cannot be modified.
- * Nodes are always commoned up.
- If the same operator has the same arguments and type then there will be a unique IHqlExpression to
- represent it. This helps ensure that graphs stay as graphs and don't get converted to trees. It
- also helps with optimizations, and allows code duplicated in two different contexts to be brought
- together.
- * The nodes are link counted.
- Link counts are used to control the lifetime of the expression objects. Whenever a reference to an
- expression node is held, its link count is increased, and decreased when no longer required. The
- node is freed when there are no more references. (This generally works well, but does give us problems
- with forward references.)
- * The access to the graph is through interfaces.
- The main interfaces are IHqlExpression, IHqlDataset and IHqlScope. They are all defined in
- hqlexpr.hpp. The aim of the interfaces is to hide the implementation of the expression nodes so
- they can be restructured and changed without affecting any other code.
- * The expression classes use interfaces and a type field rather than polymorphism.
- This could be argued to be bad object design...but.
- There are more than 500 different possible operators. If a class was created for each of them the
- system would quickly become unwieldy. Instead there are several different classes which model the
- different types of expression (dataset/expression/scope).
- The interfaces contain everything needed to create and interrogate an expression tree, but they do
- not contain functionality for directly processing the graph.
- To avoid some of the shortcomings of type fields there are various mechanisms for accessing derived attributes which avoid interrogating the type field.
- * Memory consumption is critical.
- It is not unusual to have 10M or even 100M nodes in memory as a query is being processed. At that
- scale the memory consumption of each node matters - so great care should be taken when considering
- increasing the size of the objects. The node classes contain a class hierarchy which is there
- purely to reduce the memory consumption - not to reflect the functionality. With no memory
- constraints they wouldn't be there, but removing a single pointer per node can save 1Gb of memory
- usage for very complex queries.
- IHqlExpression
- --------------
- This is the interface that is used to walk and interrogate the expression graph once it has been created. Some of the main functions are:
- getOperator() What does this node represent? It returns a member of the node_operator enumerated type.
- numChildren() How many arguments does node have?
- queryChild(unsigned n) What is the nth child? If the argument is out of range it returns NULL.
- queryType() The type of this node.
- queryBody() Used to skip annotations (see below)
- queryProperty() Does this node have a child which is an attribute that matches a given name. (see below for more about attributes).
- queryValue() For a no_constant return the value of the constant. It returns NULL otherwise.
- The nodes in the expression graph are created through factory functions. Some of the expression types
- have specialised functions - e.g., createDataset, createRow, createDictionary, but scalar expressions
- and actions are normally created with createValue().
- Note: Generally ownership of the arguments to the createX() functions are assumed to be taken over by
- the newly created node.
- The values of the enumeration constants in node_operator are used to calculate "crcs" which are used
- to check if the ECL for a query matches, and if disk and index record formats match. It contains
- quite a few legacy entries no_unusedXXX which can be used for new operators (otherwise new operators
- must be added to the end).
- IHqlSimpleScope
- ---------------
- This interface is implemented by records, and is used to map names to the fields within the records.
- If a record contains IFBLOCKs then each of the fields in the ifblock is defined in the
- IHqlSimpleScope for the containing record.
- IHqlScope
- ---------
- Normally obtained by calling IHqlExpression::queryScope(). It is primarily used in the parser to
- resolve fields from within modules.
- The ECL is parsed on demand so as the symbol is looked up it may cause a cascade of ECL to be
- compiled. The lookup context (HqlLookupContext ) is passed to IHqlScope::lookupSymbol() for several
- reasons:
- * It contains information about the active repository - the source of the ECL which will be dynamically parsed.
- * It contains caches of expanded functions - to avoid repeating expansion transforms.
- * Some members are used for tracking definitions that are read to build dependency graphs, or archives of submitted queries.
- The interface IHqlScope currently has some members that are used for creation; this should be
- refactored and placed in a different interface.
- IHqlDataset
- -----------
- This is normally obtained by calling IHqlExpression::queryDataset(). It has shrunk in size over
- time, and could quite possibly be folded into IHqlExpression with little pain.
- There is a distinction in the code generator between "tables" and "datasets". A table
- (IHqlDataset::queryTable()) is a dataset operation that defines a new output record. Any operation
- that has a transform or record that defines an output record (e.g., PROJECT,TABLE) is a table, whilst
- those that don't (e.g., a filter, dedup) are not. There are a few apparent exceptions -e.g., IF
- (This is controlled by definesColumnList() which returns true the operator is a table.)
- Properties and attributes
- -------------------------
- There are two related by slightly different concepts. An attribute refers to the explicit flags that
- are added to operators (e.g., , LOCAL, KEEP(n) etc. specified in the ECL or some internal attributes
- added by the code generator). There are a couple of different functions for creating attributes.
- createExtraAttribute() should be used by default. createAttribute() is reserved for an attribute
- that never has any arguments, or in unusual situations where it is important that the arguments are
- never transformed. They are tested using queryAttribute()/hasAttribute() and represented by nodes of
- kind no_attr/no_expr_attr.
- The term "property" refers to computed information (e.g., record counts) that can be derived from the
- operator, its arguments and attributes. They are covered in more detail below.
- Field references
- ================
- Fields can be selected from active rows of a dataset in three main ways:
- * Some operators define LEFT/RIGHT to represent an input or processed dataset. Fields from these
- active rows are referenced with LEFT.<field-name>. Here LEFT or RIGHT is the "selector".
- * Other operators use the input dataset as the selector. E.g., myFile(myFile.id != 0). Here the
- input dataset is the "selector".
- * Often when the input dataset is used as the selector it can be omitted. E.g., myFile(id != 0).
- This is implicitly expanded by the PARSER to the second form.
- A reference to a field is always represented in the expression graph as a node of kind no_select
- (with createSelectExpr). The first child is the selector, and the second is the field. Needless
- to say there are some complications...
- * LEFT/RIGHT.
- The problem is that the different uses of LEFT/RIGHT need to be disambiguated since there may be
- several different uses of LEFT in a query. This is especially true when operations are executed in
- child queries. LEFT is represented by a node no_left(record, selSeq). Often the record is
- sufficient to disambiguate the uses, but there are situations where it isn't enough. So in
- addition no_left has a child which is a selSeq (selector sequence) which is added as a child
- attribute of the PROJECT or other operator. At parse time it is a function of the input dataset
- that is later normalized to a unique id to reduce the transformation work.
- * Active datasets. It is slightly more complicated - because the dataset used as the selector can
- be any upstream dataset up to the nearest table. So the following ECL code is legal:
- ::
- x := DATASET(...)
- y := x(x.id != 0);
- z := y(x.id != 100);
- Here the reference to x.id in the definition of z is referring to a field in the input dataset.
- Because of these semantics the selector in a normalized tree is actually
- inputDataset->queryNormalizedSelector() rather than inputDatset. This function currently returns the
- table expression node (but it may change in the future see below).
- Attribute "new"
- ---------------
- In some situations ECL allows child datasets to be treated as a dataset without an explicit
- NORMALIZE. E.g., EXISTS(myDataset.ChildDataset);
- This is primarily to enable efficient aggregates on disk files to be generated, but it adds some
- complications with an expression of the form dataset.childdataset.grandchild. E.g.,::
- EXISTS(dataset(EXISTS(dataset.childdataset.grandchild))
- Or::
- EXISTS(dataset.childdataset(EXISTS(dataset.childdataset.grandchild))
- In the first example dataset.childdataset within the dataset.childdataset.grandchild is a reference
- to a dataset that doesn't have an active cursor and needs to be iterated), whilst in the second it
- refers to an active cursor.
- To differentiate between the two, all references to fields within datasets/rows that don't have
- active selectors have an additional attribute("new") as a child of the select. So a no_select with a
- "new" attribute requires the dataset to be created, one without is a member of an active dataset
- cursor.
- If you have a nested row, the new attribute is added to the selection from the dataset, rather than
- the selection from the nested row. The functions queryDatasetCursor() and querySelectorDataset())
- are used to help interpret the meaning.
- (An alternative would be to use a different node from no_select - possibly this should be considered
- - it would be more space efficient.)
- The expression graph generated by the ECL parser doesn't contain any new attributes. These are added
- as one of the first stages of normalizing the expression graph. Any code that works on normalized
- expressions needs to take care to interpret no_selects correctly.
- Transforming selects
- --------------------
- When an expression graph is transformed and none of the records are changed, the representation of
- LEFT/RIGHT remains the same. This means any no_select nodes in the expression tree will also stay
- the same.
- However, if the transform modifies a table (highly likely) it means that the selector for the second
- form of field selector will also change. Unfortunately this means that transforms often cannot be
- short-circuited.
- It could significantly reduce the extent of the graph that needs traversing, and the number of nodes
- replaced in a transformed graph if this could be avoided. One possibility is to use a different
- value for dataset->queryNormalizedSelector() using a unique id associated with the table. I think it
- would be a good long term change, but it would require unique ids (similar to the selSeq) to be added
- to all table expressions, and correctly preserved by any optimization.
- Annotations
- ===========
- Sometimes it is useful to add information into the expression graph (e.g., symbol names, position
- information) that doesn't change the meaning, but should be preserved. Annotations allow information
- to be added in this way.
- An annotation's implementation of IHqlExpression generally delegates the majority of the methods
- through to the annotated expression. This means that most code that interrogates the expression
- graph can ignore their presence, which simplifies the caller significantly. However transforms need
- to be careful (see below).
- Information about the annotation can be obtained by calling IHqlExpression:: getAnnotationKind() and
- IHqlExpression:: queryAnnotation().
- Associated side-effects
- =======================
- In legacy ECL you will see code like the following\:::
- EXPORT a(x) := FUNCTION
- Y := F(x);
- OUTPUT(Y);
- RETURN G(Y);
- END;
- The assumption is that whenever a(x) is evaluated the value of Y will be output. However that
- doesn't particularly fit in with a declarative expression graph. The code generator creates a
- special node (no_compound) with child(0) as the output action, and child(1) as the value to be
- evaluated (g(Y)).
- If the expression ends up being included in the final query then the action will also be included
- (via the no_compound). At a later stage the action is migrated to a position in the graph where
- actions are normally evaluated.
- Derived properties
- ==================
- There are many pieces of information that it is useful to know about a node in the expression graph - many
- of which would be expensive to recomputed each time there were required. Eclcc has several
- mechanisms for caching derived information so it is available efficiently.
- * Boolean flags - getInfoFlags()/getInfoFlags2().
- There are many Boolean attributes of an expression that are useful to know - e.g., is it
- constant, does it have side-effects, does it reference any fields from a dataset etc. etc. The
- bulk of these are calculated and stored in a couple of members of the expression class. They are
- normally retrieved via accessor functions e.g., containsAssertKeyed(IHqlExpression*).
- * Active datasets - gatherTablesUsed().
- It is very common to want to know which datasets an expression references. This information is
- calculated and cached on demand and accessed via the IHqlExpression::gatherTablesUsed() functions.
- There are a couple of other functions IHqlExpression::isIndependentOfScope() and
- IHqlExpression::usesSelector() which provide efficient functions for common uses.
- * Information stored in the type.
- Currently datasets contain information about sort order, distribution and grouping as part of the
- expression type. This information should be accessed through the accessor functions applied to the
- expression (e.g., isGrouped(expr)). At some point in the future it is planned to move this
- information as a general derived property (see next).
- * Other derived property.
- There is a mechanism (in hqlattr) for calculating and caching an arbitrary derived property of an
- expression. It is currently used for number of rows, location-independent representation, maximum
- record size etc. . There are typically accessor functions to access the cached information (rather
- than calling the underlying IHqlExpression::queryAttribute() function).
- * Helper functions.
- Some information doesn't need to be cached because it isn't expensive to calculate, but rather than
- duplicating the code, a helper function is provided. E.g., queryOriginalRecord() and
- hasUnknownTransform(). They are not part of the interface because the number would make the
- interface unwieldy and they can be completely calculated from the public functions.
- However, it can be very hard to find the function you are looking for, and they would greatly
- benefit from being grouped e.g., into namespaces.
- Transformations
- ===============
- One of the key processes in eclcc is walking and transforming the expression graphs. Both of these
- are covered by the term transformations. One of the key things to bear in mind is that you need to
- walk the expression graph as a graph, not as a tree. If you have already examined a node once you
- shouldn't repeat the work - otherwise the execution time may be exponential with node depth.
- Other things to bear in mind
- * If a node isn't modified don't create a new one - return a link to the old one.
- * You generally need to walk the graph and gather some information before creating a modified graph.
- Sometimes creating a new graph can be short-circuited if no changes will be required.
- * Sometimes you can be tempted to try and short-circuit transforming part of a graph (e.g., the
- arguments to a dataset activity), but because of the way references to fields within dataset work
- that often doesn't work.
- * If an expression is moved to another place in the graph, you need to be very careful to check if the
- original context was conditional and that the new context is not.
- * The meaning of expressions can be context dependent. E.g., References to active datasets can be
- ambiguous.
- * Never walk the expressions as a tree, always as a graph!
- * Be careful with annotations.
- It is essential that an expression that is used in different contexts with different annotations
- (e.g., two different named symbols) is consistently transformed. Otherwise it is possible for a
- graph to be converted into a tree. E.g.,::
- A := x; B := x; C = A + B;
- must not be converted to::
- A' := x'; B' := X''; C' := A' + B';
- For this reason most transformers will check if expr->queryBody() matches expr, and if not will
- transform the body (the unannotated expression), and then clone any annotations.
- Some examples of the work done by transformations are:
- * Constant folding.
- * Expanding function calls.
- * Walking the graph and reporting warnings.
- * Optimizing the order and removing redundant activities.
- * Reducing the fields flowing through the generated graph.
- * Spotting common sub expressions.
- * Calculating the best location to evaluate an expression (e.g., globally instead of in a child query).
- * Many, many others.
- Some more details on the individual transforms are given below..
- **********
- Key Stages
- **********
- Parsing
- ========
- The first job of eclcc is to parse the ECL into an expression graph. The source for the ECL can come
- from various different sources (archive, source files, remote repository). The details are hidden
- behind the IEclSource/IEclSourceCollection interfaces. The createRepository() function is then used
- to resolve and parse the various source files on demand.
- Several things occur while the ECL is being parsed:
- * Function definitions are expanded inline.
- A slightly unusual behaviour. It means that the expression tree is a fully expanded expression -
- which is better suited to processing and optimizing.
- * Some limited constant folding occurs.
- When a function is expanded, often it means that some of the
- test conditions are always true/false. To reduce the transformations the condition may be folded
- early on.
- * When a symbol is referenced from another module this will recursively cause the ECL for that module
- (or definition within that module) to be parsed.
- * Currently the semantic checking is done as the ECL is parsed.
- If we are going to fully support template functions and delayed expansion of functions this will
- probably have to change so that a syntax tree is built first, and then the semantic checking is
- done later.
- Normalizing
- ===========
- There are various problems with the expression graph that comes out of the parser:
- * Records can have values as children (e.g., { myField := infield.value} ), but it causes chaos if
- record definitions can change while other transformations are going on. So the normalization
- removes values from fields.
- * Some activities use records to define the values that output records should contain (e.g., TABLE).
- These are now converted to another form (e.g., no_newusertable).
- * Sometimes expressions have multiple definition names. Symbols and annotations are rationalized and
- commoned up to aid commoning up other expressions.
- * Some PATTERN definitions are recursive by name. They are resolved to a form that works if all
- symbols are removed.
- * The CASE/MAP representation for a dataset/action is awkward for the transforms to process. They
- are converted to nested Ifs.
- (At some point a different representation might be a good idea.)
- * EVALUATE is a weird syntax. Instances are replaced with equivalent code which is much easier to
- subsequently process.
- * The datasets used in index definitions are primarily there to provide details of the fields. The
- dataset itself may be very complex and may not actually be used. The dataset input to an index is
- replaced with a dummy "null" dataset to avoid unnecessary graph transforming, and avoid introducing
- any additional incorrect dependencies.
- Scope checking
- ==============
- Generally if you use LEFT/RIGHT then the input rows are going to be available wherever they are
- used. However if they are passed into a function, and that function uses them inside a definition
- marked as global then that is invalid (since by definition global expressions don't have any context).
- Similarly if you use syntax <dataset>.<field>, its validity and meaning depends on whether <dataset>
- is active. The scope transformer ensures that all references to fields are legal, and adds a "new"
- attribute to any no_selects where it is necessary.
- Constant folding: foldHqlExpression
- ===================================
- This transform simplifies the expression tree. Its aim is to simplify scalar expressions, and
- dataset expressions that are valid whether or not the nodes are shared. Some examples are:
- * 1 + 2 => 3 and any other operation on scalar constants.
- * IF(true, x, y) => x
- * COUNT(<empty-dataset>) => 0
- * IF (a = b, 'c', 'd') = 'd' => IF(a=b, false, true) => a != b
- * Simplifying sorts, projects filters on empty datasets
- Most of the optimizations are fairly standard, but a few have been added to cover more esoteric
- examples which have occurred in queries over the years.
- This transform also supports the option to percolate constants through the graph. E.g., if a project
- assigns the value 3 to a field, it can substitute the value 3 wherever that field is used in
- subsequent activities. This can often lead to further opportunities for constant folding (and
- removing fields in the implicit project).
- Expression optimizer: optimizeHqlExpression
- ===========================================
- This transformer is used to simplify, combine and reorder dataset expressions. The transformer takes
- care to count the number of times each expression is used to ensure that none of the transformations
- cause duplication. E.g., swapping a filter with a sort is a good idea, but if there are two filters
- of the same sort and they are both swapped you will now be duplicating the sort.
- Some examples of the optimizations include:
- * COUNT(SORT(x)) => COUNT(x)
- * Moving filters over projects, joins, sorts.
- * Combining adjacent projects, projects and joins.
- * Removing redundant sorts or distributes
- * Moving filters from JOINs to their inputs.
- * Combining activities e.g., CHOOSEN(SORT(x)) => TOPN(x)
- * Sometimes moving filters into IFs
- * Expanding out a field selected from a single row dataset.
- * Combine filters and projects into compound disk read operations.
- Implicit project: insertImplicitProjects
- ========================================
- ECL tends to be written as general purpose definitions which can then be combined. This can lead to
- potential inefficiencies - e.g., one definition may summarise some data in 20 different ways, this is
- then used by another definition which only uses a subset of those results. The implicit project
- transformer tracks the data flow at each point through the expression graph, and removes any fields
- that are not required.
- This often works in combination with the other optimizations. For instance the constant percolation
- can remove the need for fields, and removing fields can sometimes allow a left outer join to be
- converted to a project.
- *********
- Workunits
- *********
- is this the correct term? Should it be a query? This should really be independent of this document...)
- =======================================================================================================
- The code generator ultimately creates workunits. A workunit completely describes a generated query.
- It consists of two parts. There is an xml component - this contains the workflow information, the
- various execution graphs, and information about options. It also describes which inputs can be
- supplied to the query and what results are generated. The other part is the generated shared object
- compiled from the generated C++. This contains functions and classes that are used by the engines to
- execute the queries. Often the xml is compressed and stored as a resource within the shared object -
- so the shared object contains a complete workunit.
- Workflow
- ========
- The actions in a workunit are divided up into individual workflow items. Details of when each
- workflow item is executed, what its dependencies are stored in the <Workflow> section of the xml.
- The generated code also contains a class definition, with a method perform() which is used to execute
- the actions associated with a particular workflow item. (The class instances are created by calling
- the exported createProcess() factory function).
- The generated code for an individual workflow item will typically call back into the engine at some
- point to execute a graph.
- Graph
- =====
- The activity graphs are stored in the xml. The graph contains details of which activities are
- required, how those activities link together, what dependencies there are between the activities.
- For each activity it might contain the following information:
- * A unique id.
- * The "kind" of the activity (from enum ThorActivityKind in eclhelper.hpp)
- * The ECL that created the activity.
- * Name of the original definition
- * Location (e.g., file, line number) of the original ECL.
- * Information about the record size, number of rows, sort order etc.
- * Hints which control options for a particular activity (e.g,, the number of threads to use while sorting).
- * Record counts and stats once the job has executed.
- Each activity in a graph also has a corresponding helper class instance in the generated code. (The
- name of the class is cAc followed by the activity number, and the exported factory method is fAc
- followed by the activity number.) These classes implement the interfaces defined in eclhelper.hpp.
- The engine uses the information from the xml to produce a graph of activities that need to be
- executed. It has a general purpose implementation of each activity kind, and it uses the class
- instance to tailor that general activity to the specific use e.g., what is the filter condition, what
- fields are set up, what is the sort order?
- Inputs and Results
- ==================
- The workunit xml contains details of what inputs can be supplied when that workunit is run. These
- correspond to STORED definitions in the ECL. The result xml also contains the schema for the results
- that the workunit will generate.
- Once an instance of the workunit has been run, the values of the results may be written back into
- dali's copy of the workunit so they can be retrieved and displayed.
- Generated code
- ==============
- Aims for the generated C++ code:
- * Minimal include dependencies.
- Compile time is an issue - especially for small on-demand queries. To help reduce compile times
- (and dependencies with the rest of the system) the number of header files included by the generated
- code is kept to a minimum. In particular references to jlib, boost and icu are kept within the
- implementation of the runtime functions, and are not included in the public dependencies.
- * Thread-safe.
- It should be possible to use the members of an activity helper from multiple threads without
- issue. The helpers may contain some context dependent state, so different instances of the helpers
- are needed for concurrent use from different contexts (e.g., expansions of a graph.)
- * Concise.
- The code should be broadly readable, but the variable names etc. are chosen to generate compact code.
- * Functional.
- Generally the generated code assigns to a variable once, and doesn't modify it afterwards. Some
- assignments may be conditional, but once the variable is evaluated it isn't updated. (There are of
- course a few exceptions - e.g., dataset iterators)
- **********************
- Implementation details
- **********************
- First a few pointers to help understand the code within eclcc:
- * It makes extensive use of link counting. You need understand that concept to get very far.
- * If something is done more than once then that is generally split into a helper function.
- The helper functions aren't generally added to the corresponding interface (e.g., IHqlExpression)
- because the interface would become bloated. Instead they are added as global functions. The big
- disadvantage of this approach is they can be hard to find. Even better would be for them to be
- rationalised and organised into namespaces.
- * The code is generally thread-safe unless there would be a significant performance implication. In
- generally all the code used by the parser for creating expressions is thread safe. Expression
- graph transforms are thread-safe, and can execute in parallel if a constant
- (NUM_PARALLEL_TRANSFORMS) is increased. The data structures used to represent the generated code
- are NOT thread-safe.
- * Much of the code generation is structured fairly procedurally, with classes used to process the
- stages within it.
- * There is a giant "God" class HqlCppTranslator - which could really do with refactoring.
- Parser
- ======
- The eclcc parser uses the standard tools bison and flex to process the ECL and convert it to a
- expression graph. There are a couple of idiosyncrasies with the way it is implemented.
- * Macros with fully qualified scope.
- Slightly unusually macros are defined in the same way that other definitions are - in particular to
- can have references to macros in other modules. This means that there are references to macros
- within the grammar file (instead of being purely handled by a pre-processor). It also means the
- lexer keeps an active stack of macros being processed.
- * Attributes on operators.
- Many of the operators have optional attributes (e.g., KEEP, INNER, LOCAL, ...). If these were all
- reserved words it would remove a significant number of keywords from use as symbols, and could also
- mean that when a new attribute was added it broke existing code. To avoid this the lexer looks
- ahead in the parser tables (by following the potential reductions) to see if the token really could
- come next. If it can't then it isn't reserved as a symbol.
- **************
- Generated code
- **************
- As the workunit is created the code generator builds up the generated code and the xml for the
- workunit. Most of the xml generation is encapsulated within the IWorkUnit interface. The xml for
- the graphs is created in an IPropertyTree, and added to the workunit as a block.
- C++ Output structures
- =====================
- The C++ generation is ultimately controlled by some template files (thortpl.cpp). The templates are
- plain text and contain references to allow named sections of code to be expanded at particular points.
- The code generator builds up some structures in memory for each of those named sections. Once the
- generation is complete some peephole optimization is applied to the code. This structure is walked
- to expand each named section of code as required.
- The BuildCtx class provides a cursor into that generated C++. It will either be created for a given
- named section, or more typically from another BuildCtx. It has methods for adding the different
- types of statements. Some are simple (e.g., addExpr()), whilst some create a compound statement
- (e.g., addFilter). The compound statements change the active selector so any new statements are
- added within that compound statement.
- As well as building up a tree of expressions, this data structure also maintains a tree of
- associations. For instance when a value is evaluated and assigned to a temporary variable, the
- logical value is associated with that temporary. If the same expression is required later, the
- association is matched, and the temporary value is used instead of recalculating it. The
- associations are also used to track the active datasets, classes generated for row-meta information,
- activity classes etc. etc.
- Activity Helper
- ===============
- Each activity in an expression graph will have an associated class generated in the C++. Each
- different activity kind expects a helper that implements a particular IHThorArg interface. E.g., a
- sort activity of kind TAKsort requires a helper that implements IHThorSortArg. The associated
- factory function is used to create instances of the helper class.
- The generated class might take one of two forms:
- * A parameterised version of a library class. These are generated for simple helpers that don't have
- many variations (e.g., CLibrarySplitArg for TAKsplit), or for special cases that occur very
- frequently (CLibraryWorkUnitReadArg for internal results).
- * A class derived from a skeleton implementation of that helper (typically CThorXYZ implementing
- interface IHThorXYZ). The base class has default implementations of some of the functions, and any
- exceptions are implemented in the derived class.
- Meta helper
- ===========
- This is a class that is used by the engines to encapsulate all the information about a single row -
- e.g., the format that each activity generates. It is an implementation of the IOutputMeta
- interface. It includes functions to
- * Return the size of the row.
- * Serialize and deserialize from disk.
- * Destroy and clean up row instances.
- * Convert to xml.
- * Provide information about the contained fields.
- Building expressions
- ====================
- The same expression nodes are used for representing expressions in the generated C++ as the original
- ECL expression graph. It is important to keep track of whether an expression represents untranslated
- ECL, or the "translated" C++. For instance ECL has 1 based indexes, while C++ is zero based. If you
- processed the expression x[1] it might get translated to x[0] in C++. Translating it again would
- incorrectly refer to x[-1].
- There are two key classes used while building the C++ for an ECL expression:
- CHqlBoundExpr.
- This represents a value that has been converted to C++. Depending on the type, one or more of the
- fields will be filled in.
- CHqlBoundTarget.
- This represents the target of an assignment -C++ variable(s) that are going to be assigned the
- result of evaluating an expression. It is almost always passed as a const parameter to a function
- because the target is well-defined and the function needs to update that target.
- A C++ expression is sometimes converted back to an ECL pseudo-expression by calling
- getTranslatedExpr(). This creates an expression node of kind no_translated to indicate the child
- expression has already been converted.
- Scalar expressions
- ------------------
- The generation code for expressions has a hierarchy of calls. Each function is there to allow
- optimal code to be generated - e.g., not creating a temporary variable if none are required. A
- typical flow might be:
- * buildExpr(ctx, expr, bound).
- Evaluate the ecl expression "expr" and save the C++ representation in the class bound. This might
- then call through to...
- * buildTempExpr(ctx, expr, bound);
- Create a temporary variable, and evaluate expr and assign it to that temporary variable.... Which
- then calls.
- * buildExprAssign(ctx, target, expr);
- evaluate the expression, and ensure it is assigned to the C++ target "target".
- The default implementation might be to call buildExpr....
- An operator must either be implemented in buildExpr() (calling a function called doBuildExprXXX) or
- in buildExprAssign() (calling a function called doBuildAssignXXX). Some operators are implemented in
- both places if there are different implementations that would be more efficient in each context.
- Similarly there are several different assignment functions:
- * buildAssign(ctx, <ecl-target>, <ecl-value>);
- * buildExprAssign(ctx, <c++-target>, <ecl-value>);
- * assign(ctx, <C++target>, <c++source>)
- The different varieties are there depending on whether the source value or targets have already been
- translated. (The names could be rationalised!)
- Datasets
- --------
- Most dataset operations are only implemented as activities (e.g., PARSE, DEDUP). If these are used
- within a transform/filter then eclcc will generate a call to a child query. An activity helper for the
- appropriate operation will then be generated.
- However a subset of the dataset operations can also be evaluated inline without calling a child query.
- Some examples are filters, projects, and simple aggregation. It removes the overhead of the child query
- call in the simple cases, and often generates more concise code.
- When datasets are evaluated inline there is a similar hierarchy of function calls:
- * buildDatasetAssign(ctx, target, expr);
- Evaluate the dataset expression, and assign it to the target (a builder interface).
- This may then call....
- * buildIterate(ctx, expr)
- Iterate through each of the rows in the dataset expression in turn.
- Which may then call...
- * buildDataset(ctx, expr, target, format)
- Build the entire dataset, and return it as a single value.
- Some of the operations (e.g., aggregating a filtered dataset) can be done more efficiently by summing and
- filtering an iterator, than forcing the filtered dataset to be evaluated first.
- Dataset cursors
- ---------------
- The interface IHqlCppDatasetCursor allows the code generator to iterate through a dataset, or select
- a particular element from a dataset. It is used to hide the different representation of datasets,
- e.g.,
- * Blocked - the rows are in a contiguous block of memory appended one after another.
- * Array - the dataset is represented by an array of pointers to the individual rows.
- * Link counted - similar to array, but each element is also link counted.
- * Nested. Sometimes the cursor may iterate through multiple levels of child datasets.
- Generally rows that are serialized (e.g., on disk) are in blocked format, and they are stored as link
- counted rows in memory.
- Field access classes
- --------------------
- The IReferenceSelector interface and the classes in hqltcppc[2] provide an interface for getting and
- setting values within a row of a dataset. They hide the details of the layout - e.g., csv/xml/raw
- data, and the details of exactly how each type is represented in the row.
- Key filepos weirdness
- ---------------------
- The current implementation of keys in HPCC uses a format which uses a separate 8 byte integer field
- which was historically used to store the file position in the original file. Other complications are
- that the integer fields are stored big-endian, and signed integer values are biased.
- This introduces some complication in the way indexes are handled. You will often find that the
- logical index definition is replaced with a physical index definition, followed by a project to
- convert it to the logical view. A similar process occurs for disk files to support
- VIRTUAL(FILEPOSITION) etc.
- ***********
- Source code
- ***********
- The following are the main directories used by the ecl compiler.
- +------------------+-------------------------------------------------------------------------------------+
- | Directory | Contents |
- +==================+=====================================================================================+
- | rtl/eclrtpl | Template text files used to generate the C++ code |
- +------------------+-------------------------------------------------------------------------------------+
- | rtl/include | Headers that declare interfaces implemented by the generated code |
- +------------------+-------------------------------------------------------------------------------------+
- | common/deftype | Interfaces and classes for scalar types and values. |
- +------------------+-------------------------------------------------------------------------------------+
- | common/workunit | Code for managing the representation of a work unit. |
- +------------------+-------------------------------------------------------------------------------------+
- | ecl/hql | Classes and interfaces for parsing and representing an ecl expression graph |
- +------------------+-------------------------------------------------------------------------------------+
- | ecl/hqlcpp | Classes for converting an expression graph to a work unit (and C++) |
- +------------------+-------------------------------------------------------------------------------------+
- | ecl/eclcc | The executable which ties everything together. |
- +------------------+-------------------------------------------------------------------------------------+
- **********
- Challenges
- **********
- From declarative to imperative
- ==============================
- As mentioned at the start of this document, one of the main challenges with eclcc is converting the
- declarative ECL code into imperative C++ code. The direction we are heading in is to allow the
- engines to support more lazy-evaluation so possibly in this instance to evaluate it the first time it
- is used (although that may potentially be much less efficient). This will allow the code generator
- to relax some of its current assumptions.
- There are several example queries which are already producing pathological behaviour from eclcc,
- causing it to generate C++ functions which are many thousands of lines long.
- The parser
- ==========
- Currently the grammar for the parser is too specialised. In particular the separate productions for
- expression, datasets, actions cause problems - e.g., it is impossible to properly allow sets of
- datasets to be treated in the same way as other sets.
- The semantic checking (and probably semantic interpretation) is done too early. Really the parser
- should build up a syntax tree, and then disambiguate it and perform the semantic checks on the syntax
- tree.
- The function calls should probably be expanded later than they are. I have tried in the past and hit
- problems, but I can't remember all the details. Some are related to the semantic checking.
|