123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172 |
- /*##############################################################################
- HPCC SYSTEMS software Copyright (C) 2012 HPCC Systems®.
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
- ############################################################################## */
- Issues that aren't handled correctly at the moment:
- o LEFT/RIGHT.
- Currently these can be ambiguous - because expressions at different levels of the tree can have the same
- logical representation. It means many all transforms need to be context dependant because the same expression
- in different locations may not have the same meaning. Currently get away with it most of the time.
- o No way to alias a dataset.
- - it is needed in some situations e.g., count(children.names, age > ave(children.names));
- o no_setresult
- The representation is ugly for walking activity lists - much better if could guarantee that a dataset was the first parameter
- no_setresult(dataset, select, sequence).
- o It is never simple to walk the activity tree because of exceptions like if(), setresult(), activerow etc.
- o Should possibly represent selectors as different HqlExpressions - it might remove some of the complication from the transformers.
- o Hoisting common items: conditions and dependencies
- - Generally if a cse is used in multiple places then we want to hoist it once
- - We don't want to hoist it too early otherwise dependencies might get messed up
- - If a cse is used in a condition branch then don't want to hoist it globally unless it is (first) used non-globally
- - cses shouldn't be moved within sequentials because it may change the meaning. We can assume that all evaluation
- of the same expressions produce the same result though
- - careful if hoisting, because may also need to hoist items in no_comma that are further up the tree....
- - if sequential inside a condition, then can't common up with a later non-conditional cse.
- - Need comma expressions to be created in the best locations. Generally tagged onto the activity at a minimum.
- - problem with creating a comma locally and then hoisting without analysing is they won't get commoned up.
- o Correct place for matched/fileposition
- - currently the subquery tree is walked to work out correct place to evaluated fileposition, but need reimplementing.
- o Filter scoring
- - It would be useful to be able to scope filters/datasets so we can reorder them/work out if optimizations are worthwhile.
- o no_mapto
- This is a particularly bad representation for walking the tree because of cse issues etc.
- better would be no_case(value, [match_values], result_values {expanded});
- Removing case/map activities at least makes cse transforming simpler.
- o no_xml flags in tree
- for xml etc. should inherit from child datasets, but ignore from child values. Also don't allow nested parse at moment.
- **********************************************************************************************************************************************************
- -- Reordering filters --
- o Should have a go at looking at HOLe, and trying to steal its code. Effectively need to
- - work out the cardinality of each field.
- - functions to calculate how likely a filter is to occur
- - functions to score how expensive an operation is.
- - do filters on outer datasets first
- - reorder the filters to match.
- o Hole also chooses the best order to do the joins in. That would be possible, but there aren't that many situations where
- reordering would be that helpful. It wouldn't be trivial, but i think it could be done reasonably easily.
- -- Related datasets --
- o What is the issue? Simplest with a couple of examples:
- address(exists(address.person.books(f)))
- o The nested child dataset is relatively simple if node created is (select(select(address,person,new),books,new)
- would be requested to build iterator on address.person.
- - If no break required, can just use nested iterators
- - if break required we need a class that can iterate; AND provide access to both person and book fields.
- - Similar problem as outer level nested iterator, except we can use pointers to parent record.
- o Would be easier if we had a representation:
- address(exists(related(person(parent-join-condition), books(join-condition))(f));
- related has syntax:
- related(filter1, filter2, filter3, filter4, ...)
- o relate() iterator:
- i) need to create a class. The class will probably be of the same form as the hole iterators:
- -- Related datasets --
- o The problem, an example:
- address(exists(books(books.personid=(person(person.addrid = address.id)).id)(f)))
- o The issue here is that we need to look in the filter condition for how the join is done.
- we really need a graph which looks like:
- address(exists(relate(person(addr = address.id), books(personid=person.id)))(f);
- o First requirement is to support the explicit syntax. Second is to spot where it needs to be introduced:
- o Main effect of the syntax is
- a) to ensure tables are met in the correct order
- b) never meet newtable.field
- o Automatic spotting:
- - For aggregates etc.
- - Spot all uses of dataset.field in filter where dataset is new and not a single row.
- - If these datasets provide a bridge from current scope, and main dataset for the aggregate then
- generate a relate(a,b,c) statement. Extract any filter which just relates to fields that are already
- in scope to the appropriate level.
- - Also applies to x[1] where x is a unrelated dataset.
- - Needs to be done after scope tagging so the table usage information is there which means that
- a) scope checking needs to be happy with it. Not sure how it would really cope.
- b) modified graph with the relates needs to also be correctly tagged.
- -- Sub query related work---
- * ROW(row, record) - convert fields by name with error if no match found.
- * BUG: Hoisting datasets because scope invariant probably does too much, and
- doesn't create a valid structure.
- * Add a debug option to always create compound activities regardless of sharing - for debugging code generation of sq.
- -- Related potential changes ---
- * What work would be needed to allow records to be represented as records in a record
- rather than expanding the record/dataset?
- What work to then allow items compatibility between record and implicit-children.
- * EXISTS(x) etc. should really push x onto a top scope stack, where all are accessible. Would
- reduce the number of qualifying prefixes on fields.
- * Distributed local activities after remote activities - potentially may speed things up, but
- generally compound aggregating sources will have a larger advantage.
- -------- SubQuery fundamental questions: ---------
- * Is an explicit syntax needed when accessing related non-nested datasets? What would be
- sensible? Can it be deduced?
- * How do you calculate an average in an enclosing scope? How does within relate?
- !! Do Aliases solve the problems they are proposed to, or do you need something to evaluate something in a different scope?
- * In what sense are parent fields accessible when iterating through people at root level?
- Could theoretically be serialised as part of the
- record, but removed by an output with no table. Implicit projects would then make it usable. Worth pursuing.
- >> Needs a representation in a record to indicate it also contains parent references.
- * How do you alias a dataset? Otherwise can't implement sqjoin for a disk based item.
- -----------------------------------------------------------------------
- -- Sub query complications ---
- *) generating a sub graph on a child dataset.
- *) accessing fields from a parent dataset (unrelated child).
- *) accessing fields from a grand parent (multi level nesting)
- *) accessing fields from within inline sub-query (both levels).
- *) Indicating whether subgraphs can be run locally or need to be distributed (effects how
- cursors are serialised).
- *) traversing query at a child level.
- *) Aggregating, and selecting as child activities.
- *) counting, aggregating and group-aggregating source activities. (ensure both tested)
- *) CSE causing datasets to be returned. - Are they worth it?
- *) Multiple sub-queries, with different fields being available at different nesting levels.
- *) Optimizing counts and representation.
- *) Are aliases ever needed? (Situations where global is currently used I suspect.)
- *) Grouped temporary recordsets.
- *) hoisting table/context invariant expressions. E.g.,
- SELF.countMatch := count(d2(field = left.field))
- recoded as
- summary := table(d2, {cnt := count(group), field}, field);
- SELF.countMatch := sum(summary(field = left.field), cnt);
- where summary is only evaluate once per query.
- *) Executing multiple sub queries at once.
- - Would it help
- - What chaos would it cause.
- - How about if we only allowed an explicit syntax?
- *) How do we replicate HOLe's join files? Effectively disk read (or other) datasets are cloned on each machine.
|