Controlling Roxie Queries
There are several ECL functions that are designed specifically to help
optimize queries for execution on Roxie. These include PRELOAD, ALLNODES,
THISNODE, LOCAL, and NOLOCAL. Understanding how all these functions work
together can make a big difference in the performance of your Roxie
queries.
How Graphs Execute
Writing efficient queries for Roxie or Thor can require an
understanding of how the different clusters operate. This brings up three
questions:
How does the graph execute, on a single node, or on all nodes in
parallel?
How are datasets accessed by each node executing the graph, only the
parts that are local to the node, or all parts on all nodes?
Does an operation coordinate with the same operation on other nodes,
or does each node operate independently?
Here's how queries “normally” execute on each type of
cluster:
Thor
Graphs execute on multiple slave nodes in
parallel.
Index/disk reads are done locally by each slave
node.
All other disk access (FETCH, keyed JOIN, etc.) are
effectively accessed across all nodes.
Coordination with operations on other nodes is controlled
by the presence or absence of the LOCAL option on the
operation.
No support for child queries (this may change in future
releases).
hthor
Graphs execute on the single ECL Agent node.
All parts of the dataset/index are accessed by directly
accessing the disk drive of the node with the data—no other
interaction with the other nodes.
Child queries always execute on same node as
parent.
Roxie
Graphs execute on a single (Roxie server) node.
All parts of the dataset/index are accessed by directly
accessing the disk drive of the node with the data—no other
interaction with the other nodes.
Child queries might execute on a single agent node
instead of a Roxie server node.
ALLNODES vs. THISNODE
In Roxie, graphs execute on a single Roxie server node unless the
ALLNODES() function is used. ALLNODES() causes the portion of the query it
encloses to execute on all agent nodes in parallel. The results are
calculated independently on each node then merged together, without
ordering the records. It is generally used to do some complex remote
processing which only requires local index access, substantially reducing
the network traffic between the nodes.
By default, everything within the ALLNODES() will be executed on all
the nodes, but sometimes the ALLNODES() query requires some input or
arguments that shouldn't be executed on all the nodes—for example, the
previous best guess at the results, or some information controlling the
parallel query. The THISNODE() function can be used to surround element
that are to be evaluated by the current node instead.
A typical usage would look like this:
bestSearchResults := ALLNODES(doRemoteSearch(THISNODE(searchWords),THISNODE(previousResults)))
Where 'searchWords' and 'previousResults' are effectively calculated
on the current node, and then passed as parameters to each instance of the
doRemoteSearch() executing in parallel on all nodes.
LOCAL vs. NOLOCAL
The LOCAL option available on many functions (like JOIN, SORT, etc.)
and the LOCAL() and NOLOCAL() functions control whether the graphs running
on a particular node access all parts of a file/index or only those
associated with the particular node (LOCAL). Often within an ALLNODES()
context you only want to access local index parts from a single node
because each node is independently processing its associated parts.
Specifying that an index read or a keyed JOIN is LOCAL means that only the
local part is used on each node. A local read of a single part INDEX will
only be evaluated on the first agent node (or the farmer node if not
within an ALLNODES)
Local evaluation can be specified in two ways:
1) As a dataset operation:
LOCAL(MyIndex)(myField = searchField)
2) As an option on the operation:
JOIN(... ,LOCAL)
FETCH(... ,LOCAL)
The LOCAL(dataset) function causes every
operation on the dataset to access the file/key
locally. For example,
LOCAL(JOIN(index1, index2,...))
will read index1 and index2 locally. This rule is recursively
applied until you reach one of the following:
Use of the NOLOCAL() function
A non-local attribute—the operation stays non-local, but children
are still marked as local as necessary
A GLOBAL() or THISNODE() or workflow operation—since they will be
evaluated in a different context
Use of the ALLNODES() function (as in a nested child query)
Note that:
JOIN(x, LOCAL(index1)...) is treated the same as JOIN(x, index1,
..., local).
LOCAL is also supported as an option on an INDEX, but the LOCAL()
function is preferred, because it generally depends on the context an
index is used in whether or not access to it should be local or
not.
A non-local attribute is supported everywhere that a LOCAL attribute
is allowed - to override an enclosing LOCAL() function.
The use of LOCAL to indicate that dataset/key access is local does
not conflict with its use to control coordination of an operation with
other nodes, because there is no operation that potentially co-ordinates
with other nodes and also accesses indexes or datasets.
NOROOT Indexes
The ALLNODES() function is particularly useful if there is more than
one index co-distributed on a particular value so that all information
that relates to a particular key field value is associated with the same
node. However generally indexes are globally sorted. Adding a NOROOT option to a BUILD action or INDEX declaration
indicates that the index is not globally sorted, and
there is no root index to indicate which part of the index will contain a
particular entry.