Browse Source

Initial version of FUTURE document

Signed-off-by: Gavin Halliday <gavin.halliday@lexisnexis.com>
Gavin Halliday 13 years ago
parent
commit
b39eb133f9
1 changed files with 235 additions and 0 deletions
  1. 235 0
      FUTURE

+ 235 - 0
FUTURE

@@ -0,0 +1,235 @@
+This file contains an outline of some of the different areas we might pursue in the future.  Once the plans become
+concrete they will be added as issues with associated milestones.
+
+Technology changes
+==================
+What technology changes do we need to ensure we adapt to?
+
+Many cores
+* Better use of lots of threads.
+* Parallel PARSE, PROJECT and other activities that are cpu intensive.
+* Dynamic adapting to number of available threads.
+* Ensure multithreading code is efficient.
+* Reduce critical sections and locking.
+* More read ahead threads.
+* Experiment with Sequential blocked read ahead.
+
+More memory
+* Dynamic resourcing in Thor.
+* Dynamic caching of spill files.
+
+SSDs
+* What do the greater seek rates allow us to do differently?
+
+Increased cost of power
+
+Improved network speed
+* The gap between network speed and disk speed is growing.  What assumptions does that change?
+
+Cloud support and Local v Remote files.
+
+Architectural
+=============
+What changes to the underlying architecture do we want to make.  Why?
+
+Full windows support.
+* Solve problems with SSH under windows.
+* Solve problems of how to get/build third party libraries.
+* Equivalent to the init system.
+* Support 64bit windows.
+
+Improvements to measurement and statistics
+* Do we know where all the time is going?
+  - In the code generator?
+  - In the run time engines.
+* What feedback would help ECL developers?
+
+Combine roxie and hthor
+* Extend roxie so it becomes a superset of hthor.
+* May require a flag to indicate the mode (e.g., spill handling) but code base should be one.
+* Allow roxie to listen on a dali queue.
+
+Common up the row reading interfaces between roxie and thor.
+* Makes it easier to pick up the system and work on it.
+* Make it easier to provide utility classes (e.g., readahead, activities).
+
+Support re-entrant global-graph execution
+* Allow a query to call a C++ function which might then call another executeGraph() call.
+* Opens up possibilities of more flexible code generation.
+* Requires changes to code gen and engines.
+* Requires parent extract supported by global graph execute.
+* Thor???
+
+Reduce the number of dlls in the system
+* The number of dlls and dependencies can almost certainly be simplified.
+
+Switch to OpenMPI or other framework
+* Does it provide the capabilities we need?
+* Would it be a suitable replacement for thor or roxie transport.
+
+Generate more than one dll for each work unit?
+* Allow more granular query compiling.
+* Reduce the data required by on-demand roxie slaves.
+* Allow remote filtering and projecting.
+
+Extensible system
+=================
+What changes can we make to make it easier for third parties to extend?
+What benefit might we get?
+
+File formats
+* Indexes
+    - Enable optimizations to our own formats.
+    - New implementations.
+    - Interfacing with external implementations.
+* Files
+    - Compressed
+    - Hadoop
+    - Embedded resources
+
+File locations/sources
+* Rationalise the current logical filename syntax, and extend it.
+
+Repositories
+- Allow more flexibility and extendibility in the sources of files used as input to eclcc.
+* Create a cleaner interface for accessing hierarchical ECL source.
+* Building directly from Github
+* Tar
+* Compressed archive
+
+C++ integration
+* Make it easier to link libraries/blocks of c++ into programs.
+* Improve support for C++ attributes (e.g., dependencies between attributes).
+* Streaming of datasets to and from C++.
+* Using third party libraries.
+
+Activities
+* Make it easier for 3rd parties to extend the activities in the system
+* Allow user c++ activities to be defined.
+
+New capabilities
+================
+What capabilities can we add to the system to make it solve more problems?
+
+Better support for UTF8
+* Current support isn't even documented.
+* PARSE and a few other places (indexes?) need some more work.
+
+Unicode support
+* Better support in indexes.
+* Expose work break semantics and support in PARSE.
+* UTF8 DFAs in PARSE.
+
+Thor debugger.
+
+Support recursion
+
+New activities
+* DATASET(count, transform(counter))
+
+More problem domains
+- What would be required to support some of the following domains better?
+* Biological/Genetic.
+* Matrix processing/computationally intensive.
+* Unicode free text processing.
+* Better support for SAS/R.
+* What hooks can we provide to make it easy for 3rd parties to implement?
+
+Enterprise
+==========
+What extensions can we make to the enterprise system?
+
+Repository
+* Fix existing repository implementation - particularly cache issues
+* Simplify and improve caching capabilities.
+* Fix the current directory scanning.
+* Allow more repositories types (see extensibility)
+
+Legacy support tools
+* Support tool to add imports, and clean up other changes that are required.
+
+Encryption at rest
+* Do we need it?
+* How do we safetly distrubute the keys with the current system.
+
+Redundancy
+* Should we support 3 or more way redundancy?
+
+Clean up query deployment
+* Finish the query sets
+
+Better testing
+* Regression suite could do with a thorough overhall.
+* Ideally some better coverage testing.
+* Some queries that can be run as a benchmark for the system speed.
+* File spray tests.
+
+Streamed input support in Thor.
+* Following on from the discussion with David and Dermot.
+
+SQL interface into the system.
+- Could this build on the mapping and joining fields for the roxie browser.
+
+Dali hot/warm failover redundancy
+
+Optimizations
+=============
+How can we improve the performance of the system
+
+Optimize the complexity of the graphs that are run
+* Code generator could track how sort orders are used and optimize the activities generated.
+
+Dynamic resourcing
+* Scope for Thor to select different implementations based on input data size.
+* Dynamic row caching / combine multiple subgraphs as one.
+
+New activity implementations
+* If the data is held on a lustre file system, is there scope for new sort activities?
+
+Reduce data transfer
+* Local and remote helpers would significantly reduce the amount of data transferred for roxie keyed joins (and other activities).
+
+Row representation
+* More intelligent row serialize/deserialize (e.g. on index read slaves)
+* Enable packing/alignment on rows.
+* Allow sizes to be separate from their strings/datasets.
+* Datasets and strings with smaller record counts.
+* Link counted child rows (not just datasets)
+* Maxcount(1) optimization
+* Link counted strings
+
+Speed up eclcc
+* For some queries (e.g., NCF) a large proportion of the development time must be spent compiling and deploying the queries.
+
+BCD library
+* Remove the critical section by using a thread variable for the stack
+* Improve code for the basic operations.
+* Consider switch to a non-stack implementation.
+
+Conditional actions in graphs
+* Most of the work has been done for this we should explicitly aim to support and enable it.
+* Need to finish WHEN support (e.g., implicit field projection from side-effects).
+
+Graph representation
+* Compress it
+* Don't include the graphs in the SDS, retrieve from the workunit instead.
+
+Improve implicit field projection
+- Currently doesn't optimize nested record structures.
+
+Improve the common sub expression processing in eclcc.
+
+Improve generation of conditional expressions.
+
+Implement costing for expressions and activities
+* Would improve whether it was worth reordering, substituting etc.
+
+Variable in <set>
+* Should sometimes use a hash table.
+* Special case of an associated array [#50371]  E.g., MAP(dataset, { keyed } [,{extra}]);
+
+Fix resourcing of inline datasets
+* Currently the CSE for datasets executed within a transform is poor.
+
+Optimize overly conditional code
+* Often occurs when converting procedural code to ECL.  Too many guard conditions are added to the ECL.