14 years ago · b39eb133f9
--- a/FUTURE
+++ b/FUTURE
@@ -0,0 +1,235 @@
 
				+This file contains an outline of some of the different areas we might pursue in the future.  Once the plans become
			
 
				+concrete they will be added as issues with associated milestones.
			
 
				+
			
 
				+Technology changes
			
 
				+==================
			
 
				+What technology changes do we need to ensure we adapt to?
			
 
				+
			
 
				+Many cores
			
 
				+* Better use of lots of threads.
			
 
				+* Parallel PARSE, PROJECT and other activities that are cpu intensive.
			
 
				+* Dynamic adapting to number of available threads.
			
 
				+* Ensure multithreading code is efficient.
			
 
				+* Reduce critical sections and locking.
			
 
				+* More read ahead threads.
			
 
				+* Experiment with Sequential blocked read ahead.
			
 
				+
			
 
				+More memory
			
 
				+* Dynamic resourcing in Thor.
			
 
				+* Dynamic caching of spill files.
			
 
				+
			
 
				+SSDs
			
 
				+* What do the greater seek rates allow us to do differently?
			
 
				+
			
 
				+Increased cost of power
			
 
				+
			
 
				+Improved network speed
			
 
				+* The gap between network speed and disk speed is growing.  What assumptions does that change?
			
 
				+
			
 
				+Cloud support and Local v Remote files.
			
 
				+
			
 
				+Architectural
			
 
				+=============
			
 
				+What changes to the underlying architecture do we want to make.  Why?
			
 
				+
			
 
				+Full windows support.
			
 
				+* Solve problems with SSH under windows.
			
 
				+* Solve problems of how to get/build third party libraries.
			
 
				+* Equivalent to the init system.
			
 
				+* Support 64bit windows.
			
 
				+
			
 
				+Improvements to measurement and statistics
			
 
				+* Do we know where all the time is going?
			
 
				+  - In the code generator?
			
 
				+  - In the run time engines.
			
 
				+* What feedback would help ECL developers?
			
 
				+
			
 
				+Combine roxie and hthor
			
 
				+* Extend roxie so it becomes a superset of hthor.
			
 
				+* May require a flag to indicate the mode (e.g., spill handling) but code base should be one.
			
 
				+* Allow roxie to listen on a dali queue.
			
 
				+
			
 
				+Common up the row reading interfaces between roxie and thor.
			
 
				+* Makes it easier to pick up the system and work on it.
			
 
				+* Make it easier to provide utility classes (e.g., readahead, activities).
			
 
				+
			
 
				+Support re-entrant global-graph execution
			
 
				+* Allow a query to call a C++ function which might then call another executeGraph() call.
			
 
				+* Opens up possibilities of more flexible code generation.
			
 
				+* Requires changes to code gen and engines.
			
 
				+* Requires parent extract supported by global graph execute.
			
 
				+* Thor???
			
 
				+
			
 
				+Reduce the number of dlls in the system
			
 
				+* The number of dlls and dependencies can almost certainly be simplified.
			
 
				+
			
 
				+Switch to OpenMPI or other framework
			
 
				+* Does it provide the capabilities we need?
			
 
				+* Would it be a suitable replacement for thor or roxie transport.
			
 
				+
			
 
				+Generate more than one dll for each work unit?
			
 
				+* Allow more granular query compiling.
			
 
				+* Reduce the data required by on-demand roxie slaves.
			
 
				+* Allow remote filtering and projecting.
			
 
				+
			
 
				+Extensible system
			
 
				+=================
			
 
				+What changes can we make to make it easier for third parties to extend?
			
 
				+What benefit might we get?
			
 
				+
			
 
				+File formats
			
 
				+* Indexes
			
 
				+    - Enable optimizations to our own formats.
			
 
				+    - New implementations.
			
 
				+    - Interfacing with external implementations.
			
 
				+* Files
			
 
				+    - Compressed
			
 
				+    - Hadoop
			
 
				+    - Embedded resources
			
 
				+
			
 
				+File locations/sources
			
 
				+* Rationalise the current logical filename syntax, and extend it.
			
 
				+
			
 
				+Repositories
			
 
				+- Allow more flexibility and extendibility in the sources of files used as input to eclcc.
			
 
				+* Create a cleaner interface for accessing hierarchical ECL source.
			
 
				+* Building directly from Github
			
 
				+* Tar
			
 
				+* Compressed archive
			
 
				+
			
 
				+C++ integration
			
 
				+* Make it easier to link libraries/blocks of c++ into programs.
			
 
				+* Improve support for C++ attributes (e.g., dependencies between attributes).
			
 
				+* Streaming of datasets to and from C++.
			
 
				+* Using third party libraries.
			
 
				+
			
 
				+Activities
			
 
				+* Make it easier for 3rd parties to extend the activities in the system
			
 
				+* Allow user c++ activities to be defined.
			
 
				+
			
 
				+New capabilities
			
 
				+================
			
 
				+What capabilities can we add to the system to make it solve more problems?
			
 
				+
			
 
				+Better support for UTF8
			
 
				+* Current support isn't even documented.
			
 
				+* PARSE and a few other places (indexes?) need some more work.
			
 
				+
			
 
				+Unicode support
			
 
				+* Better support in indexes.
			
 
				+* Expose work break semantics and support in PARSE.
			
 
				+* UTF8 DFAs in PARSE.
			
 
				+
			
 
				+Thor debugger.
			
 
				+
			
 
				+Support recursion
			
 
				+
			
 
				+New activities
			
 
				+* DATASET(count, transform(counter))
			
 
				+
			
 
				+More problem domains
			
 
				+- What would be required to support some of the following domains better?
			
 
				+* Biological/Genetic.
			
 
				+* Matrix processing/computationally intensive.
			
 
				+* Unicode free text processing.
			
 
				+* Better support for SAS/R.
			
 
				+* What hooks can we provide to make it easy for 3rd parties to implement?
			
 
				+
			
 
				+Enterprise
			
 
				+==========
			
 
				+What extensions can we make to the enterprise system?
			
 
				+
			
 
				+Repository
			
 
				+* Fix existing repository implementation - particularly cache issues
			
 
				+* Simplify and improve caching capabilities.
			
 
				+* Fix the current directory scanning.
			
 
				+* Allow more repositories types (see extensibility)
			
 
				+
			
 
				+Legacy support tools
			
 
				+* Support tool to add imports, and clean up other changes that are required.
			
 
				+
			
 
				+Encryption at rest
			
 
				+* Do we need it?
			
 
				+* How do we safetly distrubute the keys with the current system.
			
 
				+
			
 
				+Redundancy
			
 
				+* Should we support 3 or more way redundancy?
			
 
				+
			
 
				+Clean up query deployment
			
 
				+* Finish the query sets
			
 
				+
			
 
				+Better testing
			
 
				+* Regression suite could do with a thorough overhall.
			
 
				+* Ideally some better coverage testing.
			
 
				+* Some queries that can be run as a benchmark for the system speed.
			
 
				+* File spray tests.
			
 
				+
			
 
				+Streamed input support in Thor.
			
 
				+* Following on from the discussion with David and Dermot.
			
 
				+
			
 
				+SQL interface into the system.
			
 
				+- Could this build on the mapping and joining fields for the roxie browser.
			
 
				+
			
 
				+Dali hot/warm failover redundancy
			
 
				+
			
 
				+Optimizations
			
 
				+=============
			
 
				+How can we improve the performance of the system
			
 
				+
			
 
				+Optimize the complexity of the graphs that are run
			
 
				+* Code generator could track how sort orders are used and optimize the activities generated.
			
 
				+
			
 
				+Dynamic resourcing
			
 
				+* Scope for Thor to select different implementations based on input data size.
			
 
				+* Dynamic row caching / combine multiple subgraphs as one.
			
 
				+
			
 
				+New activity implementations
			
 
				+* If the data is held on a lustre file system, is there scope for new sort activities?
			
 
				+
			
 
				+Reduce data transfer
			
 
				+* Local and remote helpers would significantly reduce the amount of data transferred for roxie keyed joins (and other activities).
			
 
				+
			
 
				+Row representation
			
 
				+* More intelligent row serialize/deserialize (e.g. on index read slaves)
			
 
				+* Enable packing/alignment on rows.
			
 
				+* Allow sizes to be separate from their strings/datasets.
			
 
				+* Datasets and strings with smaller record counts.
			
 
				+* Link counted child rows (not just datasets)
			
 
				+* Maxcount(1) optimization
			
 
				+* Link counted strings
			
 
				+
			
 
				+Speed up eclcc
			
 
				+* For some queries (e.g., NCF) a large proportion of the development time must be spent compiling and deploying the queries.
			
 
				+
			
 
				+BCD library
			
 
				+* Remove the critical section by using a thread variable for the stack
			
 
				+* Improve code for the basic operations.
			
 
				+* Consider switch to a non-stack implementation.
			
 
				+
			
 
				+Conditional actions in graphs
			
 
				+* Most of the work has been done for this we should explicitly aim to support and enable it.
			
 
				+* Need to finish WHEN support (e.g., implicit field projection from side-effects).
			
 
				+
			
 
				+Graph representation
			
 
				+* Compress it
			
 
				+* Don't include the graphs in the SDS, retrieve from the workunit instead.
			
 
				+
			
 
				+Improve implicit field projection
			
 
				+- Currently doesn't optimize nested record structures.
			
 
				+
			
 
				+Improve the common sub expression processing in eclcc.
			
 
				+
			
 
				+Improve generation of conditional expressions.
			
 
				+
			
 
				+Implement costing for expressions and activities
			
 
				+* Would improve whether it was worth reordering, substituting etc.
			
 
				+
			
 
				+Variable in <set>
			
 
				+* Should sometimes use a hash table.
			
 
				+* Special case of an associated array [#50371]  E.g., MAP(dataset, { keyed } [,{extra}]);
			
 
				+
			
 
				+Fix resourcing of inline datasets
			
 
				+* Currently the CSE for datasets executed within a transform is poor.
			
 
				+
			
 
				+Optimize overly conditional code
			
 
				+* Often occurs when converting procedural code to ECL.  Too many guard conditions are added to the ECL.