123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236 |
- This file contains an outline of some of the different areas we might pursue in the future. Once the plans become
- concrete they will be added as issues with associated milestones.
- Technology changes
- ==================
- What technology changes do we need to ensure we adapt to?
- Many cores
- * Better use of lots of threads.
- * Parallel PARSE, PROJECT and other activities that are cpu intensive.
- * Dynamic adapting to number of available threads.
- * Ensure multithreading code is efficient.
- * Reduce critical sections and locking.
- * More read ahead threads.
- * Experiment with Sequential blocked read ahead.
- More memory
- * Dynamic resourcing in Thor.
- * Dynamic caching of spill files.
- SSDs
- * What do the greater seek rates allow us to do differently?
- Increased cost of power
- Improved network speed
- * The gap between network speed and disk speed is growing. What assumptions does that change?
- Cloud support and Local v Remote files.
- Architectural
- =============
- What changes to the underlying architecture do we want to make. Why?
- Full windows support.
- * Solve problems with SSH under windows.
- * Solve problems of how to get/build third party libraries.
- * Equivalent to the init system.
- * Support 64bit windows.
- Improvements to measurement and statistics
- * Do we know where all the time is going?
- - In the code generator?
- - In the run time engines.
- * What feedback would help ECL developers?
- Combine roxie and hthor
- * Extend roxie so it becomes a superset of hthor.
- * May require a flag to indicate the mode (e.g., spill handling) but code base should be one.
- * Allow roxie to listen on a dali queue.
- Common up the row reading interfaces between roxie and thor.
- * Makes it easier to pick up the system and work on it.
- * Make it easier to provide utility classes (e.g., readahead, activities).
- Support re-entrant global-graph execution
- * Allow a query to call a C++ function which might then call another executeGraph() call.
- * Opens up possibilities of more flexible code generation.
- * Requires changes to code gen and engines.
- * Requires parent extract supported by global graph execute.
- * Thor???
- Reduce the number of dlls in the system
- * The number of dlls and dependencies can almost certainly be simplified.
- Switch to OpenMPI or other framework
- * Does it provide the capabilities we need?
- * Would it be a suitable replacement for thor or roxie transport.
- Generate more than one dll for each work unit?
- * Allow more granular query compiling.
- * Reduce the data required by on-demand roxie slaves.
- * Allow remote filtering and projecting.
- Extensible system
- =================
- What changes can we make to make it easier for third parties to extend?
- What benefit might we get?
- File formats
- * Indexes
- - Enable optimizations to our own formats.
- - New implementations.
- - Interfacing with external implementations.
- * Files
- - Compressed
- - Hadoop
- - Embedded resources
- File locations/sources
- * Rationalise the current logical filename syntax, and extend it.
- Repositories
- - Allow more flexibility and extendibility in the sources of files used as input to eclcc.
- * Create a cleaner interface for accessing hierarchical ECL source.
- * Building directly from Github
- * Tar
- * Compressed archive
- C++ integration
- * Make it easier to link libraries/blocks of c++ into programs.
- * Improve support for C++ attributes (e.g., dependencies between attributes).
- * Streaming of datasets to and from C++.
- * Using third party libraries.
- Activities
- * Make it easier for 3rd parties to extend the activities in the system
- * Allow user c++ activities to be defined.
- New capabilities
- ================
- What capabilities can we add to the system to make it solve more problems?
- Better support for UTF8
- * Current support isn't even documented.
- * PARSE and a few other places (indexes?) need some more work.
- Unicode support
- * Better support in indexes.
- * Expose work break semantics and support in PARSE.
- * UTF8 DFAs in PARSE.
- Thor debugger.
- Support recursion
- New activities
- * DATASET(count, transform(counter))
- More problem domains
- - What would be required to support some of the following domains better?
- * Biological/Genetic.
- * Matrix processing/computationally intensive.
- * Unicode free text processing.
- * Better support for SAS/R.
- * What hooks can we provide to make it easy for 3rd parties to implement?
- Enterprise
- ==========
- What extensions can we make to the enterprise system?
- Repository
- * Fix existing repository implementation - particularly cache issues
- * Simplify and improve caching capabilities.
- * Fix the current directory scanning.
- * Allow more repositories types (see extensibility)
- Legacy support tools
- * Support tool to add imports, and clean up other changes that are required.
- Encryption at rest
- * Do we need it?
- * How do we safetly distrubute the keys with the current system.
- Redundancy
- * Should we support 3 or more way redundancy?
- Clean up query deployment
- * Finish the query sets
- Better testing
- * Regression suite could do with a thorough overhall.
- * Ideally some better coverage testing.
- * Some queries that can be run as a benchmark for the system speed.
- * File spray tests.
- Streamed input support in Thor.
- * Following on from the discussion with David and Dermot.
- SQL interface into the system.
- - Could this build on the mapping and joining fields for the roxie browser.
- Dali hot/warm failover redundancy
- Optimizations
- =============
- How can we improve the performance of the system
- Optimize the complexity of the graphs that are run
- * Code generator could track how sort orders are used and optimize the activities generated.
- Dynamic resourcing
- * Scope for Thor to select different implementations based on input data size.
- * Dynamic row caching / combine multiple subgraphs as one.
- New activity implementations
- * If the data is held on a lustre file system, is there scope for new sort activities?
- Reduce data transfer
- * Local and remote helpers would significantly reduce the amount of data transferred for roxie keyed joins (and other activities).
- Row representation
- * More intelligent row serialize/deserialize (e.g. on index read slaves)
- * Enable packing/alignment on rows.
- * Allow sizes to be separate from their strings/datasets.
- * Datasets and strings with smaller record counts.
- * Link counted child rows (not just datasets)
- * Maxcount(1) optimization
- * Link counted strings
- Speed up eclcc
- * For some queries (e.g., NCF) a large proportion of the development time must be spent compiling and deploying the queries.
- BCD library
- * Remove the critical section by using a thread variable for the stack
- * Improve code for the basic operations.
- * Consider switch to a non-stack implementation.
- Conditional actions in graphs
- * Most of the work has been done for this we should explicitly aim to support and enable it.
- * Need to finish WHEN support (e.g., implicit field projection from side-effects).
- Graph representation
- * Compress it
- * Don't include the graphs in the SDS, retrieve from the workunit instead.
- Improve implicit field projection
- - Currently doesn't optimize nested record structures.
- Improve the common sub expression processing in eclcc.
- Improve generation of conditional expressions.
- Implement costing for expressions and activities
- * Would improve whether it was worth reordering, substituting etc.
- Variable in <set>
- * Should sometimes use a hash table.
- * Special case of an associated array [#50371] E.g., MAP(dataset, { keyed } [,{extra}]);
- Fix resourcing of inline datasets
- * Currently the CSE for datasets executed within a transform is poor.
- Optimize overly conditional code
- * Often occurs when converting procedural code to ECL. Too many guard conditions are added to the ECL.
|