FUTURE 7.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236
  1. This file contains an outline of some of the different areas we might pursue in the future. Once the plans become
  2. concrete they will be added as issues with associated milestones.
  3. Technology changes
  4. ==================
  5. What technology changes do we need to ensure we adapt to?
  6. Many cores
  7. * Better use of lots of threads.
  8. * Parallel PARSE, PROJECT and other activities that are cpu intensive.
  9. * Dynamic adapting to number of available threads.
  10. * Ensure multithreading code is efficient.
  11. * Reduce critical sections and locking.
  12. * More read ahead threads.
  13. * Experiment with Sequential blocked read ahead.
  14. More memory
  15. * Dynamic resourcing in Thor.
  16. * Dynamic caching of spill files.
  17. SSDs
  18. * What do the greater seek rates allow us to do differently?
  19. Increased cost of power
  20. Improved network speed
  21. * The gap between network speed and disk speed is growing. What assumptions does that change?
  22. Cloud support and Local v Remote files.
  23. Architectural
  24. =============
  25. What changes to the underlying architecture do we want to make. Why?
  26. Full windows support.
  27. * Solve problems with SSH under windows.
  28. * Solve problems of how to get/build third party libraries.
  29. * Equivalent to the init system.
  30. * Support 64bit windows.
  31. Improvements to measurement and statistics
  32. * Do we know where all the time is going?
  33. - In the code generator?
  34. - In the run time engines.
  35. * What feedback would help ECL developers?
  36. Combine roxie and hthor
  37. * Extend roxie so it becomes a superset of hthor.
  38. * May require a flag to indicate the mode (e.g., spill handling) but code base should be one.
  39. * Allow roxie to listen on a dali queue.
  40. Common up the row reading interfaces between roxie and thor.
  41. * Makes it easier to pick up the system and work on it.
  42. * Make it easier to provide utility classes (e.g., readahead, activities).
  43. Support re-entrant global-graph execution
  44. * Allow a query to call a C++ function which might then call another executeGraph() call.
  45. * Opens up possibilities of more flexible code generation.
  46. * Requires changes to code gen and engines.
  47. * Requires parent extract supported by global graph execute.
  48. * Thor???
  49. Reduce the number of dlls in the system
  50. * The number of dlls and dependencies can almost certainly be simplified.
  51. Switch to OpenMPI or other framework
  52. * Does it provide the capabilities we need?
  53. * Would it be a suitable replacement for thor or roxie transport.
  54. Generate more than one dll for each work unit?
  55. * Allow more granular query compiling.
  56. * Reduce the data required by on-demand roxie slaves.
  57. * Allow remote filtering and projecting.
  58. Extensible system
  59. =================
  60. What changes can we make to make it easier for third parties to extend?
  61. What benefit might we get?
  62. File formats
  63. * Indexes
  64. - Enable optimizations to our own formats.
  65. - New implementations.
  66. - Interfacing with external implementations.
  67. * Files
  68. - Compressed
  69. - Hadoop
  70. - Embedded resources
  71. File locations/sources
  72. * Rationalise the current logical filename syntax, and extend it.
  73. Repositories
  74. - Allow more flexibility and extendibility in the sources of files used as input to eclcc.
  75. * Create a cleaner interface for accessing hierarchical ECL source.
  76. * Building directly from Github
  77. * Tar
  78. * Compressed archive
  79. C++ integration
  80. * Make it easier to link libraries/blocks of c++ into programs.
  81. * Improve support for C++ attributes (e.g., dependencies between attributes).
  82. * Streaming of datasets to and from C++.
  83. * Using third party libraries.
  84. Activities
  85. * Make it easier for 3rd parties to extend the activities in the system
  86. * Allow user c++ activities to be defined.
  87. New capabilities
  88. ================
  89. What capabilities can we add to the system to make it solve more problems?
  90. Better support for UTF8
  91. * Current support isn't even documented.
  92. * PARSE and a few other places (indexes?) need some more work.
  93. Unicode support
  94. * Better support in indexes.
  95. * Expose work break semantics and support in PARSE.
  96. * UTF8 DFAs in PARSE.
  97. Thor debugger.
  98. Support recursion
  99. New activities
  100. * DATASET(count, transform(counter))
  101. More problem domains
  102. - What would be required to support some of the following domains better?
  103. * Biological/Genetic.
  104. * Matrix processing/computationally intensive.
  105. * Unicode free text processing.
  106. * Better support for SAS/R.
  107. * What hooks can we provide to make it easy for 3rd parties to implement?
  108. Enterprise
  109. ==========
  110. What extensions can we make to the enterprise system?
  111. Repository
  112. * Fix existing repository implementation - particularly cache issues
  113. * Simplify and improve caching capabilities.
  114. * Fix the current directory scanning.
  115. * Allow more repositories types (see extensibility)
  116. Legacy support tools
  117. * Support tool to add imports, and clean up other changes that are required.
  118. Encryption at rest
  119. * Do we need it?
  120. * How do we safetly distrubute the keys with the current system.
  121. Redundancy
  122. * Should we support 3 or more way redundancy?
  123. Clean up query deployment
  124. * Finish the query sets
  125. Better testing
  126. * Regression suite could do with a thorough overhall.
  127. * Ideally some better coverage testing.
  128. * Some queries that can be run as a benchmark for the system speed.
  129. * File spray tests.
  130. Streamed input support in Thor.
  131. * Following on from the discussion with David and Dermot.
  132. SQL interface into the system.
  133. - Could this build on the mapping and joining fields for the roxie browser.
  134. Dali hot/warm failover redundancy
  135. Optimizations
  136. =============
  137. How can we improve the performance of the system
  138. Optimize the complexity of the graphs that are run
  139. * Code generator could track how sort orders are used and optimize the activities generated.
  140. Dynamic resourcing
  141. * Scope for Thor to select different implementations based on input data size.
  142. * Dynamic row caching / combine multiple subgraphs as one.
  143. New activity implementations
  144. * If the data is held on a lustre file system, is there scope for new sort activities?
  145. Reduce data transfer
  146. * Local and remote helpers would significantly reduce the amount of data transferred for roxie keyed joins (and other activities).
  147. Row representation
  148. * More intelligent row serialize/deserialize (e.g. on index read slaves)
  149. * Enable packing/alignment on rows.
  150. * Allow sizes to be separate from their strings/datasets.
  151. * Datasets and strings with smaller record counts.
  152. * Link counted child rows (not just datasets)
  153. * Maxcount(1) optimization
  154. * Link counted strings
  155. Speed up eclcc
  156. * For some queries (e.g., NCF) a large proportion of the development time must be spent compiling and deploying the queries.
  157. BCD library
  158. * Remove the critical section by using a thread variable for the stack
  159. * Improve code for the basic operations.
  160. * Consider switch to a non-stack implementation.
  161. Conditional actions in graphs
  162. * Most of the work has been done for this we should explicitly aim to support and enable it.
  163. * Need to finish WHEN support (e.g., implicit field projection from side-effects).
  164. Graph representation
  165. * Compress it
  166. * Don't include the graphs in the SDS, retrieve from the workunit instead.
  167. Improve implicit field projection
  168. - Currently doesn't optimize nested record structures.
  169. Improve the common sub expression processing in eclcc.
  170. Improve generation of conditional expressions.
  171. Implement costing for expressions and activities
  172. * Would improve whether it was worth reordering, substituting etc.
  173. Variable in <set>
  174. * Should sometimes use a hash table.
  175. * Special case of an associated array [#50371] E.g., MAP(dataset, { keyed } [,{extra}]);
  176. Fix resourcing of inline datasets
  177. * Currently the CSE for datasets executed within a transform is poor.
  178. Optimize overly conditional code
  179. * Often occurs when converting procedural code to ECL. Too many guard conditions are added to the ECL.