Metrics.rst 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487
  1. ========================
  2. Metrics Framework Design
  3. ========================
  4. ************
  5. Introduction
  6. ************
  7. This document describes the design of a metrics framework that allows HPCC Systems components to
  8. implement a metric collection strategy. Metrics provide the following functionality:
  9. * Alerts and monitoring
  10. An important DevOps function is to monitor the cluster and providing alerts when
  11. problems are detected. Aggregated metric values from multiple sources provide
  12. the necessary data to build a complete picture of cluster health that drives
  13. monitoring and alerts.
  14. * Scaling
  15. As described above, aggregated metric data is also used to dynamically respond to changing
  16. cluster demands and load. Metrics provide the monitoring capability to react and take
  17. action
  18. * Fault diagnosis and resource monitoring
  19. Metrics provide historical data useful in diagnosing problems by profiling how demand and
  20. usage patterns may change prior to a fault. Predictive analysis can also be applied.
  21. * Analysis of jobs/workunits and profiling
  22. With proper instrumentation, a robust dynamic metric strategy can track workunit
  23. processing. Internal problems with queries should be diagnosed from deep drill down logging.
  24. The document consists of several sections in order to provide requirements as well as
  25. the design of framework components.
  26. Definitions
  27. ===============
  28. Some definitions are useful.
  29. Metric
  30. A measurement defined by a component that represents an internal state that is useful in a system
  31. reliability engineering function. In the context of the framework, a metric is an object representing
  32. the above.
  33. Metric Value
  34. The current value of a metric.
  35. Metric Updating
  36. The component task of updating metric state.
  37. Collection
  38. A framework process of selecting relevant metrics based on configuration and then retrieving
  39. their values.
  40. Reporting
  41. A framework process of converting values obtained during a collection into a format suitable for
  42. ingestion by a collection system.
  43. Trigger
  44. What causes the collection of metric values.
  45. Collection System
  46. The store for metric values generated during the reporting framework process.
  47. *************
  48. Use Scenarios
  49. *************
  50. This section describes how components expect to use the framework. It is not a complete list of all
  51. requirements but rather a sample.
  52. Roxie
  53. =====
  54. Roxie desires to keep a count of many different internal values. Some examples are
  55. * Disk type operations such as seeks and reads
  56. * Execution totals
  57. Need to track items such as total numbers of items such as success and failures
  58. as well as breaking some counts into individual reasons. For example, failures
  59. may need be categorized such as as
  60. * Busy
  61. * Timeout
  62. * Bad input
  63. Or even by priority (high, low, sla, etc.)
  64. * Current operational levels such as the length of internal queues
  65. * The latency of operations such as queue results, agent responses, and gateway responses
  66. Roxie also has the need to track internal memory usage beyond the pod/system level capabilities.
  67. Tracking the state of its large fixed memory pool is necessary.
  68. The Roxie buddy system also must track how often and who is completing requests. The "I Beat You To It"
  69. set of metrics must be collected and exposed in order to detect pending node failure. While specific
  70. action on these counts is not known up front, it appears clear that these values are useful and should
  71. be collected.
  72. There does not appear to be a need for creating and destroying metrics dynamically. The set of metrics
  73. is most likely to be created at startup and remain active through the life of the Roxie. If, however,
  74. stats collection seeps into the metrics framework, dynamic creation and destruction of stats metrics is
  75. a likely requirement.
  76. ESP
  77. ===
  78. There are some interesting decisions with respect to ESP and collection of metrics. Different
  79. applications within ESP present different use cases for collection. Ownership of a given task drives
  80. some of these use cases. Take workunit queues. If ownership of the task, with respect to metrics, is
  81. WsWorkunits, then use cases are centric to that component. However, if agents listening on the queue
  82. are to report metrics, then a different set of use cases emerge. It is clear that additional work is
  83. needed to generate clear ownership of metrics gathered by ESP and/or the tasks it performs.
  84. ESP needs to report the *activeTransactions* value from the TxSummary class(es). This gives an
  85. indication of how busy the ESP is in terms of client requests.
  86. Direct measurement of response time in requests may not be useful since the type of request causes
  87. different execution paths within ESP that are expected to take widely varying amounts of time. Creation
  88. of metrics for each method is not recommended. However, two possible solutions are to a) create a
  89. metric for request types, or b) use a histogram to measure response time ranges. Another option
  90. mentioned redefines the meaning of a bucket in a histogram. Instead of a numeric distribution,
  91. each bucket represents a unique subtask within an overall "metric" representing a measured operation.
  92. This should be explored whether for operational or developmental purposes.
  93. For tracking specific queries and their health, the feeling is that logging can accomplish this better
  94. than metrics since the list of queries to monitor will vary between clusters. Additionally, operational
  95. metrics solving the cases mentioned above will give a view into the overall health of ESP which will
  96. affect the execution of queries. Depending on actions taken by these metrics, scaling may solve
  97. overload conditions to keep cluster responsiveness acceptable.
  98. For Roxie a workunit operates as a service. Measuring service performance using a histogram to capture
  99. response times as a distribution may be appropriate. Extracting the 95th percentile of response time
  100. may be useful as well.
  101. There are currently no use cases requiring consistency between values of different metrics.
  102. At this time the only concrete metric identified is the number of requests received. As the framework
  103. design progresses and ESP is instrumented, the list will grow.
  104. Dali Use Cases
  105. ==============
  106. From information gathered, Dali plans to keep counts and rates for many of the items it manages.
  107. ****************
  108. Framework Design
  109. ****************
  110. This section covers the design and architecture of the framework. It discusses the main areas of the
  111. design, the interactions between each area, and an overall process model of how the framework operates.
  112. The framework consists of three major areas: metrics, sinks, and the glue logic. These areas work
  113. together with the platform and the component to provide a reusable metrics collection function.
  114. Metrics represent the quantifiable component state measurements used to track and assess the status
  115. of the component. Metrics are typically scalar values that are easily aggregated by a collection system.
  116. Aggregated values provide the necessary input to take component and cluster actions such as scaling
  117. up and down. The component is responsible for creating metrics and instrumenting the code. The
  118. framework provides the support for collecting and reporting the values. Metrics provide the following:
  119. * Simple methods for the component to update the metric
  120. * Simple methods for the framework to retrieve metric value(s)
  121. * Handling of all synchronization between updating and retrieving metric values
  122. In addition, the framework provides the support for retrieving values so that the component does not
  123. participate in metric reporting. The component simply creates the metrics it needs, then instruments
  124. the component to update the metric whenever its state changes. For example, the component may create
  125. a metric that counts the total number of requests received. Then, wherever the component
  126. receives a request, a corresponding update to the count is added. Nowhere in the component is any
  127. code added to retrieve the count as that is handled by the framework.
  128. Sinks provide a pluggable interface to hide the specifics of collection systems so that the metrics
  129. framework is independent of those dependencies. Sinks:
  130. * Operate independently of other sinks in the system
  131. * Convert metric native values into collection system specific measurements and reports
  132. * Drive the collection and reporting processes
  133. The third area of the framework is the glue logic, referred to as the *MetricsManager*. It manages
  134. the metrics system for the component. It provides the following:
  135. * Handles framework initialization
  136. * Loads sinks as required
  137. * Manages the list of metrics for the component
  138. * Handles collection and reporting with a set of convenience methods used by sinks
  139. The framework is designed to be instantiated into a component as part of its process and address space.
  140. All objects instantiated as part of the framework are owned by the component and are not shareable with
  141. any other component whether local or remote. Any coordination or consistency requirements that may
  142. arise in the implementation of a sink shall be the sole responsibility of the sink.
  143. ************************
  144. Framework Implementation
  145. ************************
  146. The framework is implemented within jlib. The following sections describe each area of the
  147. framework.
  148. Metrics
  149. =======
  150. Components use metrics to measure their internal state. Metrics can represent everything from the
  151. number of requests received to the average length some value remains cached. Components are responsible
  152. for creating and updating metrics for each measured state. The framework shall provide a set of
  153. metrics designed to cover the majority of component measurement requirements. All metrics share a
  154. common interface to allow the framework to manage them in a common way.
  155. To meet the requirement to manage metrics independent of the underlying metric state, all metrics
  156. implement a common interface. All metrics then add their specific methods to update and retrieve
  157. internal state. Generally the component uses the update method(s) to update state and the framework
  158. uses retrieval methods to get current state when reporting. The metric insures synchronized access.
  159. For components that already have an implementation that tracks a metric, the framework provides a way
  160. to instantiate a custom metric. The custom metric allows the component to leverage the existing
  161. implementation and give the framework access to the metric value for collection and reporting. Note
  162. that custom metrics only support simple scalar metrics such as a counter or a gauge.
  163. Sinks
  164. =====
  165. The framework defines a sink interface to support the different requirements of collection systems.
  166. Examples of collection systems are Prometheus, Datadog, and Elasticsearch. Each has different
  167. requirements for how and when measurements are ingested. The following are examples of different
  168. collection system requirements:
  169. * Polled vs Periodic
  170. * Single measurement vs multiple reports
  171. * Report format (JSON, text, etc.)
  172. * Push vs Pull
  173. Sinks are responsible for two main functions: initiating a collection and reporting
  174. measurements to the collection system. The *Metrics Reporter* provides the support to complete
  175. these functions.
  176. The sink encapsulates all of the collection system requirements providing a pluggable architecture that
  177. isolates components from these differences. The framework supports multiple sinks concurrently,
  178. each operating independently.
  179. Instrumented components are not aware of the sink or sinks in use. Sinks can be changed without
  180. requiring changes to a component. Therefore, components are independent of the collection system(s)
  181. in use.
  182. Metrics Reporter
  183. ================
  184. The metrics reporter class provides all of the common functions to bind together the component,
  185. the metrics it creates, and the sinks to which measurements are reported. It is responsible for
  186. the following:
  187. * Initialization of the framework
  188. * Managing the metrics created by the component
  189. * Handling collection and reporting as directed by configured sinks
  190. Metrics Implementations
  191. =======================
  192. The sections that follow discuss metric implementations.
  193. Counter Metric
  194. --------------
  195. A counter metric is a monotonically increasing value that "counts" the total occurrences of some event.
  196. Examples include the number of requests received, or the number of cache misses. Once created, the
  197. component instruments the code with updates to the count whenever appropriate.
  198. Gauge Metric
  199. ------------
  200. A gauge metric is a continuously updated value representing the current state of an interesting value
  201. in the component. For example, the amount of memory used in an internal buffer, or the number of
  202. requests waiting on a queue. A gauge metric may increase or decrease in value as needed. Reading the
  203. value of a gauge is a stateless operation in that there are no dependencies on the previous reading.
  204. The value returned shall always be the current state.
  205. Once created, the component shall update the gauge anytime the state of what is measured is updated.
  206. The metric shall provide methods to increase and decrease the value. The sink reads the value during
  207. collection and reporting.
  208. Custom Metric
  209. -------------
  210. A custom metric is a class that allows a component to leverage existing metrics. The component creates
  211. an instance of a custom metric (a templated class) and passes a reference to the underlying metric
  212. value. When collection is performed, the custom metric simply reads the value of the metric using the
  213. reference provided during construction. The component maintains full responsibility for updating the
  214. metric value as the custom metric class provides no update methods. The component is also responsible
  215. for ensuring atomic access to the value if necessary.
  216. *************
  217. Configuration
  218. *************
  219. This section discusses configuration. Since Helm charts are capable of combining configuration data
  220. at a global level into a component's specific configuration, The combined configuration takes the
  221. form as shown below. Note that as the design progresses it is expected that there will be additions.
  222. ::
  223. component:
  224. metrics:
  225. sinks:
  226. - type: <sink_type>
  227. name: <sink name>
  228. settings:
  229. sink_setting1: sink_setting_value1
  230. sink_setting2: sink_setting_value2
  231. Where (based on being a child of the current *component*):
  232. metrics
  233. Metrics configuration for the component
  234. metrics.sinks
  235. List of sinks defined for the component (may have been combined with global config)
  236. metrics.sinks[].type
  237. The type for the sink. The type is substituted into the following pattern to determine the lib to load:
  238. libhpccmetrics<type><shared_object_extension>
  239. metrics.sinks[].name
  240. A name for the sink.
  241. metrics.sinks[].settings
  242. A set of key/value pairs passed to the sink when initialized. It should contain information
  243. necessary for the operation of the sink. Nested YML is supported. Example settings are the
  244. prometheus server name, or the collection period for a periodic sink.
  245. *************
  246. Metric Naming
  247. *************
  248. Metric names shall follow a convention as outlined in this section. Because different collection systems
  249. have different requirements for how metric value reports are generated, naming is split into two parts.
  250. First, each metric is given a base name that describes what the underlying value is. Second, meta data
  251. is assigned to each metric to further qualify the value. For example, a set of metrics may count the
  252. number of requests a component has received. Each metric would have the same base name, but meta data would
  253. separate types of request (GET vs POST), or disposition such as pass or fail.
  254. Base Name
  255. =========
  256. The following convention defines how metric names are formed:
  257. * Names consist of parts separated by a period (.)
  258. * Each part shall use snake case (allows for compound names in each part)
  259. * Each name shall begin with a prefix representing the scop of the metric
  260. * Names for metric types shall be named as follows (followed by examples):
  261. Gauges: <scope>.<plural-noun>.<state> esp.requests.waiting, esp.status_requests.waiting
  262. Counters: <scope>.<plural-noun>.<past-tense-verb> thor.requests.failed, esp.gateway_requests.queued
  263. Time: <scope>.<singular-noun>.<state or active-verb>.time dali.request.blocked.time, dali.request.process.time
  264. Meta Data
  265. =========
  266. Meta data further qualifies a metric value. This allows metrics to have the same name, but different scopes
  267. or categories. Generally, meta data is only used to furher qualify metrics that would have the same base
  268. name, but need further distinction. An example best describes a use case for meta data. Consider a
  269. component that accepts HTTP requests, but needs to track GET and POST requests separately. Instead of
  270. defining metrics with names *post_requests.received* and *get_requests.received*, the component creates two
  271. metrics with the base name *requests.received* and attaches meta data describing the request type of
  272. POST to one and GET to the other.
  273. Use of meta data allows aggregating both types of requests into a single combined count of received
  274. requests while allowing a breakdown by type.
  275. Meta data is represented as a key/value pair and is attached to the metric by the component during
  276. metric creation. The sink is responsible for converting meta data into useful information for the
  277. collection system during reporting.
  278. The *Component Instrumentation* section covers how meta data is added to a metric.
  279. *************************
  280. Component Instrumentation
  281. *************************
  282. In order to instrument a component for metrics using the framework, a component must include the metrics
  283. header from jlib (*jmetrics.hpp*) and add jlib as a dependent lib (if not already doing so).
  284. The general steps for instrumentation are
  285. 1. Create a metrics reporter object
  286. 2. Create metric objects for each internal state to measure and add each to the reporter
  287. 3. Add updates to each metric throughout the component wherever metric state changes
  288. The *metrics reporter* is a singleton created using the platform defined singleton pattern template. The component
  289. must obtain a reference to the reporter. Use the following example:
  290. ::
  291. using namespace hpccMetrics;
  292. MetricsManager &metricsManager = queryMetricsManager();
  293. Metrics are wrapped by a standard C++ shared pointer. The component is responsible for maintaining a reference to
  294. each shared pointer during the lifetime of the metric. The framework keeps a weak pointer to each metric and thus
  295. does not maintain a reference. The following is an example of creating a counter metric and adding it to the
  296. reporter. The *using namespace* eliminates the need to prefix all metrics types with *hpccMetrics*. Its use
  297. is assumed for all code examples that follow.
  298. ::
  299. std::shared_ptr<CounterMetric> pCounter = std::make_shared<CounterMetric>("metricName", "description");
  300. metricsManager.add(pCounter);
  301. Note the metric type for both the shared pointer variable and in the *make_shared* template that creates the
  302. metric and returns a shared pointer. Simply substitute other metric types and handle any differences in the
  303. constructor arguments as needed.
  304. Once created, add updates to the metric state throughout the component code where required. Using the above
  305. example, the following line of code increments the counter metric by 1.
  306. ::
  307. pCounter->inc(1);
  308. Note that only a single line of code is required to update the metric.
  309. That's it! There are no component requirements related to collection or reporting of metric values. That is
  310. handled by the framework and loaded sinks.
  311. For convenience, there are function templates that handle creating the reporter, creating a metric, and adding
  312. the metric to the reporter. For example, the above three lines of code that created the reporter, a metric, and
  313. added it, can be replaced by the following:
  314. ::
  315. auto pCount = createMetricAndAddToManager<CounterMetric>("metricName", "description");
  316. For convenience a similar function template exists for creating custom metrics. For a custom metric the framework
  317. must know the metric type and have a reference to the underlying state variable. The following template function
  318. handles creating a custom metric and adding it to the reporter (which is created if needed as well):
  319. ::
  320. auto pCustomMetric = createCustomMetricAndAddToManager("customName", "description", metricType, value);
  321. Where:
  322. * metricType
  323. A defined metric type as defined by the *MetricType* enum.
  324. * value
  325. A reference to the underlying event state which must be a scalar value convertable to a 64bit unsigned
  326. integer (__uint64)
  327. Adding Metric Meta Data
  328. =======================
  329. A component, depending on requirements, may attach meta data to further qualify created metrics.
  330. Meta data takes the form of key value pairs. The base metric class *MetricBase* constructor defines
  331. a parameter for a vector of meta data. Metric subclasses also define meta data as a constructor parameter,
  332. however an empty vector is the default. The *IMetric* interface defines a method for retrieving the meta data.
  333. Meta data is order dependent.
  334. Below are two examples of constructing a metric with meta data. One creates the vector and passes it as a
  335. parameter, the other constructs the vector in place.
  336. ::
  337. MetricMetaData metaData1{{"key1", "value1"}};
  338. std::shared_ptr<CounterMetric> pCounter1 =
  339. std::make_shared<CounterMetric>("requests.completed", "description", SMeasureCount, metaData1);
  340. std::shared_ptr<CounterMetric> pCounter2 =
  341. std::make_shared<CounterMetric>("requests.completed", "description", SMeasureCount, MetricMetaData{{"key1", "value2"}});
  342. Metric Units
  343. ============
  344. Metric units are treated separately from the base name and meta data. The reason is to allow the sink to
  345. translate based on collection system requirements. The base framework provides a convenience method for
  346. converting units into a string. However, the sink is free to do any conversions, both actual units and
  347. the string representation, as needed.
  348. Metric units are defined using a subset of the *StaticsMeasure* enumeration values defined
  349. in **jstatscodes.h**. The current values are used:
  350. * SMeasureTimeNs - A time measurement in nanoseconds
  351. * SMeasureCount - A count of events
  352. * SMeasureSize - Size in bytes