Statistical Normalcy Determination Based on Data Categorization

Mazda A. Marvasti
VMware, Inc.
mazda@vmware.com

Arnak V. Poghosyan
VMware, Inc.
apoghosyan@vmware.com

Ashot N. Harutyunyan
VMware, Inc.
aharutyunyani@vmware.com

Naira M. Grigoryan
VMware, Inc.
ngrigoryan@vmware.com

Abstract

We introduce a statistical learning Normalcy Determination System (NDS) for data-agnostic management of monitoring flows. NDS performs data categorization with analysis tools that identify category-specific normalcy bounds in terms of dynamic thresholds. This information can be applied further for anomaly detection, prediction, capacity planning, and root-cause analysis. Keywords: monitoring, time series data, statistical process control, normalcy determination, dynamic thresholding, data categorization, parametric and non-parametric statistics.

1. Introduction

Today’s business and IT management face the problem of “big infrastructures” with millions of monitored metrics that need to be efficiently analyzed for gaining valuable insights in terms of underlying system control. The concept of control for ensuring the quality of services of different systems originates from the ideas of statistical process control charts, which provide a comprehensive tool to determine whether the process is in normalcy state or not. The foundation of these concepts was established by Shewart [1] where he developed methods to improve quality and lower costs. Fluctuations and deviations from standards are present everywhere, and the problem of constructing a relevant chart is in understanding which variations are normal and which are caused by a problem. In the classical theory of control, the underlying processes have a bell-shaped distribution, and in those cases control charts are based on the strong foundation of parametric statistics. The problems arise when the classical theory of control is applied to processes of other types.

The problem of construction of a relevant control tool is in identification of normalcy bounds of the processes. Since the developments of Shewart, an explosion in controlling techniques has occurred [2]–[4]. Different processes require different measures to be controlled, and each one leads to a new control chart with the corresponding normalcy states defined by thresholds [5]–[10]. Modern businesses and infrastructures are dynamic and, as a consequence, measured metrics are dynamic without any ad-hoc known behavior. These cause extension of the classical ideas to encompass the notion of dynamic normalcy behavior. In some applications a notion of normalcy bound in terms of dynamic threshold (DT) arises naturally [11]–[16]. An essentially different approach is determination of normalcy state in terms of correlated events, which leads to a directed graph revealing the fundamental structure of a system beyond the sources and processes [17] [18]. In this paper, we introduce a fully data-agnostic system [19] [20] for determining normalcy bounds in terms of DTs of monitoring time series data without presumed behavior. The system performs data categorization based on some parametric and non-parametric models and applies category-specific procedures for optimized normalcy determination via historical simulation. Although experimental results are obtained based on IT data, the approach is applicable to wider domains, because for different applications the data categories can be appropriately defined.

Determined DTs can be further applied for anomaly detection by construction of anomaly events. As soon as the DTs are historically constructed, they can be projected into the future as prediction for time-based normalcy ranges. Any data point appearing above or below those thresholds is an abnormal event. An approach described in [21]–[24] employs a directed virtual graph showing relationships between event pairs. An information-theoretic processing of this graph enables reliable prediction of root causes of problems, bottlenecks, and black swan events in IT systems.

The NDS described here is realized in VMware® vCenter™ Operations Manager™ [25] analytics, and the last section presents some results for real customer data.

2. General Description of the System

In this section, we present general principles of the NDS, which performs fully data-agnostic normalcy determination based on historical simulation. Flowchart 1 illustrates the general concept. The system sequentially utilizes different Data Quality Assurance (DQA) and Data Categorization (DC) routines that enable choosing the right procedure (right category of qualified data) for determination of data normalcy bounds in terms of DTs.

Flowchart 1. General Concept of the NDS.

Flowchart 1. General Concept of the NDS.

DQA filters data by checking different statistical characteristics defined for data qualification, and it further passes through DC. DC performs data identification into categories (e.g., trendy, cyclical, etc.). We repeat this cycle for each time series, performing category checking and identification with hierarchical/priority order until data is identified as belonging to some category or is identified as corrupted. The categorization order or the hierarchy is important, because different orders of iterative checking and identification will lead to different final categorization with differently specified normalcy states.

Flowchart 2 shows a specific realization of the general concept. Here, NDS consists of three DQA modules (Data Quality Detector, Data Density Detector, and Stability Detector) and two DC modules (Parametric Category Detector and Variability Detector).

Flowchart 2. A Specific Realization of NDS.

Flowchart 2. A Specific Realization of NDS.

As a final result, the initial data is interpreted as Parametric, Sparse, Low-Variability, and High-Variability. In each of those cases the normalcy determination method is different. For instance, Parametric Data can be of different categories (Transient, Multinomial, Semiconstant, and Trendy) with a specific normalcy analysis algorithm in each case.
The functional meanings of the abovementioned detectors are as follows:

Data Quality Detector performs a check of sufficient statistics. This lock classifies data as Qualified when available data points and length of data are sufficient for further analysis; otherwise data is classified as Corrupted.

Parametric Category Detector performs data categorization by verifying data against a selected statistical parametric model. If categorization is possible, data is named as Parametric Data; otherwise it is called Regular Data.

Data Density Detector filters Regular Data against gaps. Data with extremely an high percentage of gaps is Corrupted Data. Data with a low percentage of gaps is Dense Data. Data with a high percentage of gaps that are uniformly distributed in time is Sparse Data. Data with a high percentage of gaps that have localization in time is further processed through a gap filter whose output is Dense or Corrupted Data.

Stability Detector analyzes Dense Data in terms of statistical stability. If data is piecewise stable and the latest stable region is enough for further processing, this block performs a data selection; otherwise the data is Corrupted. Stable Data is then passed through Variability Detector.
Variability Detector calculates variability indicators and classifies data into High-Variability or Low-Variability.

In all categorization scenarios the data additionally is verified against periodicity for efficient construction of its normalcy bounds (see Flowchart 3).

Flowchart 3. Categorization in Terms of Periodicity.

Flowchart 3. Categorization in Terms of Periodicity.

3. Period Detector

The period determination procedure (see Period Detector in Flowchart 3) in NDS seeks similar patterns in the historical behavior of time series for accurate setting of its normalcy bounds based on the information on cycles.

Some classical techniques known in the literature include seasonality analysis and adjustment [26]–[30], spectral analysis, Fourier transform, discrete Fourier transform [31]–[35], data decomposition into cyclical and non-cyclical components, and the Prony method [36], [37]. Our procedure is closer to the approach described in [38], which is based on clustering principles.

The main steps of the period determination procedure are presented in Flowchart 4.

Flowchart 4. Main Steps in Period Determination.

Flowchart 4. Main Steps in Period Determination.

Data preprocessing performs data smoothing and outlier removal by various procedures. We refer to standard classical algorithms [39]–[42]. The purpose of this step is two-fold: eliminating extreme high or low outliers that can degrade the range information and smoothing of local fluctuations in data for more robust pattern recognition.
Data Quantization performs construction of the Footprint of the historical data for further cyclical pattern recognition. This is a two-step procedure:

1. Frame construction – The range of data (smoothed data) is divided into non-uniform parts by quantiles qk with k=k1,…,km, 0≤k1<…<km≤1, where parameter m and the values of kj are predefined. Evidently, the grid lines are dense where the data is dense. For division of data into parts along the time axis, two parameters time_unit and time_unit_parts are used. Time_unit is a basic parameter that defines the minimal length of possible cycles that can be found. Moreover, any cycle can be a factor only of the length of time_unit. The usual setting is time_unit=1 day. The time_unit_parts parameter shows the number of subintervals into which that time_unit must be divided. Actually, this parameter is the measure of resolution. The bigger the value of time_unit_parts, then more sensitive is the footprint of the historical data. Figure 1 shows an example of a frame. Gridlines are equidistant along the time axis and non-uniform along the range.

Figure 1. Example of a Frame.

Figure 1. Example of a Frame.

2. Percentage calculation – For the given framework, we calculate the percentage of data in every grid-cell and obtain the corresponding column for the given time interval. Collecting all columns, we construct the matrix of percentages for that particular frame. The final matrix is a two2-dimensional classical histogram of historical data. Then, for every column we calculate the corresponding cumulative sums, getting a cumulative distribution of data in each column. We call this matrix as a Footprint of historical data.

The Pattern Recognition procedure is described as: Let T=N×time_unit, N=1,2,… We collect the columns of the footprint matrix into subgroups with L=N×time_unit×time_unit_parts columns in every subgroup. The overall number of the subgroups equals to M=length(footprint)/L (footprint matrix can be extended by zero columns if needed). For each k-th k=1,…,L column in every subgroup, we check the similarity of columns by the well-known relative L2-norm:

marvasti-6

for some user-defined parameter closeness, then it is assumed that two columns are similar. Let users defines the parameters closeness (=0.2) and quality (=75%) parameters. Figure 2 shows an example is shown wherein which data is divided into T-cycles. For this particular example, we assumed that

     d(A,E)>closeness,d(A,I)>closeness, d(E,M)>closeness

Hence column A is not similar to columns E, I, and M, and we put zero under it. Now, we try column E. We assumed that

     d(E,I)≤closeness,d(E,M)≤closeness

Hence, column E is assumed to be similar to I and M , and we put ones under these columns. If the percentage of ones is not less than the value of the quality parameter quality then, we declare that the corresponding column of T-cycle is periodic; otherwise it is non-periodic. In our example, taking into account that three columns out of four compose 75%≥quality, we declare that the first column of the T-cycle is periodic and put one in the corresponding column (see Figure 3). We repeat the procedure for all columns (see the particular example in Figure 2) and check the periodicity of each column (see Figure 3).

Figure 2. T-Cycle Checking Procedure.

Figure 2. T-Cycle Checking Procedure.

Figure 3 shows that in this particular example, two columns are periodic (A,E,I,M and D,H,L,P) and two columns are non-periodic. Because periodic columns compose 50% of all columns, then T-cycle has 50% similarity of columns.

Figure 3. T-Cycle Similarity Calculation.

Figure 3. T-Cycle Similarity Calculation.

Repeating this procedure for all possible T-cycles, we compose a Cyclochart of data that shows the percentage of similarities against T. Figure 4 shows an example of a Cyclochart.

Figure 4. Example of a Cyclochart.

Figure 4. Example of a Cyclochart.

The period-determination procedure is based on the Cyclochart analysis and consists of four steps. Let us consider the particular example of Figure 4:

1. Finding out the local maximums in the Cyclochart with their corresponding similarities (see Table 1).

marvasti-10

2. Construction of the periods that correspond to every local maximum. Data with T-cycle also has kT-cycle for every natural k. Hence, for the first row in Table 1 together with the 2-day period, we expect also 4-day, 6-day, .… periods. So a 2-day period creates the following period series:

    2 → 2,4,6,8,10,….

The peak 4 creates another series

    4 → 4,8,12,16,...

and so on.

3. Calculation of the series characteristics for every period series. The following characteristics are assumed to be important: positive factor of the period series is the number of peaks in that series and negative factor of the period series is the number of members in that series which are not the peaks. Then, we set

    Strength = Positive factor – Negative factor.

Table 2 shows these characteristics for Table 1.

Table 2: Positive Factors, Negative Factors, Strengths and the Corresponding Similarities.

Table 2: Positive Factors, Negative Factors, Strengths and the Corresponding Similarities.

4. Period determination based on the defined characteristics by the following procedure. We select the periods with maximum strength. From that list, we choose the periods with minimum negative factor then with maximum similarity. Then, we pick the period with minimum length and check its similarity measure. The similarity of the determined final period must be greater than 20%; otherwise, data is claimed to be non-periodic. This procedure applied to the results in Table 2 leads to the 7-day period, because it has the maximum Strength=4

Figure 5. Normalcy Bounds for Periodic Data.

Figure 5. Normalcy Bounds for Periodic Data.

We will now discuss the general procedure of how the normalcy bounds can be determined based on periodicity, taking into account that for specific categories it can be modified appropriately. In the case of non-periodic data, normalcy bounds can be determined by the well-known whisker’s method. If data is claimed as periodic, then normalcy bounds are calculated based on cycle information (see Figure 5 for a specific example). More specifically, consider the case of cyclical data and the following four columns from the Footprint, which are shifted one from another by the period of data:

If

marvasti-13

d(A,B)≤closeness,d(A,C)≤closeness,
and
d(A,D)≤closeness,

then all columns make a cyclical subgroup and we calculate the bounds based on the four data columns. Then, if
d(A,B)≤closeness, d(A,C)≤closeness
but
d(A,D)>closeness

then only the columns A,B,C make a cyclical subgroup. Taking into account that quality = 75% and three columns out off four compose 75% , then we assume that column D is corrupted and consider only the three columns. If

d(A,D)≤closeness
but
d(A,B)>closeness

then we discard column A. If less than 75% of these four columns are similar, then we have a non-cyclical subgroup and take into consideration all columns A,B,C,D. Then, for each group of columns, normalcy bounds can be taken as (min,max) values of data (see Figure 6), taking into account the preliminary data smoothing factor.

Figure 6. Normalcy Bounds Construction Procedure from Footprint Taking into Account the Information on Cycles.

Figure 6. Normalcy Bounds Construction Procedure from Footprint Taking into
Account the Information on Cycles.

4. Data Quality and Parametric Category Detectors

Data Quality Detector performs a check for sufficient statistics. This block classifies data as Qualified when available data points and length of data are enough (for example, more than 20 points and longer than 1 week) for further analysis; otherwise data is classified as Corrupted.

Flowchart 5 shows the principal scheme of the Parametric Category Detector. It specifies data as either Parametric or Regular. Parametric Data can belong to different categories: Multinomial, Transient, Semi-Constant, Trendy, or any other user-defined category.

Flowchart 5. Parametric Category Detector.

Flowchart 5. Parametric Category Detector.

Multinomial Data – Flowchart 6 describes the process of Multinomial Data (MD) categorization. The module Checking Parameters module for MD calculates statistical parameters for comparison with the predefined measures. If the checking is positive. then data is classified as MD; otherwise the module Performing De-noising module performs data cleaning with sequential checking of predefined parameters. We consider the following predefined statistical measures. Let pj be the frequency of occurrences of the integer nj

marvasti-16

where N is the total number of integer values and m is the number of different integer values.

Flowchart 6. Categorization of Multinomial Data.

Flowchart 6. Categorization of Multinomial Data.

Data is multinomial if it takes fewer than m different integer values and at least s of them have frequencies greater than parameter H1. Two different de-noising procedures can be performed. The first procedure is filtering against non-integer values with smaller than H2 percentage (H1<H2). If this condition is satisfied, then in the remaining analysis the non-integer numbers are discarded. The second procedure is filtering against integer values with a small cumulative percentage. Sorting the percentages Pj in descending order. we define the cumulative sum Cj as:

c1=100, cj=pj+…+pm, cm=pm

Now, if ck<H3, ck-1≥H3, then the integer values nk,nk-1,…,nm can be discarded from further analysis.

Determination of normalcy bounds is started with periodicity analysis. Here, while constructing the Footprint, instead of the percentages of data in every cell, we are taking the values of ck in every column. Then, if data is claimed as periodic, points in similar columns are collected together and new values of the numbers ck are calculated. If ck-1<H, ck≥H , then the values n1,n2,…,nk constitute the most probable set (normalcy set) of similar columns. If data is determined as non-periodic, then the numbers ck are calculated for all data points and the normalcy set is determined similarly.

Transient Data is categorized by multimodality, modal inertia, and randomness of modes appearing along the time axis. Transient Data must have at least two modes. In this context, modal inertia means that data points in each mode must have some inertia. (They can’t oscillate from one mode to the other too quickly.) Actually, the inertia can be associated with the time duration that data points remain in the selected mode.

Detection of inertial modes is based on calculation of transition probabilities. By the first step we seek for a region/interval of sparse data values and for data with some inertia concentrated in upper and lower regions of this interval. We take two numbers a,b such that

x_min≤a<b≤x_max,

where x_min,x_max are minimum and maximum values of data, respectively. These numbers divide the interval[x_min,x_max] into three regions A≝[x_min,a], B≝(a,b), and C≝[b,x_max]. We calculate the following transition probabilities:

marvasti-18

where NA —is the number of points in [xmin,a), NB —is the number of points in [a,b], NC—is the number of data points in (b,xmax], NA→A) —is the number of points with the property x(ti )∈A and x(ti+1) ∈A, NB→B —is the number of points with the property x(ti )∈B and x(ti+1)∈B, Nc→C —is the number of points with the property x(ti )∈C and x(ti+1) ∈C.

Starting from the highest possible position and shifting the region B to the lowest possible, we calculate those three transition probabilities and stop the procedure until the following condition is fulfilled: p_(A→A)>H, p_(C→C)>H, p_(B→B)<h, and N_A,N_C≫1,
where the numbers H and h are some predefined parameters. In our experiments below we set H=0.75 and h=0.25. If this procedure ends without finding the needed interval we narrow the region B and repeat the procedure.

In our experiments we divide the interval [xmin,xmax] into N+1 equal parts xmin<x1<x2<⋯<xNc Δt are holes (gaps). c is a predefined parameter for the hole determination. It is assumed that for transient data the holes must be “uniformly” distributed along time axis. This can be checked by transition probabilities. Let Tk be the duration (in milliseconds, seconds, minutes, etc., but in the same measures as the monitoring time) of the k-th gapless data portion. For data without holes we have only one such portion and Tk=tn-t1. The sum

marvasti-19

is the duration of gapless data. Nt is the number of gapless data portions. Let Gk be duration (in the same measures as Tk) of the k-th hole. The sum

marvasti-20

is the duration of all holes in data. NG is the number of hole portions. Obviously , G+T = tn-t1.

By ρ we define the percentage of holes in data

marvasti-21

Calculation of Probabilities – By p11,p10,p00,p01 define the probabilities of data-to-data, data-to-gap, gap-to-gap and gap-to-data transitions, respectively

marvasti-22

and

marvasti-23

We seek for an inertial mode for which

ρ>P,p10>ε,p01>ε

Where P and ε are user defined parameters. If at least two inertial modes satisfy these conditions, then data is transient.
Normalcy determination is performed separately for each mode.
Semi-Constant Data – Data is Semi-Constant if

marvasti-24

where N corresponds to data length and iqr stands for interquartile range of xk=x(tk).

If data is not semi-constant but its latest enough long-enough time period satisfies the condition, then, we select it for further normalcy bounds determination.

Normalcy determination of the Semi-Constant Data can be performed as follows. For Semi-Constant Data every data point greater than q0.75 (quantile) or less than q0.25 is an outlier. If the percentage of outliers is greater than p% (p=15%), then we check for periodicity in outlier data by the procedure described above. For that, data points equal to the median are excluded from the analysis. In case of nonperiodic data the normalcy bounds are calculated by the whisker’s method. In the case of periodic data, the same procedure is applicable for each periodic column of the Footprint of data.

Trendy Data – Different classical methods are known for trend determination [43]—[48]. In our analysis, Trendy Data recognition and related determination of its normalcy bounds consists of three main steps (see Flowchart 7).

1. Trend identification by Trend Detector, which separates Qualified Data into Trendy and Non-Trendy Data.
2. Trendy Data goes through the Trend Recognition module, which classifies the trend into linear, log-linear and non-linear categories. The main purpose of this step is decomposition of the original time series f0 (t), consisting of N points, into sum of non-trendy time series f(t) and trend component trend(t) f0 (t)=f(t)+trend(t) that allows more accurate normalcy analysis based on f(t). 3. Specific normalcy bounds calculation for each category.

Flowchart 7. Trend Determination and Normalcy Analysis.

Flowchart 7. Trend Determination and Normalcy Analysis.

Trend Detector performs different classical tests for trend detection. The Mann-Kendall (MK) test is appropriate for our purposes. although other known tests are also possible to apply. MK statistic (S0) can be computed by the formula

marvasti-26

In general, the procedure consists of the following steps: data smoothing, and calculation of the MK statistic S0 for the smoothed data. If S0>0, then the trend can be increasing; otherwise (if S0<0) decreasing. Then, calculation of the trend measure is

marvasti-27

where

marvasti-28

Data is trendy if, for example p>40%.

Trend Recognition reveals the nature (linear, log-linear, or non-linear) of the trend. We are checking linear and log-linear trends by the linear regression analysis. Goodness of fit is checked by the following formula

marvasti-29

where Rregression is the sum of squares of the vertical distances of the points from the regression line and R0 is a similar quantity for the line with zero slop and passing through the mean of data (null hypothesis).

If R is, for example, greater than 0.6 then it is assumed that a trend is linear; otherwise the log-linearity is checked by the same procedure for f(ec t), where c is some constant. If the corresponding goodness of fit is greater than 0.6, then data is assumed to be log-linear; otherwise, data is non-linear trendy.

Normalcy is performed as follows:

Data with Linear Trend. We decompose original data f0 (t) into the form

f_0 (t)=f(t)+linear_trend(t)

where

linear_trend(t)=kt+b

with coefficients k and b determined by linear regression analysis and perform periodicity analysis for f(t) as we described above. If f(t) is non-periodic, then normalcy bounds of f0 (t) are straight lines (upper and lower dynamic thresholds) that we set up by maximization of the objective function. As an objective function we consider the following expression

marvasti-30

where S is the square of the area limited by tmin, tmax, and some lower and upper lines (see Figure 7),

Smax=h(tmax-tmin)

and P is the fraction of data within the upper and lower lines and a is a user- defined parameter. Then we calculate variability (standard deviation) of f(t) σ=std(f(t)) and consider the following set of lower and upper lines
[kt+b-zj σ,kt+b+zj σ], j=1,2,…

calculating each time the corresponding value gj of the objective function. Lines that correspond to max (gj) we take as appropriate normalcy bounds.

Figure 7. Auxiliary Drawing for Definition of the Objective Function.

Figure 7. Auxiliary Drawing for Definition of the Objective Function.

In our experiments we use the following values for zj:
z1=1, z2=1.5, z3=2, z4=3, z5=4

If f(t) is periodic. then the procedure described above can be performed for each set of similar columns by calculating variability (σm) of the m-th set and considering the following normalcy bounds:

[kt+b-zj σm, kt+b+zj σm ], j=1,2,…

Then maximum of the objective function will give the normalcy bounds of the m-th set.

Data with Log-Linear Trend. Taking into account that f(ec t) is data with linear trend the above described procedure is valid for this as well.

Data with Non-Linear trend. For this case we select the last reasonable portion of data and calculate the normalcy bounds according to the above described procedure for non-periodic case.

Figure 8 shows an example of trendy periodic data with the corresponding normalcy bounds.

Figure 8. Normalcy Bounds for Trendy Periodic Data (Red curve is upper threshold, green curve is lower threshold, and blue curve is the original data.)

Figure 8. Normalcy Bounds for Trendy Periodic Data (Red curve is upper threshold,
green curve is lower threshold, and blue curve is the original data.)

5. Data Density Detector

Data density recognition is based on probability calculation that reveals distribution of gap. According to our analysis we differentiate the following categories: Dense Data (relative to estimated monitoring time), Sparse Data (relative to estimated monitoring time). and data with technical gap (localized gap due to malfunction of device) that after data selection will belong to Dense Data cluster, and finally, Corrupted Data.

The principal scheme of density recognition and recovering (data selection) procedures is presented in Flowchart 8. For categorization purposes we deal with the following measures that characterize the nature of gap presence in data:

  • Percentage of gaps
  • Probabilities for gap-to-gap, data-to-data, gap-to-data, and data-to-gap transitions If the total percentage of gaps is acceptable, then data is categorized as Dense Data.

If the total percentage of gaps is higher than some limit and they have non-uniform distribution in time (which means that gaps have some localization in time), then the gap cleanup (data selection) procedure will give a Dense Data. If gaps have uniform distribution in time, then data belongs to a Sparse Data cluster. If gaps have such an extremely high percentage that further analysis is impossible, then the data belongs to a Corrupted Data cluster.

We omit the technical details because calculation of the transition probabilities can be performed as for Transient Data.

Flowchart 8.Data Density Detector.

Flowchart 8.Data Density Detector.

For normalcy determination data, is preliminary checked for periodicity. In the case of Sparse Data, duration of gaps is reasonable to take into account.

6. Stability Detector

The problem of change detection in time series [49]–[55] is a wellknown statistical problem. The Stability Detector (see Flowchart 9) performs data processing for statistical stability recognition. If data is stable or its stable portion can be selected, then the data (or selected portion) is defined as Stable Data; otherwise it is Corrupted.

Stability identification is accomplished by construction of a data StabiloChart that shows the stability intervals of time series and allows selection of the recent long-enough data region for further analysis. For every given m we calculate the quantity

marvasti-34

that shows the relative change (left-right) attached to the point xm in terms of the iqr measure, where n is some parameter (for example n=marvasti-25 , where T is the length of data).
If
sm<S
then we set sm=0 showing the stability against the point xm with the given sensitivity S (=70%);, otherwise we put sm=1 showing instability against the point xm.

The graph of sm values obtained along the moving (by a preset data points) xm is the Stabilochart of the data. The Stabilochart shows that if data is stable, the latest stable portion can be selected, or that data is corrupted.

Flowchart 9. Stability Detector.

Flowchart 9. Stability Detector.

7. Variability Detector

The Variability Detector performs data processing for variability recognition. Two different categories can be recognized: Low-
Variability and High-Variability. Based on the absolute jumps x′k of data points
x′k=|xk+1-xk|
the following measure R of variability is considered

marvasti-37

Data clustering is performed by the following comparison with parameter V (=20%): if R≤V , then data is from a Low-Variability cluster; otherwise it is from a High-Variability cluster. Normalcy determination for both categories is performed by different setup of preliminary parameters—less sensitive for High-Variability Data.

8. Experimental Results and Discussion

We present some results of experiments on an actual customer data set. First we performed experiments for short-term data with almost a one-month duration. NDS was applied to 3,215 time series metrics. Table 3 shows the distribution along different data categories. Table 4 shows the count of periodic and non-periodic data.

marvasti-38

marvasti-39

We also examined the distribution of periodic data in some categories. For 532 Semi-Constant Data metrics (see Table 3), 267 have percentage of outliers of less than 15%, and they are claimed as non-periodic without any further checking. The remaining 235 metrics are investigated for periodic structure, and in 212 of them periods are found. In the case of the High- Variability and Low-Variability categories, periods are found for 378 and 165 metrics, respectively.

Second, we performed experiments for long-term data with almost a three-month length. We obtained 3,956 metrics, and Table 5 shows distribution along different categories. Table 6 shows distribution of metrics along periodicity. For 586 Semi-Constant metrics, 324 have outliers less than 15% and they are categorized as non-periodic, 262 checked for periodicity, and for 221 periods are found. Then, for the High-Variability and Low-Variability categories, periods are found for 457 and 165 metrics, respectively.

It is worth noting that results obtained for the specific customer can’t be in any manner generalized to other cases. The results can vary widely from one customer to another without any intersection.

marvasti-40

marvasti-41

However, results obtained for a specific customer can provide useful information about the customer’s environment. In terms of our approach, this can also lead to some optimizations by excluding procedures for a specific category that is not common for the customer. For example, in Table 5, we see that the Sparse and Transient categories cover only 5.5% of the overall data set, and the system can be applied without specifying them.

Another important insight is that data category is not an invariant property. Change in length of data in general changes the category. Moreover, the data selection module picks up the last stable portion, and categorization is performed only on this portion. So, visually, data can be corrupted, but its latest stable portion belongs to some of the predefined categories.

Figures 10, 11, and 12 present reliably predicted normalcy bounds obtained by the NDS.

marvasti-42

References

1. W.A.Shewhart (1931), “Economic control of quality manufactured product,” New York: D. Van Nostrand Company, Republished in 1980 by the American Society of Quality Control.
2. D.J. Wheeler and D.S. Chambers (1986), “Understanding statistical process control,” Knoxville, TN: SPC Press.
3. D.J. Wheeler (1991), “Shewhart’s chart: myths, facts, and competitors,” 45th Annual Quality Congress Transactions.
4. F. Alt (1985), “Multivariate Quality Control,” Encyclopedia of Statistical Sciences, Volume 6.
5. C.A. Parris, “Hard disk drive infant mortality tets,” Patent application number: US 09/387,677. Filing date: Aug 31, 1999. Publication number: US6408406B1. Publication date: Jun18, 2002. Publication type: grant.
6. F.L. Paulson, “System and method for determining optimal sweep threshold parameters for demand deposit accounts,” Patent application number: US 08/825,012. Filing date: Mar 26, 1997. Publication number: US5893078 A. Publication date: Apr 6, 1999. Publication type: grant.
7. J.D. Luan, “Threshold alarms for processing errors in a multiplex communications system,” Patent application number: EP19880112793. Filing date: Aug 5, 1988. Publication number: EP0310781B1. Publication date: Mar 10, 1993. Publication type: grant.
8. P. Celka, “A method and system for determining the state of a person,” Patent application number: PCT/EP2013/064678. Filing date: Jul 11, 2013. Publication number: WO2014012839 A1. Publication date: Jan 23, 2014.
9. B.M. Jakobson, “Systems and methods for authenticating a user and device,” Patent application number: US 13/523, 425. Filing date: Jun 14, 2012. Publication number: US20130340052 A1. Publication date: Dec 19, 2013.
10. T. Blumensath and M.E. Davies (2009), “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, Vol. 27, no. 3, pp. 265-274.
11. H. Huang, H. Al-Azzawi, and H. Brani (2014), “Network traffic anomaly detection,” ArXiv:1402.0856v1.
12. D. Dang, A. Lefaive, J. Scarpelli, and S. Sodem, “Automatic determination of dynamic threshold for accurate detection of abnormalities,” Patent application number: US 12/847,391. Filing date: Jul 30, 2010. Publication number: US20110238376 A1. Publication date: Sep 29, 2011.
13. J. C. Jubin, V. Rajasimman, and N. Thadasina, “Hard handoff dynamic threshold determination,” Patent application number: US 1/459,317. Filing date: Jun 30, 2009. Publication number: US20100056149 A1. Publication date: Mar 4, 2010.