Short-term forecasting models for automated data backup system: segmented regression analysis

The Information and Communication Technology (ICT) becomes a critical area to business success; organizations need to adopt additional measures to ensure the availability of their services. However, such services are often not planned, analyzed and monitored, which impacts the assurance quality to customers. The backup is the service addressed in this study, with the object of study of the automated data backup systems in operation at the Federal University of Itajuba Brazil. The main objective of this research was to present a logical sequence of steps to obtain short-term forecast models that estimate the point at which each recording media reaches its storage capacity limit. The input data was collected in the metadata generated by the backup system, with 2 years data window. For the implementation of the models, the simple univariate linear regression technique was employed in conjunction, in some cases, with the simple segmented linear regression. In order to discover the breakpoint, a targeted approach to residual analysis was applied. The results obtained by the iterative implementation of the proposed algorithm showed adherence to the characteristics of the analyzed series, with accuracy measures, regression significance, normality residual through control charts, model adjustment, among others. As a result, an algorithm was developed for integration into automated backup systems using the methodology described in this study.


Introduction
High availability to systems, applications and services are critical to the organizations, so the ICT increasingly becomes a critical area to business success, and additional measures have to be addressed. Several factors such as cyber-attacks, network failures, hardware failure, among others, could jeopardize the continuity of businesses that depend entirely or partially on ICT (Laudon & Laudon, 2012).
Lucio-Nieto, Colomo-Palacios, Soto-Acosta, Popa, and Amescua-Seco (2012) argue that organizations are demanding for more efficient methods for managing services in order to offer high quality for internal and external customers. According to Owens, Wilson, and Abell (2019), information is increasingly seen as a valuable asset of an organization and as an example, we have: customer information, financial records, business processes, functional records, among others. The global consumption of data stored by users, businesses and governments are increasing; a market analysis shows that the overall revenue data storage connected to the business may reach $6.5 billion in 2015 (Pamies-Juarez, Datta, & Oggier, 2013) and the information technology costs constitute a significant component for information technology dependence for organizations (Kwiatkowski & Verhoef, 2013).
The data stored in databases continue to grow as a result of the need that organizations have to obtain, store and generate more information constantly. Much of the costs of maintaining such large amounts of recorded data with security is present in the storage media and resources used in the system management (Muthukumar & Ravichandran, 2012).
The backup consists in a redundant data copy, so the information can be restored if there is any kind of data lost (Wolff, 2007). This is a service of security in information technology area. According to Burgess and Reitan (2007), disk backups are an important area for system policy, deciding when to schedule data archiving, in order to protect against data loss. Generally, the data may be lost due to accidental or

Material and methods
The ICT subject is still little explored in studies that make use of any forecasting methodology, as a support tool, for planning, monitoring and decision making. The historical data on system activity are very valuable because they can be analyzed to assist in planning the future growth of the system. Chamness (2014) describes the architecture of a forecasting capability tool for backup systems. Also, according to the author, the error rate of a linear regression model can be significantly reduced with the application of regression on a subset (or segment) of data representing the latest best behavior of the series.
According to Nisbet, Miner, and Elder (2009) segmented linear regression performs a pre-processing of the series, when knots or breakpoints are primarily determined, where there are perceptible changes of the slope, as shown in Figure 1. In a second step, the linear regression is applied to each segment obtained, and if necessary, the combined results ratio is presented. To find the breaking point, Chamness (2014) uses a technique based solely on searching for the highest R 2 value (coefficient of determination). R 2 is an accuracy measure with an amplitude ranging between 0 and 1, as stated in Equation 1.
The calculation is made by dividing the sum of squared residuals ^ − by the total sum of squares − ´ , which is the sum of the squared differences between the mean and each observed value. So, the model selection with higher is equivalent to selecting the model that minimizes the sum of squared residuals. According to Montgomery, Jennings, and Kulahci (2011), high values of suggest a good fit to the historical data, but do not guarantee that the forecast errors are reduced. The technique to find the Chamness (2014) breakpoint performs successive analysis of a set of close observations , to the beginning , of the series to find the highest value of . The prediction model is selected by the results of linear regression in the segments where by trial and error it comes to the set with the highest value of . In Figure 2, there is an example of how a breakpoint is found with this approach.
The approach of segmented linear regression proposed by Chamness (2014) to obtain the breakpoint does not take into account other model of accuracy measures such as residual analysis. However, according to Mahmoud, Parker, Woodall, and Hawkins (2007), many authors have studied the breakpoint problem in the regression context on different objects of study. There is in the literature other ways to estimate the breakpoint adjusted to each specific case. Examples of the application of segmented regression in other areas of knowledge can be seen in Muggeo (2003), Malash and El-Khaiary (2010), Shao, Li, and Xu (2009), Lavazza and Morasca (2011) and Jin and Shi (2012).
According to Chatterjee and Hadi (2013), Montgomery et al. (2011) and Chamness (2014) regarding the regression analyses and the methodological application context, this work proposes the implementation of an algorithm to be applied as a backup forecast in automated backup systems. The flow chart in Figure 3 shows the regression model analysis steps.

Data input
At this early stage, the process is fed with data collected through the preliminary treated records in each series. The data must be arranged in matrix form, with a column for observations of the response variable and one for the predictor variable . comment: The data backup input stream global originalData[1::n; 1::n] : real; originalData ← Read(inputStream); The variables used for this regression analysis are described in Table 1.

Data filtering
The purpose of this step is to remove data that contain registry errors. According to Chatfield (2003), it is essential to make a careful evaluation of all available data collected. Responding to three primary issues: a) were the variables recorded with the necessary precision? b) how is the data useful and usable? c) are there obvious input errors, outliers, or missing data? The quality of the analysis depends on how appropriate the data filtering was done. Algorithm part 2: Data Filtering(originalData) comment: The data evaluation and filtering global filteredData[1..n; 1..n] : real; filteredData ← Filter(originalData); In the context of the Algorithm part 2, the filter function will receive the original data and will eliminate the observations equals to zero and the repeated observations, these that represent errors in the execution of the backup job service.

Linear regression analysis
In this step the prediction model is obtained. Regression analysis is a statistical tool widely used and considered a more parsimonious method in the investigation of the functional relationship between variables, being applied in many areas of knowledge. The relationship is expressed as an equation or model that connects the response or dependent variable to the predictor (explanatory) variables . Regression analysis can be viewed as an iterative process, in which the outputs are used to diagnose, validate, criticize, and possibly modify the entries. The process must be repeated until a satisfactory output is obtained (Chatterjee & Hadi, 2013). In the Algorithm part 3, the simple linear regression model function is used, and can be observed in Equation 2.
In Equation 2, the variable is the response, is the predictor, and are the model parameters (regression coefficients) and is an error term. The coefficient , called the slope, can be interpreted as the change in influenced by the amount of change in . The coefficient , called constant, is the expected value of when = 0. To estimate the unknown parameters and , the method of least squares can be employed. This method aims to calculate a line that minimizes the sum of squares of the vertical distance between each point on the line, which is the error in the response variable (Chatterjee & Hadi, 2013). Equation 3 and 4 represent the formulation to estimate the parameters: In this context, if the residual is out of control, step 'Apply Transformation' of the flowchart will be triggered to correct or minimize the points. By being more parsimonious and effective, this algorithm applies only an individual test for the verification of residual, considering outside upper control limit (UCL), the residual over 3 standard deviations , and for lower control limit (LCL), the residual bellow 3 standard deviations of the axis, as in Equation 5, 6 and 7 and the example in It is necessary to secondarily analyze through the p-value the significance test of regression and the constant term significance test ( ). In a small percentage of the series analyzed in this paper, it was necessary to remove the term ( ) to improve the model. It was considered significant the term with p-value < 0.05.

Apply forecast model
Once the model is satisfactory in the analysis provided in step 'Model With Accuracy', the model can be applied to answer the research question: at what point the backup media will reach 100% capacity? i.e., to find the value of (X) to the intended point (Y = 1), according to Equation 8 and the Algorithm part 5.

Changes in the number and/or segmented regression analysis
In this iterative process step, the model which failed in step 'Model With Accuracy?' will be adapted through changes in the data series and, if necessary, a segmented regression analysis will be applied. The first corrective action is the transformation of the data series. However, even if the model has been approved in step 'Model With Accuracy?', it is possible to apply transformations in the series (step 'Apply Transformation') in order to consider a possible improvement in the model before sending it to step 'Apply Forecast Model'.

Transformations in series
A strategy usually adopted when out of control residual is found is to apply transformations in the series. According to Montgomery et al. (2011), data transformations are often useful to stabilize the variance of the data, since the inconstant variance is quite common in time series. A commonly applied approach for the correction is the box-cox transformation. As Samagaio and Wolters (2010) stated, box-cox transformation can be useful to correct the non-normality and residual variation. In this approach, data is processed in accordance with Equation 9 and 10. = , 0 The W variable receives the transformed value of the response variable Y by the λ parameter. For the implementation, the Minitab software was used to obtain the optimal value of λ. Obtaining the optimal value takes into account the analysis of the range points with the smallest standard deviation, as observed in Figure 5. After processing, steps 'Linear Regression Analysis' and 'Model With Accuracy' are performed again, if the model is still not satisfactory, a segmented regression analysis will be applied to the series.

Segmented regression analysis
The regression analysis is only applied in the current segment of the series, i.e., which better represents the latest change of data behavior. Segmentation will occur after discovering the breakpoint. For the determination of the breakpoint, the technique based on the analysis of the model residual will be employed. Throughout the tests, it was observed that a change in the slope of the line had a start point observation that showed the highest residual value set outside the control limits (LCL and UCL). That is, when applying the linear regression again in the data segment from this observation, it was found significant improvements in the model. The improvement is especially regarding the residual, that is often under control after the regression analysis only in the segment. In Figure 7 to 9, it is possible to observe the application in real data. Note that the observation X = 30 has the highest residual value set outside the control limits. When applying linear regression analysis only in the segment from observation X = 30, it is noted that residual was under control, representing more faithfully the new behavior of the series until its end. In some conditions, the process need to be redone iteratively again and again until the residual is under control or, depending on the case, with the smallest possible number of failed points in the test, according to a sequential order of actions by trial and error for breakpoint selection as the following restrictions: i) Residual farther from the LCL or UCL; ii) Next residual in descending distance order from LCL or UCL; iii) With no more residual out of control, follow the chronological order. So that there is no possibility of infinite loop in any case, it is essential to establish a stop condition to the process. Then, a minimum of 5 observations (in the segment) as a stopping parameter to the regression analysis is defined, as stated in Algorithm part 6.  Figure 6; -The individual control chart of residual presents several points out of control, suggesting that the model is not yet satisfactory. The points around x = 30 indicates the farthest residue UCL, as can be observed in Figure 7; -Scatter plot with a regression line more adjusted to each segment, emphasizing the second segment representing the most recent behavior of the series, as can be observed in Figure 8; -Individual control chart of the most recent residual segment that passed the test, i.e. those under control, as can be observed in Figure 9. All this iterative process that seeks the most adjusted model runs along the algorithm repeatedly from step 'Model with Accuracy?' of the Figure 3. The choice of the most appropriate forecast method depends largely on the nature of the time series, the a priori knowledge, the accuracy required, in addition to available computing resources (Le Borgne, Santini, & Bontempi, 2007). Currently in the literature there is a continuous search for the best method. However, according to Chatfield (2003), it is clear that no single method can overcome all other methods in all situations, i.e., any case depends on what is expected for better results, the context is always crucial.

Object of study
In this study, each series of collected data represents the write cycle in a backup media (data cartridge), from the beginning of the recording, to the depletion of storage capacity. Data collection was conducted from metadata recorded by the automated system backup from Federal University of Itajuba, Brazil. Altogether, 98 series containing 2,647 observations from 2 years were collected. The distribution of the number of observations for each group is shown in Figure 10 and much of the observed series exhibit short-term feature.

Results
The results presented and discussed below were obtained by the execution of the proposed algorithm and detailed in material and methods. In total, 98 series were analyzed, but the algorithm could be used only in 83, as shown in Figure 11. In the other 15 series, the reason that the algorithm could not be used was: -Series containing less than 5 observations: such occurrence is justified by the fact that some backup jobs service demanded large amount of data recorded in a short time. A behavior that resulted in the use of the full capacity of the recording media almost immediately, preventing any predictive analysis.
-Write Failure: as the backup job service starts, the system detects failures (physical or logical errors) and aborts the task in the current media, and the next task is performed in the adjacent available media.    In this way, the proposed algorithm was executed iteratively in 83 series of the collected data. Note in Figure 11 that in 9 out of the 83 regression models, the obtained residuals were not under control. By analyzing the characteristics of these series, some causes of the problem with residual were identified: -Abrupt pattern changes at the end of the series: observed when the last recorded data volume is much larger compared to the initial standards. This behavior decreases the model fit accuracy and increases the differences. Such data behavior is illustrated in Figure 12 representing the analyzed backup media. The segmented linear regression was effective for cases of sudden change in the regression line slope.
However, for this application, it is necessary that the change occur in a less advanced stage of the records, as seen in Figure 13.
-Few observations in the series: in some analyses, the few media records (usually less than 10 observations) made it difficult to obtain more adjusted models and without out of control residuals.
Regarding the model fit with the goal of finding the value of X to Y = 1 in 82 out of the 83 models, the set of X values were within the confidence interval (95%). However, in some cases a wider confidence interval was obtained, due to the characteristics of the analyzed series. The model accuracy metrics were also analyzed, the metric values indicate the adjustment of the models. Figure 14 illustrates a comparison of values, adjusted and predicted .

Pseudocode
One of the specific objectives of this research is to conceive a pseudocode or generic algorithm to be used in automated backup systems. A similar approach can be found in Le Borgne et al. (2007). Pseudocode can be understood as a generic form of writing an algorithm using a more informal and parsimonious language without the need to know the syntax of any formal programming language (Roy, 2006).
The logical sequence of steps of the Algorithms part 1 to 6 represents the pseudocode and can be implemented in any programming language, to aid any automated backup tool. In order to minimize the amount of code, some of the procedures were encapsulated as functions.

Conclusion
With the focus on improving automated processes of a data backup system in operation, quantitative tools applied to quality were used. Based on observations and interpretations of results that encompass the total amount of analyzed series, residual, fit and accuracy measures, it was possible to verify the compliance of the analyzed data with the proposed generic algorithm that implements a logical sequence of steps.
Obtaining short-term forecasting models aims to estimate when each recording media reach its storage capacity limit, thus serving as a tool for decision making and monitoring systems.
The approaches presented herein may be replicated in similar set of data in series to present a positive linear relationship, and wherein a given moment occurs a change of pattern behavior causing changes in the line slope. However, systems that aims to provide full depletion by demands (represented by the dependent variable) can be also adapted to this methodology.
However, it is not the central focus of this research to define an innovative technique of segmented linear regression to search the breaking point. The main objective achieved and justified by both the literature and the results of practical application consists in getting a parsimonious algorithm able to model and forecast demand for a data backup system through analysis from the collected metadata.
Organizations have to adopt additional measures to ensure the availability of systems and applications. A monitoring tool that provides the system administrator with data backup can be considered one of additional parts that are intended to ensure the availability of data backup service.
There are still vast subareas of ICT that can be explored by studies addressing some methodology with both qualitative and quantitative quality-oriented processes. In the quality engineering area, it is possible to apply several tools that contribute to the continuous improvement of various ICT services. Within the context of this research, we suggest the following additional propositions: -The application of the proposed algorithm to other automated data backup systems and/or different institutions and evaluation of the results.
-Propose a wider forecast horizon of study (medium or long term) focused to help the planning of investments and contracts and not only the operational area.
-Perform predictive analysis with different techniques, whether involving greater complexity in order to check the accuracy of the most parsimonious approach proposed in comparison with new techniques.
-Apply the proposed algorithm to forecast the capacity demand with similar behavior data sets, but from other productive areas, such as: infrastructure, supplies, inventory, among others