View of TIME SERIES DATA PREDICTION METHODS FOR MINING TEMPORAL DATA

(1)

ACCENT JOURNAL OF ECONOMICS ECOLOGY & ENGINEERING Peer Reviewed and Refereed Journal (International Journal) ISSN-2456-1037

Vol.04,Special Issue 05, (ICIR-2019) September 2019, Available Online: www.ajeee.co.in/index.php/AJEEE

TIME SERIES DATA PREDICTION METHODS FOR MINING TEMPORAL DATA

1Mohammed Ali Shaik

1Research Scholar, Computer Science & Engineering Department, Dr. A. P. J. Abdul Kalam University, Indore, India

Abstract - In the recent years the evolving content the structure and usage of documents that we study and understand are interlinked and has gained a lot of significance, specially where the data is cited or indexed in World Wide Web. In this paper I have performed a survey over various time series models for performing forecasting by considering various methods as effective tools for acquiring and implementing data over various interlinked documents. I have taken dataset that comprises of research papers in data mining archive of journals documents that are interlinked with the documents used by KDD for predicting various citations of every possible paper in the area. This paper reflects various findings and literature survey in emerging time series models that are applied effectively with linked structure in data repository comprises of distinct interlinked documents.

Keywords: Time series, Data mining, Data indexing, Data sequence matching, similarity measure, Temporal Analysis

1. INTRODUCTION

In the present day era time series represents group of values attained from various sequential measurements over a specific period of time which is represented as “Time series data mining” with the natural ability to identify and visualize the data in distinct shapes of datasets [1]. In most of the cases we humans are used to implement complex schemes for implementing complex tasks to avoid focusing on tiny patterns over distinct time scales as the majority of time series [2].”The research has not been focused that has to be to tackle actual problems but by an interest in proposing novel approaches”. The rapidly increasing research of time series data mining techniques had become obsolete by covers a wide range of real time problems in various fields of research [5].

Time series data mining reveals distinct facets of complexity as the major problem arises from the high dimensionality of time series data as the major issue is in identifying similarity measure that is based on human observations due to rapid growth of digital sources of information that include “time series mining algorithms” for matching growing massive datasets which comprises of major issues such as:

 Data Representation [3]: the concept of how fundamental shape can be represented in time series as the properties represents satisfactory that represents for driven the representation of shape for reducing dimensionality of data with essential characteristics of data.

 Similarity Measurement [4]: how time series data can be differentiated or matched with perspective distance between two series to be formalized for measuring notations to beginning similarity notion with perceptual criteria that allows us to recognize alike objects which are not similar mathematically.

 Indexing Method [6]: the question is with massive set of “time series” to creating and executing fast querying that is based on indexing mechanism which is applied over various indexing techniques by providing nominal space computational complexity.

2. CLUSTERING

Clustering is the process of finding natural collection called as clusters in any of the dataset as the major objective is to identify the most standardized distinct cluster from other group of clusters by maximizing “inter-cluster variance” with minimized “intra-cluster variance” as the algorithm should automatically identify groups with desired data available in data set by dividing the time series clustering tasks into two sub-tasks [7].

Based on the distance measure there exists many numerous approaches for performing clustering on whole series of data by adapting various algorithms that are based on the generic clustering by performing “Self Organizing Maps (SOM) [8], Hidden Markov

(2)

Models (HMM) [9] or Support Vector Machines (SVM) [10] who proposed a alternate for Expectation Maximization (EM) algorithm”.

Though EM algorithm is based on the approach which performs scalability and implicitly proposes issues that exist in underlying model such as: “Markov chain Monte Carlo (MCMC)” method [11] which is based on the assessment of grouping that is performed over specific time period by combining various parameters in a specific model. The review focuses on methods that are based on classical techniques related to time series classification of clustering methods over distinct static data represented into five categories:

“partitioning, hierarchical, density-based, grid-based and model-based” [12].

Clustering time series data will be more useful over data streams as the implementation is being performed over clipped data that represents “Auto-Regressive (AR) models, k Means and k-center clustering” [13].

3. CLASSIFICATION

The process of classification is to assign distinct labels for each series as a possible set as the major difference when compared is to perform clustering task over the known classes in more recent algorithms that are proposed to train an example dataset with the major task to learn various advantages of distinct classes over others. The process is implemented by providing unlabeled dataset to the system automatically for determining each series that belongs to a specific class by performing classification task [14].

In the traditional classification approach of time series data is based on basic trends as the results obtained are difficult to interpret and execute [15] as it is robust to noise and outliers which can be further overcome by implementation of high dimensional data that uses “Singular Value Decomposition” to choose required frequencies o data with higher computational costs that are evaluated into classifier types: “nearest neighbor, support vector machines and decision forests” [16]. All of these methods resemble to be valid with higher dependency over dataset as the “1-NN classification algorithm with DTW appears to be most widely used classifier as it is highly accurate though the computation speed is impacted by iterative DTW computations” which can be overcome by the template proposed for construction of algorithm that is based on “Accurate Shape Averaging (ASA) technique”

[17].

All the training classes are characterized by a single sequence of incoming series of data that is further compared by the averaged template per class such as “ARMA models or HMM “ [18] which is further enhanced by using discriminative HMMs for increasing differences over inter classes by using probabilistic transitions over fewer states or obtaining the distinct results attained.

For doubling the loop in “EM algorithm with a Mixture of Experts network structure that has been used for identification of epileptic seizure that is based on the EEG signal that is displayed over normal epileptic curve attained in a data set by implementing classification over overtraining data set in an inefficient model” [19]. In order to improvise the data selection process in self training phase of classification was proposed for “time series reduction” process which extracts various patterns which can be utilized as distinct inputs over “traditional machine learning algorithms” [20].

4. SEGMENTATION

The summarization or segmentation process aims to develop a precise approximation technique over time series data by reducing training set dimensionality without disturbing its essential features as the main objective of this task is to reduce the reconstruction error over original time series data set using “Piecewise Linear Approximation (PLA)” [21] which divides the series into segments that represent a polynomial model that fits into a segment of data with three basic approaches: “sliding window” where the segment grows till it reaches maximum error threshold value specified by user.

Though the proposed approach has shown reduced performance when compared with real time datasets in a top-down approach that comprises of partitioning time series data in an iterative manner till a criterion is met that stops the iteration with a lesser time complexity [22]. This approach is based on the finest approximation principal that segments data in an iterative merge operation using “fast greedy algorithms” for improvising traditional approaches using statistical methods for selecting data segments.

(3)

5. PREDICTION

In Time series data the task of prediction is aimed to explicitly model variable dependencies in forecasting or predicting next series data that predicts several fields of research related to time series in a most extensive task that is applied where the focus is on specific field based learning methods for obtaining accuracy by depending over distinct statistical components by imposing statistical learning [23].

Many techniques for predicting the data have been applied over various applications using “AR models over a long period of time for predicting distinct tasks with signal denoising or dynamic systems modeling” using neural networks with the clusters function approximation process for resolving proposed problems with a polynomial architecture that is developed by implementing multilayer neural networks by reducing maximum terms for simplifying linear function product using SOM algorithm which is an efficient supervised architecture [24, 25].

The “Bayesian prediction using time series subjects with discrete breaks for handling size and duration that posses breaks in a hierarchical HMM which is a “dynamic genetic programming (GP) model” that is tailored over forecasting streams is proposed by attaining incremental process that is used to obtain knowledge as the prediction task represents most frequent functional real time applications that are based on the market behavior for performing forecasting over transactional data sets [26].

The four factor forecasting can be identified as “targeted predictors that performs selection process using hard and soft threshold values or predicting distinct tasks as a wide scope of applications that ranges the tourism demand forecasting medical surveillance [27]for comparing predictive accuracy of methods: non-adaptive regression, adaptive regression, and the Holt-Winters method”.

The large scale comparisons are performed in distinct machine learning models that performs time series forecasting into methods such as “multilayer perceptron and Gaussian process regression” learning models for performing predictions over a long tenure of time by using its own outputs [28].

6. IDENTIFICATION OF ANOMALIES

The process of identification of anomalies is considered to be most abnormal process in time series data mining as it has many applications over almost all areas in the present day utilization and intrusion detection techniques [29]. The general process of detecting anomalies is to create a model first over series of data set with normal behavior of data set with distinct anomalies that are linked to distinct prediction tasks. For performing the forecasting of next values where the time series data with maximum accuracy as the outliers can be identified in a straightforward manner with a flagship of anomalies using SOM model [31].

A novel machine learning framework will detect the implementation that is based on

“Support Vector Regression (SVR)” which will dynamically adapt the modelisation process in a normal behavior by evaluating “block-based One-Class Neighbor Machine and recursive Kernel-based algorithms” that are represented for performing anomaly detection with wavelet coefficients in a time series state-based approach using time point clustering over normal behavior of a data series [32].

Another definition of anomalies is the time series that discords data as the maximum subsequences with distinct subsequences of data as the main idea is the unusual subsequences that attained using time series data that includes unique parameters such as required length of subsequences required over two linear scans that are performed over massive datasets in the process as the anomalies subsequences are to be specified before performing search operation [33].

In attaining comparable time series that is computed over “time series data mining task”

that needs to represent subtle notion of similar series that are based on iterative representation of shape that observes distinct characteristics that comprises of “Euclidean distance” which is obviously does not reach abstraction with distinct pitfalls ith Lp norms with properties such as: [34].

1. It should provide recognition of perceptually similar objects though they are not mathe matically identical

(4)

3. It must emphasize on teh salient features of local and global scales

4. Similarity measure must be universal while identifying different arbitrary objects 5. It must abstract from distortions and be invariant of transformations

7. RESEARCH ISSUES & TRENDS

The Time series data mining is considered to be the major increasing and stimulating field of research where most of the challenges are to be researched yet with distinct trends such as:

7.1 Stream Analysis

Streaming technologies have been evolved in the previous era with higher bandwidth capabilities as it process enormous amount of fluctuating data that performs analysis of mining data that flows as to be the extreme task where the major review research issue is to handle data streams that mine or manage [35]. Most of the algorithms that are proposed belong s to static datasets which are not sufficiently optimized over continuous volumes of data that is generated on secondly basis as most of the algorithms have been implemented to handle the data streams like clustering [32] classification [31] segmentation [4] or anomaly detection [7].

7.2 Convergence And Hybrid Approaches.

Most of the major tasks that can be further derived in a relatively ease manner with the combination of existing tasks such as: “polynomial DFT and probabilistic methods to perform prediction with unknown values that are ignored over the systems with answer queries that will tend to perform forecasting of data using combination of prediction and query optimization content over specific data streams”.

8. CONCLUSION

After performing massive research by considering papers that have been published in major journals and conferences in the area of time series data prediction I have selected 35 best papers and illustrated the extension and scope of various applications and techniques that tend to produce more mature solution that can handle any sort of issues which tend to increase computational complexity. Most o the “Time series data mining techniques”

presently being implemented in an incredible diversity of application fields or areas in a human based assertions. I have reviewed this paper which is in the field of time series data mining for primarily providing an overview of the research where the core implementation components are addressed thoroughly using time series systems that classifies the existing literature.

REFERENCES

1. FALOUTSOS, C., R ANGANATHAN, M., AND MANOLOPULOS, Y. 1994. Fast subsequence matching in time-series databases. SIGMOD Record 23, 419 – 429.

2. WEISS, G. 2004. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6, 1, 7–

19.

3. LIN, J., K EOGH, E., L ONARDI, S., L ANKFORD, J., AND NYSTROM, D. 2004. Visually mining and monitoring massive time series. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 460–469.

4. WEIGEND, A. AND GERSHENFELD, N. 1994. Time Series Prediction: forecasting the future and understanding the past. Addison Wesley.

5. SHIEH, J. AND KEOGH, E. 2008. isax : indexing and mining terabyte sized time series. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 623–

631.

6. VASKO, K. AND TOIVONEN, H. 2002. Estimating the number of segments in time series data using permutation tests. In Proceedings of the IEEE International Conference on Data Mining. 466–473.

7. SMYTH, P. 1997. Clustering sequences with hidden Markov models. Advances in Neural Information Processing Systems, 648–654.

8. CHAPPELIER, J. AND GRUMBACH, A. 1996. A Kohonen map for temporal sequences. In Proceedings of the Conference on Neural Networks and Their Applications. 104–110.

9. SMYTH, P. 1997. Clustering sequences with hidden Markov models. Advances in Neural Information Processing Systems, 648–654.

10. YOON, H., Y ANG, K., AND SHAHABI, C. 2005. Feature subset selection and feature ranking for multivariate time series. IEEE transactions on knowledge and data engineering, 1186–1198.

11. FROHWIRTH-SCHNATTER, S. AND KAUFMANN, S. 2008. Model-based clustering of multiple time series.

Journal of Business and Economic Statistics 26, 1, 78–89.

(5)

12. BERKHIN, P. 2006. A survey of clustering data mining techniques. Grouping Multidimensional Data, 25–

71.

13. BAGNALL, A. AND JANACEK, G. 2005. Clustering time series with clipped data. Machine Learning 58, 2, 151–178.

14. BARONE, P., C ARFORA, M., AND MARCH, R. 2009. Segmentation, Classification and Denoising of a Time Series Field by a Variational Method. Journal of Mathematical Imaging and Vision 34, 2, 152–164.

15. BAKSHI, B. AND STEPHANOPOULOS, G. 1994. Representation of process trends–IV. Induction of real- time patterns from operating data for diagnosis and supervisory control. Computers & Chemical Engineering 18, 4, 303–332.

16. RODRIGUEZ, J. AND KUNCHEVA, L. 2007. Time series classification: Decision forests and SVM on interval and DTW features. In Proc Workshop on Time Series Classification, 13th International Conference on Knowledge Discovery and Data mining.

17. ZHONG, S. AND GHOSH, J. 2002. HMMs and coupled HMMs for multi-channel EEG classification. In Proceedings of the IEEE International Joint Conference on Neural Networks. 1154–1159.

18. DENG, K., M OORE, A., AND NECHYBA, M. 1997. Learning to recognize time series: combining ARMA models with memory-based learning. In Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation, 1997. CIRA’97. 246–251.

19. LOWITZ, T., E BERT, M., M EYER, W., AND HENSEL, B. 2009. Hidden Markov Models for Classification of Heart Rate Variability in RR Time Series. In World Congress on Medical Physics and Biomedical Engineering. Springer, Munich, Germany, 1980–1983.

20. RATANAMAHATANA, C. AND WANICHSAN, D. 2008. Stopping Criterion Selection for Efficient Semi - supervised Time Series Classification. Studies in Computational Intelligence 149, 1–14.

21. SHATKAY, H. AND ZDONIK, S. 1996. Approximate queries and representations for large data sequences.

In Data Engineering, 1996. Proceedings of the Twelfth International Conference on. 536–545.

22. KEOGH, E., C HU, S., H ART, D., AND PAZZANI, M. 2003. Segmenting time series: A survey and novel approach. Data mining in time series databases, 1–21.

23. BROCKWELL, P. AND DAVIS, R. 2002. Introduction to time series and forecasting. Springer Verlag.

24. HARRIS, R. AND SOLLIS, R. 2003. Applied time series modelling and forecasting. J. Wiley.

25. TSAY, R. 2005. Analysis of financial time series. Wiley-Interscience.

26. PESARAN, M., P ETTENUZZO, D., AND TIMMERMANN, A. 2006. Forecasting time series subject to multiple structural breaks. Review of Economic Studies 73, 4, 1057–1084.

27. BAI, J. AND NG, S. 2008. Forecasting economic time series using targeted predictors. Journal of Econometrics 146, 2, 304–317.

28. AHMED, N., A TIYA, A., E L GAYAR, N., E L-SHISHINY, H., AND GIZA, E. 2009. An empirical comparison of machine learning models for time series forecasting. Econometric Reviews 29, 5, 594–621.

29. ZHONG, S., K HOSHGOFTAAR, T., AND SELIYA, N. 2007. Clustering-based network intrusion detection.

International Journal of Reliability Quality and Safety Engineering 14, 2, 169–187.

30. YPMA, A. AND DUIN, R. 1997. Novelty detection using self-organizing maps. Progress in Connectionist- Based Information Systems 2, 1322–1325.

31. AHMED, T., O RESHKIN, B., AND COATES, M. 2007. Machine learning approaches to network anomaly detection. In Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques. USENIX Association, Cambridge, MA, USA, 1–6.

32. Mohammed Ali Shaik, P. Praveen, R. Vijaya Prakash,” Novel Classification Scheme for Multi Agents”, Asian Journal of Computer Science and Technology , ISSN: 2249-0701 Vol.8 No.S3, 2019, pp. 54-58 33. Mohammed Ali Shaik, T. Sampath Kumar, P. Praveen, R. Vijayaprakash,” Research on Multi-Agent

Experiment in Clustering”, International Journal of Recent Technology and Engineering (IJRTE), ISSN:

2277-3878, Volume-8, Issue-1S4, June 2019

34. DING, H., T RAJCEVSKI, G., S CHEUERMANN, P., W ANG, X., AND KEOGH, E. 2008. Querying and mining of time series data: experimental comparison of representations and distance measures.

Proceedings of the VLDB Endowment 1, 2, 1542–1552.

35. GOLAB, L. AND OZSU, M. 2003. Issues in data stream management. ACM Sigmod Record 32, 2, 5–14.