DATA MINING WITH SOFTWARE INDUSTRY PROJECT
DATA: A CASE STUDY
Topi Haapio and Tim Menzies
Lane Department of Computer Science, West Virginia University Morgantown, WV 26506-610, USA.
ABSTRACT
Increasingly, data mining is used to improve an organization’s software process quality, e.g. effort estimations. Data is collected from projects, and data miners are used to discover beneficial knowledge. Mining-suitable software project management data can, however, be difficult to collect, and results frequently in a small data set. This paper addresses the challenges with such a small data set, and how we overcame these challenges. The paper reports, as a case study, a data mining experiment that both failed, and succeeded. While the data did not support answers to the questions that prompted the experiment, we could find answers to other relating important business questions. We offer two conclusions. Firstly, it is important to control research expectations when setting up such a study since not all questions are supported by the available data. Secondly, it may be required to tune the questions to the data, and not the other way around. We offer this second conclusion cautiously since it runs counter to previous empirical software engineering recommendations. Nevertheless, we believe it may be a useful approach when studying real world software engineering data that may be limited inside, noisy or skewed by local factors.
KEYWORDS
Case study, software industry, business intelligence, data mining, small data set.
1.
INTRODUCTION
In software industry, business intelligence (BI) information is produced for corporate management to improve software process quality of cost or effort estimations, for instance. One of the popular processes for producing BI information is data mining. The data mining tools are employed to model the data (Pyle 1999), e.g. by regression or with trees. The models of data behavior are generated with data mining learners, using appropriate data. The data can be gathered either manually or automated. Whereas the automated data gathering processes can produce vast amount of data in some business areas (Pyle 1999), the manual data gathering results usually in smaller and noisier data sets. Moreover, whereas some areas of software
engineering (SE), e.g. defect analysis, show success in data collection (Menzies et al 2007), others struggle.
In fact, it can be quite difficult both to access and to find usable project management data. For example, after 26 years of trying, less than 200 sample projects have been collected for the COCOMO database (Boehm et al 2000). NASA project repository na60 with 60 NASA projects were collected during last two decades (Chen et al 2005). The reason is quite simple: whereas there are many program modules for defect analysis purposes there are only a few projects these modules are included to.
factors significant for the business. While the detailed results of the data mining is a topic of another publication, in this paper we focus in the other main contribution of our study: addressing the reality and the challenges software companies face in collecting and utilizing project data, and how to overcome them. We argue that such an analysis is an open and urgent question in many SE fields since suitable real software project data requires substantial collection effort.
The rest of this paper is structured as follows. Section 2 describes the case study presenting our data mining experiment on the effort predictors in software projects. Section 3 offers five conclusions in which the research results with their implications are discussed.
2.
CASE STUDY
2.1 Research Methodology, Site and Background
In this study we applied case study. Here, the case study methodology is understood and applied in a broader context than that Yin’s (1994). We chose an exploratory approach rather than explanatory from pragmatic reasons to achieve answers to question “what?” and “what can we do about it?” rather than to questions “why?” or “how?”, i.e. we attempt to point out challenges relevant to the software industry and provide a solution to overcome these challenges rather than understand the phenomena behind the problem.
The case study is based on the data collecting practices in Tieto Corporation during 1999-2008. Tieto is one of the largest IT services company in the Northern Europe with 16.000 employees.
In 2003, the quality executives at Tieto were concerned of the influence of the software project activities other than actual software construction and project management activities have on software project effort and its estimation accuracy, i.e. these general software project activities had not had the attention they might have required. Management hypothesized that focusing only on software construction and project management in effort estimations, and neglecting or undervaluing the other project activities can result in inaccurate estimates.
In (Haapio 2004), we noted that much of the effort estimation work focuses on the first two of the following three parts of a typical software project work breakdown structure (WBS):
1. Software construction involves the effort needed for the actual software construction in the
project’s life-cycle frame, such as analysis, design, implementation, and testing. Without this effort, the software cannot be constructed, verified and validated for the customer hand-over. 2. Project management involves activities that are conducted solely by the project manager, such as
project planning and monitoring, administrative tasks, and steering group meetings.
3. All the activities in the project’s life-cycle frame that do not belong to the other two main categories can be called construction activities. In (Haapio 2006), the category of non-construction activities was further decomposed into seven individual, but generic, software project activities: configuration management, customer-related activities, documentation, orientation, project-related activities, quality management, and miscellaneous activities.
Accordingly we explored how non-construction activities affect effort estimates. Our research (Haapio 2004) showed that the amount of non-construction activity effort in software projects is not only significant (median 20.4% of total project effort) but also varied remarkably between the projects. Could this, in fact, be the reason for the high effort estimation inaccuracy back then (median and mean magnitude of relative error– MdMRE, MMRE–were reported to be 0.34 and 0.36, respectively (Haapio 2004))? We experimented with data mining to find out if this was indeed the case.
2.2
Data Mining Experiment
In the case study, we performed a data mining experiment in which data mining techniques were used in assessing the impact of different software project activities and in finding predictors for the effort of different project activities. The learners used in data mining come from the WEKA toolkit software (Holmes et al 1994). In this paper, the focus is on the challenges the software industry can encounter in their software
mining experiment and its generalized results in the following. Details of the experiment and its results will be published separately.
Data mining starts with data preparation. We applied the data preparation guidelines given by Pyle (1999, 2003) whenever possible. However, the Pyle’s guidelines to choose data, to our mind, are for a large data source and for choosing sample records from the large source, not a small data set like ours. In our case, we have collected every record (project data) that we had access into during 2003-2008, and which had sufficient and relevant data. Although there are hundreds of projects on-going in the company in question, most of the project data is inaccessible or not usable for data mining purposes.
The gathered data set consisted of 32 custom software development and enhancement projects which took place in 1999-2006. These projects were delivered to five different Nordic customers who operate mainly in the telecommunication business domain. The delivered software systems were based on different technical solutions. The duration of the projects was between 1.9 and 52.8 months. The projects required effort between 276.5 and 51,426.6 hours. The normal work iteration was included in the effort data. Effort caused by change management, however, was excluded from the data because change management effort and costs are not included in the original effort estimation and the realized costs. The effort estimation information was gathered for this research from the tender proposal documents, the contract documents, and the final report documents.
Pyle (1999, 2003) notes that the collected records should be distributed into three representative data sets: training, testing, and evaluation. Due to our small data set this requirement was unreasonable. Instead, we performed the data mining with one single data set, using a 2/3rd 1/3rd train/test cross-validation procedure, repeated ten times.
For the 32 projects in question, we gathered available variables common and related to most software projects. Our 11 predictors (also, independent variables or inputs) include project, organization, customer, size (effort), and non-construction activity related variables. Our response variables (also, dependent or output variables) were originally a set of continuous classes representing the effort proportions of the three major software project categories (software construction, project management, non-construction activities) and six further decomposed non-construction activities. We excluded the ‘miscellaneous activities’ response class for two reasons: first, the frequency of these activities to appear in a project was rather small (28.1%) and second, the ‘miscellaneous activities’, having no common denominator, is a ‘dump’ category.
Pyle (1999, 2003) gives also instructions on missing values and data translations. We manipulated our data as little as possible: e.g. all missing values are left missing and were not replaced by a mean or median value. The values for effort estimate in hours were recalculated as LOG, which is a usual action (Chen et al 2005), and further derived to two predictors.
According the guidelines by Pyle (1999, 2003), we visualized the data (the predictors in respect to each response) using Weka’s visualizing features, including cognitive nets, also referred as cognitive maps (Pyle 1999). However, any eminent predictor for responses could not be found by applying these visualizing methods.
After data preparation we, encouraged by the results in other studies (e.g. Hall and Holmes 2003; Chen et al 2005), employed Feature Subset Selection (FSS) to analyze the predictors before classifying them. We applied Weka’s Wrapper feature subset selector, based on Kohavi and John’s (1997) Wrapper algorithm, in the process since experiments by other researchers strongly suggest that it is superior to many other variable pruning methods.
Our FSS used ten-fold cross-validation for accuracy estimation. The selection search was used to produce a list of attributes to be used for classification. Attribute selection was performed using the training data and the corresponding learning schemes for each machine learner. As result, the Wrapper performed better on discrete classes than on continuous classes.
After FSS, we applied a range of learners to find predictors for software project activity effort. First, we applied three learners provided by Weka for linear analyses:
• Function-based Multilayer Perceptron (Quinlan 1992a; Ali and Smith 2006) and Linear
Regression.
• Tree-based M5-Prime, M5P by Quinlan (1992b).
response classes resulted with acceptable coefficients. Since continuous methods were not apparently helpful in this domain, we moved on to discretizations of the continuous space and dividing all our output variables into two discretized groups (‘under’ and ‘over’) according their median values.
For the discretized response classes, we applied a range of learners provided by Weka:
• Bayes-based Naïve Bayes (Hall and Holmes 2003; Ali and Smith 2006; Menzies et al 2007) and
AODE (Witten and Frank 2005).
• Function-based Multilayer Perceptron.
• Tree-based J48, a Java implementation of Quinlan’s (1992a) C4.5 algorithm.
• Rule-based JRip, a Java implementation of Cohen’s (1995) RIPPER rule learner, and Holte’s (1993) OneR (Nevill-Manning et al 1995; Ali and Smith 2006).
Results were averages across a 2/3rd, 1/3rd cross-validation study, repeated ten times for each of the two learners and for each of nice possible target variables (in the case of the nine repeats, one output variable was selected and the other eight were removed from the data set). The prediction outcomes of the discretized classes were presented with confusion matrices. An ideal result has no false positive of false negatives matrix values. In practice, such an ideal outcome happens very rarely, due to data idiosyncrasies. Hence, we were satisfied with an ‘acceptable result’ of most the results on the diagonal with only a few off-diagonal entries. An ‘acceptable result’ only appears for one response class, i.e. when learners are predicting for the level of one of the non-construction activities, namely quality management, apparent in a project whereas no ‘acceptable result’ appears for the non-construction activities category. The data mining experiment result, interesting for both software business and quality management, in general was that the prediction is that if a project is estimated to have a smaller effort (than median effort), the relative quality management effort will realize over its median value, and vice versa, if a project is estimated to have a larger effort (than median effort), the relative quality management effort will realize under its median value. We will return to the details of the experiment and its results in another publication.
2.3
Analysis on Data Collection Practices at Research Site
During the decade the data sample is from and collected (1999-2008), the fast pace in company acquisitions led to the shortage of good-quality project management data at the research site. The acquired companies had their own project management data systems and practices how the data was structured. Also, most acquired companies remained as independent subsidiaries with no or very limited visibility into company's project management data outside the specific subsidiary.
The work breakdown structures created for the projects were more based on invoicing than on effort data utilization purposes, and in many cases the structures were proposed by the customer to assist in their budgeting and cost controlling needs. Recently, investments in data utilization have increased as the company strives to the highest CMMI maturity levels.
3.
CONCLUSION
Based on our experience, we offer the following five conclusions. Firstly, based on our case study research site analysis, we recommend software companies strive to remove the barriers related to the project management data. In particular, the use of one corporate-wide project management data system, conformity of data structures within the system, and transparent data promote the success of data mining. In practice, however, all barriers might be difficult to be removed.
Secondly, a general result for industry is that we have made the conclusions, despite a severe shortage in the amount of available data. Our pre-experimental concern was that we lacked sufficient data to make useful conclusions. For this study we collected project data for last five years from a large North European software company. Even after an elaborate historical data collection over the period 1999-2007, we could only find data in 32 projects. Nevertheless, even this small amount of data was sufficient to learn a useful effort predictor, a beneficial BI knowledge for software business and quality management.
signal in the qualitative space that was invisible in the quantitative space. In fact, it can be easier to find a dense target than a diffuse one. Median discretization batches up diffuse signals into a small number of buckets, in our case, two. Hence, we recommend that if a failure in continuous modeling is observed, discrete modeling can turn out to be successful.
Fourthly, the expectations for the research need to be carefully managed at the start of a data mining experiment. If we cannot find the answers the stakeholder that commissioned the research wants to hear, but we can find other important factors, we do not want disappointment over the former to blind our commissioner to the value of the latter.
Fifthly, it can be useful to allow for a redirection, halfway through a study. Just because a data set does not support answers to question X, does not mean it cannot offer useful information about question Y. Hence, we advise tuning the question to the data and not the other way around. This advice is somewhat at odds with standard empirical SE theory and literature, which advises tuning data collection according to the goals of the study (van Solingen and Berghout 1999). For example, in Goal/Question/Metric (GQM) paradigm (Basili and Rombach 1988), data collection is designed as follows (van Solingen and Berghout 1999):
1. Conceptual level (goal): A goal is defined for an object for a variety of reasons, with respect to various models of quality, from various points of view and relative to a particular environment. 2. Operational level (question): A set of questions is used to define models of the object of study
and then focuses on that object to characterize the assessment or achievement of a specific goal. 3. Quantitative level (metric): A set of metrics, based on the models, is associated with every
question in order to answer it in a measurable way.
In an ideal case, we can follow the above three steps. However, the pace of change in both software industry and SE organizations in general can make this impractical as it did in the NASA Software Engineering Laboratory’s (SEL), for instance. With the shift from in-house production to outsourced, external contractor, production, and without an owner of the software development process, each project could adopt its own structure. The earlier SEL’s learning organization experimentation became difficult since there was no longer a central model to build on (Basili et al 2002). The factors that lead to the demise of the SEL are still active. SE practices are quite diverse, with no widely accepted or widely adopted definition of ‘best’ practices. The SEL failed because if could not provide value-added when faced with software projects that did not fit their preconceived model of how a software project should be conducted. In the 21st century, we should not expect to find large uniform data set where fields are well-defined and filled-in by the projects. Rather, we need to find ways for data mining to be a value-added service, despite idiosyncrasies in the data. Consequently, we recommend GQM when data collection can be designed before a project starts its work. Otherwise, as done in this paper, we recommend using data mining as microscope to closely examine a data. While it useful to start such a study with a goal in mind, an experimenter should be open to stumbling over other hypothesis.
To summarize, the main conclusion of our study is that the question must be tuned to the data. In the modern agile and outsourced SE world, many of the premises of prior SE research no longer hold. We cannot assume a static sweet-structured domain where consistent data can be collected from multiple projects over many years for research and other utilizing purposes. We need to recognize that data collection has it limits in organizations driven by minimizing costs, and customers being the main stakeholder for data collection reason. Thus, we conclude with a recommendation to analyze the initial stand-point for data mining, and then either continue with GQM or, as in our case, use data mining for data examination.
ACKNOWLEDGEMENT
REFERENCES
Ali, S. and Smith, K., 2006. On learning algorithm selection for classification. In Applied Soft Computing, Vol. 6, No. 2, pp. 119-138.
Basili, V. and Rombach, H., 1988. The TAME Project: Towards Improvement-Oriented Software Environments. In IEEE
Transactions on Software Engineering, Vol. 14, No. 6, pp. 758-773.
Basili, V. et al, 2002. Lessons learned from 25 years of process improvement: the rise and fall of the NASA software engineering laboratory. Proceedings of the 24th International Conference on Software Engineering (ICSE’02), Orlando, USA, pp. 69-79.
Boehm, B. et al, 2000. Software Cost Estimation with COCOMO II. Prentice-Hall, Upper Saddle River, USA. Chen, Z. et al, 2005. Finding the Right Data for Software Cost Modeling. In IEEE Software, Vol. 22, No. 6, pp. 38-46. Cohen, W., 1995. Fast Effective Rule Induction. Proceedings of the 12th International Conference on Machine Learning,
Tahoe City, USA, pp. 115-123.
Haapio, T., 2004. The Effects of Non-Construction Activities on Effort Estimation. Proceedings of the 27th Information
Systems Research in Scandinavia (IRIS’27), Falkenberg, Sweden, [13].
Haapio, T., 2006. Generating a Work Breakdown Structure: A Case Study on the General Software Project Activities.
Proceedings of the 13th European Conference on European Systems & Software Process Improvement and Innovation (EuroSPI’2006), Joensuu, Finland, pp. 11.1-11.
Hall, M. and Holmes, G., 2003. Benchmarking attribute selection techniques for discrete class data mining. In IEEE
Transactions on Knowledge and Data Engineering, Vol. 15, No. 6, pp. 1437-1447.
Holmes, G. et al, 1994. WEKA: A Machine Learning Workbench. Proceedings of the 1994 Second Australian and New
Zealand Conference on Intelligent Information Systems, Brisbane, Australia, pp. 357-361.
Holte, R., 1993. Very simple classification rules perform well on most commonly used dataset. In Machine Learning, Vol. 11, pp. 63-91.
Kohavi, R. and John, G., 1997. Wrappers for feature subset selection. In Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324.
Menzies, T. et al, 2007. Data Mining Static Code Attributes to Learn Defect Predictors. In IEEE Transactions on
Software Engineering, Vol. 33, No. 1, pp. 2-13.
Nevill-Manning, C. et al, 1995. The Development of Holte’s 1R Classifier. Proceedings of the Second New Zealand
International Two-Stream Conference on Artificial Neural Networks and Expert Systems, Dunedin, New Zealand, pp.
239-242.
Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Inc., San Francisco, USA.
Pyle, D., 2003. Data Collection, Preparation, Quality, and Visualization. In The Handbook of Data Mining, pp. 365-391, Lawrence Erlbaum Associates, Inc., Mahwah, USA.
Quinlan, R., 1992a. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Meteo, USA.
Quinlan, R., 1992b. Learning with Continuous Classes. Proceedings of the 5th Australian Joint Conference on Artificial
Intelligence, Hobart, Tasmania, pp. 343-348.
van Solingen, R. and Berghout, E., 1999. The Goal/Question/Metric Method. McGraw-Hill Education, London, UK. Witten, I. and Frank, E., 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann: Los
Altos, USA.