Common Pitfalls in Data Science Projects: Goal Setting and Risk Management

(1)

TUGAS 03

MATA KULIAH DASAR PERENCANAAN PROYEK SAINS DATA KELAS A

DISUSUN OLEH:

Aisyah Kirana Putri Isyanto (21083010065) DOSEN PENGAMPU:

Kartika Maulida Hindrayani, S.Kom., M.Kom.

.

PROGRAM STUDI SAINS DATA FAKULTAS ILMU KOMPUTER

UNIVERSITAS PEMBANGUNAN NASIONAL “VETERAN” JAWA TIMUR

2023

(2)

1. Summarizing chapter 9 Common Pitfalls of Data Science Projects CHAPTER 9 - Common Pitfalls of Data Science Projects

Avoiding the common risks of data science projects

The biggest risk in a data science project is not defining the right goals, as this can lead to incorrect solutions. To ensure success, the team should set specific, quantifiable goals, distinguishing between right and wrong solutions, and ensuring the project's success.

To ensure that the project objectives are properly defined, use a checklist that includes elements such as; having measurable business metrics, understanding the technical metrics desired by the business, definition of tasks in data science terminology, understanding the details of the process and problem domain, availability of all necessary data sources, and the importance of documentation.

One of the common risks in data science projects is changes in experimental results that can change the direction of the project, therefore it is important to record all experiments along with the decisions and conclusions made and correct any changes that occur in the data and requirements. Being aware of the general risks of data science projects will help avoid mistakes, but the devil is in the details.

Approaching research projects

A research project involves solving a new problem, and most data science projects include a research sub-project for modeling. However, research projects face challenges such as lack of boundaries and external constraints. Boundaries must be clearly defined to ensure completion of the project, and external constraints affect the budget, accuracy, and duration of the research. To effectively plan a research project, one must assess research capacity and create an experimental backlog with team members. Each entry in the backlog should include ideas to improve the model and meet functionality goals, as well as follow SMART criteria.

SMART-compliant entries should have deadlines, links to data sets, computational resources, and metric recommendations.

It is also important to prioritize experiments according to the focus on expected results. Some experiments may take a long time, but initial tests can be done to save time. If unsure about prioritization, it is better to perform a quick initial quality check for each experiment in the current research iteration to estimate the time required. Make sure to keep track of all the details in the experiment, such as input data, experiment times, code versions, output files, model parameters, and metrics. You can use shared documents or an experiment tracking framework to manage this information. An experiment tracking framework becomes even more valuable when there are many experiments to perform.

In order to ensure that experiments are reproducible, it is important to follow certain criteria such as:

1. Input data is easily accessible and can be discovered by anyone on the team.

2. The experiment code can be run on input data without errors.

3. You don't need to enter undocumented configuration parameters to run the experiment. All of the configuration variables are fixed in an experiment configuration.

4. The experiment code contains documentation and is easily readable.

5. The experiment output is consistent.

6. The experiment output contains metrics that you can use for comparison with other experiments.

7. Conclusions from experiment results are present in the documentation, comments, or output files.

(3)

Consistency in experiment results and comparison metrics are also key factors. The conclusions of experimental results should be included in the documentation, comments, or output files. To avoid pitfalls, clear goals and success criteria should be defined, along with setting boundaries and constraints of time and budget. A trial worklist should be filled out and prioritized based on expectations. Tracking all trials and their data is essential, along with making the code reproducible and documenting findings.

Dealing with prototypes and MVP projects

When it comes to prototyping in data science, it is important to approach each prototype as a minimum viable product (MVP). An MVP should have enough core features to demonstrate a working solution, with additional features and enhancements able to be implemented later. Although the focus should be on core features, this does not mean that the prototype should lack a beautiful user interface or stunning data visualizations, if those are going to be the main strengths of the final product.

To identify the core features of the product, it is necessary to consider the target market and processes, and the problem that needs to be solved. By evaluating what features are essential for achieving the desired goal and what are not, it becomes possible to differentiate the MVP from competitors. It is important to have a differentiating feature or solve a differentiated problem, as this will allow the MVP to compete with existing products in the market.

Mitigating risks in production-oriented data science systems

End-to-end data science projects involve an iterative project lifecycle and have risks that need to be addressed. The first risk is maintaining a constant flow of change, which can be addressed with Scrum tools. However, it is important for the customer and team to understand the processes required for these tools to be effective.

Another risk is unexpected bugs due to the lack of automated testing. Online testing is important for quality assurance and monitoring performance changes. If production is not planned in advance, complex issues related to nonfunctional requirements may arise.

System design and software architecture need to be considered from the beginning of the project. If key stakeholders do not see the benefits of the system, it is important to identify errors in the goal definition. Sometimes the project objectives change halfway through and the customer's views may change. Therefore, it is important to always check the importance of the task and the appropriate solution method to avoid this risk.

In the following table, we have enumerated the common risks and their solutions to sum up this chapter:

Risk group Risk Solution

Common Vague goal definition. Make sure that the goal's definition is complete and includes all the items from the checklist in this chapter.

Common The project goal is not

quantifiable.

Define quantifiable business metrics that can be

understood by the customer.

Define one or several technical metrics that correlate with your business

(4)

metrics.

Common Decision-making without

keeping track of the record.

Document every major decision and conclusion you make throughout the project.

Fix data and code versions in order to reproduce the results that lead to your decisions.

Research The team can't reproduce the

experiment's results.

Track the experiment's results and data, along with the code.

Research The research has no scope

and plan of action.

Plan ahead using the research backlog. Prioritize entries in the research backlog and periodically check whether there are any obsolete entries that should be removed.

Assess your expectations of each experiment by

performing quick tests, if possible

MVP The prototype does not show

how to solve the user's problem.

Think about every prototype as an MVP that solves your customers' problems. Define your scope by taking the minimum amount of functionality required to solve your customers' problems into account

MVP The MVP includes too many

unnecessary features that take time to develop.

Use feature analysis to define the MVP scope.

MVP The MVP takes a lot of time

to develop.

If your team makes a lot of MVPs, think about creating rapid prototyping

frameworks and project templates to speed up the development process.

Project development The customer is constantly pushing the team to make urgent scope changes.

Advocate for the use of Agile development methodology for your project. Track project scope

(5)

changes to show how they affect project deadlines Project development The customer does not see

how the system solves their problem.

Constantly review your project goals and make sure that your way of solving the problem has been confirmed by the customer.

Project development New changes introduce a lot of bugs.

Write automated tests.

Production deployment The model's quality is degraded in production and the system has no tools to solve this problem.

Develop an online testing module to track metrics in production. Validate

incoming data. Periodically retrain your models on new data.

Production deployment The system is not suitable

for production usage. Fix functional and

nonfunctional requirements for your system. Prepare an architecture vision that provides a production-ready system design.

(6)

2. Looking for news references/journal articles of successful and failed data science projects FAILED :Why Big Data Science & Data Analytics Projects Fail (datascience-pm.com)

Why Big Data Science & Data Analytics Projects Fail

There are many examples of failure in the fields of big data, data science, and data analytics.

For example,

● 85% of big data projects fail (Gartner, 2017)

● 87% of data science projects never make it to production (VentureBeat, 2019)

● “Through 2022, only 20% of analytic insights will deliver business outcomes” (Gartner, 2019)

This begs the question of why data science and analytics projects often fail. One of the main reasons is the challenge of managing data and processing. However, there are bigger issues that need to be addressed. To learn how to overcome these challenges and achieve better results in data science, let’s explore some big data failure examples and what drives these failures.

8 Reasons Why Big Data Science and Analytics Projects Fail 1. Not having the Right Data

Data is a crucial component in any data science project, but collecting, creating, or purchasing data can pose its own set of challenges. Securing the data, ensuring its lack of bias, and using it ethically and legally are important considerations. Additionally, processing the data efficiently and at a reasonable cost, as well as cleaning and monitoring it for changes over time, can be arduous tasks. According to a 2020 survey by the International Data Corporation, the lack of sufficient volume and quality of training data remains a significant development challenge.

To address these issues, it is advised to have internal protocols in place, including policies, checklists, and reviews to ensure proper data usage. It is also crucial to assume that data is dirty unless proven otherwise and to invest in building a cloud-based system for data pipelines with proactive alerts and notifications. However, building such systems can be costly, requiring the involvement of both data and cloud engineers.

2. Not having the Right Talent

Finding and keeping the best tech workers is hard, especially since data science/analytics will be the second most difficult skill set to find in 2020, according to QuantHub. It's not just about finding qualified workers; putting together full data analysis projects requires many different roles, from brainstorming to running machine learning. So it's really important to find the right mix of qualified workers.

3. Solving the Wrong Problem

Starting a project without a clear purpose or unrealistic goals can lead to wasted time and resources. According to Domino Data Labs, many organizations have made the mistake of hiring numerous PhDs without a clear business alignment, only to realize later on that the analysis they conducted was irrelevant due to a misunderstanding of the target variable.

To avoid this, it is important to ask the right questions and understand the underlying problem before beginning a project. Rather than accepting requests at face value, it is crucial to delve deeper and gain a comprehensive understanding of the problem that needs to be solved. By doing so, projects can be started on the right track, minimizing the risk of wasted effort and ensuring meaningful value is delivered.

4. Not Deploying Value

According to a Forbes article, a mere 15% of leading companies have successfully integrated AI capabilities into their production. The reason for this lack of implementation is attributed to the deployment gap, which entails a disparity in technical expertise and motivation between model developers and those responsible for maintaining them.

(7)

To address this issue, it is advised that those in charge of IT operations and model development work collaboratively throughout the project lifecycle to ensure effective deployment. Alternatively, companies can adopt a Machine Learning Operations mindset, focusing their team's efforts on seamless integration and continuous improvement.

5. Thinking Deployment is the Last Step

In traditional project management, a project is considered complete when its scope is fulfilled, which is usually the case in data science projects at the deployment phase.

However, this does not mean that the work ends there. There are various challenges that may arise, such as adapting to changes in data and market conditions, ensuring data quality and integrity, maintaining stakeholder interest and user adoption, and conducting necessary system updates for security and uptime. These are issues that can greatly impact the success and value of a data science project.

Therefore, it is important to plan for the unexpected and have the right resources and focus in place to address these challenges. Proactive planning and the implementation of self-healing systems can help to mitigate these potential problems and ensure that the deployed model continues to provide value beyond its initial deployment.

6. Applying the Wrong (or No) Process

Organizations often struggle with data science project management due to lack of clear methodologies, leading to inefficient information sharing and misinformed analyses.

They often apply approaches from other fields, treating projects like software, which alienates them from their true nature. The best approach is to combine the data science life cycle with an agile collaboration framework.

7. Forgetting Ethics

AI models have a very strong ability to optimize the commands given to them.

However, this can raise serious ethical, branding, and legal issues. These issues can arise both due to malicious actions and lack of oversight.

Data Science Failure Examples in Ethics

● Racist Health Risk Scoring: Healthcare providers used a health risk score to help determine whether they should offer proactive healthcare treatment to each patient.

Good idea, right? However, the model used healthcare costs as a proxy for health risks. And because black patients tended to have lower health care costs, this health risk score inadvertently prioritized white patients for proactive treatment.

(sciencemag.org, 2019)

● Target Predicts Teen Pregnancy: A Target advanced analytics team was tasked to predict whether a woman was pregnant so that they could offer targeted ads. They were successful in this prediction — very successful. But they ignored the wider privacy implementations which resulted in public backlash due to the “creepiness”

factor. (forbes.com, 2012)

Therefore, it is important that we identify and address ethical issues early on and throughout the lifecycle of AI products. One step that can be taken is to question the 10 ethical questions posed in this text and develop responsible AI.

8. Overlooking Culture

The NewVantage Partner's 2021 survey reveals that cultural challenges, rather than technology challenges, are the biggest impediment to successful adoption of data initiatives and business outcomes. To overcome this, it is crucial to meet those who are uncomfortable with data initiatives, educate staff, emphasize data-driven decision-making, and help everyone involved in the change management process, which is often bigger than the technical changes.

(8)

SUCCEEDED : Data science for digital culture improvement in higher education using K-means clustering and text analytics | Maylawati | International Journal of Electrical and Computer Engineering (IJECE) (iaescore.com)

Data Science For Digital Culture Improvement In Higher Education Using K-Means Clustering And Text Analytics

This article focuses on exploring new insights and findings in digital culture and higher education. The purpose of this study is to examine significant patterns that could improve the digital culture in higher education, the research methodology employed involves the application of data mining techniques, specifically the K-means algorithm, and text analytics. The experiment involves the collection of questionnaire data from 2887 respondents at Universitas Islam Negeri (UIN) Sunan Gunung Djati in Bandung.

The data analysis and clustering results show that perceived usefulness and intention to use information systems are above average, while perceived ease of use and actual system use are relatively low. In addition, when text analytics is integrated, this study identifies a congruence between the results of exploratory data analysis (EDA) and K-means clustering, which is consistent with the expectations and desires of the academic community regarding the implementation of information systems.

In the context of this research, data mining is a technique to find important information or useful knowledge from big data. Data mining has four main approaches including classification which is supervised learning, clustering which is unsupervised learning, association rules, and semi-supervised learning which combines classification and clustering. Data mining is used to discover important hidden information that can be used for prediction and decision support.

Clustering is not used for prediction like classification, but clustering will generate insights from data that is problematic and analyzed and interpreted by humans. K-means is one of the most widely used clustering algorithms to find the minimum distance value within the same cluster.

K-means is a simple algorithm with fast processing time and produces optimal clusters.

CONCLUSION

This research is conducted comprehensively in order to evaluate the information technology implementation in higher education. The result of the experiment of EDA and K-means algorithm shows that to improve the digital culture with high perceived ease of use and actual system use of information technology should be supported with complete and comprehensive socialization, and also provide the manual guide for each information system. This results in accordance with the hope of end-users that need information, knowledge, and guidelines for an information system that they used. Digital culture through behavioral intention use of information systems that have already awakened should be maintained and improved with the quality of the information system which fulfills the user requirements.

(9)

3. Looking for news references / journal articles about data science project risk engineering analysis

The authors of this article explain the importance of evaluating and managing risks in data science projects. Through the utilization of the Delphi method, they identified potential risks associated with data science projects. Interestingly, the outcomes demonstrate that over 50% of the common risks identified in Data Science projects are parallel to those encountered in other IT projects.

The use of data science is growing rapidly, driven by big data and social media, increased computing power, lower computer storage costs, and more powerful data analysis and modeling methods such as deep learning. However, this growth also brings new risks to social values and constitutional rights. Threats related to data science include privacy violations, social media algorithms, and the Internet of Things.

This study also describes the steps involved in identifying and analyzing risks in data science projects. These steps include selecting experts, identifying the risks associated with the project, and analyzing the identified risks. By following this process, project managers can gain a better understanding of the potential risks in Data Science projects and take necessary actions to deal with them.

The purpose of this study is to increase awareness of risks in data science projects and to contribute to the reduction of risks in similar projects in Portugal. The Delphi technique was acknowledged as a valuable research tool for this purpose.