Buku The Big Book of Data Science Use Cases

Fine-grained demand forecasts have the potential to capture the patterns that influence demand closer to the level at which that demand needs to be met. As we move closer to the transactional level of granularity, we must also consider the external causal factors that influence individual customer demand and.

Accessing the historical data

Examining the transactional data

This pattern appears to closely follow the pattern of maximum temperatures (in degrees Fahrenheit) for the city. Collectively, the models had a root mean square error (RMSE) of 5.44 with a mean proportional error (MAPE) of 0.73. Actual null values were adjusted to 1 for MAPE calculation.).

Alternative approaches to forecasting supply chain demand

Implications of the results

Additional retail/CPG and demand forecasting resources

How to Build a Quality of Service (QoS)

Analytics Solution for Streaming Video

The importance of quality to streaming video services

Issues with the overall quality of streaming video (including buffering, latency, pixelation, jitter, packet loss and screen blankness) have significant business implications, whether through increased subscriber churn or reduced video engagement. Quality of Service (QoS) helps make sense of these data streams so you can understand what's going wrong, where and why.

Databricks QoS solution overview

When you start streaming you realize that there are so many places where breaks can happen and the viewer experience can suffer, whether it's an issue at the source on the on-premises servers or in the cloud; on the road at the CDN or ISP level or on the viewer's home network; or at the playback level with player/client issues. And because of the nature of television, things will go wrong during the most important, high-profile events that draw the biggest audiences.

Video QoS solution architecture

The number of "hops" in this approach is directly affected by the number of downstream consumers, the complexity of the aggregations (eg, structured streaming enforces certain constraints around the chaining of multiple aggregations), and.

Figure 2. A streaming architecture can still be achieved while materializing DataFrames in Delta tables.

Making your data ready for analytics

Video applications events

Using this pattern allows a large number of downstream consumers to process the data in a streaming paradigm without scaling the throughput of the Kinesis stream. Both the timestamp and the message type are extracted from the JSON event to partition the data and allow consumers to select the type of events they want to process.

CDN Logs

As a side effect of using a Delta table as a sink (which supports optimization!), we don't have to worry about how the size of the processing window affects the number of files in your target table, aka the . Recombining a single Kinesis stream for the events with a Delta “Events” table reduces operational complexity and makes it easier to scale during peak hours.

Creating the dashboard/virtual network operation centers

Video loading (time to first frame) provides better insight into performance for individual locations of your CDN – in this case the AWS CloudFront Edge nodes – which has a direct impact on your strategy for improving this KPI , either user-distributable traffic over multi-CDNs or perhaps just implement dynamic origin selection in the case of AWS CloudFront with Lambda@Edge. Is this the cause of the content quality stopping or are there delivery issues?

Figure 7. Visualization of the results of a SQL query.

Creating (near) real-time alerts

On the content side, in-app events can provide useful information about user behavior and overall quality of experience. On top of the basic email use case, the Demo Player includes three widgets updated in real-time with AWS AppSync: number of active users, most popular videos, number of users watching a video at the same time.

Figure 8. Integrating microservices using Amazon SNS and Amazon SQS.

Next steps: machine learning

Point-of-failure prediction and remediation

Customer churn

Getting started with the Databricks Streaming Video QoS Solution

Mitigating Bias in Machine Learning

Fairlearn

What is bias?

Where does bias come from?

This blog will not address bias in data collection, although it is an important topic within the ability of technologists to fix. This leaves the models, who may be the heroes, rather than villains, of bias in machine learning.

How can models help with bias?

This example looks at two commonly used fairness metrics (later using the Fairlearn tool): equal opportunity and equal odds. Equal odds further claim that the false positive rate (FPR) is the same even between groups.

Recidivism, and the COMPAS data set

They can help quantify the effect of features such as age and race on pattern production, using the SHAP library. This is an intriguing middle-ground approach that attempts to surgically remove unwanted effects from model prediction by explicitly learning them first.

A first model

Do nothing
Ignore demographics
Equalized odds with Fairlearn
Mitigate with SHAP values

It is clear that the model makes a positive (offensive) prediction more often for African American defendants (right columns). For African American defendants, SHAP indicated that the model increased the predicted probability of recidivism by 1-2%.

Finding anomalous cases with SHAP

There is a kind of group in the lower left for which race_African_American caused a different predicted probability of recidivism ‚ either especially on the high or low side (dark or light points). There is a group near the upper left for whom gender (gender_Male) had a markedly negative effect on probability; these are not men, they almost certainly represent female defendants.

Conclusion

In this blog, we will discuss the need for real-time data in retail and how to overcome it. To learn more, check out our Real-Time Point of Sale Analytics Solutions Accelerator.

The point-of-sale system

Before the pandemic, 71% of retailers cited the lack of real-time inventory visibility as one of the top barriers to achieving their omnichannel goals. Better access to real-time information is key to meeting consumer demands in the new normal.

Real-Time

Supply chain disruptions—from reduced product offerings and reduced inventory capacity—combined with rapidly changing consumer expectations for seamless omnichannel experiences are causing retailers to rethink how they use data to manage their operations. The pandemic only increased the demand for integrated online and in-store experiences, putting even more pressure on retailers to present accurate product availability and manage order changes on the fly.

Point-of-Sale Analytics

Modern connectivity improvements have enabled more retailers to move to a centralized, cloud-based POS system, while many others are developing near-real-time integrations between in-store systems and the corporate back office. The near-real-time availability of information means that retailers can constantly update their estimates of item availability.

Near real-time insights

The company no longer manages operations against its knowledge of the inventory state as it was the day before, but instead takes action based on its knowledge of the inventory state as it is now.

Carefully consider scope

If we limited our initial scope to this one need, the information requirements for our monitoring and analytics system would be dramatically reduced. Once a real-time inventory solution is in place and the business recognizes the value, we can expand our scope to consider other needs such as promotion monitoring and fraud detection, expanding the range of information assets used with each iteration.

Align transmission with patterns of data generation and time sensitivities

No matter how the data is exposed, the classic guideline holds true: identify a compelling business justification for your solution and use that to limit the scope of the information assets you consume initially.

Land the data in stages

Manage expectations

Implementing POS streaming

Personalization is a competitive differentiator for most every financial services institution (FSIs, for short), from banks to insurance and now investment management platforms. This blog will demonstrate how to establish a strong foundation for real-time insights for financial services use cases with the Databricks Lakehouse Platform, from change data capture (CDC) OLTP data to to the reporting panel.

Why streaming ingestion is critical

While every FSI wants to deliver intelligent, real-time personalization to customers, the basics are often hidden or implemented with incomplete platforms, leading to stagnant knowledge, long time to market, and lost productivity due to the need to scale the stream . , AI and reporting services together. In this blog, we focus on streaming for FSIs and show how these capabilities help improve new product differentiators and internal knowledge for FSIs.

Design Patterns for Real-Time Insights in

FSI data flow and requirements

User app saves and updates data such as clickstream, user updates, and geolocation data — requires operational databases
Third-party behavioral data is delivered incrementally via object storage or is available in a database in a cloud account — requires streaming
FSI has an automated process to export all database data including user updates, clickstream, and user behavioral data into data lake — requires
Data engineering teams run automated data quality checks and ensure the data is fresh — requires data quality tool and native streaming
Data science teams use data for next best action or other predictive analytics — requires native ML capabilities
Analytics engineers and data analysts will materialize data models and use data for reporting — requires dashboard integration and native visualization

Note the temporal nature of the data - all banking or credit systems have time-ordered transaction data, and a reliable data source means including late and irregular data.

Getting started with streaming

Data ingestion

Native Structured Streaming ingestion option

Partner ingestion option

The main difference between this option and the native streaming option is Confluent's use of the Delta Lake Sink Connector. Low-code data ingestion for data platform teams familiar with streaming partners (such as Confluent Kafka) and preferences to deploy data to Delta Lake without using Apache Spark™.

File-based ingestion

Both options abstract much of the underlying logic for reading raw data and landing it in Delta Lake using COPY INTO commands. Set up the Databricks Delta Lake Sink Connector for Confluent Cloud and link it to the appropriate theme.

Change data capture

APPLY CHANGES INTO is the elegant CDC API that provides DLT, handling irregular and late-incoming data by storing state internally – without requiring users to write additional code or SQL commands. APPLY CHANGE TO live.customer_patterns_silver FROM stream (live.customer_patterns_silver_copy) KEYS (customer_id).

Summarizing customer preferences and simple offers

This blog highlights several facets of the data ingestion process, which is important to support various user customization cases in financial services. More importantly, Databricks supports near-real-time use cases, offering fresh insights and abstract APIs (Delta Live Tables) for handling change data, supporting both Python and SQL out-of-the-box.

Building Patient

Determining which patients belong to a group, testing the sensitivity of various inclusion and exclusion criteria to the sample size, constructing a control group with propensity score matching techniques: These are just some of the processes that health care researchers and life sciences they live every day. , and that's not likely to change anytime soon.

Cohorts With NLP and Knowledge Graphs

Optimizing clinical trial protocols

Site selection and patient recruitment

Design and run better clinical trials with John Snow Labs and Databricks

NLP preprocessing

Within our Lakehouse for Healthcare, this final DataFrame will be written to the Silver layer. See below for the first three records of the Silver Layer dataset, the extracted Rx-related text, its NER chunks, the applicable RxNorm code, all related codes, RxNorm resolutions, and final drug resolution.

Patients who are prescribed Lasix between May 2060 and May 2125

Dangerous drug combinations

Patients with hypertension or diabetes with chest pain

Start building your cohorts with knowledge graphs using NLP

Customers increasingly expect access to goods delivered to their front doors within tight time windows. Given constraints such as order-specific pickup and narrow delivery windows, differences in item perishability that limit the time some items can sit in a delivery vehicle, and the variable (but generally increasing) costs of fuel and labor, easy solutions remain elusive.

Solution Accelerator

Amazon has played a big role in driving the expansion of fast home delivery, but it too has struggled to recoup costs.

Scalable Route Generation With

Databricks and OSRM

Accurate route information is critical

Scalability needed for route analytics

Fine-Grained

Other studies report similar results, resulting in an erosion of trust in the organization and the continuation of planning activities based on expert opinion and gut feelings. But with the rise of data science as a business opportunity, there is a renewed interest in revisiting old forecasts among many organizations.

The R language provides access to new forecasting techniques

The cloud provides access to the computational resources

Databricks brings together R functionality with cloud scalability

By combining all the records for a given store and item within a partition, we isolated all the data we need to make predictions at the store item level in one place. Using the SparkR package, we can use the gapply function to group the data and apply a user-defined function to each partition.

Figure 3. The relationship between the Spark DataFrame, its partitions and individual records.

Sentiment analysis

GPU-Accelerated Sentiment Analysis

To summarize, sentiment analysis provides coarse information, while attitude detection provides more information about certain aspects. Sentiment analysis can be used to determine a customer's feeling about a particular product, the public's reaction to an event, etc.

Types of sentiment analysis

Multi) GPU-enabled inference

Dataloader

Model class

Model parallelization and GPU dispatch

Iteration loop

Scalable inference for lots of files

Using the Databricks platform with readily available ML runtimes and the availability of the latest GPUs make it easy to experiment and deploy these solutions. Supermarkets operate in an extremely competitive and complex space in which all components of the customer experience require optimization, including inventory, assortment selection, pricing strategies and product relevance to segment.

Jumbo Transforms How They Delight

When we launched our loyalty program, the volume of new customer data points that started coming in made our data teams rethink how we optimize the customer experience at scale. We needed a platform capable of working through customer-level data and training models at a scale far beyond what we could handle on our individual machines.

With Data-Driven Personalized

However, as data volumes grew exponentially, our analytics and ML capabilities began to slow down as we were not equipped to handle such scale. Increasing data volumes meant increased complexity and resources needed to handle them from an infrastructure.

Experiences

After years on a traditional statistical software package connecting to a traditional RDBMS and analytics in Jupyter notebooks, we knew that if we wanted to make the best use of this data and deliver shopping experiences that make a difference, it was time for us to take steps to modernize our approach and the underlying technologies that make it possible. In addition to needing to modernize our infrastructure to help us thrive with big data analytics, we also needed better ways to increase speed from concept to production, reduce onboarding time for new people, collaborate and provide self-service access for data insights for our analysts and business users to help provide insights around pricing, inventory, merchandising and customer preferences.

From foundational customer loyalty to exceptional customer experiences

We would not have been able to accelerate our modernization efforts without the expert training and technical guidance from the Databricks Academy and Customer Success Engineering, who act as a direct injection of knowledge to our data science department. data. This deeper understanding of how to leverage all of our data has led to significant improvements in how we manage our assortment and supply chain, make strategic decisions and better support our customers.

Excellence has no ceiling when driven by data and AI

The process begins with a unified multi-tenant architecture, which uses Delta Lake to standardize access to all relevant data sources (both historical and real-time). For example, Delta Lake also helps us build data pipelines that enable scalable, real-time analytics to reduce in-store inventory for customers while reducing unnecessary food waste from over-ordering perishables, such as fresh produce, that won't sell.

Ordnance Survey Explores Spatial

Partitioning Using the British National Grid

While some of these were developed specifically in the context of modern geoanalytics and are therefore usually well supported by associated libraries and practical use cases in that context, the British National Grid indexing system was defined in 1936 and is deeply embedded in the UK Geospatial Data Ecosystem , but is not yet exploited and accessible for large-scale geoanalytics. Our secondary motivation here was therefore to show that it can be used directly to optimize spatial aggregations, avoiding the need to convert UK geospatial datasets to other indexing systems.

The point-in-polygon: how hard can it be?

The (not so) hidden cost?

Work smarter, not harder!

Spatial indices (PIP as a pseudo-equivalence)

The BNG Index System

For references identifying more detailed grid resolutions below 100 km, additional x and y integer values are added after the two letters that locate a child grid square within the parent grid square hierarchy.

Why BNG?

BNG as a spatial partitioning strategy

The baseline

And the reason for this is the fact that it is evaluated in the same way as a Cartesian join. Note that none of these data sets are large enough to be called big data.

Let’s frame our problem

We have run our polygon bounding box rendering for different resolutions of the BNG index system and for different dataset sizes. Spark is now able to first run a join type merge based on the BNG index ID and significantly reduce the overall number of comparisons.

Divide and conquer

Indexes that touch the boundary of the polygon require a PIP filter after joining an index to index. Indexes that do not touch the border and belong to the polygon layout do not require additional filtering.

Mosaics are pretty (fast!)

To put this into context, the equivalent bounding box join at the same index resolution was performed in 2,549 seconds. If we include the preparation times, which were 600 seconds for the mosaic approach and 317 seconds for the bounding box approach, we arrive at a total corrected performance improvement of 4.5x.

A general approach

In a recent talk, Empower Splunk and other SIEMs with Databricks Lakehouse for Cybersecurity, HSBC highlighted the limitations of legacy SIEMs and how the Databricks Lakehouse platform is transforming cyber defense. In this blog post, we'll discuss the changing IT and cyber attack threat landscape, the benefits of SIEMs, the benefits of Databricks Lakehouse, and why SIEM + Lakehouse is becoming the new strategy for security operations teams.

HSBC Augments SIEM for

Cybersecurity at Cloud Scale

But the explosion of cloud footprints is begging the question, are SIEMs the right strategy in the cloud-scale world. Either way, it's generally accepted that every organization is involved in some kind of cloud transformation or evaluation – even in the public sector, where onboarding technology is not an easy decision.

Security analytics in the cloud world

Because their current SIEM tool is on one cloud service and their largest data producers are on another. And over the years, they've built complicated transport pipelines to get data from one cloud provider to another.

Lakehouse + SIEM : The pattern for cloud-scale security operations

Integrating the lakehouse with Splunk

Achieving scale, speed, security and collaboration

What’s next

The challenges of data quality

City-scale data processing

How the City of

Spokane Improved Data Quality While

How the City of Spokane improved data quality while

Through this collaborative solution, the City of Spokane has increased government transparency and accountability and can provide citizens with information that encourages and invites public participation and feedback. With DQLabs and Azure Databricks, the City of Spokane has also achieved a 50% lower total cost of ownership (TCO) by reducing the amount of manual labor required to classify, organize, de-identify, de-duplicate and correct incoming data. lower costs to maintain and operate their information systems as data volumes increase.

How DQLabs leverages Azure Databricks to improve data quality

Using the integrated golden record view, datasets became easily accessible to improve reporting and analysis.

Get started with DQLabs and

Azure Databricks to improve data quality

Customer insights from geospatial data

Challenges of scaling geospatial workloads

How Thasos

Optimized and

Scaled Geospatial Workloads With

Mosaic on Databricks

The index ID of the point is in the set of index IDs of the polygon, and the cell corresponding to this index ID is entirely in the polygon. The index ID of the point is in the set of index IDs of the polygon, and the cell corresponding to this index ID is partially in the polygon.

Figure 1. Aerial photograph of an open-air mall overlayed with geofence polygons (red) corresponding to the store location and parking lots of various companies of interest

Scalability and performance with Databricks and Mosaic

The significant performance improvements gave the Thasos team confidence in the scalability of the approach and Mosaic is now an integral part of their production pipelines. Building these pipelines on the Databricks Lakehouse Platform has not only saved Thasos on cloud computing costs, but has also unlocked the ability to bring data scientists into the team to develop new intelligence products and also integrate with the broader Databricks partner ecosystem .

Getting started