Fine-grained demand forecasts have the potential to capture the patterns that influence demand closer to the level at which that demand needs to be met. As we move closer to the transactional level of granularity, we must also consider the external causal factors that influence individual customer demand and.
Accessing the historical data
Examining the transactional data
This pattern appears to closely follow the pattern of maximum temperatures (in degrees Fahrenheit) for the city. Collectively, the models had a root mean square error (RMSE) of 5.44 with a mean proportional error (MAPE) of 0.73. Actual null values were adjusted to 1 for MAPE calculation.).
Alternative approaches to forecasting supply chain demand
Implications of the results
Additional retail/CPG and demand forecasting resources
How to Build a Quality of Service (QoS)
Analytics Solution for Streaming Video
The importance of quality to streaming video services
Issues with the overall quality of streaming video (including buffering, latency, pixelation, jitter, packet loss and screen blankness) have significant business implications, whether through increased subscriber churn or reduced video engagement. Quality of Service (QoS) helps make sense of these data streams so you can understand what's going wrong, where and why.
Databricks QoS solution overview
When you start streaming you realize that there are so many places where breaks can happen and the viewer experience can suffer, whether it's an issue at the source on the on-premises servers or in the cloud; on the road at the CDN or ISP level or on the viewer's home network; or at the playback level with player/client issues. And because of the nature of television, things will go wrong during the most important, high-profile events that draw the biggest audiences.
Video QoS solution architecture
The number of "hops" in this approach is directly affected by the number of downstream consumers, the complexity of the aggregations (eg, structured streaming enforces certain constraints around the chaining of multiple aggregations), and.
Making your data ready for analytics
Video applications events
Using this pattern allows a large number of downstream consumers to process the data in a streaming paradigm without scaling the throughput of the Kinesis stream. Both the timestamp and the message type are extracted from the JSON event to partition the data and allow consumers to select the type of events they want to process.
CDN Logs
As a side effect of using a Delta table as a sink (which supports optimization!), we don't have to worry about how the size of the processing window affects the number of files in your target table, aka the . Recombining a single Kinesis stream for the events with a Delta “Events” table reduces operational complexity and makes it easier to scale during peak hours.
Creating the dashboard/virtual network operation centers
Video loading (time to first frame) provides better insight into performance for individual locations of your CDN – in this case the AWS CloudFront Edge nodes – which has a direct impact on your strategy for improving this KPI , either user-distributable traffic over multi-CDNs or perhaps just implement dynamic origin selection in the case of AWS CloudFront with Lambda@Edge. Is this the cause of the content quality stopping or are there delivery issues?
Creating (near) real-time alerts
On the content side, in-app events can provide useful information about user behavior and overall quality of experience. On top of the basic email use case, the Demo Player includes three widgets updated in real-time with AWS AppSync: number of active users, most popular videos, number of users watching a video at the same time.
Next steps: machine learning
Point-of-failure prediction and remediation
Customer churn
Getting started with the Databricks Streaming Video QoS Solution
Mitigating Bias in Machine Learning
Fairlearn
What is bias?
Where does bias come from?
This blog will not address bias in data collection, although it is an important topic within the ability of technologists to fix. This leaves the models, who may be the heroes, rather than villains, of bias in machine learning.
How can models help with bias?
This example looks at two commonly used fairness metrics (later using the Fairlearn tool): equal opportunity and equal odds. Equal odds further claim that the false positive rate (FPR) is the same even between groups.
Recidivism, and the COMPAS data set
They can help quantify the effect of features such as age and race on pattern production, using the SHAP library. This is an intriguing middle-ground approach that attempts to surgically remove unwanted effects from model prediction by explicitly learning them first.
A first model
- Do nothing
- Ignore demographics
- Equalized odds with Fairlearn
- Mitigate with SHAP values
It is clear that the model makes a positive (offensive) prediction more often for African American defendants (right columns). For African American defendants, SHAP indicated that the model increased the predicted probability of recidivism by 1-2%.
Finding anomalous cases with SHAP
There is a kind of group in the lower left for which race_African_American caused a different predicted probability of recidivism ‚ either especially on the high or low side (dark or light points). There is a group near the upper left for whom gender (gender_Male) had a markedly negative effect on probability; these are not men, they almost certainly represent female defendants.
Conclusion
In this blog, we will discuss the need for real-time data in retail and how to overcome it. To learn more, check out our Real-Time Point of Sale Analytics Solutions Accelerator.
The point-of-sale system
Before the pandemic, 71% of retailers cited the lack of real-time inventory visibility as one of the top barriers to achieving their omnichannel goals. Better access to real-time information is key to meeting consumer demands in the new normal.
Real-Time
Supply chain disruptions—from reduced product offerings and reduced inventory capacity—combined with rapidly changing consumer expectations for seamless omnichannel experiences are causing retailers to rethink how they use data to manage their operations. The pandemic only increased the demand for integrated online and in-store experiences, putting even more pressure on retailers to present accurate product availability and manage order changes on the fly.
Point-of-Sale Analytics
Modern connectivity improvements have enabled more retailers to move to a centralized, cloud-based POS system, while many others are developing near-real-time integrations between in-store systems and the corporate back office. The near-real-time availability of information means that retailers can constantly update their estimates of item availability.
Near real-time insights
The company no longer manages operations against its knowledge of the inventory state as it was the day before, but instead takes action based on its knowledge of the inventory state as it is now.
Carefully consider scope
If we limited our initial scope to this one need, the information requirements for our monitoring and analytics system would be dramatically reduced. Once a real-time inventory solution is in place and the business recognizes the value, we can expand our scope to consider other needs such as promotion monitoring and fraud detection, expanding the range of information assets used with each iteration.
Align transmission with patterns of data generation and time sensitivities
No matter how the data is exposed, the classic guideline holds true: identify a compelling business justification for your solution and use that to limit the scope of the information assets you consume initially.
Land the data in stages
Manage expectations
Implementing POS streaming
Personalization is a competitive differentiator for most every financial services institution (FSIs, for short), from banks to insurance and now investment management platforms. This blog will demonstrate how to establish a strong foundation for real-time insights for financial services use cases with the Databricks Lakehouse Platform, from change data capture (CDC) OLTP data to to the reporting panel.
Why streaming ingestion is critical
While every FSI wants to deliver intelligent, real-time personalization to customers, the basics are often hidden or implemented with incomplete platforms, leading to stagnant knowledge, long time to market, and lost productivity due to the need to scale the stream . , AI and reporting services together. In this blog, we focus on streaming for FSIs and show how these capabilities help improve new product differentiators and internal knowledge for FSIs.
Design Patterns for Real-Time Insights in
FSI data flow and requirements
- User app saves and updates data such as clickstream, user updates, and geolocation data — requires operational databases
- Third-party behavioral data is delivered incrementally via object storage or is available in a database in a cloud account — requires streaming
- FSI has an automated process to export all database data including user updates, clickstream, and user behavioral data into data lake — requires
- Data engineering teams run automated data quality checks and ensure the data is fresh — requires data quality tool and native streaming
- Data science teams use data for next best action or other predictive analytics — requires native ML capabilities
- Analytics engineers and data analysts will materialize data models and use data for reporting — requires dashboard integration and native visualization
Note the temporal nature of the data - all banking or credit systems have time-ordered transaction data, and a reliable data source means including late and irregular data.
Getting started with streaming
Data ingestion
Native Structured Streaming ingestion option
Partner ingestion option
The main difference between this option and the native streaming option is Confluent's use of the Delta Lake Sink Connector. Low-code data ingestion for data platform teams familiar with streaming partners (such as Confluent Kafka) and preferences to deploy data to Delta Lake without using Apache Spark™.
File-based ingestion
Both options abstract much of the underlying logic for reading raw data and landing it in Delta Lake using COPY INTO commands. Set up the Databricks Delta Lake Sink Connector for Confluent Cloud and link it to the appropriate theme.
Change data capture
APPLY CHANGES INTO is the elegant CDC API that provides DLT, handling irregular and late-incoming data by storing state internally – without requiring users to write additional code or SQL commands. APPLY CHANGE TO live.customer_patterns_silver FROM stream (live.customer_patterns_silver_copy) KEYS (customer_id).
Summarizing customer preferences and simple offers
This blog highlights several facets of the data ingestion process, which is important to support various user customization cases in financial services. More importantly, Databricks supports near-real-time use cases, offering fresh insights and abstract APIs (Delta Live Tables) for handling change data, supporting both Python and SQL out-of-the-box.
Building Patient
Determining which patients belong to a group, testing the sensitivity of various inclusion and exclusion criteria to the sample size, constructing a control group with propensity score matching techniques: These are just some of the processes that health care researchers and life sciences they live every day. , and that's not likely to change anytime soon.
Cohorts With NLP and Knowledge Graphs
Optimizing clinical trial protocols
Site selection and patient recruitment
Design and run better clinical trials with John Snow Labs and Databricks
NLP preprocessing
Within our Lakehouse for Healthcare, this final DataFrame will be written to the Silver layer. See below for the first three records of the Silver Layer dataset, the extracted Rx-related text, its NER chunks, the applicable RxNorm code, all related codes, RxNorm resolutions, and final drug resolution.
Patients who are prescribed Lasix between May 2060 and May 2125
Dangerous drug combinations
Patients with hypertension or diabetes with chest pain
Start building your cohorts with knowledge graphs using NLP
Customers increasingly expect access to goods delivered to their front doors within tight time windows. Given constraints such as order-specific pickup and narrow delivery windows, differences in item perishability that limit the time some items can sit in a delivery vehicle, and the variable (but generally increasing) costs of fuel and labor, easy solutions remain elusive.
Solution Accelerator
Amazon has played a big role in driving the expansion of fast home delivery, but it too has struggled to recoup costs.
Scalable Route Generation With
Databricks and OSRM
Accurate route information is critical
Scalability needed for route analytics
Fine-Grained
Other studies report similar results, resulting in an erosion of trust in the organization and the continuation of planning activities based on expert opinion and gut feelings. But with the rise of data science as a business opportunity, there is a renewed interest in revisiting old forecasts among many organizations.
The R language provides access to new forecasting techniques
The cloud provides access to the computational resources
Databricks brings together R functionality with cloud scalability
By combining all the records for a given store and item within a partition, we isolated all the data we need to make predictions at the store item level in one place. Using the SparkR package, we can use the gapply function to group the data and apply a user-defined function to each partition.
Sentiment analysis
GPU-Accelerated Sentiment Analysis
To summarize, sentiment analysis provides coarse information, while attitude detection provides more information about certain aspects. Sentiment analysis can be used to determine a customer's feeling about a particular product, the public's reaction to an event, etc.
Types of sentiment analysis
Multi) GPU-enabled inference
Dataloader
Model class
Model parallelization and GPU dispatch
Iteration loop
Scalable inference for lots of files
Using the Databricks platform with readily available ML runtimes and the availability of the latest GPUs make it easy to experiment and deploy these solutions. Supermarkets operate in an extremely competitive and complex space in which all components of the customer experience require optimization, including inventory, assortment selection, pricing strategies and product relevance to segment.
Jumbo Transforms How They Delight
When we launched our loyalty program, the volume of new customer data points that started coming in made our data teams rethink how we optimize the customer experience at scale. We needed a platform capable of working through customer-level data and training models at a scale far beyond what we could handle on our individual machines.
With Data-Driven Personalized
However, as data volumes grew exponentially, our analytics and ML capabilities began to slow down as we were not equipped to handle such scale. Increasing data volumes meant increased complexity and resources needed to handle them from an infrastructure.
Experiences
After years on a traditional statistical software package connecting to a traditional RDBMS and analytics in Jupyter notebooks, we knew that if we wanted to make the best use of this data and deliver shopping experiences that make a difference, it was time for us to take steps to modernize our approach and the underlying technologies that make it possible. In addition to needing to modernize our infrastructure to help us thrive with big data analytics, we also needed better ways to increase speed from concept to production, reduce onboarding time for new people, collaborate and provide self-service access for data insights for our analysts and business users to help provide insights around pricing, inventory, merchandising and customer preferences.
From foundational customer loyalty to exceptional customer experiences
We would not have been able to accelerate our modernization efforts without the expert training and technical guidance from the Databricks Academy and Customer Success Engineering, who act as a direct injection of knowledge to our data science department. data. This deeper understanding of how to leverage all of our data has led to significant improvements in how we manage our assortment and supply chain, make strategic decisions and better support our customers.
Excellence has no ceiling when driven by data and AI
The process begins with a unified multi-tenant architecture, which uses Delta Lake to standardize access to all relevant data sources (both historical and real-time). For example, Delta Lake also helps us build data pipelines that enable scalable, real-time analytics to reduce in-store inventory for customers while reducing unnecessary food waste from over-ordering perishables, such as fresh produce, that won't sell.
Ordnance Survey Explores Spatial
Partitioning Using the British National Grid
While some of these were developed specifically in the context of modern geoanalytics and are therefore usually well supported by associated libraries and practical use cases in that context, the British National Grid indexing system was defined in 1936 and is deeply embedded in the UK Geospatial Data Ecosystem , but is not yet exploited and accessible for large-scale geoanalytics. Our secondary motivation here was therefore to show that it can be used directly to optimize spatial aggregations, avoiding the need to convert UK geospatial datasets to other indexing systems.
The point-in-polygon: how hard can it be?
The (not so) hidden cost?
Work smarter, not harder!
Spatial indices (PIP as a pseudo-equivalence)
The BNG Index System
For references identifying more detailed grid resolutions below 100 km, additional x and y integer values are added after the two letters that locate a child grid square within the parent grid square hierarchy.
Why BNG?
BNG as a spatial partitioning strategy
The baseline
And the reason for this is the fact that it is evaluated in the same way as a Cartesian join. Note that none of these data sets are large enough to be called big data.
Let’s frame our problem
We have run our polygon bounding box rendering for different resolutions of the BNG index system and for different dataset sizes. Spark is now able to first run a join type merge based on the BNG index ID and significantly reduce the overall number of comparisons.
Divide and conquer
Indexes that touch the boundary of the polygon require a PIP filter after joining an index to index. Indexes that do not touch the border and belong to the polygon layout do not require additional filtering.
Mosaics are pretty (fast!)
To put this into context, the equivalent bounding box join at the same index resolution was performed in 2,549 seconds. If we include the preparation times, which were 600 seconds for the mosaic approach and 317 seconds for the bounding box approach, we arrive at a total corrected performance improvement of 4.5x.
A general approach
In a recent talk, Empower Splunk and other SIEMs with Databricks Lakehouse for Cybersecurity, HSBC highlighted the limitations of legacy SIEMs and how the Databricks Lakehouse platform is transforming cyber defense. In this blog post, we'll discuss the changing IT and cyber attack threat landscape, the benefits of SIEMs, the benefits of Databricks Lakehouse, and why SIEM + Lakehouse is becoming the new strategy for security operations teams.
HSBC Augments SIEM for
Cybersecurity at Cloud Scale
But the explosion of cloud footprints is begging the question, are SIEMs the right strategy in the cloud-scale world. Either way, it's generally accepted that every organization is involved in some kind of cloud transformation or evaluation – even in the public sector, where onboarding technology is not an easy decision.
Security analytics in the cloud world
Because their current SIEM tool is on one cloud service and their largest data producers are on another. And over the years, they've built complicated transport pipelines to get data from one cloud provider to another.
Lakehouse + SIEM : The pattern for cloud-scale security operations
Integrating the lakehouse with Splunk
Achieving scale, speed, security and collaboration
What’s next
The challenges of data quality
City-scale data processing
How the City of
Spokane Improved Data Quality While
How the City of Spokane improved data quality while
Through this collaborative solution, the City of Spokane has increased government transparency and accountability and can provide citizens with information that encourages and invites public participation and feedback. With DQLabs and Azure Databricks, the City of Spokane has also achieved a 50% lower total cost of ownership (TCO) by reducing the amount of manual labor required to classify, organize, de-identify, de-duplicate and correct incoming data. lower costs to maintain and operate their information systems as data volumes increase.
How DQLabs leverages Azure Databricks to improve data quality
Using the integrated golden record view, datasets became easily accessible to improve reporting and analysis.
Get started with DQLabs and
Azure Databricks to improve data quality
Customer insights from geospatial data
Challenges of scaling geospatial workloads
How Thasos
Optimized and
Scaled Geospatial Workloads With
Mosaic on Databricks
The index ID of the point is in the set of index IDs of the polygon, and the cell corresponding to this index ID is entirely in the polygon. The index ID of the point is in the set of index IDs of the polygon, and the cell corresponding to this index ID is partially in the polygon.
Scalability and performance with Databricks and Mosaic
The significant performance improvements gave the Thasos team confidence in the scalability of the approach and Mosaic is now an integral part of their production pipelines. Building these pipelines on the Databricks Lakehouse Platform has not only saved Thasos on cloud computing costs, but has also unlocked the ability to bring data scientists into the team to develop new intelligence products and also integrate with the broader Databricks partner ecosystem .
Getting started