AI & Automation Archives - Unravel

Databricks Data Observability Buyer’s Guide

Ray Villares — Tue, 06 May 2025 18:48:07 +0000

A Data Platform Leader’s Guide to Choosing the Right Data Observability Solution

Modern data platforms demand more than just basic monitoring. As Databricks adoption grows across the enterprise, platform owners are under pressure to optimize performance, control cloud costs, and deliver reliable data at scale. But with a fragmented vendor landscape and no one-size-fits-all solution, knowing where to start can be a challenge.

This guide simplifies the complex world of data observability and provides a clear, actionable framework for selecting and deploying the right solution for your Databricks environment.

Discover:

The five core data observability domains every enterprise needs to cover (based on Gartner’s 2024 framework)
How different solution types—DIY, FinOps, DevOps, native tools, and AI-native platforms—compare
How the emerging discipline of DataFinOps is more than cost governance
Which approach best aligns with your goals: cost control, data quality, performance tuning, and scalability
A phased deployment roadmap for rolling out your selected solution with confidence

If you’re evaluating your data observability options or looking to optimize your Databricks cost and performance, this guide will help you make the best choice for your needs.

The post Databricks Data Observability Buyer’s Guide appeared first on Unravel.

Building a FinOps Ethos

Unravel Data — Mon, 09 Dec 2024 21:02:22 +0000

3 Key Steps to Build a FinOps Ethos in Your Data Engineering Team

In today’s data-driven enterprises, the intersection of fiscal responsibility and technical innovation has never been more critical. As data processing costs continue to scale with business growth, building a FinOps culture within your Data Engineering team isn’t just about cost control, it’s about creating a mindset that views cost optimization as an integral part of technical excellence.

Unravel’s ‘actionability for everyone’ approach has enabled executive customers to adopt three transformative steps to embed FinOps principles into their Data Engineering team’s DNA, ensuring that cost awareness becomes as natural as code quality or data accuracy. In this article, we walk you through how executives can get the cost rating of their workloads from Uncontrolled to Optimized with clear, guided actionable pathways.

Step 1. Democratize cost visibility

The foundation of any successful FinOps implementation begins with transparency. However, raw cost data alone isn’t enough-it needs to be contextualized and actionable.

Breaking Down the Cost Silos

Unravel provides real-time cost attribution dashboards to map cloud spending to specific business units, teams, projects, and data pipelines.

The custom views allow different stakeholders, from engineers to executives to discover top actions to control cost spends.
The ability to track key metrics like cost/savings per job/query, time-to-value, and wasted costs due to idleness, wait times, etc., transforms cost data from a passive reporting tool to an active management instrument.

Provide tools for cost decision making

Modern data teams need to understand how their architectural and implementation choices affect the bottom line. Unravel provides visualizations and insights to guide these teams to implement:

Pre-deployment cost impact assessments for new data pipelines (for example, what is the cost impact of migrating this workload from All-purpose compute to Job compute?).
What-if analysis tools for infrastructure changes (for example, will changing from the current instance types to recommended instance types affect performance if I save on cost?).
Historical trend analysis to identify cost patterns, budget overrun costs, cost wasted due to optimizations neglected by their teams, etc.

Step 2. Embed cost intelligence into the development and operations lifecycles

The next evolution in FinOps maturity comes from making cost optimization an integral part of the development process, not a post-deployment consideration. Executives should consider leveraging specialized AI agents across their technology stack that helps to boost productivity and free up their teams’ time to focus on innovation. Unravel provides a suite of AI-agents driven features that foster cost ethos in the organization and maximizes operational excellence with auto-fix capabilities.

Automated Optimization Lifecycle
Unravel helps you establish a systematic approach to cost optimization with automated AI-agentic workflows to help your teams operationalize recommendations while getting a huge productivity boost. Below are some ways to drive automation with Unravel agents:

Implement automated fix suggestions for most code and configuration inefficiencies
Assign auto-fix actions to AI agents for effortless adoption or route them to a human reviewer for approval
Configure automated rollback capabilities for changes if unintended performance or cost degradation is detected

Push Down Cost Consciousness To Developers

Automated code reviews that flag potential cost inefficiencies
Smart cost savings recommendations based on historical usage patterns
Allow developers to see the impact of their change on future runs

Define Measurable Success Metrics
Executives can track the effectiveness of FinOps awareness and culture using Unravel through:

Cost efficiency improvements over time (WoW, MoM, YTD)
Team engagement and rate of adoption with Accountability dashboards
Time-to-resolution for code and configuration changes

Step 3. Create a self-sustaining FinOps culture

The final and most crucial step is transforming FinOps from an initiative into a cultural cornerstone of your data engineering practice.

Operationalize with AI agents

FinOps AI Agent

Implement timely alerting systems that help to drive value-oriented decisions for cost optimization and governance. Unravel provides email, Slack, Teams integrations for ensuring all necessary stakeholders get timely notifications and insights into opportunities and risks.

DataOps AI Agent

Pipeline optimization suggestions for better resource utilization and mitigate SLA risks
Job signature level cost and savings impact analysis to help with prioritization of recommendations
Intelligent workload migration recommendations

Data Engineering AI Agent

Storage tier optimization recommendations to avoid wasted costs due to cold tables
Partition strategy optimization for cost-effective querying
Avoid recurring failures and bottlenecks due to inefficiencies not acted upon for several weeks

Continuous Evolution

Finally, it is extremely important to foster and track the momentum of FinOps growth by:

Regularly performing FinOps retrospectives with wider teams
Revisiting which Business units and Cost Centers are contributing to wasted costs, neglected cost due to unadopted recommendations and budget overruns despite timely alerting

The path forward

Building a FinOps ethos in your Data Engineering team is a journey that requires commitment, tools, and cultural change. By following the above three key steps – democratizing cost visibility, embedding cost intelligence, and creating a self-sustaining culture – you can transform how your team thinks about and manages cloud costs.

The most successful implementations don’t treat FinOps as a separate discipline but rather as an integral part of technical excellence. When cost optimization becomes as natural as writing tests or documenting code, you have achieved true FinOps maturity. Unravel provides a comprehensive set of features and tools to aid your teams in accelerating FinOps best practices throughout the organization.

Remember, the goal isn’t just to reduce costs – it is to maximize the value derived from every dollar spent on your infrastructure. This mindset shift, combined with the right tools and processes, will position your data engineering team for sustainable growth and success in an increasingly cost-conscious technology landscape.

To learn more on how Unravel can help, contact us or request a demo.

The post Building a FinOps Ethos appeared first on Unravel.

Data ActionabilityTM DataOps Webinar

Unravel Data — Wed, 20 Nov 2024 21:39:37 +0000

Data Actionability: Boost Productivity with Unravel’s New DataOps AI Agent

The post Data Actionability^TM DataOps Webinar appeared first on Unravel.

Data ActionabilityTM Data Eng Webinar

Unravel Data — Wed, 20 Nov 2024 21:32:41 +0000

Data Actionability: Speed Up Analytics with Unravel’s New Data Engineering AI Agent

The post Data Actionability^TM Data Eng Webinar appeared first on Unravel.

Data ActionabilityTM Finops Webinar

Unravel Data — Wed, 20 Nov 2024 21:25:52 +0000

Data Actionability: Cost Governance with Unravel’s New FinOps AI Agent

The post Data Actionability^TM Finops Webinar appeared first on Unravel.

Data ActionabilityTM Webinar

Unravel Data — Wed, 20 Nov 2024 21:04:00 +0000

Data Actionability: Empower Your Team with Unravel’s New AI Agents

The post Data Actionability^TM Webinar appeared first on Unravel.

AI Agents: Empower Data Teams With ActionabilityTM for Transformative Results

Unravel Data — Thu, 15 Aug 2024 18:25:10 +0000

AI Agents for Data Teams

Data is the driving force of the world’s modern economies, but data teams are struggling to meet demand to support generative AI (GenAI), including rapid data volume growth and the increasing complexity of data pipelines. More than 88% of software engineers, data scientists, and SQL analysts surveyed say they are turning to AI for more effective bug-fixing and troubleshooting. And 84% of engineers who use AI said it frees up their time to focus on high-value activities.

AI Agents represent the next wave of AI innovation and have arrived just in time to help data teams make more efficient use of their limited bandwidth to build, operate, and optimize data pipelines and GenAI applications on modern data platforms.

Data Teams Grapple with High Demand for GenAI

A surge in adoption of new technologies such as GenAI is putting tremendous pressure on data teams, leading to broken apps and burnout. In order to support new GenAI products, data teams must deliver more production data pipelines and data apps, faster. The result is that data teams have too much on their plates, the pipelines are too complex, there is not enough time, and not everyone has the deep tech skills required. No surprise that 70% of organizations have difficulty integrating data into AI models and only 48% of AI projects get deployed into production.

Understanding AI Agents

Defining AI Agents

AI agents are software-based systems that gather information, provide recommended actions, initiate and complete tasks in collaboration with or on behalf of humans to achieve a goal. AI agents can act independently, utilizing components like perception and reasoning, provide step-by-step guidance to augment human abilities, or can provide supporting information to support complex human-led tasks. AI agents play a crucial role in automating tasks and simplifying data-driven decision-making, and achieving greater productivity and efficiency.

How AI Agents Work

AI agents operate by leveraging a wide range of data sources and signals, using algorithms and data processing to identify anomalies and actions, then interact with their environment and users to effectively achieve specific goals. AI agents can achieve >90% accuracy, primarily driven by the reliability, volume, and variety of input data and telemetry to which they have access.

Types of Intelligent Agents

Reactive and proactive agents are two primary categories of intelligent agents.
Some agents perform work for you, while others help complete tasks with you or provide information to support your work.
Each type of intelligent agent has distinct characteristics and applications tailored to specific functions, enhancing productivity and efficiency.

AI for Data Driven Organizations

Enhancing Decision Making

AI agents empower teams by improving data support decision-making processes for you, with you, or by you. Examples of how AI agents act on your behalf include reducing toil and handling routine decisions based on AI insights. In various industries, AI agents optimize decision-making and provide recommendations to support your decisions. For complex tasks, AI agents provide supporting information needed to build data pipelines, write SQL queries, and partition data.

Benefits of broader telemetry sources for AI agents

Integrating telemetry from various platforms and systems enhances AI agents’ ability to provide accurate recommendations. Incorporating AI agents into root cause analysis (RCA) systems offers significant benefits. Meta’s AI-based root cause analysis system shows how AI agents enhance tools and applications.

Overcoming Challenges

Enterprises running modern data stacks face common challenges like high costs, slow performance, and impaired productivity. Leveraging AI agents can automate tasks for you, with you, and by you. Unravel customers such as Equifax, Maersk, and Novartis have successfully overcome these challenges using AI.

The Value of AI Agents for Data Teams

Reducing Costs

When implementing AI agents, businesses benefit from optimized data stacks, reducing operational costs significantly. These agents continuously analyze telemetry data, adapting to new information dynamically. Unravel customers have successfully leveraged AI to achieve operational efficiency and cost savings.

Accelerating Performance

Performance is crucial in data analytics, and AI agents play a vital role in enhancing it. By utilizing these agents, enterprise organizations can make well-informed decisions promptly. Unravel customers have experienced accelerated data analytics performance through the implementation of AI technologies.

Improving Productivity

AI agents are instrumental in streamlining processes within businesses, leading to increased productivity levels. By integrating these agents into workflows, companies witness substantial productivity gains. Automation of repetitive tasks by AI agents simplify troubleshooting to boost overall productivity and efficiency.

Future Trends in AI Agents for FinOps, DataOps, and Data Engineering

Faster Innovation with AI Agents

By 2026 conversational AI will reduce agent labor costs by $80 billion. AI agents are advancing, providing accurate recommendations to address more issues automatically. This allows your team to focus on innovation. For example, companies like Meta use AI agents to simplify root cause analysis (RCA) for complex applications.

Accelerated Data Pipelines with AI Agents

Data processing is shifting towards real-time analytics, enabling faster revenue growth. However, this places higher demands on data teams. Equifax leverages AI to serve over 12 million daily requests in near real time.

Improved Data Analytics Efficiency with AI Agents

Data management is the fastest-growing segment of cloud spending. In the cloud, time is money; faster data processing reduces costs. One of the word’s largest logistics companies improved efficiency by up to 70% in just 6 months using Unravel’s AI recommendations.

Empower Your Team with AI Agents

Harnessing the power of AI agents can revolutionize your business operations, enhancing efficiency, decision-making, and customer experiences. Embrace this technology to stay ahead in the competitive landscape and unlock new opportunities for growth and innovation.

Learn more about our FinOps Agent, DataOps Agent, and Data Engineering Agent.

The post AI Agents: Empower Data Teams With Actionability^TM for Transformative Results appeared first on Unravel.

Unravel Data Unveils Data Industry’s First Purpose-Built Autonomous AI Agents

Unravel Data — Wed, 12 Jun 2024 15:02:30 +0000

Unravel AI Agents Empower DataOps, FinOps and Data Engineering Teams to Move Beyond Observability to Achieve Data Actionability^TM

PALO ALTO, Calif.– June 12, 2024 – Unravel Data, the first AI-enabled data actionability^TM and FinOps platform built to address the speed and scale of modern data platforms, today announced the release of three groundbreaking new AI agents: the Unravel DataOps Agent, the Unravel FinOps Agent, and the Unravel Data Engineering Agent.

While generative AI capabilities such as Large Language Models allow users to ask general questions and receive broad answers, AI agents are designed to perform specific, actionable tasks within their domains. These AI agents leverage domain-specific knowledge graphs and advanced automation to tackle precise problems faced by data teams, significantly enhancing efficiency and accuracy. All three new AI agents are included as part of the latest version of Unravel Platform.

The introduction of AI agents has the potential to transform various data-centric disciplines where the complexity of managing and making decisions about data pipelines often requires significant human intervention. In DataOps, AI agents can automate routine tasks like data pipeline monitoring and anomaly detection, freeing up human experts for more strategic endeavors. Meanwhile, AI agents for FinOps teams can be deployed to continuously track and analyze cloud expenditures for data storage, processing, and analytics, identifying cost-saving opportunities and potential budget overruns.

“Over the past decade, Unravel has been at the forefront of data observability, continuously innovating to meet the evolving needs of our enterprise customers. With this launch, we are taking customers beyond observability, to actionability,” said Kunal Agarwal, CEO and co-founder, Unravel Data. “The market is demanding solutions that don’t simply observe and tell you what’s happening, but make it actionable by telling you how to solve the problem. And better still, to fix it for them. This will allow resource crunched data teams to get more done through smart automation.”

AI agents hold immense potential for transforming data-related disciplines by automating the many time-consuming and tedious tasks that consume the time of under-resourced data teams. Some of the distinguishing capabilities of Unravel’s AI Agents include:

Domain-Specific Design: Unravel’s AI agents leverage deep domain-specific knowledge graphs to address the specific challenges data teams face daily.
Flexibility and Control: Users can choose their preferred level of AI automation, from highly supervised to fully automated, ensuring they are able to maintain the proper amount of oversight over critical data-driven processes.
Enhanced Focus for Data Engineers: By automating mundane tasks, these AI agents enable data engineers to reduce boring and repetitive tasks, maximize productivity and focus on solving critical, high-value problems.
Precision and Reliability: Built to provide accurate and reliable solutions, Unravel’s AI agents are finely tuned to handle precise data operations issues.
Comprehensive Cost Management: For FinOps teams, the AI agents facilitate automated cost governance, uncover savings opportunities, and proactively manage budgets.

Maersk, one of the world’s largest logistics companies, recognized the benefits of Unravel’s AI agents to streamline their data processes, optimize their cloud spending, and enhance their operational efficiency. “Unravel enables developers to proactively manage complex data systems effortlessly,” said Peter Rees, Lead Architect for Enterprise Data at Maersk. “By providing clear, actionable information through a conversational interface, we look forward to using these AI agents to help us detect and troubleshoot issues faster, propose configuration changes, and allow developers to approve and apply those changes seamlessly.”

To learn more about how Unravel agents can deliver actionable insights into your data pipelines, visit: www.unraveldata.com

About Unravel Data
Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Unravel’s market-leading data actionability^TM and FinOps platform with purpose-built AI for each data platform, provides prescriptive recommendations needed for cost and performance data and AI pipeline efficiencies. A recent winner of the Best Data Tool & Platform of 2023 as part of the annual SIIA CODiE Awards, some of the world’s most recognized brands like Adobe, Maersk, Mastercard, Equifax, and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit https://www.unraveldata.com.

PR Contact:
Rob Nachbar
Kismet Communications for Unravel Data
unraveldata@zagcommunications.com

The post Unravel Data Unveils Data Industry’s First Purpose-Built Autonomous AI Agents appeared first on Unravel.

IDC Analyst Brief: The Role of Data Observability and Optimization in Enabling AI-Driven Innovation

Unravel Data — Mon, 15 Apr 2024 14:22:53 +0000

Harnessing Data Observability for AI-Driven Innovation

Organizations are now embarking on a journey to harness AI for significant business advancements, from new revenue streams to productivity gains. However, the complexity of delivering AI-powered software efficiently and reliably remains a challenge. With AI investments expected to surge beyond $520 billion by 2027, this brief underscores the necessity for a robust intelligence architecture, scalable digital operations, and specialized skills. Learn how AI-driven data observability can be leveraged as a strategic asset for businesses aiming to lead in innovation and operational excellence.

Get a copy of the IDC Analyst Brief by Research Director Nancy Gohring.

The post IDC Analyst Brief: The Role of Data Observability and Optimization in Enabling AI-Driven Innovation appeared first on Unravel.

Winning the AI Innovation Race

Stephen Lamont — Mon, 11 Mar 2024 10:24:06 +0000

Business leaders from every industry now find themselves under the gun to somehow, someway leverage AI into an actual product that companies (and individuals) can use. Yet, an estimated 70%-85% of artificial intelligence (AI) and machine learning (ML) projects fail.

In our thought-leadership white paper Winning the AI Innovation Race: How AI Helps Optimize Speed to Market and Cost Inefficiencies of AI Innovation, you will learn:

• Top pitfalls that impede speed and ROI for running AI and data pipelines in the cloud

• How the answers to addressing these impediments can be found at the code level

• How AI is paramount for optimization of cloud data workloads

• How Unravel helps

The post Winning the AI Innovation Race appeared first on Unravel.

Top 4 Challenges to Scaling Snowflake for AI

Stephen Lamont — Tue, 14 Nov 2023 12:00:39 +0000

Organizations are transforming their industries through the power of data analytics and AI. A recent McKinsey survey finds that 75% expect generative AI (GenAI) to “cause significant or disruptive change in the nature of their industry’s competition in the next three years.” AI enables businesses to launch innovative new products, gain insights into their business, and boost profitability through technologies that help them outperform competitors. Organizations that don’t leverage data and AI risk falling behind.

Despite all the opportunities with data and AI, many find ROI with advanced technologies like IoT, AI, and predictive analytics elusive. For example, companies find it difficult to get accurate and granular reporting on compute and storage for cloud data and analytics workloads. In speaking with enterprise customers, we hear several recurring barriers they face to achieve their desired ROI on the data cloud.

Cloud data spend is challenging to forecast

About 80% of 157 data management professionals express difficulty predicting data-related cloud costs. Data cloud spend can be difficult to reliably predict. Sudden spikes in data volumes, new analytics use cases, and new data products require additional cloud resources. In addition, cloud service providers can unexpectedly increase prices. Soaring prices and usage fluctuations can disrupt financial operations. Organizations frequently lack visibility into cloud data spending to effectively manage their data analytics and AI budgets.

Workload fluctuations: Snowflake data processing and storage costs are driven by the amount of compute and storage resources used. As data cloud usage increases for new applications, dashboards, and uses, it becomes challenging to accurately estimate the required data processing and storage costs. This unpredictability can result in budget overruns that affect 60% of infrastructure and operations (I&O) leaders.
Unanticipated expenses: Spikes in streaming data volumes, large amounts of unstructured and semi-structured data, and shared warehouse consumption can quickly exceed cloud data budgets. These unforeseen usage peaks can catch organizations off guard, leading to unexpected data cloud costs.
Limited visibility: Accurately allocating costs across the company requires detailed visibility into the data cloud bill. Without query-level or user-level reporting, it becomes impossible to accurately attribute costs to various teams and departments. The result is confusion, friction and finger-pointing between groups as leaders blame high chargeback costs on reporting discrepancies.

Organizations can establish spending guardrails and implement controls by implementing a FinOps approach and leveraging granular data to implement smart and effective controls over their data cloud spend, set up budgets, and utilize alerts to avoid data cloud cost overruns.

Data cloud workloads constrained by budget and staff limits

In 2024, IT organizations expect to shift their focus towards controlling costs, improving efficiency, and increasing automation. Cloud service provider price increases and growing usage add to existing economic pressures, while talent remains scarce and expensive. These cost and bandwidth factors are limiting the number of new data cloud workloads that can be launched.

“Data analytics, engineering & storage” are among the top 3 biggest skill gaps and 54% of data teams say the talent shortage and time required to upskill employees are the biggest challenges to adoption of their AI strategy.

Global demand for AI and machine learning professionals is expected to increase by 40% over the next five years. Approximately one million new jobs will be created as companies look to leverage data and AI for a wide variety of use cases—from automation and risk analysis, to security and supply chain forecasting.

AI adoption and data volume demand

Since ChatGPT broke usage records, generative AI is driving increased data cloud demand and usage. Data teams are struggling to maintain productivity as AI projects scale “due to increasing complexity, inefficient collaboration, and lack of standardized processes and tools” (McKinsey).

Data is foundational for AI and much of it is unstructured, yet IDC found most unstructured data is not leveraged by organizations. A lack of production-ready data pipelines for diverse data sources was the second-most-cited reason (31%) for AI project failure.

Discover your Snowflake savings with a free Unravel Health Check report
Request your report here

Data pipeline failures slow innovation

Data pipelines are becoming more complex, increasing the time required for root cause analysis (RCA) for breaks and delays. Data teams struggle most with data processing speed. Time is a critical factor that pulls skilled and valuable talent into unproductive firefighting. The more time they spend dealing with pipeline issues or failures, the greater the impact on productivity and delivery of new innovation.

Automated data pipeline monitoring and testing is essential for data cloud applications, since teams rapidly iterate and adapt to changing end-user needs and product requirements. Failed queries and data pipelines create data issues for downstream users and workloads such as analytics, BI dashboards, and AI/ML model training. These delays and failures can have a ripple effect that impacts end user decision-making and AI models that rely on accurate, timely content.

Conclusion

Unravel for Snowflake combines the power of AI and automation to help you overcome these challenges. With Unravel, Snowflake users get improved visibility to allocate costs for showback/chargeback, AI-driven recommendations to boost query efficiency, and real-time spend reporting and alerts to accurately predict costs. Unravel for Snowflake helps you optimize your workloads and get more value from your data cloud investments.

Take the next step and check out a self-guided tour or request a free Snowflake Health Check report.

The post Top 4 Challenges to Scaling Snowflake for AI appeared first on Unravel.

Why Optimizing Cost Is Crucial to AI/ML Success

Stephen Lamont — Fri, 03 Nov 2023 13:59:09 +0000

This article was originally published by the Forbes Technology Council, September 13, 2023.

When gold was discovered in California in 1848, it triggered one of the largest migrations in U.S. history, accelerated a transportation revolution and helped revitalize the U.S. economy. There’s another kind of Gold Rush happening today: a mad dash to invest in artificial intelligence (AI) and machine learning (ML).

The speed at which AI-related technologies have been embraced by businesses means that companies can’t afford to sit on the sidelines. Companies also can’t afford to invest in models that fail to live up to their promises.

But AI comes with a cost. McKinsey estimates that developing a single generative AI model costs up to $200 million, up to $10 million to customize an existing model with internal data and up to $2 million to deploy an off-the-shelf solution.

The volume of generative AI/ML workloads—and data pipelines that power them—has also skyrocketed at an exponential rate as various departments run differing use cases with this transformational technology. Bloomberg Intelligence reports that the generative AI market is poised to explode, growing to $1.3 trillion over the next 10 years from a market size of just $40 billion in 2022. And every job, every workload, and every data pipeline costs money.

Because of the cost factor, winning the AI race isn’t just about getting there first; it’s about making the best use of resources to achieve maximum business goals.

The Snowball Effect

There was a time when IT teams were the only ones utilizing AI/ML models. Now, teams across the enterprise—from marketing to risk to finance to product and supply chain—are all utilizing AI in some capacity, many of whom lack the training and expertise to run these models efficiently.

AI/ML models process exponentially more data, requiring massive amounts of cloud compute and storage resources. That makes them expensive: A single training run for GPT-3 costs $12 million.

Enterprises today may have upwards of tens—even hundreds—of thousands of pipelines running at any given time. Running sub-optimized pipelines in the cloud often causes costs to quickly spin out of control.

The most obvious culprit is oversized infrastructure, where users are simply guessing how much compute resources they need rather than basing it on actual usage requirements. Same thing with storage costs, where teams may be using more expensive options than necessary for huge amounts of data that they rarely use.

But data quality and inefficient code often cause costs to soar even higher: data schema, data skew and load imbalances, idle time, and a rash of other code-related issues that make data pipelines take longer to run than necessary—or even fail outright.

Like a snowball gathering size as it rolls down a mountain, the more data pipelines you have running, the more problems, headaches, and, ultimately, costs you’re likely to have.

And it’s not just cloud costs that need to be considered. Modern data pipelines and AI workloads are complex. It takes a tremendous amount of troubleshooting expertise just to keep models working and meeting business SLAs—and that doesn’t factor in the costs of downtime or brand damage. For example, if a bank’s fraud detection app goes down for even a few minutes, how much would that cost the company?

Optimized Data Workloads on the Cloud

Optimizing cloud data costs is a business strategy that ensures a company’s resources are being allocated appropriately and in the most cost-efficient manner. It’s fundamental to the success of an AI-driven company as it ensures that cloud data budgets are being used effectively and providing maximum ROI.

But business and IT leaders need to first understand exactly where resources are being used efficiently and where waste is occurring. To do so, keep in mind the following when developing a cloud data cost optimization strategy.

• Reuse building blocks. Everything costs money on the cloud. Every file you store, every record you access, every piece of code you run incur a cost. Data processing can usually be broken down into a series of steps, and a smart data team should be able to reuse those steps for other processing. For example, code written to move data about a company’s sales records could be reused by the pricing and product teams rather than both teams building their own code separately and incurring twice the cost.

• Truly leverage cloud capabilities. The cloud allows you to quickly adjust the resources needed to process data workloads. Unfortunately, too many companies operate under “just in case” scenarios that lead to allocating more resources than actually needed. By understanding usage patterns and leveraging cloud’s auto-scaling capabilities, it’s possible for companies to dynamically control how they scale up and, more importantly, create guardrails to manage the maximum.

• Analyze compute and storage spend by job and by user. The ability to really dig down to the granular details of who is spending what on which project will likely yield a few surprises. You might find that the most expensive jobs are not the ones that are making your company millions. You may find that you’re paying way more for exploration than for data models that will be put to good use. Or, you may find that the same group of users are responsible for the jobs with the biggest spend and the lowest ROI (in which case, it might be time to tighten up on some processes).

Given the data demands that generative AI models and use cases place on a company, business and IT leaders need to have a deep understanding of what’s going on under the proverbial hood. As generative AI evolves, business leaders will need to address new challenges. Keeping cloud costs under control shouldn’t be one of them.

The post Why Optimizing Cost Is Crucial to AI/ML Success appeared first on Unravel.

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Stephen Lamont — Mon, 24 Jul 2023 14:09:01 +0000

Introduction

In the era of Big Data, the importance of data quality cannot be overstated. The vast volumes of information generated every second hold immense potential for organizations across industries. However, this potential can only be realized when the underlying data is accurate, reliable, and consistent. Data quality serves as the bedrock upon which crucial business decisions are made, insights are derived, and strategies are formulated. It empowers organizations to gain a comprehensive understanding of their operations, customers, and market trends. High-quality data ensures that analytics, machine learning, and artificial intelligence algorithms produce meaningful and actionable outcomes. From detecting patterns and predicting trends to identifying opportunities and mitigating risks, data quality is the driving force behind data-driven success. It instills confidence in decision-makers, fosters innovation, and unlocks the full potential of Big Data, enabling organizations to thrive in today’s data-driven world.

Overview

In this seven-part blog we will explore using ML/AI for data quality. Machine learning and artificial intelligence can be instrumental in improving the quality of data. Machine learning models like logistic regression, decision trees, random forests, gradient boosting machines, and neural networks predict categories of data based on past examples, correcting misclassifications. Linear regression, polynomial regression, support vector regression, and neural networks predict numeric values, filling in missing entries. Clustering techniques like K-means, hierarchical clustering, and DBSCAN identify duplicates or near-duplicates. Models such as Isolation Forest, Local Outlier Factor, and Auto-encoders detect outliers and anomalies. To handle missing data, k-Nearest Neighbors and Expectation-Maximization predict and fill in the gaps. NLP models like BERT, GPT, and RoBERTa process and analyze text data, ensuring quality through tasks like entity recognition and sentiment analysis. CNNs fix errors in image data, while RNNs and Transformer models handle sequence data. The key to ensuring data quality with these models is a well-labeled and accurate training set. Without good training data, the models may learn to reproduce the errors present in the data. We will focus on using the following models to apply critical data quality to our lakehouse.

Machine learning and artificial intelligence can be instrumental in improving the quality of data. Here are a few models and techniques that can be used for various use cases:

Classification Algorithms: Models such as logistic regression, decision trees, random forests, gradient boosting machines, or neural networks can be used to predict categories of data based on past examples. This can be especially useful in cases where data entries have been misclassified or improperly labeled.
Regression Algorithms: Similarly, algorithms like linear regression, polynomial regression, or more complex techniques like support vector regression or neural networks can predict numeric values in the data set. This can be beneficial for filling in missing numeric values in a data set.
Clustering Algorithms: Techniques like K-means clustering, hierarchical clustering, or DBSCAN can be used to identify similar entries in a data set. This can help identify duplicates or near-duplicates in the data.
Anomaly Detection Algorithms: Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders can be used to detect outliers or anomalies in the data. This can be beneficial in identifying and handling outliers or errors in the data set.
Data Imputation Techniques: Missing data is a common problem in many data sets. Machine learning techniques, such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM), can be used to predict and fill in missing data.
Natural Language Processing (NLP): NLP models like BERT, GPT, or RoBERTa can be used to process and analyze text data. These models can handle tasks such as entity recognition, sentiment analysis, text classification, which can be helpful in ensuring the quality of text data.
Deep Learning Techniques: Convolutional Neural Networks (CNNs) can be used for image data to identify and correct errors, while Recurrent Neural Networks (RNNs) or Transformer models can be useful for sequence data.

Remember that the key to ensuring data quality with these models is a well-labeled and accurate training set. Without good training data, the models may learn to reproduce the errors presented in the data.

In this first blog of seven we will focus on Classification Algorithms. The code examples provided below can be found in this GitHub location. Below is a simple example of using a classification algorithm in a Databricks Notebook to address a data quality issue using a classification algorithm.

Data Engineers Leveraging AI/ML for Data Quality

Machine learning (ML) and artificial intelligence (AI) play a crucial role in the field of data engineering. Data engineers leverage ML and AI techniques to process, analyze, and extract valuable insights from large and complex datasets. Overall, ML and AI provide data engineers with powerful tools and techniques to extract insights, improve data quality, automate processes, and enable data-driven decision-making. They enhance the efficiency and effectiveness of data engineering workflows, enabling organizations to unlock the full potential of their data assets.

AI/ML can help with numerous data quality use cases. Models such as logistic regression, decision trees, random forests, gradient boosting machines, or neural networks can be used to predict categories of data based on past examples. This can be especially useful in cases where data entries have been misclassified or improperly labeled. Similarly, algorithms like linear regression, polynomial regression, or more complex techniques like support vector regression or neural networks can predict numeric values in the data set. This can be beneficial for filling in missing numeric values in a data set. Techniques like K-means clustering, hierarchical clustering, or DBSCAN can be used to identify similar entries in a data set. This can help identify duplicates or near-duplicates in the data. Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders can be used to detect outliers or anomalies in the data. This can be beneficial in identifying and handling outliers or errors in the data set.

Missing data is a common problem in many data sets. Machine learning techniques, such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM), can be used to predict and fill in missing data. NLP models like BERT, GPT, or RoBERTa can be used to process and analyze text data. These models can handle tasks such as entity recognition, sentiment analysis, text classification, which can be helpful in ensuring the quality of text data. Convolutional Neural Networks (CNNs) can be used for image data to identify and correct errors, while Recurrent Neural Networks (RNNs) or Transformer models can be useful for sequence data.

Suppose we have a dataset with some missing categorical values. We can use logistic regression to fill in the missing values based on the other features. For the purpose of this example, let’s assume we have a dataset with ‘age’, ‘income’, and ‘job_type’ features. Suppose ‘job_type’ is a categorical variable with some missing entries.

Classification Algorithms — Using Logistic Regression to Fix the Data

Logistic regression is primarily used for binary classification problems, where the goal is to predict a binary outcome variable based on one or more input variables. However, it is not typically used directly for assessing data quality. Data quality refers to the accuracy, completeness, consistency, and reliability of data.

That being said, logistic regression can be indirectly used as a tool for identifying potential data quality issues. Here are some examples of how it can be used in data quality. Logistic regression can be used to define the specific data quality issue to be addressed. For example, you may be interested in identifying data records with missing values or outliers. Logistic regression can be used for feature engineering. This helps identify relevant features (variables) that may indicate the presence of the data quality issue. Data preparation is one of the most common uses of logistic regression. Here the ML helps prepare the dataset by cleaning, transforming, and normalizing the data as necessary. This step involves handling missing values, outliers, and any other data preprocessing tasks.

It’s important to note that logistic regression alone generally cannot fix data quality problems, but it can help identify potential issues by predicting their presence based on the available features. Addressing the identified data quality issues usually requires domain knowledge, data cleansing techniques, and appropriate data management processes. In the simplified example below we see a problem and then use logistic regression to predict what the missing values should be.

Step 1

Create a table with the columns “age”, “income”, and “job_type” in a SQL database, you can use the following SQL statement:

Step1: Create Table

Step 2

Load data to table. Notice that three records are missing job_type. This will be the column that we will use ML to predict. We load a very small set of data for this example. This same technique can be used against billions or trillions of rows. More data will almost always yield better results.

Step 2: Load Data to Table

Step 3

Load the data into a data frame. If you need to create a unique index for your data frame, please refer to this article.

Step 3: Load Data to Data Frame

Step 4

In the context of PySpark DataFrame operations, filter() is a transformation function used to filter the rows in a DataFrame based on a condition.

Step 4: Filter the Rows in a Data Frame Based on a Condition

df is the original DataFrame. We’re creating two new DataFrames, df_known and df_unknown, from this original DataFrame.

df_known = df.filter(df.job_type.isNotNull()) is creating a new DataFrame that only contains rows where the job_type is not null (i.e., rows where the job_type is known).
df_unknown = df.filter(df.job_type.isNull()) is creating a new DataFrame that only contains rows where the job_type is null (i.e., rows where the job_type is unknown).

By separating the known and unknown job_type rows into two separate DataFrames, we can perform different operations on each. For instance, we use df_known to train the machine learning model, and then use that model to predict the job_type for the df_unknown DataFrame.

Step 5

In this step we will vectorize the features. Vectorizing the features is a crucial pre-processing step in machine learning and AI. In the context of machine learning, vectorization refers to the process of converting raw data into a format that can be understood and processed by a machine learning algorithm. A vector is essentially an ordered list of values, which in machine learning represent the ‘features’ or attributes of an observation.

Step 5: Vectorize Features

Step 6

In this step we will convert categorical labels to indices. Converting categorical labels to indices is a common preprocessing step in ML and AI when dealing with categorical data. Categorical data represents information that is divided into distinct categories or classes, such as “red,” “blue,” and “green” for colors or “dog,” “cat,” and “bird” for animal types. Machine learning algorithms typically require numerical input, so converting categorical labels to indices allows these algorithms to process the data effectively.

Converting categorical labels to indices is important for ML and AI algorithms because it allows them to interpret and process the categorical data as numerical inputs. This conversion enables the algorithms to perform computations, calculate distances, and make predictions based on the numerical representations of the categories. It is worth noting that label encoding does not imply any inherent order or numerical relationship between the categories; it simply provides a numerical representation that algorithms can work with.

It’s also worth mentioning that in some cases, label encoding may not be sufficient, especially when the categorical data has a large number of unique categories or when there is no inherent ordinal relationship between the categories. In such cases, additional techniques like one-hot encoding or feature hashing may be used to represent categorical data effectively for ML and AI algorithms.

Step 6: Converting Categorical Labels to Indices

Step 7

In this step we will train the model. Training a logistic regression model involves the process of estimating the parameters of the model based on a given dataset. The goal is to find the best-fitting line or decision boundary that separates the different classes in the data.

The process of training a logistic regression model aims to find the optimal parameters that minimize the cost function and provide the best separation between classes in the given dataset. With the trained model, it becomes possible to make predictions on new data points and classify them into the appropriate class based on their features.

Step 7: Train the Model

Step 8

Predict the missing value in this case job_type. Logistic regression, despite its name, is a classification algorithm rather than a regression algorithm. It is used to predict the probability of an instance belonging to a particular class or category.

Logistic regression is widely used in various applications such as sentiment analysis, fraud detection, spam filtering, and medical diagnosis. It provides a probabilistic interpretation and flexibility in handling both numerical and categorical independent variables, making it a popular choice for classification tasks.

Step 8: Predict the Missing Value in This Case job_type

Step 9

Convert predicted indices back to original labels. Converting predicted indices back to original labels in AI and ML involves the reverse process of encoding categorical labels into numerical indices. When working with classification tasks, machine learning models often output predicted class indices instead of the original categorical labels.

It’s important to note that this reverse mapping process assumes a one-to-one mapping between the indices and the original labels. In cases where the original labels are not unique or there is a more complex relationship between the indices and the labels, additional handling may be required to ensure accurate conversion.

Step 9: Convert Predicted Indices Back to Original Labels

Recap

Machine learning (ML) and artificial intelligence (AI) have a significant impact on data engineering, enabling the processing, analysis, and extraction of insights from complex datasets. ML and AI empower data engineers with tools to improve data quality, automate processes, and facilitate data-driven decision-making. Logistic regression, decision trees, random forests, gradient boosting machines, and neural networks can predict categories based on past examples, aiding in correcting misclassified or improperly labeled data. Algorithms like linear regression, polynomial regression, support vector regression, or neural networks predict numeric values, addressing missing numeric entries. Clustering techniques like K-means, hierarchical clustering, or DBSCAN identify duplicates or near-duplicates. Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders detect outliers or anomalies, handling errors in the data. Machine learning techniques such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM) predict and fill in missing data. NLP models like BERT, GPT, or RoBERTa process text data for tasks like entity recognition and sentiment analysis. CNNs correct errors in image data, while RNNs or Transformer models handle sequence data.

Data engineers can use Databricks Notebook, for example, to address data quality issues by training a logistic regression model to predict missing categorical values. This involves loading and preparing the data, separating known and unknown instances, vectorizing features, converting categorical labels to indices, training the model, predicting missing values, and converting predicted indices back to original labels.

Ready to unlock the full potential of your data? Embrace the power of machine learning and artificial intelligence in data engineering. Improve data quality, automate processes, and make data-driven decisions with confidence. From logistic regression to neural networks, leverage powerful algorithms to predict categories, address missing values, detect anomalies, and more. Utilize clustering techniques to identify duplicates and near-duplicates. Process text, correct image errors, and handle sequential data effortlessly. Try out tools like Databricks Notebook to train models and resolve data quality issues. Empower your data engineering journey and transform your organization’s data into valuable insights. Take the leap into the world of ML and AI in data engineering today! You can start with this Notebook.

The post Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality appeared first on Unravel.

Healthcare leader uses AI insights to boost data pipeline efficiency

Stephen Lamont — Mon, 10 Jul 2023 19:16:15 +0000

One of the largest health insurance providers in the United States uses Unravel to ensure that its business-critical data applications are optimized for performance, reliability, and cost in its development environment—before they go live in production.

Data and data-driven statistical analysis have always been at the core of health insurance. But over the past few years the industry has seen an explosion in the volume, velocity, and variety of big data—electronic health records (EHR), electronic medical records (EMRs), and IoT data produced wearable medical devices and mobile health apps. As the company’s chief medical officer has said, “Sometimes I think we’re becoming more of a data analytics company than anything else.”

Like many Fortune 500 organizations, the company has a complex, hybrid, multi-everything data estate. Many workloads are still running on premises in Cloudera, but the company also has pipelines on Azure and Google Cloud Platform. Further, its Dev environment is fully on AWS. Says the key technology manager for the Enterprise Data Analytics Platform team, “Unravel is needed for us to ensure that the jobs run smoothly because these are critical data jobs,” and Unravel helps them understand and optimize performance and resource usage.

With the data team’s highest priority being able to guarantee that its 1,000s of data jobs deliver reliable results on time, every time, they find Unravel’s automated AI-powered Insights Engine invaluable. Unravel auto-discovers everything the company has running in its data estate (both in Dev and Prod), extracting millions of contextualized granular details from logs, traces, metrics, events and other metadata—horizontally and vertically—from the application down to infrastructure and everything in between. Then Unravel’s AI/ML correlates all this information into a holistic view that “connects the dots” as to how everything works together. AI and machine learning algorithms analyze millions of details in context to detect anomalous behavior in real time, pinpoint root causes in milliseconds, and automatically provide prescriptive recommendations on where and how to change configurations, containers, code, resource allocations, etc.

Application developers rely on Unravel to automatically analyze and validate their data jobs in Dev, before the apps ever go live in Prod by, first, identifying inefficient code—code that is most likely to break in production—and then, second, pinpointing to every individual data engineer exactly where and why code should be fixed, so they can tackle potential problems themselves via self-service optimization. The end-result ensures that performance inefficiencies never see the light of day.

The Unravel AI-powered Insights Engine similarly analyzes resource usage. The company leverages the chargeback report capability to understand how the various teams are using their resources. (But Unravel can also slice and dice the information to show how resources are being used by individual users, individual jobs, data products or projects, departments, Dev vs. Prod environments, budgets, etc.) For data workloads still running in Cloudera, this helps avoid resource contention and lets teams queue their jobs more efficiently. Unravel even enables teams to kill a job instantly if it is causing another mission-critical job to fail.

For workloads running in the cloud, Unravel provides precise, prescriptive AI-fueled recommendations on more efficient resource usage—usually downsizing requested resources to fewer or less costly alternatives that will still hit SLAs.

As the company’s technology manager for cloud infrastructure and interoperability says, “Some teams use humongous data, and every year our users are growing.” With such astronomical growth, it has become ever more important to tackle data workload efficiency more proactively—everything has simply gotten too big and too complex and business-critical to reactively respond. The company has leveraged Unravel’s automated guardrails and governance rules to trigger alerts whenever jobs are using more resources than necessary.

Top benefits:

Teams can now view their queues at a glance and run their jobs more efficiently, without conflict.

Chargeback reports show how teams are using their resources (or not), which helps set the timing for running jobs—during business vs. off hours. This has provided a lot of relief for all application teams.

The post Healthcare leader uses AI insights to boost data pipeline efficiency appeared first on Unravel.

AI-Driven Observability for Snowflake

Stephen Lamont — Wed, 21 Jun 2023 19:29:44 +0000

AI-DRIVEN DATA OBSERVABILITY + FINOPS FOR SNOWFLAKE

Performance. Reliability. Cost-effectiveness.

Unravel’s automated, AI-powered data observability + FinOps platform for Snowflake and other modern data stacks provides 360° visibility to allocate costs with granular precision, accurately predict spend, run 50% more workloads at the same budget, launch new apps 3X faster, and reliably hit greater than 99% of SLAs.

Unravel Data Observability + FinOps for Snowflake you can:

Launch new apps 3X faster: End-to-end observability of data-native applications and pipelines. Automatic improvement of performance, cost efficiency, and reliability.
Run 50% more workloads for same budget: Break down spend and forecast accurately. Optimize apps and platforms by eliminating inefficiencies. Set guardrails and automate governance. Unravel’s AI helps you implement observability and FinOps to ensure you achieve efficiency goals.
Reduce firefighting time by 99% using AI-enabled troubleshooting: Detect anomalies, drift, skew, missing and incomplete data end-to-end. Integrate with multiple data quality solutions. All in one place.
Forecast budget with ⨦ 10% accuracy: Accurately anticipate cloud data spending to for more predictable ROI. Unravel helps you accurately forecast spending with granular cost allocation. Purpose-built AI, at job, user and workgroup levels, enables real-time visibility of ongoing usage.

To see Unravel Data for Snowflake in action contact us today!

The post AI-Driven Observability for Snowflake appeared first on Unravel.

Solving key challenges in the ML lifecycle with Unravel and Databricks Model Serving

Christine Della Penna — Tue, 04 Apr 2023 21:53:31 +0000

By Craig Wiley, Senior Director of Product Management, Databricks and Clinton Ford, Director of Product Marketing, Unravel Data

Introduction

Machine learning (ML) enables organizations to extract more value from their data than ever before. Companies who successfully deploy ML models into production are able to leverage that data value at a faster pace than ever before. But deploying ML models requires a number of key steps, each fraught with challenges:

Data preparation, cleaning, and processing
Feature engineering
Training and ML experiments
Model deployment
Model monitoring and scoring

Figure 1. Phases of the ML lifecycle with Databricks Machine Learning and Unravel Data

Challenges at each phase

Data preparation and processing

Data preparation is a data scientist’s most time-consuming task. While there are many phases in the data science lifecycle, an ML model can only be as good as the data that feeds it. Reliable and consistent data is essential for training and machine learning (ML) experiments. Despite advances in data processing, a significant amount of effort is required to load and prepare data for training and ML experimentation. Unreliable data pipelines slow the process of developing new ML models.

Training and ML experiments

Once data is collected, cleansed, and refined, it is ready for feature engineering, model training, and ML experiments. The process is often tedious and error-prone, yet machine learning teams also need a way to reproduce and explain their results for debugging, regulatory reporting, or other purposes. Recording all of the necessary information about data lineage, source code revisions, and experiment results can be time-consuming and burdensome. Before a model can be deployed into production, it must have all of the detailed information for audits and reproducibility, including hyperparameters and performance metrics.

Model deployment and monitoring

While building ML models is hard, deploying them into production is even more difficult. For example, data quality must be continuously validated and model results must be scored for accuracy to detect model drift. What makes this challenge even more daunting is the breadth of ML frameworks and the required handoffs between teams throughout the ML model lifecycle– from data preparation and training to experimentation and production deployment. Model experiments are difficult to reproduce as the code, library dependencies, and source data change, evolve, and grow over time.

The solution

The ultimate hack to productionize ML is data observability combined with scalable, serverless, and automated ML model serving. Unravel’s AI-powered data observability for Databricks on AWS and Azure Databricks simplifies the challenges of data operations, improves performance, saves critical engineering time, and optimizes resource utilization.

Databricks Model Serving deploys machine learning models as a REST API, enabling you to build real-time ML applications like personalized recommendations, customer service chatbots, fraud detection, and more – all without the hassle of managing serving infrastructure.

Databricks + data observability

Whether you are building a lakehouse with Databricks for ML model serving, ETL, streaming data pipelines, BI dashboards, or data science, Unravel’s AI-powered data observability for Databricks on AWS and Azure Databricks simplifies operations, increases efficiency, and boosts productivity. Unravel provides AI insights to proactively pinpoint and resolve data pipeline performance issues, ensure data quality, and define automated guardrails for predictability.

Scalable training and ML experiments with Databricks

Databricks uses pre-installed, optimized libraries to build and train machine learning models. With Databricks, data science teams can build and train machine learning models. Databricks provides pre-installed, optimized libraries. Examples include scikit-learn, TensorFlow, PyTorch, and XGBoost. MLflow integration with Databricks on AWS and Azure Databricks makes it easy to track experiments and store models in repositories.

MLflow monitors machine learning model training and running. Information about the source code, data, configuration information, and results are stored in a single location for quick and easy reference. MLflow also stores models and loads them in production. Because MLflow is built on open frameworks, many different services, applications, frameworks, and tools can access and consume the models and related details.

Serverless ML model deployment and serving

Databricks Serverless Model Serving accelerates data science teams’ path to production by simplifying deployments and reducing mistakes through integrated tools. With the new model serving service, you can do the following:

Deploy a model as an API with one click in a serverless environment.
Serve models with high availability and low latency using endpoints that can automatically scale up and down based on incoming workload.
Safely deploy the model using flexible deployment patterns such as progressive rollout or perform online experimentation using A/B testing.
Seamlessly integrate model serving with online feature store (hosted on Azure Cosmos DB), MLflow Model Registry, and monitoring, allowing for faster and error-free deployment.

Conclusion

You can now train, deploy, monitor, and retrain machine learning models, all on the same platform with Databricks Model Serving. Integrating the feature store with model serving and monitoring helps ensure that production models are leveraging the latest data to produce accurate results. The end result is increased availability and simplified operations for greater AI velocity and positive business impact.

Ready to get started and try it out for yourself? Watch this Databricks event to see it in action. You can read more about Databricks Model Serving and how to use it in the Databricks on AWS documentation and the Azure Databricks documentation. Learn more about data observability in the Unravel documentation.

The post Solving key challenges in the ML lifecycle with Unravel and Databricks Model Serving appeared first on Unravel.

Join Unravel at the AI & Big Data Expo

Stephen Lamont — Fri, 19 Aug 2022 19:53:52 +0000

Swing by the Unravel Data booth at the AI & Big Data Expo in Santa Clara on October 5-6. The world’s leading AI & Big Data event returns as a hybrid in-person and virtual event, with more than 5,000 attendees expected to join from across the globe.

And don’t miss Unravel Co-Founder and CEO Kunal Agarwal’s feature presentation on DataOps Observability. He’ll explain why AI & Big Data organizations need an observability platform designed specifically for data teams and their unique challenges, the limitations of trying to “borrow” other observability (like APM) or relying on a bunch of different point tools, and how DataOps observability cuts across and incorporates cross-sections of multiple observability domains (data applications/pipelines/model observability, operations observability, business and FinOps observability, data observability).

Stop by our booth and you’ll be able to:

Go under the hood with Unravel’s DataOps observability platform
Deep-dive into features and capabilities with our experts
Learn what your peers have been able to accomplish with Unravel

Our experts will run demos and be available for 1-on-1 conversations throughout the conference.

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

The expo will showcase the most cutting-edge technologies from 250+ speakers sharing their unparalleled industry knowledge and real-life experiences, in the forms of solo presentations, expert panel discussions and in-depth fireside chats.

You can register for the event here.

The post Join Unravel at the AI & Big Data Expo appeared first on Unravel.

Three Takeaways from the Data + AI Summit 2022

Stephen Lamont — Tue, 12 Jul 2022 14:23:36 +0000

Databricks recently hosted the Data + AI Summit 2022, attended by 5,000 people in person in San Francisco and some 60,000 virtually. Billed as the world’s largest data and AI conference, the event featured over 250 presentations from dozens of speakers, training sessions, and four keynote presentations.

Databricks made a slew of announcements falling into two buckets: enhancements to open-source technologies underpinning the Databricks platform, and previews, enhancements, and GA releases related to its proprietary capabilities. Day 1 focused on data engineering announcements, day 2 on ML announcements.

There was obviously a lot to take in, so here are just three takeaways from one attendee’s perspective.

Convergence of Data Analytics and ML

The predominant overarching (or underlying, depending on how you look at it) theme running as a red thread throughout all the presentations was the convergence of data analytics and ML. It’s well known how the “big boys” like Amazon, Google, Facebook, Netflix, Apple, and Microsoft have driven disruptive innovation through data, analytics, and AI. But now more and more enterprises are doing the same thing.

In his opening keynote presentation, Databricks Co-Founder and CEO Ali Ghodshi expressed this trend as the process of moving to the right-hand side of the Data Maturity Curve, moving from hindsight to foresight.

However, today most companies are not on the right-hand side of the curve. They are still struggling to find success at scale. Ghodshi says the big reason is that there’s a technology divide between two incompatible architectures that’s getting in the way. On one side are the data warehouse and BI tools for analysts; on the other, data lakes and AI technologies for data scientists. You wind up with disjointed and duplicative data silos, incompatible security and governance models, and incomplete support for use cases.

Bridging this chasm—where you get the best of both worlds—is of course the whole idea behind the Databricks lakehouse paradigm. But it wasn’t just Databricks who was talking about this convergence of data analytics and ML. Companies as diverse as John Deere, Intuit, Adobe, Nasdaq, Nike, and Akamai were saying the same thing. The future of data is AI/ML, and moving to the right-hand side of the curve is crucial for a better competitive advantage.

Databricks Delta Lake goes fully open source

Probably the most warmly received announcement was that Databricks is open-sourcing its Delta Lake APIs as part of its Delta Lake 2.0 release. Further, Databricks will contribute all its proprietary Delta Lake features and enhancements. This is big—while Databricks initially launched its lakehouse platform as open source back in 2019, many of its subsequent features were proprietary additions available only to its customers.

Make the most of your Databricks platform
Download Unravel for Databricks data sheet

Databricks cost optimization + Unravel

A healthy number of the 5,000 in-person attendees swung by the Unravel booth to discuss its DataOps observability platform. Some were interested in how Unravel accelerates migration of on-prem Hadoop clusters to the Databricks platform. Others were interested in how its vendor-agnostic automated troubleshooting and AI-enabled optimization recommendations could benefit teams already on Databricks. But the #1 topic of conversation was around cost optimization for Databricks.

Specifically, these Databricks customers are realizing the business value of the platform and are looking to expand their usage. But they are concerned about running up unnecessary costs—whether it was FinOps people or data team leaders looking for ways to govern costs with greater accuracy, Operations personnel who had uncovered occurrences of jobs where someone had requested oversized configurations, or data engineers who wanted to find out how to right-size their resource requests. Everybody wanted to discover a data-driven approach to optimizing jobs for cost. They all said the same thing, one way or another: the powers-that-be don’t mind spending money on Databricks, given the value it delivers; what they hate is spending more than they have to.

And this is something Databricks as a company appreciates from a business perspective. They want customers to be happy and successful—if customers have confidence that they’re spending only as much as absolutely necessary, then they’ll embrace increased usage without batting an eye.

This is where Unravel really stands out from the competition. Its deep observability intelligence understands how many and what size resources are actually needed, compares that to the configuration settings the user requested, and comes back with a crisp, precise recommendation on exactly what changes to make to optimize for cost—at either the job or cluster level. These AI recommendations are generated automatically, waiting for you at any number of dashboards or views within Unravel. Cost-optimization recommendations let you cut to the chase with literally just one click.

Check out this short interactive demo (takes about 90 seconds) to see Unravel’s Job-Level AI Recommendations capability or this one to see its Automated Cloud Cluster Optimization feature.

The post Three Takeaways from the Data + AI Summit 2022 appeared first on Unravel.

Join Unravel at Data + AI Summit 2022

Stephen Lamont — Tue, 14 Jun 2022 18:08:24 +0000

Stop by Unravel Data Booth #322 at the Data + AI Summit 2022 in San Francisco, June 27-30 for a chance to win a special giveaway prize!

When you stop by our booth on expo days June 28-29, you’ll have a golden opportunity to:

Go under the hood with Unravel’s DataOps observability platform
Get a sneak-peek first look at Unravel’s new Databricks-specific capabilities
Deep-dive into features with our experts
Learn what your peers have been able to accomplish with Unravel
Win a raffle prize

Our experts will run demos and be available for 1-on-1 conversations throughout the conference—and everyone who books a meeting will be entered into a raffle for a to-be-announced ah-mazing prize.

Sponsored by Databricks, the Data+AI Summit is the world’s largest data and AI conference—four days packed with keynotes by industry visionaries, technical sessions, hands-on training, and networking opportunities.

See the full agenda here.

This year, a new hybrid format features an expanded curriculum of half- and full-day in-person and virtual training classes and six different speaker sessions tracks: Data Analytics, BI and Visualization; Data Engineering; Data Lakes, Data Warehouses and Data Lakehouses; Data Science, Machine Learning and MLOps; Data Security and Governance; and Research.

What will you see at Unravel Booth #322?

Meet Unravel Data at booth #322

You’ll see Unravel in action: how our purpose-built AI-enabled observability platform helps you stop firefighting issues, control costs, and run faster data pipelines.

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

And here’s a quick overview how Unravel helps you make the most of our Databricks platform.

The post Join Unravel at Data + AI Summit 2022 appeared first on Unravel.

Unravel Data Unveils Most Advanced AI-Powered DataOps

Unravel Data — Tue, 01 Mar 2022 13:00:13 +0000

Unravel Data Unveils Most Advanced AI-Powered DataOps Observability Platform with Launch of its 2022 Winter Release

New edition enables data professionals to gain end-to-end visibility and more effectively optimize cost and performance of modern data stack

Palo Alto, CA – March 1, 2022 – Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced the general availability of its 2022 Winter Release of the Unravel Platform. With this new release, users of Unravel will be able to leverage AI-driven enhancements that reduce the complexity of their data pipelines, control cloud utilization costs for data applications, and realize faster time-to-value for data cloud migrations. More than 20 Fortune 100 brands, including two of the top five global pharmaceutical companies and three of the top 10 financial companies in the world, rely on Unravel to simplify the complexities of their modern data operations.

“Every company is a data company. However, the complexity of the modern data stack requires a completely new and intentional approach – one that can move at the speed and scale that today’s real-time enterprise demands,” said Kunal Agarwal, founder and CEO of Unravel Data. “With this latest release, Unravel enables organizations to accelerate their cloud data migration initiatives with unmatched performance and reliability. The insights delivered by the Unravel platform also enable our customers to gain a more complete understanding of their cloud costs, delivering on our promise to radically simplify data operations”.

Traditional observability platforms were not designed to meet the advanced requirements and complexity of modern data stacks. According to a recent 451 Research report, the lack of observability into cloud data costs means enterprises are leaving an estimated “$24 billion in cost savings on the table” and that with improved cost governance and observability capabilities, they could potentially realize “massive savings of 62% by mixing and matching cloud services across environments.”

Users of the Winter Edition of the Unravel Platform will realize some of the following key benefits:

Improved Cost Optimization and Utilization for Databricks: With Unravel’s new ‘Cost 360 for Modern Data Stack’ module, customers can achieve a unified view of their data cloud utilization costs and drive greater predictability into their budgeting process. The new Winter Release supports full cost observability, budgeting, forecasting and optimization on Databricks with plans to roll out improved cost optimization support for other cloud platforms in future editions.
Deep Observability for Google Cloud: Unravel users can now gain unprecedented visibility into their Google Cloud Dataproc and Google Cloud BigQuery environments, enabling them to visualize cluster resources, jobs and stages in Dataproc and improve the performance and troubleshooting of their BigQuery applications and infrastructure.
Broader Coverage of Modern Data Stacks: The latest edition features integrated support for Databricks Delta Lake and Amazon EMR to support cluster cost optimization and chargebacks, as well as support for GCP Dataproc and BigQuery managed Spark and Hadoop instances.
Simplified Cloud Data Migrations: Unravel helps IT leaders assess and mitigate the various risk factors inherent in complex cloud data migrations by providing data teams with unprecedented visibility across their entire data stack. Intelligence gathered from Unravel also helps decision makers rationalize their costs, work efforts, dependencies, and which data workloads should be prioritized for the migration.

Modern data teams are invited to create a free account here.

About Unravel Data
Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Providing a unified view across the entire data stack, Unravel’s market-leading data observability platform leverages AI, machine learning, and advanced analytics to provide modern data teams with the actionable recommendations they need to turn data into insights. Some of the world’s most recognized brands like Adobe, 84.51˚ (a Kroger company), and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit https://www.unraveldata.com.

Media Contact
Blair Moreland
ZAG Communications for Unravel Data
unraveldata@zagcommunications.com

The post Unravel Data Unveils Most Advanced AI-Powered DataOps appeared first on Unravel.

Jeeves Grows Up: How an AI Chatbot Became Part of Unravel Data

Floyd Smith — Mon, 31 May 2021 04:20:30 +0000

Jeeves is the stereotypical English butler – and an AI chatbot that answers pertinent and important questions about Spark jobs in production. Shivnath Babu, CTO and co-founder of Unravel Data, spoke yesterday at Data + AI Summit, formerly known as Spark Summit, about the evolution of Jeeves, and how the technology has become a key supporting pillar within Unravel Data’s software.

Unravel is a leading platform for DataOps, bringing together a raft of seemingly disparate information to make it much easier to view, monitor, and manage pipelines. With Unravel, individual jobs and their pipelines become visible. But also, the interactions between jobs and pipelines become visible too.

It’s often these interactions, which are ephemeral, and very hard to track through traditional monitoring solutions, that cause jobs to cause or fail. Unravel makes them visible and actionable. On top of this, AI and machine learning help the software make proactive suggestions about improvements, and even head off trouble before it happens.

Both performance improvements and cost management become far easier with Unravel, even for DataOps personnel who don’t know all of the underlying technologies used by a given pipeline in detail.

Jeeves to the Rescue

An app failure in Spark may be difficult to even discover – let alone to trace, troubleshoot, repair, and retry. If the failure is due to interactions among multiple apps, a whole additional dimension of trouble arises. As data volumes and pipeline criticality rocket upward, no one in a busy IT department has time to dig around for the causes of problems and search for possible solutions.

But – Jeeves to the rescue! Jeeves acts as a chatbot for finding, understanding, fixing, and improving Spark jobs, and the configuration settings that define where, when, and how they run. The Jeeves demo in Shivnath’s talk shows how Jeeves comes up with the errant Spark job (by ID number), describes what happened, and recommends the needed adjustments – configuration changes, in this case – to fix the problem going forward. Jeeves can even resubmit the repaired job for you.

See how Unravel simplifies troubleshooting Spark

Create a free account

Wait, One More Thing…

But there’s more – much more. The technology underlying Jeeves has now been built into Unravel Data, with stellar results.

Modern data pipelines are ever more populated. In his talk, Shivnath shows us several views on the modern data landscape. His simplified diagram shows five silos, and 14 different top-level processes, between today’s data sources and a plethora of data consumers, both human and machine.

But Shivnath shines a bright light into this Aladdin’s cave of – well, either treasures or disasters, depending on your point of view, and whether everything is working or not. He describes each of the major processes that take place within the architecture, and highlights the plethora of technologies and products that are used to complete each process.

He sums it all up by showing the role of data pipelines in carrying out mission-critical tasks, how the data stack for all of this continues to get more complex, and how DataOps as a practice has emerged to try and get a handle on all of it.

This is where we move from Jeeves, a sophisticated bot, to Unravel, which incorporates the Jeeves functionality – and much more. Shivnath describes Unravel’s Pipeline Observer, which interacts with a large and growing range of pipeline technologies to monitor, manage, and recommend (through an AI and machine learning-powered engine) how to fix, improve, and optimize pipeline and workload functionality and reliability.

In an Unravel demo, Shivnath shows how to improve a pipeline that’s in danger of:

Breaking due to data quality problems
Missing its performance SLA
Cost overruns – check your latest cloud bill for examples of this one

If you’re in DataOps, you’ve undoubtedly experienced the pain of pipeline breaks, and that uneasy feeling of SLA misses, all reflected in your messaging apps, email, and performance reviews – not to mention the dreaded cost overruns, which don’t show up until you look at your cloud provider bills.

Shivnath concludes by offering a chance for you to create a free account; to contact the company for more information; or to reach out to Shivnath personally, especially if your career is headed in the direction of helping solve these and related problems. To get the full benefit of Shivnath’s perspective, dig into the context, and understand what’s happening in depth, please watch the presentation.

The post Jeeves Grows Up: How an AI Chatbot Became Part of Unravel Data appeared first on Unravel.

AI/ML without DataOps is just a pipe dream!

Floyd Smith — Fri, 23 Apr 2021 04:19:24 +0000

The following blog post appeared in its original form on Towards Data Science. It’s part of a series on DataOps for effective AI/ML. The author is CDO and VP Engineering here at Unravel Data. (GIF by giphy)

Let’s start with a real-world example from one of my past machine learning (ML) projects: We were building a customer churn model. “We urgently need an additional feature related to sentiment analysis of the customer support calls.” Creating the data pipeline to extract this dataset took about 4 months! Preparing, building, and scaling the Spark MLlib code took about 1.5-2 months! Later we realized that “an additional feature related to the time spent by the customer in accomplishing certain tasks in our app would further improve the model accuracy” — another 5 months gone in the data pipeline! Effectively, it took 2+ years to get the ML model deployed!

After driving dozens of ML initiatives (as well as advising multiple startups on this topic), I have reached the following conclusion: Given the iterative nature of AI/ML projects, having an agile process of building fast and reliable data pipelines (referred to as DataOps) has been the key differentiator in the ML projects that succeeded. (Unless there was a very exhaustive feature store available, which is typically never the case).

Behind every successful AI/ML product is a fast and reliable data pipeline developed using well-defined DataOps processes!

To level-set, what is DataOps? From Wikipedia: “DataOps incorporates the agile methodology to shorten the cycle time of analytics development in alignment with business goals.”

I define DataOps as a combination of process and technology to iteratively deliver reliable data pipelines with agility. Depending on the maturity of your data platform, you might be one of the following DataOps phases:

Ad-hoc: No clear processes for DataOps
Developing: Clear processes defined, but accomplished manually by the data team
Optimal: Clear processes with self-service automation for data scientists, analysts, and users.

Similar to software development, DataOps can be visualized as an infinity loop

The DataOps lifecycle – shown as an infinity loop above – represents the journey in transforming raw data to insights. Before discussing the key processes in each lifecycle stage, the following is a list of top-of-mind battle scars I have encountered in each of the stages:

Plan: “We cannot start a new project — we do not have the resources and need additional budget first”
Create: “The query joins the tables in the data samples. I didn’t realize the actual data had a billion rows! ”
Orchestrate: “Pipeline completes but the output table is empty — the scheduler triggered the ETL before the input table was populated”
Test & Fix: “Tested in dev using a toy dataset — processing failed in production with OOM (out of memory) errors”
Continuous Integration; “Poorly written data pipeline got promoted to production — the team is now firefighting”
Deploy: “Did not anticipate the scale and resource contention with other pipelines”
Operate & Monitor: “Not sure why the pipeline is running slowly today”
Optimize & Feedback: “I tuned the query one time — didn’t realize the need to do it continuously to account for data skew, scale, etc.”

To avoid these battle scars and more, it is critical to mature DataOps from ad hoc, to developing, to self-service.

This blog series will help you go from ad hoc to well-defined DataOps processes, as well as share ideas on how to make them self-service, so that data scientists and users are not bottlenecked by data engineers.

DataOps at scale with Unravel

Create a free account

For each stage of the DataOps lifecycle stage, follow the links for the key processes to define and the experiences in making them self-service (some of the links below are being populated, so please bookmark this blog post and come back over time):

Plan Stage

How to streamline finding datasets
Formulating the scope and success criteria of the AI/ML problem
How to select the right data processing technologies (batch, interactive, streaming) based on business needs

Create Stage

How to streamline accessing metadata properties of the datasets
How to streamline the data preparation process
How to make behavioral data self-service

Orchestrate Stage

Test & Fix Stage

Streamlining sandbox environment for testing
Identify and remove data pipeline bottlenecks
Verify data pipeline results for correctness, quality, performance, and efficiency

Continuous Integration & Deploy Stage

Smoke test for data pipeline code integration
Scheduling window selection for data pipelines
Changes rollback

Operate Stage

Detect anomalies to proactively avoid SLA violation
Managing data incidents in production
Alerting on rogue (resource hogging) jobs

Monitor Stage

Building end-to-end observability of data pipelines
Tracking lineage of data flows data
Enforcing data quality with circuit breakers

Optimize & Feedback Stage

Continuously optimize existing data pipelines
Alerting on budgets

In summary, DataOps is the key to delivering fast and reliable AI/ML! It is a team sport. This blog series aims to demystify the required processes as well as build a common understanding across Data Scientists, Engineers, Operations, etc.

DataOps as a team sport

To learn more, check out the recent DataOps Unleashed Conference, as well as innovations in DataOps Observability at Unravel Data. Come back to get notified when the links above are populated.

The post AI/ML without DataOps is just a pipe dream! appeared first on Unravel.

The Guide To DataOps, AIOps, and MLOPs in 2022

Unravel Data — Wed, 14 Apr 2021 15:22:11 +0000

Over the past several years, there has been an explosion of different terms related to the world of IT operations. Not long ago, it was standard practice to separate business functions from IT operations. But those days are a distant memory now, and for good reason.

The Ops landscape has expanded beyond the generic “IT” to include DevOps, DataOps, AIOPs, MLOps, and more. Each of these Ops areas are cross-functional throughout an organization, and each provides a unique benefit. And each of the Ops areas emerges from the same general mechanism: Applying agile principles, originally created to guide software development, to the overlap of different flavors of software development, related technologies (data-driven applications, AI, and ML), and operations.

As with DevOps, the goal of DataOps, AIOps, and MLOps is to accelerate processes and improve the quality of data, analytics insights, and AI models respectively. In practice, we see DataOps as a superset of AIOps and MLOps with the latter two Ops overlapping each other.

Why is this? DataOps describes the flow of data, and the processing that takes place against that data, through one or more data pipelines. In this context, every data-driven app needs DataOps; those which are primarily driven by machine learning models also need MLOps, and those with additional AI capabilities need AIOps. (Confusingly, ML is sometimes considered to be separate from AI, and sometimes simply an important part of the technologies that are part of AI as a whole.)

The goal of this article is to help you understand these new terms and provide insight into where they came from, the similarities, differences, and who in an organization uses them. To start, we’ll look at DataOps.

What is DataOps

DataOps is the use of agile development practices to create, deliver, and optimize data products, quickly and cost-effectively. DataOps is practiced by modern data teams, including data engineers, architects, analysts, scientists and operations.

The data products which power today’s companies range from advanced analytics, data pipelines, and machine learning models to embedded AI solutions. Using a DataOps methodology allows companies to move faster, more surely, and with greater cost-effectiveness in extracting value out of data.

A company can adopt a DataOps approach independently, without buying any specialty software or services. However, just as the term DevOps became strongly associated with the use of containers, as commercial software from Docker and as open source software from Kubernetes, the term DataOps is increasingly associated with data pipelines and applications.

DataOps at scale with Unravel

Create a free account

The origin of DataOps

DataOps is a successor methodology to DevOps, which is an approach to software development that optimizes for responsive, continuous deployment of software applications and software updates. DataOps applies this approach across the entire lifecycle of data applications and to even help productize data science.

As a term, DataOps has been in gradually increasing use for several years now. Gartner began to include it in their Hype Cycle for Data Management in 2018, and the term is now “moving up the Hype Cycle” as it becomes more widespread. The first industry conference devoted to DataOps, DataOps Unleashed, was launched in March 2021.

While DataOps is sometimes described as a set of best practices for continuous analytics, it is actually more holistic. DataOps includes identifying the data sources needed to solve a business problem, the processing needed to make that data ready for analytics, the analytics step(s) – which may be seen as including AI and ML – and the delivery of results to the people or apps that will use them. DataOps also includes making all this work fast enough to be useful, whether that means processing the weekly report in an hour or so, or approving a credit card transaction in a fraction of a second.

Who uses DataOps

DataOps is directly for data operations teams, data engineers, and software developers building data-powered applications and the software-defined infrastructure that supports them. The benefits of DataOps approaches are felt by the teams themselves; the IT team, which can now do more with less; the data team’s internal “customers,” who request and use analytics results; the organization as a whole; and the company’s customers. Ultimately, society benefits, as things simply work better, faster, and less expensively. (Compare booking a plane ticket online to going to a travel agent, or calling airlines yourself, out of the phone book, as a simple example.)

What is AIOps

AIOps stands for Artificial Intelligence for IT operations. It is a paradigm shift that allows machines to solve IT issues without the need of human assistance or interaction. AIOPs uses machine learning and analytics to analyze big data obtained via different tools, which allows for issues to be spotted automatically and dealt with in real time. (Confusingly, AIOps is also sometimes used to describe the operationalization of AI project, but we will stick with the definition used by Gartner and others, as described here.)

As part of DataOps, AIOps supports continuous integration and deployment for the core tech functions of machine learning and big data. AIOps helps automate operations across hybrid environments. AIOps includes the use of machine learning to detect patterns and reduce noise.

See Unravel AI and automation in action

Create a free account

The Origin of AIOps

AIOps was originally defined in 2017 by Gartner as a means to describe the growing interest and investment in applying a broad spectrum of AI capabilities to enterprise IT operations management challenges.

Gartner defines AIOps as platforms that utilize big data, machine learning, and other advanced analytics technologies to directly and indirectly enhance IT operations (such as monitoring, automation and service desk) functions with proactive, personal, and dynamic insight.

Put another way, AIOps refers to improving the way IT organizations manage data and information in their environments using artificial intelligence.

Who uses AIOps

From enterprises with large, complex environments, to cloud-native small and medium enterprises (SMEs), AIOps is being used globally by organizations of all sizes in a variety of industries. AIOps is most often bought in as products or services from specialist companies; few large organizations are using their own in-house AI expertise to solve IT operations problems.

Companies with extensive IT environments, spanning multiple technology types, are adopting AIOps, especially when they face issues of scale and complexity. AIOps can make a significant contribution when those challenges are layered on top of a business model that is dependent on IT. If the business needs to be agile and to quickly respond to market forces, IT will rely upon AIOps to help IT be just as agile in supporting the goals of the business.

AIOps is not just for massive enterprises, though. SMEs that need to develop and release software continuously are embracing AIOps as it allows them to continually improve their digital services, while preventing malfunctions and outages.

The ultimate goal of AIOps is to enhance IT Operations. AIOps delivers intelligent filtering of signals out of the noise in IT systems. AIOps intelligently observes IT operations data in order to identify root causes and recommend solutions quickly. In some cases, these recommendations can even be implemented without human interaction.

What is MLOps

MLOps, or Machine Learning Operations, helps simplify the management, logistics, and deployment of machine learning models between operations teams and machine learning researchers.

MLOps is like DataOps – the fusion of a discipline (machine learning in one case, data science in the other) and the operationalization of projects from that discipline. MLOps and DataOps are different from AIOps, which is the use of AI to improve AI operations.

MLOps takes the DevOps methodology of continuous integration and continuous deployment and applies it to machine learning. As in traditional development, there is code that needs to be written and deployed, as well as bug testing to be done, and changes in user requirements to be accommodated. Specific to the topic of machine learning, models are being trained with data, and new data is introduced to retrain the models again and again.

The Origin of MLOps

According to Forbes, the origins of MLOps date back to 2015, from a paper entitled “Hidden Technical Debt in Machine Learning Systems.” The paper offered the position that machine learning offered an incredibly powerful toolkit for building useful complex prediction systems quickly, but that it was dangerous to think of these quick wins as coming for free.

Who Uses MLOps

Data scientists tend to focus on the development of models that deliver valuable insights to your organization more quickly. Ops people tend to focus on running those models in production. MLOps unifies the two approaches into a single, flexible practice, focused on the delivery of business value through the use of machine learning models and relevant input data.

Because MLOps follows a similar pattern to DevOps, MLOps creates a seamless integration between your development cycle and your overall operations process that transforms how your organization handles the use of big data as input to machine learning models. MLOps helps drive insights that you can count on and put into practice.

Streamlining MLOps is critical to organizations that are developing Machine Learning models, as well as to end users who use applications that rely on these models.

According to research from Finances Online, machine learning applications and platforms account for 57% (or $42 Billion) in AI funding worldwide. Organizations that are making significant investments want to ensure they are deriving significant value.

As an example of the impact of MLOps, 97% of all mobile users use AI-powered voice assistants that depend on machine learning models, and thus benefit from MLOps. These voice assistants are the result of a subset of ML known as deep learning. This deep learning technology is built around machine learning and is at the core of platforms such as Apple’s Siri, Amazon Echo, and Google Assistant.

The goal of MLOps is to bridge the gap between data scientists and operations teams to deliver insights from machine learning models that can be put into use immediately.

Conclusion

Here at Unravel Data, we deliver a DataOps platform that uses AI-powered recommendations – AIOps – to help proactively identify and resolve operations problems. This platform complements the adoption of DataOps practices in an organization, with the end results including improved application performance, fewer operational hassles, lower costs, and the ability for IT to take on new initiatives that further benefit organizations.

Our platform does not explicitly support MLOps, though MLOps always occurs in a DataOps context. That is, machine learning models run on data – usually, on big data – and their outputs can also serve as inputs to additional processes, before business results are achieved.

As DataOps, AIOps, and MLOps proliferate – as working practices, and in the form of platforms and software tools that support agile XOPs approaches – complex stacks will be simplified and made to run much faster, with fewer problems, and at less cost. And overburdened IT organizations will truly be able to do more with less, leading to new and improved products and services that perhaps can’t all be imagined today.

If you want to know more about DataOps, check out the recorded sessions from the first DataOps industry conference, DataOps Unleashed. To learn more about the Unravel Data DataOps platform, you can create a free account or contact us.

The post The Guide To DataOps, AIOps, and MLOPs in 2022 appeared first on Unravel.

Rebuilding Reliable Modern Data Pipelines Using AI and DataOps

Unravel Data — Mon, 10 Feb 2020 19:47:32 +0000

Organizations today are building strategic applications using a wealth of internal and external data. Unfortunately, data-driven applications that combine customer data from multiple business channels can fail for many reasons. Identifying the cause and finding a fix is both challenging and time-consuming. With this practical ebook, DevOps personnel and enterprise architects will learn the processes and tools required to build and manage modern data pipelines.

Ted Malaska, Director of Enterprise Architecture at Capital One, examines the rise of modern data applications and guides you through a complete data operations framework. You’ll learn the importance of testing and monitoring when planning, building, automating, and managing robust data pipelines in the cloud, on premises, or in a hybrid configuration.

Plan, migrate, and operate modern data stack workloads and data pipelines using cloud-based and hybrid deployment models

Learn how performance management software can reduce the risk of running modern data applications
Take a deep dive into the components that comprise a typical data processing job
Use AI to provide insights, recommendations, and automation when operationalizing modern data systems and data applications
Plan, migrate, and operate modern data stack workloads and data pipelines using cloud-based and hybrid deployment models

The post Rebuilding Reliable Modern Data Pipelines Using AI and DataOps appeared first on Unravel.

Top 10 Bank Leverages Unravel’s AIOps To Tame Fraud Detection and Compliance Performance

Unravel Data — Mon, 09 Dec 2019 14:00:57 +0000

Unsurprisingly, Modern data apps have become crucial across all areas of the financial industry, with apps for fraud detection, claims processing and compliance amongst others playing a business-critical role. Unravel has been deployed by several of the world’s largest financial services organizations to ensure these critical apps perform reliably at all times. One recent example is one of America’s ten largest banks, a corporation that encompasses over 3,000 retail branches, 5000 ATMs and over 70,000 employees. This is what happened.

This bank has been using big data for a variety of purposes, but its two most important apps are fraud detection and compliance. They deployed Informatica broadly in order to run ETL jobs. This was a massive focus for the bank’s DataOps team, which had many workflows running multiple Hive queries. They also made heavy use of Spark and Kafka Streaming in order to process tons of real-time streaming data for their fraud detection app.

Unravel Kafka Dashboard (main)

The bank suffered constant headaches before they deployed Unravel. First, their data apps tended to be slow and failed frequently. In order to figure out why, they had to dig through an avalanche of raw data logs, a process that could take weeks. Once they had identified the problem, they would have to do a long trial-and-error process to determine how to fix it. This again could take weeks, if they were fortunate enough to even find a fix for the issue.

There was another monitoring issue at the cluster usage level. They knew they weren’t optimally consuming their compute resources but had no visibility into how to improve utilization. The team only fully became aware of how critical their compute utilization problem was when it caused a critical data app to fail.

After deploying Unravel, the bank was able to quickly alleviate these problems. To begin, the platform’s reporting capabilities changed things dramatically. The team was able to monitor and understand its many different modern data stack technologies (Hive, Spark, Workflows, Kafka) from a single interface rather than relying on siloed views that didn’t enable correlation or many useful insights. The bank’s Kafka Streaming deployment had been particularly hard to monitor due to the massive input of streaming data. In addition, they previously had no way to track if Informatica and Hive queries for ETL jobs were hitting SLAs. Unravel changed all of that, delivering detailed insights that told the team how every workload was performing.

The insights were just as valuable at the cluster usage level, with Unravel providing cluster optimization opportunities to further boost performance and reduce wasteful resource consumption. This was the first time the bank really felt they understood what was happening in each and every cluster.

On top of the monitoring and visibility capabilities, Unravel yielded a significant boost in app performance. This is where the platform’s AI and automated recommendations were huge boon for the customer. After first automatically diagnosing several root cause issues, Unravel delivered cleanup recommendations for almost half million Hive tables, resulting in tremendous performance improvements. The platform also enabled the team to set notifications for specific failures and gave them the option to run automated fixes in these circumstances.

See Unravel AI and automation in action

Try Unravel for free

Examples of “Stalled” and “Lagging” Consumer Groups (name = “demo”) showing in the Unravel UI

While the bank isn’t currently deploying data apps in the cloud, they do have plans to migrate soon. One of the hardest parts of any cloud migration is the planning phase. Unravel’s cloud assessment capabilities give the bank detailed insights to streamline this painstaking preliminary phase: The assessment mapped out the bank’s on-premises big data pipelines and then told them which apps are best fit for the cloud and how those apps should be configured using specific instance type recommendations and forecasting costs and consumption. This move saved the customer from having to hire an expensive consulting firm to evaluate and advise their move to the cloud, accelerated their decision timeline and critically provided data driven insights instead of relying on guesswork.

Modern data apps are the backbone of any major financial institution. Unravel’s AI-driven DataOps platform allowed this bank to leverage these critical data apps to their full potential for the first time. Unravel has been so transformative that the customer has been able to open their data lake to broader business users, democratizing data apps so they provide value to the team outside of the developer and IT operations staff. In the bank’s own words, Unravel is helping drive a cultural shift by ensuring big data delivers on its true potential and is future proofing architectural decisions as hybrid cloud deployments are evaluated.

The post Top 10 Bank Leverages Unravel’s AIOps To Tame Fraud Detection and Compliance Performance appeared first on Unravel.

Using Machine Learning to understand Kafka runtime behavior

George Demarest — Wed, 29 May 2019 19:16:43 +0000

On May 13, 2019, Unravel Co-Founder and CTO Shivnath Babu joined Nate Snapp, Senior Reliability Engineer from Palo Alto Networks, to present a session on Using Machine Learning to Understand Kafka Runtime Behavior at Kafka Summit in London. You can review the session slides or read the transcript below.

Transcript

Nate Snapp:

All right. Thanks. I’ll just give a quick introduction. My name is Nate Snapp. And I do big data infrastructure and engineering at companies such as Adobe, Palo Alto Networks, and Omniture. And I have had quite a bit of experience in streaming even outside of the Kafka realm about 12 years. I’ve worked at some of the big scale efforts that we’ve done for web analytics at Adobe and Omniture, working in the space of web analytics for a lot of the big companies out there, about 9 out of the 10 Fortune 500.

I’ve have dealt with major release events, from new iPads to all these things that have to do with big increases in data and streaming that in a timely fashion. Done systems that have over 10,000 servers in that [00:01:00] proprietary stack. But the exciting thing is, last couple years, moving to Kafka, and being able to apply some of those same principles in Kafka. And I’ll explain some of those today. And then I like to blog as well, natesnapp.com.

Shivnath Babu:

Hello, everyone. I’m Shivnath Babu. I’m cofounder and CTO at Unravel. What my day job looks like is building a platform like what is shown on the slide where we collect monitoring information from the big data systems, receiver systems like Kafka, like Spark, and correlate the data, analyze it. And some of the techniques that we will talk about today are inspired by that work to help operations teams as well as application developers, better manage, easily troubleshoot, and to do better capacity planning and operations for their receiver systems.

Nate:

All right. So to start with some of the Kafka work that I have been doing, we have [00:02:00] typically about…we have several clusters of Kafka that we run that have about 6 to 12 brokers, up to about 29 brokers on one of them. But we run the Kafka 5.2.1, and we’re upgrading the rest of the systems to that version. And we have about 1,700 topics across all these clusters. And we have pretty varying rates of ingestion, but topping out about 20,000 messages a second…we actually gotten higher than that. But what I explained this for is as we get into what kind of systems and events we see, the complexity involved often has some interesting play with how do you respond to these operational events.

So one of the aspects is that we have a lot of self-service that’s involved in this. We have users that set up their own topics, they setup their own pipelines. And we allow them to do [00:03:00] that to help make them most proficient and able to get them up to speed easily. And so because of that, we have a varying environment. And we have quite a big skew for how they bring it in. They do it with other…we have several other Kafka clusters that feed in. We have people that use the Java API, we have others that are using the REST API. Does anybody out here use the REST API for ingestion to Kafka? Not very many. That’s been a challenge for us.

But we also been for the Egress have a lot of custom endpoints, and a big one is to use HDFS and Hive. So as I get into this, I’ll be explaining some of the things that we see that are really dynamically challenging and why it’s not as simple as EFL Statements and how you triage and respond to these events. And then Shiv will talk more about how you can get to a higher level of using the actual ML to solve some of these challenging [00:04:00] issues.

So I’d like to start with a simple example. And in fact, this is what we often use. When we first get somebody new to the system, or we’re interviewing somebody to work in this kind of environment, as we start with a very simple example, and take it from a high level, I like to do something like this on the whiteboard and say, “Okay. You’re working in an environment that has data that’s constantly coming in that’s going through multiple stages of processing, and then eventually getting into a location where will be reported on, where people are gonna dive into it.”

And when I do this, it’s to show that there’s many choke points that can occur when this setup is made. And so you look at this and you say, “That’s great,” you have the data flowing through here. But when you hit different challenges along the way, it can back things up in interesting ways. And often, we talk about that as cascading failures of latency and latency-sensitive systems, you’re counting [00:05:00] latency is things back up. And so what I’ll do is explain to them, what if we were to take, for instance, you’ve got these three vats, “Okay, let’s take this last vat or thus last bin,” and if that’s our third stage, let’s go ahead and see if there’s a choke point in the pipe there that’s not processing. What happens then, and how do you respond to that?

And often, a new person will say, “Well, okay, I’m going to go and dive in,” then they’re gonna say, “Okay. What’s the problem with that particular choke point? And what can I do to alleviate that?” And I’ll say, “Okay. But what are you doing to make sure the rest of the system is processing okay? Is that choke point really just for one source of data? How do you make sure that it doesn’t start back-filling and cause other issues in the entire environment?” And so when I do that, they’ll often come back with, “Oh, okay, I’ll make sure that I’ve got enough capacity at the stage before the things don’t back up.”

And this is actually has very practical implications. You take the simple model, and it applies to a variety of things [00:06:00] that happen in the real world. So for every connection you have to a broker, and for every source that’s writing at the end, you actually have to account for two ports. It’s basic things, but as you scale up, it matters. So I have a port with the data being written in. I have a port that I’m managing for writing out. And as I have 5,000, 10,000 connections, I’m actually two X on the number of ports that I’m managing.

And what we found recently was we’re hitting the ephemeral port, what was it, the max ephemeral port range that Linux allows, and all sudden, “Okay. It doesn’t matter if you felt like you had capacity and maybe the message storage state, but that we actually hit boundaries at different areas. And so I think it’s important to say these have to stay a little different at time, so that you’re thinking in terms [00:07:00] of what other resources are we not accounting for, that can back up? So we find that there’s very important considerations on understanding the business logic behind it as well.

Data transparency, data governance actually can be important knowing where that data is coming from. And as I’ll talk about later, the ordering effects that you can have, it helps with being able to manage that at the application layer a lot better. So as I go through this, I want to highlight that it’s not so much a matter of just having a static view of the problem, but really understanding streams for the dynamic nature that they have, and that you may have planned for a certain amount of capacity. And then when you have a spike and that goes up, knowing how the system will respond at that level and what actions to take, having the ability to view that becomes really important.

So I’ll go through a couple of [00:08:00] data challenges here, practical challenges that I’ve seen. And as I go through that, then when we get into the ML part, the fun part, you’ll see what kind of algorithms we can use to better understand the varying signals that we have. So first challenge is variance in flow. And we actually had this with a customer of ours. And they would come to us saying, “Hey, everything look good as far as the number of messages that they’ve received.” But when they went to look at a certain…basically another topic for that, they were saying, “Well, some of the information about these visits actually looks wrong.”

And so they actually showed me a chart that look something like this. You could look at this like a five business day window here. And you look at that and you say, “Yeah, I mean, I could see. There’s a drop after Monday, the first day drops a little low.” That may be okay. Again, this is where you have to know what does a business trying to do with this data. And as an operator like myself, I can’t always make those as testaments really well. [00:09:00] Going to the third peak there, you see that go down and have a dip, they actually point out something like that, although I guess it was much smaller one and they’d circle this. And they said, “This is a huge problem for us.”

And I come to find out it had to do with sales of product and being able to target the right experience for that user. And so looking at that, and knowing what those anomalies are, and what they pertain to, having that visibility, again, going to the data transparency, you can help you to make the right decisions and say, “Okay.” And in this case with Kafka, what we do find is that at times, things like that can happen because we have, in this case, a bunch of different partitions.

One of those partitions backed up. And now, most of the data is processing, but some isn’t, and it happens to be bogged down. So what decisions can we make to be able to account for that and be able to say that, “Okay. This partition backed up, why is that partition backed up?” And the kind of ML algorithms you choose are the ones that help you [00:10:00] answer those five whys. If you talk about, “Okay. It backed up. It backed up because,” and you keep going backwards until you get to the root cause.

Challenge number two. We’ve experienced negative effects from really slow data. Slow data can happen for a variety of reasons. Two of them that we principally have found is users that have data that staging and temporary that they’re gonna be testing another pipeline, and then rolling that into production. But there’s actually some in production that are slow data, and it’s more or less how that stream is supposed to behave.

So think in terms again of what is the business trying to run on the stream? In our case, we would have some data that we wouldn’t want to have frequently like cancellations of users for our product. We hope that that stay really low. We hope that you have very few bumps in the road with that. But what we find with Kafka is that you have to have a very good [00:11:00] understanding of your offset expiration and your retention periods. We find that if you have those offsets expire too soon, then you have to go into a guesswork of, “Am I going to be aggressive,” and try and look back really far and reread that data, in which case, you may double count? Or am I gonna be really conservative, which case you may miss critical events, especially if you’re looking at cancellations, something that you need to understand about your users.

And so we found that to be a challenge. Keeping with that and going into this third idea is event sourcing. And this is something that I’ve heard argued either way should Kafka be used for this event sourcing and what that is. We have different events that we want to represent in the system. Do I create a topic for event and then have a whole bunch of topics, or do I have a single topic because we provide more guarantees on the partitions in that topic?

And it’s arguable, depends on what you’re trying to accomplish, which is the right method. But what we’ve seen [00:12:00] with users is because we have a self-service environment, we wanna give the transparency back to them on here’s what’s going on when you set it up in a certain way. And so if they say, “Okay, I wanna have,” for instance for our products, “a purchase topic representing the times that things were purchased.” And then a cancellation topic to represent the times that they decided not to use certain…they decided to cancel the product, what we can see is some ordering issues here. And I represented those matching with the colors up there. So a brown purchase followed by cancellation on that same user. You could see the, you know, the light purple there, but you can see one where the cancellation clearly comes before the purchase.

And so that is confusing when you do the processing, “Okay. Now, they’re out of order.” So having the ability to expose that back to the user and say, “Here’s the data that you’re looking at and why the data is confusing. What can you do to respond to that,” and actually pushing that back to the users is critical. So I’ll just cover a couple of surprises that [00:13:00] we’ve had along the way. And then I’ll turn the time over to Shiv. But does anybody here use Schema Registry? Any Schema Registry users? Okay, we’ve got a few. Anybody ever have users change the schema unexpectedly and still cause issues, even though you’ve used Schema Registry? Is it just me? Had a few, okay.

We found that users have this idea that they wanna have a solid schema, except when they wanna change it. And so we’ve coined this term flexible-rigid schema. They want a flexible-rigid schema. They want the best in both worlds. But what they have to understand is, “Okay, you introduce a new Boolean value, but your events don’t have that.” I can’t pick one. I can’t default to one. Limit the wrong decision, I guarantee you. It’s a 50/50 chance. And so we have this idea of can we expose back to them what they’re seeing, and when those changes occur. And they don’t always have control over changing the events at the source. And so they may not even be control of their destiny of having a [00:14:00] schema throughout the system changed.

Timeouts, leader affinity, I’m gonna skip over some of these or not spend a lot of time on it. But timeouts is a big one. As we write to HDFS, we see that the backups that occur from the Hive meta store when we’re writing with the HDFS sync can cause everything to go to a rebalanced state, which is really expensive, and now becomes a cascading issue, which was a single broker having an issue. So, again, others with leader affinity, poor assignments. There’s a randomizes where those get assigned to. We’d like to have concepts of the system involved. We wanna be able to understand the state of the system. Windows choices are being made. And if we can affect that, and Shiv will cover how that kind of visibility is helpful with some of the algorithms that can be used. And then basically, those things, all just lead to, why do we need to have better visibility into our data problems with Kafka? [00:15:00] Thanks.

Shivnath:

Thanks, Nate. So what Nate just actually talked about, and I’m sure most of you at this conference, the reason you’re here is that the streaming applications, Kafka-based architectures are becoming very, very popular, very common, and driving mission critical applications across the manufacturing, across ecommerce, many, many industries. So for this talk, I’m just going to approximate streaming architecture as something like this. There’s a stream store, something like Kafka that is being used, or poser, and maybe status actually kept in a NoSQL system, like HBase or Cassandra. And then there’s the computational element. This could be in Kafka itself with KStream, but it could be a Spark streaming, Flink as well.

When things are great, everything is fine. But when things start to break, maybe as Nate mentioned, we’re not getting your results on time. Things are actually congested, [00:16:00] backlog, all of these things can happen. And unfortunately, in any architecture like this, what happens is they’re all receiver systems. So problems could happen in many different places such as, it could be an application problem, maybe the structure, the schema, how things where…like Nate gave an example of a joint across streams, which…There might be problems there, but that might also be problems that the Kafka level, may be the partitioning, may be brokers, may be contention, or things that as a Spark level, no resource level problems, or configuration problems, all of these become very hard to troubleshoot.

And I’m sure many of you have run into problems like this. How many of you here can relate to problems what I’m showing on the slide here? Quite a few hands. As you’re running things in production, these challenges happen. And unfortunately, what happens is, given that these are all individual systems, often there is no single correlated view event that connects the streaming [00:17:00] computational side with the story side, or maybe with the NoSQL sides. And that poses a lot of issues. One of the crucial things is that when a problem happens, it takes a lot of time to resolve.

And wouldn’t it be great if there’s some tool out there, some solution that can give you visibility, as well as to do things along the following lines and empowered like the DevOps teams? First and foremost, there are metrics all over the place, that metrics, logs, you name it. And from all these different levels, especially the platform, the application, and all the interactions that happen in between. Metrics along these can be brought into one single platform, and at least have a good, nice correlated view. So again, you can go from the app, to the platform, or vice-versa, depending on the problem.

And what we will try to cover today is with all these advances that are happening in machine learning, how in applying machine learning to some of this data can help you find out problems quicker, [00:18:00] as well as, in some cases, using artificial intelligence, using ability to take automated actions either prevent the problems from happening the first place, or at least if those problems happen gonna to be able to quickly recover and fix it.

And what we will really try to cover in the stack is basically what I’m showing the slide. There’s tons and tons of interesting machine learning algorithms. And the same time, you have all of these different problems that happened with Kafka streaming architectures. How do you connect both worlds? How can you bring based on the goal that you’re actually trying to solve the right algorithm to bear on the problem? And the way we’ll structure that is, again, DevOps teams have lots of different goals. Broadly, there’s the app side of the fence, and there’s the operations side of the fence.

As a person who owns Kafka streaming application, you might have goals related to latency. I need this kind of latency, or this much amount of the throughput, or maybe might be a combination of this along with, “Hey, I can only [00:19:00] tolerate this much amount of data loss,” and all those talks that have happened very much in this room, when the two different aspects of this and how replicas and parameters help you get all of these goals.

On the operation side, maybe your goals are around ensuring that the cluster reliable like a particular loss of a rack doesn’t really cause data loss and things like that, or maybe they are related on the cloud ensuring that you’re getting the right bite performance and all of those things. And on the other side are all of these interesting advances that happened in ML. There are algorithms for detecting outliers, anomalies, or actually doing correlation. So the real focus is going to be like, let’s take some of these goals. Let’s work with the different algorithms, and then seeing how you can get these things to meet. And along the way, we’ll try to describe some our own experiences. What would worked kind of what didn’t work, as well the other things that are worth exploring?

[00:20:00] So let’s start with the very first one, the outlier detection algorithms. And why would you care? There are two very, very simple problems. And if you remember from the earlier in the talk, Nate talked about one of the critical challenges they had, where the problem was exactly this, which is, hey, there could be some imbalance among my brokers. There could be some imbalance and a topic among the partitions, some partition really getting backed up. Very, very common problems that happen. How can you very quickly instead of manually looking at graphs, and trying to stare and figure things out? Apply some automation to it.

Here’s a quick screenshot that lead us to through the problem. If you look at the graph that have highlighted on the bottom, this is a small cluster with three brokers, capital one, two, and three. And one of the brokers is actually having much lower throughput. It could be the other way. One of the having a really high throughput. Can we detect this automatically? This is about having to figure it after the fact. [00:21:00] And there’s a problem where there are tons of algorithms for outlier detection from statistics and now more recently in machine learning.

And there are algorithms and differ based on, do they deal with one parameter at a time, do they deal with multiple parameters of the time, univariate, multivariate, or algorithms that can actually take temporal things into account, algorithms that are more looking at a snapshot in time. The very, very simple technique, which actually works surprisingly well as this score where the main idea is to take…let’s say, you have different brokers, or you have hundred partitions in a topic, you can take any metric. It might be in the bites and metric, and vector Gaussian and distribution to the data.

And anything that is actually a few standard deviations away as an outlier. Very, very simple technique. The problem in this technique is, it does assume that is the distribution, but sometimes may not be the [00:22:00] case. In the case of in a brokers and partitions that is usually a safe assumption to make. But if that technique doesn’t work, there are other techniques. One of the more popular ones that we have had some success with, especially when you’re looking at multiple time series of the same time is the DBSCAN. It’s basically a density-based clustering technique. I’m not going to the all the details, but the key ideas, it basically uses some notion of distance to group points into clusters, and anything that doesn’t fall into clusters and outlier.

Then there are tons of other very interesting techniques using like in a binary trees to find outliers called the isolation forests. And in the deep learning world, there is a lot of interesting work happening with auto encoders, which tried to learn representation of the data. And again, once you’ve learned the representation from all the training data that is available, things don’t fit the representation are outliers.

So this is the template I’m going to [00:23:00] follow in the rest of the talk. Basically, the template, I pick a technique. Next, I’m gonna look at forecasting and give you some example use cases that having a good forecasting technique can help you in that Kafka DevOps world, and then tell you about some of these techniques. So for forecasting, two places where it makes those…having a good technique makes a lot of difference. One is avoiding this reactive firefighting. When you have something like a latency SLA, if you could get a sense, things are backing up, and there’s a chance in a certain amount of time, the SLAs will get missed, then you can actually take action quickly.

It could even be bringing on normal brokers and things like that, or basically doing some partition reassignment and whatnot. But getting heads up this often very, very useful. The other is more long-term forecasting for things like capacity planning. So I’m gonna use actually a real life example here, an example that one of our telecom customers actually worked with [00:24:00] where it was a sentiment analysis, sentiment extraction used case based on tweets. So the overall architecture consisted of tweets coming in real-time like in those SLA Kafka, and then the computation was happening in Spark streaming with some state actually being stored in a database.

So here’s a quick screenshot of how things actually play out there. In sentiment analysis, especially for customer service, customer support really the used cases, there’s some kind of SLA. In this case, their SLA was around three minutes. What that means is, by the time you learn about the particular incident, if you will, which you can think of these tweets coming in Kafka. Within a three-minute interval, data has to be processed. What are you seeing on screen here, all those green bars represent the rate in which data is coming, and the black line in between as the time series indicating the actual delay, [00:25:00] the end-t-end delay, the processing delay between data arrival and data being processed. And there is a SLA of three-minutes.

So if you can see the line, it trending up, and applying…there was a good forecasting technique that could be applied on this line, you can actually forecast and stay within a certain interval of time, maybe it’s a few hours, maybe even less than that. The rate at which this latency is trending up, my SLA can get missed. And it’s a great use case for having a good forecasting technique. So again, forecasting, another area that has been very well-studied. And there are tons of interesting techniques, from the time-series forecasting of the more statistical bend, there are techniques like ARIMA, which stands for autoregressive integrated moving average. And there’s lots of variance around that, which uses the trend and data differences between data elements and patterns you forecast with smoothing and taking in historic data into account, and all of that good stuff.

On the other [00:26:00] extreme, there has been a lot of like I’ve said, advances recently in using neural networks, because time-series data is one thing that is very, very easy to get, too. So there’s this long short-term memory, the LSCM, and recurrent neural networks, which have been pretty good at this, that we have actually had a lot of I would say success is with a technique that was originally it was something that Facebook released this open source called the Prophet Algorithm, which is not very different from the ARIMA and the older family of forecasting techniques. It defaults in some subtle ways.

The main idea here was what is called a generative additive model. I put in a very simple example here. So the idea is to model this time-series, whichever time-series you are picking as a combination of the trend in the time-series data and extract all the trend. The seasonality, maybe there is a yearly seasonality, maybe there’s monthly seasonality, weekly [00:27:00] seasonality, maybe even daily seasonality. This is a key thing. I’ve used the term shocky [SP]. So if you’re thinking about forecasting and ecommerce setting, like Black Friday or the Christmas days, these are actually times when the time-series will have a very different behavior.

So in Kafka or in the operational context, if you are rebooting, if you are installing a new patch or upgrading, this often end up shifting the patterns of the time-series, and they have to be explicitly modeled. Otherwise, the forecasting can go wrong, and then the rest is error. The recent Prophet has actually worked really well for us apart from the ones I mentioned. It fits quickly and all of that good stuff, but it is very, I would say, customizable, where your domain knowledge about the problem can be incorporated, instead of if it’s something that gives you a result, and then all that remains is parameter tuning.

So Prophet is something I’ll definitely ask all of you [00:28:00] to take a look. And defaults work relatively well, but forecasting is something where we have seen that. You can just trust the machine alone. It needs some guidance, it needs some data science, it needs some domain knowledge to be put along with the machine learning to actually get good forecast. So that’s the forecasting. So we saw outlier detection how that applies, we saw forecasting. Now let’s get into something even more interesting. Anomalies, detecting anomalies.

So a place where anomaly detection, and possible was an anomaly, you can think of an anomaly as unexpected change, something that if you were to expect it, then it’s not an anomaly, something unexpected that needs your attention. That’s what I’m gonna characterize as an anomaly. Where can it help? Actually smart alerts, alerts where you don’t have to configure threshold and all of that things, and worry about your workload changing or new upgrades happening, and all of that stuff, wouldn’t it be great if these anomalies can [00:29:00] be auto-detect it. But that’s also very challenging. By no means, it’s a trivial because if you’re not careful, then your smart alerts will turn out to be really dump. And you might get a lot of false alerts. And that way, you lose confidence, or it might miss a real problems that are happening.

So, I don’t know, detection is something which is pretty hard to get right in practice. And here’s one simple, but one very illustrative example. With Kafka you always see no lags. So here, what I’m plotting here is increasing lag. Is that really an expected one? Maybe there could be both in data arrival, and maybe these lags might have built up, like at some point of time every day, maybe it’s okay. When it is an anomaly that I really need to take a look at. So that’s when these anomaly detection techniques become very important.

Many [00:30:00] different schools have thought excess on how to build a good anomaly detection technique, including the ones I talked about earlier with outlier detection. One approach that has worked well in the past for us is, when you can have really modern anomalies as I’m forecasting something based on what I know. And if what I see the first one to forecasts, then that’s an anomaly. So you can pick your favorite forecasting technique, or the one that actually worked, ARIMA, or Prophet, or whatnot, use that technique to do the forecasting, and then deviations become interesting and anomalous.

Whatever sitting here is a simple example of the technique and for Prophet, or more common one that we have seen actually does work relatively well, this thing called STL. It stands for Seasonal Trend Decomposition using a smoothing function called LOWESS. So you have the time series, extract out and separate out the trend from it first. So without the trend, so that leaves [00:31:00] the time series without the trend and then extract out all the seasonalities. And once you have done that, whatever reminder or residual has called, even if you put some, like a threshold on that, it’s reasonably good. I wouldn’t say it’s perfect but reasonably good at extracting out these anomalies.

Next one, correlation analysis, getting even more complex. So once basically you have detect an anomaly. Or you have a problem that you want the root cause, why did it happen? What do you need to do to fix it. Here’s a great example. I saw my anomaly something shot up, maybe there’s the lag, that is actually building up, it’s definitely looks anomalous. And now what do you really want us…okay, maybe we address where the baseline is much higher. But can I root cause it? Can I pinpoint what is causing that? Is it just the fact that there was a burst of data or something else, maybe resource allocation issues, maybe some hot spotting and the [00:32:00] brokers?

And here, you can start to apply time-series correlation, which time series your lower level correlates best with the higher level time-series where your application latency increased? Challenge here is, there are hundreds, if not thousands of times-series if you look at Kafka, with every broker has so many kind of time-series it can give you from every level. And it quickly all of these adds up. So here, it’s a pretty hard problem. So if you just throw a time-series correlation techniques, even time-series which just have some trend in there they look correlated. So you have to be very careful.

The things to keep in mind are things like pick a good similarity function across time-series. For example using something like Euclidean, which is a straight up well-defined, well-understood distance function between points or between time series. We have had a lot of success with something called dynamic time warping, which is very good to deal with time-series, which might be slightly out of [00:33:00] sync. If you remember all the things that need mentioned, you just gone that Kafka world and streaming world in asynchronous, even world, you just see him all the time-series and nicely synchronized. So time warping is a good technique to extract distances in such a context.

But by far, the other part is just like you saw with Prophet. You have to really instead of just throwing some machine learning technique and praying that it works, you have to really try to understand the problem. And the way in which we have tried to break it down into something usable is, for a lot of these time-series, you can split the space into time-series that are related to the application performance, time-series are related to resource and contention, and then apply correlation within these bucket. So try to scope the problem.

Last technique, model learning. And this turns out to be the hardest. But if you have a good modeling technique, good model that can answer what-if questions, then things like what we saw in the previous talk with [00:34:00] all of those reassignment and whatnot. You can actually find out what’s the good reassignment and what impact will that have. Or, as Nate mentioned, this rebalanced-type, consumer times, timeouts, and rebalancing storms can actually kick in, “What’s the good timeout?”

So a lot of these places where you have to pick some threshold or some strategy, but having a model that can quickly do what-if avoid is better, or can even rank them can be very, very useful and powerful. And this is the thing that is needed for enabling action automatically. Here’s a quick example. There’s a lag. We can figure out where the problem actually is happening. But then wouldn’t be great if something is suggesting how to fix that? Increase the number of partitions from X to Y,. And that will fix the problem.

The modeling at the end of the day is function that you’re fitting to the data. And in modeling, I don’t have time to actually go into this in great detail because of time. You carefully pick the right [00:35:00] input features. And what is very, very important is to ensure that you have the right training data. For some problems, just collecting data from the production cluster is good trading data. Sometimes, it is not because you have only seen certain regions of the space. So with that, I’m actually going to wrap up.

So what we try to do in the talk is to really give you a sense, as much as we can do in a 40-minute talk of having all of these interesting Kafka DevOps challenges, so meaning application challenges, and how to map that to something where you can use machine learning or some elements of AI to make your life easier, at least guide you so that you’re not wasting a lot of time trying to look for a needle in a haystack. And with that, I’m going to wrap up. There is this area called AIOps, which is very interesting and trying to bring AI and ML with data to solve these DevOps challenges. We have a booth here. Please, drop by to see some of the techniques we have. And yes, if you’re interested in working on this interesting challenges, streaming, building [00:36:00] these applications, or applying techniques like this, we are hiring. Thank you.

Okay. Thanks, guys. So we actually have about three or four minutes for any question, mini Q&A.

Attendee 1:

I know nothing about machine learning. But I can see it helping me really help with debugging problems. As a noob, how can I get started?

Shivnath:

I can take that question. So the question was, so the whole point of this talk, and gets in that question, which is, again, on one hand, if you go to a Spark Summit, or Spark, you’ll see a lot of machine learning algorithms, and this and that, right? If we come to a Kafka Summit, then it’s all about Kafka and DevOps challenges. How do these worlds meet? That’s exactly the thing that we tried to cover in the talk. If you were listening looking at a lot of the techniques, they’re not fancy machine learning techniques. So our experience has been that once you understand the use case, then there are fairly good techniques from even statistics that can solve the problem reasonably well.

And once you have first startup experience, then look for better and better techniques. So hopefully, this talk gives you a sense of the techniques just get started with, and you get a better understanding of the problem and the [00:38:00] data, and you can actually improve and apply more of those deep learning techniques and whatnot. But most of the time, you don’t need that for these problems. Nate?

Nate:

No. Absolutely same thing.

Attendee 2:

Hi, Shiv, nice talk. Thank you. Quick question is for those machine learning algorithms, can they be applied cross domain? Or if you are moving from DevOps of Kafka to Spark streaming, for example, do you have to hand-picking those algorithms and tuning the parameters again?

Shivnath:

So the question is, how much of these algorithms just apply to something we’ve found for Kafka? Does it apply for Spark streaming? Would that apply for high performance Impala, or Redshift for that matter? So again, the real hard truth is that no one size fits all. So you can’t just have…I mentioned outlier detection, that might be a technique that can be applied to any [00:39:00] load imbalance problem. But then that’s start getting into anomalies or correlation, Some amount of expert knowledge about the system has to be combined with the machine learning technique to really get good results.

So again, it’s not as if it has to be all export rules, but some combination. So if you pick a financial space, a data scientist who exists, understands the domain, and knows how to work with data. Just like that even in the DevOps world, the biggest success will come if somebody understand receiver system, and has the knack of working with these algorithms reasonably well, and they can combine both of that. So something like a performance data scientist.

The post Using Machine Learning to understand Kafka runtime behavior appeared first on Unravel.

Unravel Introduces the First AI Powered DataOps Solution for the Cloud

George Demarest — Tue, 26 Mar 2019 21:14:09 +0000

It’s indisputable, new data-driven applications are moving to, or starting life running in the cloud. The increasing automation and resilience of current cloud infrastructure is an ideal environment for running modern data pipelines. For many companies and institutions, their cloud first strategy is becoming a cloud only strategy.

Native online business’ such as Netflix as well as mainstream Enterprises such as Capital One have multi $Billion valuations and almost no physical data centers. Public cloud providers will account for over 60% of all capital expenditures on cloud infrastructure – disks, CPUs, network switches and the like. Given this momentum, there is increased pressure on IT teams to prove that they are getting the most of their cloud and big data investments.

Against this backdrop Unravel Introduces the Industry’s First AI-Powered Cloud Platform Operations and Workload Migration Solution for Data Applications, delivering new AI-powered and Unified performance optimization for planning, migrating, and managing modern data applications on the AWS, Azure and Google Cloud Platforms.

Some of the new capabilities that IT teams will gain from this latest release include:

Unified management of the full modern data stack on all deployment platforms – Unravel Cloud Migration covers AWS, Azure and Google clouds, as well as on-premises, hybrid environments and multi-cloud settings. Customers get AI-powered troubleshooting, auto-tuning and automated remediation of failures and slowdowns with the same user interface.

Full stack visibility – Unravel uses automation to provide detailed reports and metrics on app usage, performance, cost and chargebacks in the cloud.

Recommendations for the best apps to migrate – Unravel baselines on-premises performance of the full modern data stack and uses AI to identify the best app candidates for migration to cloud. Organizations can avoid migrating apps that aren’t ideal for the cloud and having to repatriate them later.

Mapping on-premises infrastructure to cloud server instances – Unravel helps customers choose cloud instance types for their migration based on three strategies:

Lift and shift – A one-to-one mapping from physical to virtual servers ensures that a cloud deployment will have the same (or more) resources available. This minimizes any risks associated with migrating to the cloud.
Cost reduction – Provides the most cost-effective instance recommendations based on detailed dependency understanding for minimizing wasted capacity and over provisioning.
Workload fit – Takes into account data collected over time from the on-premises environment, making recommendations for instance types based on the actual workload of applications running in a data center. These recommendations will be based on the VCore, memory, and storage requirements of a customer’s typical runtime environment.

Cloud capacity planning and chargeback reporting – Unravel can predict cloud storage requirements up to six months out and can provide a detailed accounting of resource consumption and chargeback by user, department or other criteria.

Migration validation – Unravel can provide a before and after assessment of cloud applications by comparing on-premises performance and resource consumption to the same metrics in the cloud, thereby validating the relative success of the migration.

All indications point to a massive shift in data deployments to the cloud, but there are too many unknowns around cost, visibility and migration that have prevented this transition to the cloud from occurring more quickly.

We are incredibly proud of this latest release and the value we believe it can deliver as organizations either begin their cloud journey for their modern data applications or look to optimize performance and cost efficiencies for those data workloads already operating in the cloud.

The post Unravel Introduces the First AI Powered DataOps Solution for the Cloud appeared first on Unravel.

AI-Powered Data Operations for Modern Data Applications

Unravel Data — Thu, 14 Feb 2019 23:11:41 +0000

Today, more than 10,000 enterprise businesses worldwide use a complex stack composed of a combination of distributed systems like Spark, Kafka, Hadoop, NoSQL databases, and SQL access technologies.

At Unravel, we have worked with many of these businesses across all major industries. These customers are deploying modern data applications in their data centers, in private cloud deployments, in public cloud deployments, and in hybrid combinations of these.

This paper addresses the requirements that arise in driving reliable performance in these complex environments. We provide an overview of these requirements both at the level of individual applications as well as in holistic clusters and workloads. We also present a platform that can deliver automated solutions to address these requirements as well as taking a deeper dive into a few of these solutions.

The post AI-Powered Data Operations for Modern Data Applications appeared first on Unravel.