Troubleshooting & DataOps Archives - Unravel

Learn more about Unravel’s award-winning data observability platform.

In Part II of this series Why Your Spark Apps are Slow or Failing: Part II Data Skew and Garbage Collection, I will be discussing how data organization, data skew, and garbage collection impact Spark performance.

The post Why Memory Management is Causing Your Spark Apps To Be Slow or Fail appeared first on Unravel.

Data ActionabilityTM DataOps Webinar

Unravel Data — Wed, 20 Nov 2024 21:39:37 +0000

Data Actionability: Boost Productivity with Unravel’s New DataOps AI Agent

The post Data Actionability^TM DataOps Webinar appeared first on Unravel.

Data ActionabilityTM Webinar

Unravel Data — Wed, 20 Nov 2024 21:04:00 +0000

Data Actionability: Empower Your Team with Unravel’s New AI Agents

The post Data Actionability^TM Webinar appeared first on Unravel.

AI Agents: Empower Data Teams With ActionabilityTM for Transformative Results

Unravel Data — Thu, 15 Aug 2024 18:25:10 +0000

AI Agents for Data Teams

Data is the driving force of the world’s modern economies, but data teams are struggling to meet demand to support generative AI (GenAI), including rapid data volume growth and the increasing complexity of data pipelines. More than 88% of software engineers, data scientists, and SQL analysts surveyed say they are turning to AI for more effective bug-fixing and troubleshooting. And 84% of engineers who use AI said it frees up their time to focus on high-value activities.

AI Agents represent the next wave of AI innovation and have arrived just in time to help data teams make more efficient use of their limited bandwidth to build, operate, and optimize data pipelines and GenAI applications on modern data platforms.

Data Teams Grapple with High Demand for GenAI

A surge in adoption of new technologies such as GenAI is putting tremendous pressure on data teams, leading to broken apps and burnout. In order to support new GenAI products, data teams must deliver more production data pipelines and data apps, faster. The result is that data teams have too much on their plates, the pipelines are too complex, there is not enough time, and not everyone has the deep tech skills required. No surprise that 70% of organizations have difficulty integrating data into AI models and only 48% of AI projects get deployed into production.

Understanding AI Agents

Defining AI Agents

AI agents are software-based systems that gather information, provide recommended actions, initiate and complete tasks in collaboration with or on behalf of humans to achieve a goal. AI agents can act independently, utilizing components like perception and reasoning, provide step-by-step guidance to augment human abilities, or can provide supporting information to support complex human-led tasks. AI agents play a crucial role in automating tasks and simplifying data-driven decision-making, and achieving greater productivity and efficiency.

How AI Agents Work

AI agents operate by leveraging a wide range of data sources and signals, using algorithms and data processing to identify anomalies and actions, then interact with their environment and users to effectively achieve specific goals. AI agents can achieve >90% accuracy, primarily driven by the reliability, volume, and variety of input data and telemetry to which they have access.

Types of Intelligent Agents

Reactive and proactive agents are two primary categories of intelligent agents.
Some agents perform work for you, while others help complete tasks with you or provide information to support your work.
Each type of intelligent agent has distinct characteristics and applications tailored to specific functions, enhancing productivity and efficiency.

AI for Data Driven Organizations

Enhancing Decision Making

AI agents empower teams by improving data support decision-making processes for you, with you, or by you. Examples of how AI agents act on your behalf include reducing toil and handling routine decisions based on AI insights. In various industries, AI agents optimize decision-making and provide recommendations to support your decisions. For complex tasks, AI agents provide supporting information needed to build data pipelines, write SQL queries, and partition data.

Benefits of broader telemetry sources for AI agents

Integrating telemetry from various platforms and systems enhances AI agents’ ability to provide accurate recommendations. Incorporating AI agents into root cause analysis (RCA) systems offers significant benefits. Meta’s AI-based root cause analysis system shows how AI agents enhance tools and applications.

Overcoming Challenges

Enterprises running modern data stacks face common challenges like high costs, slow performance, and impaired productivity. Leveraging AI agents can automate tasks for you, with you, and by you. Unravel customers such as Equifax, Maersk, and Novartis have successfully overcome these challenges using AI.

The Value of AI Agents for Data Teams

Reducing Costs

When implementing AI agents, businesses benefit from optimized data stacks, reducing operational costs significantly. These agents continuously analyze telemetry data, adapting to new information dynamically. Unravel customers have successfully leveraged AI to achieve operational efficiency and cost savings.

Accelerating Performance

Performance is crucial in data analytics, and AI agents play a vital role in enhancing it. By utilizing these agents, enterprise organizations can make well-informed decisions promptly. Unravel customers have experienced accelerated data analytics performance through the implementation of AI technologies.

Improving Productivity

AI agents are instrumental in streamlining processes within businesses, leading to increased productivity levels. By integrating these agents into workflows, companies witness substantial productivity gains. Automation of repetitive tasks by AI agents simplify troubleshooting to boost overall productivity and efficiency.

Future Trends in AI Agents for FinOps, DataOps, and Data Engineering

Faster Innovation with AI Agents

By 2026 conversational AI will reduce agent labor costs by $80 billion. AI agents are advancing, providing accurate recommendations to address more issues automatically. This allows your team to focus on innovation. For example, companies like Meta use AI agents to simplify root cause analysis (RCA) for complex applications.

Accelerated Data Pipelines with AI Agents

Data processing is shifting towards real-time analytics, enabling faster revenue growth. However, this places higher demands on data teams. Equifax leverages AI to serve over 12 million daily requests in near real time.

Improved Data Analytics Efficiency with AI Agents

Data management is the fastest-growing segment of cloud spending. In the cloud, time is money; faster data processing reduces costs. One of the word’s largest logistics companies improved efficiency by up to 70% in just 6 months using Unravel’s AI recommendations.

Empower Your Team with AI Agents

Harnessing the power of AI agents can revolutionize your business operations, enhancing efficiency, decision-making, and customer experiences. Embrace this technology to stay ahead in the competitive landscape and unlock new opportunities for growth and innovation.

Learn more about our FinOps Agent, DataOps Agent, and Data Engineering Agent.

The post AI Agents: Empower Data Teams With Actionability^TM for Transformative Results appeared first on Unravel.

Unravel Data Security and Trust

Unravel Data — Thu, 08 Aug 2024 15:00:57 +0000

UNRAVEL DATA SECURITY AND TRUST
ENABLE DATA ACTIONABILITY + FINOPS WITH CONFIDENCE

Privacy and security are top priorities for Unravel and our customers. At Unravel, we help organizations better understand and improve the performance, quality, and cost efficiency of their data and AI pipelines. As a data business, we appreciate the scope and implications of privacy and security threats.

This data sheet provides details to help information security (InfoSec) teams make informed decisions. Specifically, it includes:

An overview of our approach to security and trust
An architectural diagram with connectivity descriptions
Details about Unravel compliance and certifications
Common questions about Unravel privacy and security

For additional details, please reach out to our security experts.

The post Unravel Data Security and Trust appeared first on Unravel.

Unravel a Representative Vendor in the 2024 Gartner® Market Guide for Data Observability Tools

Unravel Data — Wed, 24 Jul 2024 16:52:23 +0000

Unravel Data, the first AI-enabled data actionability; and FinOps platform built to address the speed and scale of modern data platforms, today announced it has been named in the 2024 Gartner® Market Guide for Data Observability Tools.

According to Gartner, “By 2026, 50% of enterprises implementing distributed data architectures will have adopted data observability tools to improve visibility over the state of the data landscape, up from less than 20% in 2024.”¹

Gartner also notes that “One of the leading causes for the high demand for data observability is the huge demand for D&A leaders to implement emerging technologies, particularly generative AI (GenAI), in their organization.”¹

Unravel’s Perspective

We believe that the recognition of Unravel as a Representative Vendor underscores our commitment to innovation and excellence in providing comprehensive data observability solutions that empower organizations to drive performance, efficiency, and cost-effectiveness in their data operations.

The Importance of Data Observability

Data observability is crucial in today’s data-driven world. It involves monitoring, tracking, and analyzing data to ensure the health and performance of data systems. Effective data observability helps organizations detect and resolve issues quickly, maintain service-level agreements (SLAs), resource efficiency, and speed time-to-market of AI and other data driven initiatives.

Demand is increasing for robust data observability tools to manage the growing complexity of modern data environments. As data volumes and velocities increase, so does the need for tools that can provide comprehensive visibility and actionable insights. Data observability platforms meet these needs by offering advanced features that support the governance and optimization of performance, costs, and productivity.

Figure 1. Options for data observability and FinOps. Source: Unravel Data.

Go Beyond Data Observability to Actionability^TM

Unravel’s unique approach to data observability enables organizations to take action, automate, and streamline toilsome cloud data analytics platform monitoring, troubleshooting, and cost management. Unravel helps data teams overcome key challenges to accelerate time to value.

Figure 2. The critical features of data observability. Source: Gartner¹.

Unravel’s Comprehensive Data Actionability^TM Platform

Unravel Data’s AI-powered actionability^TM platform is designed to provide deep visibility and enable teams to go beyond observability and take action to improve their modern data stacks, including platforms like Databricks, Snowflake, and BigQuery. AI-powered insights and automation help organizations optimize their data pipelines and infrastructure.

Accelerating Value Across the Organization

Although data observability is typically thought of as being synonymous with data quality, the role of data observability is expanding. As per Gartner, “The increasing complexity in data ecosystems will favor comprehensive data observability tools that can provide additional utilities beyond the monitoring and detection of data issues across platforms.”1 Data observability now includes a variety of new observation categories, including data content, data flow and pipelines, infrastructure and compute, users, usage, and utilization1.

With the introduction of its new AI agents—Unravel FinOps AI Agent, Unravel DataOps AI Agent, and Unravel Data Engineering AI Agent—Unravel enables each role within the organization to simplify and automate recommended actions.

These AI agents are built to tackle specific challenges faced by data teams:

Unravel FinOps AI Agent: This agent provides real-time insights into cloud expenditures, identifying cost-saving opportunities and ensuring budget adherence. It automates financial governance, allowing organizations to manage their data-related costs effectively.

Unravel DataOps AI Agent: By automating routine tasks such as data pipeline monitoring and anomaly detection, this agent frees up human experts to focus on strategic endeavors, enhancing overall productivity and efficiency.

Unravel Data Engineering AI Agent: This agent reduces the burden of mundane tasks for data engineers, enabling them to concentrate on high-value problem-solving. It enhances productivity and ensures precise and reliable data operations.

Figure 3. Unravel’s differentiated value. Source: Unravel Data.

Achieving Data Actionability^TM with Unravel

Unravel’s platform not only observes but also enables teams to take action. By leveraging AI-driven insights, it provides recommendations and automated solutions to optimize data operations. This proactive approach allows organizations to stay ahead of potential issues and continuously improve their data performance and efficiency.

The recent announcement of AI agents exemplifies how Unravel is pushing the boundaries of what data observability can achieve. These agents are designed to lower the barriers to expertise, enabling even under-resourced data teams to operate at peak efficiency.

Get Started Today

Experience the benefits of Unravel’s data actionability^TM platform with a free Health Check Report. This assessment provides insights into the current state of your data operations and quantifies potential improvements to boost performance, cost efficiency, and productivity.

Figure 4. Unravel’s Health Check Report. Source: Unravel Data.

¹Gartner, Market Guide for Data Observability Tools, By Melody Chien, Jason Medd, Lydia Ferguson, Michael Simone, 25 June 2024
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post Unravel a Representative Vendor in the 2024 Gartner® Market Guide for Data Observability Tools appeared first on Unravel.

Empowering Data Agility: Equifax’s Journey to Operational Excellence

Unravel Data — Tue, 25 Jun 2024 16:14:24 +0000

In the data-driven world where real-time decision-making and innovation are not just goals but necessities, global data analytics and technology companies like Equifax must navigate a complex environment to achieve success. Equifax sets the standard for operational excellence by enabling real-time decision-making, accelerating innovation, scaling efficiently, consistently achieving service level agreements (SLAs), and building reliable data pipelines. Let’s delve into the strategies that lead to such a resounding transformation, with a spotlight on the pivotal role of data observability and FinOps.

When Speed Meets Precision: Revolutionizing Credit Scoring

Kumar Menon, the Equifax CTO for Data Mesh and Decision-Making Technologies, faced a formidable challenge: enabling sub-second credit scoring. And he is not alone. “Many organizations find it challenging to turn their data into insights quickly enough to generate real and timely results” (Deloitte). With Unravel’s data observability and FinOps platform, Kumar’s team overcame this hurdle to deliver faster insights and decisions while fortifying Equifax’s competitive edge.

Unifying Data with Vision and Strategy

The necessity to integrate data across six cloud regions cannot be overstated for a company operating at Equifax’s scale. Gartner discovered that many organizations struggle with “high-cost and low-value data integration cycles.” Kumar Menon and his team, leveraging the tools and methodologies of data observability and FinOps, streamlined this intricate process. The result? Faster and more economical products capable of satisfying the needs of a constantly shifting market.

Mastering the Data Deluge

Managing over 10 petabytes of data is no small feat. IDC noted enterprises’ need for “responsiveness, scalability, and resiliency of the digital infrastructure.” Equifax, with Kumar Menon’s foresight, embraced the power of Unravel’s data observability and FinOps frameworks to adapt and grow. By efficiently managing their cloud resource usage, the team was able to scale data processing with the appropriate cloud resources.

Delivering Real-Time Decisions

The Equifax team needed to support 12 million daily inquiries. Operating production services at this scale can be overwhelming for any system not prepared for such a deluge. In fact, Gartner uncovered a significant challenge around the ability to “detect errors and lower the cost to fix and shorten the resulting downtime.” Data observability and FinOps serve as Kumar Menon’s frameworks to not only confront these challenges but to ensure that Equifax can consistently deliver accurate, real-time decisions such as credit scores and employment verifications.

Streamlining Massive Data Ingestion

The Equifax data team’s colossal task of ingesting over 15 billion observations per day could potentially entangle any team in a web of complexity. Again, Gartner articulates the frustration many organizations face with “complex, convoluted data pipelines that require toilsome remediation to manage errors.” Unravel’s platform provided Kumar’s team the means to build and maintain reliable and robust data pipelines, assuring the integrity of Equifax’s reliable and responsive data products.

The Path Forward with Unravel

In a data-centric industry, Equifax exemplifies leadership through precision, agility, and efficiency. The company’s journey demonstrates their capacity to enable real-time decision-making, accelerate innovation, and ensure operational scale and reliability.

At Unravel, we understand and empathize with data teams facing substantial challenges like the ones described above. As your ally in data performance, productivity, and cost management, we’re committed to equipping you with tools that not only remove obstacles but also enhance your operational prowess. Harness the power of data observability and FinOps with Unravel Data through a self-guided tour that shows how you can:

Deliver faster insights and decisions with Unravel’s pipeline bottleneck analysis.
Bring faster and more efficient products to market by using Unravel’s speed, cost, and reliability optimizer.
Scale data processing efficiently with Unravel’s correlated data observability.
Achieve and exceed your SLAs using Unravel’s out-of-the-box reporting.
Build performant data pipelines with Unravel’s pipeline bottleneck analysis.

Ready to unlock the full potential of your data strategies and operations? See how you can achieve more with data observability and FinOps with Unravel Data in a self-guided tour.

The post Empowering Data Agility: Equifax’s Journey to Operational Excellence appeared first on Unravel.

How to Stop Burning Money (or at Least Slow the Burn)

Unravel Data — Tue, 25 Jun 2024 15:32:38 +0000

A Recap of the Unravel Data Customer Session at Gartner Data & Analytics Summit 2024

The Gartner Data & Analytics Summit 2024 (“D&A”) in London is a pivotal event for Chief Data Analytics Officers (CDAOs) and data and analytics leaders, drawing together a global community eager to dive deep into the insights, strategies, and frameworks that are shaping the future of data and analytics. With an unprecedented assembly of over 3,000 attendees, spanning 150+ knowledge-packed sessions and joined by 90+ innovative partners, the D&A Summit was designed to catalyze data transformation within organizations, offering attendees unique insights to think big and drive real change.

The Unravel Data customer session, titled “How to Stop Burning Money (or At Least Slow the Burn)”, emerged as a highlight of the D&A Summit, drawing attention to the pressing need for cost-efficient data processing in today’s rapid digital evolution. The session—presented by one of the largest logistics giants globally, including over 100,000 employees and a fleet of 700+ container vessels operating across 130 countries—captivated CDAOs & data and analytics leaders from over 150+ attendees. The audience was from 140+ companies across 30+ industries such as banking, retail, and pharma, spanning 110+ cities across 20+ countries. This compelling turnout underscored the universal relevance and urgency of cost-optimization strategies in data engineering. Watch the highlight reel here.

The session was presented by Peter Rees, Director of GenAI, AI, Data and Integration at Maersk, who garnered unprecedented accolades, including a 170% higher-than-average attendance. Peter Rees’ session was the third highest-rated of all 40 partner sessions. These results reflect the session’s relevance and the invaluable insights shared on revolutionizing the efficiency of data processing pipelines, cost allocation, and optimization techniques.
The Gartner Data & Analytics Summit 2024, and particularly the Unravel Data customer session, brought together organizations striving to align their data engineering costs with the value of their data and analytics. Unravel Data’s innovative approach, showcased through the success of a world-leading logistics company, provides a blueprint for organizations across industries looking to dramatically enhance the speed, productivity, and efficiency of their data processing and AI investments.

We invite you to explore how your organization can benefit from Unravel Data’s groundbreaking platform. Take the first step towards transforming your data processing strategy by scheduling an Unravel Data Health Check. Embark on your journey towards optimal efficiency and cost savings today.

The post How to Stop Burning Money (or at Least Slow the Burn) appeared first on Unravel.

Unravel Data was Mentioned in a Graphic source in the 2024 Gartner® Report

Unravel Data — Tue, 21 May 2024 14:27:12 +0000

In a recently published report, “Beyond FinOps: Optimizing Your Public Cloud Costs”, Gartner shares a graphic which is adapted from Unravel.

Unravel’s Perspective

How FinOps Helps

Unravel helps organizations adopt a FinOps approach to improve cloud data spending efficiency. FinOps helps organizations address overspend, including under- and over-provisioned cloud services, suboptimal architectures, and inefficient pricing strategies. Infrastructure and operations (I&O) leaders and practitioners can use FinOps principles to optimize cloud service design, configuration, and spending commitments to reduce costs. Data and analytics (D&A) leaders and teams are using FinOps to achieve predictable spend for their cloud data platforms.

Introducing Cloud Financial Management and Governance

Cloud Financial Management (CFM) and Governance empowers organizations to quickly adapt in today’s dynamic landscape, ensuring agility and competitiveness. CFM principles help organizations take advantage of the cloud’s variable consumption model through purpose-built tools tailored for your modern cloud finance needs. A well-defined cloud cost governance model lets cloud users monitor and manage their cloud costs and balance cloud spending vs. performance and end-user experience.

Three Keys to Optimizing Your Modern Data Stack

Cloud data platform usage analysis plays a crucial role in cloud financial management by providing insights into usage patterns, cost allocation, and optimization opportunities. By automatically gathering, analyzing, and correlating data from various sources, such as traces, logs, metrics, source code, and configuration files, organizations can identify areas for cost savings and efficiency improvements. Unravel’s purpose-built AI reduces the toil required to manually examine metrics, such as resource utilization, spending trends, and performance benchmarks for modern data stacks such as Databricks, Snowflake, and BigQuery.
Cost allocation, showback, and chargeback are essential for effective cloud cost optimization. Organizations need to accurately assign costs to different departments or projects based on their actual resource consumption. This ensures accountability and helps in identifying areas of overspending or underutilization. Automated tools and platforms like Unravel can streamline the cost allocation process for cloud data platforms such as Databricks, Snowflake, and BigQuery, making it easier to track expenses and optimize resource usage.
Budget forecasting and management is another critical aspect of cloud financial management. By analyzing historical data and usage trends, organizations can predict future expenses more accurately. This enables them to plan budgets effectively, avoid unexpected costs, and allocate resources efficiently. Implementing robust budget forecasting processes can help organizations achieve greater financial control and optimize their cloud spending.

Next Steps

You now grasp the essence of cloud financial management and governance to optimize cloud spending and your cloud data platform. Start your journey and embrace these concepts to enhance your cloud strategy and drive success. Take charge of your Databricks, Snowflake, and BigQuery optimization today with a free health check.

Gartner, Beyond FinOps: Optimizing Your Public Cloud Costs, By Marco Meinardi, 21 March 2024

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post Unravel Data was Mentioned in a Graphic source in the 2024 Gartner® Report appeared first on Unravel.

IDC Analyst Brief: The Role of Data Observability and Optimization in Enabling AI-Driven Innovation

Unravel Data — Mon, 15 Apr 2024 14:22:53 +0000

Harnessing Data Observability for AI-Driven Innovation

Organizations are now embarking on a journey to harness AI for significant business advancements, from new revenue streams to productivity gains. However, the complexity of delivering AI-powered software efficiently and reliably remains a challenge. With AI investments expected to surge beyond $520 billion by 2027, this brief underscores the necessity for a robust intelligence architecture, scalable digital operations, and specialized skills. Learn how AI-driven data observability can be leveraged as a strategic asset for businesses aiming to lead in innovation and operational excellence.

Get a copy of the IDC Analyst Brief by Research Director Nancy Gohring.

The post IDC Analyst Brief: The Role of Data Observability and Optimization in Enabling AI-Driven Innovation appeared first on Unravel.

Winning the AI Innovation Race

Stephen Lamont — Mon, 11 Mar 2024 10:24:06 +0000

Business leaders from every industry now find themselves under the gun to somehow, someway leverage AI into an actual product that companies (and individuals) can use. Yet, an estimated 70%-85% of artificial intelligence (AI) and machine learning (ML) projects fail.

In our thought-leadership white paper Winning the AI Innovation Race: How AI Helps Optimize Speed to Market and Cost Inefficiencies of AI Innovation, you will learn:

• Top pitfalls that impede speed and ROI for running AI and data pipelines in the cloud

• How the answers to addressing these impediments can be found at the code level

• How AI is paramount for optimization of cloud data workloads

• How Unravel helps

The post Winning the AI Innovation Race appeared first on Unravel.

Data Observability + FinOps for Snowflake Engineers

Stephen Lamont — Fri, 26 Jan 2024 20:22:57 +0000

AI-DRIVEN DATA OBSERVABILITY + FINOPS FOR SNOWFLAKE DATA ENGINEERS

Snowflake data engineers are under enormous pressure to deliver results. This data sheet provides more context about the challenges data engineers face and how Unravel helps them address these challenges.

Specifically, it discusses:

Key Snowflake data engineering roadblocks
Unravel’s purpose-built AI for Snowflake
Data engineering benefits

With Unravel, Snowflake data engineers can speed data pipeline development and analytics initiatives with granular and real-time cost visibility, predictive, predictive spend forecasting, and performance insights for their data cloud.

To see Unravel Data for Snowflake in action contact: Data experts

The post Data Observability + FinOps for Snowflake Engineers appeared first on Unravel.

Open Source Overwatch VS Unravel Free Comparison Infographic

Ray Villares — Wed, 15 Nov 2023 23:05:51 +0000

5 key differences between Overwatch and Unravel’s free observability for Databricks.

Before investing significant time and effort to configure Overwatch, compare how Unravel’s free data observability solution gives you end-to-end, real-time granular visibility into your Databricks Lakehouse platform out of the box.

The post Open Source Overwatch VS Unravel Free Comparison Infographic appeared first on Unravel.

How Unravel Enhances Airflow

Stephen Lamont — Mon, 16 Oct 2023 15:28:01 +0000

In today’s data-driven world, there is a huge amount of data flowing into the business. Engineers spend a large part of their time in building pipelines—to collect the data from different sources, process it, and transform it to useful datasets that can be sent to business intelligence applications or machine learning models. Tools like Airflow are used to orchestrate complex data pipelines by programmatically authoring, scheduling, and monitoring the workflow pipelines. From a 10,000-foot view, Airflow may look like just a powerful cron, but it has additional capabilities like monitoring, generating logs, retrying jobs, adding dependencies between tasks, enabling a huge number of operators to run different types of tasks, etc.

Pipelines become increasingly complex because of the interdependence of several tools needed to complete a workflow. Common tasks done in day-to-day pipeline activities call for a number of different technologies to process data for different purposes: running a Spark job, executing a Snowflake query, executing a data transfer across different platforms (like from a GCS bucket to an AWS S3 bucket), and pushing data into a queue system like SQS.

Airflow orchestrates pipelines running several third-party tools and supports about 80 different providers. But you need to monitor what’s going on underneath pipeline operations across all those tools and get deeper insights about pipeline execution. Unravel supports this by bringing in all the information monitored from Airflow via API, StatsD, and notifications. Unravel also collects metrics from Spark, BigQuery, Snowflake to connect the dots to find out how the pipeline execution in Airflow impacts other systems.

Let’s examine two common use cases to illustrate how Unravel stitches together all pipeline details to solve problems that you can’t address with Airflow alone.

View DAG Run Information

Let’s say a financial services company is using Airflow to run Spark jobs on their instances and notices that a few of the DAG runs are taking longer to complete. Airflow UI shows information about the duration of each task run and the status of the task run, but that’s about it. For deeper insight into why the runs are taking longer, you’d need to gather detailed information about the Spark ID that is created via this task run.

By tightly integrating with the Spark ecosystem, Unravel fills that gap and gives you deeper insights about the Spark job run, its duration, and the resource utilized. These details, along with information collected from Airflow, can help you see the bigger picture of what’s happening underneath, inside your data pipelines.

This is an example test_spark DAG that runs the Spark job via the SparkSubmitOperator in the company’s environment. The SparkSubmitOperator task is taking longer to execute, and the screen shot below shows the long duration of the Spark submit task.This information flows into Unravel and is visualized in the Pipeline Observer UI. A summary list of failed/degraded pipeline runs helps to filter DAG runs with problems to debug. Clicking on a specific pipeline run ID drills down into more information about the pipeline, e.g., detailed information about the task, duration of each task, total duration, baseline compared to previous runs, and stack traces about the failure. Unravel also shows a list of all pipeline runs.Clicking on one of the Spark DAG runs with a long duration reveals more detailed information about the DAG and allows comparison with previous DAG runs. You can also find information about the Spark ID that is run for this particular DAG.

Clicking on the Spark app ID takes you to the Unravel UI with more information about the Spark job that is collected from Spark: the resources used to run the Spark job, any error logs, configurations made to run that Spark job, duration of the Spark jobs, and more.

Here we have more detail about the Spark job and the Spark ID created during the DAG run that helps verify the list of Spark jobs, understand which DAG run created the specific Spark job, see the resources consumed by that Spark job, or find out the configuration of the Spark job.

Tracking down a specific Spark job from a DAG run is difficult since although Airflow initiates the task run, Spark takes care of running the task. Unravel can monitor and collect details from Spark, identify the Spark ID, and correlate the Spark job to the Airflow DAG run that initiated the job.

See Unravel’s Automated Pipeline Bottleneck Analysis in 60-second tour
Take tour

Delay in Scheduling Pipeline Run

Unravel helps pinpoint the delay in scheduling DAG runs.

The same financial services company runs multiple DAGs as part of their pipeline system. While the DAG runs are working fine, the team stumbles upon a different problem: the DAG that is expected to run at the scheduled time is running at a delayed time. This delay affects running subsequent DAG runs, resulting in tasks not getting completed on time.

Airflow has the capability to alert if a Task/DAG that’s running misses its SLA. But there are cases when the DAG run is not initiated at all in the first place—e.g., no slot is available or the number of DAGs that can run in parallel exceeds the configured maximum number.

Unravel helps bring the problem to the surface by rendering underlying details about the DAG run delay right in the UI. For a DAG run ID, Unravel shows the useful information collected from StatsD like the total parse time for processing the DAG, DAG run schedule delay, etc. This information provides pipeline insights that help to identify why there is delay in scheduling the DAG run around the DAG execution time.

The Pipeline Details page presents DAG run ID information, with pipeline insights about the schedule delay in the right-hand pane. It clearly shows that the DAG run was delayed well beyond the acceptable threshold.

Unravel helps keep the pipeline running in check and automatically delivers insights about what’s happening inside Airflow. Unravel monitors systems like Spark, BigQuery, and Airflow and has the granular data about each of the systems, connecting that information and rendering the right insights to make it a powerful tool for monitoring cloud data systems.

See Unravel’s Automated Pipeline Bottleneck Analysis in 60-second tour
Take tour

The post How Unravel Enhances Airflow appeared first on Unravel.

Rev Up Your Lakehouse: Lap the Field with a Databricks Operating Model

Stephen Lamont — Thu, 12 Oct 2023 18:19:05 +0000

In this fast-paced era of artificial intelligence (AI), the need for data is multiplying. The demand for faster data life cycles has skyrocketed, thanks to AI’s insatiable appetite for knowledge. According to a recent McKinsey survey, 75% expect generative AI (GenAI) to “cause significant or disruptive change in the nature of their industry’s competition in the next three years.”

Next-gen AI craves unstructured, streaming, industry-specific data. Although the pace of innovation is relentless, “when it comes to generative AI, data really is your moat.”

But here’s the twist: efficiency is now the new cool kid in town. Data product profitability hinges on optimizing every step of the data life cycle—from ingestion and transformation, to processing, curating, and refining. It’s no longer just about gathering mountains of information; it’s about collecting the right data efficiently.

As new, industry-specific GenAI use cases emerge, there is an urgent need for large data sets for training, validation, verification, and drift analysis. GenAI requires flexible, scalable, and efficient data architecture, infrastructure, code, and operating models to achieve success.

Leverage a Scalable Operating Model to Accelerate Your Data Life Cycle Velocity

To optimize your data life cycle, it’s crucial to leverage a scalable operating model that can accelerate the velocity of your data processes. By following a systematic approach and implementing efficient strategies, you can effectively manage your data from start to finish.

Databricks recently introduced a scalable operating model for data and AI to help customers achieve a positive Return on Data Assets (RODA).

Databricks’ iterative end-to-end operating pipeline

Define Use Cases and Business Requirements

Before diving into the data life cycle, it’s essential to clearly define your use cases and business requirements. This involves understanding what specific problems or goals you plan to address with your data. By identifying these use cases and related business requirements, you can determine the necessary steps and actions needed throughout the entire process.

Build, Test, and Iterate the Solution

Once you have defined your use cases and business requirements, it’s time to build, test, and iterate the solution. This involves developing the necessary infrastructure, tools, and processes required for managing your data effectively. It’s important to continuously test and iterate on your solution to ensure that it meets your desired outcomes.

During this phase, consider using agile methodologies that allow for quick iterations and feedback loops. This will enable you to make adjustments as needed based on real-world usage and feedback from stakeholders.

Scale Efficiently

As your data needs grow over time, it’s crucial to scale efficiently. This means ensuring that your architecture can handle increased volumes of data without sacrificing performance or reliability.

Consider leveraging cloud-based technologies that offer scalability on-demand. Cloud platforms provide flexible resources that can be easily scaled up or down based on your needs. Employing automation techniques such as machine learning algorithms or artificial intelligence can help streamline processes and improve efficiency.

By scaling efficiently, you can accommodate growing datasets while maintaining high-quality standards throughout the entire data life cycle.

Elements of the Business Use Cases and Requirements Phase

In the data life cycle, the business requirements phase plays a crucial role in setting the foundation for successful data management. This phase involves several key elements that contribute to defining a solution and ensuring measurable outcomes. Let’s take a closer look at these elements:

Leverage design thinking to define a solution for each problem statement: Design thinking is an approach that focuses on understanding user needs, challenging assumptions, and exploring innovative solutions. In this phase, it is essential to apply design thinking principles to identify and define a single problem statement that aligns with business objectives.
Validate the business case and define measurable outcomes: Before proceeding further, it is crucial to validate the business case for the proposed solution. This involves assessing its feasibility, potential benefits, and alignment with strategic goals. Defining clear and measurable outcomes helps in evaluating project success.
Map out the MVP end user experiences: To ensure user satisfaction and engagement, mapping out Minimum Viable Product (MVP) end-user experiences is essential. This involves identifying key touchpoints and interactions throughout the data life cycle stages. By considering user perspectives early on, organizations can create intuitive and effective solutions.
Understand the data requirements: A thorough understanding of data requirements is vital for successful implementation. It includes identifying what types of data are needed, their sources, formats, quality standards, security considerations, and any specific regulations or compliance requirements.
Gather required capabilities with platform architects: Collaborating with platform architects helps gather insights into available capabilities within existing infrastructure or technology platforms. This step ensures compatibility between business requirements and technical capabilities while minimizing redundancies or unnecessary investments.
Establish data management roles, responsibilities, and procedures: Defining clear roles and responsibilities within the organization’s data management team is critical for effective execution. Establishing procedures for data observability, stewardship practices, access controls, privacy policies ensures consistency in managing data throughout its life cycle.

By following these elements in the business requirements phase, organizations can lay a solid foundation for successful data management and optimize the overall data life cycle. It sets the stage for subsequent phases, including data acquisition, storage, processing, analysis, and utilization.

Build, Test, and Iterate the Solution

To successfully implement a data life cycle, it is crucial to focus on building, testing, and iterating the solution. This phase involves several key steps that ensure the development and deployment of a robust and efficient system.

Plan development and deployment: The first step in this phase is to carefully plan the development and deployment process. This includes identifying the goals and objectives of the project, defining timelines and milestones, and allocating resources effectively. By having a clear plan in place, the data team can streamline their efforts towards achieving desired outcomes.
Gather end-user feedback at every stage: Throughout the development process, it is essential to gather feedback from end users at every stage. This allows for iterative improvements based on real-world usage scenarios. By actively involving end users in providing feedback, the data team can identify areas for enhancement or potential issues that need to be addressed.
Define CI/CD pipelines for fast testing and iteration: Implementing Continuous Integration (CI) and Continuous Deployment (CD) pipelines enables fast testing and iteration of the solution. These pipelines automate various stages of software development such as code integration, testing, deployment, and monitoring. By automating these processes, any changes or updates can be quickly tested and deployed without manual intervention.
Data preparation, cleaning, and processing: Before training machine learning models or conducting experiments with datasets, it is crucial to prepare, clean, and process the data appropriately. This involves tasks such as removing outliers or missing values from datasets to ensure accurate results during model training.
Feature engineering: Feature engineering plays a vital role in enhancing model performance by selecting relevant features from raw data or creating new features based on domain knowledge. It involves transforming raw data into meaningful representations that capture essential patterns or characteristics.
Training and ML experiments: In this stage of the data life cycle, machine learning models are trained using appropriate algorithms on prepared datasets. Multiple experiments may be conducted, testing different algorithms or hyperparameters to find the best-performing model.
Model deployment: Once a satisfactory model is obtained, it needs to be deployed in a production environment. This involves integrating the model into existing systems or creating new APIs for real-time predictions.
Model monitoring and scoring: After deployment, continuous monitoring of the model’s performance is essential. Tracking key metrics and scoring the model’s outputs against ground truth data helps identify any degradation in performance or potential issues that require attention.

By following these steps and iterating on the solution based on user feedback, data teams can ensure an efficient and effective data life cycle from development to deployment and beyond.

Efficiently Scale and Drive Adoption with Your Operating Model

To efficiently scale your data life cycle and drive adoption, you need to focus on several key areas. Let’s dive into each of them:

Deploy into production: Once you have built and tested your solution, it’s time to deploy it into production. This step involves moving your solution from a development environment to a live environment where end users can access and utilize it.
Evaluate production results: After deploying your solution, it is crucial to evaluate its performance in the production environment. Monitor key metrics and gather feedback from users to identify any issues or areas for improvement.
Socialize data observability and FinOps best practices: To ensure the success of your operating model, it is essential to socialize data observability and FinOps best practices among your team. This involves promoting transparency, accountability, and efficiency in managing data operations.
Acknowledge engineers who “shift left” performance and efficiency: Recognize and reward engineers who prioritize performance and efficiency early in the development process. Encourage a culture of proactive optimization by acknowledging those who contribute to improving the overall effectiveness of the data life cycle.
Manage access, incidents, support, and feature requests: Efficiently scaling your operating model requires effective management of access permissions, incident handling processes, support systems, and feature requests. Streamline these processes to ensure smooth operations while accommodating user needs.
Track progress towards business outcomes by measuring and sharing KPIs: Measuring key performance indicators (KPIs) is vital for tracking progress towards business outcomes. Regularly measure relevant metrics related to adoption rates, user satisfaction levels, cost savings achieved through efficiency improvements, etc., then share this information across teams for increased visibility.

By implementing these strategies within your operating model, you can efficiently scale your data life cycle while driving adoption among users. Remember that continuous evaluation and improvement are critical for optimizing performance throughout the life cycle.

Unravel for Databricks now free!
Create free account

Drive for Performance with Purpose-Built AI

Unravel helps with many elements of the Databricks operating model:

Quickly identify failed and inefficient Databricks jobs: One of the key challenges is identifying failed and inefficient Databricks jobs. However, with AI purpose-built for Databricks, this task becomes much easier. By leveraging advanced analytics and monitoring capabilities, you can quickly pinpoint any issues in your job executions.
Creating ML models vs deploying them into production: Creating machine learning models is undoubtedly challenging, but deploying them into production is even harder. It requires careful consideration of factors like scalability, performance, and reliability. With purpose-built AI tools, you can streamline the deployment process by automating various tasks such as model versioning, containerization, and orchestration.
Leverage Unravel’s Analysis tab for insights: To gain deeper insights into your application’s performance during job execution, leverage the analysis tab provided by purpose-built AI solutions. This feature allows you to examine critical details like memory usage errors or other bottlenecks that may be impacting job efficiency.

Unravel’s AI-powered analysis automatically provides deep, actionable insights.

Share links for collaboration: Collaboration plays a crucial role in data management and infrastructure optimization. Unravel enables you to share links with data scientists, developers, and data engineers to provide detailed information about specific test runs or failed Databricks jobs. This promotes collaboration and facilitates a better understanding of why certain jobs may have failed.
Cloud data cost management made easy: Cloud cost management, also known as FinOps, is another essential aspect of data life cycle management. Purpose-built AI solutions simplify this process by providing comprehensive insights into cost drivers within your Databricks environment. You can identify the biggest cost drivers such as users, clusters, jobs, and code segments that contribute significantly to cloud costs.
AI recommendations for optimization: To optimize your data infrastructure further, purpose-built AI platforms offer valuable recommendations across various aspects, including infrastructure configuration, parallelism settings, handling data skewness issues, optimizing Python/SQL/Scala/Java code snippets, and more. These AI-driven recommendations help you make informed decisions to enhance performance and efficiency.

Learn More & Next Steps

Unravel hosted a virtual roundtable, Accelerate the Data Analytics Life Cycle, with a panel of Unravel and Databricks experts. Unravel VP Clinton Ford moderated the discussion with Sanjeev Mohan, principal at SanjMo and former VP at Gartner, Subramanian Iyer, Unravel training and enablement leader and Databricks SME, and Don Hilborn, Unravel Field CTO and former Databricks lead strategic solutions architect.

FAQs

How can I implement a scalable operating model for my data life cycle?

To implement a scalable operating model for your data life cycle, start by clearly defining roles and responsibilities within your organization. Establish efficient processes and workflows that enable seamless collaboration between different teams involved in managing the data life cycle. Leverage automation tools and technologies to streamline repetitive tasks and ensure consistency in data management practices.

What are some key considerations during the Business Requirements Phase?

During the Business Requirements Phase, it is crucial to engage stakeholders from various departments to gather comprehensive requirements. Clearly define project objectives, deliverables, timelines, and success criteria. Conduct thorough analysis of existing systems and processes to identify gaps or areas for improvement.

How can I drive adoption of my data life cycle operational model?

To drive adoption of your data management solution, focus on effective change management strategies. Communicate the benefits of the solution to all stakeholders involved and provide training programs or resources to help them understand its value. Encourage feedback from users throughout the implementation process and incorporate their suggestions to enhance usability and address any concerns.

What role does AI play in optimizing the data life cycle?

AI can play a significant role in optimizing the data life cycle by automating repetitive tasks, improving data quality through advanced analytics and machine learning algorithms, and providing valuable insights for decision-making. AI-powered tools can help identify patterns, trends, and anomalies in large datasets, enabling organizations to make data-driven decisions with greater accuracy and efficiency.

How do I ensure performance while implementing purpose-built AI?

To ensure performance while implementing purpose-built AI, it is essential to have a well-defined strategy. Start by clearly defining the problem you want to solve with AI and set measurable goals for success. Invest in high-quality training data to train your AI models effectively. Continuously monitor and evaluate the performance of your AI system, making necessary adjustments as needed.

The post Rev Up Your Lakehouse: Lap the Field with a Databricks Operating Model appeared first on Unravel.

Overcoming Friction & Harnessing the Power of Unravel: Try It for Free

Stephen Lamont — Wed, 11 Oct 2023 13:52:00 +0000

Overview

In today’s digital landscape, data-driven decisions form the crux of successful business strategies. However, the path to harnessing data’s full potential is strewn with challenges. Let’s delve into the hurdles organizations face and how Unravel is the key to unlocking seamless data operations.

The Roadblocks in the Fast Lane of Data Operations

In today’s data-driven landscape, organizations grapple with erratic spending, cloud constraints, AI complexities, and prolonged MTTR, urgently seeking solutions to navigate these challenges efficiently. The four most common roadblocks are:

Data Spend Forecasting: Imagine a roller coaster with unpredictable highs and lows. That’s how most organizations view their data spend forecasting. Such unpredictability wreaks havoc on financial planning, making operational consistency a challenge.
Constraints in Adding Data Workloads: Imagine tying an anchor to a speedboat. That’s how the constraints feel when trying to adopt cloud data solutions, holding back progress and limiting agility.
Surge in AI Model Complexity: AI’s evolutionary pace is exponential. As it grows, so do the intricacies surrounding data volume and pipelines, which strain budgetary limitations.
The MTTR Bottleneck: The multifaceted nature of modern tech stacks means longer Mean Time to Repair (MTTR). This slows down processes, consumes valuable resources, and stalls innovation.

By acting as a comprehensive data observability and FinOps solution, Unravel Data empowers businesses to move past the challenges and frictions that typically hinder data operations, ensuring smoother, more efficient data-driven processes. Here’s how Unravel Data aids in navigating the roadblocks in the high-speed lane of data operations:

Predictive Data Spend Forecasting: With its advanced analytics, Unravel Data can provide insights into data consumption patterns, helping businesses forecast their data spending more accurately. This eliminates the roller coaster of unpredictable costs.
Simplifying Data Workloads: Unravel Data optimizes and automates workload management. Instead of being anchored down by the weight of complex data tasks, businesses can efficiently run and scale their data processes in the cloud.
Managing AI Model Complexity: Unravel offers visibility and insights into AI data pipelines. Analyzing and optimizing these pipelines ensure that growing intricacies do not overwhelm resources or budgets.
Reducing MTTR: By providing a clear view of the entire tech stack and pinpointing bottlenecks or issues, Unravel Data significantly reduces Mean Time to Repair. With its actionable insights, teams can address problems faster, reducing downtime and ensuring smooth data operations.
Streamlining Data Pipelines: Unravel Data offers tools to diagnose and improve data pipeline performance. This ensures that even as data grows in volume and complexity, pipelines remain efficient and agile.
Efficiency and ROI: With its clear insights into resource consumption and requirements, Unravel Data helps businesses run 50% more workloads in their existing Databricks environments, ensuring they only pay for what they need, reducing wastage and excess expenditure.

The Skyrocketing Growth of Cloud Data Management

As the digital realm expands, cloud data management usage is soaring, with data services accounting for a significant chunk. According to the IDC, the public cloud IaaS and PaaS market is projected to reach $400 billion by 2025, growing at a 28.8% CAGR from 2021 to 2025. Some highlights are:

Data management and application development account for 39% and 20% of the market, respectively, and are the main workloads backed by PaaS solutions, capturing a major share of its revenue.
In IaaS revenue, IT infrastructure leads with 25%, trailed by business applications (21%) and data management (20%).
Unstructured data analytics and media streaming are predicted to be the top-growing segments with CAGRs of 41.9% and 41.2%, respectively.

Unravel provides a comprehensive solution to address the growth associated with cloud data management. Here’s how:

Visibility and Transparency: Unravel offers in-depth insights into your cloud operations, allowing you to understand where and how costs are accruing, ensuring no hidden fees or unnoticed inefficiencies.
Optimization Tools: Through its suite of analytics and AI-driven tools, Unravel pinpoints inefficiencies, recommends optimizations, and automates the scaling of resources to ensure you’re only using (and paying for) what you need.
Forecasting: With predictive analytics, Unravel provides forecasts of data usage and associated costs, enabling proactive budgeting and financial planning.
Workload Management: Unravel ensures that data workloads run efficiently and without wastage, reducing both computational costs and storage overhead.
Performance Tuning: By optimizing query performance and data storage strategies, Unravel ensures faster results using fewer resources, translating to 50% more workloads.
Monitoring and Alerts: Real-time monitoring paired with intelligent alerts ensures that any resource-intensive operations or anomalies are flagged immediately, allowing for quick intervention and rectification.

By employing these strategies and tools, Unravel acts as a financial safeguard for businesses, ensuring that the ever-growing cloud data bill remains predictable, manageable, and optimized for efficiency.

The Tightrope Walk of Efficiency Tuning and Talent

Modern enterprises hinge on data and AI, but shrinking budgets and talent gaps threaten them. Gartner pinpoints overprovisioning and skills shortages as major roadblocks, while Google and IDC underscore the high demand for data analytics skills and the untapped potential of unstructured data. Here are some of the problems modern organizations face:

Production environments are statically overprovisioned and therefore underutilized. On-premises, 30% utilization is common, but it’s all capital expenditures (capex), and as long as it’s within budget, no one has traditionally cared about the waste. However, in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste. – Gartner
The cloud skills gap has reached a crisis level in many organizations – Gartner
Revenue creation through digital transformation requires talent engagement that is currently scarce and difficult to acquire and maintain. – Gartner
Lack of skills remains the biggest barrier to infrastructure modernization initiatives, with many organizations finding they cannot hire outside talent to fill these skills gaps. IT organizations will not succeed unless they prioritize organic skills growth. – Gartner
Data analytics skills are in demand across industries as businesses of all types around the world recognize that strong analytics improve business performance.- Google via Coursera

Unravel Data addresses the delicate balancing act of budget and talent in several strategic ways:

Operational Efficiency: Purpose-built AI provides actionable insights into data operations across Databricks, Spark, EMR, BigQuery, Snowflake, etc. Unravel Data reduces the need for trial-and-error and time-consuming manual interventions. At the core of Unravel’s data observability platform is our AI-powered Insights Engine. This sophisticated Artificial Intelligence engine incorporates AI techniques, algorithms, and tools to process and analyze vast amounts of data, learn from patterns, and make predictions or decisions based on that learning. This not only improves operational efficiency but also ensures that talented personnel spend their time innovating rather than on routine tasks.
Skills Gap Bridging: The platform’s intuitive interface and AI-driven insights mean that even those without deep expertise in specific data technologies can navigate, understand, and optimize complex data ecosystems. This eases the pressure to hire or train ultra-specialized talent.
Predictive Analysis: With Unravel’s ability to predict potential bottlenecks or inefficiencies, teams can proactively address issues, leading to more efficient budget allocation and resource utilization.
Cost Insights: Unravel provides detailed insights into the efficiency of various data operations, allowing organizations to make informed decisions on where to invest and where to cut back.
Automated Optimization: By automating many of the tasks traditionally performed by data engineers, like performance tuning or troubleshooting, Unravel ensures teams can do more with less, optimizing both budget and talent.
Talent Focus Shift: With mundane tasks automated and insights available at a glance, skilled personnel can focus on higher-value activities, like data innovation, analytics, and strategic projects.

By enhancing efficiency, providing clarity, and streamlining operations, Unravel Data ensures that organizations can get more from their existing budgets while maximizing the potential of their talent, turning the tightrope walk into a more stable journey.

The Intricacies of Data-Centric Organizations

Data-centric organizations grapple with the complexities of managing vast and fast-moving data in the digital age. Ensuring data accuracy, security, and compliance, while integrating varied sources, is challenging. They must balance data accessibility with protecting sensitive information, all while adapting to evolving technologies, addressing talent gaps, and extracting actionable insights from their data reservoirs. Here is some relevant research on the topic:

“Data is foundational to AI” yet “unstructured data remains largely untapped.” – IDC
Even as organizations rush to adopt data-centric operations, challenges persist. For instance, manufacturing data projects often hit roadblocks due to outdated legacy technology, as observed by the World Economic Forum.
Generative AI is supported by large language models (LLMs), which require powerful and highly scalable computing capabilities to process data in real-time. – Gartner

Unravel Data provides a beacon for data-centric organizations amid complex challenges. Offering a holistic view of data operations, it simplifies management using AI-driven tools. It ensures data security, accessibility, and optimized performance. With its intuitive interface, Unravel bridges talent gaps and navigates the data maze, turning complexities into actionable insights.

Embarking on the Unravel Journey: Your Step-By-Step Guide

Beginning your data journey with Unravel is as easy as 1-2-3. We guide you through the sign-up process, ensuring a smooth and hassle-free setup.
Unravel for Databricks page

Unravel for Databricks plans & pricing

Unravel for Databricks free account

Level Up with Unravel Premium

Ready for an enhanced data experience? Unravel’s premium account offers a plethora of advanced features that the free version can’t match. Investing in this upgrade isn’t just about more tools; it’s about supercharging your data operations and ROI.

Wrap-Up

Although rising demands on the modern data landscape are challenging, they are not insurmountable. With tools like Unravel, organizations can navigate these complexities, ensuring that data remains a catalyst for growth, not a hurdle. Dive into the Unravel experience and redefine your data journey today.

Unravel is a business’s performance sentinel in the cloud realm, proactively ensuring that burgeoning cloud data expenses are not only predictable and manageable but also primed for significant cost savings. Unravel Data transforms the precarious balance of budget and talent into a streamlined, efficient journey for organizations. Unravel Data illuminates the path for data-centric organizations, streamlining operations with AI tools, ensuring data security, and optimizing performance. Its intuitive interface simplifies complex data landscapes, bridging talent gaps and converting challenges into actionable insights.

The post Overcoming Friction & Harnessing the Power of Unravel: Try It for Free appeared first on Unravel.

Blind Spots in Your System: The Grave Risks of Overlooking Observability

Stephen Lamont — Mon, 28 Aug 2023 20:14:12 +0000

I was an Enterprise Data Architect at Boeing NASA Systems the day of the Columbia Shuttle disaster. The tragedy has had a profound impact in my career on how I look at data. Recently I have been struck by how, had it been available back then, purpose-built AI observability might have altered the course of events.

The Day of the Disaster

With complete reverence and respect for the Columbia disaster, I can remember that day with incredible amplification. I was standing in front of my TV watching the Shuttle Columbia disintegrate on reentry. As the Enterprise Data Architect at Boeing NASA Systems, I received a call from my boss: “We have to pull and preserve all the Orbitor data. I will meet you at the office.” Internally, Boeing referred to the Shuttles as Oribitors. It was only months earlier I had the opportunity to share punch and cookies with some of the crew. To say I was being flooded with emotions would be a tremendous understatement.

When I arrived at Boeing NASA Systems, near Johnson Space Center, the mood in the hallowed aeronautical halls was somber. I sat intently, with eyes scanning lines of data from decades of space missions. The world had just witnessed a tragedy — the Columbia Shuttle disintegrated on reentry, taking the lives of seven astronauts. The world looked on in shock and sought answers. As a major contractor for NASA, Boeing was at the forefront of this investigation.

As the Enterprise Data Architect, I was one of an army helping dissect the colossal troves of data associated with the shuttle, looking for anomalies, deviations — anything that could give a clue as to what had gone wrong. Days turned into nights, and nights turned into weeks and months as we tirelessly pieced together Columbia’s final moments. But as we delved deeper, a haunting reality began to emerge.

Every tiny detail of the shuttle was monitored, from the heat patterns of its engines to the radio signals it emitted. But there was a blind spot, an oversight that no one had foreseen. In the myriad of data sets, there was nothing that indicated the effects of a shuttle’s insulation tiles colliding with a piece of Styrofoam, especially at speeds exceeding 500 miles per hour.

The actual incident was seemingly insignificant — a piece of foam insulation had broken off and struck the shuttle’s left wing. But in the vast expanse of space and the brutal conditions of reentry, even minor damage could prove catastrophic.

Video footage confirmed: the foam had struck the shuttle. But without concrete data on what such an impact would do, the team was left to speculate and reconstruct potential scenarios. The lack of this specific data had turned into a gaping void in the investigation.

As a seasoned Enterprise Data Architect, I always believed in the power of information. I absolutely believed that in the numbers, in the bytes and bits, we find the stories that the universe whispers to us. But this time, the universe had whispered a story that we didn’t have the data to understand fully.

Key Findings

After the accident, NASA formed the Columbia Accident Investigation Board (CAIB) to investigate the disaster. The board consisted of experts from various fields outside of NASA to ensure impartiality.

1. Physical Cause: The CAIB identified the direct cause of the accident as a piece of foam insulation from the shuttle’s external fuel tank that broke off during launch. This foam struck the leading edge of the shuttle’s left wing. Although foam shedding was a known issue, it had been wrongly perceived as non-threatening because of prior flights where the foam was lost but didn’t lead to catastrophe.

2. Organizational Causes: Beyond the immediate physical cause, the CAIB highlighted deeper institutional issues within NASA. They found that there were organizational barriers preventing effective communication of safety concerns. Moreover, safety concerns had been normalized over time due to prior incidents that did not result in visible failure. Essentially, since nothing bad had happened in prior incidents where the foam was shed, the practice had been erroneously deemed “safe.”

3. Decision Making: The CAIB pointed out issues in decision-making processes that relied heavily on past success as a predictor of future success, rather than rigorous testing and validation.

4. Response to Concerns: There were engineers who were concerned about the foam strike shortly after Columbia’s launch, but their requests for satellite imagery to assess potential damage were denied. The reasons were multifaceted, ranging from beliefs that nothing could be done even if damage was found, to a misunderstanding of the severity of the situation.

The CAIB made a number of recommendations to NASA to improve safety for future shuttle flights. These included:

1. Physical Changes: Improving the way the shuttle’s external tank was manufactured to prevent foam shedding and enhancing on-orbit inspection and repair techniques to address potential damage.

2. Organizational Changes: Addressing the cultural issues and communication barriers within NASA that led to the accident, and ensuring that safety concerns were more rigorously addressed.

3. Continuous Evaluation: Establishing an independent Technical Engineering Authority responsible for technical requirements and all waivers to them, and building an independent safety program that oversees all areas of shuttle safety.

Could Purpose-built AI Observability Have Helped?

In the aftermath, NASA grounded the shuttle fleet for more than two years after the disaster. They then implemented the CAIB’s recommendations before resuming shuttle flights. Columbia’s disaster, along with the Challenger explosion in 1986, are stark reminders of the risks of space travel and the importance of a diligent and transparent safety culture. The lessons from Columbia shaped many of the safety practices NASA follows in its current human spaceflight programs.

The Columbia disaster led to profound changes in how space missions were approached, with a renewed emphasis on data collection and eliminating informational blind spots. But for me, it became a deeply personal mission. I realized that sometimes, the absence of data could speak louder than the most glaring of numbers. It was a lesson I would carry throughout my career, ensuring that no stone was left unturned, and no data point overlooked.

The Columbia disaster, at its core, was a result of both a physical failure (foam insulation striking the shuttle’s wing) and organizational oversights (inadequate recognition and response to potential risks). Purpose-built AI Observability, which involves leveraging artificial intelligence to gain insights into complex systems and predict failures, might have helped in several key ways:

1. Real-time Anomaly Detection: Modern AI systems can analyze vast amounts of data in real time to identify anomalies. If an AI-driven observability platform had been monitoring the shuttle’s various sensors and systems, it might have detected unexpected changes or abnormalities in the shuttle’s behavior after the foam strike, potentially even subtle ones that humans might overlook.

2. Historical Analysis: An AI system with access to all previous shuttle launch and flight data might have detected patterns or risks associated with foam-shedding incidents, even if they hadn’t previously resulted in a catastrophe. The system could then raise these as potential long-term risks.

3. Predictive Maintenance: AI-driven tools can predict when components of a system are likely to fail based on current and historical data. If applied to the shuttle program, such a system might have provided early warnings about potential vulnerabilities in the shuttle’s design or wear-and-tear.

4. Decision Support: AI systems could have aided human decision-makers in evaluating the potential risks of continuing the mission after the foam strike, providing simulations, probabilities of failure, or other key metrics to help guide decisions.

5. Enhanced Imaging and Diagnosis: If equipped with sophisticated imaging capabilities, AI could analyze images of the shuttle (from external cameras or satellites) to detect potential damage, even if it’s minor, and then assess the risks associated with such damage.

6. Overcoming Organizational Blind Spots: One of the major challenges in the Columbia disaster was the normalization of deviance, where foam shedding became an “accepted” risk because it hadn’t previously caused a disaster. An AI system, being objective, doesn’t suffer from these biases. It would consistently evaluate risks based on data, not on historical outcomes.

7. Alerts and Escalations: An AI system can be programmed to escalate potential risks to higher levels of authority, ensuring that crucial decisions don’t get caught in bureaucratic processes.

While AI Observability could have provided invaluable insights and might have changed the course of events leading to the Columbia disaster, it’s essential to note that the integration of such AI systems also requires organizational openness to technological solutions and a proactive attitude toward safety. The technology is only as effective as the organization’s willingness to act on its findings.

The tragedy served as a grim reminder for organizations worldwide: It’s not just about collecting data; it’s about understanding the significance of what isn’t there. Because in those blind spots, destiny can take a drastic turn.

In Memory and In Action

The Columbia crew and their families deserve our utmost respect and admiration for their unwavering commitment to space exploration and the betterment of humanity.

Rick D. Husband: As the Commander of the mission, Rick led with dedication, confidence, and unparalleled skill. His devotion to space exploration was evident in every decision he made. We remember him not just for his expertise, but for his warmth and his ability to inspire those around him. His family’s strength and grace, in the face of the deepest pain, serve as a testament to the love and support they provided him throughout his journey.
William C. McCool: As the pilot of Columbia, William’s adeptness and unwavering focus were essential to the mission. His enthusiasm and dedication were contagious, elevating the spirits of everyone around him. His family’s resilience and pride in his achievements are a reflection of the man he was — passionate, driven, and deeply caring.
Michael P. Anderson: As the payload commander, Michael’s role was vital, overseeing the myriad of experiments and research aboard the shuttle. His intellect was matched by his kindness, making him a cherished member of the team. His family’s courage and enduring love encapsulate the essence of Michael’s spirit — bright, optimistic, and ever-curious.
Ilan Ramon: As the first Israeli astronaut, Ilan represented hope, unity, and the bridging of frontiers. His enthusiasm for life was infectious, and he inspired millions with his historic journey. His family’s grace in the face of the unthinkable tragedy is a testament to their shared dream and the values that Ilan stood for.
Kalpana Chawla: Known affectionately as ‘KC’, Kalpana’s journey from a small town in India to becoming a space shuttle mission specialist stands as an inspiration to countless dreamers worldwide. Her determination, intellect, and humility made her a beacon of hope for many. Her family’s dignity and strength, holding onto her legacy, reminds us all of the power of dreams and the sacrifices made to realize them.
David M. Brown: As a mission specialist, David brought with him a zest for life, a passion for learning, and an innate curiosity that epitomized the spirit of exploration. He ventured where few dared and achieved what many only dreamt of. His family’s enduring love and their commitment to preserving his memory exemplify the close bond they shared and the mutual respect they held for each other.
Laurel B. Clark: As a mission specialist, Laurel’s dedication to scientific exploration and discovery was evident in every task she undertook. Her warmth, dedication, and infectious enthusiasm made her a beloved figure within her team and beyond. Her family’s enduring spirit, cherishing her memories and celebrating her achievements, is a tribute to the love and support that were foundational to her success.

To each of these remarkable individuals and their families, we extend our deepest respect and gratitude. Their sacrifices and contributions will forever remain etched in the annals of space exploration, reminding us of the human spirit’s resilience and indomitable will.

For those of us close to the Columbia disaster, it was more than a failure; it was a personal loss. Yet, in memory of those brave souls, we are compelled to look ahead. In the stories whispered to us by data, and in the painful lessons from their absence, we seek to ensure that such tragedies remain averted in the future.

While no technology can turn back time, the promise of AI Observability beckons a future where every anomaly is caught, every blind spot illuminated, and every astronaut returns home safely.

The above narrative seeks to respect the gravity of the Columbia disaster while emphasizing the potential of AI Observability. It underlines the importance of data, both in understanding tragedies and in preventing future ones.

The post Blind Spots in Your System: The Grave Risks of Overlooking Observability appeared first on Unravel.

DBS Bank Uplevels Individual Engineers at Scale

Stephen Lamont — Tue, 11 Jul 2023 12:56:46 +0000

DBS Bank leverages Unravel to identify inefficiencies across 10,000s of data applications/pipelines and guide individual developers and engineers on how, where, and what to improve—no matter what technology or platform they’re using.

DBS Bank is one of the largest financial services institutions in Asia—the biggest in Southeast Asia—with a presence in 19 markets. Headquartered in Singapore, DBS is recognized globally for its technological innovation and leadership, having been named World’s Best Bank by both Global Finance and Euromoney, Global Bank of the Year by The Banker, World’s Best Digital Bank by Euromoney (three times), and one of the world’s top 10 business transformations of the decade by Harvard Business Review. In addition, DBS has been given Global Finance’s Safest Bank in Asia award for 14 consecutive years.

DBS is, in its words, “redefining the future of banking as a technology company.” Says the bank’s Executive Director of Automation, Infrastructure for Big Data, AI, and Analytics Luis Carlos Cruz Huertas, “Technology is in our DNA, from the top down. Six years ago, when we started our data journey, the philosophy was that we’re going to be a data-driven organization, 100%. Everything we decide is through data.”

Realizing innovation and efficiency at scale and speed

DBS backs up its commitment to being a leading data-forward company. Almost 40% of the bank’s employees—some 23,000—are native developers. The volume, variety, and velocity of DBS’ data products, initiatives, and analytics fuel an incredibly complex environment. The bank has developed more than 200 data-use cases to deliver business value: detecting fraud, providing value to customers via their DBS banking app, and providing hyper personalized “nudges” that guide customers in making more informed banking and investment decisions.

As with all financial services enterprises, the DBS data estate is a multi-everything mélange: on-prem, hybrid, and several cloud providers; all the principal platforms; open source, commercial, and proprietary solutions; and a wide variety of technologies, old and new. Adding to this complexity are the various country-specific requirements and restrictions throughout the bank’s markets. As Luis says, “There are many ways to do the cloud, but there’s also the banking way to do cloud. Finance is ring-fenced by regulation all over the world. To do cloud in Asia is quite complex. We don’t [always] get to choose the cloud, sometimes the cloud chooses us.” (For example, India now requires all data to run in India. Google was the only cloud available in Indonesia at the time. Data cannot leave China; there are no hyperscalers there other than Chinese.)

Watch Luis’ full talk, “Leading Cultural Change for Data Efficiency, Agility and Cost Optimization,” with Unravel CEO and CO-founder Kunal Agarwal
See the conversation here

The pace of today’s innovation at a data-forward bank like DBS makes ensuring efficient, reliable performance an ever-growing challenge.

Today DBS runs more than 40,000 data jobs, with tens of thousands of ML models. “And that actually keeps growing,” says Luis. “We have exponentially grown—almost 100X what we used to run five years ago. There’s a lot of innovation we’re bringing, month to month.”

As Luis points out, “Sometimes innovation is a disruptor of stability.” With thousands of developers all using whatever technologies are best-suited for their particular workload, DBS has to contend with the triple-headed issues of exponentially increasing data volumes (think: OpenAI); increasingly more technologies, platforms, and cloud providers; and the ever-present challenge of having enough skilled people with enough time to ensure that everything is running reliably and efficiently on time, every time—at population scale.

DBS empowers self-service optimization

One of DBS’ lessons learned in its modern data journey is that, as Luis says, “a lot of the time the efficiencies relied on the engineering team to fix the problems. We didn’t really have a viable mechanism to put into the [business] users’ hands for them to analyze and understand their code, their deficiencies, and how to improve.”

That’s where Unravel comes in. The Unravel platform harnesses deep full-stack visibility, contextual awareness, AI-powered intelligence, and automation to not only quickly show what’s going on and why, but provide crisp, prescriptive recommendations on exactly how to make things better and then keep them that way proactively.

“In Unravel, [we] have a source to identify areas of improvement for our current operational jobs. By leveraging Unravel, we’ve been able to put an analytical tool in the hands of business users to evaluate themselves before they actually do a code release. [They now get] a lot of additional insights they can use to learn, and it puts a lot more responsibility into how they’re doing their job.”

To learn more about how Unravel puts self-service performance and cost optimization capabilities into the hands of individual engineers, see the Unravel Platform Overview.

Marrying data engineering and data science

DBS is perpetually innovating with data—40% of the bank’s workforce are data science developers—yet the scale and speed of DBS’ data operations bring the data engineering side of things to the forefront. As Luis points out, there are thousands of developers committing code but only a handful of data engineers to make sure everything runs reliably and efficiently.

While few organizations have such a high percentage of developers, virtually everyone has the same lopsided developer : engineer ratio. Tackling this reality is at the heart of DataOps and means new roles, where data science and data engineering meet. Unravel has helped facilitate this new intersection of roles and responsibilities by making it easier to pinpoint code inefficiencies and provide insights into what, where, and how to optimize code — without having to “throw it over the fence” to already overburdened engineers.

Luis discusses how DBS addresses the issue: “In hardware, you have something called system performance engineers. They’re dedicated to optimizing and tuning how a processor operates, until it’s pristine. I said, ‘Why don’t we apply that concept and bring it over to data?’ What we need is a very good data scientist who wants to learn data engineering. And a data engineer who wants to become a data scientist. Because I need to connect and marry both worlds. It’s the only way a person can actually say, ‘Oh, this is why pandas doesn’t work in Spark.’ For a data engineer, it’s very clear. For a data scientist, it’s not. Because the first thing they learned in Python is pandas. And they love it. And Spark hates it. It’s a perfect divorce. So we need to teach data scientists how to break their pandas into Spark. Our system performance engineers became masters at doing this. They use Unravel to highlight all the data points that we need to attack in the code, so let’s just go see it.

“We call data jobs ‘mutants,’ because they mutate from the hands of one data scientist to another. You can literally see the differences on how they write the code. Some of it might be perfect, with Markdown files, explainability, and then there’s an entire chunk of code that doesn’t have that. So we tune and optimize the code. Unravel helps us facilitate the journey on deriving what to do first, what to attack first, from the optimization perspective.”

DBS usually builds their own products—why did they buy Unravel?

DBS has made the bold decision to develop its own proprietary technology stack as wrappers for its governance, control plane, data plane, etc. “We ring-fence all compute, storage, services—every resource the cloud provides. The reason we create our own products is that we’ve been let down way too many times. We don’t really control what happens in the market. It’s the world we live in. We accept that. Our data protection mechanism used to be BlueTalon, which was then acquired by Microsoft. And Microsoft decided to dispose of [BlueTalon] and use their own product. Our entire security framework depended on BlueTalon. . . . We decided to just build our own [framework].

“In a way DBS protects itself from being forced to just do what the providers want us to do. And there’s a resistance from us—a big resistance—to oblige that.” Luis uses a cooking analogy to describe the DBS approach. At one extreme is home cooking, where you go buy all the ingredients and make meals yourself. At the other end of the spectrum is going out to restaurants, where you choose items from the menu and you get a meal the way they prepare it. The DBS data platform is more like a cooking studio—Luis provides the kitchen, developers pick the ingredients they want, and then cook the recipe with their own particular flavor. “That’s what we provide to our lines of business,” says Luis. “The good thing is, we can plug in anything new at any given point in time. The downside is that you need very, very good technical people. How do we sustain the pace of rebuilding new products, and keep up with the open source side, which moves astronomically fast—we’re very connected to the open source mission, close to 50% of our developers contribute to the open source world—and at the same time keep our [internal and external] customers happy?”

How does DBS empower self-service engineering with Unravel?
Find out here

Luis explains the build vs. buy decision boils down to whether the product “drives the core of what we do, whether it’s the ‘spinal cord’ of what we do. We’ve had Unravel for a long, long time. It’s one of those products where we said, ‘Should we do it [i.e., build it ourselves]? Or should we find something on the market?’ With Unravel, it was an easy choice to say, ‘If they can do what they claim they can do, bring ’em in.’ We don’t need to develop our own [data observability solution]. Unravel has demonstrated value in [eliminating] hours of toil—nondeterministic tasks performed by engineers because of repetitive incidents. So, long story short: that [Unravel] is still with us demonstrates that there is value from the lines of business.”

Luis emphasizes his data engineering team are not the users of Unravel. “The users are the business units that are creating their jobs. We just drive adoption to make sure everyone in the bank uses it. Because ultimately our data engineers cannot ‘overrun’ a troop of 3,000 data scientists. So we gave [Unravel] to them, they can use it, they can optimize, and control their own fate.”

DBS realizes high ROI from Unravel

Unravel’s granular data observability, automation, and AI enable DBS to control its cloud data usage at the speed and scale demanded by today’s FinServ environment. “Unravel has saved potentially several $100,000s in [unnecessary] compute capacity,” says Luis. “This is a very big transformational year for us. We’re moving more and more into some sort of data mesh and to a more containerized world. Unravel is part of that journey and we expect that we’ll get the same level of return we’ve gotten so far—or even better.”

The post DBS Bank Uplevels Individual Engineers at Scale appeared first on Unravel.

Unravel Data Recognized by SIIA as Best Data Tool & Platform at 2023 CODiE Awards

Stephen Lamont — Wed, 21 Jun 2023 21:58:08 +0000

Unravel Data Observability Platform earns prestigious industry recognition for Best Data Tool & Platform

Palo Alto, CA — June 21, 2023 —Unravel Data, the first data observability platform built to meet the needs of modern data teams, was named Best Data Tool & Platform of 2023 as part of the annual SIIA CODiE Awards. The prestigious CODiE Awards recognize the companies producing the most innovative Business Technology products across the country and around the world.

“We are deeply honored to win a CODiE Award for Best Data Tool & Platform. Today, as companies put data products and AI/ML innovation front and center of their growth and customer service strategies, the volume of derailed projects and the costs associated are ascending astronomically. Companies need a way to increase performance of their data pipelines and a way to manage costs for effective ROI,” said Kunal Agarwal, CEO and co-founder, Unravel Data. “The Unravel Data Platform brings pipeline performance management and FinOps to the modern data stack. Our AI-driven Insights Engine provides recommendations that allow data teams to make smarter decisions that optimize pipeline performance along with the associated cloud data spend, making innovation more efficient for organizations.”

Cloud-first companies are seeing cloud data costs exceed 40% of their total cloud spend. These organizations lack the visibility into queries, code, configurations, and infrastructure required to manage data workloads effectively, which in turn, leads to over-provisioned capacity for data jobs, an inability to quickly detect pipeline failures and slowdowns, and wasted cloud data spend.

“The 2023 Business Technology CODiE Award Winners maintain the vital legacy of the CODiEs in spotlighting the best and most impactful apps, services and products serving the business tech market,” said SIIA President Chris Mohr. “We are so proud to recognize this year’s honorees – the best of the best! Congratulations to all of this year’s CODiE Award winners!”

The Software & Information Industry Association (SIIA), the principal trade association for the software and digital content industries, announced the full slate of CODiE winners during a virtual winner announcement. Awards were given for products and services deployed specifically for education and learning professionals, including the top honor of the Best Overall Business Technology Solution.

A SIIA CODiE Award win is a prestigious honor, following rigorous reviews by expert judges whose evaluations determined the finalists. SIIA members then vote on the finalist products, and the scores from both rounds are tabulated to select the winners.

Details about the winning products can be found at https://siia.net/codie/2023-codie-business-technology-winners/.

About the CODiE Awards

The SIIA CODiE Awards is the only peer-reviewed program to showcase business and education technology’s finest products and services. Since 1986, thousands of products, services and solutions have been recognized for achieving excellence. For more information, visit siia.net/CODiE.

About Unravel Data

Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Providing a unified view across the entire data stack, Unravel’s market-leading data observability platform leverages AI, machine learning, and advanced analytics to provide modern data teams with the actionable recommendations they need to turn data into insights. Some of the world’s most recognized brands like Adobe, Maersk, Mastercard, Equifax, and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit https://www.unraveldata.com.

Media Contact

Blair Moreland
ZAG Communications for Unravel Data
unraveldata@zagcommunications.com

The post Unravel Data Recognized by SIIA as Best Data Tool & Platform at 2023 CODiE Awards appeared first on Unravel.

Demystifying Data Observability

Stephen Lamont — Fri, 16 Jun 2023 02:44:21 +0000

Check out the 2023 Intellyx Analyst Guide for Unravel, Demystifying Data Observability, for an independent discussion on the specific requirements and bottlenecks of data-dependent applications/pipelines that are addressed by data observability.

Discover:

Why DataOps needs its own observability
How DevOps and DataOps are similar–and how they’re very different
How the emerging discipline of DataFinOps is more than cost governance
Unique considerations of DataFinOps
DataOps resiliency and tracking down toxic workloads

The post Demystifying Data Observability appeared first on Unravel.

The Modern Data Ecosystem: Choose the Right Instance

Stephen Lamont — Thu, 01 Jun 2023 01:33:14 +0000

Overview: The Right Instance Type

This is the first blog in a five-blog series. For an overview of this blog series please review my post All Data Ecosystems Are Real-Time, It Is Just a Matter of Time.

Choosing the right instance type in the cloud is an important decision that can have a significant impact on the performance, cost, and scalability of your applications. Here are some steps to help you choose the right instance types:

Determine your application requirements. Start by identifying the resource requirements of your application, such as CPU, memory, storage, and network bandwidth. You should also consider any special hardware or software requirements, such as GPUs for machine learning workloads.
Evaluate the available instance types. Each cloud provider offers a range of instance types with varying amounts of resources, performance characteristics, and pricing. Evaluate the available instance types and their specifications to determine which ones best match your application requirements.
Consider the cost. The cost of different instance types can vary significantly, so it’s important to consider the cost implications of your choices. You should consider not only the hourly rate but also the overall cost over time, taking into account any discounts or usage commitments.
Optimize for scalability. As your application grows, you may need to scale up or out by adding more instances. Choose instance types that can easily scale horizontally (i.e., adding more instances) or vertically (i.e., upgrading to a larger instance type).
Test and optimize. Once you have chosen your instance types, test your application on them to ensure that it meets your performance requirements. Monitor the performance of your application and optimize your instance types as necessary to achieve the best balance of performance and cost.

Choosing the right instance types in the cloud requires careful consideration of your application requirements, available instance types, cost, scalability, and performance. By following these steps, you can make informed decisions and optimize your cloud infrastructure to meet your business needs.

Determine Application Requirements

Determining your application’s CPU, memory, storage, and network requirements is a crucial step in choosing the right instance types in the cloud. Here are some steps to help you determine these requirements:

CPU requirements Start by identifying the CPU-intensive tasks in your application, such as video encoding, machine learning, or complex calculations. Determine the number of cores and clock speed required for these tasks, as well as any requirements for CPU affinity or hyperthreading.
Memory requirements Identify the memory-intensive tasks in your application, such as caching, database operations, or in-memory processing. Determine the amount of RAM required for these tasks, as well as any requirements for memory bandwidth or latency.
Storage requirements Determine the amount of storage required for your application data, as well as any requirements for storage performance (e.g., IOPS, throughput) or durability (e.g., replication, backup).
Network requirements Identify the network-intensive tasks in your application, such as data transfer, web traffic, or distributed computing. Determine the required network bandwidth, latency, and throughput, as well as any requirements for network security (e.g., VPN, encryption).

To determine these requirements, you can use various monitoring and profiling tools to analyze your application’s resource usage, such as CPU and memory utilization, network traffic, and storage I/O. You can also use benchmarks and performance tests to simulate different workloads and measure the performance of different instance types.

By understanding your application’s resource requirements, you can choose the right instance types in the cloud that provide the necessary CPU, memory, storage, and network resources to meet your application’s performance and scalability needs.

Evaluate the Available Instance Types

Evaluating the available instance types in a cloud provider requires careful consideration of several factors, such as the workload requirements, the performance characteristics of the instances, and the cost. Here are some steps you can take to evaluate the available instance types in a cloud provider:

Identify your workload requirements. Before evaluating instance types, you should have a clear understanding of your workload requirements. For example, you should know the amount of CPU, memory, and storage your application needs, as well as any specific networking or GPU requirements.
Review the instance types available. Cloud providers offer a range of instance types with varying configurations and performance characteristics. You should review the available instance types and select the ones that are suitable for your workload requirements.
Evaluate performance characteristics. Each instance type has its own performance characteristics, such as CPU speed, memory bandwidth, and network throughput. You should evaluate the performance characteristics of each instance type to determine if they meet your workload requirements.
Consider cost. The cost of each instance type varies based on the configuration and performance characteristics. You should evaluate the cost of each instance type and select the ones that are within your budget.
Conduct benchmarks. Once you have selected a few instance types that meet your workload requirements and budget, you should conduct benchmarks to determine which instance type provides the best performance for your workload.
Consider other factors. Apart from performance and cost, you should also consider other factors such as availability, reliability, and support when evaluating instance types.

The best way to evaluate the available instance types in a cloud provider is to carefully consider your workload requirements, performance characteristics, and cost, and conduct benchmarks to determine the best instance type for your specific use case.

Consider the Cost

When evaluating cloud instance types, it is important to consider both the hourly rate and the overall cost over time, as these factors can vary significantly depending on the provider and the specific instance type. Here are some steps you can take to determine the hourly rate and overall cost over time for cloud instance types:

Identify the instance types you are interested in. Before you can determine the hourly rate and overall cost over time, you need to identify the specific instance types you are interested in.
Check the hourly rate. Most cloud providers offer a pricing calculator that allows you to check the hourly rate for a specific instance type. You can use this calculator to compare the hourly rate for different instance types and providers.
Consider the length of time you will use the instance. While the hourly rate is an important consideration, it is also important to consider the length of time you will use the instance. If you plan to use the instance for a long period of time, it may be more cost-effective to choose an instance type with a higher hourly rate but lower overall cost over time.
Look for cost-saving options. Many cloud providers offer cost-saving options such as reserved instances or spot instances. These options can help reduce the overall cost over time, but may require a longer commitment or be subject to availability limitations.
Factor in any additional costs. In addition to the hourly rate, there may be additional costs such as data transfer fees or storage costs that can significantly impact the overall cost over time.
Consider the potential for scaling. If you anticipate the need for scaling in the future, you should also consider the potential cost implications of adding additional instances over time.

By carefully considering the hourly rate, overall cost over time, and any additional costs or cost-saving options, you can make an informed decision about the most cost-effective instance type for your specific use case.

Optimize for Scalability

To optimize resources for scalability in the cloud, you can follow these best practices:

Design for scalability. When designing your architecture, consider the needs of your application and design it to scale horizontally. This means adding more resources, such as servers or containers, to handle an increase in demand.
Use auto-scaling. Auto-scaling allows you to automatically increase or decrease the number of resources based on the demand for your application. This helps ensure that you are using only the necessary resources at any given time, and can also save costs by reducing resources during low demand periods.
Use load balancing. Load balancing distributes incoming traffic across multiple resources, which helps prevent any one resource from being overloaded. This also helps with failover and disaster recovery.
Use caching. Caching can help reduce the load on your servers by storing frequently accessed data in a cache. This reduces the number of requests that need to be processed by your servers, which can improve performance and reduce costs.
Use cloud monitoring. Cloud monitoring tools can help you identify potential performance issues and bottlenecks before they become problems. This can help you optimize your resources more effectively and improve the overall performance of your application.
Use serverless architecture. With serverless architecture, you don’t need to manage servers or resources. Instead, you pay only for the resources that your application uses, which can help you optimize your resources and reduce costs.

By following the above best practices, you can optimize your resources for scalability in the cloud and ensure that your application can handle an increase in demand.

Test & Optimize

Testing and optimizing your cloud environment is a critical aspect of ensuring that your applications and services are performing optimally and that you’re not overspending on cloud resources. Here are some steps you can take to test and optimize your cloud environment:

Set up monitoring. Before you can start testing and optimizing, you need to have visibility into your cloud environment. Set up monitoring tools that can give you insights into key metrics such as CPU utilization, network traffic, and storage usage.
Conduct load testing. Conduct load testing to determine how your applications and services perform under different levels of traffic. This can help you identify bottlenecks and performance issues, and make optimizations to improve performance.
Optimize resource allocation. Make sure that you’re not overspending on cloud resources by optimizing resource allocation. This includes things like resizing virtual machines, choosing the right storage options, and using auto-scaling to automatically adjust resource allocation based on demand.
Implement security measures. Make sure that your cloud environment is secure by implementing appropriate security measures such as firewalls, access controls, and encryption.
Use automation. Automating routine tasks can help you save time and reduce the risk of errors. This includes things like automating deployments, backups, and resource provisioning.
Review cost optimization options. Consider reviewing your cloud provider’s cost optimization options, such as reserved instances or spot instances. These can help you save money on your cloud bill while still maintaining performance.
Continuously monitor and optimize. Continuous monitoring and optimization is key to ensuring that your cloud environment is performing optimally. Set up regular reviews to identify opportunities for optimization and ensure that your cloud environment is meeting your business needs.

By following these steps, you can test and optimize your cloud environment to ensure that it’s secure, cost-effective, and performing optimally.

Recap

Following the above steps will help to make informed decisions and optimize your cloud infrastructure to meet your business needs. Carefully consider your workload requirements. Evaluating performance characteristics, cost, and conducting benchmarks requires understanding your application’s resource requirements. With resource requirements you can choose the right instance types in the cloud that provide the necessary CPU, memory, storage, and network resources to meet your application’s performance and scalability needs.

The post The Modern Data Ecosystem: Choose the Right Instance appeared first on Unravel.

The Modern Data Ecosystem: Monitor Cloud Resources

Stephen Lamont — Thu, 01 Jun 2023 01:32:59 +0000

Monitor Cloud Resources

When monitoring cloud resources, there are several factors to consider:

Performance It is essential to monitor the performance of your cloud resources, including their availability, latency, and throughput. You can use metrics such as CPU usage, network traffic, and memory usage to measure the performance of your resources.
Scalability You should monitor the scalability of your cloud resources to ensure that they can handle changes in demand. You can use tools such as auto-scaling to automatically adjust the resources based on demand.
Security You must monitor the security of your cloud resources to ensure that they are protected from unauthorized access or attacks. You can use tools such as intrusion detection systems and firewalls to monitor and protect your resources.
Cost It is important to monitor the cost of your cloud resources to ensure that you are not overspending on resources that are not being used. You can use tools such as cost optimization and billing alerts to manage your costs.
Compliance If you are subject to regulatory compliance requirements, you should monitor your cloud resources to ensure that you are meeting those requirements. You can use tools such as audit logs and compliance reports to monitor and maintain compliance.
Availability It is important to monitor the availability of your cloud resources to ensure that they are up and running when needed. You can use tools such as load balancing and failover to ensure high availability.
User Experience You should also monitor the user experience of your cloud resources to ensure that they are meeting the needs of your users. You can use tools such as user monitoring and feedback to measure user satisfaction and identify areas for improvement.

Performance Monitoring

Here are some best practices for performance monitoring in the cloud:

Establish baselines. Establish baseline performance metrics for your applications and services. This will allow you to identify and troubleshoot performance issues more quickly.
Monitor resource utilization. Monitor resource utilization such as CPU usage, memory usage, network bandwidth, and disk I/O. This will help you identify resource bottlenecks and optimize resource allocation.
Use automated monitoring tools. Use automated monitoring tools such as CloudWatch, DataDog, and New Relic to collect performance metrics and analyze them in real time. This will allow you to identify and address performance issues as they arise.
Set alerts. Set up alerts for critical performance metrics such as CPU utilization, memory utilization, and network bandwidth. This will allow you to proactively address performance issues before they impact end users.
Use load testing. Use load testing tools to simulate heavy loads on your applications and services. This will help you identify performance bottlenecks and optimize resource allocation before going live.
Monitor end-user experience. Monitor end-user experience using tools such as synthetic monitoring and real user monitoring (RUM) . This will allow you to identify and address performance issues that impact end users.
Analyze logs. Analyze logs to identify potential performance issues. This will help you identify the root cause of performance issues and optimize resource allocation.
Continuously optimize. Continuously optimize your resources based on performance metrics and end-user experience. This will help you ensure that your applications and services perform at their best at all times.

Scalability Monitoring

Here are some best practices for scalability monitoring in the cloud:

Establish baselines. Establish baseline performance metrics for your applications and services. This will allow you to identify and troubleshoot scalability issues more quickly.
Monitor auto-scaling. Monitor auto-scaling metrics to ensure that your resources are scaling up or down according to demand. This will help you ensure that you have the right amount of resources available to meet demand.
Use load testing. Use load testing tools to simulate heavy loads on your applications and services. This will help you identify scalability bottlenecks and optimize resource allocation before going live.
Set alerts. Set up alerts for critical scalability metrics such as CPU utilization, memory utilization, and network bandwidth. This will allow you to proactively address scalability issues before they impact end users.
Use horizontal scaling. Use horizontal scaling to add more instances of your application or service to handle increased traffic. This will allow you to scale quickly and efficiently.
Use vertical scaling. Use vertical scaling to increase the size of your resources to handle increased traffic. This will allow you to scale quickly and efficiently.
Analyze logs. Analyze logs to identify potential scalability issues. This will help you identify the root cause of scalability issues and optimize resource allocation.
Continuously optimize. Continuously optimize your resources based on scalability metrics and end-user experience. This will help you ensure that your applications and services can handle any level of demand.

Security Monitoring

Here are some best practices for handling security monitoring in the cloud:

Use security services. Use cloud-based security services such as AWS Security Hub, Azure Security Center, and Google Cloud Security Command Center to centralize and automate security monitoring across your cloud environment.
Monitor user activity. Monitor user activity across your cloud environment, including login attempts, resource access, and changes to security policies. This will help you identify potential security threats and ensure that access is granted only to authorized users.
Use encryption. Use encryption to protect data at rest and in transit. This will help you protect sensitive data from unauthorized access.
Set up alerts. Set up alerts for critical security events such as failed login attempts, unauthorized access, and changes to security policies. This will allow you to respond quickly to security threats.
Use multi-factor authentication. Use multi-factor authentication to add an extra layer of security to user accounts. This will help prevent unauthorized access even if a user’s password is compromised.
Use firewalls. Use firewalls to control network traffic to and from your cloud resources. This will help you prevent unauthorized access and ensure that only authorized traffic is allowed.
Implement access controls. Implement access controls to ensure that only authorized users have access to your cloud resources. This will help you prevent unauthorized access and ensure that access is granted only to those who need it.
Perform regular security audits. Perform regular security audits to identify potential security threats and ensure that your cloud environment is secure. This will help you identify and address security issues before they become major problems.

Cost Monitoring

Here are some best practices for monitoring cost in the cloud:

Use cost management tools. Use cloud-based cost management tools such as AWS Cost Explorer, Azure Cost Management, and Google Cloud Billing to monitor and optimize your cloud costs.
Set budgets. Set budgets for your cloud spending to help you stay within your financial limits. This will help you avoid unexpected charges and ensure that you are using your cloud resources efficiently.
Monitor usage. Monitor your cloud resource usage to identify any unnecessary or underutilized resources. This will help you identify opportunities for optimization and cost savings.
Analyze cost data. Analyze your cost data to identify trends and areas of high spending. This will help you identify opportunities for optimization and cost savings.
Use cost allocation. Use cost allocation to assign costs to individual users, teams, or projects. This will help you identify which resources are being used most and which users or teams are driving up costs.
Use reserved instances. Use reserved instances to save money on long-term cloud usage. This will help you save money on your cloud costs over time.
Optimize resource allocation. Optimize your resource allocation to ensure that you are using the right amount of resources for your needs. This will help you avoid over-provisioning and under-provisioning.
Implement cost optimization strategies. Implement cost optimization strategies such as using spot instances, turning off non-critical resources when not in use, and using serverless architectures. This will help you save money on your cloud costs without sacrificing performance or reliability.

Compliance Monitoring

Here are some best practices for monitoring compliance in the cloud:

Understand compliance requirements. Understand the compliance requirements that apply to your organization and your cloud environment, such as HIPAA, PCI-DSS, or GDPR.
Use compliance services. Use cloud-based compliance services such as AWS Artifact, Azure Compliance Manager, and Google Cloud Compliance to streamline compliance management and ensure that you are meeting your regulatory requirements.
Conduct regular audits. Conduct regular audits to ensure that your cloud environment is in compliance with regulatory requirements. This will help you identify and address compliance issues before they become major problems.
Implement security controls. Implement security controls such as access controls, encryption, and multi-factor authentication to protect sensitive data and ensure compliance with regulatory requirements.
Monitor activity logs. Monitor activity logs across your cloud environment to identify potential compliance issues, such as unauthorized access or data breaches. This will help you ensure that you are meeting your regulatory requirements and protect sensitive data.
Use automation. Use automation tools to help you enforce compliance policies and ensure that your cloud environment is compliant with regulatory requirements.
Establish incident response plans. Establish incident response plans to help you respond quickly to compliance issues or data breaches. This will help you minimize the impact of any incidents and ensure that you are meeting your regulatory requirements.
Train your employees. Train your employees on compliance policies and procedures to ensure that they understand their roles and responsibilities in maintaining compliance with regulatory requirements. This will help you ensure that everyone in your organization is working together to maintain compliance in the cloud.

Monitor Availability

Here are some best practices for monitoring resource availability in the cloud:

Use monitoring services. Use cloud-based monitoring services such as AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring to monitor the availability of your cloud resources.
Set up alerts. Set up alerts to notify you when there are issues with resource availability, such as when a server goes down or a service becomes unresponsive. This will help you respond quickly to issues and minimize downtime.
Monitor performance metrics. Monitor performance metrics such as CPU usage, memory usage, and network latency to identify potential issues before they become major problems. This will help you ensure that your resources are performing optimally and prevent performance issues from affecting availability.
Conduct regular load testing. Conduct regular load testing to ensure that your resources can handle the expected levels of traffic and usage. This will help you identify potential performance bottlenecks and ensure that your resources are available when you need them.
Use high availability architectures. Use high availability architectures such as load balancing, auto-scaling, and multi-region deployments to ensure that your resources are available even in the event of a failure. This will help you minimize downtime and ensure that your resources are always available.
Monitor service-level agreements (SLAs). Monitor SLAs to ensure that your cloud providers are meeting their service-level commitments. This will help you hold your providers accountable and ensure that your resources are available as expected.
Conduct disaster recovery drills. Conduct disaster recovery drills to ensure that you can recover from major outages or disasters. This will help you minimize downtime and ensure that your resources are available even in the event of a major failure.
Implement redundancy. Implement redundancy for critical resources to ensure that they are always available. This can include redundant servers, databases, or storage systems. This will help you ensure that your critical resources are always available and minimize downtime.

Monitor User Experience

Here are some best practices for monitoring user experience in the cloud:

Define user experience metrics. Define user experience metrics that are important to your business, such as page load times, error rates, and response times. This will help you track user experience and identify areas for improvement.
Use synthetic transactions. Use synthetic transactions to simulate user interactions with your applications and services. This will help you identify performance issues and ensure that your applications and services are delivering a good user experience.
Monitor real user traffic. Monitor real user traffic to identify issues that may not be apparent in synthetic transactions. This will help you understand how your users are actually using your applications and services and identify any performance issues that may be impacting the user experience.
Monitor third-party services. Monitor third-party services that your applications and services rely on, such as payment gateways and content delivery networks. This will help you identify any issues that may be impacting the user experience and ensure that your users have a seamless experience.
Use application performance management (APM) tools. Use APM tools to monitor application performance and identify potential issues that may be impacting the user experience. This will help you quickly identify and resolve issues that may be impacting your users.
Monitor network latency. Monitor network latency to ensure that your applications and services are delivering a good user experience. This will help you identify any network-related issues that may be impacting the user experience.
Set up alerts. Set up alerts to notify you when user experience metrics fall below acceptable levels. This will help you respond quickly to issues and ensure that your users have a good experience.
Continuously test and optimize. Continuously test and optimize your applications and services to ensure that they are delivering a good user experience. This will help you identify and fix issues before they impact your users and ensure that your applications and services are always performing optimally.

Recap

When monitoring cloud resources, there are several factors to consider. First, performance. It is essential to monitor the performance of your cloud resources, including their availability, latency, and throughput. This will allow you to identify and address performance issues that impact end users. You can use tools such as cost optimization and billing alerts to manage your costs. This will help you avoid unexpected charges and ensure that you are using your cloud resources efficiently. Conduct regular load testing to ensure that your resources can handle the expected levels of traffic and usage. Define user experience metrics that are important to your business, such as page load times, error rates, and response times.

The post The Modern Data Ecosystem: Monitor Cloud Resources appeared first on Unravel.

The Modern Data Ecosystem: Use Managed Services

Stephen Lamont — Thu, 01 Jun 2023 01:32:43 +0000

Use Managed Services

Using managed services in the cloud can help you reduce your operational burden, increase efficiency, and improve scalability. However, to fully realize the benefits of managed services, you need to follow some best practices. Here are some best practices to consider when using managed services in the cloud:

Understand your service-level agreements (SLAs). Before using any managed service, you should understand the SLAs offered by your cloud provider. This will help you understand the availability and reliability of the service, as well as any limitations or restrictions that may impact your use of the service.
Choose the right service. You should choose the right managed service for your needs. This means selecting a service that meets your requirements and offers the features and functionality you need. You should also consider the cost of the service and how it will impact your budget.
Plan for scalability. Managed services in the cloud are designed to be highly scalable, so you should plan for scalability when using them. This means understanding how the service will scale as your needs change and ensuring that you can easily scale the service up or down as required.
Monitor service performance. You should monitor the performance of your managed services to ensure that they are meeting your expectations. This may involve setting up monitoring tools to track service usage, performance, and availability. You should also define appropriate thresholds and alerts to notify you when issues arise.
Secure your services. Security is critical when using managed services in the cloud. You should ensure that your services are secured according to best practices, such as using strong passwords, encrypting data in transit and at rest, and implementing access controls. You should also regularly audit your services to ensure that they remain secure.
Stay up-to-date. Managed services in the cloud are continually evolving, so you should stay up-to-date with the latest features, updates, and releases. This will help you take advantage of new features and ensure that your services are up-to-date and secure.

By following these best practices, you can ensure that your managed services in the cloud are efficient, reliable, and secure.

Understand Your Service-Level Agreements (SLAs)

Understanding cloud service-level agreements (SLAs) is crucial when you use cloud services. SLAs define the level of service you can expect from a cloud provider and outline the terms and conditions of the service. Here are some ways to help you understand cloud SLAs:

Read the SLA. The best way to understand cloud SLAs is to read the SLA itself. The SLA provides details on what services are offered, how they are delivered, and what level of availability you can expect. It also outlines the terms and conditions of the service and what you can expect in the event of an outage or other issues.
Understand the metrics. Cloud SLAs typically include metrics that define the level of service you can expect. These metrics may include availability, performance, and response time. You should understand how these metrics are measured and what level of service is guaranteed for each metric.
Know the guarantees. The SLA also specifies the guarantees that the cloud provider offers for each metric. You should understand what happens if the provider fails to meet these guarantees, such as compensation or penalties.
Identify exclusions. The SLA may also include exclusions or limitations to the service, such as scheduled maintenance, force majeure events, or issues caused by your own actions. You should understand what these exclusions are and how they may impact your use of the service.
Ask questions. If you are unsure about any aspect of the SLA, you should ask questions. The cloud provider should be able to provide clarification on any issues and help you understand the SLA better.
Get expert help. If you are still unsure about the SLA or need help negotiating SLAs with multiple providers, consider getting expert help. Cloud consultants or legal advisors can help you understand the SLA better and ensure that you get the best possible terms for your needs.

By following these steps, you can better understand cloud SLAs and make informed decisions about the cloud services you use.

Choose the Right Service

Choosing the right cloud service is a critical decision that can have a significant impact on your organization. Here are some factors to consider when choosing a cloud service:

Business needs The first step in choosing a cloud service is to understand your business needs. What are your specific requirements? What do you need the cloud service to do? Consider factors such as scalability, security, compliance, and cost when evaluating your options.
Reliability and availability Reliability and availability are critical when choosing a cloud service. Look for a provider with a strong track record of uptime and availability. Also, ensure that the provider has a robust disaster recovery plan in place in case of service disruptions or outages.
Security Security is a top priority when using cloud services. Choose a provider that has robust security measures in place, such as encryption, access controls, and multi-factor authentication. Also, consider whether the provider meets any relevant compliance requirements, such as HIPAA or GDPR.
Cost Cost is another critical factor to consider when choosing a cloud service. Look for a provider that offers transparent pricing and provides a clear breakdown of costs. Also, consider any hidden fees or charges, such as data transfer costs or support fees.
Support Choose a cloud service provider that offers robust support options, such as 24/7 customer support or online resources. Ensure that the provider has a reputation for providing excellent support and responding quickly to issues.
Integration Ensure that the cloud service provider integrates with any existing systems or applications that your organization uses. This can help reduce the complexity of your IT environment and improve productivity.
Scalability Choose a cloud service provider that can scale as your needs change. Ensure that the provider can handle your growth and provides flexibility in terms of scaling up or down.

By considering these factors, you can choose the right cloud service that meets your business needs, is secure, reliable, and scalable, and provides good value.

Plan for Scalability

Scalability is a key advantage of cloud computing, allowing organizations to quickly and easily increase or decrease resources as needed. Here are some best practices for planning for scalability in the cloud:

Start with a solid architecture. A solid architecture is essential for building scalable cloud applications. Ensure that your architecture is designed to support scalability from the beginning, by leveraging horizontal scaling, load balancing, and other cloud-native capabilities.
Automate everything. Automation is critical for scaling in the cloud. Automate deployment, configuration, and management tasks to reduce manual effort and increase efficiency. Use tools like cloud orchestration or infrastructure-as-code (IAC) to automate the provisioning and configuration of resources.
Use elasticity. Elasticity is the ability to automatically adjust resource capacity to meet changes in demand. Use auto-scaling to automatically increase or decrease resources based on utilization or other metrics. This can help ensure that you always have the right amount of resources to handle traffic spikes or fluctuations.
Monitor and optimize. Monitoring is critical for maintaining scalability. Use monitoring tools to track application performance and identify potential bottlenecks or areas for optimization. Optimize your applications, infrastructure, and processes to ensure that you can scale as needed without encountering issues.
Plan for failure. Scalability also means being prepared for failure. Ensure that your application is designed to handle failures and that you have a plan in place for dealing with outages or other issues. Use fault tolerance and high availability to ensure that your application can continue running even if a component fails.

By following these best practices, you can plan for scalability in the cloud and ensure that your applications can handle changes in demand without encountering issues.

Monitor Service Performance

Monitoring service performance is essential to ensure that your cloud applications are running smoothly and meeting service-level agreements (SLAs). Here are some best practices for monitoring service performance in the cloud:

Define key performance indicators (KPIs). Define KPIs that are relevant to your business needs, such as response time, throughput, and error rates. These KPIs will help you track how well your applications are performing and whether they are meeting your SLAs.
Use monitoring tools. Use monitoring tools to collect and analyze data on your application’s performance. These tools can help you identify issues before they become critical and track how well your application is meeting your KPIs.
Set alerts. Set up alerts based on your KPIs to notify you when something goes wrong. This can help you quickly identify and resolve issues before they impact your application’s performance.
Monitor end-to-end performance. Monitor end-to-end performance, including network latency, database performance, and third-party services, to identify any potential bottlenecks or issues that could impact your application’s performance.
Analyze and optimize. Analyze performance data to identify patterns and trends. Use this information to optimize your application’s performance and identify areas for improvement. Optimize your application code, database queries, and network configurations to improve performance.
Use machine learning. Leverage machine learning to analyze performance data and identify anomalies or issues automatically. This can help you identify issues before they become critical and take proactive steps to resolve them.

By following these best practices, you can monitor service performance in the cloud effectively and ensure that your applications are meeting your SLAs and delivering the best possible user experience.

Secure Your Services

Securing your services in the cloud is critical to protect your data and applications from cyber threats. Here are some best practices for securing your services in the cloud:

Implement strong access control. Implement strong access control mechanisms to restrict access to your cloud resources. Use least privilege principles to ensure that users only have access to the resources they need. Implement multi-factor authentication (MFA) and use strong passwords to protect against unauthorized access.
Encrypt your data. Encrypt your data both at rest and in transit to protect against data breaches. Use SSL/TLS protocols for data in transit and encryption mechanisms like AES or RSA for data at rest. Additionally, consider encrypting data before it is stored in the cloud to provide an additional layer of protection.
Implement network security. Implement network security measures to protect against network-based attacks, such as DDoS attacks, by using firewalls, intrusion detection/prevention systems (IDS/IPS), and VPNs. Segregate your network into logical segments to reduce the risk of lateral movement by attackers.
Use security groups and network ACLs. Use security groups and network ACLs to control inbound and outbound traffic to your resources. Implement granular rules to restrict traffic to only what is necessary, and consider using security groups and network ACLs together to provide layered security.
Implement logging and monitoring. Implement logging and monitoring to detect and respond to security incidents. Use cloud-native tools like CloudTrail or CloudWatch to monitor activity in your environment and alert you to any unusual behavior.
Perform regular security audits. Perform regular security audits to identify potential vulnerabilities and ensure that your security controls are effective. Conduct regular penetration testing and vulnerability assessments to identify and remediate any weaknesses in your environment.

By following these best practices, you can secure your services in the cloud and protect your applications and data from cyber threats.

Stay Up-to-Date on Managed Services

Staying up to date on managed services in the cloud is essential to ensure that you are using the latest features and capabilities and making the most out of your cloud investment. Here are some ways to stay up-to-date on managed services in the cloud:

Subscribe to cloud service providers’ blogs. Cloud service providers like AWS, Google Cloud, and Microsoft Azure regularly publish blog posts announcing new features and services. By subscribing to these blogs, you can stay up-to-date on the latest developments and updates.
Attend cloud conferences. Attending cloud conferences like AWS re:Invent, Google Cloud Next, and Microsoft Ignite is an excellent way to learn about new and upcoming managed services in the cloud. These events feature keynote speeches, technical sessions, and hands-on workshops that can help you stay up-to-date with the latest trends and technologies.
Join cloud user groups. Joining cloud user groups like AWS User Group, Google Cloud User Group, and Azure User Group is a great way to connect with other cloud professionals and learn about new managed services in the cloud. These groups often hold meetings and events where members can share their experiences and discuss the latest developments.
Participate in online communities. Participating in online communities like Reddit, Stack Overflow, and LinkedIn Groups is an excellent way to stay up-to-date on managed services in the cloud. These communities often have active discussions about new features and services, and members can share their experiences and insights.
Follow industry experts. Following industry experts on social media platforms like Twitter, LinkedIn, and Medium is an excellent way to stay up-to-date on managed services in the cloud. These experts often share their thoughts and insights on the latest developments and can provide valuable guidance and advice.

By following these methods, you can stay up-to-date on managed services in the cloud and ensure that you are using the latest features and capabilities to achieve your business goals.

Recap

Understand your cloud SLAs and make informed decisions about the cloud services you use. By following these best practices, you can ensure that your managed services in the cloud are efficient, reliable, and secure. Leveraging scalability in the cloud ensures that your applications can handle changes in demand without encountering issues. Ensuring that your applications are meeting your SLAs and delivering the best possible user experience requires constant vigilance. Make sure you can secure your services in the cloud and protect your applications and data from cyber threats. By considering the right factors, you can choose the right cloud service that meets your business needs, is secure, reliable, and scalable, and provides good value. Ultimately this helps achieve key business objectives for every customer using the cloud.

The post The Modern Data Ecosystem: Use Managed Services appeared first on Unravel.

The Modern Data Ecosystem: Optimize Your Storage

Stephen Lamont — Thu, 01 Jun 2023 01:32:28 +0000

Optimize Storage

There are several ways to optimize cloud storage, depending on your specific needs and circumstances. Here are some general tips that can help:

Understand your data. Before you start optimizing, it’s important to understand what data you have and how it’s being used. This can help you identify which files or folders are taking up the most space, and which ones are being accessed the most frequently.
Use storage compression. Compression can reduce the size of your files, which can save you storage space and reduce the amount of data you need to transfer over the network. However, keep in mind that compressed files may take longer to access and may not be suitable for all types of data.
Use deduplication. Deduplication can identify and eliminate duplicate data, which can save you storage space and reduce the amount of data you need to transfer over the network. However, keep in mind that deduplication may increase the amount of CPU and memory resources required to manage your data.
Choose the right storage class. Most cloud storage providers offer different storage classes that vary in performance, availability, and cost. Choose the storage class that best meets your needs and budget.
Set up retention policies. Retention policies can help you automatically delete old or outdated data, which can free up storage space and reduce your storage costs. However, be careful not to delete data that you may need later.
Monitor your usage. Regularly monitor your cloud storage usage to ensure that you’re not exceeding your storage limits or paying for more storage than you need. You can use cloud storage monitoring tools or third-party services to help you with this.
Consider a multi-cloud strategy. If you have very large amounts of data, you may want to consider using multiple cloud storage providers to spread your data across multiple locations. This can help you optimize performance, availability, and cost, while also reducing the risk of data loss.

Overall, optimizing cloud storage requires careful planning, monitoring, and management. By following these tips, you can reduce your storage costs, improve your data management, and get the most out of your cloud storage investment.

Understand Your Data

Analyzing data in the cloud can be a powerful way to gain insights and extract value from large datasets. Here are some best practices for analyzing data in the cloud:

Choose the right cloud platform. There are several cloud platforms available, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. Choose the one that suits your needs and budget.
Store data in a scalable, secure, and cost-effective way. You can store data in cloud-based databases, data lakes, or data warehouses. Make sure that you choose a storage solution that is scalable, secure, and cost-effective.
Choose the right data analysis tool. There are several cloud-based data analysis tools available, such as Amazon SageMaker, Microsoft Azure Machine Learning, and Google Cloud AI Platform. Choose the one that suits your needs and budget.
Prepare data for analysis. Data preparation involves cleaning, transforming, and structuring the data for analysis. This step is crucial for accurate analysis results.
Choose the right analysis technique. Depending on the nature of the data and the business problem you are trying to solve, you may choose from various analysis techniques such as descriptive, diagnostic, predictive, or prescriptive.
Visualize data. Visualization helps to communicate insights effectively. Choose a visualization tool that suits your needs and budget.
Monitor and optimize performance. Monitor the performance of your data analysis system and optimize it as necessary. This step helps to ensure that you get accurate and timely insights from your data.

Overall, analyzing data in the cloud can be a powerful way to gain insights and extract value from large datasets. By following these best practices, you can ensure that you get the most out of your cloud-based data analysis system.

Use Storage Compression

Storage compression is a useful technique for reducing storage costs and improving performance in the cloud. Here are some best practices for using storage compression in the cloud:

Choose the right compression algorithm. There are several compression algorithms available, such as gzip, bzip2, and LZ4. Choose the algorithm that suits your needs and budget. Consider factors such as compression ratio, speed, and memory usage.
Compress data at the right time. Compress data when it is written to storage or when it is not frequently accessed. Avoid compressing data that is frequently accessed, as this can slow down performance.
Monitor compression performance. Monitor the performance of your compression system to ensure that it is not slowing down performance. Use tools such as monitoring dashboards to track the performance of your system.
Test compression performance. Test the performance of your compression system with different types of data to ensure that it is effective. Consider testing with data that has varying levels of redundancy, such as log files or images.
Use compression in conjunction with other techniques. Consider using compression in conjunction with other storage optimization techniques such as deduplication, tiering, and archiving. This can further reduce storage costs and improve performance.
Consider the cost of decompression. Decompressing data can be a resource-intensive process. Consider the cost of decompression when choosing a compression algorithm and when designing your storage architecture.

Overall, using storage compression in the cloud can be an effective way to reduce storage costs and improve performance. By following these best practices, you can ensure that you get the most out of your storage compression system.

Data Deduplication

Deduplication is a technique used to reduce the amount of data stored in the cloud by identifying and removing duplicate data. Here are some best practices for deduplicating cloud data:

Choose the right deduplication algorithm. There are several deduplication algorithms available, such as content-defined chunking and fixed-size chunking. Choose the algorithm that suits your needs and budget. Consider factors such as data type, deduplication ratio, and resource usage.
Deduplicate data at the right time. Deduplicate data when it is written to storage or when it is not frequently accessed. Avoid deduplicating data that is frequently accessed, as this can slow down performance.
Monitor deduplication performance. Monitor the performance of your deduplication system to ensure that it is not slowing down performance. Use tools such as monitoring dashboards to track the performance of your system.
Test deduplication performance. Test the performance of your deduplication system with different types of data to ensure that it is effective. Consider testing with data that has varying levels of redundancy, such as log files or images.
Consider the tradeoff between storage cost and compute cost. Deduplicating data can be a resource-intensive process. Consider the tradeoff between storage cost and compute cost when choosing a deduplication algorithm and when designing your storage architecture.
Use deduplication in conjunction with other techniques. Consider using deduplication in conjunction with other storage optimization techniques such as compression, tiering, and archiving. This can further reduce storage costs and improve performance.

Overall, deduplicating cloud data can be an effective way to reduce storage costs and improve performance. By following these best practices, you can ensure that you get the most out of your deduplication system.

Use the Right Storage Class

Choosing the right storage class for data in the cloud involves considering factors such as access frequency, durability, availability, and cost. Here are some steps to follow when choosing the right storage class for your data in the cloud:

Determine your access needs. Consider how frequently you need to access your data. If you need to access your data frequently, you should choose a storage class that provides low latency and high throughput. If you don’t need to access your data frequently, you can choose a storage class that provides lower performance and lower cost.
Consider your durability needs. Durability refers to the probability of losing data due to hardware failure. If your data is critical and needs high durability, you should choose a storage class that provides high durability, such as Amazon S3 Standard or Google Cloud Storage Nearline.
Evaluate your availability needs. Availability refers to the ability to access your data when you need it. If your data is critical and needs high availability, you should choose a storage class that provides high availability, such as Amazon S3 Standard or Google Cloud Storage Nearline.
Determine your cost needs. Cost is also an important factor when choosing a storage class. If you have a limited budget, you should choose a storage class that provides lower cost, such as Amazon S3 Infrequent Access or Google Cloud Storage Coldline.
Consider any compliance requirements. Some industries have compliance requirements that dictate how data must be stored. If you have compliance requirements, you should choose a storage class that meets those requirements.
Consider data lifecycle management. Depending on the type of data, you may need to store it for a certain period of time before deleting it. Some storage classes may provide lifecycle management features to help you manage your data more efficiently.

By considering these factors, you can choose the right storage class for your data in the cloud that meets your needs and helps you save costs.

Set Data Retention Policies

Setting up retention policies for your cloud data is an important step in managing your data and ensuring that you are in compliance with regulatory requirements. Here are some steps you can follow to set up retention policies for your cloud data:

Identify the types of data you need to retain. The first step in setting up retention policies is to identify the types of data that you need to retain. This could include data related to financial transactions, employee records, customer information, and other types of data that are important for your business.
Determine the retention periods. Next, you will need to determine how long each type of data needs to be retained. This will depend on the regulatory requirements for your industry as well as your own internal policies.
Decide on the retention strategy. There are several different retention strategies you can use for your cloud data. For example, you could choose to retain all data for a certain period of time, or you could choose to delete data after a certain period of time has elapsed. You could also choose to retain data based on certain triggers, such as when a legal or regulatory inquiry is initiated.
Implement the retention policies. Once you have determined the types of data you need to retain, the retention periods, and the retention strategy, you can implement your retention policies in your cloud storage provider. Most cloud storage providers have built-in tools for setting up retention policies.
Monitor the retention policies. It’s important to regularly monitor your retention policies to ensure that they are working as intended. You should periodically review the types of data being retained, the retention periods, and the retention strategy to ensure that they are still appropriate. You should also regularly audit your retention policies to ensure that they are in compliance with any changes in regulatory requirements.

By following these steps, you can set up retention policies for your cloud data that will help you manage your data effectively, ensure compliance with regulatory requirements, and reduce your risk of data breaches or loss.

Monitor Usage

Monitoring your cloud usage is essential for managing your costs, optimizing your resources, and ensuring the security of your data. Here are some of the best ways to monitor your cloud usage:

Cloud provider monitoring tools Most cloud providers offer built-in monitoring tools that allow you to track your usage, monitor your costs, and receive alerts when you approach your resource limits. These tools typically provide real-time insights into your cloud usage and can help you identify areas where you can optimize your resources.
Third-party monitoring tools There are many third-party monitoring tools available that can help you monitor your cloud usage across multiple cloud providers. These tools offer more advanced features and can help you identify usage patterns, forecast future usage, and detect anomalies that may indicate security threats or performance issues.
Cost optimization tools Cost optimization tools can help you identify areas where you can reduce your costs, such as by using more efficient resource configurations or by identifying idle resources that can be decommissioned. These tools typically integrate with your cloud provider’s monitoring tools to provide a comprehensive view of your usage and costs.
Security and compliance tools Security and compliance tools can help you monitor your cloud usage for security threats and compliance violations. These tools typically monitor your cloud resources for suspicious activity, such as unauthorized access attempts, and can help you stay in compliance with regulatory requirements.
Regular audits Regular audits of your cloud usage can help you identify areas where you can optimize your resources, reduce costs, and improve security. You should periodically review your cloud usage and costs, and adjust your resources and policies as necessary to ensure that you are getting the most value from your cloud investment.

By using these monitoring tools and strategies, you can gain better visibility into your cloud usage, optimize your resources, reduce costs, and ensure the security and compliance of your cloud resources.

Pursue a Multi-Cloud Strategy

Pursuing a multi-cloud strategy can offer several benefits, such as increased resilience, reduced vendor lock-in, and improved performance. However, there are several considerations you should keep in mind before pursuing a multi-cloud strategy. Here are some of the key considerations:

Business objectives The first consideration is your business objectives. You need to determine why you want to pursue a multi-cloud strategy and what you hope to achieve. For example, you may be looking to improve the resilience of your applications or reduce vendor lock-in.
Compatibility The next consideration is the compatibility of your applications and workloads across different cloud providers. You need to ensure that your applications and workloads are compatible with the different cloud providers you plan to use. You may need to modify your applications and workloads to ensure they can run on multiple cloud platforms.
Data management Another important consideration is data management. You need to ensure that your data is managed securely and efficiently across all the cloud providers you use. This may involve implementing data management policies and tools to ensure that your data is always available and protected.
Cost management Managing costs is also a critical consideration. You need to ensure that you can manage costs effectively across all the cloud providers you use. This may involve using cost management tools and monitoring usage and costs to identify areas where you can optimize spending.
Security Security is always a key consideration, but it becomes even more important when using multiple cloud providers. You need to ensure that your applications and data are secure across all the cloud providers you use. This may involve implementing security policies and using security tools to detect and respond to security threats.
Skills and resources Finally, you need to consider the skills and resources required to manage a multi-cloud environment. This may involve hiring additional staff or up-skilling existing staff to ensure that they have the necessary expertise to manage a multi-cloud environment.

By considering these key factors, you can develop a successful multi-cloud strategy that meets your business objectives and helps you achieve your goals.

Recap

Optimizing cloud storage requires careful planning, monitoring, and management. Analyzing data in the cloud can be a powerful way to gain insights and extract value. Using storage compression is an effective way to reduce storage costs and improve performance. Using monitoring tools and strategies, you can gain better visibility into your cloud usage, optimize your resources, reduce costs, and ensure the security and compliance of your cloud resources. By considering these key factors, you can develop a successful multi-cloud strategy that meets your business objectives and helps you achieve your goals. By following these tips, you can reduce your storage costs, improve your data management, and get the most out of your cloud storage investment.

The post The Modern Data Ecosystem: Optimize Your Storage appeared first on Unravel.

The Modern Data Ecosystem: Use Auto-Scaling

Stephen Lamont — Thu, 01 Jun 2023 01:32:16 +0000

Auto-Scaling Overview

This is the second blog in a five-blog series. For an overview of this blog series, please review my post All Data Ecosystems Are Real-Time, It Is Just a Matter of Time. The series should be read in order.

Auto-scaling is a powerful feature of cloud computing that allows you to automatically adjust the resources allocated to your applications based on changes in demand. Here are some best practices for using auto-scaling in the cloud:

Set up appropriate triggers. Set up triggers based on metrics such as CPU utilization, network traffic, or memory usage to ensure that your application scales up or down when needed.
Use multiple availability zones. Deploy your application across multiple availability zones to ensure high availability and reliability. This will also help you to avoid any potential single points of failure.
Start with conservative settings. Start with conservative settings for scaling policies to avoid over-provisioning or under-provisioning resources. You can gradually increase the thresholds as you gain more experience with your application.
Monitor your auto-scaling. Regularly monitor the performance of your auto-scaling policies to ensure that they are working as expected. You can use monitoring tools such as CloudWatch to track metrics and troubleshoot any issues.
Use automated configuration management. Use tools such as Chef, Puppet, or Ansible to automate the configuration management of your application. This will make it easier to deploy and scale your application across multiple instances.
Test your auto-scaling policies. Test your auto-scaling policies under different load scenarios to ensure that they can handle sudden spikes in traffic. You can use load testing tools such as JMeter or Gatling to simulate realistic load scenarios.

By following these best practices, you can use auto-scaling in the cloud to improve the availability, reliability, and scalability of your applications.

Set Up Appropriate Triggers

Setting up appropriate triggers is an essential step when using auto-scaling in the cloud. Here are some best practices for setting up triggers:

Identify the right metrics. Start by identifying the metrics that are most relevant to your application. Common metrics include CPU utilization, network traffic, and memory usage. You can also use custom metrics based on your application’s specific requirements.
Determine threshold values. Once you have identified the relevant metrics, determine the threshold values that will trigger scaling. For example, you might set a threshold of 70% CPU utilization to trigger scaling up, and 30% CPU utilization to trigger scaling down.
Set up alarms. Set up CloudWatch alarms to monitor the relevant metrics and trigger scaling based on the threshold values you have set. For example, you might set up an alarm to trigger scaling up when CPU utilization exceeds 70% for a sustained period of time.
Use hysteresis. To avoid triggering scaling up and down repeatedly in response to minor fluctuations in metrics, use hysteresis. Hysteresis introduces a delay before triggering scaling in either direction, helping to ensure that scaling events are only triggered when they are really needed.
Consider cooldown periods. Cooldown periods introduce a delay between scaling events, helping to prevent over-provisioning or under-provisioning of resources. When a scaling event is triggered, a cooldown period is started during which no further scaling events will be triggered. This helps to ensure that the system stabilizes before further scaling events are triggered.

By following these best practices, you can set up appropriate triggers for scaling in the cloud, ensuring that your application can scale automatically in response to changes in demand.

Use Multiple Availability Zones

Using multiple availability zones is a best practice in the cloud to improve the availability and reliability of your application. Here are some best practices for using multiple availability zones:

Choose an appropriate region. Start by choosing a region that is geographically close to your users to minimize latency. Consider the regulatory requirements, cost, and availability of resources when choosing a region.
Deploy across multiple availability zones. Deploy your application across multiple availability zones within the same region to ensure high availability and fault tolerance. Availability zones are isolated data centers within a region that are designed to be independent of each other.
Use load balancers. Use load balancers to distribute traffic across multiple instances in different availability zones. This helps to ensure that if one availability zone goes down, traffic can be automatically redirected to other availability zones.
Use cross-zone load balancing. Enable cross-zone load balancing to distribute traffic evenly across all available instances, regardless of which availability zone they are in. This helps to ensure that instances in all availability zones are being utilized evenly.
Monitor availability zones. Regularly monitor the availability and performance of instances in different availability zones. You can use CloudWatch to monitor metrics such as latency, network traffic, and error rates, and to set up alarms to alert you to any issues.
Use automatic failover. Configure automatic failover for your database and other critical services to ensure that if one availability zone goes down, traffic can be automatically redirected to a standby instance in another availability zone.

By following these best practices, you can use multiple availability zones in the cloud to improve the availability and reliability of your application, and to minimize the impact of any potential disruptions.

Start with Conservative Settings

Over-provisioning or under-provisioning resources can lead to wasted resources or poor application performance, respectively. Here are some best practices to avoid these issues:

Monitor resource usage. Regularly monitor the resource usage of your application, including CPU, memory, storage, and network usage. Use monitoring tools such as CloudWatch to collect and analyze metrics, and set up alarms to alert you to any resource constraints.
Set appropriate thresholds. Set appropriate thresholds for scaling based on your application’s resource usage. Start with conservative thresholds, and adjust them as needed based on your monitoring data.
Use automation. Use automation tools such as AWS Auto Scaling to automatically adjust resource provisioning based on demand. This can help to ensure that resources are provisioned efficiently and that you are not over-provisioning or under-provisioning.
Use load testing. Use load testing tools such as JMeter or Gatling to simulate realistic traffic loads and test your application’s performance. This can help you to identify any performance issues before they occur in production.
Optimize application architecture. Optimize your application architecture to reduce resource usage, such as by using caching, minimizing database queries, and using efficient algorithms.
Use multiple availability zones. Deploy your application across multiple availability zones to ensure high availability and fault tolerance, and to minimize the impact of any potential disruptions.

By following these best practices, you can ensure that you are not over-provisioning or under-provisioning resources in your cloud infrastructure, and that your application is running efficiently and reliably.

Monitor and Auto-Scale Your Cloud

The best way to monitor and auto-scale your cloud applications is by using a combination of monitoring tools, scaling policies, and automation tools. Here are some best practices for monitoring and auto-scaling your cloud apps:

Monitor application performance. Use monitoring tools such as AWS CloudWatch to monitor the performance of your application. Collect metrics such as CPU utilization, memory usage, and network traffic, and set up alarms to notify you of any performance issues.
Define scaling policies. Define scaling policies for each resource type based on the performance metrics you are monitoring. This can include policies for scaling based on CPU utilization, network traffic, or other metrics.
Set scaling thresholds. Set conservative thresholds for scaling based on your initial analysis of resource usage, and adjust them as needed based on your monitoring data.
Use automation tools. Use automation tools to automatically adjust resource provisioning based on demand. This can help to ensure that resources are provisioned efficiently and that you are not over-provisioning or under-provisioning.
Use load testing. Use load testing tools such as JMeter or Gatling to simulate realistic traffic loads and test your application’s performance. This can help you to identify any performance issues before they occur in production.
Use multiple availability zones. Deploy your application across multiple availability zones to ensure high availability and fault tolerance, and to minimize the impact of any potential disruptions.
Monitor and optimize. Regularly monitor the performance of your application and optimize your scaling policies based on the data you collect. This will help you to ensure that your application is running efficiently and reliably.

By following these best practices, you can ensure that your cloud applications are monitored and auto-scaled effectively, helping you to optimize performance and minimize the risk of downtime.

Use Automated Configuration Management

Automated configuration management in the cloud can help you manage and provision your infrastructure efficiently and consistently. Here are some best practices for using automated configuration management in the cloud:

Use infrastructure as code. Use infrastructure as code tools such as AWS CloudFormation or Terraform to define your infrastructure as code. This can help to ensure that your infrastructure is consistent across different environments and can be easily reproduced.
Use configuration management tools. Use configuration management tools such as Chef, Puppet, or Ansible to automate the configuration of your servers and applications. These tools can help you ensure that your infrastructure is configured consistently and can be easily scaled.
Use version control. Use version control tools such as Git to manage your infrastructure code and configuration files. This can help you to track changes to your infrastructure and roll back changes if necessary.
Use testing and validation. Use testing and validation tools to ensure that your infrastructure code and configuration files are valid and that your infrastructure is properly configured. This can help you to avoid configuration errors and reduce downtime.
Use monitoring and logging. Use monitoring and logging tools to track changes to your infrastructure and to troubleshoot any issues that arise. This can help you to identify problems quickly and resolve them before they impact your users.
Use automation. Use automation tools such as AWS OpsWorks or AWS CodeDeploy to automate the deployment and configuration of your infrastructure. This can help you to deploy changes quickly and efficiently.

By following these best practices, you can use automated configuration management in the cloud to manage your infrastructure efficiently and consistently, reducing the risk of configuration errors and downtime, and enabling you to scale your infrastructure easily as your needs change.

Testing Auto-Scaling Policies

Testing your auto-scaling policies is an important step in ensuring that your cloud infrastructure can handle changes in demand effectively. Here are some best practices for testing your auto-scaling policies:

Use realistic test scenarios. Use realistic test scenarios to simulate the traffic patterns and demand that your application may experience in production. This can help you to identify any potential issues and ensure that your auto-scaling policies can handle changes in demand effectively.
Test different scenarios. Test your auto-scaling policies under different scenarios, such as high traffic loads or sudden spikes in demand. This can help you to ensure that your policies are effective in a variety of situations.
Monitor performance. Monitor the performance of your application during testing to identify any performance issues or bottlenecks. This can help you to optimize your infrastructure and ensure that your application is running efficiently.
Validate results. Validate the results of your testing to ensure that your auto-scaling policies are working as expected. This can help you to identify any issues or areas for improvement.
Use automation tools. Use automation tools such as AWS CloudFormation or Terraform to automate the testing process and ensure that your tests are consistent and reproducible.
Use load testing tools. Use load testing tools such as JMeter or Gatling to simulate realistic traffic loads and test your auto-scaling policies under different scenarios.

By following these best practices, you can ensure that your auto-scaling policies are effective and can handle changes in demand effectively, reducing the risk of downtime and ensuring that your application is running efficiently.

Recap

Auto-scaling can be leveraged to improve the availability, reliability, and scalability of your applications. Multiple availability zones in the cloud improve the availability and reliability of cloud applications, and minimizes the impact of any potential disruptions. Make changes conservatively. Increase resources incrementally. This makes sure you don’t oversize. We must manage our infrastructure efficiently and consistently to reduce the risk of configuration errors and downtime. Conservative scaling enables us to scale our infrastructure easily as our needs change. Handle changes in demand effectively and reduce the risk of downtime and ensure our application is running efficiently.

The post The Modern Data Ecosystem: Use Auto-Scaling appeared first on Unravel.

All Data Ecosystems Are Real Time, It Is Just a Matter of Time

Stephen Lamont — Thu, 01 Jun 2023 01:32:03 +0000

Overview: Six-Part Blog

In this six-part blog I will demonstrate why what I call Services Oriented Data Architecture (SΘDΔ^®) is the right data architecture for now and the foreseeable future. I will drill into specific examples of how to build the most optimal cloud data architecture regardless of your cloud provider. This will lay the foundation for SΘDΔ^®. We will also define the Data Asset Management System(DΔḾṢ)^®. DΔḾṢ is the modern data management system approach for advanced data ecosystems. The modern data ecosystem must focus on interchangeable interoperable services and let the system focus on optimally storing, retrieving, and processing data. DΔḾṢ takes care of this for the modern data ecosystem.

We will drill into the exercises necessary to optimize the full stack of your cloud data ecosystem. These exercises will work regardless of the cloud provider. We will look at the best ways to store data regardless of type. Then we will drill into how to optimize your compute in the cloud. The compute is generally the most expensive of all cloud assets. We will also drill into how to optimize memory use. Finally, we will wrap up with examples of SΘDΔ^®.

Modern data architecture is a framework for designing, building, and managing data systems that can effectively support modern data-driven business needs. It is focused on achieving scalability, flexibility, reliability, and cost-effectiveness, while also addressing modern data requirements such as real-time data processing, machine learning, and analytics.

Some of the key components of modern data architecture include:

Data ingestion and integration This involves collecting and integrating data from various sources, including structured and unstructured data, and ingesting it into the data system.
Data storage and processing This involves storing and processing data in a scalable, distributed, and fault-tolerant manner using technologies such as cloud storage, data lakes, and data warehouses.
Data management and governance This involves ensuring that data is properly managed, secured, and governed, including policies around data access, privacy, and compliance.
Data analysis and visualization This involves leveraging advanced analytics tools and techniques to extract insights from data and present them in a way that is understandable and actionable.
Machine learning and artificial intelligence This involves leveraging machine learning and AI technologies to build predictive models, automate decision-making, and enable advanced analytics.
Data streaming and real-time processing This involves processing and analyzing data in real time, allowing organizations to respond quickly to changing business needs.

Overall, modern data architecture is designed to help organizations leverage data as a strategic asset and gain a competitive advantage by making better data-driven decisions.

Cloud Optimization Best Practices

Running efficiently on the large cloud providers requires careful consideration of various factors, including your application’s requirements, the size and type of instances needed, and the selected services to leverage.

Here are some general tips to help you run efficiently on the large cloud providers’ cloud:

Choose the right instance types. The large cloud providers offer a wide range of instance types optimized for different workloads. Choose the instance type that best fits your application’s requirements to avoid over-provisioning or under-provisioning.
Use auto-scaling. Auto-scaling allows you to scale your infrastructure up or down based on demand. This ensures that you have enough capacity to handle traffic spikes while minimizing costs during periods of low usage.
Optimize your storage. The large cloud providers offer various storage options, each with its own performance characteristics and costs. Select the storage type that best fits your application’s needs.
Use managed services. The large cloud providers provide various managed services, These services allow you to focus on your application’s business logic while the large cloud providers take care of the underlying infrastructure. SaaS vendors manage the software, and PaaS vendors manage the platform.
Monitor your resources. The major cloud providers provide various monitoring and logging tools that allow you to track your application’s performance and troubleshoot issues quickly. Use these tools to identify bottlenecks and optimize your infrastructure.
Use a content delivery network (CDN). If your application serves static content, consider using a CDN to cache content closer to your users, reducing latency and improving performance.

By following these best practices, you can ensure that your application runs efficiently on the large cloud providers, providing a great user experience while minimizing costs.

The Optimized Way to Store Data in the Cloud

The best structure for storing data for reporting depends on various factors, including the type and volume of data, the reporting requirements, and the performance considerations. Here are some general guidelines for choosing a suitable structure for storing data for reporting:

Use a dimensional modeling approach. Dimensional modeling is a database design technique that organizes data into dimensions and facts. It is optimized for reporting and analysis and can help simplify complex queries and improve performance. The star schema and snowflake schema are popular dimensional modeling approaches.
Choose a suitable database type. Depending on the size and type of data, you can choose a suitable database type for storing data for reporting. Relational databases are the most common type of database used for reporting, but NoSQL databases can also be used for certain reporting scenarios.
Normalize data appropriately. Normalization is the process of organizing data in a database to minimize data redundancy and improve data integrity. However, over-normalization can make querying complex and slow down reporting. Therefore, it is important to normalize data appropriately based on the reporting requirements.
Use indexes to improve query performance. Indexes can help improve query performance by allowing the database to quickly find the data required for a report. Choose appropriate indexes based on the reporting requirements and the size of the data.
Consider partitioning. Partitioning involves splitting large tables into smaller, more manageable pieces. It can improve query performance by allowing the database to quickly access the required data.
Consider data compression. Data compression can help reduce the storage requirements of data and improve query performance by reducing the amount of data that needs to be read from disk.

Overall, the best structure for storing data for reporting depends on various factors, and it is important to carefully consider the reporting requirements and performance considerations when choosing a suitable structure.

Optimal Processing of Data in the Cloud

The best way to process data in the cloud depends on various factors, including the type and volume of data, the processing requirements, and the performance considerations. Here are some general guidelines for processing data in the cloud:

Use cloud-native data processing services. Cloud providers offer a wide range of data processing services, such as AWS Lambda, GCP Cloud Functions, and Azure Functions, which allow you to process data without managing the underlying infrastructure. These services are highly scalable and can be cost-effective for small- to medium-sized workloads.
Use serverless computing. Serverless computing is a cloud computing model in which the cloud provider manages the infrastructure and automatically scales the resources based on the workload. Serverless computing can be a cost-effective and scalable solution for processing data, especially for sporadic or bursty workloads.
Use containerization. Containerization allows you to package your data processing code and dependencies into a container image and deploy it to a container orchestration platform, such as Kubernetes or Docker Swarm. This approach can help you achieve faster deployment, better resource utilization, and improved scalability.
Use distributed computing frameworks. Distributed computing frameworks, such as Apache Hadoop, Spark, and Flink, allow you to process large volumes of data in a distributed manner across multiple nodes. These frameworks can be used for batch processing, real-time processing, and machine learning workloads.
Use data streaming platforms. Data streaming platforms, such as Apache Kafka and GCP Pub/Sub, allow you to process data in real time and respond quickly to changing business needs. These platforms can be used for real-time processing, data ingestion, and event-driven architectures.
Use machine learning and AI services. Cloud providers offer a wide range of machine learning and AI services, such as AWS SageMaker, GCP AI Platform, and Azure Machine Learning, which allow you to build, train, and deploy machine learning models in the cloud. These services can be used for predictive analytics, natural language processing, computer vision, and other machine learning workloads.

Overall, the best way to process data in the cloud depends on various factors, and it is important to carefully consider the processing requirements and performance considerations when choosing a suitable approach.

Optimize Memory

The best memory size for processing 1 terabyte of data depends on the specific processing requirements and the type of processing being performed. In general, the memory size required for processing 1 terabyte of data can vary widely depending on the data format, processing algorithms, and performance requirements. For example, if you are processing structured data in a relational database, the memory size required will depend on the specific SQL query being executed and the size of the result set. In this case, the memory size required may range from a few gigabytes to several hundred gigabytes or more, depending on the complexity of the query and the number of concurrent queries being executed.

On the other hand, if you are processing unstructured data, such as images or videos, the memory size required will depend on the specific processing algorithm being used and the size of the data being processed. In this case, the memory size required may range from a few gigabytes to several terabytes or more, depending on the complexity of the algorithm and the size of the input data.

Therefore, it is not possible to give a specific memory size recommendation for processing 1 terabyte of data without knowing more about the specific processing requirements and the type of data being processed. It is important to carefully consider the memory requirements when designing the processing system and to allocate sufficient memory resources to ensure optimal performance.

Service Oriented Data Architecture Is the Future for Data Ecosystems

A Services Oriented Data Architecture (SΘDΔ^®) is an architectural approach used in cloud computing that focuses on creating and deploying software systems as a set of interconnected services. Each service performs a specific business function, and communication between services occurs over a network, typically using web-based protocols such as RESTful APIs.

In the cloud, SΘDΔ can be implemented using a variety of cloud computing technologies, including infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). In an SΘDΔ-based cloud architecture, services are hosted on cloud infrastructure, such as virtual machines or containers, and can be dynamically scaled up or down based on demand.

One of the key benefits of SΘDΔ in the cloud is its ability to enable greater agility and flexibility in software development and deployment. By breaking down a complex software system into smaller, more manageable services, SΘDΔ makes it easier to build, test, and deploy new features and updates. It also allows for more granular control over resource allocation, making it easier to optimize performance and cost.

Overall, service-based architecture is a powerful tool for building scalable, flexible, and resilient software systems in the cloud, especially data ecosystems.

Recap

In this blog we began a conversation about the modern data ecosystem. By following best practices, we can ensure that our cloud applications run efficiently, on the large cloud providers, providing a great user experience while minimizing costs. We covered the following:

The modern data architecture is designed to help organizations leverage data as a strategic asset and gain a competitive advantage by making better data-driven decisions.
The best way to process data in the cloud depends on various factors, and it is important to carefully consider the processing requirements and performance considerations when choosing a suitable approach.
Overall, service-based architecture is a powerful tool for building scalable, flexible, and resilient software systems in the cloud, especially data ecosystems.

The post All Data Ecosystems Are Real Time, It Is Just a Matter of Time appeared first on Unravel.

The Modern Data Ecosystem: Leverage Content Delivery Network (CDN)

Stephen Lamont — Thu, 01 Jun 2023 01:31:46 +0000

Leverage Content Delivery Network

Here are some best practices when using a Content Delivery Network (CDN):

Choose the right CDN provider. Choose a CDN provider that is reliable, scalable, and has a global network of servers. Look for providers that offer features such as caching, load balancing, and DDoS protection.
Configure your CDN properly. Configure your CDN properly to ensure that it is delivering content efficiently and securely. This may include setting up caching rules, configuring SSL/TLS encryption, and configuring firewall rules.
Monitor your CDN. Monitor your CDN to ensure that it is performing optimally and delivering content efficiently. This may include monitoring CDN usage, network latency, and caching effectiveness.
Optimize your CDN. Optimize your CDN to ensure that it is delivering content as efficiently as possible. This may include using compression, optimizing images, and minifying JavaScript and CSS files.
Use multiple CDNs. Consider using multiple CDNs to ensure that your content is always available and is delivered quickly. This may include using multiple providers or using multiple CDNs from the same provider.
Test your CDN. Test your CDN to ensure that it is delivering content as expected. This may include conducting load testing to ensure that your CDN can handle expected levels of traffic and testing performance from different geographical locations.
Use CDN analytics. Use CDN analytics to track CDN usage and monitor performance. This will help you identify any issues and optimize your CDN for better performance.
Implement security measures. Implement security measures to protect your content and your users. This may include configuring SSL/TLS encryption, setting up firewalls, and using DDoS protection.

By following these best practices, you can ensure that your CDN is delivering content efficiently and securely, and is providing a positive user experience.

Choosing the Right CDN Provider

Here are some best practices when choosing the right Content Delivery Provider (CDP) in the cloud:

Understand your requirements. Before selecting a CDP, determine your requirements for content delivery. This includes identifying the types of content you want to deliver, the expected traffic volume, and the geographical locations of your users.
Research multiple providers. Research multiple CDP providers to determine which one best meets your requirements. Look for providers with a global network of servers, high reliability, and good performance.
Evaluate performance. Evaluate the performance of each provider by conducting tests from different geographical locations. This will help you determine which provider delivers content quickly and efficiently to your users.
Consider cost. Consider the cost of each provider, including the cost of data transfer, storage, and other associated fees. Look for providers that offer flexible pricing models and transparent pricing structures.
Evaluate security features. Evaluate the security features of each provider, including DDoS protection, SSL/TLS encryption, and firewalls. Look for providers that offer comprehensive security features to protect your content and your users.
Check for integration. Check if the provider integrates well with your existing infrastructure and tools, such as your content management system, analytics tools, and monitoring tools.
Look for analytics. Look for providers that offer analytics and reporting tools to help you track content delivery performance and optimize your content delivery.
Check for support. Check the level of support offered by each provider, including support for troubleshooting, upgrades, and maintenance. Look for providers with a responsive support team that can help you resolve issues quickly.

By following these best practices, you can select the right CDP that meets your requirements and delivers content efficiently and securely to your users.

Configure Your CDN Properly

Here are some best practices when configuring your Content Delivery Network (CDN) properly in the cloud:

Configure caching. Caching is a key feature of CDNs that enables content to be delivered quickly and efficiently. Configure caching rules to ensure that frequently accessed content is cached and delivered quickly.
Configure content compression. Configure content compression to reduce the size of your files and improve performance. Gzip compression is a popular option that can be configured at the CDN level.
Configure SSL/TLS encryption. Configure SSL/TLS encryption to ensure that content is delivered securely. Look for CDNs that offer free SSL/TLS certificates or have the option to use your own certificate.
Configure firewall rules. Configure firewall rules to protect your content and your users. This includes setting up rules to block traffic from malicious sources and to restrict access to your content.
Use multiple CDNs. Consider using multiple CDNs to improve performance and ensure availability. Use a multi-CDN strategy to distribute traffic across multiple CDNs, which can reduce latency and increase reliability.
Configure DNS settings. Configure DNS settings to ensure that traffic is directed to your CDN. This includes configuring CNAME records or using a DNS provider that integrates with your CDN.
Test your CDN configuration. Test your CDN configuration to ensure that it is working properly. This includes testing performance, testing from different geographical locations, and testing content delivery from different devices.
Monitor your CDN configuration. Monitor your CDN configuration to ensure that it is delivering content efficiently and securely. This includes monitoring CDN usage, network latency, caching effectiveness, and security events.

By following these best practices, you can configure your CDN properly to ensure that content is delivered quickly, efficiently, and securely to your users.

Monitor Your CDN

Here are some best practices when monitoring your Content Delivery Network (CDN) in the cloud:

Monitor CDN usage. Monitor CDN usage to track how much content is being delivered and where it is being delivered. This can help you identify potential issues and optimize content delivery.
Monitor network latency. Monitor network latency to ensure that content is being delivered quickly and efficiently. Use tools like Pingdom or KeyCDN’s real-time monitoring to identify network latency issues.
Monitor caching effectiveness. Monitor caching effectiveness to ensure that frequently accessed content is being cached and delivered quickly. Use CDN analytics and monitoring tools to track caching effectiveness.
Monitor security events. Monitor security events to ensure that your content and your users are protected from security threats. This includes monitoring for DDoS attacks, intrusion attempts, and other security events.
Monitor CDN performance. Monitor CDN performance to ensure that content is being delivered efficiently and that the CDN is performing optimally. This includes monitoring server response times, cache hit rates, and other performance metrics.
Monitor user experience. Monitor user experience to ensure that users are able to access content quickly and efficiently. Use tools like Google Analytics or Pingdom to monitor user experience and identify issues.
Monitor CDN costs. Monitor CDN costs to ensure that you are staying within your budget. Use CDN cost calculators to estimate costs and monitor usage to identify potential cost savings.
Set up alerts. Set up alerts to notify you when issues arise, such as network latency spikes, security events, or server downtime. Use CDN monitoring tools to set up alerts and notifications.

By following these best practices, you can monitor your CDN effectively to ensure that content is being delivered quickly, efficiently, and securely to your users, while staying within your budget.

Optimize Your Content Delivery Network

Optimizing your Content Delivery Network (CDN) in the cloud is crucial to ensure fast and reliable content delivery to your users. Here are some best practices to follow:

Choose the right CDN provider. Research different CDN providers and choose the one that meets your needs in terms of cost, performance, and geographical coverage.
Use a multi-CDN approach. Consider using multiple CDN providers to improve your content delivery performance and reliability.
Optimize your CDN configuration. Configure your CDN to serve static assets, such as images, videos, and files, directly from the CDN cache, while serving dynamic content, such as HTML pages, from your origin server.
Use caching effectively. Set appropriate cache control headers for your content to ensure that it is cached by the CDN and served quickly to your users.
Monitor your CDN performance. Monitor your CDN performance regularly and identify any issues or bottlenecks that may be affecting your content delivery performance.
Use compression. Use compression techniques, such as gzip compression, to reduce the size of your content and improve its delivery speed.
Optimize DNS resolution. Use a global DNS service to optimize DNS resolution and reduce the time it takes for your users to access your content.
Implement HTTPS. Implement HTTPS to ensure secure content delivery and improve your search engine ranking.
Consider using edge computing. Consider using edge computing to offload some of the processing and caching tasks to the CDN edge servers, which can help reduce the load on your origin servers and improve content delivery performance.

By following these best practices, you can optimize your CDN in the cloud and ensure fast, reliable, and secure content delivery to your users.

Use Multiple Content Delivery Networks

Using multiple Content Delivery Networks (CDNs) in the cloud can improve the performance and reliability of your content delivery, but it also requires careful management to ensure optimal results. Here are some best practices to follow when using multiple CDNs:

Use a multi-CDN management platform. Consider using a multi-CDN management platform to manage your CDNs and monitor their performance. These platforms allow you to configure your CDNs and dynamically route traffic to the optimal CDN based on real-time performance metrics.
Define a clear CDN selection strategy. Develop a clear strategy for selecting the CDN to use for each request. Consider factors such as geographical proximity, latency, and availability when selecting a CDN.
Test your CDNs. Regularly test the performance of your CDNs using real-world scenarios. This will help you identify any issues and ensure that your CDNs are performing optimally.
Implement consistent caching policies. Ensure that your caching policies are consistent across all your CDNs. This will help avoid cache misses and improve content delivery performance.
Implement a failover strategy. Define a failover strategy to ensure that your content is always available even if one of your CDNs experiences an outage. This may involve dynamically routing traffic to a backup CDN or using your origin server as a fallback.
Monitor and optimize costs. Monitor your CDN costs and optimize your usage to ensure that you are getting the best value for your investment.
Ensure security. Implement appropriate security measures, such as SSL encryption and DDoS protection, to ensure that your content is delivered securely and your CDNs are protected from attacks.

By following these best practices, you can successfully use multiple CDNs in the cloud to improve your content delivery performance and reliability.

Test Your Content Delivery Network

Testing your Content Delivery Network (CDN) in the cloud is crucial to ensure that it is performing optimally and delivering content quickly and reliably to your users. Here are some best practices to follow when testing your CDN:

Define clear testing objectives. Define clear objectives for your CDN testing, such as identifying bottlenecks or measuring performance metrics. This will help you design your test scenarios and evaluate the results effectively.
Use real-world testing scenarios. Use real-world scenarios to test your CDN, such as simulating traffic from different geographic locations, devices, and network conditions. This will help you identify any issues that may affect your users’ experience.
Measure key performance metrics. Measure key performance metrics such as response time, latency, throughput, and cache hit rate. This will help you identify any areas of improvement and optimize your CDN performance.
Test caching behavior. Test the caching behavior of your CDN by measuring cache hit and miss rates for different types of content. This will help you optimize your caching policies and reduce your origin server load.
Test CDN failover and disaster recovery. Test your CDN failover and disaster recovery mechanisms to ensure that your content is always available even if one or more CDNs experience an outage.
Use automated testing tools. Use automated testing tools to simulate traffic and measure performance metrics. This will help you perform tests more efficiently and accurately.
Test regularly. Regularly test your CDN to ensure that it is performing optimally and identify any issues as soon as possible.

By following these best practices, you can effectively test your CDN in the cloud and ensure fast and reliable content delivery to your users.

Use CDN Analytics

Analytics play a critical role in optimizing your Content Delivery Network (CDN) in the cloud, as they provide insights into your CDN performance and user behavior. Here are some best practices to follow when using CDN analytics in the cloud:

Define clear goals. Define clear goals for your CDN analytics, such as identifying areas of improvement or measuring user engagement. This will help you select the appropriate analytics tools and collect the right data.
Use a multi-CDN analytics platform. If you are using multiple CDNs, consider using a multi-CDN analytics platform to consolidate data from multiple CDNs and provide a unified view of your content delivery performance.
Monitor key performance metrics. Monitor key performance metrics such as page load time, bounce rate, and conversion rate. This will help you identify any performance issues and optimize your content delivery.
Use real-time analytics. Use real-time analytics to monitor your CDN performance and user behavior in real time. This will help you identify any issues as soon as they occur and take immediate action.
Use A/B testing. Use A/B testing to test different CDN configurations or content variations and measure the impact on user engagement and performance.
Use data visualization tools. Use data visualization tools to help you visualize and analyze your CDN performance data effectively. This will help you identify trends and patterns that may be difficult to detect using raw data.
Optimize CDN configuration based on analytics insights. Use insights from your CDN analytics to optimize your CDN configuration, such as adjusting caching policies, optimizing content delivery routes, or reducing page load time.

By following these best practices, you can use CDN analytics in the cloud to optimize your content delivery performance, improve user engagement, and achieve your business goals.

Implement Security Measures

Implementing security measures in the cloud is essential to protect your applications, data, and infrastructure from cyber threats. Here are some best practices to follow when implementing security measures in the cloud:

Use strong authentication and access controls. Implement strong authentication mechanisms such as multi-factor authentication and use access controls to restrict access to your cloud resources based on the principle of least privilege.
Implement network security controls. Use network security controls such as firewalls, intrusion detection and prevention systems (IDPS), and virtual private networks (VPNs) to protect your cloud infrastructure from network-based attacks.
Implement encryption. Use encryption to protect sensitive data, both in transit and at rest. Implement encryption for data stored in the cloud, and use secure protocols such as HTTPS and TLS for data in transit.
Regularly apply security patches and updates. Regularly apply security patches and updates to your cloud infrastructure and applications to ensure that you are protected against known vulnerabilities.
Implement security monitoring and logging. Implement security monitoring and logging to detect and respond to security events. Use tools that provide visibility into your cloud environment and alert you to potential security threats.
Use cloud-native security tools. Use cloud-native security tools and services such as security groups, network ACLs, and security information and event management (SIEM) to secure your cloud environment.
Develop an incident response plan. Develop an incident response plan that outlines how you will respond to security incidents, including containment, investigation, and remediation.

By following these best practices, you can effectively implement security measures in the cloud and protect your applications, data, and infrastructure from cyber threats.

The post The Modern Data Ecosystem: Leverage Content Delivery Network (CDN) appeared first on Unravel.

DBS Empowers Self-Service Engineering with Unravel

Stephen Lamont — Thu, 25 May 2023 22:30:28 +0000

The post DBS Empowers Self-Service Engineering with Unravel appeared first on Unravel.

DBS Discusses Data+FinOps for Banking

Stephen Lamont — Thu, 25 May 2023 22:30:14 +0000

The post DBS Discusses Data+FinOps for Banking appeared first on Unravel.

DataOps Resiliency: Tracking Down Toxic Workloads

Christine Della Penna — Thu, 11 May 2023 21:12:07 +0000

By Jason Bloomberg, Managing Partner, Intellyx
Part 4 of the Demystifying Data Observability Series for Unravel Data

In the first three articles in this four-post series, my colleague Jason English and I explored DataOps observability, the connection between DevOps and DataOps, and data-centric FinOps best practices.

In this concluding article in the series, I’ll explore DataOps resiliency – not simply how to prevent data-related problems, but also how to recover from them quickly, ideally without impacting the business and its customers.

Observability is essential for any kind of IT resiliency – you can’t fix what you can’t see – and DataOps is no exception. Failures can occur anywhere in the stack, from the applications on down to the hardware. Understanding the root causes of such failures is the first step to fixing, or ideally preventing, them.

The same sorts of resiliency problems that impact the IT environment at large can certainly impact the data estate. Even so, traditional observability and incident management tools don’t address specific problems unique to the world of data processing.

In particular, DataOps resiliency must address the problem of toxic workloads.

Understanding Toxic Workloads

Toxic data workloads are as old as relational database management systems (RDBMSs), if not older. Anyone who works with SQL on large databases knows there are some queries that will cause the RDBMS to slow dramatically or completely grind to a halt.

The simplest example: SELECT * FROM TRANSACTIONS where the TRANSACTIONS table has millions of rows. Oops! Your resultset also has millions of rows!

JOINs, of course, are more problematic, because they are difficult to construct, and it’s even more difficult to predict their behavior in databases with complex structures.

Such toxic workloads caused problems in the days of single on-premises databases. As organizations implemented data warehouses, the risks compounded, requiring increasing expertise from a scarce cadre of query-building experts.

Today we have data lakes as well as data warehouses, often running in the cloud where the meter is running all the time. Organizations also leverage streaming data, as well as complex data pipelines that mix different types of data in real time.

With all this innovation and complexity, the toxic workload problem hasn’t gone away. In fact, it has gotten worse, as the nuances of such workloads have expanded.

Breaking Down the Toxic Workload

Poorly constructed queries are only one of the causes of a modern toxic workload. Other root causes include:

Poor quality data – one table with NULL values, for example, can throw a wrench into seemingly simple queries. Expand that problem to other problematic data types and values across various cloud-based data services and streaming data sources, and small data quality problems can easily explode into big ones.
Coding issues – Data engineers must create data pipelines following traditional coding practices – and whenever there’s coding, there are software bugs. In the data warehouse days, tracking down toxic workloads usually revealed problematic queries. Today, coding issues are just as likely to be the root cause.
Infrastructure issues – Tracking down the root causes of toxic workloads means looking everywhere – including middleware, container infrastructure, networks, hypervisors, operating systems, and even the hardware. Just because a workload runs too slow doesn’t mean it’s a data issue. You have to eliminate as many possible root causes as you can – and quickly.
Human issues – Human error may be the root cause of any of the issues above – but there is more to this story. In many cases, root causes of toxic workloads boil down to a shortage of appropriate skills among the data team or a lack of effective collaboration within the team. Human error will always crop up on occasion, but a skills or collaboration issue will potentially cause many toxic workloads over time.

The bottom line: DataOps resiliency includes traditional resiliency challenges but extends to data-centric issues that require data observability to address.

Data Resiliency at Mastercard

Mastercard recently addressed its toxic workload problem on Hadoop, as well as Impala, Spark, and Hive.

The payment processor has petabytes of data across hundreds of nodes, as well as thousands of users who access the data in an ad hoc fashion – that is, they build their own queries.

Mastercard’s primary issue was poorly constructed queries, a combination of users’ inexperience as well as the complexity of the required queries.

In addition, the company faced various infrastructure issues, from overburdened data pipelines to maxed-out storage and disabled daemons.

All these problems led to application failures, system slowdowns and crashes, and resource bottlenecks of various types.

To address these issues, Mastercard brought in Unravel Data. Unravel quickly identified hundreds of unused data tables. Freeing up the associated resources improved query performance dramatically.

Mastercard also uses Unravel to help users tune their own query workloads as well as automate the monitoring of toxic workloads in progress, preventing the most dangerous ones from running in the first place.

Overall, Unravel helped Mastercard improve its mean time to recover (MTTR) – the best indicator of DataOps Resiliency.

The Intellyx Take

The biggest mistake an organization can make around DataOps observability and resiliency is to assume these topics are special cases of the broader discussion of IT observability and resiliency.

In truth, the areas overlap – after all, infrastructure issues are often the root causes of data-related problems – but without the particular focus on DataOps, many problems would fall through the cracks.

The need for this focus is why tools like Unravel’s are so important. Unravel adds AI optimization and automated governance to its core data observability capabilities, helping organizations optimize the cost, performance, and quality of their data estates.

DataOps resiliency is one of the important benefits of Unravel’s approach – not in isolation, but within the overall context for resiliency that is so essential to modern IT.

Copyright © Intellyx LLC. Unravel Data is an Intellyx customer. None of the other organizations mentioned in this article is an Intellyx customer. Intellyx retains final editorial control of this article. No AI was used in the production of this article.

The post DataOps Resiliency: Tracking Down Toxic Workloads appeared first on Unravel.

Data observability’s newest frontiers: DataFinOps and DataBizOps

Christine Della Penna — Thu, 20 Apr 2023 20:02:20 +0000

Check out Sanjeev Mohan’s, Principal, SanjMo & Former Gartner Research VP, Big Data and Advanced Analytics, chapter on “Data observability’s newest frontiers: DataFinOps and DataBizOps” in the book, Data Observability, The Reality.

What you’ll learn from the former VP of Research at Gartner:

DataFinOps defined
Why you need DataFinOps
The challenges of DataFinOps
DataFinOps case studies & more!

Don’t miss out on your chance to read this chapter and gain valuable insights from a top industry leader.

The post Data observability’s newest frontiers: DataFinOps and DataBizOps appeared first on Unravel.

Data Observability, The Reality eBook

Christine Della Penna — Tue, 18 Apr 2023 14:07:13 +0000

Thrilled to announce that Unravel is contributing a chapter in Ravit Jain’s new ebook, Data Observability, The Reality.

What you’ll learn from reading this ebook:

What data observability is and why it’s important
Identify key components of an observability framework
Learn how to design and implement a data observability strategy
Explore real-world use cases and best practices for data observability
Discover tools and techniques for monitoring and troubleshooting data pipelines

Don’t miss out on your chance to read our chapter, “Automation and AI Are a Must for Data Observability,” and gain valuable insights from top industry leaders such as Sanjeev Mohan and others.

The post Data Observability, The Reality eBook appeared first on Unravel.

Solving key challenges in the ML lifecycle with Unravel and Databricks Model Serving

Christine Della Penna — Tue, 04 Apr 2023 21:53:31 +0000

By Craig Wiley, Senior Director of Product Management, Databricks and Clinton Ford, Director of Product Marketing, Unravel Data

Introduction

Machine learning (ML) enables organizations to extract more value from their data than ever before. Companies who successfully deploy ML models into production are able to leverage that data value at a faster pace than ever before. But deploying ML models requires a number of key steps, each fraught with challenges:

Data preparation, cleaning, and processing
Feature engineering
Training and ML experiments
Model deployment
Model monitoring and scoring

Figure 1. Phases of the ML lifecycle with Databricks Machine Learning and Unravel Data

Challenges at each phase

Data preparation and processing

Data preparation is a data scientist’s most time-consuming task. While there are many phases in the data science lifecycle, an ML model can only be as good as the data that feeds it. Reliable and consistent data is essential for training and machine learning (ML) experiments. Despite advances in data processing, a significant amount of effort is required to load and prepare data for training and ML experimentation. Unreliable data pipelines slow the process of developing new ML models.

Training and ML experiments

Once data is collected, cleansed, and refined, it is ready for feature engineering, model training, and ML experiments. The process is often tedious and error-prone, yet machine learning teams also need a way to reproduce and explain their results for debugging, regulatory reporting, or other purposes. Recording all of the necessary information about data lineage, source code revisions, and experiment results can be time-consuming and burdensome. Before a model can be deployed into production, it must have all of the detailed information for audits and reproducibility, including hyperparameters and performance metrics.

Model deployment and monitoring

While building ML models is hard, deploying them into production is even more difficult. For example, data quality must be continuously validated and model results must be scored for accuracy to detect model drift. What makes this challenge even more daunting is the breadth of ML frameworks and the required handoffs between teams throughout the ML model lifecycle– from data preparation and training to experimentation and production deployment. Model experiments are difficult to reproduce as the code, library dependencies, and source data change, evolve, and grow over time.

The solution

The ultimate hack to productionize ML is data observability combined with scalable, serverless, and automated ML model serving. Unravel’s AI-powered data observability for Databricks on AWS and Azure Databricks simplifies the challenges of data operations, improves performance, saves critical engineering time, and optimizes resource utilization.

Databricks Model Serving deploys machine learning models as a REST API, enabling you to build real-time ML applications like personalized recommendations, customer service chatbots, fraud detection, and more – all without the hassle of managing serving infrastructure.

Databricks + data observability

Whether you are building a lakehouse with Databricks for ML model serving, ETL, streaming data pipelines, BI dashboards, or data science, Unravel’s AI-powered data observability for Databricks on AWS and Azure Databricks simplifies operations, increases efficiency, and boosts productivity. Unravel provides AI insights to proactively pinpoint and resolve data pipeline performance issues, ensure data quality, and define automated guardrails for predictability.

Scalable training and ML experiments with Databricks

Databricks uses pre-installed, optimized libraries to build and train machine learning models. With Databricks, data science teams can build and train machine learning models. Databricks provides pre-installed, optimized libraries. Examples include scikit-learn, TensorFlow, PyTorch, and XGBoost. MLflow integration with Databricks on AWS and Azure Databricks makes it easy to track experiments and store models in repositories.

MLflow monitors machine learning model training and running. Information about the source code, data, configuration information, and results are stored in a single location for quick and easy reference. MLflow also stores models and loads them in production. Because MLflow is built on open frameworks, many different services, applications, frameworks, and tools can access and consume the models and related details.

Serverless ML model deployment and serving

Databricks Serverless Model Serving accelerates data science teams’ path to production by simplifying deployments and reducing mistakes through integrated tools. With the new model serving service, you can do the following:

Deploy a model as an API with one click in a serverless environment.
Serve models with high availability and low latency using endpoints that can automatically scale up and down based on incoming workload.
Safely deploy the model using flexible deployment patterns such as progressive rollout or perform online experimentation using A/B testing.
Seamlessly integrate model serving with online feature store (hosted on Azure Cosmos DB), MLflow Model Registry, and monitoring, allowing for faster and error-free deployment.

Conclusion

You can now train, deploy, monitor, and retrain machine learning models, all on the same platform with Databricks Model Serving. Integrating the feature store with model serving and monitoring helps ensure that production models are leveraging the latest data to produce accurate results. The end result is increased availability and simplified operations for greater AI velocity and positive business impact.

Ready to get started and try it out for yourself? Watch this Databricks event to see it in action. You can read more about Databricks Model Serving and how to use it in the Databricks on AWS documentation and the Azure Databricks documentation. Learn more about data observability in the Unravel documentation.

The post Solving key challenges in the ML lifecycle with Unravel and Databricks Model Serving appeared first on Unravel.

Data Teams Summit Survey Infographic

Christine Della Penna — Wed, 29 Mar 2023 14:29:46 +0000

Thank you for your interest in the Data Teams Summit 2023 Survey Infographic.

Download the full pdf.

Take a look at last year’s survey results here.

The post Data Teams Summit Survey Infographic appeared first on Unravel.

DataFinOps: More on the menu than data cost governance

Christine Della Penna — Fri, 24 Feb 2023 20:14:25 +0000

By Jason English, Principal Analyst, Intellyx
Part 3 in the Demystifying Data Observability Series, by Intellyx for Unravel Data

IT and data executives find themselves in a quandary about deciding how to wrangle an exponentially increasing volume of data to support their business requirements – without breaking an increasingly finite IT budget.

Like an overeager diner at a buffet who’s already loaded their plate with the cheap carbs of potatoes and noodles before they reach the protein-packed entrees, they need to survey all of the data options on the menu before formulating their plans for this trip.

In our previous chapters of this series, we discussed why DataOps needs its own kind of observability, and then how DataOps is a natural evolution of DevOps practices. Now there’s a whole new set of options in the data observability menu to help DataOps teams track the intersection of value and cost.

From ROI to FinOps

Executives can never seem to get their fill of ROI insights from IT projects, so they can measure bottom-line results or increase top-line revenue associated with each budget line item. After all, predictions about ROI can shape the perception of a company for its investors and customers.

Unfortunately, ROI metrics are often discussed at the start of a major technology product or services contract – and then forgotten as soon as the next initiative gets underway.

The discipline of FinOps burst onto the scene over the last few years, as a strategy to address the see-saw problem of balancing the CFO’s budget constraints with the CIO’s technology delivery requirements to best meet the current and future needs of customers and employees.

FinOps focuses on improving technology spending decisions of an enterprise using measurements that go beyond ROI, to assess the value of business outcomes generated through technology investments.

Some considerations frequently seen on the FinOps menu include:

Based on customer demand or volatility in our consumption patterns, should we buy capacity on-demand or reserve more cloud capacity?
Which FinOps tools should we buy, and what functionality should we build ourselves, to deliver this important new capability?
Which cloud cost models are preferred for capital expenditures (capex) projects and operational expenditures (opex)?
What is the potential risk and cost of known and unknown usage spikes, and how much should we reasonably invest in analysts and tools for preventative purposes?

As a discipline, FinOps has come a long way, building communities of interest among expert practitioners, product, business, and finance teams as well as solution providers through its own FinOps Foundation and instructional courses on the topic.

FinOps + DataOps = DataFinOps?

Real-time analytics and AI-based operational intelligence are enabling revolutionary business capabilities, enterprise-wide awareness, and innovative machine learning-driven services. All of this is possible thanks to a smorgasbord of cloud data storage and processing, cloud data lakes, cloud data warehouse, and cloud lakehouse options.

Unfortunately, the rich streams of data required for such sophisticated functionality bring along the unwanted side effect of elastically expanding budgetary waistbands, due to ungoverned cloud storage and compute consumption costs. Nearly a third of all data science projects go more than 40% over budget on cloud data, according to a recent survey–a huge delta between cost expectations and reality.

How can better observability into data costs help the organization wring more value from data assets without cutting into results, or risking cost surprises?

As it turns out, data has its own unique costs, benefits, and value considerations. Combining the disciplines of FinOps and DataOps – which I’ll dub DataFinOps just for convenience here – can yield a unique new set of efficiencies and benefits for the enterprise’s data estate.

Some of unique considerations of DataFinOps:

Which groups within our company are the top spenders on cloud data analytics, and is anything anomalous about their spending patterns versus the expected budgets?
What is the value of improving data performance or decreasing the latency of our data estate by region or geography, in order to improve local accuracy, reduce customer and employee attrition and improve retention?
If we are moving to a multi-cloud, hybrid approach, what is an appropriate and realistic mix of reserved instances and spot resources for processing data of different service level agreements (SLAs)?
Where are we paying excessive ingress / egress fees within our data estate? Would it be more cost effective to process data near the data or move our data elsewhere?
How much labor do our teams spend building and maintaining data pipelines, and what is that time worth?
Are cloud instances being intelligently right-sized and auto-scaled to meet demand?

Systems-oriented observability platforms such as DataDog and Dynatrace can measure system or service level telemetry, which is useful for a DevOps team looking at application-level cloud capacity and cost/performance ratios. Unfortunately these tools do not dig into enough detail to answer data analytics-specific FinOps questions.

Taming a market of data options

Leading American grocery chain Kroger launched its 84.51° customer experience and data analytics startup to provide predictive data insights and precision marketing for its parent company and other retailers, across multiple cloud data warehouses such as Snowflake and Databricks, using data storage in multiple clouds such as Azure and GCP.

Using the Unravel platform for data observability, they were able to get a grip on data costs and value across multiple data platforms and clouds without having to train up more experts on the gritty details of data job optimization within each system.

“The end result is giving us tremendous visibility into what is happening within our platforms. Unravel gave recommendations to us that told us what was good and bad. It simply cut to the chase and told us what we really needed to know about the users and sessions that were problematic. It not only identified them, but then made recommendations that we could test and implement.”

– Jeff Lambert, Vice President Engineering, 84.51°

It’s still early days for this transformation, but a data cost reduction of up to 50% would go a long way toward extracting value from deep customer analytics, as transaction data volumes continue to increase by 2x or 3x a year as more sources come online.

The Intellyx Take

It would be nice if CFOs could just tell CIOs and CDOs to simply stop consuming and storing so much data, and have that reduce their data spend. But just like in real life, crash diets will never produce long-term results, if the ‘all-you-can-eat’ data consumption pattern isn’t changed.

The hybrid IT underpinnings of advanced data-driven applications evolves almost every day. To achieve sustainable improvements in cost/benefit returns on data, analysts and data scientists would have to become experts on the inner workings of each public cloud and data warehousing vendor.

DataFinOps practices should encourage data team accountability for value improvements, but more importantly, it should give them the data observability, AI-driven recommendations, and governance controls necessary to both contain costs, and stay ahead of the organization’s growing business demand for data across hybrid IT data resources and clouds.

The post DataFinOps: More on the menu than data cost governance appeared first on Unravel.

The Evolution from DevOps to DataOps

Christine Della Penna — Wed, 22 Feb 2023 16:11:51 +0000

By Jason Bloomberg, President, Intellyx
Part 2 of the Demystifying Data Observability Series for Unravel Data

In part one of this series, fellow Intellyx analyst Jason English explained the differences between DevOps and DataOps, drilling down into the importance of DataOps observability.

The question he left open for this article: how did we get here? How did DevOps evolve to what it is today, and what parallels or differences can we find in the growth of DataOps?

DevOps Precursors

The traditional, pre-cloud approach to building custom software in large organizations separated the application development (‘dev’) teams from the IT operations (‘ops’) personnel responsible for running software in the corporate production environment.

In between these two teams, organizations would implement a plethora of processes and gates to ensure the quality of the code and that it would work properly in production before handing it to the ops folks to deploy and manage.

Such ‘throw it over the wall’ processes were slow and laborious, leading to deployment cycles many months long. The importance of having software that worked properly, so the reasoning went, was sufficient reason for such onerous delays.

Then came the Web. And the cloud. And enterprise digital transformation initiatives. All of these macro-trends forced enterprises to rethink their plodding software lifecycles.

Not only were they too slow to deliver increasingly important software capabilities, but business requirements would evolve far too quickly for the deployed software to keep up.

Such pressures led to the rise of agile software methodologies, cloud native computing, and DevOps.

Finding the Essence of DevOps

The original vision of DevOps was to pull together the dev and ops teams to foster greater collaboration, in hopes that software deployment cadences would accelerate while maintaining or improving the quality of the resulting software.

Over time, this ‘kumbaya’ vision of seamless collaboration itself evolved. Today, we can distill the essence of modern DevOps into these five elements:

A cultural and organizational shift away from the ‘throw it over the wall’ mentality to greater collaboration across the software lifecycle
A well-integrated, comprehensive automation suite that supports CI/CD activities, along with specialists who manage and configure such technologies, i.e., DevOps engineers
A proactive, shift-left mentality that seeks to represent production behavior declaratively early in the lifecycle for better quality control and rapid deployment
Full-lifecycle observability that shifts problem resolution to the left via better prediction of problematic behavior and preemptive mitigation of issues in production
Lean practices to deliver value and improve efficiency throughout the software development lifecycle

Furthermore, DevOps doesn’t live in a vacuum. Rather, it is consistent with and supports other modern software best practices, including infrastructure-as-code, GitOps, and the ‘cattle not pets’ approach to supporting the production environment via metadata representations that drive deployment.

The Evolution of DataOps

Before information technology (IT), organizations had management of information systems (MIS). And before MIS, at the dawn of corporate computing, enterprises implemented data processing (DP).

The mainframes at the heart of enterprise technology as far back as the 1960s were all about processing data – crunching numbers in batch jobs that yielded arcane business results, typically dot-matrix printed on green and white striped paper.

Today, IT covers a vast landscape of technology infrastructure, applications, and hybrid on-premises and cloud environments – but data processing remains at the heart of what IT is all about.

Early in the evolution of DP, it became clear that the technologies necessary for processing transactions were different from the technology the organization required to provide business intelligence to line-of-business (LoB) professionals.

Enterprises required parallel investments in online transaction processing (OLTP) and online analytical processing (OLAP), respectively. OLAP proved the tougher nut to crack, because enterprises generated voluminous quantities of transactional data, while LoB executives required complex insights that would vary over time – thus taxing the ability of the data infrastructure to respond to the business need for information.

To address this need, data warehouses exploded onto the scene, separating the work of OLAP into two phases: transforming and loading data into the warehouses and supporting business intelligence via queries of the data in them.

Operating these early OLAP systems was relatively straightforward, centering on administering the data warehouses. In contrast, today’s data estate – the sum total of all the data infrastructure in a modern enterprise – is far more varied than in the early data warehousing days.

Motivations for DataOps

Operating this data estate has also become increasingly complex, as the practice of DataOps rises in today’s organizations.

Complexity, however, is only one motivation for DataOps. There are more reasons why today’s data estate requires it:

Increased mission-criticality of data, as digital transformations rework organizations into digital enterprises
Increased importance of real-time data, a capability that data warehouses never delivered
Greater diversity of data-centric use cases beyond basic business intelligence
Increased need for dynamic applications of data, as different LoBs need an ever-growing variety of data-centric solutions
Growing need for operational cost predictability, optimization, and governance

Driving these motivations is the rise of AI, as it drives the shift from code-based to data-based software behavior. In other words, AI is more than just another data-centric use case. It repositions data as the central driver of software functionality for the enterprise.

The Intellyx Take

For all these reasons, DataOps can no longer follow the simplistic data warehouse administration pattern of the past. Today’s data estate is dynamic, diverse, and increasingly important, requiring organizations to take a full-lifecycle approach to collecting, transforming, storing, querying, managing, and consuming data.

As a result, DataOps requires the application of core DevOps practices along the data lifecycle. DataOps requires the cross-lifecycle collaboration, full-lifecycle automation and observability, and the shift-left mentality that DevOps brings to the table – only now applied to the enterprise data estate.

Thinking of DataOps as ‘DevOps for data’ may be too simplistic an explanation of the role DataOps should play. Instead, it might be more accurate to say that as data increasingly becomes the driver of software behavior, DataOps becomes the new DevOps.

Next up in part 3 of this series: DataFinOps: More on the menu than data cost governance

The post The Evolution from DevOps to DataOps appeared first on Unravel.

Why do we need DataOps Observability?

Christine Della Penna — Mon, 13 Feb 2023 14:36:50 +0000

By Jason English, Principal Analyst, Intellyx
Part 1 of the Demystifying Data Observability Series for Unravel Data

Don’t we already have DevOps?

DevOps was started more than a decade ago as a movement, not a product or solution category.

DevOps offered us a way of collaborating between development and operations teams, using automation and optimization practices to continually accelerate the release of code, measure everything, lower costs, and improve the quality of application delivery to meet customer needs.

Today, almost every application delivery shop naturally aspires to take flight with DevOps practices, and operate with more shared empathy and a shared commitment to progress through faster feature releases and feedback cycles.

DevOps practices also include better management practices such as self-service environments, test and release automation, monitoring, and cost optimization.

On the journey toward DevOps, teams who apply this methodology deliver software more quickly, securely, reliably, and with less burnout.

For dynamic applications to deliver a successful user experience at scale, we still need DevOps to keep delivery flowing. But as organizations increasingly view data as a primary source of business value, data teams are tasked with building and delivering reliable data products and data applications. Just as DevOps principles emerged to enable efficient and reliable delivery of applications by software development teams, DataOps best practices are helping data teams solve a new set of data challenges.

What is DataOps?

If “data is the new oil,” as pundits like to say, then data is also the most valuable resource in today’s modern data-driven application world.

The combination of commodity hardware, ubiquitous high-bandwidth networking, cloud data warehouses, and infrastructure abstraction methods like containers and Kubernetes creates an exponential rise in our ability to use data itself to dynamically compose functionality such as running analytics and informing machine learning-based inference within applications.

Enterprises recognized data as a valuable asset, welcoming the newly minted CDO (chief data officer) role to the E-suite, with responsibility for data and data quality across the organization. While leading-edge companies like Google, Uber and Apple increased their return on data investment by mastering DataOps, many leaders struggled to staff up with enough data scientists, data analysts, and data engineers to properly capitalize on this trend.

Progressive DataOps companies began to drain data swamps by pouring massive amounts of data (and investment) into a new modern ecosystem of cloud data warehouses and data lakes from open source Hadoop and Kafka clusters to vendor-managed services like Databricks, Snowflake, Amazon EMR, BigQuery, and others.

The elastic capacity and scalability of cloud resources allowed new kinds of structured, semi-structured, and unstructured data to be stored, processed and analyzed, including streaming data for real-time applications.

As these cloud resources quickly grew and scaled, they became a complex tangle of data sources, pipelines, dashboards, and machine learning models, with a variety of interdependencies, owners, stakeholders, and products with SLAs. Getting additional cloud resources and launching new data pipelines was easy, but operating them well required a lot of effort, and making sense of the business value of any specific component to prioritize data engineering efforts became a huge challenge.

Software teams went through the DevOps revolution more than a decade ago, and even before that, there were well-understood software engineering disciplines for design/build/deploy/change, as well as monitoring and observability. Before DataOps, data teams didn’t typically think about test and release cycles, or misconfiguration of the underlying infrastructure itself.

Where DevOps optimized the lifecycle of software from coding to release, DataOps is about the flow of data, breaking data out of work silos to collaborate on the movement of data from its inception, to its arrival, processing, and use within modern data architectures to feed production BI and machine learning applications.

DataOps jobs, especially in a cloudy, distributed data estate, aren’t the same as DevOps jobs. For instance, if a cloud application becomes unavailable, DevOps teams might need to reboot the server, adjust an API, or restart the K8s cluster.

If a DataOps-led application starts failing, it may show incorrect results instead of simply crashing, and cause leaders informed by faulty analytics and AI inferences to make disastrous business decisions. Figuring out the source of data errors and configuration problems can be maddeningly difficult, and DataOps teams may even need to restore the whole data estate – including values moving through ephemeral containers and pipelines – back to a valid, stable state for that point in time.

Why does DataOps need its own observability?

Once software observability started finding its renaissance within DevOps practices and early microservices architectures five years ago, we also started seeing some data management vendors pivoting to offer ‘data observability’ solutions.

The original concept of data observability was concerned with database testing, properly modeling, addressing and scaling databases, and optimizing the read/write performance and security of both relational and cloud back ends.

In much the same way that the velocity and automated release cadence of DevOps meant dev and ops teams needed to shift component and integration testing left, data teams need to tackle data application performance and data quality earlier in the DataOps lifecycle.

In essence, DataOps teams are using agile and other methodologies to develop and deliver analytics and machine learning at scale. Therefore they need DataOps observability to clarify the complex inner plumbing of apps, pipelines and clusters handling that moving data. Savvy DataOps teams must monitor ever-increasing numbers of unique data objects moving through data pipelines.

The KPIs for measuring success in DataOps observability include metrics and metadata that standard observability tools would never see: differences or anomalies in data layout, table partitioning, data source lineages, degrees of parallelism, data job and subroutine runtimes and resource utilization, interdependencies and relationships between data sets and cloud infrastructures – and the business tradeoffs between speed, performance and cost (or FinOps) of implementing recommended changes.

The Intellyx Take

A recent survey noted that 97 percent of data engineers ‘feel burned out’ at their current jobs, and 70 percent say they are likely to quit within a year! That’s a wakeup call for why DataOps observability matters now more than ever.

We must maintain the morale of understaffed and overworked data teams, when these experts take a long time to train and are almost impossible to replace in today’s tight technical recruiting market.

Any enterprise that intends to deliver modern DataOps should first consider equipping data teams with DataOps observability capabilities. Observability should go beyond the traditional metrics and telemetry of application code and infrastructure, empowering DataOps teams to govern data and the resources used to refine and convert raw data into business value as it flows through their cloud application estates.

Next up in part 2 of this series: Jason Bloomberg on the transformation from DevOps to DataOps!

©2022 Intellyx LLC. Intellyx retains editorial control of this document. At the time of writing, Unravel Data is an Intellyx customer. Image credit: Phil O’Driscoll, flickr CC2.0 license, compositing pins by author.

The post Why do we need DataOps Observability? appeared first on Unravel.

Enabling Strong Engineering Practices at Maersk

Christine Della Penna — Thu, 09 Feb 2023 17:32:45 +0000

As DataOps moves along the maturity curve, many organizations are deciphering how to best balance the success of running critical jobs with optimized time and cost governance.

Watch the fireside chat from Data Teams Summit where Mark Sear, Head of Data Platform Optimization for Maersk, shares how his team is driving towards enabling strong engineering practices, design tenets, and culture at one of the largest shipping and logistics companies in the world. Transcript below.

To give you an example, I worked on a project about 18 months ago where we worked out, working in tandem with a couple of nature organizations, that if a ship hits a whale at 12 knots and above, that whale will largely die. If you hit it below 12 knots, it will live. It’s a bit like hitting an adult at 30 miles an hour versus 20. The company puts some money in so we could use the data for where the whales were to slow the ships down. So this is an example of where this company doesn’t just think about what can we do to make money. This is a company that thinks about how can we use data to better the company, make us more profitable, and at the same time, put back into the planet that gave us the ability to have this business.

Kunal Agarwal:

Let’s not forget that we’re human, most importantly.

Kunal Agarwal:

Yeah, you never want to go for V1.

Mark Sear:

Exactly, absolutely right. But, guys, ladies, everybody watching this, you are in the most exciting part, not just of technology, of humanity itself. I really believe that, of humanity itself, you can make a difference that very few people on the planet get to make.

Kunal Agarwal:

And on that note, I think the big theme that we have going on this series, we strongly feel that data teams are running the world and will continue to run the world. Mark, thank you so much for sharing this, exciting insights, and it’s always fun having you. Thanks you for making the time.

Mark Sear:

Complete pleasure.

The post Enabling Strong Engineering Practices at Maersk appeared first on Unravel.

3 Takeaways from the 2023 Data Teams Summit

Stephen Lamont — Thu, 02 Feb 2023 13:45:19 +0000

The 2023 Data Teams Summit (formerly DataOps Unleashed) was a smashing success, with over 2,000 participants from 1,600 organizations attending 23 expert-led breakout sessions, panel discussions, case studies, and keynote presentations covering a wide range of thought leadership and best practices.

There were a lot of sessions devoted to different strategies and considerations when building a high-performing data team, how to become a data team leader, where data engineering is heading, and emerging trends in DataOps (asset-based orchestration, data contracts, data mesh, digital twins, data centers center of excellence). And winding as a common theme throughout almost every presentation was that top-of-mind topic: FinOps and how to get control over galloping cloud data costs.

Some of the highlight sessions are available now on demand (no form to fill out) on our Data Teams Summit 2023 page. More are coming soon.

There was a lot to see (and full disclosure: I didn’t get to a couple of sessions), but here are 3 sessions that I found particularly interesting.

Enabling strong engineering practices at Maersk

The fireside chat between Unravel CEO and Co-founder Kunal Agarwal and Mark Sear, Head of Data Platform Optimization at Maersk, one of the world’s largest logistics companies, is entertaining and informative. Kunal and Mark cut through the hype to simplify complex issues in commonsensical, no-nonsense language about:

The “people problem” that nobody’s talking about
How Maersk was able to upskill its data teams at scale
Maersk’s approach to rising cloud data costs
Best practices for implementing FinOps for data teams

Check out their talk here

Maximize business results with FinOps

Unravel DataOps Champion and FinOps Certified Practitioner Clinton Ford and FinOps Foundation Ambassador Thiago Gil explain how and why the emerging cloud financial management discipline of FinOps is particularly relevant—and challenging—for data teams. They cover:

The hidden costs of cloud adoption
Why observability matters
How FinOps empowers data teams
How to maximize business results
The state of production ML

See their session here

Situational awareness in a technology ecosystem

Charles Boicey, Chief Innovation Officer and Co-founder of Clearsense, a healthcare data platform company, explores the various components of a healthcare-centric data ecosystem and how situational awareness in the clinical environment has been transferred to the technical realm. He discusses:

What clinical situational awareness looks like
The concept of human- and technology-assisted observability
The challenges of getting “focused observability” in a complex hybrid, multi-cloud, multi-platform modern data architecture for healthcare
How Clearsense leverages observability in practice
Observability at the edge

Watch his presentation here

How to get more session recordings on demand

To see other session recordings without any registration, visit the Unravel Data Teams Summit 2023 page.
To see all Data Teams Summit 2023 recordings, register for access here.

And please share your favorite takeaways, see what resonated with your peers, and join the discussion on LinkedIn.

The post 3 Takeaways from the 2023 Data Teams Summit appeared first on Unravel.

Gartner® Market Guide for Dataops Tools… Unravel Data Highlighted

Christine Della Penna — Tue, 31 Jan 2023 09:00:37 +0000

Palo Alto, CA – Jan. 31, 2023 – Unravel Data, the first data observability platform built to meet the needs of modern data teams, announced today that it was recognized as a Representative Vendor in the DataOps Market in the 2022 Gartner Market Guide for DataOps Tools.

“In recent years, organizations have witnessed significant increases in the number of data sources and data consumers as they deploy more applications and use cases. DataOps has risen as a set of best practices and technologies to streamline and efficiently handle this onslaught of new upcoming workloads,” said Sanjeev Mohan, principal with SanjMo. “Just as DevOps helped streamline web application development and make software teams more productive, DataOps aims to do the same thing for data applications.”

According to a Gartner strategic planning assumption from this Market Guide, “by 2025, a data engineering team guided by DataOps practices and tools will be 10 times more productive than teams that do not use DataOps.” In this Market Guide for DataOps Tools, Gartner examines the DataOps market and explains the various capabilities of DataOps tools, paints a picture of the DataOps tool landscape, and offers recommendations.

“Data teams are struggling to keep up with the increased volume, velocity, variety, and complexity of their data applications and data pipelines. These teams are facing many of the same generic challenges that software teams did 10-plus years ago,” said Kunal Agarwal, CEO of Unravel Data. “We’re proud to be recognized by Gartner in the DataOps Market. Unravel Data provides unprecedented visibility across users’ data stacks, proactively troubleshooting and optimizing data workloads, and defining guardrails to govern costs and improve predictability.”

Today’s modern data stack is a complex collection of different systems, platforms, technologies, and environments. Enterprises need a DataOps solution that works with every type of workload and addresses the performance, cost, and quality issues facing data teams today. Founded by data pioneers Kunal Agarwal and Dr. Shivnath Babu, Unravel Data was created on the notion that the exponential growth of data combined with the broad adoption of the public cloud requires an entirely new way to manage and optimize the data pipelines that support the real-time analytics needs of data-driven enterprises.

Customers who have deployed the Unravel platform have been able to double the productivity of data teams and ensure that data applications run on time, while scaling costs efficiently in the cloud. To learn more about how Unravel is leading the DataOps observability space, visit the new library of demonstration videos.

*Gartner, “Market Guide for DataOps Tools”, December 5, 2022.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

About Unravel Data
Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Providing a unified view across the entire data stack, Unravel’s market-leading data observability platform leverages AI, machine learning, and advanced analytics to provide modern data teams with the actionable recommendations they need to turn data into insights. Some of the world’s most recognized brands like Adobe, 84.51˚ (a Kroger company), and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit www.unraveldata.com.

Media Contact
Blair Moreland
ZAG Communications for Unravel Data
unraveldata@zagcommunications.com

The post Gartner^® Market Guide for Dataops Tools… Unravel Data Highlighted appeared first on Unravel.

Eckerson Report: Data Observability for Modern Digital Enterprises

Christine Della Penna — Tue, 24 Jan 2023 19:01:29 +0000

Analyst Deep Dive into Unravel

This Eckerson Group report gives you a good understanding of how the Unravel platform addresses multiple categories of data observability—application/pipeline performance, cluster/platform performance, data quality, and, most significant, FinOps cost governance—with automation and AI-driven recommendations.

Eckerson Group, a leading global research, consulting, and advisory firm that focuses solely on data and analytics, profiles various elements of Unravel Data:

Target customers and use cases
Product functionality
Differentiators
Product architecture
Pricing

You’ll walk away with a clear picture of what Unravel does, how it does it, and why its features and capabilities are crucial for today’s DataOps and FinOps teams.

Eckerson Group profiles provide independent and objective research on products they believe offer exceptional value to customers.

The post Eckerson Report: Data Observability for Modern Digital Enterprises appeared first on Unravel.

Data Observability Datasheet

Unravel Data — Tue, 10 Jan 2023 11:00:21 +0000

Observability designed for data teams

The Unravel platform harnesses full-stack visibility, contextual awareness, AI-powered intelligence, and automation to go “beyond observability”— to not only show you what’s going on, but why and exactly how to make things better and then keep them that way proactively. Unravel is designed for every member of your data team.

Key Use Cases

Advanced cost governance
AI-enabled optimization
Automated troubleshooting
Flexible data quality
Cloud migration

Read this datasheet to learn more.

Get the help you need to simplify your modern dataops.

The post Data Observability Datasheet appeared first on Unravel.

Panel recap: What Is DataOps observability?

Stephen Lamont — Fri, 06 Jan 2023 14:39:13 +0000

Data teams and their business-side colleagues now expect—and need—more from their observability solutions than ever before. Modern data stacks create new challenges for performance, reliability, data quality, and, increasingly, cost. And the challenges faced by operations engineers are going to be different from those for data analysts, which are different from those people on the business side care about. That’s where DataOps observability comes in.

But what is DataOps observability, exactly? And what does it look like in a practical sense for the day-to-day life of data application developers or data engineers or data team business leaders?

In the Unravel virtual panel discussion What Is Data Observability? Sanjeev Mohan, principal with SanjMo and former Research Vice President at Gartner, lays out the business context and driving forces behind DataOps observability, and Chris Santiago, Unravel Vice President of Solutions Engineering, shows how different roles use the Unravel DataOps observability platform to address performance, cost, and quality challenges.

Why (and what is) DataOps observability?

Sanjeev opens the conversation by discussing the top three driving trends he’s seeing from talking with data-driven organizations, analysts, vendors, and fellow leaders in the data space. Specifically, how current economic headwinds are causing reduced IT spend—except in cloud computing and, in particular, data and analytics. Second, with the explosion of innovative new tools and technologies, companies are having difficulty in finding people who can tie together all of these moving pieces and are looking to simplify this increasing complexity. Finally, more global data privacy regulations are coming into force while data breaches continue unabated. Because of these market forces, Sanjeev says, “We are seeing a huge emphasis on integrating, simplifying, and understanding what happens between a data producer and a data consumer. What happens between these two is data management, but how well we do the data management is what we call DataOps.”

Sanjeev presents his definition of DataOps, why it has matured more slowly than its cousin DevOps, and the kind(s) of observability that is critical to DataOps—data pipeline reliability and trust in data quality—and how his point of view has evolved to now include a third dimension: demonstrating the business value of data through BizOps and FinOps. Taken together, these three aspects (performance, cost, quality) give all the different personas within a data team the observability they need. This is what Unravel calls DataOps observability.

See Sanjeev’s full presentation–no registration!
Video on demand

DataOps observability in practice with Unravel

Chris Santiago walks through how the various players on a data team—business leadership, application/pipeline developers and engineers, data analysts—use Unravel across the three critical vectors of DataOps observability: performance (reliability of data applications/pipelines), quality (trust in the data), and cost (value/ROI modern data stack investments).

Cost (value, ROI)

First up is how Unravel DataOps observability provides deep visibility and actionable intelligence into cost governance. As opposed to the kind of dashboards that cloud providers themselves offer—which are batch-processed aggregated summaries—Unravel lets you drill down from that 10,000-foot view into granular details to see exactly where the money is going. Chris uses a Databricks chargeback report example, but it would be similar for Snowflake, EMR, or GCP. He shows how with just a click, you can filter all the information collected by Unravel to see with granular precision which workspaces, teams, projects, even individual users, are consuming how many resources across your entire data estate—in real time.

From there, Chris demonstrates how Unravel can easily set up budget tracking and automated guardrails for, say, a user (or team or project or any other tagging category that makes sense to your particular business). Say you want to track usage by the metric of DBUs; you set the budget/guardrail at a predefined threshold and get real-time status insight into whether that user is on track or is in danger of exceeding the DBU budget. You can set up alerts to get ahead of usage instead of getting notified only after you’ve blown a monthly budget.

See more self-paced interactive product tours here

Performance (reliability)

Chris then showed how Unravel DataOps observability helps the people who are actually on the hook for data pipeline reliability, making sure everything is working as expected. When applications or pipelines fail, it’s a cumbersome task to hunt down and cobble together all the information from disparate systems (Databricks, Airflow, etc.) to understand the root cause and figure out the next steps. Chris shows how Unravel collects and correlates all the granular details about what’s going on and why from a pipeline view. And then how you can drill down into specific Spark jobs. From a single pane of glass, you have all the pertinent logs, errors, metrics, configuration settings, etc., from Airflow or Spark or any other component. All the heavy lifting has been done for you automatically.

But where Unravel stands head and shoulders above everyone else is its AI-powered analysis. Automated root cause analysis diagnoses why jobs failed in seconds, and pinpoints exactly where in a pipeline something went wrong. For a whole class of problems, Unravel goes a step further and provides crisp, prescriptive recommendations for changing configurations or code to improve performance and meet SLAs.

See more self-paced interactive product tours here

Data quality (trust)

Chris then pivots away from the processing side of things to look at observability of the data itself—especially how Unravel provides visibility into the characteristics of data tables and helps prevent bad data from wreaking havoc downstream. From a single view, you can understand how large tables are partitioned, size of the data, who’s using the data tables (and how frequently), and a lot more information. But what may be most valuable is the automated analysis of the data tables. Unravel integrates external data quality checks (starting with the open source Great Expectations) so that if certain expectations are not met—null values, ranges, number of final columns—Unravel flags the failures and can automatically take user-defined actions, like alert the pipeline owner or even kill a job that fails a data quality check. At the very least, Unravel’s lineage capability enables you to track down where the rogue data got introduced and what dependencies are affected.

See Chris’s entire walk-through–no registration!
Video on demand

Whether it’s engineering teams supporting data pipelines, developers themselves making sure they hit their SLAs, budget owners looking to control costs, or business leaders who funded the technology looking to realize full value—everyone who constitutes a “data team”—DataOps observability is crucial to ensuring that data products get delivered reliably on time, every time, in the most cost-efficient manner.

The Q&A session

As anybody who’s ever attended a virtual panel discussion knows, sometimes the Q&A session is a snoozefest, sometimes it’s great. This one is the latter. Some of the questions that Sanjeev and Chris field:

If I’m migrating to a modern data stack, do I still need DataOps observability, or is it baked in?
How is DataOps observability different from application observability?
When should we implement DataOps observability?
You talked about FinOps and pipeline reliability. What other use cases come up for DataOps observability?
Does Unravel give us a 360 degree view of all of the tools in the ecosystem, or does it only focus on data warehouses like Snowflake?

To jump directly to the Q&A portion of the event, click on the video image below.

The post Panel recap: What Is DataOps observability? appeared first on Unravel.

Sneak Peek into Data Teams Summit 2023 Agenda

Stephen Lamont — Thu, 05 Jan 2023 22:47:29 +0000

The Data Teams Summit 2023 is just around the corner!

This year, on January 25, 2023, we’re taking the peer-to-peer empowerment of data teams one step further, transforming DatOps Unleashed into Data Teams Summit to better reflect our focus on the people—data teams—who unlock the value of data.

Data Teams Summit is an annual, full-day virtual conference, led by data rockstars at future-forward organizations about how they’re establishing predictability, increasing reliability, and creating economic efficiencies with their data pipelines.

Check out full agenda and register
Get free ticket

Join us for sessions on:

DataOps best practices
Data team productivity and self-service
DataOps observability
FinOps for data teams
Data quality and governance
Data modernizations and infrastructure

The peer-built agenda is packed, with over 20 panel discussions and breakout sessions. Here’s a sneak peek at some of the most highly anticipated presentations:

Keynote Panel: Winning strategies to unleash your data team

Great data outcomes depend on successful data teams. Every single day, data teams deal with hundreds of different problems arising from the volume, velocity, variety—and complexity—of the modern data stack.

Learn best practices and winning strategies about what works (and what doesn’t) to help data teams tackle the top day-to-day challenges and unleash innovation.

Breakout Session: Maximize business results with FinOps

As organizations run more data applications and pipelines in the cloud, they look for ways to avoid the hidden costs of cloud adoption and migration. Teams seek to maximize business results through cost visibility, forecast accuracy, and financial predictability.

In this session, learn why observability matters and how a FinOps approach empowers DataOps and business teams to collaboratively achieve shared business goals. This approach uses the FinOps Framework, taking advantage of the cloud’s variable cost model, and distributing ownership and decision-making through shared visibility to get the biggest return on their modern data stack investments.

See how organizations apply agile and lean principles using the FinOps framework to boost efficiency, productivity, and innovation.

Breakout Session: Going from DevOps to DataOps

DevOps has had a massive impact on the web services world. Learn how to leverage those lessons and take them further to improve the quality and speed of delivery for analytics solutions.

Ali’s talk will serve as a blueprint for the fundamentals of implementing DataOps, laying out some principles to follow from the DevOps world and, importantly, adding subject areas required to get to DataOps—which participants can take back and apply to their teams.

Breakout Session: Becoming a data engineering team leader

As you progress up the career ladder for data engineering, responsibilities shift as you start to become more hands-off and look at the overall picture rather than a project in particular.

How do you ensure your team’s success? It starts with focusing on the team members themselves.

In this talk, Matt Weingarten, a lead Data Engineer at Disney Streaming, will walk through some of his suggestions and best practices for how to be a leader in the data engineering world.

Attendance is free! Sign up here for a free ticket

The post Sneak Peek into Data Teams Summit 2023 Agenda appeared first on Unravel.

Our 4 key takeaways from the 2022 Gartner® Market Guide for DataOps Tools

Stephen Lamont — Thu, 15 Dec 2022 02:28:27 +0000

Unravel is recognized as a Representative Vendor in the DataOps Market in the 2022 Gartner Market Guide for DataOps Tools.

Data teams are struggling to keep pace with the increased volume, velocity, variety—and complexity—of their data applications/pipelines. They are facing many of the same (generic) challenges that software teams did 10+ years ago. Just as DevOps helped streamline web application development and make software teams more productive, DataOps aims to do the same thing for data applications.

According to a Gartner strategic planning assumption from this Market Guide, “by 2025, a data engineering team guided by DataOps practices and tools will be 10 times more productive than teams that do not use DataOps.”

The report states that “the DataOps market is highly fragmented.” This Gartner Market Guide analyzes the DataOps market and explains the various capabilities of DataOps tools, paints a picture of the DataOps tool landscape, and offers recommendations.

Our understanding on DataOps and some of the key points we took away from the Gartner Market Guide:

The way data teams are doing things today isn’t working.

One of the key findings in the Gartner report analysis reveals that “a DataOps tool is a necessity to reduce the use of custom solutions and manual efforts around data pipeline operations. Buyers seek DataOps tools to streamline their data operations.” In our opinion, manual effort/custom solutions require prohibitively vast amounts of time and expertise—both of which are already in short supply within enterprise data teams.

No single DataOps tool does everything.

DataOps covers a lot of ground. Gartner defines the core capabilities of DataOps tools as orchestration (including connectivity, workflow automation, data lineage, scheduling, logging, troubleshooting and alerting), observability (monitoring live/historic workflows, insights into workflow performance and cost metrics, impact analysis), environment management (infrastructure as code, resource provisioning, environment repository templates, credentials management), deployment automation, and test automation. We think it’s clear that no one tool does it all. The report recommends, “Follow the decision guidelines and avoid multibuy circumstances by understanding the diverse market landscape and focusing on a desired set of core capabilities.”

DataOps tools break down silos.

A consequence of modern data stack complexity is that disparate pockets of experts all run their own particular tools of choice in silos. Communication and collaboration break down, and you wind up with an operational Tower of Babel that leads to missed SLAs, friction and finger-pointing, and everybody spending way too much time firefighting. DataOps tools are designed specifically to avoid this unsustainable situation.

Choose DataOps tools that give a single pane of glass view.

The Gartner report recommends to “prioritize DataOps tools that give you a ‘single pane of glass’ for diverse data workloads across heterogeneous technologies with orchestration, lineage, and automation capabilities.” The modern data stack is a complex collection of different systems, platforms, technologies, and environment. Most enterprises use a combination of them all. You need a DataOps tool that works with all kinds of workloads in this heterogenous stack for different capabilities and different reasons—which can be boiled down to the three dimensions of performance, cost, and quality.

We are excited that Gartner recognized Unravel as a Representative Vendor in the DataOps market, and we couldn’t agree more with their Market Guide for DataOps Tools recommendations.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post Our 4 key takeaways from the 2022 Gartner^® Market Guide for DataOps Tools appeared first on Unravel.

Unravel Data Accelerates DataOps Time to Value

Christine Della Penna — Wed, 09 Nov 2022 14:00:04 +0000

Unravel Data Accelerates DataOps Time to Value with Launch of its 2022 Fall Release

Latest Edition of the Unravel Platform Includes BigQuery Support and Cost 360 for Amazon EMR to Help Customers Optimize the Cost, Efficiency, and Performance of their Data Applications

Palo Alto, CA – Nov 9, 2022 – Unravel Data, the first DataOps observability platform built to meet the needs of modern data teams, today announced the general availability of its 2022 Fall Release of the Unravel Platform. With this new release, users of the Unravel Platform are now able to leverage several new capabilities including support for Google Cloud BigQuery and Cost 360 for Amazon EMR. These new capabilities are designed to help users boost the efficiency of their public cloud spend, simplify troubleshooting across their big data ecosystem, and improve the overall performance of their business-critical data applications.

“Data teams have a clear mandate to ensure that the data pipelines that support their data analytics programs are fully optimized, running efficiently and staying within budget. However, given the complexity of their data ecosystem, getting answers about the health and performance of their data pipelines is harder than ever,” said Kunal Agarwal, founder and CEO of Unravel Data. “Whether they’re migrating more workloads to platforms like BigQuery or Amazon EMR or already running it as part of their data ecosystem, enterprise data teams are struggling to control costs and accurately forecast their resource requirements that are ultimately impacting their ability to execute on their strategic data analytics initiatives. With this latest edition, Unravel customers will be better able to gain the full-stack observability they need to optimize performance and manage their costs according to budget.”

Some of the new capabilities in the Fall Edition of the Unravel Platform include:

Full Cost Governance for Amazon EMR: With Unravel’s new ‘Cost 360 for Amazon EMR’ capability, customers can enjoy full visibility and gain critical insights into their spending on Amazon EMR. A new cost page has been added that enables customers to observe their EMR cost and usage trends and identify anomalies, slice and dice chargeback views with a variety of filters, and actively monitor costs against user-defined budgets that can automatically trigger warnings if costs are projected to exceed predefined thresholds.
Advanced EMR Cluster Management: To streamline the management of EMR clusters, a new Clusters page is available inside the Unravel UI for Amazon EMR, allowing users to actively monitor all their EMR clusters and cost data from a single location. The Clusters page also helps users quickly visualize, debug, and troubleshoot issues at both the cluster and application levels.
Unified View into BigQuery Deployments: Unravel enables customers to now view all information about any number of Google Cloud Platform projects and all BigQuery queries in these respective projects from a single interface. Key performance indicators and metadata are collected for every query. The new edition now supports advanced relevance-based ranked search, faceted search based on metadata and performance indicators, time-based search, as well as drill downs into the individual query level.
Enhancements for Databricks: When scheduling a job to extract metadata from Databricks, Unravel users can now specify multiple database names and can extract up to 25,000 Delta tables in a single job. Other Databricks enhancements include custom pricing for workload and tier combinations, APIs to detect anomalies and metrics, as well as the ability to encrypt passwords and tokens to harden application security.

More than 20 enterprise companies in the Fortune 100, including two of the top five global pharmaceutical companies and three of the top 10 financial institutions in the world, today rely on Unravel Data to gain unprecedented visibility across their data stacks, proactively troubleshoot and optimize their data workloads, and define guardrails to govern costs and improve predictability.

Data teams are invited to sign up for a free trial of the Unravel Platform at: https://www.unraveldata.com/create-account/

About Unravel Data
Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Providing a unified view across the entire data stack, Unravel’s market-leading DataOps observability platform leverages AI, machine learning, and advanced analytics to provide modern data teams with the actionable recommendations they need to turn data into insights. Some of the world’s most recognized brands like Adobe, 84.51 (a Kroger company), and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit www.unraveldata.com.

Media Contact
Blair Moreland
ZAG Communications for Unravel Data
unraveldata@zagcommunications.com

The post Unravel Data Accelerates DataOps Time to Value appeared first on Unravel.

India’s First DataOps Observability Conference in Bangalore

Christine Della Penna — Wed, 12 Oct 2022 15:24:48 +0000

Unravel Data to Host India’s First DataOps Observability Conference in Bangalore

The day-long conference on Friday, October 14, will decode DataOps observability and showcase the latest trends and developments in observability

October 12, 2022, Bangalore: Unravel Data, the first DataOps observability platform built to meet the needs of modern data teams, today announced that it is hosting India’s first DataOps Observability Conference on Friday, October 14, 2022 in Bangalore. The day-long conference is hosted in partnership with JetBrains and The Fifth Elephant. Hundreds of data engineers, data leaders, architects, and operations engineers from leading technology companies will be attending the conference.

DataOps and observability are two of the biggest trends in modern data management. This first-of-its-kind conference dedicated to these trends will explain why DataOps observability has become a mission-critical requirement for managing modern data stacks.

The conference will feature leading data professionals from companies like Uber, Dell Technologies, Infosys and Intuit as speakers. The agenda for the event includes panel discussions and talks on managing the modern data stack and applying DataOps observability to all data and analytics use cases.

Dr. Shivnath Babu – co-founder and CTO of Unravel Data will deliver the keynote address at the event. Industry experts who will be speaking on DataOps are Anuj Gupta – chief AI officer at Vahan Inc, Atri Sharma – Senior Staff Engineer at Uber, Sona Samad, Staff Engineer at Intuit and Pallavi Angraje – Software Engineer at Intuit. A Panel discussion on “DataOps Observability: Old Wine in a New Bottle or the Next Big Thing?” will be hosted by Dr. Shivnath Babu. The panelists include Govinda Sambamurthy – Head Of Engineering, Enterprise Data Portability at Atlassian, Akhila Prabhakaran – Consultant Data Scientist at Dell and Giriraj Bagdi – VP Engineering Operations, Data/AI Platform at Unravel.

“We are excited to host the first ever DataOps Observability Conference for data leaders and professionals in India and demonstrate the different aspects of DataOps observability, including application and pipeline performance, operational observability into how the entire platform or system is running end-to-end, and business observability aspects such as ROI and—most significantly—FinOps insights to govern and control escalating cloud costs,” said Shivnath.

Unravel Data estimates that data engineers and operations teams spend around 70-75% of their time tracking down and resolving problems through manual detective work and time-consuming trial and error. Further, as more workloads migrate to the cloud, team leaders are finding that costs are getting out of control—often leading to migration initiatives stalling out completely because of budget overruns.

“Data is eating the world, but complexity is eating data teams who are struggling with cloud costs, pipeline performance, and data quality issues. True DataOps observability can empower these teams to easily tackle even the most complex modern data stack challenges, so they can spend more time on innovation and less time troubleshooting issues,” Shivnath added.

Unravel’s DataOps Observability Platform extracts, correlates, and analyzes information from every system in the modern data stack, within and across some of the most popular data ecosystems. The platform is designed to show data application and pipeline performance, cloud costs, and quality; and proactively suggests precise, prescriptive fixes to quickly and efficiently solve problems. The AI-powered solution helps enterprises realize greater return on their investment in the modern data stack by delivering faster troubleshooting, better performance to meet service level agreements, self-service features that allow applications to get out of development and into production faster and more reliably, and reduced cloud costs.

Data teams can preview how Unravel supports a wide variety of daily observability challenges by viewing the new library of quick demonstration videos on YouTube here.

About Unravel Data
Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Providing a unified view across the entire data stack, Unravel’s market-leading DataOps Observability Platform leverages AI, machine learning, and advanced analytics to provide modern data teams with the actionable recommendations they need to turn data into insights. Some of the world’s most recognized brands like Adobe, 84.51 (a Kroger company), and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit www.unraveldata.com.

Media Contact
Bhawana Singh
+917204277001
bhawanas@evoc.in

The post India’s First DataOps Observability Conference in Bangalore appeared first on Unravel.

Unravel Automated Data Quality Datasheet

Christine Della Penna — Fri, 07 Oct 2022 17:11:52 +0000

Unravel now pulls in data quality checks from external tools into its single-pane-of-glass full-stack observability view.

All data details in one place

Have all the granular data details at your fingertips, correlated in context. One-click drill down into lineage, schema, timelines, partitions, size, usage, and more.

Understand the impact in seconds

Automated insights show exactly what’s affected downstream, where bad data got introduced upstream, and how to fix the issue.

Automated circuit-breaker guardrails

Automated alerts let you know when—and where—quality checks fail. Customizable system- and user-defined, policy-based guardrails prevent bad data from seeing the light of day.

Download our Data Quality Datasheet here.

The post Unravel Automated Data Quality Datasheet appeared first on Unravel.

3-Minute Recap: Unlocking the Value of Cloud Data and Analytics

Stephen Lamont — Wed, 05 Oct 2022 20:54:06 +0000

DBTA recently hosted a roundtable webinar with four industry experts on “Unlocking the Value of Cloud Data and Analytics.” Moderated by Stephen Faig, Research Director, Unisphere Research and DBTA, the webinar featured presentations from Progress, Ahana, Reltio, and Unravel.

You can see the full 1-hour webinar “Unlocking the Value of Cloud Data and Analytics” below.

Here’s a quick recap of what each presentation covered.

Todd Wright, Global Product Marketing Manager at Progress, in his talk “More Data Makes for Better Analytics” showed how Progress DataDirect connectors let users get to their data from more sources securely without adding a complex software stack in between. He quoted former Google research director Peter Norvig (now fellow at the Stanford Institute for Human-Centered Artificial Intelligence) about how more data beats clever algorithms: “Simple models and a lot of data trump more elaborate models based on less data.” He then outlined how Progress DataDirect and its Hybrid Data Pipeline platform uses standard-based connectors to expand connectivity options of BI and analytics tools, access all your data using a single connector, and make on-prem data available to cloud BI and analytics tools without exposing ports or implementing costly and complex VPN tunnels. He also addressed how to secure all data sources behind your corporate authentication/identity to mitigate the risk of exposing sensitive private information (e.g., which tables and columns to expose to which people) and keep tabs on data usage for auditing and compliance.

Rachel Pedreschi, VP of Technical Services at Ahana, presented “4 Easy Tricks to Save Big Money on Big Data in the Cloud.” She first simplified the cloud data warehouse into its component parts (data on disk + some kind of metadata + a query layer + system that authenticates users and allows them to do stuff). Then she broke it down to see where you could save some money. Starting at the bottom, the storage layer, she said data lakes are a more cost-effective way of providing data to users throughout the organization. At the metadata level, Hive Metastore or AWS Glue are less expensive options. For authentication, she mentioned Apache Ranger or AWS Lake Formation. But what about SQL? For that she has Presto, an open source project that came out of Facebook as a successor to Hive. Presto is a massively scalable distributed in-memory system that allows you write queries not just against data lake files but also other databases as well. She calls this collectively the Open SQL Data Lakehouse. Ahana is an AWS-based service that gets you up and running with this Open Data Lakehouse in 30 minutes.

How can DataOps observability help unlock the value of cloud data and analytics?
Download the white paper

Mike Frasca, Field Chief Field Technology Officer at Reltio, discussed the value and benefits of a modern master data management platform. He echoed previous presenters’ points about how the volume, velocity, and variety of data have become almost overwhelming, especially given how highly siloed and fragmented data is today. Data teams spend most of their time getting the data ready for insights—data consolidation, aggregation, and cleansing; synchronizing and standardizing data, ensuring data quality, timeliness, and accuracy, etc.—rather than actually delivering insights from analytics. He outlined the critical functions a master data management (MDM) platform should provide to deliver precision data. Entity management automatically unifies data into a dynamic enterprise source of truth, including context-aware master records. Data quality management continuously validates, cleans, and standardizes the data via custom validation rules. Data integration receives input from and distributes mastered data to any application or data store in real time and at high volumes. He emphasized that what characterizes a “modern” MDM is the ability to access data in real time—so you’re not making decisions based on data that’s a week or a month old. Cloud MDM extends the capabilities to relationship management, data governance, and reference data management. He wound up his presentation with compelling real-life examples of customer efficiency gains and effectiveness, including Forrester TEI (total economic impact) results on ROI.

Chris Santaigo, VP of Solutions Engineering at Unravel, presented how AI-enabled optimization and automatic insights help unlock the value of cloud data. He noted how nowadays every company is a data company. If they’re not leveraging data analytics to create strategic business advantages, they’re falling behind. But he illustrated how the complexity of the modern data stack is slowing companies down. He broke down the challenges into three C’s: cost, where enterprises are experiencing significant budget overruns; resource constraints—not infrastructure resources, but human resources—the talent gap and mismatch between supply and demand for expertise; and complexity of the stack, where 87% of data science projects never make it into production because it’s so challenging to implement them.

But it all comes down to people—all the different roles on data teams. Analysts, data scientists, engineers on the application side; architects, operations teams, FinOps, and business stakeholders on the operations side. Everybody needs to be working off a “single source of truth” to break down silos, enable collaboration, eliminate finger-pointing, and empower more self-service. Software teams have APM to untangle this complexity—for web apps. But data apps are a totally different beast. You need observability designed specifically for data teams. You could try to stitch together the details you need for performance, cost, data quality from a smorgasbord of point tools or solutions that do some of what you need (but not all), but that’s time-consuming and misses the holistic view of how everything works together so necessary to connect the dots when you’re looking to troubleshoot faster, control spiraling cloud costs, automate AI-driven optimization (for performance and cost), or migrate to the cloud on budget and on time. That’s exactly where Unravel comes in.

Check out the full webinar, including attendee Q&A here!

The post 3-Minute Recap: Unlocking the Value of Cloud Data and Analytics appeared first on Unravel.

Get Ready for the Next Generation of DataOps Observability

Stephen Lamont — Wed, 05 Oct 2022 00:15:52 +0000

This blog was originally published by Unravel CEO Kunal Agarwal on LinkedIn in September 2022.

I was chatting with Sanjeev Mohan, Principal and Founder of SanjMo Consulting and former Research Vice President at Gartner, about how the emergence of DataOps is changing people’s idea of what “data observability” means. Not in any semantic sense or a definitional war of words, but in terms of what data teams need to stay on top of an increasingly complex modern data stack. While much ink has been spilled over how data observability is much more than just data profiling and quality monitoring, until only very recently the term has pretty much been restricted to mean observing the condition of the data itself.

But now DataOps teams are thinking about data observability more comprehensively as embracing other “flavors” of observability like application and pipeline performance, operational observability into how the entire platform or system is running end-to-end, and business observability aspects such as ROI and—most significantly—FinOps insights to govern and control escalating cloud costs.

That’s what we at Unravel call DataOps observability.

Data teams are getting bogged down

Data teams are struggling, overwhelmed by the increased volume, velocity, variety, and complexity of today’s data workloads. These data applications are simultaneously becoming ever more difficult to manage and ever more business-critical. And as more workloads migrate to the cloud, team leaders are finding that costs are getting out of control—often leading to migration initiatives stalling out completely because of budget overruns.

The way data teams are doing things today isn’t working.

Data engineers and operations teams spend way too much time firefighting “by hand.” Something like 70-75% of their time is spent tracking down and resolving problems through manual detective work and a lot of trial and error. And with 20x more people creating data applications than fixing them when something goes wrong, the backlog of trouble tickets gets longer, SLAs get missed, friction among teams creeps in, and the finger-pointing and blame game begins.

This less-than-ideal situation is a natural consequence of inherent process bottlenecks and working in silos. There are only a handful of experts who can untangle the wires to figure out what’s going on, so invariably problems get thrown “over the wall” to them. Self-service remediation and optimization is just a pipe dream. Different team members each use their own point tools, seeing only part of the overall picture, and everybody gets a different answer to the same problem. Communication and collaboration among the team breaks down, and you’re left operating in a Tower of Babel.

Check out our white paper DataOps Observability: The Missing Link for Data Teams
Download here

Accelerating next-gen DataOps observability

These problems aren’t new. DataOps teams are facing some of the same general challenges as their DevOps counterparts did a decade ago. Just as DevOps united the practice of software development and operations and transformed the application lifecycle, today’s data teams need the same observability but tailored to their unique needs. And while application performance management (APM) vendors have done a good job of collecting, extracting, and correlating details into a single pane of glass for web applications, they’re designed for web applications and give data teams only a fraction of what they need.

System point tools and cloud provider tools all provide some of the information data teams need, but not all. Most of this information is hidden in plain sight—it just hasn’t been extracted, correlated, and analyzed by a single system designed specifically for data teams.

That’s where Unravel comes in.

Data teams need what Unravel delivers—observability designed to show data application/pipeline performance, cost, and quality coupled with precise, prescriptive fixes that will allow you to quickly and efficiently solve the problem and get on to the real business of analyzing data. Our AI-powered solution helps enterprises realize greater return on their investment in the modern data stack by delivering faster troubleshooting, better performance to meet service level agreements, self-service features that allow applications to get out of development and into production faster and more reliably, and reduced cloud costs.

I’m excited, therefore, to share that earlier this week, we closed a $50 million Series D round of funding that will allow us to take DataOps observability to the next level and extend the Unravel platform to help connect the dots from every system in the modern data stack—within and across some of the most popular data ecosystems.

Unlocking the door to success

By empowering data teams to spend more time on innovation and less time firefighting, Unravel helps data teams take a page out of their software counterparts’ playbook and tackle their problems with a solution that goes beyond observability—to not just show you what’s going on and why, but actually tell you exactly what to do about it. It’s time for true DataOps observability.

To learn more about how Unravel Data is helping data teams tackle some of today’s most complex modern data stack challenges, visit: www.unraveldata.com.

The post Get Ready for the Next Generation of DataOps Observability appeared first on Unravel.

The Data Challenge Nobody’s Talking About: An Interview from CDAO UK

Stephen Lamont — Wed, 28 Sep 2022 17:57:33 +0000

Chief Data & Analytics Officer UK (CDAO UK) is the United Kingdom’s premier event for senior data and analytics executives. The three-day event, with more than 200 attendees and 50+ industry-leading speakers, was packed with case studies, thought leadership, and practical advice around data culture, data quality and governance, building a data workforce, data strategy, metadata management, AI/MLOps, self-service strategies, and more.

Catherine King of Corinium Global Intelligence interviews Chris Santiago, Unravel Data VP of Solutions Engineering

Chris Santiago, Unravel VP of Solutions Engineering, sat down with Catherine King from event host Corinium Global Intelligence to discuss what’s top-of-mind for data and analytics leaders.

Here are the highlights from their 1-on-1 interview.

Catherine: What are you hearing in the market at the moment? What are people coming up to you and having a chat with you about today?

Chris: I think that, big picture-wise, there are a lot of things being talked about that are very insightful. But some of the stuff that hasn’t really been talked about are the things that people don’t want to talk about. They have this grand vision, but how are they going to get there? How are they going to execute? How are they going to have the processes in place, the right people hired—that sort of thing—to take advantage of data and all the great technology that’s out there and actually execute on the vision?

Catherine: I think you’ve hit the nail on the head there. I think the technology piece, which was so prevalent a few years ago—do we have the right tech, the right tools, to go out and do these things?—that’s almost been ticked off and done. Now it’s actually, do you have the processes in place to achieve it. So from your perspective, what would you like to see businesses do differently?

Chris: The technology has made leaps and bounds over the last few years. If you think about big data people who came from the Hadoop world, one of the challenges of that stack was that it did require expertise. Companies back then struggled to get the ROI on performance, on getting some sort of business value at the end of the day. Fast-forward to today, especially with the advancements in the cloud, they have solved a lot of the challenges. Ease of use? I can just literally just log in, click a button, and have an environment. Storage is cheaper. A lot of the problems back then have been solved with today’s technology.

I do think that the one thing that hasn’t been 100% solved yet, though, is the people, the skills. Right now, the things technology is not necessarily addressing directly are: Do we have the right skill set? Do we have the right people? As more folks are using these newer technologies, the gap in skills to do things the right way and best practices are not being directly addressed. The way that most customers are handling it right now is that they’ll bring in the experts, consultants like Accenture, Avanade, Deloitte, etc.

In order to achieve the true ROI in these data stacks, it’s adjusting the people-problem.

Catherine: What do you see coming in the next 12 months?

Chris: I think that if we look at the industry as a whole, a lot of the technology is still considered new(ish). You have a lot of folks who are still using DevOps tools. So they’re trying to work with observability tools that are focused on the issues that aren’t necessarily what data teams want. So I think in the next six to twelve months we’re going to have a proliferation of observability trying to solve this problem specifically for data teams. Because they’re different problems [than software application issues]. You can’t use the same [APM] tooling for folks who are running Databricks or Snowflake. There are going to be different problems, different challenges.

Obviously, Unravel is in that space now, but I do think the industry will recognize more and more that this is actually not just a small problem, but a major problem. Companies are starting to realize that [observability designed for data teams] is not a nice-to-have anymore, it’s a must-have—and needs to be addressed right now.

Everybody has a vision. Everybody has an idea of what they want to do with data, whether it’s having that strategic business advantage or getting insights that they didn’t know about—everybody trying to do really cool stuff—but everybody always seems to forget about how to execute. There’s lots of interesting talks about how we’re going to measure things, what KPIs we’re going to be tracking, what methodologies we should have in place—all great stuff—but if you truly want to be successful, it’s all about execution and [doing] the stuff people don’t want to talk about. That’s what will set up companies to be successful or not with their data initiatives, getting into the weeds and solving these hard challenges.

Check out the full interview with Chris from CDAO UK

The post The Data Challenge Nobody’s Talking About: An Interview from CDAO UK appeared first on Unravel.

DataOps Observability Designed for Data Teams

Unravel Data — Mon, 19 Sep 2022 15:10:09 +0000

The post DataOps Observability Designed for Data Teams appeared first on Unravel.

Unravel Goes on the Road at These Upcoming Events

Stephen Lamont — Thu, 01 Sep 2022 19:41:27 +0000

Join us at an event near you or attend virtually to discover our DataOps observability platform, discuss your challenges with one of our DataOps experts, go under the hood to check out platform capabilities, and see what your peers have been able to accomplish with Unravel.

September 21-22: Big Data LDN (London)

Big Data LDN is the UK’s leading free-to-attend data & analytics conference and exhibition, hosting leading data and analytics experts, ready to arm you with the tools to deliver your most effective data-driven strategy. Stop by the Unravel booth (stand #724) to see how Unravel is observability designed specifically for the unique needs of today’s data teams.

And be sure to stop by the Unravel booth at 5PM on Day 1 for the Data Drinks Happy Hour for drinks and snacks (while supplies last!)

October 5-6: AI & Big Data Expo – North America (Santa Clara)

The world’s leading AI & Big Data event returns to Santa Clara as a hybrid in-person and virtual event, with more than 5,000 attendees expected to join from across the globe. The expo will showcase the most cutting-edge technologies from 250+ speakers sharing their unparalleled industry knowledge and real-life experiences, in the forms of solo presentations, expert panel discussions and in-depth fireside chats.

And don’t miss Unravel Co-Founder and CEO Kunal Agarwal’s feature presentation on the different challenges facing different AI & Big Data team members and how multidimensional observability (performance, cost, quality) designed specifically for the modern data stack can help.

October 10-12: Chief Data & Analytics Officers (CDAO) Fall (Boston)

The premier in-person gathering for data & analytics leaders in North America, CDAO Fall offers focus tracks on data infrastructure, data governance, data protection & privacy; analytics, insights, and business intelligence; and data science, artificial intelligence, and machine learning. Exclusive industry summit days for data and analytics professionals in Financial Services, Insurance, Healthcare, and Retail/CPG.

October 14: DataOps Observability Conf India 2022 (Bengaluru)

India’s first DataOps observability conference, this event brings together data professionals to collaborate and discuss best practices and trends in the modern data stack, analytics, AI, and observability.

Join leading DataOps observability experts to:

Understand what DataOps is and why it’s important
Learn why DataOps observability has become a mission-critical need in the modern data stack
Discover how AI is transforming DataOps and observability

November 1-3: ODSC West (San Francisco)

The Open Data Science Conference (ODSC) is essential for anyone who wants to connect to the data science community and contribute to the open source applications it uses every day. A hybrid in-person/virtual event, ODSC West features 250 speakers, with 300 hours of content, including keynote presentations, breakout talk sessions, hands-on tutorials and workshops, partner demos, and more.

Sneak peek into what you’ll see from Unravel

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

The post Unravel Goes on the Road at These Upcoming Events appeared first on Unravel.

Expert Panel: Challenges with Modern Data Pipelines

Stephen Lamont — Thu, 01 Sep 2022 14:47:26 +0000

Modern data pipelines have become more business-critical than ever. Every company today is a data company, looking to leverage data analytics as a competitive advantage. But the complexity of the modern data stack imposes some significant challenges that are hindering organizations from realizing their goals and realizing the value of data.

TDWI recently hosted an expert panel on modern data pipelines, moderated by Fern Halper, VP and Senior Director of Advanced Analytics at TDWI, with guests Kunal Agarwal, Co-Founder and CEO of Unravel Data, and Krish Krishnan, Founder of Sixth Sense Advisors.

Dr. Halper opened the discussion with a quick overview of the trends she’s seeing in the changing data landscape, characteristics of the modern data pipeline (automation, universal connectivity, scalable/flexible/fast, comprehensive and cloud-native), and some of the challenges with pipeline processes. Data takes weeks or longer to access new data. Organizations want to enrich data with new data sources, but can’t. There aren’t enough data engineers. Pipeline growth causes problems such as errors and management. It’s hard for different personas to make use of pipelines for self-service. A quick poll of attendees showed a pretty even split among the different challenges.

The panel was asked how they define a modern data pipeline, and what challenges their customers have that modern pipelines help solve.

Kunal talked about the different use cases data-driven companies have for their data to help define what a pipeline is. Banks are using big data to detect and prevent fraud, retailers are running multidimensional recommendation engines (products, price). Software companies are measuring their SaaS subscriptions.

“All these decisions, and all these insights, are now gotten through data products. And data pipelines are the backbone of running any of these business-critical processes,” he said. “So, ultimately, a data pipeline is a sequence of actions that’s gathering, collecting, moving this data from all the different sources to a destination. And the destination could be a data product or a dashboard. And a pipeline is all the stages it takes to clean up the data, transform the data, connect it together, and give it to the data scientist or business analyst to be able to make use of it.”

He said that with the increased volume, velocity, and variety of data, modern data pipelines need to be scalable and extensible—to add new data sources, to move to a different processing paradigm, for example. “And there are tons of vendors who are trying to crack this problem in unique ways. But besides the tools, it should all go back to what you’re trying to achieve, and what you’re trying to drive from the data, that dictates what kind of data pipeline or architecture you should have,” Kunal said.

Krish sees the modern data pipeline as the integration point enabling the true multi-cloud vision. It’s no longer just one system or another, but a mix-and-match “system of systems.” And the challenges revolve around moving workloads to the cloud. “If a company is an on-premises shop and they’re going to the cloud, it’s a laborious exercise,” he said. “There is no universal lift-and-shift mechanism for going to the cloud. That’s where pipelines come into play.”

Observability and the Modern Data Pipeline

The panel discussed the components and capabilities of the modern data pipeline, again circling back to the challenges spawned by the increased complexity. Kunal noted that one common capability among organizations that are running modern data pipelines successfully is observability.

“One key component we see along all the different stages of a pipeline is observability. Observability helps you ultimately improve reliability of these pipelines—that they work on time, every time—and improve the productivity of all the data team members. The poll results show that there aren’t enough data engineers, the demand far exceeds the supply. And what we see is that the data engineers who are present are spending more than 50% of their time just firefighting issues. So observability can help you eliminate—or at least get ahead of—all these different issues. And last but not least, we see that businesses tame their data ambitions by looking at their ballooning cloud bills and cloud costs. And that also happens because of the complexity that these data technologies present. Observability can also help get control and governance around cloud costs so you can start to scale your data operations in a much more efficient manner.”

See the entire panel discussion on demand

Watch the entire conversation with Kunal, Krish, and Fern from TDWI’s Expert Panel: Modern Data Pipelines on-demand replay.

The post Expert Panel: Challenges with Modern Data Pipelines appeared first on Unravel.

Takeaways from CDO TechVent on Data Observability

Stephen Lamont — Tue, 30 Aug 2022 21:54:13 +0000

The Eckerson Group recently presented a CDO TechVent that explored data observability, “Data Observability: Managing Data Quality and Pipelines for the Cloud Era.” Hosted by Wayne Eckerson, president of Eckerson Group, Dr. Laura Sebastian-Coleman, Data Quality Director at Prudential Financial, and Eckerson VP of Research Kevin Petrie, the virtual event kicked off with a keynote overview of data observability products and best practices, followed by a technology panel discussion, “How to Evaluate and Select a Data Observability Platform,” with four industry experts.

Here are some of the highlights and interesting insights from the roundtable discussion on the factors data leaders should consider when looking at data observability solutions.

Josh Benamram, CEO of Databand, said it really depends on what problem you’re facing, or which team is trying to solve which problem. It’s going to be different for different stakeholders. For example, if your challenge is maintaining SLAs, your evaluation criteria would lean more towards Ops-oriented solutions that cater to data platform teams that need to manage the overall system. On the other hand, if your observability requirements are more targeted towards analysts, your criteria would be more oriented towards understanding the health of data tables rather than the overall pipeline.

This idea of identifying the problems you’re trying to solve first, or as Wayne put it, “not putting the tools before the horse,” was a consistent refrain among all panelists.

Seth Rao, CEO of FirstEigen, said he would start by asking three questions: Where do you want to observe? What do you want to observe? How much do you want to automate? If you’re running Snowflake, there are a bunch of vendor solutions; but if you’re talking about data lakes, there’s a different set of solutions. Pipelines are yet another group of solutions. If you’re looking to observe the data itself, that’s a different type of observability altogether. Different solutions automate different pieces of observability. He suggested not trying to “boil the ocean” with one product that tries to do everything. He feels that you’ll get only an average product for all functions. Rather, he said, get flexibility of tooling—like Lego blocks that connect with other Lego blocks in your ecosystem.

This point drew the biggest reaction from the attendees (at lest as evidenced by the Q&A chat). Who’s responsible for integrating all the different tools? We already don’t have enough time! A couple of panelists tackled the argument head-on, either in the panel discussion or in breakout sessions.

Specifically, Rohit Choudhary, CEO of Acceldata, said that the purpose of observability is to simplify everything data teams have to do. You don’t have enough data engineers as it is, and now you’re asking data leaders to invest in a bunch of different data observability tools. Instead of actually help them solve problems, you’re handing them more problems. He said to look at two things when evaluating data observability solutions: what it is capable of today, and what its roadmap looks like and what use cases it will support moving forward. Observability means different things to different people, and it all depends on whether the offering fits your maturity model. Smaller organizations with analytics teams of 10-20 people are probably fine with point solutions. But large enterprises that are dealing with data pipelines at petabyte scale are dealing with much greater complexity. For them, it would be prohibitively expensive to build their own observability solution.

Chris Santiago, Unravel Data VP of Solutions Engineering, was of the same opinion but looked at things from a different slant. He agreed that different tools—system-specific point tools, native cloud vendor capabilities, various data quality monitoring solutions—all have strengths and weaknesses, with insight into different “pieces of the puzzle.” But rather than connect them all together as discrete building blocks, observability would be better realized by extracting all the relevant granular details, correlating them into a holistic context, and analyzing them with ML and other analytical algorithms so that data teams get the intelligence they need in one place. The problems data teams face—around performance, quality, reliability, cost—are all interconnected, so you’re saving a lot of valuable time and reducing manual effort to have as much insight as possible in a single pane of glass. He refers to such comprehensive capabilities as DataOps observability.

The dimension of cost was something Eckerson analyst Kevin Petrie highlighted in the wrap-up as a key emerging factor. He’s seeing an increased focus on FinOps capabilities, which Chris called out specifically: it’s not just making sure pipelines are running smoothly, but understanding where the spend is going and who the “big spenders” are, so that observability can uncover opportunities to optimize for cost and control/govern the cloud spend.

That’s the cost side for the business, but he said it’s also crucial to understand the profit side. Companies are investing millions of dollars in their modern data stack, but how are we measuring whether they’re getting the value they expected from their investment? Can the observability platform help make sense of all the business metrics in some way? Because at the end of the day, all these data projects have to deliver value.

Check out Unravel’s breakout presentation, A DataOps Observability Dialogue: Empowering DevOps for Data Teams.

The post Takeaways from CDO TechVent on Data Observability appeared first on Unravel.

Join Unravel at the AI & Big Data Expo

Stephen Lamont — Fri, 19 Aug 2022 19:53:52 +0000

Swing by the Unravel Data booth at the AI & Big Data Expo in Santa Clara on October 5-6. The world’s leading AI & Big Data event returns as a hybrid in-person and virtual event, with more than 5,000 attendees expected to join from across the globe.

And don’t miss Unravel Co-Founder and CEO Kunal Agarwal’s feature presentation on DataOps Observability. He’ll explain why AI & Big Data organizations need an observability platform designed specifically for data teams and their unique challenges, the limitations of trying to “borrow” other observability (like APM) or relying on a bunch of different point tools, and how DataOps observability cuts across and incorporates cross-sections of multiple observability domains (data applications/pipelines/model observability, operations observability, business and FinOps observability, data observability).

Stop by our booth and you’ll be able to:

Go under the hood with Unravel’s DataOps observability platform
Deep-dive into features and capabilities with our experts
Learn what your peers have been able to accomplish with Unravel

Our experts will run demos and be available for 1-on-1 conversations throughout the conference.

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

The expo will showcase the most cutting-edge technologies from 250+ speakers sharing their unparalleled industry knowledge and real-life experiences, in the forms of solo presentations, expert panel discussions and in-depth fireside chats.

You can register for the event here.

The post Join Unravel at the AI & Big Data Expo appeared first on Unravel.

Tuning Spark applications: Detect and fix common issues with Spark driver

Unravel Data — Wed, 20 Jul 2022 00:01:04 +0000

Learn more about Apache Spark drivers and how to tune spark application quickly.

During the lifespan of a Spark application, the driver should distribute most of the work to the executors, instead of doing the work itself. This is one of the advantages of using python with Spark over Pandas. The Contented Driver event is detected when a Spark application spends way more time on the driver than the time spent in the Spark jobs on the executors. Leaving the executors idle could waste lots of time and money, specifically in the Cloud environment.

Here is a sample application shown in Unravel for Azure Databricks. Note in the Gantt Chart there was a huge gap of about 2 hours 40 minutes between Job-1 and Job-2, and the job duration for most of the Spark jobs was under 1 minute. Based on the data Unravel collected, Unravel detected the bottleneck which was the contended driver event.

Further digging into the Python code itself, we found that it actually tried to ingest data from the MongoDB server on the driver node alone. This left all the executors idle while the meter was still ticking.

There was some network issue that caused the MongoDB injection slowing down from 15 minutes to 2 plus hours. Once this issue was resolved, there was about 93% reduction in cost. The alternative solution is to move the MongoDB ingestion out of the Spark application. If there is no dependency on previous Spark jobs, we can do it before the Spark application.

If there is a dependency, we can split the Spark application into two. Unravel also collects all the job run status such as Duration and IO as shown below and we can easily see the history of all the job runs and monitor the job.

In conclusion, we must pay attention to the Contended Driver event in a Spark application, so we can save money and time without leaving the executors IDLE for a long time.

Next steps

Check out what are the biggest spark troubleshooting challenges in 2022 and how to fix them.
Learn ways to troubleshoot Apache Spark issues.
Create your free account today.
Book a demo to see how Unravel simplifies modern data stack management.

The post Tuning Spark applications: Detect and fix common issues with Spark driver appeared first on Unravel.

Unravel for Google BigQuery Datasheet

Unravel Data — Wed, 06 Jul 2022 20:29:39 +0000

AI-DRIVEN DATA OBSERVABILITY + FINOPS FOR GOOGLE BIGQUERY

Performance. Reliability. Cost-effectiveness.

Unravel’s automated, AI-powered data observability + FinOps platform for Google Cloud BigQuery and other modern data stacks provides 360° visibility to allocate costs with granular precision, accurately predict spend, run 50% more workloads at the same budget, launch new apps 3X faster, and reliably hit greater than 99% of SLAs.

Unravel Data Observability + FinOps for BigQuery you can:

Launch new apps 3X faster: End-to-end observability of data-native applications and pipelines. Automatic improvement of performance, cost efficiency, and reliability.
Run 50% more workloads for same budget: Break down spend and forecast accurately. Optimize apps and platforms by eliminating inefficiencies. Set guardrails and automate governance. Unravel’s AI helps you implement observability and FinOps to ensure you achieve efficiency goals.
Reduce firefighting time by 99% using AI-enabled troubleshooting: Detect anomalies, drift, skew, missing and incomplete data end-to-end. Integrate with multiple data quality solutions. All in one place.
Forecast budget with ⨦ 10% accuracy: Accurately anticipate cloud data spending to for more predictable ROI. Unravel helps you accurately forecast spending with granular cost allocation. Purpose-built AI, at job, user and workgroup levels, enables real-time visibility of ongoing usage.

To see Unravel Data for BigQuery in action contact: Data experts | 650 741-3442

The post Unravel for Google BigQuery Datasheet appeared first on Unravel.

Unravel at Data Summit 2022 Recap: Beyond Observability

Stephen Lamont — Wed, 25 May 2022 14:02:44 +0000

At the recent Data Summit 2022 conference in Boston, Unravel presented “The DataOps Journey: Beyond Observability”—the most highly attended session within the conference’s DataOps Boot Camp track. In a nutshell, Unravel VPs Keith Alsheimer and Chris Santiago discussed:

The top challenges facing different types of data teams today
Why web APM and point tools leave fundamentally big observability gaps
How observability purpose-built for the modern data stack closes these gaps
What going “beyond observability” looks like

Data teams get caught in a 4-way crossfire

It’s been said many times, many ways, but it’s true: nowadays every company has become a data company. Data pipelines are creating strategic business advantages, and companies are depending on their data pipelines more than ever before. And everybody wants in—every department in every business unit in every kind of company wants to leverage the insights and opportunities of big data.

But data teams are finding themselves caught in a 4-way crossfire: not only are data pipelines more important than ever, but more people are demanding more output from them, there’s more data to process, people want their data faster, and the modern data stack has gotten ever more complex.

The increased volume, velocity, and variety of today’s data pipelines coupled with their increasing complexity is leaving battle scars across all teams. Data engineers find it harder to troubleshoot apps/pipelines, meet SLAs, and get quality right. Operations teams have too many issues to solve and not enough time—they’re drowning in a flood of service tickets. Data architects have to figure out how to scale these complex pipelines efficiently. Business unit leaders are seeing cloud costs skyrocket and are under the gun to deliver better ROI.

See how Unravel can help you keep pace with the volume, velocity and variety of today’s data pipelines
Try for free

It’s becoming increasingly harder for data teams to keep pace, much less do everything faster, better, cheaper. All this complexity is taking its toll. Data teams are bogged down and burning out. A recent survey shows that 70% of data engineers are likely to quit in the next 12 months. Demand for talent has always exceeded supply, and it’s getting worse. McKinsey research reveals that data analytics has the greatest need to address potential skills gaps, with 15 open jobs for every active job seeker. The latest LinkedIn Emerging Jobs Report says that 5 of the top 10 jobs most in demand are data-related.

And complexity continues in the cloud. First off, the variety of choices is huge—with more coming every day. Too often everybody is doing their own thing in the cloud, with no clear sense of what’s going on. This makes things hard to rein in and scale out, and most companies find their cloud costs are getting out of control.

DataOps is the answer, but . . .

Adopting a DataOps approach can help solve these challenges. There are varying viewpoints on exactly what DataOps is (and isn’t), but we can certainly learn from how software developers and operations teams tackled a similar issue via DevOps—specifically, leveraging and enhancing collaboration, automation, and continuous improvement.

One of the key foundational needs for this approach is observability. You absolutely must have a good handle on what’s going on in your system, and why. After all, you can’t improve things—much less simplify and streamline—if you don’t really know everything that’s happening. And you can’t break down silos and embrace collaboration without the ability to share information.

At DataOps Unleashed 2022 we surveyed some 2,000 participants about their top challenges, and by far the #1 issue was the lack of visibility across the environment (followed by lack of proactive alerts, the expense of runaway jobs/pipelines, lack of experts, and no visibility into cloud cost/usage).

One of the reasons teams have so much difficulty gaining visibility into their data estate is that the kind of observability that works great for DevOps doesn’t cut it for DataOps. You need observability purpose-built for the modern data stack.

Get the right kind of observability

When it comes to the modern data stack, traditional observability tools leave a “capabilities gap.” Some of the details you need live in places like the Spark UI, other information can be found in native cloud tools or other point solutions, but all too often you have to jump from tool to tool, stitching together disparate data by yourself manually. Application performance monitoring (APM) does this correlation for web applications, but falls far short for data applications/pipelines.

First and foremost, the whole computing framework is different for data applications. Data workloads get broken down into multiple, smaller, often similar parts each processed concurrently on a separate node, with the results re-combined upon completion — parallel processing. In contrast, web applications are a tangle of discrete request-response services processed individually.

Consequently, there’s a totally different class of problems, root causes, and remediation for data apps vs. web apps. When doing your detective work into a slow or failed app, you’re looking at a different kind of culprit for a different type of crime, and need different clues. You need a whole new set of data points, different KPIs, from distinct technologies, visualized in another way, and correlated in a uniquely modern data stack–specific context.

You need to see details at a highly granular level — for each sub-task within each sub-part of each job — and then marry them together into a single pane of glass that comprises the bigger picture at the application, pipeline, platform, and cluster levels.

That’s where Unravel comes in.

Unravel extracts all the relevant raw data from all the various components of your data stack and correlates it to paint a picture of what’s happening. All the information captured by telemetry data (logs, metrics, events, traces) is pulled together in context and in the “language you’re speaking”—the language of data applications and pipelines—within the context of your workloads.

click to enlarge

Going beyond observability

Full-stack observability tells you what’s going on. But to understand why, you need to go beyond observability and apply AI/ML to correlate patterns, identify anomalies, derive meaningful insights, and perform automated root cause analysis. And for that, you need lots of granular data feeding ML models and AI algorithms purpose-built specifically for modern data apps/pipelines.

The size and complexity of modern data pipelines—with hundreds (even thousands) of intricate interdependencies, jobs broken down into stages processing in parallel—could lead to a lot of trial-and-error resolution effort even if you know what’s happening and why. What you really need to know is what to do.

That’s where Unravel goes where other observability solutions can only dream about.

Unravel’s AI recommendation engine tells you exactly what your next step is in crisp, precise detail. For example, if there’s one container in one part of one job that’s improperly sized and so causing the entire pipeline to fail, Unravel not only pinpoints the guilty party but tells you what the proper configuration settings would be. Or another example is that Unravel can tell you exactly why a pipeline is slow and how you can speed it up.

AI recommendations tell you exactly what to do to optimize for performance.

Observability is all about context. Traditional APM provides observability in context for web applications, but not for data applications and pipelines. But Unravel does. Its telemetry, correlation, anomaly detection, root cause analysis capabilities, and AI-powered remediation recommendations are all by design built specifically to understand, troubleshoot, and optimize modern data workloads.

See how Unravel goes beyond observability
Try for free

The post Unravel at Data Summit 2022 Recap: Beyond Observability appeared first on Unravel.

Why Legacy Observability Tools Don’t Work for Modern Data Stacks

Stephen Lamont — Fri, 13 May 2022 13:13:01 +0000

Whether they know it or not, every company has become a data company. Data is no longer just a transactional byproduct, but a transformative enabler of business decision-making. In just a few years, modern data analytics has gone from being a science project to becoming the backbone of business operations to generate insights, fuel innovation, improve customer satisfaction, and drive revenue growth. But none of that can happen if data applications and pipelines aren’t running well.

Yet data-driven organizations find themselves caught in a crossfire: their data applications/pipelines are more important than ever, but managing them is more difficult than ever. As more data is generated, processed, and analyzed in an increasingly complex environment, businesses are finding the tools that served them well in the past or in other parts of their technology stack simply aren’t up to the task.

Modern data stacks are a different animal

Would you want an auto mechanic (no matter how excellent) to diagnose and fix a jet engine problem before you took flight? Of course not. You’d want an aviation mechanic working on it. Even though the basic mechanical principles and symptoms — engine trouble — are similar, automobiles and airplanes are very different under the hood. The same is true with observability for your data application stacks and your web application stacks. The process is similar, but they are totally different animals.

At first glance, it may seem that the leading APM monitoring and observability tools like Datadog, New Relic, Dynatrace, AppDynamics, etc., do the same thing as a modern data stack observability platform like Unravel. And in the sense that both capture and correlate telemetry data to help you understand issues, that’s true. But one is designed for web apps, while the other for modern data pipelines and applications.

Observability for the modern data stack is indeed completely different from observability for web (or mobile) applications. They are built and behave differently, face different types of issues for different reasons, requiring different analyses to resolve problems. To fully understand, troubleshoot, and optimize (for both performance and cost) data applications and pipelines, you need an observability platform that’s built from the ground up to tackle the specific complexities of the modern data stack. Here’s why.

What’s different about modern data applications?

First and foremost, the whole computing framework is different for data applications. Data workloads get broken down into multiple, smaller, often similar parts each processed concurrently on a separate node, with the results re-combined upon completion — parallel processing. And this happens at each successive stage of the workflow as a whole. Dependencies within data applications/pipelines are deep and layered. It’s crucial that everything — execution time, scheduling and orchestration, data lineage, and layout — be in sync.

In contrast, web applications are a tangle of discrete request-response services processed individually. Each service does its own thing and operates relatively independently. What’s most important is the response time of each service request and how that contributes to the overall response time of a user transaction. Dependencies within web apps are not especially deep but are extremely broad.

Web apps are request-response; data apps process in parallel.

Consequently, there’s a totally different class of problems, root causes, and remediation for data apps vs. web apps. When doing your detective work into a slow or failed app, you’re looking at a different kind of culprit for a different type of crime, and need different clues (observability data). You need a whole new set of data points, different KPIs, from distinct technologies, visualized in another way, and correlated in a uniquely modern data stack–specific context.

The flaw with using traditional APM tools for modern data stacks

What organizations who try to use traditional APM for the modern data stack find is that they wind up getting only a tiny fraction of the information they need from a solution like Dynatrace or Datadog or AppDynamics, such as infrastructure and services-level metrics. But over 90% of the information data teams need is buried in places where web APM simply doesn’t go — you need an observability platform specifically designed to dig through all the systems to get this data and then stitch it together into a unified context.

This is where the complexity of modern data applications and pipelines rears its head. The modern data stack is not a single system, but a collection of systems. You might have Kafka or Kinesis or Data Factory for data ingestion, some sort of data lake to store it, then possibly dozens of different components for different types of processing: Druid for real-time processing, Databricks for AI/ML processing, BigQuery or Snowflake for data warehouse processing, another technology for batch processing — the list goes on. So you need to capture deep information horizontally across all the various systems that make up your data stack. But you also need to capture deep information vertically, from the application down to infrastructure and everything in between (pipeline definitions, data usage/lineage, data types and layout, job-level details, execution settings, container-level information, resource allocation, etc.).

Cobbling it together manually via “swivel chair whiplash,” jumping from screen to screen, is a time-consuming, labor-intensive effort that can take hours — even days — for a single problem. And still there’s a high risk that it won’t be completely accurate. There is simply too much data to make sense of, in too many places. Trying to correlate everything on your own, whether by hand or with a homegrown jury-rigged solution, requires two things that are always in short supply: time and expertise. Even if you know what you’re looking for, trolling through hundreds of log files is like looking for a needle in a stack of needles.

An observability platform purpose-built for the modern data stack can do all that for you automatically. Trying to make traditional APM observe data stacks is simply using the wrong tool for the job at hand.

DevOps APM vs. DataOps observability in practice

With the growing complexity in today’s modern data systems, any enterprise-grade observability solution should do 13 things:

Capture full-stack data — both horizontally and vertically — from the various systems that make up the stack, including engines, schedulers, services, and cloud provider
Capture information about all data applications: pipelines, warehousing, ETL/ELT, machine learning models, etc.
Capture information about datasets, lineage, users, business units, computing resources, infrastructure, and more
Correlate, not just aggregate, data collected into meaningful context
Understand all application/pipeline dependencies on data, resources, and other apps
Visualize data pipelines end to end from source to output
Provide a centralized view of your entire data stack, for governance, usage analytics, etc.
Identify baseline patterns and detect anomalies
Automatically analyze and pinpoint root causes
Proactively alert to prevent problems
Provide recommendations and remedies to solve issues
Automate resolution or self-healing
Show the business impact

While the principles are the same for data app and web app observability, how to go about this and what it looks like are markedly dissimilar.

Everything starts with the data — and correlating it

If you don’t capture the right kind of telemetry data, nothing else matters.

APM solutions inject agents that run 24×7 to monitor the runtime and behavior of applications written in .NET, Java, Node.js, PHP, Ruby, Go, and dozens of other languages. These agents collect data on all the individual services as they snake through the application ecosystem. Then APM stitches together all the data to understand which services the application calls and how the performance of each discrete service call impacts performance of the overall transaction. The various KPIs revolve around response times, availability (up/down, green/red), and the app users’ digital experience. The volume of data to be captured is incredibly broad, but not especially deep.

APM is primarily concerned with response times and availability. Here, Datadog shows red/green status and aggregated metrics.

Here, AppDynamics shows the individual response times for various interconnected services.

The telemetry details to be captured and correlated for data applications/pipelines, on the other hand, need to be both broad and extremely deep. A modern data workload comprises hundreds of jobs, each broken down into parallel-processing parts, with each part executing various tasks. And each job feeds into hundreds of other jobs and applications not only in this particular pipeline but all the other pipelines in the system.

Today’s pipelines are built on an assortment of distributed processing engines; each might be able to monitor its own application’s jobs but not show you how everything works as a whole. You need to see details at a highly granular level — for each sub-task within each sub-part of each job — and then marry them together into a single pane of glass that comprises the bigger picture at the application, pipeline, platform, and cluster levels.

DataOps observability (here, Unravel) looks at completely different metrics at the app level . .

. . . as well as the pipeline level.

Let’s take troubleshooting a slow Spark application as an example. The information you need to investigate the problem lives in a bunch of different places, and the various tools for getting this data give you only some of what you need, not all.

The Spark UI can tell you about the status of individual jobs but lacks infrastructure and configuration details and other information to connect together a full pipeline view. Spark logs help you retrospectively find out what happened to a given job (and even what was going on with other jobs at the same time) but don’t have complete information about resource usage, data partitioning, container configuration settings, and a host of other factors that can affect performance. And, of course, Spark tools are limited to Spark. But Spark jobs might have data coming in from, say, Kafka and run alongside a dozen other technologies.

Conversely, platform-specific interfaces (Databricks, Amazon EMR, Dataproc, BigQuery, Snowflake) have the information about resource usage and the status of various services at the cluster level, but not the granular details at the application or job level.Having all the information specific for data apps is a great start, but it isn’t especially helpful if it’s not all put into context. The data needs to be correlated, visualized, and analyzed in a purposeful way that lets you get to the information you need easily and immediately.

Then there’s how data is visualized and analyzed

Even the way you need to look at a data application environment is different. A topology map for a web application shows dependencies like a complex spoke-and-wheel diagram. When visualizing web app environments, you need to see the service-to-service interrelationships in a map like this:

How Dynatrace visualizes service dependencies in a topology map.

With drill-down details on service flows and response metrics:

Dynatrace drill-down details

For a modern data environment, you need to see how all the pipelines are interdependent and in what order. The view is more like a complex system of integrated funnels:

A modern data estate involves many interrelated application and pipeline dependencies (Source: Sandeep Uttamchandani)

You need full observability into not only how all the pipelines relate to one another, but all the dependencies of multiple applications within each pipeline . . .

An observability platform purpose-built for modern data stacks provides visibility into all the individual applications within a particular pipeline

. . . with granular drill-down details into the various jobs within each application. . .

. . and the sub-parts of each job processing in parallel . . .

How things get fixed

Monitoring and observability tell you what’s going on. But to understand why, you need to go beyond observability and apply AI/ML to correlate patterns, identify anomalies, derive meaningful insights, and perform automated root cause analysis. “Beyond observability” is a continuous and incremental spectrum, from understanding why something happened to knowing what to do about it to automatically fixing the issue. But to make that leap from good to great, you need ML models and AI algorithms purpose-built for the task at hand. And that means you need complete data about everything in the environment.

The best APM tools have some sort of AI/ML-based engine (some are more sophisticated than others) to analyze millions of data points and dependencies, spot anomalies, and alert on them.

For data applications/pipelines, the type of problems and their root causes are completely different than web apps. The data points and dependencies needing to be analyzed are completely different. The patterns, anomalous behavior, and root causes are different. Consequently, the ML models and AI algorithms need to be different.

In fact, DataOps observability needs to go even further than APM. The size of modern data pipelines and the complexities of multi-layered dependencies — from clusters to platforms and frameworks to applications to jobs within applications to sub-parts of those jobs to the various tasks within each sub-part — could lead to a lot of trial-and-error resolution effort even if you know what’s happening and why. What you really need to know is what to do.

An AI-driven recommendation engine like Unravel goes beyond the standard idea of observability to tell you how to fix a problem. For example, if there’s one container in one part of one job that’s improperly sized and so causing the entire pipeline to fail, Unravel not only pinpoints the guilty party but tells you what the proper configuration settings would be. Or another example is that Unravel can tell you exactly why a pipeline is slow and how you can speed it up. This is because Unravel’s AI has been trained over many years to understand the specific intricacies and dependencies of modern data stacks.

AI recommendations tell you exactly what to do to optimize for performance.

Business impact

Sluggish or broken web applications cost organizations money in terms of lost revenue and customer dissatisfaction. Good APM tools are able to put the problem into a business context by providing a lot of details about how many customer transactions were affected by an app problem.

As more and more of an organization’s operations and decision-making revolve around data analytics, data pipelines that miss SLAs or fail outright have an increasingly significant (negative) impact on the company’s revenue, productivity, and agility. Businesses must be able to depend on their data applications, so their applications need to have predictable, reliable behavior.

For example: If a fraud prevention data pipeline stops working for a bank, it can cause billions of dollars in fraudulent transactions going undetected. Or a slow healthcare analysis pipeline may increase risk for patients by failing to provide timely responses. Measuring and optimizing performance of data applications and pipelines correlates directly to how well the business is performing.

Businesses need proactive alerts when pipelines deviate from their normal behavior. But going “beyond observability” would tell them automatically why this is happening and what they can do to get the application back on track. This allows businesses to have reliable and predictable performance.

There’s also an immediate bottom-line impact that businesses need to consider: maximizing their return on investment and controlling/optimizing cloud spend. Modern data applications process a lot of data, which usually consumes a large amount of resources — and the meter is always running. This means the cloud bills can rack up fast.

To keep costs from spiraling out of control, businesses need actionable intelligence on how best to optimize their data pipelines. An AI recommendations engine can take all the profile and other key information it has about applications and pipelines and identify where jobs are overprovisioned or could be tuned for improvement. For example: optimizing code to remove inefficiencies, right-sizing containers to avoid wastage, providing the best data partitioning based on goals, and much more.

AI recommendations pinpoint exactly where and how to optimize for cost.

AI recommendations and deep insights lay the groundwork for putting in place some automated cost guardrails for governance. Governance is really all about converting the AI recommendations and insights into impact. Automated guardrails (per user, app, business unit, project) would alert operations teams about unapproved spend, potential budget overruns, jobs that run over a certain time/cost threshold, and the like. You can then proactively manage your budget, rather than getting hit with sticker shock after the fact.

In a nutshell

Application monitoring and observability solutions like Datadog, Dynatrace, and AppDynamics are excellent tools for web applications. Their telemetry, correlation, anomaly detection, and root cause analysis capabilities do a good job of helping you understand, troubleshoot, and optimize most areas of your digital ecosystem — one exception being the modern data stack. They are by design built for general-purpose observability of user interactions.

In contrast, an observability platform for the modern data stack like Unravel is more specialized. Its telemetry, correlation, anomaly detection, root cause analysis capabilities — and in the case of Unravel uniquely, its AI-powered remediation recommendations, automated guardrails, and automated remediation — is by design built specifically to understand, troubleshoot, and optimize modern data workloads.

Observability is all about context. Traditional APM provides observability in context for web applications, but not for data applications and pipelines. That’s not a knock on these APM solutions. Far from it. They do an excellent job at what they were designed for. They just weren’t built for observability of the modern data stack. That requires another kind of solution designed specifically for a different kind of animal.

The post Why Legacy Observability Tools Don’t Work for Modern Data Stacks appeared first on Unravel.

Visit Unravel at Data Summit 2022

Stephen Lamont — Fri, 13 May 2022 13:12:46 +0000

Get a quick Unravel product demo at Data Summit 2022 Tuesday & Wednesday, May 17-18, at the Hyatt Regency Boston.

Presented by our friends from DBTA (Database Trends and Applications), Data Summit 2022 is a unique conference that brings together IT practitioners and business stakeholders from all types of organizations to learn, share, and celebrate the trends and technologies shaping the future of data. See where the world of Big Data and data science is going, and how to get there fast.

Featuring workshops, panel discussions, and provocative talks, Data Summit 2022 provides a comprehensive educational experience designed to guide you through all of today’s key issues in data management and analysis. Whether your interests lie in the technical possibilities and challenges of new and emerging technologies or using Big Data for business intelligence, analytics, and other business strategies, we have something for you!

And Unravel is there as a sponsor, so be sure to stop by and check out how we’re going beyond observability to solve some of today’s biggest challenges with:

Full-stack data pipeline observability: Accelerate troubleshooting with observability purpose-built for the modern data stack
AI-enabled optimization: Get automated AI-driven recommendations to improve performance and cost
Advanced cost governance: Pinpoint exactly where cloud spend is going, track actual and projected costs against budgets, and put in automated guardrails
Cloud migration IQ: Make accurate data-driven decisions before, during, and after your migration

And we’re running a raffle – stop by our booth for a chance to an Oculus VR headset.

The conference has different talk tracks over the two days—Modern Data Strategy Essentials Today, Emerging Technologies & Trends in Data & Analytics, What’s Next in Data & Analytics Architecture, The Future of Data Warehouses, Data Lakes & Data Hubs—as well as special day-long programs: DataOps Boot Camp, Database & DevOps Boot Camp, and AI & Machine Learning Summit.

Doug Laney, Data & Strategy Innovation Fellow at West Monroe and author of Infonomics, is the keynote speaker with his presentation Data Is Not the New Oil.

On Tuesday, May 17, don’t miss Unravel Co-Founder & CEO Kunal Agarwal’s feature presentation, Exploiting the Multi-Cloud Opportunity With DataOps, in which he

details the common obstacles that data teams encounter in data migration
explains why next-generation data tools must evolve beyond simple observability to provide prescriptive insights
shares best practices for optimizing big data costs
demonstrates through real-world case studies how a mature DataOps practice can accelerate even the most complex cloud migration projects.

So stop by the Unravel booth to go beyond observability, win an Oculus VR headset, and see how Unravel’s AI-enabled purpose-built observability for modern data stack can help your data team monitor, observe, manage, troubleshoot, and optimize the performance and cost of large-scale modern data applications.

The post Visit Unravel at Data Summit 2022 appeared first on Unravel.

Webinar Recap: Succeeding with DataOps

Stephen Lamont — Fri, 06 May 2022 15:13:22 +0000

The term DataOps (like its older cousin, DevOps) means different things to different people. But one thing that everyone agrees on is its objective: accelerate the delivery of data-driven insights to the business quickly and reliably.

As the business now expects—or more accurately, demands—increasingly more insights from data, on increasingly shorter timelines, with increasingly more data being generated, in increasingly more complex data estates, it becomes ever more difficult to keep pace with the new levels of speed and scale.

Eliminating as much manual effort as possible is key. In a recent Database Trends and Applications (DBTA)-hosted roundtable webinar, Succeeding with DataOps: Implementing, Managing, and Scaling, three engineering leaders with practical experience in enterprise-level DataOps present solutions that simplify and accelerate:

Security. Satori Chief Scientist Ben Herzberg discusses how you can meet data security, privacy, and compliance requirements faster with a single universal approach.
Analytics. Incorta VP of Product Management Ashwin Warrier shows how you can make more data available to the business faster and more frequently.
Observability. Unravel VP of Solutions Engineering Chris Santiago explains how AI-enabled full-stack observability accelerates data application/pipeline optimization and troubleshooting, automates cost governance, and delivers the cloud migration intelligence you need before, during, and after your move to the cloud.

Universal data access and security

Businesses today are sitting on a treasure trove of data for analytics. But within that gold mine is a lot of personally identifiable information (PII) that must be safeguarded. That brings up an inherent conflict: making the data available to the company’s data teams in a timely manner without exposing the company to risk. The accelerated volume, velocity, and variety of data operations only makes the challenge greater.

Satori’s Ben Herzberg shared that recently one CISO said he has visibility over his network and his endpoints, but data was like a black box—it was hard for the Security team to understand what was going on. Security policies usually had to be implemented by other teams (data engineering or DataOps, database administrators).

The 5 capabilities of Satori’s DataSecOps

So what Satori does is provide a universal data access and security platform where organizations can manage all their security policies across their entire data environment from one single place in a simplified way. Ben outlined the five capabilities that Satori enables to realize DataSecOps:

Continuous data discovery and classification
Data keeps on changing, with more people touching it every day. Auditing or mapping your sensitive data once a year, even once a quarter or once a month, may not be sufficient.
Dynamic data masking
Different data consumers need different levels of data anonymization. Usually to dynamically mask data, you have to rely on the different data platforms themselves.
Granular data access policies
A key way to safeguard data is via Attribute-based Access Control (ABAC) policies. For example, you may want data analysts to be able access data stores using Tableau or Looker but not with any other technologies.
Access control and workflows
Satori automates and simplifies the way access is granted or revoked. A common integration is with Slack, so Security team can approve or deny access requests quickly.
Monitoring, audit & compliance Satori sits between the data consumers and the data that’s being accessed, so monitoring/auditing for compliance is transparent without ever touching the data itself.

Keeping data safe is always a top-of-mind concern, but no organization can afford to delay projects for months in order to apply security.

Scaling analytics with DataOps

When you think of the complexity of the modern data stack, you’re looking at a proliferation of cloud-based applications but also many legacy on-prem applications that continue to be the backbone of a large enterprise (SAP, Oracle). You have all these sources rich with operational data that business users need to build analytics on top of. So the first thing you see in a modern data stack is the data being moved to a data lake or data warehouse, to make it readily accessible along with making refinements along the way.

But to Incorta’s Ashwin Warrier, this presents two big challenges when it comes to enterprise analytics and operational reporting. First, getting through this entire process takes weeks, if not months. Second, once you finally get the data into the hands of business users, it’s aggregated. Much of your low-fidelity data is gone because of summarizations and aggregations. Building a bridge between the analytics plane and the operations plane requires some work.

Incorta’s Direct Data Platform approach

Where Incorta is different is that it does enrichments but doesn’t try to create transformations to take your source third normal form data and convert that into star schemas. It takes an exact copy with minimal amount of change. Having the same fidelity means that now your business users can get access to their data at scale in just a matter of hours or days instead of weeks or months.

Ashwin related the example of a Fortune 50 customer with about 2,000 users developing over 5,000 unique reports. It had daily refreshes, with an average query time of over 2 minutes, and took over 8 days to deliver insights. With Incorta, it reduced average query time to just a few seconds, now refreshes data 96 times every day, and can deliver insights in less than 1 day.

Unravel is purpose-built for the modern data stack
Create a free account

Taming the beast of modern data stacks

Unravel’s Chris Santiago opened with an overview of how the complexity compounded by complexity of modern data stacks causes problems for different groups of data teams. Data engineers’ apps aren’t hitting their SLAs; how can they make them more reliable? Operations teams are getting swamped with tickets; how can they troubleshoot faster? Data architects are falling behind schedule and going over budget; how can they speed migrations without more risk? Business leaders are seeing skyrocketing cloud bills; how can they better govern costs?

Then he went into the challenges of complexity. A single modern data pipeline is complex enough on its own, with dozens or even hundreds of interdependent jobs on different technologies (Kafka, Airflow, Spark, Hive, all the barn animals in Hadoop). But most enterprises run large volumes of these complex pipelines continuously—now we’re talking about the number of jobs getting into the thousands. Plus you’ve got multiple instances and data stored in different geographical locations around the globe. Already it’s a daunting task to figure out where and why something went wrong. And now there’s multi-cloud.

Managing this environment with point tools is going to be a challenge: they are very specific to the service they’re running, and the crucial information you need is spread across dozens of different systems and tools. Chris points out four key reasons why managing modern data stacks with point tools falls flat:

4 ways point tools fall flat for managing the modern data stack

You get only a partial picture.
The Spark UI, for instance, has granular details about individual jobs but not at the cluster level. Public cloud vendors’ tools have some of this information, but not nothing at the job level.
You don’t get to the root cause.
You’ll get a lot of graphs and charts, but what you really need to know is what went wrong, why—and how to fix it.
You’re blind to where you’re overspending.
You need to know exactly which jobs are using how much resources, whether that’s appropriate for the job at hand, and how you can optimize for performance and cost.
You can’t understand your cloud migration needs.
Things are always changing, and changing fast. You need to always be thinking about the next move.

That’s where Unravel comes in.

Unravel is purpose-built for the modern data stack
Create a free account

Unravel is purpose-built to collect granular data app/pipeline-specific details from every system in your data estate, correlate it all in a “workload-aware” context automatically, analyze everything for you, and provide actionable insights and precise AI recommendations on what to do next.

Unravel is purpose-built for the modern data stack

Get single-pane-of-glass visibility across all environments.
You’ll have complete visibility to understand what’s happening at the job level up to the cluster level.
Optimize automatically.
Unravel’s AI engine is like having a Spark or Databricks or Amazon EMR telling you exactly what you need to do to optimize performance to meet SLAs or change instance configurations to control cloud costs.
Fine-grained insight into cloud costs.
See at a granular level exactly where the money is going, set some budgets, track spend month over month—by project, team, even individual job or user—and have AI uncover opportunities to save.
Migrate on time, on budget.
Move to the cloud with confidence, knowing that you have complete and accurate insight into how long migration will take, the level of effort involved, and what it’ll cost once you migrate.

As businesses become ever more data-driven, build out more modern data stack workloads, and adopt newer cloud technologies, it will become ever more important to be able to see everything in context, let AI take much of the heavy lifting and manual investigation off the shoulders of data teams already stretched too thin, and manage, troubleshoot, and optimize data applications/pipelines at scale.

Check out the full webinar here.

The post Webinar Recap: Succeeding with DataOps appeared first on Unravel.

Beyond Observability for the Modern Data Stack

Stephen Lamont — Tue, 26 Apr 2022 16:04:21 +0000

The term “observability” means many things to many people. A lot of energy has been spent—particularly among vendors offering an observability solution—in trying to define what the term means in one context or another.

But instead of getting bogged down in the “what” of observability, I think it’s more valuable to address the “why.” What are we trying to accomplish with observability? What is the end goal?

At Unravel, I’m not only the co-founder and CTO but also the head of our Customer Success team, so when thinking about modern data stack observability, I look at the issue through the lens of customer problems and what will alleviate their pain and solve their challenges.

I start by considering the ultimate end goal or Holy Grail: autonomous systems. Data teams just want things to work; they want everything to be taken care of, automatically. They want the system itself to be watching over issues, resolving problems without any human intervention at all. It’s the true spirit of AI: all they have to do is tell the system what they want to achieve. It’s going from reactive to proactive. No one ever has to troubleshoot the problem, because it was “fixed” before it ever saw the light of day. The system recognizes that there will be a problem and auto-corrects the issue invisibly.

As an industry, we’re not completely at that point yet for the modern data stack. But we’re getting closer.

We are on a continuous and incremental spectrum of progressively less toil and more automation: going from the old ways of manually stitching together logs, metrics, traces, and events from disparate systems; to accurate extraction and correlation of all that data in context; to automatic insights and identification of significant patterns; to AI-driven actionable recommendations; to autonomous governance.

“Beyond observability” is on a spectrum from manual to autonomous

See how Unravel goes “beyond observability” for the modern data stack
Create a free account

Beyond observability: to optimization and on to governance

Let’s say you have a data pipeline from Marketing Analytics called ML_Model_Update_and_Scoring that’s missing its SLA. The SLA specifies that the pipeline must run in less than 20 minutes, but it’s taking 25 minutes. What happened? Why is the data pipeline now too slow? And how can you tune things to meet the SLA? This particular application is pretty complex, with multiple jobs processing in parallel and several components (orchestration, compute, auto-scaling clusters, streaming platforms, dashboarding), so the problem could be anywhere along the line.

It’s virtually impossible to manually pore through the thousands of logs that are generated at each stage of the pipeline from the various tools—Spark and Kafka and Airflow logs, Databricks cluster logs, etc.—to “untangle the wires” and figure out where the slowdown could be. But where should you even start? Even if you have an idea of what you’re looking for, it can take hours—even days or weeks for highly complex workflows—to stitch together all the raw metrics/events/logs/traces data to figure out what’s meaningful to why your data pipeline is missing its SLA. Just a single app can run 10,000 containers on 10,000 nodes, with 10,000 logs.

Modern data stacks are simply too complex for humans to manage by hand

That’s where observability comes in.

Observability tells you “what”

Instead of having to sift through reams of logs and cobble together everything manually, full-stack observability extracts all the relevant raw data from all the various components of your data stack and correlates it to paint a picture of what’s going on. All the information captured by telemetry data (logs, metrics, events, traces) is pulled together in context and in the “language you’re speaking”—the language of data applications and pipelines.

Full-stack observability correlates data from all systems to provide a clear picture of what happened

In this case, observability shows you that while the application completed successfully, it took almost 23 minutes (22m 57s)—violating the SLA of 20 minutes. Here, Unravel takes you to the exact spot in the complex pipeline shown earlier and, on the left, has pulled together granular details about the various jobs processing in parallel. You can toggle on a Gantt chart view to get a better view of the degree of parallelism:

A Gantt chart view breaks down all the individual jobs and sub-tasks processing in parallel

So now you know what caused the pipeline to miss the SLA and where it happened (jobs #0 and #3), but you don’t know why. You’ve saved a lot of investigation time—you get the relevant data in minutes, not hours—and can be confident that you haven’t missed anything, but you still need to do some detective work to analyze the information and figure out what went wrong.

Optimization tells you “why”—and what to do about it

The better observability tools also point out, based on patterns, things you should pay attention to—or areas you don’t need to investigate. By applying machine learning and statistical algorithms, it essentially throws some math at the correlated data to identify significant patterns—what’s changed, what hasn’t, what’s different. This pipeline runs regularly; why was it slow this time? It’s the same kind of analysis a human expert would do, only done automatically with the help of ML and statistical algorithms.

While it would certainly be helpful to get some generalized insight into why the pipeline slowed—it’s a memory issue—what you really need to know is what to do about it.

AI-enabled observability goes beyond telling you what happened and why, to pinpointing the root cause and providing recommendations on exactly what you need to do to fix it.

AI-driven recommendations pinpoint exactly where—and how—to optimize (click on image to expand)

AI-driven recommendations provide specific configuration parameters that need to be applied in order for the pipeline to run faster and meet the 20-minute SLA. After the AI recommendations are implemented, we can see that the pipeline now runs in under 11 minutes—a far cry from the almost 23 minutes before.

AI recommendations improved pipeline performance from 23m to <11m, now meeting its SLA

Too often getting to the point of actually fixing the problem is another time-consuming effort of trial and error. Actionable AI recommendations won’t fix things automatically for you—taking the action still requires human intervention—but all the Sherlock Holmes work is done for you. The fundamental questions of what went wrong, why, and what to do about it are answered automatically.

Beyond optimizing performance, AI recommendations can also identify where cost could be improved. Say your pipeline is completing within your SLA commitments, but you’re spending much more on cloud resources than you need to. An AI engine can determine how many or what size containers you actually need to run each individual component of the job—vs. what you currently have configured. Most enterprises soon realize that they’re overspending by as much as 50%.

Start getting AI recommendations for your data estate today
Create a free account

Governance helps avoid problems in the first place

These capabilities save a ton of time and money, but they’re still relatively reactive. What every enterprise wants is a more proactive approach to making their data applications and pipelines run better, faster, cheaper. Spend less time firefighting because there was no fire to put out. The system itself would understand that the memory configurations were insufficient and automatically take action so that pipeline performance would meet the SLA.

For a whole class of data application problems, this is already happening. AI-powered recommendations and insights lay the groundwork for putting in place some automated governance policies that take action on your behalf.

Governance is really all about converting the AI recommendations and insights into impact. In other words, have the system run automatic responses that implement fixes and remediations for you. No human intervention is needed. Instead of reviewing the AI recommendation and then pulling the trigger, have the system apply the recommendation automatically.

Automated alerts proactively identify SLA violations

Policy-based governance rules could be as benign as sending an alert to the user if the data table exceeds a certain size threshold or as aggressive as automatically requesting a configuration change for a container with more memory.

This is true AIOps. You don’t have to wait until after the fact to address an issue or perpetually scan the horizon for emerging problems. The system applies AI/ML to all the telemetry data extracted and correlated from everywhere in your modern data stack to not only tell you what went wrong, why, and what to do about it, but it predicts and prevents problems altogether without any human having to touch a thing.

The post Beyond Observability for the Modern Data Stack appeared first on Unravel.

Building vs. Buying Your Modern Data Stack: A Panel Discussion

Stephen Lamont — Thu, 21 Apr 2022 19:55:09 +0000

One of the highlights of the DataOps Unleashed 2022 virtual conference was a roundtable panel discussion on building versus buying when it comes to your data stack. Build versus buy is a question for all layers of the enterprise infrastructure stack. But in the last five years — even in just the last year alone — it’s hard to think of a part of IT that has seen more dramatic change than that of the modern data stack.

These transformations shape how today’s businesses engage and work with data. Moderated by Lightspeed Venture Partners’ Nnamdi Iregbulem, the panel’s three conversation partners — Andrei Lopatenko, VP of Engineering at Zillow; Gokul Prabagaren, Software Engineering Manager at Capital One; and Aaron Richter, Data Engineer at Squarespace — weighed in on the build versus buy question and walked us through their thoughts:

What motivates companies to build instead of buy?
How do particular technologies and/or goals affect their decision?

These issues and other considerations were discussed. A few of the highlights follow, but the entire session is available on demand here.

What are the key variables to consider when deciding whether to build or buy in the data stack?

Gokul: I think the things which we probably consider most are what kind of customization a particular product offers or what we uniquely need. Then there are the cases in which we may need unique data schemas and formats to ingest the data. We must consider how much control we have of the product and also our processing and regulatory needs. We have to ask how we will be able to answer those kinds of questions if we are building in-house or choosing to adopt an outsourced product.

Aaron: Thinking from the organizational perspective, there are a few factors that come from just purchasing or choosing to invest in something. Money is always a factor. It’s going to depend on the organization and how much you’re willing to invest.

Beyond that a key factor is the expertise of the organization or the team. If a company has only a handful of analysts doing the heavy-lifting data work, to go in and build an orchestration tool would take them away from their focus and their expertise of providing insights to the business.

Andrei: Another important thing to consider is the quality of the solution. Not all the data products on the market have high quality from different points of view. So sometimes it makes sense to build something, to narrow the focus of the product. Compatibility with your operations environment is another crucial consideration when choosing build versus buy.

What’s the more compelling consideration: saving headcount or increasing productivity of the existing headcount?

Aaron: In general, everybody’s oversubscribed, right? Everybody always has too much work to do. And we don’t have enough people to accomplish that work. From my perspective, the compelling part is, we’re going to make you more efficient, we’re going to give you fewer headaches, and you’ll have fewer things to manage.

Gokul: I probably feel the same. It depends more on where we want to invest and if we’re ready to change where we’re investing: upfront costs or running costs.

Andrei: And development costs: do we want to buy this, or invest in building? And again, consider the human equation. It’s not just the number of people in your headcount. Maybe you have a small number of engineers, but then you have to invest more of their time into data science or data engineering or analytics. Saving time is a significant factor when making these choices.

How does the decision matrix change when the cloud becomes part of the consideration set in terms of build versus buy?

Gokul: I feel like it’s trending towards a place where it’s more managed. That may not be the same question as build or buy. But it skews more towards the manage option, because of that compatibility, where all these things are available within the same ecosystem.

Aaron: I think about it in terms of a cloud data warehouse: some kind of processing tool, like dbt; and then some kind of orchestration tool, like Airflow or Prefect; and there’s probably one pillar on that side, where you would never think to build it yourself. And that’s the cloud data warehouse. So you’re now kind of always going to be paying for a cloud vendor, whether it’s Snowflake or BigQuery or something of that nature.

So you already have your foot in the door there, and you’re already buying, right? So then that opens the door now to buying more things, adding things on that integrate really easily. This approach helps the culture shift. If a culture is very build-oriented, this allows them to be more okay with buying things.

Andrei: Theoretically you want to have your infrastructure independent on cloud, but it never happens, for multiple reasons. Firstly, cloud company tools make integration work much easier. Second, of course, once you have to think about multi-cloud, you must address privacy and security concerns. In principle, it’s possible to be independent, but you’ll often run into a lot of technical problems. There are multiple different factors when cloud becomes key in deciding what you will make and what tools to use.

See the entire Build vs. Buy roundtable discussion on demand
Watch now

The post Building vs. Buying Your Modern Data Stack: A Panel Discussion appeared first on Unravel.

DataOps Unleashed Survey Infographic

Unravel Data — Thu, 14 Apr 2022 13:08:08 +0000

Thank you for your interest in the DataOps Unleashed 2022 Survey Infographic.

You can download your copy here.

Looking for the 2023 results. Find the infographic here.

The post DataOps Unleashed Survey Infographic appeared first on Unravel.

Key Findings of the 2022 DataOps Unleashed Survey

Stephen Lamont — Thu, 14 Apr 2022 13:05:40 +0000

At the recent DataOps Unleashed 2022 virtual conference, we surveyed several hundred leading DataOps professionals across multiple industries in North America and Europe on a range of issues, including:

Current adoption of DataOps approaches
Top challenges in operating/managing their data stack
How long they estimate cloud migration will take
Where they are prioritizing automation
On which tasks they spend their time

Download our infographic here.

Highlights and takeaways

DataOps as a practice is hitting an inflection point

DataOps has gained momentum over the past 12 months. While a DataOps approach is still in the early innings for most companies—almost 4 out of 5 of respondents said they are “having active discussions” or “progressing”—this is an 80% jump from last year.

Visibility into data pipelines remains the top challenge

For the second consecutive year, “lack of visibility across the environment” was the #1 challenge. Interestingly, “lack of proactive alerts” leapfrogged “expensive runaway jobs/pipelines” as the second biggest challenge, followed by “lack of experts” and “no visibility into cost/usage.”

Projected cloud migration schedules are longer

When we asked respondents to estimate how long their cloud migration will take, only 28% said just one year—with the vast majority saying 2 years or more. This is a 150% jump from last year, reflecting the growing realization of the complexity in moving from on-prem to cloud.

Automation continues to be a key driver

Over 75% of survey respondents said that being able to “automatically test and verify before moving jobs/pipelines to production” was their top automation priority. Right behind was automating the troubleshooting of pipeline issues (65%), followed by a three-way tie (about 33% each) among automatically troubleshooting platform issues, troubleshooting jobs, and automatically reducing usage costs.

Teams spend more time building pipelines than maintaining/deploying them

Like last year, we saw that data teams spend more time building pipelines (43% in 2022, up from 39% in 2021) than maintaining/troubleshooting them (30% in 2022 vs. 34% in 2021) or deploying pipelines (holding steady at 27%).

Interested in more DataOps trends?

Check out the summary recap of the DataOps Unleashed 2022 keynote session, Three Venture Capitalists Weigh In on the State of DataOps 2022, or watch the full roundtable discussion on demand.

The post Key Findings of the 2022 DataOps Unleashed Survey appeared first on Unravel.

New Survey Reveals Poor Visibility and Lack of Prescriptive Insights as Top Challenges for DataOps in 2022

Unravel Data — Thu, 14 Apr 2022 13:04:15 +0000

Benchmarked Survey from 2022 DataOps Unleashed Event Highlights the Key Priorities and Pressing Challenges Facing the Burgeoning DataOps Market

Palo Alto, CA – April 14, 2022 – Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today released key findings from a survey administered to several hundred attendees at the most recent DataOps Unleashed event in order to gauge the priorities, challenges, and progress of leading data teams as they seek to modernize their big data management and analytics capabilities so they can fully realize their real-time data analytics ambitions. To view and download an infographic of the key findings from this survey, click here.

“Data is the lifeblood of the modern enterprise and those organizations who have dedicated the resources and budget to modernizing their data stacks are the ones who will be best positioned to drive innovations in the coming decade,” said Kunal Agarwal, co-founder and CEO of Unravel Data. “The results of this latest survey show just how complex the modern data stack has become and illustrates the many unanticipated challenges that come with efficiently managing and optimizing data pipelines across multiple public cloud providers and platforms. It also serves to validate that there is an obvious demand for a purpose-built solution that can help these teams gain the critical visibility they need to drive the most value from their data operations.”

Some of the key findings collected from the most recent survey, which benchmarks responses from the year prior, include:

DataOps as a practice is hitting an inflection point: There was an almost 80% increase from the year prior of respondents who said they are in the active stage of adopting a formal DataOps approach to manage and optimize their data pipelines. This year, more than 41% of attendees reported they are actively employing DataOps methodologies, compared to just less than a quarter (24%) in 2021.
Visibility into data pipelines remains the top challenge: For the second year in a row, when participants were asked what they viewed as the top challenge with operating their data stack, respondents cited the lack of visibility across their environment as their most significant obstacle. Whereas in the previous year respondents reported that “controlling runaway costs” was the second biggest, this year the “lack of proactive alerts” was noted as the second most challenging aspect.
Complexity of cloud migrations is more time consuming than previously thought: Sixty percent of respondents from this year’s event estimated that their cloud migration project would take between 12-24 months, representing a 150% increase over the prior year’s projection. The challenge of forecasting the duration of these cloud migration initiatives reveals the vast amount of complexity and uncertainty that data teams face when attempting to map out these critical projects.
Automation continues to be a key driver: When asked about the role of automation in managing their DataOps pipelines, three in four DataOps professionals in both years reported that the ability to “automatically test and verify before moving jobs/pipelines to production” was the most important automation priority when compared to other aspects such as automating troubleshooting of platform or pipeline issues.
Data teams spend more time building than deploying/managing pipelines: For both years, data professionals reported that they spent the majority of their day building their data pipelines (39% in 2021 and 43% in 2022). In 2022, respondents reported spending slightly less time maintaining or troubleshooting their pipelines (30%) than the year prior (34%) while the time spent deploying data pipelines remained the same at 27% for both years.

Now in its second year, DataOps Unleashed is a one-day virtual conference underwritten by Unravel Data, that seeks to build a vibrant peer-based community for DataOps professionals to share best practices. Data professionals can productively collaborate with one another and deliver on the promise of what it means to be a data-led company. This past year’s event attracted more than 2,300 data professionals and featured presentations from data leaders at some of the most data-forward enterprise organizations in the world, including AWS, DBS, Google, Johnson & Johnson, and many others. Recordings of all 20 sessions are available to watch for free and on-demand on the DataOps Unleashed site at: www.dataopsunleashed.com.

Media Contact
Blair Moreland
ZAG Communications for Unravel Data
unraveldata@zagcommunications.com

The post New Survey Reveals Poor Visibility and Lack of Prescriptive Insights as Top Challenges for DataOps in 2022 appeared first on Unravel.

DBS Bank Goes “Beyond Observability” for DataOps

Stephen Lamont — Thu, 03 Mar 2022 18:07:23 +0000

At the DataOps Unleashed 2022 conference, Luis Carlos Cruz Huertas, Head of Technology Infrastructure & Automation at DBS Bank, discussed how the bank has developed a framework whereby they translate millions of telemetry data points into actionable recommendations and even self-healing capabilities.

The session, Beyond observability and how to get there, opens with Dr. Shivnath Babu, Unravel Co-Founder and CTO, setting the context for observability challenges in the modern data stack—how simple applications grow to become so complex—and walking through the successive stages of what’s “beyond observability.” How can we go from extracting and correlating events, logs, metrics, and traces from applications, datasets, infrastructure, and tenants to get to self-healing systems?

The evolution of “beyond observability” goes from correlation to causation.

Then Luis shows how DBS is doing just that—how they leverage what he calls “cognitive capabilities” to deliver faster root cause analysis insights and automated corrective actions.

Why DBS went beyond observability

“When you have an always-on environment where banking applications are fully needed, you come to a point where observability [by itself] doesn’t cut it. You come to a point where your NCI [negative customer impact] truly becomes a key valuable indicator on how your systems are relating. We’re no longer in the game of measuring systems to be able to monitor, we’re measuring systems with the intent to provide a better customer experience,” says Luis.

Given the complexity of the bank’s IT ecosystem (not just its data stack), DBS made a strategic decision to not focus on tools developed by third-party vendors but rather build an “overarching brain” that could collect and understand the metrics from the diversity of tools in place without forcing the organization to rip and replace for something new. The objective was to speed root cause analysis across the board, provide less “noise,” reduce manual effort (toil), and get to proactive, predictive alerting on emerging issues.

See how DBS built its “beyond observability” self-healing platform

Watch on-demand session

How DBS built its cognitive capability platform

“You have applications that are collecting different telemetry through different systems or different log collectors—node exporters, metric beats, file beats, you can have an ELK stack. But ultimately what you want to do is create an open platform that you can ingest all this data,” Luis says. And for that you need three elements, he explains:

a historical repository, where you can collect and cross-check data
a real-time series database, because time becomes the de facto metadata to identify a critical incident and its correlations
a log aggregator

Luis notes that one of the things he gets asked constantly is, How do you define the ingestion?

“We do it all based on metadata. We define the metadata before it actually gets ingested. And then we park that into the [system] data lake. On top of that we provide an [ML-driven] analytical engine. Ultimately, what our system does is basically provide a recommendation to our site reliability engineers. And it gives them a list of elements, saying I’ve identified this set of errors or incidents that have happened over the last month and are repetitive and continuous. And then our site reliability engineers need to marry that with our automation engine. You build the right scripting—the right automation—to properly fix and remediate the problem. So that every time an incident is identified, it maps Incident A to Automation B to get Outcome C.”

DBS turns diverse telemetry data into auto-corrective actions.

Luis adds that with Unravel, the telemetry data has already been correlated. He says, “Unravel is huge for us. I don’t need to worry about marrying the correlation. I can just consume it right away.”

Luis concludes: “So in the end, we’re not changing tools, we’re collecting the metrics from all of the tools. We are providing a higher overarching mapping of the data being collected across all the tools in our environment, mapping them through metadata, and leveraging that to provide the right ML.”

The bottom line? DBS Bank is able to go “beyond observability” and leverage machine learning to get closer to the ultimate goal of a self-healing system that doesn’t require any human intervention at all.

Check out the full presentation from DataOps Unleashed on demand here.

See the DBS schematics for its Cognitive Technology Services Platform, tools mapping to the architectural components, solution data flows, overview of data sources, and more.

The post DBS Bank Goes “Beyond Observability” for DataOps appeared first on Unravel.

Three Venture Capitalists Weigh In on the State of DataOps 2022

Stephen Lamont — Thu, 03 Mar 2022 17:55:45 +0000

The keynote presentation at DataOps Unleashed 2022 featured a roundtable panel discussion on the State of DataOps and the Modern Data Stack. Moderated by Unravel Data Co-Founder and CEO Kunal Agarwal, this session features insights from three investors who have a unique vantage point on what’s changing and emerging in the modern data world, the effects of these changes, and the opportunities being created.

The panel’s three venture capitalists—Matt Turck, Managing Director at FirstMark, Venky Ganesan, Partner at Menlo Ventures, and Glenn Solomon, Managing Partner at GGV Capital—all invest in companies that are both users and creators of the modern data stack. They are either crunching massive amounts of data and converting it into insights or are helping other companies do that at scale.

Today every company is a data company. And these data pipelines, AI, and machine learning models are creating tremendous strategic value for companies. And these companies are depending on them more than ever before. Now, while the advantages of becoming data-driven are clear, what’s often hard to grasp is what’s changing in this landscape.

The discussion covered a broad range of topics, loosely revolving around a couple of key areas. Here are just a handful of the interesting observations and insights they shared.

What are the top data industry macro-trends?

Glenn Solomon: I don’t think companies are nearly as advanced as you’re likely to believe. Most companies are in the early innings of trying to figure out how to manage data. We see that even with born-digital companies. And the complexity is compounded by the fact that there is a lot of noise in the startup universe. Figuring out the decisions you make as a company is difficult and challenging. So I think this best-of-breed vs. platform, balancing act that companies had to go through in software is also going to happen in the data stack.

Matt Turck: The big driver is the rise of modern cloud data warehouses, and the lake houses as well. So for me, that’s been the big unlock in the space. We finally have that base level in the data hierarchy of needs, where we can take all this data, put it somewhere, and actually process it at scale. Now this whole thing is becoming very real, no longer experimental. Now the whole thing needs to work.

Venky Ganesan: The digital transformation that was happening just got super accelerated by the pandemic. All these analog business processes were digitized. And now that they are digitized, they can be tracked, stored, analyzed, evaluated and acted upon. I think the data stack has got to be one of the most important stacks in a company because your success long term is going to be based on how good is your data stack? How good is your DataOps? And then how do you build the analytics on top of it?

What trends are you seeing within DataOps specifically?

Venky: I would say the biggest trend I’m seeing is pushing these data workloads to the cloud. And I think it’s a really interesting game-changer. One of the things we are seeing now as we move to the cloud is suddenly you can separate out the storage and compute, have the infrastructure handle it, and then have the data warehouses such as Snowflake, Databricks. Now there are new sets of problems that come into play around DataOps when you move the data to Snowflake or Databricks or any other cloud providers, which is that you need to still understand the workloads, still need to optimize them. I think there’s going to be a whole DataOps category that helps you both migrate workloads to the cloud and also monitor them, because you can’t have the fox guarding the henhouse, you need some third party there to help you make sure you’re optimizing the workload, because the cloud provider is not interested in optimizing the workload for efficiency.

Glenn: A driver that I’m seeing accelerate, and gain momentum, is quite simply just the need for speed. In organizations there’s a tremendous amount of momentum around real-time streaming, real time analytics. Companies are growing the number of business processes for which they want real-time data to make decisions. And that is having a big impact on this whole world.

Another trend I’d point out is the rise of open source and the impact open source is having on many, many areas within the DataOps world. It looks like Kafka has had a massive impact on streaming as a result. That shows me that open source can really standardize markets. It can standardize technologies and standardize workflows. We’ll have to see how this all plays out, but I think open source—when it works well—is a very, very powerful trend.

Watch full panel discussion: The State of DataOps and the Modern Data Stack

Watch session on demand

Moving data to the cloud—why or why not?

Venky: I think if you are a company that has data on premises, you just have numerous issues. On prem is very heterogeneous: heterogeneous in hardware, heterogeneous in environment. And in a world of labor shortages, what happens if you don’t have the people? If you have the kind of turnover you’re seeing, can you get the people to manage it? So if you don’t move to the cloud, you’re going to be trapped on an island with fewer and fewer resources that cost more and more.

But I actually think the most important part of moving into the cloud is that it gives you an opportunity to standardize data, think about the data you want. And then once you move it to the cloud, you can unlock new generations of AI technologies that come into the cloud and allow you to get more insights from data. And so to me, eventually, data is worthless if it doesn’t translate to insights. Your best way of getting that insight is to figure out a scalable way of moving to the cloud, cheaper, and also unlock a lot of the new AI techniques to get insight from it.

Glenn: I think we’re on a continuum where there are still lots of companies who are reticent to move all their data to the cloud. I think the view is, hey, we have regulatory obligations and there’s risk if we don’t manage things ourselves. For data that is viewed as too sensitive, too risky, too valuable to move to the cloud—it’s just a matter of time. The value that can be driven from having data up in the cloud is just too great. But how do you safely move data into the cloud? And then once it’s there, how do you manage the applications that consume that data in a way that is rational? And if you want to use [cloud services] rationally, and use them the right way, and in a cost-efficient way, then you really do need other tools to make sure that things don’t run away from you.

Matt: For many years, there was an almost cognitive dissonance, where everything I read, and all the conversations I was having with like execs and people in the industry, was all cloud, cloud, cloud. But our customers all wanted to be on prem—actually, zero people want to be in the cloud. It feels like in the last year and a half that cognitive dissonance has disappeared, and suddenly I starting seeing all these customers, almost all at once—and the pandemic certainly accelerated all this—saying, okay, now is the time I will move those workloads and the data to the cloud. So it feels like there’s been an inflection point of some sort. And it is very anecdotal, I realize, but it is very, very clear.

The only nuance to all of this is I think there’s a little bit of a growing realization and concern around the cost of being in the cloud. When you start in the cloud, you actually save a lot of money. Once you’ve configured your organization to actually run in the cloud, you save a lot of money for a while. But then there’s a moment when it starts actually being pretty expensive. And I think that’s a problem that’s starting to come to the fore.

What’s the impact of the talent shortage?

Glenn: It’s very difficult to amass the kind of talent you need to really effectively both manage the data and then ultimately evaluate and analyze it for good purpose in your business. If you split the world into managing data—DataOps and the data engineer and all the challenges and complexities there, where there’s definitely a labor shortage—and then analyzing the data, getting value from it (data scientists up through business analysts), where there’s also a shortage—we have a human problem in both. One of my colleagues used the term “unbundling the data engineer.” If you look at all the tasks a data engineer would need to do to get a well-functioning data stack in place, there just aren’t enough of those people. Companies are picking off and automating various aspects of that workflow.

But on the other side, on analyzing the data, I think there are a lot of interesting things to be done there. How do you make data scientists, because there aren’t enough of them, more efficient? What tools and technologies do they need? I think we’ll see more solutions on the analysis side because we have that same human capital problem there too.

Matt: It’s an obvious problem that is only going to get worse, because the rate that we as a society produce technical people—data engineers, ML engineers data scientists—is nowhere near the pace we need to meet the demand. Again, every company is becoming not just a software company but a data company. That has two consequences. One, we need products and platforms that abstract away the complexity. That’s empowering people who are somewhat technical but not engineers to do more and more of what’s needed to make the whole machinery work. And the second, related consequence is the rise of automation—making a lot of those technologies just work in a way where no human is required. There’s plenty of opportunity there, especially AI-driven information of system optimization, tuning, anomaly detection, auto-repair, and the like.

Venky: Whether we’re talking about DataOps or security, these are things that will get automated at scale. It won’t replace humans. They will be complemented by technology that does most of the mundane stuff, and humans will deal with the exceptions. The mundane stuff gets done automatically, the exceptions get kicked to humans.That’s the only way forward. There’s no way to build the human capital required.

Watch the entire panel discussion on demand here. Hear more war stories, anecdotes, and expert insights, including predictions for they coming year.

The post Three Venture Capitalists Weigh In on the State of DataOps 2022 appeared first on Unravel.

DataOps Unleashed Returns, Even Bigger and Better

Stephen Lamont — Thu, 13 Jan 2022 17:00:18 +0000

DataOps Unleashed is back! Founded and hosted by Unravel Data, DataOps Unleashed is a free, full-day virtual event, that took place on Wednesday, February 2, that brings together DataOps, CloudOps, AIOps, MLOps, and other professionals to connect and collaborate on trends and best practices for running, managing, and monitoring data pipelines and data-intensive workloads. Once again, thousands of your peers signed up for what turned out to be the most high-impact big data summit of the year.

All DataOps Unleashed sessions now available on demand

Check out sessions here

A combination of industry thought leadership and practical guidance, sessions included talks by DataOps professionals at such leading organizations as Slack, Zillow, Cisco, and IBM (and dozens of others, detailing how they’re establishing data predictability, increasing reliability, and reducing costs.

Here’s just a taste of what DataOps Unleashed 2022 covered.

In the keynote address, the state of DataOps and the modern data stack, Unravel CEO Kunal Agarwal moderates a roundtable panel discussion with three prominent venture capitalists (Matt Turck from FirstMark, Glenn Solomon from GGV Capital, Venky Ganesan from Menlo Ventures) to cut through the hype and talk about what they’re seeing in reality within the DataOps world. They talk about macro-trends, how DataOps is emerging, the pros and cons of moving data workloads to the cloud, how the industry id=s dealing with the talent shortage, and more.
Ryan Kinney, Senior Data Engineer at Slack, shares how his team has streamlined its DataOps stack with Slack to not only collaborate, but to observe their data pipelines and orchestration, enforce data quality and governance, manage their CloudOps, and unlock the entire data science and analytics platform for their customers and stakeholders.
See how Torsten Steinbach, Cloud Data Architect at IBM, has incorporated different open tech into a state-of-the-art cloud-native data lakehouse platform. Torsten shares practical tips on table formats for consistency, metastores and catalogs for usability, encryption for data protection, data skipping indexes for performance, and more.
Michael DePrizio, Principal Architect at Akamai, and Jordan Tigani, Chief Product Officer at SingleStore, discuss How Akamai handled 13x data growth and moved from batch to near-real-time visibility and analytics.
Shivnath Babu, Co-Founder, Chief Technology Officer, and Head of Customer Success at Unravel, discusses what’s beyond observability and how to get there. Luis Carlos Cruz Huertas, Head of Technology Infrastructure and Automation at DBS Bank, then walks through how DBS has gone “beyond observability” to build a framework for automated root cause analysis and auto-corrections.

See the entire lineup and agenda here.

Check out the recorded DataOps Unleashed 2022 sessions today!

The post DataOps Unleashed Returns, Even Bigger and Better appeared first on Unravel.

DataOps Unleashed 2022 Keynote

Unravel Data — Tue, 11 Jan 2022 22:28:27 +0000

Unravel Data CEO to Keynote Second “DataOps Unleashed” Virtual Event on February 2, 2022

Peer-to-Peer Event Dedicated to Helping Data Professionals Untangle the Complexity of Modern Data Clouds and Simplify their Data Operations

WHAT: Unravel Data announced that its CEO and Co-founder, Kunal Agarwal will deliver the keynote address at the “DataOps Unleashed” event, a free, full-day virtual summit will showcase some of the most prominent voices from across the burgeoning DataOps community that will be taking place on February 2nd, 2022.

The most successful enterprise organizations of tomorrow will be the ones who can effectively harness data from a broad array of sources and operating environments and rapidly transform it into actionable intelligence that supports data-driven decision making. DataOps Unleashed is a growing, cross-industry community where data professionals can productively collaborate with one another, share industry best practices, and deliver on the promise of what it means to be a data-led company.

The event will feature compelling presentations, conversation, and peer-sharing between technical practitioners, data scientists, and executive strategists from some of the most data-forward enterprise organizations in the world, including Slack, Zillow, Cisco, IBM, and many other recognized global brands. Over the course of the day, attendees will learn how a modernized approach to DataOps can transform their operations, improve data predictability, increase reliability, and create economic efficiencies with their data pipelines. Leading vendors from the DataOps market including Census, Metaplane, Airbyte, and Manta will be joining Unravel Data in supporting this community event.

Speakers at the DataOps Unleashed event to include:

Kunal Agarwal, Co-founder & CEO & Shivnath Babu, Co-founder & CTO, Unravel Data
Ryan Kinney, Senior Data Engineer, Slack
Andrei Lopatenko, VP of Engineering, Zillow
Disha Ahuja, Senior Manager, Cisco
Torsten Steinbach, Lead Architect, IBM

More information about the event can be found at: https://dataopsunleashed.com/

WHEN: Wednesday, February 2nd beginning at 9AM PST

COST: Free

WHERE: Register at: https://dataopsunleashed.com/

WHO: Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Adobe and Deutsche Bank. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

The post DataOps Unleashed 2022 Keynote appeared first on Unravel.

Unravel Google Cloud GCP

Unravel Data — Fri, 17 Dec 2021 00:55:30 +0000

Thank you for your interest in the Unravel Google Cloud Solution Brief.

The post Unravel Google Cloud GCP appeared first on Unravel.

Spark Troubleshooting Guides

Unravel Data — Tue, 02 Nov 2021 20:17:13 +0000

Thanks for your interest in the Spark Troubleshooting Guides.

This 3 part series is your one-stop guide to all things Spark troubleshooting. In Part 1, we describe the ten biggest challenges for troubleshooting Spark jobs across levels. In Part 2, we describe the major categories of tools and types of solutions to solve challenges.

Lastly, Part 3 of the guide builds on the other two to show you how to address the problems we described, and more, with a single tool that does the best of what single-purpose tools do, and more – our DataOps platform, Unravel Data.

The post Spark Troubleshooting Guides appeared first on Unravel.

Data Science and Analytics in the Cloud

Unravel Data — Tue, 02 Nov 2021 01:58:26 +0000

Thank you for your interest in the 451 Research Report, Data Science and analytics in the cloud set to grow three times faster than on-premises.

451 Research: Data Science and analytics in the cloud set to grow three times faster than on-premises
Published: September, 28 2021

Introduction
Data science and analytics, as well as the data abstraction and acceleration offerings that underpin them, represented a $29bn market in 2020, according to 451 Research’s Data, AI & Analytics Market Monitor: Data Science & Analytics. Moreover, this market is growing thanks to the critical role of data, AI and analytics in speeding up enterprise data-driven decision-making for faster time to insight. Cloud services, in particular, are exhibiting strong growth – a trend that has been underway for some time and has been accelerated by the COVID-19 pandemic.

Get the 451 Take. Download Report.

The post Data Science and Analytics in the Cloud appeared first on Unravel.

Troubleshooting Apache Spark – Job, Pipeline, & Cluster Level

Unravel Data — Wed, 13 Oct 2021 18:05:18 +0000

Apache Spark is the leading technology for big data processing, on-premises and in the cloud. Spark powers advanced analytics, AI, machine learning, and more. Spark provides a unified infrastructure for all kinds of professionals to work together to achieve outstanding results.

Technologies such as Cloudera’s offerings, Amazon EMR, and Databricks are largely used to run Spark jobs. However, as Spark’s importance grows, so does the importance of Spark reliability – and troubleshooting Spark problems is hard. Information you need for troubleshooting is scattered across multiple, voluminous log files. The right log files can be hard to find, and even harder to understand. There are other tools, each providing part of the picture, leaving it to you to try to assemble the jigsaw puzzle yourself.

Would your organization benefit from rapid troubleshooting for your Spark workloads? If you’re running significant workloads on Spark, then you may be looking for ways to find and fix problems faster and better – and to find new approaches that steadily reduce your problems over time.

This blog post is adapted from the recent webinar, Troubleshooting Apache Spark, part of the Unravel Data “Optimize” Webinar Series. In the webinar, Unravel Data’s Global Director of Solutions Engineering, Chis Santiago, runs through common challenges faced when troubleshooting Spark and shows how Unravel Data can help.

Spark: The Good, the Bad & the Ugly

Chris has been at Unravel for almost four years. Throughout that time, when it comes to Spark, he has seen it all – the good, the bad, and the ugly. On one hand, Spark as a community has been growing exponentially. Millions of users are adopting Spark, with no end in sight. Cloud platforms such as Amazon EMR and Databricks are largely used to run Spark jobs.

Spark is here to stay, and use cases are rising. There are countless product innovations powered by Spark, such as Netflix recommendations, targeted ads on Facebook and Instagram, or the “Trending” feature on Twitter. On top of its great power, the barrier to entry for Spark is now lower than ever before. But unfortunately, with the good comes the bad, and the number one common issue is troubleshooting.

Troubleshooting Spark is complex for a multitude of different reasons. First off, there are multiple points of failure. A typical Spark data pipeline could be using orchestration tools, such as Airflow or Oozie, as well as built-in tools, such as Spark UI. You also may be using cluster management technologies, such as Cluster Manager or Ambari.

A failure may not always start on Spark; it could rather be a failure within a network layer on the cloud, for example.

Because Spark uses so many tools, not only does that introduce multiple points of failure, but there is also a lot of correlating information from various sources that you must carry out across these platforms. This requires expertise. You need experience in order to understand not only the basics of Spark, but all the other platforms that can support Spark as well.

Lastly, when running Spark workloads, the priority is often to meet SLAs at all costs. To meet SLAs you may, for example, double your resources, but there will always be a downstream effect. Determining what’s an appropriate action to take in order to make SLAs can be tricky.

Want to experience Unravel for Spark?

The Three Levels of Spark Troubleshooting

There are multiple levels when it comes to troubleshooting Spark. First there is the Job Level, which deals with the inner workings of Spark itself, from executors to drivers to memory allocation to logs. The job level is about determining best practices for using the tools that we have today to make sure that Spark jobs are performing properly. Next is the Pipeline Level. Troubleshooting at the pipeline level is about managing multiple Spark runs and stages to make sure you’re getting in front of issues and using different tools to your advantage. Lasty, there is the Cluster Level, which deals with infrastructure. Troubleshooting at the cluster level is about understanding the platform in order to get an end-to-end view of troubleshooting Spark jobs.

Troubleshooting: Job Level

A Spark job refers to actions such as doing work in a workbook or analyzing a Spark SQL query. One tool used on the Spark job level is Spark UI. Spark UI can be described as an interface for understanding Spark at the job level.

Spark UI is useful in giving granular details about the breakdown of tasks, breakdown of stages, the amount of time it takes workers to complete tasks, etc. Spark UI is a powerful dataset that you can use to understand every detail about what happened to a particular Spark job.

Challenges people often face are manual correlation; understanding the overall architecture of Spark; and, more importantly, things such as understanding what logs you need to look into and how one job correlates with other jobs. While Spark UI is the default starting point to determine what is going on with Spark jobs, there is still a lot of interpretation that needs to be done, which takes experience.

Further expanding on Spark job-level challenges, one thing people often face difficulty with is diving into the logs. If you truly want to understand what caused a job to fail, you must get down to the log details. However, looking at logs is the most verbose way of troubleshooting, because every component in Spark produces logs. Therefore, you have to look at a lot of errors and timestamps across multiple servers. Looking at logs gives you the most information to help understand why a job has failed, but sifting through all that information is time-consuming.

It also may be challenging to determine where to start on the job level. Spark was born out of the Hadoop ecosystem, so it has a lot of big data concepts. If you’re not familiar with big data concepts, determining a starting point may be difficult. Understanding the concepts behind Spark takes time, effort, and experience.

Lastly, when it comes to Spark jobs, there are often multiple job runs that make up a data pipeline. Figuring out how one Spark job affects another is tricky and must be done manually. In his experience working with the best engineers at large organizations, Chris finds that even they sometimes decide not to finish troubleshooting, and instead just keep on running and re-running a job until it’s completed or meets the SLA. While troubleshooting is ideal, it is extremely time-consuming. Therefore, having a better tool for troubleshooting on the job level would be helpful.

Troubleshooting: Pipeline Level

In Chris’ experience, most organizations don’t have just one Spark job that does everything, but there are rather multiple stages and jobs that are needed to carry out Spark workloads. To manage all these steps and jobs you’d need an orchestration tool. One popular orchestration tool is Airflow, which allows you to sequence out specific jobs.

Orchestration tools like Airflow are useful in managing pipelines. But while these tools are helpful for creating complex pipelines and mapping where points of failure are, they are lacking when it comes to providing detailed information about why a specific step may have failed. Orchestration tools are more focused on the higher, orchestration level, rather than the Spark job level. Orchestration tools tell you where and when something has failed, so they are useful as a starting point to troubleshoot data pipeline jobs on Spark.

Those who are running Spark on Hadoop, however, often use Oozie. Similarly to Airflow, Oozie gives you a high level view, alerting you when a job has failed, but not providing the type of information needed to answer questions such as “Where is the bottleneck?” or “Why did the job break down?”. To answer these questions, it’s up to the user to manually correlate the information that orchestration tools provide with information from job-level tools, which again requires expertise. For example, you may have to determine which Spark run that you see in Spark UI correlates to a certain step in Airflow. This can be very time consuming and prone to errors.

Troubleshooting: Cluster Level

The cluster level for Spark is the level that refers to things such as infrastructure, VMs, allocated resources, and Hadoop. Hadoop’s ResourceManager is a great tool for managing applications as they come in. ResourceManage is also useful for determining what the resource usage is, and where a Spark job is in the queue.

However, one shortcoming of ResourceManager is that you don’t get historical data. You cannot view the past state of ResourceManager from twelve or twenty-four hours ago, for example. Everytime you open ResourceManager you have a view of how jobs are consuming resources at that specific time, as shown below.

Another challenge when troubleshooting Spark on the cluster level is that while tools such as Cluster Manager or Ambari give a holistic view of what’s going on with the entire estate, you cannot see how cluster-level information, such as CPU consumption, I/O consumption, or network I/O consumption, relate to Spark jobs.

Lastly, and similarly to the challenges faced when troubleshooting on the job and pipeline level, manual correlation is also a problem when it comes to troubleshooting on the cluster level. Manual correlation takes time and effort that a data science team could instead be putting towards product innovations.

But what if there was a tool that takes all these troubleshooting challenges, on the job, pipeline, and cluster level, into consideration? Well, luckily, Unravel Data does just that. Chris next gives examples of how Unravel can be used to mitigate Spark troubleshooting issues, which we’ll go over in the remainder of this blog post.

Want to experience Unravel for Spark?

Demo: Unravel for Spark Troubleshooting

The beauty of Unravel is that it provides a single pane of glass where you can look at logs, the type of information provided by Spark UI, Oozie, and all the other tools mentioned throughout this blog, and data uniquely available through Unravel, all in one view. At this point in the webinar, Chris takes us through a demo to show how Unravel aids in troubleshooting at all Spark levels – job, pipeline, and cluster. For a deeper dive into the demo, view the webinar.

Job Level

At the job level, one area where Unravel can be leveraged is in determining why a job failed. The image below is a Spark run that is monitored by Unravel.

On the left hand side of the dashboard, you can see that Job 3 has failed, indicated by the orange bar. With Unravel, you can click on the failed job and see what errors occurred. On the right side of the dashboard, within the Errors tab, you can see why Job 3 failed, as highlighted in blue. Unravel is showing the Spark logs that give information on the failure.

Pipeline Level

Using Unravel for troubleshooting at the data pipeline level, you can look at a specific workflow, rather than looking at the specifics of one job. The image shows the dashboard when looking at data pipelines.

On the left, the blue lines represent instance runs. The black dot represents a job that ran for two minutes and eleven seconds. You could use information on run duration to determine if you meet your SLA. If your SLA is under two minutes, for example, the highlighted run would miss the SLA. With Unravel you can also look at changes in I/O, as well as dive deeper into specific jobs to determine why they lasted a certain amount of time. The information in the screenshot gives us insight into why the job mentioned prior ran for two minutes and eleven seconds.

The Unravel Analysis tab, shown above, carries out analysis to detect issues and make recommendations on how to mitigate those issues.

Cluster Level

Below is the view of Unravel when troubleshooting at the cluster level, specifically focusing on the same job mentioned previously. The job, which again lasted two minutes and eleven seconds, took place on July 5th at around 7PM. So what happened?

The image above shows information about the data security queue at the time when the job we’re interested in was running. The table at the bottom of the dashboard shows the state of the jobs that were running on July 5th at 7PM, allowing you to see which, if any, job was taking up too much resources. In this case, Chris’ job, highlighted in yellow, wasn’t using a large amount of resources. From there, Chris can then conclude that perhaps the issue is instead on the application side and something needs to be fixed within the code. The best way to determine what needs to be fixed is to use the Analysis tab mentioned previously.

Conclusion

There are many ways to troubleshoot Spark, whether it be on the job, pipeline, or cluster level. Unravel can be your one-stop shop to determine what is going on with your Spark jobs and data pipelines, as well as give you proactive intelligence that allows you to quickly troubleshoot your jobs. Unravel can help you meet your SLAs in a resourceful and efficient manner.

We hope you have enjoyed, and learned from, reading this blog post. If you think Unravel Data can help you troubleshoot Spark and would like to know more, you can create a free account or contact Unravel.

The post Troubleshooting Apache Spark – Job, Pipeline, & Cluster Level appeared first on Unravel.

Spark Troubleshooting Solutions – DataOps, Spark UI or logs, Platform or APM Tools

Unravel Data — Thu, 09 Sep 2021 21:53:14 +0000

Note: This guide applies to running Spark jobs on any platform, including Cloudera platforms; cloud vendor-specific platforms – Amazon EMR, Microsoft HDInsight, Microsoft Synapse, Google DataProc; Databricks, which is on all three major public cloud providers; and Apache Spark on Kubernetes, which runs on nearly all platforms, including on-premises and cloud.

Introduction

Spark is known for being extremely difficult to debug. But this is not all Spark’s fault. Problems in running a Spark job can be the result of problems with the infrastructure Spark is running on, an inappropriate configuration of Spark, Spark issues, the currently running Spark job, other Spark jobs running at the same time – or interactions among these layers. But Spark jobs are very important to the success of the business; when a job crashes, or runs slowly, or contributes to a big increase in the bill from your cloud provider, you have no choice but to fix the problem.

Widely used tools generally focus on part of the environment – the Spark job, infrastructure, the network layer, etc. These tools don’t present a holistic view. But that’s just what you need to truly solve problems. (You also need the holistic view when you’re creating the Spark job, and as a check before you start running it, to help you avoid having problems in the first place. But that’s another story.)

In this guide, Part 2 in a series, we’ll show ten major tools that people use for Spark troubleshooting. We’ll show what they do well, and where they fall short. In Part 3, the final piece, we’ll introduce Unravel Data, which makes solving many of these problems easier.

What’s the Problem(s)?

The problems we mentioned in Part 1 of this series have many potential solutions. The methods people usually use to try to solve them often derive from that person’s role on the data team. The person who gets called when a Spark job crashes, such as the job’s developer, is likely to look at the Spark job. The person who is responsible for making sure the cluster is healthy will look at that level. And so on.

In this guide, we highlight five types of solutions that people use – often in various combinations – to solve problems with Spark jobs

Spark UI
Spark logs
Platform-level tools such as Cloudera Manager, the Amazon EMR UI, Cloudwatch, the Databricks UI, and Ganglia
APM tools
DataOps platforms such as Unravel Data

As an example of solving problems of this type, let’s look at the problem of an application that’s running too slowly – a very common Spark problem, that may be caused by one or more of the issues listed in the chart. Here. we’ll look at how existing tools might be used to try to solve it.

Note: Many of the observations and images in this guide have been drawn from the presentation Beyond Observability: Accelerate Performance on Databricks, by Patrick Mawyer, Systems Engineer at Unravel Data. We recommend this webinar to anyone interested in Spark troubleshooting and Spark performance management, whether on Databricks or on other platforms.

Solving Problems Using Spark UI

Spark UI is the first tool most data team members use when there’s a problem with a Spark job. It shows a snapshot of currently running jobs, the stages jobs are in, storage usage, and more. It does a good job, but is seen as having some faults. It can be hard to use, with a low signal-to-noise ratio and a long learning curve. It doesn’t tell you things like which jobs are taking up more or less of a cluster’s resources, nor deliver critical observations such as CPU, memory, and I/O usage.

In the case of a slow Spark application, Spark UI will show you what the current status of that job is. You can also use Spark UI for past jobs, if the logs for those jobs still exist, and if they were configured to log events properly. Also, the Spark history server tends to crash. When this is all working, it can help you find out how long an application took to run in the past – you need to do this kind of investigative work just to determine what “slow” is.

The following screenshot is for a Spark 1.4.1 job with a two-node cluster. It shows a Spark Streaming job that steadily uses more memory over time, which might cause the job to slow down. And the job eventually – over a matter of days – runs out of memory.

(Source: Stack Overflow)

To solve this problem, you might do several things. Here’s a brief list of possible solutions, and the problems they might cause elsewhere:

Increase available memory for each worker. You can increase the value of the spark.executor.memory variable to increase the memory for each worker. This will not necessarily speed the job up, but will defer the eventual crash. However, you are either taking memory away from other jobs in your cluster or, if you’re in the cloud, potentially running up the cost of the job.
Increase the storage fraction. You can change the value of spark.storage.memoryFraction, which varies from 0 to 1, to a higher fraction. Since the Java virtual machine (JVM) uses memory for caching RDDs and for shuffle memory, you are increasing caching memory at the expense of shuffle memory. This will cause a different failure if, at some point, the job needs shuffle memory that you allocated away at this step.
Increase the parallelism of the job. For a Spark Cassandra Connector job, for example, you can change spark.cassandra.input.split.size to a smaller value. (It’s a different variable for other RDD types.) Increasing parallelism decreases the data set size for each worker, requiring less memory per worker. But more workers means more resources used. In a fixed-resources environment, this takes resources away from other jobs; in a dynamic environment, such as a Databricks job cluster, it directly runs up your bill.

The point here is that everything you might do has a certain amount of guesswork to it, because you don’t have complete information. You have to use trial and error approaches to see what might work – both which approach to try/variable to change, and how much to change the variable(s) involved. And, whichever approach you choose, you are putting the job in line for other, different problems – including later failure, failure for other reasons, or increased cost. And, when your done, this specific job may be fine – but at the expense of other jobs that then fail. And those failures will also be hard to troubleshoot.

Here’s a look at the Stages section of Spark UI. It gives you a list of metrics across executors. However, there’s no overview or big picture view to help guide you in finding problems. And the tool doesn’t make recommendations to help you solve problems, or avoid them in the first place.

Spark UI is limited to Spark, but a Spark job may, for example, have data coming in from Kafka, and run alongside other technologies. Each of those has its own monitoring and management tools, or does without; Spark UI doesn’t work with those tools. It also lacks pro-active alerting, automatic actions, and AI-driven insights, all found in Unravel.

Spark UI is very useful for what it does, but its limitations – and the limitations of the other tool types described here – lead many organizations to build homegrown tools or toolsets, often built on Grafana. These solutions are resource-intensive, hard to extend, hard to support, and hard to keep up-to-date.

A few individuals and organizations even offer their homegrown tools as open source software for other organizations to use. However, support, documentation, and updates are limited to non-existent. Several such tools, such as Sparklint and DrElephant, do not support recent versions of Spark. At this writing, they have not had many, if any, fresh commits in recent months or even years.

Spark Logs

Spark logs are the underlying resource for troubleshooting Spark jobs. As mentioned above, Spark UI can even use Spark logs, if available, to rebuild a view of the Spark environment on an historical basis. You can use the logs related to the job’s driver and executors to retrospectively find out what happened to a given job, and even some information about what was going on with other jobs at the same time.

If you have a slow app, for instance, you can painstakingly assemble a picture to tell you if the slowness was in one task versus the other by scanning through multiple log files. But answering why and finding the root cause is hard. These logs don’t have complete information about resource usage, data partitioning, correct configuration setting and many other factors than can affect performance. There are also many potential issues that don’t show up in Spark logs, such as “noisy neighbor” or networking issues that sporadically reduce resource availability within your Spark environment.

Spark logs are a tremendous resource, and are always a go-to for solving problems with Spark jobs. However, if you depend on logs as a major component of your troubleshooting toolkit, several problems come up, including:

Access and governance difficulties. In highly secure environments, it can take time to get permission to access logs, or you may need to ask someone with the proper permissions to access the file for you. In some highly regulated companies, such as financial institutions, it can take hours per log to get access.
Multiple files. You may need to look at the logs for a driver and several executors, for instance, to solve job-level problems. And your brain is the comparison and integration engine that pulls the information together, makes sense of it, and develops insights into possible causes and solutions.
Voluminous files. The file for one component of a job can be very large, and all the files for all the components of a job can be huge – especially for long-running jobs. Again, you are the one who has to find and retain each part of the information needed, develop a complete picture of the problem, and try different solutions until one works.
Missing files. Governance rules and data storage considerations take files away, as files are moved to archival media or simply lost to deletion. More than one large organization deletes files every 90 days, which makes quarterly summaries very difficult. Comparisons to, say, the previous year’s holiday season or tax season become impossible.
Only Spark-specific information. Spark logs are just that – logs from Spark. They don’t include much information about the infrastructure available, resource allocation, configuration specifics, etc. Yet this information is vital to solving a great many of the problems that hamper Spark jobs.

Because Spark logs don’t cover infrastructure and related information, it’s up to the operations person to find as much information they can on those other important areas, then try to integrate it all and determine the source of the problem. (Which may be the result of a complex interaction among different factors, with multiple changes needed to fix it.)

Platform-Level Solutions

There are platform-level solutions that work on a given Spark platform, such as Cloudera Manager, the Amazon EMR UI, and Databricks UI. In general, these interfaces allow you to work at the cluster level. They tell you information about resource usage and the status of various services.

If you have a slow app, for example, these tools will give you part of the picture you need to put together to determine the actual problem, or problems. But these tools do not have the detail-level information in the tools above, nor do they even have all the environmental information you need. So again, it’s up to you to decide how much time to spend researching, pulling all the information together, and trying to determine a solution. A quick fix might take a few hours; a comprehensive, long-term solution may take days of research and experimentation.

This screenshot shows Databricks UI. It gives you a solid overview of jobs and shows you status, cluster type, and so on. Like other platform-level solutions, it doesn’t help you much with historical runs, nor in working at the pipeline level, across the multiple jobs that make up the pipeline.

Another monitoring tool for Spark, which is included as open source within Databricks, is called Ganglia. It’s largely complementary to Databricks UI, though it also works at the cluster and, in particular, at the node level. You can see hostlevel metrics such as CPU consumption, memory consumption, disk usage, network-level IO – all host-level factors that can affect the stability and performance of your job.

This can allow you to see if your nodes are configured appropriately, to institute manual scaling or auto-scaling, or to change instance types. (Though someone trying to fix a specific job is not inclined to take on issues that affect other jobs, other users, resource availability, and cost.) Ganglia does not have job-specific insights, nor work with pipelines. And there are no good output options; you might be reduced to taking a screen snapshot to get a JPEG or PNG image of the current status.

Support from the open-source community is starting to shift toward more modern observability platforms like Prometheus, which works well with Kubernetes. And cloud providers offer their own solutions – AWS Cloudwatch, for example, and Azure Log Monitoring and Analytics. These tools are all oriented toward web applications; they lack modern data stack application and pipeline information which is essential to understand what’s happening to your job and how your job is affecting things on the cluster or workspace.

Platform-level solutions can be useful for solving the root causes of problems such as out-of-memory errors. However, they don’t go down to the job level, leaving that to resources such as Spark logs and tools such as Spark UI. Therefore, to solve a problem, you are often using platform-level solutions in combination with job-level tools – and again, it’s your brain that has to do the comparisons and data integration needed to get a complete picture and solve the problem.

Like job-level tools, these solutions are not comprehensive, nor integrated. They offer snapshots, but not history, and they don’t make proactive recommendations. And, to solve a problem on Databricks, for example, you may be using Spark logs, Spark UI, Databricks UI, and Ganglia, along with Cloudwatch on AWS, or Azure Log Monitoring and Analytics. None of these tools integrate with the others.

APM Tools

There is a wide range of monitoring tools, generally known as Application Performance Management (APM) tools. Many organizations have adopted one or more tools from this
category, though they can be expensive, and provide very limited metrics on Spark and other modern data technologies. Leading tools in this category include Datadog, Dynatrace, and Cisco AppDynamics.

For a slow app, for instance, an APM tool might tell you if the system as a whole is busy, slowing your app, or if there were networking issues, slowing down all the jobs. While helpful, they’re oriented toward monitoring and observability for Web applications and middleware, not data-intensive operations such as Spark jobs. They tend to lack information about pipelines, specific jobs, data usage, configuration setting, and much more, as they are not designed to deal with the complexity of modern data applications.

Correlation is the Issue

To sum up, there are several types of existing tools:

DIY with Spark logs. Spark keeps a variety of types of logs, and you can parse them, in a do it yourself (DIY) fashion, to help solve problems. But this lacks critical infrastructure, container, and other metrics.
Open source tools. Spark UI comes with Spark itself, and there are other Spark tools from the open source community. But these lack infrastructure, configuration and other information. They also do not help connect together a full pipeline view, as you need for Spark – and even more so if you are using technologies such as Kafka to bring data in.
Platform-specific tools. The platforms that Spark runs on – notably Cloudera platforms, Amazon EMR, and Databricks – each have platform-specific tools that help with Spark troubleshooting. But these lack application-level information and are best used for troubleshooting platform services.
Application performance monitoring (APM) tools. APM tools monitor the interactions of applications with their environment, and can help with troubleshooting and even with preventing problems. But the applications these APM tools are built for are technologies such as .NET, Java, and Ruby, not technologies that work with data-intensive applications such as Spark.
DataOps platforms. DataOps – applying Agile principles to both writing and running Spark, and other big data jobs – is catching on, and new platforms are emerging to embody these principles.

Each tool in this plethora of tools takes in and processes different, but overlapping, data sets. No one tool provides full visibility, and even if you use one or more tools of each type, full visibility is still lacking.

You need expertise in each tool to get the most out of that tool. But the most important work takes place in the expert user’s head: spotting a clue in one tool, which sends you looking at specific log entries and firing up other tools, to come up with a hypothesis as to the problem. You then have to try out the potential solution, often through several iterations of trial and error, before arriving at a “good enough” answer to the problem.

Or, you might pursue two tried and trusted, but ineffective, “solutions”: ratcheting up resources and retrying the job until it works, either due to the increased resources or by luck; or simply giving up, which our customers tell us they often had to do before they started using Unravel Data.

The situation is much worse in the kind of hybrid data clouds that organizations use today. To troubleshoot on each platform, you need expertise in the toolset for that platform, and all the others. (Since jobs often have cross-platform interdependencies, and the same team has to support multiple platforms.) In addition, when you find a solution for a problem on one platform, you should apply what you’ve learned on all platforms, taking into account their differences. In addition, you have issues that are inherently multi-platform, such as moving jobs from one platform to a platform that is better, faster, or cheaper for a given job. Taking on all this with the current, fragmented, and incomplete toolsets available is a mind-numbing prospect.

The biggest need is for a platform that integrates the capabilities from several existing tools, performing a five-step process:

Ingest all of the data used by the tools above, plus additional, application-specific and pipeline data.
Integrate all of this data into an internal model of the current environment, including pipelines.
Provide live access to the model.
Continually store model data in an internally maintained history.
Correlate information across the ingested data sets, the current, “live” model, and the stored historical background, to derive insights and make recommendations to the user.

This tool must also provide the user with the ability to put “triggers” onto current processes that can trigger either alerts or automatic actions. In essence, the tool’s inbuilt intelligence and the user are then working together to make the right things happen at the right time.

A simple example of how such a platform can help is by keeping information per pipeline, not just per job – then spotting, and automatically letting you know, when the pipeline suddenly starts running slower than it had previously. The platform will also make recommendations as to how you can solve the problem. All this lets you take any needed action before the job is delayed.

The post Spark Troubleshooting Solutions – DataOps, Spark UI or logs, Platform or APM Tools appeared first on Unravel.

Troubleshooting EMR

Unravel Data — Tue, 17 Aug 2021 20:57:06 +0000

The post Troubleshooting EMR appeared first on Unravel.

Driving Data Governance and Data Products at ING Bank France

Unravel Data — Thu, 12 Aug 2021 15:45:59 +0000

Data+AI Battlescars Takeaways: Driving Data Governance and Data Products at ING Bank France

In this episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Samir Boualla, CDO at ING Bank France, one of the largest banks in the world. They cover his battlescars in Driving Data Governance Across Business Teams and Building Data Products.

At ING Bank France, Samir is the Chief Data Officer. He’s responsible for several teams that govern, develop, and manage data infrastructure and data assets to deliver value to the business. With over 20+ years of experience on various data topics, Samir shares interesting battle-tested techniques in this podcast, including using a process catalog, having a “data minimum standard,” and having a change management mindset. Here are the key takeaways from their chat.

Data Governance

Preparing for Data Governance

The first step to preparing for data governance is defining data governance frameworks, guidelines, and principles as part of your organization’s data strategy and discussing them with various stakeholders.
Next, you need to get approval and validation from your directors. It is important to have support and commitment from senior leaders since it will be impacting the whole organization.
After that, you can identify the appropriate people that should be assigned to roles such as data steward, data owner, or process lead.
It is important to implement data governance in parallel with building a data architecture. You can identify and close gaps quickly so you don’t develop something which later on is not compliant with a regulation or with other frameworks.

Having a Data and a Process Catalog

In order to monitor compliance proactively, you must know your data. This is where a data catalog plays an important role.
A data catalog allows you to have uniform definitions in place, have a data owner for each of the specific data categories, and have people who have more in-depth knowledge and the capability to manage data acting as data stewards.
Having a process catalog in addition to a data catalog allows you to seamlessly track the data life cycle.
In a process catalog, it’s documented, for each process, what sources are used, who is doing what, and which data is being used in those processes.
The process catalog is linked with your data catalog and used to manage your life cycle and apply additional data capabilities, like retention and deletion, on the appropriate systems or appropriate process steps.

Data Minimum Standards

Data minimum standards are key frameworks that can give you insight and input on the controls and checks that should be part of your minimum standards for governance.
Having these data minimum standards, such as GDPR and BCBS 239, allows you to proactively monitor for compliance and apply governance.
These data minimum standards are part of your compliance framework, so you are able to apply them to all kinds of different processes or departments.
ING Bank is constantly assessing if the controls work, testing them for effectiveness, and auditing them every once in a while.

Data Products

Building External-Facing Data Products

In preparation for building external-facing data products, you need to have a set of standardized APIs as a product, which you can deliver to third parties for external consumers.
At the same time, you should also be using another product as your data catalog to make sure that the data that is being defined and flowing through those APIs are made unique.
The biggest challenges ING faces when building external-facing data products is making sure they are acting more or less on the edge of technology and architecture, while also ensuring that they are working towards their goal of becoming a data-driven bank.
They encounter several challenges in making sure that their platforms are compatible and can exchange data in the right formats and in the right structure, but also in a way that the infrastructure remains scalable.

Challenges of Building Data Products

Sometimes when a product is based on data quality, while Samir hopes to identify upfront any data quality issues, somewhere down the line the consumer may identify issues or have questions regarding the quality of the data. This is where another data product, called remediation, can come into play.
Remediation is when a consumer can address data quality issues directly to the appropriate data stewards in the organization. Using other complementary products, a consumer can look into certain data or to a certain report to identify which data point came from where and who’s the data steward or data owner of that specific data. They can then address it automatically in a workflow environment, and request remediation.
When building a data product, you may run into manual, legacy processes that have not yet been redesigned.

Change Management

Having a change management mindset means that you are willing to implement something new or change something from the legacy based on new data products.
A standard data model and data catalog are essential when it comes to change management.
By having a single data model and a data catalog, you have a decoupling layer which helps you and supports you in the exercise of identifying what that data point reflecting the truth exactly is.
ING has a broad framework, which allows them to work in a similar, agile way across the organization, and across the globe, whenever something needs to be adjusted.
Their data products also allow them to minimize the amount of effort that needs to be put into a change to make it available.

While we’ve highlighted the key talking points here, Sandeep and Samir talked about so much more in the full podcast. Be sure to check out the full episode and the rest of the Data+AI Battlescars podcast series!

The post Driving Data Governance and Data Products at ING Bank France appeared first on Unravel.

Troubleshooting Databricks – Challenges, Fixes, and Solutions

Unravel Data — Wed, 11 Aug 2021 21:00:22 +0000

The post Troubleshooting Databricks – Challenges, Fixes, and Solutions appeared first on Unravel.

Simplifying Data Management at LinkedIn – What is Data Quality

Unravel Data — Fri, 18 Jun 2021 13:00:31 +0000

In the second of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In part one, they covered LinkedIn’s challenges related to metadata management and data access APIs. This second part dives deep into data quality.

Kapil has 20+ years of experience in data and infrastructure, at large companies such as Oracle and at multiple startups. At LinkedIn, Kapil has been leading the next generation of big data infrastructure, platforms, tools, and applications to empower data scientists, AI engineers, app developers, to extract value from data. Kapil’s team has been at the forefront of innovation driving multiple open source initiatives such as Pinot, Gobblin, and DataHub. Here are some key talking points from the second part of their insightful chat.

What Does “Data Quality” Mean?

Data quality spans many dimensions and can be really broad in its coverage. It answers the questions: Is your data accurate? Do you have data integrity, validity, security, timeliness, and completeness? Is the data consistent? Is it interpretable?
LinkedIn measures tons of different metrics, aiming to understand the health of the business, products, and systems.
When KPIs differ from the norm, it becomes a question of data quality.
When determining the root cause of poor data quality, it can differ for each dimension of quality.
For example, when there is metric inconsistency, you must ask yourself if you have an accurate source of truth for your metrics.
Timeliness and completeness problems often happen as a result of infrastructure issues. In complex ecosystems you have a lot of moving parts, so a lot can go wrong that impacts timeliness and completeness.
Data quality problems often don’t actually manifest themselves as data quality problems in obvious ways. It takes time to monitor, detect, and effectively analyze the root cause of these issues, and remediate the issues.
When assessing and improving data quality, it often helps to categorize it in three buckets. There is (1) the monitoring and observability aspect, (2) anomaly detection and root cause analysis for anomalies, and (3) preventing data quality issues in the first place. The last bucket is the best-case scenario.
In any complex ecosystem, when something goes wrong for any reason in a pipeline or in a single stage of a data set or a data flow, it can have real consequences on the entire downstream change. So your goal is to detect problems as close to the source as possible.

How Does LinkedIn Maintain Data Quality?

Unified Metrics Platform

It is important to have an evolving schema of your datasets and signals for when something goes awry, to act as markers of data quality.
At LinkedIn they built a platform for metric definition and the entire life cycle management of those metrics, called the Unified Metrics Platform. The platform processes all of LinkedIn’s critical metrics – to the point that if it’s not produced by the platform, it wouldn’t be considered a trusted metric. The Unified Metrics Platform defines their source of truth.
The company turned to machine learning techniques to improve the detection of anomalies and alerting based on user feedback.

Data Sentinel

You can have situations where the overall metric that you’re monitoring may not have a significant deviation, but when you look into the sub spaces within that metric, you find significant deviations. To solve this problem, LinkedIn leveraged algorithms to automatically build structures based on the nature of the data itself and build a multi-dimensional data cube.
When you’re unable to pinpoint the root cause of a deviation, it becomes a matter of identifying the space of the key drivers which might impact the particular metric. You narrow that space, present it to users for their feedback, and then continuously refine the system.
To detect issues based on the known properties of the data, Linkedin built a system called the Data Sentinel. This system has the ability to take the developer’s knowledge about the dataset and specify it as declarative roots. The Data Sentinel then takes on the responsibility of generating the code to perform data validations.
Linkedin is considering making Data Sentinel open source in the future.

Building a Data Quality Team

At LinkedIn, they make sure that team members take the time to treat event schemas as code. They have to be reviewed and checked on. The same goes for metrics. This requires collaboration between different teams. They are constantly talking to each other and coming together to improve not just tools, but also processes.
What is accepted as state-of-the-art today is almost guaranteed not to be state-of-the-art tomorrow. So when hiring for a data quality or data management team, it is important to look for people who are naturally curious and have a growth mindset.

If you’re interested in any of the topics discussed here, Sandeep and Kapil talked about even more in the full podcast. Be sure to check out Part 1 of this blog post, the full episode and the rest of the Data+AI Battlescars podcast series!

The post Simplifying Data Management at LinkedIn – What is Data Quality appeared first on Unravel.

Simplifying Data Management at LinkedIn – Metadata Management and APIs

Unravel Data — Thu, 17 Jun 2021 13:00:49 +0000

In the first of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In this first part, they cover LinkedIn’s challenges related to Metadata Management and Data Access APIs. Part 2 will dive deep into data quality.

Metadata Management

The Problem: So Much Data

LinkedIn manages over a billion data points and over 50 billion relationships.
As the number of datasets skyrocketed at LinkedIn, the company found that internal users were spending an inordinate amount of time searching for and trying to understand hundreds of thousands, if not millions, of datasets.
LinkedIn could no longer rely on manually generated information about the datasets. The company needed a central metadata repository and a metadata management strategy.

Solution 1: WhereHows

The company initiated an in-depth analysis of its Hadoop data lake and asked questions such as: What are the datasets? Where do they come from? Who produced the dataset? Who owns it? What are the associated SLAs? What other sets does a particular dataset depend on?
Similar questions were asked about the jobs: What are the inputs and outputs? Who owns the jobs?.

The first step was the development of an open source system called WhereHows, a central metadata repository to capture metadata across the diverse datasets, with a search engine.

Solution 2: Pegasus

LinkedIn discovered that it was not enough just to focus on the datasets and the jobs. The human element had to be accounted for as well. A broader view was necessary, accommodating both static and dynamic metadata.
In order to expand the capabilities of the metadata model, the company realized it needed to take a “push” approach rather than a metadata “scraping” approach.
The company then built a library called Pegasus to create a push-based model that improved both efficiency and latency.

The Final Solution: DataHub

Kapil’s team then found that you need the ability to query metadata through APIs. You must be able to query the metadata online, so other services can integrate.
The team went back to the drawing board and re-architected the system from the ground up based on these learnings.
The result was DataHub, the latest version of the company’s open source metadata management system, released last year.
A major benefit of this metadata approach is the ability to drive a lot of other functions in the organization that depend on access to metadata.

Data Access APIs

LinkedIn recently completely rebuilt its mobile experience, user tracking, and associated data models, essentially needing to “change engines in mid-flight.”
The company needed a data API to meet this challenge. One did not exist, however, so they created a data access API named “Dali API” to provide an access layer to offline data.
The company used Hive to build the infrastructure and get to market quickly. But in hindsight, using Hive added some dependencies.
So LinkedIn built a new, more flexible library called Coral and made it available as open source software. This removed the Hive dependency, and the company will benefit from improvements made to Coral by the community.

If you’re interested in any of the topics discussed here, Sandeep and Kapil talked about even more in the full podcast. Keep an eye out for takeaways from the second part of this chat, and be sure to check out Part 2 of this blog post, the full episode and the rest of the Data+AI Battlescars podcast series!

The post Simplifying Data Management at LinkedIn – Metadata Management and APIs appeared first on Unravel.

Recruiting and Building the Data Science Team at Etsy

Unravel Data — Wed, 16 Jun 2021 13:00:47 +0000

Data+AI Battlescars Takeaways: Recruiting and Building the Data Science Team at Etsy

In this episode of Data+AI Battlescars (formerly CDO Battlescars), Sandeep Uttamchandani talks to Chu-Cheng, CDO at Etsy. This episode focuses on Chu-Cheng’s battlescars related to recruiting and building a data science team.

Chu-Cheng leads the global data organization at Etsy. He’s responsible for data science, AI innovation, machine learning and data infrastructure. Prior to Etsy, Chu-Cheng has led various data roles, including at Amazon, Intuit, Rakuten and eBay. Here are the key talking points from their chat.

Building a Data Science Team: The Early Stages

At the early stages of building a data science team, it may be more useful to hire people who are generalists rather than, for example, specialized data scientists or machine learning engineers.
In any successful team, you need a mix of different experience levels and skill sets.
When building a data science team, Chu-Cheng actually looks for people who probably wouldn’t pass a traditional data science interview, but can still get the job done.
For first hires, Chu-Cheng generally targets people who have at least a few years of work experience and have a lot of patience and willingness to learn.
A good candidate is someone who can explain a difficult concept in a way that anyone can understand. You want to find someone who knows how to tell people what they are thinking.
For example, you can follow them on LinkedIn and see their profile, how they write their own self description, and you get an idea even before the interview.
In the past, Chu-Cheng often unconsciously looked for someone with whom he had a similar background.
To counteract this bias, he learned to create a checklist of criteria prior to interviewing a candidate. He uses that checklist to evaluate the candidate’s qualifications, rather than just picking someone who has a similar background to him.

Building a Data Science Team: The Later Stages

Eventually when you have a bigger team, you must start moving from hiring generalists to hiring specialists. If you only hire generalists, you’ll eventually run into a wall, because you have a bunch of people who are fungible.
In interviews, Chu-Cheng tries to balance technical and soft skills, even when hiring scientists and engineers.
If you are interviewing someone for a manager position, prioritize their mentoring, coaching, conflict resolution, and communication skills. The manager’s success is defined by the team’s success.
As a manager, when coaching someone, instead of trying to give out or prescribe an answer, focus on how you can ask the right questions so that the person can come up with a solution on their own. Switch from telling to asking. Allow people to make mistakes so you can coach them to grow and learn a lesson from it.

Innovation at Etsy

Chu-Cheng tries to teach his team how to write papers and patents. At Etsy, they encourage this innovation by sending a recognition gift or an innovation award.
Papers and patents are not the only types of innovation, however. Innovation also involves the process of making something easier. Not everything can be patented or become a paper, but you can, for instance, write a blog sharing your learning. Innovation is a mindset.
When considering a new technology, it is important to get a sense of the circumstances under which you should not use the technology, as well as when to use it. You must know when to use what.

If you’re interested in any of the topics discussed here, Sandeep and Chu-Cheng talked about even more in the full podcast. Be sure to check out the full episode and the rest of the Data+AI Battlescars podcast series!

The post Recruiting and Building the Data Science Team at Etsy appeared first on Unravel.

AI/ML without DataOps is just a pipe dream!

Floyd Smith — Fri, 23 Apr 2021 04:19:24 +0000

The following blog post appeared in its original form on Towards Data Science. It’s part of a series on DataOps for effective AI/ML. The author is CDO and VP Engineering here at Unravel Data. (GIF by giphy)

Let’s start with a real-world example from one of my past machine learning (ML) projects: We were building a customer churn model. “We urgently need an additional feature related to sentiment analysis of the customer support calls.” Creating the data pipeline to extract this dataset took about 4 months! Preparing, building, and scaling the Spark MLlib code took about 1.5-2 months! Later we realized that “an additional feature related to the time spent by the customer in accomplishing certain tasks in our app would further improve the model accuracy” — another 5 months gone in the data pipeline! Effectively, it took 2+ years to get the ML model deployed!

After driving dozens of ML initiatives (as well as advising multiple startups on this topic), I have reached the following conclusion: Given the iterative nature of AI/ML projects, having an agile process of building fast and reliable data pipelines (referred to as DataOps) has been the key differentiator in the ML projects that succeeded. (Unless there was a very exhaustive feature store available, which is typically never the case).

Behind every successful AI/ML product is a fast and reliable data pipeline developed using well-defined DataOps processes!

To level-set, what is DataOps? From Wikipedia: “DataOps incorporates the agile methodology to shorten the cycle time of analytics development in alignment with business goals.”

I define DataOps as a combination of process and technology to iteratively deliver reliable data pipelines with agility. Depending on the maturity of your data platform, you might be one of the following DataOps phases:

Ad-hoc: No clear processes for DataOps
Developing: Clear processes defined, but accomplished manually by the data team
Optimal: Clear processes with self-service automation for data scientists, analysts, and users.

Similar to software development, DataOps can be visualized as an infinity loop

The DataOps lifecycle – shown as an infinity loop above – represents the journey in transforming raw data to insights. Before discussing the key processes in each lifecycle stage, the following is a list of top-of-mind battle scars I have encountered in each of the stages:

Plan: “We cannot start a new project — we do not have the resources and need additional budget first”
Create: “The query joins the tables in the data samples. I didn’t realize the actual data had a billion rows! ”
Orchestrate: “Pipeline completes but the output table is empty — the scheduler triggered the ETL before the input table was populated”
Test & Fix: “Tested in dev using a toy dataset — processing failed in production with OOM (out of memory) errors”
Continuous Integration; “Poorly written data pipeline got promoted to production — the team is now firefighting”
Deploy: “Did not anticipate the scale and resource contention with other pipelines”
Operate & Monitor: “Not sure why the pipeline is running slowly today”
Optimize & Feedback: “I tuned the query one time — didn’t realize the need to do it continuously to account for data skew, scale, etc.”

To avoid these battle scars and more, it is critical to mature DataOps from ad hoc, to developing, to self-service.

This blog series will help you go from ad hoc to well-defined DataOps processes, as well as share ideas on how to make them self-service, so that data scientists and users are not bottlenecked by data engineers.

DataOps at scale with Unravel

Create a free account

For each stage of the DataOps lifecycle stage, follow the links for the key processes to define and the experiences in making them self-service (some of the links below are being populated, so please bookmark this blog post and come back over time):

Plan Stage

How to streamline finding datasets
Formulating the scope and success criteria of the AI/ML problem
How to select the right data processing technologies (batch, interactive, streaming) based on business needs

Create Stage

How to streamline accessing metadata properties of the datasets
How to streamline the data preparation process
How to make behavioral data self-service

Orchestrate Stage

Test & Fix Stage

Streamlining sandbox environment for testing
Identify and remove data pipeline bottlenecks
Verify data pipeline results for correctness, quality, performance, and efficiency

Continuous Integration & Deploy Stage

Smoke test for data pipeline code integration
Scheduling window selection for data pipelines
Changes rollback

Operate Stage

Detect anomalies to proactively avoid SLA violation
Managing data incidents in production
Alerting on rogue (resource hogging) jobs

Monitor Stage

Building end-to-end observability of data pipelines
Tracking lineage of data flows data
Enforcing data quality with circuit breakers

Optimize & Feedback Stage

Continuously optimize existing data pipelines
Alerting on budgets

In summary, DataOps is the key to delivering fast and reliable AI/ML! It is a team sport. This blog series aims to demystify the required processes as well as build a common understanding across Data Scientists, Engineers, Operations, etc.

DataOps as a team sport

To learn more, check out the recent DataOps Unleashed Conference, as well as innovations in DataOps Observability at Unravel Data. Come back to get notified when the links above are populated.

The post AI/ML without DataOps is just a pipe dream! appeared first on Unravel.

DataOps vs DevOps and Their Benefits Towards Scaling Delivery

Unravel Data — Thu, 15 Apr 2021 15:07:24 +0000

The exponential adoption of IT technologies over the past several decades has had a profound impact on organizations of all sizes. Whether it is a small, medium, or large enterprise, the need to create web applications while managing an extensive set of data effectively is high on every CIO’s priority list.

As a result, there has been an ongoing effort to implement better approaches to software development, data analysis, and data management.

The efforts are so pervasive across industries that these approaches have been given names of their own. The approach to better manage software development and delivery is known as DevOps. The end-to-end approach to efficiently and effectively deliver data products – from responses to SQL queries, to data pipelines, to machine learning models and AI-powered insights – is known as DataOps.

DataOps and DevOps are similar in that they both aim to solve the need to scale delivery. The key difference is that DataOps focuses on the flow of data, and the use of data in analytics, rather than on the software development and delivery lifecycle.

There’s also a difference in impact. Strong DataOps practices are vital to the successful development and delivery of AI-powered applications, including machine learning models. AI and ML are powerful areas of innovation, perhaps the most important in decades. DataOps as a discipline is necessary for the successful development and deployment of AI.

To help you gain an understanding of DataOps vs DevOps, it’s helpful to provide an overview of both, discuss their respective goals, and then highlight the key differences between the two.

DataOps Overview

DataOps is sometimes seen as simply making data flows through an organization, and data transformations, work correctly. This misconception that DataOps is just DevOps applied to data analytics is common. Rather, DataOps is actually a holistic approach to solving business problems.

According to Gartner,

DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate the design, deployment and management of data delivery with appropriate levels of governance, and it uses metadata to improve the usability and value of data in a dynamic environment.

As a concept, DataOps was first introduced by InformationWeek’s Lenny Liebmann in 2014. The term appeared in a blog post on the IBM Big Data & Analytics Hub, titled “3 reasons why DataOps is essential for big data success”.

A few years later, in 2017, DataOps began to get its own ecosystem, dedicated analyst coverage, increased keyword searches, inclusion in surveys, mention in books, and use in open source projects.

DataOps is Not Just Analytics

DataOps is sometimes seen as a set of best practices around analytics. Analytics can be considered to include most of AI and ML, so improving analytics functionality is not a trivial goal. But DataOps is much more than just DevOps applied to analyze data.

Data analytics happens at the end of a data pipeline, while DataOps encompasses nearly a dozen steps, including data ingestion and the entire data pipeline before analytics happens. DataOps also includes the delivery of analytics results and their ultimate business impact. And it serves as a framework for the development and delivery of useful capabilities from AI and ML.

As complex data environments are constantly changing, it is critical for an organization to possess and maintain a solid understanding of their data assets, and to add to their data assets when needed for business success. Understanding the origin of the data, analyzing data dependencies, and keeping documentation up to date are each resource-intensive, yet critical.

Having a high-performing DataOps team in place can help an organization accelerate the DataOps lifecycle – developing powerful, data-centric apps to deliver accurate insights to both internal and external customers.

The Complete Guide to DataOps

Download guide

DevOps Overview

Now that we’ve briefly described DataOps, let’s discuss DevOps. According to Atlassian, the DevOps movement started to come together between 2007 and 2008. At the time, software development and IT operations communities started raising concerns about an increase in what they felt was a near-fatal level of dysfunction in the industry.

The primary dysfunction these two groups saw was that in a traditional software development model, those who wrote the code were functionally and organizationally separate and apart from those who deployed and supported the code.

As such, software developers and IT Operations teams had competing objectives, different leadership, and different KPIs by which each group was measured and evaluated. As a result, teams were siloed and only concerned with what had a direct impact on them.

The result: poorly executed software releases, unhappy customers, and often unhappy development and IT Operations teams.

Over time, DevOps evolved to solve the pain caused by these siloed teams and poor lines of communication between software developers and IT Operations.

What is DevOps

DevOps describes an approach to software development that accelerates the build lifecycle using automation. By focusing on continuous integration and delivery of software, and by automating the integration, testing, and deployment of code, enterprises start to see many benefits. Specifically, this approach of merging DEVelopment and OPerationS, reduces time to deployment, increases time to market, keeps defects or bugs at a minimum, and generally shortens the time required to resolve any defects.

6 Primary Goals of DevOps

There are six key principles or goals the DevOps aims to deliver.

DevOps Goal 1: Speed

In order to quickly adapt to customer needs, changes in the market, or new business goals, the application development process and release capabilities need to be extremely fast. Practices such as continuous delivery and continuous integration help make it possible to maintain that speed throughout the development and operations phases.

DevOps Goal 2: Reliability

Continuous delivery and integration not only improve the time to market with new code, they also improve the overall stability and reliability of software. Integrating automated testing and exception handling helps software development teams identify problems immediately, minimizing the chances of errors being introduced and exposed to end users.

DevOps Goal 3: Rapid Delivery

DevOps aims to increase the pace and frequency of new software application releases, enabling development teams to improve an application as often as they’d like. Performing frequent, fast-paced releases ensures that the turnaround time for any given bug fix or new feature release is as short as possible.

DevOps Goal 4: Scalability

DevOps focuses on creating applications and infrastructure platforms that quickly and easily scale to address the constantly changing needs and demands of both business needs and end users. A practice that is gaining in popularity that helps scale applications is infrastructure as code, which is the process of managing and provisioning hardware data centers to immediately add resources and capacity for an application.

DevOps Goal 5: Security

DevOps encourages strong security practices by automating compliance policies. This simplifies the configuration process and introduces detailed security controls. This programmatic approach ensures that any resources that fall out of compliance are noticed immediately so they can be evaluated by the development team, in order to get them back into compliance immediately.

DevOps Goal 6: Collaboration

Just like other agile-based software development methodologies, DevOps strongly encourages collaboration throughout the entire software development life cycle. This leads to software development teams that are up to 25% more productive and 50% faster to market than non-agile teams.

Differences between DataOps and DevOps

As outlined above, DevOps is the transformation in the ability and capability of software development teams to continuously deliver and integrate their code.

DataOps focuses on the transformation of intelligence systems and analytic models by data analysts and data engineers.

DevOps brings software development teams and IT operations together with the primary goal to reduce the cost and time spent on the development and release cycle.

DataOps goes one step further, integrating data so that data teams can acquire the data, transform it, model it, and obtain insights that are highly actionable.

DataOps is not limited to making existing data pipelines work effectively, getting reports and Artificial Intelligence and Machine Learning outputs and inputs to appear as needed, and so on. DataOps actually includes all parts of the data management lifecycle.

DataOps, DevOps, and Competitive Advantage

DevOps, as a term and as a practice, grew rapidly in interest and activity throughout the last decade, but has plateaued recently. A decade ago, or even five years ago, aggressive adoption of DevOps as a practice could give an organization a significant competitive advantage. But DevOps is now “table stakes” for modern software development and delivery.

It’s now DataOps that’s in a phase of rapid growth. A big part of the reason for this is the need for strong DataOps practices in developing, and delivering value from, AI and ML. For IT practitioners and management, there are new skills to learn, new ways to deliver value, and in a sense, whole new worlds to conquer, all based on the development, growth, and institutionalization of new practices around data.

At the organizational level, DataOps gives companies the opportunity to innovate, to better serve customers, and to gain competitive advantage by rapid, effective adoption and innovation around DataOps as a practice.

Many of today’s largest and fastest-growing companies are DataOps innovators. Facebook, Apple, Alphabet, Netflix, and Uber are just the best-known among companies which have grown to a previously unheard-of degree, largely based on their innovative (and, often, controversial) practices around the use of data.

Adobe is an example of a company that has grown rapidly – increasing its market value by 4x in the last few years – by adding a data-driven business, the marketing-centered Experience Cloud, to their previously existing portfolio.

Algorithms are hard to protect, competitively, while a company’s data is its own. And, while AI algorithms and machine learning models that depend on this data can be shared, they don’t mean much without the flow of data that powers them. So all this means that innovation based on a company’s data, accelerated by the adoption and implementation of DataOps, is more able to yield lasting and protectable competitive advantage, and contribute to a company’s growth.

Test-drive Unravel for DataOps

Conclusion

It’s fair to say that “DataOps is the new DevOps” – not because one replaces the other, but because DataOps is the hot new area for innovation and competitive advantage. The main difference is that it’s easier for competitive advantage based on data, and on making the best possible use of that data via DataOps, to be enduring.

Unravel Data customers are more easily able to work all steps in the DataOps cycle, because they can see and work with their data-driven applications holistically and effectively. If you’re interested in assessing Unravel for your own data-driven applications, you can create a free account or contact us to learn how we can help.

The post DataOps vs DevOps and Their Benefits Towards Scaling Delivery appeared first on Unravel.

The Guide To DataOps, AIOps, and MLOPs in 2022

Unravel Data — Wed, 14 Apr 2021 15:22:11 +0000

Over the past several years, there has been an explosion of different terms related to the world of IT operations. Not long ago, it was standard practice to separate business functions from IT operations. But those days are a distant memory now, and for good reason.

The Ops landscape has expanded beyond the generic “IT” to include DevOps, DataOps, AIOPs, MLOps, and more. Each of these Ops areas are cross-functional throughout an organization, and each provides a unique benefit. And each of the Ops areas emerges from the same general mechanism: Applying agile principles, originally created to guide software development, to the overlap of different flavors of software development, related technologies (data-driven applications, AI, and ML), and operations.

As with DevOps, the goal of DataOps, AIOps, and MLOps is to accelerate processes and improve the quality of data, analytics insights, and AI models respectively. In practice, we see DataOps as a superset of AIOps and MLOps with the latter two Ops overlapping each other.

Why is this? DataOps describes the flow of data, and the processing that takes place against that data, through one or more data pipelines. In this context, every data-driven app needs DataOps; those which are primarily driven by machine learning models also need MLOps, and those with additional AI capabilities need AIOps. (Confusingly, ML is sometimes considered to be separate from AI, and sometimes simply an important part of the technologies that are part of AI as a whole.)

The goal of this article is to help you understand these new terms and provide insight into where they came from, the similarities, differences, and who in an organization uses them. To start, we’ll look at DataOps.

What is DataOps

DataOps is the use of agile development practices to create, deliver, and optimize data products, quickly and cost-effectively. DataOps is practiced by modern data teams, including data engineers, architects, analysts, scientists and operations.

The data products which power today’s companies range from advanced analytics, data pipelines, and machine learning models to embedded AI solutions. Using a DataOps methodology allows companies to move faster, more surely, and with greater cost-effectiveness in extracting value out of data.

A company can adopt a DataOps approach independently, without buying any specialty software or services. However, just as the term DevOps became strongly associated with the use of containers, as commercial software from Docker and as open source software from Kubernetes, the term DataOps is increasingly associated with data pipelines and applications.

DataOps at scale with Unravel

The origin of DataOps

DataOps is a successor methodology to DevOps, which is an approach to software development that optimizes for responsive, continuous deployment of software applications and software updates. DataOps applies this approach across the entire lifecycle of data applications and to even help productize data science.

As a term, DataOps has been in gradually increasing use for several years now. Gartner began to include it in their Hype Cycle for Data Management in 2018, and the term is now “moving up the Hype Cycle” as it becomes more widespread. The first industry conference devoted to DataOps, DataOps Unleashed, was launched in March 2021.

While DataOps is sometimes described as a set of best practices for continuous analytics, it is actually more holistic. DataOps includes identifying the data sources needed to solve a business problem, the processing needed to make that data ready for analytics, the analytics step(s) – which may be seen as including AI and ML – and the delivery of results to the people or apps that will use them. DataOps also includes making all this work fast enough to be useful, whether that means processing the weekly report in an hour or so, or approving a credit card transaction in a fraction of a second.

Who uses DataOps

DataOps is directly for data operations teams, data engineers, and software developers building data-powered applications and the software-defined infrastructure that supports them. The benefits of DataOps approaches are felt by the teams themselves; the IT team, which can now do more with less; the data team’s internal “customers,” who request and use analytics results; the organization as a whole; and the company’s customers. Ultimately, society benefits, as things simply work better, faster, and less expensively. (Compare booking a plane ticket online to going to a travel agent, or calling airlines yourself, out of the phone book, as a simple example.)

What is AIOps

AIOps stands for Artificial Intelligence for IT operations. It is a paradigm shift that allows machines to solve IT issues without the need of human assistance or interaction. AIOPs uses machine learning and analytics to analyze big data obtained via different tools, which allows for issues to be spotted automatically and dealt with in real time. (Confusingly, AIOps is also sometimes used to describe the operationalization of AI project, but we will stick with the definition used by Gartner and others, as described here.)

As part of DataOps, AIOps supports continuous integration and deployment for the core tech functions of machine learning and big data. AIOps helps automate operations across hybrid environments. AIOps includes the use of machine learning to detect patterns and reduce noise.

See Unravel AI and automation in action

The Origin of AIOps

AIOps was originally defined in 2017 by Gartner as a means to describe the growing interest and investment in applying a broad spectrum of AI capabilities to enterprise IT operations management challenges.

Gartner defines AIOps as platforms that utilize big data, machine learning, and other advanced analytics technologies to directly and indirectly enhance IT operations (such as monitoring, automation and service desk) functions with proactive, personal, and dynamic insight.

Put another way, AIOps refers to improving the way IT organizations manage data and information in their environments using artificial intelligence.

Who uses AIOps

From enterprises with large, complex environments, to cloud-native small and medium enterprises (SMEs), AIOps is being used globally by organizations of all sizes in a variety of industries. AIOps is most often bought in as products or services from specialist companies; few large organizations are using their own in-house AI expertise to solve IT operations problems.

Companies with extensive IT environments, spanning multiple technology types, are adopting AIOps, especially when they face issues of scale and complexity. AIOps can make a significant contribution when those challenges are layered on top of a business model that is dependent on IT. If the business needs to be agile and to quickly respond to market forces, IT will rely upon AIOps to help IT be just as agile in supporting the goals of the business.

AIOps is not just for massive enterprises, though. SMEs that need to develop and release software continuously are embracing AIOps as it allows them to continually improve their digital services, while preventing malfunctions and outages.

The ultimate goal of AIOps is to enhance IT Operations. AIOps delivers intelligent filtering of signals out of the noise in IT systems. AIOps intelligently observes IT operations data in order to identify root causes and recommend solutions quickly. In some cases, these recommendations can even be implemented without human interaction.

What is MLOps

MLOps, or Machine Learning Operations, helps simplify the management, logistics, and deployment of machine learning models between operations teams and machine learning researchers.

MLOps is like DataOps – the fusion of a discipline (machine learning in one case, data science in the other) and the operationalization of projects from that discipline. MLOps and DataOps are different from AIOps, which is the use of AI to improve AI operations.

MLOps takes the DevOps methodology of continuous integration and continuous deployment and applies it to machine learning. As in traditional development, there is code that needs to be written and deployed, as well as bug testing to be done, and changes in user requirements to be accommodated. Specific to the topic of machine learning, models are being trained with data, and new data is introduced to retrain the models again and again.

The Origin of MLOps

According to Forbes, the origins of MLOps date back to 2015, from a paper entitled “Hidden Technical Debt in Machine Learning Systems.” The paper offered the position that machine learning offered an incredibly powerful toolkit for building useful complex prediction systems quickly, but that it was dangerous to think of these quick wins as coming for free.

Who Uses MLOps

Data scientists tend to focus on the development of models that deliver valuable insights to your organization more quickly. Ops people tend to focus on running those models in production. MLOps unifies the two approaches into a single, flexible practice, focused on the delivery of business value through the use of machine learning models and relevant input data.

Because MLOps follows a similar pattern to DevOps, MLOps creates a seamless integration between your development cycle and your overall operations process that transforms how your organization handles the use of big data as input to machine learning models. MLOps helps drive insights that you can count on and put into practice.

Streamlining MLOps is critical to organizations that are developing Machine Learning models, as well as to end users who use applications that rely on these models.

According to research from Finances Online, machine learning applications and platforms account for 57% (or $42 Billion) in AI funding worldwide. Organizations that are making significant investments want to ensure they are deriving significant value.

As an example of the impact of MLOps, 97% of all mobile users use AI-powered voice assistants that depend on machine learning models, and thus benefit from MLOps. These voice assistants are the result of a subset of ML known as deep learning. This deep learning technology is built around machine learning and is at the core of platforms such as Apple’s Siri, Amazon Echo, and Google Assistant.

The goal of MLOps is to bridge the gap between data scientists and operations teams to deliver insights from machine learning models that can be put into use immediately.

Conclusion

Here at Unravel Data, we deliver a DataOps platform that uses AI-powered recommendations – AIOps – to help proactively identify and resolve operations problems. This platform complements the adoption of DataOps practices in an organization, with the end results including improved application performance, fewer operational hassles, lower costs, and the ability for IT to take on new initiatives that further benefit organizations.

Our platform does not explicitly support MLOps, though MLOps always occurs in a DataOps context. That is, machine learning models run on data – usually, on big data – and their outputs can also serve as inputs to additional processes, before business results are achieved.

As DataOps, AIOps, and MLOps proliferate – as working practices, and in the form of platforms and software tools that support agile XOPs approaches – complex stacks will be simplified and made to run much faster, with fewer problems, and at less cost. And overburdened IT organizations will truly be able to do more with less, leading to new and improved products and services that perhaps can’t all be imagined today.

If you want to know more about DataOps, check out the recorded sessions from the first DataOps industry conference, DataOps Unleashed. To learn more about the Unravel Data DataOps platform, you can create a free account or contact us.

The post The Guide To DataOps, AIOps, and MLOPs in 2022 appeared first on Unravel.

DataOps Has Unleashed, Part 2

Floyd Smith — Wed, 31 Mar 2021 02:52:03 +0000

The DataOps Unleashed conference, founded this year by Unravel Data, was full of interesting presentations and discussions. We described the initial keynote and some highlights from the first half of the day in our DataOps Unleashed Part 1 blog post. Here, in Part 2, we bring you highlights through the end of the day: more about what DataOps is, and case studies as to how DataOps is easing and speeding data-related workflows in big, well-known companies.

You can freely access videos from DataOps Unleashed – most just 30 minutes in length, with a lot of information packed into hot takes on indispensable technologies and tools, key issues, and best practices. We highlight some of the best-received talks here, but we also urge you to check out any and all sessions that are relevant to you and your work.

Mastercard Pre-Empts Harmful Workloads

See the Mastercard session now, on-demand.

Chinmay Sagade and Srinivasa Gajula of Mastercard are responsible for helping the payments giant respond effectively to the flood of transactions that has come their way due to the pandemic, with cashless, touchless, and online transactions all skyrocketing. And much of the change is likely to be permanent, as approaches to doing business that were new and emerging before 2020 become mainstream.

Hadoop is a major player at Mastercard. Their largest cluster is petabytes in size, and they use Impala for SQL workloads, as well as Spark and Hive. But they have in the past been plagued by services being unavailable, applications failing, and bottlenecks caused by suboptimalsub-optimal use of workloads.

Mastercard has used Unravel to help application owners self-tune their workloads and to create an automated monitoring system to detect toxic workloads and automatically intervene to prevent serious problems. For instance, they proactively detect large broadcast joins in Impala, which tend to consume tremendous resources. They also detect cross-joins in queries.

Their work has delivered tremendous business value:

Vastly improve reliability
Better configurations free up resources
Reduced problems free up time for troubleshooting recurring issues
Better capacity usage and forecasting saves infrastructure costs

To the team’s surprise, users were hugely supportive of restrictions, because they could see the positive impact on performance and reliability. And the entire estate now works much better, freeing resources for new initiatives.

Take the hassle out petabyte-scale DataOps

Gartner CDO DataOps Panel Shows How to Maximize Opportunities

See the Gartner CDO Panel now, on-demand.

As the VP in charge of big data and advanced analytics for leading industry analysts, Gartner, Sanjeev Mohan has seen it all. So he had some incisive questions for his panel of four CDOs. A few highlights follow.

Paloma Gonzalez Martinez is CDO at AlphaCredit, one of the fastest-growing financial technology companies in Latin America. She was asked: How has data architecture evolved? And, if you had a chance to do your big data efforts over again, how would you do things differently?

Paloma shared that her company actually is revisiting their whole approach. The data architecture was originally designed around data; AlphaCredit is now re-architecting around customer and business needs.

David Lloyd is CDO at Ceridian, a human resources (HR) services provider in the American Midwest. David weighed in on the following: What are the hardest roles to fill on your data team? And, how are these roles changing?

David said that one of the guiding principles he uses in hiring is to see how a candidate’s eyes light up around data. What opportunities do they see? How do they want to help take advantage of them?

Kumar Menon is SVP of Data Fabric and Decision Science Technology at Equifax, a leading credit bureau. With new candidates, Kumar looks for the combination of the intersection of engineering and insights. How does one build platforms that identify crucial features, then share them quickly and effectively? When does a model need to be optimized, and when does it need to be rebuilt?

Sarah Gadd is Head of Semantic Technology, Analytics and Machine Intelligence at Credit Suisse. (Credit Suisse recently named Unravel Data a winner of their 2021 Disruptive Technologies award.) Technical problems disrupted Sarah from participating live, but she contributed answers to the questions that were discussed.

Sarah looks for storytellers to help organize data and analytics understandably, and is always on the lookout for technical professionals who deeply understand the role of the semantic layer in data models. And in relation to data architecture, the team faces a degree of technical debt, so innovation is incremental rather than sweeping at this point.

84.51°/Kroger Solves Contention and the Small Files Problem with DataOps Techniques

See the 84.51/Kroger session now, on-demand.

Jeff Lambert and Suresh Devarakonda are DataOps leaders at the 84.51° analytics business of retailing giant Kroger. Their entire organization is all about deriving value from data and delivering that value to Kroger, their customers, partners, and suppliers. They use Yarn and Impala as key tools in their data stack.

They had a significant problem with jobs that created hundreds of small files, which consumed system resources way out of proportion to the file sizes. They have built executive dashboards that have helped stop finger-pointing, and begin solving problems based on shared, trusted information.

Unravel Data has been a key tool in helping 84.51° to adopt a DataOps approach and get all of this done. They are expanding their cloud presence on Azure, using Databricks and Snowflake. Unravel gives them visibility, management capability, and automatically generated actions and recommendations, making their data pipelines work much better. 84.51 has just completed a proof of concept (PoC) for Unravel on Azure and Databricks, and are heavily using recently introduced Spark 3.0 support.

Resource contention was caused by a rogue app that spiked memory usage. Using Unravel, 84.51° quickly found the offending app, killed it, and worked with the owner to prevent the problem in the future. 84.51 now proactively scans for small files and concerning issues using Unravel, heading off problems in advance. Unravel also helps move problems up to a higher level of abstraction, so operations work doesn’t require that operators be expert in all of the technologies they’re responsible for managing.

At 84.51°, Unravel has helped the team improve not only their own work, but what they deliver to the company:

Solving the small files problem improves performance and reliability
Spotting and resolving issues early prevents disruption and missed SLAs
Improved resource usage and availability saves money and increases trust
More production from less investment allows innovation to replace disruption

Cutting Cloud Costs with Amazon EMR and Unravel Data

See the AWS session now, on-demand.

As our own Sandeep Uttamchandani says, “Once you get into the cloud, ‘pay as you go’ takes on a whole new meaning: as you go, you pay.” But AWS Solutions Architect Angelo Carvalho is here to help. AWS understands that their business will grow healthier if customers are wringing the most value out of their cloud costs, and Angelo uses this talk to help people do so.

Angelo describes the range of AWS services around big data, including EMR support for Spark, Hie, Presto, HBase, Flink, and more. He emphasized EMR Managed Scaling, which makes scaling automatic, and takes advantage of cloud features to save money compared to on-premises, where you need to have enough servers all the time to support peak workloads that only occur some of the time. (And where you can easily be overwhelmed by unexpected spikes.)

Angelo was followed by Kunal Agarwal of Unravel Data, who showed how Unravel optimizes EMR. Unravel creates an interconnected model of the data in EMR and applies predictive analytics and machine learning to it. Unravel automatically optimizes some areas, offers alerts for others, and provides dashboards and reports to help you manage both price and performance.

Kunal then shows how this actually works in Unravel, demonstrating a few key features, such as automatically generated, proactive recommendations for right-sizing resource use and managing jobs. The lessons from this session apply well beyond EMR, and even beyond the cloud, to anyone who needs to run their jobs with the highest performance and the most efficient use of available resources.

Need to get cloud overspending under control?

Microsoft Describes How to Modernize your Data Estate

See the Microsoft session now, on-demand.

According to Priya Vijayarajendran, VP for Data and AI at Microsoft, a modern, cloud-based strategy underpins success in digital transformation. Microsoft is enabling this shift for customers and undertaking a similar journey themselves.

Priya describes data as a strategic asset. Managing data is not a liability or a problem, but a major opportunity. She shows how even a simplified data estate is very complex, requiring constant attention.

Priya tackled the “what is DataOps” challenge, using DevOps techniques, agile, and statistics, processes, and control methodologies to intelligently manage data as a strategic asset. She displayed a reference architecture for continuous integration and continuous delivery on the Azure platform.

Priya ended by offering to interact with the community around developing ever-better answers to the challenges and opportunities that data provides, whether on Microsoft platforms or more broadly. Microsoft is offering multi-cloud products that work on AWS and Google Cloud Platform as well as Azure. She said that governance should not be restrictive, but instead should enable people to do more.

A Superset of Advanced Topics for Data Engineers

See the Apache Superset session now, on-demand.

Max Beauchemin is the creator of Airflow and currently CEO at Preset, the company that is bringing Apache Superset to the market. Superset is the leading open-source analytics platform and is widely used at major companies. Max is the original creator of Apache Airflow, mentioned in our previous Unleashed blog post, as well as Superset. Preset makes Superset available as a managed service in the cloud.

Max discussed high-end, high-impact topics around Superset. He gave a demo, then demonstrated SQL Lab, a SQL development environment built in React. He then showed how to build a visualization plugin; creating alerts, reports, charts and dashboards; and using the Superset Rest API.

Templating is a key feature in SQL Lab, allowing users to build a framework that they can easily adapt to a wide variety of SQL queries. Built on Python, Jinja allows you to use macros in your SQL code. Jinja integrates with Superset, Airflow, and other open source technologies. A parameter can be specified as type JSON, so values can be filled in at runtime.

With this talk, Max gave the audience the information they need to plan, and begin to implement, ambitious Superset projects that work across a range of technologies.

Soda Delivers a Technologist’s Call to Arms for DataOps

See the Soda DataOps session now, on-demand.

What does DataOps really mean to practitioners? Vijay Karan, Head of Data Engineering at Soda, shows how DataOps applies at each stage of moving data across the stack, from ingest to analytics.

Soda is a data monitoring platform that supports data integrity, so Vijay is in a good position to understand the importance of DataOps. He discusses core principles of DataOps and how to apply those principles in your own projects.

Vijay begins with the best brief description of DataOps, from a practitioner’s point of view, that we’ve heard yet:

What is DataOps?

A process framework that helps data teams deliver high quality, reliable data insights with high velocity.

At just sixteen words, this is admirably concise. In fact, to boil it down to just seven words, “A process framework that helps data teams” is not a bad description.

Vijay goes on to share half a dozen core DataOps principles, and then delivers a deep dive on each of them.

Here at Unravel, part of what we deliver is in his fourth principle:

Improve Observability

Monitor quality and performance metrics across data flows

Just in this one area, if everyone did what Vijay suggests around this – defining metrics, visualizing them, configuring meaningful alerts – the world would be a calmer and more productive place.

Conclusion

This wraps up our overview description of DataOps Unleashed. If you haven’t already done so, please check out Part 1, highlighting the keynote and talks discussing Adobe, Cox Automotive, Airflow, Great Expectations, DataOps, and moving Snowflake to space.

However, while this blog post gives you some idea as to what happened, nothing can take the place of “attending” sessions yourself, by viewing the recordings. You can view the videos from DataOps Unleashed here. You can also download The Unravel Guide to DataOps, which was made available for the first time during the conference.

Finding Out More

Read our blog post Why DataOps Is Critical for Your Business.

The post DataOps Has Unleashed, Part 2 appeared first on Unravel.

DataOps Has Unleashed, Part 1

Floyd Smith — Wed, 24 Mar 2021 04:10:27 +0000

DataOps Unleashed launched as a huge success, with scores of speakers, thousands of registrants, and way too many talks for anyone to take in all at once. Luckily, as a virtual event, all sessions are available for instant viewing, and attendees can keep coming back for more. (You can click here to see some of the videos, or visit Part 2 of this blog post.)

Kunal Agarwal, CEO of Unravel Data, kicked off the event with a rousing keynote, describing DataOps in depth. Kunal introduced the DataOps Infinity Loop, with ten integrated and overlapping stages. He showed how teams work together, across and around the loop, to solve the problems caused as both data flow and business challenges escalate.

Kunal introduced three primary challenges that DataOps addresses, and that everyone assembled needs to solve, in order to make progress:

Complexity. A typical modern stack and pipeline has about a dozen components, and the data estate as a whole has many more. All this capability brings power – and complexity.
Crew. Small data teams – the crew – face staggering workloads. Finding qualified, experienced people, and empowering them, may be the biggest challenge.
Coordination. The secret to team productivity is coordination. DataOps, and the DataOps lifecycle, are powerful coordination frameworks.

These challenges resonated across the day’s presentations. Adobe, Kroger, Cox Automotive, Mastercard, Astronomer, and Superconductive described the Unravel Data platform as an important ally in tackling complexity. And Vijay Kiran, head of data engineering at Soda, emphasized the role of collaboration in making teams effective. The lack of available talent to expand one’s Crew – and the importance of empowering one’s team members – came up again and again.

There were many highlights per presentation. A few that stand out from the morning sessions are Adobe, moving their entire business to the cloud; Airflow, a leader in orchestration; Cox Automotive, running a global business with a seven-person data team; and Great Expectations, which majors in data reliability.

How Adobe is Moving a $200B Business to the Cloud

Adobe was up next, with an in-depth cloud migration case study, covering the company’s initial steps toward cloud migration. Adobe is one of the original godfathers of today’s digital world, powering so much of the creativity seen in media, entertainment, on websites, and everywhere one looks. The company was founded in the early 1980s, and gained their original fame with Postscript and the laser printer. But the value of the business did not really take off until the past ten years, which is when the company moved from boxed software to a subscription business, using a SaaS model.

Now, Adobe is moving their entire business to the cloud. They describe four lessons they’ve learned in the move:

Ignorance is not bliss. In planning their move to the cloud, Adobe realized that they had a large blind spot about what work was running on-premises. This may seem surprising until you check and realize that your company may have this problem too.
Comparison-shop now. You need to compare your on-premises cost and performance per workload to what you can expect in the cloud. The only way to do this is to use a tool that profiles your on-premises workloads and maps each to the appropriate infrastructure and costs on each potential cloud platform.
Optimize first. Moving an unoptimized workload to the cloud is asking for trouble – and big bills. It’s critically important to optimize workloads on-premises to reduce the hassle of cloud migration and the expense of running the workload in the cloud.
Manage effectively. On-premises workloads may run poorly without too much hassle, but running workloads poorly in the cloud adds immediately, and unendingly, to costs. Effective management tools are needed to help keep performance high and costs under budget.

Kevin Davis, Application Engineering Manager, was very clear that Adobe has only gained the clarity they need through the use of Unravel Data for cloud migration, and for performance management, both on-premises and in the cloud. Unravel allows Adobe to profile their on-premises workloads; map each workload to appropriate services in the cloud; compare cost and performance on-premises to what they could expect in the cloud; optimize workloads on-premises before the move; and carefully manage workload cost and performance, after migration, in the cloud. Unravel’s automatic insights increase the productivity of their DataOps crew.

Cloud DataOps at scale with Unravel

Cox Automotive Scales Worldwide with Databricks

Cox Automotive responded with a stack optimization case study. Cox Automotive is a global business, with wholesale and retail offerings supporting the lifecycle of motor vehicles. Their data services team, a mighty team of only seven people, supports the UK and Europe, offering reporting, analytics, and data sciences services to the businesses.

The data services team is moving from Hadoop clusters, deployed manually, to a full platform as a service (PaaS setup) using Databricks on Microsoft Azure. As they execute this transition, they are automating everything possible. Databricks allows them to spin up Spark environments quickly, and Unravel helps them automate pipeline health management.

Data comes from a range of sources – mostly as files, with streaming expected soon. Cox finds Unravel particularly useful for optimization: choosing virtual machine types in Databricks, watching memory usage by jobs, making jobs run quicker, optimizing cost. These are all things that the team has trouble finding through other tools, and can’t readily build by themselves. They have an ongoing need for the visibility and observability that Unravel gives them. Unravel helps with optimization and is “really strong for observability in our platform.”

Great Expectations for Data Reliability

Great Expectations shared best practices on data reliability. Great Expectations is the leading open-source package for data reliability, which is crucial to DataOps success. An expectation is an assertion about data; when data falls outside these expectations, an exception is raised, making it easier to manage outliers and errors. Great Expectations integrates with a range of DataOps tools, making its developers DataOps insiders.SuperConductive provides support for Great Expectations and is a leading contributor to the open source project.

The stages where Great Expectations works map directly to the DataOps infinity loop. Static data must be prepared, cleaned, stored, and used for model development and testing. Then, when the model is deployed, live data must be cleansed, in real time, and run into and through the model. Results go out to users and apps, and quality concerns feed back to operations and development.

Airflow Enhances Observability

Astronomer conducted a master class on observability. Astronomer was founded to help organizations adopt Apache Airflow. Airflow is open source software for programmatically creating, scheduling, and monitoring complex workflows, including core DataOps tasks such as data pipelines used to feed machine learning models. To construct workflows, users create task flows called directed acyclic graphs (DAGs) in Python.

The figure shows a neat integration between Airflow and Unravel. Specifically, how Unravel can provide end-to-end observability and automatic optimization for Airflow pipelines. In this example it’s a simple DAG containing a few Hive and Spark stages. Data is passed from Airflow into Unravel via REST APIs, which helps create an easy to understand interface and then allows Unravel to generate automated insights for these pipelines.

DataOps Powers the New Data Stack

Unravel Data co-founder Shivnath Babu described and demystified the new data stack that is the focus of today’s DataOps practice. This stack easily supports new technologies such as advanced analytics and machine learning. However, this stack, while powerful, is complex, and operational challenges can derail its success.

Shivnath showed an example of the new data stack, with Databricks providing Spark support, Azure Data Lake for storage, Airflow for orchestration, dbt for data transformation, and Great Expectations for data quality and validation. Slack provides communications and displays alerts, and Unravel Data provides end-to-end observability and automated recommendations for configuration management, troubleshooting, and optimization.

In Shivnath’s demo, he showed pipelines in danger of missing performance SLAs, overrunning on costs, or hanging due to data quality problems. Unravel’s Pipeline Observer allows close monitoring, and alerts feed into Slack.

The goal, in Shivnath’s talk and across all of DataOps, is for companies to move up the data pipeline maturity scale – from problems detected after the fact, and fixed weeks later, to problems detected proactively, RCA’d (the root cause analyzed and found) automatically, and healing themselves.

Simplify modern data pipeline complexity

OneWeb Takes the Infinity Loop to the Stars

To finish the first half of the day, OneWeb showed how to take Snowflake beyond the clouds – the ones that you can see in the sky over your head. OneWeb is a global satellite network provider that takes Internet connectivity to space, reaching anywhere on the globe. They are going to near-Earth orbit with Snowflake, using a boost from DataOps.live.

OneWeb connects to their satellites with antenna arrays that require lots of room, isolation – and near-perfect connectivity. Since customer connectivity can’t drop, reliability is crucial across their estate, and a DataOps-powered approach is a necessity for keeping OneWeb online.

One More Thing…

All of this is just part of what happened – in the first half of the day! We’ll provide a further update soon, including – wait for it – the state of DataOps, how to create a data-driven culture, customer case studies from 84.51°/Kroger and Mastercard, and much, much more. You can view the videos from DataOps Unleashed here. You can also download The Unravel Guide to DataOps, which was made available for the first time during the conference.

Finding Out More

Read our blog post Why DataOps Is Critical for Your Business.

The post DataOps Has Unleashed, Part 1 appeared first on Unravel.

Why DataOps is Critical For The Success Of Your Business

Unravel Data — Thu, 18 Mar 2021 18:08:14 +0000

What is DataOps?

DataOps is the use of agile development practices to create, deliver, and optimize data products, quickly and cost-effectively. Data products include data pipelines, data-dependent apps, dashboards, machine learning models, other AI-powered software, and even answers to ad hoc SQL queries. DataOps is practiced by modern data teams, including data engineers, architects, analysts, scientists and operations.

What are Data Products?

A data product is any tool or application that processes data and generates results. The primary objective of data products is to manage, organize, and make sense of the large amount of data that an organization generates and collects.

What is a DataOps Platform?

DataOps is more of a methodology than a specific, discrete platform. However, a platform that supports multiple aspects of DataOps practices can assist in the adoption and effectiveness of DataOps.

Every organization today is in the process of harnessing the power of their data using advanced analytics, which is likely running on a modern data stack. On top of the data stack, organizations create “data products.”

These data products range from advanced analytics, data pipelines, and machine learning models to embedded AI solutions. All of these work together to help organizations gain insights, make decisions,and power applications. In order to extract the maximum value out of these data products, companies employ a DataOps methodology that allows them to efficiently extract value out of their data.

What does DataOps do for an organization?

DataOps is a set of agile-based development practices that make it faster, easier, and less costly to develop, deploy, and optimize data-powered applications.

Using an agile approach, an organization identifies a problem to solve, and then breaks it down into smaller pieces. Each piece is then assigned to a team that breaks down the work to solve the problem into a defined set of time – usually called a sprint – that includes planning, work, deployment, and review.

Who benefits from DataOps?

DataOps is for organizations that want to not only succeed, but to outpace the competition. With DataOps, an organization is continually striving to create better ways to manage their data, which should lead to being able to use this data to make decisions that help them.

The practice of DataOps can benefit organizations by fostering cross-matrix collaboration between teams of data scientists, data engineers, data analysts, operations, and product owners. Each of these roles needs to be in sync in order to use the data in the most efficient manner, and DataOps strives to accomplish this.

Research by Forrester indicates that companies that embed analytics and data science into their operating models to bring actionable knowledge into every decision are at least twice as likely to be in a market-leading position than their industry peers.

10 Steps of the DataOps Lifecycle

DataOps is not limited to making existing data pipelines work effectively, getting reports and artificial intelligence and machine learning outputs and inputs to appear as needed, and so on. DataOps actually includes all parts of the data management lifecycle.

The DataOps lifecycle shown above takes data teams on a journey from raw data to insights. Where possible, DataOps stages are automated to accelerate time to value. The steps below show the full lifecycle of a data-driven application.

Plan. Define how a business problem can be solved using data analytics. Identify the needed sources of data and the processing and analytics steps that will be required to solve the problem. Then select the right technologies, along with the delivery platform, and specify available budget and performance requirements.
Create. Create the data pipelines and application code that will ingest, transform, and analyze the data. Based on the desired outcome, data applications are written using SQL, Scala, Python, R, or Java, among others.
Orchestrate. Connect stages needed to work together to produce the desired result. Schedule code execution, with reference to when the results are needed; when cost-effective processing is most available; and when related jobs (inputs and outputs, or steps in a pipeline) are running.
Test & Fix. Simulate the process of running the code against the data sources in a sandbox environment. Identify and remove any bottlenecks in data pipelines. Verify results for correctness, quality, performance, and efficiency.
Continuous Integration. Verify that the revised code meets established criteria to be promoted into production. Integrate the latest, tested and verified code and data sources incrementally, to speed improvements and reduce risk.
Deploy. Select the best scheduling window for job execution based on SLAs and budget. Verify that the changes are an improvement; if not, roll them back, and revise.
Operate. Code runs against data, solving the business problem, and stakeholder feedback is solicited. Detect and fix deviations in performance to ensure that SLAs are met.
Monitor. Observe the full stack, including data pipelines and code execution, end-to-end. Data operators and engineers use tools to observe the progress of code running against data in a busy environment, solving problems as they arise.
Optimize. Constantly improve the performance, quality, cost, and business outcomes of data applications and pipelines. Team members work together to optimize the application’s resource usage and improve its performance and effectiveness.
Feedback. The team gathers feedback from all stakeholders – the data team itself, app users, and line of business owners. The team compares results to business success criteria and delivers input to the Plan phase.

There are two overarching characteristics of DataOps that apply to every stage in the DataOps lifecycle: end-to-end observability and real-time collaboration.

End-to-End Observability

End-to-end observability is key to delivering high-quality data products, on time and under budget. You need to be able to measure key KPIs about your data-driven applications, the data sets they process, and the resources they consume. Key metrics include application / pipeline latency, SLA score, error rate, result correctness, cost of run, resource usage, data quality, and data usage.

You need this visibility horizontally – across every stage and service of the data pipeline – and vertically, to see whether it is the application code, service, container, data set, infrastructure or another layer that is experiencing problems. End-to-end observability provides a single, trusted “source of truth” for data teams and data product users to collaborate around.

Real-Time Collaboration

Real-time collaboration is crucial to agile techniques; dividing work into short sprints, for instance, provides a work rhythm across teams. The DataOps lifecycle helps teams identify where in the loop they’re working, and to reach out to other stages as needed to solve problems – both in the moment, and for the long term.

Real-time collaboration requires open discussion of results as they occur. The observability platform provides a single source of truth that grounds every discussion in shared facts. Only through real-time collaboration can a relatively small team have an outsized impact on the daily and long-term delivery of high-quality data products.

Conclusion

Through the use of a DataOps approach to their work, and careful attention to each step in the DataOps lifecycle, data teams can improve their productivity and the quality of the results they deliver to the organization.

As the ability to deliver predictable and reliable business value from data assets increases, the business as a whole will be able to make more and better use of data in decision-making, product development, and service delivery.

Advanced technologies, such as artificial intelligence and machine learning, can be implemented faster and with better results, leading to competitive differentiation and, in many cases, industry leadership.

The post Why DataOps is Critical For The Success Of Your Business appeared first on Unravel.

Unravel Data2021 Infographic

Unravel Data — Tue, 16 Mar 2021 22:20:59 +0000

Thank you for your interest in the Unravel Data2021 Infographic.

You can access the series here.

The post Unravel Data2021 Infographic appeared first on Unravel.

Developing Data Literacy and Standardized Business Metrics at Tailored Brands

Unravel Data — Thu, 11 Mar 2021 14:00:44 +0000

In this episode of CDO Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Meenal Iyer, Sr. Director of Enterprise Analytics and Data at Tailored Brands. They discuss battlescars in two areas, data and metrics: Growing Data Literacy and Developing a Data-Driven Culture and Standardization of Business Metrics.

Meenal brings in 20+ years of data analytics experience across retail, travel, and financial services. Meenal has been transforming enterprises to become data-driven and shares interesting, domain-agnostic lessons from her experience. Here are the key talking points from their chat.

Growing Data Literacy and Developing a Data-Driven Culture

What Does it Mean to be Data-Driven?
At a very fundamental level, being data-driven means that actions are taken based on insights derived from data.

In order for an enterprise to be truly classified as data-driven, there are a few qualifications that need to be met:

Leadership must have a data-driven mentality.
The organization has to be data-literate, meaning that everyone in the organization knows that there is an initiative that they are pulling the data for.
You need to have a foundational framework of your data.

How to Create a Data-Driven Culture and Increase Data Literacy
The biggest challenge Meenal faced in shifting the culture to a data-driven one is the fact that people often have the mindset that, ”if something is already working, why do we need to fix it?”

To change that mindset, it is important to collaborate and communicate with all parties the reasons for making changes in an application.
Allow everyone, including leadership, middle management, and end-users, to lay out pros and cons, and address it all, while keeping emotions off the table. Be transparent about both the capabilities and limitations of each application.
To increase data literacy in an organization, the education must be top-down. Leadership must communicate why the organization needs to be data-literate, making the end goal clear.

Standardization of Business Metrics

Currently, Meenal is working on building the data foundation needed to build the data science platform at Tailored Brands. One of the biggest challenges she is facing is maintaining consistency in business metrics.

It may be challenging to come to a single definition for an enterprise KPI or metric and then identify the data steward for that metric. Sometimes you have to take the lead and choose who will be the data steward.
You need to make sure that the source of data that is informing the metric is the right dataset. This dataset should come out of a central organization rather than multiple organizations. This works well because if it is wrong, it is wrong in just that one place.
Once a metric is defined, it is built into reporting applications.

If you’re interested in any of the topics discussed here, Sandeep and Meenal talked about even more in the full podcast. Be sure to check out the full episode and the rest of the CDO Battlescars podcast series!

The post Developing Data Literacy and Standardized Business Metrics at Tailored Brands appeared first on Unravel.

Creating a Data Strategy & Self-Service Data Platform in FinTech

Unravel Data — Tue, 23 Feb 2021 13:00:11 +0000

In this episode of CDO Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Keyur Desai, CDO of TD Ameritrade. They discuss battlescars in two areas: Building a Data Strategy and Pervasive Self-Service Analytics Platforms.

Keyur is a data executive with over 30 years of experience managing and monetizing data and analytics. He has created data-driven organizations and driven enterprise-wide data strategies in areas including data literacy, modern data governance, machine learning and data science, pervasive self-service analytics, and several other areas.

He has experience across multiple industries including insurance, technology, healthcare, and retail. Keyur shares some really valuable lessons based on his extensive experience. Here are the key talking points from their chat.

Building a Data Strategy

The Problem: Disconnect Between Business Goals & Technical Infrastructure

A data analytics strategy is never singularly built by a data analytics organization. It is absolutely co-created between the business and the data analytics organization.
To the business, a data and analytics strategy is a set of data initiatives and analytics initiatives that will be brought together to help them achieve their business outcomes.
However, building your data analytics initiatives based on desired business outcomes doesn’t guarantee that your technical infrastructure would be neatly built as well.

The Solution 1: A Data Economist

To solve this problem, Keyur felt that there was a need for a new role, called a data economist.

The data economist’s job is to create a mathematical model of the outcome that they’re trying to achieve with the business and backtrack that into attributes of data, the analytics system, and the technical system.
Through these mathematical models, you can estimate not only whether you’re going to be able to meet the business objectives or the business outcomes, but also model out the impact within the company, measured in earnings per share.
You can use this model to propose a fact-based plan and engage with cross-functional executives in a conversation about why you’re proposing it that way.

The Solution 2: Data Literacy

Keyur also stressed the importance of establishing data literacy on the business side.

A lot of companies will try to create a data literacy program that is overarching across the entire company. However, Keyur has found a lot of success in segmenting the business user base, as well as the technical user base, based on the types of capabilities each segment will need when it comes to data and analytics.
A fluency program needs to enable the right segment to translate what they’re seeing from the tool at hand to an insight, and then translate that insight into some kind of implication.
Literacy is not just about understanding the data; it’s also about practices to keep it safe and private, as well as being able to effectively tell a story with the data.
Establishing data literacy across the business allows them to determine what types of outcomes are even possible with data and to begin to figure out which outcomes they want to chase.

Pervasive Self-Service Analytics Platforms

The Role of the Self-Service Analytics Team

The self-service analytics team’s role has shifted from creating reporting assets to enabling data fluency across the organization and watching what data sets are being used by whom to solve what types of problems.
There is a balancing act between wanting to provide self-service to everybody while, at the same time, making sure that everyone is doing it in a secure way that does not open up a risk for the corporation.
It comes down to making sure that you’ve got a corporate-wide access framework where all subsystems that have data of any sort have the potential to store, move, or, share the data.

Data Prep Environments
Up to now, the missing link to developing an integrated, end-to-end self-service model was a self-service data preparation environment that is as easy to use as Excel. We’re now getting there!

Data prep environments now allow non-technical business users to be able to get past some of the big bottlenecks they had in the past, like lacking the technical skill to clean up the data.
The AI running in the background of data prep environments, combined with what the users around you are doing, is now smart enough to basically propose to you some of the cleaning actions you should be able to perform.
Through the data prep environment, you can get dashboards on what data is being used for what purposes, or what kinds of metrics are being created, across the organization.

The Role of a Business Leader in the Self-Service World
This new self-service world is one that revolutionizes how we go about sharing data and even accelerates the sharing of data.

With the unified access framework, you can now ensure that people are not only sharing, but are seeing the things they should. More importantly, business intelligence leaders can now watch and see what people are doing and ensure that everything is going smoothly.
All leaders must be aligned around the concept of respect. Respect allows for a team to get to a level of interaction where they can easily bounce ideas off of each other. This breeds innovation.
A leader needs to ensure that they trust the people they hire on their team and their colleagues enough to be able to give them enough autonomy to get the job done.
In addition to laying out the end goal, leaders must also lay out the intermediate milestones required to reach that goal.
Lastly, a leader should be able to see how previous leaders have failed and be able to now have a sense of purpose around how the new approaches will actually create business value.

If you’re interested in any of the topics discussed here, Sandeep and Keyur talked about even more in the full podcast. Be sure to check out the full episode and the rest of the CDO Battlescars podcast series!

The post Creating a Data Strategy & Self-Service Data Platform in FinTech appeared first on Unravel.

Achieving Top Efficiency in Cloud Data Operations

Unravel Data — Fri, 05 Feb 2021 22:34:53 +0000

The post Achieving Top Efficiency in Cloud Data Operations appeared first on Unravel.

Why You Need DataOps in Your Organization

Unravel Data — Fri, 05 Feb 2021 22:27:33 +0000

The post Why You Need DataOps in Your Organization appeared first on Unravel.

Discover Your Datasets The Self-Service Data Roadmap

Unravel Data — Fri, 05 Feb 2021 22:20:12 +0000

The post Discover Your Datasets The Self-Service Data Roadmap appeared first on Unravel.

CDO Battlescars Podcast Series

Unravel Data — Wed, 27 Jan 2021 13:00:28 +0000

Thank you for your interest in the CDO Battlescars Podcast Series.

The post CDO Battlescars Podcast Series appeared first on Unravel.

10X Engineering Leadership Series: 21 Playbooks to Lead in the Online Era

Unravel Data — Fri, 11 Dec 2020 17:29:05 +0000

Managing online teams has become the new normal! In an online world, how do you give effective feedback, have a difficult conversation, increase team accountability, communicate to stakeholders effectively, and so on?

At Unravel, we are a fast-growing AI startup with a globally distributed engineering team across the US, EMEA, and India. Even before the pandemic this year, the global nature of our team has prepared us for effectively leading outcomes across online engineering teams.

To help fellow Engineering Leaders (Managers, Tech Leads, ICs), we are making available our 10X Engineering Leadership Video series. Instead of hours of training sessions (where very little is ingested and retained), we developed a new format – 15 min playbooks with concrete actionable levers leaders can apply right away!

This micro-learning async approach has served us well and allows leaders to pick topics most relevant to their needs. Each playbook has an assignment — the intention is for leaders to discuss and learn from peer leaders. The playbooks help create a shared terminology among leaders especially required in an online setting.

We discovered there are three categories where engineering leaders (including seasoned ones) often struggle when it comes to online teams — creating clarity, driving execution accountability, and coaching team members to deliver the best work of their lives.

Playbooks for Creating Clarity

“Everyone in the team is running fast” — “But in different directions!” It is very easy to get out of sync, especially in online teams. Also, clarity is required for team members to effectively balance tactical fires while ensuring long-term initiatives are delivered.

Defining Why: Is your team suffering from the myth, “Build it, and they (customers) will come”? The mantra today is, “If they will come, we will build it.” This playbook covers three plays (that are best taught at Intuit): Follow-me-home, customer need/ideal state statement, Leap-of-Faith-Assumptions statement.
Defining What: The field team is expecting an elephant, the product management is thinking of a giraffe, and engineering delivers a horse. How to align everyone on the same “what.” This playbook covers three plays: User story Jiras, Working backward, Feature-kickoff meeting.
Defining How: Is the team aligned on the new feature scope, tradeoffs, dependencies? Are they thinking about long pole tasks proactively and front-loading risk? To help align on How, this playbook covers four plays.
Clarifying priorities as OKRs: In an online setting, objectives and key results (OKRs) are a great way for leaders to communicate their priorities. Are you using OKRs effectively – or is it treated as another tracking overhead ? This playbook covers top-down and bottom-up plays in defining OKRs effectively.
Effective stakeholder updates: “How is the customer issue resolution coming along?” Leaders need to learn to provide an online (Slack) response in 2-3 sentences. Whether it is a 50-second response or a 50-min planning meeting, this playbook covers the corresponding plays: SCQA Pyramid, Amazon 6-pager, Weekly progress updates.

Driving Execution Accountability

“Vision without execution… is just hallucination,” or “Vision without action is a daydream.” Inspiring outcomes requires leaders to have the rigor for execution across the team.

Your operating checklist: All leaders have the same suitcase-size of 24 hours — what you fill in it defines your leadership effectiveness. This playbook covers plays for creating an operating rhythm of regular checkpoints with the team and helps you get organized to tackle unplanned work that inevitably will show up on your plate.
Applying Extreme Ownership: “There are no bad teams, only bad leaders” — as a leader, accepting responsibility for everything that impacts the outcome is a foundational building block, especially in an online setting. This playbook covers extreme ownership behaviors critical to demonstrate in high-performance teams.
Getting better with retrospectives: Similar to individuals, teams need to have a growth-mindset — getting better with each sprint and taking on bigger and bigger challenges together. This playbook covers plays on conducting retrospectives effectively with the lens of both what and how tasks were accomplished.
Effective delegation: Are leaders scaling themselves effectively in an online setting? Effective delegation is also important for leaders for growing their teams. This playbook covers Andy Grove’s Task-Relevant Maturity (TRM) and applying it effectively!
Clarifying roles: In an online setting, being very clear on the roles of the individuals on the team for a given project — who is driver, approve, collaborator, informed? This playbook covers plays namely DACI/RACI/RAPID, one-way & two-way doors, disagree and commit, and effective escalations.

Coaching Team

Effective online weekly check-ins: For team members, one of the biggest motivators and sources of satisfaction comes from making progress on meaningful goals. The playbook covers how to have an effective 15-min check-in with everyone on the team each week.
Giving actionable feedback: Leaders typically are not comfortable sharing feedback online. This playbook covers SBI (Situation-Behavior-Impact) and related plays.
Structured interviewing process: As leaders, effective hiring is the foundational building block of effective outcomes. How to hire the right candidates via online interviewing? Hint: It’s relying less on your gut for decision-making. This playbook covers a structured interviewing approach you need to implement.
Increasing team trust & effectiveness: Trust within the team members is the foundational building block for effective outcomes. Leaders need to work for developing trust and accountability. This playbook provides actionable levers to apply.
Having difficult conversations: An employee is consistently missing timelines or is making others in the team uncomfortable with their actions. While most leaders avoid difficult conversations even in office settings, online makes this even more difficult. This playbook covers how to create shared understanding.
Sharpening EQ: An effective leader needs to have IQ+Technical Skills+EQ. While this is a very broad topic, the playbook covers key actionable next steps.
Working on your mindset: Leaders need to be aware of their mindsets and biases. The lens with which they view themselves and others is important to understand. This playbook covers plays on mindsets to inculcate and those to watch out.

We continue to add more playbooks to the series and are committed to investing and growing our leaders within Unravel, as well as helping the community. Subscribe to the channel to make sure you do not miss out on these playbooks.

The post 10X Engineering Leadership Series: 21 Playbooks to Lead in the Online Era appeared first on Unravel.

Eckerson Report DataOps Deep Dive

Unravel Data — Thu, 10 Dec 2020 17:29:25 +0000

Get DataOps Deep Dive

Nowadays, companies of all types spend heavily on compute power across a wide range of data technologies, but frequently don’t place enough emphasis pre-planning where that money will be allocated. This is particularly true for organizations that are moving to the cloud.

As uncovered in the Eckerson report DataOps: Deep Dive, providing insights into efficiency, from the level of KPIs down to lines of code, helps organizations understand and improve the ROI of their data initiatives. At the same time, the integration of AI to automate efficiency improvements saves developers valuable time in the DataOps lifecycle. And, this is an area where Unravel excels.

As described in the report, “The Unravel Data Operations Platform is, in many respects, a next-gen application performance monitor tailored to the unique challenges of development in the modern big data context. Unravel works to alleviate this complexity by providing full-stack visibility across the data ecosystem.”

Download the report to learn how to add advanced monitoring capabilities to your DataOps strategy whether your high-volume data environment is on-premise, hybrid, or in the cloud.

The Unravel Data Operations Platform is ideal for customers that want to:

Optimize data resources to reduce overhead when migrating workloads to AWS, Microsoft Azure, or GCP
Manage pipeline deployment efficiency
Track the ROI of their investments in data initiatives
Improve the quality of developers’ code through automation

The post Eckerson Report DataOps Deep Dive appeared first on Unravel.

Unravel’s Hybrid Test Strategy to Conquer the Scaling Data Giant

Floyd Smith — Wed, 25 Nov 2020 16:34:01 +0000

Unravel provides full-stack coverage and a unified, end-to-end view of everything going on in your environment, plus recommendations from our rules-based model and our AI engine. Unravel works on-premises, in the cloud, and for cloud migration.

Unravel provides direct support for platforms such as Cloudera Hadoop (CDH), HortonWorks Data Platform (HDP), Cloudera Data Platform (CDP), and a wide range of cloud solutions, including AWS infrastructure as a service (IaaS), Amazon EMR, Microsoft Azure IaaS, Azure HDInsight, and DataBricks on both cloud platforms, as well as GCP IaaS, Dataproc, and BigQuery. We have grown to supporting scores of well-known customers and engaging in productive partnerships with both AWS and Microsoft Azure.

We have an ambitious engineering agenda and a relatively large team, with more than half the company in the engineering org. We want our engineering process to be as forward-looking as the product we deliver.

We constantly strive to develop adaptive and end-to-end testing strategies. For testing, Unravel had started with a modest customer deployment. We now support scores of large customer deployments with 2000 nodes and 18 clusters. We had to conquer the giant challenges posed by this massive increase in scale.

Since testing is an integral part of every release cycle, we give top priority to developing a systematic, automated, scalable, and yet customizable approach for driving the entire release cycle. As a new startup comes up, the obvious and quickest approach one is tempted to follow is the traditional testing model, and to manually test and certify a module/product.

Well, this structure sometimes works satisfactorily when the features in the product are few. However, a growing customer base, increasing features, and the need for supporting multiple platforms give rise to proportionally more and more testing. At this stage, testing becomes a time-consuming and cumbersome process. So if you and your organization are struggling with the traditional, manual testing approach for modern data stack pipelines, and looking for a better solution, then read on.

In this blog, we will walk you through our journey about:

How we evolved our robust testing strategies and methodologies.
The measures we took and the best practices that we applied to make our test infrastructure the best fit for our increasing scale and growing customer base.

Take the Unravel tour

Download the Unravel Guide to DataOps. Contact us or try out Unravel for free.

Evolution of Unravel’s Test Model

Like any other startup, Unravel had a test infrastructure that followed the traditional testing approach of manual testing, as depicted in the following image:

Initially, with just a few customers, Unravel mainly focused on release certification through manual testing. Different platforms and configurations were manually tested, which took roughly ~4-6 weeks of release cycle. With increasing scale, this cycle became endless, which made the release train longer and unpredictable.

This type of testing model has quite a few stumbling blocks and does not work well with scaling data sizes and product features. Common problems with the traditional approach include:

Late discovery of defects, leading to:
- - Last-minute code changes and bug fixes
  - Frantic communication and hurried testing
  - Paving the way for newer regressions
Deteriorating testing quality due to:
- - Manual end-to-end testing of the modern data stack pipeline, which becomes error-prone and tends to miss out on corner cases, concurrency issues, etc.
  - Difficulty in capturing the lag issues in modern data stack pipelines
Longer and unpredictable release trains that leads to:
- - Stretched deadlines, since testing time increases proportionally with the number of builds across multiple platforms.
  - Increased cost due to high resource requirements such as more man-hours, heavily equipped test environments, etc.

Spotting the defects at a later stage becomes a risky affair, since the cost of fixing defects increases exponentially across the software development life cycle (SDLC).

While the traditional testing model has its cons, it also has some pros. A couple of key advantages are that:

Manual testing can reproduce customer-specific scenarios
It can also catch some good bugs where you least expect them to be

So we resisted the temptation to move fully to what most organizations now implement, a completely mechanized approach. To cope with the challenges faced in the traditional model, we introduced a new test model, a hybrid approach that has, for our purposes, the best of both worlds.

This model is inspired by the following strategy which is adaptive, to scale with a robust testing framework.

Our Strategy

Unravel’s hybrid test strategy is the foundation for our new test model.

New Testing Model

Our current test model is depicted in the following image:

This approach mainly focuses on end-to-end automation testing, which provides the following benefits:

Runs automated daily regression suite on every new release build, with end-to-end tests for all the components in the Unravel stack
Provides a holistic view of the regression results using a rich reporting dashboard
The Automation framework works for all kinds of releases (point release, GA release), making it flexible, robust, and scalable.

A reporting dashboard and an automated regression summary email are key differentiators of the new test model.

The new test model provides a lot of key advantages. However, there are some disadvantages too.

KPI Comparison – Traditional Vs New Model

The following bar chart is derived on the KPI values for deployment and execution time, which is captured for both the traditional as well as the new model.

The following graph showcases the comparison of deployment, execution, and resource time savings:

Release Certification Layout

The new testing model comes with a new Release Certification layout, as shown in the image below. The process involved in the certification of a release cycle is summarized in the Release Cycle Summary table.

Release Cycle Summary

Conclusion

Today, Unravel has a rich set of unit tests; more than 1000 tests are run for every commit, along with the CI/CD pipeline in place. This includes functional sanity test cases (1500+) and can cover our end-to-end data pipelines as well as the integration test cases. Such a testing strategy can significantly reduce the impact on integrated functionality by proactively highlighting issues in pre-checks.

Cutting a long story short, It is indeed a difficult and tricky task to build a flexible, robust, and scalable test infrastructure that caters to varying scales, especially for a cutting-edge product like Unravel, and with a team that strives for the highest quality in every build.

In this post, we have highlighted commonly faced hurdles in testing modern data stack pipelines. We have also showcased the reliable testing strategies we have developed to efficiently test and certify modern data stack ecosystems. Armed with these test approaches, just like us, you can also effectively tame the scaling giant!

Reference Links (clip art images)

Unravel’s Hybrid Test Strategy:

End to End testing: https://www.lambdatest.com/blog/all-you-need-to-know-about-end-to-end-testing/
Integrated testing: https://professionalqa.com/types-of-integration-testing
Load testing: https://www.kiwiqa.com/getting-to-know-the-fundamentals-of-performance-testing-a-guide-for-amateurs-and-professionals/
Weekly E2E pipeline: https://www.pngkit.com/view/u2r5a9y3q8y3r5y3_smart-end-to-end-process-icon/
Parallelism: https://pages.tacc.utexas.edu/~eijkhout/istc/html/parallel.html (2.6.1.1 The Fork-Join Mechanism)

Exponential Cost of Fixing Defects:

Developer (Male computer user vector): image:https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1E42d7aLu5TRDcxzmpFD2N5b3YGo4E3C2dt1fcGONICav8jTteF-mpOM&s
Developer (Female computer user vector image): https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTX-Q6L_n9Bd5zdQ4J3haaO9hkMJKX3Cpg7RY9Ml8bq56YOmXqo2i5rrA&s
Build: https://icon-library.com/icon/automation-icon-2.html

Unravel’s Test Model:

QA Testing – Manual (Computer Users): https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSqI4svILCrc77R2-K0q5jys4PCXDiJFnEVASBiF5A_FNdnSBbdA_kD9Q&s

The post Unravel’s Hybrid Test Strategy to Conquer the Scaling Data Giant appeared first on Unravel.

Getting Real With Data Analytics

Unravel Data — Thu, 30 Jul 2020 14:00:59 +0000

CDO Sessions: Getting Real with Data Analytics Acting on Market Risks & Opportunities

On July 9th, big data leaders, Harinder Singh from Anheuser-Busch, Kumar Menon at Equifax, and Unravel’s Sandeep Uttamchandani, joined together for our CDO Session, hosted by our co-founder and CEO, Kunal Agarwal, to discuss how their companies have adapted and evolved during these challenging times.

Transcript

Kunal: Hi guys, welcome to this chat today with big data leaders. Thank you everybody for joining today’s call. We’re all going through very uncertain, unprecedented times, to say the least, with pandemic, combined with all the geo social unrest. Business models and analytics that have developed over the last several years have perhaps been thrown out of the window, or need to be reworked for what we call “the new normal”. We’ve seen companies take a defensive position for the first couple of weeks when this pandemic hit and now we’re looking at companies taking an offensive position. We have an excellent panel here to discuss how they’re using data within their companies to be able to uncover both risks and opportunities. Can we start with your background and your current role?

Harinder: Hey guys, it’s Harinder Singh. I lead data strategy and architecture and Anheuser-Busch InBev, also known as AB InBev. AB InBev is a $55 billion revenue company with 500 plus brands, including Corona, Budweiser, and Stella. We operate in hundreds of countries. Personally, I’ve been in this industry for about 20 years and prior to being at AB InBev, I was at Walmart, Stanford, and a few other companies. I’m very excited to be here part of the panel.

Kumar: Hey guys, it’s Kumar Menon. I lead Data Fabric at Equifax. I’ve been at Equifax for a couple of years and we are a very critical part of the credit life cycle within the economy in almost all the regions that we operate in. So it puts us in an interesting position during situations like these. I’ve been in the industry for 25 years, doing data work primarily in two major, highly regulated industries, life sciences and pharmaceuticals, and financial services. That’s the experience that I’ve been able to bring to Equifax to be able to really rethink how we use data and analytics to deliver new value to our customers.

Sandeep: Hey everyone, it’s a pleasure to be here. I’m Sandeep Uttamchandani, the VP of Engineering and the Chief Data Officer at Unravel Data. My career has basically been a blend of building data products, as well as running data engineering, at large scale, most recently at Intuit. My passion is basically, how do you democratize data? How do you make data self serve and really sort of the culture of data driven for enterprises? At Intuit, I was responsible for data for Intuit QuickBooks, a $3 billion franchise and continuing that passion at Unravel, we’re really building out what we refer to as the self serve platform, looking at telemetry data combined across resources, clusters, jobs, and applications, and really making sense of it as a true data product. I’m really excited to be talking more about what changes we’ve gone through and how we bounce back.

Kunal: Thank you, Sandeep. To gather some context, why don’t we go around the room and understand what your companies are doing at this particular time? What’s top of mind for your businesses?

Harinder: What’s on top of our mind today for our business is the following: Number one, taking care of our people. We operate in 126 countries and we don’t just ship our products, we actually brew locally, sell locally, meaning everything from the grain, the farmer, the brewery, is local in that country or city. For us, taking care of our people is our number one priority and there are different ways to do that. For example, in our corporate offices, it can be as simple as sending office chairs to people’s houses.

Our second priority is taking care of our customers. When I say customers, I’m not talking just about consumers, I’m talking actually about bars and restaurants, our B2B customers. They have been significantly impacted because people are not going into bars and restaurants. We have done that by using credit, doing advanced purchases, coming up with alliances with other companies, and creating gift cards that people can buy to use later. And finally, another thing that’s on top of mind is taking care of finances. We’re a big company and we don’t know how long this will last, so we want to make sure that we’re in it for the long run. We’re very fortunate that our leadership in the C level has very strong finance backgrounds and even in technology, people actually come from finance. So that’s definitely something that we are doing as a business for sure.

Kunal: That’s fascinating, Harinder. That was the sequence of things that we expect from a big company. Hopefully, once live sports start again, the sales will start picking back up. Kumar, I’d love to hear from you as well.

Kumar: Before our current situation, back in late 2017, Equifax had a significant, unfortunate breach, post which, we as a company made a serious commitment to ensure that we look at everything that we do as a business in terms of our product strategy, our platform strategy, our security posture, and transform the company to really helping our customers, in turn helping the consumer. We were in the middle of a massive transformation and we’re still currently going through that. We were already moving at a quick pace to rebuild all of our data and application platforms and refactoring process portfolio, our engagement with our customers, etc. The situation, I would say, has helped us even more.

When you look at the pandemic scenario, the credit economy has taken on a life of its own, in terms of how banks and other financial institutions are looking at data and at the impact of the pandemic on their portfolios and customers. As a company, we’ve been really looking at the more macroeconomic impact of the situation on the consumers in the regions that we operate. Traditional data indicators or signals don’t really hold as much value as they would in normal times, in these unique times, and we’re constantly looking at new indicators that would help not only our customers, but eventually the consumers, go through the situation in a much better way.

We are actually helping banks and other financial institutions reassess, and while we’ve seen businesses uptake in certain areas, in other areas of course, we see lower volumes. But overall, as a company we’ve actually done pretty well. And in executing the transformation, we’ve actually had to make some interesting choices. We’ve had to accelerate things and we’re taking a really offensive strategy, where we quickly deploy the new capabilities that we’re building. This will help businesses in regions with slower uptake execute better and serve their customers and eventually the consumers better.

Kunal: We would love to hear some of those changes that you’re making in our next round of questions. Sandeep?

Sandeep: I’ll answer this question by drawing from my own experience as well as my discussions talking to a broader set of CDO’s on this topic. There are three high level aspects that are loud and clear. First, I think data is more important than ever. During these uncertain times, it is important to use data to make decisions, which is clear across pretty much every vertical that I’m tracking.

The other aspect here is the need for agility and speed. If you look at the traditional analytical models and ML models built over years and years of data, the question is, do they represent the new normal? How do you now make sure that what you’re representing in the form of predictive algorithms and forecasting is even valid? Are those assumptions right? This is leading to the need to build a whole set of analytics, ML models, and data pipelines really quickly and I’ve seen people finding and looking for signals and data which they otherwise wouldn’t have.

The last piece of the puzzle, as a result of this, is more empowerment of the end users. End users here would be the beta users, the data analysts, the data scientists. In fact, everyone within the enterprise needs data at this time, whether I’m a marketer trying to decide a campaign, or if I’m in sales trying to decide pricing. Data allows you to determine how you react to the competition, how you react to changing demand, and how you react to changing means.

Big Data Projects

Kunal: So let’s talk some data. What data projects have you started are accelerated during these times? And how do you continue to leverage your data at your different companies while thinking about cutting costs and extending your resources?

Harinder: As times are changing, everything we do as a company has to change and align with that. To start off, we were already investing quite heavily in digital transformation to take us to the next level. The journey to taking a business view end to end, including everything from farmers, to customers, to breweries and corporations, has already begun. And due to COVID, we have really expedited the process.

Second, we had some really big projects to streamline our ERP systems. AB InBev grew heavily through M&A, we acquired many companies and partnered with many companies small and big. Each company was a successful business in its own right, meaning they had their own data and technology.

Definitely, the journey to Cloud is big as well. Some aspects of our organization were already multi-Cloud but if you look at the positive side of this crisis,on the Cloud journey as well, it really pushed us hard to go there faster. The same is true for remote work. Something that would have taken three to five years to execute happened overnight.

So the question then becomes, well, how do we manage the costs? Because all of these things that I’m talking about expediting require a budget to go with it. One thing we’ve done is reprioritize some of our initiatives. While these things that I talked about earlier have gone from, let’s say, priority three to priority one, we have some initiatives that we were working on that have been pushed to the backburner.

Let me give you some examples of managing or cutting costs. I run the data platform and the approach there was to scale, scale, scale, because the business is growing and we’re bringing all these different countries and zones online into Cloud. We still want to grow, but we’ve gone from looking at scale to focusing on how we can optimize and have more of a real time inventory of what’s needed, rather than having it three months ahead. The fact that you’re in the Cloud enables you to do that. It’s a positive thing on both sides because it helps expedite the journey to the Cloud, while moving to the Cloud helps you keep your inventory lean. Then, in terms of just doing some basic sanity checks, are there systems that have been around, but just not significantly used? Or are there software that we need less of? Or if there are things, in terms of technology, hardware, software or applications, that we need more of because of COVID, can we negotiate better because of scale again?

Kunal: Alright Harinder, so you’ve been just scrutinizing everything, scrutinizing size, scrutinizing projects, and making sure that you’re scaling in an optimized fashion, and not scaling out fast for unforeseen loads and warranties, if I summarized that correctly.

Harinder: You’re absolutely right. I think we were in a unique position to do that because our company follows a zero-based budget model, which essentially means that at the start of each year, we don’t just build upon from where we were last year. We start from scratch every single year, so that’s already in our culture, our DNA. And once or twice a year, we just had to take the playbooks out and do it again. That’s actually quite easy for us as a company to do versus, I can imagine, big companies that may have a tough time doing that.

Kunal: One last question before we move on to Kumar. What about some of the challenges that Cloud presented to you?

Harinder: Anybody going into the Cloud has to keep in mind two things. One is that it’s a double edged sword. It gives convenience when it’s time to market fast, but you also have to be very careful about security. All of these Cloud vendors, Google, Amazon, or Azure, spend more in security than companies can. So, the Cloud security out of the box is much better compared to an on-prem system. But you also have to be careful about how you manage it, configure it, enforce it, so on and so forth.

The second part to me is the cost. If you do a true comparison and don’t manage your cost properly, then Cloud costs can be much higher. If used properly and managed properly, Cloud costs are much better for business. A lot of people and companies that I talked to say that they are going to move to Cloud to save costs, but while moving to the Cloud is part of that, that’s step one. You must also make sure you manage the cost and watch out for it, especially in the very beginning, and prioritize the cost equally. Those two things, when done in combination, really kind of take care of the bottleneck issue with moving to Cloud.

Kunal: Yeah, Cloud definitely needs guardrails. Harinder, thank you so much for that.

Sandeep: I just want to quickly add to Harinder’s points. Just from our own experience, when we entered the Cloud in the past, we had to repurpose, using one instance for 10 hours versus 10 instances for one hour. I completely resonate with that point, Harinder. You also mentioned multi-Cloud and I would love to learn more.

Kunal: How about you, Kumar?

Kumar: For us, since we were already executing this blazing transformation, we didn’t really have to start anything specifically new. We went through some reprioritization of our roadmaps and were already executing at a serious pace, looking to complete this global transformation in a two year timeframe. So what we really focused on from an acceleration or a reprioritization perspective was deploying the capabilities as quickly as possible into the global footprint. Once the pandemic hit, we had to think about the impact on our portfolio. Most of our customers are big financial institutions and we quickly realized that traditional data points are no longer as predictive for understanding the current scenario, as Sandeep mentioned before. So we had to really reevaluate and look at how we can bring our data together in a much faster way, in a much more frequent manner, that can help our customers understand portfolios better. And obviously, how does this impact our traditional predictive models that we deploy for credit decisioning, fraud, and other areas where we saw some significant uptake in certain ways? All this required the capability to be deployed much faster.

Our transformation was based on a Cloud first strategy, so we are 100% on the Cloud. That helped us accelerate pushing these capabilities out into the global regions at a much faster pace and we completed the global deployment of several of our platforms over the last a couple of months or so.

From a data projects perspective, our goal throughout this transformation has been to enable faster ingestion of new data into our ecosystem, bringing this data together and driving better insights for customers. So we’re constantly looking for new data sources that we can acquire that can add value to the more traditional and the very high fidelity data sources we already have. When you look at our footprint in a particular region, we actually have some of the most important data about a consumer within the region that we operate in. In a traditional environment, that data is very unique and very powerful, but when you look at a scenario like the pandemic situation that we’re in, we have to bring in data and figure out how the current situation impacts customers, therefore understanding consumers better.

Also, anything that we produce has to be explainable. While we absolutely have a lot of desire to and currently do very advanced techniques around analytics, using ML and AI for several things, for some of our regular literary businesses, everything has to be explainable. So we’ve accelerated some of our work in the explainable AI space and we think that’s going to be an interesting play in the market as more and more regulations start governing how we use data and how we help our customers or the consumers eventually own the data. We,in fact, own a patent in the industry that allows for credit decisioning using explainable AI capabilities.

Kunal: We’d love to hear about some of the signals that weren’t considered earlier that are now considered. Would you be able to share some of those, Kumar?

Kumar: Absolutely. So, we have some new data sets that not many of the credit rating agencies or other financial institution data providers have today. For example, the standard credit exchange is all banking loan data that we get at the consumer level that every credit rating agency has. But we also have a very highly valuable data asset called the work number, which is information about people’s employment and income. We also have a utilities exchange, where we get utility payment information of consumers. I can talk about some insights that you don’t even have to be a genius to think about, opportunities that you can literally unravel through combining this data.

If you were to just look at a traditional credit score that is based on credit data, as an example, I could say, “Kumar Menon worked in the restaurant business and has a credit score of 800”. In a traditional way of looking at credit scoring, I would be still a fairly worthy customer to lend money to. Looking at the industry that I’m working in, maybe there is a certain element of risk that is now introduced into the equation, because I’m in the restaurant business, which is obviously not doing well. So what does that mean when I look at Kumar Menon as a consumer? There are things that you can do to understand the portfolio and plan better. I’m not saying that all data points are valid, but understanding the portfolio helps financial institutions prepare better, help consumers, work with consumers to better understand forbearance scenarios, and help them work out scenarios where you don’t have to go into default. I mean, the goal of the financial institution was to bring more people into the credit industry, which is what we are trying to enable more.

By providing financial institutions with more data, we’re helping them become more aware of potential risks or scenarios that may not be visible in a traditional paradigm

Kunal: That’s very interesting. Thanks so much for sharing, Kumar. Moving on to you, Sandeep.

Sandeep: Piggybacking on what Harinder and Kumar touched on, one of the key projects has been accelerated movement to the Cloud. When you think about moving to the Cloud, it’s a highly non trivial process. On one side, you have your data and thousands of data sets, then on the other side, you have these active pipelines that run daily, pipelines, dashboards, ML models feeding into the product. So the question really becomes, what is the sequence to move these? Some pipelines are fairly isolated data sets with, I would say, trivial query constructs being used but on the other side, you’ll be using some query constructs which are deeply embedded in the on-prem system, highly non trivial to move, requiring rethinking of the schema, rethinking the partitioning logic, the whole nine yards. This is a huge undertaking. How do you sequence? How do you minimize risk? These are life systems and the analogy here is, how do you change the engine of the plane while the plane is running? We don’t have the luxury to say, “okay, these pipelines, these dashboards or models won’t refresh for a week”.

The other aspect is the meaning of data. Traditionally, I think as data professionals, the number one question is, where is the data and how do I get to the data? With data sets, which attribute is authentic and which attribute has been refreshed? During the pandemic, the question is now slightly changing into not just “where is my data?, but “is this the right data to use”? Are these the right signals I should be relying on? This new normal requires a combination of both the understanding of the business and the expertise there, combined with the critical aspects of data and the data platform. So there’s clearly increasing synergy in several places as people think about, “okay, how do I rework my models?” It’s a combination of building this out as well as using the right data.

The last piece is how you shorten the whole process of getting a pipeline or an insight developed. We are writing apps to do things which we haven’t done before, no matter which vertical you’re in and the moment you have these apps coming out at a fast pace, in production, there are a lot of discoveries. In terms of misusing the data, scanning, adjoining across a billion row table, all these aspects can inundate any system. Comparing it to a drainage system, one bad query is like a clog that stops everything, affecting everything below. I think that’s the other aspect, increasing awareness of how we fortify the CICD and the process of improving that

Kumar: That’s a very interesting point you bring up because when we look at our data ecosystem, all the way from ingestion of raw data to what we call, purposing the data for a specific set of products, we must ensure that the models and other insights that execute on that data all stay in sync. So how do we monitor that entire ecosystem? How do we ingest data faster, deploy new models, monitor it, and understand if it’s performing in the right way? We looked at that ecosystem and we want it to be almost like a push button scenario where analysts can develop models when they’re looking at data schemas that are exactly similar to what is running in production, so that there is no rewiring of the data required. And the deployment of the models is seamless, so you’re not rewriting the models.

In many of the on-prem systems, you actually end up rewriting the model in a production system because of performance challenges, etc. So, do you really want to extend the CICD pipeline concept to the analytic space, where there is an automated mechanism for data scientists to be able to build and deploy in a way that a traditional data engineer would deploy some pipeline code? And how do we make sure that that synergy is available for us to deploy seamlessly? It’s something that we’ve actually looked at very consciously and are building it into our stack. It’s a very relevant industry problem that I think many companies are trying to solve.

Big Data Challenges and Bottlenecks

Kunal: To summarize what Kumar and Sandeep were saying, we’re growing data projects and somebody, at the end of the day, needs to make sure it runs properly. We’re looking at operations and Kumar made a comment, comparing it with the more mature DevOps lifecycle, which we are thinking about as a DataOps lifecycle. What are some of these challenges and bottlenecks that are slowing you guys down, Harinder?

Harinder: I would like to start off by giving our definition of DataOps. We define DataOps as end to end management and support of a data platform, data pipelines, data science models, and essentially the full lifecycle of consumption and production of information. Now, when we look at this sort of life cycle, there’s the basics of people, default process, and technology and data.

Starting with our people, we started building this team about three years ago, so there’s a lot of experienced talent with a blend of new and upcoming individuals. We were still in the growth phase of the team, but I think that the current situation has slowed down that process a bit.

The technology was always there but it’s more so about the adoption of it because when you have to strike the balance between growth in data projects and more need for data, usually you will have people in technology scale with it. In our case, like I said, the people team is not able to grow as fast because of the situation, so we are looking for automation. How can we utilize our team better? CICD was there in some parts of the platform while in some, it wasn’t. So we are finding those opportunities, automating the platform, and applying full CICD.

When we talk about Cloud, there are scenarios where you can move to the Cloud in different ways. You can move as an infrastructure, as a platform, as a full SAS, so we always wanted to be sort of platform agnostic and multi-Cloud. There are some things we have done, mostly on the infrastructure side, but now we are taking the middle ground a bit, moving away from infrastructure to more of a platform as a service model so that, again, going back to people, we can save some time to market by moving to that model.

On the process side, it’s about striking the right balance between governance and time to market. When you have to move fast, governance always slows down. The industry is very regulated and that means you still have to maintain a minimum requirement on compliance. Depending on which country we are in, while in the US, not so much, in other countries where we operate, there’s always the GDPR. So those requirements have to be met while we move fast to meet the demands of our internal customers for data analytics and insights. When we talk about this whole process end to end, I think it’s about how we continue to scale and meet the needs of our business, while also doing our best to strike the balance just because of the space we are in. And when I talk about regulation, I’m not just talking about the required regulation or compliance, it’s also just good data hygiene, maintaining the catalog, maintaining the glossaries. Right now it’s just that complication of sometimes speed taking over and other times governance taking over , so we’re trying to find the right balance there.

Kunal: As is every organization, Harinder, so you’re not alone there for sure. Kumar, anything to add there?

Kumar: I think he covered it pretty well. For us, when moving to the Cloud, you really have to have a different philosophy when you’re building Cloud native applications versus what you’re building on-prem. It’s really about how do you improve the skill sets of your people to think more broadly. Now you take a developer who has been developing on Java, on-prem, and she or he now has to have a little bit of an understanding of infrastructure, a little bit about the network, a little bit about how Cloud Security works so we can actually have security in the stack, versus just an overlay on the application. A lot of on-prem applications are built that way, relying on perimeter security by the network. How do you actually engineer the right IM policies into every layer of the services you’re building? How do you make sure that the encryption and decryption capabilities that you enable to the application are enterprise wide policies?

I’ve come back often to the ability to deploy into the Cloud. How do you ensure that your deployment is compliant? How do you make sure that everything is code in the Cloud, infrastructure is code, security is code, your application is code? How do you check in your CICD pipeline that you have all your controls in place so that your build fails and you don’t actually deploy if you’re violating policy? So we actually started to implement policy as code within our CICD pipeline to ensure that no bad behavior really manifests itself in production.

We’ve also been ruthlessly looking at security because of the situation we were in before, as well as the fact that we hold some very valuable high fidelity data. How do you ensure that what our security policy is on the data is also on the technology stack that operates on the data? So those have been some very interesting learnings and I wouldn’t say this has slowed us down, but these things are mandatory and we must learn and be able to master them as a company.

Regulations are ever changing. We’ve encountered new regulations, as we’re building this. New privacy laws are coming into existence, like the CCPA in California, and I think there’ll be other states pursuing similar privacy laws. Obviously, that will impact you globally when you extrapolate that GDPR and other regional laws. So when you’re deploying the Cloud, how do you make sure you’re adhering to the data residency requirements within those regions, as well as the privacy laws? How you build an architecture that can adapt and be flexible to that change is really the big challenge.

Kunal: Thank you for sharing all of that. Sandeep, any thoughts there?

Sandeep: I define DataOps as a point in the project where the honeymoon phase ends and reality sets in, the phase of making a prototype and building out the models and analytics.

On a single weekend, I’ve seen a bad query accumulate more than a hundred thousand dollars in cost. That’s an example where if you don’t really have the right guardrails, just one weekend with high end GPUs in play, trying to do ML training for a model that honestly we did not even require, you get a bill of $100,000.

I think the other thing is just the sheer root cause analysis and debugging. There are so many technologies out there and on one side, there is the philosophy of using the right tool for the job, which is actually the right thing, there is no silver bullet. But then, if you look at the other side of it, the ugly side of it, you need the skill sets required to understand how Presto works, versus how Hive works, versus how spark works, and tune it to really figure out where issues are happening. It’s much more difficult, how do you package that? Figuring it out is one of those issues which has always been there, but is now becoming even more critical.

The last thing to wrap up, and I think Kumar touched on this, is a very different way to think about some of the newer technologies. If you think of these serverless technologies like Google BigQuery or AWS Athena, they have different pricing models. Here, you’re actually being charged by the amount of data scanned and imagine a query that is basically doing a massive scan of data, incurring significant costs. So you need to incorporate all of these aspects, be it compliance, cost, root cause analysis, tuning, and so on, early on so that DataOps is seamless and you can avoid surprises.

How Big Data Professionals Can Adjust to Current Times

Kunal: Thank you. We’ll have one, one minute rapid fire round question for everybody as a parting thought. There’s several hundred people listening in right now, so what should all of us data professionals plan for as we’re thinking through this prolonged period of uncertain times? What is that one thing that they should be doing right now that they have not in the past?

Harinder: I actually have not one, but five, but they’re all very quick. First of all, empathy. We are in completely different times, so make sure you have empathy towards your team and towards your business partners.

Number two, move fast. It’s not the time to think too hard or plan, you just have to move fast and adapt.

Number three, manage your costs.

Number four, focus on your business partners internally and try to understand what their needs are, because it’s not just you, everybody is in a unique situation. So focus on your internal customers, what do they need from you, in terms of data analytics?

And finally, focus on your external customers, try to understand their needs. One of the most important things would be maybe changing the delivery model of your product or service and meeting where the customer is instead of expecting customers to come to you.

Kumar: I totally agree with focusing on internal customers. Obviously focus on the ecosystem you’re operating in so it’s your customers as well as potentially your customers’ customers. Definitely make sure that you connect a lot more with your customers and your coworkers to keep the momentum going.

I think, in several scenarios there are new opportunities that are being unearthed in the market space, so really watch out for where those opportunities lie, especially when you’re in the data space. There are new signals coming up, new ways of looking at data that can provide you better insights. So how do you constantly look at that?

Finally, I would say to keep an eye out for how fast regulations are changing. I’m sure new regulations will be in play with this new normal, so just make sure that what you build today today can withstand the challenge of the time.

Kunal: Thank you, Sandeep?

Sandeep: One piece of advice for professionals would be to also focus on data literacy and explainable insights within your organization. Not everyone understands data the way you do and when you think about insights, it’s basically three parts, what’s going on, why it is happening and how to get out of it. Not everyone will have the skills and expertise to do all three. The “what” part, what’s going on in the business, how to think about it, how to slice and dice,data professionals have a unique opportunity here to really educate and build that literacy within their enterprise for better decision making. And everything that Harinder and Kumar mentioned are spot on.

Kunal: Thank you. Guys again, this was a fantastic one hour. We had a ton of viewers here today. I hope we all took away something from these data professionals, I certainly learned a lot. Harinder, Kumar, Sandeep, thank you so much for taking time out during such crazy times and sharing your experiences, all the practical advice, and strategies with the entire data community.

The post Getting Real With Data Analytics appeared first on Unravel.

CDO Sessions: Getting Real with Data Analytics

Unravel Data — Tue, 28 Jul 2020 19:06:58 +0000

The post CDO Sessions: Getting Real with Data Analytics appeared first on Unravel.

Transforming DataOps in Banking

Unravel Data — Thu, 16 Jul 2020 13:00:39 +0000

CDO Sessions: A Chat with DBS Bank

On July 9th 2020, Unravel CDO Sandeep Uttamchandani joined Matteo Pelati, Executive Director, DBS in a fireside chat to discuss transforming DataOps in Banking. Watch the 45 minute video below or review the transcript.

Transcript

Sandeep: Welcome Matteo, really excited to have you, can you briefly introduce yourself, tell us a bit about your background and your role at DBS.

Matteo: Thank you for the opportunity. I’m leading DBS’s data platform and have been with the bank for the last three years. Over the last three years, we’ve built the entire data platform from the ground up. We started with nothing with a team of about five people and now we are over one hundred. My last 20 years has been in this field with many different companies, mostly startups. DBS is actually my first bank.

Sandeep: That’s phenomenal and I know DBS is a big company being one of the key banks in Singapore, so it’s really phenomenal that you and your team kicked off the building of your data platform.

Matteo: As DBS is going through a company wide digitization initiative we initially started to outsource development but now, we retained much more development in-house with the use of open source technologies. We’re also contributing back to open source, too!

Sandeep: That’s phenomenal and I have seen some of your other work. Matt, you’re truly a thought leader in this space. So really, I would say spearheading a whole
bunch of things!

Matteo: Thank you very much.

Sandeep: Top of mind for everyone is COVID-19. This has been something that every enterprise is grappling with and maybe a good starting point. Matteo, can you share your thoughts on the COVID-19 situation impacting banks in Singapore?

Matteo: Obviously there is an economic impact that is inevitable everywhere in the world. Definitely there is a big impact on the organization because banks don’t traditionally have a remote workforce. All of a sudden we found ourselves having to work remotely as ordered by the government. We had to adapt and we’ve done well adapting to the transition to home based working.There were challenges in the beginning such as remote access to systems and the suddenness of all of this, but we handled and are handling it pretty well.

Sandeep: That’s definitely good news. Matteo, do you have thoughts on other industries in Singapore and how are they recovering?

Matteo: In Singapore we didn’t really have that strict lockdown other countries experienced. We did however limited social contact and the government instructed people to stay at home. There are some businesses that have been directly impacted by this i.e. Singapore Airlines, the airlines have all shut down. I’m from Italy and COVID-19 has been hugely disruptive to peoples lives as all businesses were shut down. It did happen here in Singapore, but with a lesser impact. As things start to ease up and restrictions will start to loosen, hopefully the situation will get better soon.

Sandeep: From a DBS standpoint, Matteo, what is top of mind for you from the broader aspect as you are planning ahead from the impacts of the pandemic.

Matteo: Planning ahead, we’re looking at remote working as a long-term arrangement as there are many requests for it. We’re also exploring cloud migration as most of our systems have always been on-premise. As you already know, banks and companies with PII data may find it challenging to move sensitive data to the cloud.

The current COVID-19 pandemic has accelerated the planning for the cloud. It is a journey that will take time, won’t happen overnight, but having a remote workforce has helped up understand actual use cases. We’re investing in tokenization and encryption of data so that it can be in the Cloud. There are lots of investments in that direction and have probably been sped up by the pandemic.

How DBS Bank Leverages Unravel Data

Sandeep: In addition to moving to the cloud, what new data project priorities can you shed a light on?

Matteo: As you know we are investing a lot in the data group as we’re running the data platform. Building a platform is very much about putting pieces and making them work together. We decided in the beginning to invest a lot in building a complete solution, and we started doing a lot of development as well. We built this platform from the ground up adopting open source software and building to an extent a full end-to-end self-service portal for the users. This took time, obviously, but the ROI was worth it because our users are now able to ingest data easier enabling them to simply build compute jobs.

Let me give you an example of where we leveraged Unravel. We have compute jobs that are built by our users directly on our cluster, on a web UI too. Once they have done the build they can take that artifact and easily promote it to the next environment. We can go to User Acceptance Testing (UAT), QA, and production in a fully automated way using our UI of a component that we wrote. This has now become our application lifecycle manager where we have integrated with Unravel giving us the ability to automatically validate the quality of jobs.

We leverage Unravel to analyze jobs, analyze previous runs of a job and basically block the promotion of a job if the job doesn’t satisfy certain criteria thanks to Unravel. For us, it’s not just bringing in tools and installing them, but building an entire ecosystem of integrated tools. We have integration with Airflow and many other applications that we’re building and fully automating. Having gone through this experience, we’ve learned a lot as we are bringing the same user experience to the model productionization. What we’ve done with Spark data pipelines, etc, we are gonna do with models.

We want our users to be able to build and iterate on machine learning models which should enable them to productionize them easier to replicate the same button that we have used for ETL and compute jobs to machine learning models as well. That’s our next step we’re working on right now.

Sandeep: You talked a bit about the whole self-service portal, and making it easy for the users to accomplish whatever insights they’re trying to extract which is really interesting. When you think about your existing users, how are you satisfying their needs? What are some of the gaps that you’re seeing that you’re working towards?

Matteo: There are definitely gaps because there are always new feature requests and new boxes, as any product. There are always new feature requests coming from customers and users. We do try and preempt what they need by analyzing their behavior such as usage patterns and historic requests.

For example we’re heavily investing in streaming now but historically banks have always been aligned to doing batch processing restricted by their legacy systems like mainframes, databases. Fast forward to now, we are starting to have a system that can produce real time streaming, that doesn’t change the platform in a way to support streaming data which we introduced more than 2 years ago.

This changes the whole paradigm because you just don’t want to build a platform supporting streaming, but supporting natively where all we can have end to end streaming of applications. Traditionally all the applications are built using batch processing, SQL, etc. Now the paradigm has shifted which changes the requirements for machine learning where the deployment of a model becomes independent from the serving means, transport layer etc.

While many organizations package deployment into the application and deploy as a REST API, here, we say “okay let’s isolate the model itself from the application”. So basically once the data scientist builds a model, we can deploy a model and build the tools for the discoverability of the models too. This enables me to use my model, deploy my model as a REST API, embed the model into my application, deploy the model as a streaming component, or deploy the model as a UDF inside the spark job This is how we facilitate reusability of models and the journey we’re going through, and it has started to pay back.

Three years ago we started with a very simple UI to allow users to clean their SQL so it would ease the migration of existing Teradata & Exadata jobs to our data platform. As the users became more skilled they needed more features on reusability. So the platform evolved with the users at heart and now we are at a very good stage. We get good feedback from what we have built.

Cloud Migration Strategy

Sandeep: I’ve heard some of your talks and it’s good stuff. Share some of the detailed challenges that you’re facing when you think about the cloud migration.

Matteo: We’re at the early stages of Cloud migration, you could say the exploration phase. The biggest challenge is access to data. What we are working on uses encryption and tokenization at a large scale and expanding the use throughout the entire data platform. So data access, whatever the data will be governed by these technologies.

We have to handle it holistically incorporating our own data ingestion framework. To an extent we’re simplified by the work that we have done previously because every component that reads or writes data to our platform goes through a layer that we built, which is the our data access layer, which handles all these these details, so for example all the tokenization, access validation, authorization, is all handled by the data access layer. As our users are using this data access layer, it gives us an opportunity to implement the feature across all the users in a very easy way, so that’s basically our abstraction layer.

Security and the hybrid cloud model is a challenge at the moment. How are we going to share the data between on prem and cloud? How are we going to handle the movement of data? Part of the data will be in the cloud, part of the data will be on prem, so it’s not easy to define the rules which determine what is going to be on prem, what is going to be on the cloud. So we are evaluating different technologies to help us move the data across data centers, such as abstraction layers, abstracting the file system using a caching layer. I must say that these are probably the two challenges we’re facing now, and we are at the very beginning of that, actually, so I see many more challenges on our journey.

Sandeep: Having done cloud migration a few times before, I can totally vouch on the complexity. Can you share the other ways in which Unravel is providing visibility to data operations that are helping you out?

Matteo: We use Unravel for two different purposes. One, as I mentioned, is the integration with CICD for all the validation of the jobs and the other is more for analyzing and debugging the jobs. We definitely leverage Unravel while building our ingestion framework. I also see a lot of usage from users that are writing jobs and deploying into the platform, so they can leverage Unravel to understand more about the quality of the jobs, if they did something wrong, etc, etc.

Unravel has become really useful to understand the impact of our users queries on the system. It’s very easy to migrate a SQL query that has been written for Oracle or Teradata, encounter operations like joining twenty, thirty tables, actually, and these operations are very expensive on a distributed system like Spark and the user might not necessarily know it actually. Unravel has become extremely useful, to let users understand the impacts of the operation that they’re orchestrating. As you know, we have our own CI/CD integration that prevents users from let’s say expensive jobs in production. So this and Unravel is a very powerful combination as we empower the user. First we stop the user from messing up the platform, and second we empower the user to debug their own things and analyze their own job. Unravel gives us the possibility for users that have traditionally been DBMS users to understand more about their complex jobs.

Sandeep: Can you share what was it prior to deploying Unravel? What are some of the key operational challenges you were facing and what was the starting point for an Unravel deployment.

Matteo: Through the control checks that we implemented recently, we saw too many poor quality jobs on the platform and that obviously had a huge impact on the platform. So before introducing Unravel, we saw the platform using too many cores and jobs very inefficiently executed.

We taught users how to use Unravel, which enabled them to spend time understanding their jobs and going back to Unravel finding out what the issues were. People were not using that process previously as you know, optimization is always a hard task as people would want to deliver fast. So control checks basically started to push the user back and to Unravel to gain performance insights before putting jobs into production.

Advice for Big Data Community

Sandeep: Matteo what do you see coming in terms of technology evolution in DataOps? Earlier you mentioned about adoption of machine learning and AI, can you share some thoughts on how you’re thinking about building out the platform and potentially extending some of the capabilities you have in that domain?

Matteo: We have had different debates about how to organize the platform and we have built a lot of stuff in-house. Now we are challenged with moving to the cloud, the biggest question is, “shall we adopt the current stack that we have and leverage and can we be a cloud agnostic. Should we rely on services provided by cloud providers?”

We don’t have a clear answer yet and we’re still debating. I believe that you can get a lot of benefits on what the cloud can give you natively and you can basically make your platform work.

Talking about technology, we are investing a lot in Kubernetes, actually, and most of our workload is on Spark, that’s where we’re planning to go. Now our entire platform runs on Spark on YARN and we are investing a lot in experimenting using Spark on Kubernetes and migrating all apps to Kubernetes. This will simplify the integration with machine learning jobs as well. Running machine learning jobs are much easier on Kubernetes because you have containers and full integration is what we need.

We are also exploring technologies, like KubeFlow, for example, for productionize machine learning pipelines. To an extent, it’s like scraping a lot of stuff that has been built over the last three years and rebuilding it, because we are using different technologies.

I see a lot of hype, also, around other other languages too. Traditionally the Hadoop and Spark stack has been rotating around the JDM, Java and Scala, and more recently I started exploring with Python. We’ve also seen a lot of work using other languages like Golang and Raster. So I think there will be a huge change in the entire ecosystem, because of the limitation that JDM has. People are starting to realize that going back to a much smaller executable, like in Golang or Raster, with a very much simpler garbage collection, no garbage collection, can simplify really well.

Sandeep: I think there’s definitely a revival of the functional programming languages. You made an interesting point about a cloud agnostic platform, and one of the things that Unravel focuses a lot on is supporting technologies across on-prem as well as the cloud. For instance, we support all the three major cloud providers as well as technologies. One of the aspects we’ve added is the migration planner, any thoughts on that, Matteo? Knowing what data to move In the cloud versus what data to keep local? How are you sort of solving that?

Matteo: We are exploring different technologies and different patterns, and we have some technical limitations and policy limitations. To give you an example, all the data that we encrypt and tokenize, if they are tokenized on-prem and they need to be moved to the cloud, they actually need to be re-encrypted and re-tokenized with different access keys. So that’s one of the things that we are discussing that makes, obviously, the data movement harder.

One thing that we are exploring is having a virtualized file system layer across not just the file system, but a virtual cluster on-prem and in the cloud. For example, to visualize our file system, we’re using Alluxio. With Alluxio we are experimenting having an Alluxio cluster that spawns across the two data centers, on-prem and the cloud. We are doing the same with database technologies, as we are heavily using Aerospike and experimenting the same in Aerospike.

We have to be really careful, because being across data centers, the bandwidth between on-prem and cloud is unlimited. I’m not sure if this will be our final solution because we have to face some technical challenges, like re-tokenization of data.

Re-tokenization and re-encryption of data, with this automatic movement, is too expensive, so we are also exploring ingesting the data on-prem and in the cloud, and letting the user decide where the data should be. So, we are experimenting with these two options. We haven’t come to any conclusion yet because it’s in the R&D phase now.

Sandeep: Thank you so much for sharing. So to wrap up, Matteo, in fact, I just wanted to end with, you know, do you have any final thoughts on some of the topics we discussed? anything that you’d like to add to the discussion?

Matteo: No, not particularly. To summarize the way we run the platform is like a product company. We have product managers, we have our own roadmap that is decided by us and not by the users.

This has turned out to be very successful from two aspects; one aspect is the integration, because, you know, building a product, we make sure that every piece is fully integrated with each other and we can give the user a unified user experience – from the infrastructure, to the UI. The second is helped a lot with the retention of the engineering team, actually, because, you know, building a product creates much more engagement than doing random projects. This has been very impactful.

I think about all the integrations that we’ve done, the automation that we have done, and there are multiple aspects. For us, building our platform and our services as a product has been extremely beneficial and with payback after some time. You need to give time for the investment to return, but once you get to that stage, you’re gonna get your ROI.

Sandeep: That’s a phenomenal point, Matteo, especially treating the platform as a product and really investing and driving it. I couldn’t agree more. I think that’s really very insightful and, from your own experience, you have been clearly driving this very successfully. Thank you so much, Matteo.

Matteo: Thank you, that was great.

FINDING OUT MORE

The post Transforming DataOps in Banking appeared first on Unravel.

CDO Sessions: Transforming DataOps in Banking

Unravel Data — Wed, 15 Jul 2020 19:19:35 +0000

The post CDO Sessions: Transforming DataOps in Banking appeared first on Unravel.

Webinar Achieving Top Efficiency in Cloud Big Data Operations

Unravel Data — Thu, 23 Apr 2020 19:22:02 +0000

The post Webinar Achieving Top Efficiency in Cloud Big Data Operations appeared first on Unravel.

Now Is the Time to Take Stock in Your Dataops Readiness: Are Your Systems Ready?

Unravel Data — Tue, 14 Apr 2020 14:49:50 +0000

As the global business climate is experiencing rapid change due to the health crisis, the role of data to provide much needed solutions to urgent issues are being highlighted throughout the world. Helping customers manage critical modern data systems for years, Unravel sees a heightened interest in fortifying the reliability of business operations in healthcare, logistics, financial services and telecommunications.

DATA TO THE RESCUE

Complex issues – from closely tracking population spread to accelerating clinical trials for vaccines take priority. Other innovations in anomaly detection, such as the application of image classification to detect COVID patients through lung scans provide hope of instantaneous detection. Further, big data has been used to personalize patient treatments even before the crisis, so those most at risk will receive more curated application beyond simply age and comorbidity. Other novel uses of AI to track the pandemic have been reported (facial recognition, heat signatures to detect fevers) with great success in flattening the curve of infection.

Outside of healthcare, the rapid change in the global business climate is putting strain on modern data systems which are being pushed to unprecedented levels: Logistics engines for demand of essential goods need to recalibrate on a sub second basis. Financial institutions need to update risk models to incorporate a 24/7 steady stream of rapidly evolving stimulus policies. And in social media, a wave of misinformation and conspiracy theories are giving rise to phishing and malware attacks perpetrated by bad actors. AI systems help identify these offenders to minimize the damage.

EXCELLENCE IN DATA OPERATIONS IS NO LONGER A LUXURY

No doubt we are seeing unprecedented and accelerating demand for solutions to complex, business-critical challenges. Modern data systems are becoming even more crucial, with the task of providing reliability more important than ever. As an organization keenly focused on data and operations management, we believe it’s paramount to keep systems performing optimally. From a business perspective now is the time to assess operational readiness. Companies that lean in now will be ready to leverage and accelerate their business and marketplace value more quickly – post market events.

UNRAVEL PROVIDES COMPLETE MANAGEMENT INTO EVERY ASPECT OF YOUR DATA PIPELINES AND PLATFORM:

Is application code executing optimally (or failing)?
Are cluster resources being used efficiently?
How do I lower my cloud costs, while running more workloads?
Which workloads should I scale out to cloud?
Which user, app and use case is using most of the cluster?
How do I ensure all applications meet their SLAs?
How can I proactively prevent performance and reliability issues?

THESE ISSUES APPLY AS MUCH TO SYSTEMS LOCATED IN THE CLOUD AS THEY DO TO SYSTEMS ON-PREMISES. THIS IS TRUE FOR THE BREADTH OF PUBLIC CLOUD DEPLOYMENT TYPES:

Cloud-Native: Products like Amazon Redshift, Azure Databricks, AWS Databricks, Snowflake, etc.

Serverless: Ready-to-use services that require no setup like Amazon Athena, Google BigQuery, Google Cloud DataFlow, etc.

PaaS (Platform as a Service): Managed Hadoop/Spark Platforms like AWS Elastic MapReduce (EMR), Azure HDInsight, Google Cloud Dataproc, etc.

IaaS (Infrastructure as a Service): Cloudera, Hortonworks or MapR data platforms deployed on cloud VMs where your modern data applications are running

For those interested in learning more about specific services offered by the cloud platform providers we recently posted a blog on “Understanding Cloud Data Services.”

CONSIDER A HEALTHCHECK AS A FIRST STEP

As you consider your operational posture, we have a team available to run a complimentary data operations and application performance diagnostic. Consider it a Business Ops Check-up. Leveraging our platform, you’ll quickly see how we can help lower the cost of support while bolstering the performance in your modern data applications stack. If you are interested in an Unravel DataOps healthcheck, please contact us at hello@unraveldata.com.

The post Now Is the Time to Take Stock in Your Dataops Readiness: Are Your Systems Ready? appeared first on Unravel.

Supermarkets Optimizing Supply Chains with Unravel DataOps

Unravel Data — Tue, 07 Apr 2020 13:00:10 +0000

Retailers are using big data to report on consumer demand, inventory availability, and supply chain performance in real time. Big data provides a convenient, easy way for retail organizations to quickly ingest petabytes of data and apply machine learning techniques for efficiently moving consumer goods. A top supermarket retailer has recently used Unravel to monitor its vast trove of customer data to stock the right product for the right customer, at the right time.

The supermarket retailer needed to bring point-of-sale, online sales, demographic and global economic data together in real-time and give the data team a single tool to analyze and take action on the data. The organization needed all the systems in their data pipelines to be monitored and managed end-to-end to ensure proper system and application performance and reliability. Existing methods were largely manual, error prone and lacked actionable insights.

Unravel Extensible Data Operations Platform

After failing to find alternative solutions to cluster performance management, the customer chose Unravel to help remove risks in their cloud journey. During implementation, Unravel worked closely with the ITOps team to support the customer, providing support and iterating in collaboration based on the insights and recommendations provided by Unravel. This enabled both companies to triage issues, and troubleshoot issues faster.

Get answers, not more charts and graphs

Download the Unravel Guide to DataOps. Contact us or create a free account.

Bringing All The Data Into A Single Interface

The customer utilized a number of modern open source data projects in its data engineering workflow – Spark, MapReduce, HBase, YARN and Kafka. These components were needed to ingest and properly process millions of transactions a day. Hive query performance was a particular concern, as numerous downstream business intelligence reports depended on timely completion of these queries. Previously, the devops team spent several days to a week troubleshooting job failure issues, often blaming the operations team of improperly cluster configuration settings. The operations teams would in turn ask the devops team to re-check SQL query syntax for cartesian joins and other inefficient code. Unravel was able to shed light to these types of issues, providing usage based reporting which helped both teams pinpoint inefficiencies quickly.

Unravel was able to leverage its AI and automated recommendations engine to clean up hundreds of Hive tables, greatly enhancing performance. A feature that the company found particularly useful is the ability to generate custom failure reports using Unravel’s flexible API. In addition to custom reports, Unravel is able to deliver timely notifications through e-mail, serviceNOW, and PagerDuty.

Happy with the level of control Unravel was able to provide for Hive, the customer deployed Unravel for all other components – Spark, MapReduce, HBase, YARN and Kafka and made it a standard tool for DataOps management across the organization. Upon deploying Unravel, the team was presented with an end-to-end dashboard of insights and recommendations across the entire stack (Kafka, Spark, Hive/Tez, HBase) from a single interface, which allowed them to correlate thousands of pages of logs automatically. Previously, the team performed this analysis manually, with unmanaged spreadsheet tracking tools.

In addition to performance management, the organization was looking for an elegant means to isolate users who were consistently wasteful with the compute resources on the Hadoop clusters. Such reporting is difficult to put together, and requires cluster telemetry to not only be collected across multiple components, but also correlated to a specific job and user. Using Unravel’s chargeback feature, the customer was able to report not only the worst offenders who were over-utilizing resources, but the specific cost ramifications of inefficient Hive and Spark jobs. It’s a feature that enabled the company to recoup any procurement costs in a matter of months.

Examples of cluster utilization showing in the Unravel UI

Scalable modern data applications on the cloud are critical to the success of retail organizations. Using Unravel’s AI-driven DataOps platform, a top retail organization was able to confidently optimize its supply chain. By providing full visibility of their applications and operations, Unravel helped the retail organization to ensure their modern data apps are effectively architected and operational. This enabled the customer to minimize excess inventory and deliver high demand goods on time (such as water, bread, milk, eggs) and maintain long term growth.

FINDING OUT MORE

The post Supermarkets Optimizing Supply Chains with Unravel DataOps appeared first on Unravel.

The journey to democratize data continues

Unravel Data — Wed, 01 Apr 2020 13:00:12 +0000

Data is the new oil and a critical differentiator in generating retrospective, interactive, and predictive ML insights. There has been an exponential growth in the amount of data in the form of structured, semi-structured, and unstructured data collected within the enterprise. Harnessing this data today is difficult — typically data in the lakes is not consistent, interpretable, accurate, timely, standardized, or sufficient. Scully et. al. from Google highlight that for implementing ML in production, less than 5% of the effort is spent on the actual ML algorithms. The remaining 95% of the effort is spent on data engineering related to discovering, collecting, preparing data, as well as building and deploying the models in production.

As a result of the complexity, enterprises today are data rich, but insights poor. Gartner predicts that 80% of analytics insights will not deliver business outcomes through 2022. Another study, highlights that 87% of data projects never make it production deployment.

Over the last two years, I have been leading an awesome team in the journey to democratize data at Intuit Quickbooks. The focus has been to radically improve the time it takes to complete the journey map from raw data into insights (defined as time to insight). Our approach has been to systematically break the jou and automate corresponding data engineering patterns, making it self-service for citizen data users. We modernized the data fabric to leverage the cloud, and developed several tools and frameworks as a part of the overall self-serve data platform.

The team has been sharing the automation frameworks both as talks in key conferences and 3 open-source projects. Checkout the list of talks and open-source projects at the end of the blog. It makes me really product on how the team has truly changed the trajectory of the data platform. A huge shoutout and thank you to the team — all of you rock!

In the journey to democratize data platforms, I recently moved to Unravel Data. Today, there is no “one-size-fits-all” requiring enterprises to adopt polyglot datastores and query engines both on-premise as well as the cloud. Configuring and optimizing queries to run seamlessly for performance, SLAs, cost, and root-cause diagnosis is highly non-trivial requiring deep understanding. Data users such as data analysts, scientists and data citizens essentially need a turn-key solution to analyze and automatically configure their jobs and applications.

I am very excited to be joining the Unravel Data driving the technology of AI-powered data operations platform for performance management, resource and cost optimization, and cloud operations and migration. The mission to democratize data platforms continues …

The full press release can be viewed below.

————————————————————————————————————————————–

Unravel Hires Data Industry Leader with over 40 Patents as New Chief Data Officer and VP of Engineering

The new CDO will draw on experience from IBM, VMware and Intuit QuickBooks to help Unravel customers accelerate their modern data workloads

PALO ALTO, CALIFORNIA – April 1, 2020 – Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced that it has hired Sandeep Uttamchandani as its new Chief Data Officer and VP of Engineering. Uttamchandani will help boost Unravel’s capabilities for optimizing data apps and end-to-end data pipelines, with special focus on driving innovations for cloud and machine learning workloads. He will also lead and expand the company’s world-class data and engineering team.

Uttamchandani brings over 20 years of critical industry experience in building enterprise software and running petabyte-scale data platforms for analytics and artificial intelligence. He most recently served as Chief Data Architect and Global Head of Data and AI at Intuit QuickBooks, where he led the transformation of the data platform used to power transactional databases, data analytics and ML products. Before that, he held engineering leadership roles for over 16 years at IBM and VMWare. Uttamchandani has spent his career delivering innovations that provide tangible business value for customers.

“We’re thrilled to have someone with Sandeep’s track record on board the Unravel team. Sandeep has led critical big data, AI and ML efforts at some of the world’s biggest and most successful tech companies. He’s thrived everywhere he’s gone,” said Kunal Agarwal, CEO, Unravel Data. “Sandeep will make an immediate impact and help advance Unravel’s mission to radically simplify the way businesses understand and optimize the performance of their modern data applications and data pipelines, whether they’re on-premises, in the cloud or in a hybrid setting. He’s the perfect fit to lead Unravel’s data and engineering team in 2020 and beyond.”

In addition to his achievements at Intuit QuickBooks, IBM and VMWare, Uttamchandani has also led outside the office. He has received 42 total patents involving systems management, virtualization platforms, and data and storage systems, and has written 25 conference publications, including an upcoming O’Reilly Media book on self-service data strategies. Uttamchandani earned a Ph.D. in computer science from the University of Illinois at Urbana-Champaign, one of the top computer science programs in the world. He currently serves as co-Chair of Gartner’s CDO Executive Summit.

“My career has always been focused on developing customer-centric solutions that foster a data-driven culture, and this experience has made me uniquely prepared for this new role at Unravel. I’m excited to help organizations boost their businesses by getting the most out of their modern data workloads,” said Sandeep Uttamchandani, CDO and VP of Engineering, Unravel Data. “In addition to driving product innovations and leading the data and engineering team, I look forward to collaborating directly with customer CDOs to assist them in bypassing any roadblocks they face in democratizing data platforms within the enterprise.”

About Unravel Data
Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Kaiser Permanente, Adobe, Deutsche Bank, Wayfair, and Neustar. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement
The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post The journey to democratize data continues appeared first on Unravel.

4 Modern Data Stack Riddles: The Straggler, the Slacker, the Fatso, and the Heckler

Unravel Data — Wed, 26 Feb 2020 14:00:53 +0000

This article discusses four bottlenecks in modern data stack applications and introduces a number of tools, some of which are new, for identifying and removing them. These bottlenecks could occur in any framework but a particular emphasis will be given to Apache Spark and PySpark.

The applications/riddles discussed below have something in common: They require around 10 minutes wall clock time when a “local version” of them is run on a commodity notebook. Using more or more powerful processors or machines for their execution would not significantly reduce their run time. But there are also important differences: Each riddle contains a different kind of bottleneck that is responsible for the slowness and each of these bottlenecks will be identified with a different approach. Some of these analytical tools are innovative or not widely used, references to source code are included in the second part of this article. The first section will discuss the riddles in a “black box” manner:

The Fatso

The fatso occurs frequently in the modern data stack. A symptom of running a local version of it is noise – the fans of my notebook are very active for almost 10 minutes, the entire application lifetime. Since we can rarely listen to machines in a cluster computing environment, we need a different approach to identify a fatso:

JVM Profile

The following code snippet (full version here) combines two information sources, the output of a JVM profiler and normal Spark logs, into a single visualization:


profile_file = './data/ProfileFatso/CpuAndMemoryFatso.json.gz'  # Output from JVM profiler
profile_parser = ProfileParser(profile_file, normalize=True)
data_points: List[Scatter] = profile_parser.make_graph()

logfile = './data/ProfileFatso/JobFatso.log.gz'  # standard Spark logs
log_parser = SparkLogParser(logfile)
stage_interval_markers: Scatter = log_parser.extract_stage_markers()
data_points.append(stage_interval_markers)

layout = log_parser.extract_job_markers(700)
fig = Figure(data=data_points, layout=layout)
plot(fig, filename='fatso.html')

The interactive graph produced by running this script can be analyzed in its full glory here, a smaller snapshot is displayed below:

Spark’s execution model consists of different units of different “granularity levels” and some of these are displayed above: Boundaries of Spark jobs are represented as vertical dashed lines, start and end points of Spark stages are displayed as transparent blue dots on the x-axis which also show the full stage names/IDs. This scheduling information does not add a lot of insight here since Fatso consists of only one Spark job which in turn consists of just a single Spark stage (comprised of three tasks) but, as shown below, knowing such time points can be very helpful when analyzing more complex applications.

For all graphs in this article, the x-axis shows the application run time as UNIX Epoch time (milliseconds passed since 1 January 1970). The y-axis represents different normalized units for different metrics: For graph lines representing memory metrics such as total heap memory used (“heapMemoryTotalUsed”, ocher green line above), it represents gigabytes; for time measurements like MarkSweep GC collection time (“MarkSweepCollTime”, orange line above), data points on the y-axis represent milliseconds. More details can be found in this data structure which can be changed or extended with new metrics from different profilers.

One available metric, ScavengeCollCount, is absent from the snapshot above but present in the original. It signifies a minor garbage collection event and almost increases linearly up to 20000 during Fatso’s execution. In other words, the application ran for almost 10 minutes – from epoch 1550420474091 (= 17/02/2019 16:21:14) until epoch 1550421148780 (= 17/02/2019 16:32:28) – and more than 20000 minor Garbage Collection events and almost 70 major GC events (“MarkSweepCollCount”, green line) occurred.

When the application was launched, no configuration parameters were manually set so the default Spark settings applied. This means that the maximum memory available to the program was 1GB. Having a closer look at the two heap memory metrics heapMemoryCommitted and heapMemoryTotalUsed reveals that both lines approach this 1GB ceiling near the end of the application.

The intermediate conclusion that can be drawn from the discussion so far is that the application is very memory hungry and a lot of GC activity is going on, but the exact reason for this is still unclear. A second tool can help now:

JVM FlameGraph

The profiler also collected stacktraces which can be folded and transformed into flame graphs with the help of my fold_stacks.py script and this external script:

Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/ProfileFatso/StacktraceFatso.json.gz > Fatso.folded
Phils-MacBook-Pro:analytics a$ perl flamegraph.pl Fatso.folded > FatsoFlame.svg

Opening FatsoFlame.svg in a browser shows the following, the full version which is also searchable (top right corner) is located at this location:

Click for Full Size

A rule of thumb for the interpretation of flame graphs is: The more spiky the shape, the better. We see many plateaus above with native Spark/Java functions like sun.misc.unsafe.park sitting on top (first plateau) or low-level functions from packages like io.netty occurring near the top, this is a 3rd party library that Spark depends on for network communication / IO. The only functions in the picture that are defined by me are located in the center plateau, searching for the package name profile.sparkjob in the top right corner will prove this claim. On top of these user defined functions are native Java Array and String functions; a closer look at the definition of fatFunctionOuter and fatFunctionInner would reveal that they create many String objects in an efficient way so we have identified the two Fatso methods that need to be optimized.

Python/PySpark Profiles

What about Spark applications written in Python? I created several PySpark profilers that try to provide some of the functionality of Uber’s JVM profiler. Because of the architecture of PySpark, it might be beneficial to generate both Python and JVM profiles in order to get a good grasp of the overall resource usage. This can be accomplished for the Python edition of Fatso by using the following launch command (abbreviated, full command here):

~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
--conf spark.python.profile=true \
--conf spark.driver.extraJavaOptions=-javaagent:/.../=sampleInterval=1000,metricInterval=100,reporter=...outputDir=... \
./spark_jobs/job_fatso.py cpumemstack /users/phil/phil_stopwatch/analytics/data/profile_fatso > Fatso_PySpark.log

The –conf parameter in the third line is responsible for attaching the JVM profiler. The –conf parameter in the second line as well as the two script arguments in the last line are Python specific and required for PySpark profiling: The cpumemstack argument will choose a PySpark profiler that captures both CPU/memory usage as well as stack traces. By providing a second script argument in the form of a directory path, it is ensured that the profile records are written into separate output files instead of just printing all of them to the standard output.

Similar to its Scala cousin, the PySpark edition of Fatso completes in around 10 minutes on my MacBook and creates several JSON files in the specified output directory. The JVM profile could be visualized idenpendently of the Python profile but it might be more insightful to create a single combined graph from them. This can be accomplished easily and is shown in the second half of this script. The full combined graph is located here

The clever reader will already have a hunch about the high memory consumption and who is responsible for it: The garbage collection activity of the JVM that is again represented by MarkSweepCollCount and ScavengeCollCount is much lower here compared to the “pure” Spark run described in the previous paragraphs (20000 events above versus less than 20 GC events now). The two inefficient fatso functions are now implemented in Python and therefore not managed by the JVM leading to far fewer JVM memory usage and GC events. A PySpark flamegraph should confirm our hunch:

Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/profile_fatso/s_8_stack.json  > FatsoPyspark.folded
Phils-MacBook-Pro:analytics a$ perl flamegraph.pl  FatsoPyspark.folded  > FatsoPySparkFlame.svg

Opening FatsoPySparkFlame.svg in a browser displays …

Click for Full Size

And indeed, two fatso methods sit ontop the stack for almost 90% of all measurements burning most CPU cycles. It would be easy to create a combined JVM/Python flamegraph by concatenating the respective stacktrace files. This would be of limited use here though since the JVM flamegraph will likely consist entirely of native Java/Spark functions over which a Python coder has no control. One scenario I can think of where this merging of JVM with PySpark stacktraces might be especilly useful is when Java code or libraries are registered and called from PySpark/Python code which is getting easier and easier in newer versions of Spark. In the discussion of Slacker later on, I will present a combined stack trace of Python and Scala code.

The Straggler

The Straggler is deceiving: It appears as if all resources are fully utilized most the time and only closer analysis can reveal that this might be the case for only a small subset of the system or for a limited period of time. The following graph created from this script combines two CPU metrics with information about task and stage boundaries extracted from the standard logging output of a typical straggler run; the full size graph can be investigated here

The associated application consisted of one Spark job which is represented as vertical dashed lines at the left and right. This single job was comprised of a single stage, shown as transparent blue dots on the x axis that coincide with the job start and end points. But there were three tasks within that stage so we can see three horizontal task lines. The naming schema of this execution hierarchy is not arbitrary:

The stage name in the graph is 0.0@0 because a stage with the id 0.0 which belonged to a job with id 0 is referred to. The first part of stage or task names is a floating point number, this reflects the apparent naming convention in Spark logs that new attempts of failed task or stages are baptized with an incremented fraction part.
The task names are 0.0@0.0@0, 1.0@0.0@0, and 2.0@0.0@0 because three tasks were launched that were all members of stage 0.0@0 that in turn belonged to job 0

The three tasks have the same start time which almost coincides with the application’s invocation but very different end times: Tasks 1.0@0.0@0 and 2.0@0.0@0 finish within the first fifth of the application’s lifetime whereas task 0.0@0.0@0 stays alive for almost the entire application since its start and end points are located at the left and right borders of this graph. The orange and light blue lines visualize two CPU metrics (system cpu load and process cpu load) whose fluctuations correspond with the task activity: We can observe that the CPU load drops right after tasks 1.0@0.0@0 and 2.0@0.0@0 end. It stays at around 20% for 4/5 of the time, when only straggler task 0.0@0.0@0 is running.

Concurrency Profiles

When an application consists of more than just one stage with three tasks like Straggler, it might be more illuminating to calculate and represent the total number of tasks that were running at any point during the application’s lifetime. The “concurrency profile” of a modern data stack workload might look more like

The source code that is the basis for the graph can be found in this script. The big difference to the Straggler toy example before is that in real-life applications, many different log files are compiled (one for each container/Spark executor) and there is only one “master” log file which contains necessary information about scheduling and task boundaries which are needed to make concurrency profiles. The script uses an AppParser class that does this automatically by creating a list of LogParser objects (one for each container) and then parsing them to determine the master log.
We can attempt a back-of-the-envelope calculation to increase the efficiency of the application from just looking at this concurrency profile: In case around 80 physical CPU cores were used (given multiple peaks of ~80 active tasks), we can hypothesize that the application was “overallocated” by at least 20 CPU cores or 4 to 7 Spark executors or one to three nodes as Spark executors are often configured to use 3 to 5 physical CPU cores. Reducing the machines reserved for this application should not increase its execution time but it will give more resources to other users in a shared cluster setting or save some $$ in a cloud environment.

A Fratso

What about the specs of the actual compute nodes used? The memory profile for a similar app created via this code segment is chaotic yet illuminating since more than 50 Spark executors/containers were launched by application and each one left its mark in the graph in the form of a memory metric line (original located here)

The peak heap memory used is a little more than 10GB, one executor crosses this 10k line twice (top right) while most other executors use at most 8-9 GB or less. Removing the memory usage from the picture and displaying scheduling information like task boundaries instead results in the following graph

The application launches several small Spark jobs initially as indicated by the occurrence of multiple dashed lines near the left border. However, more than 90% of the total execution time is consumed by a single big job which has the ID 8. A closer look at the blue dots on the x-axis that represent boundaries of Spark stages reveals that there are two longer stages within job 8. During the first stage, there are four task waves without stragglers – concurrent tasks that together look like solid blue rectangles when visualized this way. The second stage of job 8 does have a straggler task as there is one horizontal blue task line that is much longer active than its “neighbour” tasks. Looking back at the memory graph of this application, it is likely that this straggler task is also responsible for the heap memory peak of >10GB that we discovered. We might have identified a “fratso” here (a straggling fatso) and this task/stage should definitely be analyzed in more detail when improving the associated application.

The script that generated all three previous plots can be found here.

The Heckler: CoreNLP & spaCy

Applying NLP or machine learning methods often involves the use of third party libraries which in turn create quite memory-intensive objects. There are several different ways of constructing such heavy classifiers in Spark so that each task can access them, the first version of the Heckler code that is the topic of this section will do that in the worst possible way. I am not aware of a metric currently exposed by Spark that could directly show such inefficiencies, something similar to a measure of network transfer from master to executors would be required for one case below. The identification of this bottleneck must therefore happen indirectly by applying some more sophisticated string matching and collapsing logic to Spark’s standard logs:

log_file = './data/ProfileHeckler1/JobHeckler1.log.gz'
log_parser = SparkLogParser(log_file)
collapsed_ranked_log: List[Tuple[int, List[str]]] = log_parser.get_top_log_chunks()
for line in collapsed_ranked_log[:5]:  # print 5 most frequently occurring log chunks
    print(line)

Executing the script containing this code segment produces the following output:


Phils-MacBook-Pro:analytics a$ python3 extract_heckler.py
 
^^ Identified time format for log file: %Y-%m-%d %H:%M:%S
 
(329, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(257, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(223, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(221, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(197, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])

These are the 5 most frequent log chunks found in the logging file, each one is a pair [int, List[str]]. The left integer signifies the total number of times the right list of log segments occurred in the file; each individual member in the list occurred in a separate log line in the file. Hence the return value of the method get_top_log_chunks that created the output above has the type annotation List[Tuple[int, List[str]]], it extracts a ranked list of contiguous log segments.

The top record can be interpreted the following way: The four strings


StanfordCoreNLP:88 - Adding annotator tokenize
StanfordCoreNLP:88 - Adding annotator ssplit 
StanfordCoreNLP:88 - Adding annotator pos
StanfordCoreNLP:88 - Adding annotator lemma
StanfordCoreNLP:88 - Adding annotator parse

occurred as infixes in this order 329 times in total in the log file. They were likely part of longer log lines as normalization and collapsing logic was applied by the extraction algorithm, an example occurrence of the first part of the chunk (StanfordCoreNLP:88 – Adding annotator tokenize) would be

2019-02-16 08:44:30 INFO StanfordCoreNLP:88 - Adding annotator tokenize

What does this tell us? The associated Spark app seems to have performed some NLP tagging since log4j messages from the Stanford CoreNLP project can be found as part of the Spark logs. Initializing a StanfordCoreNLP object …


  val props = new Properties()
  props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse")

  val pipeline = new StanfordCoreNLP(props)
  val annotation = new Annotation("This is an example sentence")

  pipeline.annotate(annotation)
  val parseTree = annotation.get(classOf[SentencesAnnotation]).get(0).get(classOf[TreeAnnotation])
  println(parseTree.toString) // prints (ROOT (NP (NN Example) (NN sentence)))


0 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator tokenize
9 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator ssplit
13 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator pos
847 [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger  - Loading POS tagger from [...] done [0.8 sec].
848 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator lemma
849 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator parse
1257 [main] INFO edu.stanford.nlp.parser.common.ParserGrammar  - Loading parser from serialized [...] ... done [0.4 sec].

… which tells us that five annotators (tokenize, ssplit, pos, lemma, parse) are created and wrapped inside a single StanfordCoreNLP object. Concerning the use of CoreNLP with Spark, the number of cores/tasks used in Heckler is three (as it is in all other riddles) which means that we should find at most three occurrences of these annotator messages in the corresponding Spark log file. But we already saw more than 1000 occurrences when only the top 5 log chunks were investigated above. Having a closer look at the Heckler source code resolves this contradiction, the implementation is bad since one classifier object is recreated for every input sentence that will be syntactially annotated – there are 60000 input sentences in total so an StanfordCoreNLP object will be constructed a staggering 60000 times. Due to the distributed/concurrent nature of Heckler, we don’t always see the annotator messages in the order tokenize – ssplit – pos – lemma – parse because log messages of task (1) might interweave with log messages of task (2) and (3) in the actual log file which is also the reason for the slightly reordered log chunks in the top 5 list.

Improving this inefficient implementation is not too difficult: Creating the classifier inside a mapPartitions instead of a map function as done here will only create three StanfordCoreNLP objects overall. However, this is not the minimum, I will now set the record for creating the smallest number of tagger objects with the minimum amount of network transfer: Since StanfordCoreNLP is not serializable per se, it needs to be wrapped inside a class that is in order to prevent a java.io.NotSerializableException when broadcasting it later:


class DistribbutedStanfordCoreNLP extends Serializable {
  val props = new Properties()
  props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse")
  lazy val pipeline = new StanfordCoreNLP(props)
}
[...]
val pipelineWrapper = new DistribbutedStanfordCoreNLP()
val pipelineBroadcast: Broadcast[DistribbutedStanfordCoreNLP] = session.sparkContext.broadcast(pipelineWrapper)
[...]
val parsedStrings3 = stringsDS.map(string => {
   val annotation = new Annotation(string)
   pipelineBroadcast.value.pipeline.annotate(annotation)
   val parseTree = annotation.get(classOf[SentencesAnnotation]).get(0).get(classOf[TreeAnnotation])
   parseTree.toString
})

The proof lies in the logs:


19/02/23 18:48:45 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/23 18:48:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/23 18:48:45 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator tokenize
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator ssplit
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator pos
19/02/23 18:48:46 INFO MaxentTagger: Loading POS tagger from [...] ... done [0.6 sec].
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator lemma
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator parse
19/02/23 18:48:47 INFO ParserGrammar: Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
19/02/23 18:59:07 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1590 bytes result sent to driver
19/02/23 18:59:07 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1590 bytes result sent to driver
19/02/23 18:59:07 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1590 bytes result sent to driver

I’m not sure about the multi-threading capabilities of StanfordCoreNLP so it might turn out that the second “per partition” solution is superior performance-wise to the third. In any case, we reduced the number of tagging objects created from 60000 to three or one, not bad.

spaCy on PySpark

The PySpark version of Heckler will use spaCy (written in Cython/Python) as NLP library instead of CoreNLP. From the perspective of a JVM aficionado, packaging in Python itself is odd and spaCy doesn’t seem to be very chatty. Therefore I created an initialization function that should print more log messages and address potential issues when running spaCy in a distributed environment as its model files need to be present on every Spark executor.

As expected, the “bad” implementation of Heckler recreates one spaCy NLP model per input sentence as proven by this logging excerpt:


[Stage 0:>                                                          3 / 3]
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
[...]

Inspired by the Scala edition of Heckler, the “per partition” PySpark solution only initialize three spacy NLP objects during the application’s lifetime, the complete log file of that run is short:


[Stage 0:>                                                          (0 + 3) / 3]
^^ Using spaCy 2.0.18
^^ Using spaCy 2.0.18
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
1500

Finding failure messages

The functionality introduced in the previous paragraphs can be modified to facilitate the investigation of failed applications: The reason for a crash is often not immediately apparent and requires sifting through log files. Resource-intensive applications will create numerous log files (one per container/Spark executor) so search functionality along with deduplication and pattern matching logic should come in handy here: The function extract_errors from the AppParser class tries to deduplicate potential exceptions and error messages and will print them out in reverse chronological order. An exception or error message might occur several times during a run with slight variations (e.g., different timestamps or code line numbers) but the last occurrence is the most important one for debugging purposes since it might be the direct cause for the failure.


app_path = './data/application_1549675138635_0005'
app_parser = AppParser(app_path)
app_errors: Deque[Tuple[str, List[str]]] = app_parser.extract_errors()

for error in app_errors:
    print(error)


^^ Identified app path with log files
^^ Identified time format for log file: %y/%m/%d %H:%M:%S
^^ Warning: Not all tasks completed successfully: {(16.0, 9.0, 8), (16.1, 9.0, 8), (164.0, 9.0, 8), ...}
^^ Extracting task intervals
^^ Extracting stage intervals
^^ Extracting job intervals

Error messages found, most recent ones first:

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:35 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted.’, ‘org.apache.spark.SparkException: Job aborted.’, ‘at org.apache.spark.sql.execution.datasources.FileFormatWriteranonfun$write$1.apply(FileFormatWriter.scala:166)’, ‘at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)’, ‘at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)’, ‘at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)’, ‘at org.apache.spark.sql.execution.SparkPlananonfun$executeQuery$1.apply(SparkPlan.scala:138)’, ‘at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)’, ‘at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)’, ‘at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)’, ‘at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)’, ‘at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)’, ‘at org.apache.spark.sql.execution.SparkPlananonfun$executeQuery$1.apply(SparkPlan.scala:138)’, ‘at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)’, ‘at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)’, ‘at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)’, ‘at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)’, ‘at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)’, ‘at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508)’, […] ‘… 48 more’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000124/stderr.gz’, [‘18/02/01 21:49:34 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN YarnAllocator: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 ERROR TaskSetManager: Task 30 in stage 9.0 failed 4 times; aborting job’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 ERROR YarnClusterScheduler: Lost executor 62 on ip-172-18-39-28.ec2.internal: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN TaskSetManager: Lost task 4.3 in stage 9.0 (TID 610, ip-172-18-39-28.ec2.internal, executor 62): ExecutorLostFailure (executor 62 exited caused by one of the running tasks) Reason: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0’])

[…]

Each error record printed out in this fashion consists of two elements: The first one is a path to the source log to the source log in which the second element, the actual error chunk, was found. The error chunk is a single or multiline error message to which collapsing logic was applied and is stored in a list of strings. The application that threw the errors shown above seems to have crashed because some problems occurred during a write phase since classes like FileFormatWriter occur in the final stack trace that was produced by executor container_1549_0001_02_000002. It is likely that not enough output partitions were used when materializing output records to a storage layer. The total number of error chunks in all container log files associated with this application was more than 350, the deduplication logic of AppParser.extract_errors boiled this high number down to less than 20.

The Slacker

The Slacker does exactly what the name suggests – not a lot. Let’s collect and investigate the maximum values of its most important metrics:


combined_file = './data/profile_slacker/CombinedCpuAndMemory.json.gz'  # Output from JVM & PySpark profilers

jvm_parser = ProfileParser(combined_file)
jvm_parser.manually_set_profiler('JVMProfiler')

pyspark_parser = ProfileParser(combined_file)
pyspark_parser.manually_set_profiler('pyspark')

jvm_maxima: Dict[str, float] = jvm_parser.get_maxima()
pyspark_maxima: Dict[str, float] = pyspark_parser.get_maxima()

print('JVM max values:')
print(jvm_maxima)
print('\nPySpark max values:')
print(pyspark_maxima)

The output is …


JVM max values:
{'ScavengeCollTime': 0.0013, 'MarkSweepCollTime': 0.00255, 'MarkSweepCollCount': 3.0, 'ScavengeCollCount': 10.0,
'systemCpuLoad': 0.64, 'processCpuLoad': 0.6189945167759597, 'nonHeapMemoryTotalUsed': 89.079,
'nonHeapMemoryCommitted': 90.3125, 'heapMemoryTotalUsed': 336.95, 'heapMemoryCommitted': 452.0}

PySpark max values:
{'pmem_rss': 78.50390625, 'pmem_vms': 4448.35546875, 'cpu_percent': 0.4}

These are low values given a baseline overhead of running Spark and especially when comparing them to the profiles for Fatso above – for example, only 13 GC events happened and the peak CPU load for the entire run was less than 65%. Visualizing all CPU data points shows that these maxima occured at the beginning / end of the application when there is always lots of initialization and cleanup work going on regardless of the actual code being executed (bigger version here):

So the system is almost idle for the majority of the time. The slacker in this pure form is a rare sight; when processing real-life workloads, slacking most likely occurs in certain stages that interact with an external system like querying a database for records that should be joined with Datasets/RDDs later on or that materialize output records to a storage layer like HDFS and use not enough write partitions. A combined flame graph of JVM and Python stack traces will reveal the slacking part:

Click for Full Size

In the first plateau which is also the longest, two custom Python functions sit at the top. After inspecting their implementation here and there, the low system utilization should not be surprising anymore: The second function from the top,job_slacker.py:slacking, is basically a simple loop that calls a function helper.py:secondsSleep from an external helper package many times. This function has a sample presence of almost 20% (seen in the original) and, since it sits atop the plateau, is executed by the CPU most of the time. As its function name suggests, it causes the program to sleep for one second. So Slacker is esentially a 10 minute long system sleep. In real-world modern data stack applications that have slacking phases, we can expect the top of many plateaus to be occupied by “write” functions like FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala or by functions related to DB queries.

Source Code & Links & Goodies

Quite a lot of things were discussed and code exists that implements every idea presented:

Riddle source code

The Spark source code for the riddles is located in this repository:

The PysSpark editions along with all log and profile parsing logic can be found in the main repo:

Profiling the JVM

Uber’s recently open sourced JVM profiler isn’t the first of its kind but has a number of features that are very handy for the cases described in this article: It is “non-invasive” so source code doesn’t need to be changed at all in order to collect metrics. Any JVM can be profiled which means that this project is suitable for tracking a Spark master as well as its associated executors. Internally this profiler uses the java.lang.management interface that was introduced with Java 1.5 and accesses several Bean objects.

When running the riddles mentioned above in local mode, only one JVM is launched and the master subsumes the executors so only a –conf spark.driver.extraJavaOptions= has to be added to the launch command, a distributed application also requires a second –conf spark.executor.extraJavaOptions=. Full launch commands are included in my project’s repo

Phil’s PysSpark profilers

I implemented three custom PySpark profilers here which should provide functionality similar to that of Uber’s JVM profiler for a PySpark user: A CPU/memory profiler, a stack profiler for creating PySpark flamegraphs and a combination of the two. Detais about how to integrate them into an application can be found in the project’s readme file.

Going distributed

If the JVM profiler is used as described above, three different types of records are generated which, in case the FileOutputReporter flag is used, are written to three separate JSON files, ProcessInfo.json, CpuAndMemory.json, and Stacktrace.json. The ProcessInfo.json file contains meta information and is not used in this article. Similarly, my PySpark profilers will create one or two different types of output records that are stored in at least two JSON files with the pattern s_X_stack.json or s_X_cpumem.json_ when sparkContext.dump_profiles(dump_path). If sparkContext.show_profiles() is used instead, all profile records would be written to the standard output.

In a distributed/cloud environment, Uber’s and my FileOutputReporter might not be able to create output files on storage systems like HDFS or S3 so the profiler records might need to be written to the standard output files (stdout.gz) instead. Since the design of the profile and application parsers in parsers.py is compositional, this is not a problem. A demonstration of how to extract both metrics and scheduling info from all standard output files belonging to an application is here.

When launching a distributed application, Spark executors run on multiple nodes in a cluster and produce several log files, one per executor/container. In a cloud environment like AWS, these log files will be organized in the following structure:


s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_001/
                                                                                         stderr.gz
                                                                                         stdout.gz
s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_002/
                                                                                         stderr.gz
                                                                                         stdout.gz
[...]
s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_N/
                                                                                         stderr.gz
                                                                                         stdout.gz

[...]

s3://aws-logs/elasticmapreduce/clusterid-M/containers/application_K_0001/container_K_L/
                                                                                         stderr.gz
                                                                                         stdout.gz

An EMR cluster like clusterid-1 might run several Spark applications consecutively, each one as its own step. Each application launched a number of containers, application_1_0001 for example launched executors container_1_001, container_1_002, …, container_1_N. Each of these container created a standard error and a standard out file on S3. In order to analyze a particular application like application_1_0001 above, all of its associated log files like …/application_1_0001/container_1_001/stderr.gz and …/application_1_0001/container_1_001/stdout.gz are needed. The easiest way is to collect all files under the application folder using a command like …


aws s3 cp --recursive s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/ ./application_1_0001/

… and then to create an AppParser object like


from parsers import AppParser
app_path = './application_1_0001/'  # path to the application directory downloaded from s3 above
app_parser = AppParser(app_path)

This object creates a number of SparkLogParser objects internally (one for each container) and automatically identifies the “master” log file created by the Spark driver (likely located under application_1_0001/container_1_001/). Several useful functions are now made available by the app_parser object, examples can be found in this script and more detailed explanations are in the readme file.

The post 4 Modern Data Stack Riddles: The Straggler, the Slacker, the Fatso, and the Heckler appeared first on Unravel.

Data Structure Zoo

Unravel Data — Thu, 13 Feb 2020 23:35:12 +0000

Solving a problem programmatically often involves grouping data items together so they can be conveniently operated on or copied as a single unit – the items are collected in a data structure. Many different data structures have been designed over the past decades, some store individual items like phone numbers, others store more complex objects like name/phone number pairs. Each has strengths and weaknesses and is more or less suitable for a specific use case. In this article, I will describe and attempt to classify some of the most popular data structures and their actual implementations on three different abstraction levels starting from a Platonic ideal and ending with actual code that is benchmarked:

Theoretical level: Data structures/collection types are described irrespective of any concrete implementation and the asymptotic behavior of their core operations are listed.
Implementation level: It will be shown how the container classes of a specific programming language relate to the data structures introduced at the previous level – e.g., despite their name similarity, Java’s Vector is different from Scala’s or Clojure’s Vector implementation. In addition, asymptotic complexities of core operations will be provided per implementing class.
Empirical level: Two aspects of the efficiency of data structures will be measured: The memory footprints of the container classes will be determined under different configurations. The runtime performance of operations will be measured which will show to what extent asymptotic advantages manifest themselves in concrete scenarios and what the relative performances of asymptotically equal structures are.

Theoretical Level

Before providing actual speed and space measurement results in the third section, execution time and space can be described in an abstract way as a function of the number of items that a data structure might store. This is traditionally done via Big O notation and the following abbreviations are used throughout the tables:

C is constant time, O(1)
aC is amortized constant time
eC is effective constant time
Log is logarithmic time, O(log n)
L is linear time, O(n)

The green, yellow or red background colors in the table cells will indicate how “good” the time complexity of a particular data structure/operation combination is relative to the other combinations.

Click for Full Size

The first five entries of Table 1 are linear data structures: They have a linear ordering and can only be traversed in one way. By contrast, Trees can be traversed in different ways, they consist of hierarchically linked data items that each have a single parent except for the root item. Trees can also be classified as connected graphs without cycles; a data item (= node or vertex) can be connected to more than two other items in a graph.

Data structures provide many operations for manipulating their elements. The most important ones are the following four core operations which are included above and studied throughout this article:

Access: Read an element located at a certain position
Search: Search for a certain element in the whole structure
Insertion: Add an element to the structure
Deletion: Remove a certain element

Table 1 includes two probabilistic data structures, Bloom Filter and Skip List.

Implementation Level – Java & Scala Collections Framework

The following table classifies almost all members of both the official Java Collection and Scala Collection libraries in addition to a number of relevant classes like Array or String that are not canonical members. The actual class names are placed in the second column, a name that starts with im. or m. refers to a Scala class, other prefixes refer to Java classes. The fully qualified class names are shortened by using the following abbreviations:

u. stands for the package java.util
c. stands for the package java.util.concurrent
lang. stands for the package java.lang
m. stands for the package scala.collection.mutable
im. stands for the package scala.collection.immutable

The actual method names and logic of the four core operations (Access, Search, Addition and Deletion) are dependent on a concrete implementation. In the table below, these method names are printed right before the asymptotic times in italic (they will also be used in the core operation benchmarks later). For example: Row number eleven describes the implementation u.ArrayList (second column) which refers to the Java collection class java.util.ArrayList. In order to access an item in a particular position (fourth column, Random Access), the method get can be called on an object of the ArrayList class with an integer argument that indicates the position. A particular element can be searched for with the method indexOf and an item can be added or deleted via add or remove. Scala’s closest equivalent is the class scala.collection.mutable.ArrayBuffer which is described two rows below ArrayList: To retrieve the element in the third position from an ArrayBuffer, Scala’s apply method can be used which allows an object to be used in function notation, Ss we would write val thirdElement = bufferObject(2). Searching for an item can be done via find and appending or removing an element from an ArrayBuffer is possible by calling the methods += and -= respectively.

Click for Full Size

Subclass and wrapping relationships between two classes are represented via

General features of Java & Scala structures

Several collection properties are not explicitly represented in the table above since they either apply to almost all elements or a simple rule exists:

Almost all data structures that store key/value pairs have the characters Map as part of their class name in the second column. The sole exception to this naming convention is java.util.Hashtable which is a retrofitted legacy class born before Java 2 that also stores key/value pairs.

Almost all Java Collections are mutable: They can be destroyed, elements can be removed from or added to them and their data values can be modified in-place, mutable structures can therefore loose their original/previous state. By contrast, Scala provides a dedicated immutable package (scala.collection.immutable) whose members, in contrast to scala.collection.mutable and the Java collections, cannot be changed in-place. All members of this immutable package are also persistent: Modifications will produce an updated version via structural sharing and/or path copying while also preserving the original version. Examples of immutable but non-persistent data structures from third party providers are mentioned below.

Mutability can lead to problems when concurrency comes into play. Most classes in Table 2 that do not have the prefix c. (abbreviating the package java.util.concurrent) are unsynchronized. In fact, one of the design decision made in the Java Collections Framework was to not synchronize most members of the java.util package since single-threaded or read-only uses of data structures are pervasive. In case synchronization for these classes is required, java.util.Collections provides a cascade of synchronized* methods that accept a given collection and return a synchronized, thread-safe version.

Due to the nature of immutability, the (always unsynchronized) immutable structures in Table 2 are thread-safe.

All entries in Table 2 are eager except for scala.collection.immutable.Stream which is a lazy list that only computes elements that are accessed.

Java supports the eight primitive data types byte, short, int, long, float, double, boolean and char. Things are a bit more complicated with Scala but the same effectively also applies there at the bytecode level. Both languages provide primitive and object arrays but the Java and Scala Collection libraries are object collections which always store object references: When primitives like 3 or 2.3F are inserted, the values get autoboxed so the respective collections will hold a reference to numeric objects (a wrapper class like java.lang.Integer) and not the primitive values themselves:

int[] javaArrayPrimitive = new int[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
Integer[] javaArrayObject = new Integer[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
// javaArrayPrimitive occupies just 64 Bytes, javaArrayObject 240 Bytes

List javaList1 = new ArrayList<>(11); // initial capacity of 11
List javaList2 = new ArrayList<>(11);
for (int i : javaArrayPrimitive)
javaList1.add(i);
for (int i : javaArrayObject)
javaList2.add(i);
// javaList1 is 264 bytes in size now as is javaList2

Similar results for Scala:

val scalaArrayPrimitive = Array[Int](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
val scalaArrayObject = scalaArrayPrimitive.map(new java.lang.Integer(_))
// scalaArrayPrimitive occupies just 64 Bytes, scalaArrayObject 240 Bytes

val scalaBuffer1 = scalaArrayPrimitive.toBuffer
val scalaBuffer2 = scalaArrayObject.toBuffer
// scalaBuffer1 is 264 bytes in size now as is scalaBuffer2

Several third party libraries provide primitive collection support on the JVM allowing the 8 primitives mentioned above to be directly stored in data structures. This can have a big impact on the memory footprint – the creators of Apache Spark recommend in their official tuning guide to

Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.

We will see below whether FastUtil is really the most suitable alternative.

Empirical Level

Hardly any concrete memory sizes and runtime numbers have been mentioned so far, these two measurements are in fact very different: Estimating memory usage is a deterministic task compared to runtime performance since the latter might be influenced by several non-deterministic factors, especially when operations run on an adaptive virtual machine that performans online optimizations.

Memory measurements for JVM objects

Determining the memory footprint of a complex object is far from trivial since JVM languages don’t provide a direct API for that purpose. Apache Spark has an internal function for this purpose that implements the suggestions of an older JavaWorld article. I ported the code and modified it a bit here so this memory measuring functionality can be conveniently used outside of Spark:

val objectSize = JvmSizeEstimator.estimate(new Object())
println(objectSize) // will print 16 since one flat object instance occupies 16 bytes

Measurements for the most important classes from Table 2 with different element types and element sizes are shown below. The number of elements will be 0, 1, 4, 16, 64, 100, 256, 1024, 4096, 10000, 16192, 65536, 100000, 262144, 1048576, 4194304, 10000000, 33554432 and 50000000 in all configurations. For data structures that store individual elements, the two element types are int and String. For structures operating with key/value pairs, the combinations int/int and float/String will be used. The raw sizes of these element types are 4 bytes in the case of an individual int or float (16 bytes in boxed form) and, since all Strings used here will be 8 characters long, 56 bytes per String object.

The same package abbreviations as in Table 2 above will be used for the Java/Scala classes under measurement. In addition, some classes from the following 3rd party libraries are also used in their latest edition at the time of writing:

HPPC (0.8.1)
Koloboke (1.0.0)
fastutil (8.2.1)
Eclipse Collections (9.2.0)

Concerning the environment, jdk1.8.0_171.jdk on MacOS High Sierra 10.13 was used. The JVM flag +UseCompressedOops can affect object memory sizes and was enabled here, it is enabled by default in Java 8.

Measurements of single element structures

Below are the measurement results for the various combinations, every cell contains the object size in bytes for the particular data structure in the corresponding row filled with the number of elements indicated in the column. Some mutable classes provide the option to specify an initial capacity at construction time which can sometimes lead to a smaller overall object footprint after the structure is filled up. I included an additional + capacity row in cases where data structure in the previous row provides such an option and a difference could be measured.

Java/Scala structures storing integers:

Java/Scala structures storing strings:

3rd party structures storing integers:

3rd party structures storing strings:

Measurements for key/value structures:

For some mutable key/value structures like Java’s HashMap, a load factor that determines when to rehash can be specified in addition to an initial capacity. Similar to the logic in the previous tables, a row with + capacity will indicate that the data structure from the previous row was initialized using a capacity.

Java/Scala structures storing integer/integer pairs:

Java/Scala structures storing strings/float pairs:

3rd party structures storing integer/integer pairs:

3rd party structures storing strings/float pairs:

The post Data Structure Zoo appeared first on Unravel.

Rebuilding Reliable Modern Data Pipelines Using AI and DataOps

Unravel Data — Mon, 10 Feb 2020 19:47:32 +0000

Organizations today are building strategic applications using a wealth of internal and external data. Unfortunately, data-driven applications that combine customer data from multiple business channels can fail for many reasons. Identifying the cause and finding a fix is both challenging and time-consuming. With this practical ebook, DevOps personnel and enterprise architects will learn the processes and tools required to build and manage modern data pipelines.

Ted Malaska, Director of Enterprise Architecture at Capital One, examines the rise of modern data applications and guides you through a complete data operations framework. You’ll learn the importance of testing and monitoring when planning, building, automating, and managing robust data pipelines in the cloud, on premises, or in a hybrid configuration.

Plan, migrate, and operate modern data stack workloads and data pipelines using cloud-based and hybrid deployment models

Learn how performance management software can reduce the risk of running modern data applications
Take a deep dive into the components that comprise a typical data processing job
Use AI to provide insights, recommendations, and automation when operationalizing modern data systems and data applications
Plan, migrate, and operate modern data stack workloads and data pipelines using cloud-based and hybrid deployment models

The post Rebuilding Reliable Modern Data Pipelines Using AI and DataOps appeared first on Unravel.

Benchmarks, Data Spark and Graal

Unravel Data — Mon, 20 Jan 2020 15:30:56 +0000

This is a blog by Phil Schwab, Software Engineer at Unravel Data. This blog was first published on Phil’s BigData Recipe website.

A very important question is how long something takes and the answer to that is fairly straightforward in normal life: Check the current time, then perform the unit of work that should be measured, then check the time again. The end time minus the start time would equal the amount of time that the task took, the elapsed time or wallclock time. The programmatic version of this simple measuring technique could look like

def measureTime[A](computation: => A): Long = { 
 val startTime = System.currentTimeMillis() 
 computation
 val endTime = System.currentTimeMillis() 
 val elapsedTime = endTime - startTime elapsedTime 
 }

In the case of Apache Spark, the computation would likely be of type Dataset[_] or RDD[_]. In fact, the two third party benchmarking frameworks for Spark mentioned below are based on a function similar to the one shown above for measuring the execution time of a Spark job.

It is surprisingly hard to accurately predict how long something will take in programming: The result from a single invocation of the naive method above is likely not very reliable since numerous non-deterministic factors can interfere with a measurement, especially when the underlying runtime applies dynamic optimizations like the Java Virtual Machine. Even the usage of a dedicated microbenchmarking framework like JMH only solves parts of the problem – the user is reminded every time of that caveat after JMH completes:


[info] REMEMBER: The numbers below are just data. To gain reusable 
[info] insights, you need to follow up on why the numbers are the 
[info] way they are. Use profilers (see -prof, -lprof), design 
[info] factorial experiments, perform baseline and negative tests 
[info] that provide experimental control, make sure the 
[info] benchmarking environment is safe on JVM/OS/HW level, ask 
[info] for reviews from the domain experts. Do not assume the 
[info] numbers tell you what you want them to tell.

From the Apache Spark creators: spark-sql-perf

If benchmarking a computation on a local machine is already hard, then doing this for a distributed computation/environment should be very hard. spark-sql-perf is the official performance testing framework for Spark 2. The following twelve benchmarks are particularly interesting since they target various features and APIs of Spark; they are organized into three classes:

DatasetPerformance compares the same workloads expressed via RDD, Dataframe and Dataset API:

range just creates 100 million integers of datatype Long which are wrapped in a case class in the case of RDDs and Datasets
filter applies four consecutive filters to 100 million Longs
map applies an increment operation 100 million Longs four times
average computes the average of one million Longs using a user-defined function for Datasets, a built-in sql function for DataFrames and a map/reduce combination for RDDs.

JoinPerformance is based on three data sets with one million, 100 million and one billion Longs:
singleKeyJoins: joins each one of the three basic data sets in a left table with each one of the three basic data sets in a right table via four different join types (Inner Join, Right Join, Left Join and Full Outer Join)
varyDataSize: joins two tables consisting of 100 million integers each with a ‘data’ column containing Strings of 5 different lengths (1, 128, 256, 512 and 1024 characters)
varyKeyType: joins two tables consisting of 100 million integers and casts it to four different data types (String, Integer, Long and Double)
varyNumMatches

AggregationPerformance:

varyNumGroupsAvg and twoGroupsAvg both compute the average of one table column and group them by the other column. They differ in the cardinaliy and shape of the input tables used.
The next two aggregation benchmarks use the three data sets that are also used in the DataSetPerformance benchmark described above:
complexInput: For each of the three integer tables, adds a single column together nine times
aggregates: aggregates a single column via four different aggregation types (SUM, AVG, COUNT and STDDEV)

Running these benchmarks produces……. almost nothing, most of them are broken or will crash in the current state of the official master branch due to various problems (issues with reflection, missing table registrations, wrong UDF pattern, …):

$ bin/run --benchmark AggregationPerformance
[...]
[error] Exception in thread "main" java.lang.InstantiationException: com.databricks.spark.sql.perf.AggregationPerformance
[error] at java.lang.Class.newInstance(Class.java:427)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:81)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:82)
[error] at scala.util.Try.getOrElse(Try.scala:79)
[...]

$ bin/run --benchmark JoinPerformance
[...]
[error] Exception in thread "main" java.lang.InstantiationException: com.databricks.spark.sql.perf.JoinPerformance
[error] at java.lang.Class.newInstance(Class.java:427)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:81)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:82)
[error] at scala.util.Try.getOrElse(Try.scala:79)

I repaired these issues and was able to run all of these twelve benchmarks sucessfully; the fixed edition can be downloaded from my personal repo here, a PR was also submitted. Enough complaints, the first results generated via

$ bin/run --benchmark DatasetPerformance

that compare the same workloads expressed in RDD, Dataset and Dataframe APIs are:

name |minTimeMs| maxTimeMs| avgTimeMs| stdDev
-------------------------|---------|---------|---------|---------
DF: average | 36.53 | 119.91 | 56.69 | 32.31
DF: back-to-back filters | 2080.06 | 2273.10 | 2185.40 | 70.31
DF: back-to-back maps | 1984.43 | 2142.28 | 2062.64 | 62.38
DF: range | 1981.36 | 2155.65 | 2056.18 | 70.89
DS: average | 59.59 | 378.97 | 126.16 | 125.39
DS: back-to-back filters | 3219.80 | 3482.17 | 3355.01 | 88.43
DS: back-to-back maps | 2794.68 | 2977.08 | 2890.14 | 59.55
DS: range | 2000.36 | 3240.98 | 2257.03 | 484.98
RDD: average | 20.21 | 51.95 | 30.04 | 11.31
RDD: back-to-back filters| 1704.42 | 1848.01 | 1764.94 | 56.92
RDD: back-to-back maps | 2552.72 | 2745.86 | 2678.29 | 65.86
RDD: range | 593.73 | 689.74 | 665.13 | 36.92

This is rather surprising and counterintuitive given the focus of the architecture changes and performance improvements in Spark 2 – the RDD API performs best (= lowest numbers in the fourth column) for three out of four workloads, Dataframes only outperform the two other APIs in the back-to-back maps benchmark with 2062 ms versus 2890 ms in the case of Datasets and 2678 in the case of RDDs.

The results for the two other benchmark classes are as follows:

bin/run --benchmark AggregationPerformance

name | minTimeMs | maxTimeMs | avgTimeMs | stdDev
------------------------------------|-----------|-----------|-----------|--------
aggregation-complex-input-100milints| 19917.71 |23075.68 | 21604.91 | 1590.06
aggregation-complex-input-1bilints | 227915.47 |228808.97 | 228270.96 | 473.89
aggregation-complex-input-1milints | 214.63 |315.07 | 250.08 | 56.35
avg-ints10 | 213.14 |1818.041 | 763.67 | 913.40
avg-ints100 | 254.02 |690.13 | 410.96 | 242.38
avg-ints1000 | 650.53 |1107.93 | 812.89 | 255.94
avg-ints10000 | 2514.60 |3273.21 | 2773.66 | 432.72
avg-ints100000 | 18975.83 |20793.63 | 20016.33 | 937.04
avg-ints1000000 | 233277.99 |240124.78 | 236740.79 | 3424.07
avg-twoGroups100000 | 218.86 |405.31 | 283.57 | 105.49
avg-twoGroups1000000 | 194.57 |402.21 | 276.33 | 110.62
avg-twoGroups10000000 | 228.32 |409.40 | 303.74 | 94.25
avg-twoGroups100000000 | 627.75 |733.01 | 673.69 | 53.88
avg-twoGroups1000000000 | 4773.60 |5088.17 | 4910.72 | 161.11
avg-twoGroups10000000000 | 43343.70 |47985.57 | 45886.03 | 2352.40
single-aggregate-AVG-100milints | 20386.24 |21613.05 | 20803.14 | 701.50
single-aggregate-AVG-1bilints | 209870.54 |228745.61 | 217777.11 | 9802.98
single-aggregate-AVG-1milints | 174.15 |353.62 | 241.54 | 97.73
single-aggregate-COUNT-100milints | 10832.29 |11932.39 | 11242.52 | 601.00
single-aggregate-COUNT-1bilints | 94947.80 |103831.10 | 99054.85 | 4479.29
single-aggregate-COUNT-1milints | 127.51 |243.28 | 166.65 | 66.36
single-aggregate-STDDEV-100milints | 20829.31 |21207.90 | 20994.51 | 193.84
single-aggregate-STDDEV-1bilints | 205418.40 |214128.59 | 211163.34 | 4976.13
single-aggregate-STDDEV-1milints | 181.16 |246.32 | 205.69 | 35.43
single-aggregate-SUM-100milints | 20191.36 |22045.60 | 21281.71 | 969.26
single-aggregate-SUM-1bilints | 216648.77 |229335.15 | 221828.33 | 6655.68
single-aggregate-SUM-1milints | 186.67 |1359.47 | 578.54 | 676.30

bin/run --benchmark JoinPerformance

name |minTimeMs |maxTimeMs |avgTimeMs |stdDev
------------------------------------------------|----------|----------|----------|--------
singleKey-FULL OUTER JOIN-100milints-100milints | 44081.59 |46575.33 | 45418.33 |1256.54
singleKey-FULL OUTER JOIN-100milints-1milints | 36832.28 |38027.94 | 37279.31 |652.39
singleKey-FULL OUTER JOIN-1milints-100milints | 37293.99 |37661.37 | 37444.06 |192.69
singleKey-FULL OUTER JOIN-1milints-1milints | 936.41 |2509.54 | 1482.18 |890.29
singleKey-JOIN-100milints-100milints | 41818.86 |42973.88 | 42269.81 |617.71
singleKey-JOIN-100milints-1milints | 20331.33 |23541.67 | 21692.16 |1660.02
singleKey-JOIN-1milints-100milints | 22028.82 |23309.41 | 22573.63 |661.30
singleKey-JOIN-1milints-1milints | 708.12 |2202.12 | 1212.86 |856.78
singleKey-LEFT JOIN-100milints-100milints | 43651.79 |46327.19 | 44658.37 |1455.45
singleKey-LEFT JOIN-100milints-1milints | 22829.34 |24482.67 | 23633.77 |827.56
singleKey-LEFT JOIN-1milints-100milints | 32674.77 |34286.75 | 33434.05 |810.04
singleKey-LEFT JOIN-1milints-1milints | 682.51 |773.95 | 715.53 |50.73
singleKey-RIGHT JOIN-100milints-100milints | 44321.99 |45405.85 | 44965.93 |570.00
singleKey-RIGHT JOIN-100milints-1milints | 32293.54 |32926.62 | 32554.74 |330.73
singleKey-RIGHT JOIN-1milints-100milints | 22277.12 |24883.91 | 23551.74 |1304.34
singleKey-RIGHT JOIN-1milints-1milints | 683.04 |935.88 | 768.62 |144.85

From Phil: Spark & JMH

The surprising results from the DatasetPerformance benchmark above should make us skeptical – probably the benchmarking code or setup itself is to blame for the odd measurement, not the actual Spark APIs. A popular and quasi-official benchmarking framework for languages targeting the JVM is JMH so why not use it for the twelve Spark benchmarks? I “translated” them into JMH versions here and produced new results, among them the previously odd DatasetPerformance cases:

Phils-MacBook-Pro: pwd
/Users/Phil/IdeaProjects/jmh-spark
Phils-MacBook-Pro: ls
README.md benchmarks build.sbt project src target

Phils-MacBook-Pro: sbt benchmarks/jmh:run Bench_APIs1
[...]
Phils-MacBook-Pro: sbt benchmarks/jmh:run Bench_APIs2

Benchmark (start) Mode Cnt Score Error Units
Bench_APIs1.rangeDataframe 1 avgt 20 2618.631 ± 59.210 ms/op
Bench_APIs1.rangeDataset 1 avgt 20 1646.626 ± 33.230 ms/op
Bench_APIs1.rangeDatasetJ 1 avgt 20 2069.763 ± 76.444 ms/op
Bench_APIs1.rangeRDD 1 avgt 20 448.281 ± 85.781 ms/op
Bench_APIs2.averageDataframe 1 avgt 20 24.614 ± 1.201 ms/op
Bench_APIs2.averageDataset 1 avgt 20 41.799 ± 2.012 ms/op
Bench_APIs2.averageRDD 1 avgt 20 12.280 ± 1.532 ms/op
Bench_APIs2.filterDataframe 1 avgt 20 2395.985 ± 36.333 ms/op
Bench_APIs2.filterDataset 1 avgt 20 2669.160 ± 81.043 ms/op
Bench_APIs2.filterRDD 1 avgt 20 2776.382 ± 62.065 ms/op
Bench_APIs2.mapDataframe 1 avgt 20 2020.671 ± 136.371 ms/op
Bench_APIs2.mapDataset 1 avgt 20 5218.872 ± 177.096 ms/op
Bench_APIs2.mapRDD 1 avgt 20 2957.573 ± 26.458 ms/op

These results are more in line with expectations: Dataframes perform best in two out of four benchmarks. The Spark-internal functionality used for the other two (average and range) indeed favour RDDs:

From IBM: spark-bench
To be published

From CERN:
To be published

Enter GraalVM

Most computer programs nowadays are written in higher-level languages so humans can create them faster and understand them easier. But since a machine can only “understand” numerical languages, these high-level artifacts cannot directly be executed by a processor so typically one or more additional steps are required before a program can be run. Some programming languages transform their user’s source code into an intermediate representation which then gets compiled again into and interpreted as machine code. The languages of interest in this article (roughly) follow this strategy: The programmer only writes high-level source code which is then automatically transformed to a file ending in .class that contains platform-independent bytecode. This bytecode file is further compiled down to machine code by the Java Virtual Machine while hardware-specific aspects are fully taken care of and, depending on the compiler used, optimizations are applied. Finally this machine code is executed in the JVM runtime.

One of the most ambitious software projects of the past years has probably been the development of a general-purpose virtual machine, Oracle’s Graal project, “one VM to rule them all.” There are several aspects to this technology, two of the highlights include the goal of providing seamless interoperability between (JVM and non-JVM) programming languages while running them efficiently on the same JVM and a new, high performance Java compiler. Twitter already uses the enterprise edition in production and saves around 8% of CPU utilization. The Community edition can be downloaded and used for free, more details below.

Graal and Scala

Graal works at the bytecode level. In order to to run Scala code via Graal, I created a toy example that is inspired by the benchmarks described above: The source code snippet below creates 10 million integers, increments each number by one, removes all odd elements and finally sums up all of the remaining even numbers. These four operations are repeated 100 times and during each step the execution time and the sum (which stays the same across all 100 iterations) are printed out. Before the program terminates, the total run time will also be printed. The following source code implements this logic – not in the most elegant way but with optimization potential for the final compiler phase where Graal will come into play:

object ProcessNumbers {
 // Helper functions:
 def increment(xs: Array[Int]) = xs.map(_ + 1)
 def filterOdd(xs: Array[Int]) = xs.filter(_ % 2 == 0)
 def sum(xs: Array[Int]) = xs.foldLeft(0L)(_ + _)
 // Main part:
 def main(args: Array[String]): Unit = {
   var totalTime = 0L
   // loop 100 times
  for (iteration <- 1 to 100) {
     val startTime = System.currentTimeMillis()
     val numbers = (0 until 10000000).toArray
     // transform numbers and sum them up
     val incrementedNumbers = increment(numbers)
     val evenNumbers = filterOdd(incrementedNumbers)
     val summed = sum(evenNumbers)
     // calculate times and print out
     val endTime = System.currentTimeMillis()
     val elapsedTime = endTime - startTime
     totalTime += elapsedTime
     println(s"Iteration $iteration took $elapsedTime milliseconds\t\t$summed")
   }
   println("*********************************")
     println(s"Total time: $totalTime ms")
   }
}

The transformation of this code to the intermediate bytecode representation is done as usual, via scalac ProcessNumbers.scala. The resulting bytecode file is not directly interpretable by humans but those JVM instructions can be made more intelligible by disassembling them with the command javap -c -cp. The original source code above has less than 30 lines, the disassembled version has more than 200 lines but in a simpler structure and with a small instruction set:

javap -c -cp ./ ProcessNumbers$

public final class ProcessNumbers$ {
[...]

public void main(java.lang.String[]);
Code:
0: lconst_0
1: invokestatic #137 // Method scala/runtime/LongRef.create:(J)Lscala/runtime/LongRef;
4: astore_2
5: getstatic #142 // Field scala/runtime/RichInt$.MODULE$:Lscala/runtime/RichInt$;
8: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
11: iconst_1
12: invokevirtual #145 // Method scala/Predef$.intWrapper:(I)I
15: bipush 100
17: invokevirtual #149 // Method scala/runtime/RichInt$.to$extension0:(II)Lscala/collection/immutable/Range$Inclusive;
20: aload_2
21: invokedynamic #160, 0 // InvokeDynamic #3:apply$mcVI$sp:(Lscala/runtime/LongRef;)Lscala/runtime/java8/JFunction1$mcVI$sp;
26: invokevirtual #164 // Method scala/collection/immutable/Range$Inclusive.foreach$mVc$sp:(Lscala/Function1;)V
29: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
32: ldc #166 // String *********************************
34: invokevirtual #170 // Method scala/Predef$.println:(Ljava/lang/Object;)V
37: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
40: new #172 // class java/lang/StringBuilder
43: dup
44: ldc #173 // int 15
46: invokespecial #175 // Method java/lang/StringBuilder."":(I)V
49: ldc #177 // String Total time:
51: invokevirtual #181 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
54: aload_2
55: getfield #185 // Field scala/runtime/LongRef.elem:J
58: invokevirtual #188 // Method java/lang/StringBuilder.append:(J)Ljava/lang/StringBuilder;
61: ldc #190 // String ms
63: invokevirtual #181 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
66: invokevirtual #194 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
69: invokevirtual #170 // Method scala/Predef$.println:(Ljava/lang/Object;)V
72: return
[...]
}

Now we come to the Graal part: My system JDK is

Phils-MacBook-Pro $ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

I downloaded the community edition of Graal from here and placed it in a folder along with Scala libraries and the files for the toy benchmarking example mentioned above:

Phils-MacBook-Pro: ls
ProcessNumbers$.class ProcessNumbers.class ProcessNumbers.scala graalvm scalalibs

Phils-MacBook-Pro: ./graalvm/Contents/Home/bin/java -version
openjdk version "1.8.0_172"
OpenJDK Runtime Environment (build 1.8.0_172-20180626105433.graaluser.jdk8u-src-tar-g-b11)
GraalVM 1.0.0-rc5 (build 25.71-b01-internal-jvmci-0.46, mixed mode)

Phils-MacBook-Pro: ls scalalibs/
jline-2.14.6.jar scala-library.jar scala-reflect.jar scala-xml_2.12-1.0.6.jar
scala-compiler.jar scala-parser-combinators_2.12-1.0.7.jar scala-swing_2.12-2.0.0.jar scalap-2.12.6.jar

Let’s run this benchmark with the “normal” JDK via java -cp ./lib/scala-library.jar:. ProcessNumbers. Around 31 seconds are needed as can be seen below (only the first and last iterations are shown)

Phils-MacBook-Pro: java -cp ./lib/scala-library.jar:. ProcessNumbers

Iteration 1 took 536 milliseconds 25000005000000
Iteration 2 took 533 milliseconds 25000005000000
Iteration 3 took 350 milliseconds 25000005000000
Iteration 4 took 438 milliseconds 25000005000000
Iteration 5 took 345 milliseconds 25000005000000
[...]
Iteration 95 took 293 milliseconds 25000005000000
Iteration 96 took 302 milliseconds 25000005000000
Iteration 97 took 333 milliseconds 25000005000000
Iteration 98 took 282 milliseconds 25000005000000
Iteration 99 took 308 milliseconds 25000005000000
Iteration 100 took 305 milliseconds 25000005000000
*********************************
Total time: 31387 ms

And here a run that invokes Graal as JIT compiler:

Phils-MacBook-Pro:testo a$ ./graalvm/Contents/Home/bin/java -cp ./lib/scala-library.jar:. ProcessNumbers

Iteration 1 took 1287 milliseconds 25000005000000
Iteration 2 took 264 milliseconds 25000005000000
Iteration 3 took 132 milliseconds 25000005000000
Iteration 4 took 120 milliseconds 25000005000000
Iteration 5 took 128 milliseconds 25000005000000
[...]
Iteration 95 took 111 milliseconds 25000005000000
Iteration 96 took 124 milliseconds 25000005000000
Iteration 97 took 122 milliseconds 25000005000000
Iteration 98 took 123 milliseconds 25000005000000
Iteration 99 took 120 milliseconds 25000005000000
Iteration 100 took 149 milliseconds 25000005000000
*********************************
Total time: 14207 ms

14 seconds compared to 31 seconds means a 2x speedup with Graal, not bad. The first iteration takes much longer but then a turbo boost seems to kick in – most iterations from 10 to 100 take around 100 to 120 ms in the Graal run compared to 290-310 ms in the vanilla Java run. Graal itself has an option to deactivate the optimization via the -XX:-UseJVMCICompiler flag, trying that results in similar numbers compared with the first run:

Phils-MacBook-Pro: /Users/a/graalvm/Contents/Home/bin/java -XX:-UseJVMCICompiler -cp ./lib/scala-library.jar:. ProcessNumbers
Iteration 1 took 566 milliseconds 25000005000000
Iteration 2 took 508 milliseconds 25000005000000
Iteration 3 took 376 milliseconds 25000005000000
Iteration 4 took 456 milliseconds 25000005000000
Iteration 5 took 310 milliseconds 25000005000000
[...]
Iteration 95 took 301 milliseconds 25000005000000
Iteration 96 took 301 milliseconds 25000005000000
Iteration 97 took 285 milliseconds 25000005000000
Iteration 98 took 302 milliseconds 25000005000000
Iteration 99 took 296 milliseconds 25000005000000
Iteration 100 took 296 milliseconds 25000005000000
*********************************
Total time: 30878 ms

Graal and Spark

Why not invoke Graak for Spark jobs. Let’s do this for my benchmarking project introduced above with the -jvm flag:

Phils-MacBook-Pro:jmh-spark $ sbt
Loading settings for project jmh-spark-build from plugins.sbt ...
Loading project definition from /Users/Phil/IdeaProjects/jmh-spark/project
Loading settings for project jmh-spark from build.sbt ...
Set current project to jmh-spark (in build file:/Users/Phil/IdeaProjects/jmh-spark/)
sbt server started at local:///Users/Phil/.sbt/1.0/server/c980c60cda221235db06/sock

sbt:jmh-spark> benchmarks/jmh:run -jvm /Users/Phil/testo/graalvm/Contents/Home/bin/java
Running (fork) spark_benchmarks.MyRunner -jvm /Users/Phil/testo/graalvm/Contents/Home/bin/java

The post Benchmarks, Data Spark and Graal appeared first on Unravel.

Doing more with Data and evolving to DataOps

Unravel Data — Tue, 03 Dec 2019 23:16:45 +0000

As technology evolves at a rapid pace, the healthcare industry is transforming quickly along with it. Tech breakthroughs like IoT, advanced imaging, genomics mapping, artificial intelligence and machine learning are some of the key items re-shaping the space. The result is better patient care and health outcomes. To facilitate this shift to the next generation of healthcare services – and to deliver on the promise of improved patient care – organizations are adopting modern data technologies to support new use cases.

We are a large company operating healthcare facilities across the US and employing over 20,000 people. Despite our size, we understand that we must be nimble and adapt fast to keep delivering cutting edge healthcare services. We only began leveraging big data about three years ago, we’ve grown fast and built out a significant modern data stack, including Kafka, Impala and Spark Streaming deployments, among others. Our focus here was always on the applications, developer needs and, ultimately, business value.

We’ve built a number of innovative data apps on top of our growing data pipelines, providing great new services and insights for our customers and employees alike. During this process, though, we realized that it’s extremely difficult to manually troubleshoot and manage such data apps. We have a very developer-focused culture –the programmers are building the very apps that are ushering in the next generation of healthcare, and we put them front and center. We were concerned when we noticed these developers were sinking huge chunks of their time fixing and diagnosing failing apps, taking their focus away from creating new apps to drive core business value.

Impala and Spark Streaming are two modern data stack technologies that our developers increasingly employed to support next generation use cases. These two technologies are commonly used to build apps that leverage streaming data, which is prevalent in our industry. Unfortunately, both Impala and Spark Streaming are difficult to manage. Apps built with these two were experiencing frequent slowdowns and intermittent crashes. Spark Streaming, in particular, was very hard to even monitor.

Our key data apps were not performing the way we expected and our programmers were wasting tons of time trying to troubleshoot them. When we deployed Unravel Data, it changed things swiftly, providing new insights into aspects of our data apps we previously had no visibility of and drastically improving app performance.

Impala – improving performance by 12-18x !

Impala is a distributed analytic SQL engine running on the Hadoop platform. Unravel provided critical metrics that helped us to better understand how Impala was being used, including:

Impala memory consumption
Impala queries
Detail for queries using drill-down functionality
Recommendations on how to make queries run faster, use data across nodes more efficiently, and more

Unravel analyzed the query pattern (insert, select, data, data locality across the Hadoop cluster) against Impala and offered a few key insights. For one, Unravel saw most of the time was spent scanning Hadoop’s file system across nodes and combining the results. After computing stats on the underlying table – a simple operation – we were able to dramatically improve performance by 12-18x.

Unravel provides detailed insights for Impala

Spark Streaming – Reducing memory requirements by 80 percent!

Spark Streaming is a lightning quick processing and analytics engine that’s perfect for handling enormous quantities of streaming data. As with Impala, Unravel provided insights and recommendations that alleviated the headaches we were having with the technology. The platform told us we didn’t have the memory for many Spark Streaming jobs, which was ultimately causing the all the slowdowns and crashes. Unravel then provided specific recommendations on how to re-configure Spark Streaming, a process that’s typically complicated and replete with costly missteps. In addition, Unravel found that we could save significant CPU resources by sending parallel tasks to cores.

Overall, two critical Spark Streaming jobs saw memory reductions of 74 percent and 80 percent. Unravel’s parallelism recommendations saved us 8.63 hours of CPU per day.

Spark Streaming performance recommendation

The Bigger Picture

Unravel is straight forward to implement and immediately delivers results. The platform’s recommendations are all configuration changes and don’t require any changes to coding. We were stunned that we could improve app performance so considerably without making a single tweak to the code, yielding an immediate boost to critical business apps. Unravel’s full-stack platform delivers insights and recommendations for our entire modern data stack deployment, eliminating the need to manage any data pipeline with a siloed tool.

Modern data apps are fueling healthcare’s technical transformation. By improving data app performance, We have been able to continue delivering a pioneering healthcare experience, achieving better patient outcomes, new services and greater business value. Without a platform like Unravel, our developers and IT team would be bogged down troubleshooting these apps rather than creating new ones and revolutionizing our business. Unravel has helped create a deep cultural shift to do more with our data and evolve to a DataOps mindset.

The post Doing more with Data and evolving to DataOps appeared first on Unravel.

Unravel Data For Google Cloud Dataproc

Unravel Data — Tue, 22 Oct 2019 03:45:03 +0000

Thank you for your interest in the Unravel Data for Google Cloud Dataproc Datasheet.

The post Unravel Data For Google Cloud Dataproc appeared first on Unravel.

Understanding DataOps Impact on Application Quality

Unravel Data — Tue, 16 Jul 2019 19:46:57 +0000

The post Understanding DataOps Impact on Application Quality appeared first on Unravel.

Data Pipelines and the Promise of Data

Unravel Data — Thu, 27 Jun 2019 15:21:13 +0000

This is a blog by Keith D. Foote, Writer Researcher for Dataversity. This blog was first published on the DM Radio Dataversity site.

The flow of data can be perilous. Any number of problems can develop during the transport of data from one system to another. Data flows can hit bottlenecks resulting in latency; it can become corrupted, or datasets may conflict or have duplication. The more complex the environment and intricate the requirements, the more the potential for these problems increases. Volume also increases the potential for problems. Transporting data between systems often requires several steps. These include copying data, moving it to another location, and reformatting and/or joining it to other datasets. With good reason, data teams are focusing on the end-to-end performance and reliability of their data pipelines.

If massive amounts of data are streaming in from a variety of sources and passing through different data ecosystem software during different stages, it is reasonable to expect periodic problems with the data flow. A well-designed and well-tuned data pipeline ensures that all the steps needed to provide reliable performance are taken care of. The necessary steps should be automated, and most organizations will require at least one or two engineers to maintain the systems, repair failures, and make updates as the needs of the business evolve.

DataOps and Application Performance Management

Not long ago, data from different sources would be sent to separate silos, which often provided limited access. During transit, data could not be viewed, interpreted, or analyzed. Data was typically processed in batches on a daily basis, and the concept of real-time processing was unrealistic.

“Luckily, for enterprises today, such a reality has changed,” said Shivnath Babu the co-founder and Chief Technology Officer at Unravel in a recent interview with DATAVERSITY®.

“Now data pipelines can process and analyze huge amounts of data in real time. Data pipelines should therefore be designed to minimize the number of manual steps to provide a smooth, automated flow of data from one stage to the next.”

The first stage in a pipeline defines how, where, and what data is collected. The pipeline should then automate the processes used to extract the data, transform it, validate it, aggregate it and load it for further analysis. A data pipeline provides operational velocity by eliminating errors and correcting bottlenecks. A good data pipeline can handle multiple data streams simultaneously and has become an absolute necessity for many data-driven enterprises.

DataOps teams leverage Application Performance Management (APM) tools to monitor the performance of apps written in specific languages (Java, Python, Ruby, .NET, PHP, node.js). As data moves through the application, three key metrics are collected:

Error rate (errors per minute)
Load (data flow per minute)
Latency (average response time)

According to Babu, Unravel Provides an AI-powered DataOps/APM solution that is specifically designed for the modern data stack such as Spark, Kafka, NoSQL, Impala and Hadoop. A spectrum of industries—such as financial, telecom, healthcare, and technology—use Unravel to optimize their data pipelines. It is known for improving the reliability of applications, the productivity of data operations teams, and reducing costs. Babu commented that:

“The modern applications that truly are driving the promise of data, that are delivering or helping a company make sense of the data—they are running on a complex stack, which is critical and crucial to delivering on the promise of data. And that means, now you need software. You need tooling that can ensure these systems that are part of those applications end up being reliable, that you can troubleshoot them, and ensure their performance can be depended upon. You can detect and fix it. Ideally, the problem should never happen in the first place. They should be avoided, via the Machine learning algorithms.”

How Unravel AI Automates DataOps

Data Pipelines

Data pipelines are the digital reflection of the goals of data teams and business leaders, with each pipeline having unique characteristics and serving different purposes depending on those goals. For example, in a marketing-feedback scenario, real-time tools would be more useful than tools designed for moving data to the cloud. Pipelines are not mutually exclusive and can be optimized for both the cloud and real-time, or other combinations. The follow list describes the most popular kinds of pipelines available:

Cloud native pipelines: Deploy data pipelines in the cloud helps with cloud-based data and applications. Generally, cloud vendors provide many of the tools needed to create these pipelines, thereby saving the client time and money on building out infrastructure.
Real-time pipelines: Designed to process data as it comes in, or in real-time. The use of real-time requires processing data coming from a streaming source.
Batch pipelines: The use of batch processing is generally most useful when moving large amounts of data at regularly scheduled intervals. It is not a process that supports receiving data in real-time.

Ideally, a data pipeline handles any data as though it were streaming data and allows for flexible schemas. Whether the data comes from static sources (e.g. flat-file databases) or whether it comes from real-time sources (e.g. online retail transactions), a data pipeline can separate each data stream into narrower streams, which are then processed in parallel. It should be noted that processing in parallel uses significant computing power, said Babu:

“Companies get more value from their data by using modern applications. And these modern applications are running on this complex series of systems, be it the cloud or in data centers. In case any problem happens, you want to detect and fix it—and ideally fix the problem before it happens. That solution comes from Unravel. This is how Unravel helps companies deliver on the promise of data.

The big problem in the industry is having to hire human experts, who aren’t there 24/7, he commented. If the application is slow at 2 a.m. in the morning, or the real-time application is not capable of delivering in true real time, if somebody has to troubleshoot it, and fix it, that person is very hard to find. We have created technology that can collect what we call full-stack performance information at every level of this complex procedural stack. From application, from the platform, from infrastructure, from storage, we can collect performance information.

Cloud Migration Containers

There has been a significant rise in the use of data containers. As use of the cloud has gained in popularity, methods for transferring data and their processing instructions have become important, with data containers providing a viable solution. Data containers will organize and store “virtual objects” (these are self-contained entities made up of both data and the instructions needed for controlling the data).

“Containers do, however, come with limitations,” Babu remarked. While containers are very easy to transport, they can only be used with servers having compatible operating system “kernels,” and put a limit on the kinds of servers that can be used:

“We’re moving toward microservices. We have created a technology called Sensors, which can basically pop open any container. Whenever a container comes up, in effect, it opens what we call a sensor. It can sense everything that is happening within the container, and then stream that data back to Unravel. Sensor enables us to make these correlations between Spark applications, that are streaming from a Kafka topic. Or writing to a S3 storage, where we are able to draw these connections. And that is in conjunction with the data that we are tapping into, where the APIs go with each system.”

DataOps

DataOps is a process-oriented practice that focuses on communicating, integrating, and automating the ingest, preparation, analysis, and lifecycle management of data infrastructure. DataOps manages the technology used to automate data delivery, with attention to the appropriate levels of quality, metadata, and security needed to improve the value of data.

“We apply AI and machine learning algorithms to solve a very critical problem that affects modern day applications. A lot of the time, companies are creating new business intelligence applications, or streaming applications. Once they create the application, it takes them a long time to make the application reliable and production-ready. So, there are mission-critical technical applications, very important to the overall company’s bottom line.”

But, the application itself often ends up running on multiple definitive systems, systems to collect data in a streaming fashion, systems to collect very large and valuable types of data. The database architecture that most companies have ends up being a complex collection of details that’s inside, and on this, “we have mission-critical technical applications running.”

Unravel has applied AI and machine learning to logs, metrics, execution plans, and configurations and then applied it to automatically identify where crucial problems lie, where inefficiencies lie, and fix a lot of these problems automatically, so that companies that are running these applications are guaranteed that these mission-critical applications are very reliable.

“So, this is how we help companies deliver on the promise of data,” said Babu in closing. “Data pipelines are certainly making data flows and data volumes easier to deal with, and we remove those blind spots in an organization’s data pipelines. We give them visibility, AI-powered recommendations, and much more reliable performance.”

The post Data Pipelines and the Promise of Data appeared first on Unravel.

Meeting SLAs for Data Pipelines on Amazon EMR With Unravel

Unravel Data — Wed, 26 Jun 2019 22:31:02 +0000

Post by George Demarest, Senior Director of Product Marketing, Unravel and Abha Jain, Director Products, Unravel. This blog was first published on the Amazon Startup blog.

A household name in global media analytics – let’s call them MTI – is using Unravel to support their data operations (DataOps) on Amazon EMR to establish and protect their internal service level agreements (SLAs) and get the most out of their Spark applications and pipelines. Amazon EMR was an easy choice for MTI as the platform to run all their analytics. To start with, getting up and running is simple. There is nothing to install, no configuration required etc. and you can get to a functional cluster in a few clicks. This enabled MTI to focus most of their efforts in building out analytics that would benefit their business instead of having to spend time and money on acquiring the skillset needed for setting up and maintaining Hadoop deployments by themselves. MTI was very quickly able to get a point that they were running 10’s of thousands of jobs per week. About 70% of which are Spark, with the remaining 30% of workloads running on Hadoop, or more specifically Hive/MapReduce.

Among the most common complaints and concerns about optimizing modern data stack clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel employs a frictionless method of collecting relevant data about the full data stack, running applications, cluster resources, datasets, users, business units and projects. Unravel then aggregates and correlates this data into the Unravel data model and then applies a variety of analytical techniques to put that data into a useful context. Unravel utilizes EMR bootstrap Actions to distribute (non-intrusive) sensors on each node of a new cluster that are needed for collecting granular application level data which in turn is used to optimize applications.

See Unravel for EMR in Action

Unravel’s Amazon AWS/EMR architecture

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing Data Operations
For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

● Reduce time needed for troubleshooting and resolving issues.

● Improve cluster efficiency and performance.

● Improve visibility into cluster workloads.

● Provide usage analysis

Reducing Time to Identify and Resolve Issues
One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

One of the most basic ways that Unravel helps reduce MTTI/MTTR is through the elimination of the time-consuming steps of data collection and correlation. Unravel collects granular cluster and application-specific runtime information, as well as metrics on infrastructure, resources using native Hadoop APIs and via lightweight sensors that only send data while an application is executing. This alone can save data teams hours – if not days – of data collection by, capturing application and system log data, configuration parameters, and other relevant data.

Once that data is collected, the manual process of evaluating and interpreting that data has just begun. You may spend hours charting log data from your Spark application only to find that some small human error, a missed configuration parameter, and incorrectly sized container, or a rogue stage of your Spark application is bringing your cluster to its knees.

Unravel’s top level operations dashboard

Improving Visibility Into Cluster Operations
In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster Level Reporting
Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.

Unravel’s cluster infrastructure dashboard

One observation from reports like the above was that the memory and CPU usage in the cluster was peaking from time to time, potentially leading to application failures and slowdowns. To resolve this issue, MTI utilized EMR Automatic scaling feature so that instances were automatically added and removed as needed to ensure adequate resources at all times. This also ensured that they were not incurring unnecessary costs from underutilized resources.

Application and Workflow Tagging
Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage Analysis and Capacity Planning
MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel Chargeback Reporting
You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel Data’s chargeback reporting

Improving Cluster Efficiency and Performance
MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Get answers to EMR issues, not more charts and graphs

Unravel Auto-actions and Alerting
Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

The simplicity of the Auto-actions screen belies the deep automation and functionality of autonomous remediation of application slowdowns and failures. At the highest level, Unravel Auto-actions can be quickly set up to alert your team via email, PagerDuty, Slack or text message. Offending jobs can also be killed or moved to a different queue. Unravel can also create an HTTP post that gives users a lot of powerful options

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Pre-configured Unravel auto-action templates

Applications
Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased Visibility Into Application Runtime and Trends
MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Because typical data pipelines are built on a collection of distributed processing engines (Spark, Hadoop, et al.), getting visibility into the complete data pipeline is a challenge. Each individual processing engine may have monitoring capabilities, but there is a need to have a unified view to monitor and manage all the components together.

Unravel Monitoring, Tuning, and Troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel was designed with an end-to-end perspective on data pipelines. The basic navigation moves from the top level list of applications to drill down to jobs, and further drill down to individual stages of a Spark, Hive, MapReduce or Impala applications.

Unravel Gantt chart view of a Hive query

Unravel provides a number of intuitive navigational and reporting elements in the user interface including a Gantt chart of application components to understand the execution and parallelism of your applications.

Unravel Self-service Optimization of Spark Applications
MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

MTI has chosen Unravel as a foundational element of their self-service application and workflow improvements, especially taking advantage of application recommendations and insights for Spark developers.

Unravel self-service capabilities

Unravel provides plain language insights as well as specific, actionable recommendations to improve performance and efficiency. In addition to these recommendations and insights, users can take action via the auto-tune function, which is available to run from the events panel.

Intelligent recommendations and insights as well as auto-tuning

Optimizing Application Resource Efficiency
In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel provides a rich catalog of insights and recommendations around resource consumption at the application level. To eliminate resource wastage Unravel can help you run your data applications more efficiently by providing you AI driven insights and recommendations to do show:

Underutilization of Container Resources, CPU, or Memory

Too few partitions with respect to available parallelism

Mapper/Reducers Requesting Too Much Memory

Too Many Map Tasks and/or Too Many Reduce Tasks

Solution Highlights
Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

MTI also got great insights on the number and causes of inefficiently running applications. Unravel detected a significant number of inefficient applications. Unravel detected 38,190 events after analyzing 30,378 MapReduce jobs that they executed. They were also able to detect 44,176 events for 21,799 Spark jobs that they executed. They were also able to detect resource contention which causing Spark jobs to get stuck in “Accepted” state, rather than running to completion.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to identify many jobs which waste CPU and memory resources.

The post Meeting SLAs for Data Pipelines on Amazon EMR With Unravel appeared first on Unravel.

Unraveling the Complex Streaming Data Pipelines of Cybersecurity

George Demarest — Wed, 12 Jun 2019 04:55:49 +0000

Earlier this year, Unravel released the results of a survey that looked at how organizations are using modern data apps and general trends in big data. There were many interesting findings, but I was most struck by what the survey revealed about security. First, respondents indicated that they get the most value from big data when leveraging it for use in security applications. Fraud detection was listed as the single most effective use case for big data, while cybersecurity intelligence was third. This was hardly surprising, as security is at the top of everyone’s minds today and modern security apps like fraud detection rely heavily on AI and machine learning to work properly. However, despite that value, respondents also indicated that security analytics was the modern data application they struggled most to get right. This also didn’t surprise me, as it reflects the complexity of managing the real-time streaming data common in security apps.

Cybersecurity is difficult from a big data point of view, and many organizations are struggling with it. The hardest part is managing all of the streaming data that comes pouring in from the internet, IoT devices, sensors, edge platforms and other endpoints. Streaming data comes often, piles up quickly and is complex. To properly manage this data and deliver working security apps, you need the right solution that provides trustworthy workload management for cybersecurity analytics and offers you the ability to track, diagnose, and troubleshoot end-to-end across all of your data pipelines.

Real-time processing is not a new concept, but the ability to run real-time apps reliably and at scale is. The development of open-source technologies such as Kafka, Spark Streaming, Flink and HBase have enabled developers to create real-time apps that scale, further accelerating their proliferation and business value. Cybersecurity is critical for the well-being of enterprises and large organizations, but many don’t have the right data operations platform to do it correctly.

Apache Metron is an example of a complex security data pipeline

To analyze streaming traffic data, generate statistical features, and train machine learning models to help detect cybersecurity threats on large-scale networks, like malicious hosts in botnets, big data systems require complex and resource-consuming monitoring methods. Security analysts may apply multiple detection methods simultaneously, to the same massive incoming data, for pre-processing, selective sampling, and feature generation, adding to the existing complexity and performance challenges. Keep in mind, the applications often span across multiple systems (e.g., interacting with Spark for computation, with YARN for resource allocation and scheduling, with HDFS or S3 for data access, with Kafka or Flink for streaming) and may contain independent, user-defined programs, making it inefficient to repeat data pre-processing and feature generation common in multiple applications, especially in large-scale traffic data.

These inefficiencies create bottlenecks in application execution, hog the underlying systems, cause suboptimal resource utilization, increase failures (e.g., due to out-of-memory errors), and more importantly, may decrease the chances to detect a threat or a malicious attempt in time.

Unravel’s full stack platform addresses these challenges and provides a compelling solution for operationalizing security apps. Leveraging artificial intelligence, Unravel has introduced a variety of capabilities for enabling better workload management for cybersecurity analytics, all delivered from a single pane of glass. These include:

Automatically identifying applications that share common characteristics and requirements and grouping them based on relevant data colocation (e.g., a combination of port usage entropy, IP region or geolocation, time or flow duration)
Recommendations on how to segregate applications with different requirements (e.g., disk i/o heavy preprocessing tasks vs. computational heavy feature selection) submitted by different users (e.g., SOC level 1 vs. level 3 analysts)
Recommendations on how to allocate applications with increased sharing opportunities and computational similarities to appropriate execution pools/queues
Automatic fixes for failed applications drawing on rich historic data of successful and failed runs of the application
Recommendations for alternative configurations to get failed applications quickly to a running state, followed by getting the application to a resource-efficient running state

Security apps running on a highly distributed modern data stack are too complex to monitor and manage manually. And when these apps fail, organizations don’t just suffer inconvenience or some lost revenue. Their entire business is put at risk. Unravel ensures these apps are reliable and operate at optimal performance. Ours is the only platform that makes such effective use of AI to scrutinize application execution, identify the cause of potential failure, and to generate recommendations for improving performance and resource usage, all automatically.

The post Unraveling the Complex Streaming Data Pipelines of Cybersecurity appeared first on Unravel.

Case Study: Meeting SLAs for Data Pipelines on Amazon EMR

George Demarest — Thu, 30 May 2019 20:44:44 +0000

Among the most common complaints and concerns about optimizing big data clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel architecture for Amazon AWS/EMR

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing data operations

For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

Reduce time needed for troubleshooting and resolving issues.
Improve cluster efficiency and performance.
Improve visibility into cluster workloads.
Provide usage analysis

Reducing time to identify and resolve issues

One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

Unravel top level operations dashboard

Improving visibility into cluster operations

In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster level reporting

Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.Unravel cluster infrastructure dashboard

Application and workflow tagging

Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc.. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage analysis and capacity planning

MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel chargeback reporting

You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel chargeback reporting

Improving cluster efficiency and performance

MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Unravel Auto-actions and alerting

Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Preconfigured Unravel auto-action templates

Applications

Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased visibility into application runtime and trends

MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Unravel monitoring, tuning and troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel Gantt chart view of a Hive query

Unravel self-service optimization of Spark applications

MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

Unravel self-service capabilities

Unravel provides intelligent recommendations and insights as well as auto-tuning.

Optimizing Application Resource Efficiency

In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel Insight: Under-utilization of container resources, CPU or memory

Unravel Insight: Too few partitions with respect to available parallelism

Unravel Insight: Mapper/Reducers requesting too much memory

Unravel Insight: Too many map tasks and/or too many reduce tasks

Solution Highlights

Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to Identify many jobs which waste CPU and memory resources.

The post Case Study: Meeting SLAs for Data Pipelines on Amazon EMR appeared first on Unravel.

Unravel Data For Informatica Big Data Management

Unravel Data — Sat, 25 May 2019 03:06:22 +0000

Thank you for your interest in the Unravel Data for Informatica Big Data Management Datasheet.

The post Unravel Data For Informatica Big Data Management appeared first on Unravel.

The Guide To Understanding Cloud Data Services in 2022

George Demarest — Fri, 24 May 2019 15:05:48 +0000

In the past five years, a shift in Cloud Vendor offerings has fundamentally changed how companies buy, deploy and run big data systems. Cloud Vendors have absorbed more back-end data storage and transformation technologies into their core offerings and are now highlighting their data pipeline, analysis, and modeling tools. This is great news for companies deploying, migrating, or upgrading big data systems. Companies can now focus on generating value from data and Machine Learning (ML), rather than building teams to support hardware, infrastructure, and application deployment/monitoring.

The following chart shows how more and more of the cloud platform stack is becoming the responsibility of the Cloud Vendors (shown in blue). The new value for companies working with big data is the maturation of Cloud Vendor Function as a Service (FaaS), also known as serverless, and Software as a Service (SaaS) offerings. For FaaS (serverless) the Cloud Vendor manages the applications and users focus on data and functions/features. With SaaS, features and data management become the Cloud Vendor’s responsibility. Google Analytics, Workday, and Marketo are examples of SaaS offerings.As the technology gets easier to deploy, and the Cloud Vendor data services mature, it becomes much easier to build data-centric applications and provide data and tools to the enterprise. This is good news: companies looking to migrate from on-premise systems to the cloud are no longer required to purchase directly or manage hardware, storage, networking, virtualization, applications, and databases. In addition, this changes the operational focus for a big data systems from infrastructure and application management (DevOps) to pipeline optimization and data governance (DataOps). The following table shows the different roles required to build and run Cloud Vendor-based big data systems.

This article is aimed at helping big data systems leaders moving from on-premise or native IaaS (compute, storage, and networking) deployments understand the current Cloud Vendor offerings. Those readers new to big data, or Cloud Vendor services, will get a high-level understanding of big data system architecture, components, and offerings. To facilitate discussion we provide an end-to-end taxonomy for big data systems and show how the three leading Cloud Vendors (AWS, Azure and GCP) align to the model:

Amazon Web Services (AWS)
Microsoft Azure (Azure)
Google Cloud Platform (GCP)

APPLYING A COMMON TAXONOMY

Understanding Cloud Vendor offerings and big data systems can be very confusing. The same service may have multiple names across Cloud Vendors and, to complicate things even more, each Cloud Vendor has multiple services that provide similar functionality. However, the Cloud Vendors big data offerings align to a common architecture and set of workflows.

Each big data offering is set up to receive high volumes of data to be stored and processed for real-time and batch analytics as well as more complex ML/AI modeling. In order to provide clarity amidst the chaos, we provide a two-level taxonomy. The first-level includes five stages that sit between data sources and data consumers: CAPTURE, STORE, TRANSFORM, PUBLISH, and CONSUME. The second-level taxonomy includes multiple service offerings for each stage to provide a consistent language for aligning Cloud Vendor solutions.The following sections provide details for each stage and the related service offerings.

CAPTURE

Persistent and resilient data CAPTURE is the first step in any big data system. Cloud Vendors and the community also describe data CAPTURE as ingest, extract, collect, or more generally as data movement. Data CAPTURE includes ingestion of both batch and streaming data. Streaming event data becomes more valuable by being blended with transactional data from internal business applications like SAP, Siebel, Salesforce, and Marketo. Business application data usually resides within a proprietary data model and needs to be brought into the big data system as changes/transactions occur.

Cloud Vendors provide many tools for bringing large batches of data into their platforms. This includes database migration/replication, processing of transactional changes, and physical transfer devices when data volumes are too big to send efficiently over the internet. Batch data transfer is common for moving on-premise data sources and bringing in data from internal business applications, both SaaS and on-premise. Batch data can be run once as part of an application migration or in near real-time as transactional updates are made in business systems.

The focus of many big data Pipeline implementations is the capture of real-time data streaming in as an application clickstream, product usage events, application logs, and IoT sensor events. To properly capture streaming data requires configuration on the edge device or application. For, example, collecting clickstream from a mobile or web application requires events to be instrumented and sent back to an endpoint listening for the events. This is similar with IoT devices, which may also perform some data processing on the edge device prior to sending it back to an end point.

STORE

For big data systems the STORE stage focuses on the concept of a data lake, a single location where structured, semi-structured, unstructured data and objects are stored together. The data lake is also a place to store the output from extract, transform, load (ETL) and ML pipelines running in the TRANSFORM stage. Vendors focus on scalability and resilience over read/write performance. To increase data access and analytics performance, data should be highly aggregated in the data lake or organized and placed into higher performance data warehouses, massively parallel processing (MPP) databases, or key-value stores as described in the PUBLISH stage. In addition, some data streams have such high event volume, or the data are only relevant at the time of capture, that the data stream may be processed without ever entering the data lake.

Cloud Vendors have recently put more focus on the concept of the data lake, by adding functionality to their object stores and creating a much tighter integration with TRANSFORM and CONSUME service offerings. For example, Azure created Data Lake Storage on top of the existing Object Store with additional services for end to end analytics pipelines. Also, AWS now provides Data Lake Formation to make it easier to set up a data lake on their core object store S3.

TRANSFORM

The heart of any big data implementation is the ability to create data pipelines in order to clean, prepare, and TRANSFORM complex multi-modal data into valuable information. Data TRANSFORM is also described as preparing, massaging, processing, organizing, and analyzing among other things. The TRANSFORM stage is where value is created and, as a result, Cloud Vendors, start-ups, and traditional database and ETL vendors provide many tools. The TRANSFORM stage has three main data pipeline offerings including Batch Processing, Machine Learning, and Stream Processing. In addition, we include the Orchestration offering because complex data pipelines require tools to stage, schedule, and monitor deployments.

Batch TRANSFORM uses traditional extract, TRANSFORM, and load techniques that have been around for decades and are the purview of traditional RDBMS and ETL vendors. However, with the increase in data volumes and velocity, TRANSFORM now commonly comes after extraction and loading into the data lake. This is referred to as extract, load, and transform or ELT. Batch TRANSFORM uses Apache Spark or Hadoop to distribute compute across multiple nodes to process and aggregate large volumes of data.

ML/AI uses many of the same Batch Processing tools and techniques for data preparation and for the development and training of predictive models. Machine Learning also takes advantage numerous libraries and packages to help optimize data science workflows and provide pre-built algorithms.
big data systems also provide tools to query continuous data streams in near real-time. Some data has immediate value that would be lost waiting for a batch process to run. For example, predictive models for fraud detection or alerts based on data from an IoT sensor. In addition, streaming data is commonly processed, and portions are loaded into a data lake.

Cloud Vendor offerings for TRANSFORM are evolving quickly and it can be difficult to understand which tools to use. All three Cloud Vendors have versions of Spark/Hadoop that scale on their IaaS compute nodes. However, all three now provide serverless offerings that make it much simpler to build and deploy data pipelines for batch, ML and streaming workflows. For example, AWS EMR, GCP Cloud Data Proc, and Azure Databricks provide Spark/Hadoop that scale by adding additional compute resources. However, they also offer the serverless AWS Glue, GCP Data Flow, and Azure Data Factory which abstract away the need to manage compute nodes and orchestration tools. In addition, they now all provide end-to-end tools to build, train, and deploy machine learning models quickly. This includes data preparation, algorithm development, model training algorithm, and deployment tuning and optimization.

PUBLISH

Once through the data CAPTURE and TRANSFORM stages it is necessary to PUBLISH the output from batch, ML, or streaming pipelines for users and applications to CONSUME. PUBLISH is also described as deliver or serve, and comes in the form of Data Warehouses, Data Catalogs, or Real-Time Stores.

Data Warehouse solutions are abundant in the market, and the choice depends on the data scale and complexity as well as performance requirements. Serverless relational databases are a common choice for Business Intelligence applications and for publishing data for other systems to consume. They provide scale and performance and, most of all, SQL-based access to the prepared data. Cloud Vendor examples include AWS Redshift, Google BigQuery, and Azure SQL Data Warehouse. These work great for moderately sized and relatively simple data structures. For higher performance and complex relational data models, massively parallel processing (MPP) databases store large volumes of data in-memory and can be blazing fast, but often at a steep price.

As the tools for TRANSFORM and CONSUME become easier to use, data analyses, models, and metrics proliferate. It becomes harder to find valuable, governed, and standardized metrics in the mass of derived tables and analyses. A well-managed and up-to-date data catalog is necessary for both technical and non-technical users to manage and explore published tables and metrics. Cloud Vendor Data Catalog offerings are still relatively immature. Many companies build their own or use third party catalogs like Alation or Waterline. More technical users including data engineers and data scientists explore both raw and transformed data directly in the data lake. For these users the data catalog, or metastore, is the key for various compute options to understand where data is and how it is structured.

Many streaming applications require a Real-Time Store to meet millisecond response times. Hundreds of optimized data stores exist in the market. As with Data Warehouse solutions, picking a Real-Time Store depends on the type and complexity of the application and data. Cloud Vendors examples include AWS DynamoDB, Google Bigtable, and Azure Cosmos DB providing wide-column or key-value data stores. These are applied as high performance in-process databases and improve the performance of Data processing and analytics workloads.

CONSUME

The value of any big data system comes together in the hands of technical and non-technical users, and in the hands of customers using data-centric applications and products. Vendors also refer to CONSUME as use, harness, explore, model, infuse, and sandbox. We discuss three CONSUME models: Advanced Analytics, Business Intelligence (BI), and Real-Time APIs.

Aggregated data does not always allow for deeper data exploration and understanding. So, advanced analytics users CONSUME both raw and processed data either directly from the data lake or from a Data Warehouse. Advanced analytics users use similar tools from the TRANSFORM stage including Spark- and Hadoop-based distributed compute. In addition, notebook technologies are a popular tool that allow data engineers and data scientists to create documents containing live code, equations, visualizations and text. Notebooks allow users to code in a variety of languages, run packages, and share the results. All three Cloud Vendors offer notebook solutions, most based on the popular open source Jupyter project.

BI tools have been in the market for a couple of decades and are now being optimized to work with larger data sets, new types of compute, and directly in the cloud. Each of the three cloud vendors now provide a BI tool optimized to work with their stack. These include AWS Quicksight, GCP Data Studio, and Microsoft Power BI. However, several more mature BI tools exist in the market that work with data from most vendors. BI tools are optimized to work with published data and usage improves greatly with an up-to-date data catalog and some standardization of tables and metrics.

Applications, products, and services also CONSUME raw and transformed data through APIs built on the Real-Time Store or predictive ML models. The same Cloud Vendor ML offerings used to explore and build models also provides Real-Time APIs for alerting, analysis, and personalization. Example use cases include fraud detection, system/sensor alerting, user classification, and product personalization.

CLOUD VENDOR OFFERINGS

AWS, GCP and AZURE have very complex cloud offerings based on their core networking, storage and compute. In addition, they provide vertical offerings for many markets, and within the big data systems and ML/AI verticals they each provide multiple offerings. In the following chart we align the Cloud Vendor offerings within the two-tier big data system taxonomy defined in the second section.The following table includes some additional Cloud Vendor offerings as well as open source and selected third party tools that provide similar functionality.

THE TIME IS NOW

Most companies that deployed big data systems and data-centric applications in the past 5-10 years did this on-premise (or colocation) or on top of the Cloud Vendor core infrastructure services including storage, networking, and compute. Much has changed in the Cloud Vendor offerings since these early deployments. Cloud Vendors now provide a nearly complete set of serverless big data services. In addition, more and more companies see the value of Cloud Vendor offerings and are trusting their mission-critical data and applications to run on them. So, now is the time to think about migrating big data applications from on-premise or upgrading bespoke systems built on Cloud Vendor infrastructure services. In order to make the best use of, make sure to get a deep understanding of existing systems, develop a clear migration strategy, and establish a data operations center of excellence.

In order to prepare for migration to Cloud Vendor big data offerings, it is necessary for an organization to get a clear picture of its current big data system. This can be difficult depending on the heterogeneity of existing systems, the types of data-centric products it supports, and the number of teams or people using the system. Fortunately, tools (such as Unravel) exist to monitor, optimize, and plan migrations for big data systems and pipelines. During migration planning it is common to discover inefficient, redundant, and even unnecessary pipelines actively running, chewing up compute, and costing the organization time and money. So, during the development of a migration strategy companies commonly find ways to clean up and optimize their data pipelines and overall data architecture.

It is helpful that all three Cloud Vendors are interested in getting a company’s data and applications onto their platforms. To this end, they provide a variety of tools and services to help move data or lift and shift applications and databases onto their platforms. For example, AWS provides a Migration Hub to help plan and execute migrations and a variety of tools like the AWS Database Migration Service. Azure provides free Migration Assessments as well as several tools. And, GCP provides a variety of migration strategies and tools like Anthos and Velostrata depending on a company’s’ current and future system requirements.

Please take a look at the Cloud Vendor migration support sites below.

No matter whether a company runs an on-premise system or a fully managed serverless environment, or some hybrid combination, companies need to build expertise a core competence in data operations. DataOps is a rapidly emerging discipline that companies need to own–it is difficult to outsource. Most data implementations utilize tools from multiple vendors, maintain hybrid cloud/on-premises systems, or rely on more than one Cloud Vendor. So, it becomes difficult to rely on a single company or Cloud Vendor to manage all the DataOps task for an organization.

Typical scope includes:

Data quality
Metadata management
Pipeline optimization
Cost management and Charge back
Performance Management
Resource Management
Business stakeholder Management
Data governance
Data catalogs
Data security & compliance
ML/AI model management
Corporate metrics and reporting

Where ever you are on your cloud adoption and workload migration journey, now is the time to start or accelerate your strategic thinking and execution planning for Cloud based data services. Serverless offerings are maturing quickly and give companies faster time to value, increased standardization, and overall lower people and technology costs. However, as migration goes from planning to reality, ensure you invest in the critical skills, technology and process changes to establish a data operations center of excellence.

The post The Guide To Understanding Cloud Data Services in 2022 appeared first on Unravel.

Unravel Data Operations Guide For Amazon AWS

Unravel Data — Fri, 12 Apr 2019 03:31:41 +0000

Thank you for your interest in the Unravel Data Operations Guide for Amazon AWS.

The post Unravel Data Operations Guide For Amazon AWS appeared first on Unravel.

Why Data Skew & Garbage Collection Causes Spark Apps To Slow or Fail

George Demarest — Tue, 09 Apr 2019 05:55:13 +0000

The second part of our series “Why Your Spark Apps Are Slow or Failing” follows Part I on memory management and deals with issues that arise with data skew and garbage collection in Spark. Like many performance challenges with Spark, the symptoms increase as the scale of data handled by the application increases.

What is Data Skew?

In an ideal Spark application run, when Spark wants to perform a join, for example, join keys would be evenly distributed and each partition would get nicely organized to process. However, real business data is rarely so neat and cooperative. We often end up with less than ideal data organization across the Spark cluster that results in degraded performance due to data skew.

Data skew is not an issue with Spark per se, rather it is a data problem. The cause of the data skew problem is the uneven distribution of the underlying data. Uneven partitioning is sometimes unavoidable in the overall data layout or the nature of the query.

For joins and aggregations Spark needs to co-locate records of a single key in a single partition. Records of a key will always be in a single partition. Similarly, other key records will be distributed in other partitions. If a single partition becomes very large it will cause data skew, which will be problematic for any query engine if no special handling is done.

Dealing with data skew

Data skew problems are more apparent in situations where data needs to be shuffled in an operation such as a join or an aggregation. Shuffle is an operation done by Spark to keep related data (data pertaining to a single key) in a single partition. For this, Spark needs to move data around the cluster. Hence shuffle is considered the most costly operation.

Common symptoms of data skew are

Stuck stages & tasks
Low utilization of CPU
Out of memory errors

There are several tricks we can employ to deal with data skew problem in Spark.

Identifying and resolving data skew

Spark users often observe all tasks finish within a reasonable amount of time, only to have one task take forever. In all likelihood, this is an indication that your dataset is skewed. This behavior also results in the overall underutilization of the cluster. This is especially a problem when running Spark in the cloud, where over-provisioning of cluster resources is wasteful and costly.

If skew is at the data source level (e.g. a hive table is partitioned on _month key and table has a lot more records for a particular _month), this will cause skewed processing in the stage that is reading from the table. In such a case restructuring the table with a different partition key(s) helps. However, sometimes it is not feasible as the table might be used by other data pipelines in an enterprise.

In such cases, there are several things that we can do to avoid skewed data processing.

Data Broadcast

If we are doing a join operation on a skewed dataset one of the tricks is to increase the “spark.sql.autoBroadcastJoinThreshold” value so that smaller tables get broadcasted. This should be done to ensure sufficient driver and executor memory.

Data Preprocess

If there are too many null values in a join or group-by key they would skew the operation. Try to preprocess the null values with some random ids and handle them in the application.

Salting

In a SQL join operation, the join key is changed to redistribute data in an even manner so that processing for a partition does not take more time. This technique is called salting. Let’s take an example to check the outcome of salting. In a join or group-by operation, Spark maps a key to a particular partition id by computing a hash code on the key and dividing it by the number of shuffle partitions.

Let’s assume there are two tables with the following schema.

Let’s consider a case where a particular key is skewed heavily e.g. key 1, and we want to join both the tables and do a grouping to get a count. For example,

After the shuffle stage induced by the join operation, all the rows having the same key needs to be in the same partition. Look at the above diagram. Here all the rows of key 1 are in partition 1. Similarly, all the rows with key 2 are in partition 2. It is quite natural that processing partition 1 will take more time, as the partition contains more data. Let’s check Spark’s UI for shuffle stage run time for the above query.

As we can see one task took a lot more time than other tasks. With more data it would be even more significant. Also, this might cause application instability in terms of memory usage as one partition would be heavily loaded.

Can we add something to the data, so that our dataset will be more evenly distributed? Most of the users with skew problem use the salting technique. Salting is a technique where we will add random values to join key of one of the tables. In the other table, we need to replicate the rows to match the random keys.The idea is if the join condition is satisfied by key1 == key1, it should also get satisfied by key1_ = key1_. The value of salt will help the dataset to be more evenly distributed.

Here is an example of how to do that in our use case. Check the number 20, used while doing a random function & while exploding the dataset. This is the distinct number of divisions we want for our skewed key. This is a very basic example and can be improved to include only keys which are skewed.

Now let’s check the Spark UI again. As we can see processing time is more even now

Note that for smaller data the performance difference won’t be very different. Sometimes the shuffle compress also plays a role in the overall runtime. For skewed data, shuffled data can be compressed heavily due to the repetitive nature of data. Hence the overall disk IO/ network transfer also reduces. We need to run our app without salt and with salt to finalize the approach that best fits our case.

Garbage Collection

Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection (GC). Therefore, garbage collection (GC) can be a major issue that can affect many Spark applications.

Common symptoms of excessive GC in Spark are:

Slowness of application
Executor heartbeat timeout
GC overhead limit exceeded error

Spark’s memory-centric approach and data-intensive applications make it a more common issue than other Java applications. Thankfully, it’s easy to diagnose if your Spark application is suffering from a GC problem. The Spark UI marks executors in red if they have spent too much time doing GC.

Spark executors are spending a significant amount of CPU cycles performing garbage collection. This can be determined by looking at the “Executors” tab in the Spark application UI. Spark will mark an executor in red if the executor has spent more than 10% of the time in garbage collection than the task time as you can see in the diagram below.

The Spark UI indicates excessive GC in red

Addressing garbage collection issues

Here are some of the basic things we can do to try to address GC issues.

Data structures

If using RDD based applications, use data structures with fewer objects. For example, use an array instead of a list.

Specialized data structures

If you are dealing with primitive data types, consider using specialized data structures like Koloboke or fastutil. These structures optimize memory usage for primitive types.

Storing data off-heap

The Spark execution engine and Spark storage can both store data off-heap. You can switch on off-heap storage using

–conf spark.memory.offHeap.enabled = true
–conf spark.memory.offHeap.size = Xgb.

Be careful when using off-heap storage as it does not impact on-heap memory size i.e. it won’t shrink heap memory. So to define an overall memory limit, assign a smaller heap size.

Built-in vs User Defined Functions (UDFs)

If you are using Spark SQL, try to use the built-in functions as much as possible, rather than writing new UDFs. Most of the SPARK UDFs can work on UnsafeRow and don’t need to convert to wrapper data types. This avoids creating garbage, also it plays well with code generation.

Be stingy about object creation

Remember we may be working with billions of rows. If we create even a small temporary object with a 100-byte size for each row, it will create 1 billion * 100 bytes of garbage.

End of Part II

So far, we have focused on memory management, data skew, and garbage collection as causes of slowdowns and failures in your Spark applications. For Part III of the series, we will turn our attention to resource management and cluster configuration where issues such as data locality, IO-bound workloads, partitioning, and parallelism can cause some real headaches unless you have good visibility and intelligence about your data runtime.

If you found this blog useful, you may wish to view Part I of this series Why Your Spark Apps are Slow or Failing: Part I Memory Management. Also see our blog Spark Troubleshooting, Part 1 – Ten Challenges.

The post Why Data Skew & Garbage Collection Causes Spark Apps To Slow or Fail appeared first on Unravel.

Modern Data Stack Predictions With Unravel Data Advisory Board Member, Tasso Argyros

Audrey Enriquez — Tue, 05 Mar 2019 07:32:57 +0000

Unravel lucked out with the quality and strategic Impact of our advisory board. Collectively, they hold a phenomenal track record of entrepreneurship, leadership, and product innovation, and I am pleased to introduce them to the Unravel community. Looking into the year ahead, we asked two of our advisors for their perspective on what 2019 holds as we dive into the next few quarters.. Our first guest, Herb Cunitz, was featured in Part 1 of our Prediction series (read it here) and discussed breakout modern data stack technologies, the role of artificial intelligence (AI) and automation in the modern data stack, and the increasing modern data stack skills gap. Now, in Part 2, Tasso Argyros, founder, and CEO of ActionIQ will outline his take on the upcoming trends for 2019.

Tasso is a widely recognized and respected innovator, with awards and accolades from the World Economic Forum, BusinessWeek and Forbes, and has a storied career with more than a decade’s experience working with data-focused organizations and advising companies on how to accelerate growth and become market leaders. He is a CEO and Founder at ActionIQ, a company giving Marketers direct access to their customer data, and previously founded Aster Data, a pioneer in the modern data stack, which was ultimately acquired by Teradata. He was also a Founder of early-stage Big Data seed fund, Data Elite, that helped incubate Unravel Data.

In 2018, big data matured and hit a point of inflection, where an increasing number of Fortune 1000 enterprises deployed critical modern data applications that depend on the modern data stack, into production. What effect will this have on the product innovation pipeline and adoption for 2019 and beyond? This is an excerpt of my recent conversation with Tasso:

Unravel: Looking back in the ‘rear-view mirror’ to the past year, what were the most exciting developments and innovations in the modern data stack?

TA: While some say innovation has slowed down in big data, I’m seeing the opposite and believe it has accelerated. When we started Aster Data in 2005, many thought that database innovation was dead. Between Oracle, IBM, and some specialty players like Teradata, investors and pundits believed that all problems had been solved and that there was nothing else to do. Instead, it was the perfect time to start a database company as the seeds of the Big Data revolution were about to be planted.

Since then, the underlying infrastructure have experienced massive, continual changes in about 3 to 4-year intervals. For example, in the mid-2000s, the primary industry trend was moving from expensive proprietary hardware to more cost-effective commodity hardware, and in the early-2010s, the industry spotlighted open source data software. Now, for the past few years, the industry has been focused on the introduction of and transition to cloud solutions, the increasing volume of streaming data and debut of Internet-of-Things (IoT) technologies.

As we focus on finding better ways to manage data, introduce new technologies and databases, and explore the ecosystem that lays on top of the big data layer, these will be the underlying trends that will continue to drive innovation in 2019 and beyond. Whereas initially, big data was more about collection, aggregation and experimentation, in 2018, it became clear that big data is a crucial, mission-critical aspect to the next generation of applications – and there is much more to learn.

Unravel: What breakout technology will move to the forefront in 2019?

TA: There has been a definite increase in the number and variety of data-centric applications (versus data infrastructure) that are being created and in-use today. As a result, there is a rising interest in the industry in learning how to manage data for specific systems and in different environments, including on-premises, hybrid, and across multiple clouds. In 2019, the industry will start empowering these organization with tools that help non-experts become self-sufficient at managing their data operations processes across their end-to-end irrespective of where code is executing.

Unravel: Which industries or use cases have delivered the most value and have seen the most significant adoption of a mature modern data ecosystem?

TA: Digital-native organizations were the first companies to jump in at-scale – which is not a surprise as they have historically advanced more rapidly and been ahead of those who have some form of legacy to consider. Although heavily regulated, financial services institutions saw the value of an effective modern data strategy early-on as well as those industries that struggled with the cost and complexity of traditional data warehousing approaches when the 2008-2009 recession hit. In fact, few realize that the big recession was one of the key catalysts that accelerated the adoption of new modern data stack technologies.

Big data started with a heavy analytics focus– and then, as it matured, turned operational. Now, it’s coming to the point where streaming data is driving innovation, and many different industries and verticals are set to benefit from this next step. For example, one compelling modern data use case is delivering improved customer experiences through real-time customer data gathering, inference and personalization.

Moreover, the convergence of data science and big data has accelerated adoption as it activates the use of big data for critical business decision-making through optimized machine learning. By offering the ability to filter and prepare data, extract insights from large data sets, and capture complex patterns and develop models, big data becomes a critical value driver for modern data application areas like fraud and risk detection, or industries telecom and Healthcare.

Unravel: Is 2019 the year where ‘Big data’ gives way to just ‘Data’, as the lines and technologies between the 2 become increasingly hard to separate. A data-pipeline is a data pipeline after all.

TA: In the early days, there was confusion between big data and data warehousing. Data warehousing was the buzzword during the two decades prior, whereas big data became the hot trend more recently. A data warehouse is a central repository of integrated data – it is rigid and organized. A technology category, such as big data, is instead a means to store and manage large amounts of data – from many different sources, at a lower cost– to make better decisions and more accurate predictions. In short, modern data stack technologies have been more efficient at processing today’s high-volume, highly variable data pipelines that continue to grow at ever increasing rates.

With that in mind however, nothing stands still for very long, especially with technology innovation. The delineation between categories, as with any maturing market continues to evolve and high degrees of fragmentation, often lead by Open Source committers is often juxtaposed with the evolution of existing adopted technologies. SQL is a good example of this where the traditional the landscape of SQL, NoSQL, NewSQL and serverless solutions like AWS Athena start to blur the lines between what is ‘big’ and what is just ‘data’. One thing is for sure, we have come a long way in a short space of time and ‘Big Data’ is much more that on-premises Hadoop.

Unravel: What role will AI and automation, and capabilities like AIOps, play in the modern data stack in the coming year?

TA: Technologies like Spark, Hive and Kafka, are very complex under the hood, and when an application fails, it requires a specialist with a unique skill set to comb through massive amounts of data and determine the cause and solution. Data Operations frameworks need to mature to permit separation of roles rather than relying on a single Data engineer to solve all of the problems. Self-service for the applications owners will relieve part of this bottleneck but fully operationalizing a growing number of production data pipelines will require a different approach that relies heavily on Machine Learning and Artificial Intelligence.

In 2019, as the industry continues to strive for higher efficiency, automation will rise as a solution to the modern data stack skills problem. For example, AI for Operations (AIOps), which combines big data, artificial intelligence, and machine learning functionality, can augment and replace many IT operations processes to e.g. Accelerate the time it takes to identify performance issues, proactively tune resources to reduce cost or Automate configuration changes to prevent an app failures proactivly.

Unravel: What major vendor shake-ups do you predict in 2019?

TA: The industry now understands that there is more to a big data ecosystem than just Hadoop. Hadoop, for many years, was the leading open source framework, but Spark and Kafka’s increasing rise in popularity has proven that the stack will continue to rapidly evolve in ways we have not yet thought of. Complexity will be with us for a very long time and along with that some incredible new innovative companies, a new emerging incumbent (Cloudera/Hortonworks) and the Cloud giants will jockey for customer mindshare.

The post Modern Data Stack Predictions With Unravel Data Advisory Board Member, Tasso Argyros appeared first on Unravel.

AI-Powered Data Operations for Modern Data Applications

Unravel Data — Thu, 14 Feb 2019 23:11:41 +0000

Today, more than 10,000 enterprise businesses worldwide use a complex stack composed of a combination of distributed systems like Spark, Kafka, Hadoop, NoSQL databases, and SQL access technologies.

At Unravel, we have worked with many of these businesses across all major industries. These customers are deploying modern data applications in their data centers, in private cloud deployments, in public cloud deployments, and in hybrid combinations of these.

This paper addresses the requirements that arise in driving reliable performance in these complex environments. We provide an overview of these requirements both at the level of individual applications as well as in holistic clusters and workloads. We also present a platform that can deliver automated solutions to address these requirements as well as taking a deeper dive into a few of these solutions.

The post AI-Powered Data Operations for Modern Data Applications appeared first on Unravel.

CIDR 2019

Unravel Data Posts — Wed, 13 Feb 2019 13:15:15 +0000

As Enterprises Deploy complex Data pipelines into Full Production, AI for Operations (AIOPS) is Key to ensuring reliability and performance.

I recently traveled down to Asilomar, Calif., to speak at the Conference on Innovative Data Systems Research (CIDR), a gathering where researchers and practicing IT professionals discuss the latest innovative and visionary ideas in data. There was a diverse range of speakers and plenty of fascinating talks this year, leaving me with some valuable insights and new point of views to consider. However, despite all of the innovative and cutting edge material, the main theme of the event was an affirmation of what some of us already know: the challenges of managing distributed data systems—the problems we’ve been addressing at Unravel for years—are very real and are being experienced in both academia and the enterprise, both small and large businesses, and in the government and private sectors. Moreover, organizations feel like they have no choice but to react to these problems as they occur, rather than preparing in advance and avoiding them altogether.

My presentation looked at the common headaches in running modern applications that are built on many distributed systems. Apps fail, SLAs are missed, cloud costs spiral out of control, etc. There’s a wide variety of causes for these issues, such as bad joins, inaccurate configuration settings, bugs, misconfigured containers, machine degradation, and many more. It’s tough (and sometimes impossible) to identify and diagnose these problems manually, because monitoring data is highly siloed and often not available at all.

As a result, enterprises take a slow, reactive approach to addressing these issues. According to a survey from AppDynamics, most enterprise IT teams first discover that an app has failed when users call or email their help desk or a non-IT member alerts the IT department. Some don’t even realize there’s an issue until users post on social media about it! This of course results in a high mean time to resolve problems (an average of seven hours in AppDynamics’ survey). Clearly, this is not a sustainable way to manage apps.

Unravel’s approach starts first by collecting all the disparate monitoring data in a unified platform, eliminating silo issues. Then the platform applies algorithms to analyze the data and, whenever possible, take action automatically; providing automatic fixes for any of the problems listed above. The use of AIOps and automation is what really differentiates this approach and provides so much value. Take root cause analysis, for example. Manually determining root cause of an app failure is time consuming and often requires domain expertise. It’s a process that can often last days. Using AI and our inference engine, Unravel can complete root cause analysis in seconds.

How does this work? We draw on a large set of root cause patterns learned from customers and partners. This data is constantly updated. We then continuously inject this root cause data to train and test models for root-cause prediction and proactive remediation.

During the Q&A portion of the session, an engineering lead from Amazon asked a great question about what Unravel is doing to keep their AIOps techniques up to date as modern data stack systems evolve rapidly. Simply, the answer is that the platform doesn’t stop learning. We consistently perform careful probes to identify places where we can enhance the training data for learning, then collect more of that data to do so.

There were a couple of other conference talks that also nicely highlighted the value of AIOps:

SageDB: A Learned Database System: Traditional data processing systems have been designed to be general purpose. SageDB presents a new type of data processing system, one which highly specializes to an application through code synthesis and machine learning. (Joint work from Massachusetts Institute of Technology, Google, and Brown University)

Learned Cardinalities: Estimating Correlated Joins with Deep Learning: This talk addressed a critical challenge of cardinality estimation, namely correlated data. The presenters have developed a new deep learning method for cardinality estimation. (Joint work from the Technical University of Munich, University of Amsterdam, and Centrum Wiskunde & Informatica)

Organizations are deploying highly distributed data pipelines into full production now. These aren’t science projects, they’re for real, and the bar has been raised even higher for accuracy and performance. And these organizations aren’t just growing data lakes like they were five years ago—they’re now trying to get tangible value from that data by using it to develop a range of next-generation applications. Organizations are facing serious hurdles daily as they take that next step with data, and AIOps is emerging as the clear answer to help them with it.

Big data is no longer just the future, it’s the present, too.

The post CIDR 2019 appeared first on Unravel.

Unravel Cloud Operations Guide

Unravel Data — Sat, 09 Feb 2019 03:32:32 +0000

Thank you for your interest in the Unravel Cloud Operations Guide.