Performance & Data Eng Archives - Unravel

Data Observability: The Missing Link for Data Teams

Christine Della Penna — Thu, 15 Sep 2022 20:16:38 +0000

As organizations invest ever more heavily in modernizing their data stacks, data teams—the people who actually deliver the value of data to the business—are finding it increasingly difficult to manage the performance, cost, and quality of these complex systems.

Data teams today find themselves in much the same boat as software teams were 10+ years ago. Software teams have dug themselves out the hole with DevOps best practices and tools—chief among them full-stack observability.

But observability for data applications is a completely different animal from observability for web applications. Effective DataOps demands observability that’s designed specifically for the different challenges facing different data team members throughout the DataOps lifecycle.

That’s data observability for DataOps.

In this white paper, you’ll learn:

What is data observability for DataOps?
What data teams can learn from DevOps
Why observability designed for web apps is insufficient for data apps
What data teams need from data observability for DataOps
The 5 capabilities data observability for DataOps

Download the white paper here.

The post Data Observability: The Missing Link for Data Teams appeared first on Unravel.

Download The Unravel Guide to DataOps

Unravel Data — Wed, 17 Mar 2021 14:24:48 +0000

Read The Unravel Guide to DataOps

The Unravel Guide to DataOps gives you the information and tools you need to sharpen your DataOps practices and increase data application performance, reduce costs, and remove operational headaches. The ten-step DataOps lifecycle shows you everything you need to do to create data products that run and run.

The Guide introduces Unravel Data software as a powerful platform, helping you to create a robust and effective DataOps culture within your organization. Three customer use cases show you the power of DataOps, and Unravel, in the hands of your data teams. As one Unravel customer says in the Guide, “I’m sleeping a lot easier than I did a year ago.”

Download the Guide to help you as you master the current state, and future possibilities, of DataOps adoption in organizations worldwide.

The post Download The Unravel Guide to DataOps appeared first on Unravel.

How to intelligently monitor Kafka/Spark Streaming data pipelines

George Demarest — Thu, 06 Jun 2019 18:41:06 +0000

Identifying issues in distributed applications like a Spark streaming application is not trivial. This blog describes how Unravel helps you connect the dots across streaming applications to identify bottlenecks.

Spark streaming is widely used in real-time data processing, especially with Apache Kafka. A typical scenario involves a Kafka producer application writing to a Kafka topic. The Spark application then subscribes to the topic and consumes records. The records might be further processed downstream using operations like map and foreachRDD ops or saved into a datastore.

Below are two scenarios illustrating how you can use Unravel’s APMs to inspect, understand, correlate, and finally debug issues around a Spark streaming app consuming a Kafka topic. They demonstrate how Unravel’s APMs can help you perform an end-to-end analysis of the data pipelines and its applications. In turn, this wealth of information helps you effectively and efficiently debug and resolve issues that otherwise would require more time and effort.

Use Case 1: Producer Failure

Consider a running Spark application that has not processed any data for a significant amount of time. The application’s previous iterations successfully processed data and you know for a certainty that current iteration should be processing data.

Detect Spark is processing no data

When you notice the Spark application has been running without processing any data, you want to check the application’s I/O.

Checking I/O Consumption

Click on it to bring up the application and see its I/O KPI

View Spark input details

When you see the Spark application is consuming from Kafka, you need to examine Kafka side of the process to determine the issue. Click on a batch to find the topic it is consuming.

View Kafka topic details

Bring up the Kafka Topic details which graphs the Bytes In Per Second, Bytes Out Per Second, Messages In Per Second, and Total Fetch Requests Per Second. In this case, you see Bytes In Per Second and Messages In Per Second graphs show a steep decline; this indicates the Kafka Topic is no longer producing any data.

The Spark application is not consuming data because there is no data in the pipeline to consume.
You should notify the owner of the application that writes to this particular topic. Upon notification, they can drill down to determine and then resolve the underlying issues causing that writes to this topic to fail.

The graphs show a steep decline in the BytesInPerSecond and MessagesInPerSecond metric. Since there is nothing flowing into the topic, there is no data to be consumed and processed by Spark down the pipeline. The administrator can then alert the owner of the Kafka Producer Application to further drill down into what are the underlying issues that might have caused the Kafka Producer to fail.

Use Case 2: Slow Spark app, offline Kafka partitions

In this scenario, an application’s run has processed significantly less data when compared to what is the expected norm. In the above scenario, the analysis was fairly straightforward. In these types of cases, the underlying root cause might be far more subtle to identify. The Unravel APM can help you to quickly and efficiently root cause such issues.

Detect inconsistent performance

When you see a considerable difference in the data processed by consecutive runs of a Spark App, bring up the APMs for each application.

Detect drop in input records

Examine the trend lines to identify a time window when there is a drop in input records for the slow app. The image, the slow app, shows a drop off in consumption.

Drill down to I/O drop off

Drill down into the slow application’s stream graph to narrow the time period (window) in which the I/O dropped off.

Drill down to suspect time interval

Unravel displays the batches that were running during the problematic time period. Inspect the time interval to see the batches’ activity. In this example no input records were processed during the suspect interval. You can dig further into the issue by selecting a batch and examining it’s input tab. In the below case, you can infer that no new offsets have been read based upon the input source’s description,“Offsets 1343963 to Offset 1343963”.

View KPIs on the Kafka monitor screen

You can further debug the problem on the Kafka side by viewing the Kafka cluster monitor. Navigate to OPERATIONS > USAGE DETAILS > KAFKA. At a glance, the cluster’s KPIs convey important and pertinent information.

Here, the first two metrics show that the cluster is facing issues which need to be resolved as quickly as possible.

# of under replicated partitions is a broker level metric of the count of partitions for which the broker is the leader replica and the follower replicas that have yet not caught up. In this case there are 131 under replicated partitions.

# of offline partitions is a broker level metric provided by the cluster’s controlling broker. It is a count of the partitions that currently have no leader. Such partitions are not available for producing/consumption. In this case there are two (2) offline partitions.

You can select the BROKER tab to see broker table. Here the table hints that the broker, Ignite1.kafka2 is facing some issue; therefore, the broker’s status should be checked.

Broker Ignite1.kafka2 has two (2) offline partitions. The broker table shows that it is/was the controller for the cluster. We could have inferred this because only the controller can have offline partitions. Examining the table further, we see that Ignite.kafka1 also has an active controller of one (1).
The Kafka KPI # of Controller lists the Active Controller as one (1) which should indicate a healthy cluster. The fact that there are two (2) brokers listed as being one (1) active controller indicates the cluster is in an inconsistent state.

Identify cluster controller

In the broker tab table, we see that for the broker Ignite1.kafka2 there are two offline Kafka partitions. Also we see that the Active controller count is 1, which indicates that this broker was/is the controller for the cluster (This information can also be deduced from the fact that offline partition is a metric exclusive only to the controller broker). Also notice that the controller count for Ignite.kafka1 is also 1, which would indicate that the cluster itself is in an inconsistent state.

This gives us a hint that it could be the case that the Ignite1.kafka2 broker is facing some issues, and the first thing to check would be the status of the broker.

View consumer group status

You can further corroborate the hypothesis by checking the status of the consumer group for the Kafka topic the Spark App is consuming from. In this example, consumer groups’ status indicates it’s currently stalled. The topic the stalled group is consuming is tweetSentiment-1000.

Topic drill down

To drill down into the consumer groups topic, click on the TOPIC tab in the Kafka Cluster manager and search for the topic. In this case, the topic’s trend lines for the time range the Spark applications consumption dropped off show a sharp decrease in the Bytes In Per Second and Bytes Out Per Second. This decrease explains why the Spark app is not processing any records.

Consumer group details

To view the consumer groups lag and offset consumption trends, click on the consumer groups listed for the topic. Below, the topic’s consumer groups, stream-consumer-for-tweetSentimet1000, log end offset trend line is a constant (flat) line that shows no new offsets have been consumed with the passage of time. This further supports our hypothesis that something is wrong with the Kafka cluster and especially broker Ignite.kafka2.

Conclusion

These are but two examples of how Unravel helps you to identify, analyze, and debug Spark Streaming applications consuming from Kafka topics. Unravel’s APMs collates, consolidates, and correlates information from various stages in the data pipeline (Spark and Kafka), thereby allowing you to troubleshoot applications without ever having to leave Unravel.

The post How to intelligently monitor Kafka/Spark Streaming data pipelines appeared first on Unravel.

Why Memory Management is Causing Your Spark Apps To Be Slow or Fail

Audrey Enriquez — Wed, 20 Mar 2019 00:25:31 +0000

Spark applications are easy to write and easy to understand when everything goes according to plan. However, it becomes very difficult when Spark applications start to slow down or fail. (See our blog Spark Troubleshooting, Part 1 – Ten Biggest Challenges.) Sometimes a well-tuned application might fail due to a data change, or a data layout change. Sometimes an application that was running well so far, starts behaving badly due to resource starvation. The list goes on and on.

It’s not only important to understand a Spark application, but also its underlying runtime components like disk usage, network usage, contention, etc., so that we can make an informed decision when things go bad.

In this series of articles, I aim to capture some of the most common reasons why a Spark application fails or slows down. The first and most common is memory management.

If we were to get all Spark developers to vote, out-of-memory (OOM) conditions would surely be the number one problem everyone has faced. This comes as no big surprise as Spark’s architecture is memory-centric. Some of the most common causes of OOM are:

Incorrect usage of Spark
High concurrency
Inefficient queries
Incorrect configuration

To avoid these problems, we need to have a basic understanding of Spark and our data. There are certain things that can be done that will either prevent OOM or rectify an application which failed due to OOM. Spark’s default configuration may or may not be sufficient or accurate for your applications. Sometimes even a well-tuned application may fail due to OOM as the underlying data has changed.

Out of memory issues can be observed for the driver node, executor nodes, and sometimes even for the node manager. Let’s take a look at each case.

See How Unravel identifies Spark memory issues
Create a free account

Out of memory at the driver level

A driver in Spark is the JVM where the application’s main control flow runs. More often than not, the driver fails with an OutOfMemory error due to incorrect usage of Spark. Spark is an engine to distribute workload among worker machines. The driver should only be considered as an orchestrator. In typical deployments, a driver is provisioned less memory than executors. Hence we should be careful what we are doing on the driver.

Common causes which result in driver OOM are:

1. rdd.collect()
2. sparkContext.broadcast
3. Low driver memory configured as per the application requirements
4. Misconfiguration of spark.sql.autoBroadcastJoinThreshold. Spark uses this limit to broadcast a relation to all the nodes in case of a join operation. At the very first usage, the whole relation is materialized at the driver node. Sometimes multiple tables are also broadcasted as part of the query execution.

Try to write your application in such a way that you can avoid all explicit result collection at the driver. You can very well delegate this task to one of the executors. E.g., if you want to save the results to a particular file, either you can collect it at the driver or assign an executor to do that for you.

If you are using Spark’s SQL and the driver is OOM due to broadcasting relations, then either you can increase the driver memory if possible; or else reduce the “spark.sql.autoBroadcastJoinThreshold” value so that your join operations will use the more memory-friendly sort merge join.

Out of memory at the executor level

This is a very common issue with Spark applications which may be due to various reasons. Some of the most common reasons are high concurrency, inefficient queries, and incorrect configuration. Let’s look at each in turn.

High concurrency

Before understanding why high concurrency might be a cause of OOM, let’s try to understand how Spark executes a query or job and what are the components that contribute to memory consumption.

Spark jobs or queries are broken down into multiple stages, and each stage is further divided into tasks. The number of tasks depends on various factors like which stage is getting executed, which data source is getting read, etc. If it’s a map stage (Scan phase in SQL), typically the underlying data source partitions are honored.

For example, if a hive ORC table has 2000 partitions, then 2000 tasks get created for the map stage for reading the table assuming partition pruning did not come into play. If it’s a reduce stage (Shuffle stage), then spark will use either “spark.default.parallelism” setting for RDDs or “spark.sql.shuffle.partitions” for DataSets for determining the number of tasks. How many tasks are executed in parallel on each executor will depend on “spark.executor.cores” property. If this value is set to a higher value without due consideration to the memory, executors may fail with OOM. Now let’s see what happens under the hood while a task is getting executed and some probable causes of OOM.

Let’s say we are executing a map task or the scanning phase of SQL from an HDFS file or a Parquet/ORC table. For HDFS files, each Spark task will read a 128 MB block of data. So if 10 parallel tasks are running, then memory requirement is at least 128 *10 only for storing partitioned data. This is again ignoring any data compression which might cause data to blow up significantly depending on the compression algorithms.

Spark reads Parquet in a vectorized format. To put it simply, each task of Spark reads data from the Parquet file batch by batch. As Parquet is columnar, these batches are constructed for each of the columns. It accumulates a certain amount of column data in memory before executing any operation on that column. This means Spark needs some data structures and bookkeeping to store that much data. Also, encoding techniques like dictionary encoding have some state saved in memory. All of them require memory.

Figure: Spark task and memory components while scanning a table

So with more concurrency, the overhead increases. Also, if there is a broadcast join involved, then the broadcast variables will also take some memory. The above diagram shows a simple case where each executor is executing two tasks in parallel.

Inefficient queries

While Spark’s Catalyst engine tries to optimize a query as much as possible, it can’t help if the query itself is badly written. E.g., selecting all the columns of a Parquet/ORC table. As seen in the previous section, each column needs some in-memory column batch state. If more columns are selected, then more will be the overhead.

Try to read as few columns as possible. Try to use filters wherever possible, so that less data is fetched to executors. Some of the data sources support partition pruning. If your query can be converted to use partition column(s), then it will reduce data movement to a large extent.

Incorrect configuration

Incorrect configuration of memory and caching can also cause failures and slowdowns in Spark applications. Let’s look at some examples.

Executor & Driver memory

Each application’s memory requirement is different. Depending on the requirement, each app has to be configured differently. You should ensure correct spark.executor.memory or spark.driver.memory values depending on the workload. As obvious as it may seem, this is one of the hardest things to get right. We need the help of tools to monitor the actual memory usage of the application. Unravel does this pretty well.

Memory Overhead

Sometimes it’s not executor memory, rather its YARN container memory overhead that causes OOM or the node gets killed by YARN. “YARN kill” messages typically look like this:

YARN runs each Spark component like executors and drivers inside containers. Overhead memory is the off-heap memory used for JVM overheads, interned strings and other metadata of JVM. In this case, you need to configure spark.yarn.executor.memoryOverhead to a proper value. Typically 10% of total executor memory should be allocated for overhead.

Caching Memory

If your application uses Spark caching to store some datasets, then it’s worthwhile to consider Spark’s memory manager settings. Spark’s memory manager is written in a very generic fashion to cater to all workloads. Hence, there are several knobs to set it correctly for a particular workload.

Spark has defined memory requirements as two types: execution and storage. Storage memory is used for caching purposes and execution memory is acquired for temporary structures like hash tables for aggregation, joins etc.

Both execution & storage memory can be obtained from a configurable fraction of (total heap memory – 300MB). That setting is “spark.memory.fraction”. Default is 60%. Out of which, by default, 50% is assigned (configurable by “spark.memory.storageFraction”) to storage and rest assigned for execution.

There are situations where each of the above pools of memory, namely execution and storage, may borrow from each other if the other pool is free. Also, storage memory can be evicted to a limit if it has borrowed memory from execution. However, without going into those complexities, we can configure our program such that our cached data which fits in storage memory should not cause a problem for execution.

If we don’t want all our cached data to sit in memory, then we can configure “spark.memory.storageFraction” to a lower value so that extra data would get evicted and execution would not face memory pressure.

Out of memory at Node Manager

Spark applications which do data shuffling as part of group by or join like operations, incur significant overhead. Normally data shuffling process is done by the executor process. If the executor is busy or under heavy GC load, then it can’t cater to the shuffle requests. This problem is alleviated to some extent by using an external shuffle service.

External shuffle service runs on each worker node and handles shuffle requests from executors. Executors can read shuffle files from this service rather than reading from each other. This helps requesting executors to read shuffle files even if the producing executors are killed or slow. Also, when dynamic allocation is enabled, its mandatory to enable external shuffle service.

When Spark external shuffle service is configured with YARN, NodeManager starts an auxiliary service which acts as an External shuffle service provider. By default, NodeManager memory is around 1 GB. However, applications which do heavy data shuffling might fail due to NodeManager going out of memory. Its imperative to properly configure your NodeManager if your applications fall into the above category.

End of Part I – Thanks for the Memory

Spark’s in-memory processing is a key part of its power. Therefore, effective memory management is a critical factor to get the best performance, scalability, and stability from your Spark applications and data pipelines. However, the Spark defaults settings are often insufficient. Depending on the application and environment, certain key configuration parameters must be set correctly to meet your performance goals. Having a basic idea about them and how they can affect the overall application helps.

I have provided some insights into what to look for when considering Spark memory management. This is an area that the Unravel platform understands and optimizes very well, with little, if any, human intervention needed. See how Unravel can make Spark perform better and more reliably here.

Event better, schedule a demo to see Unravel in action. The performance speedups we are seeing for Spark apps are pretty significant.

See how Unravel simplifies Spark Memory Management.

Create a free account

In Part II of this series Why Your Spark Apps are Slow or Failing: Part II Data Skew and Garbage Collection, I will be discussing how data organization, data skew, and garbage collection impact Spark performance.

The post Why Memory Management is Causing Your Spark Apps To Be Slow or Fail appeared first on Unravel.

Building a FinOps Ethos

Unravel Data — Mon, 09 Dec 2024 21:02:22 +0000

3 Key Steps to Build a FinOps Ethos in Your Data Engineering Team

In today’s data-driven enterprises, the intersection of fiscal responsibility and technical innovation has never been more critical. As data processing costs continue to scale with business growth, building a FinOps culture within your Data Engineering team isn’t just about cost control, it’s about creating a mindset that views cost optimization as an integral part of technical excellence.

Unravel’s ‘actionability for everyone’ approach has enabled executive customers to adopt three transformative steps to embed FinOps principles into their Data Engineering team’s DNA, ensuring that cost awareness becomes as natural as code quality or data accuracy. In this article, we walk you through how executives can get the cost rating of their workloads from Uncontrolled to Optimized with clear, guided actionable pathways.

Step 1. Democratize cost visibility

The foundation of any successful FinOps implementation begins with transparency. However, raw cost data alone isn’t enough-it needs to be contextualized and actionable.

Breaking Down the Cost Silos

Unravel provides real-time cost attribution dashboards to map cloud spending to specific business units, teams, projects, and data pipelines.

The custom views allow different stakeholders, from engineers to executives to discover top actions to control cost spends.
The ability to track key metrics like cost/savings per job/query, time-to-value, and wasted costs due to idleness, wait times, etc., transforms cost data from a passive reporting tool to an active management instrument.

Provide tools for cost decision making

Modern data teams need to understand how their architectural and implementation choices affect the bottom line. Unravel provides visualizations and insights to guide these teams to implement:

Pre-deployment cost impact assessments for new data pipelines (for example, what is the cost impact of migrating this workload from All-purpose compute to Job compute?).
What-if analysis tools for infrastructure changes (for example, will changing from the current instance types to recommended instance types affect performance if I save on cost?).
Historical trend analysis to identify cost patterns, budget overrun costs, cost wasted due to optimizations neglected by their teams, etc.

Step 2. Embed cost intelligence into the development and operations lifecycles

The next evolution in FinOps maturity comes from making cost optimization an integral part of the development process, not a post-deployment consideration. Executives should consider leveraging specialized AI agents across their technology stack that helps to boost productivity and free up their teams’ time to focus on innovation. Unravel provides a suite of AI-agents driven features that foster cost ethos in the organization and maximizes operational excellence with auto-fix capabilities.

Automated Optimization Lifecycle
Unravel helps you establish a systematic approach to cost optimization with automated AI-agentic workflows to help your teams operationalize recommendations while getting a huge productivity boost. Below are some ways to drive automation with Unravel agents:

Implement automated fix suggestions for most code and configuration inefficiencies
Assign auto-fix actions to AI agents for effortless adoption or route them to a human reviewer for approval
Configure automated rollback capabilities for changes if unintended performance or cost degradation is detected

Push Down Cost Consciousness To Developers

Automated code reviews that flag potential cost inefficiencies
Smart cost savings recommendations based on historical usage patterns
Allow developers to see the impact of their change on future runs

Define Measurable Success Metrics
Executives can track the effectiveness of FinOps awareness and culture using Unravel through:

Cost efficiency improvements over time (WoW, MoM, YTD)
Team engagement and rate of adoption with Accountability dashboards
Time-to-resolution for code and configuration changes

Step 3. Create a self-sustaining FinOps culture

The final and most crucial step is transforming FinOps from an initiative into a cultural cornerstone of your data engineering practice.

Operationalize with AI agents

FinOps AI Agent

Implement timely alerting systems that help to drive value-oriented decisions for cost optimization and governance. Unravel provides email, Slack, Teams integrations for ensuring all necessary stakeholders get timely notifications and insights into opportunities and risks.

DataOps AI Agent

Pipeline optimization suggestions for better resource utilization and mitigate SLA risks
Job signature level cost and savings impact analysis to help with prioritization of recommendations
Intelligent workload migration recommendations

Data Engineering AI Agent

Storage tier optimization recommendations to avoid wasted costs due to cold tables
Partition strategy optimization for cost-effective querying
Avoid recurring failures and bottlenecks due to inefficiencies not acted upon for several weeks

Continuous Evolution

Finally, it is extremely important to foster and track the momentum of FinOps growth by:

Regularly performing FinOps retrospectives with wider teams
Revisiting which Business units and Cost Centers are contributing to wasted costs, neglected cost due to unadopted recommendations and budget overruns despite timely alerting

The path forward

Building a FinOps ethos in your Data Engineering team is a journey that requires commitment, tools, and cultural change. By following the above three key steps – democratizing cost visibility, embedding cost intelligence, and creating a self-sustaining culture – you can transform how your team thinks about and manages cloud costs.

The most successful implementations don’t treat FinOps as a separate discipline but rather as an integral part of technical excellence. When cost optimization becomes as natural as writing tests or documenting code, you have achieved true FinOps maturity. Unravel provides a comprehensive set of features and tools to aid your teams in accelerating FinOps best practices throughout the organization.

Remember, the goal isn’t just to reduce costs – it is to maximize the value derived from every dollar spent on your infrastructure. This mindset shift, combined with the right tools and processes, will position your data engineering team for sustainable growth and success in an increasingly cost-conscious technology landscape.

To learn more on how Unravel can help, contact us or request a demo.

The post Building a FinOps Ethos appeared first on Unravel.

Data ActionabilityTM Data Eng Webinar

Unravel Data — Wed, 20 Nov 2024 21:32:41 +0000

Data Actionability: Speed Up Analytics with Unravel’s New Data Engineering AI Agent

The post Data Actionability^TM Data Eng Webinar appeared first on Unravel.

Data ActionabilityTM Webinar

Unravel Data — Wed, 20 Nov 2024 21:04:00 +0000

Data Actionability: Empower Your Team with Unravel’s New AI Agents

The post Data Actionability^TM Webinar appeared first on Unravel.

Configuration Management in Modern Data Platforms

Unravel Data — Wed, 06 Nov 2024 16:38:22 +0000

Navigating the Maze of Configuration Management in Modern Data Platforms: Problems, Challenges and Solutions

In the world of big data, configuration management is often the unsung hero of platform performance and cost-efficiency. Whether you’re working with Snowflake, Databricks, BigQuery, or any other modern data platform, effective configuration management can mean the difference between a sluggish, expensive system and a finely-tuned, cost-effective one.

This blog post explores the complexities of configuration management in data platforms, the challenges in optimizing these settings, and how automated solutions can simplify this critical task.

The Configuration Conundrum

1. Cluster and Warehouse Sizing

Problem: Improper sizing of compute resources (like Databricks clusters or Snowflake warehouses) can lead to either performance bottlenecks or unnecessary costs.

Diagnosis Challenge: Determining the right size for your compute resources is not straightforward. It depends on workload patterns, data volumes, and query complexity, all of which can vary over time. Identifying whether performance issues or high costs are due to improper sizing requires analyzing usage patterns across multiple dimensions.

Resolution Difficulty: Adjusting resource sizes often involves a trial-and-error process. Too small, and you risk poor performance; too large, and you’re wasting money. The impact of changes may not be immediately apparent and can affect different workloads in unexpected ways.

2. Caching and Performance Optimization Settings

Problem: Suboptimal caching strategies and performance settings can lead to repeated computations and slow query performance.

Diagnosis Challenge: The effectiveness of caching and other performance optimizations can be highly dependent on specific workload characteristics. Identifying whether poor performance is due to cache misses, inappropriate caching strategies, or other factors requires deep analysis of query patterns and platform-specific metrics.

Resolution Difficulty: Tuning caching and performance settings often requires a delicate balance. Aggressive caching might improve performance for some queries while causing staleness issues for others. Each adjustment needs to be carefully evaluated across various workload types.

3. Security and Access Control Configurations

Problem: Overly restrictive security settings can hinder legitimate work, while overly permissive ones can create security vulnerabilities.

Diagnosis Challenge: Identifying the root cause of access issues can be complex, especially in platforms with multi-layered security models. Is a performance problem due to a query issue, or is it because of an overly restrictive security policy?

Resolution Difficulty: Adjusting security configurations requires careful consideration of both security requirements and operational needs. Changes need to be thoroughly tested to ensure they don’t inadvertently create security holes or disrupt critical workflows.

4. Cost Control and Resource Governance

Problem: Without proper cost control measures, data platform expenses can quickly spiral out of control.

Diagnosis Challenge: Understanding the cost implications of various platform features and usage patterns is complex. Is a spike in costs due to inefficient queries, improper resource allocation, or simply increased usage?

Resolution Difficulty: Implementing effective cost control measures often involves setting up complex policies and monitoring systems. It requires balancing cost optimization with the need for performance and flexibility, which can be a challenging trade-off to manage.

The Manual Configuration Management Struggle

Traditionally, managing these configurations involves:

1. Continuously monitoring platform usage, performance metrics, and costs
2. Manually adjusting configurations based on observed patterns
3. Conducting extensive testing to ensure changes don’t negatively impact performance or security
4. Constantly staying updated with platform-specific best practices and new features
5. Repeating this process as workloads and requirements evolve

This approach is not only time-consuming but also reactive. By the time an issue is noticed and diagnosed, it may have already impacted performance or inflated costs. Moreover, the complexity of modern data platforms means that the impact of configuration changes can be difficult to predict, leading to a constant cycle of tweaking and re-adjusting.

Embracing Automation in Configuration Management

Given these challenges, many organizations are turning to automated solutions to manage and optimize their data platform configurations. Platforms like Unravel can help by:

Continuous Monitoring: Automatically tracking resource utilization, performance metrics, and costs across all aspects of the data platform.

Intelligent Analysis: Using machine learning to identify patterns and anomalies in platform usage and performance that might indicate configuration issues.

Predictive Optimization: Suggesting configuration changes based on observed usage patterns and predicting their impact before implementation.

Automated Adjustment: In some cases, automatically adjusting configurations within predefined parameters to optimize performance and cost.

Policy Enforcement: Helping to implement and enforce governance policies consistently across the platform.

Cross-Platform Optimization: For organizations using multiple data platforms, providing a unified view and consistent optimization approach across different environments.

By leveraging automated solutions, data teams can shift from a reactive to a proactive configuration management approach. Instead of constantly fighting fires, teams can focus on strategic initiatives while ensuring their data platforms remain optimized, secure, and cost-effective.

Conclusion

Configuration management in modern data platforms is a complex, ongoing challenge that requires continuous attention and expertise. While the problems are multifaceted and the manual management process can be overwhelming, automated solutions offer a path to simplify and streamline these efforts.

By embracing automation in configuration management, organizations can more effectively optimize their data platform performance, enhance security, control costs, and free up their data teams to focus on extracting value from data rather than endlessly tweaking platform settings.

Remember, whether using manual methods or automated tools, effective configuration management is an ongoing process. As your data volumes grow, workloads evolve, and platform features update, staying on top of your configurations will ensure that your data platform continues to meet your business needs efficiently and cost-effectively.

To learn more about how Unravel can help manage and optimize your data platform configurations with Databricks, Snowflake, and BigQuery: request a health check report, view a self-guided product tour, or request a demo.

The post Configuration Management in Modern Data Platforms appeared first on Unravel.

How Shopify Fixed a $1 Million Single Query

Stephen Lamont — Fri, 26 Jan 2024 20:04:34 +0000

A while back, Shopify posted a story about how they avoided a $1 million query in BigQuery. They detail the data engineering that reduced the cost of the query to $1,300 and share tips for lowering costs in BigQuery.

Kunal Agarwal, CEO and Co-founder of Unravel Data, walks through Shopify’s approach to clustering tables as a way to bring the price of a highly critical query down to a more reasonable monthly cost. But clustering is not the only data engineering technique available to run far more efficiently—for cost, performance, reliability—and Kunal brings up 6-7 others.

Few organizations have the data engineering muscle of Shopify. All the techniques to keep costs low and performance high can entail a painstaking, toilsome slog through thousands of telemetry data points, logs, error messages, etc.

Unravel’s customers understand that they cannot possibly have people go through the hundreds of thousands or more lines of code to find the areas to optimize, but rather that this is all done better and faster with automation and AI. Whether your data runs on BigQuery, Databricks, or Snowflake.

Kunal shows what that looks like in a 1-minute drive-by.

Get Unravel Standard data observability + FinOps for free. Forever.
Get started here

The post How Shopify Fixed a $1 Million Single Query appeared first on Unravel.

Overcoming Friction & Harnessing the Power of Unravel: Try It for Free

Stephen Lamont — Wed, 11 Oct 2023 13:52:00 +0000

Overview

In today’s digital landscape, data-driven decisions form the crux of successful business strategies. However, the path to harnessing data’s full potential is strewn with challenges. Let’s delve into the hurdles organizations face and how Unravel is the key to unlocking seamless data operations.

The Roadblocks in the Fast Lane of Data Operations

In today’s data-driven landscape, organizations grapple with erratic spending, cloud constraints, AI complexities, and prolonged MTTR, urgently seeking solutions to navigate these challenges efficiently. The four most common roadblocks are:

Data Spend Forecasting: Imagine a roller coaster with unpredictable highs and lows. That’s how most organizations view their data spend forecasting. Such unpredictability wreaks havoc on financial planning, making operational consistency a challenge.
Constraints in Adding Data Workloads: Imagine tying an anchor to a speedboat. That’s how the constraints feel when trying to adopt cloud data solutions, holding back progress and limiting agility.
Surge in AI Model Complexity: AI’s evolutionary pace is exponential. As it grows, so do the intricacies surrounding data volume and pipelines, which strain budgetary limitations.
The MTTR Bottleneck: The multifaceted nature of modern tech stacks means longer Mean Time to Repair (MTTR). This slows down processes, consumes valuable resources, and stalls innovation.

By acting as a comprehensive data observability and FinOps solution, Unravel Data empowers businesses to move past the challenges and frictions that typically hinder data operations, ensuring smoother, more efficient data-driven processes. Here’s how Unravel Data aids in navigating the roadblocks in the high-speed lane of data operations:

Predictive Data Spend Forecasting: With its advanced analytics, Unravel Data can provide insights into data consumption patterns, helping businesses forecast their data spending more accurately. This eliminates the roller coaster of unpredictable costs.
Simplifying Data Workloads: Unravel Data optimizes and automates workload management. Instead of being anchored down by the weight of complex data tasks, businesses can efficiently run and scale their data processes in the cloud.
Managing AI Model Complexity: Unravel offers visibility and insights into AI data pipelines. Analyzing and optimizing these pipelines ensure that growing intricacies do not overwhelm resources or budgets.
Reducing MTTR: By providing a clear view of the entire tech stack and pinpointing bottlenecks or issues, Unravel Data significantly reduces Mean Time to Repair. With its actionable insights, teams can address problems faster, reducing downtime and ensuring smooth data operations.
Streamlining Data Pipelines: Unravel Data offers tools to diagnose and improve data pipeline performance. This ensures that even as data grows in volume and complexity, pipelines remain efficient and agile.
Efficiency and ROI: With its clear insights into resource consumption and requirements, Unravel Data helps businesses run 50% more workloads in their existing Databricks environments, ensuring they only pay for what they need, reducing wastage and excess expenditure.

The Skyrocketing Growth of Cloud Data Management

As the digital realm expands, cloud data management usage is soaring, with data services accounting for a significant chunk. According to the IDC, the public cloud IaaS and PaaS market is projected to reach $400 billion by 2025, growing at a 28.8% CAGR from 2021 to 2025. Some highlights are:

Data management and application development account for 39% and 20% of the market, respectively, and are the main workloads backed by PaaS solutions, capturing a major share of its revenue.
In IaaS revenue, IT infrastructure leads with 25%, trailed by business applications (21%) and data management (20%).
Unstructured data analytics and media streaming are predicted to be the top-growing segments with CAGRs of 41.9% and 41.2%, respectively.

Unravel provides a comprehensive solution to address the growth associated with cloud data management. Here’s how:

Visibility and Transparency: Unravel offers in-depth insights into your cloud operations, allowing you to understand where and how costs are accruing, ensuring no hidden fees or unnoticed inefficiencies.
Optimization Tools: Through its suite of analytics and AI-driven tools, Unravel pinpoints inefficiencies, recommends optimizations, and automates the scaling of resources to ensure you’re only using (and paying for) what you need.
Forecasting: With predictive analytics, Unravel provides forecasts of data usage and associated costs, enabling proactive budgeting and financial planning.
Workload Management: Unravel ensures that data workloads run efficiently and without wastage, reducing both computational costs and storage overhead.
Performance Tuning: By optimizing query performance and data storage strategies, Unravel ensures faster results using fewer resources, translating to 50% more workloads.
Monitoring and Alerts: Real-time monitoring paired with intelligent alerts ensures that any resource-intensive operations or anomalies are flagged immediately, allowing for quick intervention and rectification.

By employing these strategies and tools, Unravel acts as a financial safeguard for businesses, ensuring that the ever-growing cloud data bill remains predictable, manageable, and optimized for efficiency.

The Tightrope Walk of Efficiency Tuning and Talent

Modern enterprises hinge on data and AI, but shrinking budgets and talent gaps threaten them. Gartner pinpoints overprovisioning and skills shortages as major roadblocks, while Google and IDC underscore the high demand for data analytics skills and the untapped potential of unstructured data. Here are some of the problems modern organizations face:

Production environments are statically overprovisioned and therefore underutilized. On-premises, 30% utilization is common, but it’s all capital expenditures (capex), and as long as it’s within budget, no one has traditionally cared about the waste. However, in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste. – Gartner
The cloud skills gap has reached a crisis level in many organizations – Gartner
Revenue creation through digital transformation requires talent engagement that is currently scarce and difficult to acquire and maintain. – Gartner
Lack of skills remains the biggest barrier to infrastructure modernization initiatives, with many organizations finding they cannot hire outside talent to fill these skills gaps. IT organizations will not succeed unless they prioritize organic skills growth. – Gartner
Data analytics skills are in demand across industries as businesses of all types around the world recognize that strong analytics improve business performance.- Google via Coursera

Unravel Data addresses the delicate balancing act of budget and talent in several strategic ways:

Operational Efficiency: Purpose-built AI provides actionable insights into data operations across Databricks, Spark, EMR, BigQuery, Snowflake, etc. Unravel Data reduces the need for trial-and-error and time-consuming manual interventions. At the core of Unravel’s data observability platform is our AI-powered Insights Engine. This sophisticated Artificial Intelligence engine incorporates AI techniques, algorithms, and tools to process and analyze vast amounts of data, learn from patterns, and make predictions or decisions based on that learning. This not only improves operational efficiency but also ensures that talented personnel spend their time innovating rather than on routine tasks.
Skills Gap Bridging: The platform’s intuitive interface and AI-driven insights mean that even those without deep expertise in specific data technologies can navigate, understand, and optimize complex data ecosystems. This eases the pressure to hire or train ultra-specialized talent.
Predictive Analysis: With Unravel’s ability to predict potential bottlenecks or inefficiencies, teams can proactively address issues, leading to more efficient budget allocation and resource utilization.
Cost Insights: Unravel provides detailed insights into the efficiency of various data operations, allowing organizations to make informed decisions on where to invest and where to cut back.
Automated Optimization: By automating many of the tasks traditionally performed by data engineers, like performance tuning or troubleshooting, Unravel ensures teams can do more with less, optimizing both budget and talent.
Talent Focus Shift: With mundane tasks automated and insights available at a glance, skilled personnel can focus on higher-value activities, like data innovation, analytics, and strategic projects.

By enhancing efficiency, providing clarity, and streamlining operations, Unravel Data ensures that organizations can get more from their existing budgets while maximizing the potential of their talent, turning the tightrope walk into a more stable journey.

The Intricacies of Data-Centric Organizations

Data-centric organizations grapple with the complexities of managing vast and fast-moving data in the digital age. Ensuring data accuracy, security, and compliance, while integrating varied sources, is challenging. They must balance data accessibility with protecting sensitive information, all while adapting to evolving technologies, addressing talent gaps, and extracting actionable insights from their data reservoirs. Here is some relevant research on the topic:

“Data is foundational to AI” yet “unstructured data remains largely untapped.” – IDC
Even as organizations rush to adopt data-centric operations, challenges persist. For instance, manufacturing data projects often hit roadblocks due to outdated legacy technology, as observed by the World Economic Forum.
Generative AI is supported by large language models (LLMs), which require powerful and highly scalable computing capabilities to process data in real-time. – Gartner

Unravel Data provides a beacon for data-centric organizations amid complex challenges. Offering a holistic view of data operations, it simplifies management using AI-driven tools. It ensures data security, accessibility, and optimized performance. With its intuitive interface, Unravel bridges talent gaps and navigates the data maze, turning complexities into actionable insights.

Embarking on the Unravel Journey: Your Step-By-Step Guide

Beginning your data journey with Unravel is as easy as 1-2-3. We guide you through the sign-up process, ensuring a smooth and hassle-free setup.
Unravel for Databricks page

Unravel for Databricks plans & pricing

Unravel for Databricks free account

Level Up with Unravel Premium

Ready for an enhanced data experience? Unravel’s premium account offers a plethora of advanced features that the free version can’t match. Investing in this upgrade isn’t just about more tools; it’s about supercharging your data operations and ROI.

Wrap-Up

Although rising demands on the modern data landscape are challenging, they are not insurmountable. With tools like Unravel, organizations can navigate these complexities, ensuring that data remains a catalyst for growth, not a hurdle. Dive into the Unravel experience and redefine your data journey today.

Unravel is a business’s performance sentinel in the cloud realm, proactively ensuring that burgeoning cloud data expenses are not only predictable and manageable but also primed for significant cost savings. Unravel Data transforms the precarious balance of budget and talent into a streamlined, efficient journey for organizations. Unravel Data illuminates the path for data-centric organizations, streamlining operations with AI tools, ensuring data security, and optimizing performance. Its intuitive interface simplifies complex data landscapes, bridging talent gaps and converting challenges into actionable insights.

The post Overcoming Friction & Harnessing the Power of Unravel: Try It for Free appeared first on Unravel.

Blind Spots in Your System: The Grave Risks of Overlooking Observability

Stephen Lamont — Mon, 28 Aug 2023 20:14:12 +0000

I was an Enterprise Data Architect at Boeing NASA Systems the day of the Columbia Shuttle disaster. The tragedy has had a profound impact in my career on how I look at data. Recently I have been struck by how, had it been available back then, purpose-built AI observability might have altered the course of events.

The Day of the Disaster

With complete reverence and respect for the Columbia disaster, I can remember that day with incredible amplification. I was standing in front of my TV watching the Shuttle Columbia disintegrate on reentry. As the Enterprise Data Architect at Boeing NASA Systems, I received a call from my boss: “We have to pull and preserve all the Orbitor data. I will meet you at the office.” Internally, Boeing referred to the Shuttles as Oribitors. It was only months earlier I had the opportunity to share punch and cookies with some of the crew. To say I was being flooded with emotions would be a tremendous understatement.

When I arrived at Boeing NASA Systems, near Johnson Space Center, the mood in the hallowed aeronautical halls was somber. I sat intently, with eyes scanning lines of data from decades of space missions. The world had just witnessed a tragedy — the Columbia Shuttle disintegrated on reentry, taking the lives of seven astronauts. The world looked on in shock and sought answers. As a major contractor for NASA, Boeing was at the forefront of this investigation.

As the Enterprise Data Architect, I was one of an army helping dissect the colossal troves of data associated with the shuttle, looking for anomalies, deviations — anything that could give a clue as to what had gone wrong. Days turned into nights, and nights turned into weeks and months as we tirelessly pieced together Columbia’s final moments. But as we delved deeper, a haunting reality began to emerge.

Every tiny detail of the shuttle was monitored, from the heat patterns of its engines to the radio signals it emitted. But there was a blind spot, an oversight that no one had foreseen. In the myriad of data sets, there was nothing that indicated the effects of a shuttle’s insulation tiles colliding with a piece of Styrofoam, especially at speeds exceeding 500 miles per hour.

The actual incident was seemingly insignificant — a piece of foam insulation had broken off and struck the shuttle’s left wing. But in the vast expanse of space and the brutal conditions of reentry, even minor damage could prove catastrophic.

Video footage confirmed: the foam had struck the shuttle. But without concrete data on what such an impact would do, the team was left to speculate and reconstruct potential scenarios. The lack of this specific data had turned into a gaping void in the investigation.

As a seasoned Enterprise Data Architect, I always believed in the power of information. I absolutely believed that in the numbers, in the bytes and bits, we find the stories that the universe whispers to us. But this time, the universe had whispered a story that we didn’t have the data to understand fully.

Key Findings

After the accident, NASA formed the Columbia Accident Investigation Board (CAIB) to investigate the disaster. The board consisted of experts from various fields outside of NASA to ensure impartiality.

1. Physical Cause: The CAIB identified the direct cause of the accident as a piece of foam insulation from the shuttle’s external fuel tank that broke off during launch. This foam struck the leading edge of the shuttle’s left wing. Although foam shedding was a known issue, it had been wrongly perceived as non-threatening because of prior flights where the foam was lost but didn’t lead to catastrophe.

2. Organizational Causes: Beyond the immediate physical cause, the CAIB highlighted deeper institutional issues within NASA. They found that there were organizational barriers preventing effective communication of safety concerns. Moreover, safety concerns had been normalized over time due to prior incidents that did not result in visible failure. Essentially, since nothing bad had happened in prior incidents where the foam was shed, the practice had been erroneously deemed “safe.”

3. Decision Making: The CAIB pointed out issues in decision-making processes that relied heavily on past success as a predictor of future success, rather than rigorous testing and validation.

4. Response to Concerns: There were engineers who were concerned about the foam strike shortly after Columbia’s launch, but their requests for satellite imagery to assess potential damage were denied. The reasons were multifaceted, ranging from beliefs that nothing could be done even if damage was found, to a misunderstanding of the severity of the situation.

The CAIB made a number of recommendations to NASA to improve safety for future shuttle flights. These included:

1. Physical Changes: Improving the way the shuttle’s external tank was manufactured to prevent foam shedding and enhancing on-orbit inspection and repair techniques to address potential damage.

2. Organizational Changes: Addressing the cultural issues and communication barriers within NASA that led to the accident, and ensuring that safety concerns were more rigorously addressed.

3. Continuous Evaluation: Establishing an independent Technical Engineering Authority responsible for technical requirements and all waivers to them, and building an independent safety program that oversees all areas of shuttle safety.

Could Purpose-built AI Observability Have Helped?

In the aftermath, NASA grounded the shuttle fleet for more than two years after the disaster. They then implemented the CAIB’s recommendations before resuming shuttle flights. Columbia’s disaster, along with the Challenger explosion in 1986, are stark reminders of the risks of space travel and the importance of a diligent and transparent safety culture. The lessons from Columbia shaped many of the safety practices NASA follows in its current human spaceflight programs.

The Columbia disaster led to profound changes in how space missions were approached, with a renewed emphasis on data collection and eliminating informational blind spots. But for me, it became a deeply personal mission. I realized that sometimes, the absence of data could speak louder than the most glaring of numbers. It was a lesson I would carry throughout my career, ensuring that no stone was left unturned, and no data point overlooked.

The Columbia disaster, at its core, was a result of both a physical failure (foam insulation striking the shuttle’s wing) and organizational oversights (inadequate recognition and response to potential risks). Purpose-built AI Observability, which involves leveraging artificial intelligence to gain insights into complex systems and predict failures, might have helped in several key ways:

1. Real-time Anomaly Detection: Modern AI systems can analyze vast amounts of data in real time to identify anomalies. If an AI-driven observability platform had been monitoring the shuttle’s various sensors and systems, it might have detected unexpected changes or abnormalities in the shuttle’s behavior after the foam strike, potentially even subtle ones that humans might overlook.

2. Historical Analysis: An AI system with access to all previous shuttle launch and flight data might have detected patterns or risks associated with foam-shedding incidents, even if they hadn’t previously resulted in a catastrophe. The system could then raise these as potential long-term risks.

3. Predictive Maintenance: AI-driven tools can predict when components of a system are likely to fail based on current and historical data. If applied to the shuttle program, such a system might have provided early warnings about potential vulnerabilities in the shuttle’s design or wear-and-tear.

4. Decision Support: AI systems could have aided human decision-makers in evaluating the potential risks of continuing the mission after the foam strike, providing simulations, probabilities of failure, or other key metrics to help guide decisions.

5. Enhanced Imaging and Diagnosis: If equipped with sophisticated imaging capabilities, AI could analyze images of the shuttle (from external cameras or satellites) to detect potential damage, even if it’s minor, and then assess the risks associated with such damage.

6. Overcoming Organizational Blind Spots: One of the major challenges in the Columbia disaster was the normalization of deviance, where foam shedding became an “accepted” risk because it hadn’t previously caused a disaster. An AI system, being objective, doesn’t suffer from these biases. It would consistently evaluate risks based on data, not on historical outcomes.

7. Alerts and Escalations: An AI system can be programmed to escalate potential risks to higher levels of authority, ensuring that crucial decisions don’t get caught in bureaucratic processes.

While AI Observability could have provided invaluable insights and might have changed the course of events leading to the Columbia disaster, it’s essential to note that the integration of such AI systems also requires organizational openness to technological solutions and a proactive attitude toward safety. The technology is only as effective as the organization’s willingness to act on its findings.

The tragedy served as a grim reminder for organizations worldwide: It’s not just about collecting data; it’s about understanding the significance of what isn’t there. Because in those blind spots, destiny can take a drastic turn.

In Memory and In Action

The Columbia crew and their families deserve our utmost respect and admiration for their unwavering commitment to space exploration and the betterment of humanity.

Rick D. Husband: As the Commander of the mission, Rick led with dedication, confidence, and unparalleled skill. His devotion to space exploration was evident in every decision he made. We remember him not just for his expertise, but for his warmth and his ability to inspire those around him. His family’s strength and grace, in the face of the deepest pain, serve as a testament to the love and support they provided him throughout his journey.
William C. McCool: As the pilot of Columbia, William’s adeptness and unwavering focus were essential to the mission. His enthusiasm and dedication were contagious, elevating the spirits of everyone around him. His family’s resilience and pride in his achievements are a reflection of the man he was — passionate, driven, and deeply caring.
Michael P. Anderson: As the payload commander, Michael’s role was vital, overseeing the myriad of experiments and research aboard the shuttle. His intellect was matched by his kindness, making him a cherished member of the team. His family’s courage and enduring love encapsulate the essence of Michael’s spirit — bright, optimistic, and ever-curious.
Ilan Ramon: As the first Israeli astronaut, Ilan represented hope, unity, and the bridging of frontiers. His enthusiasm for life was infectious, and he inspired millions with his historic journey. His family’s grace in the face of the unthinkable tragedy is a testament to their shared dream and the values that Ilan stood for.
Kalpana Chawla: Known affectionately as ‘KC’, Kalpana’s journey from a small town in India to becoming a space shuttle mission specialist stands as an inspiration to countless dreamers worldwide. Her determination, intellect, and humility made her a beacon of hope for many. Her family’s dignity and strength, holding onto her legacy, reminds us all of the power of dreams and the sacrifices made to realize them.
David M. Brown: As a mission specialist, David brought with him a zest for life, a passion for learning, and an innate curiosity that epitomized the spirit of exploration. He ventured where few dared and achieved what many only dreamt of. His family’s enduring love and their commitment to preserving his memory exemplify the close bond they shared and the mutual respect they held for each other.
Laurel B. Clark: As a mission specialist, Laurel’s dedication to scientific exploration and discovery was evident in every task she undertook. Her warmth, dedication, and infectious enthusiasm made her a beloved figure within her team and beyond. Her family’s enduring spirit, cherishing her memories and celebrating her achievements, is a tribute to the love and support that were foundational to her success.

To each of these remarkable individuals and their families, we extend our deepest respect and gratitude. Their sacrifices and contributions will forever remain etched in the annals of space exploration, reminding us of the human spirit’s resilience and indomitable will.

For those of us close to the Columbia disaster, it was more than a failure; it was a personal loss. Yet, in memory of those brave souls, we are compelled to look ahead. In the stories whispered to us by data, and in the painful lessons from their absence, we seek to ensure that such tragedies remain averted in the future.

While no technology can turn back time, the promise of AI Observability beckons a future where every anomaly is caught, every blind spot illuminated, and every astronaut returns home safely.

The above narrative seeks to respect the gravity of the Columbia disaster while emphasizing the potential of AI Observability. It underlines the importance of data, both in understanding tragedies and in preventing future ones.

The post Blind Spots in Your System: The Grave Risks of Overlooking Observability appeared first on Unravel.

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Stephen Lamont — Mon, 24 Jul 2023 14:09:01 +0000

Introduction

In the era of Big Data, the importance of data quality cannot be overstated. The vast volumes of information generated every second hold immense potential for organizations across industries. However, this potential can only be realized when the underlying data is accurate, reliable, and consistent. Data quality serves as the bedrock upon which crucial business decisions are made, insights are derived, and strategies are formulated. It empowers organizations to gain a comprehensive understanding of their operations, customers, and market trends. High-quality data ensures that analytics, machine learning, and artificial intelligence algorithms produce meaningful and actionable outcomes. From detecting patterns and predicting trends to identifying opportunities and mitigating risks, data quality is the driving force behind data-driven success. It instills confidence in decision-makers, fosters innovation, and unlocks the full potential of Big Data, enabling organizations to thrive in today’s data-driven world.

Overview

In this seven-part blog we will explore using ML/AI for data quality. Machine learning and artificial intelligence can be instrumental in improving the quality of data. Machine learning models like logistic regression, decision trees, random forests, gradient boosting machines, and neural networks predict categories of data based on past examples, correcting misclassifications. Linear regression, polynomial regression, support vector regression, and neural networks predict numeric values, filling in missing entries. Clustering techniques like K-means, hierarchical clustering, and DBSCAN identify duplicates or near-duplicates. Models such as Isolation Forest, Local Outlier Factor, and Auto-encoders detect outliers and anomalies. To handle missing data, k-Nearest Neighbors and Expectation-Maximization predict and fill in the gaps. NLP models like BERT, GPT, and RoBERTa process and analyze text data, ensuring quality through tasks like entity recognition and sentiment analysis. CNNs fix errors in image data, while RNNs and Transformer models handle sequence data. The key to ensuring data quality with these models is a well-labeled and accurate training set. Without good training data, the models may learn to reproduce the errors present in the data. We will focus on using the following models to apply critical data quality to our lakehouse.

Machine learning and artificial intelligence can be instrumental in improving the quality of data. Here are a few models and techniques that can be used for various use cases:

Classification Algorithms: Models such as logistic regression, decision trees, random forests, gradient boosting machines, or neural networks can be used to predict categories of data based on past examples. This can be especially useful in cases where data entries have been misclassified or improperly labeled.
Regression Algorithms: Similarly, algorithms like linear regression, polynomial regression, or more complex techniques like support vector regression or neural networks can predict numeric values in the data set. This can be beneficial for filling in missing numeric values in a data set.
Clustering Algorithms: Techniques like K-means clustering, hierarchical clustering, or DBSCAN can be used to identify similar entries in a data set. This can help identify duplicates or near-duplicates in the data.
Anomaly Detection Algorithms: Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders can be used to detect outliers or anomalies in the data. This can be beneficial in identifying and handling outliers or errors in the data set.
Data Imputation Techniques: Missing data is a common problem in many data sets. Machine learning techniques, such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM), can be used to predict and fill in missing data.
Natural Language Processing (NLP): NLP models like BERT, GPT, or RoBERTa can be used to process and analyze text data. These models can handle tasks such as entity recognition, sentiment analysis, text classification, which can be helpful in ensuring the quality of text data.
Deep Learning Techniques: Convolutional Neural Networks (CNNs) can be used for image data to identify and correct errors, while Recurrent Neural Networks (RNNs) or Transformer models can be useful for sequence data.

Remember that the key to ensuring data quality with these models is a well-labeled and accurate training set. Without good training data, the models may learn to reproduce the errors presented in the data.

In this first blog of seven we will focus on Classification Algorithms. The code examples provided below can be found in this GitHub location. Below is a simple example of using a classification algorithm in a Databricks Notebook to address a data quality issue using a classification algorithm.

Data Engineers Leveraging AI/ML for Data Quality

Machine learning (ML) and artificial intelligence (AI) play a crucial role in the field of data engineering. Data engineers leverage ML and AI techniques to process, analyze, and extract valuable insights from large and complex datasets. Overall, ML and AI provide data engineers with powerful tools and techniques to extract insights, improve data quality, automate processes, and enable data-driven decision-making. They enhance the efficiency and effectiveness of data engineering workflows, enabling organizations to unlock the full potential of their data assets.

AI/ML can help with numerous data quality use cases. Models such as logistic regression, decision trees, random forests, gradient boosting machines, or neural networks can be used to predict categories of data based on past examples. This can be especially useful in cases where data entries have been misclassified or improperly labeled. Similarly, algorithms like linear regression, polynomial regression, or more complex techniques like support vector regression or neural networks can predict numeric values in the data set. This can be beneficial for filling in missing numeric values in a data set. Techniques like K-means clustering, hierarchical clustering, or DBSCAN can be used to identify similar entries in a data set. This can help identify duplicates or near-duplicates in the data. Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders can be used to detect outliers or anomalies in the data. This can be beneficial in identifying and handling outliers or errors in the data set.

Missing data is a common problem in many data sets. Machine learning techniques, such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM), can be used to predict and fill in missing data. NLP models like BERT, GPT, or RoBERTa can be used to process and analyze text data. These models can handle tasks such as entity recognition, sentiment analysis, text classification, which can be helpful in ensuring the quality of text data. Convolutional Neural Networks (CNNs) can be used for image data to identify and correct errors, while Recurrent Neural Networks (RNNs) or Transformer models can be useful for sequence data.

Suppose we have a dataset with some missing categorical values. We can use logistic regression to fill in the missing values based on the other features. For the purpose of this example, let’s assume we have a dataset with ‘age’, ‘income’, and ‘job_type’ features. Suppose ‘job_type’ is a categorical variable with some missing entries.

Classification Algorithms — Using Logistic Regression to Fix the Data

Logistic regression is primarily used for binary classification problems, where the goal is to predict a binary outcome variable based on one or more input variables. However, it is not typically used directly for assessing data quality. Data quality refers to the accuracy, completeness, consistency, and reliability of data.

That being said, logistic regression can be indirectly used as a tool for identifying potential data quality issues. Here are some examples of how it can be used in data quality. Logistic regression can be used to define the specific data quality issue to be addressed. For example, you may be interested in identifying data records with missing values or outliers. Logistic regression can be used for feature engineering. This helps identify relevant features (variables) that may indicate the presence of the data quality issue. Data preparation is one of the most common uses of logistic regression. Here the ML helps prepare the dataset by cleaning, transforming, and normalizing the data as necessary. This step involves handling missing values, outliers, and any other data preprocessing tasks.

It’s important to note that logistic regression alone generally cannot fix data quality problems, but it can help identify potential issues by predicting their presence based on the available features. Addressing the identified data quality issues usually requires domain knowledge, data cleansing techniques, and appropriate data management processes. In the simplified example below we see a problem and then use logistic regression to predict what the missing values should be.

Step 1

Create a table with the columns “age”, “income”, and “job_type” in a SQL database, you can use the following SQL statement:

Step1: Create Table

Step 2

Load data to table. Notice that three records are missing job_type. This will be the column that we will use ML to predict. We load a very small set of data for this example. This same technique can be used against billions or trillions of rows. More data will almost always yield better results.

Step 2: Load Data to Table

Step 3

Load the data into a data frame. If you need to create a unique index for your data frame, please refer to this article.

Step 3: Load Data to Data Frame

Step 4

In the context of PySpark DataFrame operations, filter() is a transformation function used to filter the rows in a DataFrame based on a condition.

Step 4: Filter the Rows in a Data Frame Based on a Condition

df is the original DataFrame. We’re creating two new DataFrames, df_known and df_unknown, from this original DataFrame.

df_known = df.filter(df.job_type.isNotNull()) is creating a new DataFrame that only contains rows where the job_type is not null (i.e., rows where the job_type is known).
df_unknown = df.filter(df.job_type.isNull()) is creating a new DataFrame that only contains rows where the job_type is null (i.e., rows where the job_type is unknown).

By separating the known and unknown job_type rows into two separate DataFrames, we can perform different operations on each. For instance, we use df_known to train the machine learning model, and then use that model to predict the job_type for the df_unknown DataFrame.

Step 5

In this step we will vectorize the features. Vectorizing the features is a crucial pre-processing step in machine learning and AI. In the context of machine learning, vectorization refers to the process of converting raw data into a format that can be understood and processed by a machine learning algorithm. A vector is essentially an ordered list of values, which in machine learning represent the ‘features’ or attributes of an observation.

Step 5: Vectorize Features

Step 6

In this step we will convert categorical labels to indices. Converting categorical labels to indices is a common preprocessing step in ML and AI when dealing with categorical data. Categorical data represents information that is divided into distinct categories or classes, such as “red,” “blue,” and “green” for colors or “dog,” “cat,” and “bird” for animal types. Machine learning algorithms typically require numerical input, so converting categorical labels to indices allows these algorithms to process the data effectively.

Converting categorical labels to indices is important for ML and AI algorithms because it allows them to interpret and process the categorical data as numerical inputs. This conversion enables the algorithms to perform computations, calculate distances, and make predictions based on the numerical representations of the categories. It is worth noting that label encoding does not imply any inherent order or numerical relationship between the categories; it simply provides a numerical representation that algorithms can work with.

It’s also worth mentioning that in some cases, label encoding may not be sufficient, especially when the categorical data has a large number of unique categories or when there is no inherent ordinal relationship between the categories. In such cases, additional techniques like one-hot encoding or feature hashing may be used to represent categorical data effectively for ML and AI algorithms.

Step 6: Converting Categorical Labels to Indices

Step 7

In this step we will train the model. Training a logistic regression model involves the process of estimating the parameters of the model based on a given dataset. The goal is to find the best-fitting line or decision boundary that separates the different classes in the data.

The process of training a logistic regression model aims to find the optimal parameters that minimize the cost function and provide the best separation between classes in the given dataset. With the trained model, it becomes possible to make predictions on new data points and classify them into the appropriate class based on their features.

Step 7: Train the Model

Step 8

Predict the missing value in this case job_type. Logistic regression, despite its name, is a classification algorithm rather than a regression algorithm. It is used to predict the probability of an instance belonging to a particular class or category.

Logistic regression is widely used in various applications such as sentiment analysis, fraud detection, spam filtering, and medical diagnosis. It provides a probabilistic interpretation and flexibility in handling both numerical and categorical independent variables, making it a popular choice for classification tasks.

Step 8: Predict the Missing Value in This Case job_type

Step 9

Convert predicted indices back to original labels. Converting predicted indices back to original labels in AI and ML involves the reverse process of encoding categorical labels into numerical indices. When working with classification tasks, machine learning models often output predicted class indices instead of the original categorical labels.

It’s important to note that this reverse mapping process assumes a one-to-one mapping between the indices and the original labels. In cases where the original labels are not unique or there is a more complex relationship between the indices and the labels, additional handling may be required to ensure accurate conversion.

Step 9: Convert Predicted Indices Back to Original Labels

Recap

Machine learning (ML) and artificial intelligence (AI) have a significant impact on data engineering, enabling the processing, analysis, and extraction of insights from complex datasets. ML and AI empower data engineers with tools to improve data quality, automate processes, and facilitate data-driven decision-making. Logistic regression, decision trees, random forests, gradient boosting machines, and neural networks can predict categories based on past examples, aiding in correcting misclassified or improperly labeled data. Algorithms like linear regression, polynomial regression, support vector regression, or neural networks predict numeric values, addressing missing numeric entries. Clustering techniques like K-means, hierarchical clustering, or DBSCAN identify duplicates or near-duplicates. Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders detect outliers or anomalies, handling errors in the data. Machine learning techniques such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM) predict and fill in missing data. NLP models like BERT, GPT, or RoBERTa process text data for tasks like entity recognition and sentiment analysis. CNNs correct errors in image data, while RNNs or Transformer models handle sequence data.

Data engineers can use Databricks Notebook, for example, to address data quality issues by training a logistic regression model to predict missing categorical values. This involves loading and preparing the data, separating known and unknown instances, vectorizing features, converting categorical labels to indices, training the model, predicting missing values, and converting predicted indices back to original labels.

Ready to unlock the full potential of your data? Embrace the power of machine learning and artificial intelligence in data engineering. Improve data quality, automate processes, and make data-driven decisions with confidence. From logistic regression to neural networks, leverage powerful algorithms to predict categories, address missing values, detect anomalies, and more. Utilize clustering techniques to identify duplicates and near-duplicates. Process text, correct image errors, and handle sequential data effortlessly. Try out tools like Databricks Notebook to train models and resolve data quality issues. Empower your data engineering journey and transform your organization’s data into valuable insights. Take the leap into the world of ML and AI in data engineering today! You can start with this Notebook.

The post Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality appeared first on Unravel.

Healthcare leader uses AI insights to boost data pipeline efficiency

Stephen Lamont — Mon, 10 Jul 2023 19:16:15 +0000

One of the largest health insurance providers in the United States uses Unravel to ensure that its business-critical data applications are optimized for performance, reliability, and cost in its development environment—before they go live in production.

Data and data-driven statistical analysis have always been at the core of health insurance. But over the past few years the industry has seen an explosion in the volume, velocity, and variety of big data—electronic health records (EHR), electronic medical records (EMRs), and IoT data produced wearable medical devices and mobile health apps. As the company’s chief medical officer has said, “Sometimes I think we’re becoming more of a data analytics company than anything else.”

Like many Fortune 500 organizations, the company has a complex, hybrid, multi-everything data estate. Many workloads are still running on premises in Cloudera, but the company also has pipelines on Azure and Google Cloud Platform. Further, its Dev environment is fully on AWS. Says the key technology manager for the Enterprise Data Analytics Platform team, “Unravel is needed for us to ensure that the jobs run smoothly because these are critical data jobs,” and Unravel helps them understand and optimize performance and resource usage.

With the data team’s highest priority being able to guarantee that its 1,000s of data jobs deliver reliable results on time, every time, they find Unravel’s automated AI-powered Insights Engine invaluable. Unravel auto-discovers everything the company has running in its data estate (both in Dev and Prod), extracting millions of contextualized granular details from logs, traces, metrics, events and other metadata—horizontally and vertically—from the application down to infrastructure and everything in between. Then Unravel’s AI/ML correlates all this information into a holistic view that “connects the dots” as to how everything works together. AI and machine learning algorithms analyze millions of details in context to detect anomalous behavior in real time, pinpoint root causes in milliseconds, and automatically provide prescriptive recommendations on where and how to change configurations, containers, code, resource allocations, etc.

Application developers rely on Unravel to automatically analyze and validate their data jobs in Dev, before the apps ever go live in Prod by, first, identifying inefficient code—code that is most likely to break in production—and then, second, pinpointing to every individual data engineer exactly where and why code should be fixed, so they can tackle potential problems themselves via self-service optimization. The end-result ensures that performance inefficiencies never see the light of day.

The Unravel AI-powered Insights Engine similarly analyzes resource usage. The company leverages the chargeback report capability to understand how the various teams are using their resources. (But Unravel can also slice and dice the information to show how resources are being used by individual users, individual jobs, data products or projects, departments, Dev vs. Prod environments, budgets, etc.) For data workloads still running in Cloudera, this helps avoid resource contention and lets teams queue their jobs more efficiently. Unravel even enables teams to kill a job instantly if it is causing another mission-critical job to fail.

For workloads running in the cloud, Unravel provides precise, prescriptive AI-fueled recommendations on more efficient resource usage—usually downsizing requested resources to fewer or less costly alternatives that will still hit SLAs.

As the company’s technology manager for cloud infrastructure and interoperability says, “Some teams use humongous data, and every year our users are growing.” With such astronomical growth, it has become ever more important to tackle data workload efficiency more proactively—everything has simply gotten too big and too complex and business-critical to reactively respond. The company has leveraged Unravel’s automated guardrails and governance rules to trigger alerts whenever jobs are using more resources than necessary.

Top benefits:

Teams can now view their queues at a glance and run their jobs more efficiently, without conflict.

Chargeback reports show how teams are using their resources (or not), which helps set the timing for running jobs—during business vs. off hours. This has provided a lot of relief for all application teams.

The post Healthcare leader uses AI insights to boost data pipeline efficiency appeared first on Unravel.

Unravel Data Recognized by SIIA as Best Data Tool & Platform at 2023 CODiE Awards

Stephen Lamont — Wed, 21 Jun 2023 21:58:08 +0000

Unravel Data Observability Platform earns prestigious industry recognition for Best Data Tool & Platform

Palo Alto, CA — June 21, 2023 —Unravel Data, the first data observability platform built to meet the needs of modern data teams, was named Best Data Tool & Platform of 2023 as part of the annual SIIA CODiE Awards. The prestigious CODiE Awards recognize the companies producing the most innovative Business Technology products across the country and around the world.

“We are deeply honored to win a CODiE Award for Best Data Tool & Platform. Today, as companies put data products and AI/ML innovation front and center of their growth and customer service strategies, the volume of derailed projects and the costs associated are ascending astronomically. Companies need a way to increase performance of their data pipelines and a way to manage costs for effective ROI,” said Kunal Agarwal, CEO and co-founder, Unravel Data. “The Unravel Data Platform brings pipeline performance management and FinOps to the modern data stack. Our AI-driven Insights Engine provides recommendations that allow data teams to make smarter decisions that optimize pipeline performance along with the associated cloud data spend, making innovation more efficient for organizations.”

Cloud-first companies are seeing cloud data costs exceed 40% of their total cloud spend. These organizations lack the visibility into queries, code, configurations, and infrastructure required to manage data workloads effectively, which in turn, leads to over-provisioned capacity for data jobs, an inability to quickly detect pipeline failures and slowdowns, and wasted cloud data spend.

“The 2023 Business Technology CODiE Award Winners maintain the vital legacy of the CODiEs in spotlighting the best and most impactful apps, services and products serving the business tech market,” said SIIA President Chris Mohr. “We are so proud to recognize this year’s honorees – the best of the best! Congratulations to all of this year’s CODiE Award winners!”

The Software & Information Industry Association (SIIA), the principal trade association for the software and digital content industries, announced the full slate of CODiE winners during a virtual winner announcement. Awards were given for products and services deployed specifically for education and learning professionals, including the top honor of the Best Overall Business Technology Solution.

A SIIA CODiE Award win is a prestigious honor, following rigorous reviews by expert judges whose evaluations determined the finalists. SIIA members then vote on the finalist products, and the scores from both rounds are tabulated to select the winners.

Details about the winning products can be found at https://siia.net/codie/2023-codie-business-technology-winners/.

Learn more about Unravel’s award-winning data observability platform.

About the CODiE Awards

The SIIA CODiE Awards is the only peer-reviewed program to showcase business and education technology’s finest products and services. Since 1986, thousands of products, services and solutions have been recognized for achieving excellence. For more information, visit siia.net/CODiE.

About Unravel Data

Unravel Data radically transforms the way businesses understand and optimize the performance and cost of their modern data applications – and the complex data pipelines that power those applications. Providing a unified view across the entire data stack, Unravel’s market-leading data observability platform leverages AI, machine learning, and advanced analytics to provide modern data teams with the actionable recommendations they need to turn data into insights. Some of the world’s most recognized brands like Adobe, Maersk, Mastercard, Equifax, and Deutsche Bank rely on Unravel Data to unlock data-driven insights and deliver new innovations to market. To learn more, visit https://www.unraveldata.com.

Media Contact

Blair Moreland
ZAG Communications for Unravel Data
unraveldata@zagcommunications.com

The post Unravel Data Recognized by SIIA as Best Data Tool & Platform at 2023 CODiE Awards appeared first on Unravel.

Data Observability, The Reality eBook

Christine Della Penna — Tue, 18 Apr 2023 14:07:13 +0000

Thrilled to announce that Unravel is contributing a chapter in Ravit Jain’s new ebook, Data Observability, The Reality.

What you’ll learn from reading this ebook:

What data observability is and why it’s important
Identify key components of an observability framework
Learn how to design and implement a data observability strategy
Explore real-world use cases and best practices for data observability
Discover tools and techniques for monitoring and troubleshooting data pipelines

Don’t miss out on your chance to read our chapter, “Automation and AI Are a Must for Data Observability,” and gain valuable insights from top industry leaders such as Sanjeev Mohan and others.

The post Data Observability, The Reality eBook appeared first on Unravel.

Three Companies Driving Better Business Outcomes from Data Analytics

Stephen Lamont — Thu, 09 Feb 2023 19:00:08 +0000

You’re unlikely to be able to build a business without data.

But how can you use it effectively?

There are so many ways you can use data in your business, from creating better products and services for customers, to improving efficiency and reducing waste.

Enter data observability. Using agile development practices, companies can create, deliver, and optimize data products, quickly and cost-effectively. Organizations can easily identify a problem to solve and then break it down into smaller pieces. Each piece is then assigned to a team that breaks down the work to solve the problem into a defined set of time – usually called a sprint – that includes planning, work, deployment, and review.

Marion Shaw, Head of Data and Analytics at Chaucer Group, and Unravel’s Ben Cooper presented on transforming data analytics to build better products and services and on making data analytics more efficient, respectively, at a recent Chief Disruptor virtual event.

Following on from the presentation, members joined a roundtable discussion and took part in a number of polls, in order to share their experiences. Here are just some examples of how companies have used data analytics to drive innovation:

Improved payments. An influx of customer calls asking “where’s my money or payment?” prompted a company to introduce a “track payments” feature as a way of digitally understanding the payment status. As a result, the volume of callers decreased, while users of the new feature actually eclipsed the amount of original complaints, which proved there was a batch of customers who couldn’t be bothered to report problems but still found the feature useful. “If you make something easy for your customers, they will use it.“

Cost reduction and sustainability: Moving from plastics to paper cups improved cost reduction and sustainability for one company, showing how companies can use their own data to make business decisions.

New products: Using AI in drug discovery, data collaboration, to explore disease patterns can help pharmaceutical companies find new treatments for diseases with the potential for high returns. Cost of discovery is as expensive as using big data sets.

The key takeaways from the discussion were:

Make it simple. When you make an action easy for your customers, they will use it.
Lean on the data. If there isn’t data behind someone’s viewpoint, then it is simply an opinion.
Get buy-in. Data teams need to buy into the usage of data—just because a data person owns the data side of things does not mean that they are responsible for the benefits or failings of it.

Using data analytic effectively with data observability is key. Companies across all industries are using data observability to create better products and services, reduce waste, and to improve productivity.

But data observability is no longer just about data quality or observing the condition of the data itself. Today it encompasses much more, and you can’t “borrow” your software teams’ observability solution. Discover more in our report DataOps Observability: The Missing Link for Data Teams.

The post Three Companies Driving Better Business Outcomes from Data Analytics appeared first on Unravel.

3 Takeaways from the 2023 Data Teams Summit

Stephen Lamont — Thu, 02 Feb 2023 13:45:19 +0000

The 2023 Data Teams Summit (formerly DataOps Unleashed) was a smashing success, with over 2,000 participants from 1,600 organizations attending 23 expert-led breakout sessions, panel discussions, case studies, and keynote presentations covering a wide range of thought leadership and best practices.

There were a lot of sessions devoted to different strategies and considerations when building a high-performing data team, how to become a data team leader, where data engineering is heading, and emerging trends in DataOps (asset-based orchestration, data contracts, data mesh, digital twins, data centers center of excellence). And winding as a common theme throughout almost every presentation was that top-of-mind topic: FinOps and how to get control over galloping cloud data costs.

Some of the highlight sessions are available now on demand (no form to fill out) on our Data Teams Summit 2023 page. More are coming soon.

There was a lot to see (and full disclosure: I didn’t get to a couple of sessions), but here are 3 sessions that I found particularly interesting.

Enabling strong engineering practices at Maersk

The fireside chat between Unravel CEO and Co-founder Kunal Agarwal and Mark Sear, Head of Data Platform Optimization at Maersk, one of the world’s largest logistics companies, is entertaining and informative. Kunal and Mark cut through the hype to simplify complex issues in commonsensical, no-nonsense language about:

The “people problem” that nobody’s talking about
How Maersk was able to upskill its data teams at scale
Maersk’s approach to rising cloud data costs
Best practices for implementing FinOps for data teams

Check out their talk here

Maximize business results with FinOps

Unravel DataOps Champion and FinOps Certified Practitioner Clinton Ford and FinOps Foundation Ambassador Thiago Gil explain how and why the emerging cloud financial management discipline of FinOps is particularly relevant—and challenging—for data teams. They cover:

The hidden costs of cloud adoption
Why observability matters
How FinOps empowers data teams
How to maximize business results
The state of production ML

See their session here

Situational awareness in a technology ecosystem

Charles Boicey, Chief Innovation Officer and Co-founder of Clearsense, a healthcare data platform company, explores the various components of a healthcare-centric data ecosystem and how situational awareness in the clinical environment has been transferred to the technical realm. He discusses:

What clinical situational awareness looks like
The concept of human- and technology-assisted observability
The challenges of getting “focused observability” in a complex hybrid, multi-cloud, multi-platform modern data architecture for healthcare
How Clearsense leverages observability in practice
Observability at the edge

Watch his presentation here

How to get more session recordings on demand

To see other session recordings without any registration, visit the Unravel Data Teams Summit 2023 page.
To see all Data Teams Summit 2023 recordings, register for access here.

And please share your favorite takeaways, see what resonated with your peers, and join the discussion on LinkedIn.

The post 3 Takeaways from the 2023 Data Teams Summit appeared first on Unravel.

Sneak Peek into Data Teams Summit 2023 Agenda

Stephen Lamont — Thu, 05 Jan 2023 22:47:29 +0000

The Data Teams Summit 2023 is just around the corner!

This year, on January 25, 2023, we’re taking the peer-to-peer empowerment of data teams one step further, transforming DatOps Unleashed into Data Teams Summit to better reflect our focus on the people—data teams—who unlock the value of data.

Data Teams Summit is an annual, full-day virtual conference, led by data rockstars at future-forward organizations about how they’re establishing predictability, increasing reliability, and creating economic efficiencies with their data pipelines.

Check out full agenda and register
Get free ticket

Join us for sessions on:

DataOps best practices
Data team productivity and self-service
DataOps observability
FinOps for data teams
Data quality and governance
Data modernizations and infrastructure

The peer-built agenda is packed, with over 20 panel discussions and breakout sessions. Here’s a sneak peek at some of the most highly anticipated presentations:

Keynote Panel: Winning strategies to unleash your data team

Great data outcomes depend on successful data teams. Every single day, data teams deal with hundreds of different problems arising from the volume, velocity, variety—and complexity—of the modern data stack.

Learn best practices and winning strategies about what works (and what doesn’t) to help data teams tackle the top day-to-day challenges and unleash innovation.

Breakout Session: Maximize business results with FinOps

As organizations run more data applications and pipelines in the cloud, they look for ways to avoid the hidden costs of cloud adoption and migration. Teams seek to maximize business results through cost visibility, forecast accuracy, and financial predictability.

In this session, learn why observability matters and how a FinOps approach empowers DataOps and business teams to collaboratively achieve shared business goals. This approach uses the FinOps Framework, taking advantage of the cloud’s variable cost model, and distributing ownership and decision-making through shared visibility to get the biggest return on their modern data stack investments.

See how organizations apply agile and lean principles using the FinOps framework to boost efficiency, productivity, and innovation.

Breakout Session: Going from DevOps to DataOps

DevOps has had a massive impact on the web services world. Learn how to leverage those lessons and take them further to improve the quality and speed of delivery for analytics solutions.

Ali’s talk will serve as a blueprint for the fundamentals of implementing DataOps, laying out some principles to follow from the DevOps world and, importantly, adding subject areas required to get to DataOps—which participants can take back and apply to their teams.

Breakout Session: Becoming a data engineering team leader

As you progress up the career ladder for data engineering, responsibilities shift as you start to become more hands-off and look at the overall picture rather than a project in particular.

How do you ensure your team’s success? It starts with focusing on the team members themselves.

In this talk, Matt Weingarten, a lead Data Engineer at Disney Streaming, will walk through some of his suggestions and best practices for how to be a leader in the data engineering world.

Attendance is free! Sign up here for a free ticket

The post Sneak Peek into Data Teams Summit 2023 Agenda appeared first on Unravel.

Get Ready for the Next Generation of DataOps Observability

Stephen Lamont — Wed, 05 Oct 2022 00:15:52 +0000

This blog was originally published by Unravel CEO Kunal Agarwal on LinkedIn in September 2022.

I was chatting with Sanjeev Mohan, Principal and Founder of SanjMo Consulting and former Research Vice President at Gartner, about how the emergence of DataOps is changing people’s idea of what “data observability” means. Not in any semantic sense or a definitional war of words, but in terms of what data teams need to stay on top of an increasingly complex modern data stack. While much ink has been spilled over how data observability is much more than just data profiling and quality monitoring, until only very recently the term has pretty much been restricted to mean observing the condition of the data itself.

But now DataOps teams are thinking about data observability more comprehensively as embracing other “flavors” of observability like application and pipeline performance, operational observability into how the entire platform or system is running end-to-end, and business observability aspects such as ROI and—most significantly—FinOps insights to govern and control escalating cloud costs.

That’s what we at Unravel call DataOps observability.

Data teams are getting bogged down

Data teams are struggling, overwhelmed by the increased volume, velocity, variety, and complexity of today’s data workloads. These data applications are simultaneously becoming ever more difficult to manage and ever more business-critical. And as more workloads migrate to the cloud, team leaders are finding that costs are getting out of control—often leading to migration initiatives stalling out completely because of budget overruns.

The way data teams are doing things today isn’t working.

Data engineers and operations teams spend way too much time firefighting “by hand.” Something like 70-75% of their time is spent tracking down and resolving problems through manual detective work and a lot of trial and error. And with 20x more people creating data applications than fixing them when something goes wrong, the backlog of trouble tickets gets longer, SLAs get missed, friction among teams creeps in, and the finger-pointing and blame game begins.

This less-than-ideal situation is a natural consequence of inherent process bottlenecks and working in silos. There are only a handful of experts who can untangle the wires to figure out what’s going on, so invariably problems get thrown “over the wall” to them. Self-service remediation and optimization is just a pipe dream. Different team members each use their own point tools, seeing only part of the overall picture, and everybody gets a different answer to the same problem. Communication and collaboration among the team breaks down, and you’re left operating in a Tower of Babel.

Check out our white paper DataOps Observability: The Missing Link for Data Teams
Download here

Accelerating next-gen DataOps observability

These problems aren’t new. DataOps teams are facing some of the same general challenges as their DevOps counterparts did a decade ago. Just as DevOps united the practice of software development and operations and transformed the application lifecycle, today’s data teams need the same observability but tailored to their unique needs. And while application performance management (APM) vendors have done a good job of collecting, extracting, and correlating details into a single pane of glass for web applications, they’re designed for web applications and give data teams only a fraction of what they need.

System point tools and cloud provider tools all provide some of the information data teams need, but not all. Most of this information is hidden in plain sight—it just hasn’t been extracted, correlated, and analyzed by a single system designed specifically for data teams.

That’s where Unravel comes in.

Data teams need what Unravel delivers—observability designed to show data application/pipeline performance, cost, and quality coupled with precise, prescriptive fixes that will allow you to quickly and efficiently solve the problem and get on to the real business of analyzing data. Our AI-powered solution helps enterprises realize greater return on their investment in the modern data stack by delivering faster troubleshooting, better performance to meet service level agreements, self-service features that allow applications to get out of development and into production faster and more reliably, and reduced cloud costs.

I’m excited, therefore, to share that earlier this week, we closed a $50 million Series D round of funding that will allow us to take DataOps observability to the next level and extend the Unravel platform to help connect the dots from every system in the modern data stack—within and across some of the most popular data ecosystems.

Unlocking the door to success

By empowering data teams to spend more time on innovation and less time firefighting, Unravel helps data teams take a page out of their software counterparts’ playbook and tackle their problems with a solution that goes beyond observability—to not just show you what’s going on and why, but actually tell you exactly what to do about it. It’s time for true DataOps observability.

To learn more about how Unravel Data is helping data teams tackle some of today’s most complex modern data stack challenges, visit: www.unraveldata.com.

The post Get Ready for the Next Generation of DataOps Observability appeared first on Unravel.

Unravel Goes on the Road at These Upcoming Events

Stephen Lamont — Thu, 01 Sep 2022 19:41:27 +0000

Join us at an event near you or attend virtually to discover our DataOps observability platform, discuss your challenges with one of our DataOps experts, go under the hood to check out platform capabilities, and see what your peers have been able to accomplish with Unravel.

September 21-22: Big Data LDN (London)

Big Data LDN is the UK’s leading free-to-attend data & analytics conference and exhibition, hosting leading data and analytics experts, ready to arm you with the tools to deliver your most effective data-driven strategy. Stop by the Unravel booth (stand #724) to see how Unravel is observability designed specifically for the unique needs of today’s data teams.

And be sure to stop by the Unravel booth at 5PM on Day 1 for the Data Drinks Happy Hour for drinks and snacks (while supplies last!)

October 5-6: AI & Big Data Expo – North America (Santa Clara)

The world’s leading AI & Big Data event returns to Santa Clara as a hybrid in-person and virtual event, with more than 5,000 attendees expected to join from across the globe. The expo will showcase the most cutting-edge technologies from 250+ speakers sharing their unparalleled industry knowledge and real-life experiences, in the forms of solo presentations, expert panel discussions and in-depth fireside chats.

And don’t miss Unravel Co-Founder and CEO Kunal Agarwal’s feature presentation on the different challenges facing different AI & Big Data team members and how multidimensional observability (performance, cost, quality) designed specifically for the modern data stack can help.

October 10-12: Chief Data & Analytics Officers (CDAO) Fall (Boston)

The premier in-person gathering for data & analytics leaders in North America, CDAO Fall offers focus tracks on data infrastructure, data governance, data protection & privacy; analytics, insights, and business intelligence; and data science, artificial intelligence, and machine learning. Exclusive industry summit days for data and analytics professionals in Financial Services, Insurance, Healthcare, and Retail/CPG.

October 14: DataOps Observability Conf India 2022 (Bengaluru)

India’s first DataOps observability conference, this event brings together data professionals to collaborate and discuss best practices and trends in the modern data stack, analytics, AI, and observability.

Join leading DataOps observability experts to:

Understand what DataOps is and why it’s important
Learn why DataOps observability has become a mission-critical need in the modern data stack
Discover how AI is transforming DataOps and observability

November 1-3: ODSC West (San Francisco)

The Open Data Science Conference (ODSC) is essential for anyone who wants to connect to the data science community and contribute to the open source applications it uses every day. A hybrid in-person/virtual event, ODSC West features 250 speakers, with 300 hours of content, including keynote presentations, breakout talk sessions, hands-on tutorials and workshops, partner demos, and more.

Sneak peek into what you’ll see from Unravel

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

The post Unravel Goes on the Road at These Upcoming Events appeared first on Unravel.

Tips to optimize Spark jobs to improve performance

Unravel Data — Tue, 23 Aug 2022 01:48:56 +0000

Summary: Sometimes the insight you’re shown isn’t the one you were expecting. Unravel DataOps observability provides the right, and actionable, insights to unlock the full value and potential of your Spark application.

One of the key features of Unravel is our automated insights. This is the feature where Unravel analyzes the finished Spark job and then presents its findings to the user. Sometimes those findings can be layered and not exactly what you expect.

Why Apache Spark Optimization?

Let’s take a scenario where you are developing a new Spark job and doing some testing. The goal of this testing is to ensure the Spark application is properly optimized. You want to see if you have tuned it appropriately so you bump up the resource allocation really high. The goal being you want to see if Unravel shows a “Container Sizing” event, or something along those lines. Instead of seeing a “Container Sizing” event there are a few others, in our case “Contended Driver” and “Long Idle Time for Executors” events. Let’s take a look at why this might be the case!

Recommendations to optimize Spark

When Unravel presents recommendations it presents them based on what is most helpful for the current state of the application. It will also only present the best case, when it can, for improving the applications. In certain scenarios this will lead to Insights that are not shown because they would end up causing more harm than good. There can be many reasons for this, but let’s take a look at how Unravel is presenting this particular application.

Below are the resource utilization graphs for the two runs of the application we are developing:

Original run

New Run

The most glaring issue is this idle time. We can see the job is doing bursts of work and then just sitting around doing nothing. If the idle time was looked at from an application perspective then it would help improve the jobs performance tremendously. This is most likely masking other potential performance improvements. If we dig into the job a bit further we can see this:

The above is a breakdown of what the job was doing for the whole time. Nearly 50% was spent on ScheduledWaitTime! This leads to the Executor idle time recommendation:

Taking all of the above information we can see that the application was coded in such a way that it’s waiting around for something for a while. At this point you could hop into the “Program” tab within Unravel to take a look at the actual source code associated with this application. We can say the same thing about the Contended Driver:

With this one we should examine why the code is spending so much time with the driver. It goes hand-in-hand with the idle time because while the executors are idle, the driver is still working away. Once we take a look at our code and resolve these two items, we would see a huge increase in this job’s performance!

Another item which was surfaced when looking at this job was this:

This is telling us that the Java processes spent a lot of time in garbage collection. This can lead to thrashing and other types of issues. More than likely it’s related to the recommendations Unravel made that need to be examined more deeply.

With all of the recommendations we saw we didn’t see the expected resource recommendations. That is because Unravel is presenting the most important insights. Unravel is presenting the right things that the developer needs to look at.

These issues are deeper than settings changes. An interesting point is that both jobs we looked at required some resource tuning. We could see that even the original job was overallocated. The problem with these jobs though is showing just the setting changes isn’t the whole story.

If Unravel just presented the resource change and those were implemented it might lead to the job failing; for example being killed by the JVM because of garbage collection issues. Unravel is attempting to steer the developer in the right direction and give them the tools they need to determine where the issue is in their code. If they fix the larger issues, like high GC or contended driver, then they will start being able to really tune and optimize their job.

Next steps:

Unravel is a purpose-built observability platform that helps you stop firefighting issues, control costs, and run faster data pipelines. What does that mean for you?

Never worry about monitoring and observing your entire data estate.
Reduce unnecessary cloud costs and the DataOps team’s application tuning burden.
Avoid downtime, missed SLAs, and business risk with end-to-end observability and cost governance over your modern data stack.

How can you get started?

Create your free account today.
Book a demo to see how Unravel simplifies modern data stack management.

The post Tips to optimize Spark jobs to improve performance appeared first on Unravel.

Kafka best practices: Monitoring and optimizing the performance of Kafka applications

George Demarest — Mon, 18 Jul 2022 21:04:02 +0000

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Administrators, developers, and data engineers who use Kafka clusters struggle to understand what is happening in their Kafka implementations. There are out of the box solutions like Ambari and Cloudera manager which provide some high level monitoring however most customers find these tools to be insufficient for troubleshooting purposes. These tools also fail to provide right visibility down to the applications acting as consumers that are processing Kafka data streams.

For Unravel Data users Kafka monitoring and insights come out of the box with your installation. This blog is geared towards showing you best practices for using Unravel as a Kafka monitoring tool to detect and resolve issues faster. We assume you have a baseline knowledge of Kafka concepts and architecture.

Kafka monitoring tools

There are several tools and ways to monitor Kafka applications, such as open source tools, proprietary tools, and purpose-built tools. There are advantages associated with each option. Based on the availability of skilled resources and organizational priority, you can choose the best option.

Importance of using Kafka Monitoring tools

An enterprise-grade Kafka monitoring tool, like Unravel, enables your critical resources to avoid wasting hours manually configuring monitoring for Apache Kafka. They can get more time delivering value to end users. Locating an issue can be a challenging task without having a monitoring solution in place. Your data engineering team will have to test different use cases and educated guesses before they can get to the root of issues. Whereas proprietary Kafka monitoring tools like Unravel, provides AI-enabled insights, recommendation and automate remediation.

Monitoring Cluster Health

Unravel provides color coded KPIs per cluster to give you a quick overall view on the health of your Kafka cluster. The colors represent the following:

Green = Healthy
Red = Unhealthy and is something you will want to investigate
Blue = Metrics on cluster activity

Let’s walk through a few scenarios and how we can use Unravel’s KPIs to help us through troubleshooting issues.

Under-replicated partitions

Under-replicated partitions tell us that replication is not going as fast as configured, which adds latency as consumers don’t get the data they need until messages are replicated. It also suggests that we are more vulnerable to losing data if we have a master failure. Any under replicated partitions at all constitute a bad thing and is something we’ll want to root cause to avoid any data loss. Generally speaking, under-replicated partitions usually point to an issue on a specific broker.

To investigate scenarios if under replicated partitions are showing up in your cluster we’ll need to understand how often this is occurring. For that we can use the following graphs to get a better understanding of this:

# of under replicated partitions

Another useful metric to monitor is “Log flush latency, 99th percentile“ graph which provides us the time it takes for the brokers to flush logs to disk:

Log Flush Latency, 99th Percentile

Log flush latency is important, because the longer it takes to flush log to disk, the more the pipeline backs up, the worse our latency and throughput. When this number goes up, even 10ms going to 20ms, end-to-end latency balloons which can also lead to under replicated partitions.

If we notice latency fluctuates greatly then we’ll want to identify which broker(s) are contributing to this. In order to do this we’ll go back to the “broker” tab and click on each broker to refresh the graph for the broker selected:

Once we identify the broker(s) with wildly fluctuating latency we will want to get further insight by investigating the logs for that broker. If the OS is having a hard time keeping up with log flush then we may want to consider adding more brokers to our Kafka architecture.

Offline partitions

Generally speaking you want to avoid offline partitions in your cluster. If you see this metric > 0 then there are broker level issues which need to be addressed.

Unravel provides the “# Offline Partitions” graph to understand when offline partitions occur:

This metric provides the total number of topic partitions in the cluster that are offline. This can happen if the brokers with replicas are down, or if unclean leader election is disabled and the replicas are not in sync and thus none can be elected leader (may be desirable to ensure no messages are lost).

Controller Health

This KPI displays the number of brokers in the cluster reporting as the active controller in the last interval. Controller assignment can change over time as shown in the “# Active Controller Trend” graph:

“# Controller” = 0
- There is no active controller. You want to avoid this!
“# Controller” = 1
- There is one active controller. This is the state that you want to see.
“# Controller” > 1
- Can be good or bad. During steady state there should be only one active controller per cluster. If this is greater than 1 for only one minute, then it probably means the active controller switched from one broker to another. If this persists for more than one minute, troubleshoot the cluster for “split brain”.

For Active Controller <> 1 we’ll want to investigate logs on the broker level for further insight.

Cluster Activity

The last three KPI’s show cluster activity within the last 24 hours. They will always be colored blue because these metrics can neither be good or bad. These KPIs are useful in gauging activity in your Kafka cluster for the last 24 hours. You can also view these metrics via their respective graphs below:

These metrics can be useful in understanding cluster capacity such as:

Add additional brokers to keep up with data velocity
Evaluate performance of topic architecture on your brokers
Evaluate performance of partition architecture for a topic

The next section provides best practices in using Unravel to evaluate performance of your topics / brokers.

Topic Partition Strategy

A common challenge for Kafka admins is providing an architecture for the topics / partitions in the cluster which can support the data velocity coming from producers. Most times this is a shot in the dark because OOB solutions do not provide any actionable insight on activity for your topics / partitions. Let’s investigate how Unravel can help shed some light into this.

Producer architecture

On the producer side it can be a challenge in deciding on how to partition topics on the Kafka level. Producers can choose to send messages via key or use a round-robin strategy when no key has been defined for a message. Choosing the correct # of partitions for your data velocity are important in ensuring we have a real time architecture that is performant.

Let’s use Unravel to get a better understanding on how the current architecture is performing. We can then use this insight in guiding us in choosing the correct # of partitions.

To investigate – click on the topic tab and scroll to the bottom to see the list of topics in your Kafka cluster:

In this table we can quickly identify topics which have heavy traffic where we may want to understand how our partitions are performing. This is convenient in identifying topics which have heavy traffic.

Let’s click on a topic to get further insight:

Consumer Group Architecture

We can also make changes on the consumer group side to scale according to your topic / partition architecture. Unravel provides a convenient view on the consumer group for each topic to quickly get a health status of the consumer groups in your Kafka cluster. If a consumer group is a Spark Streaming application then Unravel can also provide insights to that application thereby providing an end-to-end monitoring solution for your streaming applications.

See Kafka Insights for a use case on monitoring consumer groups, and lagging/stalled partitions or consumer groups.

Next steps:

Unravel is a purpose-built observability platform that helps you stop firefighting issues, control costs, and run faster data pipelines. What does that mean for you?

Never worry about monitoring and observing your entire data estate.
Reduce unnecessary cloud costs and the DataOps team’s application tuning burden.
Avoid downtime, missed SLAs, and business risk with end-to-end observability and cost governance over your modern data stack.

How can you get started?

Create your free account today.

Book a demo to see how Unravel simplifies modern data stack management.

References:

https://www.signalfx.com/blog/how-we-monitor-and-run-kafka-at-scale/

https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster

https://docs.confluent.io/current/control-center/systemhealth.html

The post Kafka best practices: Monitoring and optimizing the performance of Kafka applications appeared first on Unravel.

Why Legacy Observability Tools Don’t Work for Modern Data Stacks

Stephen Lamont — Fri, 13 May 2022 13:13:01 +0000

Whether they know it or not, every company has become a data company. Data is no longer just a transactional byproduct, but a transformative enabler of business decision-making. In just a few years, modern data analytics has gone from being a science project to becoming the backbone of business operations to generate insights, fuel innovation, improve customer satisfaction, and drive revenue growth. But none of that can happen if data applications and pipelines aren’t running well.

Yet data-driven organizations find themselves caught in a crossfire: their data applications/pipelines are more important than ever, but managing them is more difficult than ever. As more data is generated, processed, and analyzed in an increasingly complex environment, businesses are finding the tools that served them well in the past or in other parts of their technology stack simply aren’t up to the task.

Modern data stacks are a different animal

Would you want an auto mechanic (no matter how excellent) to diagnose and fix a jet engine problem before you took flight? Of course not. You’d want an aviation mechanic working on it. Even though the basic mechanical principles and symptoms — engine trouble — are similar, automobiles and airplanes are very different under the hood. The same is true with observability for your data application stacks and your web application stacks. The process is similar, but they are totally different animals.

At first glance, it may seem that the leading APM monitoring and observability tools like Datadog, New Relic, Dynatrace, AppDynamics, etc., do the same thing as a modern data stack observability platform like Unravel. And in the sense that both capture and correlate telemetry data to help you understand issues, that’s true. But one is designed for web apps, while the other for modern data pipelines and applications.

Observability for the modern data stack is indeed completely different from observability for web (or mobile) applications. They are built and behave differently, face different types of issues for different reasons, requiring different analyses to resolve problems. To fully understand, troubleshoot, and optimize (for both performance and cost) data applications and pipelines, you need an observability platform that’s built from the ground up to tackle the specific complexities of the modern data stack. Here’s why.

What’s different about modern data applications?

First and foremost, the whole computing framework is different for data applications. Data workloads get broken down into multiple, smaller, often similar parts each processed concurrently on a separate node, with the results re-combined upon completion — parallel processing. And this happens at each successive stage of the workflow as a whole. Dependencies within data applications/pipelines are deep and layered. It’s crucial that everything — execution time, scheduling and orchestration, data lineage, and layout — be in sync.

In contrast, web applications are a tangle of discrete request-response services processed individually. Each service does its own thing and operates relatively independently. What’s most important is the response time of each service request and how that contributes to the overall response time of a user transaction. Dependencies within web apps are not especially deep but are extremely broad.

Web apps are request-response; data apps process in parallel.

Consequently, there’s a totally different class of problems, root causes, and remediation for data apps vs. web apps. When doing your detective work into a slow or failed app, you’re looking at a different kind of culprit for a different type of crime, and need different clues (observability data). You need a whole new set of data points, different KPIs, from distinct technologies, visualized in another way, and correlated in a uniquely modern data stack–specific context.

The flaw with using traditional APM tools for modern data stacks

What organizations who try to use traditional APM for the modern data stack find is that they wind up getting only a tiny fraction of the information they need from a solution like Dynatrace or Datadog or AppDynamics, such as infrastructure and services-level metrics. But over 90% of the information data teams need is buried in places where web APM simply doesn’t go — you need an observability platform specifically designed to dig through all the systems to get this data and then stitch it together into a unified context.

This is where the complexity of modern data applications and pipelines rears its head. The modern data stack is not a single system, but a collection of systems. You might have Kafka or Kinesis or Data Factory for data ingestion, some sort of data lake to store it, then possibly dozens of different components for different types of processing: Druid for real-time processing, Databricks for AI/ML processing, BigQuery or Snowflake for data warehouse processing, another technology for batch processing — the list goes on. So you need to capture deep information horizontally across all the various systems that make up your data stack. But you also need to capture deep information vertically, from the application down to infrastructure and everything in between (pipeline definitions, data usage/lineage, data types and layout, job-level details, execution settings, container-level information, resource allocation, etc.).

Cobbling it together manually via “swivel chair whiplash,” jumping from screen to screen, is a time-consuming, labor-intensive effort that can take hours — even days — for a single problem. And still there’s a high risk that it won’t be completely accurate. There is simply too much data to make sense of, in too many places. Trying to correlate everything on your own, whether by hand or with a homegrown jury-rigged solution, requires two things that are always in short supply: time and expertise. Even if you know what you’re looking for, trolling through hundreds of log files is like looking for a needle in a stack of needles.

An observability platform purpose-built for the modern data stack can do all that for you automatically. Trying to make traditional APM observe data stacks is simply using the wrong tool for the job at hand.

DevOps APM vs. DataOps observability in practice

With the growing complexity in today’s modern data systems, any enterprise-grade observability solution should do 13 things:

Capture full-stack data — both horizontally and vertically — from the various systems that make up the stack, including engines, schedulers, services, and cloud provider
Capture information about all data applications: pipelines, warehousing, ETL/ELT, machine learning models, etc.
Capture information about datasets, lineage, users, business units, computing resources, infrastructure, and more
Correlate, not just aggregate, data collected into meaningful context
Understand all application/pipeline dependencies on data, resources, and other apps
Visualize data pipelines end to end from source to output
Provide a centralized view of your entire data stack, for governance, usage analytics, etc.
Identify baseline patterns and detect anomalies
Automatically analyze and pinpoint root causes
Proactively alert to prevent problems
Provide recommendations and remedies to solve issues
Automate resolution or self-healing
Show the business impact

While the principles are the same for data app and web app observability, how to go about this and what it looks like are markedly dissimilar.

Everything starts with the data — and correlating it

If you don’t capture the right kind of telemetry data, nothing else matters.

APM solutions inject agents that run 24×7 to monitor the runtime and behavior of applications written in .NET, Java, Node.js, PHP, Ruby, Go, and dozens of other languages. These agents collect data on all the individual services as they snake through the application ecosystem. Then APM stitches together all the data to understand which services the application calls and how the performance of each discrete service call impacts performance of the overall transaction. The various KPIs revolve around response times, availability (up/down, green/red), and the app users’ digital experience. The volume of data to be captured is incredibly broad, but not especially deep.

APM is primarily concerned with response times and availability. Here, Datadog shows red/green status and aggregated metrics.

Here, AppDynamics shows the individual response times for various interconnected services.

The telemetry details to be captured and correlated for data applications/pipelines, on the other hand, need to be both broad and extremely deep. A modern data workload comprises hundreds of jobs, each broken down into parallel-processing parts, with each part executing various tasks. And each job feeds into hundreds of other jobs and applications not only in this particular pipeline but all the other pipelines in the system.

Today’s pipelines are built on an assortment of distributed processing engines; each might be able to monitor its own application’s jobs but not show you how everything works as a whole. You need to see details at a highly granular level — for each sub-task within each sub-part of each job — and then marry them together into a single pane of glass that comprises the bigger picture at the application, pipeline, platform, and cluster levels.

DataOps observability (here, Unravel) looks at completely different metrics at the app level . .

. . . as well as the pipeline level.

Let’s take troubleshooting a slow Spark application as an example. The information you need to investigate the problem lives in a bunch of different places, and the various tools for getting this data give you only some of what you need, not all.

The Spark UI can tell you about the status of individual jobs but lacks infrastructure and configuration details and other information to connect together a full pipeline view. Spark logs help you retrospectively find out what happened to a given job (and even what was going on with other jobs at the same time) but don’t have complete information about resource usage, data partitioning, container configuration settings, and a host of other factors that can affect performance. And, of course, Spark tools are limited to Spark. But Spark jobs might have data coming in from, say, Kafka and run alongside a dozen other technologies.

Conversely, platform-specific interfaces (Databricks, Amazon EMR, Dataproc, BigQuery, Snowflake) have the information about resource usage and the status of various services at the cluster level, but not the granular details at the application or job level.Having all the information specific for data apps is a great start, but it isn’t especially helpful if it’s not all put into context. The data needs to be correlated, visualized, and analyzed in a purposeful way that lets you get to the information you need easily and immediately.

Then there’s how data is visualized and analyzed

Even the way you need to look at a data application environment is different. A topology map for a web application shows dependencies like a complex spoke-and-wheel diagram. When visualizing web app environments, you need to see the service-to-service interrelationships in a map like this:

How Dynatrace visualizes service dependencies in a topology map.

With drill-down details on service flows and response metrics:

Dynatrace drill-down details

For a modern data environment, you need to see how all the pipelines are interdependent and in what order. The view is more like a complex system of integrated funnels:

A modern data estate involves many interrelated application and pipeline dependencies (Source: Sandeep Uttamchandani)

You need full observability into not only how all the pipelines relate to one another, but all the dependencies of multiple applications within each pipeline . . .

An observability platform purpose-built for modern data stacks provides visibility into all the individual applications within a particular pipeline

. . . with granular drill-down details into the various jobs within each application. . .

. . and the sub-parts of each job processing in parallel . . .

How things get fixed

Monitoring and observability tell you what’s going on. But to understand why, you need to go beyond observability and apply AI/ML to correlate patterns, identify anomalies, derive meaningful insights, and perform automated root cause analysis. “Beyond observability” is a continuous and incremental spectrum, from understanding why something happened to knowing what to do about it to automatically fixing the issue. But to make that leap from good to great, you need ML models and AI algorithms purpose-built for the task at hand. And that means you need complete data about everything in the environment.

The best APM tools have some sort of AI/ML-based engine (some are more sophisticated than others) to analyze millions of data points and dependencies, spot anomalies, and alert on them.

For data applications/pipelines, the type of problems and their root causes are completely different than web apps. The data points and dependencies needing to be analyzed are completely different. The patterns, anomalous behavior, and root causes are different. Consequently, the ML models and AI algorithms need to be different.

In fact, DataOps observability needs to go even further than APM. The size of modern data pipelines and the complexities of multi-layered dependencies — from clusters to platforms and frameworks to applications to jobs within applications to sub-parts of those jobs to the various tasks within each sub-part — could lead to a lot of trial-and-error resolution effort even if you know what’s happening and why. What you really need to know is what to do.

An AI-driven recommendation engine like Unravel goes beyond the standard idea of observability to tell you how to fix a problem. For example, if there’s one container in one part of one job that’s improperly sized and so causing the entire pipeline to fail, Unravel not only pinpoints the guilty party but tells you what the proper configuration settings would be. Or another example is that Unravel can tell you exactly why a pipeline is slow and how you can speed it up. This is because Unravel’s AI has been trained over many years to understand the specific intricacies and dependencies of modern data stacks.

AI recommendations tell you exactly what to do to optimize for performance.

Business impact

Sluggish or broken web applications cost organizations money in terms of lost revenue and customer dissatisfaction. Good APM tools are able to put the problem into a business context by providing a lot of details about how many customer transactions were affected by an app problem.

As more and more of an organization’s operations and decision-making revolve around data analytics, data pipelines that miss SLAs or fail outright have an increasingly significant (negative) impact on the company’s revenue, productivity, and agility. Businesses must be able to depend on their data applications, so their applications need to have predictable, reliable behavior.

For example: If a fraud prevention data pipeline stops working for a bank, it can cause billions of dollars in fraudulent transactions going undetected. Or a slow healthcare analysis pipeline may increase risk for patients by failing to provide timely responses. Measuring and optimizing performance of data applications and pipelines correlates directly to how well the business is performing.

Businesses need proactive alerts when pipelines deviate from their normal behavior. But going “beyond observability” would tell them automatically why this is happening and what they can do to get the application back on track. This allows businesses to have reliable and predictable performance.

There’s also an immediate bottom-line impact that businesses need to consider: maximizing their return on investment and controlling/optimizing cloud spend. Modern data applications process a lot of data, which usually consumes a large amount of resources — and the meter is always running. This means the cloud bills can rack up fast.

To keep costs from spiraling out of control, businesses need actionable intelligence on how best to optimize their data pipelines. An AI recommendations engine can take all the profile and other key information it has about applications and pipelines and identify where jobs are overprovisioned or could be tuned for improvement. For example: optimizing code to remove inefficiencies, right-sizing containers to avoid wastage, providing the best data partitioning based on goals, and much more.

AI recommendations pinpoint exactly where and how to optimize for cost.

AI recommendations and deep insights lay the groundwork for putting in place some automated cost guardrails for governance. Governance is really all about converting the AI recommendations and insights into impact. Automated guardrails (per user, app, business unit, project) would alert operations teams about unapproved spend, potential budget overruns, jobs that run over a certain time/cost threshold, and the like. You can then proactively manage your budget, rather than getting hit with sticker shock after the fact.

In a nutshell

Application monitoring and observability solutions like Datadog, Dynatrace, and AppDynamics are excellent tools for web applications. Their telemetry, correlation, anomaly detection, and root cause analysis capabilities do a good job of helping you understand, troubleshoot, and optimize most areas of your digital ecosystem — one exception being the modern data stack. They are by design built for general-purpose observability of user interactions.

In contrast, an observability platform for the modern data stack like Unravel is more specialized. Its telemetry, correlation, anomaly detection, root cause analysis capabilities — and in the case of Unravel uniquely, its AI-powered remediation recommendations, automated guardrails, and automated remediation — is by design built specifically to understand, troubleshoot, and optimize modern data workloads.

Observability is all about context. Traditional APM provides observability in context for web applications, but not for data applications and pipelines. That’s not a knock on these APM solutions. Far from it. They do an excellent job at what they were designed for. They just weren’t built for observability of the modern data stack. That requires another kind of solution designed specifically for a different kind of animal.

The post Why Legacy Observability Tools Don’t Work for Modern Data Stacks appeared first on Unravel.

Building vs. Buying Your Modern Data Stack: A Panel Discussion

Stephen Lamont — Thu, 21 Apr 2022 19:55:09 +0000

One of the highlights of the DataOps Unleashed 2022 virtual conference was a roundtable panel discussion on building versus buying when it comes to your data stack. Build versus buy is a question for all layers of the enterprise infrastructure stack. But in the last five years — even in just the last year alone — it’s hard to think of a part of IT that has seen more dramatic change than that of the modern data stack.

These transformations shape how today’s businesses engage and work with data. Moderated by Lightspeed Venture Partners’ Nnamdi Iregbulem, the panel’s three conversation partners — Andrei Lopatenko, VP of Engineering at Zillow; Gokul Prabagaren, Software Engineering Manager at Capital One; and Aaron Richter, Data Engineer at Squarespace — weighed in on the build versus buy question and walked us through their thoughts:

What motivates companies to build instead of buy?
How do particular technologies and/or goals affect their decision?

These issues and other considerations were discussed. A few of the highlights follow, but the entire session is available on demand here.

What are the key variables to consider when deciding whether to build or buy in the data stack?

Gokul: I think the things which we probably consider most are what kind of customization a particular product offers or what we uniquely need. Then there are the cases in which we may need unique data schemas and formats to ingest the data. We must consider how much control we have of the product and also our processing and regulatory needs. We have to ask how we will be able to answer those kinds of questions if we are building in-house or choosing to adopt an outsourced product.

Aaron: Thinking from the organizational perspective, there are a few factors that come from just purchasing or choosing to invest in something. Money is always a factor. It’s going to depend on the organization and how much you’re willing to invest.

Beyond that a key factor is the expertise of the organization or the team. If a company has only a handful of analysts doing the heavy-lifting data work, to go in and build an orchestration tool would take them away from their focus and their expertise of providing insights to the business.

Andrei: Another important thing to consider is the quality of the solution. Not all the data products on the market have high quality from different points of view. So sometimes it makes sense to build something, to narrow the focus of the product. Compatibility with your operations environment is another crucial consideration when choosing build versus buy.

What’s the more compelling consideration: saving headcount or increasing productivity of the existing headcount?

Aaron: In general, everybody’s oversubscribed, right? Everybody always has too much work to do. And we don’t have enough people to accomplish that work. From my perspective, the compelling part is, we’re going to make you more efficient, we’re going to give you fewer headaches, and you’ll have fewer things to manage.

Gokul: I probably feel the same. It depends more on where we want to invest and if we’re ready to change where we’re investing: upfront costs or running costs.

Andrei: And development costs: do we want to buy this, or invest in building? And again, consider the human equation. It’s not just the number of people in your headcount. Maybe you have a small number of engineers, but then you have to invest more of their time into data science or data engineering or analytics. Saving time is a significant factor when making these choices.

How does the decision matrix change when the cloud becomes part of the consideration set in terms of build versus buy?

Gokul: I feel like it’s trending towards a place where it’s more managed. That may not be the same question as build or buy. But it skews more towards the manage option, because of that compatibility, where all these things are available within the same ecosystem.

Aaron: I think about it in terms of a cloud data warehouse: some kind of processing tool, like dbt; and then some kind of orchestration tool, like Airflow or Prefect; and there’s probably one pillar on that side, where you would never think to build it yourself. And that’s the cloud data warehouse. So you’re now kind of always going to be paying for a cloud vendor, whether it’s Snowflake or BigQuery or something of that nature.

So you already have your foot in the door there, and you’re already buying, right? So then that opens the door now to buying more things, adding things on that integrate really easily. This approach helps the culture shift. If a culture is very build-oriented, this allows them to be more okay with buying things.

Andrei: Theoretically you want to have your infrastructure independent on cloud, but it never happens, for multiple reasons. Firstly, cloud company tools make integration work much easier. Second, of course, once you have to think about multi-cloud, you must address privacy and security concerns. In principle, it’s possible to be independent, but you’ll often run into a lot of technical problems. There are multiple different factors when cloud becomes key in deciding what you will make and what tools to use.

See the entire Build vs. Buy roundtable discussion on demand
Watch now

The post Building vs. Buying Your Modern Data Stack: A Panel Discussion appeared first on Unravel.

Data Pipeline HealthCheck and Analysis

Unravel Data — Tue, 17 Aug 2021 20:07:31 +0000

Why Data Pipelines Become Complex

Companies across a multitude of industries, including entertainment, finance, transportation, and healthcare, are on their way to becoming data companies. These companies are creating advanced data products with insights generated using data pipelines. To build data pipelines, you need a modern stack that involves a variety of systems, such as Airflow for orchestration, Snowflake or Redshift as the data warehouse, Databricks and Presto for advanced analytics on a data lake, and Kafka and Spark Streaming to process data in real-time, just to name a few. Combining all these systems naturally causes your data pipelines to become complex, however. In this talk, Shivnath shares a recipe for dealing with this complexity and keeping your data pipelines healthy.

What Is a Healthy Data Pipeline?

Pipeline health can be viewed along three dimensions: correctness, performance, and cost. To understand pipeline health, Shivnath described what an unhealthy pipeline would look like along the three dimensions.

In this presentation, Shivnath not only shares tips and tricks to monitor the health of your pipelines, and to give you confidence that your pipelines are healthy, but he also shares ways to troubleshoot and fix problems that may arise if and when pipelines fall ill.

HealthCheck for Data Pipelines

Correctness

Users cannot be expected to write and define all health checks, however. For example, a user may not define a check because the check is implicit. But if this check is not evaluated, a false negative can arise, meaning that your pipeline might generate incorrect results that will be hard to resolve. Luckily, you can avoid this problem by having a system that runs automatic checks – for example, automatically detecting anomalies or changes. It is important to note, however, that automatic checks can instead induce false positives. Balancing false positives and false negatives remains a challenge today. A good practice is to design your checks in parallel with designing the pipeline.

But what do you do when these checks fail? Of course you have to troubleshoot the problem and fix it, but it is also important to capture failed checks in the context of the pipeline execution. There are two reasons why. One, the root cause of the failure may lie upstream in the pipeline, so you must understand the lineage of the pipeline. And two, a lot of checks fail because of changes in your pipeline. Having a history of these pipeline runs is key to understanding what has changed.

Performance

Shivnath has some good news for you; performance checks involve less context and application semantics compared to correctness checks. The best practice for performance checks is to define end-to-end performance checks in the form of pipeline SLAs. You can also define checks at different stages of the pipeline. For example, with Airflow you can easily specify the maximum time a task can take and define a check at that time. Automatic checks for performance are useful because the users don’t have to specify all the checks, just as with correctness checks. Again, keeping in mind the caveat about false positives and false negatives, it is critical but easy to build the appropriate baselines and detect deviations from these baselines though pipeline runs.

The timing of checks is also important. For example, it probably wouldn’t be helpful if a check fails after the pipeline SLA was missed. A best practice is to keep pipeline stages short and frequent so checks can also be seen and evaluated often.

When it comes to troubleshooting and tuning, Shivnath notes that it is important to have what he calls a “single pane of glass”. Pipelines can be complex with many moving parts, so having an end-to-end view is very important to troubleshoot problems. For example, due to multi-tenancy, an app that is otherwise unrelated to your pipeline may affect your pipeline’s performance. This is another example of where having automatic insights about what caused a problem is vital.

Cost

Just like performance checks, HealthChecks for cost also require less context and application semantics compared to correctness checks. But when it comes to cost checks, early detection is especially important. If a check is not executed, or a failed check is not fixed, and as a result there is a cost overrun, there can be severe consequences. So it’s very important to troubleshoot and fix problems as soon as checks fail.

Costs can be incurred in many different ways, including cost of storage or cost of compute. Therefore, a single pane of glass is again useful, as well as a detailed cost breakdown. Lastly, automated insights to remediate problems are also critical.

HealthCheck Demos

After Shivnath describes the different kinds of HealthChecks and the impact they have on helping you monitor and track your pipelines, he demos a couple different scenarios where HealthChecks fail – first at the performance level, then at the cost level, and lastly at the correctness level.

That was a short demo of how HealthChecks can make it very easy to manage complex pipelines. And this blog just gives you a taste of how HealthChecks can help you find and fix problems, as well as streamline your pipelines. You can view Shivnath’s full session from Airflow Summit 2021 here. If you’re interested in assessing Unravel for your own data-driven applications, you can create a free account or contact us to learn how we can help.

The post Data Pipeline HealthCheck and Analysis appeared first on Unravel.

Accelerate Amazon EMR for Spark & More

Unravel Data — Mon, 21 Jun 2021 21:54:59 +0000

The post Accelerate Amazon EMR for Spark & More appeared first on Unravel.

Strategies To Accelerate Performance for Databricks

Unravel Data — Fri, 18 Jun 2021 21:57:29 +0000

The post Strategies To Accelerate Performance for Databricks appeared first on Unravel.

Effective Cost and Performance Management for Amazon EMR

Unravel Data — Wed, 28 Apr 2021 22:06:57 +0000

The post Effective Cost and Performance Management for Amazon EMR appeared first on Unravel.

How To Get the Best Performance & Reliability Out of Kafka & Spark Applications

Unravel Data — Thu, 11 Mar 2021 22:10:06 +0000

The post How To Get the Best Performance & Reliability Out of Kafka & Spark Applications appeared first on Unravel.

Standardizing Business Metrics & Democratizing Experimentation at Intuit

Unravel Data — Wed, 27 Jan 2021 13:00:24 +0000

CDO Battlescars Takeaways: Standardizing Business Metrics & Democratizing Experimentation at Intuit

CDO Battlescars is a podcast series hosted by Sandeep Uttamchandani, Unravel Data’s CDO. He talks to data leaders across data engineering, analytics, and data science about the challenges they encountered in their journey of transforming raw data into insights. The motivation of this podcast series is to give back to the data community the hard-learned lessons that Sandeep and his peer data leaders have learned over the years. (Thus, the reference to battlescars.)

In this episode, Sandeep talked to Anil Madan about battlescars in two areas: Standardization of Business Metrics and Democratization of Experimentation at Intuit.

At Intuit, Anil was the VP of Data and Analytics for Intuit’s Small Business and Self Employed group. He has over 25 years of experience in the data space across Intuit, PayPal, and eBay, and is now the VP of Data Platforms at Walmart. Anil is a pioneer in building data infrastructure and creating value from data across products, experimentation, digital marketing, payments, Fintech, and many other areas. Here are the key talking points from this insightful chat!

Standardization of Business Metrics

The Problem: Slow and Inconsistent Processing

Anil led data and analytics for Intuit’s QuickBooks software. The goal of QuickBooks is to create a platform where they can power all the distinct needs of small businesses seamlessly.
As they looked at their customer base, the key metric Anil’s team measured was signups and subscriptions. This metric needed to be consistent in order to have standards that could be relied on to drive business.
This metric used to be inconsistent, and the time-to-value ranged from 24 to 48 hours because their foundational pipelines had several hubs.

The Solution: Simplify, Measure, Improve, and Move to Cloud
To solve this problem, Anil shared the following insights:

Determine what the true metric definition was and establish the right counting rules and documentation.
Define the right SLA for each data source.
Invest deeply in programmatic tools, building a lineage called Superglue, which would start traversing right from the source all the way into reporting.
Create micro-Spark pipelines, moving away from traditional MapReduce and monolithic ways of processing.
Migrate to the cloud to help them auto-scale their compute capacity.
Establish the right governance so that schema changes in any source would be detected through their programs.

The analytics teams, business leaders, and finance teams all looked at business definitions. As they launch new product offerings, or new monetization and consumption patterns, they review the metrics against these new patterns and ensure that business definitions are still holding true.

Democratization of Experimentation

Moving to democratizing experimentation, Anil was involved in significantly transforming the experimentation trajectory.

Anil breaks this transformation into people, processes, and technology:

From a people perspective, what the challenges were, and where the handoffs were happening, determining if the right skills, and hiring data analysts with specialization in the experimentation space.
On the processing side, ensuring a single place where they could measure what’s really happening at every step.
On the technology side, looking at several examples from the industry to decide what to invest in the platform capabilities. They invested in things like metrics generation to establish overall evaluation criteria in each experiment. They also looked at techniques to move faster, like sequential testing, interleaving, and even quasi-experimentation.

Focusing on key metrics and ensuring that there are no mistakes in setting up and running the experiment became extremely important. They invested in a lot of programmatic ways to do that.

Anil’s team looked into where time is spent in the overall experimentation process and then focused on addressing those areas through a combination of building programs, leveraging programs that are open source, and even buying the needed software.

Focusing on the foundational pieces around how your pipelines work and how you analyze things, and investing in those foundational pieces, is key.

Though we highlighted the key talking points, Sandeep and Anil talked about so much more in the 29-minute podcast. Be sure to check out the full episode and the rest of the CDO Battlescars podcast series!

Listen to Podcast on Spotify

The post Standardizing Business Metrics & Democratizing Experimentation at Intuit appeared first on Unravel.

“More than 60% of Our Pipelines Have SLAs,” Say Unravel Customers at Untold

Floyd Smith — Fri, 18 Dec 2020 21:42:41 +0000

Also see our blog post with stories from Untold speakers, “My Fitbit sleep score is 88…”

Unravel Data recently held its first-ever customer conference, Untold 2020. We promoted Untold 2020 as a convocation of #datalovers. And these #datalovers generated some valuable data – including the interesting fact that more than 60% of surveyed customers have SLAs for either “more than 50% of their pipelines” (42%) or “all of their pipelines” (21%).

All of this ties together. Unravel makes it much easier for customers to set attainable SLAs for their pipelines, and to meet those SLAs once they’re set. Let’s look at some more data-driven findings from the conference.

And, if you’re an Unravel Data customer, reach out to access much more information, including webinar-type recordings of all five customer and industry expert presentations, with polling and results presentations interspersed throughout. If you are not yet a customer, but want to know more, you can create a free account or contact us.

Unravel Data CEO Kunal Agarwal kicking off Untold.

Note: “The plural of anecdotes is not data,” the saying goes – so, nostra culpa. The findings discussed herein are polling results from our customers who attended Untold, and they fall short of statistical significance. But they do represent the opinions of some of the most intense #datalovers in leading corporations worldwide – Unravel Data’s customer base. (“The great ones skate to where the puck’s going to be,” after Wayne Gretzky…)

More Than 60% of Customer Pipelines Have SLAs

Using Unravel, more than 60% of the pipelines managed by our customers have SLAs:

More than 20% have SLAs for all their pipelines.
More than 40% have SLAs for more than half of their pipelines.
Fewer than 40% have SLAs for fewer than half their pipelines (29%) or fewer than a quarter of them (8%).

Pipelines were, understandably, a huge topic of conversation at Untold. Complex pipelines, and the need for better tools to manage them, are very common amongst our customers. And Unravel makes it possible for our customers to set, and meet, SLAs for their pipelines.

What percentage of your data pipelines have SLAs
<25%	8.3%
>25-50%	29.2%
>50%	41.7%
All of them	20.8%

Bad Applications Cause Cost Overruns

We asked our attendees the biggest reason for cost overruns:

Roughly three-quarters replied, “Bad applications taking too many resources.” Finding out which applications all the resources are being consumed by is a key feature of Unravel software.
Nearly a quarter replied, “Oversized containers.” Now, not everyone is using containers yet, so we are left to wonder just how many container users are unnecessarily supersizing their containers. Most of them?
And the remaining answer was, “small files.” One of the strongest presentations at the Untold conference was about the tendency of bad apps to generate a myriad of small files that consume a disproportionate share of resources; you can use Unravel to generate a small files report and track these problems down.

What is usually the biggest reason for cost overruns?
Bad applications taking too many resources	75.0%
Oversized containers	20.0%
Small files	5.0%

Two-Thirds Know Their Most Expensive App/User

Amongst Unravel customers, almost two-thirds can identify their most expensive app(s) and user(s). Identifying costly apps and users is one of the strengths of Unravel Data:

On-premises, expensive apps and users consume far more than their share of system resources, slowing other jobs and contributing to instability and crashes.
In moving to cloud, knowing who’s costing you in your on-premises estate – and whether the business results are worth the expense – is crucial to informed decision-making.
In the cloud, “pay as you go” means that, as one of our presenters described it, “When you go, you pay.” It’s very easy for 20% of your apps and users to generate 80% of your cloud bill, and it’s very unlikely for that inflated bill to also represent 80% of your IT value.

Do you know who is the most expensive user / app on your system?
Yes	65.0%
No	25.0%
No, but would be great to know	10.0%

An Ounce of Prevention is Worth a Pound of Cure

Knowing whether you have rogue users and/or apps on your system is very valuable:

A plurality (43%) of Unravel customers have rogue users/apps on their cluster “all the time.”
A minority (19%) see this about once a day.
A near-plurality (38%) only see this behavior once a week or so.

We would bet good money that non-Unravel customers see a lot more rogue behavior than our customers do. With Unravel, you can know exactly who and what is “going rogue” – and you can help stakeholders get the same results with far less use of system resources and cost. This cuts down on rogue behavior, freeing up room for existing jobs, and for new applications to run with hitherto unattainable performance.

How often do you have rogue users/apps on your cluster?
All the time!	42.9%
Once a day	19.0%
Once a week	38.1%

Unravel Customers Are Fending Off Bad Apps

Once you find and improve the bad apps that are currently in production, the logical next step is to prevent bad apps from even reaching production in the first place. Unravel customers are meeting this challenge:

More than 90% of attendees find automation helpful in preventing bad quality apps from being promoted into production.
More than 80% have a quality gate when promoting apps from Dev to QA, and from QA to production.
More than half have a well-defined DataOps/SDLC (software development life cycle) process, and nearly a third have a partially-defined process. Only about one-eighth have neither.
About one-quarter have operations people/sysadmins troubleshooting their data pipelines; another quarter put the responsibility onto the developers or data engineers who create the apps. Nearly half make a game-time decision, depending on the type of issue, or use an “all hands on deck” approach with everyone helping.

The Rest of the Story

More than two-thirds are finding their data costs to be running over budget.
More than half are in the process of migrating to cloud – though only a quarter have gotten to the point of actually moving apps and data, then optimizing and scaling the result.
Half find that automating problem identification, root cause analysis, and resolution saves them 1-5 hours per issue; the other half save from 6-10 hours or more.
Somewhat more than half find their clusters to be on the over-provisioned side.
Nearly half have ten or more technologies in representative data pipelines.

Finding Out More

Unravel Data customers – #datalovers all – look to be well ahead of the industry in managing issues that relate to big data, streaming data, and moving to the cloud. If you’re interested in joining this select group, you can create a free account or contact us. (There may still be some Untold conference swag left for new customers!)

The post “More than 60% of Our Pipelines Have SLAs,” Say Unravel Customers at Untold appeared first on Unravel.

Spark APM – What is Spark Application Performance Management

Unravel Data — Thu, 17 Dec 2020 21:01:51 +0000

What is Spark?

Apache Spark is a fast and general-purpose engine for large-scale data processing. It’s most widely used to replace MapReduce for fast processing of data stored in Hadoop. Spark also provides an easy-to-use API for developers.

Designed specifically for data science, Spark has evolved to support more use cases, including real-time stream event processing. Spark is also widely used in AI and machine learning applications.

There are six main reasons to use Spark.

1. Speed – Spark has an advanced engine that can deliver up to 100 times faster processing than traditional MapReduce jobs on Hadoop.

2. Ease of use – Spark supports Python, Java, R, and Scala natively, and offers dozens of high-level operators that make it easy to build applications.

3. General purpose – Spark offers the ability to combine all of its components to make a robust and comprehensive data product.

4. Platform agnostic – Spark runs in nearly any environment.

5. Open source – Spark and its component and complementary technologies are free and open source; users can change the code as needed.

6. Widely used – Spark is the industry standard for many tasks, so expertise and help are widely available.

A Brief History of Spark

Today, when it comes to parallel big data analytics, Spark is the dominant framework that developers turn to for their data applications. But what came before Spark?

In 2003, several developers, mostly based at Yahoo!, started working on an open, distributed computing platform. A few years later, these developers released their work as an open source project called Hadoop.

This is also approximately the same time that Google created a Java interface called MapReduce that they used to work with their volumes of data. While Hadoop grew in popularity in its ability to store massive volumes of data, at Facebook, developers wanted to provide their data science team with an easier way to work with their data in Hadoop. As a result, they created Hive, a data warehousing framework based on Hadoop.

Even though Hadoop was gaining wide adoption at this point, there really weren’t any good interfaces for analysts and data scientists to use. So, in 2009, a group of people at the University of California Berkeley ampLab started a new project to solve this problem. Thus Spark was born – and was released as open source a year later.

What is Spark APM?

Spark enables rapid innovation and high performance in your applications. But as your applications grow in complexity, inefficiencies are bound to be introduced. These inefficiencies add up to significant performance losses and increased processing costs.

For example, a Spark cluster may have idle time between batches because of slow data writes on the server. Batch modes become idle because the next batch can’t start until all of the previous tasks haven’t been completed yet. Your Spark jobs are “bottlenecked on writes.”

When this happens, you can’t scale your application horizontally – adding more servers to help with processing won’t improve your application performance. Instead, you’d just be increasing the idle time of your clusters.

Unravel Data for Spark APM

This is where Unravel Data comes in to save the day. Unravel Data for Spark provides a comprehensive full-stack, intelligent, and automated approach to Spark operations and application performance management across your modern data stack architecture.

The Unravel platform helps you analyze, optimize, and troubleshoot Spark applications and pipelines in a seamless, intuitive user experience. Operations personnel, who have to manage a wide range of technologies, don’t need to learn Spark in great depth in order to significantly improve the performance and reliability of Spark applications.

See Unravel for Spark in Action

Create a free account

Example of Spark APM with Unravel Data

A Spark application consists of one or more jobs, each of which in turn has one or more stages. A job corresponds to a Spark action – for example, count, take, for each, etc. Within the Unravel platform, you can view all of the details of your Spark application.

Unravel’s Spark APM lets you:

Quickly see which jobs and stages consume the most resources
View your app as a resilient distributed dataset (RDD) execution graph
Drill into the source code from the stage tile, Spark stream batch tile, or the execution graph to locate the problems

Unravel Data for Spark APM can then help you:

Resolve inefficiencies, bottlenecks, and reasons for failure within apps
Optimize resource allocation for Spark drivers and executors
Detect and fix poor partitioning
Detect and fix inefficient and failed Spark apps
Use recommended settings to tune for speed and efficiency

Improve Your Spark Application Performance

To learn how Unravel can help improve the performance of your applications, create a free account and take it for a test drive on your Spark applications. Or, contact Unravel.

The post Spark APM – What is Spark Application Performance Management appeared first on Unravel.

Spark Catalyst Pipeline: A Deep Dive into Spark’s Optimizer

Floyd Smith — Sat, 12 Dec 2020 03:44:11 +0000

The Catalyst optimizer is a crucial component of Apache Spark. It optimizes structural queries – expressed in SQL, or via the DataFrame/Dataset APIs – which can reduce the runtime of programs and save costs. Developers often treat Catalyst as a black box that just magically works. Moreover, only a handful of resources are available that explain its inner workings in an accessible manner.

When discussing Catalyst, many presentations and articles reference this architecture diagram, which was included in the original blog post from Databricks that introduced Catalyst to a wider audience. The diagram has inadequately depicted the physical planning phase, and there is an ambiguity about what kind of optimizations are applied at which point.

The following sections explain Catalyst’s logic, the optimizations it performs, its most crucial constructs, and the plans it generates. In particular, the scope of cost-based optimizations is outlined. These were advertised as “state-of-art” when the framework was introduced, but our article will describe larger limitations. And, last but not least, we have redrawn the architecture diagram:

The Spark Catalyst Pipeline

This diagram and the description below focus on the second half of the optimization pipeline and do not cover the parsing, analyzing, and caching phases in much detail.

Diagram 1: The Catalyst pipeline

The input to Catalyst is a SQL/HiveQL query or a DataFrame/Dataset object which invokes an action to initiate the computation. This starting point is shown in the top left corner of the diagram. The relational query is converted into a high-level plan, against which many transformations are applied. The plan becomes better-optimized and is filled with physical execution information as it passes through the pipeline. The output consists of an execution plan that forms the basis for the RDD graph that is then scheduled for execution.

Nodes and Trees

Six different plan types are represented in the Catalyst pipeline diagram. They are implemented as Scala trees that descend from TreeNode. This abstract top-level class declares a children field of type Seq[BaseType]. A TreeNode can therefore have zero or more child nodes that are also TreeNodes in turn. In addition, multiple higher-order functions, such as transformDown, are defined, which are heavily used by Catalyst’s transformations.

This functional design allows optimizations to be expressed in an elegant and type-safe fashion; an example is provided below. Many Catalyst constructs inherit from TreeNode or operate on its descendants. Two important abstract classes that were inspired by the logical and physical plans found in databases are LogicalPlan and SparkPlan. Both are of type QueryPlan, which is a direct subclass of TreeNode.

In the Catalyst pipeline diagram, the first four plans from the top are LogicalPlans, while the bottom two – Spark Plan and Selected Physical Plan – are SparkPlans. The nodes of logical plans are algebraic operators like Join; the nodes of physical plans (i.e. SparkPlans) correspond to low-level operators like ShuffledHashJoinExec or SortMergeJoinExec. The leaf nodes read data from sources such as files on stable storage or in-memory lists. The tree root represents the computation result.

Transformations

An example of Catalyst’s trees and transformation patterns is provided below. We have used expressions to avoid verbosity in the code. Catalyst expressions derive new values and are also trees. They can be connected to logical and physical nodes; an example would be the condition of a filter operator.

The following snippet transforms the arithmetic expression –((11 – 2) * (9 – 5)) into – ((1 + 0) + (1 + 5)):


import org.apache.spark.sql.catalyst.expressions._
 val firstExpr: Expression = UnaryMinus(Multiply(Subtract(Literal(11), Literal(2)), Subtract(Literal(9), Literal(5))))
 val transformed: Expression = firstExpr transformDown {
   case BinaryOperator(l, r) => Add(l, r)
   case IntegerLiteral(i) if i > 5 => Literal(1)
   case IntegerLiteral(i) if i < 5 => Literal(0)
 }

The root node of the Catalyst tree referenced by firstExpr has the type UnaryMinus and points to a single child, Multiply. This node has two children, both of type Subtract.

The first Subtract node has two Literal child nodes that wrap the integer values 11 and 2, respectively, and are leaf nodes. firstExpr is refactored by calling the predefined higher-order function transformDown with custom transformation logic: The curly braces enclose a function with three pattern matches. They convert all binary operators, such as multiplication, into addition; they also map all numbers that are greater than 5 to 1, and those that are smaller than 5 to zero.

Notably, the rule in the example gets successfully applied to firstExpr without covering all of its nodes: UnaryMinus is not a binary operator (having a single child) and 5 is not accounted for by the last two pattern matches. The parameter type of transformDown is responsible for this positive behavior: It expects a partial function that might only cover subtrees (or not match any node) and returns “a copy of this node where `rule` has been recursively applied to it and all of its children (pre-order)”.

This example appears simple, but the functional techniques are powerful. At the other end of the complexity scale, for example, is a logical rule that, when it fires, applies a dynamic programming algorithm to refactor a join.

Catalyst Plans

The following (intentionally bad) code snippet will be the basis for describing the Catalyst plans in the next sections:


val result = session.read.option("header", "true").csv(outputLoc)
   .select("letter", "index")
   .filter($"index" < 10) 
   .filter(!$"letter".isin("h") && $"index" > 6)
result.show()

The complete program can be found here. Some plans that Catalyst generates when evaluating this snippet are presented in a textual format below; they should be read from the bottom up. A trick needs to be applied when using pythonic DataFrames, as their plans are hidden; this is described in our upcoming follow-up article, which also features a general interpretation guide for these plans.

Parsing and Analyzing

The user query is first transformed into a parse tree called Unresolved or Parsed Logical Plan:


'Filter (NOT 'letter IN (h) && ('index > 6))
+- Filter (cast(index#28 as int) < 10)
   +- Project [letter#26, index#28]
             +- Relation[letter#26,letterUppercase#27,index#28] csv

This initial plan is then analyzed, which involves tasks such as checking the syntactic correctness, or resolving attribute references like names for columns and tables, which results in an Analyzed Logical Plan.

In the next step, a cache manager is consulted. If a previous query was persisted, and if it matches segments in the current plan, then the respective segment is replaced with the cached result. Nothing is persisted in our example, so the Analyzed Logical Plan will not be affected by the cache lookup.

A textual representation of both plans is included below:


Filter (NOT letter#26 IN (h) && (cast(index#28 as int) > 6))
+- Filter (cast(index#28 as int) < 10)
   +- Project [letter#26, index#28]
            +- Relation[letter#26,letterUppercase#27,index#28] csv

Up to this point, the operator order of the logical plan reflects the transformation sequence in the source program, if the former is interpreted from the bottom up and the latter is read, as usual, from the top down. A read, corresponding to Relation, is followed by a select (mapped to Project) and two filters. The next phase can change the topology of the logical plan.

Logical Optimizations

Catalyst applies logical optimization rules to the Analyzed Logical Plan with cache data. They are part of rule groups which are internally called batches. There are 25 batches with 109 rules in total in the Optimizer (Spark 2.4.7). Some rules are present in more than one batch, so the number of distinct logical rules boils down to 69. Most batches are executed once, but some can run repeatedly until the plan converges or a predefined maximum number of iterations is reached. Our program CollectBatches can be used to retrieve and print this information, along with a list of all rule batches and individual rules; the output can be found here.

Major rule categories are operator pushdowns/combinations/replacements and constant foldings. Twelve logical rules will fire in total when Catalyst evaluates our example. Among them are rules from each of the four aforementioned categories:

Two applications of PushDownPredicate that move the two filters before the column selection
A CombineFilters that fuses the two adjacent filters into one
An OptimizeIn that replaces the lookup in a single-element list with an equality comparison
A RewritePredicateSubquery which rearranges the elements of the conjunctive filter predicate that CombineFilters created

These rules reshape the logical plan shown above into the following Optimized Logical Plan:


Project [letter#26, index#28], Statistics(sizeInBytes=162.0 B)
 +- Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6)), Statistics(sizeInBytes=230.0 B)
    +- Relation[letter#26,letterUppercase#27,index#28] csv, Statistics(sizeInBytes=230.0 B

Quantitative Optimizations

Spark’s codebase contains a dedicated Statistics class that can hold estimates of various quantities per logical tree node. They include:

Size of the data that flows through a node
Number of rows in the table
Several column-level statistics:
- Number of distinct values and nulls
- Minimum and maximum value
- Average and maximum length of the values
- An equi-height histogram of the values

Estimates for these quantities are eventually propagated and adjusted throughout the logical plan tree. A filter or aggregate, for example, can significantly reduce the number of records the planning of a downstream join might benefit from using the modified size, and not the original input size, of the leaf node.

Two estimation approaches exist:

The simpler, size-only approach is primarily concerned with the first bullet point, the physical size in bytes. In addition, row count values may be set in a few cases.
The cost-based approach can compute the column-level dimensions for Aggregate, Filter, Join, and Project nodes, and may improve their size and row count values.

For other node types, the cost-based technique just delegates to the size-only methods. The approach chosen depends on the spark.sql.cbo.enabled property. If you intend to traverse the logical tree with a cost-based estimator, then set this property to true. Besides, the table and column statistics should be collected for the advanced dimensions prior to the query execution in Spark. This can be achieved with the ANALYZE command. However, a severe limitation seems to exist for this collection process, according to the latest documentation: “currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run”.

The estimated statistics can be used by two logical rules, namely CostBasedJoinReorder and ReorderJoin (via StarSchemaDetection), and by the JoinSelection strategy in the subsequent phase, which is described further below.

The Cost-Based Optimizer

A fully fledged Cost-Based Optimizer (CBO) seems to be a work in progress, as indicated here: “For cost evaluation, since physical costs for operators are not available currently, we use cardinalities and sizes to compute costs”. The CBO facilitates the CostBasedJoinReorder rule and may improve the quality of estimates; both can lead to better planning of joins. Concerning column-based statistics, only the two count stats (distinctCount and nullCount) appear to participate in the optimizations; the other advanced stats may improve the estimations of the quantities that are directly used.

The scope of quantitative optimizations does not seem to have significantly expanded with the introduction of the CBO in Spark 2.2. It is not a separate phase in the Catalyst pipeline, but improves the join logic in several important places. This is reflected in the Catalyst optimizer diagram by the smaller green rectangles, since the stats-based optimizations are outnumbered by the rule-based/heuristic ones.

The CBO will not affect the Catalyst plans in our example for three reasons:

The property spark.sql.cbo.enabled is not modified in the source code and defaults to false.
The input consists of raw CSV files for which no table/column statistics were collected.
The program operates on a single dataset without performing any joins.

The textual representation of the optimized logical plan shown above includes three Statistics segments which only hold values for sizeInBytes. This further indicates that size-only estimations were used exclusively; otherwise attributeStats fields with multiple advanced stats would appear.

Physical Planning

The optimized logical plan is handed over to the query planner (SparkPlanner), which refactors it into a SparkPlan with the help of strategies — Spark strategies are matched against logical plan nodes and map them to physical counterparts if their constraints are satisfied. After trying to apply all strategies (possibly in several iterations), the SparkPlanner returns a list of valid physical plans with at least one member.

Most strategy clauses replace one logical operator with one physical counterpart and insert PlanLater placeholders for logical subtrees that do not get planned by the strategy. These placeholders are transformed into physical trees later on by reapplying the strategy list on all subtrees that contain them.

As of Spark 2.4.7, the query planner applies ten distinct strategies (six more for streams). These strategies can be retrieved with our CollectStrategies program. Their scope includes the following items:

The push-down of filters to, and pruning of, columns in the data source
The planning of scans on partitioned/bucketed input files
The aggregation strategy (SortAggregate, HashAggregate, or ObjectHashAggregate)
The choice of the join algorithm (broadcast hash, shuffle hash, sort merge, broadcast nested loop, or cartesian product)

The SparkPlan of the running example has the following shape:


Project [letter#26, index#28]
+- Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6))
   +- FileScan csv [letter#26,index#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/example], PartitionFilters: [], PushedFilters: [IsNotNull(index), IsNotNull(letter), Not(EqualTo(letter,h))], ReadSchema: struct

A DataSource strategy was responsible for planning the FileScan leaf node and we see multiple entries in its PushedFilters field. Pushing filter predicates down to the underlying source can avoid scans of the entire dataset if the source knows how to apply them; the pushed IsNotNull and Not(EqualTo) predicates will have a small or nil effect, since the program reads from CSV files and only one cell with the letter h exists.

The original architecture diagram depicts a Cost Model that is supposed to choose the SparkPlan with the lowest execution cost. However, such plan space explorations are still not realized in Spark 3.01; the code trivially picks the first available SparkPlan that the query planner returns after applying the strategies.

A simple, “localized” cost model might be implicitly defined in the codebase by the order in which rules and strategies and their internal logic get applied. For example, the JoinSelection strategy assigns the highest precedence to the broadcast hash join. This tends to be the best join algorithm in Spark when one of the participating datasets is small enough to fit into memory. If its conditions are met, the logical node is immediately mapped to a BroadcastHashJoinExec physical node, and the plan is returned. Alternative plans with other join nodes are not generated.

Physical Preparation

The Catalyst pipeline concludes by transforming the Spark Plan into the Physical/Executed Plan with the help of preparation rules. These rules can affect the number of physical operators present in the plan. For example, they may insert shuffle operators where necessary (EnsureRequirements), deduplicate existing Exchange operators where appropriate (ReuseExchange), and merge several physical operators where possible (CollapseCodegenStages).

The final plan for our example has the following format:


*(1) Project [letter#26, index#28]
 +- *(1) Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6))
    +- *(1) FileScan csv [letter#26,index#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/example], PartitionFilters: [], PushedFilters: [IsNotNull(index), IsNotNull(letter), Not(EqualTo(letter,h))], ReadSchema: struct

This textual representation of the Physical Plan is identical to the preceding one for the Spark Plan, with one exception: Each physical operator is now prefixed with a *(1). The asterisk indicates that the aforementioned CollapseCodegenStages preparation rule, which “compiles a subtree of plans that support codegen together into a single Java function,” fired.

The number after the asterisk is a counter for the CodeGen stage index. This means that Catalyst has compiled the three physical operators in our example together into the body of one Java function. This comprises the first and only WholeStageCodegen stage, which in turn will be executed in the first and only Spark stage. The code for the collapsed stage will be generated by the Janino compiler at run-time.

Adaptive Query Execution in Spark 3

One of the major enhancements introduced in Spark 3.0 is Adaptive Query Execution (AQE), a framework that can improve query plans during run-time. Instead of passing the entire physical plan to the scheduler for execution, Catalyst tries to slice it up into subtrees (“query stages”) that can be executed in an incremental fashion.

The adaptive logic begins at the bottom of the physical plan, with the stage(s) containing the leaf nodes. Upon completion of a query stage, its run-time data statistics are used for reoptimizing and replanning its parent stages. The scope of this new framework is limited: The official SQL guide states that “there are three major features in AQE, including coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skew join optimization.” A query must not be a streaming query and its plan must contain at least one Exchange node (induced by a join for example) or a subquery; otherwise, an adaptive execution will not be even attempted – if a query satisfies the requirements, it may not get improved by AQE.

A large part of the adaptive logic is implemented in a special physical root node called AdaptiveSparkPlanExec. Its method getFinalPhysicalPlan is used to traverse the plan tree recursively while creating and launching query stages. The SparkPlan that was derived by the preceding phases is divided into smaller segments at Exchange nodes (i.e., shuffle boundaries), and query stages are formed from these subtrees. When these stages complete, fresh run-time stats become available, and can be used for reoptimization purposes.

A dedicated, slimmed-down logical optimizer with just one rule, DemoteBroadcastHashJoin, is applied to the logical plan associated with the current physical one. This special rule attempts to prevent the planner from converting a sort merge join into a broadcast hash join if it detects a high ratio of empty partitions. In this specific scenario, the first join type will likely be more performant, so a no-broadcast-hash-join hint is inserted.

The new optimized logical plan is fed into the default query planner (SparkPlanner) which applies its strategies and returns a SparkPlan. Finally, a small number of physical operator rules are applied that takes care of the subqueries and missing ShuffleExchangeExecs. This results in a refactored physical plan, which is then compared to the original version. The physical plan with the lowest number of shuffle exchanges is considered cheaper for execution and is therefore chosen.

When query stages are created from the plan, four adaptive physical rules are applied that include CoalesceShufflePartitions, OptimizeSkewedJoin, and OptimizeLocalShuffleReader. The logic of AQE’s major new optimizations is implemented in these case classes. Databricks explains these in more detail here. A second list of dedicated physical optimizer rules is applied right after the creation of a query stage; these mostly make internal adjustments.

The last paragraph has mentioned four adaptive optimizations which are implemented in one logical rule (DemoteBroadcastHashJoin) and three physical ones (CoalesceShufflePartitions, OptimizeSkewedJoin, and OptimizeLocalShuffleReader). All four make use of a special statistics class that holds the output size of each map output partition. The statistics mentioned in the Quantitative Optimization section above do get refreshed, but will not be leveraged by the standard logical rule batches, as a custom logical optimizer is used by AQE.

This article has described Spark’s Catalyst optimizer in detail. Developers who are unhappy with its default behavior can add their own logical optimizations and strategies or exclude specific logical optimizations. However, it can become complicated and time-consuming to devise customized logic for query patterns and other factors, such as the choice of machine types, may also have a significant impact on the performance of applications. Unravel Data can automatically analyze Spark workloads and provide tuning recommendations so developers become less burdened with studying query plans and identifying optimization potentials.

The post Spark Catalyst Pipeline: A Deep Dive into Spark’s Optimizer appeared first on Unravel.

End-to-end Monitoring of HBase Databases and Clusters

George Demarest — Fri, 02 Aug 2019 18:42:19 +0000

End-to-end Monitoring of HBase databases and clusters

Running real time data injection and multiple concurrent workloads on HBase clusters in production is always challenging. There are multiple factors that affect a cluster’s performance or health and dealing with them is not easy. Timely, up-to-date and detailed data is crucial to locating and fixing issues to maintain a cluster’s health and performance.

Most cluster managers provide high level metrics, which while helpful, are not enough for understanding cluster and performance issues. Unravel provides detailed data and metrics to help you identify the root causes of cluster and performance issues specifically hotspotting. This tutorial examines how you can use Unravel’s HBase Application Program Manager (APM) to debug issues in your HBase cluster and improve its performance.

Cluster health

Unravel provides cluster metrics per cluster which provide an overview of HBase clusters in the OPERATIONS > USAGE DETAILS > HBase tab where all the HBase clusters are listed. Click on a cluster name to bring up the cluster’s detailed information.

The metrics are color coded so you can quickly ascertain your cluster’s health and what, if any, issues you need to drill into.

Green = Healthy
Red = Unhealthy, alert for metrics and investigation required

Below we examine the HBase metrics and understand how use them to monitor your HBase cluster through Unravel.

Overall cluster activity

HBase Cluster activity

Live Region Servers: the number of running region servers.
Dead Region Servers: the number of dead region servers.
Cluster Requests: the number of read and write requests aggregated across the entire cluster.
Average Load: the average number of regions per region server across all Servers
Rit Count: the number of regions in transition.
RitOver Threshold: the number of regions that have been in transition longer than a threshold time (default: 60 seconds).
RitOldestAge: the age, in milliseconds, of the longest region in transition.

Note

Region Server refers to the servers (hosts) while region refers to the specific regions on the servers.

Dead Region Servers

This metric gives you insight into the health of the region servers. In a healthy cluster this should be zero (0). When the number of dead region servers is one (1) or greater you know something is wrong in the cluster.

Dead region servers can cause an imbalance in the cluster. When the server has stopped its regions are then distributed across the running region servers. This consolidation means some region servers handle a larger number of regions and consequently have a corresponding higher number of read, write and store operations. This can result in some servers processing a huge number of operations and while others are idle, causing hotspotting.

Average Load

This metric is the average number of regions on each region server. Like Dead Region Servers, this metric helps you to triage imbalances on cluster and optimize the cluster’s performance.

Below, for the same cluster, you can see the impact on the Average Load when Dead Region Servers is 0 and 4.

Dead Region Servers=0In this case, the Average Load is 2k.

HBase dead region server

Dead Region Servers=4

As the number increased so did the corresponding Average Load which is now 3.23k, an increase of 60%,

The next section of the tab contains a region server table which shows the number of regions hosted by the region server. You can see the delta ( Average Load – Region Count); a large delta generates an imbalance in cluster and reduces the cluster’s performance. You can resolve this issue by either:

Moving the regions onto other region server.
Removing regions from current region server(s) at which point the master immediately deploys it on another available region server.

Region server

Unravel provides a list of region servers, their metrics and Unravel’s insight into server’s health for all region serversacross the cluster for a specific point of time.

For each region server the table lists the REGION SERVER NAME and the server metrics READ REQUEST COUNT, WRITE REQUEST COUNT, STORE FILE SIZE, PERCENT FILES LOCAL, COMPACTION QUEUE LENGTH, REGION COUNT, and HEALTH for each server. These metrics and health status are helpful in monitoring activity in your HBase cluster

The HEALTH status is color coded and you can see at a glance when the server is in trouble. Hover over the server’s health status to see a tool tip listing the hotspotting notifications with their associated threshold (Average value * 1.2). If any value is above the threshold the region server is hotspotting and Unravel shows the health as bad.

Region server metric graphs

Below the table are four (4) graphs readRequestCount, writeRequestCount, storeFileSize, and percentFilesLocal. These graphs are for all metrics across the entire cluster. The time period the metrics are graphs over is noted above the graphs.

Tables

The last item in the Cluster tab is a list of tables. This list is for all the tables across all the region servers across the entire cluster. Unravel then uses these metrics to attempt to detect an imbalance. Averages are calculated within each category and alerts are raised accordingly. Just like with the region servers you can tell at glance the Health of the table.

The list is searchable and displays the TABLE NAME, TABLE SIZE, REGION COUNT, AVERAGE REGION SIZE, STORE FILE SIZE, READ REQUEST COUNT, WRITE REQUEST COUNT, and finally HEALTH. Hover over the health status to see a tool tip listing the hotspotting. Bad health indicates that a large amount of store operations from different sources are redirected to this table. In this example, the Store File Size is more than 20% of the threshold.

You can use this list to drill down into the tables, and get its details which can be useful to monitor your cluster.

Click on a table to view its details, which include graphs of metrics over time for the region, a list of the regions using the table, and the applications accessing the table. Below is an example of the graphed metrics, regionCount, readRequestCount, and writeRequestCount.

Region

Once in the TABLE view, click on the REGION tab to see the list of regions accessing the table.

The table list shows the REGION NAME, REGION SERVER NAME, and the region metrics STORE FILE SIZE, READ REQUEST COUNT, WRITE REQUEST COUNT, and HEALTH. These metrics are useful in gauging activity and load in region. The region health is important in deciding whether the region is functioning properly. In case any metric value is crossing threshold, the status is listed as bad. A bad status indicates you should start investigating to locate and fix hotspotting.

The post End-to-end Monitoring of HBase Databases and Clusters appeared first on Unravel.

Meeting SLAs for Data Pipelines on Amazon EMR With Unravel

Unravel Data — Wed, 26 Jun 2019 22:31:02 +0000

Post by George Demarest, Senior Director of Product Marketing, Unravel and Abha Jain, Director Products, Unravel. This blog was first published on the Amazon Startup blog.

A household name in global media analytics – let’s call them MTI – is using Unravel to support their data operations (DataOps) on Amazon EMR to establish and protect their internal service level agreements (SLAs) and get the most out of their Spark applications and pipelines. Amazon EMR was an easy choice for MTI as the platform to run all their analytics. To start with, getting up and running is simple. There is nothing to install, no configuration required etc. and you can get to a functional cluster in a few clicks. This enabled MTI to focus most of their efforts in building out analytics that would benefit their business instead of having to spend time and money on acquiring the skillset needed for setting up and maintaining Hadoop deployments by themselves. MTI was very quickly able to get a point that they were running 10’s of thousands of jobs per week. About 70% of which are Spark, with the remaining 30% of workloads running on Hadoop, or more specifically Hive/MapReduce.

Among the most common complaints and concerns about optimizing modern data stack clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel employs a frictionless method of collecting relevant data about the full data stack, running applications, cluster resources, datasets, users, business units and projects. Unravel then aggregates and correlates this data into the Unravel data model and then applies a variety of analytical techniques to put that data into a useful context. Unravel utilizes EMR bootstrap Actions to distribute (non-intrusive) sensors on each node of a new cluster that are needed for collecting granular application level data which in turn is used to optimize applications.

See Unravel for EMR in Action

Try Unravel for free

Unravel’s Amazon AWS/EMR architecture

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing Data Operations
For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

● Reduce time needed for troubleshooting and resolving issues.

● Improve cluster efficiency and performance.

● Improve visibility into cluster workloads.

● Provide usage analysis

Reducing Time to Identify and Resolve Issues
One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

One of the most basic ways that Unravel helps reduce MTTI/MTTR is through the elimination of the time-consuming steps of data collection and correlation. Unravel collects granular cluster and application-specific runtime information, as well as metrics on infrastructure, resources using native Hadoop APIs and via lightweight sensors that only send data while an application is executing. This alone can save data teams hours – if not days – of data collection by, capturing application and system log data, configuration parameters, and other relevant data.

Once that data is collected, the manual process of evaluating and interpreting that data has just begun. You may spend hours charting log data from your Spark application only to find that some small human error, a missed configuration parameter, and incorrectly sized container, or a rogue stage of your Spark application is bringing your cluster to its knees.

Unravel’s top level operations dashboard

Improving Visibility Into Cluster Operations
In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster Level Reporting
Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.

Unravel’s cluster infrastructure dashboard

One observation from reports like the above was that the memory and CPU usage in the cluster was peaking from time to time, potentially leading to application failures and slowdowns. To resolve this issue, MTI utilized EMR Automatic scaling feature so that instances were automatically added and removed as needed to ensure adequate resources at all times. This also ensured that they were not incurring unnecessary costs from underutilized resources.

Application and Workflow Tagging
Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage Analysis and Capacity Planning
MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel Chargeback Reporting
You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel Data’s chargeback reporting

Improving Cluster Efficiency and Performance
MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Get answers to EMR issues, not more charts and graphs

Try Unravel for free

Unravel Auto-actions and Alerting
Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

The simplicity of the Auto-actions screen belies the deep automation and functionality of autonomous remediation of application slowdowns and failures. At the highest level, Unravel Auto-actions can be quickly set up to alert your team via email, PagerDuty, Slack or text message. Offending jobs can also be killed or moved to a different queue. Unravel can also create an HTTP post that gives users a lot of powerful options

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Pre-configured Unravel auto-action templates

Applications
Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased Visibility Into Application Runtime and Trends
MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Because typical data pipelines are built on a collection of distributed processing engines (Spark, Hadoop, et al.), getting visibility into the complete data pipeline is a challenge. Each individual processing engine may have monitoring capabilities, but there is a need to have a unified view to monitor and manage all the components together.

Unravel Monitoring, Tuning, and Troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel was designed with an end-to-end perspective on data pipelines. The basic navigation moves from the top level list of applications to drill down to jobs, and further drill down to individual stages of a Spark, Hive, MapReduce or Impala applications.

Unravel Gantt chart view of a Hive query

Unravel provides a number of intuitive navigational and reporting elements in the user interface including a Gantt chart of application components to understand the execution and parallelism of your applications.

Unravel Self-service Optimization of Spark Applications
MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

MTI has chosen Unravel as a foundational element of their self-service application and workflow improvements, especially taking advantage of application recommendations and insights for Spark developers.

Unravel self-service capabilities

Unravel provides plain language insights as well as specific, actionable recommendations to improve performance and efficiency. In addition to these recommendations and insights, users can take action via the auto-tune function, which is available to run from the events panel.

Intelligent recommendations and insights as well as auto-tuning

Optimizing Application Resource Efficiency
In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel provides a rich catalog of insights and recommendations around resource consumption at the application level. To eliminate resource wastage Unravel can help you run your data applications more efficiently by providing you AI driven insights and recommendations to do show:

Underutilization of Container Resources, CPU, or Memory

Too few partitions with respect to available parallelism

Mapper/Reducers Requesting Too Much Memory

Too Many Map Tasks and/or Too Many Reduce Tasks

Solution Highlights
Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

MTI also got great insights on the number and causes of inefficiently running applications. Unravel detected a significant number of inefficient applications. Unravel detected 38,190 events after analyzing 30,378 MapReduce jobs that they executed. They were also able to detect 44,176 events for 21,799 Spark jobs that they executed. They were also able to detect resource contention which causing Spark jobs to get stuck in “Accepted” state, rather than running to completion.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to identify many jobs which waste CPU and memory resources.

The post Meeting SLAs for Data Pipelines on Amazon EMR With Unravel appeared first on Unravel.

Case Study: Meeting SLAs for Data Pipelines on Amazon EMR

George Demarest — Thu, 30 May 2019 20:44:44 +0000

Among the most common complaints and concerns about optimizing big data clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel architecture for Amazon AWS/EMR

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing data operations

For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

Reduce time needed for troubleshooting and resolving issues.
Improve cluster efficiency and performance.
Improve visibility into cluster workloads.
Provide usage analysis

Reducing time to identify and resolve issues

One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

Unravel top level operations dashboard

Improving visibility into cluster operations

In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster level reporting

Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.Unravel cluster infrastructure dashboard

Application and workflow tagging

Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc.. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage analysis and capacity planning

MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel chargeback reporting

You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel chargeback reporting

Improving cluster efficiency and performance

MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Unravel Auto-actions and alerting

Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Preconfigured Unravel auto-action templates

Applications

Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased visibility into application runtime and trends

MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Unravel monitoring, tuning and troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel Gantt chart view of a Hive query

Unravel self-service optimization of Spark applications

MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

Unravel self-service capabilities

Unravel provides intelligent recommendations and insights as well as auto-tuning.

Optimizing Application Resource Efficiency

In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel Insight: Under-utilization of container resources, CPU or memory

Unravel Insight: Too few partitions with respect to available parallelism

Unravel Insight: Mapper/Reducers requesting too much memory

Unravel Insight: Too many map tasks and/or too many reduce tasks

Solution Highlights

Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to Identify many jobs which waste CPU and memory resources.

The post Case Study: Meeting SLAs for Data Pipelines on Amazon EMR appeared first on Unravel.

Using Machine Learning to understand Kafka runtime behavior

George Demarest — Wed, 29 May 2019 19:16:43 +0000

On May 13, 2019, Unravel Co-Founder and CTO Shivnath Babu joined Nate Snapp, Senior Reliability Engineer from Palo Alto Networks, to present a session on Using Machine Learning to Understand Kafka Runtime Behavior at Kafka Summit in London. You can review the session slides or read the transcript below.

Transcript

Nate Snapp:

All right. Thanks. I’ll just give a quick introduction. My name is Nate Snapp. And I do big data infrastructure and engineering at companies such as Adobe, Palo Alto Networks, and Omniture. And I have had quite a bit of experience in streaming even outside of the Kafka realm about 12 years. I’ve worked at some of the big scale efforts that we’ve done for web analytics at Adobe and Omniture, working in the space of web analytics for a lot of the big companies out there, about 9 out of the 10 Fortune 500.

I’ve have dealt with major release events, from new iPads to all these things that have to do with big increases in data and streaming that in a timely fashion. Done systems that have over 10,000 servers in that [00:01:00] proprietary stack. But the exciting thing is, last couple years, moving to Kafka, and being able to apply some of those same principles in Kafka. And I’ll explain some of those today. And then I like to blog as well, natesnapp.com.

Shivnath Babu:

Hello, everyone. I’m Shivnath Babu. I’m cofounder and CTO at Unravel. What my day job looks like is building a platform like what is shown on the slide where we collect monitoring information from the big data systems, receiver systems like Kafka, like Spark, and correlate the data, analyze it. And some of the techniques that we will talk about today are inspired by that work to help operations teams as well as application developers, better manage, easily troubleshoot, and to do better capacity planning and operations for their receiver systems.

Nate:

All right. So to start with some of the Kafka work that I have been doing, we have [00:02:00] typically about…we have several clusters of Kafka that we run that have about 6 to 12 brokers, up to about 29 brokers on one of them. But we run the Kafka 5.2.1, and we’re upgrading the rest of the systems to that version. And we have about 1,700 topics across all these clusters. And we have pretty varying rates of ingestion, but topping out about 20,000 messages a second…we actually gotten higher than that. But what I explained this for is as we get into what kind of systems and events we see, the complexity involved often has some interesting play with how do you respond to these operational events.

So one of the aspects is that we have a lot of self-service that’s involved in this. We have users that set up their own topics, they setup their own pipelines. And we allow them to do [00:03:00] that to help make them most proficient and able to get them up to speed easily. And so because of that, we have a varying environment. And we have quite a big skew for how they bring it in. They do it with other…we have several other Kafka clusters that feed in. We have people that use the Java API, we have others that are using the REST API. Does anybody out here use the REST API for ingestion to Kafka? Not very many. That’s been a challenge for us.

But we also been for the Egress have a lot of custom endpoints, and a big one is to use HDFS and Hive. So as I get into this, I’ll be explaining some of the things that we see that are really dynamically challenging and why it’s not as simple as EFL Statements and how you triage and respond to these events. And then Shiv will talk more about how you can get to a higher level of using the actual ML to solve some of these challenging [00:04:00] issues.

So I’d like to start with a simple example. And in fact, this is what we often use. When we first get somebody new to the system, or we’re interviewing somebody to work in this kind of environment, as we start with a very simple example, and take it from a high level, I like to do something like this on the whiteboard and say, “Okay. You’re working in an environment that has data that’s constantly coming in that’s going through multiple stages of processing, and then eventually getting into a location where will be reported on, where people are gonna dive into it.”

And when I do this, it’s to show that there’s many choke points that can occur when this setup is made. And so you look at this and you say, “That’s great,” you have the data flowing through here. But when you hit different challenges along the way, it can back things up in interesting ways. And often, we talk about that as cascading failures of latency and latency-sensitive systems, you’re counting [00:05:00] latency is things back up. And so what I’ll do is explain to them, what if we were to take, for instance, you’ve got these three vats, “Okay, let’s take this last vat or thus last bin,” and if that’s our third stage, let’s go ahead and see if there’s a choke point in the pipe there that’s not processing. What happens then, and how do you respond to that?

And often, a new person will say, “Well, okay, I’m going to go and dive in,” then they’re gonna say, “Okay. What’s the problem with that particular choke point? And what can I do to alleviate that?” And I’ll say, “Okay. But what are you doing to make sure the rest of the system is processing okay? Is that choke point really just for one source of data? How do you make sure that it doesn’t start back-filling and cause other issues in the entire environment?” And so when I do that, they’ll often come back with, “Oh, okay, I’ll make sure that I’ve got enough capacity at the stage before the things don’t back up.”

And this is actually has very practical implications. You take the simple model, and it applies to a variety of things [00:06:00] that happen in the real world. So for every connection you have to a broker, and for every source that’s writing at the end, you actually have to account for two ports. It’s basic things, but as you scale up, it matters. So I have a port with the data being written in. I have a port that I’m managing for writing out. And as I have 5,000, 10,000 connections, I’m actually two X on the number of ports that I’m managing.

And what we found recently was we’re hitting the ephemeral port, what was it, the max ephemeral port range that Linux allows, and all sudden, “Okay. It doesn’t matter if you felt like you had capacity and maybe the message storage state, but that we actually hit boundaries at different areas. And so I think it’s important to say these have to stay a little different at time, so that you’re thinking in terms [00:07:00] of what other resources are we not accounting for, that can back up? So we find that there’s very important considerations on understanding the business logic behind it as well.

Data transparency, data governance actually can be important knowing where that data is coming from. And as I’ll talk about later, the ordering effects that you can have, it helps with being able to manage that at the application layer a lot better. So as I go through this, I want to highlight that it’s not so much a matter of just having a static view of the problem, but really understanding streams for the dynamic nature that they have, and that you may have planned for a certain amount of capacity. And then when you have a spike and that goes up, knowing how the system will respond at that level and what actions to take, having the ability to view that becomes really important.

So I’ll go through a couple of [00:08:00] data challenges here, practical challenges that I’ve seen. And as I go through that, then when we get into the ML part, the fun part, you’ll see what kind of algorithms we can use to better understand the varying signals that we have. So first challenge is variance in flow. And we actually had this with a customer of ours. And they would come to us saying, “Hey, everything look good as far as the number of messages that they’ve received.” But when they went to look at a certain…basically another topic for that, they were saying, “Well, some of the information about these visits actually looks wrong.”

And so they actually showed me a chart that look something like this. You could look at this like a five business day window here. And you look at that and you say, “Yeah, I mean, I could see. There’s a drop after Monday, the first day drops a little low.” That may be okay. Again, this is where you have to know what does a business trying to do with this data. And as an operator like myself, I can’t always make those as testaments really well. [00:09:00] Going to the third peak there, you see that go down and have a dip, they actually point out something like that, although I guess it was much smaller one and they’d circle this. And they said, “This is a huge problem for us.”

And I come to find out it had to do with sales of product and being able to target the right experience for that user. And so looking at that, and knowing what those anomalies are, and what they pertain to, having that visibility, again, going to the data transparency, you can help you to make the right decisions and say, “Okay.” And in this case with Kafka, what we do find is that at times, things like that can happen because we have, in this case, a bunch of different partitions.

One of those partitions backed up. And now, most of the data is processing, but some isn’t, and it happens to be bogged down. So what decisions can we make to be able to account for that and be able to say that, “Okay. This partition backed up, why is that partition backed up?” And the kind of ML algorithms you choose are the ones that help you [00:10:00] answer those five whys. If you talk about, “Okay. It backed up. It backed up because,” and you keep going backwards until you get to the root cause.

Challenge number two. We’ve experienced negative effects from really slow data. Slow data can happen for a variety of reasons. Two of them that we principally have found is users that have data that staging and temporary that they’re gonna be testing another pipeline, and then rolling that into production. But there’s actually some in production that are slow data, and it’s more or less how that stream is supposed to behave.

So think in terms again of what is the business trying to run on the stream? In our case, we would have some data that we wouldn’t want to have frequently like cancellations of users for our product. We hope that that stay really low. We hope that you have very few bumps in the road with that. But what we find with Kafka is that you have to have a very good [00:11:00] understanding of your offset expiration and your retention periods. We find that if you have those offsets expire too soon, then you have to go into a guesswork of, “Am I going to be aggressive,” and try and look back really far and reread that data, in which case, you may double count? Or am I gonna be really conservative, which case you may miss critical events, especially if you’re looking at cancellations, something that you need to understand about your users.

And so we found that to be a challenge. Keeping with that and going into this third idea is event sourcing. And this is something that I’ve heard argued either way should Kafka be used for this event sourcing and what that is. We have different events that we want to represent in the system. Do I create a topic for event and then have a whole bunch of topics, or do I have a single topic because we provide more guarantees on the partitions in that topic?

And it’s arguable, depends on what you’re trying to accomplish, which is the right method. But what we’ve seen [00:12:00] with users is because we have a self-service environment, we wanna give the transparency back to them on here’s what’s going on when you set it up in a certain way. And so if they say, “Okay, I wanna have,” for instance for our products, “a purchase topic representing the times that things were purchased.” And then a cancellation topic to represent the times that they decided not to use certain…they decided to cancel the product, what we can see is some ordering issues here. And I represented those matching with the colors up there. So a brown purchase followed by cancellation on that same user. You could see the, you know, the light purple there, but you can see one where the cancellation clearly comes before the purchase.

And so that is confusing when you do the processing, “Okay. Now, they’re out of order.” So having the ability to expose that back to the user and say, “Here’s the data that you’re looking at and why the data is confusing. What can you do to respond to that,” and actually pushing that back to the users is critical. So I’ll just cover a couple of surprises that [00:13:00] we’ve had along the way. And then I’ll turn the time over to Shiv. But does anybody here use Schema Registry? Any Schema Registry users? Okay, we’ve got a few. Anybody ever have users change the schema unexpectedly and still cause issues, even though you’ve used Schema Registry? Is it just me? Had a few, okay.

We found that users have this idea that they wanna have a solid schema, except when they wanna change it. And so we’ve coined this term flexible-rigid schema. They want a flexible-rigid schema. They want the best in both worlds. But what they have to understand is, “Okay, you introduce a new Boolean value, but your events don’t have that.” I can’t pick one. I can’t default to one. Limit the wrong decision, I guarantee you. It’s a 50/50 chance. And so we have this idea of can we expose back to them what they’re seeing, and when those changes occur. And they don’t always have control over changing the events at the source. And so they may not even be control of their destiny of having a [00:14:00] schema throughout the system changed.

Timeouts, leader affinity, I’m gonna skip over some of these or not spend a lot of time on it. But timeouts is a big one. As we write to HDFS, we see that the backups that occur from the Hive meta store when we’re writing with the HDFS sync can cause everything to go to a rebalanced state, which is really expensive, and now becomes a cascading issue, which was a single broker having an issue. So, again, others with leader affinity, poor assignments. There’s a randomizes where those get assigned to. We’d like to have concepts of the system involved. We wanna be able to understand the state of the system. Windows choices are being made. And if we can affect that, and Shiv will cover how that kind of visibility is helpful with some of the algorithms that can be used. And then basically, those things, all just lead to, why do we need to have better visibility into our data problems with Kafka? [00:15:00] Thanks.

Shivnath:

Thanks, Nate. So what Nate just actually talked about, and I’m sure most of you at this conference, the reason you’re here is that the streaming applications, Kafka-based architectures are becoming very, very popular, very common, and driving mission critical applications across the manufacturing, across ecommerce, many, many industries. So for this talk, I’m just going to approximate streaming architecture as something like this. There’s a stream store, something like Kafka that is being used, or poser, and maybe status actually kept in a NoSQL system, like HBase or Cassandra. And then there’s the computational element. This could be in Kafka itself with KStream, but it could be a Spark streaming, Flink as well.

When things are great, everything is fine. But when things start to break, maybe as Nate mentioned, we’re not getting your results on time. Things are actually congested, [00:16:00] backlog, all of these things can happen. And unfortunately, in any architecture like this, what happens is they’re all receiver systems. So problems could happen in many different places such as, it could be an application problem, maybe the structure, the schema, how things where…like Nate gave an example of a joint across streams, which…There might be problems there, but that might also be problems that the Kafka level, may be the partitioning, may be brokers, may be contention, or things that as a Spark level, no resource level problems, or configuration problems, all of these become very hard to troubleshoot.

And I’m sure many of you have run into problems like this. How many of you here can relate to problems what I’m showing on the slide here? Quite a few hands. As you’re running things in production, these challenges happen. And unfortunately, what happens is, given that these are all individual systems, often there is no single correlated view event that connects the streaming [00:17:00] computational side with the story side, or maybe with the NoSQL sides. And that poses a lot of issues. One of the crucial things is that when a problem happens, it takes a lot of time to resolve.

And wouldn’t it be great if there’s some tool out there, some solution that can give you visibility, as well as to do things along the following lines and empowered like the DevOps teams? First and foremost, there are metrics all over the place, that metrics, logs, you name it. And from all these different levels, especially the platform, the application, and all the interactions that happen in between. Metrics along these can be brought into one single platform, and at least have a good, nice correlated view. So again, you can go from the app, to the platform, or vice-versa, depending on the problem.

And what we will try to cover today is with all these advances that are happening in machine learning, how in applying machine learning to some of this data can help you find out problems quicker, [00:18:00] as well as, in some cases, using artificial intelligence, using ability to take automated actions either prevent the problems from happening the first place, or at least if those problems happen gonna to be able to quickly recover and fix it.

And what we will really try to cover in the stack is basically what I’m showing the slide. There’s tons and tons of interesting machine learning algorithms. And the same time, you have all of these different problems that happened with Kafka streaming architectures. How do you connect both worlds? How can you bring based on the goal that you’re actually trying to solve the right algorithm to bear on the problem? And the way we’ll structure that is, again, DevOps teams have lots of different goals. Broadly, there’s the app side of the fence, and there’s the operations side of the fence.

As a person who owns Kafka streaming application, you might have goals related to latency. I need this kind of latency, or this much amount of the throughput, or maybe might be a combination of this along with, “Hey, I can only [00:19:00] tolerate this much amount of data loss,” and all those talks that have happened very much in this room, when the two different aspects of this and how replicas and parameters help you get all of these goals.

On the operation side, maybe your goals are around ensuring that the cluster reliable like a particular loss of a rack doesn’t really cause data loss and things like that, or maybe they are related on the cloud ensuring that you’re getting the right bite performance and all of those things. And on the other side are all of these interesting advances that happened in ML. There are algorithms for detecting outliers, anomalies, or actually doing correlation. So the real focus is going to be like, let’s take some of these goals. Let’s work with the different algorithms, and then seeing how you can get these things to meet. And along the way, we’ll try to describe some our own experiences. What would worked kind of what didn’t work, as well the other things that are worth exploring?

[00:20:00] So let’s start with the very first one, the outlier detection algorithms. And why would you care? There are two very, very simple problems. And if you remember from the earlier in the talk, Nate talked about one of the critical challenges they had, where the problem was exactly this, which is, hey, there could be some imbalance among my brokers. There could be some imbalance and a topic among the partitions, some partition really getting backed up. Very, very common problems that happen. How can you very quickly instead of manually looking at graphs, and trying to stare and figure things out? Apply some automation to it.

Here’s a quick screenshot that lead us to through the problem. If you look at the graph that have highlighted on the bottom, this is a small cluster with three brokers, capital one, two, and three. And one of the brokers is actually having much lower throughput. It could be the other way. One of the having a really high throughput. Can we detect this automatically? This is about having to figure it after the fact. [00:21:00] And there’s a problem where there are tons of algorithms for outlier detection from statistics and now more recently in machine learning.

And there are algorithms and differ based on, do they deal with one parameter at a time, do they deal with multiple parameters of the time, univariate, multivariate, or algorithms that can actually take temporal things into account, algorithms that are more looking at a snapshot in time. The very, very simple technique, which actually works surprisingly well as this score where the main idea is to take…let’s say, you have different brokers, or you have hundred partitions in a topic, you can take any metric. It might be in the bites and metric, and vector Gaussian and distribution to the data.

And anything that is actually a few standard deviations away as an outlier. Very, very simple technique. The problem in this technique is, it does assume that is the distribution, but sometimes may not be the [00:22:00] case. In the case of in a brokers and partitions that is usually a safe assumption to make. But if that technique doesn’t work, there are other techniques. One of the more popular ones that we have had some success with, especially when you’re looking at multiple time series of the same time is the DBSCAN. It’s basically a density-based clustering technique. I’m not going to the all the details, but the key ideas, it basically uses some notion of distance to group points into clusters, and anything that doesn’t fall into clusters and outlier.

Then there are tons of other very interesting techniques using like in a binary trees to find outliers called the isolation forests. And in the deep learning world, there is a lot of interesting work happening with auto encoders, which tried to learn representation of the data. And again, once you’ve learned the representation from all the training data that is available, things don’t fit the representation are outliers.

So this is the template I’m going to [00:23:00] follow in the rest of the talk. Basically, the template, I pick a technique. Next, I’m gonna look at forecasting and give you some example use cases that having a good forecasting technique can help you in that Kafka DevOps world, and then tell you about some of these techniques. So for forecasting, two places where it makes those…having a good technique makes a lot of difference. One is avoiding this reactive firefighting. When you have something like a latency SLA, if you could get a sense, things are backing up, and there’s a chance in a certain amount of time, the SLAs will get missed, then you can actually take action quickly.

It could even be bringing on normal brokers and things like that, or basically doing some partition reassignment and whatnot. But getting heads up this often very, very useful. The other is more long-term forecasting for things like capacity planning. So I’m gonna use actually a real life example here, an example that one of our telecom customers actually worked with [00:24:00] where it was a sentiment analysis, sentiment extraction used case based on tweets. So the overall architecture consisted of tweets coming in real-time like in those SLA Kafka, and then the computation was happening in Spark streaming with some state actually being stored in a database.

So here’s a quick screenshot of how things actually play out there. In sentiment analysis, especially for customer service, customer support really the used cases, there’s some kind of SLA. In this case, their SLA was around three minutes. What that means is, by the time you learn about the particular incident, if you will, which you can think of these tweets coming in Kafka. Within a three-minute interval, data has to be processed. What are you seeing on screen here, all those green bars represent the rate in which data is coming, and the black line in between as the time series indicating the actual delay, [00:25:00] the end-t-end delay, the processing delay between data arrival and data being processed. And there is a SLA of three-minutes.

So if you can see the line, it trending up, and applying…there was a good forecasting technique that could be applied on this line, you can actually forecast and stay within a certain interval of time, maybe it’s a few hours, maybe even less than that. The rate at which this latency is trending up, my SLA can get missed. And it’s a great use case for having a good forecasting technique. So again, forecasting, another area that has been very well-studied. And there are tons of interesting techniques, from the time-series forecasting of the more statistical bend, there are techniques like ARIMA, which stands for autoregressive integrated moving average. And there’s lots of variance around that, which uses the trend and data differences between data elements and patterns you forecast with smoothing and taking in historic data into account, and all of that good stuff.

On the other [00:26:00] extreme, there has been a lot of like I’ve said, advances recently in using neural networks, because time-series data is one thing that is very, very easy to get, too. So there’s this long short-term memory, the LSCM, and recurrent neural networks, which have been pretty good at this, that we have actually had a lot of I would say success is with a technique that was originally it was something that Facebook released this open source called the Prophet Algorithm, which is not very different from the ARIMA and the older family of forecasting techniques. It defaults in some subtle ways.

The main idea here was what is called a generative additive model. I put in a very simple example here. So the idea is to model this time-series, whichever time-series you are picking as a combination of the trend in the time-series data and extract all the trend. The seasonality, maybe there is a yearly seasonality, maybe there’s monthly seasonality, weekly [00:27:00] seasonality, maybe even daily seasonality. This is a key thing. I’ve used the term shocky [SP]. So if you’re thinking about forecasting and ecommerce setting, like Black Friday or the Christmas days, these are actually times when the time-series will have a very different behavior.

So in Kafka or in the operational context, if you are rebooting, if you are installing a new patch or upgrading, this often end up shifting the patterns of the time-series, and they have to be explicitly modeled. Otherwise, the forecasting can go wrong, and then the rest is error. The recent Prophet has actually worked really well for us apart from the ones I mentioned. It fits quickly and all of that good stuff, but it is very, I would say, customizable, where your domain knowledge about the problem can be incorporated, instead of if it’s something that gives you a result, and then all that remains is parameter tuning.

So Prophet is something I’ll definitely ask all of you [00:28:00] to take a look. And defaults work relatively well, but forecasting is something where we have seen that. You can just trust the machine alone. It needs some guidance, it needs some data science, it needs some domain knowledge to be put along with the machine learning to actually get good forecast. So that’s the forecasting. So we saw outlier detection how that applies, we saw forecasting. Now let’s get into something even more interesting. Anomalies, detecting anomalies.

So a place where anomaly detection, and possible was an anomaly, you can think of an anomaly as unexpected change, something that if you were to expect it, then it’s not an anomaly, something unexpected that needs your attention. That’s what I’m gonna characterize as an anomaly. Where can it help? Actually smart alerts, alerts where you don’t have to configure threshold and all of that things, and worry about your workload changing or new upgrades happening, and all of that stuff, wouldn’t it be great if these anomalies can [00:29:00] be auto-detect it. But that’s also very challenging. By no means, it’s a trivial because if you’re not careful, then your smart alerts will turn out to be really dump. And you might get a lot of false alerts. And that way, you lose confidence, or it might miss a real problems that are happening.

So, I don’t know, detection is something which is pretty hard to get right in practice. And here’s one simple, but one very illustrative example. With Kafka you always see no lags. So here, what I’m plotting here is increasing lag. Is that really an expected one? Maybe there could be both in data arrival, and maybe these lags might have built up, like at some point of time every day, maybe it’s okay. When it is an anomaly that I really need to take a look at. So that’s when these anomaly detection techniques become very important.

Many [00:30:00] different schools have thought excess on how to build a good anomaly detection technique, including the ones I talked about earlier with outlier detection. One approach that has worked well in the past for us is, when you can have really modern anomalies as I’m forecasting something based on what I know. And if what I see the first one to forecasts, then that’s an anomaly. So you can pick your favorite forecasting technique, or the one that actually worked, ARIMA, or Prophet, or whatnot, use that technique to do the forecasting, and then deviations become interesting and anomalous.

Whatever sitting here is a simple example of the technique and for Prophet, or more common one that we have seen actually does work relatively well, this thing called STL. It stands for Seasonal Trend Decomposition using a smoothing function called LOWESS. So you have the time series, extract out and separate out the trend from it first. So without the trend, so that leaves [00:31:00] the time series without the trend and then extract out all the seasonalities. And once you have done that, whatever reminder or residual has called, even if you put some, like a threshold on that, it’s reasonably good. I wouldn’t say it’s perfect but reasonably good at extracting out these anomalies.

Next one, correlation analysis, getting even more complex. So once basically you have detect an anomaly. Or you have a problem that you want the root cause, why did it happen? What do you need to do to fix it. Here’s a great example. I saw my anomaly something shot up, maybe there’s the lag, that is actually building up, it’s definitely looks anomalous. And now what do you really want us…okay, maybe we address where the baseline is much higher. But can I root cause it? Can I pinpoint what is causing that? Is it just the fact that there was a burst of data or something else, maybe resource allocation issues, maybe some hot spotting and the [00:32:00] brokers?

And here, you can start to apply time-series correlation, which time series your lower level correlates best with the higher level time-series where your application latency increased? Challenge here is, there are hundreds, if not thousands of times-series if you look at Kafka, with every broker has so many kind of time-series it can give you from every level. And it quickly all of these adds up. So here, it’s a pretty hard problem. So if you just throw a time-series correlation techniques, even time-series which just have some trend in there they look correlated. So you have to be very careful.

The things to keep in mind are things like pick a good similarity function across time-series. For example using something like Euclidean, which is a straight up well-defined, well-understood distance function between points or between time series. We have had a lot of success with something called dynamic time warping, which is very good to deal with time-series, which might be slightly out of [00:33:00] sync. If you remember all the things that need mentioned, you just gone that Kafka world and streaming world in asynchronous, even world, you just see him all the time-series and nicely synchronized. So time warping is a good technique to extract distances in such a context.

But by far, the other part is just like you saw with Prophet. You have to really instead of just throwing some machine learning technique and praying that it works, you have to really try to understand the problem. And the way in which we have tried to break it down into something usable is, for a lot of these time-series, you can split the space into time-series that are related to the application performance, time-series are related to resource and contention, and then apply correlation within these bucket. So try to scope the problem.

Last technique, model learning. And this turns out to be the hardest. But if you have a good modeling technique, good model that can answer what-if questions, then things like what we saw in the previous talk with [00:34:00] all of those reassignment and whatnot. You can actually find out what’s the good reassignment and what impact will that have. Or, as Nate mentioned, this rebalanced-type, consumer times, timeouts, and rebalancing storms can actually kick in, “What’s the good timeout?”

So a lot of these places where you have to pick some threshold or some strategy, but having a model that can quickly do what-if avoid is better, or can even rank them can be very, very useful and powerful. And this is the thing that is needed for enabling action automatically. Here’s a quick example. There’s a lag. We can figure out where the problem actually is happening. But then wouldn’t be great if something is suggesting how to fix that? Increase the number of partitions from X to Y,. And that will fix the problem.

The modeling at the end of the day is function that you’re fitting to the data. And in modeling, I don’t have time to actually go into this in great detail because of time. You carefully pick the right [00:35:00] input features. And what is very, very important is to ensure that you have the right training data. For some problems, just collecting data from the production cluster is good trading data. Sometimes, it is not because you have only seen certain regions of the space. So with that, I’m actually going to wrap up.

So what we try to do in the talk is to really give you a sense, as much as we can do in a 40-minute talk of having all of these interesting Kafka DevOps challenges, so meaning application challenges, and how to map that to something where you can use machine learning or some elements of AI to make your life easier, at least guide you so that you’re not wasting a lot of time trying to look for a needle in a haystack. And with that, I’m going to wrap up. There is this area called AIOps, which is very interesting and trying to bring AI and ML with data to solve these DevOps challenges. We have a booth here. Please, drop by to see some of the techniques we have. And yes, if you’re interested in working on this interesting challenges, streaming, building [00:36:00] these applications, or applying techniques like this, we are hiring. Thank you.

Okay. Thanks, guys. So we actually have about three or four minutes for any question, mini Q&A.

Attendee 1:

I know nothing about machine learning. But I can see it helping me really help with debugging problems. As a noob, how can I get started?

Shivnath:

I can take that question. So the question was, so the whole point of this talk, and gets in that question, which is, again, on one hand, if you go to a Spark Summit, or Spark, you’ll see a lot of machine learning algorithms, and this and that, right? If we come to a Kafka Summit, then it’s all about Kafka and DevOps challenges. How do these worlds meet? That’s exactly the thing that we tried to cover in the talk. If you were listening looking at a lot of the techniques, they’re not fancy machine learning techniques. So our experience has been that once you understand the use case, then there are fairly good techniques from even statistics that can solve the problem reasonably well.

And once you have first startup experience, then look for better and better techniques. So hopefully, this talk gives you a sense of the techniques just get started with, and you get a better understanding of the problem and the [00:38:00] data, and you can actually improve and apply more of those deep learning techniques and whatnot. But most of the time, you don’t need that for these problems. Nate?

Nate:

No. Absolutely same thing.

Attendee 2:

Hi, Shiv, nice talk. Thank you. Quick question is for those machine learning algorithms, can they be applied cross domain? Or if you are moving from DevOps of Kafka to Spark streaming, for example, do you have to hand-picking those algorithms and tuning the parameters again?

Shivnath:

So the question is, how much of these algorithms just apply to something we’ve found for Kafka? Does it apply for Spark streaming? Would that apply for high performance Impala, or Redshift for that matter? So again, the real hard truth is that no one size fits all. So you can’t just have…I mentioned outlier detection, that might be a technique that can be applied to any [00:39:00] load imbalance problem. But then that’s start getting into anomalies or correlation, Some amount of expert knowledge about the system has to be combined with the machine learning technique to really get good results.

So again, it’s not as if it has to be all export rules, but some combination. So if you pick a financial space, a data scientist who exists, understands the domain, and knows how to work with data. Just like that even in the DevOps world, the biggest success will come if somebody understand receiver system, and has the knack of working with these algorithms reasonably well, and they can combine both of that. So something like a performance data scientist.

The post Using Machine Learning to understand Kafka runtime behavior appeared first on Unravel.

Using Unravel to tune Spark data skew and partitioning

George Demarest — Wed, 22 May 2019 21:20:40 +0000

This blog provides some best practices for how to use Unravel to tackle issues of data skew and partitioning for Spark applications. In Spark, it’s very important that the partitions of an RDD are aligned with the number of available tasks. Spark assigns one task per partition and each core can process one task at a time.

By default, the number of partitions is set to the total number of cores on all the nodes hosting executors. Having too few partitions leads to less concurrency, processing skew, and improper resource utilization, whereas having too many leads to low throughput and high task scheduling overhead.

The first step in tuning a Spark application is to identify the stages which represents bottlenecks in the execution. In Unravel, this can be done easily by using the Gantt Chart view of the app’s execution. We can see at a glance that there is a longer-running stage:

Once I’ve navigated to the stage, I can navigate to the Timeline view where the duration and I/O of each task within the stage are readily apparent. The histogram charts are very useful to identify outlier tasks, it is clear in this case that 199 tasks take at most 5 minutes to complete however one task takes 35-40 min to complete!

When we select the first bucket of 199 tasks, another clear representation of the effect of this skew is visible within the Timeline, many executors are sitting idle:

When we select the outlier bucket that took over 35 minutes to complete, we can see the duration of the associated executor is almost equal to the duration of the entire app:

We can also observe the bursting of containers at the time the longer executor started in the Graphs > Containers view. Adding more partitions via repartition() can help distribute the data set among the executors.

Unravel can provide recommendations for optimizations in some of the cases where join key(s) or group by key(s) are skewed.

In the below Spark SQL example two dummy data sources are used, both of them are partitioned.

The join operation between customer & order table is on cust_id column which is heavily skewed. Looking at the code it can easily be identified that key “1000” has most number of entries in the orders table. So one of the reduce partition will contain all the “1000” entries. In such cases we can apply some techniques to avoid skewed processing.

Increase the spark.sql.autoBroadcastJoinThreshold value so that smaller table “customer” gets broadcasted. This should be done ensuring sufficient driver memory.
If memory in executors are sufficient we can decrease the spark.sql.shuffle.partitions to accomodate more data per reduce partitions. This will finish all the reduce tasks in more or less in same time.
If possible find out the keys which are skewed and process them separately by using filters.

Let’s try the #2 approach

With Spark’s default spark.sql.shuffle.partitions=200.

Observe the lone task which takes more time during shuffle. That means the next stage can’t be started and executors are lying idle.

Now let’s change the spark.sql.shuffle.partitions = 10. As the shuffle input/output is well within executor memory sizes we can safely do this change.

In real-life deployments, not all skew problems can be solved by configurations and repartitioning. That may need underlying data layout modification. If the data source itself is skewed then tasks which read from these sources can’t be optimized. Sometimes at enterprise level its not possible as the same data source will be used from different tools & pipelines.

Sam Lachterman is a Manager of Solutions Engineering at Unravel.

Unravel Principle Engineer Rishitesh Mishra also contributed to this blog. Take a look at Rishi’s other Unravel blogs:

Why Your Spark Jobs are Slow and Failing; Part I Memory Management

Why Your Spark Jobs are Slow and Failing; Part I Data Skew and Garbage Collection

The post Using Unravel to tune Spark data skew and partitioning appeared first on Unravel.

Unravel Architect Solution Guide APM For Big Data Apps And Platforms

Unravel Data — Wed, 08 May 2019 03:36:59 +0000

Thank you for your interest in the Unravel Architect Solutions Guide APM for Big Data Apps and Platforms.

You can download it here.

The post Unravel Architect Solution Guide APM For Big Data Apps And Platforms appeared first on Unravel.

Unravel APM Solution Guide

Unravel Data — Fri, 15 Feb 2019 03:39:40 +0000

Thank you for your interest in the Unravel APM Solution Guide.

You can download it here.

The post Unravel APM Solution Guide appeared first on Unravel.

Maximize Big Data Application Performance and ROI

Unravel Data Posts — Tue, 29 Jan 2019 05:55:33 +0000

The post Maximize Big Data Application Performance and ROI appeared first on Unravel.

Monitor and optimize data pipelines and workflows end-to-end

Unravel Data Posts — Fri, 25 Jan 2019 11:00:57 +0000

How do you monitor, optimize, and really operationalize the data pipeline architecture and workflows in your modern data stack? There are different types of problems that occur every day when you’re running these workflows at scale. How do you go about resolving those common issues?

Understanding Data Pipeline Architecture

First, let’s quickly understand what we mean by a data pipeline architecture or a workflow. A data pipeline and a workflow are interchangeable terms, and they are comprised of several stages that are stitched together to drive a recommendation engine, a report, a dashboard, etc. These data pipelines or workflows may then be scheduled on top of Oozie or AirFlow or BMC, etc., and may be running repeatedly maybe once every day, or several times during the day.

For example, if I’m trying to create a recommendation engine, I may have all of my raw data in separate files, separate tables, separate systems, that I may put together using Kafka, and then I may load this data into HDFS or S3, after which I’m doing some cleaning and transformation jobs on top of it, merging these data sets together, trying to create a single table out of this, or just trying to clean it up and make sure that all the data’s consistent across these different tables. And then I may run a couple of Spark applications on top, to create my model or analyze this data to create some reports and views. This end-to-end process is an example of a data pipeline or workflow.

A lot of industries are already running data pipelines and workflows in production. In the telecom industry, we see a lot of companies using these pipelines for churn prevention. In the automotive industry, a lot of companies are using data pipeline architecture for predictive maintenance using IoT applications. For instance, datasets could be going in through Kafka in a real-time fashion, and then running some Spark streaming applications to be able to predict any faults or problems that may happen with the entire manufacturing line itself. A financial services company may be using this for fraud detection with your credit card. Additional examples of data pipeline architecture include e-commerce companies’ recommendation engines, and healthcare companies improving clinical decision making.

Data Pipeline Architecture Common Problems

Data pipelines are mission critical applications, so getting them to run well, and getting them to run reliably, is very important. However, that’s not usually the case, when we try to create data pipelines and workflows on modern data stack systems there are four common problems that occur every time.

You may have runtime problems, meaning an application just won’t work.
You may have configuration settings problems, where are you trying to understand how to set these different configuration setting levels to make these applications run in the most efficient way possible?
You may have problems with scale, meaning this application ran with 10 GB of data just fine, but what about when it needs to run with 100 GB or PB, will it then still finish in the same amount of time or not?
You may have multi-tenancy issues where hundreds of possible users, and tens of thousands of applications / jobs / workflows, are all running together in the same cluster. As a result, you may have other applications affecting the performance of your application.

In this blog series, we will look into all these different types of problems in a little more depth and understand, how they can be solved.

COMING SOON

The post Monitor and optimize data pipelines and workflows end-to-end appeared first on Unravel.

Intelligent Application Performance Management (APM) Platform for the Modern Data Stack

Unravel Data Posts — Fri, 25 Jan 2019 10:31:29 +0000

Unravel Data has unveiled a new set of automated actions to improve modern data stack operations and performance. The solution was designed with input from more than 100 enterprise customers and prospects to uncover their biggest modern data stack challenges and to make DataOps more proactive and productive by automating problem discovery, root-cause analysis, and resolution across the entire modern data stack, while improving ROI and time to value of Big Data investments.

The post Intelligent Application Performance Management (APM) Platform for the Modern Data Stack appeared first on Unravel.

Skills Developers Need To Optimize Performance

Unravel Data Posts — Fri, 25 Jan 2019 10:23:29 +0000

To gather insights on the state of performance optimization and monitoring today, we spoke to 12 executives from 11 companies that provide performance optimization and monitoring solutions for their clients.

Here’s what they told us when we asked, “What skills do developers need to have to optimize the performance and ease of monitoring their applications?”

The post Skills Developers Need To Optimize Performance appeared first on Unravel.

Additional Considerations Regarding Performance And Monitoring

Unravel Data Posts — Fri, 25 Jan 2019 10:22:28 +0000

Here’s what they told us when we asked, “What have we failed to ask you that you think we need to consider with regards to performance and monitoring?”

The post Additional Considerations Regarding Performance And Monitoring appeared first on Unravel.

Opportunities to Improve Performance and Monitoring

Unravel Data Posts — Fri, 25 Jan 2019 10:21:53 +0000

Here’s what they told us when we asked, “Where do you think the biggest opportunities are for improvement in performance optimization and monitoring?”

The post Opportunities to Improve Performance and Monitoring appeared first on Unravel.

Common Issues With Performance and Monitoring

Unravel Data Posts — Fri, 25 Jan 2019 10:20:48 +0000

Here’s what they told us when we asked, “What are the most common issues you see affecting performance optimization and monitoring?”

The post Common Issues With Performance and Monitoring appeared first on Unravel.

Real-World Problems Solved By Performance Monitoring and Optimization

Unravel Data Posts — Fri, 25 Jan 2019 10:14:27 +0000

Here’s what they told us when we asked, “What real-world problems are you, or your clients, solving with performance optimization and monitoring?”

We have an ad services client monitoring service quality by monitoring latency and the flow of ads being served. The path through the internet varies. It can be influenced and changed. It’s important to be aware of the path being taken.

The post Real-World Problems Solved By Performance Monitoring and Optimization appeared first on Unravel.

Most Frequently Used Performance Tools

Unravel Data Posts — Fri, 25 Jan 2019 10:10:21 +0000

Here’s what they told us when we asked, “What technical solutions do you use beyond your own?”

Standard tools. APM = AppDynamics and New Relic. It’s open source and commercial. For synthetic, it’s Catchpoint and Dynatrace. We don’t see our clients using one overriding set of solutions.

The post Most Frequently Used Performance Tools appeared first on Unravel.

2017 Application Performance Management Predictions – Part 4

Unravel Data Posts — Fri, 25 Jan 2019 09:53:37 +0000

APMdigest’s 2017 Application Performance Management Predictions is a forecast by the top minds in APM today. Industry experts — from analysts and consultants to users and the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM and related technologies will evolve and impact business in 2017. Part 4 covers cloud, containers and microservices.

The post 2017 Application Performance Management Predictions – Part 4 appeared first on Unravel.

2017 Application Performance Management Predictions – Part 2

Unravel Data Posts — Fri, 25 Jan 2019 09:50:40 +0000

The post 2017 Application Performance Management Predictions – Part 2 appeared first on Unravel.

Organizations Turn Focus to Data Reliability, Security and Governance

Unravel Data Posts — Fri, 25 Jan 2019 09:43:58 +0000

With the rapidly growing investments in data analytics, many organizations complain they lack a combined view of all the information being created. At the recent Strata & Hadoop World Conference in New York, Information Management asked Kunal Agarwal, chief executive officer at UnravelData what this means for the industry, and his company.

The post Organizations Turn Focus to Data Reliability, Security and Governance appeared first on Unravel.

Is The Promise of Big Data Hampered by Skills Shortages and Poor Performance?

Unravel Data Posts — Fri, 25 Jan 2019 08:10:37 +0000

Ahead of Big Data London in November, Unravel Data worked with Sapio Research to find out how business leaders really feel about big data and how they are doing in executing against their deployment plans.

We’ve heard first hand from many of the UK’s largest financial services, telecommunications, and technology enterprises that they are facing significant challenges in harnessing the power of data to transform their businesses and create more compelling customer experiences. This new research suggests there is plenty of optimism from businesses about putting data to work. Yet addressing operational challenges as they morph and evolve is a pressing concern.

Although 84 percent of respondents claim their big data projects usually deliver on expectations, only 17 percent currently rate the performance of their modern data stack as ‘optimal’. There are also stark differences between how managers and senior teams see their modern data stacks: 13 percent of VPs, directors and C-suite members report their stack only meets half of its KPIs, but more than double the number of managers (29%) say the same.

It also seems businesses aren’t yet using modern data stack applications to grow their businesses, instead seeing protection and compliance as the most worthwhile goals. The top four most valuable and effective uses of big data currently, according to business leaders, are:

Cybersecurity intelligence (42%);
Risk, regulatory, compliance reporting (41%);
Predictive analytics for preventative maintenance (35%);
Fraud detection and prevention (35%).

Further findings from the research:

According to survey respondents the two greatest aspirations of big data teams are delivering on the promise of big data and to improve big data performance
Lack of skills was cited as the top challenge (49%) to big data success in the enterprise; followed closely by challenges of handling data volumes, variety, and velocity (44%)
Lack of big data architects (45%) and big data engineers (43%) top the list of the most pressing skills that are lacking
Data analysis is the top priority for improvement, cited by 43% of respondents. This is followed by data transformation (39%) and data visualization (37%)
Cost reduction is the biggest expected benefit for big data applications, cited by 41% of respondents, followed by faster application release timings (37%)
78% of organizations are already running big data workloads in the cloud, and 82% have a strategy to move existing big data applications into the cloud
Only one in five have an ‘all cloud’ strategy, with more than half (54%) using a mix of cloud and on-premise applications
99% of business leaders report that their big data projects on delivering on business goals at least ‘some of the time’

The top level insight that we can derive from this primary research is that most organizations believe in the promise of big data – but the operational challenges are holding back enterprises from realising the full potential. This is due to a combination of factors, most notably performance and a shortage of experienced talent.

The challenge now is to ensure the modern data stack performs reliably and efficiently, and that big data teams have the tools and expertise to deliver the next generation of applications, analytics, AI and Machine Learning. This is the Unravel mission: To radically simplify the tuning and optimization of modern data stack applications and infrastructure.

Unravel provides a single view of performance across platforms – whether on-premise or in the cloud, and delivers automated Insights and recommendations through machine learning to assure SLA’s are met and ensuring operational teams are as efficient as possible.

The full research report will be available in December.

The survey was conducted with 200 IT decision makers involved in big data in organizations with over 1,000 employees. The interviews were conducted in October 2018 via email and online panels.

The post Is The Promise of Big Data Hampered by Skills Shortages and Poor Performance? appeared first on Unravel.

Architecting an Automated Performance Management (APM) Solution for Modern Data Applications — Part 1

Unravel Data Posts — Fri, 25 Jan 2019 07:55:02 +0000

While many enterprises across a wide range of verticals—finance, healthcare, technology—build applications on the modern data stack, some are not fully aware of the application performance management (APM) and operational challenges that often arise. Over the course of a two-part blog series, we’ll address the requirements at both the individual application level, as well as holistic clusters and workloads, and explore what type of architecture can provide automated solutions for these complex environments.

The Modern Data Stack and Operational Performance Requirements

Composed of multiple distributed systems, the modern data stack in an enterprise typically goes through the following evolution:

Big data ETL: storage systems, such as Azure Blob Store (ABS), house the large volumes of structured, semi-structured and unstructured data. Distributed processing engines, like MapReduce, come in for data extraction, cleaning and transformation of the data.
Big data BI: MPP SQL systems, such as Impala, are added to the stack, which power the interactive SQL queries that are common in BI workloads.
Big data science: with maturity of use of a modern data stack, more workloads leverage machine learning (ML) and artificial intelligence (AI).
Big data streaming: Systems such as Kafka are added to the modern data stack to support applications that ingest and process data in a continuous streaming fashion.

Evolution of the modern data stack in an enterprise

Industry analysts estimate that there are more than 10,000 enterprises worldwide running applications in production on a modern data stack, comprised of three or more distributed systems.

Naturally, performance challenges are inherent related to failure, speed, SLA or behavior. Typical questions that are nontrivial to answer include:

What caused an application to fail and how can I fix it?
This application seems to have made little progress in the last hour. Where is it stuck?
Will this application ever finish, or will it finish in a reasonable time?
Will this application meet its SLA?
Does this application behave differently than in the past? If so, in what way and why?
Is this application causing problems in my cluster?
Is the performance of this application being affected by one or more other applications?

Many operational performance requirements are needed at the “macro” level compared to the level of individual applications. These include:

Configuring resource allocation policies to meet SLAs in multi-tenant clusters
Detecting rogue applications that can affect the performance of SLA-bound applications through a variety of low-level resource interactions
Configuring the hundreds of configuration settings that distributed systems are notoriously known for having in order to get the desired performance
Tuning data partitioning and storage layout
Optimizing dollar costs on the cloud
Capacity planning using predictive analysis in order to account for workload growth proactively

The architecture of a Performance Management Solution

Addressing performance challenges requires a sophisticated architecture that includes the following components:

Full data stack collection: to answer questions like what caused this application to fail or will this application ever meet its SLA, monitoring data will be needed from every level of the stack. However, collecting such data in a non-intrusive or low-overhead manner from production clusters is a major technical challenge.
Event-driven data processing: deployments can generate tens of terabytes of logs and metrics every day, which can present variety and consistency challenges, thus calling for the data processing layer to be based on event-driven processing algorithms whose outputs converge to the same final state irrespective of the timeliness and order in which the monitoring data arrives.
Machine learning-driven insights and policy-driven actions: Enabling all of the monitoring data to be collected and stored in a single place allows for statistical analysis and machine learning to be applied to the data. Such algorithms can generate insights that can then be applied to address the performance requirements.

Architecture of a performance management platform for Modern data applications

There are several technical challenges involved in this process. Examples include:

What is the optimal way to correlate events across independent systems? In other words, if an application accesses multiple systems and the events pass through them, how end-to-end correlation can be done?
How can we deal with telemetry noise and data quality issues which are common in such complex environments? For example, records from multiple systems may not be completely aligned and may not be standardized.
How can we train ML models in closed-loop, complex deployments? How can we split training and test data?

Now that we’ve tackled the modern data stack and reviewed operational challenges and requirements for a performance management solution to meet those needs, we look forward to discussing the features that a solution should offer in our next post.

The post Architecting an Automated Performance Management (APM) Solution for Modern Data Applications — Part 1 appeared first on Unravel.

Architecting APM Solution for Modern Data Applications — Part 2

Unravel Data Posts — Fri, 25 Jan 2019 07:37:26 +0000

Architecting Automated Performance Management (APM) Solution

In the last APM blog, we broke down the modern data stack, reviewed some of the stack’s biggest operational challenges, and explained what’s required for a performance management solution to meet those needs. In this piece, we’ll dive deeper in discussing the features that solutions should offer in order to support modern data stacks comprised of multiple distributed systems. Specifically, we’ll look at two of the main components these solutions should deliver: diagnosing application failures and optimizing clusters.

Application Failures

Apps break down for many different reasons, particularly in highly distributed environments. Traditionally, it’s up to the users to identify and ﬁx the cause of the failure and get the app running smoothly again. Applications in distributed systems are comprised of many components, resulting in lots of raw logs appearing when failure occurs. With thousands of messages, including errors and stack traces, these logs are highly complex and can be impossible for most users to comb through. It’s incredibly difficult and time-consuming even for experts to discover the root cause of app failure through all of this noise. To generate insights into a failed application in a multi-engine modern data stack, a performance management solution needs to deliver the following:

Automatic identiﬁcation of the root cause of application failure: This is one of the most important tools to mitigate challenges. The solution should continuously collect logs from a variety of application failures, convert those logs into feature vectors, then learn a predictive model for root cause analysis (RCA) from these feature vectors. Figure X1 shows a complete solution to deal with such problems and automate RCA for application failures.
Data collection for training: We all know the chief axiom of analytics: garbage in, garbage out. With this principle in mind, RCA models must be trained on representative input data. For example, if you’re performing RCA on a Spark failure, it’s imperative that the solution has been trained on log data from actual Spark failures across many different deployments.
Accurate model prediction: ML techniques for prediction can be classified as supervised learning and unsupervised learning. We use both in our solutions. We attach root-cause labels with the logs collected from an application failure. Labels come from a taxonomy of root causes as the one illustrated in Figure X2. Once the logs are available, we can construct feature vectors, using for example, the Doc2Vec technique that uses a three-layer neural network to gauge the context of the document and related similar content together. After the feature vectors are generated along with a label, we can apply a variety of learning techniques for automatic RCA, such as shallow as well as deep learning techniques, including random forests, support vector machines, Bayesian classiﬁers, and neural networks. Example results produced by our solution are shown in Figure X3.
Automatic fixes for failed applications: This capability uses an algorithm to automatically find and implement solutions to failing apps. This algorithm is based on data from both successful and failed runs of a given app. Providing automatic fixes is one of the most critical features of a performance management solution as it saves users the time and headache of manually troubleshooting broken apps. Figure X4 shows an automated tuning process of a failed Spark application.

Figure X1: Approach for automatic root cause analysis

Figure X2: Taxonomy of failures

Figure X3: Feature vector generation

Figure X4: Automated tuning of a failed Spark application

Cluster Optimization

In addition to understanding app failures, the other key component of an effective big data performance management solution is cluster optimization. This includes performance management, autoscaling, and cost optimization. When properly delivered, these objectives benefit both IT operations and DevOps teams. However, when dealing with distributed big data, it’s not easy to execute cluster-level workload analysis and optimization. Here’s what we need to do:

Operational insights: This refers to the analysis of metrics data to yield actionable insights. This includes insights on app performance issues (i.e., whether a failure is due to bad code or resource contention), insights on cluster tuning based on aggregation of app data (i.e., finding out whether a compute cluster is properly tuned at both a cluster and application level), and insights on cluster utilization, cloud usage, and autoscaling. We also provide users with tools to help them understand how they are using their compute resources, as for example, compare cluster activity between two time periods, aggregated cluster workload, summary reports for cluster usage, chargeback reports, and so on. In addition to that, we also provide cluster level recommendations to ﬁne tune cluster wide parameters to maximize a cluster’s eﬃciency based upon the cluster’s typical workload. To do so, we (a) collect performance data of prior completed applications, (b) analyze the applications w.r.t. the cluster’s current conﬁguration, (c) generate recommended cluster parameter changes, and (d) predict and quantify the impact that these changes will have on applications that will execute in the future. Figure Y1 shows example recommendations for tuning the size of map containers (top) and reduce containers (bottom) on a production cluster, and in particular the allocated memory in MB.
Workload analysis: This takes a closer look at queue usage to provide a holistic picture of how apps run and how they affect one another on a set of clusters. Workload analysis highlights usage trends, sub-optimal queue designs, workloads that run sub-optimally on queues, convoys, slow apps, problem users (e.g., users who frequently run applications that reach max capacity for a long period), and queue usage per application type or user project. Figure Y2 shows example resource utilization charts for two queues over a time range. In one queue, the workload running does not use all the resources allocated, whilst the workload in the other queue needs more resources. In this figure, the purple line shows resource request and the black line shows resource allocation. Based on such ﬁndings, we generate queue level insights and recommendations including queue design/settings modiﬁcations (e.g., change resource budget for a queue or max/min limits), workload reassignment to diﬀerent queues (e.g., move an application or a workload from one queue to another), queue usage forecasting, etc. A typical big data deployment involves 100s of queues and without automation such task can be tedious.
Forecasting: Forecasting leverages historical operational and application metrics to determine future needs for provisioning, usage, cost, job scheduling, and other attributes. This is important for capacity planning. Figure Y3 shows an example disk capacity forecasting chart; the black line shows actual utilization and the light blue line shows a forecast within an error bound.

Figure Y1: Example set of cluster-wide recommendations

Figure Y2: Workload analysis

Figure Y3: Disk capacity forecasting

Solving the Biggest Big Data Challenges

Modern data applications face a number of unique performance management challenges. These challenges exist at the individual application level, as well as at the workload and cluster levels. To solve these problems at every level, a performance management solution needs to offer a range of capabilities to tell users why apps fail and enable them to optimize entire clusters. As organizations look to get more and more out of their modern data stack—including the use of artificial intelligence and machine learning—it’s imperative that they can address these common issues.

For more technical details, we refer the interested reader to our research paper to be presented at the biennial Conference on Innovative Data Systems Research (CIDR 2019).

The post Architecting APM Solution for Modern Data Applications — Part 2 appeared first on Unravel.

Why is APM a Must-Have for Big Data Self-Service Analytics?

Unravel Data Posts — Fri, 25 Jan 2019 07:22:49 +0000

Self-Service Big Data Analytics is the key to enabling a data driven organization. Enterprises understand the value of this capability and are investing heavily in technology, tools, processes and people to make this a reality. However, as rightfully pointed out by the recent Gartner report on big data self service analytics and BI, several challenges remain. While I agree with many of the assessments from Gartner, I also view that there are some key requirements enabling a self-serve analytics paradigm:

1. (Self-Serve) Platform & Infrastructure

We need modern Data Management platforms to be agile to address the self-serve needs. We are headed towards a model of “Query-As-a-Service” where end users expect the right infrastructure and analytic frameworks (batch, real-time, streaming etc.) provisioned appropriately to support their analytic needs, regardless of scale and complexity.

2. Security/Governance

Modern data stack platforms are democratizing access to data. We now have the capability to store all of the data in a single scalable platform and make it accessible for a wide range of analysis. This presents both an opportunity and challenge. Enterprises need to ensure that they have the right “controls” and “governance” structures in place, so only the right data is accessible for the right analysis and relevant users.

3. Self-Service Application Performance Management

While this is often an afterthought, end-users should have an easy way to understand and rationalize the performance of their analytic applications and in many cases also be able to optimize them for the desired SLA. In the absence of this “layer”, end users spend a lot of time depending on their IT organization to do these tasks for them or jump through hoops to get the appropriate information (logs, metrics, configuration data) to understand their application performance. This approach slow down or defeats the whole self-service approach!

At Unravel, we are squarely focused on solving the Self-Service Application Performance Management (APM) challenges for modern analytic applications built on modern data stack platforms. Unravel is built ground-up to provide a complete 360 degree view of applications and provide insights and recommendation on what needs to be done to improve performance and address failures associated with these applications – all in a self-serve fashion.

Unravel deployment is streamlined to integrate easily with a self-serve platform/infrastructure and integrates with existing security layers both at the infrastructure level (e.g kerberos) and platform level (e.g Apache Sentry, Apache Ranger etc.). Another unique capability in Unravel is to provide a role-based access control (RBAC) for users. This feature enables end users to view only their applications or those related to their business organization (mapped to a queue, project, tenant, departments etc.).

You can learn more about how Unravel can enable your organization achieve the self-serve analytics paradigm by visiting us at www.unraveldata.com or meeting us at the Strata San Jose 2018 Conference at booth #1329

References

https://www.gartner.com/newsroom/id/3848671
https://www.datanami.com/2018/01/25/self-service-analytics-seen-overtaking-data-scientists/

The post Why is APM a Must-Have for Big Data Self-Service Analytics? appeared first on Unravel.

Fighting for precious resources on a multi-tenant big data cluster

Unravel Data Posts — Fri, 25 Jan 2019 07:16:23 +0000

Over the coming few weeks, we’ll provide some practical technical insights and advice on typical problems we encounter in working with many complex data pipelines. In this blog, we’ll talk about multi-tenant cluster contention issues. Check back in for future topics.

Modern data clusters are becoming increasingly commonplace and essential to your business. However, a wide variety of workloads typically run on a single cluster, making it a nightmare to manage and operate, similar to managing traffic in a busy city. We feel the pain of the operations folks out there who have to manage Spark, Hive, Impala, and Kafka apps running on the same cluster where they have to worry about each app’s resource requirements, the time distribution of the cluster workloads, the priority levels of each app or user, and then make sure everything runs like a predictable well-oiled machine.

We at Unravel have a unique perspective on this kind of thorny problem since we spend countless hours, day in and day out, studying the behavior of giant production clusters in the discovery of insights into how to improve performance, predictability, and stability whether it is a thousand node Hadoop cluster running batch jobs, or a five hundred node Spark cluster running AI, ML or some advanced, real-time, analytics. Or, more likely, 1000 nodes of Hadoop, connected via a 50 node Kafka cluster to a 500 node Spark cluster for processing. What could go wrong?

As you are most likely painfully well aware, plenty! Here are some of the most common problems that we see in the field:

Oversubscribed clusters – too many apps or jobs to run, just not enough resources
Bad container sizing – too big or too small
Poor queue management – sizes of queues are inappropriate
Resource hogging users or apps – bad apples in the cluster

So how do you go about solving each of these issues?

Measure and Analyze

To understand which of the above issues is plaguing your cluster, you must first understand what’s happening under the hood. Modern data clusters have several precious resources that the operations team must constantly watch. These include Memory, CPU, NameNode. When monitoring these resources, make sure to measure both the total available and consumed at any given time.

Next, break down these resource charts by user, app, department, project to truly understand who contributes how much to the total usage. This kind of analytical exploration can help quickly reveal:

If there is anyone tenant (user, app, dept, project) causing the majority of users of the cluster, which may then require further investigation to determine if that tenant is using or abusing resources
Which resources are under constant threat of being oversubscribed
If you need to expand your big data cluster or tune apps and system to get more juice

Make apps better for multi-tenant citizens

Configuration settings at the cluster and app level dictate how many system resources each app gets. For example, if we have a setting of 8GB containers at the master level, then each app will get 8GB containers whether they need it or not. Now imagine if most of your apps only needed 4GB containers. Well, your system would show it’s at max capacity when it could be running twice as many apps.

In addition to inefficient memory sizing, big data apps can be bad multi-tenant citizens due to other bad configuration settings (CPU, # of containers, heap size, etc., etc. ), inefficient code, and bad data layout.

Therefore it’s crucial to measure and understand each of these resource-hogging factors for every app on the system and ensure they are using and not abusing resources.

Define queues and priority levels

Your big data cluster must have a resource management tool built-in, for example, YARN or Kubernetes. These tools allow you to divide your cluster into queues. This feature can work well to separate production workloads from experiments or Spark from HBase or high priority users from low priority ones, etc. The trick, though, is to get the levels of these queues right.

This is where “measure and analyze” techniques help again. You should analyze the usage of system resources by users, departments, or any other tenant you see fit to determine the min, max, average that they usually demand. This will get you some common sense levels for your queues.

However, queue levels may need to be adjusted dynamically for best results. For example, a mission-critical app may need more resources if it processes 5x more data one day compared to the other. Therefore having a sense of seasonality is also essential when allocating these levels. A heatmap of cluster usage (as shown below) will enable us to get more precise about these allocations.

Proactively find and fix rogue users or apps.

Even after you follow the steps above, your cluster will experience rogue usage from time to time. Rogue usage is defined as bad behavior on the cluster by an application or user, such as hogging resources from a mission-critical app, taking more CPU or memory than needed for timely execution, having a very long idle shell, etc. In a multi-tenant environment, this type of behavior affects all users and ultimately reduces the reliability of the overall platform.

Therefore setting boundaries for acceptable behavior is very important to keep your big data cluster humming. A few examples of these are:

The time limit for application execution
CPU, memory, containers limit for each application or user

Setting the thresholds for these boundaries should be done after analyzing your cluster patterns over a month to help determine the average or accepted values. These values may also be different for different days of the week. Also, think about what happens when these boundaries are breached? Should the user and admin get an alert? Should these rogue applications be killed or moved to a lower priority queue?

Let us know if this was helpful or topics you would like us to address in future postings. Then, check back in for the next issue, which will be on the subject of rogue applications.

The post Fighting for precious resources on a multi-tenant big data cluster appeared first on Unravel.

How to resolve performance issues of big data applications

Unravel Data Posts — Fri, 25 Jan 2019 07:01:44 +0000

I didn’t grow up celebrating Christmas, but every time I watched Chevy Chase, as Clark Griswold in National Lampoon’s Christmas Vacation, trying to unravel his Christmas lights I could feel his pain. For those who had to put up lights, you might remember how time-consuming it was to unwrap those lights. But that wasn’t the biggest problem. No, the biggest issue was troubleshooting why one or more lights were out. It could be an issue with the socket, the wiring, or just one light causing a section of good lights to be out. Figuring out the root cause of the problem was a trial-and-error process that was very frustrating.

Today’s approach to diagnose and resolve performance issues of big data systems is just like dealing with those pesky Christmas lights. Current performance monitoring and management tools don’t pin point the root cause of the problem or how they affect other systems or components running across a big data platform. As a result, troubleshooting and resolving performance issues, like rogue users and jobs impacting cluster performance, missed SLAs, stuck jobs, failed queries, or not understanding cluster usage and application performance, is very time consuming and cannot scale to support big data applications in a production deployment.

There’s a better way to resolve Big Data performance issues than spending hours sifting through monitoring graphs and logs

Managing big data operations in a multi-tenant cluster is complex and it’s hard to diagnose some of the problems listed above. It’s also hard to track who is doing what, understand cluster usage and application performance, justify resource demands, and forecast capacity needs.

Gaining full visibility across the big data stack is difficult because there is no single pane that gives operations and users insight into what’s going on. Even with monitoring tools like Cloudera Manager, Ambari, and MapR Control System, people have to use logs, monitoring graphs, configuration files, and so on to try to resolve application performance issues.

The Unravel platform gives you this important insight across multiple systems.

The post How to resolve performance issues of big data applications appeared first on Unravel.

Organizations Need APM to fully Harness Big Data

Unravel Data Posts — Fri, 25 Jan 2019 06:50:07 +0000

Gartner recently released the report Monitor Data and Analytics Platforms to Drive Digital Business. In the introduction, the authors explain, “Unfortunately, most application performance monitoring (APM) and infrastructure monitoring technologies are not equipped to deal with the challenges of monitoring business data and analytics platforms. And most IT operations professionals lack the skills required to understand and manage the performance problems and incidents that are likely to arise as these platforms are put through their paces by an increasingly data-hungry business community.”

Moving and managing big data applications in production will be a nightmare because many problems can occur that can cripple an operations team. The growth and diversity of applications in a multi-tenant cluster are making it hard to diagnose problems, like rogue applications, missed SLAs, cascading cluster failures, application slowdowns, stuck jobs, and failed queries. Also, it’s becoming hard to track who is doing what, understand cluster usage and application performance, and optimize and plan in order to justify resource demands and forecast capacity needs.

For operations teams, addressing these challenges is a must because there are too many downstream customers that depend on things running smoothly. But trying to have full visibility across the modern data stack is really hard to do because there is no single pane that gives operations and users insight into what’s going on or a way to collaborate – instead people have to look at code, logs, monitoring graphs, configuration files, and so on in an attempt to figure it out.

Operations professionals need more. The Gartner report recommends that leaders tasked with optimizing and managing their data and analytics platforms should:

Monitor business data and analytics platform performance holistically by combining data from infrastructure and application performance tools and tasks. This will ensure good traceability, improve performance and support the business uptake of analytics

Unravel provides the only tool that analyzes the entire modern data stack and enables operations teams to understand, improve and control application performance in production. What makes Unravel so unique is that the solution is designed to automatically detect, diagnose and resolve issues to improve productivity, guarantee reliability, and reduce costs for running modern data stack applications in any environment– on-premises, in the cloud, or hybrid.

Learn more about why Big Data APM is mission critical for modern data stack applications in production.

Citations:
Gartner, Monitor Data and Analytics Platforms to Drive Digital Business, 05 April 2017

The post Organizations Need APM to fully Harness Big Data appeared first on Unravel.

Common Modern Data Stack Application and Operation Challenges

Unravel Data Posts — Fri, 25 Jan 2019 06:36:44 +0000

Over the last one year, we at Unravel Data have spoken to several enterprise customers (with cluster sizes ranging from 10’s to 1000’s of nodes) about their modern data stack application challenges. Most of them leverage Hadoop in a typical multi-tenant fashion wherein they have different applications (ETL, Business Intelligence, Machine Learning Apps etc.) accessing the common data substrate.

The conversations have typically been with both the Operations team, who manage and maintain these platforms and Developers/End Users, who are building these applications on the modern data stack (Hive, Spark, MR, Oozie etc.).

Broadly speaking, the challenges can be grouped from the application perspective and cluster perspective. We have highlighted these challenges with actual examples that we have discussed with these customers.

Challenges faced by Developers/Data Engineers

Some of the most common Application challenges we have seen or heard of that are top of mind include:

An ad-hoc Application (submitted via Hive, Spark, MR) is stuck (not making forward progress) or fails after a while
Example 1: Spark job gets hung at the last task and eventually fails
Example 2: Spark job fails with executor OOM at a particular stage
Application is performing poorly suddenly
Example 1: A hive query that used to take ~6 hrs is now taking > 10 hrs
Not a good understanding on what “knobs” (configuration parameters) to change to improve application performance and resource usage
Need a self-serve platform to understand end-to-end how their specific application(s) is behaving.

Today, engineers end up going to five different sources (e.g CM/Ambari UI, Job History UI, Application Logs, Spark WebUI, AM/RM UI/Logs) to get an end to end understanding of application behavior and performance challenges. Further, these sources maybe insufficient to truly understand the bottlenecks associated with an applications (e.g detailed container execution profiles, visibility into the transformations that execute within a Spark stage etc.). See an example here on how to debug a Spark application. It’s cumbersome!

To add to the above challenges, many developers do not have access to the above sources and have to go via their Ops team, which adds to significant delays.

Challenges faced by Data Operations Team(s)

Some of the most common challenges we have seen or heard of that are top of mind for include:

Lack of visibility of Cluster usage from an application perspective
Example 1: Which application(s) cause my cluster usage (cpu, memory) to spike up?
Example 2: Are Queues being used optimally
Example 3: Are various Data sets actually being used?
Example 4: Need a comprehensive chargeback/showback view for planning/budgeting purposes
Not having good visibility and understanding when Data Pipelines miss SLAs? Where do we start to triage these issues?
Example 1: An Oozie orchestrated data pipeline that needs to complete by 4AM every morning is now consistently getting delayed. Completes by 6AM only
Unable to control/manage runaway jobs that could end up taking way more resources than needed effecting overall cluster and starving other applications
Example 1: Would need an automated way to track and manage these runaway jobs
Not a good understanding on what “knobs” (configuration parameters) to change at the cluster level to improve overall application performance and resource usage
How to quickly identify inefficient applications that can be reviewed and acted upon?
Example 1: Most Ops team members do not have deep Hadoop Application expertise. So having a way to quickly triage and understand root-causes around application performance degradation is very helpful

The above application and operations challenges are real blockers preventing enterprise customers to make their Hadoop applications production ready. This in turn is slowing down the ROI enterprises are seeing from their Big Data investments.

Unravel’s vision is to address these broad challenges and more for modern data stack applications and operations. Our vision is to provide a full stack performance intelligence and self-serve platform that can improve your modern data stack operations, make your applications reliable and improve overall cluster utilization. In our subsequent blog posts we will detail on how we go about doing this and the (data) science behind these solutions.

The post Common Modern Data Stack Application and Operation Challenges appeared first on Unravel.

To Cache or Not to Cache RDDs in Spark

Unravel Data Posts — Fri, 25 Jan 2019 06:31:25 +0000

This post is the first part of a series of posts on caching, and it covers basic concepts for caching data in Spark applications. Following posts will cover more how-to’s for caching, such as caching DataFrames, more information on the internals of Spark’s caching implementation, as well as automatic recommendations for what to cache based on our work with many production Spark applications. For a more general overview of causes of Spark performance issues, as well as an orientation to our learning to date, refer to our page on Spark Performance Management.

Caching RDDs in Spark

It is one mechanism to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD. There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. persist() without an argument is equivalent with cache(). We discuss caching strategies later in this post. Freeing up space from the Storage memory is performed by unpersist().

When to use caching

As suggested in this post, it is recommended to use caching in the following situations:

RDD re-use in iterative machine learning applications
RDD re-use in standalone Spark applications
When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails

1.  val sc = new SparkContext(sparkConf)
2.  val ranGen = new Random(2016)
3.
4.  val pairs1 = sc.parallelize(0 until numMappers, numPartitions).flatMap { p =>
5.
6.  // map output sizes lineraly increase from the 1st to the last
7.  val numKVPairsLocal = (1.0 * (p+1) / numMapppers * numKVPairs).toInt
8.
9.  var arr1 =new Array(Int, Array[Byte])(numKVPairsLocal)
10. for(i<=0 until numKVPairsLocal){ 
11.  val byteArr = new Array[Byte](valSize) 
12.  ranGen.nextBytes(byteArr) 
13.
14.  // generate more like a zipfian skew 
15.  var key: Int = 0 
16.  if (ranGen.nextInt(100) > 50)
17.   key = 0
18.  else
19.   key = ranGen.nextInt(numKVPairsLocal)
20.
21.  arr1(i) = (key, byteArr)
22. }
23. arr1
24. }
25. .cache()
26.
27. pairs1.count()
28.
29. printIn("Number of pairs: " + pairs1.count)
30. printIn("Count result " + pairs1.groupByKey(numReducers.count()))
31.
32. sc.stop()

Caching example

Let us consider the example application illustrated in Figure 1. Let us also assume that we have enough memory to cache any RDD. The example application generates an RDD on line 4, “pairs” which consists of a set of key-value pairs. The RDD is split into several partitions (i.e., numPartitions). RDDs are lazily evaluated in Spark. Thus, the “pairs” RDD is not evaluated until an action is called. Neither cache() nor persist() is an action. Several actions are called on this RDD as seen on lines 27, 29, and 30. The data is cached during the first action call, on line 27. All the subsequent actions (line 29, line 30) will find the RDD blocks in the memory. Without the cache() statement on line 25, each action on “pairs” would evaluate the RDD by processing the intermediate steps to generate the data. For this example, caching clearly speeds up execution by avoiding RDD re-evaluation in the last two actions.

Block eviction

Now let us also consider the situation that some of the block partitions are so large that they will quickly fill up the storage memory used for caching. The partitions generated in the example above are skewed, i.e., more key-value pairs are allocated for the partitions with higher IDs. Highly skewed partitions are the first candidates that will not fit into the cache storage.

When the storage memory becomes full, an eviction policy (i.e., Least Recently Used) will be used to make up space for new blocks. This situation is not ideal, as cached partitions may be evicted before actually being re-used. Depending on the caching strategy adopted, evicted blocks are cached on disk. A better use case is to cache only RDDs that are expensive to re-evaluate, and have a modest size such that they will fit in the memory entirely. Making such decisions before application execution may be challenging as it is unclear which RDDs will fit in the cache, and which caching strategy is better to use (i.e., where to cache: in memory, on disk, off heap, or a combined version of the above) in order to achieve the best performance. Generally, a caching strategy that caches blocks in memory and on disk is preferred. For this case, cached blocks evicted from memory are written to disk. Reading the data from disk is relatively fast compared with re-evaluating the RDD [1].

Under the covers

Internally, caching is performed at the block level. That means that each RDD consists of multiple blocks and each block is being cached independently of the other blocks. Caching is performed on the node that generated that particular RDD block. Each Executor in Spark has an associated BlockManager that is used to cache RDD blocks. The memory allocation of the BlockManager is given by the storage memory fraction (i.e., spark.memory.storageFraction) which gives the fraction from the memory pool allocated to the Spark engine itself (i.e., specified by spark.memory.fraction) . A summary of memory management in Spark can be found here. The BlockManager manages cached partitions as well as intermediate shuffle outputs. The storage, the place where blocks are actually stored, can be specified through the StorageLevel (e.g., persist(level: StorageLevel)). Once the storage level of the RDD has been defined, it cannot be changed. An RDD block can be cached in memory, on disk, or off-heap as specified by level.

Caching strategies (StorageLevel): RDD blocks can be cached in multiple stores (memory, disk, off-heap), in serialized or non-serialized format.

MEMORY_ONLY: Data is cached in memory only in non-serialized format.
MEMORY_AND_DISK: Data is cached in memory. If enough memory is not available, evicted blocks from memory are serialized to disk. This mode of operation is recommended when re-evaluation is expensive and memory resources are scarce.
DISK_ONLY: Data is cached on disk only in serialized format.
OFF_HEAP: Blocks are cached off-heap, e.g., on Alluxio [2].
The caching strategies above can also use serialization to store the data in serialized format. Serialization increases the processing cost but reduces the memory footprint of large datasets. These variants append “_SER” suffix to the above schemes. E.g., MEMORY_ONLY_SER, MEMORY_AND_DISK_SER. DISK_ONLY and OFF_HEAP always write data in serialized format.
Data can be also replicated to another node by appending “_2” suffix to the StorageLevel: e.g., MEMORY_ONLY_2, MEMORY_AND_DISK_SER_2. Replication is useful for speeding up recovery in the case one node of the cluster (or an executor) fails.
A full description of caching strategies can be found here.

Summary

Caching is very useful for applications that re-use an RDD multiple times. Iterative machine learning applications include such RDDs that are re-used in each iteration.
Caching all of the generated RDDs is not a good strategy as useful cached blocks may be evicted from the cache well before being re-used. For such cases, additional computation time is required to re-evaluate the RDD blocks evicted from the cache.
Given a large list of RDDs that are being used multiple times, deciding which ones to cache may be challenging. When memory is scarce, it is recommended to use MEMORY_AND_DISK caching strategy such that evicted blocks from cache are saved to disk. Reading the blocks from disk is generally faster than re-evaluation. If extra processing cost can be afforded, MEMORY_AND_DISK_SER can further reduce the memory footprint of the cached RDDs.
If certain RDDs have very large evaluation cost, it is recommended to replicate them to another node. This will boost significantly performance in the case of a node failure, since re-evaluation can be skipped.

References:

[1] https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/td-p/30763[2] “Learning Spark: Lightning-Fast Big Data Analysis”. Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. O’Reilly Media, 2015.

*Image from https://www.storagereview.com/

The post To Cache or Not to Cache RDDs in Spark appeared first on Unravel.

Auto-Tuning Tez Applications

Unravel Data Posts — Fri, 25 Jan 2019 06:20:46 +0000

In a previous blog and webina r, my colleague Eric Chu discussed Unravel’s powerful recommendation capabilities in the Hive on MapReduce setting. To recap briefly, for these Hive on MR applications, Unravel surfaced inefficiencies and associated performance problems associated with the excessive number of mapper and reducer tasks being spawned to physically execute the query.

In this article, we’ll focus on another execution engine for Hive queries, Apache Tez. Tez provides a DAG-based model for physical execution of Hive queries and ships with Hortonworks’s HDP distribution as well as Amazon EMR.

We’ll see that reducing the number of tasks is an important optimization in this setting as well, however, it’s important to note that Tez makes use of a different mechanism for input splits, which means different parameters must be tuned. Of course, Unravel simplifies this nuance by directly recommending efficient settings for the Tez-specific parameters in play, using our ML/AI-informed infrastructure monitoring tools.

Before diving into some hands-on tuning from a data engineer’s point-of-view, I want to note that there are many factors that can lead to variability in execution performance and resource consumption of Hive on Tez queries.

Once Unravel’s intelligent APM functionality provides deep insights into the nature of multi-tenant workloads running in modern data stack environments, operators can rapidly identify the causes of this variability, which could include:

Data skew
Resource contention
Changes in volume of processed data

Towards the end of this blog, we will highlight some of the operator-centric features in Unravel, providing full fidelity visibility that can be leveraged to yield such insights.

We’ll use the following TPC-DS query on an HDP 2.6.3 cluster.

When we execute a query of this form with aggregation, sort, and limit semantics, we can see that the Map-Reduce-Reduce pattern is used by Tez which is a nice improvement over the MR execution engine, as this pattern isn’t possible in pure MapReduce.

As discussed above and in Eric’s aforementioned blog, there is a balancing act in distributed processing that must be maintained between the degree of parallelism of the processing and how much work is being completed by each parallel process. Tez attempts to achieve this balance via the grouping of splits. The task parallelism in Tez is determined by the grouping algorithm discussed here.

The key thing to note is that the number of splits per task must be aligned with the configuration settings for tez.grouping.min-size and tez.grouping.max-size.

Unravel first recommends tuning of these grouped splits in order for the cluster to allocate a smaller number of Tez map tasks, in order to gain better throughput with each task processing a larger amount of data relative to the pre-tuned execution.

Returning to the Map-Reduce-Reduce DAG that the optimizer identified for this query, we see that the first reducer stage is taking a long time to process.

At this point, without Unravel, a data engineer would usually dig into the physical execution plan, in order to identify how much data each reducer task is processing. Next, the engineer would need to understand how to properly configure the number of reducers to yield better efficiency and performance.

This is non-trivial, given the number of parameters in play: hive.tez.auto.reducer.parallelism, hive.tez.min.partition.factor, hive.tez.max.partition.factor, hive.exec.reducers.max, and hive.exec.reducers.bytes.per.reducer, and more (take a look at the number of Tez configuration parameters available, a large number of which can affect performance).

In fact, with auto reducer parallelism turned on (the default in HDP), Tez samples the source vertices’ output sizes and adjusts the number of reducer tasks dynamically. When there is a large volume of input data, but the reducer stage output is small relative to this input, the default heuristics can lead to too few reducers.

This is a good example of the subtleties that can arise with performance tuning–in the Hive on MR case, we had a query that was using way too many reducer tasks, leading to inefficiencies. In this case, we have a query using too few reducer tasks. Furthermore, even more reducers does not necessarily mean better performance in this case either!

In fact, in further testing, we found that tuning hive.exec.reducers.bytes.per.reducer to a somewhat higher value, resulting in less reducers, can squeeze out a bit better performance in this case.

As we discussed, it’s a complex balancing act to play with the number of tasks processing data in parallel and the amount of data being processed by each. Savvy data engineers spend an inordinate amount of time running experiments in order to hand-tune these parameters before deploying their applications to Production.

How Unravel Helps Auto-Tuning Tez Applications

Fortunately, with Unravel, we have a lot less work to do as data engineers. Unravel intelligently navigates this complex response surface using machine learning algorithms, to identify and surface optimal configurations automatically. It’s up to you how to optimize your output, using the full-stack infrastructure monitoring tools provided.

Let’s move on to the operator-centric point-of-view in Unravel to briefly discuss the insights we can gain when we determine there is considerable variability in the efficiency and performance of our Tez applications.

The first item we identified is data skew, how can we use Unravel to identify this issue? First, we can quickly see the total run time broken down by Tez task this in the vertex timeline:

Secondly, we have a sortable cross-table we can leverage to easily see which tasks took the longest.

This can be the result of processing skew or data skew, and Unravel provides some additional histograms to provide insights into this. These histograms bucket the individual tasks by duration and by processed I/O.

Non-uniformity of the distribution of these tasks yields insights into the nature of the skew which cluster operators and developers can leverage to take actionable tuning and troubleshooting steps.

Fantastic, how about resource contention? First, we can take a look at an app that executed quickly because it was allocated all of its requested containers by YARN right away:

Our second figure shows what can occur when there are other tenants within the cluster requesting containers for their applications, the slower execution is due to waiting for containers from the Resource Manager:

Finally, as far as changes in input data volume, we can make use of Unravel’s powerful Workflow functionality in this context in order to determine when this changes considerably, leading to variances in application behavior. In a future blog, we will discuss techniques to address this concern.

Create a free account.

The post Auto-Tuning Tez Applications appeared first on Unravel.

The Operations Guide To Big Data Application Performance

Unravel Data — Thu, 17 Jan 2019 03:40:22 +0000

Thank you for your interest in The Operations Guide to Big Data Application Performance Guide.

You can download it here.

The post The Operations Guide To Big Data Application Performance appeared first on Unravel.