Spark Resources | Unravel Data

The Spark Troubleshooting Solution is The Unravel DataOps Platform

Unravel Data — Wed, 20 Oct 2021 23:33:02 +0000

Current practice for Spark troubleshooting is messy. Part of this is due to Spark’s very popularity; it’s widely used on platforms as varied as open source Apache Spark, on all platforms; Cloudera’s Hadoop offerings (on-premises and in the cloud); Amazon EMR, Azure Synapse, and Google Dataproc; and Databricks, which runs on all three public clouds. (Which means you have to be able to address Spark’s interaction with all of these very different environments.)

Because Spark does so much, on so many platforms, “Spark troubleshooting” covers a wide range of problems – jobs that halt; pipelines that fail to deliver, so you have to find the issue; performance that’s too slow; or using too many resources, either in the data center (where your clusters can suck up all available resources) or in the cloud (where resources are always available, but your costs rise, or even skyrocket.)

Where Are the Issues – and the Solutions?

Problems in running Spark jobs occur at the job and pipeline levels, as well as at the cluster level, as described in Part 1 of this three-part series: the top ten problems you encounter in working with Spark. And there are several solutions that can help, as we described in Part 2: five types of solutions used for Spark troubleshooting. (You can also see our recent webinar, Troubleshooting Apache Spark, for an overview and demo.)

Table: What each level of tool shows you – and what’s missing

Existing tools provide incomplete, siloed information. We created Unravel Data as a go-to DataOps platform that includes much of the best of existing tools. In this blog post we’ll give examples of problems at the job, pipeline, and cluster levels, and show how to solve them with Unravel Data. We’ll also briefly describe how Unravel Data helps you prevent problems, providing AI-powered, proactive recommendations.

The Unravel Data platform gathers more information than existing tools by adding its own sensors to your stack, and by using all previously existing metrics, traces, logs, and available API calls. It gathers this robust information set together and correlates pipeline information, for example, across jobs.

The types of issues that Unravel covers are, broadly speaking: fixing bottlenecks; meeting and beating SLAs; cost optimization; fixing failures; and addressing slowdowns, helping you improve performance. Within each of these broad areas, Unravel has the ability to spot hundreds of different types of factors contributing to an issue. These contributing factors include data skew, bad joins, load imbalance, incorrectly sized containers, poor configuration settings, and poorly written code, as well as a variety of other issues.

Fixing Job-Level Problems with Unravel

Here’s an example of a Spark job or application run that’s monitored by Unravel.

In Unravel, you first see automatic recommendations, analysis, and insights for any given job. This allows users to quickly understand what the problem is, why it happened, and how to resolve it. In the example below, resolving the problem will take about a minute.

Let’s dive into the insights for an application run, as shown below.

You can see here that Unravel has spotted bottlenecks, and also room for improving the performance of this app. It has narrowed down what the particular problem is with this application and how to resolve it. In this case, it has recommended to double the number of executors and reduce the memory for each executor, which will improve performance by about 30%, meeting the SLA.

Additionally, Unravel has also spotted some bad joins which are slowing this application down, as shown below.

In addition to helping speed this application up, Unravel is also recommending resource settings which will lower the cost of running this application, as shown below – reductions of roughly 50% in executor memory and driver memory, cutting out half the total memory cost. Again, Unravel is delivering pinpoint recommendations. Users avoid a lengthy trial-and-error exercise; instead, they can solve the problem in about a minute.

Unravel can also help with jobs or applications that just didn’t work and failed. It uses a similar approach as above to help data engineers and operators get to the root cause of the problem and resolve it quickly.

In this example, the job or application failed because of an out of memory exception error. Unravel surfaces this problem instantly and pinpoints exactly where the problem is.

For further information, and to support investigation, Unravel provides distilled and easy-to-access logs and error messages, so users and data engineers have all the relevant information they need at hand.

And once data teams start using Unravel, they can do everything with more confidence. For instance, if they try to save money by keeping resource allocations low, but overdo that a little bit, they’ll get an out-of-memory error. Previously, it might have taken many hours to resolve the error, so the team might not risk tight allocations. But fixing the error only takes a couple of minutes with Unravel, so the data team can cut costs effectively.

Examples of logs that Unravel provides for easy access and error message screens follow.

Unravel strives to help users solve their problems with a click of a button. At the same time, Unravel provides a great deal of detail about each job and application, including displaying code execution, displaying DAGs, showing resource usage, tracking task execution, and more. This allows users to drill down to whatever depth needed to understand and solve the problem.

Task stage metrics in Unravel Data

As another example, this screen shows details for task stage information:

Left-hand side: task metrics. This includes the job stage task metrics of Spark, much like what you would see from Spark UI. However, Unravel keeps history on this information; stores critical log information for easy access; presents multiple logs coherently; and ties problems to specific log locations.
Right-hand side: holistic KPIs. Information such as job start and end time, run-time durations, I/O in KB – and whether each job succeeded or failed.

Data Pipeline Problems

The tools people use for troubleshooting Spark jobs tend to focus on one level of the stack or another – the networking level, the cluster level, or the job level, for instance. None of these approaches helps much with Spark pipelines. A pipeline is likely to have many stages, involving many separate Spark jobs.

Here’s an example. One Spark job can handle data ingest; a second job, transformation; a third job may send the data to Kafka; and a final job can be reading the data from Kafka and then putting it into a distributed store, like Amazon S3 or HDFS.

Airflow being used to create and organize a Spark pipeline.

The two most important orchestration tools are Oozie, which tends to be used with on-premises Hadoop, and Airflow, which is used more often in the cloud. They will help you create and manage a pipeline; and, when the pipeline breaks down, they’ll show you which job the problem occurred in.

But orchestration tools don’t help you drill down into that job; that’s up to you. You have to find the specific Spark run where the failure occurred. You have to use other tools, such as Spark UI or logs, and look at timestamps, using your detailed knowledge of each job to cross-correlate and, hopefully, find the issue. As you see, just finding the problem is messy, intense, time-consuming, expert work; fixing it is even more effort.

Oozie also gives you a big-picture view of pipelines.

Unravel, by contrast, provides pipeline-specific views that first connect all the components – Spark, and everything else in your modern data stack – and runs of the data pipeline together in one place. Unravel then allows you to drill down into the slow, failed, or inefficient job, identify the actual problem, and fix it quickly. And it gets even better; Unravel’s AI-powered recommendations will help you prevent a pipeline problem from even happening in the first place.

You didn’t have to look at Spark UI, plus dig through Spark logs, then check Oozie or Airflow. All the information is correlated into one view – a single pane of glass.

This view shows details for several jobs. In the graphic, each line has an instance run. The longest duration shown here is three minutes and 1 second. If the SLA is “under two minutes,” then the job failed to meet its SLA. (Because some jobs run scores or hundreds of times a day, missing an SLA by more than a minute – especially when that means a roughly 50% overshoot against the SLA – can become a very big deal.)

Unravel then provides history and correlated information, using all of this to deliver AI-powered recommendations. You can also set AutoActions against a wide variety of conditions and get cloud migration support.

Cluster Issues

Resources are allocated at the cluster level. The screenshot shows ResourceManager (RM), which tracks resources, schedules jobs such as Spark jobs, and so on. You can see the virtual machines assigned to your Spark jobs, what resources they’re using, and status – started or not started, completed or not completed.

Apache Hadoop ResourceManager

The first problem is that there’s no way to see what actual resources your job is consuming. Nor can you see whether those resources are being used efficiently or not. So you can be over-allocated, wasting resources – or running very close to your resources limit, with the job likely to crash in the future.

Nor can you compare past to present; ResourceManager does not have history in it. Now you can pull logs at this level – the YARN level – to look at what was happening, but that’s aggregated data, not the detail you’re looking for. You also can’t dig into potential conflicts with neighbors sharing resources in the cluster.

You can use site tools like Cloudwatch, Cloudera Manager or Ambari. They provide a useful holistic view, at the cluster level – total CPU consumption, disk I/O consumption, and network I/O consumption. But, as with some of the pipeline views we discussed above, you can’t take this down to the job level.

You may have a spike in cluster disk I/O. Was it your job that started that, or someone else’s? Again, you’re looking at Spark UI, you’re looking at Spark logs, hoping maybe to get a bit lucky and figure out what the problem is. Troubleshooting becomes a huge intellectual and practical challenge. And this is all taking away from time making your environment better or doing new projects that move the business forward.

It’s common for a job to be submitted, then held because the cluster’s resources are already tied up. The bigger the job, the more likely it will have to wait. But existing tools make it hard to see how busy the cluster is. So later, when the job that had to wait finishes late, no one knows why that happened.

A cluster-level view showing vCores, specific users, and a specific queue

By contrast, in this screenshot from Unravel, you see cluster-level details. This job was in the data security queue, and it was submitted on July 5th, around 7:30pm. These two rows show vCores – overall consumption on this Hadoop cluster’s memory. The orange line shows maximum usage, and the blue line shows what’s available.

At this point in time, usage (blue line) did not exceed available resources (orange line)

You can also get more granular and look at a specific user. You can go to the date and time that the job was launched and see what was running at that point in time. And voilà – there were actually enough resources available.

So it’s not a cluster-level problem; you need to examine the job itself. And Unravel, as we’ve described, gives you the tools to do that. You can see that we’ve eliminated a whole class of potential problems for this slowdown – not in hours or days, and with no trial-and-error experimentation needed. We just clicked around in Unravel for a few minutes.

Unravel Data: An Ounce of Prevention

For the issues above, such as slowdowns, failures, missed SLAs or just expensive runs, a developer would have to be looking at YARN logs, ResourceManager logs, and Spark logs, possibly spending hours figuring it all out. Within Unravel, though, they would not need to jump between all those screens; they would get all the information in one place. They can then use Unravel’s built-in intelligence to automatically root-cause the problem and resolve it.

Unravel Data solves the problem of Spark troubleshooting at all three levels – at the job, pipeline, and cluster levels. It handles the correlation problem – tying together cluster, pipeline, and job information – for you. Then it uses that information to give unique views at every level of your environment. Unravel makes AI-powered recommendations to help you head off problems; allows you to create AutoActions that execute on triggers you define; and makes troubleshooting much easier.

Unravel solves systemic problems with Spark. For instance, Spark tends to cause overallocation: assigning very large amounts of resources to every run of a Spark job, to try to avoid crashes on any run of that job over time. The biggest datasets or most congested conditions set the tone for all runs of the job or pipeline. But with Unravel, you can flexibly right-size the allocation of resources.

Unravel frees up your experts to do more productive work. And Unravel often enables newer and more junior-level people to be as effective as an expert would have been, using the ability to drill down, and the proactive insights and recommendations that Unravel provides.

Unravel even feeds back into software development. Once you find problems, you can work with the development team to implement new best practices, heading off problems before they appear. Unravel will then quickly tell you which new or revised jobs are making the grade.

The Unravel advantage – on-premises and all public clouds

Another hidden virtue of Unravel is: it serves as a single source of truth for different roles in the organization. If the developer, or an operations person, finds a problem, then they can use Unravel to highlight just what the issue is, and how to fix it. And not only how to fix it this time, for this job, but to reduce the incidence of that class of problem across the whole organization. The same goes for business intelligence (BI) tool users such as analysts, data scientists, everyone. Unravel gives you a kind of X-ray of problems, so you can cooperate in solving them.

With Unravel, you have the job history, the cluster history, and the interaction with the environment as a whole – whether it be on-premises, or using Databricks or native services on AWS, Azure, or Google Cloud Platform. In most cases you don’t have to try to remember, or discover, what tools you might have available in a given environment. You just click around in Unravel, largely the same way in any environment, and solve your problem.

Between the problems you avoid, and your new-found ability to quickly solve the problems that do arise, you can start meeting your SLAs in a resource-efficient manner. You can create your jobs, run them, and be a rockstar Spark developer or operations person within your organization.

Conclusion

In this blog post, we’ve given you a wide-ranging tour of how you can use Unravel Data to troubleshoot Spark jobs – on-premises and in the cloud, at the job, pipeline, and cluster levels, working across all levels, efficiently, from a single pane of glass.

In Troubleshooting Spark Applications, Part 1: Ten Challenges, we described the ten biggest challenges for troubleshooting Spark jobs across levels. And in Spark Troubleshooting, Part 2: Five Types of Solutions, we describe the major categories of tools, several of which we touched on here.

This blog post, Part 3, builds on the other two to show you how to address the problems we described, and more, with a single tool that does the best of what single-purpose tools do, and more – our DataOps platform, Unravel Data.

The post The Spark Troubleshooting Solution is The Unravel DataOps Platform appeared first on Unravel.

The Biggest Spark Troubleshooting Challenges in 2024

Unravel Data — Fri, 06 Aug 2021 22:04:36 +0000

Spark has become one of the most important tools for processing data – especially non-relational data – and deriving value from it. And Spark serves as a platform for the creation and delivery of analytics, AI, and machine learning applications, among others. But troubleshooting Spark applications is hard – and we’re here to help.

In this blog post, we’ll describe ten challenges that arise frequently in troubleshooting Spark applications. We’ll start with issues at the job level, encountered by most people on the data team – operations people/administrators, data engineers, and data scientists, as well as analysts. Then, we’ll look at problems that apply across a cluster. These problems are usually handled by operations people/administrators and data engineers.

For more on Spark and its use, please see this piece in Infoworld. And for more depth about the problems that arise in creating and running Spark jobs, at both the job level and the cluster level, please see the links below.

Five Reasons Why Troubleshooting Spark Applications is Hard

Some of the things that make Spark great also make it hard to troubleshoot. Here are some key Spark features, and some of the issues that arise in relation to them:

Memory-resident

Spark gets much of its speed and power by using memory, rather than disk, for interim storage of source data and results. However, this can cost a lot of resources and money, which is especially visible in the cloud. It can also make it easy for jobs to crash due to lack of sufficient available memory. And it makes problems hard to diagnose – only traces written to disk survive after crashes.

Parallel processing

Spark takes your job and applies it, in parallel, to all the data partitions assigned to your job. (You specify the data partitions, another tough and important decision.) But when a processing workstream runs into trouble, it can be hard to find and understand the problem among the multiple workstreams running at once.

Variants

Spark is open source, so it can be tweaked and revised in innumerable ways. There are major differences between the Spark 1 series, Spark 2.x, and the newer Spark 3. And Spark works somewhat differently across platforms – on-premises; on cloud-specific platforms such as AWS EMR, Azure HDInsight, and Google Dataproc; and on Databricks, which is available across the major public clouds. Each variant offers some of its own challenges and a somewhat different set of tools for solving them.

Configuration options

Spark has hundreds of configuration options. And Spark interacts with the hardware and software environment it’s running in, each component of which has its own configuration options. Getting one or two critical settings right is hard; when several related settings have to be correct, guesswork becomes the norm, and over-allocation of resources, especially memory and CPUs (see below) becomes the safe strategy.

Trial and error approach

With so many configuration options, how to optimize? Well, if a job currently takes six hours, you can change one, or a few, options, and run it again. That takes six hours, plus or minus. Repeat this three or four times, and it’s the end of the week. You may have improved the configuration, but you probably won’t have exhausted the possibilities as to what the best settings are.

Sparkitecture diagram – the Spark application is the Driver Process, and the job is split up across executors. (Source: Apache Spark for the Impatient on DZone.)

Three Issues with Spark Jobs, On-Premises and in the Cloud

Spark jobs can require troubleshooting against three main kinds of issues:

Failure. Spark jobs can simply fail. Sometimes a job will fail on one try, then work again after a restart. Just finding out that the job failed can be hard; finding out why can be harder. (Since the job is memory-resident, failure makes the evidence disappear.)
Poor performance. A Spark job can run slower than you would like it to; slower than an external service level agreement (SLA); or slower than it would do if it were optimized. It’s very hard to know how long a job “should” take, or where to start in optimizing a job or a cluster.
Excessive cost or resource use. The resource use or, especially in the cloud, the hard dollar cost of a job may raise concerns. As with performance, it’s hard to know how much the resource use and cost “should” be, until you put work into optimizing and see where you’ve gotten to.

All of the issues and challenges described here apply to Spark across all platforms, whether it’s running on-premises, in Amazon EMR, or on Databricks (across AWS, Azure, or GCP). However, there are a few subtle differences:

Move to cloud. There is a big movement of big data workloads from on-premises (largely running Spark on Hadoop) to the cloud (largely running Spark on Amazon EMR or Databricks). Moving to cloud provides greater flexibility and faster time to market, as well as access to built-in services found on each platform.
Move to on-premises. There is a small movement of workloads from the cloud back to on-premises environments. When a cloud workload “settles down,” such that flexibility is less important, then it may become significantly cheaper to run it on-premises instead.
On-premises concerns. Resources (and costs) on-premises tend to be relatively fixed; there can be a leadtime of months to years to significantly expand on-premises resources. So the main concern on-premises is maximizing the existing estate: making more jobs run in existing resources, and getting jobs complete reliably and on-time, to maximize the pay-off from the existing estate.
Cloud concerns. Resources in the cloud are flexible and “pay as you go” – but as you go, you pay. So the main concern in the cloud is managing costs. (As AWS puts it, “When running big data pipelines on the cloud, operational cost optimization is the name of the game.”) This concern increases because reliability concerns in the cloud can often be addressed by “throwing hardware at the problem” – increasing reliability but at a greater cost.
On-premises Spark vs Amazon EMR. When moving to Amazon EMR, it’s easy to do a “lift and shift” from on-premises Spark to EMR. This saves time and money on the cloud migration effort, but any inefficiencies in the on-premises environment are reproduced in the cloud, increasing costs. It’s also fully possible to refactor before moving to EMR, just as with Databricks.
On-premises Spark vs Databricks. When moving to Databricks, most companies take advantage of Databricks’ capabilities, such as ease of starting/shutting down clusters, and do at least some refactoring as part of the cloud migration effort. This costs time and money in the cloud migration effort, but results in lower costs and, potentially, greater reliability for the refactored job in the cloud.

All of these concerns are accompanied by a distinct lack of needed information. Companies often make crucial decisions – on-premises vs. cloud, EMR vs. Databricks, “lift and shift” vs. refactoring – with only guesses available as to what different options will cost in time, resources, and money.

The Biggest Spark Troubleshooting Challenges in 2024

Many Spark challenges relate to configuration, including the number of executors to assign, memory usage (at the driver level, and per executor), and what kind of hardware/machine instances to use. You make configuration choices per job, and also for the overall cluster in which jobs run, and these are interdependent – so things get complicated, fast.

Some challenges occur at the job level; these challenges are shared right across the data team. They include:

How many executors should each job use?
How much memory should I allocate for each job?
How do I find and eliminate data skew?
How do I make my pipelines work better?
How do I know if a specific job is optimized?

Other challenges come up at the cluster level, or even at the stack level, as you decide what jobs to run on what clusters. These problems tend to be the remit of operations people and data engineers. They include:

How do I size my nodes, and match them to the right servers/instance types?
How do I see what’s going on across the Spark stack and apps?
Is my data partitioned correctly for my SQL queries?
When do I take advantage of auto-scaling?
How do I get insights into jobs that have problems?

See exactly how to optimize Spark configurations automatically
Book a demo

Section 1: Five Job-Level Challenges

These challenges occur at the level of individual jobs. Fixing them can be the responsibility of the developer or data scientist who created the job, or of operations people or data engineers who work on both individual jobs and at the cluster level.

However, job-level challenges, taken together, have massive implications for clusters, and for the entire data estate. One of our Unravel Data customers has undertaken a right-sizing program for resource-intensive jobs that has clawed back nearly half the space in their clusters, even though data processing volume and jobs in production have been increasing.

For these challenges, we’ll assume that the cluster your job is running in is relatively well-designed (see next section); that other jobs in the cluster are not resource hogs that will knock your job out of the running; and that you have the tools you need to troubleshoot individual jobs.

1. How many executors and cores should a job use?

One of the key advantages of Spark is parallelization – you run your job’s code against different data partitions in parallel workstreams, as in the diagram below. The number of workstreams that run at once is the number of executors, times the number of cores per executor. So how many executors should your job use, and how many cores per executor – that is, how many workstreams do you want running at once?

A Spark job uses three cores to parallelize output. Up to three tasks run simultaneously, and seven tasks are completed in a fixed period of time. (Source: Lisa Hua, Spark Overview, Slideshare.)

You want high usage of cores, high usage of memory per core, and data partitioning appropriate to the job. (Usually, partitioning on the field or fields you’re querying on.) This beginner’s guide for Hadoop suggests two-three cores per executor, but not more than five; this expert’s guide to Spark tuning on AWS suggests that you use three executors per node, with five cores per executor, as your starting point for all jobs. (!)

You are likely to have your own sensible starting point for your on-premises or cloud platform, the servers or instances available, and the experience your team has had with similar workloads. Once your job runs successfully a few times, you can either leave it alone or optimize it. We recommend that you optimize it, because optimization:

Helps you save resources and money (not over-allocating)
Helps prevent crashes, because you right-size the resources (not under-allocating)
Helps you fix crashes fast, because allocations are roughly correct, and because you understand the job better

2. How much memory should I allocate for each job?

Memory allocation is per executor, and the most you can allocate is the total available in the node. If you’re in the cloud, this is governed by your instance type; on-premises, by your physical server or virtual machine. Some memory is needed for your cluster manager and system resources (16GB may be a typical amount), and the rest is available for jobs.

If you have three executors in a 128GB cluster, and 16GB is taken up by the cluster, that leaves 37GB per executor. However, a few GB will be required for executor overhead; the remainder is your per-executor memory. You will want to partition your data so it can be processed efficiently in the available memory.

This is just a starting point, however. You may need to be using a different instance type, or a different number of executors, to make the most efficient use of your node’s resources against the job you’re running. As with the number of executors (see the previous section), optimizing your job will help you know whether you are over- or under-allocating memory, reduce the likelihood of crashes, and get you ready for troubleshooting when the need arises.

For more on memory management, see this widely read article, Spark Memory Management, by our own Rishitesh Mishra.

Right-size Spark jobs in seconds
See a demo

3. How do I handle data skew and small files?

Data skew and small files are complementary problems. Data skew tends to describe large files – where one key-value, or a few, have a large share of the total data associated with them. This can force Spark, as it’s processing the data, to move data around in the cluster, which can slow down your task, cause low utilization of CPU capacity, and cause out-of-memory errors which abort your job. Several techniques for handling very large files which appear as a result of data skew are given in the popular article, Data Skew and Garbage Collection, by Rishitesh Mishra of Unravel.

Small files are partly the other end of data skew – a share of partitions will tend to be small. And Spark, since it is a parallel processing system, may generate many small files from parallel processes. Also, some processes you use, such as file compression, may cause a large number of small files to appear, causing inefficiencies. You may need to reduce parallelism (undercutting one of the advantages of Spark), repartition (an expensive operation you should minimize), or start adjusting your parameters, your data, or both (see details here).

Both data skew and small files incur a meta-problem that’s common across Spark – when a job slows down or crashes, how do you know what the problem was? We will mention this again, but it can be particularly difficult to know this for data-related problems, as an otherwise well-constructed job can have seemingly random slowdowns or halts, caused by hard-to-predict and hard-to-detect inconsistencies across different data sets.

4. How do I optimize at the pipeline level?

Spark pipelines are made up of dataframes, connected by transformers (which calculate new data from existing data), and Estimators. Pipelines are widely used for all sorts of processing, including extract, transform, and load (ETL) jobs and machine learning. Spark makes it easy to combine jobs into pipelines, but it does not make it easy to monitor and manage jobs at the pipeline level. So it’s easy for monitoring, managing, and optimizing pipelines to appear as an exponentially more difficult version of optimizing individual Spark jobs.

Existing Transformers create new Dataframes, with an Estimator producing the final model. (Source: Spark Pipelines: Elegant Yet Powerful, InsightDataScience.)

Many pipeline components are “tried and trusted” individually, and are thereby less likely to cause problems than new components you create yourself. However, interactions between pipeline steps can cause novel problems.

Just as job issues roll up to the cluster level, they also roll up to the pipeline level. Pipelines are increasingly the unit of work for DataOps, but it takes truly deep knowledge of your jobs and your cluster(s) for you to work effectively at the pipeline level. This article, which tackles the issues involved in some depth, describes pipeline debugging as an “art.”

5. How do I know if a specific job is optimized?

Neither Spark nor, for that matter, SQL is designed for ease of optimization. Spark comes with a monitoring and management interface, Spark UI, which can help. But Spark UI can be challenging to use, especially for the types of comparisons – over time, across jobs, and across a large, busy cluster – that you need to really optimize a job. And there is no “SQL UI” that specifically tells you how to optimize your SQL queries.

There are some general rules. For instance, a “bad” – inefficient – join can take hours. But it’s very hard to find where your app is spending its time, let alone whether a specific SQL command is taking a long time, and whether it can indeed be optimized.

Spark’s Catalyst optimizer, described here, does its best to optimize your queries for you. But when data sizes grow large enough, and processing gets complex enough, you have to help it along if you want your resource usage, costs, and runtimes to stay on the acceptable side.

Section 2: Cluster-Level Challenges

Cluster-level challenges are those that arise for a cluster that runs many (perhaps hundreds or thousands) of jobs, in cluster design (how to get the most out of a specific cluster), cluster distribution (how to create a set of clusters that best meets your needs), and allocation across on-premises resources and one or more public, private, or hybrid cloud resources.

The first step toward meeting cluster-level challenges is to meet job-level challenges effectively, as described above. A cluster that’s running unoptimized, poorly understood, slowdown-prone, and crash-prone jobs are impossible to optimize. But if your jobs are right-sized, cluster-level challenges become much easier to meet. (Note that Unravel Data, as mentioned in the previous section, helps you find your resource-heavy Spark jobs, and optimize those first. It also does much of the work of troubleshooting and optimization for you.)

Meeting cluster-level challenges for Spark may be a topic better suited for a graduate-level computer science seminar than for a blog post, but here are some of the issues that come up, and a few comments on each:

6. Are Nodes Matched Up to Servers or Cloud Instances?

A Spark node – a physical server or a cloud instance – will have an allocation of CPUs and physical memory. (The whole point of Spark is to run things in actual memory, so this is crucial.) You have to fit your executors and memory allocations into nodes that are carefully matched to existing resources, on-premises, or in the cloud. (You can allocate more or fewer Spark cores than there are available CPUs, but matching them makes things more predictable, uses resources better, and may make troubleshooting easier.)

On-premises, poor matching between nodes, physical servers, executors, and memory results in inefficiencies, but these may not be very visible; as long as the total physical resource is sufficient for the jobs running, there’s no obvious problem. However, issues like this can cause data centers to be very poorly utilized, meaning there’s big overspending going on – it’s just not noticed. (Ironically, the impending prospect of cloud migration may cause an organization to freeze on-prem spending, shining a spotlight on costs and efficiency.)

In the cloud, “pay as you go” pricing shines a different type of spotlight on efficient use of resources – inefficiency shows up in each month’s bill. You need to match nodes, cloud instances, and job CPU and memory allocations very closely indeed, or incur what might amount to massive overspending. This article gives you some guidelines for running Apache Spark cost-effectively on AWS EC2 instances and is worth a read even if you’re running on-premises, or on a different cloud provider.

You still have big problems here. In the cloud, with costs both visible and variable, cost allocation is a big issue. It’s hard to know who’s spending what, let alone what the business results that go with each unit of spending are. But tuning workloads against server resources and/or instances is the first step in gaining control of your spending, across all your data estates.

7. How Do I See What’s Going on in My Cluster?

“Spark is notoriously difficult to tune and maintain,” according to an article in The New Stack. Clusters need to be “expertly managed” to perform well, or all the good characteristics of Spark can come crashing down in a heap of frustration and high costs. (In people’s time and in business losses, as well as direct, hard dollar costs.)

Key Spark advantages include accessibility to a wide range of users and the ability to run in memory. But the most popular tool for Spark monitoring and management, Spark UI, doesn’t really help much at the cluster level. You can’t, for instance, easily tell which jobs consume the most resources over time. So it’s hard to know where to focus your optimization efforts. And Spark UI doesn’t support more advanced functionality – such as comparing the current job run to previous runs, issuing warnings, or making recommendations, for example.

Logs on cloud clusters are lost when a cluster is terminated, so problems that occur in short-running clusters can be that much harder to debug. More generally, managing log files is itself a big data management and data accessibility issue, making debugging and governance harder. This occurs in both on-premises and cloud environments. And, when workloads are moved to the cloud, you no longer have a fixed-cost data estate, nor the “tribal knowledge” accrued from years of running a gradually changing set of workloads on-premises. Instead, you have new technologies and pay-as-you-go billing. So cluster-level management, hard as it is, becomes critical.

See how Unravel helps manage clusters.

Create a free account.

8. Is my data partitioned correctly for my SQL queries? (and other inefficiencies)

Operators can get quite upset, and rightly so, over “bad” or “rogue” queries that can cost way more, in resources or cost, than they need to. One colleague describes a team he worked on that went through more than $100,000 of cloud costs in a weekend of crash-testing a new application – a discovery made after the fact. (But before the job was put into production, where it would have really run up some bills.)

SQL is not designed to tell you how much a query is likely to cost, and more elegant-looking SQL queries (ie, fewer statements) may well be more expensive. The same is true of all kinds of code you have running.

So you have to do some or all of three things:

Learn something about SQL, and about coding languages you use, especially how they work at runtime
Understand how to optimize your code and partition your data for good price/performance
Experiment with your app to understand where the resource use/cost “hot spots” are, and reduce them where possible

All this fits in the “optimize” recommendations from 1. and 2. above. We’ll talk more about how to carry out optimization in Part 2 of this blog post series.

9. When do I take advantage of auto-scaling?

The ability to auto-scale – to assign resources to a job just while it’s running, or to increase resources smoothly to meet processing peaks – is one of the most enticing features of the cloud. It’s also one of the most dangerous; there is no practical limit to how much you can spend. You need some form of guardrails, and some form of alerting, to remove the risk of truly gigantic bills.

The need for auto-scaling might, for instance, determine whether you move a given workload to the cloud, or leave it running, unchanged, in your on-premises data center. But to help an application benefit from auto-scaling, you have to profile it, then cause resources to be allocated and de-allocated to match the peaks and valleys. And you have some calculations to make because cloud providers charge you more for spot resources – those you grab and let go of, as needed – than for persistent resources that you keep running for a long time. Spot resources may cost two or three times as much as dedicated ones.

The first step, as you might have guessed, is to optimize your application, as in the previous sections. Auto-scaling is a price/performance optimization, and a potentially resource-intensive one. You should do other optimizations first.

Then profile your optimized application. You need to calculate ongoing and peak memory and processor usage, figure out how long you need each, and the resource needs and cost for each state. And then decide whether it’s worth auto-scaling the job, whenever it runs, and how to do that. You may also need to find quiet times on a cluster to run some jobs, so the job’s peaks don’t overwhelm the cluster’s resources.

To help, Databricks has two types of clusters, and the second type works well with auto-scaling. Most jobs start out in an interactive cluster, which is like an on-premises cluster; multiple people use a set of shared resources. It is, by definition, very difficult to avoid seriously underusing the capacity of an interactive cluster.

So you are meant to move each of your repeated, resource-intensive, and well-understood jobs off to its own, dedicated, job-specific cluster. A job-specific cluster spins up, runs its job, and spins down. This is a form of auto-scaling already, and you can also scale the cluster’s resources to match job peaks, if appropriate. But note that you want your application profiled and optimized before moving it to a job-specific cluster.

10. How Do I Find and Fix Problems?

Just as it’s hard to fix an individual Spark job, there’s no easy way to know where to look for problems across a Spark cluster. And once you do find a problem, there’s very little guidance on how to fix it. Is the problem with the job itself, or the environment it’s running in? For instance, over-allocating memory or CPUs for some Spark jobs can starve others. In the cloud, the noisy neighbors problem can slow down a Spark job run to the extent that it causes business problems on one outing – but leaves the same job to finish in good time on the next run.

The better you handle the other challenges listed in this blog post, the fewer problems you’ll have, but it’s still very hard to know how to most productively spend Spark operations time. For instance, a slow Spark job on one run may be worth fixing in its own right and may be warning you of crashes on future runs. But it’s very hard just to see what the trend is for a Spark job in performance, let alone to get some idea of what the job is accomplishing vs. its resource use and average time to complete. So Spark troubleshooting ends up being reactive, with all too many furry, blind little heads popping up for operators to play Whack-a-Mole with.

Impacts of these Challenges

If you meet the above challenges effectively, you’ll use your resources efficiently and cost-effectively. However, our observation here at Unravel Data is that most Spark clusters are not run efficiently.

What we tend to see most are the following problems – at a job level, within a cluster, or across all clusters:

Under-allocation. It can be tricky to allocate your resources efficiently on your cluster, partition your datasets effectively, and determine the right level of resources for each job. If you under-allocate (either for a job’s driver or the executors), a job is likely to run too slowly or crash. As a result, many developers and operators resort to…
Over-allocation. If you assign too many resources to your job, you’re wasting resources (on-premises) or money (cloud). We hear about jobs that need, for example, 2GB of memory but are allocated much more – in one case, 85GB.

Applications can run slowly, because they’re under-allocated – or because some apps are over-allocated, causing others to run slowly. Data teams then spend much of their time fire-fighting issues that may come and go, depending on the particular combination of jobs running that day. With every level of resource in shortage, new, business-critical apps are held up, so the cash needed to invest against these problems doesn’t show up. IT becomes an organizational headache, rather than a source of business capability.

Conclusion

To jump ahead to the end of this series a bit, our customers here at Unravel are easily able to spot and fix over-allocation and inefficiencies. They can then monitor their jobs in production, finding and fixing issues as they arise. Developers even get on board, checking their jobs before moving them to production, then teaming up with Operations to keep them tuned and humming.

One Unravel customer, Mastercard, has been able to reduce usage of their clusters by roughly half, even as data sizes and application density has moved steadily upward during the global pandemic. And everyone gets along better, and has more fun at work, while achieving these previously unimagined results.

So, whether you choose to use Unravel or not, develop a culture of right-sizing and efficiency in your work with Spark. It will seem to be a hassle at first, but your team will become much stronger, and you’ll enjoy your work life more, as a result.

You need a sort of X-ray of your Spark jobs, better cluster-level monitoring, environment information, and to correlate all of these sources into recommendations. In Troubleshooting Spark Applications, Part 2: Solutions, we will describe the most widely used tools for Spark troubleshooting – including the Spark Web UI and our own offering, Unravel Data – and how to assemble and correlate the information you need.

The post The Biggest Spark Troubleshooting Challenges in 2024 appeared first on Unravel.

The Spark 3.0 Performance Impact of Different Kinds of Partition Pruning

Floyd Smith — Wed, 14 Apr 2021 06:42:40 +0000

In this blog post, I’ll set up and run a couple of experiments to demonstrate the effects of different kinds of partition pruning in Spark.

Big data has a complex relationship with SQL, which has long served as the standard query language for traditional databases – Oracle, Microsoft SQL Server, IBM DB2, and a vast number of others.

Only relational databases support true SQL natively, and many big data databases fall in the NoSQL – ie, non-relational – camp. For these databases, there are a number of near-SQL alternatives.

When dealing with big data, the most crucial part of processing and answering any SQL or near-SQL query is likely to be the I/O cost – that is, moving data around, including making one or more copies of the same data, as the query processor gathers needed data and completes the response to the query.

A great deal of effort has gone into reducing I/O costs for queries. Some of the techniques used are indexes, columnar data storage, data skipping, etc.

Partition pruning, described below, is one of the data skipping techniques used by most of the query engines like Spark, Impala, and Presto. One of the advanced ways of partition pruning is called dynamic partition pruning. In this blog post, we will work to understand both of these concepts and how they impact query execution, while reducing I/O requirements.

What is Partition Pruning?

Let’s first understand what partition pruning is, how it works, and the implications of this for performance.

If a table is partitioned and one of the query predicates is on the table partition column(s), then only partitions that contain data which satisfy the predicate get scanned. Partition pruning eliminates scanning of the partitions which are not required by the query.

There are two types of partition pruning.

Static partition pruning. Partition filters are determined at query analysis/optimize time.
Dynamic partition pruning. Partition filters are determined at runtime, looking at the actual values returned by some of the predicates.

Experiment Setup

For this experiment we have created three Hive tables, to mimic real-life workloads on a smaller scale. Although the number of tables is small, and their data size almost insignificant, the findings are valuable to clearly understand the impact of partition pruning, and to gain insight into some of its intricacies.

Spark version used : 3.0.1

Important configuration settings for our experiment :

spark.sql.cbo.enabled = false (default)
spark.sql.cbo.joinReorder.enabled = false (default)
spark.sql.adaptive.enabled = false (default)
spark.sql.optimizer.dynamicPartitionPruning.enabled = false (default is true)

We have disabled dynamic partition pruning for the first half of the experiment to see the effect of static partition pruning.

Table Name	Columns	Partition column	Row Count	Num Partitions
T1	key, val	key	21,000,000	1000
T2	dim_key, val	dim_key	800	400
T3	key, val	key	11,000,000	1000

Static Partition Pruning

We will begin with the benefits of static partition pruning, and how they affect table scans.

Let’s first try with a simple join query between T1 & T2.

select t1.key from t1,t2 where t1.key=t2.dim_key

Query Plan

This query scans all 1000 partitions of T1 and all 400 partitions of T2. So in this case no partitions were pruned.

Let’s try putting a predicate on T1’s partition column.

select t1.key from t1 ,t2 where t1.key=t2.dim_key and t1.key = 40

What happened here? As we added the predicate “t1.key = 40,” and “key” is a partitioning column, that means the query requires data from only one partition, which is reflected in the “number of partitions read” on the scan operator for T1.

But there is another interesting thing. If you observe the scan operator for T2, it also says only one partition is read. Why?

If we deduce logically, we need:

All rows from T1 where t1.key = 40
All rows from T2 where t2.dim_key = t1.key
Hence, all rows from T2 where t2.dim_key = 40
All rows of T2 where t2.dim_key=40 can be only be in one partition of T2, as it’s a partitioning column

What if the predicate would have been on T2, rather than T1? The end result would be the same. You can check yourself.

select t1.key from t1,t2 where t1.key=t2.dim_key and t2.dim_key = 40

What if the predicate satisfies more than one partition? The ultimate effect would be the same. The query processor would eliminate all the partitions which don’t satisfy the predicate.

select t1.key from t1 ,t2 where t1.key=t2.dim_key and t2.dim_key > 40 and t2.dim_key < 100

As we can see, there are 59 partitions which are scanned from each table. Still a significant saving as compared to scanning 1000 partitions.

So far, so good. Now let’s try adding another table, T3, to the mix.

select t1.key from t1, t3, t2
where t1.key=t3.key and t1.val=t3.val and t1.key=t2.dim_key and t2.dim_key = 40

Something strange happened here. As expected, only one partition was scanned for both T1 and T2. But 1000 partitions were scanned for T3.

This is a bit odd. If we extend our logic above, only one partition should have been picked up for T3 as well. It was not so.

Let’s try the same query, arranged a bit differently. Here we have just changed the order of join tables. The rest of the query is the same.

select t1.key from t1, t2, t3
where t1.key=t3.key and t1.val=t3.val and t1.key=t2.dim_key and t2.dim_key = 40

Now we can see the query scanned one partition from each of the tables. So is the join order important for static partition pruning? It should not be, but we see that it is. This looks like it’s a limitation in Spark 3.0.1.

Dynamic Partition Pruning

So far, we have looked at queries which have one predicate on a partitioning column(s) of tables – key, dim_key etc. What will happen if the predicate is on a non-partitioned column of T2, like “val”?

Let’s try such a query.

select t1.key from t1, t2 where t1.key=t2.dim_key and t2.val > 40

As you might have guessed, all the partitions of both the tables are scanned. Can we optimize this, given that the size of T2 is very small ? Can we eliminate certain partitions from T1, knowing the fact that predicates on T2 may select data from only a subset of its partitions?

This is where dynamic partition pruning will help us. Let’s re-enable it.

set spark.sql.optimizer.dynamicPartitionPruning.enabled=true

Then, re-run the above query.

As we can see, although all 400 partitions were scanned for T2, only 360 partitions were scanned for T1. We won’t go through the details of dynamic partition pruning (DPP). A lot of materials already exist on the web, such as the following:

https://medium.com/@prabhakaran.electric/spark-3-0-feature-dynamic-partition-pruning-dpp-to-avoid-scanning-irrelevant-data-1a7bbd006a89

Dynamic Partition Pruning in Apache Spark Bogdan Ghit Databricks -Juliusz Sompolski (Databricks).

You can read about how DPP is implemented in the above blogs. Our focus is its impact on the queries.

Now that we have executed a simple case, let’s fire off a query that’s a bit more complex.

select t1.key from t1, t2, t3
where t1.key=t3.key and t3.key=t2.dim_key and t1.key=t2.dim_key and t2.val > 40 and t2.val < 100

A lot of interesting things are going on here. As before, scans of T1 got the benefit of partition pruning. Only 59 partitions of T1 got scanned. But what happened to T3? All 1000 partitions were scanned.

To investigate the problem, I checked Spark’s code. I found that, at the moment, only a single DPP subquery is supported in Spark.

Ok, but how does this affect query writers? Now it becomes more important to get our join order right so that a more costly table can get help from DPP. For example, if the above query were written as shown below, with the order of T1 and T3 changed, what would be the impact?

select t1.key from t3, t2, t1
where t1.key=t3.key and t3.key=t2.dim_key and t1.key=t2.dim_key and t2.val > 40 and t2.val < 100

You guessed it! DPP will be applied to T3 rather than T1, which will not be very helpful for us, as T1 is almost twice the size of T3.

That means in certain cases join order is very important. Isn’t it the job of the optimizer, more specifically the “cost-based optimizer” (CBO), to do this?

Let’s now switch on the cost-based optimizer for our queries and see what happens.

set spark.sql.cbo.enabled=true

set spark.sql.cbo.joinReorder.enabled=true
ANALYZE TABLE t1 partition (key) compute statistics NOSCAN
ANALYZE TABLE t3 partition (key) compute statistics NOSCAN
ANALYZE TABLE t2 partition (dim_key) compute statistics NOSCAN
ANALYZE TABLE t2 compute statistics for columns dim_key
ANALYZE TABLE t1 compute statistics for columns key
ANALYZE TABLE t3 compute statistics for columns key

Let’s re-run the above query and check if T1 is picked up for DPP or not. As we can see below, there is no change in the plan. We can conclude that CBO is not able to change where DPP should be applied.

Conclusion

As we saw, with even as few as three simple tables, there are many permutations of running queries, each behaving quite differently to the others. As SQL developers we can try to optimize our queries ourselves, and try to know the impact of a specific query. But imagine dealing with a million queries a day.

Here at Unravel Data, we have customers who run more than a million SQL queries a day across Spark, Impala, Hive etc. Each of these engines have their respective optimizers. Each optimizer behaves differently.

Optimization is a hard problem to solve, and during query runtime it’s even harder due to time constraints. Moreover, as datasets grow, collecting statistics, on which optimizers depend, becomes a big overhead.

But fortunately, not all queries are ad hoc. Most are repetitive and can be tuned for future runs.

We at Unravel are solving this problem, making it much easier to tune large numbers of query executions quickly.

In upcoming articles we will discuss the approaches we are taking to tackle these problems.

The post The Spark 3.0 Performance Impact of Different Kinds of Partition Pruning appeared first on Unravel.

How to intelligently monitor Kafka/Spark Streaming data pipelines

George Demarest — Thu, 06 Jun 2019 18:41:06 +0000

Identifying issues in distributed applications like a Spark streaming application is not trivial. This blog describes how Unravel helps you connect the dots across streaming applications to identify bottlenecks.

Spark streaming is widely used in real-time data processing, especially with Apache Kafka. A typical scenario involves a Kafka producer application writing to a Kafka topic. The Spark application then subscribes to the topic and consumes records. The records might be further processed downstream using operations like map and foreachRDD ops or saved into a datastore.

Below are two scenarios illustrating how you can use Unravel’s APMs to inspect, understand, correlate, and finally debug issues around a Spark streaming app consuming a Kafka topic. They demonstrate how Unravel’s APMs can help you perform an end-to-end analysis of the data pipelines and its applications. In turn, this wealth of information helps you effectively and efficiently debug and resolve issues that otherwise would require more time and effort.

Use Case 1: Producer Failure

Consider a running Spark application that has not processed any data for a significant amount of time. The application’s previous iterations successfully processed data and you know for a certainty that current iteration should be processing data.

Detect Spark is processing no data

When you notice the Spark application has been running without processing any data, you want to check the application’s I/O.

Checking I/O Consumption

Click on it to bring up the application and see its I/O KPI

View Spark input details

When you see the Spark application is consuming from Kafka, you need to examine Kafka side of the process to determine the issue. Click on a batch to find the topic it is consuming.

View Kafka topic details

Bring up the Kafka Topic details which graphs the Bytes In Per Second, Bytes Out Per Second, Messages In Per Second, and Total Fetch Requests Per Second. In this case, you see Bytes In Per Second and Messages In Per Second graphs show a steep decline; this indicates the Kafka Topic is no longer producing any data.

The Spark application is not consuming data because there is no data in the pipeline to consume.
You should notify the owner of the application that writes to this particular topic. Upon notification, they can drill down to determine and then resolve the underlying issues causing that writes to this topic to fail.

The graphs show a steep decline in the BytesInPerSecond and MessagesInPerSecond metric. Since there is nothing flowing into the topic, there is no data to be consumed and processed by Spark down the pipeline. The administrator can then alert the owner of the Kafka Producer Application to further drill down into what are the underlying issues that might have caused the Kafka Producer to fail.

Use Case 2: Slow Spark app, offline Kafka partitions

In this scenario, an application’s run has processed significantly less data when compared to what is the expected norm. In the above scenario, the analysis was fairly straightforward. In these types of cases, the underlying root cause might be far more subtle to identify. The Unravel APM can help you to quickly and efficiently root cause such issues.

Detect inconsistent performance

When you see a considerable difference in the data processed by consecutive runs of a Spark App, bring up the APMs for each application.

Detect drop in input records

Examine the trend lines to identify a time window when there is a drop in input records for the slow app. The image, the slow app, shows a drop off in consumption.

Drill down to I/O drop off

Drill down into the slow application’s stream graph to narrow the time period (window) in which the I/O dropped off.

Drill down to suspect time interval

Unravel displays the batches that were running during the problematic time period. Inspect the time interval to see the batches’ activity. In this example no input records were processed during the suspect interval. You can dig further into the issue by selecting a batch and examining it’s input tab. In the below case, you can infer that no new offsets have been read based upon the input source’s description,“Offsets 1343963 to Offset 1343963”.

View KPIs on the Kafka monitor screen

You can further debug the problem on the Kafka side by viewing the Kafka cluster monitor. Navigate to OPERATIONS > USAGE DETAILS > KAFKA. At a glance, the cluster’s KPIs convey important and pertinent information.

Here, the first two metrics show that the cluster is facing issues which need to be resolved as quickly as possible.

# of under replicated partitions is a broker level metric of the count of partitions for which the broker is the leader replica and the follower replicas that have yet not caught up. In this case there are 131 under replicated partitions.

# of offline partitions is a broker level metric provided by the cluster’s controlling broker. It is a count of the partitions that currently have no leader. Such partitions are not available for producing/consumption. In this case there are two (2) offline partitions.

You can select the BROKER tab to see broker table. Here the table hints that the broker, Ignite1.kafka2 is facing some issue; therefore, the broker’s status should be checked.

Broker Ignite1.kafka2 has two (2) offline partitions. The broker table shows that it is/was the controller for the cluster. We could have inferred this because only the controller can have offline partitions. Examining the table further, we see that Ignite.kafka1 also has an active controller of one (1).
The Kafka KPI # of Controller lists the Active Controller as one (1) which should indicate a healthy cluster. The fact that there are two (2) brokers listed as being one (1) active controller indicates the cluster is in an inconsistent state.

Identify cluster controller

In the broker tab table, we see that for the broker Ignite1.kafka2 there are two offline Kafka partitions. Also we see that the Active controller count is 1, which indicates that this broker was/is the controller for the cluster (This information can also be deduced from the fact that offline partition is a metric exclusive only to the controller broker). Also notice that the controller count for Ignite.kafka1 is also 1, which would indicate that the cluster itself is in an inconsistent state.

This gives us a hint that it could be the case that the Ignite1.kafka2 broker is facing some issues, and the first thing to check would be the status of the broker.

View consumer group status

You can further corroborate the hypothesis by checking the status of the consumer group for the Kafka topic the Spark App is consuming from. In this example, consumer groups’ status indicates it’s currently stalled. The topic the stalled group is consuming is tweetSentiment-1000.

Topic drill down

To drill down into the consumer groups topic, click on the TOPIC tab in the Kafka Cluster manager and search for the topic. In this case, the topic’s trend lines for the time range the Spark applications consumption dropped off show a sharp decrease in the Bytes In Per Second and Bytes Out Per Second. This decrease explains why the Spark app is not processing any records.

Consumer group details

To view the consumer groups lag and offset consumption trends, click on the consumer groups listed for the topic. Below, the topic’s consumer groups, stream-consumer-for-tweetSentimet1000, log end offset trend line is a constant (flat) line that shows no new offsets have been consumed with the passage of time. This further supports our hypothesis that something is wrong with the Kafka cluster and especially broker Ignite.kafka2.

Conclusion

These are but two examples of how Unravel helps you to identify, analyze, and debug Spark Streaming applications consuming from Kafka topics. Unravel’s APMs collates, consolidates, and correlates information from various stages in the data pipeline (Spark and Kafka), thereby allowing you to troubleshoot applications without ever having to leave Unravel.

The post How to intelligently monitor Kafka/Spark Streaming data pipelines appeared first on Unravel.

Why Memory Management is Causing Your Spark Apps To Be Slow or Fail

Audrey Enriquez — Wed, 20 Mar 2019 00:25:31 +0000

Spark applications are easy to write and easy to understand when everything goes according to plan. However, it becomes very difficult when Spark applications start to slow down or fail. (See our blog Spark Troubleshooting, Part 1 – Ten Biggest Challenges.) Sometimes a well-tuned application might fail due to a data change, or a data layout change. Sometimes an application that was running well so far, starts behaving badly due to resource starvation. The list goes on and on.

It’s not only important to understand a Spark application, but also its underlying runtime components like disk usage, network usage, contention, etc., so that we can make an informed decision when things go bad.

In this series of articles, I aim to capture some of the most common reasons why a Spark application fails or slows down. The first and most common is memory management.

If we were to get all Spark developers to vote, out-of-memory (OOM) conditions would surely be the number one problem everyone has faced. This comes as no big surprise as Spark’s architecture is memory-centric. Some of the most common causes of OOM are:

Incorrect usage of Spark
High concurrency
Inefficient queries
Incorrect configuration

To avoid these problems, we need to have a basic understanding of Spark and our data. There are certain things that can be done that will either prevent OOM or rectify an application which failed due to OOM. Spark’s default configuration may or may not be sufficient or accurate for your applications. Sometimes even a well-tuned application may fail due to OOM as the underlying data has changed.

Out of memory issues can be observed for the driver node, executor nodes, and sometimes even for the node manager. Let’s take a look at each case.

See How Unravel identifies Spark memory issues
Create a free account

Out of memory at the driver level

A driver in Spark is the JVM where the application’s main control flow runs. More often than not, the driver fails with an OutOfMemory error due to incorrect usage of Spark. Spark is an engine to distribute workload among worker machines. The driver should only be considered as an orchestrator. In typical deployments, a driver is provisioned less memory than executors. Hence we should be careful what we are doing on the driver.

Common causes which result in driver OOM are:

1. rdd.collect()
2. sparkContext.broadcast
3. Low driver memory configured as per the application requirements
4. Misconfiguration of spark.sql.autoBroadcastJoinThreshold. Spark uses this limit to broadcast a relation to all the nodes in case of a join operation. At the very first usage, the whole relation is materialized at the driver node. Sometimes multiple tables are also broadcasted as part of the query execution.

Try to write your application in such a way that you can avoid all explicit result collection at the driver. You can very well delegate this task to one of the executors. E.g., if you want to save the results to a particular file, either you can collect it at the driver or assign an executor to do that for you.

If you are using Spark’s SQL and the driver is OOM due to broadcasting relations, then either you can increase the driver memory if possible; or else reduce the “spark.sql.autoBroadcastJoinThreshold” value so that your join operations will use the more memory-friendly sort merge join.

Out of memory at the executor level

This is a very common issue with Spark applications which may be due to various reasons. Some of the most common reasons are high concurrency, inefficient queries, and incorrect configuration. Let’s look at each in turn.

High concurrency

Before understanding why high concurrency might be a cause of OOM, let’s try to understand how Spark executes a query or job and what are the components that contribute to memory consumption.

Spark jobs or queries are broken down into multiple stages, and each stage is further divided into tasks. The number of tasks depends on various factors like which stage is getting executed, which data source is getting read, etc. If it’s a map stage (Scan phase in SQL), typically the underlying data source partitions are honored.

For example, if a hive ORC table has 2000 partitions, then 2000 tasks get created for the map stage for reading the table assuming partition pruning did not come into play. If it’s a reduce stage (Shuffle stage), then spark will use either “spark.default.parallelism” setting for RDDs or “spark.sql.shuffle.partitions” for DataSets for determining the number of tasks. How many tasks are executed in parallel on each executor will depend on “spark.executor.cores” property. If this value is set to a higher value without due consideration to the memory, executors may fail with OOM. Now let’s see what happens under the hood while a task is getting executed and some probable causes of OOM.

Let’s say we are executing a map task or the scanning phase of SQL from an HDFS file or a Parquet/ORC table. For HDFS files, each Spark task will read a 128 MB block of data. So if 10 parallel tasks are running, then memory requirement is at least 128 *10 only for storing partitioned data. This is again ignoring any data compression which might cause data to blow up significantly depending on the compression algorithms.

Spark reads Parquet in a vectorized format. To put it simply, each task of Spark reads data from the Parquet file batch by batch. As Parquet is columnar, these batches are constructed for each of the columns. It accumulates a certain amount of column data in memory before executing any operation on that column. This means Spark needs some data structures and bookkeeping to store that much data. Also, encoding techniques like dictionary encoding have some state saved in memory. All of them require memory.

Figure: Spark task and memory components while scanning a table

So with more concurrency, the overhead increases. Also, if there is a broadcast join involved, then the broadcast variables will also take some memory. The above diagram shows a simple case where each executor is executing two tasks in parallel.

Inefficient queries

While Spark’s Catalyst engine tries to optimize a query as much as possible, it can’t help if the query itself is badly written. E.g., selecting all the columns of a Parquet/ORC table. As seen in the previous section, each column needs some in-memory column batch state. If more columns are selected, then more will be the overhead.

Try to read as few columns as possible. Try to use filters wherever possible, so that less data is fetched to executors. Some of the data sources support partition pruning. If your query can be converted to use partition column(s), then it will reduce data movement to a large extent.

Incorrect configuration

Incorrect configuration of memory and caching can also cause failures and slowdowns in Spark applications. Let’s look at some examples.

Executor & Driver memory

Each application’s memory requirement is different. Depending on the requirement, each app has to be configured differently. You should ensure correct spark.executor.memory or spark.driver.memory values depending on the workload. As obvious as it may seem, this is one of the hardest things to get right. We need the help of tools to monitor the actual memory usage of the application. Unravel does this pretty well.

Memory Overhead

Sometimes it’s not executor memory, rather its YARN container memory overhead that causes OOM or the node gets killed by YARN. “YARN kill” messages typically look like this:

YARN runs each Spark component like executors and drivers inside containers. Overhead memory is the off-heap memory used for JVM overheads, interned strings and other metadata of JVM. In this case, you need to configure spark.yarn.executor.memoryOverhead to a proper value. Typically 10% of total executor memory should be allocated for overhead.

Caching Memory

If your application uses Spark caching to store some datasets, then it’s worthwhile to consider Spark’s memory manager settings. Spark’s memory manager is written in a very generic fashion to cater to all workloads. Hence, there are several knobs to set it correctly for a particular workload.

Spark has defined memory requirements as two types: execution and storage. Storage memory is used for caching purposes and execution memory is acquired for temporary structures like hash tables for aggregation, joins etc.

Both execution & storage memory can be obtained from a configurable fraction of (total heap memory – 300MB). That setting is “spark.memory.fraction”. Default is 60%. Out of which, by default, 50% is assigned (configurable by “spark.memory.storageFraction”) to storage and rest assigned for execution.

There are situations where each of the above pools of memory, namely execution and storage, may borrow from each other if the other pool is free. Also, storage memory can be evicted to a limit if it has borrowed memory from execution. However, without going into those complexities, we can configure our program such that our cached data which fits in storage memory should not cause a problem for execution.

If we don’t want all our cached data to sit in memory, then we can configure “spark.memory.storageFraction” to a lower value so that extra data would get evicted and execution would not face memory pressure.

Out of memory at Node Manager

Spark applications which do data shuffling as part of group by or join like operations, incur significant overhead. Normally data shuffling process is done by the executor process. If the executor is busy or under heavy GC load, then it can’t cater to the shuffle requests. This problem is alleviated to some extent by using an external shuffle service.

External shuffle service runs on each worker node and handles shuffle requests from executors. Executors can read shuffle files from this service rather than reading from each other. This helps requesting executors to read shuffle files even if the producing executors are killed or slow. Also, when dynamic allocation is enabled, its mandatory to enable external shuffle service.

When Spark external shuffle service is configured with YARN, NodeManager starts an auxiliary service which acts as an External shuffle service provider. By default, NodeManager memory is around 1 GB. However, applications which do heavy data shuffling might fail due to NodeManager going out of memory. Its imperative to properly configure your NodeManager if your applications fall into the above category.

End of Part I – Thanks for the Memory

Spark’s in-memory processing is a key part of its power. Therefore, effective memory management is a critical factor to get the best performance, scalability, and stability from your Spark applications and data pipelines. However, the Spark defaults settings are often insufficient. Depending on the application and environment, certain key configuration parameters must be set correctly to meet your performance goals. Having a basic idea about them and how they can affect the overall application helps.

I have provided some insights into what to look for when considering Spark memory management. This is an area that the Unravel platform understands and optimizes very well, with little, if any, human intervention needed. See how Unravel can make Spark perform better and more reliably here.

Event better, schedule a demo to see Unravel in action. The performance speedups we are seeing for Spark apps are pretty significant.

See how Unravel simplifies Spark Memory Management.

Create a free account

In Part II of this series Why Your Spark Apps are Slow or Failing: Part II Data Skew and Garbage Collection, I will be discussing how data organization, data skew, and garbage collection impact Spark performance.

The post Why Memory Management is Causing Your Spark Apps To Be Slow or Fail appeared first on Unravel.

Tips to optimize Spark jobs to improve performance

Unravel Data — Tue, 23 Aug 2022 01:48:56 +0000

Summary: Sometimes the insight you’re shown isn’t the one you were expecting. Unravel DataOps observability provides the right, and actionable, insights to unlock the full value and potential of your Spark application.

One of the key features of Unravel is our automated insights. This is the feature where Unravel analyzes the finished Spark job and then presents its findings to the user. Sometimes those findings can be layered and not exactly what you expect.

Why Apache Spark Optimization?

Let’s take a scenario where you are developing a new Spark job and doing some testing. The goal of this testing is to ensure the Spark application is properly optimized. You want to see if you have tuned it appropriately so you bump up the resource allocation really high. The goal being you want to see if Unravel shows a “Container Sizing” event, or something along those lines. Instead of seeing a “Container Sizing” event there are a few others, in our case “Contended Driver” and “Long Idle Time for Executors” events. Let’s take a look at why this might be the case!

Recommendations to optimize Spark

When Unravel presents recommendations it presents them based on what is most helpful for the current state of the application. It will also only present the best case, when it can, for improving the applications. In certain scenarios this will lead to Insights that are not shown because they would end up causing more harm than good. There can be many reasons for this, but let’s take a look at how Unravel is presenting this particular application.

Below are the resource utilization graphs for the two runs of the application we are developing:

Original run

New Run

The most glaring issue is this idle time. We can see the job is doing bursts of work and then just sitting around doing nothing. If the idle time was looked at from an application perspective then it would help improve the jobs performance tremendously. This is most likely masking other potential performance improvements. If we dig into the job a bit further we can see this:

The above is a breakdown of what the job was doing for the whole time. Nearly 50% was spent on ScheduledWaitTime! This leads to the Executor idle time recommendation:

Taking all of the above information we can see that the application was coded in such a way that it’s waiting around for something for a while. At this point you could hop into the “Program” tab within Unravel to take a look at the actual source code associated with this application. We can say the same thing about the Contended Driver:

With this one we should examine why the code is spending so much time with the driver. It goes hand-in-hand with the idle time because while the executors are idle, the driver is still working away. Once we take a look at our code and resolve these two items, we would see a huge increase in this job’s performance!

Another item which was surfaced when looking at this job was this:

This is telling us that the Java processes spent a lot of time in garbage collection. This can lead to thrashing and other types of issues. More than likely it’s related to the recommendations Unravel made that need to be examined more deeply.

With all of the recommendations we saw we didn’t see the expected resource recommendations. That is because Unravel is presenting the most important insights. Unravel is presenting the right things that the developer needs to look at.

These issues are deeper than settings changes. An interesting point is that both jobs we looked at required some resource tuning. We could see that even the original job was overallocated. The problem with these jobs though is showing just the setting changes isn’t the whole story.

If Unravel just presented the resource change and those were implemented it might lead to the job failing; for example being killed by the JVM because of garbage collection issues. Unravel is attempting to steer the developer in the right direction and give them the tools they need to determine where the issue is in their code. If they fix the larger issues, like high GC or contended driver, then they will start being able to really tune and optimize their job.

Next steps:

Unravel is a purpose-built observability platform that helps you stop firefighting issues, control costs, and run faster data pipelines. What does that mean for you?

Never worry about monitoring and observing your entire data estate.
Reduce unnecessary cloud costs and the DataOps team’s application tuning burden.
Avoid downtime, missed SLAs, and business risk with end-to-end observability and cost governance over your modern data stack.

How can you get started?

Create your free account today.
Book a demo to see how Unravel simplifies modern data stack management.

The post Tips to optimize Spark jobs to improve performance appeared first on Unravel.

Tuning Spark applications: Detect and fix common issues with Spark driver

Unravel Data — Wed, 20 Jul 2022 00:01:04 +0000

Learn more about Apache Spark drivers and how to tune spark application quickly.

During the lifespan of a Spark application, the driver should distribute most of the work to the executors, instead of doing the work itself. This is one of the advantages of using python with Spark over Pandas. The Contented Driver event is detected when a Spark application spends way more time on the driver than the time spent in the Spark jobs on the executors. Leaving the executors idle could waste lots of time and money, specifically in the Cloud environment.

Here is a sample application shown in Unravel for Azure Databricks. Note in the Gantt Chart there was a huge gap of about 2 hours 40 minutes between Job-1 and Job-2, and the job duration for most of the Spark jobs was under 1 minute. Based on the data Unravel collected, Unravel detected the bottleneck which was the contended driver event.

Further digging into the Python code itself, we found that it actually tried to ingest data from the MongoDB server on the driver node alone. This left all the executors idle while the meter was still ticking.

There was some network issue that caused the MongoDB injection slowing down from 15 minutes to 2 plus hours. Once this issue was resolved, there was about 93% reduction in cost. The alternative solution is to move the MongoDB ingestion out of the Spark application. If there is no dependency on previous Spark jobs, we can do it before the Spark application.

If there is a dependency, we can split the Spark application into two. Unravel also collects all the job run status such as Duration and IO as shown below and we can easily see the history of all the job runs and monitor the job.

In conclusion, we must pay attention to the Contended Driver event in a Spark application, so we can save money and time without leaving the executors IDLE for a long time.

Next steps

Check out what are the biggest spark troubleshooting challenges in 2022 and how to fix them.
Learn ways to troubleshoot Apache Spark issues.
Create your free account today.
Book a demo to see how Unravel simplifies modern data stack management.

The post Tuning Spark applications: Detect and fix common issues with Spark driver appeared first on Unravel.

Spark Troubleshooting Guides

Unravel Data — Tue, 02 Nov 2021 20:17:13 +0000

Thanks for your interest in the Spark Troubleshooting Guides.

You can download it here.

This 3 part series is your one-stop guide to all things Spark troubleshooting. In Part 1, we describe the ten biggest challenges for troubleshooting Spark jobs across levels. In Part 2, we describe the major categories of tools and types of solutions to solve challenges.

Lastly, Part 3 of the guide builds on the other two to show you how to address the problems we described, and more, with a single tool that does the best of what single-purpose tools do, and more – our DataOps platform, Unravel Data.

The post Spark Troubleshooting Guides appeared first on Unravel.

Twelve Best Cloud & DataOps Articles

Unravel Data — Thu, 28 Oct 2021 19:27:17 +0000

Our resource picks for October!
Prescriptive Insights On Cloud & DataOps Topics

Interested in learning about different technologies and methodologies, such as Databricks, Amazon EMR, cloud computing and DataOps? A good place to start is reading articles that give tips, tricks, and best practices for working with these technologies.

Here are some of our favorite articles from experts on cloud migration, cloud management, Spark, Databricks, Amazon EMR, and DataOps!

Cloud Migration

Cloud-migration Opportunity: Business Value Grows but Missteps Abound
(Source: McKinsey & Company)
Companies aim to embrace the cloud more fully, but many are already failing to reap the sizable rewards. Outperformers have shown what it takes to overcome the costly hurdles and could potentially unlock $1 trillion in value, according to a recent McKinsey article.

4 Major Mistakes That Can Derail a Cloud Migration (Source: MDM)
If your organization is thinking of moving to the cloud, it’s important to know both what to do and what NOT to do. This article details four common missteps that can hinder your journey to the cloud. One such mistake is not having a cloud migration strategy.

Check out the full article on the Modern Distribution Management (MDM) site to learn about other common mistakes, their impacts, and ways to avoid them.

Plan Your Move: Three Tips For Efficient Cloud Migrations (Source: Forbes)
Think about the last time you moved to a new place. Moving is usually exciting, but the logistics can get complicated. The same can be said for moving to the cloud.

Just as a well-planned move is often the smoothest, the same holds true for cloud migrations.

As you’re packing up your data and workloads to transition business services to the cloud, check out this article on Forbes for three best practices for cloud migration planning.

(Bonus resource: Check out our Ten Steps to Cloud Migration post. If your company is considering making the move, these steps will help!)

Cloud Management

How to Improve Cloud Management (Source: DevOps)
The emergence of technologies like AI and IoT as well as the spike in remote work due to the COVID-19 pandemic have accelerated cloud adoption.

With this growth comes a need for a cloud management strategy in order to avoid unnecessarily high costs and security or compliance violations. This DevOps article shares insights on how to build a successful cloud management strategy.

The Challenges of Cloud Data Management (Source: TechTarget)
Cloud spend and the amount of data in the cloud continues to grow at an unprecedented rate. This rapid expansion is causing organizations to also face new cloud management challenges as they try to keep up with cloud data management advancements.

Head over to TechTarget to learn about cloud management challenges, including data governance and adhering to regulatory compliance frameworks.

Spark

Spark: A Data Engineer’s Best Friend (Source: CIO)
Spark is the ultimate tool for data engineers. It simplifies the work environment by providing a platform to organize and execute complex data pipelines and powerful tools for storing, retrieving, and transforming data.

This CIO article describes different things data engineers can do with Spark, touches on what makes Spark unique, and explains why it is so beneficial for data engineers.

Is There Life After Hadoop? The Answer is a Resounding Yes. (Source: CIO)
Many organizations who invested heavily in the Hadoop ecosystem have found themselves wondering what life after Hadoop is like and what lies ahead.

This article addresses life after Hadoop and lays out a strategy for organizations entering the post-Hadoop era, including reasons why you may want to embrace Spark as an alternative. Head to the CIO site for more!

Databricks

5 Ways to Boost Query Performance with Databricks and Spark (Source: Key2 Consulting)
When running Spark jobs on Databricks, do you often find yourself frustrated by slow query times?

Check out this article from Key2 Consulting to discover 5 rules for speeding up query times. The rules include:

Cache intermediate big dataframes for repetitive use.
Monitor the Spark UI within a cluster where a Spark job is running.

For more information on these rules and to find out the remaining three, check out the full article.

What is a Data Lakehouse and Why Should You Care? (Source: S&P Global)
A data lakehouse is an environment designed to combine the data structure and data management features of a data warehouse with the low-cost storage of a data lake.

Databricks offers a couple data lakehouses, including Delta Lake and Delta Engine. This article from S&P Global, gives a more comprehensive explanation of what a data lakehouse is, its benefits, and what lakehouses you can use on Databricks.

Amazon EMR

What is Amazon EMR? – Amazon Elastic MapReduce Tutorial (Source: ADMET)
AWS EMR is among the hottest clouds and massive data-based platforms. It gives a supervised structure for simply, cost-effectively, and securely working information processing frameworks.

In this ADMET blog, learn what Amazon Elastic MapReduce is and how it can be used to deal with a variety of issues.

DataOps

3 Steps for Successfully Implementing Industrial DataOps (Source: eWeek)
DataOps has been growing in popularity over the past few years. Today, we see many industrial operations realizing the value of DataOps.

This article explains three steps for successfully implementing industrial DataOps:

1. Make industrial data available
2. Make data useful
3. Make data valuable

Head over to eWeek for a deeper dive into the benefits of implementing industrial DataOps and what these three steps really mean.

Using DataOps To Maximize Value For Your Business (Source: Forbes)
Everybody is talking about artificial intelligence and data, but how do you make it real for your business? That’s where DataOps comes in.

From this Forbes article, learn how DataOps can be used to solve common business challenges, including:

A process mismatch between traditional data management and newer techniques such as AI.
A lack of collaboration to drive a successful cultural shift and support operational readiness.
Unclear approach to measure success across the organization.

In Conclusion

Knowledge is power! We hope our data community enjoys these resources and they provide valuable insights to help you in your current role and beyond.

Be sure to visit our library of resources on DataOps, Cloud Migration, Cloud Management (and more) for best practices, happenings, and expert tips and techniques. If you want to know more about Unravel Data, you can sign up for a you can sign up for a free account or contact us.

The post Twelve Best Cloud & DataOps Articles appeared first on Unravel.

Troubleshooting Apache Spark – Job, Pipeline, & Cluster Level

Unravel Data — Wed, 13 Oct 2021 18:05:18 +0000

Apache Spark is the leading technology for big data processing, on-premises and in the cloud. Spark powers advanced analytics, AI, machine learning, and more. Spark provides a unified infrastructure for all kinds of professionals to work together to achieve outstanding results.

Technologies such as Cloudera’s offerings, Amazon EMR, and Databricks are largely used to run Spark jobs. However, as Spark’s importance grows, so does the importance of Spark reliability – and troubleshooting Spark problems is hard. Information you need for troubleshooting is scattered across multiple, voluminous log files. The right log files can be hard to find, and even harder to understand. There are other tools, each providing part of the picture, leaving it to you to try to assemble the jigsaw puzzle yourself.

Would your organization benefit from rapid troubleshooting for your Spark workloads? If you’re running significant workloads on Spark, then you may be looking for ways to find and fix problems faster and better – and to find new approaches that steadily reduce your problems over time.

This blog post is adapted from the recent webinar, Troubleshooting Apache Spark, part of the Unravel Data “Optimize” Webinar Series. In the webinar, Unravel Data’s Global Director of Solutions Engineering, Chis Santiago, runs through common challenges faced when troubleshooting Spark and shows how Unravel Data can help.

Spark: The Good, the Bad & the Ugly

Chris has been at Unravel for almost four years. Throughout that time, when it comes to Spark, he has seen it all – the good, the bad, and the ugly. On one hand, Spark as a community has been growing exponentially. Millions of users are adopting Spark, with no end in sight. Cloud platforms such as Amazon EMR and Databricks are largely used to run Spark jobs.

Spark is here to stay, and use cases are rising. There are countless product innovations powered by Spark, such as Netflix recommendations, targeted ads on Facebook and Instagram, or the “Trending” feature on Twitter. On top of its great power, the barrier to entry for Spark is now lower than ever before. But unfortunately, with the good comes the bad, and the number one common issue is troubleshooting.

Troubleshooting Spark is complex for a multitude of different reasons. First off, there are multiple points of failure. A typical Spark data pipeline could be using orchestration tools, such as Airflow or Oozie, as well as built-in tools, such as Spark UI. You also may be using cluster management technologies, such as Cluster Manager or Ambari.

A failure may not always start on Spark; it could rather be a failure within a network layer on the cloud, for example.

Because Spark uses so many tools, not only does that introduce multiple points of failure, but there is also a lot of correlating information from various sources that you must carry out across these platforms. This requires expertise. You need experience in order to understand not only the basics of Spark, but all the other platforms that can support Spark as well.

Lastly, when running Spark workloads, the priority is often to meet SLAs at all costs. To meet SLAs you may, for example, double your resources, but there will always be a downstream effect. Determining what’s an appropriate action to take in order to make SLAs can be tricky.

Want to experience Unravel for Spark?

Create a free account

The Three Levels of Spark Troubleshooting

There are multiple levels when it comes to troubleshooting Spark. First there is the Job Level, which deals with the inner workings of Spark itself, from executors to drivers to memory allocation to logs. The job level is about determining best practices for using the tools that we have today to make sure that Spark jobs are performing properly. Next is the Pipeline Level. Troubleshooting at the pipeline level is about managing multiple Spark runs and stages to make sure you’re getting in front of issues and using different tools to your advantage. Lasty, there is the Cluster Level, which deals with infrastructure. Troubleshooting at the cluster level is about understanding the platform in order to get an end-to-end view of troubleshooting Spark jobs.

Troubleshooting: Job Level

A Spark job refers to actions such as doing work in a workbook or analyzing a Spark SQL query. One tool used on the Spark job level is Spark UI. Spark UI can be described as an interface for understanding Spark at the job level.

Spark UI is useful in giving granular details about the breakdown of tasks, breakdown of stages, the amount of time it takes workers to complete tasks, etc. Spark UI is a powerful dataset that you can use to understand every detail about what happened to a particular Spark job.

Challenges people often face are manual correlation; understanding the overall architecture of Spark; and, more importantly, things such as understanding what logs you need to look into and how one job correlates with other jobs. While Spark UI is the default starting point to determine what is going on with Spark jobs, there is still a lot of interpretation that needs to be done, which takes experience.

Further expanding on Spark job-level challenges, one thing people often face difficulty with is diving into the logs. If you truly want to understand what caused a job to fail, you must get down to the log details. However, looking at logs is the most verbose way of troubleshooting, because every component in Spark produces logs. Therefore, you have to look at a lot of errors and timestamps across multiple servers. Looking at logs gives you the most information to help understand why a job has failed, but sifting through all that information is time-consuming.

It also may be challenging to determine where to start on the job level. Spark was born out of the Hadoop ecosystem, so it has a lot of big data concepts. If you’re not familiar with big data concepts, determining a starting point may be difficult. Understanding the concepts behind Spark takes time, effort, and experience.

Lastly, when it comes to Spark jobs, there are often multiple job runs that make up a data pipeline. Figuring out how one Spark job affects another is tricky and must be done manually. In his experience working with the best engineers at large organizations, Chris finds that even they sometimes decide not to finish troubleshooting, and instead just keep on running and re-running a job until it’s completed or meets the SLA. While troubleshooting is ideal, it is extremely time-consuming. Therefore, having a better tool for troubleshooting on the job level would be helpful.

Troubleshooting: Pipeline Level

In Chris’ experience, most organizations don’t have just one Spark job that does everything, but there are rather multiple stages and jobs that are needed to carry out Spark workloads. To manage all these steps and jobs you’d need an orchestration tool. One popular orchestration tool is Airflow, which allows you to sequence out specific jobs.

Orchestration tools like Airflow are useful in managing pipelines. But while these tools are helpful for creating complex pipelines and mapping where points of failure are, they are lacking when it comes to providing detailed information about why a specific step may have failed. Orchestration tools are more focused on the higher, orchestration level, rather than the Spark job level. Orchestration tools tell you where and when something has failed, so they are useful as a starting point to troubleshoot data pipeline jobs on Spark.

Those who are running Spark on Hadoop, however, often use Oozie. Similarly to Airflow, Oozie gives you a high level view, alerting you when a job has failed, but not providing the type of information needed to answer questions such as “Where is the bottleneck?” or “Why did the job break down?”. To answer these questions, it’s up to the user to manually correlate the information that orchestration tools provide with information from job-level tools, which again requires expertise. For example, you may have to determine which Spark run that you see in Spark UI correlates to a certain step in Airflow. This can be very time consuming and prone to errors.

Troubleshooting: Cluster Level

The cluster level for Spark is the level that refers to things such as infrastructure, VMs, allocated resources, and Hadoop. Hadoop’s ResourceManager is a great tool for managing applications as they come in. ResourceManage is also useful for determining what the resource usage is, and where a Spark job is in the queue.

However, one shortcoming of ResourceManager is that you don’t get historical data. You cannot view the past state of ResourceManager from twelve or twenty-four hours ago, for example. Everytime you open ResourceManager you have a view of how jobs are consuming resources at that specific time, as shown below.

Another challenge when troubleshooting Spark on the cluster level is that while tools such as Cluster Manager or Ambari give a holistic view of what’s going on with the entire estate, you cannot see how cluster-level information, such as CPU consumption, I/O consumption, or network I/O consumption, relate to Spark jobs.

Lastly, and similarly to the challenges faced when troubleshooting on the job and pipeline level, manual correlation is also a problem when it comes to troubleshooting on the cluster level. Manual correlation takes time and effort that a data science team could instead be putting towards product innovations.

But what if there was a tool that takes all these troubleshooting challenges, on the job, pipeline, and cluster level, into consideration? Well, luckily, Unravel Data does just that. Chris next gives examples of how Unravel can be used to mitigate Spark troubleshooting issues, which we’ll go over in the remainder of this blog post.

Want to experience Unravel for Spark?

Create a free account

Demo: Unravel for Spark Troubleshooting

The beauty of Unravel is that it provides a single pane of glass where you can look at logs, the type of information provided by Spark UI, Oozie, and all the other tools mentioned throughout this blog, and data uniquely available through Unravel, all in one view. At this point in the webinar, Chris takes us through a demo to show how Unravel aids in troubleshooting at all Spark levels – job, pipeline, and cluster. For a deeper dive into the demo, view the webinar.

Job Level

At the job level, one area where Unravel can be leveraged is in determining why a job failed. The image below is a Spark run that is monitored by Unravel.

On the left hand side of the dashboard, you can see that Job 3 has failed, indicated by the orange bar. With Unravel, you can click on the failed job and see what errors occurred. On the right side of the dashboard, within the Errors tab, you can see why Job 3 failed, as highlighted in blue. Unravel is showing the Spark logs that give information on the failure.

Pipeline Level

Using Unravel for troubleshooting at the data pipeline level, you can look at a specific workflow, rather than looking at the specifics of one job. The image shows the dashboard when looking at data pipelines.

On the left, the blue lines represent instance runs. The black dot represents a job that ran for two minutes and eleven seconds. You could use information on run duration to determine if you meet your SLA. If your SLA is under two minutes, for example, the highlighted run would miss the SLA. With Unravel you can also look at changes in I/O, as well as dive deeper into specific jobs to determine why they lasted a certain amount of time. The information in the screenshot gives us insight into why the job mentioned prior ran for two minutes and eleven seconds.

The Unravel Analysis tab, shown above, carries out analysis to detect issues and make recommendations on how to mitigate those issues.

Cluster Level

Below is the view of Unravel when troubleshooting at the cluster level, specifically focusing on the same job mentioned previously. The job, which again lasted two minutes and eleven seconds, took place on July 5th at around 7PM. So what happened?

The image above shows information about the data security queue at the time when the job we’re interested in was running. The table at the bottom of the dashboard shows the state of the jobs that were running on July 5th at 7PM, allowing you to see which, if any, job was taking up too much resources. In this case, Chris’ job, highlighted in yellow, wasn’t using a large amount of resources. From there, Chris can then conclude that perhaps the issue is instead on the application side and something needs to be fixed within the code. The best way to determine what needs to be fixed is to use the Analysis tab mentioned previously.

Conclusion

There are many ways to troubleshoot Spark, whether it be on the job, pipeline, or cluster level. Unravel can be your one-stop shop to determine what is going on with your Spark jobs and data pipelines, as well as give you proactive intelligence that allows you to quickly troubleshoot your jobs. Unravel can help you meet your SLAs in a resourceful and efficient manner.

We hope you have enjoyed, and learned from, reading this blog post. If you think Unravel Data can help you troubleshoot Spark and would like to know more, you can create a free account or contact Unravel.

The post Troubleshooting Apache Spark – Job, Pipeline, & Cluster Level appeared first on Unravel.

Managing Costs for Spark on Amazon EMR

Unravel Data — Tue, 28 Sep 2021 20:43:11 +0000

The post Managing Costs for Spark on Amazon EMR appeared first on Unravel.

Managing Costs for Spark on Databricks

Unravel Data — Fri, 17 Sep 2021 20:51:08 +0000

The post Managing Costs for Spark on Databricks appeared first on Unravel.

Spark Troubleshooting Solutions – DataOps, Spark UI or logs, Platform or APM Tools

Unravel Data — Thu, 09 Sep 2021 21:53:14 +0000

Note: This guide applies to running Spark jobs on any platform, including Cloudera platforms; cloud vendor-specific platforms – Amazon EMR, Microsoft HDInsight, Microsoft Synapse, Google DataProc; Databricks, which is on all three major public cloud providers; and Apache Spark on Kubernetes, which runs on nearly all platforms, including on-premises and cloud.

Introduction

Spark is known for being extremely difficult to debug. But this is not all Spark’s fault. Problems in running a Spark job can be the result of problems with the infrastructure Spark is running on, an inappropriate configuration of Spark, Spark issues, the currently running Spark job, other Spark jobs running at the same time – or interactions among these layers. But Spark jobs are very important to the success of the business; when a job crashes, or runs slowly, or contributes to a big increase in the bill from your cloud provider, you have no choice but to fix the problem.

Widely used tools generally focus on part of the environment – the Spark job, infrastructure, the network layer, etc. These tools don’t present a holistic view. But that’s just what you need to truly solve problems. (You also need the holistic view when you’re creating the Spark job, and as a check before you start running it, to help you avoid having problems in the first place. But that’s another story.)

In this guide, Part 2 in a series, we’ll show ten major tools that people use for Spark troubleshooting. We’ll show what they do well, and where they fall short. In Part 3, the final piece, we’ll introduce Unravel Data, which makes solving many of these problems easier.

What’s the Problem(s)?

The problems we mentioned in Part 1 of this series have many potential solutions. The methods people usually use to try to solve them often derive from that person’s role on the data team. The person who gets called when a Spark job crashes, such as the job’s developer, is likely to look at the Spark job. The person who is responsible for making sure the cluster is healthy will look at that level. And so on.

In this guide, we highlight five types of solutions that people use – often in various combinations – to solve problems with Spark jobs

Spark UI
Spark logs
Platform-level tools such as Cloudera Manager, the Amazon EMR UI, Cloudwatch, the Databricks UI, and Ganglia
APM tools
DataOps platforms such as Unravel Data

As an example of solving problems of this type, let’s look at the problem of an application that’s running too slowly – a very common Spark problem, that may be caused by one or more of the issues listed in the chart. Here. we’ll look at how existing tools might be used to try to solve it.

Note: Many of the observations and images in this guide have been drawn from the presentation Beyond Observability: Accelerate Performance on Databricks, by Patrick Mawyer, Systems Engineer at Unravel Data. We recommend this webinar to anyone interested in Spark troubleshooting and Spark performance management, whether on Databricks or on other platforms.

Solving Problems Using Spark UI

Spark UI is the first tool most data team members use when there’s a problem with a Spark job. It shows a snapshot of currently running jobs, the stages jobs are in, storage usage, and more. It does a good job, but is seen as having some faults. It can be hard to use, with a low signal-to-noise ratio and a long learning curve. It doesn’t tell you things like which jobs are taking up more or less of a cluster’s resources, nor deliver critical observations such as CPU, memory, and I/O usage.

In the case of a slow Spark application, Spark UI will show you what the current status of that job is. You can also use Spark UI for past jobs, if the logs for those jobs still exist, and if they were configured to log events properly. Also, the Spark history server tends to crash. When this is all working, it can help you find out how long an application took to run in the past – you need to do this kind of investigative work just to determine what “slow” is.

The following screenshot is for a Spark 1.4.1 job with a two-node cluster. It shows a Spark Streaming job that steadily uses more memory over time, which might cause the job to slow down. And the job eventually – over a matter of days – runs out of memory.

(Source: Stack Overflow)

To solve this problem, you might do several things. Here’s a brief list of possible solutions, and the problems they might cause elsewhere:

Increase available memory for each worker. You can increase the value of the spark.executor.memory variable to increase the memory for each worker. This will not necessarily speed the job up, but will defer the eventual crash. However, you are either taking memory away from other jobs in your cluster or, if you’re in the cloud, potentially running up the cost of the job.
Increase the storage fraction. You can change the value of spark.storage.memoryFraction, which varies from 0 to 1, to a higher fraction. Since the Java virtual machine (JVM) uses memory for caching RDDs and for shuffle memory, you are increasing caching memory at the expense of shuffle memory. This will cause a different failure if, at some point, the job needs shuffle memory that you allocated away at this step.
Increase the parallelism of the job. For a Spark Cassandra Connector job, for example, you can change spark.cassandra.input.split.size to a smaller value. (It’s a different variable for other RDD types.) Increasing parallelism decreases the data set size for each worker, requiring less memory per worker. But more workers means more resources used. In a fixed-resources environment, this takes resources away from other jobs; in a dynamic environment, such as a Databricks job cluster, it directly runs up your bill.

The point here is that everything you might do has a certain amount of guesswork to it, because you don’t have complete information. You have to use trial and error approaches to see what might work – both which approach to try/variable to change, and how much to change the variable(s) involved. And, whichever approach you choose, you are putting the job in line for other, different problems – including later failure, failure for other reasons, or increased cost. And, when your done, this specific job may be fine – but at the expense of other jobs that then fail. And those failures will also be hard to troubleshoot.

Here’s a look at the Stages section of Spark UI. It gives you a list of metrics across executors. However, there’s no overview or big picture view to help guide you in finding problems. And the tool doesn’t make recommendations to help you solve problems, or avoid them in the first place.

Spark UI is limited to Spark, but a Spark job may, for example, have data coming in from Kafka, and run alongside other technologies. Each of those has its own monitoring and management tools, or does without; Spark UI doesn’t work with those tools. It also lacks pro-active alerting, automatic actions, and AI-driven insights, all found in Unravel.

Spark UI is very useful for what it does, but its limitations – and the limitations of the other tool types described here – lead many organizations to build homegrown tools or toolsets, often built on Grafana. These solutions are resource-intensive, hard to extend, hard to support, and hard to keep up-to-date.

A few individuals and organizations even offer their homegrown tools as open source software for other organizations to use. However, support, documentation, and updates are limited to non-existent. Several such tools, such as Sparklint and DrElephant, do not support recent versions of Spark. At this writing, they have not had many, if any, fresh commits in recent months or even years.

Spark Logs

Spark logs are the underlying resource for troubleshooting Spark jobs. As mentioned above, Spark UI can even use Spark logs, if available, to rebuild a view of the Spark environment on an historical basis. You can use the logs related to the job’s driver and executors to retrospectively find out what happened to a given job, and even some information about what was going on with other jobs at the same time.

If you have a slow app, for instance, you can painstakingly assemble a picture to tell you if the slowness was in one task versus the other by scanning through multiple log files. But answering why and finding the root cause is hard. These logs don’t have complete information about resource usage, data partitioning, correct configuration setting and many other factors than can affect performance. There are also many potential issues that don’t show up in Spark logs, such as “noisy neighbor” or networking issues that sporadically reduce resource availability within your Spark environment.

Spark logs are a tremendous resource, and are always a go-to for solving problems with Spark jobs. However, if you depend on logs as a major component of your troubleshooting toolkit, several problems come up, including:

Access and governance difficulties. In highly secure environments, it can take time to get permission to access logs, or you may need to ask someone with the proper permissions to access the file for you. In some highly regulated companies, such as financial institutions, it can take hours per log to get access.
Multiple files. You may need to look at the logs for a driver and several executors, for instance, to solve job-level problems. And your brain is the comparison and integration engine that pulls the information together, makes sense of it, and develops insights into possible causes and solutions.
Voluminous files. The file for one component of a job can be very large, and all the files for all the components of a job can be huge – especially for long-running jobs. Again, you are the one who has to find and retain each part of the information needed, develop a complete picture of the problem, and try different solutions until one works.
Missing files. Governance rules and data storage considerations take files away, as files are moved to archival media or simply lost to deletion. More than one large organization deletes files every 90 days, which makes quarterly summaries very difficult. Comparisons to, say, the previous year’s holiday season or tax season become impossible.
Only Spark-specific information. Spark logs are just that – logs from Spark. They don’t include much information about the infrastructure available, resource allocation, configuration specifics, etc. Yet this information is vital to solving a great many of the problems that hamper Spark jobs.

Because Spark logs don’t cover infrastructure and related information, it’s up to the operations person to find as much information they can on those other important areas, then try to integrate it all and determine the source of the problem. (Which may be the result of a complex interaction among different factors, with multiple changes needed to fix it.)

Platform-Level Solutions

There are platform-level solutions that work on a given Spark platform, such as Cloudera Manager, the Amazon EMR UI, and Databricks UI. In general, these interfaces allow you to work at the cluster level. They tell you information about resource usage and the status of various services.

If you have a slow app, for example, these tools will give you part of the picture you need to put together to determine the actual problem, or problems. But these tools do not have the detail-level information in the tools above, nor do they even have all the environmental information you need. So again, it’s up to you to decide how much time to spend researching, pulling all the information together, and trying to determine a solution. A quick fix might take a few hours; a comprehensive, long-term solution may take days of research and experimentation.

This screenshot shows Databricks UI. It gives you a solid overview of jobs and shows you status, cluster type, and so on. Like other platform-level solutions, it doesn’t help you much with historical runs, nor in working at the pipeline level, across the multiple jobs that make up the pipeline.

Another monitoring tool for Spark, which is included as open source within Databricks, is called Ganglia. It’s largely complementary to Databricks UI, though it also works at the cluster and, in particular, at the node level. You can see hostlevel metrics such as CPU consumption, memory consumption, disk usage, network-level IO – all host-level factors that can affect the stability and performance of your job.

This can allow you to see if your nodes are configured appropriately, to institute manual scaling or auto-scaling, or to change instance types. (Though someone trying to fix a specific job is not inclined to take on issues that affect other jobs, other users, resource availability, and cost.) Ganglia does not have job-specific insights, nor work with pipelines. And there are no good output options; you might be reduced to taking a screen snapshot to get a JPEG or PNG image of the current status.

Support from the open-source community is starting to shift toward more modern observability platforms like Prometheus, which works well with Kubernetes. And cloud providers offer their own solutions – AWS Cloudwatch, for example, and Azure Log Monitoring and Analytics. These tools are all oriented toward web applications; they lack modern data stack application and pipeline information which is essential to understand what’s happening to your job and how your job is affecting things on the cluster or workspace.

Platform-level solutions can be useful for solving the root causes of problems such as out-of-memory errors. However, they don’t go down to the job level, leaving that to resources such as Spark logs and tools such as Spark UI. Therefore, to solve a problem, you are often using platform-level solutions in combination with job-level tools – and again, it’s your brain that has to do the comparisons and data integration needed to get a complete picture and solve the problem.

Like job-level tools, these solutions are not comprehensive, nor integrated. They offer snapshots, but not history, and they don’t make proactive recommendations. And, to solve a problem on Databricks, for example, you may be using Spark logs, Spark UI, Databricks UI, and Ganglia, along with Cloudwatch on AWS, or Azure Log Monitoring and Analytics. None of these tools integrate with the others.

APM Tools

There is a wide range of monitoring tools, generally known as Application Performance Management (APM) tools. Many organizations have adopted one or more tools from this
category, though they can be expensive, and provide very limited metrics on Spark and other modern data technologies. Leading tools in this category include Datadog, Dynatrace, and Cisco AppDynamics.

For a slow app, for instance, an APM tool might tell you if the system as a whole is busy, slowing your app, or if there were networking issues, slowing down all the jobs. While helpful, they’re oriented toward monitoring and observability for Web applications and middleware, not data-intensive operations such as Spark jobs. They tend to lack information about pipelines, specific jobs, data usage, configuration setting, and much more, as they are not designed to deal with the complexity of modern data applications.

Correlation is the Issue

To sum up, there are several types of existing tools:

DIY with Spark logs. Spark keeps a variety of types of logs, and you can parse them, in a do it yourself (DIY) fashion, to help solve problems. But this lacks critical infrastructure, container, and other metrics.
Open source tools. Spark UI comes with Spark itself, and there are other Spark tools from the open source community. But these lack infrastructure, configuration and other information. They also do not help connect together a full pipeline view, as you need for Spark – and even more so if you are using technologies such as Kafka to bring data in.
Platform-specific tools. The platforms that Spark runs on – notably Cloudera platforms, Amazon EMR, and Databricks – each have platform-specific tools that help with Spark troubleshooting. But these lack application-level information and are best used for troubleshooting platform services.
Application performance monitoring (APM) tools. APM tools monitor the interactions of applications with their environment, and can help with troubleshooting and even with preventing problems. But the applications these APM tools are built for are technologies such as .NET, Java, and Ruby, not technologies that work with data-intensive applications such as Spark.
DataOps platforms. DataOps – applying Agile principles to both writing and running Spark, and other big data jobs – is catching on, and new platforms are emerging to embody these principles.

Each tool in this plethora of tools takes in and processes different, but overlapping, data sets. No one tool provides full visibility, and even if you use one or more tools of each type, full visibility is still lacking.

You need expertise in each tool to get the most out of that tool. But the most important work takes place in the expert user’s head: spotting a clue in one tool, which sends you looking at specific log entries and firing up other tools, to come up with a hypothesis as to the problem. You then have to try out the potential solution, often through several iterations of trial and error, before arriving at a “good enough” answer to the problem.

Or, you might pursue two tried and trusted, but ineffective, “solutions”: ratcheting up resources and retrying the job until it works, either due to the increased resources or by luck; or simply giving up, which our customers tell us they often had to do before they started using Unravel Data.

The situation is much worse in the kind of hybrid data clouds that organizations use today. To troubleshoot on each platform, you need expertise in the toolset for that platform, and all the others. (Since jobs often have cross-platform interdependencies, and the same team has to support multiple platforms.) In addition, when you find a solution for a problem on one platform, you should apply what you’ve learned on all platforms, taking into account their differences. In addition, you have issues that are inherently multi-platform, such as moving jobs from one platform to a platform that is better, faster, or cheaper for a given job. Taking on all this with the current, fragmented, and incomplete toolsets available is a mind-numbing prospect.

The biggest need is for a platform that integrates the capabilities from several existing tools, performing a five-step process:

Ingest all of the data used by the tools above, plus additional, application-specific and pipeline data.
Integrate all of this data into an internal model of the current environment, including pipelines.
Provide live access to the model.
Continually store model data in an internally maintained history.
Correlate information across the ingested data sets, the current, “live” model, and the stored historical background, to derive insights and make recommendations to the user.

This tool must also provide the user with the ability to put “triggers” onto current processes that can trigger either alerts or automatic actions. In essence, the tool’s inbuilt intelligence and the user are then working together to make the right things happen at the right time.

A simple example of how such a platform can help is by keeping information per pipeline, not just per job – then spotting, and automatically letting you know, when the pipeline suddenly starts running slower than it had previously. The platform will also make recommendations as to how you can solve the problem. All this lets you take any needed action before the job is delayed.

The post Spark Troubleshooting Solutions – DataOps, Spark UI or logs, Platform or APM Tools appeared first on Unravel.

Managing Cost & Resources Usage for Spark

Unravel Data — Wed, 08 Sep 2021 20:55:29 +0000

The post Managing Cost & Resources Usage for Spark appeared first on Unravel.

Accelerate Amazon EMR for Spark & More

Unravel Data — Mon, 21 Jun 2021 21:54:59 +0000

The post Accelerate Amazon EMR for Spark & More appeared first on Unravel.

Jeeves Grows Up: How an AI Chatbot Became Part of Unravel Data

Floyd Smith — Mon, 31 May 2021 04:20:30 +0000

Jeeves is the stereotypical English butler – and an AI chatbot that answers pertinent and important questions about Spark jobs in production. Shivnath Babu, CTO and co-founder of Unravel Data, spoke yesterday at Data + AI Summit, formerly known as Spark Summit, about the evolution of Jeeves, and how the technology has become a key supporting pillar within Unravel Data’s software.

Unravel is a leading platform for DataOps, bringing together a raft of seemingly disparate information to make it much easier to view, monitor, and manage pipelines. With Unravel, individual jobs and their pipelines become visible. But also, the interactions between jobs and pipelines become visible too.

It’s often these interactions, which are ephemeral, and very hard to track through traditional monitoring solutions, that cause jobs to cause or fail. Unravel makes them visible and actionable. On top of this, AI and machine learning help the software make proactive suggestions about improvements, and even head off trouble before it happens.

Both performance improvements and cost management become far easier with Unravel, even for DataOps personnel who don’t know all of the underlying technologies used by a given pipeline in detail.

Jeeves to the Rescue

An app failure in Spark may be difficult to even discover – let alone to trace, troubleshoot, repair, and retry. If the failure is due to interactions among multiple apps, a whole additional dimension of trouble arises. As data volumes and pipeline criticality rocket upward, no one in a busy IT department has time to dig around for the causes of problems and search for possible solutions.

But – Jeeves to the rescue! Jeeves acts as a chatbot for finding, understanding, fixing, and improving Spark jobs, and the configuration settings that define where, when, and how they run. The Jeeves demo in Shivnath’s talk shows how Jeeves comes up with the errant Spark job (by ID number), describes what happened, and recommends the needed adjustments – configuration changes, in this case – to fix the problem going forward. Jeeves can even resubmit the repaired job for you.

See how Unravel simplifies troubleshooting Spark

Create a free account

Wait, One More Thing…

But there’s more – much more. The technology underlying Jeeves has now been built into Unravel Data, with stellar results.

Modern data pipelines are ever more populated. In his talk, Shivnath shows us several views on the modern data landscape. His simplified diagram shows five silos, and 14 different top-level processes, between today’s data sources and a plethora of data consumers, both human and machine.

But Shivnath shines a bright light into this Aladdin’s cave of – well, either treasures or disasters, depending on your point of view, and whether everything is working or not. He describes each of the major processes that take place within the architecture, and highlights the plethora of technologies and products that are used to complete each process.

He sums it all up by showing the role of data pipelines in carrying out mission-critical tasks, how the data stack for all of this continues to get more complex, and how DataOps as a practice has emerged to try and get a handle on all of it.

This is where we move from Jeeves, a sophisticated bot, to Unravel, which incorporates the Jeeves functionality – and much more. Shivnath describes Unravel’s Pipeline Observer, which interacts with a large and growing range of pipeline technologies to monitor, manage, and recommend (through an AI and machine learning-powered engine) how to fix, improve, and optimize pipeline and workload functionality and reliability.

In an Unravel demo, Shivnath shows how to improve a pipeline that’s in danger of:

Breaking due to data quality problems
Missing its performance SLA
Cost overruns – check your latest cloud bill for examples of this one

If you’re in DataOps, you’ve undoubtedly experienced the pain of pipeline breaks, and that uneasy feeling of SLA misses, all reflected in your messaging apps, email, and performance reviews – not to mention the dreaded cost overruns, which don’t show up until you look at your cloud provider bills.

Shivnath concludes by offering a chance for you to create a free account; to contact the company for more information; or to reach out to Shivnath personally, especially if your career is headed in the direction of helping solve these and related problems. To get the full benefit of Shivnath’s perspective, dig into the context, and understand what’s happening in depth, please watch the presentation.

The post Jeeves Grows Up: How an AI Chatbot Became Part of Unravel Data appeared first on Unravel.

How To Get the Best Performance & Reliability Out of Kafka & Spark Applications

Unravel Data — Thu, 11 Mar 2021 22:10:06 +0000

The post How To Get the Best Performance & Reliability Out of Kafka & Spark Applications appeared first on Unravel.

Going Beyond Observability for Spark Applications & Databricks Environments

Unravel Data — Tue, 23 Feb 2021 22:10:57 +0000

The post Going Beyond Observability for Spark Applications & Databricks Environments appeared first on Unravel.

“More than 60% of Our Pipelines Have SLAs,” Say Unravel Customers at Untold

Floyd Smith — Fri, 18 Dec 2020 21:42:41 +0000

Also see our blog post with stories from Untold speakers, “My Fitbit sleep score is 88…”

Unravel Data recently held its first-ever customer conference, Untold 2020. We promoted Untold 2020 as a convocation of #datalovers. And these #datalovers generated some valuable data – including the interesting fact that more than 60% of surveyed customers have SLAs for either “more than 50% of their pipelines” (42%) or “all of their pipelines” (21%).

All of this ties together. Unravel makes it much easier for customers to set attainable SLAs for their pipelines, and to meet those SLAs once they’re set. Let’s look at some more data-driven findings from the conference.

And, if you’re an Unravel Data customer, reach out to access much more information, including webinar-type recordings of all five customer and industry expert presentations, with polling and results presentations interspersed throughout. If you are not yet a customer, but want to know more, you can create a free account or contact us.

Unravel Data CEO Kunal Agarwal kicking off Untold.

Note: “The plural of anecdotes is not data,” the saying goes – so, nostra culpa. The findings discussed herein are polling results from our customers who attended Untold, and they fall short of statistical significance. But they do represent the opinions of some of the most intense #datalovers in leading corporations worldwide – Unravel Data’s customer base. (“The great ones skate to where the puck’s going to be,” after Wayne Gretzky…)

More Than 60% of Customer Pipelines Have SLAs

Using Unravel, more than 60% of the pipelines managed by our customers have SLAs:

More than 20% have SLAs for all their pipelines.
More than 40% have SLAs for more than half of their pipelines.
Fewer than 40% have SLAs for fewer than half their pipelines (29%) or fewer than a quarter of them (8%).

Pipelines were, understandably, a huge topic of conversation at Untold. Complex pipelines, and the need for better tools to manage them, are very common amongst our customers. And Unravel makes it possible for our customers to set, and meet, SLAs for their pipelines.

What percentage of your data pipelines have SLAs
<25%	8.3%
>25-50%	29.2%
>50%	41.7%
All of them	20.8%

Bad Applications Cause Cost Overruns

We asked our attendees the biggest reason for cost overruns:

Roughly three-quarters replied, “Bad applications taking too many resources.” Finding out which applications all the resources are being consumed by is a key feature of Unravel software.
Nearly a quarter replied, “Oversized containers.” Now, not everyone is using containers yet, so we are left to wonder just how many container users are unnecessarily supersizing their containers. Most of them?
And the remaining answer was, “small files.” One of the strongest presentations at the Untold conference was about the tendency of bad apps to generate a myriad of small files that consume a disproportionate share of resources; you can use Unravel to generate a small files report and track these problems down.

What is usually the biggest reason for cost overruns?
Bad applications taking too many resources	75.0%
Oversized containers	20.0%
Small files	5.0%

Two-Thirds Know Their Most Expensive App/User

Amongst Unravel customers, almost two-thirds can identify their most expensive app(s) and user(s). Identifying costly apps and users is one of the strengths of Unravel Data:

On-premises, expensive apps and users consume far more than their share of system resources, slowing other jobs and contributing to instability and crashes.
In moving to cloud, knowing who’s costing you in your on-premises estate – and whether the business results are worth the expense – is crucial to informed decision-making.
In the cloud, “pay as you go” means that, as one of our presenters described it, “When you go, you pay.” It’s very easy for 20% of your apps and users to generate 80% of your cloud bill, and it’s very unlikely for that inflated bill to also represent 80% of your IT value.

Do you know who is the most expensive user / app on your system?
Yes	65.0%
No	25.0%
No, but would be great to know	10.0%

An Ounce of Prevention is Worth a Pound of Cure

Knowing whether you have rogue users and/or apps on your system is very valuable:

A plurality (43%) of Unravel customers have rogue users/apps on their cluster “all the time.”
A minority (19%) see this about once a day.
A near-plurality (38%) only see this behavior once a week or so.

We would bet good money that non-Unravel customers see a lot more rogue behavior than our customers do. With Unravel, you can know exactly who and what is “going rogue” – and you can help stakeholders get the same results with far less use of system resources and cost. This cuts down on rogue behavior, freeing up room for existing jobs, and for new applications to run with hitherto unattainable performance.

How often do you have rogue users/apps on your cluster?
All the time!	42.9%
Once a day	19.0%
Once a week	38.1%

Unravel Customers Are Fending Off Bad Apps

Once you find and improve the bad apps that are currently in production, the logical next step is to prevent bad apps from even reaching production in the first place. Unravel customers are meeting this challenge:

More than 90% of attendees find automation helpful in preventing bad quality apps from being promoted into production.
More than 80% have a quality gate when promoting apps from Dev to QA, and from QA to production.
More than half have a well-defined DataOps/SDLC (software development life cycle) process, and nearly a third have a partially-defined process. Only about one-eighth have neither.
About one-quarter have operations people/sysadmins troubleshooting their data pipelines; another quarter put the responsibility onto the developers or data engineers who create the apps. Nearly half make a game-time decision, depending on the type of issue, or use an “all hands on deck” approach with everyone helping.

The Rest of the Story

More than two-thirds are finding their data costs to be running over budget.
More than half are in the process of migrating to cloud – though only a quarter have gotten to the point of actually moving apps and data, then optimizing and scaling the result.
Half find that automating problem identification, root cause analysis, and resolution saves them 1-5 hours per issue; the other half save from 6-10 hours or more.
Somewhat more than half find their clusters to be on the over-provisioned side.
Nearly half have ten or more technologies in representative data pipelines.

Finding Out More

Unravel Data customers – #datalovers all – look to be well ahead of the industry in managing issues that relate to big data, streaming data, and moving to the cloud. If you’re interested in joining this select group, you can create a free account or contact us. (There may still be some Untold conference swag left for new customers!)

The post “More than 60% of Our Pipelines Have SLAs,” Say Unravel Customers at Untold appeared first on Unravel.

Spark APM – What is Spark Application Performance Management

Unravel Data — Thu, 17 Dec 2020 21:01:51 +0000

What is Spark?

Apache Spark is a fast and general-purpose engine for large-scale data processing. It’s most widely used to replace MapReduce for fast processing of data stored in Hadoop. Spark also provides an easy-to-use API for developers.

Designed specifically for data science, Spark has evolved to support more use cases, including real-time stream event processing. Spark is also widely used in AI and machine learning applications.

There are six main reasons to use Spark.

1. Speed – Spark has an advanced engine that can deliver up to 100 times faster processing than traditional MapReduce jobs on Hadoop.

2. Ease of use – Spark supports Python, Java, R, and Scala natively, and offers dozens of high-level operators that make it easy to build applications.

3. General purpose – Spark offers the ability to combine all of its components to make a robust and comprehensive data product.

4. Platform agnostic – Spark runs in nearly any environment.

5. Open source – Spark and its component and complementary technologies are free and open source; users can change the code as needed.

6. Widely used – Spark is the industry standard for many tasks, so expertise and help are widely available.

A Brief History of Spark

Today, when it comes to parallel big data analytics, Spark is the dominant framework that developers turn to for their data applications. But what came before Spark?

In 2003, several developers, mostly based at Yahoo!, started working on an open, distributed computing platform. A few years later, these developers released their work as an open source project called Hadoop.

This is also approximately the same time that Google created a Java interface called MapReduce that they used to work with their volumes of data. While Hadoop grew in popularity in its ability to store massive volumes of data, at Facebook, developers wanted to provide their data science team with an easier way to work with their data in Hadoop. As a result, they created Hive, a data warehousing framework based on Hadoop.

Even though Hadoop was gaining wide adoption at this point, there really weren’t any good interfaces for analysts and data scientists to use. So, in 2009, a group of people at the University of California Berkeley ampLab started a new project to solve this problem. Thus Spark was born – and was released as open source a year later.

What is Spark APM?

Spark enables rapid innovation and high performance in your applications. But as your applications grow in complexity, inefficiencies are bound to be introduced. These inefficiencies add up to significant performance losses and increased processing costs.

For example, a Spark cluster may have idle time between batches because of slow data writes on the server. Batch modes become idle because the next batch can’t start until all of the previous tasks haven’t been completed yet. Your Spark jobs are “bottlenecked on writes.”

When this happens, you can’t scale your application horizontally – adding more servers to help with processing won’t improve your application performance. Instead, you’d just be increasing the idle time of your clusters.

Unravel Data for Spark APM

This is where Unravel Data comes in to save the day. Unravel Data for Spark provides a comprehensive full-stack, intelligent, and automated approach to Spark operations and application performance management across your modern data stack architecture.

The Unravel platform helps you analyze, optimize, and troubleshoot Spark applications and pipelines in a seamless, intuitive user experience. Operations personnel, who have to manage a wide range of technologies, don’t need to learn Spark in great depth in order to significantly improve the performance and reliability of Spark applications.

See Unravel for Spark in Action

Create a free account

Example of Spark APM with Unravel Data

A Spark application consists of one or more jobs, each of which in turn has one or more stages. A job corresponds to a Spark action – for example, count, take, for each, etc. Within the Unravel platform, you can view all of the details of your Spark application.

Unravel’s Spark APM lets you:

Quickly see which jobs and stages consume the most resources
View your app as a resilient distributed dataset (RDD) execution graph
Drill into the source code from the stage tile, Spark stream batch tile, or the execution graph to locate the problems

Unravel Data for Spark APM can then help you:

Resolve inefficiencies, bottlenecks, and reasons for failure within apps
Optimize resource allocation for Spark drivers and executors
Detect and fix poor partitioning
Detect and fix inefficient and failed Spark apps
Use recommended settings to tune for speed and efficiency

Improve Your Spark Application Performance

To learn how Unravel can help improve the performance of your applications, create a free account and take it for a test drive on your Spark applications. Or, contact Unravel.

The post Spark APM – What is Spark Application Performance Management appeared first on Unravel.

Spark Catalyst Pipeline: A Deep Dive into Spark’s Optimizer

Floyd Smith — Sat, 12 Dec 2020 03:44:11 +0000

The Catalyst optimizer is a crucial component of Apache Spark. It optimizes structural queries – expressed in SQL, or via the DataFrame/Dataset APIs – which can reduce the runtime of programs and save costs. Developers often treat Catalyst as a black box that just magically works. Moreover, only a handful of resources are available that explain its inner workings in an accessible manner.

When discussing Catalyst, many presentations and articles reference this architecture diagram, which was included in the original blog post from Databricks that introduced Catalyst to a wider audience. The diagram has inadequately depicted the physical planning phase, and there is an ambiguity about what kind of optimizations are applied at which point.

The following sections explain Catalyst’s logic, the optimizations it performs, its most crucial constructs, and the plans it generates. In particular, the scope of cost-based optimizations is outlined. These were advertised as “state-of-art” when the framework was introduced, but our article will describe larger limitations. And, last but not least, we have redrawn the architecture diagram:

The Spark Catalyst Pipeline

This diagram and the description below focus on the second half of the optimization pipeline and do not cover the parsing, analyzing, and caching phases in much detail.

Diagram 1: The Catalyst pipeline

The input to Catalyst is a SQL/HiveQL query or a DataFrame/Dataset object which invokes an action to initiate the computation. This starting point is shown in the top left corner of the diagram. The relational query is converted into a high-level plan, against which many transformations are applied. The plan becomes better-optimized and is filled with physical execution information as it passes through the pipeline. The output consists of an execution plan that forms the basis for the RDD graph that is then scheduled for execution.

Nodes and Trees

Six different plan types are represented in the Catalyst pipeline diagram. They are implemented as Scala trees that descend from TreeNode. This abstract top-level class declares a children field of type Seq[BaseType]. A TreeNode can therefore have zero or more child nodes that are also TreeNodes in turn. In addition, multiple higher-order functions, such as transformDown, are defined, which are heavily used by Catalyst’s transformations.

This functional design allows optimizations to be expressed in an elegant and type-safe fashion; an example is provided below. Many Catalyst constructs inherit from TreeNode or operate on its descendants. Two important abstract classes that were inspired by the logical and physical plans found in databases are LogicalPlan and SparkPlan. Both are of type QueryPlan, which is a direct subclass of TreeNode.

In the Catalyst pipeline diagram, the first four plans from the top are LogicalPlans, while the bottom two – Spark Plan and Selected Physical Plan – are SparkPlans. The nodes of logical plans are algebraic operators like Join; the nodes of physical plans (i.e. SparkPlans) correspond to low-level operators like ShuffledHashJoinExec or SortMergeJoinExec. The leaf nodes read data from sources such as files on stable storage or in-memory lists. The tree root represents the computation result.

Transformations

An example of Catalyst’s trees and transformation patterns is provided below. We have used expressions to avoid verbosity in the code. Catalyst expressions derive new values and are also trees. They can be connected to logical and physical nodes; an example would be the condition of a filter operator.

The following snippet transforms the arithmetic expression –((11 – 2) * (9 – 5)) into – ((1 + 0) + (1 + 5)):


import org.apache.spark.sql.catalyst.expressions._
 val firstExpr: Expression = UnaryMinus(Multiply(Subtract(Literal(11), Literal(2)), Subtract(Literal(9), Literal(5))))
 val transformed: Expression = firstExpr transformDown {
   case BinaryOperator(l, r) => Add(l, r)
   case IntegerLiteral(i) if i > 5 => Literal(1)
   case IntegerLiteral(i) if i < 5 => Literal(0)
 }

The root node of the Catalyst tree referenced by firstExpr has the type UnaryMinus and points to a single child, Multiply. This node has two children, both of type Subtract.

The first Subtract node has two Literal child nodes that wrap the integer values 11 and 2, respectively, and are leaf nodes. firstExpr is refactored by calling the predefined higher-order function transformDown with custom transformation logic: The curly braces enclose a function with three pattern matches. They convert all binary operators, such as multiplication, into addition; they also map all numbers that are greater than 5 to 1, and those that are smaller than 5 to zero.

Notably, the rule in the example gets successfully applied to firstExpr without covering all of its nodes: UnaryMinus is not a binary operator (having a single child) and 5 is not accounted for by the last two pattern matches. The parameter type of transformDown is responsible for this positive behavior: It expects a partial function that might only cover subtrees (or not match any node) and returns “a copy of this node where `rule` has been recursively applied to it and all of its children (pre-order)”.

This example appears simple, but the functional techniques are powerful. At the other end of the complexity scale, for example, is a logical rule that, when it fires, applies a dynamic programming algorithm to refactor a join.

Catalyst Plans

The following (intentionally bad) code snippet will be the basis for describing the Catalyst plans in the next sections:


val result = session.read.option("header", "true").csv(outputLoc)
   .select("letter", "index")
   .filter($"index" < 10) 
   .filter(!$"letter".isin("h") && $"index" > 6)
result.show()

The complete program can be found here. Some plans that Catalyst generates when evaluating this snippet are presented in a textual format below; they should be read from the bottom up. A trick needs to be applied when using pythonic DataFrames, as their plans are hidden; this is described in our upcoming follow-up article, which also features a general interpretation guide for these plans.

Parsing and Analyzing

The user query is first transformed into a parse tree called Unresolved or Parsed Logical Plan:


'Filter (NOT 'letter IN (h) && ('index > 6))
+- Filter (cast(index#28 as int) < 10)
   +- Project [letter#26, index#28]
             +- Relation[letter#26,letterUppercase#27,index#28] csv

This initial plan is then analyzed, which involves tasks such as checking the syntactic correctness, or resolving attribute references like names for columns and tables, which results in an Analyzed Logical Plan.

In the next step, a cache manager is consulted. If a previous query was persisted, and if it matches segments in the current plan, then the respective segment is replaced with the cached result. Nothing is persisted in our example, so the Analyzed Logical Plan will not be affected by the cache lookup.

A textual representation of both plans is included below:


Filter (NOT letter#26 IN (h) && (cast(index#28 as int) > 6))
+- Filter (cast(index#28 as int) < 10)
   +- Project [letter#26, index#28]
            +- Relation[letter#26,letterUppercase#27,index#28] csv

Up to this point, the operator order of the logical plan reflects the transformation sequence in the source program, if the former is interpreted from the bottom up and the latter is read, as usual, from the top down. A read, corresponding to Relation, is followed by a select (mapped to Project) and two filters. The next phase can change the topology of the logical plan.

Logical Optimizations

Catalyst applies logical optimization rules to the Analyzed Logical Plan with cache data. They are part of rule groups which are internally called batches. There are 25 batches with 109 rules in total in the Optimizer (Spark 2.4.7). Some rules are present in more than one batch, so the number of distinct logical rules boils down to 69. Most batches are executed once, but some can run repeatedly until the plan converges or a predefined maximum number of iterations is reached. Our program CollectBatches can be used to retrieve and print this information, along with a list of all rule batches and individual rules; the output can be found here.

Major rule categories are operator pushdowns/combinations/replacements and constant foldings. Twelve logical rules will fire in total when Catalyst evaluates our example. Among them are rules from each of the four aforementioned categories:

Two applications of PushDownPredicate that move the two filters before the column selection
A CombineFilters that fuses the two adjacent filters into one
An OptimizeIn that replaces the lookup in a single-element list with an equality comparison
A RewritePredicateSubquery which rearranges the elements of the conjunctive filter predicate that CombineFilters created

These rules reshape the logical plan shown above into the following Optimized Logical Plan:


Project [letter#26, index#28], Statistics(sizeInBytes=162.0 B)
 +- Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6)), Statistics(sizeInBytes=230.0 B)
    +- Relation[letter#26,letterUppercase#27,index#28] csv, Statistics(sizeInBytes=230.0 B

Quantitative Optimizations

Spark’s codebase contains a dedicated Statistics class that can hold estimates of various quantities per logical tree node. They include:

Size of the data that flows through a node
Number of rows in the table
Several column-level statistics:
- Number of distinct values and nulls
- Minimum and maximum value
- Average and maximum length of the values
- An equi-height histogram of the values

Estimates for these quantities are eventually propagated and adjusted throughout the logical plan tree. A filter or aggregate, for example, can significantly reduce the number of records the planning of a downstream join might benefit from using the modified size, and not the original input size, of the leaf node.

Two estimation approaches exist:

The simpler, size-only approach is primarily concerned with the first bullet point, the physical size in bytes. In addition, row count values may be set in a few cases.
The cost-based approach can compute the column-level dimensions for Aggregate, Filter, Join, and Project nodes, and may improve their size and row count values.

For other node types, the cost-based technique just delegates to the size-only methods. The approach chosen depends on the spark.sql.cbo.enabled property. If you intend to traverse the logical tree with a cost-based estimator, then set this property to true. Besides, the table and column statistics should be collected for the advanced dimensions prior to the query execution in Spark. This can be achieved with the ANALYZE command. However, a severe limitation seems to exist for this collection process, according to the latest documentation: “currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run”.

The estimated statistics can be used by two logical rules, namely CostBasedJoinReorder and ReorderJoin (via StarSchemaDetection), and by the JoinSelection strategy in the subsequent phase, which is described further below.

The Cost-Based Optimizer

A fully fledged Cost-Based Optimizer (CBO) seems to be a work in progress, as indicated here: “For cost evaluation, since physical costs for operators are not available currently, we use cardinalities and sizes to compute costs”. The CBO facilitates the CostBasedJoinReorder rule and may improve the quality of estimates; both can lead to better planning of joins. Concerning column-based statistics, only the two count stats (distinctCount and nullCount) appear to participate in the optimizations; the other advanced stats may improve the estimations of the quantities that are directly used.

The scope of quantitative optimizations does not seem to have significantly expanded with the introduction of the CBO in Spark 2.2. It is not a separate phase in the Catalyst pipeline, but improves the join logic in several important places. This is reflected in the Catalyst optimizer diagram by the smaller green rectangles, since the stats-based optimizations are outnumbered by the rule-based/heuristic ones.

The CBO will not affect the Catalyst plans in our example for three reasons:

The property spark.sql.cbo.enabled is not modified in the source code and defaults to false.
The input consists of raw CSV files for which no table/column statistics were collected.
The program operates on a single dataset without performing any joins.

The textual representation of the optimized logical plan shown above includes three Statistics segments which only hold values for sizeInBytes. This further indicates that size-only estimations were used exclusively; otherwise attributeStats fields with multiple advanced stats would appear.

Physical Planning

The optimized logical plan is handed over to the query planner (SparkPlanner), which refactors it into a SparkPlan with the help of strategies — Spark strategies are matched against logical plan nodes and map them to physical counterparts if their constraints are satisfied. After trying to apply all strategies (possibly in several iterations), the SparkPlanner returns a list of valid physical plans with at least one member.

Most strategy clauses replace one logical operator with one physical counterpart and insert PlanLater placeholders for logical subtrees that do not get planned by the strategy. These placeholders are transformed into physical trees later on by reapplying the strategy list on all subtrees that contain them.

As of Spark 2.4.7, the query planner applies ten distinct strategies (six more for streams). These strategies can be retrieved with our CollectStrategies program. Their scope includes the following items:

The push-down of filters to, and pruning of, columns in the data source
The planning of scans on partitioned/bucketed input files
The aggregation strategy (SortAggregate, HashAggregate, or ObjectHashAggregate)
The choice of the join algorithm (broadcast hash, shuffle hash, sort merge, broadcast nested loop, or cartesian product)

The SparkPlan of the running example has the following shape:


Project [letter#26, index#28]
+- Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6))
   +- FileScan csv [letter#26,index#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/example], PartitionFilters: [], PushedFilters: [IsNotNull(index), IsNotNull(letter), Not(EqualTo(letter,h))], ReadSchema: struct

A DataSource strategy was responsible for planning the FileScan leaf node and we see multiple entries in its PushedFilters field. Pushing filter predicates down to the underlying source can avoid scans of the entire dataset if the source knows how to apply them; the pushed IsNotNull and Not(EqualTo) predicates will have a small or nil effect, since the program reads from CSV files and only one cell with the letter h exists.

The original architecture diagram depicts a Cost Model that is supposed to choose the SparkPlan with the lowest execution cost. However, such plan space explorations are still not realized in Spark 3.01; the code trivially picks the first available SparkPlan that the query planner returns after applying the strategies.

A simple, “localized” cost model might be implicitly defined in the codebase by the order in which rules and strategies and their internal logic get applied. For example, the JoinSelection strategy assigns the highest precedence to the broadcast hash join. This tends to be the best join algorithm in Spark when one of the participating datasets is small enough to fit into memory. If its conditions are met, the logical node is immediately mapped to a BroadcastHashJoinExec physical node, and the plan is returned. Alternative plans with other join nodes are not generated.

Physical Preparation

The Catalyst pipeline concludes by transforming the Spark Plan into the Physical/Executed Plan with the help of preparation rules. These rules can affect the number of physical operators present in the plan. For example, they may insert shuffle operators where necessary (EnsureRequirements), deduplicate existing Exchange operators where appropriate (ReuseExchange), and merge several physical operators where possible (CollapseCodegenStages).

The final plan for our example has the following format:


*(1) Project [letter#26, index#28]
 +- *(1) Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6))
    +- *(1) FileScan csv [letter#26,index#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/example], PartitionFilters: [], PushedFilters: [IsNotNull(index), IsNotNull(letter), Not(EqualTo(letter,h))], ReadSchema: struct

This textual representation of the Physical Plan is identical to the preceding one for the Spark Plan, with one exception: Each physical operator is now prefixed with a *(1). The asterisk indicates that the aforementioned CollapseCodegenStages preparation rule, which “compiles a subtree of plans that support codegen together into a single Java function,” fired.

The number after the asterisk is a counter for the CodeGen stage index. This means that Catalyst has compiled the three physical operators in our example together into the body of one Java function. This comprises the first and only WholeStageCodegen stage, which in turn will be executed in the first and only Spark stage. The code for the collapsed stage will be generated by the Janino compiler at run-time.

Adaptive Query Execution in Spark 3

One of the major enhancements introduced in Spark 3.0 is Adaptive Query Execution (AQE), a framework that can improve query plans during run-time. Instead of passing the entire physical plan to the scheduler for execution, Catalyst tries to slice it up into subtrees (“query stages”) that can be executed in an incremental fashion.

The adaptive logic begins at the bottom of the physical plan, with the stage(s) containing the leaf nodes. Upon completion of a query stage, its run-time data statistics are used for reoptimizing and replanning its parent stages. The scope of this new framework is limited: The official SQL guide states that “there are three major features in AQE, including coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skew join optimization.” A query must not be a streaming query and its plan must contain at least one Exchange node (induced by a join for example) or a subquery; otherwise, an adaptive execution will not be even attempted – if a query satisfies the requirements, it may not get improved by AQE.

A large part of the adaptive logic is implemented in a special physical root node called AdaptiveSparkPlanExec. Its method getFinalPhysicalPlan is used to traverse the plan tree recursively while creating and launching query stages. The SparkPlan that was derived by the preceding phases is divided into smaller segments at Exchange nodes (i.e., shuffle boundaries), and query stages are formed from these subtrees. When these stages complete, fresh run-time stats become available, and can be used for reoptimization purposes.

A dedicated, slimmed-down logical optimizer with just one rule, DemoteBroadcastHashJoin, is applied to the logical plan associated with the current physical one. This special rule attempts to prevent the planner from converting a sort merge join into a broadcast hash join if it detects a high ratio of empty partitions. In this specific scenario, the first join type will likely be more performant, so a no-broadcast-hash-join hint is inserted.

The new optimized logical plan is fed into the default query planner (SparkPlanner) which applies its strategies and returns a SparkPlan. Finally, a small number of physical operator rules are applied that takes care of the subqueries and missing ShuffleExchangeExecs. This results in a refactored physical plan, which is then compared to the original version. The physical plan with the lowest number of shuffle exchanges is considered cheaper for execution and is therefore chosen.

When query stages are created from the plan, four adaptive physical rules are applied that include CoalesceShufflePartitions, OptimizeSkewedJoin, and OptimizeLocalShuffleReader. The logic of AQE’s major new optimizations is implemented in these case classes. Databricks explains these in more detail here. A second list of dedicated physical optimizer rules is applied right after the creation of a query stage; these mostly make internal adjustments.

The last paragraph has mentioned four adaptive optimizations which are implemented in one logical rule (DemoteBroadcastHashJoin) and three physical ones (CoalesceShufflePartitions, OptimizeSkewedJoin, and OptimizeLocalShuffleReader). All four make use of a special statistics class that holds the output size of each map output partition. The statistics mentioned in the Quantitative Optimization section above do get refreshed, but will not be leveraged by the standard logical rule batches, as a custom logical optimizer is used by AQE.

This article has described Spark’s Catalyst optimizer in detail. Developers who are unhappy with its default behavior can add their own logical optimizations and strategies or exclude specific logical optimizations. However, it can become complicated and time-consuming to devise customized logic for query patterns and other factors, such as the choice of machine types, may also have a significant impact on the performance of applications. Unravel Data can automatically analyze Spark workloads and provide tuning recommendations so developers become less burdened with studying query plans and identifying optimization potentials.

The post Spark Catalyst Pipeline: A Deep Dive into Spark’s Optimizer appeared first on Unravel.

“My Fitbit Score is Now 88” – Sharing Success Stories at the First Unravel Customer Conference

Floyd Smith — Thu, 10 Dec 2020 07:02:32 +0000

Also see our blog post with stats from our Untold conference polls, “More than 60% of our Pipelines have SLAs…”

Unravel Data recently held our first-ever customer conference, Untold. Untold was a four-hour virtual event exclusively for and by Unravel customers, with five talks and an audience of 200 enthusiastic customers. The event featured powerful presentations, lively Q&A sessions, and interactive discussions on a special Slack channel.

Unravel customers can view the recorded presentations and communicate with the speakers, as well as their peers who attended the event. If you’re among the hundreds of existing Unravel customers, including three of the top five North American banks, reach out to your customer success representative for access. If you are not yet a customer, you can create a free account or contact us.

The talks included:

Using Unravel for DataOps and SDLC enhancement
Keeping complex data pipelines healthy
Solving the small files problem (with great decreases in resource usage and large increases in throughput)
Managing thousands of nodes in scores of ever-busier modern data clusters

One speaker even described a successful cloud migration and the key pain points that they had to overcome.

In a “triple threat” presentation, with three speakers addressing different use cases for Unravel in the same organization, the middle speaker captured much of the flavor of the day in a single comment: “Especially with cloud costs, I think Unravel could be a game changer for every one of us.”

Another speaker summed up “the Unravel effect” in their organization: “We can invest those dollars that we’re saving elsewhere, which makes everybody in our organization super happy.”

Brief descriptions of each talk, with highlights from each of our speakers, follow. Look for case studies on these uses of Unravel in the weeks ahead.

Using Unravel to Improve the Software Development Life Cycle and DataOps

A major bank uses DataOps and Unravel to support the software development life cycle (SDLC) – to create, validate, and promote apps to production. And Unravel helps operators to orchestrate the apps in production. Unravel significantly shortens development time and improves software quality and performance.

Highlights from the presentation include:

- “We have Unravel integrated with our entire software development life cycle. Developers get all the insights into how well their code will perform before they go to production.”
- “The Unravel toolset is something like a starting point to log into our system, and do all the work, from starting building software to deployment level to production.”
- “When less experienced users look at Spark code, they don’t understand what they’re seeing. When they look at Unravel recommendations, they understand quickly, and then they go fix it.”
- “As we move to the cloud, Unravel’s APIs will be useful. Today you run your code and spend $10. Tomorrow, you’ll check your code in Unravel, you’ll implement all these recommendations, you’ll spend $5 instead of $10.”

Test-drive Unravel for DataOps

Create a free account

Managing Mission-Critical Pipelines at Scale

A leading software publisher manages thousands of complex, mission-critical pipelines, with tight SLAs – and costs that can range up into the millions of dollars associated with any failures. Unravel gives this organization fine-grained visibility into their pipelines in production, allowing issues to be detected and fixed before they become problems.

Key points from the talk:

“At my company, this is a simple pipeline. (Speaker shows slide with Rube Goldberg pipeline image – Ed.) One of our complicated pipelines has a 50-page user manual. We partnered with the Unravel team to stitch these pipelines together.”
“Unravel is helping us to actually predict the problems in advance and early in the process, ensuring we can bring these pipelines’ health back to normal, and making sure our SLAs are met.”
“Last year, to be frank, we were around 80% SLA accomplishments. But, through the two quarters, we are 98% and above.”

Easing Operations and Cutting Costs

A fast-growing marketing services company uses Unravel to manage operations, reduce costs, and support leadership use cases, all while orchestrating and managing cloud migration – and saves hugely on resources by solving the small files problem.

This talk actually involved three separate perspectives. A few highlights follow:

“On-prem Hadoop, that’s where the Unraveled story started for us.”
“We had around 1,800 cores allocated. After the config changes recommended by Unravel, the cores allocated went down to 120.”
“Unravel helped us build an executive dashboard… the response time from Unravel on it was great. The whole entire experience was wonderful for me.”
“Proactive alerting – AutoActions – catches runaway queries, rogue jobs, and resource contention issues. It makes our Hadoop clusters more efficient, and our users good citizens.”
“We were over-allocating for jobs in Spark, and I’m not an expert Spark user. Unravel recommendations and AutoActions helped solve the problem of over-allocation, without my having to learn more about Spark.”
“Having the data from Unravel makes our conversations with our tenants much more productive, and has helped to build trust among our teams.”
“We can invest those dollars that we’re saving elsewhere, which makes everybody in our organization super happy.”

A Successful Move to Cloud

A leading provider of consumer financial software moved to the cloud successfully – but they had to use inadequate tools, including tracking the move in Excel spreadsheets and saving data in CSV files. They were able to complete the move in less than a year, “leaving no app behind.”

As the speaker described, it was a huge job:

“Think of our move from on-premises to the cloud as changing the engines of the plane while the plane is flying.”
“We actually started the exercise with more than 20,000 tables, but careful analysis showed that half of them were unused.”
“We have many, many critical pipelines, and there were millions of dollars at stake for these things not to be correct, complete, or within SLA.”
“The cloud is not forgiving when it comes to cost. Linear increases in usage lead to linear increases in costs.”
“The cloud offers pay as you go, which sounds great. But when you go, you pay.”

The move made clear the need for Unravel Data as a supportive platform for assessing the on-premises data estate, matching on-premises software tools to their closest equivalents on each of the major public cloud platforms, and tracking the success of the move.

Simplify the complexity of data cloud migration

Create a free account

Unravel Right-Sizes Resource Consumption – as the Pandemic Sends Traffic Soaring

One of the world’s largest financial services companies sees usage passing critical thresholds – and then the pandemic sends their data processing needs through the roof. They used Unravel to steadily reduce memory and CPU usage, even as data processing volumes grew rapidly.

“Before Unravel, our memory and CPU allocations were way greater than our actual usage. And we spent many hours troubleshooting problems because we couldn’t see what was going on inside our apps.”
“With Unravel, we saw that a lot of vCores were unused. And we were able to drop almost 40,000 tables… that helped us a lot.”
“Before Unravel, we were uncomfortably past the 80% line in capacity, and memory was always pegged. With Unravel, we were able to cut usage roughly in half for the same throughput.”
“Before Unravel, we couldn’t give users – including the CEO – a real good reason on why they weren’t getting what they wanted.”
“Comprehensive job visibility, such as the configuration page in Unravel, has improved resolution times.”
“(Unravel) provides us a reasonable rate of growth in our use of resources compared to workloads processed – a rate which I can sell to my management team.”
“I’m sleeping a lot better than I was a year ago. My Fitbit sleep score is now 88. It’s been a good journey so far.” (A Fitbit sleep score of 88 is well above most Fitbit users – good, bordering on excellent – Ed.)

Finding Out More

Unravel customers can view the talks, communicate with industry peers who gave and attended the talks, and more. (There may still be some swag available!) If you’re interested, you can create a free account or contact us.

The post “My Fitbit Score is Now 88” – Sharing Success Stories at the First Unravel Customer Conference appeared first on Unravel.

The Guide To Apache Spark Memory Optimization

Unravel Data — Thu, 23 Jan 2020 21:49:31 +0000

Memory mysteries

I recently read an excellent blog series about Apache Spark but one article caught my attention as its author states:

Let’s try to figure out what happens with the application when the source file is much bigger than the available memory. The memory in the below tests is limited to 900MB […]. Naively we could think that a file bigger than available memory will fail the processing with OOM memory error. And this supposition is true:

It would be bad if Spark could only process input that is smaller than the available memory – in a distributed environment, it implies that an input of 15 Terabytes in size could only be processed when the number of Spark executors multiplied by the amount of memory given to each executor equals at least 15TB. I can say from experience that this is fortunately not the case so let’s investigate the example from the article above in more detail and see why an OutOfMemory exception occurred.

The input to the failed Spark application used in the article referred to above is a text file (generated_file_1_gb.txt) that is created by a script similar to this. This file is 1GB in size and has ten lines, each line simply consists of a line number (starting with zero) that is repeated 100 million times. The on-disk-size of each line is easy to calculate, it is one byte for the line number multiplied by 100 million or ~100MB. The program that processes this file launches a local Spark executor with three cores and the memory available to it is limited to 900MB as the JVM arguments -Xms900m -Xmx900m are used. This results in an OOM error after a few seconds so this little experiment seems to validate the initial hypothesis that “we can’t process datasets bigger than the memory limits”.

I played around with the Python script that created the original input file here and …
(-) created a second input file that is twice the disk size of the original (generated_file_1_gb.txt) but will be processed successfully by ProcessFile.scala
(-) switched to the DataFrame API instead of the RDD API which again crashes the application with an OOM Error
(-) created a third file that is less than a third of the size of generated_file_1_gb.txt but that crashes the original application
(-) reverted back to the original input file but made one small change in the application code which now processes it successfully (using .master(“local[1]”))
The first and last change directly contradict the original hypothesis and the other changes make the memory mystery even bigger.

Memory compartments explained

Visualizations will be useful for illuminating this mystery, the following pictures show Spark’s memory compartments when running ProcessFile.scala on my MacBook:

According to the system spec, my MacBook has four physical cores that amount to eight vCores. Since the application was initializd with .master(“local[3]”), three out of those eight virtual cores will participate in the processing. As reflected in the picture above, the JVM heap size is limited to 900MB and default values for both spark.memory. fraction properties are used. The sizes for the two most important memory compartments from a developer perspective can be calculated with these formulas:

Execution Memory = (1.0 – spark.memory.storageFraction) * Usable Memory = 0.5 * 360MB = 180MB
Storage Memory = spark.memory.storageFraction * Usable Memory = 0.5 * 360MB = 180MB

Execution Memory is used for objects and computations that are typically short-lived like the intermediate buffers of shuffle operation whereas Storage Memory is used for long-lived data that might be reused in downstream computations. However, there is no static boundary but an eviction policy – if there is no cached data, then Execution Memory will claim all the space of Storage Memory and vice versa. If there is stored data and a computation is performed, cached data will be evicted as needed up until the Storage Memory amount which denotes a minimum that will not be spilled to disk. The reverse does not hold true though, execution is never evicted by storage.

This dynamic memory management strategy has been in use since Spark 1.6, previous releases drew a static boundary between Storage and Execution Memory that had to be specified before run time via the configuration properties spark.shuffle.memoryFraction, spark.storage.memoryFraction, and spark.storage.unrollFraction. These have become obsolete and using them will not have an effect unless the user explicitly requests the old static strategy by setting spark.memory.useLegacyMode to true.

The example application does not cache any data so Execution Memory will eat up all of Storage Memory but this is still not enough:

We can finally see the root cause for the application failure and the culprit is not the total input size but the individual record size: Each record consists of 100 million numbers (0 to 9) from which a java.lang.String is created. The size of such a String is twice its “native” size (each character consumes 2 bytes) plus some overhead for headers and fields which amortizes to a Memory Expansion Rate of 2. As already mentioned, the Spark Executor processing the text file uses three cores which results in three tasks trying to load the first three lines of the input into memory at the same time. Each active task gets the same chunk of Execution Memory (360MB), thus
Execution Memory per Task = (Usable Memory – Storage Memory) / spark.executor.cores = (360MB – 0MB) / 3 = 360MB / 3 = 120MB

Based on the previous paragraph, the memory size of an input record can be calculated by
Record Memory Size = Record size (disk) * Memory Expansion Rate
= 100MB * 2 = 200MB
… which is significantly above the available Execution Memory per Task hence the observed application failure.

Assigning just one core to the Spark executor will prevent the Out Of Memory exception as shown in the following picture:

Now there is only one active task that can use all Execution Memory and each record fits comfortably into the available space since 200MB < < 360MB. This defeats the whole point of using Spark of course since there is no parallelism, all records are now processed consecutively.

With the formulas developed above, we can estimate the largest record size which would not crash the original version of the application (which uses .master(“local[3]”)): We have around 120MB per task available so any record can only consume up to 120MB of memory. Given our special circumstances, this implies that each line in the file should be 120/200 = 0.6 times shorter. I created a slightly modified script that creates such a maximum input, it uses a factor of 0.6 and the resulting file can still be processed without an OOM error. Using a factor of 0.7 though would create an input that is too big and crash the application again thus validating the thoughts and formulas developed in this section.

Going distributed: Spark inside YARN containers

Things become even more complicated in a distributed environment. Suppose we run on AWS/EMR and use a cluster of m4.2xlarge instance types, then every node has eight vCPUs (four physical CPUs) and 32GB memory according to https://aws.amazon.com/ec2/instance-types/. YARN will be responsible for resource allocations and each Spark executor will run inside a YARN container. Additional memory properties have to be taken into acccount since YARN needs some resources for itself:

Out of the 32GB node memory in total of an m4.2xlarge instance, 24GB can be used for containers/Spark executors by default (property yarn.nodemanager.resource.memory-mb) and the largest container/executor could use all of this memory (property yarn.scheduler.maximum-allocation-mb), these values are taken from https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html. Each YARN container needs some overhead in addition to the memory reserved for a Spark executor that runs inside it, the default value of this spark.yarn.executor.memoryOverhead property is 384MB or 0.1 * Container Memory, whichever value is bigger; the memory available to the Spark executor would be 0.9 * Container Memory in this scenario.

How is this Container Memory determined? It is actually not a property that is explicitly set: Let’s say we use two Spark executors and two cores per executor (–executor-cores 2) as reflected in the image above. Then the Container Memory is
Container Memory = yarn.scheduler.maximum-allocation-mb / Number of Spark executors per node = 24GB / 2 = 12GB

Therefore each Spark executor has 0.9 * 12GB available (equivalent to the JVM Heap sizes in the images above) and the various memory compartments inside it could now be calculated based on the formulas introduced in the first part of this article. The virtual core count of two was just chosen for this example, it wouldn’t make much sense in real life since four vcores are idle under this configuration. The best setup for m4.2xlarge instance types might be to just use one large Spark executor with seven cores as one core should always be reserved for the Operating System and other background processes on the node.

Spark and Standalone Mode

Things become a bit easier again when Spark is deployed without YARN in StandAlone Mode as is the case with services like Azure Databricks:

Only one Spark executor will run per node and the cores will be fully used. In this case, the available memory can be calculated for instances like DS4 v2 with the following formulas:
Container Memory = (Instance Memory * 0.97 – 4800)
spark.executor.memory = (0.8 * Container Memory)

Memory and partitions in real life workloads

Determining the “largest” record that might lead to an OOM error is much more complicated than in the previous scenario for a typical workload: The line lengths of all input files used (like generated_file_1_gb.txt) were the same so there was no “smallest” or “largest” record. Finding the maximum would be much harder if not practically impossible when transformations and aggregations occur. One approach might consist in searching the input or intermediate data that was persisted to stable storage for the “largest” record and creating an object of the right type (the schema used during a bottleneck like a shuffle) from it. The memory size of this object can then be directly determined by passing a reference to SizeEstimator.estimate, a version of this function that can be used outside of Spark can be found here.

Once an application succeeds, it might be useful to determine the average memory expansion rate for performance reasons as this could influence the choice of the number of (shuffle) partitions: One of the clearest indications that more partitions should be used is the presence of “spilled tasks” during a shuffle stage. In that case, the Spark Web UI should show two spilling entries (Shuffle spill (disk) and Shuffle spill (memory)) with positive values when viewing the details of a particular shuffle stage by clicking on its Description entry inside the Stage section. The presence of these two metrics indicates that not enough Execution Memory was available during the computation phase so records had to be evicted to disk, a process that is bad for the application performance. A “good” side effect of this costly spilling is that the memory expansion rate can be easily approximated by dividing the value for Shuffle spill (memory) by Shuffle spill (disk) since both metrics are based on the same records and denote how much space they take up in-memory versus on-disk, therefore:
Memory Expansion Rate ≈ Shuffle spill (memory) / Shuffle spill (disk)

This rate can now be used to approximate the total in-memory shuffle size of the stage or, in case a Spark job contains several shuffles, of the biggest shuffle stage. An estimation is necessary since this value is not directly exposed in the web interface but can be inferred from the on-disk size (field Shuffle Read shown in the details view of the stage) multiplied by the Memory Expansion Rate:
Shuffle size in memory = Shuffle Read * Memory Expansion Rate

Finally, the number of shuffle partitions should be set to the ratio of the Shuffle size (in memory) and the memory that is available per task, the formula for deriving the last value was already mentioned in the first section (“Execution Memory per Task”). So
Shuffle Partition Number = Shuffle size in memory / Execution Memory per task
This value can now be used for the configuration property spark.sql.shuffle.partitions whose default value is 200 or, in case the RDD API is used, for spark.default.parallelism or as second argument to operations that invoke a shuffle like the *byKey functions.

The intermediate values needed for the last formula might be hard to determine in practice in which case the following alternative calculation can be used; it only uses values that are directly provided by the Web UI: The Shuffle Read Size per task for the largest shuffle stage should be around 150MB so the number of shuffle partitions would be equal to the value of Shuffle Read divided by it:
Shuffle Partition Number = Shuffle size on disk (= Shuffle Read) / 150

The post The Guide To Apache Spark Memory Optimization appeared first on Unravel.

Spark APIs: RDD, DataFrame, DataSet in Scala, Java, Python

Unravel Data — Wed, 22 Jan 2020 21:41:47 +0000

This is a blog by Phil Schwab, Software Engineer at Unravel Data. This blog was first published on Phil’s BigData Recipe website.

Once upon a time there was only one way to use Apache Spark but support for additional programming languages and APIs have been introduced in recent times. A novice can be confused by the different options that have become available since Spark 1.6 and intimidated by the idea of setting up a project to explore these APIs. I’m going to show how to use Spark with three different APIs in three different programming languages by implementing the Hello World of Big Data, Word Count, for each combination.

Project Setup: IntelliJ, Maven, JVM, Fat JARs

This repo contains everything needed for running Spark or Pyspark locally and can be used as a template for more complex projects. Apache Maven is used as a build tool so dependencies, build configurations etc. are specified in a POM file. After cloning, the project can be opened and executed with Maven from the terminal or imported with a modern IDE like Intellij via:File => New => Project From Existing Sources => Open => Import project from external model => Maven

To build the project without an IDE, go to its source directory and execute the command mvn clean package which compiles an “uber” or “fat” JAR at bdrecipes/target/top-modules-1.0-SNAPSHOT.jar. This file is a self-contained unit that is executable so it will contain all dependencies specified in the POM. If you run your application on a cluster, the Spark dependencies are already “there” and shouldn’t be included in the JAR that contains the program code and other non-Spark dependencies. This could be done by adding a provided scope to the two the Spark dependencies (spark-sql_2.11 and spark-core_2.11) in the POM.

Run an application using the command

java -cp target/top-modules-1.0-SNAPSHOT.jar

scala -cp target/top-modules-1.0-SNAPSHOT.jar

plus the fully qualified name of the class that should be executed. To run the DataSet API example for both Scala and Java, use the following commands:

scala -cp target/top-modules-1.0-SNAPSHOT.jar spark.apis.wordcount.Scala_DataSet

java -cp target/top-modules-1.0-SNAPSHOT.jar spark.apis.wordcount.Java_DataSet

PySpark Setup

Python is not a JVM-based language and the Python scripts that are included in the repo are actually completely independent from the Maven project and its dependencies. In order to run the Python examples, you need to install pyspark which I did on MacOS via pip3 install pyspark. The scripts can be run from an IDE or from the terminal via python3 python_dataframe.py

Implementation
Each Spark program will implement the same simple Word Count logic:

1. Read the lines of a text file; Moby Dick will be used here

2. Extract all words from those lines and normalize them. For simplicity, we will just split the lines on whitespace here and lowercase the words

3. Count how often each element occurs

4. Create an output file that contains the element and its occurrence frequency

The solutions for the various combinations using the most recent version of Spark (2.3) can be found here:

Scala + RDD
Scala + DataFrame
Scala + DataSet
Python + RDD
Python + DataFrame
Java + RDD
Java + DataFrame
Java + DataSet

These source files should contain enough comments so there is no need to describe the code in detail here.

The post Spark APIs: RDD, DataFrame, DataSet in Scala, Java, Python appeared first on Unravel.

Benchmarks, Data Spark and Graal

Unravel Data — Mon, 20 Jan 2020 15:30:56 +0000

This is a blog by Phil Schwab, Software Engineer at Unravel Data. This blog was first published on Phil’s BigData Recipe website.

A very important question is how long something takes and the answer to that is fairly straightforward in normal life: Check the current time, then perform the unit of work that should be measured, then check the time again. The end time minus the start time would equal the amount of time that the task took, the elapsed time or wallclock time. The programmatic version of this simple measuring technique could look like

def measureTime[A](computation: => A): Long = { 
 val startTime = System.currentTimeMillis() 
 computation
 val endTime = System.currentTimeMillis() 
 val elapsedTime = endTime - startTime elapsedTime 
 }

In the case of Apache Spark, the computation would likely be of type Dataset[_] or RDD[_]. In fact, the two third party benchmarking frameworks for Spark mentioned below are based on a function similar to the one shown above for measuring the execution time of a Spark job.

It is surprisingly hard to accurately predict how long something will take in programming: The result from a single invocation of the naive method above is likely not very reliable since numerous non-deterministic factors can interfere with a measurement, especially when the underlying runtime applies dynamic optimizations like the Java Virtual Machine. Even the usage of a dedicated microbenchmarking framework like JMH only solves parts of the problem – the user is reminded every time of that caveat after JMH completes:


[info] REMEMBER: The numbers below are just data. To gain reusable 
[info] insights, you need to follow up on why the numbers are the 
[info] way they are. Use profilers (see -prof, -lprof), design 
[info] factorial experiments, perform baseline and negative tests 
[info] that provide experimental control, make sure the 
[info] benchmarking environment is safe on JVM/OS/HW level, ask 
[info] for reviews from the domain experts. Do not assume the 
[info] numbers tell you what you want them to tell.

From the Apache Spark creators: spark-sql-perf

If benchmarking a computation on a local machine is already hard, then doing this for a distributed computation/environment should be very hard. spark-sql-perf is the official performance testing framework for Spark 2. The following twelve benchmarks are particularly interesting since they target various features and APIs of Spark; they are organized into three classes:

DatasetPerformance compares the same workloads expressed via RDD, Dataframe and Dataset API:

range just creates 100 million integers of datatype Long which are wrapped in a case class in the case of RDDs and Datasets
filter applies four consecutive filters to 100 million Longs
map applies an increment operation 100 million Longs four times
average computes the average of one million Longs using a user-defined function for Datasets, a built-in sql function for DataFrames and a map/reduce combination for RDDs.

JoinPerformance is based on three data sets with one million, 100 million and one billion Longs:
singleKeyJoins: joins each one of the three basic data sets in a left table with each one of the three basic data sets in a right table via four different join types (Inner Join, Right Join, Left Join and Full Outer Join)
varyDataSize: joins two tables consisting of 100 million integers each with a ‘data’ column containing Strings of 5 different lengths (1, 128, 256, 512 and 1024 characters)
varyKeyType: joins two tables consisting of 100 million integers and casts it to four different data types (String, Integer, Long and Double)
varyNumMatches

AggregationPerformance:

varyNumGroupsAvg and twoGroupsAvg both compute the average of one table column and group them by the other column. They differ in the cardinaliy and shape of the input tables used.
The next two aggregation benchmarks use the three data sets that are also used in the DataSetPerformance benchmark described above:
complexInput: For each of the three integer tables, adds a single column together nine times
aggregates: aggregates a single column via four different aggregation types (SUM, AVG, COUNT and STDDEV)

Running these benchmarks produces……. almost nothing, most of them are broken or will crash in the current state of the official master branch due to various problems (issues with reflection, missing table registrations, wrong UDF pattern, …):

$ bin/run --benchmark AggregationPerformance
[...]
[error] Exception in thread "main" java.lang.InstantiationException: com.databricks.spark.sql.perf.AggregationPerformance
[error] at java.lang.Class.newInstance(Class.java:427)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:81)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:82)
[error] at scala.util.Try.getOrElse(Try.scala:79)
[...]

$ bin/run --benchmark JoinPerformance
[...]
[error] Exception in thread "main" java.lang.InstantiationException: com.databricks.spark.sql.perf.JoinPerformance
[error] at java.lang.Class.newInstance(Class.java:427)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:81)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:82)
[error] at scala.util.Try.getOrElse(Try.scala:79)

I repaired these issues and was able to run all of these twelve benchmarks sucessfully; the fixed edition can be downloaded from my personal repo here, a PR was also submitted. Enough complaints, the first results generated via

$ bin/run --benchmark DatasetPerformance

that compare the same workloads expressed in RDD, Dataset and Dataframe APIs are:

name |minTimeMs| maxTimeMs| avgTimeMs| stdDev
-------------------------|---------|---------|---------|---------
DF: average | 36.53 | 119.91 | 56.69 | 32.31
DF: back-to-back filters | 2080.06 | 2273.10 | 2185.40 | 70.31
DF: back-to-back maps | 1984.43 | 2142.28 | 2062.64 | 62.38
DF: range | 1981.36 | 2155.65 | 2056.18 | 70.89
DS: average | 59.59 | 378.97 | 126.16 | 125.39
DS: back-to-back filters | 3219.80 | 3482.17 | 3355.01 | 88.43
DS: back-to-back maps | 2794.68 | 2977.08 | 2890.14 | 59.55
DS: range | 2000.36 | 3240.98 | 2257.03 | 484.98
RDD: average | 20.21 | 51.95 | 30.04 | 11.31
RDD: back-to-back filters| 1704.42 | 1848.01 | 1764.94 | 56.92
RDD: back-to-back maps | 2552.72 | 2745.86 | 2678.29 | 65.86
RDD: range | 593.73 | 689.74 | 665.13 | 36.92

This is rather surprising and counterintuitive given the focus of the architecture changes and performance improvements in Spark 2 – the RDD API performs best (= lowest numbers in the fourth column) for three out of four workloads, Dataframes only outperform the two other APIs in the back-to-back maps benchmark with 2062 ms versus 2890 ms in the case of Datasets and 2678 in the case of RDDs.

The results for the two other benchmark classes are as follows:

bin/run --benchmark AggregationPerformance

name | minTimeMs | maxTimeMs | avgTimeMs | stdDev
------------------------------------|-----------|-----------|-----------|--------
aggregation-complex-input-100milints| 19917.71 |23075.68 | 21604.91 | 1590.06
aggregation-complex-input-1bilints | 227915.47 |228808.97 | 228270.96 | 473.89
aggregation-complex-input-1milints | 214.63 |315.07 | 250.08 | 56.35
avg-ints10 | 213.14 |1818.041 | 763.67 | 913.40
avg-ints100 | 254.02 |690.13 | 410.96 | 242.38
avg-ints1000 | 650.53 |1107.93 | 812.89 | 255.94
avg-ints10000 | 2514.60 |3273.21 | 2773.66 | 432.72
avg-ints100000 | 18975.83 |20793.63 | 20016.33 | 937.04
avg-ints1000000 | 233277.99 |240124.78 | 236740.79 | 3424.07
avg-twoGroups100000 | 218.86 |405.31 | 283.57 | 105.49
avg-twoGroups1000000 | 194.57 |402.21 | 276.33 | 110.62
avg-twoGroups10000000 | 228.32 |409.40 | 303.74 | 94.25
avg-twoGroups100000000 | 627.75 |733.01 | 673.69 | 53.88
avg-twoGroups1000000000 | 4773.60 |5088.17 | 4910.72 | 161.11
avg-twoGroups10000000000 | 43343.70 |47985.57 | 45886.03 | 2352.40
single-aggregate-AVG-100milints | 20386.24 |21613.05 | 20803.14 | 701.50
single-aggregate-AVG-1bilints | 209870.54 |228745.61 | 217777.11 | 9802.98
single-aggregate-AVG-1milints | 174.15 |353.62 | 241.54 | 97.73
single-aggregate-COUNT-100milints | 10832.29 |11932.39 | 11242.52 | 601.00
single-aggregate-COUNT-1bilints | 94947.80 |103831.10 | 99054.85 | 4479.29
single-aggregate-COUNT-1milints | 127.51 |243.28 | 166.65 | 66.36
single-aggregate-STDDEV-100milints | 20829.31 |21207.90 | 20994.51 | 193.84
single-aggregate-STDDEV-1bilints | 205418.40 |214128.59 | 211163.34 | 4976.13
single-aggregate-STDDEV-1milints | 181.16 |246.32 | 205.69 | 35.43
single-aggregate-SUM-100milints | 20191.36 |22045.60 | 21281.71 | 969.26
single-aggregate-SUM-1bilints | 216648.77 |229335.15 | 221828.33 | 6655.68
single-aggregate-SUM-1milints | 186.67 |1359.47 | 578.54 | 676.30

bin/run --benchmark JoinPerformance

name |minTimeMs |maxTimeMs |avgTimeMs |stdDev
------------------------------------------------|----------|----------|----------|--------
singleKey-FULL OUTER JOIN-100milints-100milints | 44081.59 |46575.33 | 45418.33 |1256.54
singleKey-FULL OUTER JOIN-100milints-1milints | 36832.28 |38027.94 | 37279.31 |652.39
singleKey-FULL OUTER JOIN-1milints-100milints | 37293.99 |37661.37 | 37444.06 |192.69
singleKey-FULL OUTER JOIN-1milints-1milints | 936.41 |2509.54 | 1482.18 |890.29
singleKey-JOIN-100milints-100milints | 41818.86 |42973.88 | 42269.81 |617.71
singleKey-JOIN-100milints-1milints | 20331.33 |23541.67 | 21692.16 |1660.02
singleKey-JOIN-1milints-100milints | 22028.82 |23309.41 | 22573.63 |661.30
singleKey-JOIN-1milints-1milints | 708.12 |2202.12 | 1212.86 |856.78
singleKey-LEFT JOIN-100milints-100milints | 43651.79 |46327.19 | 44658.37 |1455.45
singleKey-LEFT JOIN-100milints-1milints | 22829.34 |24482.67 | 23633.77 |827.56
singleKey-LEFT JOIN-1milints-100milints | 32674.77 |34286.75 | 33434.05 |810.04
singleKey-LEFT JOIN-1milints-1milints | 682.51 |773.95 | 715.53 |50.73
singleKey-RIGHT JOIN-100milints-100milints | 44321.99 |45405.85 | 44965.93 |570.00
singleKey-RIGHT JOIN-100milints-1milints | 32293.54 |32926.62 | 32554.74 |330.73
singleKey-RIGHT JOIN-1milints-100milints | 22277.12 |24883.91 | 23551.74 |1304.34
singleKey-RIGHT JOIN-1milints-1milints | 683.04 |935.88 | 768.62 |144.85

From Phil: Spark & JMH

The surprising results from the DatasetPerformance benchmark above should make us skeptical – probably the benchmarking code or setup itself is to blame for the odd measurement, not the actual Spark APIs. A popular and quasi-official benchmarking framework for languages targeting the JVM is JMH so why not use it for the twelve Spark benchmarks? I “translated” them into JMH versions here and produced new results, among them the previously odd DatasetPerformance cases:

Phils-MacBook-Pro: pwd
/Users/Phil/IdeaProjects/jmh-spark
Phils-MacBook-Pro: ls
README.md benchmarks build.sbt project src target

Phils-MacBook-Pro: sbt benchmarks/jmh:run Bench_APIs1
[...]
Phils-MacBook-Pro: sbt benchmarks/jmh:run Bench_APIs2

Benchmark (start) Mode Cnt Score Error Units
Bench_APIs1.rangeDataframe 1 avgt 20 2618.631 ± 59.210 ms/op
Bench_APIs1.rangeDataset 1 avgt 20 1646.626 ± 33.230 ms/op
Bench_APIs1.rangeDatasetJ 1 avgt 20 2069.763 ± 76.444 ms/op
Bench_APIs1.rangeRDD 1 avgt 20 448.281 ± 85.781 ms/op
Bench_APIs2.averageDataframe 1 avgt 20 24.614 ± 1.201 ms/op
Bench_APIs2.averageDataset 1 avgt 20 41.799 ± 2.012 ms/op
Bench_APIs2.averageRDD 1 avgt 20 12.280 ± 1.532 ms/op
Bench_APIs2.filterDataframe 1 avgt 20 2395.985 ± 36.333 ms/op
Bench_APIs2.filterDataset 1 avgt 20 2669.160 ± 81.043 ms/op
Bench_APIs2.filterRDD 1 avgt 20 2776.382 ± 62.065 ms/op
Bench_APIs2.mapDataframe 1 avgt 20 2020.671 ± 136.371 ms/op
Bench_APIs2.mapDataset 1 avgt 20 5218.872 ± 177.096 ms/op
Bench_APIs2.mapRDD 1 avgt 20 2957.573 ± 26.458 ms/op

These results are more in line with expectations: Dataframes perform best in two out of four benchmarks. The Spark-internal functionality used for the other two (average and range) indeed favour RDDs:

From IBM: spark-bench
To be published

From CERN:
To be published

Enter GraalVM

Most computer programs nowadays are written in higher-level languages so humans can create them faster and understand them easier. But since a machine can only “understand” numerical languages, these high-level artifacts cannot directly be executed by a processor so typically one or more additional steps are required before a program can be run. Some programming languages transform their user’s source code into an intermediate representation which then gets compiled again into and interpreted as machine code. The languages of interest in this article (roughly) follow this strategy: The programmer only writes high-level source code which is then automatically transformed to a file ending in .class that contains platform-independent bytecode. This bytecode file is further compiled down to machine code by the Java Virtual Machine while hardware-specific aspects are fully taken care of and, depending on the compiler used, optimizations are applied. Finally this machine code is executed in the JVM runtime.

One of the most ambitious software projects of the past years has probably been the development of a general-purpose virtual machine, Oracle’s Graal project, “one VM to rule them all.” There are several aspects to this technology, two of the highlights include the goal of providing seamless interoperability between (JVM and non-JVM) programming languages while running them efficiently on the same JVM and a new, high performance Java compiler. Twitter already uses the enterprise edition in production and saves around 8% of CPU utilization. The Community edition can be downloaded and used for free, more details below.

Graal and Scala

Graal works at the bytecode level. In order to to run Scala code via Graal, I created a toy example that is inspired by the benchmarks described above: The source code snippet below creates 10 million integers, increments each number by one, removes all odd elements and finally sums up all of the remaining even numbers. These four operations are repeated 100 times and during each step the execution time and the sum (which stays the same across all 100 iterations) are printed out. Before the program terminates, the total run time will also be printed. The following source code implements this logic – not in the most elegant way but with optimization potential for the final compiler phase where Graal will come into play:

object ProcessNumbers {
 // Helper functions:
 def increment(xs: Array[Int]) = xs.map(_ + 1)
 def filterOdd(xs: Array[Int]) = xs.filter(_ % 2 == 0)
 def sum(xs: Array[Int]) = xs.foldLeft(0L)(_ + _)
 // Main part:
 def main(args: Array[String]): Unit = {
   var totalTime = 0L
   // loop 100 times
  for (iteration <- 1 to 100) {
     val startTime = System.currentTimeMillis()
     val numbers = (0 until 10000000).toArray
     // transform numbers and sum them up
     val incrementedNumbers = increment(numbers)
     val evenNumbers = filterOdd(incrementedNumbers)
     val summed = sum(evenNumbers)
     // calculate times and print out
     val endTime = System.currentTimeMillis()
     val elapsedTime = endTime - startTime
     totalTime += elapsedTime
     println(s"Iteration $iteration took $elapsedTime milliseconds\t\t$summed")
   }
   println("*********************************")
     println(s"Total time: $totalTime ms")
   }
}

The transformation of this code to the intermediate bytecode representation is done as usual, via scalac ProcessNumbers.scala. The resulting bytecode file is not directly interpretable by humans but those JVM instructions can be made more intelligible by disassembling them with the command javap -c -cp. The original source code above has less than 30 lines, the disassembled version has more than 200 lines but in a simpler structure and with a small instruction set:

javap -c -cp ./ ProcessNumbers$

public final class ProcessNumbers$ {
[...]

public void main(java.lang.String[]);
Code:
0: lconst_0
1: invokestatic #137 // Method scala/runtime/LongRef.create:(J)Lscala/runtime/LongRef;
4: astore_2
5: getstatic #142 // Field scala/runtime/RichInt$.MODULE$:Lscala/runtime/RichInt$;
8: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
11: iconst_1
12: invokevirtual #145 // Method scala/Predef$.intWrapper:(I)I
15: bipush 100
17: invokevirtual #149 // Method scala/runtime/RichInt$.to$extension0:(II)Lscala/collection/immutable/Range$Inclusive;
20: aload_2
21: invokedynamic #160, 0 // InvokeDynamic #3:apply$mcVI$sp:(Lscala/runtime/LongRef;)Lscala/runtime/java8/JFunction1$mcVI$sp;
26: invokevirtual #164 // Method scala/collection/immutable/Range$Inclusive.foreach$mVc$sp:(Lscala/Function1;)V
29: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
32: ldc #166 // String *********************************
34: invokevirtual #170 // Method scala/Predef$.println:(Ljava/lang/Object;)V
37: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
40: new #172 // class java/lang/StringBuilder
43: dup
44: ldc #173 // int 15
46: invokespecial #175 // Method java/lang/StringBuilder."":(I)V
49: ldc #177 // String Total time:
51: invokevirtual #181 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
54: aload_2
55: getfield #185 // Field scala/runtime/LongRef.elem:J
58: invokevirtual #188 // Method java/lang/StringBuilder.append:(J)Ljava/lang/StringBuilder;
61: ldc #190 // String ms
63: invokevirtual #181 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
66: invokevirtual #194 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
69: invokevirtual #170 // Method scala/Predef$.println:(Ljava/lang/Object;)V
72: return
[...]
}

Now we come to the Graal part: My system JDK is

Phils-MacBook-Pro $ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

I downloaded the community edition of Graal from here and placed it in a folder along with Scala libraries and the files for the toy benchmarking example mentioned above:

Phils-MacBook-Pro: ls
ProcessNumbers$.class ProcessNumbers.class ProcessNumbers.scala graalvm scalalibs

Phils-MacBook-Pro: ./graalvm/Contents/Home/bin/java -version
openjdk version "1.8.0_172"
OpenJDK Runtime Environment (build 1.8.0_172-20180626105433.graaluser.jdk8u-src-tar-g-b11)
GraalVM 1.0.0-rc5 (build 25.71-b01-internal-jvmci-0.46, mixed mode)

Phils-MacBook-Pro: ls scalalibs/
jline-2.14.6.jar scala-library.jar scala-reflect.jar scala-xml_2.12-1.0.6.jar
scala-compiler.jar scala-parser-combinators_2.12-1.0.7.jar scala-swing_2.12-2.0.0.jar scalap-2.12.6.jar

Let’s run this benchmark with the “normal” JDK via java -cp ./lib/scala-library.jar:. ProcessNumbers. Around 31 seconds are needed as can be seen below (only the first and last iterations are shown)

Phils-MacBook-Pro: java -cp ./lib/scala-library.jar:. ProcessNumbers

Iteration 1 took 536 milliseconds 25000005000000
Iteration 2 took 533 milliseconds 25000005000000
Iteration 3 took 350 milliseconds 25000005000000
Iteration 4 took 438 milliseconds 25000005000000
Iteration 5 took 345 milliseconds 25000005000000
[...]
Iteration 95 took 293 milliseconds 25000005000000
Iteration 96 took 302 milliseconds 25000005000000
Iteration 97 took 333 milliseconds 25000005000000
Iteration 98 took 282 milliseconds 25000005000000
Iteration 99 took 308 milliseconds 25000005000000
Iteration 100 took 305 milliseconds 25000005000000
*********************************
Total time: 31387 ms

And here a run that invokes Graal as JIT compiler:

Phils-MacBook-Pro:testo a$ ./graalvm/Contents/Home/bin/java -cp ./lib/scala-library.jar:. ProcessNumbers

Iteration 1 took 1287 milliseconds 25000005000000
Iteration 2 took 264 milliseconds 25000005000000
Iteration 3 took 132 milliseconds 25000005000000
Iteration 4 took 120 milliseconds 25000005000000
Iteration 5 took 128 milliseconds 25000005000000
[...]
Iteration 95 took 111 milliseconds 25000005000000
Iteration 96 took 124 milliseconds 25000005000000
Iteration 97 took 122 milliseconds 25000005000000
Iteration 98 took 123 milliseconds 25000005000000
Iteration 99 took 120 milliseconds 25000005000000
Iteration 100 took 149 milliseconds 25000005000000
*********************************
Total time: 14207 ms

14 seconds compared to 31 seconds means a 2x speedup with Graal, not bad. The first iteration takes much longer but then a turbo boost seems to kick in – most iterations from 10 to 100 take around 100 to 120 ms in the Graal run compared to 290-310 ms in the vanilla Java run. Graal itself has an option to deactivate the optimization via the -XX:-UseJVMCICompiler flag, trying that results in similar numbers compared with the first run:

Phils-MacBook-Pro: /Users/a/graalvm/Contents/Home/bin/java -XX:-UseJVMCICompiler -cp ./lib/scala-library.jar:. ProcessNumbers
Iteration 1 took 566 milliseconds 25000005000000
Iteration 2 took 508 milliseconds 25000005000000
Iteration 3 took 376 milliseconds 25000005000000
Iteration 4 took 456 milliseconds 25000005000000
Iteration 5 took 310 milliseconds 25000005000000
[...]
Iteration 95 took 301 milliseconds 25000005000000
Iteration 96 took 301 milliseconds 25000005000000
Iteration 97 took 285 milliseconds 25000005000000
Iteration 98 took 302 milliseconds 25000005000000
Iteration 99 took 296 milliseconds 25000005000000
Iteration 100 took 296 milliseconds 25000005000000
*********************************
Total time: 30878 ms

Graal and Spark

Why not invoke Graak for Spark jobs. Let’s do this for my benchmarking project introduced above with the -jvm flag:

Phils-MacBook-Pro:jmh-spark $ sbt
Loading settings for project jmh-spark-build from plugins.sbt ...
Loading project definition from /Users/Phil/IdeaProjects/jmh-spark/project
Loading settings for project jmh-spark from build.sbt ...
Set current project to jmh-spark (in build file:/Users/Phil/IdeaProjects/jmh-spark/)
sbt server started at local:///Users/Phil/.sbt/1.0/server/c980c60cda221235db06/sock

sbt:jmh-spark> benchmarks/jmh:run -jvm /Users/Phil/testo/graalvm/Contents/Home/bin/java
Running (fork) spark_benchmarks.MyRunner -jvm /Users/Phil/testo/graalvm/Contents/Home/bin/java

The post Benchmarks, Data Spark and Graal appeared first on Unravel.

Unravel and Azure Databricks: Monitor, Troubleshoot and Optimize in the Cloud

George Demarest — Wed, 04 Sep 2019 11:00:18 +0000

Monitor, Troubleshoot and Optimize Apache Spark Applications Using Microsoft Azure Databricks

We are super excited to announce our support for Azure Databricks! We continue to build out the capabilities of the Unravel Data Operations platform and specifically support for the Microsoft Azure data and AI ecosystem teams. The business and technical imperative to strategically and tactically architect the journey to cloud for your organization has never been stronger. Businesses are increasingly dependent on data for decision making and by extension the services and platforms such as Azure HDI and Azure Databricks that underpin these modern data applications.

The large scale industry adoption of Spark, and Cloud services from Azure and other platforms. represent the heart of the modern data operations program for the next decade. The combination of Microsoft and Databricks and resulting Azure Databricks offering is a natural response to deliver a deployment platform for AI, machine learning, and streaming data applications.

Spark has largely eclipsed Hadoop/MapReduce as the development paradigm of choice to develop a new generation of data applications that provide new insights and user experiences. Databricks has added a rich development and operations environment for running Apache Spark applications in the cloud, while Microsoft Azure has rapidly evolved into an enterprise favorite for migrating and running these new data applications in the cloud.

It is against this backdrop that Unravel announces support for the Azure Databricks platform to provide our AI-powered data operations solution for Spark applications and data pipelines running on Azure Databricks. While Azure Databricks provides a state of the art platform for developing and running Spark apps and data pipelines, Unravel provides the relentless monitoring, interrogating, modeling, learning, and guided tuning and troubleshooting to create the optimal conditions for Spark to perform and operate at its peak potential.

Unravel is able to ask and answer questions about Azure Databricks that are essential to provide the levels of intelligence that are required to:

Provide a unified view across all of your Azure Databricks instances and workspaces
Understand Spark runtime behavior and how it interacts with Azure infrastructure, and adjacent technologies like Apache Kafka
Detect and avoid costly human error in configuration, tuning, and root cause analysis
Accurately report cluster usage patterns and be able to adjust resource usage on the fly with Unravel insights
Set and guarantee enterprise service levels, based on correlated operational metadata

The Unravel Platform is constantly learning and our training models adapting. The intelligence you glean from Unravel today continues to extend and adapt over time as application and user behaviors themselves change and adapt to new business demands. These in-built capabilities of the Unravel platform and our extensible APIs enable us to move fast to support customer demands to support an expanding range of Data and AI services such as Azure Databricks. More importantly though it provides the insights, recommendations and automation to assure your journey to cloud is accelerated and your ongoing Cloud operations is fully optimized for cost and performance.

Take the hassle out of managing data pipelines in the cloud

Try Unravel for free

Read on to learn more about today’s news from Unravel.

Unravel Data Introduces AI-powered Data Operations Solution to Monitor, Troubleshoot and Optimize Apache Spark Applications Using Microsoft Azure Databricks

New Offering Enables Azure Databricks Customers to Quickly Operationalize Spark Data Engineering Workloads with Unprecedented Visibility and Radically Simpler Remediation of Failures and Slowdowns

PALO ALTO, Calif. – Sep. 4, 2019 —Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced Unravel for Azure Databricks, a solution to deliver comprehensive monitoring, troubleshooting, and application performance management for Azure Databricks environments. The new offering leverages AI to enable Azure Databricks customers to significantly improve performance of Spark jobs while providing unprecedented visibility into runtime behavior, resource usage, and cloud costs.

“Spark, Azure, and Azure Databricks have become foundational technologies in the modern data stack landscape, with more and more Fortune 1000 organizations using them to build their modern data pipelines,” said Kunal Agarwal, CEO, Unravel Data. “Unravel is uniquely positioned to empower Azure Databricks customers to maximize the performance, reliability and return on investment of their Spark workloads.”

Unravel for Azure Databricks helps operationalize Spark apps on the platform: Azure Databricks customers will shorten the cycle of getting Spark applications into production by relying on the visibility, operational intelligence, and data driven insights and recommendations that only Unravel can provide. Users will enjoy greater productivity by eliminating the time spent on tedious, low value tasks such as log data collection, root cause analysis and application tuning.

“Unravel’s full-stack DataOps platform has already helped Azure customers get the most out of their cloud-based big data deployments. We’re excited to extend that relationship to Azure Databricks,” said Yatharth Gupta, principal group manager, Azure Data at Microsoft. “Unravel adds tremendous value by delivering an AI-powered solution for Azure Databricks customers that are looking to troubleshoot challenging operational issues and optimize cost and performance of their Azure Databricks workloads.”

Key features of Unravel for Azure Databricks include:

Application Performance Management for Azure Databricks – Unravel delivers visibility and understanding of Spark applications, clusters, workflows, and the underlying software stack
Automated root cause analysis of Spark apps – Unravel can automatically identify, diagnose, and remediate Spark jobs and the full Spark stack, achieving simpler and faster resolution of issues for Spark applications on Azure Databricks clusters
Comprehensive reporting, alerting, and dashboards – Azure Databricks users can now enjoy detailed insights, plain-language recommendations, and a host of new dashboards, alerts, and reporting on chargeback accounting, cluster resource usage, Spark runtime behavior and much more.

Azure Databricks is a Spark-based analytics platform optimized for Microsoft Azure. Azure Databricks provides one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

An early access release of Unravel for Azure Databricks available now.

About Unravel Data

Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Kaiser Permanente, Adobe, Deutsche Bank, Wayfair, and Neustar. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement

The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

Contacts

Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Unravel and Azure Databricks: Monitor, Troubleshoot and Optimize in the Cloud appeared first on Unravel.

Using Unravel to tune Spark data skew and partitioning

George Demarest — Wed, 22 May 2019 21:20:40 +0000

This blog provides some best practices for how to use Unravel to tackle issues of data skew and partitioning for Spark applications. In Spark, it’s very important that the partitions of an RDD are aligned with the number of available tasks. Spark assigns one task per partition and each core can process one task at a time.

By default, the number of partitions is set to the total number of cores on all the nodes hosting executors. Having too few partitions leads to less concurrency, processing skew, and improper resource utilization, whereas having too many leads to low throughput and high task scheduling overhead.

The first step in tuning a Spark application is to identify the stages which represents bottlenecks in the execution. In Unravel, this can be done easily by using the Gantt Chart view of the app’s execution. We can see at a glance that there is a longer-running stage:

Once I’ve navigated to the stage, I can navigate to the Timeline view where the duration and I/O of each task within the stage are readily apparent. The histogram charts are very useful to identify outlier tasks, it is clear in this case that 199 tasks take at most 5 minutes to complete however one task takes 35-40 min to complete!

When we select the first bucket of 199 tasks, another clear representation of the effect of this skew is visible within the Timeline, many executors are sitting idle:

When we select the outlier bucket that took over 35 minutes to complete, we can see the duration of the associated executor is almost equal to the duration of the entire app:

We can also observe the bursting of containers at the time the longer executor started in the Graphs > Containers view. Adding more partitions via repartition() can help distribute the data set among the executors.

Unravel can provide recommendations for optimizations in some of the cases where join key(s) or group by key(s) are skewed.

In the below Spark SQL example two dummy data sources are used, both of them are partitioned.

The join operation between customer & order table is on cust_id column which is heavily skewed. Looking at the code it can easily be identified that key “1000” has most number of entries in the orders table. So one of the reduce partition will contain all the “1000” entries. In such cases we can apply some techniques to avoid skewed processing.

Increase the spark.sql.autoBroadcastJoinThreshold value so that smaller table “customer” gets broadcasted. This should be done ensuring sufficient driver memory.
If memory in executors are sufficient we can decrease the spark.sql.shuffle.partitions to accomodate more data per reduce partitions. This will finish all the reduce tasks in more or less in same time.
If possible find out the keys which are skewed and process them separately by using filters.

Let’s try the #2 approach

With Spark’s default spark.sql.shuffle.partitions=200.

Observe the lone task which takes more time during shuffle. That means the next stage can’t be started and executors are lying idle.

Now let’s change the spark.sql.shuffle.partitions = 10. As the shuffle input/output is well within executor memory sizes we can safely do this change.

In real-life deployments, not all skew problems can be solved by configurations and repartitioning. That may need underlying data layout modification. If the data source itself is skewed then tasks which read from these sources can’t be optimized. Sometimes at enterprise level its not possible as the same data source will be used from different tools & pipelines.

Sam Lachterman is a Manager of Solutions Engineering at Unravel.

Unravel Principle Engineer Rishitesh Mishra also contributed to this blog. Take a look at Rishi’s other Unravel blogs:

Why Your Spark Jobs are Slow and Failing; Part I Memory Management

Why Your Spark Jobs are Slow and Failing; Part I Data Skew and Garbage Collection

The post Using Unravel to tune Spark data skew and partitioning appeared first on Unravel.

Why Data Skew & Garbage Collection Causes Spark Apps To Slow or Fail

George Demarest — Tue, 09 Apr 2019 05:55:13 +0000

The second part of our series “Why Your Spark Apps Are Slow or Failing” follows Part I on memory management and deals with issues that arise with data skew and garbage collection in Spark. Like many performance challenges with Spark, the symptoms increase as the scale of data handled by the application increases.

What is Data Skew?

In an ideal Spark application run, when Spark wants to perform a join, for example, join keys would be evenly distributed and each partition would get nicely organized to process. However, real business data is rarely so neat and cooperative. We often end up with less than ideal data organization across the Spark cluster that results in degraded performance due to data skew.

Data skew is not an issue with Spark per se, rather it is a data problem. The cause of the data skew problem is the uneven distribution of the underlying data. Uneven partitioning is sometimes unavoidable in the overall data layout or the nature of the query.

For joins and aggregations Spark needs to co-locate records of a single key in a single partition. Records of a key will always be in a single partition. Similarly, other key records will be distributed in other partitions. If a single partition becomes very large it will cause data skew, which will be problematic for any query engine if no special handling is done.

Dealing with data skew

Data skew problems are more apparent in situations where data needs to be shuffled in an operation such as a join or an aggregation. Shuffle is an operation done by Spark to keep related data (data pertaining to a single key) in a single partition. For this, Spark needs to move data around the cluster. Hence shuffle is considered the most costly operation.

Common symptoms of data skew are

Stuck stages & tasks
Low utilization of CPU
Out of memory errors

There are several tricks we can employ to deal with data skew problem in Spark.

Identifying and resolving data skew

Spark users often observe all tasks finish within a reasonable amount of time, only to have one task take forever. In all likelihood, this is an indication that your dataset is skewed. This behavior also results in the overall underutilization of the cluster. This is especially a problem when running Spark in the cloud, where over-provisioning of cluster resources is wasteful and costly.

If skew is at the data source level (e.g. a hive table is partitioned on _month key and table has a lot more records for a particular _month), this will cause skewed processing in the stage that is reading from the table. In such a case restructuring the table with a different partition key(s) helps. However, sometimes it is not feasible as the table might be used by other data pipelines in an enterprise.

In such cases, there are several things that we can do to avoid skewed data processing.

Data Broadcast

If we are doing a join operation on a skewed dataset one of the tricks is to increase the “spark.sql.autoBroadcastJoinThreshold” value so that smaller tables get broadcasted. This should be done to ensure sufficient driver and executor memory.

Data Preprocess

If there are too many null values in a join or group-by key they would skew the operation. Try to preprocess the null values with some random ids and handle them in the application.

Salting

In a SQL join operation, the join key is changed to redistribute data in an even manner so that processing for a partition does not take more time. This technique is called salting. Let’s take an example to check the outcome of salting. In a join or group-by operation, Spark maps a key to a particular partition id by computing a hash code on the key and dividing it by the number of shuffle partitions.

Let’s assume there are two tables with the following schema.

Let’s consider a case where a particular key is skewed heavily e.g. key 1, and we want to join both the tables and do a grouping to get a count. For example,

After the shuffle stage induced by the join operation, all the rows having the same key needs to be in the same partition. Look at the above diagram. Here all the rows of key 1 are in partition 1. Similarly, all the rows with key 2 are in partition 2. It is quite natural that processing partition 1 will take more time, as the partition contains more data. Let’s check Spark’s UI for shuffle stage run time for the above query.

As we can see one task took a lot more time than other tasks. With more data it would be even more significant. Also, this might cause application instability in terms of memory usage as one partition would be heavily loaded.

Can we add something to the data, so that our dataset will be more evenly distributed? Most of the users with skew problem use the salting technique. Salting is a technique where we will add random values to join key of one of the tables. In the other table, we need to replicate the rows to match the random keys.The idea is if the join condition is satisfied by key1 == key1, it should also get satisfied by key1_ = key1_. The value of salt will help the dataset to be more evenly distributed.

Here is an example of how to do that in our use case. Check the number 20, used while doing a random function & while exploding the dataset. This is the distinct number of divisions we want for our skewed key. This is a very basic example and can be improved to include only keys which are skewed.

Now let’s check the Spark UI again. As we can see processing time is more even now

Note that for smaller data the performance difference won’t be very different. Sometimes the shuffle compress also plays a role in the overall runtime. For skewed data, shuffled data can be compressed heavily due to the repetitive nature of data. Hence the overall disk IO/ network transfer also reduces. We need to run our app without salt and with salt to finalize the approach that best fits our case.

Garbage Collection

Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection (GC). Therefore, garbage collection (GC) can be a major issue that can affect many Spark applications.

Common symptoms of excessive GC in Spark are:

Slowness of application
Executor heartbeat timeout
GC overhead limit exceeded error

Spark’s memory-centric approach and data-intensive applications make it a more common issue than other Java applications. Thankfully, it’s easy to diagnose if your Spark application is suffering from a GC problem. The Spark UI marks executors in red if they have spent too much time doing GC.

Spark executors are spending a significant amount of CPU cycles performing garbage collection. This can be determined by looking at the “Executors” tab in the Spark application UI. Spark will mark an executor in red if the executor has spent more than 10% of the time in garbage collection than the task time as you can see in the diagram below.

The Spark UI indicates excessive GC in red

Addressing garbage collection issues

Here are some of the basic things we can do to try to address GC issues.

Data structures

If using RDD based applications, use data structures with fewer objects. For example, use an array instead of a list.

Specialized data structures

If you are dealing with primitive data types, consider using specialized data structures like Koloboke or fastutil. These structures optimize memory usage for primitive types.

Storing data off-heap

The Spark execution engine and Spark storage can both store data off-heap. You can switch on off-heap storage using

–conf spark.memory.offHeap.enabled = true
–conf spark.memory.offHeap.size = Xgb.

Be careful when using off-heap storage as it does not impact on-heap memory size i.e. it won’t shrink heap memory. So to define an overall memory limit, assign a smaller heap size.

Built-in vs User Defined Functions (UDFs)

If you are using Spark SQL, try to use the built-in functions as much as possible, rather than writing new UDFs. Most of the SPARK UDFs can work on UnsafeRow and don’t need to convert to wrapper data types. This avoids creating garbage, also it plays well with code generation.

Be stingy about object creation

Remember we may be working with billions of rows. If we create even a small temporary object with a 100-byte size for each row, it will create 1 billion * 100 bytes of garbage.

End of Part II

So far, we have focused on memory management, data skew, and garbage collection as causes of slowdowns and failures in your Spark applications. For Part III of the series, we will turn our attention to resource management and cluster configuration where issues such as data locality, IO-bound workloads, partitioning, and parallelism can cause some real headaches unless you have good visibility and intelligence about your data runtime.

If you found this blog useful, you may wish to view Part I of this series Why Your Spark Apps are Slow or Failing: Part I Memory Management. Also see our blog Spark Troubleshooting, Part 1 – Ten Challenges.

The post Why Data Skew & Garbage Collection Causes Spark Apps To Slow or Fail appeared first on Unravel.

Reduce Apache Spark Troubleshooting Time from Days to Seconds

Unravel Data Posts — Fri, 25 Jan 2019 06:53:27 +0000

Spark’s simple programming constructs and powerful execution engine have brought a diverse set of users to its platform. Many new modern data stack applications are being built with Spark in fields like healthcare, genomics, financial services, self-driving technology, government, and media. Things are not so rosy, however, when a Spark application fails.

Similar to applications in other distributed systems that have a large number of independent and interacting components, a failed Spark application throws up a large set of raw logs. These logs typically contain thousands of messages including errors and stacktraces. Hunting for the root cause of an application failure from these messy, raw, and distributed logs is hard for Spark experts; and a nightmare for the thousands of new users coming to the Spark platform. We aim to radically simplify root cause detection of any Spark application failure by automatically providing insights to Spark users like what is shown in Figure 1.

Figure 1: Insights from automatic root cause analysis improve Spark user productivity

Spark platform providers like Amazon, Azure, Databricks, and Google clouds as well as Application Performance Management (APM) solution providers like Unravel have access to a large and growing dataset of logs from millions of Spark application failures. This dataset is a gold mine for applying state-of-the-art artificial intelligence (AI) and machine learning (ML) techniques. In this blog, we look at how to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root causes have been identified. These models can then automatically predict the root cause when an application fails [1]. Such actionable root-cause identification improves the productivity of Spark users significantly.

Clues in the logs

A number of logs are available every time a Spark application fails. A distributed Spark application consists of a driver container and one or more executor containers. The logs generated by these containers have information about the application, as well as how the application interacts with the rest of the Spark platform. These logs form the key dataset that Spark users scan for clues to understand why an application failed.

However, the logs are extremely verbose and messy. They contain multiple types of messages, such as informational messages from every component of Spark, error messages in many different formats, stacktraces from code running on the Java Virtual Machine (JVM), and more. The complexity of Spark usage and internals make things worse. Types of failures and error messages differ across Spark SQL, Spark Streaming, iterative machine learning and graph applications, and interactive applications from Spark shell and notebooks (e.g., Jupyter, Zeppelin). Furthermore, failures in distributed systems routinely propagate from one component to another. Such propagation can cause a flood of error messages in the log and obscure the root cause.

Figure 2 shows our overall solution to deal with these problems and to automate root cause analysis (RCA) for Spark application failures. Overall, the solution consists of:

Continuously collecting logs from a variety of Spark application failures
Converting logs into feature vectors
Learning a predictive model for RCA from these feature vectors

Of course, as with any intelligent solution that uses AI and ML techniques, the devil is the details!

Figure 2: Root cause analysis of Spark application failures

Data collection for training: As the saying goes: garbage in, garbage out. Thus, it is critical to train RCA models on representative input data. In addition to relying on logs from real-life Spark application failures observed on customer sites, we have also invested in a lab framework where root causes can be artificially injected to collect even larger and more diverse training data.

Structured versus unstructured data: Logs are mostly unstructured data. To keep the accuracy of model predictions to a high level in automated RCA, it is important to combine this unstructured data with some structured data. Thus, whenever we collect logs, we are careful to collect trustworthy structured data in the form of key-value pairs that we additionally use as input features in the predictive models. These include Spark platform information and environment details of Scala, Hadoop, OS, and so on.

Labels: ML techniques for prediction fall into two broad categories: supervised learning and unsupervised learning. We use both techniques in our overall solution. For the supervised learning part, we attach root-cause labels with the logs collected from an application failure. This label comes from a taxonomy of root causes that we have created based on millions of Spark application failures seen in the field and in our lab. Broadly speaking, the taxonomy can be thought of as a tree data structure that categorizes the full space of root causes. For example, the first non-root level of this tree can be failures caused by:

Configuration errors
Deployment errors
Resource errors
Data errors
Application errors
Unknown factors

The leaves of this taxonomy tree form the labels used in the supervised learning techniques. In addition to a text label representing the root cause, each leaf also stores additional information such as: (a) a description template to present the root cause to a Spark user in a way that she will easily understand (like the message in Figure 1), and (b) recommended fixes for this root cause. We will cover the root-cause taxonomy in a future blog.

The labels are associated with the logs in one of two ways. First, the root cause is already known when the logs are generated, as a result of injecting a specific root cause we have designed to produce an application failure in our lab framework. The second way in which a label is given to the logs for an application failure is when a Spark domain expert manually diagnoses the root cause of the failure.

Input Features: Once the logs are available, there are various ways in which the feature vector can be extracted from these logs. One way is to transform the logs into a bit vector (e.g., 1001100001). Each bit in this vector represents whether a specific message template is present in the respective logs. A prerequisite to this approach is to extract all possible message templates from the logs. A more traditional approach for feature vectors from the domain of information retrieval is to represent the logs for a failure as a bag of words. This approach is mostly similar to the bit vector approach except for a couple of differences: (i) each bit in the vector now corresponds to a word instead of a message template, and (ii) instead of 0’s and 1’s, it is more common to use numeric values generated using techniques like TF-IDF.

More recent advances in ML have popularized vector embeddings. In particular, we use the doc2vectechnique [2]. At a high level, these vector embeddings map words (or paragraphs, entire documents) to multidimensional vectors by evaluating the order and placement of words with respect to their neighboring words. Similar words map to nearby vectors in the feature vector space. The doc2vec technique uses a 3-layer neural network to gauge the context of the document and relate similar content together.

Once the feature vectors are generated along with the label, a variety of supervised learning techniques can be applied for automatic RCA. We have evaluated both shallow as well as deep learning techniques including random forests, support vector machines, Bayesian classifiers, and neural networks.

Conclusion

The overall results produced by our solution are very promising and will be presented at Strata 2017 in New York. We are currently enhancing the solution in some key ways. One of these is to quantify the degree of confidence in the root cause predicted by the model in a way that users will easily understand. Another key enhancement is to speed up the ability to incorporate new types of application failures. The bottleneck currently is in generating labels. We are working on active learning techniques [3] that nicely prioritize the human efforts required in generating labels. The intuition behind active learning is to pick the unlabeled failure instances that provide the most useful information to build an accurate model. The expert labels these instances and then the predictive model is rebuilt.

Manual failure diagnosis in Spark is not only time-consuming but highly challenging due to correlated failures that can occur simultaneously. Our unique RCA solution enables the diagnosis process to function effectively even in the presence of multiple concurrent failures, as well as noisy data prevalent in production environments. Through automated failure diagnosis, we remove the burden of manually troubleshooting failed applications from the hands of Spark users and developers, enabling them to focus entirely on solving business problems with Spark.

To learn more about how to run Spark in production reliably, contact us.

[1] S. Duan, S. Babu, and K. Munagala, “Fa: A System for Automating Failure Diagnosis”, International Conference on Data Engineering, 2009

[2] Q. Lee and T. Mikolov, “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning, 2014

[3] S. Duan and S. Babu, “Guided Problem Diagnosis through Active Learning“, International Conference on Autonomic Computing, 2008

The post Reduce Apache Spark Troubleshooting Time from Days to Seconds appeared first on Unravel.

How to Gain 5x Performance for Hadoop and Apache Spark Jobs

Unravel Data Posts — Fri, 25 Jan 2019 06:39:00 +0000

Improve performance of Hadoop jobs

Unravel Data can help improve the performance of your Spark and Hadoop jobs by up to 5x. The Unravel platform can help drive this performance improvement in several ways, including:

Save resources by reducing the number of tasks for queries
Drastically speed up queries
Removing the reliance of using default configurations
Recommending optimized configurations that minimizes the number of tasks
Automatically detect problems and suggest optimum values

Let’s look at a real-world example from an Unravel customer.

Too many tasks, too many problems

We recently heard from a customer who was experiencing performance issues and by using Unravel, he was able to speed up his query from 41 minutes to approximately 8 minutes. He did this by using our automatic inefficiency diagnoses and implementing the suggested fixes.

The Unravel platform identified that a number of this customer’s queries were using way more tasks than were needed. This problem is not unique to his application.

When running big data applications, we often find the following common pitfalls:

Using too many tasks for a query increases the overhead
Increased wait time for resource allocation
Increased duration of the query
Resource contention at the cluster level affects concurrent queries

The result after fine-tuning the Hadoop jobs with Unravel

In this case, the customer was able to fine-tune their queries based on Unravel’s suggestions and see a 5x speed improvement by using only one-tenth of the original number of tasks.

Let’s look at this example in more detail to review the problems facing this customer and how Unravel helped resolve it. Our examples will be Hive with MapReduce applications running on a YARN cluster as this is what the customer was using, but the problem is applicable in other types of YARN applications such as Spark and Hive with Tez, which Unravel supports also.

Customer Experience

The Hive query in this customer example took 40 minutes on average to complete and could take longer (more than 1 hour) or less (about 29 min) depending on how busy the cluster was. To note, he did not make changes to any YARN, Hive, MapReduce configurations for these runs, so the configurations were from the default setting for the cluster.

After installing Unravel, he saw that for this query, Unravel automatically detected that it was using too many map and reduce tasks (Figures 1 and 2). Moreover, it provided actionable suggestions to improve the inefficiency – it specified new values for the configurations mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize, and hive.exec.reducers.bytes.per.reducer, that would result in fewer map and reduce tasks in a future run.

Significant improvements were seen immediately after using the new configurations. As shown in Q1 in Table 1, the number of tasks used by the query dropped from 4368 to 407, while query duration went down from 41 minutes 18 seconds to 8 minutes 2 seconds. Three other queries in this workload were also flagged by Unravel for using too many tasks. The user proceeded to use the settings recommended for these queries and got significant improvements in both resource efficiency and speedup.

Q2 to Q4 in Table 1 show the results. We could see that for all of the queries, the new runs used approximately one-tenth of the original number of tasks, but achieved a query speedup of 2 to 5X. It is also worth noting that the significant reduction in tasks by these queries freed up resources for other jobs to run on the cluster, thereby improving the overall throughput of the cluster.

Figure 1. Query used too many mappers. Unravel shows that this query has one job using too many map tasks, and provides a recommendation for the configurations mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize to use fewer number of map tasks in the next run.

Figure 2. Query used too many reducers. Unravel shows that this query has one job using too many reduce tasks, and provides a recommendation for the configuration hive.exec.reducers.bytes.per.reducer to use reduce tasks in the next run.

Table 1. The duration and tasks used before and after using Unravel’s suggested configurations.

Detecting and Mitigating Too Many Tasks in Spark and Hadoop jobs

Unravel analyzes every application running on the cluster for various types of inefficiencies. To determine whether a query is using too many tasks, Unravel checks your Spark and Hadoop jobs to see if a larger number of its tasks process too little data per task.

As shown in Figures 1 and 2, Unravel explained that a majority of the tasks ran for too little time (average map time is 44 seconds and average reduce computation time is 1 second). Figure 3 provides another view of the symptom. We see that all 3265 map tasks finished in less than 2 minutes, and most of them processed between 200 and 250 MB of data. On the reduce side, more than 1000 reduce tasks finished in less than 1.5 minutes, and most of them processed less than 100 KB of data each.

Upon detecting the inefficiency, Unravel automatically computes new configuration values that users can use to get fewer tasks for a future run. Contrasting Figure 3, Figure 4 shows the same histograms for the run with new configurations. This time there were only 395 map tasks, an 8X reduction, and most of them took longer than 2 minutes (though all of them still finished within 3.5 minutes) and processed between 2 and 2.5 GB of data. On the reduce side, only 8 reduce tasks were used, and each processed between 500 and 800 KB. Most importantly, the job duration went from 39 minutes 37 seconds to 7 minutes 11 seconds.

Figure 3. Histograms showing the distribution of map and reduce task duration, input, and output size. More than 3000 map tasks and more than 1000 reduce tasks are allocated for this job, which took almost 40 minutes, but each task runs for very little time.

Figure 4. Drastic reduction in job duration with fewer tasks processing more data per task. Fewer than 400 map tasks and only 8 reduce tasks were allocated to do the same amount of work in the new run. The job finished in just over 7 minutes.

How Using Fewer Tasks Improves Query Speedup for Spark and Hadoop jobs

Several reasons explain why using fewer tasks resulted in shorter duration, not longer as many might expect. First, in the original run, a larger number of tasks were allocated but each task processed very little data. Allocating and starting a task incurs an overhead, but this overhead is about the same whether a task ends up processing 100KB or 1GB of data. Therefore, using 1000 tasks to process 100GB, with each task processing 0.1GB, would have a much larger overhead than doing the same with 100 tasks, with each task processing 1GB.

Second, there is a limit to how many containers an application can get at a time. Not only does the cluster have limited resources, but also it is typically multi-tenant, so concurrent applications have to share (fight for) containers – recall that the query with the default configuration could vary in duration from 29 minutes to more than an hour depending on how busy the cluster was.

Suppose a query needs 200 containers but only 100 are available at the time. The query will get the 100 containers, and will wait till resources become available for the other 100. Often, this means it will have to wait until its first 100 containers are finished, freed, and allocated again for the second “wave” of the 100 containers. Therefore, requesting fewer tasks means that it is faster to get all the tasks needed.

Third, the default configurations often result in too many tasks for many applications. Unfortunately, most users submit applications with the default configurations because they may not know that resources can be tuned, or which parameters to tune. Savvier users who know which parameters to tune may try to change them, often arbitrarily and through trial and error, but this process takes a long time because they do not know what values to set.

This is where Unravel shines – in addition to detecting an inefficiency, it can tell users what configurations to change, and what values to set to improve the inefficiency. In fact, besides the post-query analysis that I discussed in this post, Unravel can detect issues when the query is still running, and users can configure rules to automatically take predefined actions when a problem is detected. We will discuss that further in a future post.

The post How to Gain 5x Performance for Hadoop and Apache Spark Jobs appeared first on Unravel.

Unravel Data For Spark Datasheet

Unravel Data — Tue, 20 Nov 2018 02:55:14 +0000

Thank you for your interest in the Unravel Data for Spark Datasheet.

You can download it here.

The post Unravel Data For Spark Datasheet appeared first on Unravel.

Unravel Data Solution Brief Spark

Unravel Data — Sat, 10 Nov 2018 03:50:54 +0000

Thank you for your interest in the Unravel Data Solution Brief Spark.

You can download it here.

The post Unravel Data Solution Brief Spark appeared first on Unravel.

Spark Resources | Unravel Data

The Spark Troubleshooting Solution is The Unravel DataOps Platform

Where Are the Issues – and the Solutions?

Fixing Job-Level Problems with Unravel

Data Pipeline Problems

Cluster Issues

Unravel Data: An Ounce of Prevention

Conclusion

The Biggest Spark Troubleshooting Challenges in 2024

Five Reasons Why Troubleshooting Spark Applications is Hard

Memory-resident

Parallel processing

Variants

Configuration options

Trial and error approach

Three Issues with Spark Jobs, On-Premises and in the Cloud

The Biggest Spark Troubleshooting Challenges in 2024

Section 1: Five Job-Level Challenges

1. How many executors and cores should a job use?

2. How much memory should I allocate for each job?

3. How do I handle data skew and small files?

4. How do I optimize at the pipeline level?

5. How do I know if a specific job is optimized?

Section 2: Cluster-Level Challenges

6. Are Nodes Matched Up to Servers or Cloud Instances?

7. How Do I See What’s Going on in My Cluster?

See how Unravel helps manage clusters.

8. Is my data partitioned correctly for my SQL queries? (and other inefficiencies)

9. When do I take advantage of auto-scaling?

10. How Do I Find and Fix Problems?

Impacts of these Challenges

Conclusion

The Spark 3.0 Performance Impact of Different Kinds of Partition Pruning

What is Partition Pruning?

Experiment Setup

Static Partition Pruning

Query Plan

Dynamic Partition Pruning

Conclusion

How to intelligently monitor Kafka/Spark Streaming data pipelines

Use Case 1: Producer Failure

Detect Spark is processing no data

Checking I/O Consumption

View Spark input details

View Kafka topic details

Use Case 2: Slow Spark app, offline Kafka partitions

Detect inconsistent performance

Detect drop in input records

Drill down to I/O drop off

Drill down to suspect time interval

View KPIs on the Kafka monitor screen

Identify cluster controller

View consumer group status

Topic drill down

Consumer group details

Conclusion

Why Memory Management is Causing Your Spark Apps To Be Slow or Fail

Out of memory at the driver level

Out of memory at the executor level

High concurrency

Inefficient queries

Incorrect configuration

Executor & Driver memory

Memory Overhead

Caching Memory

Out of memory at Node Manager

End of Part I – Thanks for the Memory

See how Unravel simplifies Spark Memory Management.

Tips to optimize Spark jobs to improve performance

Why Apache Spark Optimization?

Recommendations to optimize Spark

Next steps:

How can you get started?

Tuning Spark applications: Detect and fix common issues with Spark driver

Learn more about Apache Spark drivers and how to tune spark application quickly.

Next steps

Spark Troubleshooting Guides

Twelve Best Cloud & DataOps Articles

Our resource picks for October! Prescriptive Insights On Cloud & DataOps Topics

Cloud Migration

Our resource picks for October!
Prescriptive Insights On Cloud & DataOps Topics