Amazon EMR Archives - Unravel

Data Observability for AWS Datasheet

Unravel Data — Thu, 09 Feb 2023 16:24:15 +0000

AI-DRIVEN DATA OBSERVABILITY + FINOPS FOR AWS

Performance. Reliability. Cost-effectiveness.

Unravel’s automated, AI-powered data observability + FinOps platform for AWS and other modern data stacks provides 360° visibility to allocate costs with granular precision, accurately predict spend, run 50% more workloads at the same budget, launch new apps 3X faster, and reliably hit greater than 99% of SLAs.

Unravel Data Observability + FinOps for AWS you can:

Launch new apps 3X faster: End-to-end observability of data-native applications and pipelines. Automatic improvement of performance, cost efficiency, and reliability.
Run 50% more workloads for same budget: Break down spend and forecast accurately. Optimize apps and platforms by eliminating inefficiencies. Set guardrails and automate governance. Unravel’s AI helps you implement observability and FinOps to ensure you achieve efficiency goals.
Reduce firefighting time by 99% using AI-enabled troubleshooting: Detect anomalies, drift, skew, missing and incomplete data end-to-end. Integrate with multiple data quality solutions. All in one place.
Forecast budget with ⨦ 10% accuracy: Accurately anticipate cloud data spending to for more predictable ROI. Unravel helps you accurately forecast spending with granular cost allocation. Purpose-built AI, at job, user and workgroup levels, enables real-time visibility of ongoing usage.

To see Unravel Data for AWS in action contact: Data experts | 650 741-3442

The post Data Observability for AWS Datasheet appeared first on Unravel.

Amazon EMR cost optimization and governance

Unravel Data — Thu, 04 Aug 2022 15:07:23 +0000

There are now dozens of AWS cost optimization tools that exist today. Here’s the purpose-built one for AWS EMR: Begin monitoring immediately to gain control of your AWS EMR costs and continuously optimize resource performance.

What is Amazon EMR (Elastic MapReduce)?

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.

Based on the workload and the application type, EMR can process a huge amount of data by using EC2 instances running the Hadoop File System (HDFS) and EMRFS based on AWS S3. Based on the workload type, these EC2 instances can be configured with any instance types of on-demand and/or spot market kind.

AWS EMR is a great platform, as more and more workloads get added to it then understanding pricing can be a challenge. It’s challenging to govern the cost and easy to lose track of aspects of your monthly spend. In this article, we are sharing tips for governing and optimizing your AWS EMR costs, and resources.

Amazon EMR costs

With multiple choices in selecting instance types and configuring the EMR cluster, understanding pricing of EMR service can become cumbersome and difficult. And because the EMR service inherently utilizes other AWS services (EC2, EBS, EMR, others) so the usage cost of all these services too gets factored into the cost bill.

Best practices for optimizing the cost of your AWS EMR cluster

Here are the list of best practices/techniques for monitoring and optimizing the cost of your EMR cluster:

1. Always tag your resources
Tags is a label consisting of key-value pairs, which allows one to assign metadata to their AWS resources. And, providing one the ability to easily manage, identify, organize, search for, and filter resources. Thus, it is important to give a meaningful and purpose built tags
For example: Create tags to categorize resources by purpose, owner, department, or other criteria as shown below

2. Pick the right cluster type
AWS EMR offers two cluster types – permanent and transient.

For transient clusters, the compute unit is decoupled from the storage. HDFS on local storage is best used for caching the intermediate results and EMRFS is the final destination for storing persistent data in AWS S3. Once the computation is done and the results are stored safely in AWS S3, the resources on transient clusters can be reclaimed.

For permanent clusters, the data in HDFS is stored in EBS volumes, and can not easily be shared outside of the clusters. Small file issues in Hadoop Name Nodes will still be present just as the on-premise Hadoop clusters.

3. Size your cluster appropriately
Undersized or oversized clusters are what you need to absolutely avoid. EMR platform provides you with auto-scaling capabilities, however, it is important to first factor in the right-sizing aspect for your clusters so as to avoid higher costs and workload execution inefficiencies. To anticipate these issues, you can calculate the number and type of nodes that will be needed for the workloads.
Master node: As the computational requirements are low for this node so it can be a single node.
Core node: As these nodes perform the data processing and storing of data in the HDFS so it is important to right size these nodes. As per Amazon guiding principle, you can multiply the number of nodes by the EBS storage capacity of each node.

For example, if you define 10 core nodes to process 1 TB of data, and you have selected m5.xlarge instance type (with 64 GiB of EBS storage), you have 10 nodes*64 GiB, or 640 GiB of capacity. Based on the HDFS replication factor of three, your data size is replicated three times in the nodes, so 1 TB of data requires a capacity of 3 TB.

For above scenario, you have two options:
a. As you have only 640 GiB here so to run your workload optimally you must increase the number of nodes or change the instance type until you have a capacity of 3 TB.
b. Alternatively, switching your instance type from m5.xlarge to an m5.4xlarge instance type and selecting 12 instances provides enough capacity.

12 instances * 256 GiB = 3072 GiB = 3.29 TB available

Task node: As these nodes only run tasks and do not store data so to calculate the number of task nodes, you only need to estimate the memory usage. As this memory capacity can be distributed between the core and task nodes, it is easy to calculate the number of task nodes by subtracting the core node memory.

As per Amazon provided best practices guidelines, you need to multiply the memory needed by three.
For example, suppose that you have 28 processes of 20 GiB each then your total memory requirements would be as follows:

3*28 processes*20 GiB of memory = 1680 GiB of memory

For this example, your core nodes have 64 GiB of memory (m5.4xlarge instances), and your task nodes have 32 GiB of memory (m5.2xlarge instances). Your core nodes provide 64 GiB * 12 nodes = 768 GiB of memory, which is not enough in this example. To find the shortage, subtract this memory from the total memory required: 1680 GiB – 768 GiB = 912 GiB. You can set up the task nodes to provide the remaining 912 GiB of memory. Divide the shortage memory by the instance type memory to obtain the number of task nodes needed.

912 GiB / 32 GiB = 28.5 task nodes
4. Based on your workload size, always pick the right instance type and size

Let us take an example of Task node where suppose that you have 28 processes of 20 GiB each then your total memory requirements would be as follows:
3*28 processes*20 GiB of memory = 1680 GiB of memory

For this example, suppose your core nodes have 64 GiB of memory (m5.4xlarge instances), and your task nodes have 32 GiB of memory (m5.2xlarge instances). Your core nodes provide 64 GiB * 12 nodes = 768 GiB of memory, which is not enough in this example.

To find the shortage, subtract this memory from the total memory required: 1680 GiB – 768 GiB = 912 GiB. You can set up the task nodes to provide the remaining 912 GiB of memory. Divide the shortage memory by the instance type memory to obtain the number of task nodes needed.

912 GiB / 32 GiB = 28.5 task nodes

5. Use autoscaling as needed
Based on your workload size, Amazon EMR can programmatically scale out applications like Apache Spark and Apache Hive to utilize additional nodes for increased performance and scale in the number of nodes in your cluster to save costs when utilization is low.

For example, you can assign minimum, maximum, on-demand limit and maximum core node to dynamically scale up and down based on your running workload.

6. Always have a cluster termination policy set
When you add an auto-termination policy to a cluster, you specify the amount of idle time after which the cluster should automatically shut down. This ability allows you to orchestrate cluster cleanup without the need to monitor and manually terminate unused clusters.

You can attach an auto-termination policy when you create a cluster, or add a policy to an existing cluster. To change or disable auto-termination, you can update or remove the policy.

7. Monitor cost with cost explorer

To manage and meet your costs within your budget, you need to diligently monitor your costs. One tool that AWS offers you here is AWS Cost Explorer, which allows you to visualize, understand, and manage your AWS costs and usage over time.

With Cost Explorer, you can build custom applications to directly access and query your costs and usage data, build interactive and ad-hoc analytics reports over a daily or monthly granularity. You can even create a forecast by selecting a future time range for your reports, estimate your AWS bill and set alarms and budgets based on predictions.

Unravel can help!

Without doubt, AWS helps you manage your EMR clusters and its costs with the above listed pathways. And, Cost Explorer is a great tool to use for monitoring your monthly bill, however, that all does come with a price where you have to spent your precious time checking and monitoring things manually or writing custom scripts to fetch the data and then run that data by your data science teams and finance ops for detailed analysis.

Further, the data provided by the Cost Explorer for your EMR cluster costs is not in real-time (has a turn around of 24 hours delay). And, it is also difficult to access your EMR cluster cost usage with other services costs included. However, not to worry, there is a better solution available today — a dataops observability product from the company Unravel Data! — which frees you completely from worrying all about your EMR cluster management and costs as it gives you the real-time view, holistic and fully automated way to manage your clusters!

AWS EMR Cost Management is made easy with Unravel Data

Although there are many tools offered by AWS as well other companies to manage your EMR cluster costs, Unravel stands out with its key offerings of providing you a single pane of glass and ease of use.

Unravel provides an automated observability for your modern data stack!

Unravel’s purpose-built observability for modern data stacks helps you stop firefighting issues, control costs, and run faster data pipelines, all monitored and observed via a single pane of glass.

One unique value that Unravel provides is Chargeback details for the EMR clusters in real-time, where a detailed cost breakdown is provided for your services (EMR, EC2, and EBS volumes) for each configured AWS account. In addition, you get a holistic view of your cluster with respect to resources utilization, chargeback and instance health, with automated Artificial Intelligence based delivered cluster cost-saving recommendations and suggestions.

AWS EMR Monitoring with Unravel’s DataOps Observability

Unravel 4.7.4 has the capability to holistically monitor your EMR cluster. It collects and monitors a range of data points for various KPIs and metrics, with which it then builds a knowledge base to derive the resource and cost saving insights and recommendation decisions.

AWS EMR chargeback and showback

The image below shows the cost breakdown for EMR, EC2 and EBS services

Monitoring AWS EMR cost trends

For your EMR cluster usage, it is important to see how the costs are trending based on your usage and workload size. Unravel helps you with understanding your costs via its chargeback page. Our agents are constantly fetching all the relevant metrics used for analyzing the cluster costs usage and resource utilization, showing you the instant chargeback view in real-time. These collected metrics are further feeded into our AI engine to give you the recommended insights.

The image below shows the cost trends per cluster type, avg costs and total costs

Complete monitoring AWS EMR insights

As seen in the above image, Unravel has analyzed the resources utilization (both memory and CPU) of the clusters and analyzed the configured instance types used for your cluster. Further, based on your executed workload size, Unravel has come up with a set of recommendations to help you save costs by downsizing your node instance types.

Do you want to lower your AWS EMR cost?

Avoid overspending on AWS EMR. If you’re not sure how to lower your AWS EMR cost, or simply don’t have time, the Unravel’s DataOps Observability tool can help you save cost.
Schedule a free consultation.
Create your free account today.
Watch video: 5 Best Practices for Optimizing Your Big Data Costs With Amazon EMR

The post Amazon EMR cost optimization and governance appeared first on Unravel.

Roundtable Recap: DataOps Just Wanna Have Fun

Stephen Lamont — Fri, 06 May 2022 12:04:42 +0000

We like to keep things light at Unravel. In a recent event, we hosted a group of industry experts for a night of laughs and drinks as we discussed cloud migration and heard from our friends at Don’t Tell Comedy.

Unravel VP of Solutions Engineering Chris Santiago and AWS Sr. Worldwide Business Development Manager for Analytics Kiran Guduguntla moderated a discussion with data professionals from Black Knight, TJX Companies, AT&T Systems, Georgia Pacific, and IBM, among others.

This post briefly recaps that discussion.

The cloud journey

To start, Chris asked attendees where they were in their cloud journey. The top responses were tied, with hybrid cloud and on-prem being the most popular responses. Following that were cloud-native, cloud-enabled, and multi-cloud workloads.

Kiran, who focuses primarily on migrations to EMR, responded to these results noting that he wasn’t surprised. The industry has seen significant churn in the past few years, especially people moving from Hadoop. As clients move to the cloud, EMR continues to lead the pack as a top choice for users.

Migration goals

As a follow-up question, we conducted a second poll to learn about the migration goals of our attendees. Are they prioritizing cost optimization? Seeking greater visibility? Boosting performance? Or are they looking for ways to better automate and decrease time to resolution?

Unsurprisingly, the number one migration goal was reducing and optimizing resource usage. Cost is king. Kiran explained the results of an IDC study that followed users as they migrated their workloads from on-premises environments to EMR. The study found that customers saved about 57%, and the ROI over five years rose to 350%.

He emphasized that cost isn’t the only benefit of migration from on-prem to the cloud. The shift allows for better management, reduced administration, and better performance. Customers can run their workloads two or three times faster because of the optimization included in EMR frameworks.

Thinking about migrating to AWS? Start with Unravel
Discover why

Data security and privacy in the cloud

One attendee brought the conversation along by bringing up the questions many clients are asking: How can I be sure of data security? Their priority is meeting regulatory compliance and taking every step to ensure they aren’t hacked. The main concern is not how to use the cloud, but how to secure the cloud.

Kiran agreed, emphasizing that security is paramount at AWS. He explained the security features AWS implements to promote data security:

1. Data encryption

AWS encrypts data either while in S3 or as it’s in motion to S3.

2. Access control

Providing fine-grain access control using Lake Formation combined with robust audit controls limits data access.<

3. Compliance

AWS meets every major compliance requirement, including GDPR.

He continued, noting that making these features available is good, but it is essential to architect them to meet the user’s or clients’ particular requirements.

Interested in learning more about Unravel for EMR migrations? Start here.

The post Roundtable Recap: DataOps Just Wanna Have Fun appeared first on Unravel.

Webinar Recap: Functional strategies for migrating from Hadoop to AWS

Stephen Lamont — Thu, 21 Apr 2022 19:55:25 +0000

In a recent webinar, Functional (& Funny) Strategies for Modern Data Architecture, we combined comedy and practical strategies for migrating from Hadoop to AWS.

Unravel Co-Founder and CTO Shivnath Babu moderated a discussion with AWS Principal Architect, Global Specialty Practice, Dipankar Ghosal and WANdisco CTO Paul Scott-Murphy. Here are some of the key takeaways from the event.

Hadoop migration challenges

Business computing workloads are moving to the cloud en masse to achieve greater business agility, access to modern technology, and reduce operational costs. But identifying what you have running, understanding how it all works together, and mapping it to a cloud topology is extremely difficult when you have hundreds, if not thousands, of data pipelines and are dealing with tens of thousands of data sets.

We asked attendees what their top challenges have been during cloud migration and how they would describe the current state of their cloud journey. Not surprisingly, the complexity of their environment was the #1 challenge (71%), followed by the “talent gap” (finding people with the right skills).

The unfortunate truth is that most cloud migrations run over time and over budget. However, when done right, moving to the cloud can realize spectacular results.

How AWS approaches migration

Dipankar talked about translating a data strategy into a data architecture, emphasizing that the data architecture must be forward-looking: it must be able to scale in terms of both size and complexity, with the flexibility to accommodate polyglot access. That’s why AWS does not recommend a one-size-fits-all approach, as it eventually leads to compromise. With that in mind, he talked about different strategies for migration.

Hadoop to Amazon EMR Migration Reference Architecture

He recommends a data-first strategy for complex environments where it’s a challenge to find the right owners to define why the system is in place. Plus, he said, “At the same time, it gives the business the data availability on the cloud, so that they can start consuming the data right away.”

The other approach is a workload-first strategy, which is favored when migrating a relatively specialized part of the business that needs to be refactored (e.g., Pig to Spark).

He wrapped up with a process with different “swim lanes where every persona has skin in the game for a migration to be successful.”

Why a data-first strategy?

Paul followed up with a deeper dive into a data-first strategy. Specifically, he pointed out that in a cloud migration, “people are unfamiliar with what it takes to move their data at scale to the cloud. They’re typically doing this for the first time, it’s a novel experience for them. So traditional approaches to copying data, moving data, or planning a migration between environments may not be applicable.” The traditional lift-and-shift application-first approach is not well suited to the type of architecture in a big data migration to the cloud.

Paul said that the WANdisco data-first strategy looks at things from three perspectives:

Performance: Obviously moving data to the cloud faster is important so you can start taking advantage of a platform like AWS sooner. You need technology that supports the migration of large-scale data and allows you to continue to use it while migration is under way. There cannot be any downtime or business interruption.
Predictability: You need to be able to determine when data migration is complete and plan for workload migration around it.
Automation: Make the data migration as straightforward and simple as possible, to give you faster time to insight, to give you the flexibility required to migrate your workloads efficiently, and to optimize workloads effectively.

How Unravel helps before, during, and after migration

Shivnath went through the pitfalls encountered at each stage of a typical migration to AWS (assess; plan; test/fix/verify; optimize, manage, scale). He pointed out that it all starts with careful and accurate planning, then continuous optimization to make sure things don’t go off the rails as more and more workloads migrate over.

And to plan properly, you need to assess what is a very complex environment. All too often, this is a highly manual, expensive, and error-filled exercise. Unravel’s full-stack observability collects, correlates, and contextualizes everything that’s running in your Hadoop environment, including identifying all the dependencies for each application and pipeline.

Then once you have this complete application catalog, with baselines to compare against after workloads move to the cloud, Unravel generates a wave plan for migration. Having such accurate and complete data-based information is crucial to formulating your plan. Usually when migrations go off schedule and over budget, it’s because the original plan itself was inaccurate.

Then after workloads migrate, Unravel provides deep insights into the correctness, performance, and cost. Performance inefficiencies and over-provisioned resources are identified automatically, with AI-driven recommendations on exactly what to fix and how.

As more workloads migrate, Unravel empowers you to apply policy-based governance and automated alerts and remediation so you can manage, troubleshoot, and optimize at scale.

Case study: GoDaddy

The Unravel-WANdisco-Amazon partnership has proven success in migrating a Hadoop environment to EMR. GoDaddy had to move petabytes of actively changing “live” data when the business depends on the continued operation of applications in the cluster and access to its data. They had to move an 800-node Hadoop cluster with 2.5PB of customer data that was growing by more than 4TB every day. The initial (pre-Unravel) manual assessment took several weeks and proved incomplete: only 300 scripts were discovered, whereas Unravel identified over 800.

GoDaddy estimated that its lift-and-shift migration operating costs would cost $7 million, but Unravel AI optimization capabilities identified savings that brought down the cloud costs to $2.9 million. Using WANdisco’s data-first strategy, GoDaddy was able to complete its migration process on time and under budget while maintaining normal business operations at all times.

Q&A

The webinar wrapped up with a lively Q&A session where attendees asked questions such as:

We’re moving from on-premises Oracle to AWS. What would be the best strategy?
What kind of help can AWS provide in making data migration decisions?
What is your DataOps strategy for cloud migration?
How do you handle governance in the cloud vs. on-premises?

To hear our experts’ specific individual responses to these questions as well as the full presentation, click here to get the webinar on demand (no form to fill out!)

The post Webinar Recap: Functional strategies for migrating from Hadoop to AWS appeared first on Unravel.

What’s New in Amazon EMR Unveiled at DataOps Unleashed 2022

Stephen Lamont — Fri, 08 Apr 2022 15:51:36 +0000

At the DataOps Unleashed 2022 virtual conference, AWS Principal Solutions Architect Angelo Carvalho presented How AWS & Unravel help customers modernize their Big Data workloads with Amazon EMR. The full session recording is available on demand, but here are some of the highlights.

Angelo opened his session with a quick recap of some of the trends and challenges in big data today: the ever increasing size and scale of data; the variety of sources and stores and silos; people of different skill sets needing to access this data easily balanced against the need for security, privacy, and compliance; the expertise challenge in managing open source projects; and, of course, cost considerations.

He went on to give an overview of how Amazon EMR makes it easy to process petabyte-scale data using the latest open source frameworks such as Spark, Hive, Presto, Trino, HBase, Hudi, and Flink. But the lion’s share of his session delved into what’s new in Amazon EMR within the areas of cost and performance, ease of use, transactional data lakes, and security; the different EMR deployment options; and the EMR Migration Program.

What’s new in Amazon EMR?

Cost and performance

EMR takes advantage of the new Amazon Graviton2 instances to provide differentiated performance at lower cost—up to 30% better price-performance. Angelo presented some compelling statistics:

Up to 3X faster performance than standard Apache Spark at 40% of the cost
Up to 2.6X faster performance than open-source Preston at 80% of the cost
11.5% average performance improvement with Graviton2
25.7% average cost reduction with Graviton2

And you can realize these improvements out of the box while still remaining 100% compliant with open-source APIs.

Ease of use

EMR Studio now supports Presto. EMR Studio is a fully managed integrated development environment (IDE) based on Jupyter notebooks that makes it easy for data scientists and data engineers to develop, visualize, and debug applications on an EMR cluster without having to log into the AWS console. So basically, you can attach and detach notebooks to and from the clusters using a single click at any time.

Transactional data lakes

Amazon EMR has supported Apache Hudi for some time to enable transactional data lakes, but now it has added support for Spark SQL and Apache Iceberg. Iceberg is a high-performance format for huge analytic tables at massive scale. Created by Netflix and Apple, it brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to work safely in the same tables at the same time.

Security

Amazon EMR has a comprehensive set of security features and functions, including isolation, authentication, authorization, encryption, and auditing. The latest version adds user execution role authorizations, as well as fine-grained access controls (FGAC) using AWS Lake Formation, and auditing using Lake Formation via AWS CloudTrail.

Different options for deploying Amazon EMR

Options for deploying EMR

There are multiple options for deploying Amazon EMR:

Deployment on Amazon EC2 allows customers to choose instances that offer optimal price and performance ratios for specific workloads.
Deployment on AWS Outposts allows customers to manage and scale Amazon EMR in on-premises environments, just as they would in the cloud.
Deployment on containers on top of Amazon Elastic Kubernetes Service (EKS). But note that at this time, Spark is the only big data framework supported by EMR on EKS.
Amazon EMR Serverless is a new function that lets customers run petabyte-scale data analytics in the cloud without having to manage or operate server clusters.

See how Unravel helps optimize Amazon EMR
Create a free account

Using Amazon’s EMR migration program

The EMR migration program was launched to help customers streamline their migration and answer questions like, How do I move this massive data set to EMR? What will my TCO look like if I move to EMR? How do we implement security requirements?

Amazon EMR Migration program outcomes

Taking a data-driven approach to determine the optimal migration strategy, the Amazon EMR Migration Program (EMP) consists of three main steps:

1. Assessing the migration process begins with creating an initial TCO report, conducting discovery meetings, and using Unravel to quickly discover everything about the data estate.

2. The mobilization stage involves delivering an assessment insights summary, qualifying for incentives, and developing a migration readiness plan.

3. The migration stage itself includes executing the lift-and shift-migration of applications and data, before modernizing the migrated applications.

Amazon relies on Unravel to perform a comprehensive AI-powered cloud migration assessment. As Angelo explained, “We partner with Unravel Data to take a faster, more data-driven approach to migration planning. We collect utilization data for about two to four weeks depending on the size of the cluster and the complexity of the workloads.

“During this phase, we are looking to get a summary of all the applications running on the on-premises environment, which provides a breakdown of all workloads and jobs in the customer environment. We identify killed or failed jobs—applications that fail due to resource contention and or lack of resources—and bursty applications or pipelines.

“For example, we would locate bursty apps to move to EMR, where they can have sufficient resources every time those jobs are run, in a cost-effective way via auto-scaling. We can also estimate migration complexity and effort required to move applications automatically. And lastly, we can identify tools suited for separate clusters. For example, if we identify long-running batch jobs that run at specific intervals, they might be good candidates for spinning a transient cluster only for that job.”

Unravel is equally valuable during and after migration. Its AI-powered recommendations for optimizing applications simplifies tuning and its full-stack insights accelerate troubleshooting.

To illustrate, Angelo concluded with an Amazon EMR-Unravel success story: GoDaddy was moving 900 scripts to Amazon EMR, and each one had to be optimized for performance and cost in a long, time-consuming manual process. But with Unravel’s automated optimization for EMR, they spent 99% less time tuning jobs—from 10+ hours to 8 minutes—saving 2700 hours of data engineering time. Performance improved by up to 72%, and GoDaddy realized $650,000 savings in resource usage costs.

See the entire DataOps Unleashed session on demand
Watch on demand

The post What’s New in Amazon EMR Unveiled at DataOps Unleashed 2022 appeared first on Unravel.

Managing Costs for Spark on Amazon EMR

Unravel Data — Tue, 28 Sep 2021 20:43:11 +0000

The post Managing Costs for Spark on Amazon EMR appeared first on Unravel.

Managing Cost & Resources Usage for Spark

Unravel Data — Wed, 08 Sep 2021 20:55:29 +0000

The post Managing Cost & Resources Usage for Spark appeared first on Unravel.

Troubleshooting EMR

Unravel Data — Tue, 17 Aug 2021 20:57:06 +0000

The post Troubleshooting EMR appeared first on Unravel.

Accelerate Amazon EMR for Spark & More

Unravel Data — Mon, 21 Jun 2021 21:54:59 +0000

The post Accelerate Amazon EMR for Spark & More appeared first on Unravel.

Effective Cost and Performance Management for Amazon EMR

Unravel Data — Wed, 28 Apr 2021 22:06:57 +0000

The post Effective Cost and Performance Management for Amazon EMR appeared first on Unravel.

Optimizing big data costs with Amazon EMR & Unravel

Unravel Data — Sat, 25 Jul 2020 19:16:35 +0000

The post Optimizing big data costs with Amazon EMR & Unravel appeared first on Unravel.

EMR Cost Optimization

Unravel Data — Wed, 22 Jul 2020 19:17:56 +0000

The post EMR Cost Optimization appeared first on Unravel.

Meeting SLAs for Data Pipelines on Amazon EMR With Unravel

Unravel Data — Wed, 26 Jun 2019 22:31:02 +0000

Post by George Demarest, Senior Director of Product Marketing, Unravel and Abha Jain, Director Products, Unravel. This blog was first published on the Amazon Startup blog.

A household name in global media analytics – let’s call them MTI – is using Unravel to support their data operations (DataOps) on Amazon EMR to establish and protect their internal service level agreements (SLAs) and get the most out of their Spark applications and pipelines. Amazon EMR was an easy choice for MTI as the platform to run all their analytics. To start with, getting up and running is simple. There is nothing to install, no configuration required etc. and you can get to a functional cluster in a few clicks. This enabled MTI to focus most of their efforts in building out analytics that would benefit their business instead of having to spend time and money on acquiring the skillset needed for setting up and maintaining Hadoop deployments by themselves. MTI was very quickly able to get a point that they were running 10’s of thousands of jobs per week. About 70% of which are Spark, with the remaining 30% of workloads running on Hadoop, or more specifically Hive/MapReduce.

Among the most common complaints and concerns about optimizing modern data stack clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel employs a frictionless method of collecting relevant data about the full data stack, running applications, cluster resources, datasets, users, business units and projects. Unravel then aggregates and correlates this data into the Unravel data model and then applies a variety of analytical techniques to put that data into a useful context. Unravel utilizes EMR bootstrap Actions to distribute (non-intrusive) sensors on each node of a new cluster that are needed for collecting granular application level data which in turn is used to optimize applications.

See Unravel for EMR in Action

Try Unravel for free

Unravel’s Amazon AWS/EMR architecture

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing Data Operations
For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

● Reduce time needed for troubleshooting and resolving issues.

● Improve cluster efficiency and performance.

● Improve visibility into cluster workloads.

● Provide usage analysis

Reducing Time to Identify and Resolve Issues
One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

One of the most basic ways that Unravel helps reduce MTTI/MTTR is through the elimination of the time-consuming steps of data collection and correlation. Unravel collects granular cluster and application-specific runtime information, as well as metrics on infrastructure, resources using native Hadoop APIs and via lightweight sensors that only send data while an application is executing. This alone can save data teams hours – if not days – of data collection by, capturing application and system log data, configuration parameters, and other relevant data.

Once that data is collected, the manual process of evaluating and interpreting that data has just begun. You may spend hours charting log data from your Spark application only to find that some small human error, a missed configuration parameter, and incorrectly sized container, or a rogue stage of your Spark application is bringing your cluster to its knees.

Unravel’s top level operations dashboard

Improving Visibility Into Cluster Operations
In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster Level Reporting
Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.

Unravel’s cluster infrastructure dashboard

One observation from reports like the above was that the memory and CPU usage in the cluster was peaking from time to time, potentially leading to application failures and slowdowns. To resolve this issue, MTI utilized EMR Automatic scaling feature so that instances were automatically added and removed as needed to ensure adequate resources at all times. This also ensured that they were not incurring unnecessary costs from underutilized resources.

Application and Workflow Tagging
Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage Analysis and Capacity Planning
MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel Chargeback Reporting
You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel Data’s chargeback reporting

Improving Cluster Efficiency and Performance
MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Get answers to EMR issues, not more charts and graphs

Try Unravel for free

Unravel Auto-actions and Alerting
Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

The simplicity of the Auto-actions screen belies the deep automation and functionality of autonomous remediation of application slowdowns and failures. At the highest level, Unravel Auto-actions can be quickly set up to alert your team via email, PagerDuty, Slack or text message. Offending jobs can also be killed or moved to a different queue. Unravel can also create an HTTP post that gives users a lot of powerful options

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Pre-configured Unravel auto-action templates

Applications
Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased Visibility Into Application Runtime and Trends
MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Because typical data pipelines are built on a collection of distributed processing engines (Spark, Hadoop, et al.), getting visibility into the complete data pipeline is a challenge. Each individual processing engine may have monitoring capabilities, but there is a need to have a unified view to monitor and manage all the components together.

Unravel Monitoring, Tuning, and Troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel was designed with an end-to-end perspective on data pipelines. The basic navigation moves from the top level list of applications to drill down to jobs, and further drill down to individual stages of a Spark, Hive, MapReduce or Impala applications.

Unravel Gantt chart view of a Hive query

Unravel provides a number of intuitive navigational and reporting elements in the user interface including a Gantt chart of application components to understand the execution and parallelism of your applications.

Unravel Self-service Optimization of Spark Applications
MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

MTI has chosen Unravel as a foundational element of their self-service application and workflow improvements, especially taking advantage of application recommendations and insights for Spark developers.

Unravel self-service capabilities

Unravel provides plain language insights as well as specific, actionable recommendations to improve performance and efficiency. In addition to these recommendations and insights, users can take action via the auto-tune function, which is available to run from the events panel.

Intelligent recommendations and insights as well as auto-tuning

Optimizing Application Resource Efficiency
In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel provides a rich catalog of insights and recommendations around resource consumption at the application level. To eliminate resource wastage Unravel can help you run your data applications more efficiently by providing you AI driven insights and recommendations to do show:

Underutilization of Container Resources, CPU, or Memory

Too few partitions with respect to available parallelism

Mapper/Reducers Requesting Too Much Memory

Too Many Map Tasks and/or Too Many Reduce Tasks

Solution Highlights
Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

MTI also got great insights on the number and causes of inefficiently running applications. Unravel detected a significant number of inefficient applications. Unravel detected 38,190 events after analyzing 30,378 MapReduce jobs that they executed. They were also able to detect 44,176 events for 21,799 Spark jobs that they executed. They were also able to detect resource contention which causing Spark jobs to get stuck in “Accepted” state, rather than running to completion.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to identify many jobs which waste CPU and memory resources.

The post Meeting SLAs for Data Pipelines on Amazon EMR With Unravel appeared first on Unravel.

Case Study: Meeting SLAs for Data Pipelines on Amazon EMR

George Demarest — Thu, 30 May 2019 20:44:44 +0000

Among the most common complaints and concerns about optimizing big data clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel architecture for Amazon AWS/EMR

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing data operations

For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

Reduce time needed for troubleshooting and resolving issues.
Improve cluster efficiency and performance.
Improve visibility into cluster workloads.
Provide usage analysis

Reducing time to identify and resolve issues

One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

Unravel top level operations dashboard

Improving visibility into cluster operations

In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster level reporting

Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.Unravel cluster infrastructure dashboard

Application and workflow tagging

Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc.. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage analysis and capacity planning

MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel chargeback reporting

You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel chargeback reporting

Improving cluster efficiency and performance

MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Unravel Auto-actions and alerting

Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Preconfigured Unravel auto-action templates

Applications

Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased visibility into application runtime and trends

MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Unravel monitoring, tuning and troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel Gantt chart view of a Hive query

Unravel self-service optimization of Spark applications

MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

Unravel self-service capabilities

Unravel provides intelligent recommendations and insights as well as auto-tuning.

Optimizing Application Resource Efficiency

In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel Insight: Under-utilization of container resources, CPU or memory

Unravel Insight: Too few partitions with respect to available parallelism

Unravel Insight: Mapper/Reducers requesting too much memory

Unravel Insight: Too many map tasks and/or too many reduce tasks

Solution Highlights

Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to Identify many jobs which waste CPU and memory resources.

The post Case Study: Meeting SLAs for Data Pipelines on Amazon EMR appeared first on Unravel.