Articles Archives - Unravel

Mastering Cost Management: From Reactive Spending to Proactive Optimization

Unravel Data — Wed, 05 Feb 2025 20:26:50 +0000

According to Forrester, accurately forecasting cloud costs remains a significant challenge for 80% of data management professionals. This struggle often stems from a lack of granular visibility, control over usage, and ability to optimize code and infrastructure for cost and performance. Organizations utilizing modern data platforms like Snowflake, BigQuery, and Databricks often face unexpected budget overruns, missed performance SLAs, and inefficient resource allocation.

Transitioning from reactive spending to proactive optimization is crucial for effective cost management in modern data stack environments.

This shift requires a comprehensive approach that encompasses several key strategies:

1. Granular Visibility
Gain comprehensive insights into expenses by unifying fragmented data and breaking down silos, enabling precise financial planning and resource allocation for effective cost control. This unified approach allows teams to identify hidden cost drivers and inefficiencies across the entire data ecosystem.

By consolidating data from various sources, organizations can create a holistic view of their spending patterns, facilitating more accurate budget forecasting and informed decision-making. Additionally, this level of visibility empowers teams to pinpoint opportunities for optimization, such as underutilized resources or redundant processes, leading to significant cost savings over time.

2. ETL Pipeline Optimization
Design cost-effective pipelines from the outset, implementing resource utilization best practices and ongoing performance monitoring to identify and address inefficiencies. This approach involves carefully architecting ETL processes to minimize resource usage while maintaining optimal performance.

By employing advanced performance tuning techniques, such as optimizing query execution plans and leveraging built-in optimizations, organizations can significantly reduce processing time and associated costs. Continuous monitoring of pipeline performance allows for the early detection of bottlenecks or resource-intensive operations, enabling timely adjustments and ensuring sustained efficiency over time.

3. Intelligent Resource Management
Implement intelligent autoscaling to dynamically adjust resources based on workload demands, optimizing costs in real-time while maintaining performance. Efficiently manage data lake and compute resources to minimize unnecessary expenses during scaling. This approach allows organizations to provision automatically and de-provision resources as needed, ensuring optimal utilization and cost-efficiency.

By setting appropriate scaling policies and thresholds, you can avoid over-provisioning during periods of low demand and ensure sufficient capacity during peak usage times. Additionally, separating storage and compute resources enables more granular control over costs, allowing you to scale each component independently based on specific requirements.

4. FinOps Culture
Foster collaboration between data and finance teams, implementing cost allocation strategies like tagging and chargeback mechanisms to attribute expenses to specific projects or teams accurately. This approach creates a shared responsibility for cloud costs and promotes organizational transparency.

By establishing clear communication channels and regular meetings between technical and financial stakeholders, teams can align their efforts to optimize resource utilization and spending. A robust tagging system also allows for detailed cost breakdowns, enabling more informed decision-making and budget allocation based on actual usage patterns.

5. Advanced Forecasting
Develop sophisticated forecasting techniques and flexible budgeting strategies using historical data and AI-driven analytics to accurately predict future costs and create adaptive budgets that accommodate changing business needs. Organizations can identify trends and seasonal variations that impact costs by analyzing past usage patterns and performance metrics.

This data-driven approach enables more precise resource allocation and helps teams anticipate potential cost spikes, allowing for proactive adjustments to prevent budget overruns. Additionally, implementing AI-powered forecasting models can provide real-time insights and recommendations, enabling continuous optimization of environments as workloads and business requirements evolve.

Mastering these strategies can help you transform your approach to cost management from reactive to proactive, ensuring you maximize the value of your cloud investments while maintaining financial control.

To learn more about implementing these cost management strategies in your modern data environment, join our upcoming webinar series, “Controlling Cloud Costs.” This ten-part series will explore each aspect of effective cost management, providing actionable insights and best practices to gain control over your data platform costs.

The post Mastering Cost Management: From Reactive Spending to Proactive Optimization appeared first on Unravel.

Building a FinOps Ethos

Unravel Data — Mon, 09 Dec 2024 21:02:22 +0000

3 Key Steps to Build a FinOps Ethos in Your Data Engineering Team

In today’s data-driven enterprises, the intersection of fiscal responsibility and technical innovation has never been more critical. As data processing costs continue to scale with business growth, building a FinOps culture within your Data Engineering team isn’t just about cost control, it’s about creating a mindset that views cost optimization as an integral part of technical excellence.

Unravel’s ‘actionability for everyone’ approach has enabled executive customers to adopt three transformative steps to embed FinOps principles into their Data Engineering team’s DNA, ensuring that cost awareness becomes as natural as code quality or data accuracy. In this article, we walk you through how executives can get the cost rating of their workloads from Uncontrolled to Optimized with clear, guided actionable pathways.

Step 1. Democratize cost visibility

The foundation of any successful FinOps implementation begins with transparency. However, raw cost data alone isn’t enough-it needs to be contextualized and actionable.

Breaking Down the Cost Silos

Unravel provides real-time cost attribution dashboards to map cloud spending to specific business units, teams, projects, and data pipelines.

The custom views allow different stakeholders, from engineers to executives to discover top actions to control cost spends.
The ability to track key metrics like cost/savings per job/query, time-to-value, and wasted costs due to idleness, wait times, etc., transforms cost data from a passive reporting tool to an active management instrument.

Provide tools for cost decision making

Modern data teams need to understand how their architectural and implementation choices affect the bottom line. Unravel provides visualizations and insights to guide these teams to implement:

Pre-deployment cost impact assessments for new data pipelines (for example, what is the cost impact of migrating this workload from All-purpose compute to Job compute?).
What-if analysis tools for infrastructure changes (for example, will changing from the current instance types to recommended instance types affect performance if I save on cost?).
Historical trend analysis to identify cost patterns, budget overrun costs, cost wasted due to optimizations neglected by their teams, etc.

Step 2. Embed cost intelligence into the development and operations lifecycles

The next evolution in FinOps maturity comes from making cost optimization an integral part of the development process, not a post-deployment consideration. Executives should consider leveraging specialized AI agents across their technology stack that helps to boost productivity and free up their teams’ time to focus on innovation. Unravel provides a suite of AI-agents driven features that foster cost ethos in the organization and maximizes operational excellence with auto-fix capabilities.

Automated Optimization Lifecycle
Unravel helps you establish a systematic approach to cost optimization with automated AI-agentic workflows to help your teams operationalize recommendations while getting a huge productivity boost. Below are some ways to drive automation with Unravel agents:

Implement automated fix suggestions for most code and configuration inefficiencies
Assign auto-fix actions to AI agents for effortless adoption or route them to a human reviewer for approval
Configure automated rollback capabilities for changes if unintended performance or cost degradation is detected

Push Down Cost Consciousness To Developers

Automated code reviews that flag potential cost inefficiencies
Smart cost savings recommendations based on historical usage patterns
Allow developers to see the impact of their change on future runs

Define Measurable Success Metrics
Executives can track the effectiveness of FinOps awareness and culture using Unravel through:

Cost efficiency improvements over time (WoW, MoM, YTD)
Team engagement and rate of adoption with Accountability dashboards
Time-to-resolution for code and configuration changes

Step 3. Create a self-sustaining FinOps culture

The final and most crucial step is transforming FinOps from an initiative into a cultural cornerstone of your data engineering practice.

Operationalize with AI agents

FinOps AI Agent

Implement timely alerting systems that help to drive value-oriented decisions for cost optimization and governance. Unravel provides email, Slack, Teams integrations for ensuring all necessary stakeholders get timely notifications and insights into opportunities and risks.

DataOps AI Agent

Pipeline optimization suggestions for better resource utilization and mitigate SLA risks
Job signature level cost and savings impact analysis to help with prioritization of recommendations
Intelligent workload migration recommendations

Data Engineering AI Agent

Storage tier optimization recommendations to avoid wasted costs due to cold tables
Partition strategy optimization for cost-effective querying
Avoid recurring failures and bottlenecks due to inefficiencies not acted upon for several weeks

Continuous Evolution

Finally, it is extremely important to foster and track the momentum of FinOps growth by:

Regularly performing FinOps retrospectives with wider teams
Revisiting which Business units and Cost Centers are contributing to wasted costs, neglected cost due to unadopted recommendations and budget overruns despite timely alerting

The path forward

Building a FinOps ethos in your Data Engineering team is a journey that requires commitment, tools, and cultural change. By following the above three key steps – democratizing cost visibility, embedding cost intelligence, and creating a self-sustaining culture – you can transform how your team thinks about and manages cloud costs.

The most successful implementations don’t treat FinOps as a separate discipline but rather as an integral part of technical excellence. When cost optimization becomes as natural as writing tests or documenting code, you have achieved true FinOps maturity. Unravel provides a comprehensive set of features and tools to aid your teams in accelerating FinOps best practices throughout the organization.

Remember, the goal isn’t just to reduce costs – it is to maximize the value derived from every dollar spent on your infrastructure. This mindset shift, combined with the right tools and processes, will position your data engineering team for sustainable growth and success in an increasingly cost-conscious technology landscape.

To learn more on how Unravel can help, contact us or request a demo.

The post Building a FinOps Ethos appeared first on Unravel.

BigQuery Cost Management

Unravel Data — Wed, 06 Nov 2024 16:46:23 +0000

Mastering BigQuery Cost Management and FinOps: A Comprehensive Checklist

Effective cost management becomes crucial as organizations increasingly rely on Google BigQuery for their data warehousing and analytics needs. This checklist delves into the intricacies of cost management and FinOps for BigQuery, exploring strategies to inform, govern, and optimize usage while taking a holistic approach that considers queries, datasets, infrastructure, and more.

While this checklist is comprehensive and very impactful when implemented fully, it can also be overwhelming to implement with limited staffing and resources. AI-driven insights and automation can solve this problem and are also explored at the bottom of this guide.

Understanding Cost Management for BigQuery

BigQuery’s pricing model is primarily based on data storage and query processing. While this model offers flexibility, it also requires careful management to ensure costs align with business value. Effective cost management for BigQuery is about more than reducing expenses—it’s also about optimizing spend, ensuring efficient resource utilization, and aligning costs with business outcomes. This comprehensive approach falls under the umbrella of FinOps (Financial Operations).

The Holistic Approach: Key Areas to Consider

1. Query Optimization

Are queries optimized? Efficient queries are fundamental to cost-effective BigQuery usage:

Query Structure: Write efficient SQL queries that minimize data scanned.
Partitioning and Clustering: Implement appropriate table partitioning and clustering strategies to reduce query costs.
Materialized Views: Use materialized views for frequently accessed or complex query results.
Query Caching: Leverage BigQuery’s query cache to avoid redundant processing.

2. Dataset Management

Are datasets managed correctly? Proper dataset management is crucial for controlling costs:

Data Lifecycle Management: Implement policies for data retention and expiration to manage storage costs.
Table Expiration: Set up automatic table expiration for temporary or test datasets.
Data Compression: Use appropriate compression methods to reduce storage costs.
Data Skew: Address data skew issues to prevent performance bottlenecks and unnecessary resource consumption.

3. Infrastructure Optimization

Is infrastructure optimized? While BigQuery is a managed service, there are still infrastructure considerations:

Slot Reservations: Evaluate and optimize slot reservations for predictable workloads.
Flat-Rate Pricing: Consider flat-rate pricing for high-volume, consistent usage patterns.
Multi-Region Setup: Balance data residency requirements with cost implications of multi-region setups.

4. Access and Governance

Are the right policies and governance in place? Proper access controls and governance are essential for cost management:

IAM Roles: Implement least privilege access using Google Cloud IAM roles.
Resource Hierarchies: Utilize resource hierarchies (organizations, folders, projects) for effective cost allocation.
VPC Service Controls: Implement VPC Service Controls to manage data access and potential egress costs.

Implementing FinOps Practices

To master cost management for BiqQuery, consider these FinOps practices:

1. Visibility and Reporting

Implement comprehensive labeling strategies for resources.
Create custom dashboards in Google Cloud Console or Data Studio for cost visualization.
Set up budget alerts and export detailed billing data for analysis.

2. Optimization

Regularly review and optimize queries based on BigQuery’s query explanation and job statistics.
Implement automated processes to identify and optimize high-cost queries.
Foster a culture of cost awareness among data analysts and engineers.

3. Governance

Establish clear policies for dataset creation, query execution, and resource provisioning.
Implement approval workflows for high-cost operations or large-scale data imports.
Create and enforce organizational policies to prevent costly misconfigurations.

Setting Up Guardrails

Implementing guardrails is crucial to prevent unexpected costs:

Query Limits: Set daily query limit quotas at the project or user level.
Cost Controls: Implement custom cost controls using Cloud Functions and the BigQuery API.
Data Access Controls: Use column-level and row-level security to restrict access to sensitive or high-volume data.
Budgets and Alerts: Set up project-level budgets and alerts in Google Cloud Console.

The Need for Automated Observability and FinOps Solutions

Given the scale and complexity of modern data operations, automated solutions can significantly enhance cost management efforts. Automated observability and FinOps solutions can provide the following:

Real-time cost visibility across your entire BigQuery environment.
Automated recommendations for query optimization and cost reduction.
Anomaly detection to quickly identify unusual spending patterns.
Predictive analytics to forecast future costs and resource needs.

These solutions can offer insights that would be difficult or impossible to obtain manually, helping you make data-driven decisions about your BigQuery usage and costs.

BigQuery-Specific Cost Optimization Techniques

Avoid SELECT: Instead, specify only the columns you need to reduce data processed.
Use Approximate Aggregation Functions: For large-scale aggregations where precision isn’t critical, use approximate functions like APPROX_COUNT_DISTINCT().
Optimize JOIN Operations: Ensure the larger table is on the left side of the JOIN to potentially reduce shuffle and processing time.
Leverage BigQuery ML: Use BigQuery ML for in-database machine learning to avoid data movement costs.
Use Scripting: Utilize BigQuery scripting to perform complex operations without multiple query executions.

Conclusion

Effective BigQuery cost management and FinOps require a holistic approach that considers all aspects of your data operations. By optimizing queries, managing datasets efficiently, leveraging appropriate pricing models, and implementing robust FinOps practices, you can ensure that your BigQuery investment delivers maximum value to your organization.

Remember, the goal isn’t just to reduce costs, but to optimize spend and align it with business objectives. With the right strategies and tools in place, you can transform cost management from a challenge into a competitive advantage, enabling your organization to make the most of BigQuery’s powerful capabilities while maintaining control over expenses.

To learn more about how Unravel can help with BigQuery cost management, request a health check report, view a self-guided product tour, or request a demo.

The post BigQuery Cost Management appeared first on Unravel.

Databricks Cost Management

Unravel Data — Wed, 06 Nov 2024 16:45:55 +0000

Mastering Databricks Cost Management and FinOps: A Comprehensive Checklist

In the era of big data and cloud computing, organizations increasingly turn to platforms like Databricks to handle their data processing and analytics needs. However, with great power comes great responsibility – and, in this case, the responsibility of managing costs effectively.

This checklist dives deep into cost management and FinOps for Databricks, exploring how to inform, govern, and optimize your usage while taking a holistic approach that considers code, configurations, datasets, and infrastructure.

Understanding Databricks Cost Management

Before we delve into strategies for optimization, it’s crucial to understand that Databricks cost management isn’t just about reducing expenses. It’s about gaining visibility into where your spend is going, ensuring resources are being used efficiently, and aligning costs with business value. This comprehensive approach is often referred to as FinOps (Financial Operations).

The Holistic Approach: Key Areas to Consider

1. Code Optimization

Is code optimized? Efficient code is the foundation of cost-effective Databricks usage. Consider the following:

Query Optimization: Ensure your Spark SQL queries are optimized for performance. Use explain plans to understand query execution and identify bottlenecks.
Proper Data Partitioning: Implement effective partitioning strategies to minimize data scans and improve query performance.
Caching Strategies: Utilize Databricks’ caching mechanisms judiciously to reduce redundant computations.

2. Configuration Management

Are configurations managed appropriately? Proper configuration can significantly impact costs:

Cluster Sizing: Right-size your clusters based on workload requirements. Avoid over-provisioning resources.
Autoscaling: Implement autoscaling to adjust cluster size based on demand dynamically.
Instance Selection: Choose the appropriate instance types for your workloads, considering both performance and cost.

3. Dataset Management

Are datasets managed correctly? Efficient data management is crucial for controlling costs:

Data Lifecycle Management: Implement policies for data retention and archiving to avoid unnecessary storage costs.
Data Format Optimization: Use efficient file formats like Parquet or ORC to reduce storage and improve query performance.
Data Skew Handling: Address data skew issues to prevent performance bottlenecks and unnecessary resource consumption.

4. Infrastructure Optimization

Is infrastructure optimized? Optimize your underlying infrastructure for cost-efficiency:

Storage Tiering: Utilize appropriate storage tiers (e.g., DBFS, S3, Azure Blob Storage) based on data access patterns.
Spot Instances: Leverage spot instances for non-critical workloads to reduce costs.
Reserved Instances: Consider purchasing reserved instances for predictable, long-running workloads.

Implementing FinOps Practices

To truly master Databricks cost management, implement these FinOps practices:

1. Visibility and Reporting

Implement comprehensive cost allocation and tagging strategies.
Create dashboards to visualize spend across different dimensions (teams, projects, environments).
Set up alerts for unusual spending patterns or budget overruns.

2. Optimization

Regularly review and optimize resource usage based on actual consumption patterns.
Implement automated policies for shutting down idle clusters.
Encourage a culture of cost awareness among data engineers and analysts.

3. Governance

Establish clear policies and guidelines for resource provisioning and usage.
Implement role-based access control (RBAC) to ensure appropriate resource access.
Create approval workflows for high-cost operations or resource requests.

Setting Up Guardrails

Guardrails are essential for preventing cost overruns and ensuring responsible usage:

Budget Thresholds: Set up budget alerts at various thresholds (e.g., 50%, 75%, 90% of budget).
Usage Quotas: Implement quotas for compute hours, storage, or other resources at the user or team level.
Automated Policies: Use Databricks’ Policy Engine to enforce cost-saving measures automatically.
Cost Centers: Implement chargeback or showback models to make teams accountable for their spend.

The Need for Automated Observability and FinOps Solutions

While manual oversight is important, the scale and complexity of modern data operations often necessitate automated solutions. Tools like Unravel can provide:

Real-time cost visibility across your entire Databricks environment.
Automated recommendations for cost optimization.
Anomaly detection to identify unusual spending patterns quickly.
Predictive analytics to forecast future costs and resource needs.

These solutions can significantly enhance your ability to manage costs effectively, providing insights that would be difficult or impossible to obtain manually.

Conclusion

Effective cost management and FinOps for Databricks require a holistic approach considering all aspects of your data operations. By optimizing code, configurations, datasets, and infrastructure, and implementing robust FinOps practices, you can ensure that your Databricks investment delivers maximum value to your organization. Remember, the goal isn’t just to reduce costs, but to optimize spend and align it with business objectives. With the right strategies and tools in place, you can turn cost management from a challenge into a competitive advantage.

To learn more about how Unravel can help with Databricks cost management, request a health check report, view a self-guided product tour, or request a demo.

The post Databricks Cost Management appeared first on Unravel.

BigQuery Code Optimization

Unravel Data — Wed, 06 Nov 2024 16:43:18 +0000

The Complexities of Code Optimization in BigQuery: Problems, Challenges and Solutions

Google BigQuery offers a powerful, serverless data warehouse solution that can handle massive datasets with ease. However, this power comes with its own set of challenges, particularly when it comes to code optimization.

This blog post delves into the complexities of code optimization in BigQuery, the difficulties in diagnosing and resolving issues, and how automated solutions can simplify this process.

The BigQuery Code Optimization Puzzle

1. Query Performance and Cost Management

Problem: In BigQuery, query performance and cost are intimately linked. Inefficient queries can not only be slow but also extremely expensive, especially when dealing with large datasets.

Diagnosis Challenge: Identifying the root cause of a poorly performing and costly query is complex. Is it due to inefficient JOIN operations, suboptimal table structures, or simply the sheer volume of data being processed? BigQuery provides query explanations, but interpreting these for complex queries and understanding their cost implications requires significant expertise.

Resolution Difficulty: Optimizing BigQuery queries often involves a delicate balance between performance and cost. Techniques like denormalizing data might improve query speed but increase storage costs. Each optimization needs to be carefully evaluated for its impact on both performance and billing, which can be a time-consuming and error-prone process.

2. Partitioning and Clustering Challenges

Problem: Improper partitioning and clustering can lead to excessive data scanning, resulting in slow queries and unnecessary costs.

Diagnosis Challenge: The effects of suboptimal partitioning and clustering may not be immediately apparent and can vary depending on query patterns. Identifying whether poor performance is due to partitioning issues, clustering issues, or something else entirely requires analyzing query patterns over time and understanding the intricacies of BigQuery’s architecture.

Resolution Difficulty: Changing partitioning or clustering strategies is not a trivial operation, especially for large tables. It requires careful planning and can temporarily impact query performance during the restructuring process. Determining the optimal strategy often requires extensive A/B testing and monitoring across various query types and data sizes.

3. Nested and Repeated Fields Complexity

Problem: While BigQuery’s support for nested and repeated fields offers flexibility, it can lead to complex queries that are difficult to optimize and prone to performance issues.

Diagnosis Challenge: Understanding the performance characteristics of queries involving nested and repeated fields is like solving a multidimensional puzzle. The query explanation may not provide clear insights into how these fields are being processed, making it difficult to identify bottlenecks.

Resolution Difficulty: Optimizing queries with nested and repeated fields often requires restructuring the data model or rewriting queries in non-intuitive ways. This process can be time-consuming and may require significant changes to ETL processes and downstream analytics.

4. UDF and Stored Procedure Performance

Problem: User-Defined Functions (UDFs) and stored procedures in BigQuery can lead to unexpected performance issues if not implemented carefully.

Diagnosis Challenge: The impact of UDFs and stored procedures on query performance isn’t always clear from the query explanation. Identifying whether these are the source of performance issues requires careful analysis and benchmarking.

Resolution Difficulty: Optimizing UDFs and stored procedures often involves rewriting them from scratch or finding ways to eliminate them altogether. This can be a complex process, especially if these functions are widely used across your BigQuery projects.

Manual Optimization Struggle

Traditionally, addressing these challenges involves a cycle of:

1. Manually analyzing query explanations and job statistics
2. Conducting time-consuming A/B tests with different query structures and table designs
3. Carefully monitoring the impact of changes on both performance and cost
4. Continuously adjusting as data volumes and query patterns evolve

This process is not only time-consuming but also requires deep expertise in BigQuery’s architecture, SQL optimization techniques, and cost management strategies. Even then, optimizations that work today might become inefficient as your data and usage patterns change.

Harnessing Automation for BigQuery Optimization

Given the complexities and ongoing nature of these challenges, many organizations are turning to automated solutions to streamline their BigQuery optimization efforts. Tools like Unravel can help by:

Continuous Performance and Cost Monitoring: Automatically tracking query performance, resource utilization, and cost metrics across your entire BigQuery environment.

Intelligent Query Analysis: Using machine learning algorithms to identify patterns and anomalies in query performance and cost that might be missed by manual analysis.

Root Cause Identification: Quickly pinpointing the source of performance issues, whether they’re related to query structure, data distribution, or BigQuery-specific features like partitioning and clustering.

Optimization Recommendations: Providing actionable suggestions for query rewrites, partitioning and clustering strategies, and cost-saving measures.

Impact Prediction: Estimating the potential performance and cost impacts of suggested changes before you implement them.

Automated Policy Enforcement: Helping enforce best practices and cost controls automatically across your BigQuery projects.

By leveraging such automated solutions, data teams can focus their expertise on deriving insights from data while ensuring their BigQuery environment remains optimized and cost-effective. Instead of spending hours digging through query explanations and job statistics, teams can quickly identify and resolve issues, or even prevent them from occurring in the first place.

Conclusion

Code optimization in BigQuery is a complex, ongoing challenge that requires continuous attention and expertise. While the problems are multifaceted and the manual diagnosis and resolution process can be daunting, automated solutions offer a path to simplify and streamline these efforts. By leveraging such tools, organizations can more effectively manage their BigQuery performance and costs, improve query efficiency, and allow their data teams to focus on delivering value rather than constantly grappling with optimization challenges.

Remember, whether you’re using manual methods or automated tools, optimization in BigQuery is an ongoing process. As your data volumes grow and query patterns evolve, staying on top of performance and cost management will ensure that your BigQuery implementation continues to deliver the insights your business needs, efficiently and cost-effectively.

To learn more about how Unravel can help with BigQuery code optimization, request a health check report, view a self-guided product tour, or request a demo.

The post BigQuery Code Optimization appeared first on Unravel.

Snowflake Cost Management

Unravel Data — Wed, 06 Nov 2024 16:43:11 +0000

Mastering Snowflake Cost Management and FinOps: A Comprehensive Checklist

Effective cost management becomes paramount as organizations leverage Snowflake’s powerful cloud data platform for their analytics and data warehousing needs. This comprehensive checklist explores the intricacies of cost management and FinOps for Snowflake, delving into strategies to inform, govern, and optimize usage while taking a holistic approach that considers queries, storage, compute resources, and more.

Understanding Cost Management for Snowflake

Snowflake’s unique architecture separates compute and storage, offering a flexible pay-as-you-go model. While this provides scalability and performance benefits, it also requires careful management to ensure costs align with business value.

Effective Snowflake cost management is about more than reducing expenses—it’s also about optimizing spend, ensuring efficient resource utilization, and aligning costs with business outcomes. This comprehensive approach falls under the umbrella of FinOps (Financial Operations).

The Holistic Approach: Key Areas to Consider

1. Compute Optimization

Are compute resources allocated efficiently?

Virtual Warehouse Sizing: Right-size your virtual warehouses based on workload requirements.
Auto-suspend and Auto-resume: Leverage Snowflake’s auto-suspend and auto-resume features to minimize idle time.
Query Optimization: Write efficient SQL queries to reduce compute time and costs.
Materialized Views: Use materialized views for frequently accessed or complex query results.
Result Caching: Utilize Snowflake’s result caching to avoid redundant computations.

2. Resource Monitoring and Governance

Are the right policies and governance in place? Proper monitoring and governance are essential for cost management:

Resource Monitors: Set up resource monitors to track and limit credit usage.
Account Usage and Information Schema Views: Utilize these views to gain insights into usage patterns and costs.
Role-Based Access Control (RBAC): Implement RBAC to ensure appropriate resource access and usage.

3. Storage Management

Is storage managed efficiently? While storage is typically a smaller portion of Snowflake costs, it’s still important to manage efficiently:

Data Lifecycle Management: Implement policies for data retention and archiving.
Time Travel and Fail-safe: Optimize usage of Time Travel and Fail-safe features based on your data recovery needs.
Zero-copy Cloning: Leverage zero-copy cloning for testing and development to avoid duplicating storage costs.
Data Compression: Use appropriate compression methods to reduce storage requirements.

4. Data Sharing and Marketplace

Are data sharing and marketplace usage optimized?

Secure Data Sharing: Leverage Snowflake’s secure data sharing to reduce data movement and associated costs.
Marketplace Considerations: Carefully evaluate the costs and benefits of data sets or applications from Snowflake Marketplace.

Implementing FinOps Practices

To master Snowflake cost management, consider these FinOps practices:

1. Visibility and Reporting

Implement comprehensive tagging strategies for resources.
Create custom dashboards using Snowsight or third-party BI tools for cost visualization.
Set up alerts for unusual spending patterns or budget overruns.

2. Optimization

Regularly review and optimize warehouse configurations and query performance.
Implement automated processes to identify and optimize high-cost queries or inefficient warehouses.
Foster a culture of cost awareness among data analysts, engineers, and scientists.

3. Governance

Establish clear policies for warehouse creation, data ingestion, and resource provisioning.
Implement approval workflows for high-cost operations or large-scale data imports.
Create and enforce organizational policies to prevent costly misconfigurations.

Setting Up Guardrails

Implementing guardrails is crucial to prevent unexpected costs:

Resource Monitors: Set up resource monitors with actions (suspend or notify) when thresholds are reached.
Warehouse Size Limits: Establish policies on maximum warehouse sizes for different user groups.
Query Timeouts: Configure appropriate query timeouts to prevent runaway queries.
Data Retention Policies: Implement automated data retention and archiving policies.

The Need for Automated Observability and FinOps Solutions

Given the complexity of modern data operations, automated solutions can significantly enhance cost management efforts. Automated observability and FinOps solutions can provide the following:

Real-time cost visibility across your entire Snowflake environment.
Automated recommendations for query optimization and warehouse right-sizing.
Anomaly detection to quickly identify unusual spending patterns.
Predictive analytics to forecast future costs and resource needs.

These solutions can offer insights that would be difficult or impossible to obtain manually, helping you make data-driven decisions about your Snowflake usage and costs.

Snowflake-Specific Cost Optimization Techniques

Cluster Keys: Properly define cluster keys to improve data clustering and query performance.
Search Optimization: Use search optimization service for tables with frequent point lookup queries.
Multi-cluster Warehouses: Leverage multi-cluster warehouses for concurrency without over-provisioning.
Resource Classes: Utilize resource classes to manage priorities and costs for different workloads.
Snowpipe: Consider Snowpipe for continuous, cost-effective data ingestion.

Conclusion

Effective Snowflake cost management and FinOps require a holistic approach considering all aspects of your data operations. By optimizing compute resources, managing storage efficiently, implementing robust governance, and leveraging Snowflake-specific features, you can ensure that your Snowflake investment delivers maximum value to your organization.

Remember, the goal isn’t just to reduce costs, but to optimize spend and align it with business objectives. With the right strategies and tools in place, you can transform cost management from a challenge into a competitive advantage, enabling your organization to make the most of Snowflake’s powerful capabilities while maintaining control over expenses.

By continuously monitoring, optimizing, and governing your Snowflake usage, you can achieve a balance between performance, flexibility, and cost-efficiency, ultimately driving better business outcomes through data-driven decision-making.

To learn more about how Unravel can help optimize your Snowflake cost, request a health check report, view a self-guided product tour, or request a personalized demo.

The post Snowflake Cost Management appeared first on Unravel.

Databricks Code Optimization

Unravel Data — Wed, 06 Nov 2024 16:42:05 +0000

The Complexities of Code Optimization in Databricks: Problems, Challenges and Solutions

Databricks, with its unified analytics platform, offers powerful capabilities for big data processing and machine learning. However, with great power comes great responsibility – and in this case, the responsibility of efficient code optimization.

This blog post explores the complexities of code optimization in Databricks across SQL, Python, and Scala, the difficulties in diagnosing and resolving issues, and how automated solutions can simplify this process.

The Databricks Code Optimization Puzzle

1. Spark SQL Optimization Challenges

Problem: Inefficient Spark SQL queries can lead to excessive shuffling, out-of-memory errors, and slow execution times.

Diagnosis Challenge: Identifying the root cause of a slow Spark SQL query is complex. Is it due to poor join conditions, suboptimal partitioning, or inefficient use of Spark’s catalyst optimizer? The Spark UI provides a wealth of information, but parsing through stages, tasks, and shuffles to pinpoint the exact issue requires deep expertise and time.

Resolution Difficulty: Optimizing Spark SQL often involves a delicate balance. Techniques like broadcast joins might work well for small-to-medium datasets but fail spectacularly for large ones. Each optimization technique needs to be carefully tested across various data scales, which is time-consuming and can lead to performance regressions if not done meticulously.

2. Python UDF Performance Issues

Problem: While Python UDFs (User-Defined Functions) offer flexibility, they can be a major performance bottleneck due to serialization overhead and lack of Spark’s optimizations.

Diagnosis Challenge: The impact of Python UDFs isn’t always immediately apparent. A UDF that works well on a small dataset might become a significant bottleneck as data volumes grow. Identifying which UDFs are causing issues and why requires careful profiling and analysis of Spark jobs.

Resolution Difficulty: Optimizing Python UDFs often involves rewriting them in Scala or using Pandas UDFs, which requires a different skill set. Balancing between code readability, maintainability, and performance becomes a significant challenge, especially in teams with varying levels of Spark expertise.

3. Scala Code Complexity

Problem: While Scala can offer performance benefits, complex Scala code can lead to issues like object serialization problems, garbage collection pauses, and difficulty in code maintenance.

Diagnosis Challenge: Scala’s powerful features, like lazy evaluation and implicits, can make it difficult to trace the execution flow and identify performance bottlenecks. Issues like serialization problems might only appear in production environments, making them particularly challenging to diagnose.

Resolution Difficulty: Optimizing Scala code often requires a deep understanding of both Scala and the internals of Spark. Solutions might involve changing fundamental aspects of the code structure, which can be risky and time-consuming. Balancing between idiomatic Scala and Spark-friendly code is an ongoing challenge.

4. Memory Management Across Languages

Problem: Inefficient memory management, particularly in long-running Spark applications, can lead to out-of-memory errors or degraded performance over time.

Diagnosis Challenge: Memory issues in Databricks can be particularly elusive. Is the problem in the JVM heap, off-heap memory, or perhaps in Python’s memory management? Understanding the interplay between Spark’s memory management and the specifics of SQL, Python, and Scala requires expertise in all these areas.

Resolution Difficulty: Resolving memory issues often involves a combination of code optimization, configuration tuning, and sometimes fundamental architectural changes. This process can be lengthy and may require multiple iterations of testing in production-like environments.

The Manual Optimization Struggle

Traditionally, addressing these challenges involves a cycle of:

1. Manually analyzing Spark UI, logs, and metrics
2. Conducting time-consuming performance tests across various data scales
3. Carefully refactoring code and tuning configurations
4. Monitoring the impact of changes across different workloads and data sizes
5. Rinse and repeat

This process is not only time-consuming but also requires a rare combination of skills across SQL, Python, Scala, and Spark internals. Even for experts, keeping up with the latest best practices and Databricks features is an ongoing challenge.

Leveraging Automation for Databricks Optimization

Given the complexities and ongoing nature of these challenges, many organizations are turning to automated solutions to streamline their Databricks optimization efforts. Tools like Unravel can help by:

1. Cross-Language Performance Monitoring: Automatically tracking performance metrics across SQL, Python, and Scala code in a unified manner.

2. Intelligent Bottleneck Detection: Using machine learning to identify performance bottlenecks, whether they’re in SQL queries, Python UDFs, or Scala code.

3. Root Cause Analysis: Quickly pinpointing the source of performance issues, whether they’re related to code structure, data skew, or resource allocation.

4. Code-Aware Optimization Recommendations: Providing language-specific suggestions for code improvements, such as replacing Python UDFs with Pandas UDFs or optimizing Scala serialization.

5. Predictive Performance Modeling: Estimating the impact of code changes across different data scales before deployment.

6. Automated Tuning: In some cases, automatically adjusting Spark configurations based on workload patterns and performance goals.

By leveraging such automated solutions, data teams can focus their expertise on building innovative data products while ensuring their Databricks environment remains optimized and cost-effective. Instead of spending hours digging through Spark UIs and log files, teams can quickly identify and resolve issues, or even prevent them from occurring in the first place.

Conclusion

Code optimization in Databricks is a multifaceted challenge that spans across SQL, Python, and Scala. While the problems are complex and the manual diagnosis and resolution process can be daunting, automated solutions offer a path to simplify and streamline these efforts. By leveraging such tools, organizations can more effectively manage their Databricks performance, improve job reliability, and allow their data teams to focus on delivering value rather than constantly battling optimization challenges.

Remember, whether you’re using manual methods or automated tools, optimization in Databricks is an ongoing process. As your data volumes grow and processing requirements evolve, staying on top of performance management will ensure that your Databricks implementation continues to deliver the insights and data products your business needs, efficiently and reliably.

To learn more about how Unravel can help with Databricks code optimization, request a health check report, view a self-guided product tour, or request a demo.

The post Databricks Code Optimization appeared first on Unravel.

Snowflake Code Optimization

Unravel Data — Wed, 06 Nov 2024 16:40:25 +0000

The Complexities of Code Optimization in Snowflake: Problems, Challenges, and Solutions

In the world of Snowflake data warehousing, code optimization is crucial for managing costs and ensuring efficient resource utilization. However, this process is fraught with challenges that can leave even experienced data teams scratching their heads.

This blog post explores the complexities of code optimization in Snowflake, the difficulties in diagnosing and resolving issues, and how automated solutions can simplify this process.

The Snowflake Code Optimization Puzzle

1. Inefficient JOIN Operations

Problem: Large table joins often lead to excessive data shuffling and prolonged query times, significantly increasing credit consumption.

Diagnosis Challenge: Pinpointing the exact cause of a slow JOIN is like finding a needle in a haystack. Is it due to poor join conditions, lack of proper clustering, or simply the volume of data involved? The query plan might show a large data shuffle, but understanding why it’s happening and how to fix it requires deep expertise and time-consuming investigation.

Resolution Difficulty: Optimizing JOINs often involves a trial-and-error process. You might need to experiment with different join types, adjust clustering keys, or even consider restructuring your data model. Each change requires careful testing to ensure it doesn’t negatively impact other queries or downstream processes.

2. Suboptimal Data Clustering

Problem: Poor choices in clustering keys lead to inefficient data access patterns, increasing query times and, consequently, costs.

Diagnosis Challenge: The effects of suboptimal clustering are often subtle and vary depending on query patterns. A clustering key that works well for one set of queries might be terrible for another. Identifying the root cause requires analyzing a wide range of queries over time, a task that’s both time-consuming and complex.

Resolution Difficulty: Changing clustering keys is not a trivial operation. It requires careful planning, as it can temporarily increase storage costs and impact query performance during the re-clustering process. Determining the optimal clustering strategy often requires extensive A/B testing and monitoring.

3. Inefficient Use of UDFs

Problem: While powerful, User-Defined Functions (UDFs) can lead to unexpected performance issues and increased credit consumption if not used correctly.

Diagnosis Challenge: UDFs are often black boxes from a performance perspective. Traditional query profiling tools might show that a UDF is slow, but they can’t peer inside to identify why. This opacity makes it extremely difficult to pinpoint the root cause of UDF-related performance issues.

Resolution Difficulty: Optimizing UDFs often requires rewriting them from scratch, which can be time-consuming and risky. You might need to balance between UDF performance and maintainability, and in some cases, completely rethink your approach to the problem the UDF was solving.

4. Complex, Monolithic Queries

Problem: Large, complex queries can be incredibly difficult to optimize and may not leverage Snowflake’s MPP architecture effectively, leading to increased execution times and costs.

Diagnosis Challenge: Understanding the performance characteristics of a complex query is like solving a multidimensional puzzle. Each part of the query interacts with others in ways that can be hard to predict. Traditional query planners may struggle to provide useful insights for such queries.

Resolution Difficulty: Optimizing complex queries often requires breaking them down into smaller, more manageable parts. This process can be incredibly time-consuming and may require significant refactoring of not just the query, but also the surrounding ETL processes and downstream dependencies.

The Manual Optimization Struggle

Traditionally, addressing these challenges involves a cycle of:

1. Manually sifting through query histories and execution plans
2. Conducting time-consuming A/B tests
3. Carefully monitoring the impact of changes across various workloads
4. Rinse and repeat

This process is not only time-consuming but also prone to human error. It requires deep expertise in Snowflake’s architecture, SQL optimization techniques, and your specific data model. Even then, optimizations that work today might become inefficient as your data volumes and query patterns evolve.

The Power of Automation in Snowflake Optimization

Given the complexities and ongoing nature of these challenges, many organizations are turning to automated solutions to simplify and streamline their Snowflake optimization efforts. Tools like Unravel can help by:

Continuous Monitoring: Automatically tracking query performance, resource utilization, and cost metrics across your entire Snowflake environment.

Intelligent Analysis: Using machine learning algorithms to identify patterns and anomalies that might be missed by manual analysis.

Root Cause Identification: Quickly pinpointing the source of performance issues, whether they’re related to query structure, data distribution, or resource allocation.

Optimization Recommendations: Providing actionable suggestions for query rewrites, clustering key changes, and resource allocation adjustments.

Impact Prediction: Estimating the potential performance and cost impacts of suggested changes before you implement them.

Automated Tuning: In some cases, automatically applying optimizations based on predefined rules and thresholds.

By leveraging such automated solutions, data teams can focus their expertise on higher-value tasks while ensuring their Snowflake environment remains optimized and cost-effective. Instead of spending hours digging through query plans and execution logs, teams can quickly identify and resolve issues, or even prevent them from occurring in the first place.

Conclusion

Code optimization in Snowflake is a complex, ongoing challenge that requires continuous attention and expertise. While the problems are multifaceted and the manual diagnosis and resolution process can be daunting, automated solutions offer a path to simplify and streamline these efforts. By leveraging such tools, organizations can more effectively manage their Snowflake costs, improve query performance, and allow their data teams to focus on delivering value rather than constantly fighting optimization battles.

Remember, whether you’re using manual methods or automated tools, optimization is an ongoing process. As your data volumes grow and query patterns evolve, staying on top of performance and cost management will ensure that your Snowflake implementation continues to deliver the insights your business needs, efficiently and cost-effectively.

To learn more about how Unravel can help optimize your code in Snowflake, request a health check report, view a self-guided product tour, or request a demo.

The post Snowflake Code Optimization appeared first on Unravel.

Configuration Management in Modern Data Platforms

Unravel Data — Wed, 06 Nov 2024 16:38:22 +0000

Navigating the Maze of Configuration Management in Modern Data Platforms: Problems, Challenges and Solutions

In the world of big data, configuration management is often the unsung hero of platform performance and cost-efficiency. Whether you’re working with Snowflake, Databricks, BigQuery, or any other modern data platform, effective configuration management can mean the difference between a sluggish, expensive system and a finely-tuned, cost-effective one.

This blog post explores the complexities of configuration management in data platforms, the challenges in optimizing these settings, and how automated solutions can simplify this critical task.

The Configuration Conundrum

1. Cluster and Warehouse Sizing

Problem: Improper sizing of compute resources (like Databricks clusters or Snowflake warehouses) can lead to either performance bottlenecks or unnecessary costs.

Diagnosis Challenge: Determining the right size for your compute resources is not straightforward. It depends on workload patterns, data volumes, and query complexity, all of which can vary over time. Identifying whether performance issues or high costs are due to improper sizing requires analyzing usage patterns across multiple dimensions.

Resolution Difficulty: Adjusting resource sizes often involves a trial-and-error process. Too small, and you risk poor performance; too large, and you’re wasting money. The impact of changes may not be immediately apparent and can affect different workloads in unexpected ways.

2. Caching and Performance Optimization Settings

Problem: Suboptimal caching strategies and performance settings can lead to repeated computations and slow query performance.

Diagnosis Challenge: The effectiveness of caching and other performance optimizations can be highly dependent on specific workload characteristics. Identifying whether poor performance is due to cache misses, inappropriate caching strategies, or other factors requires deep analysis of query patterns and platform-specific metrics.

Resolution Difficulty: Tuning caching and performance settings often requires a delicate balance. Aggressive caching might improve performance for some queries while causing staleness issues for others. Each adjustment needs to be carefully evaluated across various workload types.

3. Security and Access Control Configurations

Problem: Overly restrictive security settings can hinder legitimate work, while overly permissive ones can create security vulnerabilities.

Diagnosis Challenge: Identifying the root cause of access issues can be complex, especially in platforms with multi-layered security models. Is a performance problem due to a query issue, or is it because of an overly restrictive security policy?

Resolution Difficulty: Adjusting security configurations requires careful consideration of both security requirements and operational needs. Changes need to be thoroughly tested to ensure they don’t inadvertently create security holes or disrupt critical workflows.

4. Cost Control and Resource Governance

Problem: Without proper cost control measures, data platform expenses can quickly spiral out of control.

Diagnosis Challenge: Understanding the cost implications of various platform features and usage patterns is complex. Is a spike in costs due to inefficient queries, improper resource allocation, or simply increased usage?

Resolution Difficulty: Implementing effective cost control measures often involves setting up complex policies and monitoring systems. It requires balancing cost optimization with the need for performance and flexibility, which can be a challenging trade-off to manage.

The Manual Configuration Management Struggle

Traditionally, managing these configurations involves:

1. Continuously monitoring platform usage, performance metrics, and costs
2. Manually adjusting configurations based on observed patterns
3. Conducting extensive testing to ensure changes don’t negatively impact performance or security
4. Constantly staying updated with platform-specific best practices and new features
5. Repeating this process as workloads and requirements evolve

This approach is not only time-consuming but also reactive. By the time an issue is noticed and diagnosed, it may have already impacted performance or inflated costs. Moreover, the complexity of modern data platforms means that the impact of configuration changes can be difficult to predict, leading to a constant cycle of tweaking and re-adjusting.

Embracing Automation in Configuration Management

Given these challenges, many organizations are turning to automated solutions to manage and optimize their data platform configurations. Platforms like Unravel can help by:

Continuous Monitoring: Automatically tracking resource utilization, performance metrics, and costs across all aspects of the data platform.

Intelligent Analysis: Using machine learning to identify patterns and anomalies in platform usage and performance that might indicate configuration issues.

Predictive Optimization: Suggesting configuration changes based on observed usage patterns and predicting their impact before implementation.

Automated Adjustment: In some cases, automatically adjusting configurations within predefined parameters to optimize performance and cost.

Policy Enforcement: Helping to implement and enforce governance policies consistently across the platform.

Cross-Platform Optimization: For organizations using multiple data platforms, providing a unified view and consistent optimization approach across different environments.

By leveraging automated solutions, data teams can shift from a reactive to a proactive configuration management approach. Instead of constantly fighting fires, teams can focus on strategic initiatives while ensuring their data platforms remain optimized, secure, and cost-effective.

Conclusion

Configuration management in modern data platforms is a complex, ongoing challenge that requires continuous attention and expertise. While the problems are multifaceted and the manual management process can be overwhelming, automated solutions offer a path to simplify and streamline these efforts.

By embracing automation in configuration management, organizations can more effectively optimize their data platform performance, enhance security, control costs, and free up their data teams to focus on extracting value from data rather than endlessly tweaking platform settings.

Remember, whether using manual methods or automated tools, effective configuration management is an ongoing process. As your data volumes grow, workloads evolve, and platform features update, staying on top of your configurations will ensure that your data platform continues to meet your business needs efficiently and cost-effectively.

To learn more about how Unravel can help manage and optimize your data platform configurations with Databricks, Snowflake, and BigQuery: request a health check report, view a self-guided product tour, or request a demo.

The post Configuration Management in Modern Data Platforms appeared first on Unravel.

AI Agents: Empower Data Teams With ActionabilityTM for Transformative Results

Unravel Data — Thu, 15 Aug 2024 18:25:10 +0000

AI Agents for Data Teams

Data is the driving force of the world’s modern economies, but data teams are struggling to meet demand to support generative AI (GenAI), including rapid data volume growth and the increasing complexity of data pipelines. More than 88% of software engineers, data scientists, and SQL analysts surveyed say they are turning to AI for more effective bug-fixing and troubleshooting. And 84% of engineers who use AI said it frees up their time to focus on high-value activities.

AI Agents represent the next wave of AI innovation and have arrived just in time to help data teams make more efficient use of their limited bandwidth to build, operate, and optimize data pipelines and GenAI applications on modern data platforms.

Data Teams Grapple with High Demand for GenAI

A surge in adoption of new technologies such as GenAI is putting tremendous pressure on data teams, leading to broken apps and burnout. In order to support new GenAI products, data teams must deliver more production data pipelines and data apps, faster. The result is that data teams have too much on their plates, the pipelines are too complex, there is not enough time, and not everyone has the deep tech skills required. No surprise that 70% of organizations have difficulty integrating data into AI models and only 48% of AI projects get deployed into production.

Understanding AI Agents

Defining AI Agents

AI agents are software-based systems that gather information, provide recommended actions, initiate and complete tasks in collaboration with or on behalf of humans to achieve a goal. AI agents can act independently, utilizing components like perception and reasoning, provide step-by-step guidance to augment human abilities, or can provide supporting information to support complex human-led tasks. AI agents play a crucial role in automating tasks and simplifying data-driven decision-making, and achieving greater productivity and efficiency.

How AI Agents Work

AI agents operate by leveraging a wide range of data sources and signals, using algorithms and data processing to identify anomalies and actions, then interact with their environment and users to effectively achieve specific goals. AI agents can achieve >90% accuracy, primarily driven by the reliability, volume, and variety of input data and telemetry to which they have access.

Types of Intelligent Agents

Reactive and proactive agents are two primary categories of intelligent agents.
Some agents perform work for you, while others help complete tasks with you or provide information to support your work.
Each type of intelligent agent has distinct characteristics and applications tailored to specific functions, enhancing productivity and efficiency.

AI for Data Driven Organizations

Enhancing Decision Making

AI agents empower teams by improving data support decision-making processes for you, with you, or by you. Examples of how AI agents act on your behalf include reducing toil and handling routine decisions based on AI insights. In various industries, AI agents optimize decision-making and provide recommendations to support your decisions. For complex tasks, AI agents provide supporting information needed to build data pipelines, write SQL queries, and partition data.

Benefits of broader telemetry sources for AI agents

Integrating telemetry from various platforms and systems enhances AI agents’ ability to provide accurate recommendations. Incorporating AI agents into root cause analysis (RCA) systems offers significant benefits. Meta’s AI-based root cause analysis system shows how AI agents enhance tools and applications.

Overcoming Challenges

Enterprises running modern data stacks face common challenges like high costs, slow performance, and impaired productivity. Leveraging AI agents can automate tasks for you, with you, and by you. Unravel customers such as Equifax, Maersk, and Novartis have successfully overcome these challenges using AI.

The Value of AI Agents for Data Teams

Reducing Costs

When implementing AI agents, businesses benefit from optimized data stacks, reducing operational costs significantly. These agents continuously analyze telemetry data, adapting to new information dynamically. Unravel customers have successfully leveraged AI to achieve operational efficiency and cost savings.

Accelerating Performance

Performance is crucial in data analytics, and AI agents play a vital role in enhancing it. By utilizing these agents, enterprise organizations can make well-informed decisions promptly. Unravel customers have experienced accelerated data analytics performance through the implementation of AI technologies.

Improving Productivity

AI agents are instrumental in streamlining processes within businesses, leading to increased productivity levels. By integrating these agents into workflows, companies witness substantial productivity gains. Automation of repetitive tasks by AI agents simplify troubleshooting to boost overall productivity and efficiency.

Future Trends in AI Agents for FinOps, DataOps, and Data Engineering

Faster Innovation with AI Agents

By 2026 conversational AI will reduce agent labor costs by $80 billion. AI agents are advancing, providing accurate recommendations to address more issues automatically. This allows your team to focus on innovation. For example, companies like Meta use AI agents to simplify root cause analysis (RCA) for complex applications.

Accelerated Data Pipelines with AI Agents

Data processing is shifting towards real-time analytics, enabling faster revenue growth. However, this places higher demands on data teams. Equifax leverages AI to serve over 12 million daily requests in near real time.

Improved Data Analytics Efficiency with AI Agents

Data management is the fastest-growing segment of cloud spending. In the cloud, time is money; faster data processing reduces costs. One of the word’s largest logistics companies improved efficiency by up to 70% in just 6 months using Unravel’s AI recommendations.

Empower Your Team with AI Agents

Harnessing the power of AI agents can revolutionize your business operations, enhancing efficiency, decision-making, and customer experiences. Embrace this technology to stay ahead in the competitive landscape and unlock new opportunities for growth and innovation.

Learn more about our FinOps Agent, DataOps Agent, and Data Engineering Agent.

The post AI Agents: Empower Data Teams With Actionability^TM for Transformative Results appeared first on Unravel.

How to Stop Burning Money (or at Least Slow the Burn)

Unravel Data — Tue, 25 Jun 2024 15:32:38 +0000

A Recap of the Unravel Data Customer Session at Gartner Data & Analytics Summit 2024

The Gartner Data & Analytics Summit 2024 (“D&A”) in London is a pivotal event for Chief Data Analytics Officers (CDAOs) and data and analytics leaders, drawing together a global community eager to dive deep into the insights, strategies, and frameworks that are shaping the future of data and analytics. With an unprecedented assembly of over 3,000 attendees, spanning 150+ knowledge-packed sessions and joined by 90+ innovative partners, the D&A Summit was designed to catalyze data transformation within organizations, offering attendees unique insights to think big and drive real change.

The Unravel Data customer session, titled “How to Stop Burning Money (or At Least Slow the Burn)”, emerged as a highlight of the D&A Summit, drawing attention to the pressing need for cost-efficient data processing in today’s rapid digital evolution. The session—presented by one of the largest logistics giants globally, including over 100,000 employees and a fleet of 700+ container vessels operating across 130 countries—captivated CDAOs & data and analytics leaders from over 150+ attendees. The audience was from 140+ companies across 30+ industries such as banking, retail, and pharma, spanning 110+ cities across 20+ countries. This compelling turnout underscored the universal relevance and urgency of cost-optimization strategies in data engineering. Watch the highlight reel here.

The session was presented by Peter Rees, Director of GenAI, AI, Data and Integration at Maersk, who garnered unprecedented accolades, including a 170% higher-than-average attendance. Peter Rees’ session was the third highest-rated of all 40 partner sessions. These results reflect the session’s relevance and the invaluable insights shared on revolutionizing the efficiency of data processing pipelines, cost allocation, and optimization techniques.
The Gartner Data & Analytics Summit 2024, and particularly the Unravel Data customer session, brought together organizations striving to align their data engineering costs with the value of their data and analytics. Unravel Data’s innovative approach, showcased through the success of a world-leading logistics company, provides a blueprint for organizations across industries looking to dramatically enhance the speed, productivity, and efficiency of their data processing and AI investments.

We invite you to explore how your organization can benefit from Unravel Data’s groundbreaking platform. Take the first step towards transforming your data processing strategy by scheduling an Unravel Data Health Check. Embark on your journey towards optimal efficiency and cost savings today.

The post How to Stop Burning Money (or at Least Slow the Burn) appeared first on Unravel.

Unravel Data was Mentioned in a Graphic source in the 2024 Gartner® Report

Unravel Data — Tue, 21 May 2024 14:27:12 +0000

In a recently published report, “Beyond FinOps: Optimizing Your Public Cloud Costs”, Gartner shares a graphic which is adapted from Unravel.

Unravel’s Perspective

How FinOps Helps

Unravel helps organizations adopt a FinOps approach to improve cloud data spending efficiency. FinOps helps organizations address overspend, including under- and over-provisioned cloud services, suboptimal architectures, and inefficient pricing strategies. Infrastructure and operations (I&O) leaders and practitioners can use FinOps principles to optimize cloud service design, configuration, and spending commitments to reduce costs. Data and analytics (D&A) leaders and teams are using FinOps to achieve predictable spend for their cloud data platforms.

Introducing Cloud Financial Management and Governance

Cloud Financial Management (CFM) and Governance empowers organizations to quickly adapt in today’s dynamic landscape, ensuring agility and competitiveness. CFM principles help organizations take advantage of the cloud’s variable consumption model through purpose-built tools tailored for your modern cloud finance needs. A well-defined cloud cost governance model lets cloud users monitor and manage their cloud costs and balance cloud spending vs. performance and end-user experience.

Three Keys to Optimizing Your Modern Data Stack

Cloud data platform usage analysis plays a crucial role in cloud financial management by providing insights into usage patterns, cost allocation, and optimization opportunities. By automatically gathering, analyzing, and correlating data from various sources, such as traces, logs, metrics, source code, and configuration files, organizations can identify areas for cost savings and efficiency improvements. Unravel’s purpose-built AI reduces the toil required to manually examine metrics, such as resource utilization, spending trends, and performance benchmarks for modern data stacks such as Databricks, Snowflake, and BigQuery.
Cost allocation, showback, and chargeback are essential for effective cloud cost optimization. Organizations need to accurately assign costs to different departments or projects based on their actual resource consumption. This ensures accountability and helps in identifying areas of overspending or underutilization. Automated tools and platforms like Unravel can streamline the cost allocation process for cloud data platforms such as Databricks, Snowflake, and BigQuery, making it easier to track expenses and optimize resource usage.
Budget forecasting and management is another critical aspect of cloud financial management. By analyzing historical data and usage trends, organizations can predict future expenses more accurately. This enables them to plan budgets effectively, avoid unexpected costs, and allocate resources efficiently. Implementing robust budget forecasting processes can help organizations achieve greater financial control and optimize their cloud spending.

Next Steps

You now grasp the essence of cloud financial management and governance to optimize cloud spending and your cloud data platform. Start your journey and embrace these concepts to enhance your cloud strategy and drive success. Take charge of your Databricks, Snowflake, and BigQuery optimization today with a free health check.

Gartner, Beyond FinOps: Optimizing Your Public Cloud Costs, By Marco Meinardi, 21 March 2024

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post Unravel Data was Mentioned in a Graphic source in the 2024 Gartner® Report appeared first on Unravel.

Maersk data leaders speaking at Gartner Data & Analytics Summit London

Unravel Data — Wed, 24 Apr 2024 12:57:22 +0000

Palo Alto, CA – Apr. 24, 2024 – Unravel Data is proud to announce A.P. Moller – Maersk, a global leader in container logistics, is participating at the upcoming Gartner Data & Analytics Summit. Scheduled from May 13-15, 2024, at ExCeL London, this summit is renowned for gathering visionaries and innovators in the realm of data and analytics.

Peter Rees, Lead Architect at Maersk, together with Mark Sear, Director of Insight Analytics, Data, and Integration at Maersk, will be leading a must-attend session titled “Unravel Data: How to stop burning money (or at least slow the burn).” This session is meticulously designed to address the escalating concern of unpredictable growth in data analytics costs which can significantly hinder the progression of data-driven innovation.

Session Overview:

Data analytics costs are spiraling, and businesses are searching for methodologies to efficiently manage and optimize these expenses without compromising on innovation or operational agility. Maersk, leveraging Unravel Data’s cutting-edge solutions, has pioneered a cost-optimization framework that not only streamlines development and data delivery processes but also aligns with the enterprise’s mission to deliver a more connected, agile, and sustainable future for global logistics.

During the session, attendees will gain exclusive insights into how Maersk has successfully harnessed the power of Unravel Data within its infrastructure to ensure the business remains at the forefront of cost efficiency while bolstering its data-driven decision-making capabilities.

Speaker Highlights:

Peter Rees, as Maersk’s Lead Architect specializing in Enterprise Data & AI/ML, brings a wealth of knowledge in data mesh and event-driven architectures. His extensive track record in AI strategies and analytics, complemented by an innovative mindset, positions him as a cornerstone in the conversation on bridging data technology with business value.
Mark Sear elevates the discourse with his profound expertise in digital transformation and business intelligence as Maersk’s Director of Insight Analytics, Data, and Integration. Mark’s academic and professional achievements underscore his commitment to leveraging data for actionable insights, thus fostering strategic business growth and operational efficiencies.

Event Details:

What: Gartner Data & Analytics Summit
When: May 13-15, 2024
Where: ExCeL London, UK
Session Title: Unravel Data: How to stop burning money (or at least slow the burn)

Unravel Data invites all attendees who are looking to navigate the challenges of data analytics cost growth to join this session. It promises to be an enlightening exploration of practical solutions, real-world applications, and visionary strategies for any organization aiming to optimize their data-driven initiatives and investments.

About A.P. Moller – Maersk:
A.P. Moller – Maersk is an integrated container logistics company working to connect and simplify its customers’ supply chains. As a global leader in shipping services, the company operates in 130 countries and employs over 80,000 people. For further information, visit www.maersk.com.

About Unravel Data:
Unravel’s automated, AI-powered data observability + FinOps provides 360° visibility to allocate costs with granular precision, accurately predict spend, run 50% more workloads at the same budget, launch new apps 3X faster, and reliably hit greater than 99% of SLAs. For further information, visit www.unraveldata.com.

Media Contact:
Keith Alsheimer
CMO, Unravel Data
hello@unraveldata.com

The post Maersk data leaders speaking at Gartner Data & Analytics Summit London appeared first on Unravel.

Understanding BigQuery Cost

Stephen Lamont — Thu, 22 Feb 2024 17:49:34 +0000

This article explains the two pricing models—on-demand and capacity—for the compute component of the BigQuery pricing, the challenges of calculating chargeback on compute, and what Unravel can do.

On-demand compute pricing: You are charged based on the billed bytes used by your queries. So if the price is $X/TiB and a query uses Y-TiB of billed bytes, you will be billed $(X*Y) for that query.
Capacity compute pricing: You buy slots and you are charged based on the number of slots and the time for which slots were made available.

The following section describes compute pricing in more detail.

Capacity Pricing

To use capacity pricing, you start by creating reservations. You can create one or more reservations inside an admin project. All the costs related to the slots for the reservations will be attributed to the admin project in the bill.

Reservations

While creating a reservation, you need to specify “baseline slots” and “max slots,” in multiples of 100. You will be charged for the baseline slots at the minimum for the duration of the reservation. When all the baseline slots have been utilized by queries running in that reservation, more slots can be made available via autoscaling. Autoscaling happens in multiple of 100 slots. The number of slots available for autoscaling is (max slots – baseline slots).

You can modify the baseline and max slots after you create the reservation. You can increase the values at any point in time, but you cannot decrease the values within 1 hour of creating a reservation or updating the reservation.

Assignments

After you have created a reservation, to enable the queries in projects to use slots from that reservation, you have to create assignments. You can assign projects, folders ,or organizations to a reservation. Whenever a query is started, it will first try to use the slots from this reservation. You can create or delete assignments at any point in time.

Pricing

BigQuery provides 3 editions —Standard, Enterprise, and Enterprise Plus. These editions have different capabilities and different pricing rates. The rates are defined in terms of slot hours. For example, the rate is $0.04 per slot hour for Standard edition in the US and $0.06 per slot hour for Enterprise edition in the same region.

In capacity pricing, you are charged for the number of slots made available and the time for which slots are made available. Suppose you have a reservation with 100 baseline slots and 500 max slots in the Standard edition. Consider the following usage:

In the first hour, no queries are running, so the slot requirement is 0.
In the second hour, there are queries running, but the slot requirement is less than 100.
In the third hour, more queries are running and the slot requirement is 150.

In the first 2 hours, even though the slot requirement is less than 100, you will still be charged for 100 slots—the baseline slots—for each of the first 2 hours.

In the third hour, we need 50 more slots than the baseline, so autoscaling kicks in to provide more slots. Since autoscaling only scales up or down in multiples of 100, 100 more slots are added. Hence, a total of 200 slots (100 baseline + 100 from autoscaling) are made available in this hour.

The number of slot-hours from this 3-hour period is 100 + 100 + 200 = 400. With a rate of $0.04 per slot-hour for Standard edition, you will be charged 0.04*400 = $16 for this usage.

Pay-as-you-go

Recall that you can create/delete/update reservations and baseline/max slots whenever you want. Also, you will be charged for just the number of slots made available to you and for the time the slots are made available. This model is called pay-as-you-go, as you are paying for the usage you are using.

Capacity Commitment

If you expect to use a certain number of slots over a long period of time, you can make a commitment of X slots over a 1-year or 3-year period, for a better rate. You will be charged for those slots for the entire period regardless of whether you use them or not. This model is called capacity commitment.

Consider the following example of capacity commitment. Let’s say you have:

1-year commitment of 1600 slots in the Enterprise edition.
Created 1 reservation with max size of 1500 slots and baseline of 1000 slots.
Hence your autoscaling slots are 1500-1000 = 500.
Pay-as-you-go price for enterprise edition is $0.06 per slot-hour.
1-year commitment price for the enterprise edition is $0.048 per slot hour.

Consider this scenario:

In the first hour, the requirement is less than 1000 slots.
In the second hour, the requirement is 1200 slots.
In the third hour, the requirement is 1800 slots.

In the first hour, the baseline slots of 1000 are made available for the reservation; these slots are available from the commitment slots. Since we have a commitment of 1600 slots, all the 1600 slots are actually available. The 1000 slots are available for the reservation as baseline. The remaining 600 are called idle slots and are also charged. So for the first hour, we are charged for 1600 slots as per commitment price, with a cost of $(1600 * 0.048).

In the second hour, since the requirement is 1200 slots, there is an additional requirement of 200 slots beyond the baseline of 1000 slots. Since 600 idle slots are available from the committed capacity, the additional requirement of 200 slots will come from these idle slots, while the remaining 400 slots will remain idle. Notice that autoscaling was not needed in this case. Before going for autoscaling, BigQuery will try to use idle slots (unless ignore_idle_slots config is set to True for that reservation). So how much are we charged for the second hour? The answer is 1600 slots, since that is what is committed. These 1600 slots are charged as per the commitment price, so the cost for the second hour is $(1600 * 0.048).

In the third hour, the requirement is 1800 slots: the first 1600 slots will come from commitment slots, and the other 200 will now come from autoscaling slots. The 1600 slots will be charged as per 1-year commit pricing, and the 200 slots coming from autoscale slots will be charged as per pay-as-you-go pricing at $0.06/slot-hour in this case. Therefore, the cost for the third hour is $((1600 * 0.048) + (200 * 0.06)).

Notes

Some points to note regarding capacity pricing:

The slots are charged with a maximum granularity of 1 second, and the charge is for a minimum of 1 minute.
Autoscaling always happens in increments/decrements of 100 slots.
Queries running in a reservation automatically use idle slots from other reservations within the same admin project. Unless ignore_idle_slots is set to True for the reservation.
The capacity commitment is specific to a region, organization, and edition. The idle slots can’t be shared across regions or editions.
Idle slot capacity is not shared between reservations in different admin projects.

The Cost Chargeback Problem

In an organization, typically there are one or more capacity commitments, one or more admin projects, and multiple reservations in these admin projects. GCP provides billing data that gives you the hourly reservation costs for a given admin project, edition, and location combination. However, there are multiple possible ways to map a team to a reservation: a team can be assigned to one reservation, multiple teams can be assigned to the same reservation, or multiple teams can be assigned to multiple reservations.

In any case, it is a tricky task to find out which team or user is contributing how much to your cost. How do you map the different teams and users to the different projects, editions, and locations? And how do you then track the cost incurred by these teams and users? Common chargeback approaches—such as chargeback by accounts, projects, and reservations—simply cannot provide clarity at the user or team level. There is also no direct data source that gives this information from GCP.

Unravel provides cost chargeback at the user and team levels by combining data from different sources such as billing, apps, and tags. The crux of our proprietary approach is providing an accurate cost estimate at the query level. We then associate the query costs with users and teams (via tags) to derive the user-level or team-level chargeback.

Computing query-level cost estimates involves addressing a number of challenges. Some of the challenges include:

A query may have different stages where chargeback could be different.
A query may use slots with commitment pricing or pay-as-you-go pricing.
Capacity is billed at one minute minimum.
Autoscaling increments by multiples of 100 slots.
Chargeback policy for idle resource.

Let’s understand these challenges further with a few scenarios. In all the examples below, we assume that we use the Enterprise edition in the US region, with rates of $0.06/slot-hour for pay-as-you-go, and $0.048/slot-hour for a 1-year commitment.

Scenario 1: Slot-hours billed differently at different stages of a query

Looking at the total slot-hours for chargeback could be misleading because the slot-hours at different stages of a query may be billed differently.

Consider the following scenario:

A reservation with a baseline of 0 slot and max 200 slots.
No capacity commitment.
Query Q1 runs from 5am to 6am with 200 slots for the whole hour.
Query Q2 runs from 6am to 8am in 2 stages:
- In the first stage, from 6am to 7am, it uses 150 slots for the whole hour.
- In the second stage, from 7am to 8am, it uses 50 slots for the whole hour.

Both queries use the exact same slot-hours total, i.e., 200 slot-hours, and use the same reservation and edition. Hence we may think the chargeback to both queries should be the same.

But if you look closely, the two queries do not incur the same amount of cost.

Q1 uses 200 slots for 1 hour. Given the reservation with a baseline of 0 slot and a max of 200 slots, 200 slots are made available in this hour, and the cost of the query is $(200*0.06) = $12.

In contrast, Q2’s usage is split into 150 slots for the first hour and 50 slots for the second hour. Since slots are autoscaled in increments of 100, to run Q2, 200 slots are made available in the first hour and 100 slots are made available in the second hour. The total slot-hours for Q2 is therefore 300, and the cost is $(300*0.06) = $18.

Summary: The cost chargeback to a query needs to account for how many slots are used in different stages of the query and not just the total slot-hours (or slot-ms) used.

Scenario 2: Capacity commitment pricing vs. pay-as-you-go pricing from autoscaling

At different stages, a query may use slots from capacity commitment or from autoscaling that are charged at the pay-as-you-go price.

Consider the following scenario:

1 reservation with a baseline of 100 slots and max slots as 300 slots.
1-year capacity commitment of 100 slots.
Query Q1 runs from 5am to 6am and uses 300 slots for the whole 1 hour.
Query Q2 runs from 6am to 8am in 2 stages. It uses 100 slots from 6am to 7am, and uses 200 slots from 7am to 8am.

Once again, the total slot-hours for both queries are the same, i.e., 300 slot hours, and we might chargeback the same cost to both queries.

But if you look closely, the queries do not incur the same amount of cost.

For Q1, 100 slots come from committed capacity and are charged at the 1-year commit price ($0.048/slot-hour), whereas 200 are autoscale slots that are charged at the pay-as-you-go price ($0.06/slot-hour). So the cost of Q1 is $((100*0.048) + (200*0.06)) = $16.80.

For Q2, from 6-7am, 100 slots come from committed capacity and are charged at the 1-year commit price ($0.048/slot-hour), so the cost for 6-7am is = $(100*0.048) = $4.80.

From 7-8am, 100 slots from committed capacity are charged at 1-year commit price ($0.048/slot-hour), and the other 100 are autoscale slots charged at pay-as-you-go price ($0.06/slot-hour). So the cost from 7-8am is = $(100*0.048) + (100*0.06) = $10.80.

Hence the cost between 6-8am (duration when Query-2 is running) is = $4.80 + $10.80 = $15.60.

Summary: The cost chargeback to a query needs to account for whether the slots come from committed capacity or from autoscaling charged at pay-as-you-go price. A query may use both at different stages.

Scenario 3: Minimum slot capacity and increment in autoscaling

A query may be billed for more than the resource it actually needs because of the minimum slot capacity and the minimum increment in autoscaling.

1 reservation with a baseline of 0 slots and max slots as 300 slots.
No capacity commitment.
Query Q1 uses 50 slots for 10 seconds between 05:00:00 to 05:00:10.
There is no other query running between 04:59:00 to 05:02:00.

If you were to chargeback by slot-ms, you would say that the query uses 50 slots for 10 seconds, or 500,000 slot-ms.

However, this assumption is flawed because of these two conditions:

Slot capacity is billed for a minimum of 1 minute before being billed per second.
Autoscaling happens in increments of 100 slots.

For Q1, 100 slots (not 50) are actually made available, for 1 minute (60,000 ms) and hence you will be charged for 6,000,000 slot-ms in your bill.

Summary: The cost chargeback needs to account for minimum slot capacity and autoscaling increments.

Scenario 4: Chargeback policy for idle resource

In the previous scenario, we see that a query that actually uses 500,000 slot-ms is billed for 6,000,000 slot-ms. Here we make the assumption that whatever resource is made available but not used is also included in the chargeback of the queries running at the same time. What happens if there are multiple queries running concurrently, with unused resources? Continuing with the example in Scenario 3, if there is another query, Q2, that uses 50 slots for 30s, from 05:00:10 to 05:00:40, then:

Q1 still uses 500,000 slot-ms like before.
Q2 uses 1,500,000 slot-ms.
The total bill remains 6,000,000 slot-ms as before, because slot capacity is billed for a minimum of 1 min and autoscaling increments by 100 slots.

There are several ways to consider the chargeback to Q1 and Q2:

Charge each query by its actual slot-ms, and have a separate “idle” category. In this case, Q1 is billed for 500,000 slot-ms, Q2 is billed for 1,500,000 slot-ms, and the remaining 4,000,000 slot-ms is attributed to the “idle” category.

Divide idle resources equally among the queries. In this case, Q1 is billed 2,500,000 slot-ms, and Q2 is billed 3,500,000 slot-ms.

Divide idle resources proportionally among the queries based on the queries’ usage. In this case, Q1 uses 1,833,333 slot-ms, while Q2 uses 4,166,667 slot-ms.

Summary: Chargeback policy needs to consider how to handle idle resources. Without a clear policy, there could be mismatches between users’ assumptions and the implementation, even leading to inconsistencies, such as the sum of the queries costs deviating from the bill. Moreover, different organizations may prefer different chargeback policies, and there’s no one-size-fits-all approach.

Conclusion

To conclude, providing accurate and useful chargeback for an organization’s usage of BigQuery presents a number of challenges. The common approaches of chargeback by accounts, reservations, and projects are often insufficient for most organizations, as they need user-level and team-level chargeback. However, chargeback by users and teams require us to be able to provide query-level cost estimates, and then aggregate by users and teams (via tags). Computing the query-level cost estimate is another tricky puzzle where simply considering the total slot usage of a query will not work. Instead, we need to consider various factors such as different billing for different stages of the same query, commitment pricing cs. pay-as-you-go pricing from autoscaling, minimum slot capacity and minimum autoscaling increments, and idle policy.

Fortunately, Unravel has information for all the pieces of the puzzle. Its proprietary algorithm intelligently combines these pieces of information and considers the scenarios discussed. Unravel recognizes that chargeback often doesn’t have a one-size-fits-all approach, and can work with customers to adapt its algorithm to specific requirements and use cases.

The post Understanding BigQuery Cost appeared first on Unravel.

Open Source Overwatch VS Unravel Free Comparison Infographic

Ray Villares — Wed, 15 Nov 2023 23:05:51 +0000

5 key differences between Overwatch and Unravel’s free observability for Databricks.

Before investing significant time and effort to configure Overwatch, compare how Unravel’s free data observability solution gives you end-to-end, real-time granular visibility into your Databricks Lakehouse platform out of the box.

The post Open Source Overwatch VS Unravel Free Comparison Infographic appeared first on Unravel.

Announcing Unravel for Snowflake: Faster Time to Business Value in the Data Cloud

Stephen Lamont — Tue, 14 Nov 2023 12:00:51 +0000

Snowflake’s data cloud has expanded to become a top choice among organizations looking to leverage data and AI—including large language models (LLMs) and other types of generative AI—to deliver innovative new products to end users and customers. However, the democratization of AI often leads to inefficient usage that results in a cost explosion and decreases the business value of Snowflake. The inefficient usage of Snowflake can occur at various levels. Below are just some examples.

Warehouses: A warehouse that is too large or has too many clusters for a given workload will be underutilized and incur a higher cost than necessary, whereas the opposite (the warehouse being too small or having too few clusters) will not do the work fast enough.
Workload: The democratization of AI results in a rapidly increasing number of SQL users, many of whom focus on getting value out of the data and do not think about the performance and the cost aspects of running SQL on Snowflake. This often leads to costly practices such as:
- SELECT * just to check out the schema
- Running a long, timed-out query repeatedly without checking why it timed out
- Expensive joins such as cartesian products
Data: No or poorly chosen cluster or partition keys lead to many table scans. Unused tables accumulate over time and by the time users notice the data explosion, they have a hard time knowing which tables may be deleted safely.

Snowflake, like other leading cloud data platforms, provides various compute options and automated features to help with the control of resource usage and spending. However, you need to understand the characteristics of your workload and other KPIs, and have domain expertise, to pick the right options and settings for these features—not to mention there’s no native support to identify bad practices and anti-patterns in SQL. Lastly, optimization is not a one-time exercise. As business needs evolve, so do workloads and data; optimizing for cost and performance becomes part of a continued governance of data and AI operations.

Introducing Unravel for Snowflake

Unravel for Snowflake is a purpose-built AI-driven data observability and FinOps solution that enables organizations to get maximum business value from Snowflake by achieving high cost efficiency and performance. It does so by providing deep, correlated visibility into cost, workload, and performance, with AI-powered insights and automated guardrails for continued optimization and governance. Expanding the existing portfolio of purpose-built AI solutions for Databricks, BigQuery, Amazon EMR, and Cloudera, Unravel for Snowflake is the latest data observability and FinOps product from Unravel Data.

The new Unravel for Snowflake features align with the FinOps phases of inform, optimize, operate:

INFORM

A “Health Check” that provides a top-down summary view of cost, usage, insights to improve inefficiencies, and projected annualized savings from applying these insights
A Cost 360 view that captures complete cost across compute, storage and data transfer, and shows chargeback and trends of cost and usage across warehouses, users, tags, and queries
Top K most expensive warehouses, users, queries
Detailed warehouse and query views with extensive KPIs
Side-by-side comparison of queries

OPTIMIZE

Warehouse insights
SQL insights
Data and storage insights

OPERATE

OpenSearch-based alerts on query duration and credits
Alert customization: ability to create custom alerts

Let us first take a look at the Health Check feature that kick-starts the journey of cost and performance optimization.

Health Check for Cost and Performance Inefficiencies

Dashboard-level summary of usage & costs, AI insights into inefficiencies, and projected savings.

The Health Check feature automatically analyzes the workload and cost over the past 15 days. It generates a top-down summary of the cost and usage during this period and, more important, shows insights to improve the inefficiencies for cost and performance—along with the projected annualized savings from applying these insights.

See at a glance your most expensive “query signatures,” with AI-driven insights on reducing costs.

Users can easily spot the most impactful insights at the warehouse and query levels, and drill down to find out the details. They can also view the top 10 most expensive “query signatures,” or groups of similar queries. Lastly, it recommends alerting policies specific to your workload.

Users can use the Health Check feature regularly to find new insights and their impact in savings. As the workloads evolve with new business use cases, new inefficiencies may arise and require continued monitoring and governance.

Uncover your Snowflake savings with a free Health Check report
Request report here

Digging Deep into Understanding Spending

Cost chargeback breakdown and trends by users & queries

Unravel also enables you to easily visualize and understand where money is spent and whether there are anomalies that you should investigate. The Cost 360 view provides cost breakdown and trends across warehouses, users, queries, and tags. It also shows top offenders by listing the most expensive warehouses, users, and query signatures, so that users can address them first.

Cost chargeback breakdown and trends by warehouses & tags.

Debugging and Optimizing Failed, Costly, and Slow Queries

Drill-down AI-driven insights and recommendations into query cost & performance.

Unravel captures extensive data and metadata about cost, workload, and data, and automatically applies AI to generate insights and recommendations for each query and warehouse. Users can filter queries based on status, insights, duration, etc., to find interesting queries, and drill down to look at query details, including the insights for cost and performance optimization. They can also see similar queries to a given query and do side-by-side comparison to spot the difference between two runs.

Get Started with Unravel for Snowflake

To conclude, Unravel supports a variety of use cases in FinOps, from understanding cost and usage, to optimizing inefficiencies and providing alerts for governance. Learn more about Unravel for Snowflake by reading the Unravel for Snowflake docs and request a personalized Snowflake Health Check report.

Role	Scenario	Unravel benefits
FinOps Practitioner	Understand what we pay for Snowflake down to the user/app level in real time, accurately forecast future spend with confidence	Granular visibility at the warehouse, query, and user level enables FinOps practitioners to perform cost allocation, estimate annual data cloud application costs, cost drivers, break-even, and ROI analysis.
FinOps Practitioner / Engineering / Operations	Identify the most impactful recommendations to optimize overall cost and performance	AI-powered performance and cost optimization recommendations enable FinOps and data teams to rapidly upskill team members, implement cost efficiency SLAs, and optimize Snowflake pricing tier usage to maximize the company’s cloud data ROI.
Engineering Lead / Product Owner	Identify the most impactful recommendations to optimize the cost and performance of a warehouse	AI-driven insights and recommendations enable product and data teams to improve slot utilization, boost SQL query performance, leverage table partitioning and column clustering to achieve cost efficiency SLAs and launch more data queries within the same warehouse budget.
Engineering / Operations	Live monitoring with alerts	Live monitoring with alerts speed mean time to repair (MTTR) and prevent outages before they happen.
Data Engineer	Debugging a query and comparing queries	Automatic troubleshooting guides data teams directly to the pinpoint the source of query failures down to the line of code or SQL query along with AI recommendations to fix it and prevent future issues.
Data Engineer	Identify expensive, inefficient, or failed queries	Proactively improve cost efficiency, performance, and reliability before deploying queries into production. Compare two queries side-by-side to find any metrics that are different between the two runs, even if the queries are different.

The post Announcing Unravel for Snowflake: Faster Time to Business Value in the Data Cloud appeared first on Unravel.

Top 4 Challenges to Scaling Snowflake for AI

Stephen Lamont — Tue, 14 Nov 2023 12:00:39 +0000

Organizations are transforming their industries through the power of data analytics and AI. A recent McKinsey survey finds that 75% expect generative AI (GenAI) to “cause significant or disruptive change in the nature of their industry’s competition in the next three years.” AI enables businesses to launch innovative new products, gain insights into their business, and boost profitability through technologies that help them outperform competitors. Organizations that don’t leverage data and AI risk falling behind.

Despite all the opportunities with data and AI, many find ROI with advanced technologies like IoT, AI, and predictive analytics elusive. For example, companies find it difficult to get accurate and granular reporting on compute and storage for cloud data and analytics workloads. In speaking with enterprise customers, we hear several recurring barriers they face to achieve their desired ROI on the data cloud.

Cloud data spend is challenging to forecast

About 80% of 157 data management professionals express difficulty predicting data-related cloud costs. Data cloud spend can be difficult to reliably predict. Sudden spikes in data volumes, new analytics use cases, and new data products require additional cloud resources. In addition, cloud service providers can unexpectedly increase prices. Soaring prices and usage fluctuations can disrupt financial operations. Organizations frequently lack visibility into cloud data spending to effectively manage their data analytics and AI budgets.

Workload fluctuations: Snowflake data processing and storage costs are driven by the amount of compute and storage resources used. As data cloud usage increases for new applications, dashboards, and uses, it becomes challenging to accurately estimate the required data processing and storage costs. This unpredictability can result in budget overruns that affect 60% of infrastructure and operations (I&O) leaders.
Unanticipated expenses: Spikes in streaming data volumes, large amounts of unstructured and semi-structured data, and shared warehouse consumption can quickly exceed cloud data budgets. These unforeseen usage peaks can catch organizations off guard, leading to unexpected data cloud costs.
Limited visibility: Accurately allocating costs across the company requires detailed visibility into the data cloud bill. Without query-level or user-level reporting, it becomes impossible to accurately attribute costs to various teams and departments. The result is confusion, friction and finger-pointing between groups as leaders blame high chargeback costs on reporting discrepancies.

Organizations can establish spending guardrails and implement controls by implementing a FinOps approach and leveraging granular data to implement smart and effective controls over their data cloud spend, set up budgets, and utilize alerts to avoid data cloud cost overruns.

Data cloud workloads constrained by budget and staff limits

In 2024, IT organizations expect to shift their focus towards controlling costs, improving efficiency, and increasing automation. Cloud service provider price increases and growing usage add to existing economic pressures, while talent remains scarce and expensive. These cost and bandwidth factors are limiting the number of new data cloud workloads that can be launched.

“Data analytics, engineering & storage” are among the top 3 biggest skill gaps and 54% of data teams say the talent shortage and time required to upskill employees are the biggest challenges to adoption of their AI strategy.

Global demand for AI and machine learning professionals is expected to increase by 40% over the next five years. Approximately one million new jobs will be created as companies look to leverage data and AI for a wide variety of use cases—from automation and risk analysis, to security and supply chain forecasting.

AI adoption and data volume demand

Since ChatGPT broke usage records, generative AI is driving increased data cloud demand and usage. Data teams are struggling to maintain productivity as AI projects scale “due to increasing complexity, inefficient collaboration, and lack of standardized processes and tools” (McKinsey).

Data is foundational for AI and much of it is unstructured, yet IDC found most unstructured data is not leveraged by organizations. A lack of production-ready data pipelines for diverse data sources was the second-most-cited reason (31%) for AI project failure.

Discover your Snowflake savings with a free Unravel Health Check report
Request your report here

Data pipeline failures slow innovation

Data pipelines are becoming more complex, increasing the time required for root cause analysis (RCA) for breaks and delays. Data teams struggle most with data processing speed. Time is a critical factor that pulls skilled and valuable talent into unproductive firefighting. The more time they spend dealing with pipeline issues or failures, the greater the impact on productivity and delivery of new innovation.

Automated data pipeline monitoring and testing is essential for data cloud applications, since teams rapidly iterate and adapt to changing end-user needs and product requirements. Failed queries and data pipelines create data issues for downstream users and workloads such as analytics, BI dashboards, and AI/ML model training. These delays and failures can have a ripple effect that impacts end user decision-making and AI models that rely on accurate, timely content.

Conclusion

Unravel for Snowflake combines the power of AI and automation to help you overcome these challenges. With Unravel, Snowflake users get improved visibility to allocate costs for showback/chargeback, AI-driven recommendations to boost query efficiency, and real-time spend reporting and alerts to accurately predict costs. Unravel for Snowflake helps you optimize your workloads and get more value from your data cloud investments.

Take the next step and check out a self-guided tour or request a free Snowflake Health Check report.

The post Top 4 Challenges to Scaling Snowflake for AI appeared first on Unravel.

Why Optimizing Cost Is Crucial to AI/ML Success

Stephen Lamont — Fri, 03 Nov 2023 13:59:09 +0000

This article was originally published by the Forbes Technology Council, September 13, 2023.

When gold was discovered in California in 1848, it triggered one of the largest migrations in U.S. history, accelerated a transportation revolution and helped revitalize the U.S. economy. There’s another kind of Gold Rush happening today: a mad dash to invest in artificial intelligence (AI) and machine learning (ML).

The speed at which AI-related technologies have been embraced by businesses means that companies can’t afford to sit on the sidelines. Companies also can’t afford to invest in models that fail to live up to their promises.

But AI comes with a cost. McKinsey estimates that developing a single generative AI model costs up to $200 million, up to $10 million to customize an existing model with internal data and up to $2 million to deploy an off-the-shelf solution.

The volume of generative AI/ML workloads—and data pipelines that power them—has also skyrocketed at an exponential rate as various departments run differing use cases with this transformational technology. Bloomberg Intelligence reports that the generative AI market is poised to explode, growing to $1.3 trillion over the next 10 years from a market size of just $40 billion in 2022. And every job, every workload, and every data pipeline costs money.

Because of the cost factor, winning the AI race isn’t just about getting there first; it’s about making the best use of resources to achieve maximum business goals.

The Snowball Effect

There was a time when IT teams were the only ones utilizing AI/ML models. Now, teams across the enterprise—from marketing to risk to finance to product and supply chain—are all utilizing AI in some capacity, many of whom lack the training and expertise to run these models efficiently.

AI/ML models process exponentially more data, requiring massive amounts of cloud compute and storage resources. That makes them expensive: A single training run for GPT-3 costs $12 million.

Enterprises today may have upwards of tens—even hundreds—of thousands of pipelines running at any given time. Running sub-optimized pipelines in the cloud often causes costs to quickly spin out of control.

The most obvious culprit is oversized infrastructure, where users are simply guessing how much compute resources they need rather than basing it on actual usage requirements. Same thing with storage costs, where teams may be using more expensive options than necessary for huge amounts of data that they rarely use.

But data quality and inefficient code often cause costs to soar even higher: data schema, data skew and load imbalances, idle time, and a rash of other code-related issues that make data pipelines take longer to run than necessary—or even fail outright.

Like a snowball gathering size as it rolls down a mountain, the more data pipelines you have running, the more problems, headaches, and, ultimately, costs you’re likely to have.

And it’s not just cloud costs that need to be considered. Modern data pipelines and AI workloads are complex. It takes a tremendous amount of troubleshooting expertise just to keep models working and meeting business SLAs—and that doesn’t factor in the costs of downtime or brand damage. For example, if a bank’s fraud detection app goes down for even a few minutes, how much would that cost the company?

Optimized Data Workloads on the Cloud

Optimizing cloud data costs is a business strategy that ensures a company’s resources are being allocated appropriately and in the most cost-efficient manner. It’s fundamental to the success of an AI-driven company as it ensures that cloud data budgets are being used effectively and providing maximum ROI.

But business and IT leaders need to first understand exactly where resources are being used efficiently and where waste is occurring. To do so, keep in mind the following when developing a cloud data cost optimization strategy.

• Reuse building blocks. Everything costs money on the cloud. Every file you store, every record you access, every piece of code you run incur a cost. Data processing can usually be broken down into a series of steps, and a smart data team should be able to reuse those steps for other processing. For example, code written to move data about a company’s sales records could be reused by the pricing and product teams rather than both teams building their own code separately and incurring twice the cost.

• Truly leverage cloud capabilities. The cloud allows you to quickly adjust the resources needed to process data workloads. Unfortunately, too many companies operate under “just in case” scenarios that lead to allocating more resources than actually needed. By understanding usage patterns and leveraging cloud’s auto-scaling capabilities, it’s possible for companies to dynamically control how they scale up and, more importantly, create guardrails to manage the maximum.

• Analyze compute and storage spend by job and by user. The ability to really dig down to the granular details of who is spending what on which project will likely yield a few surprises. You might find that the most expensive jobs are not the ones that are making your company millions. You may find that you’re paying way more for exploration than for data models that will be put to good use. Or, you may find that the same group of users are responsible for the jobs with the biggest spend and the lowest ROI (in which case, it might be time to tighten up on some processes).

Given the data demands that generative AI models and use cases place on a company, business and IT leaders need to have a deep understanding of what’s going on under the proverbial hood. As generative AI evolves, business leaders will need to address new challenges. Keeping cloud costs under control shouldn’t be one of them.

The post Why Optimizing Cost Is Crucial to AI/ML Success appeared first on Unravel.

Unravel CI/CD Integration for Databricks

Stephen Lamont — Wed, 18 Oct 2023 04:21:00 +0000

“Someone’s sitting in the shade today because someone planted a tree a long time ago.” —Warren Buffet

CI/CD, a software development strategy, combines the methodologies of Continuous Integration and Continuous Delivery/Continuous Deployment to safely and reliably deliver new versions of code in iterative short cycles. This practice bridges the gap between developers and operations team by streamlining the building, testing, and deployment of the code by automating the series of steps involved in this otherwise complex process. Traditionally used to speed up the software development life cycle, today CI/CD is gaining popularity among data scientists and data engineers since it enables cross-team collaboration and rapid, secure integration and deployment of libraries, scripts, notebooks, and other ML workflow assets.

One recent report found that 80% of organizations have adopted agile practices, but for nearly two-thirds of developers it takes at least one week to get committed code successfully running in production. Implementing CI/CD can streamline data pipeline development and deployment, accelerating release times and frequency, while improving code quality.

The evolving need for CI/CD for data teams

AI’s rapid adoption is driving the demand for fresh and reliable data for training, validation, verification, and drift analysis. Implementing CI/CD enhances your Databricks development process, streamlines pipeline deployment, and accelerates time-to-market. CI/CD revolutionizes how you build, test, and deploy code within your Databricks environment, helping you automate tasks, ensure a smooth transition from development to production, and enable lakehouse data engineering and data science teams to work more efficiently. And when it comes to cloud data platforms like Databricks, performance equals cost. The more optimized your pipelines are, the more optimized your Databricks spend will be.

Why incorporate Unravel into your existing DevOps workflow?

Unravel Data is the AI-powered data observability and FinOps platform for Databricks. By using Unravel’s CI/CD integration for Databricks, developers can catch performance problems early in development and deployment life cycles and proactively take actions to mitigate issues. This has shown to significantly reduce the time taken by data teams to act on critical timely insights. Unravel’s AI-powered efficiency recommendations, now embedded right into the DevOps environments, help foster a cost-conscious culture that compels developers to follow performance and cost-driven coding best practices. It also raises awareness of resource usage, configuration changes, and data layout issues that could impact service level agreements (SLAs) when the code is deployed in production. Accepting or ignoring insights suggested by Unravel helps promote accountability for developers’ actions and creates transparency for the DevOps and FinOps practitioners to attribute cost-saving wins and losses.

With the advent of Generative Pre-Trained Transformer (GPT) AI models, data teams today have started using coding co-pilots to generate accurate and efficient code. With Unravel, this experience is a notch better with real-time visibility into code inefficiencies that can translate into production performance problems like bottlenecks, performance anomalies, missed SLAs, cost overruns, etc. Other code-assist tools like GitHub Copilot are limited in their scope of assistance to static code analysis based code rewrite suggestions, Unravel’s AI-driven Insights Engine built for Databricks considers the performance and cost impact of code and configuration changes and provides recommendations to make optimal suggestions. This helps you streamline your development process, identify bottlenecks, and ensure optimal performance throughout the life cycle of your data pipelines.

Unravel’s AI-powered analysis automatically provides deep, actionable insights.

Next, let’s look into what key benefits are provided by the Unravel integration into your DevOps workflows.

Achieve operational excellence

Unravel’s CI/CD integration for Databricks enhances data team and developer efficiency by seamlessly providing real-time, AI-powered insights to help optimize performance and troubleshoot issues in your data pipelines.

Unravel integrates with your favorite CI/CD tools such as Azure DevOps and GitHub. When developers make changes to code and submit via a pull request, Unravel automatically conducts AI-powered checks to ensure the code is performant and efficient. This helps developers:

Maximize resource utilization by gaining valuable insights into pipeline efficiency
Achieve performance and cost goals by analyzing critical metrics during development
Leverage specific, actionable recommendations to improve code for cost and performance optimization
Identify and resolve bottlenecks promptly, reducing development time

Leverage developer pull request (PR) reviews

Developers play a crucial role in achieving cost efficiency through PR reviews. Encourage them to adopt best practices and follow established guidelines when submitting code for review. This ensures that all tests are run and results are thoroughly evaluated before merging into the main project branch.

By actively involving developers in the review process, you tap into their knowledge and experience to identify potential areas for cost savings within your pipelines. Their insights can help streamline workflows, improve resource allocation, and eliminate inefficiencies. Involving developers in PR reviews fosters collaboration among team members and encourages feedback, creating a culture of continuous improvement.

Here are several ways developer PR reviews can enhance the reliability of data pipelines:

Ensure code quality: Developer PR reviews serve as an effective mechanism to maintain high code-quality standards. Through these reviews, developers can catch coding errors, identify potential bugs, and suggest improvements before the code is merged into the production repository.
Detect issues early: By involving developers in PR reviews, you ensure that potential issues are identified early in the development process. This allows for prompt resolution and prevents problems from propagating further down the pipeline.
Mitigate risks: Faulty or inefficient code changes can have significant impacts on your pipelines and overall system stability. With developer PR reviews, you involve experts who understand the intricacies of the pipeline and can help mitigate risks by providing valuable insights and suggestions.
Foster a collaborative environment: Developer PR reviews create a collaborative environment where team members actively engage with one another’s work. Feedback provided during these reviews promotes knowledge sharing, improves individual skills, and enhances overall team performance.

Real-world examples of CI/CD integration for Databricks

Companies in finance, healthcare, e-commerce, and more have successfully implemented CI/CD practices with Databricks. Enterprise organization across industries leverage Unravel to ensure that code is performant and efficient before it goes into production.

Financial services: A Fortune Global 500 bank provides Unravel to their developers as a way to evaluate their pipelines before they do a code release.
Healthcare: One of the largest health insurance providers in the United States uses Unravel to ensure that its business-critical data applications are optimized for performance, reliability, and cost in its development environment—before they go live in production.
Logistics: One of the world’s largest logistics companies leverages Unravel to upskill their data teams at scale. They put Unravel in their CI/CD process to ensure that all code and queries are reviewed to ensure they meet the desired quality and efficiency bar before they go into production.

Self-guided tours of Unravel AI-powered health checks
Check it out!

Unravel CI/CD integration for Databricks use cases

Incorporating Unravel’s real-time, AI insights into PR reviews helps developers ensure the reliability, performance, and cost efficiency of data pipelines before they go into production. This practice ensures that any code changes are thoroughly reviewed before being merged into the main project branch. By catching potential issues early on, you can prevent pipeline breaks, bottlenecks, and wasted compute tasks from running in production.

Ensure pipeline reliability

Unravel’s purpose-built AI helps augment your PR reviews to ensure code quality and reliability in your release pipelines. Unravel integration into your Databricks CI/CD process helps developers identify potential issues early on and mitigate risks associated with faulty or inefficient code changes. Catching breaking changes in development and test environments helps developers improve productivity and helps ensure that you achieve your SLAs.

1-minute tour: Unravel’s AI-powered Speed, Cost, Reliability Optimizer

Achieve cost efficiency

Unravel provides immediate feedback and recommendations to improve cost efficiency. This enables you to catch inefficient code, and developers can make any necessary adjustments for optimal resource utilization before it impacts production environments. Using Unravel as part of PR reviews helps your organization optimize resource allocation and reduce cloud waste.

1-minute tour: Unravel’s AI-powered Databricks Cost Optimization

Boost pipeline performance

Collaborative code reviews provide an opportunity to identify bottlenecks, optimize code, and enhance data processing efficiency. By including Unravel’s AI recommendations in the review process, developers benefit from AI-powered insights to ensure code changes achieve performance objectives.

1-minute tour: Unravel’s AI-powered Pipeline Bottleneck Analysis

Get started with Unravel CI/CD integration for Databricks

Supercharge your CI/CD process for Databricks using Unravel’s AI. By leveraging this powerful combination, you can significantly improve developer productivity, ensure pipeline reliability, achieve cost efficiency, and boost overall pipeline performance. Whether you choose to automate PR reviews with Azure DevOps or GitHub, Unravel’s CI/CD integration for Databricks has got you covered.

Now it’s time to take action and unleash the full potential of your Databricks environment. Integrate Unravel’s CI/CD solution into your workflow and experience the benefits firsthand. Don’t miss out on the opportunity to streamline your development process, save costs, and deliver high-quality code faster than ever before.

Next steps to learn more

Read Unravel’s CI/CD integration documentation

Watch this video

Book a live demo

The post Unravel CI/CD Integration for Databricks appeared first on Unravel.

How Unravel Enhances Airflow

Stephen Lamont — Mon, 16 Oct 2023 15:28:01 +0000

In today’s data-driven world, there is a huge amount of data flowing into the business. Engineers spend a large part of their time in building pipelines—to collect the data from different sources, process it, and transform it to useful datasets that can be sent to business intelligence applications or machine learning models. Tools like Airflow are used to orchestrate complex data pipelines by programmatically authoring, scheduling, and monitoring the workflow pipelines. From a 10,000-foot view, Airflow may look like just a powerful cron, but it has additional capabilities like monitoring, generating logs, retrying jobs, adding dependencies between tasks, enabling a huge number of operators to run different types of tasks, etc.

Pipelines become increasingly complex because of the interdependence of several tools needed to complete a workflow. Common tasks done in day-to-day pipeline activities call for a number of different technologies to process data for different purposes: running a Spark job, executing a Snowflake query, executing a data transfer across different platforms (like from a GCS bucket to an AWS S3 bucket), and pushing data into a queue system like SQS.

Airflow orchestrates pipelines running several third-party tools and supports about 80 different providers. But you need to monitor what’s going on underneath pipeline operations across all those tools and get deeper insights about pipeline execution. Unravel supports this by bringing in all the information monitored from Airflow via API, StatsD, and notifications. Unravel also collects metrics from Spark, BigQuery, Snowflake to connect the dots to find out how the pipeline execution in Airflow impacts other systems.

Let’s examine two common use cases to illustrate how Unravel stitches together all pipeline details to solve problems that you can’t address with Airflow alone.

View DAG Run Information

Let’s say a financial services company is using Airflow to run Spark jobs on their instances and notices that a few of the DAG runs are taking longer to complete. Airflow UI shows information about the duration of each task run and the status of the task run, but that’s about it. For deeper insight into why the runs are taking longer, you’d need to gather detailed information about the Spark ID that is created via this task run.

By tightly integrating with the Spark ecosystem, Unravel fills that gap and gives you deeper insights about the Spark job run, its duration, and the resource utilized. These details, along with information collected from Airflow, can help you see the bigger picture of what’s happening underneath, inside your data pipelines.

This is an example test_spark DAG that runs the Spark job via the SparkSubmitOperator in the company’s environment. The SparkSubmitOperator task is taking longer to execute, and the screen shot below shows the long duration of the Spark submit task.This information flows into Unravel and is visualized in the Pipeline Observer UI. A summary list of failed/degraded pipeline runs helps to filter DAG runs with problems to debug. Clicking on a specific pipeline run ID drills down into more information about the pipeline, e.g., detailed information about the task, duration of each task, total duration, baseline compared to previous runs, and stack traces about the failure. Unravel also shows a list of all pipeline runs.Clicking on one of the Spark DAG runs with a long duration reveals more detailed information about the DAG and allows comparison with previous DAG runs. You can also find information about the Spark ID that is run for this particular DAG.

Clicking on the Spark app ID takes you to the Unravel UI with more information about the Spark job that is collected from Spark: the resources used to run the Spark job, any error logs, configurations made to run that Spark job, duration of the Spark jobs, and more.

Here we have more detail about the Spark job and the Spark ID created during the DAG run that helps verify the list of Spark jobs, understand which DAG run created the specific Spark job, see the resources consumed by that Spark job, or find out the configuration of the Spark job.

Tracking down a specific Spark job from a DAG run is difficult since although Airflow initiates the task run, Spark takes care of running the task. Unravel can monitor and collect details from Spark, identify the Spark ID, and correlate the Spark job to the Airflow DAG run that initiated the job.

See Unravel’s Automated Pipeline Bottleneck Analysis in 60-second tour
Take tour

Delay in Scheduling Pipeline Run

Unravel helps pinpoint the delay in scheduling DAG runs.

The same financial services company runs multiple DAGs as part of their pipeline system. While the DAG runs are working fine, the team stumbles upon a different problem: the DAG that is expected to run at the scheduled time is running at a delayed time. This delay affects running subsequent DAG runs, resulting in tasks not getting completed on time.

Airflow has the capability to alert if a Task/DAG that’s running misses its SLA. But there are cases when the DAG run is not initiated at all in the first place—e.g., no slot is available or the number of DAGs that can run in parallel exceeds the configured maximum number.

Unravel helps bring the problem to the surface by rendering underlying details about the DAG run delay right in the UI. For a DAG run ID, Unravel shows the useful information collected from StatsD like the total parse time for processing the DAG, DAG run schedule delay, etc. This information provides pipeline insights that help to identify why there is delay in scheduling the DAG run around the DAG execution time.

The Pipeline Details page presents DAG run ID information, with pipeline insights about the schedule delay in the right-hand pane. It clearly shows that the DAG run was delayed well beyond the acceptable threshold.

Unravel helps keep the pipeline running in check and automatically delivers insights about what’s happening inside Airflow. Unravel monitors systems like Spark, BigQuery, and Airflow and has the granular data about each of the systems, connecting that information and rendering the right insights to make it a powerful tool for monitoring cloud data systems.

See Unravel’s Automated Pipeline Bottleneck Analysis in 60-second tour
Take tour

The post How Unravel Enhances Airflow appeared first on Unravel.

Rev Up Your Lakehouse: Lap the Field with a Databricks Operating Model

Stephen Lamont — Thu, 12 Oct 2023 18:19:05 +0000

In this fast-paced era of artificial intelligence (AI), the need for data is multiplying. The demand for faster data life cycles has skyrocketed, thanks to AI’s insatiable appetite for knowledge. According to a recent McKinsey survey, 75% expect generative AI (GenAI) to “cause significant or disruptive change in the nature of their industry’s competition in the next three years.”

Next-gen AI craves unstructured, streaming, industry-specific data. Although the pace of innovation is relentless, “when it comes to generative AI, data really is your moat.”

But here’s the twist: efficiency is now the new cool kid in town. Data product profitability hinges on optimizing every step of the data life cycle—from ingestion and transformation, to processing, curating, and refining. It’s no longer just about gathering mountains of information; it’s about collecting the right data efficiently.

As new, industry-specific GenAI use cases emerge, there is an urgent need for large data sets for training, validation, verification, and drift analysis. GenAI requires flexible, scalable, and efficient data architecture, infrastructure, code, and operating models to achieve success.

Leverage a Scalable Operating Model to Accelerate Your Data Life Cycle Velocity

To optimize your data life cycle, it’s crucial to leverage a scalable operating model that can accelerate the velocity of your data processes. By following a systematic approach and implementing efficient strategies, you can effectively manage your data from start to finish.

Databricks recently introduced a scalable operating model for data and AI to help customers achieve a positive Return on Data Assets (RODA).

Databricks’ iterative end-to-end operating pipeline

Define Use Cases and Business Requirements

Before diving into the data life cycle, it’s essential to clearly define your use cases and business requirements. This involves understanding what specific problems or goals you plan to address with your data. By identifying these use cases and related business requirements, you can determine the necessary steps and actions needed throughout the entire process.

Build, Test, and Iterate the Solution

Once you have defined your use cases and business requirements, it’s time to build, test, and iterate the solution. This involves developing the necessary infrastructure, tools, and processes required for managing your data effectively. It’s important to continuously test and iterate on your solution to ensure that it meets your desired outcomes.

During this phase, consider using agile methodologies that allow for quick iterations and feedback loops. This will enable you to make adjustments as needed based on real-world usage and feedback from stakeholders.

Scale Efficiently

As your data needs grow over time, it’s crucial to scale efficiently. This means ensuring that your architecture can handle increased volumes of data without sacrificing performance or reliability.

Consider leveraging cloud-based technologies that offer scalability on-demand. Cloud platforms provide flexible resources that can be easily scaled up or down based on your needs. Employing automation techniques such as machine learning algorithms or artificial intelligence can help streamline processes and improve efficiency.

By scaling efficiently, you can accommodate growing datasets while maintaining high-quality standards throughout the entire data life cycle.

Elements of the Business Use Cases and Requirements Phase

In the data life cycle, the business requirements phase plays a crucial role in setting the foundation for successful data management. This phase involves several key elements that contribute to defining a solution and ensuring measurable outcomes. Let’s take a closer look at these elements:

Leverage design thinking to define a solution for each problem statement: Design thinking is an approach that focuses on understanding user needs, challenging assumptions, and exploring innovative solutions. In this phase, it is essential to apply design thinking principles to identify and define a single problem statement that aligns with business objectives.
Validate the business case and define measurable outcomes: Before proceeding further, it is crucial to validate the business case for the proposed solution. This involves assessing its feasibility, potential benefits, and alignment with strategic goals. Defining clear and measurable outcomes helps in evaluating project success.
Map out the MVP end user experiences: To ensure user satisfaction and engagement, mapping out Minimum Viable Product (MVP) end-user experiences is essential. This involves identifying key touchpoints and interactions throughout the data life cycle stages. By considering user perspectives early on, organizations can create intuitive and effective solutions.
Understand the data requirements: A thorough understanding of data requirements is vital for successful implementation. It includes identifying what types of data are needed, their sources, formats, quality standards, security considerations, and any specific regulations or compliance requirements.
Gather required capabilities with platform architects: Collaborating with platform architects helps gather insights into available capabilities within existing infrastructure or technology platforms. This step ensures compatibility between business requirements and technical capabilities while minimizing redundancies or unnecessary investments.
Establish data management roles, responsibilities, and procedures: Defining clear roles and responsibilities within the organization’s data management team is critical for effective execution. Establishing procedures for data observability, stewardship practices, access controls, privacy policies ensures consistency in managing data throughout its life cycle.

By following these elements in the business requirements phase, organizations can lay a solid foundation for successful data management and optimize the overall data life cycle. It sets the stage for subsequent phases, including data acquisition, storage, processing, analysis, and utilization.

Build, Test, and Iterate the Solution

To successfully implement a data life cycle, it is crucial to focus on building, testing, and iterating the solution. This phase involves several key steps that ensure the development and deployment of a robust and efficient system.

Plan development and deployment: The first step in this phase is to carefully plan the development and deployment process. This includes identifying the goals and objectives of the project, defining timelines and milestones, and allocating resources effectively. By having a clear plan in place, the data team can streamline their efforts towards achieving desired outcomes.
Gather end-user feedback at every stage: Throughout the development process, it is essential to gather feedback from end users at every stage. This allows for iterative improvements based on real-world usage scenarios. By actively involving end users in providing feedback, the data team can identify areas for enhancement or potential issues that need to be addressed.
Define CI/CD pipelines for fast testing and iteration: Implementing Continuous Integration (CI) and Continuous Deployment (CD) pipelines enables fast testing and iteration of the solution. These pipelines automate various stages of software development such as code integration, testing, deployment, and monitoring. By automating these processes, any changes or updates can be quickly tested and deployed without manual intervention.
Data preparation, cleaning, and processing: Before training machine learning models or conducting experiments with datasets, it is crucial to prepare, clean, and process the data appropriately. This involves tasks such as removing outliers or missing values from datasets to ensure accurate results during model training.
Feature engineering: Feature engineering plays a vital role in enhancing model performance by selecting relevant features from raw data or creating new features based on domain knowledge. It involves transforming raw data into meaningful representations that capture essential patterns or characteristics.
Training and ML experiments: In this stage of the data life cycle, machine learning models are trained using appropriate algorithms on prepared datasets. Multiple experiments may be conducted, testing different algorithms or hyperparameters to find the best-performing model.
Model deployment: Once a satisfactory model is obtained, it needs to be deployed in a production environment. This involves integrating the model into existing systems or creating new APIs for real-time predictions.
Model monitoring and scoring: After deployment, continuous monitoring of the model’s performance is essential. Tracking key metrics and scoring the model’s outputs against ground truth data helps identify any degradation in performance or potential issues that require attention.

By following these steps and iterating on the solution based on user feedback, data teams can ensure an efficient and effective data life cycle from development to deployment and beyond.

Efficiently Scale and Drive Adoption with Your Operating Model

To efficiently scale your data life cycle and drive adoption, you need to focus on several key areas. Let’s dive into each of them:

Deploy into production: Once you have built and tested your solution, it’s time to deploy it into production. This step involves moving your solution from a development environment to a live environment where end users can access and utilize it.
Evaluate production results: After deploying your solution, it is crucial to evaluate its performance in the production environment. Monitor key metrics and gather feedback from users to identify any issues or areas for improvement.
Socialize data observability and FinOps best practices: To ensure the success of your operating model, it is essential to socialize data observability and FinOps best practices among your team. This involves promoting transparency, accountability, and efficiency in managing data operations.
Acknowledge engineers who “shift left” performance and efficiency: Recognize and reward engineers who prioritize performance and efficiency early in the development process. Encourage a culture of proactive optimization by acknowledging those who contribute to improving the overall effectiveness of the data life cycle.
Manage access, incidents, support, and feature requests: Efficiently scaling your operating model requires effective management of access permissions, incident handling processes, support systems, and feature requests. Streamline these processes to ensure smooth operations while accommodating user needs.
Track progress towards business outcomes by measuring and sharing KPIs: Measuring key performance indicators (KPIs) is vital for tracking progress towards business outcomes. Regularly measure relevant metrics related to adoption rates, user satisfaction levels, cost savings achieved through efficiency improvements, etc., then share this information across teams for increased visibility.

By implementing these strategies within your operating model, you can efficiently scale your data life cycle while driving adoption among users. Remember that continuous evaluation and improvement are critical for optimizing performance throughout the life cycle.

Unravel for Databricks now free!
Create free account

Drive for Performance with Purpose-Built AI

Unravel helps with many elements of the Databricks operating model:

Quickly identify failed and inefficient Databricks jobs: One of the key challenges is identifying failed and inefficient Databricks jobs. However, with AI purpose-built for Databricks, this task becomes much easier. By leveraging advanced analytics and monitoring capabilities, you can quickly pinpoint any issues in your job executions.
Creating ML models vs deploying them into production: Creating machine learning models is undoubtedly challenging, but deploying them into production is even harder. It requires careful consideration of factors like scalability, performance, and reliability. With purpose-built AI tools, you can streamline the deployment process by automating various tasks such as model versioning, containerization, and orchestration.
Leverage Unravel’s Analysis tab for insights: To gain deeper insights into your application’s performance during job execution, leverage the analysis tab provided by purpose-built AI solutions. This feature allows you to examine critical details like memory usage errors or other bottlenecks that may be impacting job efficiency.

Unravel’s AI-powered analysis automatically provides deep, actionable insights.

Share links for collaboration: Collaboration plays a crucial role in data management and infrastructure optimization. Unravel enables you to share links with data scientists, developers, and data engineers to provide detailed information about specific test runs or failed Databricks jobs. This promotes collaboration and facilitates a better understanding of why certain jobs may have failed.
Cloud data cost management made easy: Cloud cost management, also known as FinOps, is another essential aspect of data life cycle management. Purpose-built AI solutions simplify this process by providing comprehensive insights into cost drivers within your Databricks environment. You can identify the biggest cost drivers such as users, clusters, jobs, and code segments that contribute significantly to cloud costs.
AI recommendations for optimization: To optimize your data infrastructure further, purpose-built AI platforms offer valuable recommendations across various aspects, including infrastructure configuration, parallelism settings, handling data skewness issues, optimizing Python/SQL/Scala/Java code snippets, and more. These AI-driven recommendations help you make informed decisions to enhance performance and efficiency.

Learn More & Next Steps

Unravel hosted a virtual roundtable, Accelerate the Data Analytics Life Cycle, with a panel of Unravel and Databricks experts. Unravel VP Clinton Ford moderated the discussion with Sanjeev Mohan, principal at SanjMo and former VP at Gartner, Subramanian Iyer, Unravel training and enablement leader and Databricks SME, and Don Hilborn, Unravel Field CTO and former Databricks lead strategic solutions architect.

FAQs

How can I implement a scalable operating model for my data life cycle?

To implement a scalable operating model for your data life cycle, start by clearly defining roles and responsibilities within your organization. Establish efficient processes and workflows that enable seamless collaboration between different teams involved in managing the data life cycle. Leverage automation tools and technologies to streamline repetitive tasks and ensure consistency in data management practices.

What are some key considerations during the Business Requirements Phase?

During the Business Requirements Phase, it is crucial to engage stakeholders from various departments to gather comprehensive requirements. Clearly define project objectives, deliverables, timelines, and success criteria. Conduct thorough analysis of existing systems and processes to identify gaps or areas for improvement.

How can I drive adoption of my data life cycle operational model?

To drive adoption of your data management solution, focus on effective change management strategies. Communicate the benefits of the solution to all stakeholders involved and provide training programs or resources to help them understand its value. Encourage feedback from users throughout the implementation process and incorporate their suggestions to enhance usability and address any concerns.

What role does AI play in optimizing the data life cycle?

AI can play a significant role in optimizing the data life cycle by automating repetitive tasks, improving data quality through advanced analytics and machine learning algorithms, and providing valuable insights for decision-making. AI-powered tools can help identify patterns, trends, and anomalies in large datasets, enabling organizations to make data-driven decisions with greater accuracy and efficiency.

How do I ensure performance while implementing purpose-built AI?

To ensure performance while implementing purpose-built AI, it is essential to have a well-defined strategy. Start by clearly defining the problem you want to solve with AI and set measurable goals for success. Invest in high-quality training data to train your AI models effectively. Continuously monitor and evaluate the performance of your AI system, making necessary adjustments as needed.

The post Rev Up Your Lakehouse: Lap the Field with a Databricks Operating Model appeared first on Unravel.

Overcoming Friction & Harnessing the Power of Unravel: Try It for Free

Stephen Lamont — Wed, 11 Oct 2023 13:52:00 +0000

Overview

In today’s digital landscape, data-driven decisions form the crux of successful business strategies. However, the path to harnessing data’s full potential is strewn with challenges. Let’s delve into the hurdles organizations face and how Unravel is the key to unlocking seamless data operations.

The Roadblocks in the Fast Lane of Data Operations

In today’s data-driven landscape, organizations grapple with erratic spending, cloud constraints, AI complexities, and prolonged MTTR, urgently seeking solutions to navigate these challenges efficiently. The four most common roadblocks are:

Data Spend Forecasting: Imagine a roller coaster with unpredictable highs and lows. That’s how most organizations view their data spend forecasting. Such unpredictability wreaks havoc on financial planning, making operational consistency a challenge.
Constraints in Adding Data Workloads: Imagine tying an anchor to a speedboat. That’s how the constraints feel when trying to adopt cloud data solutions, holding back progress and limiting agility.
Surge in AI Model Complexity: AI’s evolutionary pace is exponential. As it grows, so do the intricacies surrounding data volume and pipelines, which strain budgetary limitations.
The MTTR Bottleneck: The multifaceted nature of modern tech stacks means longer Mean Time to Repair (MTTR). This slows down processes, consumes valuable resources, and stalls innovation.

By acting as a comprehensive data observability and FinOps solution, Unravel Data empowers businesses to move past the challenges and frictions that typically hinder data operations, ensuring smoother, more efficient data-driven processes. Here’s how Unravel Data aids in navigating the roadblocks in the high-speed lane of data operations:

Predictive Data Spend Forecasting: With its advanced analytics, Unravel Data can provide insights into data consumption patterns, helping businesses forecast their data spending more accurately. This eliminates the roller coaster of unpredictable costs.
Simplifying Data Workloads: Unravel Data optimizes and automates workload management. Instead of being anchored down by the weight of complex data tasks, businesses can efficiently run and scale their data processes in the cloud.
Managing AI Model Complexity: Unravel offers visibility and insights into AI data pipelines. Analyzing and optimizing these pipelines ensure that growing intricacies do not overwhelm resources or budgets.
Reducing MTTR: By providing a clear view of the entire tech stack and pinpointing bottlenecks or issues, Unravel Data significantly reduces Mean Time to Repair. With its actionable insights, teams can address problems faster, reducing downtime and ensuring smooth data operations.
Streamlining Data Pipelines: Unravel Data offers tools to diagnose and improve data pipeline performance. This ensures that even as data grows in volume and complexity, pipelines remain efficient and agile.
Efficiency and ROI: With its clear insights into resource consumption and requirements, Unravel Data helps businesses run 50% more workloads in their existing Databricks environments, ensuring they only pay for what they need, reducing wastage and excess expenditure.

The Skyrocketing Growth of Cloud Data Management

As the digital realm expands, cloud data management usage is soaring, with data services accounting for a significant chunk. According to the IDC, the public cloud IaaS and PaaS market is projected to reach $400 billion by 2025, growing at a 28.8% CAGR from 2021 to 2025. Some highlights are:

Data management and application development account for 39% and 20% of the market, respectively, and are the main workloads backed by PaaS solutions, capturing a major share of its revenue.
In IaaS revenue, IT infrastructure leads with 25%, trailed by business applications (21%) and data management (20%).
Unstructured data analytics and media streaming are predicted to be the top-growing segments with CAGRs of 41.9% and 41.2%, respectively.

Unravel provides a comprehensive solution to address the growth associated with cloud data management. Here’s how:

Visibility and Transparency: Unravel offers in-depth insights into your cloud operations, allowing you to understand where and how costs are accruing, ensuring no hidden fees or unnoticed inefficiencies.
Optimization Tools: Through its suite of analytics and AI-driven tools, Unravel pinpoints inefficiencies, recommends optimizations, and automates the scaling of resources to ensure you’re only using (and paying for) what you need.
Forecasting: With predictive analytics, Unravel provides forecasts of data usage and associated costs, enabling proactive budgeting and financial planning.
Workload Management: Unravel ensures that data workloads run efficiently and without wastage, reducing both computational costs and storage overhead.
Performance Tuning: By optimizing query performance and data storage strategies, Unravel ensures faster results using fewer resources, translating to 50% more workloads.
Monitoring and Alerts: Real-time monitoring paired with intelligent alerts ensures that any resource-intensive operations or anomalies are flagged immediately, allowing for quick intervention and rectification.

By employing these strategies and tools, Unravel acts as a financial safeguard for businesses, ensuring that the ever-growing cloud data bill remains predictable, manageable, and optimized for efficiency.

The Tightrope Walk of Efficiency Tuning and Talent

Modern enterprises hinge on data and AI, but shrinking budgets and talent gaps threaten them. Gartner pinpoints overprovisioning and skills shortages as major roadblocks, while Google and IDC underscore the high demand for data analytics skills and the untapped potential of unstructured data. Here are some of the problems modern organizations face:

Production environments are statically overprovisioned and therefore underutilized. On-premises, 30% utilization is common, but it’s all capital expenditures (capex), and as long as it’s within budget, no one has traditionally cared about the waste. However, in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste. – Gartner
The cloud skills gap has reached a crisis level in many organizations – Gartner
Revenue creation through digital transformation requires talent engagement that is currently scarce and difficult to acquire and maintain. – Gartner
Lack of skills remains the biggest barrier to infrastructure modernization initiatives, with many organizations finding they cannot hire outside talent to fill these skills gaps. IT organizations will not succeed unless they prioritize organic skills growth. – Gartner
Data analytics skills are in demand across industries as businesses of all types around the world recognize that strong analytics improve business performance.- Google via Coursera

Unravel Data addresses the delicate balancing act of budget and talent in several strategic ways:

Operational Efficiency: Purpose-built AI provides actionable insights into data operations across Databricks, Spark, EMR, BigQuery, Snowflake, etc. Unravel Data reduces the need for trial-and-error and time-consuming manual interventions. At the core of Unravel’s data observability platform is our AI-powered Insights Engine. This sophisticated Artificial Intelligence engine incorporates AI techniques, algorithms, and tools to process and analyze vast amounts of data, learn from patterns, and make predictions or decisions based on that learning. This not only improves operational efficiency but also ensures that talented personnel spend their time innovating rather than on routine tasks.
Skills Gap Bridging: The platform’s intuitive interface and AI-driven insights mean that even those without deep expertise in specific data technologies can navigate, understand, and optimize complex data ecosystems. This eases the pressure to hire or train ultra-specialized talent.
Predictive Analysis: With Unravel’s ability to predict potential bottlenecks or inefficiencies, teams can proactively address issues, leading to more efficient budget allocation and resource utilization.
Cost Insights: Unravel provides detailed insights into the efficiency of various data operations, allowing organizations to make informed decisions on where to invest and where to cut back.
Automated Optimization: By automating many of the tasks traditionally performed by data engineers, like performance tuning or troubleshooting, Unravel ensures teams can do more with less, optimizing both budget and talent.
Talent Focus Shift: With mundane tasks automated and insights available at a glance, skilled personnel can focus on higher-value activities, like data innovation, analytics, and strategic projects.

By enhancing efficiency, providing clarity, and streamlining operations, Unravel Data ensures that organizations can get more from their existing budgets while maximizing the potential of their talent, turning the tightrope walk into a more stable journey.

The Intricacies of Data-Centric Organizations

Data-centric organizations grapple with the complexities of managing vast and fast-moving data in the digital age. Ensuring data accuracy, security, and compliance, while integrating varied sources, is challenging. They must balance data accessibility with protecting sensitive information, all while adapting to evolving technologies, addressing talent gaps, and extracting actionable insights from their data reservoirs. Here is some relevant research on the topic:

“Data is foundational to AI” yet “unstructured data remains largely untapped.” – IDC
Even as organizations rush to adopt data-centric operations, challenges persist. For instance, manufacturing data projects often hit roadblocks due to outdated legacy technology, as observed by the World Economic Forum.
Generative AI is supported by large language models (LLMs), which require powerful and highly scalable computing capabilities to process data in real-time. – Gartner

Unravel Data provides a beacon for data-centric organizations amid complex challenges. Offering a holistic view of data operations, it simplifies management using AI-driven tools. It ensures data security, accessibility, and optimized performance. With its intuitive interface, Unravel bridges talent gaps and navigates the data maze, turning complexities into actionable insights.

Embarking on the Unravel Journey: Your Step-By-Step Guide

Beginning your data journey with Unravel is as easy as 1-2-3. We guide you through the sign-up process, ensuring a smooth and hassle-free setup.
Unravel for Databricks page

Unravel for Databricks plans & pricing

Unravel for Databricks free account

Level Up with Unravel Premium

Ready for an enhanced data experience? Unravel’s premium account offers a plethora of advanced features that the free version can’t match. Investing in this upgrade isn’t just about more tools; it’s about supercharging your data operations and ROI.

Wrap-Up

Although rising demands on the modern data landscape are challenging, they are not insurmountable. With tools like Unravel, organizations can navigate these complexities, ensuring that data remains a catalyst for growth, not a hurdle. Dive into the Unravel experience and redefine your data journey today.

Unravel is a business’s performance sentinel in the cloud realm, proactively ensuring that burgeoning cloud data expenses are not only predictable and manageable but also primed for significant cost savings. Unravel Data transforms the precarious balance of budget and talent into a streamlined, efficient journey for organizations. Unravel Data illuminates the path for data-centric organizations, streamlining operations with AI tools, ensuring data security, and optimizing performance. Its intuitive interface simplifies complex data landscapes, bridging talent gaps and converting challenges into actionable insights.

The post Overcoming Friction & Harnessing the Power of Unravel: Try It for Free appeared first on Unravel.

Databricks Cost Efficiency: 5 Reasons Why Unravel Free Observability vs. Databricks Free Observability

Stephen Lamont — Tue, 26 Sep 2023 13:33:35 +0000

“Data is the new oil, and analytics is the combustion engine.” – Peter Sondergaard

Cloud data analytics is the key to maximizing value from your data. The lakehouse has emerged as a flexible and efficient architecture, and Databricks has emerged as a popular choice. However, data lakehouse processing volumes can fluctuate, leading to unpredictable surges in cloud data spending that impact budgeting and profitability. Executives want to make sure they are getting the most from their lakehouse investments and not overspending.

Implementing a proactive data observability and FinOps approach early in your lakehouse journey helps ensure you achieve your business objectives and bring predictability to your financial planning. Choosing the right lakehouse observability and FinOps tool sets your team up for success. Since the goal is efficiency, starting with free tools makes sense. Two free options stand out:

Overwatch – the open source Databricks observability tool
Unravel Standard – the free version of Unravel’s data observability and FinOps platform

Below are 5 reasons to choose Unravel free observability vs. Databricks free observability:

Reason #1: Complete observability

Many organizations take a do-it-yourself approach, building piecemeal observability solutions in-house by cobbling together a variety of data sources using open source tools. The problem is that it takes months or even years to get something usable up and running. Unravel’s data observability and FinOps platform helps you get results fast.

Unravel provides a 360° out-of-the-box solution

Unravel provides a holistic view of your Databricks estate, reducing the time to value. Gain deep insights into cluster performance, job execution, resource utilization, and cost drivers through comprehensive lakehouse observability. Unravel’s observability solution provides you with detailed visibility into the performance of your Databricks clusters.

Insights overview dashboard in Unravel Standard

Ensure no blind spots in your analysis by leveraging Unravel’s end-to-end visibility across all aspects of your Databricks environment. View your lakehouse landscape at a glance with the Insights Overview dashboard. You can see the overall health of your Databricks estate, including the number of clusters that are over- or under-provisioned, the total number of inefficient and failed apps, and other summary statistics to guide your efforts to optimize your lakehouse towards better performance and cost efficiency.

Purpose-built correlation

Unravel’s purpose-built correlation models help you identify inefficient jobs at code, data layout/partitioning, and infrastructure levels. Databricks logs, metrics, events, traces, and source code are automatically evaluated to simplify root cause analysis and issue resolution. You can dive deep into the execution details of your Databricks jobs, track the progress of each job, and see resource usage details. This helps you identify long-running and resource-intensive jobs that might be impacting the overall performance and efficiency of your lakehouse estate.

End-to-end visibility

Visual summaries provide a way to look across all the jobs and clusters in your Databricks workspace. No need to click around your Databricks workspace looking for issues, run queries, or pull details into a spreadsheet to summarize results. Unravel helps you easily see all the details in one place.

Reason #2: Real-time visibility

A single errant job or underutilized cluster can derail your efficiency goals and delay critical data pipelines. The ability to see job and cluster performance and efficiency in real time provides an early warning system.

Live updates for running jobs and clusters

React promptly to any anomalies or bottlenecks in your clusters and jobs to ensure efficiency. Unravel’s real-time insights allow you to catch long-running jobs before they impact pipeline performance or consume unnecessary resources.

Workspace Trends dashboard in Unravel Standard

See DBU usage and cluster session trends

By understanding the real-time performance of your Databricks workloads, you can identify areas where improvements can be made to improve efficiency without sacrificing performance. Leverage Unravel’s real-time insights to make data-driven decisions for better resource allocation and workload management.

Drill down to see DBU usage and tasks for a specific day

Quickly find resource consumption outliers by day to understand how usage patterns are driving costs. Unravel helps you identify opportunities to reduce waste and increase cluster utilization. By having visibility into the real-time cost implications of your jobs and clusters, you can make faster decisions to boost performance and improve business results.

User-level reporting for showback/chargeback

Granular reporting to the user and job level helps you produce accurate and timely showback and chargeback reports. With Unravel’s real-time visibility into your Databricks workloads, you have the power to see which teams are consuming the most resources and proactively manage costs to achieve efficient operations. Reacting quickly to anomalies and leveraging real-time, user-level insights enables better decision-making for resource allocation and utilization. Unravel enables central data platform and operations teams to provide a reliable, single source of truth for showback and chargeback reporting.

Try Unravel for free
Sign up for free account

Reason #3: Automated Cluster Discovery

You can’t fix problems you can’t see. It all begins with getting visibility across all your workspace clusters and jobs. Unravel automates this process to save you time and ensure you don’t miss anything.

Easily connect to all of your clusters in the workspace

Simplify the process of connecting to your Databricks clusters with Unravel’s automated cluster discovery. This streamlines the observability and management of your compute clusters, so you can focus on resource optimization to boost productivity. Unravel lets you easily see all of your clusters without adding dependencies.

Compute dashboard shows clusters in Unravel Standard

Quickly discover clusters with errors, delays, and failures

Unravel lets you see clusters grouped by event type (e.g., Contended Driver, High I/O, Data Skew, Node Downsizing). This helps you quickly identify patterns in compute clusters that are not being fully utilized. This eliminates the need for manual monitoring and analysis, saving you time and effort.

View cluster resource trends

Unravel’s intelligent automation continuously monitors cluster activity and resource utilization over time. This helps you spot changing workload requirements and helps ensure optimal performance while keeping costs in check by avoiding overprovisioning or underutilization to make the most of your cloud infrastructure investments.

Reason #4: Ease of Entry

Open source and DIY solutions typically have a steep learning curve to ensure everything is correctly configured and connected. Frequent changes and updates add to your team’s already full load. Unravel offers a simpler approach.

Unravel is quick to set up and get started with minimal learning curve

Integrating Unravel into your existing Databricks environment is a breeze. No complex setup or configuration required. With Unravel, you can seamlessly bring data observability and FinOps capabilities to your data lakehouse estate without breaking a sweat.

Unravel SaaS makes setup and configuration a breeze

But what exactly does this mean for you? It means that you can focus on what matters most—getting the most out of your Databricks platform while keeping costs in check. Unravel’s data observability and FinOps capabilities are provided as a fully managed service, giving you the power to optimize performance and resources, spot efficiency opportunities, and ensure smooth operation of your data pipelines and data applications.

No DIY coding or development required

Unravel is trusted by large enterprise customers across many industries for its ease of integration into their Databricks environments. Whether you’re a small team or an enterprise organization, Unravel’s data observability and FinOps platform is designed to meet your specific needs and use cases without the need to build anything from scratch.

Try Unravel for free
Sign up for free account

Reason #5: Avoid lock-in

A lakehouse architecture gives you flexibility. As your data analytics and data processing needs grow and evolve, you may choose additional analytics tools to complement your cloud data estate. Your data observability and FinOps tool should support those tools as well.

Unravel is purpose-built for Databricks, Snowflake, BigQuery, and other modern data stacks

Each cloud data platform is different and requires a deep understanding of its inner workings in order to provide the visibility you need to run efficiently. Unravel is designed from the ground up to help you get the most out of each modern data platform, leveraging the most relevant and valuable metadata sources and correlating them all into a unified view of your data estate.

No need to deploy a separate tool as your observability needs grow

Unravel provides a consistent approach to data observability and FinOps to minimize time spent deploying and learning new tools. Data teams spend less time upskilling and more time getting valuable insights.

Independent reporting for FinOps

Data analytics is the fastest growing segment of cloud computing as organizations invest in new use cases such as business intelligence (BI), AI and machine learning. Organizations are adopting FinOps practices to ensure transparency in resource allocation, usage, and reporting. Unravel provides an independent perspective of lakehouse utilization and efficiency to ensure objective, data-driven decisions.

Compare Unravel and Databricks free observability

Get started today

Achieve predictable spend and gain valuable insights into your Databricks usage. Get started today with Unravel’s complete data observability and FinOps platform for Databricks that provides real-time visibility, automated cluster discovery, ease of entry, and independent analysis to help you take control of your costs while maximizing the value of your Databricks investments. Create your free Unravel account today.

Try Unravel for free
Sign up for free account

Unravel for Databricks FAQ

Can I use Unravel’s data observability platform with other cloud providers?

Yes. Unravel’s data observability platform is designed to work seamlessly across multiple cloud providers including AWS, Azure, and Google Cloud. So regardless of which cloud provider you choose for your data processing needs, Unravel can help you optimize costs and gain valuable insights.

How does automated cluster discovery help in managing Databricks costs?

Automated cluster discovery provided by solutions like Unravel enables you to easily identify underutilized or idle clusters within your Databricks environment. By identifying these clusters, you can make informed decisions about resource allocation and ensure that you are only paying for what you actually need.

Does Unravel offer real-time visibility into my Databricks usage?

Yes. With Unravel’s real-time visibility feature, you can monitor your Databricks usage in real time. This allows you to quickly identify any anomalies or issues that may impact cost efficiency and take proactive measures to address them.

Can Unravel help me optimize my Databricks costs for different workloads?

Yes. Unravel’s data observability platform provides comprehensive insights into the performance and cost of various Databricks workloads. By analyzing this data, you can identify areas for optimization and make informed decisions to ensure cost efficiency across different workloads.

How easy is it to get started with Unravel’s data observability platform?

Getting started with Unravel is quick and easy. Simply sign up for a free account on our website, connect your Databricks environment, and start gaining valuable insights into your usage and costs. Our intuitive interface and user-friendly features make it simple for anyone to get started without any hassle.

The post Databricks Cost Efficiency: 5 Reasons Why Unravel Free Observability vs. Databricks Free Observability appeared first on Unravel.

5 Key Ingredients to Accurate Cloud Data Budget Forecasting

Stephen Lamont — Mon, 28 Aug 2023 20:14:45 +0000

Hey there! Have you ever found yourself scratching your head over unpredictable cloud data costs? It’s no secret that accurately forecasting cloud data spend can be a real headache. Fluctuating costs make it challenging to plan and allocate resources effectively, leaving businesses vulnerable to budget overruns and financial challenges. But don’t worry, we’ve got you covered!

Uncontrolled fluctuations in cloud data spend can hinder growth and profitability, disrupting the smooth sailing of your business operations. That’s why it’s crucial to gain control over your cloud data workload forecasts. By understanding the changes in your cloud data spend and having a clear picture of your billing data, you can make informed decisions that align with your company’s goals.

We’ll discuss practical strategies to improve forecast accuracy, identify data pipelines and analytics workloads that are above or below budget, and enhance accountability across different business units.

So let’s dive right in and discover how you can steer your business towards cost-effective cloud management!

Unanticipated cloud data spend

Last year, over $16 billion was wasted in cloud spend. Data management is the largest and fastest-growing category of cloud spending, representing 39% of the typical cloud bill. Gartner noted that in 2022, 98% of the overall database management system (DBMS) market growth came from cloud-based database platforms. Cloud data costs are often the most difficult to predict due to fluctuating workloads. 82% of 157 data management professionals surveyed by Forrester cited difficulty predicting data-related cloud costs. On top of the fluctuations that are inherent with data workloads, a lack of visibility into cloud data spend makes it challenging to manage budgets effectively.

Fluctuating workloads: Data processing and storage costs are driven by the amount of data stored and analyzed. With varying workloads, it becomes challenging to accurately estimate the required data processing and storage costs. This unpredictability can result in budget overruns that affect 60% of infrastructure and operations (I&O) leaders.
Unexpected expenses: Streaming data, large amounts of unstructured and semi-structured data, and shared slot pool consumption can quickly drive up cloud data costs. These factors contribute to unforeseen spikes in usage that may catch organizations off guard, leading to unexpected expenses on their cloud bills.
Lack of visibility: Without granular visibility into cloud data analytics billing information, businesses have no way to accurately allocate costs down to the job or user level. This makes it difficult for them to track usage patterns and identify areas where budgets will be over- or under-spent, or where performance and cost optimization are needed.

By implementing a FinOps approach, businesses can gain better control over their cloud data spend, optimize their budgets effectively, and avoid unpleasant surprises when it comes time to pay the bill.

Why cloud data costs are so unpredictable

Cloud data costs can be a source of frustration for many businesses due to their unpredictability. Cloud and data platform providers often have complex pricing models that make it challenging to accurately forecast expenses. Here are some key reasons why cloud data analytics costs can be so difficult to predict:

Variety of factors affect analytics costs: Cloud and data platform providers offer a range of services and pricing options, making it hard to determine the exact cost of using specific resources. Factors such as compute instance and cluster sizes, regional pricing, and additional features all contribute to the final cloud and data platform bill.

Usage patterns impact cost: Fluctuations in usage patterns can significantly affect cloud data costs. Peak demand periods or sudden increases in data volume can result in unexpected expenses. Without proper planning, businesses may find themselves facing higher bills during these periods.

Lack of visibility into resource utilization: Inefficient workload management and a lack of visibility into resource utilization can lead to higher expenses. Without the ability to monitor and optimize resource allocation, businesses may end up paying for unused or underutilized resources.

Inability to allocate historical spend: A lack of granular visibility into data costs at the job, project, and user level makes it nearly impossible to accurately allocate historical spend or forecast future investments. This makes budgeting and financial planning challenging for businesses that rely on cloud data platforms.

Changes in technology or service offerings: Cloud and data platform providers frequently introduce new technologies or adjust their service offerings, which can impact cost structures. Businesses must stay updated on these changes as they may influence their overall cloud expenditure.

Navigating the complexities of cloud data forecasting requires careful analysis and proactive management or resource consumption fluctuations and cost unpredictability. By understanding usage patterns, optimizing capacity utilization, and staying informed about changes from cloud and data platform providers, businesses can gain better control over their cloud data costs.

5 key ingredients of an accurate cloud data cost forecast

To ensure an accurate cloud data cost forecast, several key ingredients must be considered. These include:

Comprehensive understanding of historical usage patterns and trends: Analyzing past usage data provides valuable insights into resource consumption and enables more accurate predictions of future spending.
Granular visibility into data resource usage: It is essential to have detailed visibility into the utilization of resources down to the job and user level. This level of granularity enables a more precise estimation of costs associated with specific tasks or individuals.
Analysis of current platform configurations and workload requirements: Evaluating the existing data platform settings, data access patterns, and workload demands help predict growth rates and changes in cloud data spend.
Consideration of external factors: External factors such as market conditions or regulatory changes can significantly impact cloud data processing costs. Incorporating these variables into the forecasting model ensures a more realistic projection.
Utilization of advanced forecasting techniques and algorithms: Leveraging advanced techniques and algorithms enhances the accuracy of predictions by accounting for various factors simultaneously, resulting in more reliable forecasts.

By incorporating these key ingredients into your cloud data forecasting strategy, you can gain better control over your forecast and achieve higher accuracy in predicting future expenses. With a comprehensive understanding of historical patterns, granular visibility into resource usage, analysis of configurations and workload requirements, consideration of external influences, and advanced forecasting techniques, you can make informed decisions to increase the accuracy of your cloud data spend forecasts.

Remember that accurate cloud data cost forecasting is crucial for effective financial planning within your organization’s cloud environment.

Explore different methods and tools for accurate cloud data forecasting

Statistical modeling techniques like time series analysis can be used to predict future trends based on patterns in historical data. These predictive models help improve forecast accuracy by identifying recurring patterns and extrapolating them into the future.

Machine learning algorithms offer another powerful tool for cloud data forecasting. By analyzing vast amounts of information, these algorithms can generate accurate forecasts by identifying complex relationships and trends within the data. This enables organizations to make informed estimates of their cloud data processing patterns to help anticipate cyclical usage and growth.

Cloud management platforms provide built-in forecasting capabilities. However, it is important to note that forecasts are only as good as the underlying data. Without granular visibility into the jobs, projects, and users utilizing the cloud resources, forecast models cannot take key drivers into account. To ensure accurate predictions, it is crucial to distinguish one-time or end-of-period data processing from ongoing processing related to growing customer and end-user activity.

How purpose-built AI improves cloud data cost forecasting

Purpose-built AI is a game-changer. By leveraging advanced algorithms and machine learning capabilities, businesses can unlock the full potential of their cloud data cost management.

Here’s how purpose-built AI improves cloud data cost forecasting:

Identifying hidden cost drivers: Purpose-built AI has the ability to analyze vast amounts of data and identify subtle factors that contribute to increased costs. It goes beyond surface-level analysis and uncovers underlying patterns, enabling businesses to accurately anticipate cloud analytics resource needs.

Continuous learning for improved accuracy: Machine learning models are continuously trained on past performance, enabling them to learn from historical data and improve accuracy over time. This means that as more data becomes available, the forecasts become increasingly reliable.

Proactive decision-making with predictive analytics: Predictive analytics powered by purpose-built AI enable businesses to make proactive decisions regarding their cloud expenditure. By analyzing trends and patterns, organizations can anticipate future costs and take necessary steps to avoid unnecessary expenses or mitigate risks associated with fluctuating costs.

AI-driven recommendations for enhanced cost efficiency: Purpose-built AI provides valuable recommendations based on its analysis of cloud data cost patterns. These recommendations help businesses improve their overall cost efficiency, ensuring that resources are allocated optimally to accelerate cloud data platform usage.

Using Unravel to Improve Cloud Data Cost Forecast Accuracy to within ±10%

These different approaches help improve the accuracy of your cloud data cost forecasts. But how can you take it a step further? That’s where Unravel comes in.

Unravel is a purpose-built AI platform that can revolutionize the accuracy of your cloud data cost forecasts. Unravel provides real-time insights into your data usage patterns, identifies budget trends, and predicts future costs with remarkable accuracy. With its intuitive interface and easy-to-use features, Unravel empowers you to make informed decisions about resource allocation, budget planning, and overall cost management in the cloud.

Ready to take control of your cloud data costs? Start using Unravel today and unlock the full potential of accurate cloud data cost forecasting.

FAQs

Q: How does Unravel improve cloud data cost forecasting?

A: Unravel leverages advanced machine learning algorithms to analyze historical usage patterns and identify trends in your cloud data costs. By understanding these patterns, it can accurately predict future costs and provide actionable insights for optimizing resource allocation.

Q: Can I integrate Unravel with my existing cloud data platform?

A: Yes! Unravel seamlessly integrates with popular cloud platforms such as AWS, Azure, and Google Cloud Platform. It supports both on-premises and hybrid environments as well.

Q: Is Unravel suitable for businesses of all sizes?

A: Absolutely! Whether you’re a small startup or a large enterprise, Unravel caters to businesses of all sizes with large data estates looking to leverage data to maximize business value. Unravel’s scalable architecture ensures that it can handle the needs of organizations with 10,000s of data platform jobs, such as DBS Bank.

Q: How long does it take to see results with Unravel?

A: You’ll start seeing immediate benefits once you integrate Unravel into your infrastructure. Its real-time insights and actionable recommendations allow you to make informed decisions right from the get-go.

Q: Can Unravel help with other aspects of cloud data management?

A: Yes, Unravel offers a comprehensive suite of features for end-to-end cloud data performance management, FinOps for your data cloud, intelligent data quality, and forecast accuracy. From performance optimization to cost governance, Unravel provides a holistic solution for your cloud analytics needs.

The post 5 Key Ingredients to Accurate Cloud Data Budget Forecasting appeared first on Unravel.

Blind Spots in Your System: The Grave Risks of Overlooking Observability

Stephen Lamont — Mon, 28 Aug 2023 20:14:12 +0000

I was an Enterprise Data Architect at Boeing NASA Systems the day of the Columbia Shuttle disaster. The tragedy has had a profound impact in my career on how I look at data. Recently I have been struck by how, had it been available back then, purpose-built AI observability might have altered the course of events.

The Day of the Disaster

With complete reverence and respect for the Columbia disaster, I can remember that day with incredible amplification. I was standing in front of my TV watching the Shuttle Columbia disintegrate on reentry. As the Enterprise Data Architect at Boeing NASA Systems, I received a call from my boss: “We have to pull and preserve all the Orbitor data. I will meet you at the office.” Internally, Boeing referred to the Shuttles as Oribitors. It was only months earlier I had the opportunity to share punch and cookies with some of the crew. To say I was being flooded with emotions would be a tremendous understatement.

When I arrived at Boeing NASA Systems, near Johnson Space Center, the mood in the hallowed aeronautical halls was somber. I sat intently, with eyes scanning lines of data from decades of space missions. The world had just witnessed a tragedy — the Columbia Shuttle disintegrated on reentry, taking the lives of seven astronauts. The world looked on in shock and sought answers. As a major contractor for NASA, Boeing was at the forefront of this investigation.

As the Enterprise Data Architect, I was one of an army helping dissect the colossal troves of data associated with the shuttle, looking for anomalies, deviations — anything that could give a clue as to what had gone wrong. Days turned into nights, and nights turned into weeks and months as we tirelessly pieced together Columbia’s final moments. But as we delved deeper, a haunting reality began to emerge.

Every tiny detail of the shuttle was monitored, from the heat patterns of its engines to the radio signals it emitted. But there was a blind spot, an oversight that no one had foreseen. In the myriad of data sets, there was nothing that indicated the effects of a shuttle’s insulation tiles colliding with a piece of Styrofoam, especially at speeds exceeding 500 miles per hour.

The actual incident was seemingly insignificant — a piece of foam insulation had broken off and struck the shuttle’s left wing. But in the vast expanse of space and the brutal conditions of reentry, even minor damage could prove catastrophic.

Video footage confirmed: the foam had struck the shuttle. But without concrete data on what such an impact would do, the team was left to speculate and reconstruct potential scenarios. The lack of this specific data had turned into a gaping void in the investigation.

As a seasoned Enterprise Data Architect, I always believed in the power of information. I absolutely believed that in the numbers, in the bytes and bits, we find the stories that the universe whispers to us. But this time, the universe had whispered a story that we didn’t have the data to understand fully.

Key Findings

After the accident, NASA formed the Columbia Accident Investigation Board (CAIB) to investigate the disaster. The board consisted of experts from various fields outside of NASA to ensure impartiality.

1. Physical Cause: The CAIB identified the direct cause of the accident as a piece of foam insulation from the shuttle’s external fuel tank that broke off during launch. This foam struck the leading edge of the shuttle’s left wing. Although foam shedding was a known issue, it had been wrongly perceived as non-threatening because of prior flights where the foam was lost but didn’t lead to catastrophe.

2. Organizational Causes: Beyond the immediate physical cause, the CAIB highlighted deeper institutional issues within NASA. They found that there were organizational barriers preventing effective communication of safety concerns. Moreover, safety concerns had been normalized over time due to prior incidents that did not result in visible failure. Essentially, since nothing bad had happened in prior incidents where the foam was shed, the practice had been erroneously deemed “safe.”

3. Decision Making: The CAIB pointed out issues in decision-making processes that relied heavily on past success as a predictor of future success, rather than rigorous testing and validation.

4. Response to Concerns: There were engineers who were concerned about the foam strike shortly after Columbia’s launch, but their requests for satellite imagery to assess potential damage were denied. The reasons were multifaceted, ranging from beliefs that nothing could be done even if damage was found, to a misunderstanding of the severity of the situation.

The CAIB made a number of recommendations to NASA to improve safety for future shuttle flights. These included:

1. Physical Changes: Improving the way the shuttle’s external tank was manufactured to prevent foam shedding and enhancing on-orbit inspection and repair techniques to address potential damage.

2. Organizational Changes: Addressing the cultural issues and communication barriers within NASA that led to the accident, and ensuring that safety concerns were more rigorously addressed.

3. Continuous Evaluation: Establishing an independent Technical Engineering Authority responsible for technical requirements and all waivers to them, and building an independent safety program that oversees all areas of shuttle safety.

Could Purpose-built AI Observability Have Helped?

In the aftermath, NASA grounded the shuttle fleet for more than two years after the disaster. They then implemented the CAIB’s recommendations before resuming shuttle flights. Columbia’s disaster, along with the Challenger explosion in 1986, are stark reminders of the risks of space travel and the importance of a diligent and transparent safety culture. The lessons from Columbia shaped many of the safety practices NASA follows in its current human spaceflight programs.

The Columbia disaster led to profound changes in how space missions were approached, with a renewed emphasis on data collection and eliminating informational blind spots. But for me, it became a deeply personal mission. I realized that sometimes, the absence of data could speak louder than the most glaring of numbers. It was a lesson I would carry throughout my career, ensuring that no stone was left unturned, and no data point overlooked.

The Columbia disaster, at its core, was a result of both a physical failure (foam insulation striking the shuttle’s wing) and organizational oversights (inadequate recognition and response to potential risks). Purpose-built AI Observability, which involves leveraging artificial intelligence to gain insights into complex systems and predict failures, might have helped in several key ways:

1. Real-time Anomaly Detection: Modern AI systems can analyze vast amounts of data in real time to identify anomalies. If an AI-driven observability platform had been monitoring the shuttle’s various sensors and systems, it might have detected unexpected changes or abnormalities in the shuttle’s behavior after the foam strike, potentially even subtle ones that humans might overlook.

2. Historical Analysis: An AI system with access to all previous shuttle launch and flight data might have detected patterns or risks associated with foam-shedding incidents, even if they hadn’t previously resulted in a catastrophe. The system could then raise these as potential long-term risks.

3. Predictive Maintenance: AI-driven tools can predict when components of a system are likely to fail based on current and historical data. If applied to the shuttle program, such a system might have provided early warnings about potential vulnerabilities in the shuttle’s design or wear-and-tear.

4. Decision Support: AI systems could have aided human decision-makers in evaluating the potential risks of continuing the mission after the foam strike, providing simulations, probabilities of failure, or other key metrics to help guide decisions.

5. Enhanced Imaging and Diagnosis: If equipped with sophisticated imaging capabilities, AI could analyze images of the shuttle (from external cameras or satellites) to detect potential damage, even if it’s minor, and then assess the risks associated with such damage.

6. Overcoming Organizational Blind Spots: One of the major challenges in the Columbia disaster was the normalization of deviance, where foam shedding became an “accepted” risk because it hadn’t previously caused a disaster. An AI system, being objective, doesn’t suffer from these biases. It would consistently evaluate risks based on data, not on historical outcomes.

7. Alerts and Escalations: An AI system can be programmed to escalate potential risks to higher levels of authority, ensuring that crucial decisions don’t get caught in bureaucratic processes.

While AI Observability could have provided invaluable insights and might have changed the course of events leading to the Columbia disaster, it’s essential to note that the integration of such AI systems also requires organizational openness to technological solutions and a proactive attitude toward safety. The technology is only as effective as the organization’s willingness to act on its findings.

The tragedy served as a grim reminder for organizations worldwide: It’s not just about collecting data; it’s about understanding the significance of what isn’t there. Because in those blind spots, destiny can take a drastic turn.

In Memory and In Action

The Columbia crew and their families deserve our utmost respect and admiration for their unwavering commitment to space exploration and the betterment of humanity.

Rick D. Husband: As the Commander of the mission, Rick led with dedication, confidence, and unparalleled skill. His devotion to space exploration was evident in every decision he made. We remember him not just for his expertise, but for his warmth and his ability to inspire those around him. His family’s strength and grace, in the face of the deepest pain, serve as a testament to the love and support they provided him throughout his journey.
William C. McCool: As the pilot of Columbia, William’s adeptness and unwavering focus were essential to the mission. His enthusiasm and dedication were contagious, elevating the spirits of everyone around him. His family’s resilience and pride in his achievements are a reflection of the man he was — passionate, driven, and deeply caring.
Michael P. Anderson: As the payload commander, Michael’s role was vital, overseeing the myriad of experiments and research aboard the shuttle. His intellect was matched by his kindness, making him a cherished member of the team. His family’s courage and enduring love encapsulate the essence of Michael’s spirit — bright, optimistic, and ever-curious.
Ilan Ramon: As the first Israeli astronaut, Ilan represented hope, unity, and the bridging of frontiers. His enthusiasm for life was infectious, and he inspired millions with his historic journey. His family’s grace in the face of the unthinkable tragedy is a testament to their shared dream and the values that Ilan stood for.
Kalpana Chawla: Known affectionately as ‘KC’, Kalpana’s journey from a small town in India to becoming a space shuttle mission specialist stands as an inspiration to countless dreamers worldwide. Her determination, intellect, and humility made her a beacon of hope for many. Her family’s dignity and strength, holding onto her legacy, reminds us all of the power of dreams and the sacrifices made to realize them.
David M. Brown: As a mission specialist, David brought with him a zest for life, a passion for learning, and an innate curiosity that epitomized the spirit of exploration. He ventured where few dared and achieved what many only dreamt of. His family’s enduring love and their commitment to preserving his memory exemplify the close bond they shared and the mutual respect they held for each other.
Laurel B. Clark: As a mission specialist, Laurel’s dedication to scientific exploration and discovery was evident in every task she undertook. Her warmth, dedication, and infectious enthusiasm made her a beloved figure within her team and beyond. Her family’s enduring spirit, cherishing her memories and celebrating her achievements, is a tribute to the love and support that were foundational to her success.

To each of these remarkable individuals and their families, we extend our deepest respect and gratitude. Their sacrifices and contributions will forever remain etched in the annals of space exploration, reminding us of the human spirit’s resilience and indomitable will.

For those of us close to the Columbia disaster, it was more than a failure; it was a personal loss. Yet, in memory of those brave souls, we are compelled to look ahead. In the stories whispered to us by data, and in the painful lessons from their absence, we seek to ensure that such tragedies remain averted in the future.

While no technology can turn back time, the promise of AI Observability beckons a future where every anomaly is caught, every blind spot illuminated, and every astronaut returns home safely.

The above narrative seeks to respect the gravity of the Columbia disaster while emphasizing the potential of AI Observability. It underlines the importance of data, both in understanding tragedies and in preventing future ones.

The post Blind Spots in Your System: The Grave Risks of Overlooking Observability appeared first on Unravel.

Unlocking Success with FinOps: Top Insights from Expert Virtual Event

Stephen Lamont — Fri, 11 Aug 2023 16:26:15 +0000

The data landscape is constantly evolving, and with it come new challenges and opportunities for data teams. While generative AI and large language models (LLMs) seem to be all everyone is talking about, they are just the latest manifestation of a trend that has been evolving over the past several years: organizations tapping into petabyte-scale data volumes and running increasingly massive data pipelines to deliver ever more data analytics projects and AI/ML models.

The scale and pace of data consumption is rising exponentially.

Data is now core to every business. Untapping its potential to uncover business insights, drive operational efficiencies, and develop new data products is a key competitive differentiator that separates winners from losers. But running the enormous data workloads that fuel such innovation is expensive. Businesses are already struggling to keep their cloud data costs within budget. So, how can companies increase business-critical data output without sending cloud data costs through the roof?

Effective cost management becomes paramount. That’s where FinOps principles come into play, helping organizations to optimize their cloud resources and align them with business goals.

A recent virtual fireside chat with Sanjeev Mohan, former Gartner Research VP for Big Data & Advanced Analytics and founder/principal of SanjMo, Unravel VP of Solutions Engineering Chris Santiago, and DataOps Champion and Certified FinOps Practitioner Clinton Ford discussed five tips for getting started with FinOps for data workloads.

Why FinOps for Data Makes Sense

The virtual event kicked off discussing Gartner analyst Lydia Leong’s argument that building a dedicated FinOps team is a waste of time. Our panelists broke down why FinOps for data teams is actually crucial for companies running big data workloads in the cloud. Then they talked about how Spotify fixed a $1 million query, using that as an example of FinOps principles in practice.

The experts emphasized that having a clear strategy and plan in place is critical for ensuring that resources are allocated effectively and in line with business objectives.

Five Best Practices for DataFinOps

To set data teams on the path to FinOps success, Sanjeev Mohan and Chris Santiago shared five practical tips during the presentation:

Discover: Begin by tracking and visualizing your costs at three different layers: the query level, user level, and workload level. You must first know where the money is going, to ensure that resources are properly aligned.
Understand Cost Drivers: Dig into the top cost drivers—what’s actually incurring the most cost—at a granular level. Not only is this vital for accurately forecasting budgets, but you can prioritize workloads based on their strategic value, ensuring that you’re focusing on tasks that contribute meaningfully to your bottom line.
Collaborate: FinOps is a team sport among Finance, Engineering, and the business. For many organizations, this is a cultural shift as much as anything. A good start is to leverage chargeback/showback models to hold teams accountable and encourage better management of resources.
Build Early Warning Systems: Implement guardrails to nip cost overruns in the bud. It’s best to catch performance/cost problems early on in development, whether as part of your CI/CD process or even as simple as triggering an alert when some cost/performance threshold is violated.
Automate and Optimize: Continuously monitor, automate, and optimize key processes to minimize waste, save time, and achieve better results.

Audience Questions

In addition to discussing FinOps best practices, the panelists fielded several questions from the audience. They addressed topics such as calculating unit costs, selecting impactful visualization tools, and employing cost reduction strategies tailored for their organizations. Throughout the session, the experts emphasized collaboration and partnership, showcasing Unravel’s commitment to empowering data teams to reach their full potential.

The Unravel Advantage

With its AI-powered Insights Engine built specifically for modern data platforms like Databricks, Snowflake, Google Cloud BigQuery, and Amazon EMR, Unravel provides data teams with the performance/cost optimization insights and automation they need to thrive in the competitive data landscape. Just as the recent case study of a leading health insurance provider demonstrated, Unravel’s capabilities are instrumental in helping organizations optimize code and infrastructure to help run more data workloads without increasing their budget. The five FinOps best practices shared by Sanjeev and Chris offer actionable insights for data teams looking to optimize costs, drive efficiency, and achieve their goals in an ever-changing data landscape.

With Unravel as your trusted partner, you can approach FinOps with confidence, knowing that you have access to the expertise, tools, and support required to succeed.

Next Steps

Schedule a demo with Unravel to discover how it can revolutionize your cost optimization efforts.
Download Eckerson Group’s Governing Cost with FinOps for Cloud Analytics.

The post Unlocking Success with FinOps: Top Insights from Expert Virtual Event appeared first on Unravel.

Harnessing Google Cloud BigQuery for Speed and Scale: Data Observability, FinOps, and Beyond

Stephen Lamont — Thu, 10 Aug 2023 12:05:21 +0000

Data is a powerful force that can generate business value with immense potential for businesses and organizations across industries. Leveraging data and analytics has become a critical factor for successful digital transformation that can accelerate revenue growth and AI innovation. Data and AI leaders enable business insights, product and service innovation, and game-changing technology that helps them outperform their peers in terms of operational efficiency, revenue, and customer retention, among other key business metrics. Organizations that fail to harness the power of data are at risk of falling behind their competitors.

Despite all the benefits of data and AI, businesses face common challenges.

Unanticipated cloud data spend

Fluctuating workloads: Google Cloud BigQuery data processing and storage costs are driven by the amount of data stored and analyzed. With varying workloads, it becomes challenging to accurately estimate the required data processing and storage costs. This unpredictability can result in budget overruns that affect 60% of infrastructure and operations (I&O) leaders.
Unexpected expenses: Streaming data, large amounts of unstructured and semi-structured data, and shared slot pool consumption can quickly drive up cloud data costs. These factors contribute to unforeseen spikes in usage that may catch organizations off guard, leading to unexpected expenses on their cloud bills.
Lack of visibility: Without granular visibility into cloud data analytics billing information, businesses have no way to accurately allocate costs down to the job or user level. This makes it difficult for them to track usage patterns and identify areas where budgets will be over- or under-spent, or where performance and cost optimization are needed.

Budget and staff constraints limit new data workloads

In 2023, CIOs are expecting an average increase of only 5.1% in their IT budgets, which is lower than the projected global inflation rate of 6.5%. Economic pressures, scarcity and high cost of talent, and ongoing supply challenges are creating urgency to achieve more value in less time.

Limited budget and staffing resources can hinder the implementation of new data workloads. For example, “lack of resources/knowledge to scale” is the leading reason preventing IoT data deployments. Budget and staffing resources constraints pose real risks to launching profitable data and AI projects.

Exponential data volume growth for AI

The rapid growth of disruptive technologies such as generative AI, has led to an exponential increase in cloud computing data volumes. However, managing and analyzing massive amounts of data poses significant challenges for organizations.

Data pipeline failures slow innovation

Data pipelines are becoming increasingly complex, increasing the Mean Time To Repair (MTTR) breaks and delays. Time is a critical factor that pulls skilled and valuable talent into unproductive firefighting. The more time they spend dealing with pipeline issues or failures, the greater the impact on productivity and new innovation.

Manually testing and running release process checklists are heavy burdens for new and growing data engineering teams. With all of the manual toil, it is no surprise that over 70% of data projects in manufacturing stall at Proof of Concept (PoC) stage and do not see sustainable value realization.

Downtime resulting from pipeline disruptions can have a significant negative impact on the service level agreements (SLAs). It not only affects the efficiency of data processing, but also impacts downstream tasks like analysis and reporting. These slowdowns directly affect the ability of team members and business leaders to make timely decisions based on data insights.

Conclusion

Unravel 4.8.1 for BigQuery provides improved visibility to accelerate performance, boost query efficiency, allocate costs, and accurately predict spend. This launch aligns with the recent BigQuery pricing model change. With Unravel for BigQuery, customers can easily choose the best pricing plan to match their usage. Unravel helps you optimize your workloads and get more value from your cloud data investments.

The post Harnessing Google Cloud BigQuery for Speed and Scale: Data Observability, FinOps, and Beyond appeared first on Unravel.

Announcing Unravel 4.8.1: Maximize business value with Google Cloud BigQuery Editions pricing

Stephen Lamont — Thu, 10 Aug 2023 12:03:50 +0000

Google recently introduced significant changes to its existing BigQuery pricing models, affecting both compute and storage. They announced the end of sale for flat-rate and flex slots for all BigQuery customers not currently in a contract. Google announced an increase to the price of on-demand analysis by 25% across all regions, starting on July 5, 2023.

Main Components of BigQuery Pricing

Understanding the pricing structure of BigQuery is crucial to effectively manage expenses. There are two big components to BigQuery pricing:

Compute (analysis) pricing is the cost to process queries, including SQL queries, user-defined functions, scripts, and certain data manipulation language (DML) and data definition language (DDL) statements
Storage pricing is the cost to store data that you load into BigQuery. Storage options are logical (the default) or physical. If data storage is converted from logical to physical, customers cannot go back to logical storage.

Selecting the appropriate edition and accurately forecasting data processing needs is essential to cloud data budget planning and maximizing the value derived from Google Cloud BigQuery.

Introducing Unravel 4.8.1 for BigQuery

Unravel 4.8.1 for BigQuery includes AI-driven FinOps and performance optimization features and enhancements, empowering Google Cloud BigQuery customers to see and better manage their cloud data costs. Unravel helps users understand specific cost drivers, allocation insights, and performance and cost optimization of SQL queries. New Unravel features are align with the FinOps phases:

Inform

Compute and storage costs
Unit costs and trends for projects, users, and jobs

Optimize

Reservation insights
SQL insights
Data and storage insights
Scheduling insights

Operate

OpenSearch-based alerts on job duration and slot-ms
Alert customization: ability to create custom alerts

Improving visibility, optimizing data performance, and automating spending guardrails can help organizations overcome resource limitations to get more out of their existing data environments.

Visibility into BigQuery compute and storage spend

Getting insights into your cloud data spending starts with understanding your cloud bill. With Unravel, BigQuery users can see their overall spend as well as spending trends for their selected time window, such as the past 30 days.

The cost dashboard shows details and trends, including compute, storage, and services by pricing tier, project, job, and user.

Unravel provides cost analysis, including the average cost of both compute and storage per project, job, and user over time. Compute spending can be further split between on-demand and reserved capacity pricing.

Armed with this detail, BigQuery customers can better understand both infrastructure and pricing tier usage as well as efficiencies by query, user, department, and project. This granular visibility enables accurate, precise cost allocation, trend visualization, and forecasting.

This dashboard provides BigQuery project chargeback details and trends, including a breakdown by compute and storage tier.

Unravel’s AI-driven cloud cost optimization for BigQuery delivers insights based on Unravel’s deep observability of the job, user, and code level to supply AI-driven cost optimization recommendations for slots and SQL queries, including slot provisioning, query duration, autoscaling efficiencies, and more.

With Unravel, BigQuery users can speed cloud transformation initiatives by having real-time cost visibility, predictive spend forecasting, and performance insights for their workloads.

AI-driven cloud cost optimization for BigQuery

At the core of Unravel Data’s data observability and FinOps platform is the AI-powered Insights Engine. It is purpose-built for data platforms—including BigQuery—to understand all the unique aspects and capabilities of each modern data stack and the underlying infrastructure to optimize efficiency and performance.

Unravel’s AI-powered Insights Engine continuously ingests and interprets millions of metadata inputs to provide real-time insights into application and system performance, along with recommendations to improve performance and efficiency for faster results and greater positive business impact for your existing cloud data spend.

Unravel provides insights and recommendations to optimize BigQuery reservations.

Using Unravel’s cost and performance optimization intelligence based on its deep observability at the job, user, and code level, users get recommendations such as:

Reservation sizing that achieves optimal cost efficiency and performance
SQL insights and anti-patterns to avoid
Scheduling insights for recurring jobs
Quota insights with respect to workload patterns
and more

With Unravel, BigQuery customers can speed cloud transformation initiatives by having predictive cost and performance insights of existing workloads prior to moving them to the cloud.

Visualization dashboards and unit costs

Visualizing unit costs not only simplifies cost management but also enhances decision-making processes within your organization. With clear insights into spending patterns and resource utilization, you can make informed choices regarding optimization strategies or budget allocation.

With Unravel, BigQuery customers can customize dashboards and alerts with easy-to-use widgets to enable at-a-glance and drill down dashboards on:

Spend
Performance
Unit economics

Unravel displays user count and cost trends by compute pricing tier.

From a unit economics perspective, BigQuery customers can build dashboards to show unit costs in terms of average cost per user, per project, and per job.

Take advantage of visualization dashboards in Unravel for BigQuery to effortlessly gain valuable insights into unit costs.

Additional features included in this release

Unravel 4.8.1 includes additional features, such as showback/chargeback reports, SQL insights and anti-patterns. You can compare two jobs side-by-side, enabling you to point out any metrics that are different between two runs, even if the queries are different.

With this release, you also get:

Top-K projects, users, and jobs
Showback by compute and storage types, services, pricing plans, etc.
Chargeback by projects and users
Out-of-the-box and custom alerts and dashboards
Project/Job views of insights and details
Side-by-side job comparisons
Data KPIs, metrics, and insights such as size and number of tables and partitions, access by jobs, hot/warm/cold tables

Use case scenarios

Unravel for BigQuery provides a single source of truth to improve collaboration across functional teams and accelerates workflows for common use cases. Below are just a few examples of how Unravel helps BigQuery users for specific situations:

Role	Scenario	Unravel benefits
FinOps Practitioner	Understand what we pay for BigQuery down to the user/app level in real time, accurately forecast future spend with confidence	Granular visibility at the project, job, and user level enables FinOps practitioners to perform cost allocation, estimate annual cloud data application costs, cost drivers, break-even, and ROI analysis
FinOps Practitioner / Engineering / Operations	Identify the most impactful recommendations to optimize overall cost and performance	AI-powered performance and cost optimization recommendations enable FinOps and data teams to rapidly upskill team members, implement cost efficiency SLAs, and optimize BigQuery pricing tier usage to maximize the company’s cloud data ROI
Engineering Lead / Product Owner	Identify the most impactful recommendations to optimize the cost and performance of a project	AI-driven insights and recommendations enable product and data teams to improve slot utilization, boost SQL query performance, leverage table partitioning and column clustering to achieve cost efficiency SLAs and launch more data jobs within the same project budget
Engineering / Operations	Live monitoring with alerts	Live monitoring with alerts speed MTTR and prevent outages before they happen
Data Engineer	Debugging a job and comparing jobs	Automatic troubleshooting guides data teams directly to the pinpoint the source of job failures down to the line of code or SQL query along with AI recommendations to fix it and prevent future issues
Data Engineer	Identify expensive, inefficient, or failed jobs	Proactively improve cost efficiency, performance, and reliability before deploying jobs into production. Compare two jobs side-by-side to find any metrics that are different between the two runs, even if the queries are different.

Get Started with Unravel for BigQuery

Learn more about Unravel for BigQuery by reviewing the docs and creating your own free account.

The post Announcing Unravel 4.8.1: Maximize business value with Google Cloud BigQuery Editions pricing appeared first on Unravel.

Unlocking Cost Optimization: Insights from FinOps Camp Episode #1

Stephen Lamont — Fri, 04 Aug 2023 17:30:57 +0000

With the dramatic increase in the volume, velocity, and variety of data analytics projects, understanding costs and optimizing expenditure is crucial for success. Data teams often face challenges in effectively managing costs, accurately attributing them, and finding ways to enhance cost efficiency. Fortunately, Unravel Data, with its comprehensive platform, addresses these pain points and empowers data teams to unlock their full cost optimization potential—enabling them to run more cloud data workloads without increasing their budget. In this blog, we will delve into the key takeaways from the recent FinOps Camp Episode #1, focusing on the importance of understanding costs, attributing costs, and achieving cost efficiency with Unravel.

Understanding Costs: The Foundation of Cost Optimization

One of the main challenges faced by data teams is gaining deep insights into cost drivers. Traditional observability platforms like Cloudability, or vendor-specific tools like AWS Cost Explorer and Azure Cost Dashboard, often provide limited visibility, focusing solely on infrastructure costs. This lack of granular insights hinders data teams from making informed decisions about resource allocation and identifying areas for cost optimization. Unravel, however, offers a comprehensive dashboard that provides a holistic view of cost breakdowns. This enables data teams to understand exactly where costs are being incurred, facilitating smarter decision-making around resource allocation and optimization efforts.

Accurate Cost Attribution: The Key to Fairness and Transparency

Accurately attributing costs is another hurdle for data teams. Shared services, such as MySQL or Kafka, require meticulous cost allocation to ensure fairness and transparency within an organization. Unravel understands this challenge and provides a solution that simplifies cost allocation. By seamlessly attributing costs to individual users, jobs, teams/departments/lines of business, projects, applications and pipelines, clusters, etc., Unravel enables data teams to break down the proportional costs of shared services. This not only enhances cost tracking but also promotes accountability and fairness within the organization.

Unlocking Cost Efficiency: Recommendations for Optimization

Cost efficiency is the goal of every data team striving for excellence. In the virtual event, Unravel highlighted its powerful feature of identifying inefficiencies within data jobs and providing actionable recommendations for optimization. By analyzing tasks with its AI-powered Recommendation Engine designed specifically for specific platforms like Databricks, Snowflake, BigQuery, and Amazon EMR, Unravel can pinpoint resources that have been oversized and lines of code that contribute to performance bottlenecks. Armed with these insights, data teams can collaborate with developers to address these inefficiencies effectively, resulting in improved resource utilization, reduced costs, and accelerated application performance. Unravel’s proactive optimization recommendations enable data teams to achieve peak cost efficiency and deliver exceptional results.

Operationalizing Unravel: Going From Reactive to Proactive

Unravel’s platform lets data teams go beyond responding to cost and performance inefficiencies after the fact to getting ahead of cost issues beforehand. Unravel empowers ongoing cost governance by enabling team leaders to set up automated guardrails that trigger an alert (or even autonomous “circuit breaker” actions) whenever a predefined threshold is violated—be it projected budget overrun, jobs exceeding a certain size, runtime, cost, etc. Essentially, these automated guardrails take the granular cost allocation information and the AI-powered recommendations and apply them to context-specific workloads to track real-time spending against budgets, prevent runaway jobs and rogue users, identify optimizations to meet SLAs, and nip cost overruns in the bud.

Conclusion: Unleash Your Cost Optimization Potential with Unravel

Understanding costs, attributing costs accurately, and achieving cost efficiency are critical components of any successful data analytics strategy. In the FinOps Camp Episode #1, Unravel showcased its ability to address these concerns and empower data teams to optimize costs effectively. By providing in-depth insights, seamless cost attribution, and proactive optimization recommendations, Unravel enables data teams to understand at a deep level exactly where their cloud data spend is going, predict spending with data-driven accuracy, and optimize data applications/pipelines so organizations can run more high-value workloads within existing budgets. Unravel unlocks your cost optimization potential and maximizes the value of your data analytics efforts. Together, we can transform cost optimization from a challenge into a competitive advantage.

Next Steps:

Schedule a demo with Unravel to discover how it can revolutionize your cost optimization efforts..
Download Eckerson Group’s Governing Cost with FinOps for Cloud Analytics

The post Unlocking Cost Optimization: Insights from FinOps Camp Episode #1 appeared first on Unravel.

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Stephen Lamont — Mon, 24 Jul 2023 14:09:01 +0000

Introduction

In the era of Big Data, the importance of data quality cannot be overstated. The vast volumes of information generated every second hold immense potential for organizations across industries. However, this potential can only be realized when the underlying data is accurate, reliable, and consistent. Data quality serves as the bedrock upon which crucial business decisions are made, insights are derived, and strategies are formulated. It empowers organizations to gain a comprehensive understanding of their operations, customers, and market trends. High-quality data ensures that analytics, machine learning, and artificial intelligence algorithms produce meaningful and actionable outcomes. From detecting patterns and predicting trends to identifying opportunities and mitigating risks, data quality is the driving force behind data-driven success. It instills confidence in decision-makers, fosters innovation, and unlocks the full potential of Big Data, enabling organizations to thrive in today’s data-driven world.

Overview

In this seven-part blog we will explore using ML/AI for data quality. Machine learning and artificial intelligence can be instrumental in improving the quality of data. Machine learning models like logistic regression, decision trees, random forests, gradient boosting machines, and neural networks predict categories of data based on past examples, correcting misclassifications. Linear regression, polynomial regression, support vector regression, and neural networks predict numeric values, filling in missing entries. Clustering techniques like K-means, hierarchical clustering, and DBSCAN identify duplicates or near-duplicates. Models such as Isolation Forest, Local Outlier Factor, and Auto-encoders detect outliers and anomalies. To handle missing data, k-Nearest Neighbors and Expectation-Maximization predict and fill in the gaps. NLP models like BERT, GPT, and RoBERTa process and analyze text data, ensuring quality through tasks like entity recognition and sentiment analysis. CNNs fix errors in image data, while RNNs and Transformer models handle sequence data. The key to ensuring data quality with these models is a well-labeled and accurate training set. Without good training data, the models may learn to reproduce the errors present in the data. We will focus on using the following models to apply critical data quality to our lakehouse.

Machine learning and artificial intelligence can be instrumental in improving the quality of data. Here are a few models and techniques that can be used for various use cases:

Classification Algorithms: Models such as logistic regression, decision trees, random forests, gradient boosting machines, or neural networks can be used to predict categories of data based on past examples. This can be especially useful in cases where data entries have been misclassified or improperly labeled.
Regression Algorithms: Similarly, algorithms like linear regression, polynomial regression, or more complex techniques like support vector regression or neural networks can predict numeric values in the data set. This can be beneficial for filling in missing numeric values in a data set.
Clustering Algorithms: Techniques like K-means clustering, hierarchical clustering, or DBSCAN can be used to identify similar entries in a data set. This can help identify duplicates or near-duplicates in the data.
Anomaly Detection Algorithms: Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders can be used to detect outliers or anomalies in the data. This can be beneficial in identifying and handling outliers or errors in the data set.
Data Imputation Techniques: Missing data is a common problem in many data sets. Machine learning techniques, such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM), can be used to predict and fill in missing data.
Natural Language Processing (NLP): NLP models like BERT, GPT, or RoBERTa can be used to process and analyze text data. These models can handle tasks such as entity recognition, sentiment analysis, text classification, which can be helpful in ensuring the quality of text data.
Deep Learning Techniques: Convolutional Neural Networks (CNNs) can be used for image data to identify and correct errors, while Recurrent Neural Networks (RNNs) or Transformer models can be useful for sequence data.

Remember that the key to ensuring data quality with these models is a well-labeled and accurate training set. Without good training data, the models may learn to reproduce the errors presented in the data.

In this first blog of seven we will focus on Classification Algorithms. The code examples provided below can be found in this GitHub location. Below is a simple example of using a classification algorithm in a Databricks Notebook to address a data quality issue using a classification algorithm.

Data Engineers Leveraging AI/ML for Data Quality

Machine learning (ML) and artificial intelligence (AI) play a crucial role in the field of data engineering. Data engineers leverage ML and AI techniques to process, analyze, and extract valuable insights from large and complex datasets. Overall, ML and AI provide data engineers with powerful tools and techniques to extract insights, improve data quality, automate processes, and enable data-driven decision-making. They enhance the efficiency and effectiveness of data engineering workflows, enabling organizations to unlock the full potential of their data assets.

AI/ML can help with numerous data quality use cases. Models such as logistic regression, decision trees, random forests, gradient boosting machines, or neural networks can be used to predict categories of data based on past examples. This can be especially useful in cases where data entries have been misclassified or improperly labeled. Similarly, algorithms like linear regression, polynomial regression, or more complex techniques like support vector regression or neural networks can predict numeric values in the data set. This can be beneficial for filling in missing numeric values in a data set. Techniques like K-means clustering, hierarchical clustering, or DBSCAN can be used to identify similar entries in a data set. This can help identify duplicates or near-duplicates in the data. Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders can be used to detect outliers or anomalies in the data. This can be beneficial in identifying and handling outliers or errors in the data set.

Missing data is a common problem in many data sets. Machine learning techniques, such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM), can be used to predict and fill in missing data. NLP models like BERT, GPT, or RoBERTa can be used to process and analyze text data. These models can handle tasks such as entity recognition, sentiment analysis, text classification, which can be helpful in ensuring the quality of text data. Convolutional Neural Networks (CNNs) can be used for image data to identify and correct errors, while Recurrent Neural Networks (RNNs) or Transformer models can be useful for sequence data.

Suppose we have a dataset with some missing categorical values. We can use logistic regression to fill in the missing values based on the other features. For the purpose of this example, let’s assume we have a dataset with ‘age’, ‘income’, and ‘job_type’ features. Suppose ‘job_type’ is a categorical variable with some missing entries.

Classification Algorithms — Using Logistic Regression to Fix the Data

Logistic regression is primarily used for binary classification problems, where the goal is to predict a binary outcome variable based on one or more input variables. However, it is not typically used directly for assessing data quality. Data quality refers to the accuracy, completeness, consistency, and reliability of data.

That being said, logistic regression can be indirectly used as a tool for identifying potential data quality issues. Here are some examples of how it can be used in data quality. Logistic regression can be used to define the specific data quality issue to be addressed. For example, you may be interested in identifying data records with missing values or outliers. Logistic regression can be used for feature engineering. This helps identify relevant features (variables) that may indicate the presence of the data quality issue. Data preparation is one of the most common uses of logistic regression. Here the ML helps prepare the dataset by cleaning, transforming, and normalizing the data as necessary. This step involves handling missing values, outliers, and any other data preprocessing tasks.

It’s important to note that logistic regression alone generally cannot fix data quality problems, but it can help identify potential issues by predicting their presence based on the available features. Addressing the identified data quality issues usually requires domain knowledge, data cleansing techniques, and appropriate data management processes. In the simplified example below we see a problem and then use logistic regression to predict what the missing values should be.

Step 1

Create a table with the columns “age”, “income”, and “job_type” in a SQL database, you can use the following SQL statement:

Step1: Create Table

Step 2

Load data to table. Notice that three records are missing job_type. This will be the column that we will use ML to predict. We load a very small set of data for this example. This same technique can be used against billions or trillions of rows. More data will almost always yield better results.

Step 2: Load Data to Table

Step 3

Load the data into a data frame. If you need to create a unique index for your data frame, please refer to this article.

Step 3: Load Data to Data Frame

Step 4

In the context of PySpark DataFrame operations, filter() is a transformation function used to filter the rows in a DataFrame based on a condition.

Step 4: Filter the Rows in a Data Frame Based on a Condition

df is the original DataFrame. We’re creating two new DataFrames, df_known and df_unknown, from this original DataFrame.

df_known = df.filter(df.job_type.isNotNull()) is creating a new DataFrame that only contains rows where the job_type is not null (i.e., rows where the job_type is known).
df_unknown = df.filter(df.job_type.isNull()) is creating a new DataFrame that only contains rows where the job_type is null (i.e., rows where the job_type is unknown).

By separating the known and unknown job_type rows into two separate DataFrames, we can perform different operations on each. For instance, we use df_known to train the machine learning model, and then use that model to predict the job_type for the df_unknown DataFrame.

Step 5

In this step we will vectorize the features. Vectorizing the features is a crucial pre-processing step in machine learning and AI. In the context of machine learning, vectorization refers to the process of converting raw data into a format that can be understood and processed by a machine learning algorithm. A vector is essentially an ordered list of values, which in machine learning represent the ‘features’ or attributes of an observation.

Step 5: Vectorize Features

Step 6

In this step we will convert categorical labels to indices. Converting categorical labels to indices is a common preprocessing step in ML and AI when dealing with categorical data. Categorical data represents information that is divided into distinct categories or classes, such as “red,” “blue,” and “green” for colors or “dog,” “cat,” and “bird” for animal types. Machine learning algorithms typically require numerical input, so converting categorical labels to indices allows these algorithms to process the data effectively.

Converting categorical labels to indices is important for ML and AI algorithms because it allows them to interpret and process the categorical data as numerical inputs. This conversion enables the algorithms to perform computations, calculate distances, and make predictions based on the numerical representations of the categories. It is worth noting that label encoding does not imply any inherent order or numerical relationship between the categories; it simply provides a numerical representation that algorithms can work with.

It’s also worth mentioning that in some cases, label encoding may not be sufficient, especially when the categorical data has a large number of unique categories or when there is no inherent ordinal relationship between the categories. In such cases, additional techniques like one-hot encoding or feature hashing may be used to represent categorical data effectively for ML and AI algorithms.

Step 6: Converting Categorical Labels to Indices

Step 7

In this step we will train the model. Training a logistic regression model involves the process of estimating the parameters of the model based on a given dataset. The goal is to find the best-fitting line or decision boundary that separates the different classes in the data.

The process of training a logistic regression model aims to find the optimal parameters that minimize the cost function and provide the best separation between classes in the given dataset. With the trained model, it becomes possible to make predictions on new data points and classify them into the appropriate class based on their features.

Step 7: Train the Model

Step 8

Predict the missing value in this case job_type. Logistic regression, despite its name, is a classification algorithm rather than a regression algorithm. It is used to predict the probability of an instance belonging to a particular class or category.

Logistic regression is widely used in various applications such as sentiment analysis, fraud detection, spam filtering, and medical diagnosis. It provides a probabilistic interpretation and flexibility in handling both numerical and categorical independent variables, making it a popular choice for classification tasks.

Step 8: Predict the Missing Value in This Case job_type

Step 9

Convert predicted indices back to original labels. Converting predicted indices back to original labels in AI and ML involves the reverse process of encoding categorical labels into numerical indices. When working with classification tasks, machine learning models often output predicted class indices instead of the original categorical labels.

It’s important to note that this reverse mapping process assumes a one-to-one mapping between the indices and the original labels. In cases where the original labels are not unique or there is a more complex relationship between the indices and the labels, additional handling may be required to ensure accurate conversion.

Step 9: Convert Predicted Indices Back to Original Labels

Recap

Machine learning (ML) and artificial intelligence (AI) have a significant impact on data engineering, enabling the processing, analysis, and extraction of insights from complex datasets. ML and AI empower data engineers with tools to improve data quality, automate processes, and facilitate data-driven decision-making. Logistic regression, decision trees, random forests, gradient boosting machines, and neural networks can predict categories based on past examples, aiding in correcting misclassified or improperly labeled data. Algorithms like linear regression, polynomial regression, support vector regression, or neural networks predict numeric values, addressing missing numeric entries. Clustering techniques like K-means, hierarchical clustering, or DBSCAN identify duplicates or near-duplicates. Models like Isolation Forest, Local Outlier Factor (LOF), or Auto-encoders detect outliers or anomalies, handling errors in the data. Machine learning techniques such as k-Nearest Neighbors (KNN) or Expectation-Maximization (EM) predict and fill in missing data. NLP models like BERT, GPT, or RoBERTa process text data for tasks like entity recognition and sentiment analysis. CNNs correct errors in image data, while RNNs or Transformer models handle sequence data.

Data engineers can use Databricks Notebook, for example, to address data quality issues by training a logistic regression model to predict missing categorical values. This involves loading and preparing the data, separating known and unknown instances, vectorizing features, converting categorical labels to indices, training the model, predicting missing values, and converting predicted indices back to original labels.

Ready to unlock the full potential of your data? Embrace the power of machine learning and artificial intelligence in data engineering. Improve data quality, automate processes, and make data-driven decisions with confidence. From logistic regression to neural networks, leverage powerful algorithms to predict categories, address missing values, detect anomalies, and more. Utilize clustering techniques to identify duplicates and near-duplicates. Process text, correct image errors, and handle sequential data effortlessly. Try out tools like Databricks Notebook to train models and resolve data quality issues. Empower your data engineering journey and transform your organization’s data into valuable insights. Take the leap into the world of ML and AI in data engineering today! You can start with this Notebook.

The post Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality appeared first on Unravel.

DBS Bank Uplevels Individual Engineers at Scale

Stephen Lamont — Tue, 11 Jul 2023 12:56:46 +0000

DBS Bank leverages Unravel to identify inefficiencies across 10,000s of data applications/pipelines and guide individual developers and engineers on how, where, and what to improve—no matter what technology or platform they’re using.

DBS Bank is one of the largest financial services institutions in Asia—the biggest in Southeast Asia—with a presence in 19 markets. Headquartered in Singapore, DBS is recognized globally for its technological innovation and leadership, having been named World’s Best Bank by both Global Finance and Euromoney, Global Bank of the Year by The Banker, World’s Best Digital Bank by Euromoney (three times), and one of the world’s top 10 business transformations of the decade by Harvard Business Review. In addition, DBS has been given Global Finance’s Safest Bank in Asia award for 14 consecutive years.

DBS is, in its words, “redefining the future of banking as a technology company.” Says the bank’s Executive Director of Automation, Infrastructure for Big Data, AI, and Analytics Luis Carlos Cruz Huertas, “Technology is in our DNA, from the top down. Six years ago, when we started our data journey, the philosophy was that we’re going to be a data-driven organization, 100%. Everything we decide is through data.”

Realizing innovation and efficiency at scale and speed

DBS backs up its commitment to being a leading data-forward company. Almost 40% of the bank’s employees—some 23,000—are native developers. The volume, variety, and velocity of DBS’ data products, initiatives, and analytics fuel an incredibly complex environment. The bank has developed more than 200 data-use cases to deliver business value: detecting fraud, providing value to customers via their DBS banking app, and providing hyper personalized “nudges” that guide customers in making more informed banking and investment decisions.

As with all financial services enterprises, the DBS data estate is a multi-everything mélange: on-prem, hybrid, and several cloud providers; all the principal platforms; open source, commercial, and proprietary solutions; and a wide variety of technologies, old and new. Adding to this complexity are the various country-specific requirements and restrictions throughout the bank’s markets. As Luis says, “There are many ways to do the cloud, but there’s also the banking way to do cloud. Finance is ring-fenced by regulation all over the world. To do cloud in Asia is quite complex. We don’t [always] get to choose the cloud, sometimes the cloud chooses us.” (For example, India now requires all data to run in India. Google was the only cloud available in Indonesia at the time. Data cannot leave China; there are no hyperscalers there other than Chinese.)

Watch Luis’ full talk, “Leading Cultural Change for Data Efficiency, Agility and Cost Optimization,” with Unravel CEO and CO-founder Kunal Agarwal
See the conversation here

The pace of today’s innovation at a data-forward bank like DBS makes ensuring efficient, reliable performance an ever-growing challenge.

Today DBS runs more than 40,000 data jobs, with tens of thousands of ML models. “And that actually keeps growing,” says Luis. “We have exponentially grown—almost 100X what we used to run five years ago. There’s a lot of innovation we’re bringing, month to month.”

As Luis points out, “Sometimes innovation is a disruptor of stability.” With thousands of developers all using whatever technologies are best-suited for their particular workload, DBS has to contend with the triple-headed issues of exponentially increasing data volumes (think: OpenAI); increasingly more technologies, platforms, and cloud providers; and the ever-present challenge of having enough skilled people with enough time to ensure that everything is running reliably and efficiently on time, every time—at population scale.

DBS empowers self-service optimization

One of DBS’ lessons learned in its modern data journey is that, as Luis says, “a lot of the time the efficiencies relied on the engineering team to fix the problems. We didn’t really have a viable mechanism to put into the [business] users’ hands for them to analyze and understand their code, their deficiencies, and how to improve.”

That’s where Unravel comes in. The Unravel platform harnesses deep full-stack visibility, contextual awareness, AI-powered intelligence, and automation to not only quickly show what’s going on and why, but provide crisp, prescriptive recommendations on exactly how to make things better and then keep them that way proactively.

“In Unravel, [we] have a source to identify areas of improvement for our current operational jobs. By leveraging Unravel, we’ve been able to put an analytical tool in the hands of business users to evaluate themselves before they actually do a code release. [They now get] a lot of additional insights they can use to learn, and it puts a lot more responsibility into how they’re doing their job.”

To learn more about how Unravel puts self-service performance and cost optimization capabilities into the hands of individual engineers, see the Unravel Platform Overview.

Marrying data engineering and data science

DBS is perpetually innovating with data—40% of the bank’s workforce are data science developers—yet the scale and speed of DBS’ data operations bring the data engineering side of things to the forefront. As Luis points out, there are thousands of developers committing code but only a handful of data engineers to make sure everything runs reliably and efficiently.

While few organizations have such a high percentage of developers, virtually everyone has the same lopsided developer : engineer ratio. Tackling this reality is at the heart of DataOps and means new roles, where data science and data engineering meet. Unravel has helped facilitate this new intersection of roles and responsibilities by making it easier to pinpoint code inefficiencies and provide insights into what, where, and how to optimize code — without having to “throw it over the fence” to already overburdened engineers.

Luis discusses how DBS addresses the issue: “In hardware, you have something called system performance engineers. They’re dedicated to optimizing and tuning how a processor operates, until it’s pristine. I said, ‘Why don’t we apply that concept and bring it over to data?’ What we need is a very good data scientist who wants to learn data engineering. And a data engineer who wants to become a data scientist. Because I need to connect and marry both worlds. It’s the only way a person can actually say, ‘Oh, this is why pandas doesn’t work in Spark.’ For a data engineer, it’s very clear. For a data scientist, it’s not. Because the first thing they learned in Python is pandas. And they love it. And Spark hates it. It’s a perfect divorce. So we need to teach data scientists how to break their pandas into Spark. Our system performance engineers became masters at doing this. They use Unravel to highlight all the data points that we need to attack in the code, so let’s just go see it.

“We call data jobs ‘mutants,’ because they mutate from the hands of one data scientist to another. You can literally see the differences on how they write the code. Some of it might be perfect, with Markdown files, explainability, and then there’s an entire chunk of code that doesn’t have that. So we tune and optimize the code. Unravel helps us facilitate the journey on deriving what to do first, what to attack first, from the optimization perspective.”

DBS usually builds their own products—why did they buy Unravel?

DBS has made the bold decision to develop its own proprietary technology stack as wrappers for its governance, control plane, data plane, etc. “We ring-fence all compute, storage, services—every resource the cloud provides. The reason we create our own products is that we’ve been let down way too many times. We don’t really control what happens in the market. It’s the world we live in. We accept that. Our data protection mechanism used to be BlueTalon, which was then acquired by Microsoft. And Microsoft decided to dispose of [BlueTalon] and use their own product. Our entire security framework depended on BlueTalon. . . . We decided to just build our own [framework].

“In a way DBS protects itself from being forced to just do what the providers want us to do. And there’s a resistance from us—a big resistance—to oblige that.” Luis uses a cooking analogy to describe the DBS approach. At one extreme is home cooking, where you go buy all the ingredients and make meals yourself. At the other end of the spectrum is going out to restaurants, where you choose items from the menu and you get a meal the way they prepare it. The DBS data platform is more like a cooking studio—Luis provides the kitchen, developers pick the ingredients they want, and then cook the recipe with their own particular flavor. “That’s what we provide to our lines of business,” says Luis. “The good thing is, we can plug in anything new at any given point in time. The downside is that you need very, very good technical people. How do we sustain the pace of rebuilding new products, and keep up with the open source side, which moves astronomically fast—we’re very connected to the open source mission, close to 50% of our developers contribute to the open source world—and at the same time keep our [internal and external] customers happy?”

How does DBS empower self-service engineering with Unravel?
Find out here

Luis explains the build vs. buy decision boils down to whether the product “drives the core of what we do, whether it’s the ‘spinal cord’ of what we do. We’ve had Unravel for a long, long time. It’s one of those products where we said, ‘Should we do it [i.e., build it ourselves]? Or should we find something on the market?’ With Unravel, it was an easy choice to say, ‘If they can do what they claim they can do, bring ’em in.’ We don’t need to develop our own [data observability solution]. Unravel has demonstrated value in [eliminating] hours of toil—nondeterministic tasks performed by engineers because of repetitive incidents. So, long story short: that [Unravel] is still with us demonstrates that there is value from the lines of business.”

Luis emphasizes his data engineering team are not the users of Unravel. “The users are the business units that are creating their jobs. We just drive adoption to make sure everyone in the bank uses it. Because ultimately our data engineers cannot ‘overrun’ a troop of 3,000 data scientists. So we gave [Unravel] to them, they can use it, they can optimize, and control their own fate.”

DBS realizes high ROI from Unravel

Unravel’s granular data observability, automation, and AI enable DBS to control its cloud data usage at the speed and scale demanded by today’s FinServ environment. “Unravel has saved potentially several $100,000s in [unnecessary] compute capacity,” says Luis. “This is a very big transformational year for us. We’re moving more and more into some sort of data mesh and to a more containerized world. Unravel is part of that journey and we expect that we’ll get the same level of return we’ve gotten so far—or even better.”

The post DBS Bank Uplevels Individual Engineers at Scale appeared first on Unravel.

The Modern Data Ecosystem: Choose the Right Instance

Stephen Lamont — Thu, 01 Jun 2023 01:33:14 +0000

Overview: The Right Instance Type

This is the first blog in a five-blog series. For an overview of this blog series please review my post All Data Ecosystems Are Real-Time, It Is Just a Matter of Time.

Choosing the right instance type in the cloud is an important decision that can have a significant impact on the performance, cost, and scalability of your applications. Here are some steps to help you choose the right instance types:

Determine your application requirements. Start by identifying the resource requirements of your application, such as CPU, memory, storage, and network bandwidth. You should also consider any special hardware or software requirements, such as GPUs for machine learning workloads.
Evaluate the available instance types. Each cloud provider offers a range of instance types with varying amounts of resources, performance characteristics, and pricing. Evaluate the available instance types and their specifications to determine which ones best match your application requirements.
Consider the cost. The cost of different instance types can vary significantly, so it’s important to consider the cost implications of your choices. You should consider not only the hourly rate but also the overall cost over time, taking into account any discounts or usage commitments.
Optimize for scalability. As your application grows, you may need to scale up or out by adding more instances. Choose instance types that can easily scale horizontally (i.e., adding more instances) or vertically (i.e., upgrading to a larger instance type).
Test and optimize. Once you have chosen your instance types, test your application on them to ensure that it meets your performance requirements. Monitor the performance of your application and optimize your instance types as necessary to achieve the best balance of performance and cost.

Choosing the right instance types in the cloud requires careful consideration of your application requirements, available instance types, cost, scalability, and performance. By following these steps, you can make informed decisions and optimize your cloud infrastructure to meet your business needs.

Determine Application Requirements

Determining your application’s CPU, memory, storage, and network requirements is a crucial step in choosing the right instance types in the cloud. Here are some steps to help you determine these requirements:

CPU requirements Start by identifying the CPU-intensive tasks in your application, such as video encoding, machine learning, or complex calculations. Determine the number of cores and clock speed required for these tasks, as well as any requirements for CPU affinity or hyperthreading.
Memory requirements Identify the memory-intensive tasks in your application, such as caching, database operations, or in-memory processing. Determine the amount of RAM required for these tasks, as well as any requirements for memory bandwidth or latency.
Storage requirements Determine the amount of storage required for your application data, as well as any requirements for storage performance (e.g., IOPS, throughput) or durability (e.g., replication, backup).
Network requirements Identify the network-intensive tasks in your application, such as data transfer, web traffic, or distributed computing. Determine the required network bandwidth, latency, and throughput, as well as any requirements for network security (e.g., VPN, encryption).

To determine these requirements, you can use various monitoring and profiling tools to analyze your application’s resource usage, such as CPU and memory utilization, network traffic, and storage I/O. You can also use benchmarks and performance tests to simulate different workloads and measure the performance of different instance types.

By understanding your application’s resource requirements, you can choose the right instance types in the cloud that provide the necessary CPU, memory, storage, and network resources to meet your application’s performance and scalability needs.

Evaluate the Available Instance Types

Evaluating the available instance types in a cloud provider requires careful consideration of several factors, such as the workload requirements, the performance characteristics of the instances, and the cost. Here are some steps you can take to evaluate the available instance types in a cloud provider:

Identify your workload requirements. Before evaluating instance types, you should have a clear understanding of your workload requirements. For example, you should know the amount of CPU, memory, and storage your application needs, as well as any specific networking or GPU requirements.
Review the instance types available. Cloud providers offer a range of instance types with varying configurations and performance characteristics. You should review the available instance types and select the ones that are suitable for your workload requirements.
Evaluate performance characteristics. Each instance type has its own performance characteristics, such as CPU speed, memory bandwidth, and network throughput. You should evaluate the performance characteristics of each instance type to determine if they meet your workload requirements.
Consider cost. The cost of each instance type varies based on the configuration and performance characteristics. You should evaluate the cost of each instance type and select the ones that are within your budget.
Conduct benchmarks. Once you have selected a few instance types that meet your workload requirements and budget, you should conduct benchmarks to determine which instance type provides the best performance for your workload.
Consider other factors. Apart from performance and cost, you should also consider other factors such as availability, reliability, and support when evaluating instance types.

The best way to evaluate the available instance types in a cloud provider is to carefully consider your workload requirements, performance characteristics, and cost, and conduct benchmarks to determine the best instance type for your specific use case.

Consider the Cost

When evaluating cloud instance types, it is important to consider both the hourly rate and the overall cost over time, as these factors can vary significantly depending on the provider and the specific instance type. Here are some steps you can take to determine the hourly rate and overall cost over time for cloud instance types:

Identify the instance types you are interested in. Before you can determine the hourly rate and overall cost over time, you need to identify the specific instance types you are interested in.
Check the hourly rate. Most cloud providers offer a pricing calculator that allows you to check the hourly rate for a specific instance type. You can use this calculator to compare the hourly rate for different instance types and providers.
Consider the length of time you will use the instance. While the hourly rate is an important consideration, it is also important to consider the length of time you will use the instance. If you plan to use the instance for a long period of time, it may be more cost-effective to choose an instance type with a higher hourly rate but lower overall cost over time.
Look for cost-saving options. Many cloud providers offer cost-saving options such as reserved instances or spot instances. These options can help reduce the overall cost over time, but may require a longer commitment or be subject to availability limitations.
Factor in any additional costs. In addition to the hourly rate, there may be additional costs such as data transfer fees or storage costs that can significantly impact the overall cost over time.
Consider the potential for scaling. If you anticipate the need for scaling in the future, you should also consider the potential cost implications of adding additional instances over time.

By carefully considering the hourly rate, overall cost over time, and any additional costs or cost-saving options, you can make an informed decision about the most cost-effective instance type for your specific use case.

Optimize for Scalability

To optimize resources for scalability in the cloud, you can follow these best practices:

Design for scalability. When designing your architecture, consider the needs of your application and design it to scale horizontally. This means adding more resources, such as servers or containers, to handle an increase in demand.
Use auto-scaling. Auto-scaling allows you to automatically increase or decrease the number of resources based on the demand for your application. This helps ensure that you are using only the necessary resources at any given time, and can also save costs by reducing resources during low demand periods.
Use load balancing. Load balancing distributes incoming traffic across multiple resources, which helps prevent any one resource from being overloaded. This also helps with failover and disaster recovery.
Use caching. Caching can help reduce the load on your servers by storing frequently accessed data in a cache. This reduces the number of requests that need to be processed by your servers, which can improve performance and reduce costs.
Use cloud monitoring. Cloud monitoring tools can help you identify potential performance issues and bottlenecks before they become problems. This can help you optimize your resources more effectively and improve the overall performance of your application.
Use serverless architecture. With serverless architecture, you don’t need to manage servers or resources. Instead, you pay only for the resources that your application uses, which can help you optimize your resources and reduce costs.

By following the above best practices, you can optimize your resources for scalability in the cloud and ensure that your application can handle an increase in demand.

Test & Optimize

Testing and optimizing your cloud environment is a critical aspect of ensuring that your applications and services are performing optimally and that you’re not overspending on cloud resources. Here are some steps you can take to test and optimize your cloud environment:

Set up monitoring. Before you can start testing and optimizing, you need to have visibility into your cloud environment. Set up monitoring tools that can give you insights into key metrics such as CPU utilization, network traffic, and storage usage.
Conduct load testing. Conduct load testing to determine how your applications and services perform under different levels of traffic. This can help you identify bottlenecks and performance issues, and make optimizations to improve performance.
Optimize resource allocation. Make sure that you’re not overspending on cloud resources by optimizing resource allocation. This includes things like resizing virtual machines, choosing the right storage options, and using auto-scaling to automatically adjust resource allocation based on demand.
Implement security measures. Make sure that your cloud environment is secure by implementing appropriate security measures such as firewalls, access controls, and encryption.
Use automation. Automating routine tasks can help you save time and reduce the risk of errors. This includes things like automating deployments, backups, and resource provisioning.
Review cost optimization options. Consider reviewing your cloud provider’s cost optimization options, such as reserved instances or spot instances. These can help you save money on your cloud bill while still maintaining performance.
Continuously monitor and optimize. Continuous monitoring and optimization is key to ensuring that your cloud environment is performing optimally. Set up regular reviews to identify opportunities for optimization and ensure that your cloud environment is meeting your business needs.

By following these steps, you can test and optimize your cloud environment to ensure that it’s secure, cost-effective, and performing optimally.

Recap

Following the above steps will help to make informed decisions and optimize your cloud infrastructure to meet your business needs. Carefully consider your workload requirements. Evaluating performance characteristics, cost, and conducting benchmarks requires understanding your application’s resource requirements. With resource requirements you can choose the right instance types in the cloud that provide the necessary CPU, memory, storage, and network resources to meet your application’s performance and scalability needs.

The post The Modern Data Ecosystem: Choose the Right Instance appeared first on Unravel.

The Modern Data Ecosystem: Monitor Cloud Resources

Stephen Lamont — Thu, 01 Jun 2023 01:32:59 +0000

Monitor Cloud Resources

When monitoring cloud resources, there are several factors to consider:

Performance It is essential to monitor the performance of your cloud resources, including their availability, latency, and throughput. You can use metrics such as CPU usage, network traffic, and memory usage to measure the performance of your resources.
Scalability You should monitor the scalability of your cloud resources to ensure that they can handle changes in demand. You can use tools such as auto-scaling to automatically adjust the resources based on demand.
Security You must monitor the security of your cloud resources to ensure that they are protected from unauthorized access or attacks. You can use tools such as intrusion detection systems and firewalls to monitor and protect your resources.
Cost It is important to monitor the cost of your cloud resources to ensure that you are not overspending on resources that are not being used. You can use tools such as cost optimization and billing alerts to manage your costs.
Compliance If you are subject to regulatory compliance requirements, you should monitor your cloud resources to ensure that you are meeting those requirements. You can use tools such as audit logs and compliance reports to monitor and maintain compliance.
Availability It is important to monitor the availability of your cloud resources to ensure that they are up and running when needed. You can use tools such as load balancing and failover to ensure high availability.
User Experience You should also monitor the user experience of your cloud resources to ensure that they are meeting the needs of your users. You can use tools such as user monitoring and feedback to measure user satisfaction and identify areas for improvement.

Performance Monitoring

Here are some best practices for performance monitoring in the cloud:

Establish baselines. Establish baseline performance metrics for your applications and services. This will allow you to identify and troubleshoot performance issues more quickly.
Monitor resource utilization. Monitor resource utilization such as CPU usage, memory usage, network bandwidth, and disk I/O. This will help you identify resource bottlenecks and optimize resource allocation.
Use automated monitoring tools. Use automated monitoring tools such as CloudWatch, DataDog, and New Relic to collect performance metrics and analyze them in real time. This will allow you to identify and address performance issues as they arise.
Set alerts. Set up alerts for critical performance metrics such as CPU utilization, memory utilization, and network bandwidth. This will allow you to proactively address performance issues before they impact end users.
Use load testing. Use load testing tools to simulate heavy loads on your applications and services. This will help you identify performance bottlenecks and optimize resource allocation before going live.
Monitor end-user experience. Monitor end-user experience using tools such as synthetic monitoring and real user monitoring (RUM) . This will allow you to identify and address performance issues that impact end users.
Analyze logs. Analyze logs to identify potential performance issues. This will help you identify the root cause of performance issues and optimize resource allocation.
Continuously optimize. Continuously optimize your resources based on performance metrics and end-user experience. This will help you ensure that your applications and services perform at their best at all times.

Scalability Monitoring

Here are some best practices for scalability monitoring in the cloud:

Establish baselines. Establish baseline performance metrics for your applications and services. This will allow you to identify and troubleshoot scalability issues more quickly.
Monitor auto-scaling. Monitor auto-scaling metrics to ensure that your resources are scaling up or down according to demand. This will help you ensure that you have the right amount of resources available to meet demand.
Use load testing. Use load testing tools to simulate heavy loads on your applications and services. This will help you identify scalability bottlenecks and optimize resource allocation before going live.
Set alerts. Set up alerts for critical scalability metrics such as CPU utilization, memory utilization, and network bandwidth. This will allow you to proactively address scalability issues before they impact end users.
Use horizontal scaling. Use horizontal scaling to add more instances of your application or service to handle increased traffic. This will allow you to scale quickly and efficiently.
Use vertical scaling. Use vertical scaling to increase the size of your resources to handle increased traffic. This will allow you to scale quickly and efficiently.
Analyze logs. Analyze logs to identify potential scalability issues. This will help you identify the root cause of scalability issues and optimize resource allocation.
Continuously optimize. Continuously optimize your resources based on scalability metrics and end-user experience. This will help you ensure that your applications and services can handle any level of demand.

Security Monitoring

Here are some best practices for handling security monitoring in the cloud:

Use security services. Use cloud-based security services such as AWS Security Hub, Azure Security Center, and Google Cloud Security Command Center to centralize and automate security monitoring across your cloud environment.
Monitor user activity. Monitor user activity across your cloud environment, including login attempts, resource access, and changes to security policies. This will help you identify potential security threats and ensure that access is granted only to authorized users.
Use encryption. Use encryption to protect data at rest and in transit. This will help you protect sensitive data from unauthorized access.
Set up alerts. Set up alerts for critical security events such as failed login attempts, unauthorized access, and changes to security policies. This will allow you to respond quickly to security threats.
Use multi-factor authentication. Use multi-factor authentication to add an extra layer of security to user accounts. This will help prevent unauthorized access even if a user’s password is compromised.
Use firewalls. Use firewalls to control network traffic to and from your cloud resources. This will help you prevent unauthorized access and ensure that only authorized traffic is allowed.
Implement access controls. Implement access controls to ensure that only authorized users have access to your cloud resources. This will help you prevent unauthorized access and ensure that access is granted only to those who need it.
Perform regular security audits. Perform regular security audits to identify potential security threats and ensure that your cloud environment is secure. This will help you identify and address security issues before they become major problems.

Cost Monitoring

Here are some best practices for monitoring cost in the cloud:

Use cost management tools. Use cloud-based cost management tools such as AWS Cost Explorer, Azure Cost Management, and Google Cloud Billing to monitor and optimize your cloud costs.
Set budgets. Set budgets for your cloud spending to help you stay within your financial limits. This will help you avoid unexpected charges and ensure that you are using your cloud resources efficiently.
Monitor usage. Monitor your cloud resource usage to identify any unnecessary or underutilized resources. This will help you identify opportunities for optimization and cost savings.
Analyze cost data. Analyze your cost data to identify trends and areas of high spending. This will help you identify opportunities for optimization and cost savings.
Use cost allocation. Use cost allocation to assign costs to individual users, teams, or projects. This will help you identify which resources are being used most and which users or teams are driving up costs.
Use reserved instances. Use reserved instances to save money on long-term cloud usage. This will help you save money on your cloud costs over time.
Optimize resource allocation. Optimize your resource allocation to ensure that you are using the right amount of resources for your needs. This will help you avoid over-provisioning and under-provisioning.
Implement cost optimization strategies. Implement cost optimization strategies such as using spot instances, turning off non-critical resources when not in use, and using serverless architectures. This will help you save money on your cloud costs without sacrificing performance or reliability.

Compliance Monitoring

Here are some best practices for monitoring compliance in the cloud:

Understand compliance requirements. Understand the compliance requirements that apply to your organization and your cloud environment, such as HIPAA, PCI-DSS, or GDPR.
Use compliance services. Use cloud-based compliance services such as AWS Artifact, Azure Compliance Manager, and Google Cloud Compliance to streamline compliance management and ensure that you are meeting your regulatory requirements.
Conduct regular audits. Conduct regular audits to ensure that your cloud environment is in compliance with regulatory requirements. This will help you identify and address compliance issues before they become major problems.
Implement security controls. Implement security controls such as access controls, encryption, and multi-factor authentication to protect sensitive data and ensure compliance with regulatory requirements.
Monitor activity logs. Monitor activity logs across your cloud environment to identify potential compliance issues, such as unauthorized access or data breaches. This will help you ensure that you are meeting your regulatory requirements and protect sensitive data.
Use automation. Use automation tools to help you enforce compliance policies and ensure that your cloud environment is compliant with regulatory requirements.
Establish incident response plans. Establish incident response plans to help you respond quickly to compliance issues or data breaches. This will help you minimize the impact of any incidents and ensure that you are meeting your regulatory requirements.
Train your employees. Train your employees on compliance policies and procedures to ensure that they understand their roles and responsibilities in maintaining compliance with regulatory requirements. This will help you ensure that everyone in your organization is working together to maintain compliance in the cloud.

Monitor Availability

Here are some best practices for monitoring resource availability in the cloud:

Use monitoring services. Use cloud-based monitoring services such as AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring to monitor the availability of your cloud resources.
Set up alerts. Set up alerts to notify you when there are issues with resource availability, such as when a server goes down or a service becomes unresponsive. This will help you respond quickly to issues and minimize downtime.
Monitor performance metrics. Monitor performance metrics such as CPU usage, memory usage, and network latency to identify potential issues before they become major problems. This will help you ensure that your resources are performing optimally and prevent performance issues from affecting availability.
Conduct regular load testing. Conduct regular load testing to ensure that your resources can handle the expected levels of traffic and usage. This will help you identify potential performance bottlenecks and ensure that your resources are available when you need them.
Use high availability architectures. Use high availability architectures such as load balancing, auto-scaling, and multi-region deployments to ensure that your resources are available even in the event of a failure. This will help you minimize downtime and ensure that your resources are always available.
Monitor service-level agreements (SLAs). Monitor SLAs to ensure that your cloud providers are meeting their service-level commitments. This will help you hold your providers accountable and ensure that your resources are available as expected.
Conduct disaster recovery drills. Conduct disaster recovery drills to ensure that you can recover from major outages or disasters. This will help you minimize downtime and ensure that your resources are available even in the event of a major failure.
Implement redundancy. Implement redundancy for critical resources to ensure that they are always available. This can include redundant servers, databases, or storage systems. This will help you ensure that your critical resources are always available and minimize downtime.

Monitor User Experience

Here are some best practices for monitoring user experience in the cloud:

Define user experience metrics. Define user experience metrics that are important to your business, such as page load times, error rates, and response times. This will help you track user experience and identify areas for improvement.
Use synthetic transactions. Use synthetic transactions to simulate user interactions with your applications and services. This will help you identify performance issues and ensure that your applications and services are delivering a good user experience.
Monitor real user traffic. Monitor real user traffic to identify issues that may not be apparent in synthetic transactions. This will help you understand how your users are actually using your applications and services and identify any performance issues that may be impacting the user experience.
Monitor third-party services. Monitor third-party services that your applications and services rely on, such as payment gateways and content delivery networks. This will help you identify any issues that may be impacting the user experience and ensure that your users have a seamless experience.
Use application performance management (APM) tools. Use APM tools to monitor application performance and identify potential issues that may be impacting the user experience. This will help you quickly identify and resolve issues that may be impacting your users.
Monitor network latency. Monitor network latency to ensure that your applications and services are delivering a good user experience. This will help you identify any network-related issues that may be impacting the user experience.
Set up alerts. Set up alerts to notify you when user experience metrics fall below acceptable levels. This will help you respond quickly to issues and ensure that your users have a good experience.
Continuously test and optimize. Continuously test and optimize your applications and services to ensure that they are delivering a good user experience. This will help you identify and fix issues before they impact your users and ensure that your applications and services are always performing optimally.

Recap

When monitoring cloud resources, there are several factors to consider. First, performance. It is essential to monitor the performance of your cloud resources, including their availability, latency, and throughput. This will allow you to identify and address performance issues that impact end users. You can use tools such as cost optimization and billing alerts to manage your costs. This will help you avoid unexpected charges and ensure that you are using your cloud resources efficiently. Conduct regular load testing to ensure that your resources can handle the expected levels of traffic and usage. Define user experience metrics that are important to your business, such as page load times, error rates, and response times.

The post The Modern Data Ecosystem: Monitor Cloud Resources appeared first on Unravel.

The Modern Data Ecosystem: Use Managed Services

Stephen Lamont — Thu, 01 Jun 2023 01:32:43 +0000

Use Managed Services

Using managed services in the cloud can help you reduce your operational burden, increase efficiency, and improve scalability. However, to fully realize the benefits of managed services, you need to follow some best practices. Here are some best practices to consider when using managed services in the cloud:

Understand your service-level agreements (SLAs). Before using any managed service, you should understand the SLAs offered by your cloud provider. This will help you understand the availability and reliability of the service, as well as any limitations or restrictions that may impact your use of the service.
Choose the right service. You should choose the right managed service for your needs. This means selecting a service that meets your requirements and offers the features and functionality you need. You should also consider the cost of the service and how it will impact your budget.
Plan for scalability. Managed services in the cloud are designed to be highly scalable, so you should plan for scalability when using them. This means understanding how the service will scale as your needs change and ensuring that you can easily scale the service up or down as required.
Monitor service performance. You should monitor the performance of your managed services to ensure that they are meeting your expectations. This may involve setting up monitoring tools to track service usage, performance, and availability. You should also define appropriate thresholds and alerts to notify you when issues arise.
Secure your services. Security is critical when using managed services in the cloud. You should ensure that your services are secured according to best practices, such as using strong passwords, encrypting data in transit and at rest, and implementing access controls. You should also regularly audit your services to ensure that they remain secure.
Stay up-to-date. Managed services in the cloud are continually evolving, so you should stay up-to-date with the latest features, updates, and releases. This will help you take advantage of new features and ensure that your services are up-to-date and secure.

By following these best practices, you can ensure that your managed services in the cloud are efficient, reliable, and secure.

Understand Your Service-Level Agreements (SLAs)

Understanding cloud service-level agreements (SLAs) is crucial when you use cloud services. SLAs define the level of service you can expect from a cloud provider and outline the terms and conditions of the service. Here are some ways to help you understand cloud SLAs:

Read the SLA. The best way to understand cloud SLAs is to read the SLA itself. The SLA provides details on what services are offered, how they are delivered, and what level of availability you can expect. It also outlines the terms and conditions of the service and what you can expect in the event of an outage or other issues.
Understand the metrics. Cloud SLAs typically include metrics that define the level of service you can expect. These metrics may include availability, performance, and response time. You should understand how these metrics are measured and what level of service is guaranteed for each metric.
Know the guarantees. The SLA also specifies the guarantees that the cloud provider offers for each metric. You should understand what happens if the provider fails to meet these guarantees, such as compensation or penalties.
Identify exclusions. The SLA may also include exclusions or limitations to the service, such as scheduled maintenance, force majeure events, or issues caused by your own actions. You should understand what these exclusions are and how they may impact your use of the service.
Ask questions. If you are unsure about any aspect of the SLA, you should ask questions. The cloud provider should be able to provide clarification on any issues and help you understand the SLA better.
Get expert help. If you are still unsure about the SLA or need help negotiating SLAs with multiple providers, consider getting expert help. Cloud consultants or legal advisors can help you understand the SLA better and ensure that you get the best possible terms for your needs.

By following these steps, you can better understand cloud SLAs and make informed decisions about the cloud services you use.

Choose the Right Service

Choosing the right cloud service is a critical decision that can have a significant impact on your organization. Here are some factors to consider when choosing a cloud service:

Business needs The first step in choosing a cloud service is to understand your business needs. What are your specific requirements? What do you need the cloud service to do? Consider factors such as scalability, security, compliance, and cost when evaluating your options.
Reliability and availability Reliability and availability are critical when choosing a cloud service. Look for a provider with a strong track record of uptime and availability. Also, ensure that the provider has a robust disaster recovery plan in place in case of service disruptions or outages.
Security Security is a top priority when using cloud services. Choose a provider that has robust security measures in place, such as encryption, access controls, and multi-factor authentication. Also, consider whether the provider meets any relevant compliance requirements, such as HIPAA or GDPR.
Cost Cost is another critical factor to consider when choosing a cloud service. Look for a provider that offers transparent pricing and provides a clear breakdown of costs. Also, consider any hidden fees or charges, such as data transfer costs or support fees.
Support Choose a cloud service provider that offers robust support options, such as 24/7 customer support or online resources. Ensure that the provider has a reputation for providing excellent support and responding quickly to issues.
Integration Ensure that the cloud service provider integrates with any existing systems or applications that your organization uses. This can help reduce the complexity of your IT environment and improve productivity.
Scalability Choose a cloud service provider that can scale as your needs change. Ensure that the provider can handle your growth and provides flexibility in terms of scaling up or down.

By considering these factors, you can choose the right cloud service that meets your business needs, is secure, reliable, and scalable, and provides good value.

Plan for Scalability

Scalability is a key advantage of cloud computing, allowing organizations to quickly and easily increase or decrease resources as needed. Here are some best practices for planning for scalability in the cloud:

Start with a solid architecture. A solid architecture is essential for building scalable cloud applications. Ensure that your architecture is designed to support scalability from the beginning, by leveraging horizontal scaling, load balancing, and other cloud-native capabilities.
Automate everything. Automation is critical for scaling in the cloud. Automate deployment, configuration, and management tasks to reduce manual effort and increase efficiency. Use tools like cloud orchestration or infrastructure-as-code (IAC) to automate the provisioning and configuration of resources.
Use elasticity. Elasticity is the ability to automatically adjust resource capacity to meet changes in demand. Use auto-scaling to automatically increase or decrease resources based on utilization or other metrics. This can help ensure that you always have the right amount of resources to handle traffic spikes or fluctuations.
Monitor and optimize. Monitoring is critical for maintaining scalability. Use monitoring tools to track application performance and identify potential bottlenecks or areas for optimization. Optimize your applications, infrastructure, and processes to ensure that you can scale as needed without encountering issues.
Plan for failure. Scalability also means being prepared for failure. Ensure that your application is designed to handle failures and that you have a plan in place for dealing with outages or other issues. Use fault tolerance and high availability to ensure that your application can continue running even if a component fails.

By following these best practices, you can plan for scalability in the cloud and ensure that your applications can handle changes in demand without encountering issues.

Monitor Service Performance

Monitoring service performance is essential to ensure that your cloud applications are running smoothly and meeting service-level agreements (SLAs). Here are some best practices for monitoring service performance in the cloud:

Define key performance indicators (KPIs). Define KPIs that are relevant to your business needs, such as response time, throughput, and error rates. These KPIs will help you track how well your applications are performing and whether they are meeting your SLAs.
Use monitoring tools. Use monitoring tools to collect and analyze data on your application’s performance. These tools can help you identify issues before they become critical and track how well your application is meeting your KPIs.
Set alerts. Set up alerts based on your KPIs to notify you when something goes wrong. This can help you quickly identify and resolve issues before they impact your application’s performance.
Monitor end-to-end performance. Monitor end-to-end performance, including network latency, database performance, and third-party services, to identify any potential bottlenecks or issues that could impact your application’s performance.
Analyze and optimize. Analyze performance data to identify patterns and trends. Use this information to optimize your application’s performance and identify areas for improvement. Optimize your application code, database queries, and network configurations to improve performance.
Use machine learning. Leverage machine learning to analyze performance data and identify anomalies or issues automatically. This can help you identify issues before they become critical and take proactive steps to resolve them.

By following these best practices, you can monitor service performance in the cloud effectively and ensure that your applications are meeting your SLAs and delivering the best possible user experience.

Secure Your Services

Securing your services in the cloud is critical to protect your data and applications from cyber threats. Here are some best practices for securing your services in the cloud:

Implement strong access control. Implement strong access control mechanisms to restrict access to your cloud resources. Use least privilege principles to ensure that users only have access to the resources they need. Implement multi-factor authentication (MFA) and use strong passwords to protect against unauthorized access.
Encrypt your data. Encrypt your data both at rest and in transit to protect against data breaches. Use SSL/TLS protocols for data in transit and encryption mechanisms like AES or RSA for data at rest. Additionally, consider encrypting data before it is stored in the cloud to provide an additional layer of protection.
Implement network security. Implement network security measures to protect against network-based attacks, such as DDoS attacks, by using firewalls, intrusion detection/prevention systems (IDS/IPS), and VPNs. Segregate your network into logical segments to reduce the risk of lateral movement by attackers.
Use security groups and network ACLs. Use security groups and network ACLs to control inbound and outbound traffic to your resources. Implement granular rules to restrict traffic to only what is necessary, and consider using security groups and network ACLs together to provide layered security.
Implement logging and monitoring. Implement logging and monitoring to detect and respond to security incidents. Use cloud-native tools like CloudTrail or CloudWatch to monitor activity in your environment and alert you to any unusual behavior.
Perform regular security audits. Perform regular security audits to identify potential vulnerabilities and ensure that your security controls are effective. Conduct regular penetration testing and vulnerability assessments to identify and remediate any weaknesses in your environment.

By following these best practices, you can secure your services in the cloud and protect your applications and data from cyber threats.

Stay Up-to-Date on Managed Services

Staying up to date on managed services in the cloud is essential to ensure that you are using the latest features and capabilities and making the most out of your cloud investment. Here are some ways to stay up-to-date on managed services in the cloud:

Subscribe to cloud service providers’ blogs. Cloud service providers like AWS, Google Cloud, and Microsoft Azure regularly publish blog posts announcing new features and services. By subscribing to these blogs, you can stay up-to-date on the latest developments and updates.
Attend cloud conferences. Attending cloud conferences like AWS re:Invent, Google Cloud Next, and Microsoft Ignite is an excellent way to learn about new and upcoming managed services in the cloud. These events feature keynote speeches, technical sessions, and hands-on workshops that can help you stay up-to-date with the latest trends and technologies.
Join cloud user groups. Joining cloud user groups like AWS User Group, Google Cloud User Group, and Azure User Group is a great way to connect with other cloud professionals and learn about new managed services in the cloud. These groups often hold meetings and events where members can share their experiences and discuss the latest developments.
Participate in online communities. Participating in online communities like Reddit, Stack Overflow, and LinkedIn Groups is an excellent way to stay up-to-date on managed services in the cloud. These communities often have active discussions about new features and services, and members can share their experiences and insights.
Follow industry experts. Following industry experts on social media platforms like Twitter, LinkedIn, and Medium is an excellent way to stay up-to-date on managed services in the cloud. These experts often share their thoughts and insights on the latest developments and can provide valuable guidance and advice.

By following these methods, you can stay up-to-date on managed services in the cloud and ensure that you are using the latest features and capabilities to achieve your business goals.

Recap

Understand your cloud SLAs and make informed decisions about the cloud services you use. By following these best practices, you can ensure that your managed services in the cloud are efficient, reliable, and secure. Leveraging scalability in the cloud ensures that your applications can handle changes in demand without encountering issues. Ensuring that your applications are meeting your SLAs and delivering the best possible user experience requires constant vigilance. Make sure you can secure your services in the cloud and protect your applications and data from cyber threats. By considering the right factors, you can choose the right cloud service that meets your business needs, is secure, reliable, and scalable, and provides good value. Ultimately this helps achieve key business objectives for every customer using the cloud.

The post The Modern Data Ecosystem: Use Managed Services appeared first on Unravel.

The Modern Data Ecosystem: Optimize Your Storage

Stephen Lamont — Thu, 01 Jun 2023 01:32:28 +0000

Optimize Storage

There are several ways to optimize cloud storage, depending on your specific needs and circumstances. Here are some general tips that can help:

Understand your data. Before you start optimizing, it’s important to understand what data you have and how it’s being used. This can help you identify which files or folders are taking up the most space, and which ones are being accessed the most frequently.
Use storage compression. Compression can reduce the size of your files, which can save you storage space and reduce the amount of data you need to transfer over the network. However, keep in mind that compressed files may take longer to access and may not be suitable for all types of data.
Use deduplication. Deduplication can identify and eliminate duplicate data, which can save you storage space and reduce the amount of data you need to transfer over the network. However, keep in mind that deduplication may increase the amount of CPU and memory resources required to manage your data.
Choose the right storage class. Most cloud storage providers offer different storage classes that vary in performance, availability, and cost. Choose the storage class that best meets your needs and budget.
Set up retention policies. Retention policies can help you automatically delete old or outdated data, which can free up storage space and reduce your storage costs. However, be careful not to delete data that you may need later.
Monitor your usage. Regularly monitor your cloud storage usage to ensure that you’re not exceeding your storage limits or paying for more storage than you need. You can use cloud storage monitoring tools or third-party services to help you with this.
Consider a multi-cloud strategy. If you have very large amounts of data, you may want to consider using multiple cloud storage providers to spread your data across multiple locations. This can help you optimize performance, availability, and cost, while also reducing the risk of data loss.

Overall, optimizing cloud storage requires careful planning, monitoring, and management. By following these tips, you can reduce your storage costs, improve your data management, and get the most out of your cloud storage investment.

Understand Your Data

Analyzing data in the cloud can be a powerful way to gain insights and extract value from large datasets. Here are some best practices for analyzing data in the cloud:

Choose the right cloud platform. There are several cloud platforms available, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. Choose the one that suits your needs and budget.
Store data in a scalable, secure, and cost-effective way. You can store data in cloud-based databases, data lakes, or data warehouses. Make sure that you choose a storage solution that is scalable, secure, and cost-effective.
Choose the right data analysis tool. There are several cloud-based data analysis tools available, such as Amazon SageMaker, Microsoft Azure Machine Learning, and Google Cloud AI Platform. Choose the one that suits your needs and budget.
Prepare data for analysis. Data preparation involves cleaning, transforming, and structuring the data for analysis. This step is crucial for accurate analysis results.
Choose the right analysis technique. Depending on the nature of the data and the business problem you are trying to solve, you may choose from various analysis techniques such as descriptive, diagnostic, predictive, or prescriptive.
Visualize data. Visualization helps to communicate insights effectively. Choose a visualization tool that suits your needs and budget.
Monitor and optimize performance. Monitor the performance of your data analysis system and optimize it as necessary. This step helps to ensure that you get accurate and timely insights from your data.

Overall, analyzing data in the cloud can be a powerful way to gain insights and extract value from large datasets. By following these best practices, you can ensure that you get the most out of your cloud-based data analysis system.

Use Storage Compression

Storage compression is a useful technique for reducing storage costs and improving performance in the cloud. Here are some best practices for using storage compression in the cloud:

Choose the right compression algorithm. There are several compression algorithms available, such as gzip, bzip2, and LZ4. Choose the algorithm that suits your needs and budget. Consider factors such as compression ratio, speed, and memory usage.
Compress data at the right time. Compress data when it is written to storage or when it is not frequently accessed. Avoid compressing data that is frequently accessed, as this can slow down performance.
Monitor compression performance. Monitor the performance of your compression system to ensure that it is not slowing down performance. Use tools such as monitoring dashboards to track the performance of your system.
Test compression performance. Test the performance of your compression system with different types of data to ensure that it is effective. Consider testing with data that has varying levels of redundancy, such as log files or images.
Use compression in conjunction with other techniques. Consider using compression in conjunction with other storage optimization techniques such as deduplication, tiering, and archiving. This can further reduce storage costs and improve performance.
Consider the cost of decompression. Decompressing data can be a resource-intensive process. Consider the cost of decompression when choosing a compression algorithm and when designing your storage architecture.

Overall, using storage compression in the cloud can be an effective way to reduce storage costs and improve performance. By following these best practices, you can ensure that you get the most out of your storage compression system.

Data Deduplication

Deduplication is a technique used to reduce the amount of data stored in the cloud by identifying and removing duplicate data. Here are some best practices for deduplicating cloud data:

Choose the right deduplication algorithm. There are several deduplication algorithms available, such as content-defined chunking and fixed-size chunking. Choose the algorithm that suits your needs and budget. Consider factors such as data type, deduplication ratio, and resource usage.
Deduplicate data at the right time. Deduplicate data when it is written to storage or when it is not frequently accessed. Avoid deduplicating data that is frequently accessed, as this can slow down performance.
Monitor deduplication performance. Monitor the performance of your deduplication system to ensure that it is not slowing down performance. Use tools such as monitoring dashboards to track the performance of your system.
Test deduplication performance. Test the performance of your deduplication system with different types of data to ensure that it is effective. Consider testing with data that has varying levels of redundancy, such as log files or images.
Consider the tradeoff between storage cost and compute cost. Deduplicating data can be a resource-intensive process. Consider the tradeoff between storage cost and compute cost when choosing a deduplication algorithm and when designing your storage architecture.
Use deduplication in conjunction with other techniques. Consider using deduplication in conjunction with other storage optimization techniques such as compression, tiering, and archiving. This can further reduce storage costs and improve performance.

Overall, deduplicating cloud data can be an effective way to reduce storage costs and improve performance. By following these best practices, you can ensure that you get the most out of your deduplication system.

Use the Right Storage Class

Choosing the right storage class for data in the cloud involves considering factors such as access frequency, durability, availability, and cost. Here are some steps to follow when choosing the right storage class for your data in the cloud:

Determine your access needs. Consider how frequently you need to access your data. If you need to access your data frequently, you should choose a storage class that provides low latency and high throughput. If you don’t need to access your data frequently, you can choose a storage class that provides lower performance and lower cost.
Consider your durability needs. Durability refers to the probability of losing data due to hardware failure. If your data is critical and needs high durability, you should choose a storage class that provides high durability, such as Amazon S3 Standard or Google Cloud Storage Nearline.
Evaluate your availability needs. Availability refers to the ability to access your data when you need it. If your data is critical and needs high availability, you should choose a storage class that provides high availability, such as Amazon S3 Standard or Google Cloud Storage Nearline.
Determine your cost needs. Cost is also an important factor when choosing a storage class. If you have a limited budget, you should choose a storage class that provides lower cost, such as Amazon S3 Infrequent Access or Google Cloud Storage Coldline.
Consider any compliance requirements. Some industries have compliance requirements that dictate how data must be stored. If you have compliance requirements, you should choose a storage class that meets those requirements.
Consider data lifecycle management. Depending on the type of data, you may need to store it for a certain period of time before deleting it. Some storage classes may provide lifecycle management features to help you manage your data more efficiently.

By considering these factors, you can choose the right storage class for your data in the cloud that meets your needs and helps you save costs.

Set Data Retention Policies

Setting up retention policies for your cloud data is an important step in managing your data and ensuring that you are in compliance with regulatory requirements. Here are some steps you can follow to set up retention policies for your cloud data:

Identify the types of data you need to retain. The first step in setting up retention policies is to identify the types of data that you need to retain. This could include data related to financial transactions, employee records, customer information, and other types of data that are important for your business.
Determine the retention periods. Next, you will need to determine how long each type of data needs to be retained. This will depend on the regulatory requirements for your industry as well as your own internal policies.
Decide on the retention strategy. There are several different retention strategies you can use for your cloud data. For example, you could choose to retain all data for a certain period of time, or you could choose to delete data after a certain period of time has elapsed. You could also choose to retain data based on certain triggers, such as when a legal or regulatory inquiry is initiated.
Implement the retention policies. Once you have determined the types of data you need to retain, the retention periods, and the retention strategy, you can implement your retention policies in your cloud storage provider. Most cloud storage providers have built-in tools for setting up retention policies.
Monitor the retention policies. It’s important to regularly monitor your retention policies to ensure that they are working as intended. You should periodically review the types of data being retained, the retention periods, and the retention strategy to ensure that they are still appropriate. You should also regularly audit your retention policies to ensure that they are in compliance with any changes in regulatory requirements.

By following these steps, you can set up retention policies for your cloud data that will help you manage your data effectively, ensure compliance with regulatory requirements, and reduce your risk of data breaches or loss.

Monitor Usage

Monitoring your cloud usage is essential for managing your costs, optimizing your resources, and ensuring the security of your data. Here are some of the best ways to monitor your cloud usage:

Cloud provider monitoring tools Most cloud providers offer built-in monitoring tools that allow you to track your usage, monitor your costs, and receive alerts when you approach your resource limits. These tools typically provide real-time insights into your cloud usage and can help you identify areas where you can optimize your resources.
Third-party monitoring tools There are many third-party monitoring tools available that can help you monitor your cloud usage across multiple cloud providers. These tools offer more advanced features and can help you identify usage patterns, forecast future usage, and detect anomalies that may indicate security threats or performance issues.
Cost optimization tools Cost optimization tools can help you identify areas where you can reduce your costs, such as by using more efficient resource configurations or by identifying idle resources that can be decommissioned. These tools typically integrate with your cloud provider’s monitoring tools to provide a comprehensive view of your usage and costs.
Security and compliance tools Security and compliance tools can help you monitor your cloud usage for security threats and compliance violations. These tools typically monitor your cloud resources for suspicious activity, such as unauthorized access attempts, and can help you stay in compliance with regulatory requirements.
Regular audits Regular audits of your cloud usage can help you identify areas where you can optimize your resources, reduce costs, and improve security. You should periodically review your cloud usage and costs, and adjust your resources and policies as necessary to ensure that you are getting the most value from your cloud investment.

By using these monitoring tools and strategies, you can gain better visibility into your cloud usage, optimize your resources, reduce costs, and ensure the security and compliance of your cloud resources.

Pursue a Multi-Cloud Strategy

Pursuing a multi-cloud strategy can offer several benefits, such as increased resilience, reduced vendor lock-in, and improved performance. However, there are several considerations you should keep in mind before pursuing a multi-cloud strategy. Here are some of the key considerations:

Business objectives The first consideration is your business objectives. You need to determine why you want to pursue a multi-cloud strategy and what you hope to achieve. For example, you may be looking to improve the resilience of your applications or reduce vendor lock-in.
Compatibility The next consideration is the compatibility of your applications and workloads across different cloud providers. You need to ensure that your applications and workloads are compatible with the different cloud providers you plan to use. You may need to modify your applications and workloads to ensure they can run on multiple cloud platforms.
Data management Another important consideration is data management. You need to ensure that your data is managed securely and efficiently across all the cloud providers you use. This may involve implementing data management policies and tools to ensure that your data is always available and protected.
Cost management Managing costs is also a critical consideration. You need to ensure that you can manage costs effectively across all the cloud providers you use. This may involve using cost management tools and monitoring usage and costs to identify areas where you can optimize spending.
Security Security is always a key consideration, but it becomes even more important when using multiple cloud providers. You need to ensure that your applications and data are secure across all the cloud providers you use. This may involve implementing security policies and using security tools to detect and respond to security threats.
Skills and resources Finally, you need to consider the skills and resources required to manage a multi-cloud environment. This may involve hiring additional staff or up-skilling existing staff to ensure that they have the necessary expertise to manage a multi-cloud environment.

By considering these key factors, you can develop a successful multi-cloud strategy that meets your business objectives and helps you achieve your goals.

Recap

Optimizing cloud storage requires careful planning, monitoring, and management. Analyzing data in the cloud can be a powerful way to gain insights and extract value. Using storage compression is an effective way to reduce storage costs and improve performance. Using monitoring tools and strategies, you can gain better visibility into your cloud usage, optimize your resources, reduce costs, and ensure the security and compliance of your cloud resources. By considering these key factors, you can develop a successful multi-cloud strategy that meets your business objectives and helps you achieve your goals. By following these tips, you can reduce your storage costs, improve your data management, and get the most out of your cloud storage investment.

The post The Modern Data Ecosystem: Optimize Your Storage appeared first on Unravel.

The Modern Data Ecosystem: Use Auto-Scaling

Stephen Lamont — Thu, 01 Jun 2023 01:32:16 +0000

Auto-Scaling Overview

This is the second blog in a five-blog series. For an overview of this blog series, please review my post All Data Ecosystems Are Real-Time, It Is Just a Matter of Time. The series should be read in order.

Auto-scaling is a powerful feature of cloud computing that allows you to automatically adjust the resources allocated to your applications based on changes in demand. Here are some best practices for using auto-scaling in the cloud:

Set up appropriate triggers. Set up triggers based on metrics such as CPU utilization, network traffic, or memory usage to ensure that your application scales up or down when needed.
Use multiple availability zones. Deploy your application across multiple availability zones to ensure high availability and reliability. This will also help you to avoid any potential single points of failure.
Start with conservative settings. Start with conservative settings for scaling policies to avoid over-provisioning or under-provisioning resources. You can gradually increase the thresholds as you gain more experience with your application.
Monitor your auto-scaling. Regularly monitor the performance of your auto-scaling policies to ensure that they are working as expected. You can use monitoring tools such as CloudWatch to track metrics and troubleshoot any issues.
Use automated configuration management. Use tools such as Chef, Puppet, or Ansible to automate the configuration management of your application. This will make it easier to deploy and scale your application across multiple instances.
Test your auto-scaling policies. Test your auto-scaling policies under different load scenarios to ensure that they can handle sudden spikes in traffic. You can use load testing tools such as JMeter or Gatling to simulate realistic load scenarios.

By following these best practices, you can use auto-scaling in the cloud to improve the availability, reliability, and scalability of your applications.

Set Up Appropriate Triggers

Setting up appropriate triggers is an essential step when using auto-scaling in the cloud. Here are some best practices for setting up triggers:

Identify the right metrics. Start by identifying the metrics that are most relevant to your application. Common metrics include CPU utilization, network traffic, and memory usage. You can also use custom metrics based on your application’s specific requirements.
Determine threshold values. Once you have identified the relevant metrics, determine the threshold values that will trigger scaling. For example, you might set a threshold of 70% CPU utilization to trigger scaling up, and 30% CPU utilization to trigger scaling down.
Set up alarms. Set up CloudWatch alarms to monitor the relevant metrics and trigger scaling based on the threshold values you have set. For example, you might set up an alarm to trigger scaling up when CPU utilization exceeds 70% for a sustained period of time.
Use hysteresis. To avoid triggering scaling up and down repeatedly in response to minor fluctuations in metrics, use hysteresis. Hysteresis introduces a delay before triggering scaling in either direction, helping to ensure that scaling events are only triggered when they are really needed.
Consider cooldown periods. Cooldown periods introduce a delay between scaling events, helping to prevent over-provisioning or under-provisioning of resources. When a scaling event is triggered, a cooldown period is started during which no further scaling events will be triggered. This helps to ensure that the system stabilizes before further scaling events are triggered.

By following these best practices, you can set up appropriate triggers for scaling in the cloud, ensuring that your application can scale automatically in response to changes in demand.

Use Multiple Availability Zones

Using multiple availability zones is a best practice in the cloud to improve the availability and reliability of your application. Here are some best practices for using multiple availability zones:

Choose an appropriate region. Start by choosing a region that is geographically close to your users to minimize latency. Consider the regulatory requirements, cost, and availability of resources when choosing a region.
Deploy across multiple availability zones. Deploy your application across multiple availability zones within the same region to ensure high availability and fault tolerance. Availability zones are isolated data centers within a region that are designed to be independent of each other.
Use load balancers. Use load balancers to distribute traffic across multiple instances in different availability zones. This helps to ensure that if one availability zone goes down, traffic can be automatically redirected to other availability zones.
Use cross-zone load balancing. Enable cross-zone load balancing to distribute traffic evenly across all available instances, regardless of which availability zone they are in. This helps to ensure that instances in all availability zones are being utilized evenly.
Monitor availability zones. Regularly monitor the availability and performance of instances in different availability zones. You can use CloudWatch to monitor metrics such as latency, network traffic, and error rates, and to set up alarms to alert you to any issues.
Use automatic failover. Configure automatic failover for your database and other critical services to ensure that if one availability zone goes down, traffic can be automatically redirected to a standby instance in another availability zone.

By following these best practices, you can use multiple availability zones in the cloud to improve the availability and reliability of your application, and to minimize the impact of any potential disruptions.

Start with Conservative Settings

Over-provisioning or under-provisioning resources can lead to wasted resources or poor application performance, respectively. Here are some best practices to avoid these issues:

Monitor resource usage. Regularly monitor the resource usage of your application, including CPU, memory, storage, and network usage. Use monitoring tools such as CloudWatch to collect and analyze metrics, and set up alarms to alert you to any resource constraints.
Set appropriate thresholds. Set appropriate thresholds for scaling based on your application’s resource usage. Start with conservative thresholds, and adjust them as needed based on your monitoring data.
Use automation. Use automation tools such as AWS Auto Scaling to automatically adjust resource provisioning based on demand. This can help to ensure that resources are provisioned efficiently and that you are not over-provisioning or under-provisioning.
Use load testing. Use load testing tools such as JMeter or Gatling to simulate realistic traffic loads and test your application’s performance. This can help you to identify any performance issues before they occur in production.
Optimize application architecture. Optimize your application architecture to reduce resource usage, such as by using caching, minimizing database queries, and using efficient algorithms.
Use multiple availability zones. Deploy your application across multiple availability zones to ensure high availability and fault tolerance, and to minimize the impact of any potential disruptions.

By following these best practices, you can ensure that you are not over-provisioning or under-provisioning resources in your cloud infrastructure, and that your application is running efficiently and reliably.

Monitor and Auto-Scale Your Cloud

The best way to monitor and auto-scale your cloud applications is by using a combination of monitoring tools, scaling policies, and automation tools. Here are some best practices for monitoring and auto-scaling your cloud apps:

Monitor application performance. Use monitoring tools such as AWS CloudWatch to monitor the performance of your application. Collect metrics such as CPU utilization, memory usage, and network traffic, and set up alarms to notify you of any performance issues.
Define scaling policies. Define scaling policies for each resource type based on the performance metrics you are monitoring. This can include policies for scaling based on CPU utilization, network traffic, or other metrics.
Set scaling thresholds. Set conservative thresholds for scaling based on your initial analysis of resource usage, and adjust them as needed based on your monitoring data.
Use automation tools. Use automation tools to automatically adjust resource provisioning based on demand. This can help to ensure that resources are provisioned efficiently and that you are not over-provisioning or under-provisioning.
Use load testing. Use load testing tools such as JMeter or Gatling to simulate realistic traffic loads and test your application’s performance. This can help you to identify any performance issues before they occur in production.
Use multiple availability zones. Deploy your application across multiple availability zones to ensure high availability and fault tolerance, and to minimize the impact of any potential disruptions.
Monitor and optimize. Regularly monitor the performance of your application and optimize your scaling policies based on the data you collect. This will help you to ensure that your application is running efficiently and reliably.

By following these best practices, you can ensure that your cloud applications are monitored and auto-scaled effectively, helping you to optimize performance and minimize the risk of downtime.

Use Automated Configuration Management

Automated configuration management in the cloud can help you manage and provision your infrastructure efficiently and consistently. Here are some best practices for using automated configuration management in the cloud:

Use infrastructure as code. Use infrastructure as code tools such as AWS CloudFormation or Terraform to define your infrastructure as code. This can help to ensure that your infrastructure is consistent across different environments and can be easily reproduced.
Use configuration management tools. Use configuration management tools such as Chef, Puppet, or Ansible to automate the configuration of your servers and applications. These tools can help you ensure that your infrastructure is configured consistently and can be easily scaled.
Use version control. Use version control tools such as Git to manage your infrastructure code and configuration files. This can help you to track changes to your infrastructure and roll back changes if necessary.
Use testing and validation. Use testing and validation tools to ensure that your infrastructure code and configuration files are valid and that your infrastructure is properly configured. This can help you to avoid configuration errors and reduce downtime.
Use monitoring and logging. Use monitoring and logging tools to track changes to your infrastructure and to troubleshoot any issues that arise. This can help you to identify problems quickly and resolve them before they impact your users.
Use automation. Use automation tools such as AWS OpsWorks or AWS CodeDeploy to automate the deployment and configuration of your infrastructure. This can help you to deploy changes quickly and efficiently.

By following these best practices, you can use automated configuration management in the cloud to manage your infrastructure efficiently and consistently, reducing the risk of configuration errors and downtime, and enabling you to scale your infrastructure easily as your needs change.

Testing Auto-Scaling Policies

Testing your auto-scaling policies is an important step in ensuring that your cloud infrastructure can handle changes in demand effectively. Here are some best practices for testing your auto-scaling policies:

Use realistic test scenarios. Use realistic test scenarios to simulate the traffic patterns and demand that your application may experience in production. This can help you to identify any potential issues and ensure that your auto-scaling policies can handle changes in demand effectively.
Test different scenarios. Test your auto-scaling policies under different scenarios, such as high traffic loads or sudden spikes in demand. This can help you to ensure that your policies are effective in a variety of situations.
Monitor performance. Monitor the performance of your application during testing to identify any performance issues or bottlenecks. This can help you to optimize your infrastructure and ensure that your application is running efficiently.
Validate results. Validate the results of your testing to ensure that your auto-scaling policies are working as expected. This can help you to identify any issues or areas for improvement.
Use automation tools. Use automation tools such as AWS CloudFormation or Terraform to automate the testing process and ensure that your tests are consistent and reproducible.
Use load testing tools. Use load testing tools such as JMeter or Gatling to simulate realistic traffic loads and test your auto-scaling policies under different scenarios.

By following these best practices, you can ensure that your auto-scaling policies are effective and can handle changes in demand effectively, reducing the risk of downtime and ensuring that your application is running efficiently.

Recap

Auto-scaling can be leveraged to improve the availability, reliability, and scalability of your applications. Multiple availability zones in the cloud improve the availability and reliability of cloud applications, and minimizes the impact of any potential disruptions. Make changes conservatively. Increase resources incrementally. This makes sure you don’t oversize. We must manage our infrastructure efficiently and consistently to reduce the risk of configuration errors and downtime. Conservative scaling enables us to scale our infrastructure easily as our needs change. Handle changes in demand effectively and reduce the risk of downtime and ensure our application is running efficiently.

The post The Modern Data Ecosystem: Use Auto-Scaling appeared first on Unravel.

All Data Ecosystems Are Real Time, It Is Just a Matter of Time

Stephen Lamont — Thu, 01 Jun 2023 01:32:03 +0000

Overview: Six-Part Blog

In this six-part blog I will demonstrate why what I call Services Oriented Data Architecture (SΘDΔ^®) is the right data architecture for now and the foreseeable future. I will drill into specific examples of how to build the most optimal cloud data architecture regardless of your cloud provider. This will lay the foundation for SΘDΔ^®. We will also define the Data Asset Management System(DΔḾṢ)^®. DΔḾṢ is the modern data management system approach for advanced data ecosystems. The modern data ecosystem must focus on interchangeable interoperable services and let the system focus on optimally storing, retrieving, and processing data. DΔḾṢ takes care of this for the modern data ecosystem.

We will drill into the exercises necessary to optimize the full stack of your cloud data ecosystem. These exercises will work regardless of the cloud provider. We will look at the best ways to store data regardless of type. Then we will drill into how to optimize your compute in the cloud. The compute is generally the most expensive of all cloud assets. We will also drill into how to optimize memory use. Finally, we will wrap up with examples of SΘDΔ^®.

Modern data architecture is a framework for designing, building, and managing data systems that can effectively support modern data-driven business needs. It is focused on achieving scalability, flexibility, reliability, and cost-effectiveness, while also addressing modern data requirements such as real-time data processing, machine learning, and analytics.

Some of the key components of modern data architecture include:

Data ingestion and integration This involves collecting and integrating data from various sources, including structured and unstructured data, and ingesting it into the data system.
Data storage and processing This involves storing and processing data in a scalable, distributed, and fault-tolerant manner using technologies such as cloud storage, data lakes, and data warehouses.
Data management and governance This involves ensuring that data is properly managed, secured, and governed, including policies around data access, privacy, and compliance.
Data analysis and visualization This involves leveraging advanced analytics tools and techniques to extract insights from data and present them in a way that is understandable and actionable.
Machine learning and artificial intelligence This involves leveraging machine learning and AI technologies to build predictive models, automate decision-making, and enable advanced analytics.
Data streaming and real-time processing This involves processing and analyzing data in real time, allowing organizations to respond quickly to changing business needs.

Overall, modern data architecture is designed to help organizations leverage data as a strategic asset and gain a competitive advantage by making better data-driven decisions.

Cloud Optimization Best Practices

Running efficiently on the large cloud providers requires careful consideration of various factors, including your application’s requirements, the size and type of instances needed, and the selected services to leverage.

Here are some general tips to help you run efficiently on the large cloud providers’ cloud:

Choose the right instance types. The large cloud providers offer a wide range of instance types optimized for different workloads. Choose the instance type that best fits your application’s requirements to avoid over-provisioning or under-provisioning.
Use auto-scaling. Auto-scaling allows you to scale your infrastructure up or down based on demand. This ensures that you have enough capacity to handle traffic spikes while minimizing costs during periods of low usage.
Optimize your storage. The large cloud providers offer various storage options, each with its own performance characteristics and costs. Select the storage type that best fits your application’s needs.
Use managed services. The large cloud providers provide various managed services, These services allow you to focus on your application’s business logic while the large cloud providers take care of the underlying infrastructure. SaaS vendors manage the software, and PaaS vendors manage the platform.
Monitor your resources. The major cloud providers provide various monitoring and logging tools that allow you to track your application’s performance and troubleshoot issues quickly. Use these tools to identify bottlenecks and optimize your infrastructure.
Use a content delivery network (CDN). If your application serves static content, consider using a CDN to cache content closer to your users, reducing latency and improving performance.

By following these best practices, you can ensure that your application runs efficiently on the large cloud providers, providing a great user experience while minimizing costs.

The Optimized Way to Store Data in the Cloud

The best structure for storing data for reporting depends on various factors, including the type and volume of data, the reporting requirements, and the performance considerations. Here are some general guidelines for choosing a suitable structure for storing data for reporting:

Use a dimensional modeling approach. Dimensional modeling is a database design technique that organizes data into dimensions and facts. It is optimized for reporting and analysis and can help simplify complex queries and improve performance. The star schema and snowflake schema are popular dimensional modeling approaches.
Choose a suitable database type. Depending on the size and type of data, you can choose a suitable database type for storing data for reporting. Relational databases are the most common type of database used for reporting, but NoSQL databases can also be used for certain reporting scenarios.
Normalize data appropriately. Normalization is the process of organizing data in a database to minimize data redundancy and improve data integrity. However, over-normalization can make querying complex and slow down reporting. Therefore, it is important to normalize data appropriately based on the reporting requirements.
Use indexes to improve query performance. Indexes can help improve query performance by allowing the database to quickly find the data required for a report. Choose appropriate indexes based on the reporting requirements and the size of the data.
Consider partitioning. Partitioning involves splitting large tables into smaller, more manageable pieces. It can improve query performance by allowing the database to quickly access the required data.
Consider data compression. Data compression can help reduce the storage requirements of data and improve query performance by reducing the amount of data that needs to be read from disk.

Overall, the best structure for storing data for reporting depends on various factors, and it is important to carefully consider the reporting requirements and performance considerations when choosing a suitable structure.

Optimal Processing of Data in the Cloud

The best way to process data in the cloud depends on various factors, including the type and volume of data, the processing requirements, and the performance considerations. Here are some general guidelines for processing data in the cloud:

Use cloud-native data processing services. Cloud providers offer a wide range of data processing services, such as AWS Lambda, GCP Cloud Functions, and Azure Functions, which allow you to process data without managing the underlying infrastructure. These services are highly scalable and can be cost-effective for small- to medium-sized workloads.
Use serverless computing. Serverless computing is a cloud computing model in which the cloud provider manages the infrastructure and automatically scales the resources based on the workload. Serverless computing can be a cost-effective and scalable solution for processing data, especially for sporadic or bursty workloads.
Use containerization. Containerization allows you to package your data processing code and dependencies into a container image and deploy it to a container orchestration platform, such as Kubernetes or Docker Swarm. This approach can help you achieve faster deployment, better resource utilization, and improved scalability.
Use distributed computing frameworks. Distributed computing frameworks, such as Apache Hadoop, Spark, and Flink, allow you to process large volumes of data in a distributed manner across multiple nodes. These frameworks can be used for batch processing, real-time processing, and machine learning workloads.
Use data streaming platforms. Data streaming platforms, such as Apache Kafka and GCP Pub/Sub, allow you to process data in real time and respond quickly to changing business needs. These platforms can be used for real-time processing, data ingestion, and event-driven architectures.
Use machine learning and AI services. Cloud providers offer a wide range of machine learning and AI services, such as AWS SageMaker, GCP AI Platform, and Azure Machine Learning, which allow you to build, train, and deploy machine learning models in the cloud. These services can be used for predictive analytics, natural language processing, computer vision, and other machine learning workloads.

Overall, the best way to process data in the cloud depends on various factors, and it is important to carefully consider the processing requirements and performance considerations when choosing a suitable approach.

Optimize Memory

The best memory size for processing 1 terabyte of data depends on the specific processing requirements and the type of processing being performed. In general, the memory size required for processing 1 terabyte of data can vary widely depending on the data format, processing algorithms, and performance requirements. For example, if you are processing structured data in a relational database, the memory size required will depend on the specific SQL query being executed and the size of the result set. In this case, the memory size required may range from a few gigabytes to several hundred gigabytes or more, depending on the complexity of the query and the number of concurrent queries being executed.

On the other hand, if you are processing unstructured data, such as images or videos, the memory size required will depend on the specific processing algorithm being used and the size of the data being processed. In this case, the memory size required may range from a few gigabytes to several terabytes or more, depending on the complexity of the algorithm and the size of the input data.

Therefore, it is not possible to give a specific memory size recommendation for processing 1 terabyte of data without knowing more about the specific processing requirements and the type of data being processed. It is important to carefully consider the memory requirements when designing the processing system and to allocate sufficient memory resources to ensure optimal performance.

Service Oriented Data Architecture Is the Future for Data Ecosystems

A Services Oriented Data Architecture (SΘDΔ^®) is an architectural approach used in cloud computing that focuses on creating and deploying software systems as a set of interconnected services. Each service performs a specific business function, and communication between services occurs over a network, typically using web-based protocols such as RESTful APIs.

In the cloud, SΘDΔ can be implemented using a variety of cloud computing technologies, including infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). In an SΘDΔ-based cloud architecture, services are hosted on cloud infrastructure, such as virtual machines or containers, and can be dynamically scaled up or down based on demand.

One of the key benefits of SΘDΔ in the cloud is its ability to enable greater agility and flexibility in software development and deployment. By breaking down a complex software system into smaller, more manageable services, SΘDΔ makes it easier to build, test, and deploy new features and updates. It also allows for more granular control over resource allocation, making it easier to optimize performance and cost.

Overall, service-based architecture is a powerful tool for building scalable, flexible, and resilient software systems in the cloud, especially data ecosystems.

Recap

In this blog we began a conversation about the modern data ecosystem. By following best practices, we can ensure that our cloud applications run efficiently, on the large cloud providers, providing a great user experience while minimizing costs. We covered the following:

The modern data architecture is designed to help organizations leverage data as a strategic asset and gain a competitive advantage by making better data-driven decisions.
The best way to process data in the cloud depends on various factors, and it is important to carefully consider the processing requirements and performance considerations when choosing a suitable approach.
Overall, service-based architecture is a powerful tool for building scalable, flexible, and resilient software systems in the cloud, especially data ecosystems.

The post All Data Ecosystems Are Real Time, It Is Just a Matter of Time appeared first on Unravel.

The Modern Data Ecosystem: Leverage Content Delivery Network (CDN)

Stephen Lamont — Thu, 01 Jun 2023 01:31:46 +0000

Leverage Content Delivery Network

Here are some best practices when using a Content Delivery Network (CDN):

Choose the right CDN provider. Choose a CDN provider that is reliable, scalable, and has a global network of servers. Look for providers that offer features such as caching, load balancing, and DDoS protection.
Configure your CDN properly. Configure your CDN properly to ensure that it is delivering content efficiently and securely. This may include setting up caching rules, configuring SSL/TLS encryption, and configuring firewall rules.
Monitor your CDN. Monitor your CDN to ensure that it is performing optimally and delivering content efficiently. This may include monitoring CDN usage, network latency, and caching effectiveness.
Optimize your CDN. Optimize your CDN to ensure that it is delivering content as efficiently as possible. This may include using compression, optimizing images, and minifying JavaScript and CSS files.
Use multiple CDNs. Consider using multiple CDNs to ensure that your content is always available and is delivered quickly. This may include using multiple providers or using multiple CDNs from the same provider.
Test your CDN. Test your CDN to ensure that it is delivering content as expected. This may include conducting load testing to ensure that your CDN can handle expected levels of traffic and testing performance from different geographical locations.
Use CDN analytics. Use CDN analytics to track CDN usage and monitor performance. This will help you identify any issues and optimize your CDN for better performance.
Implement security measures. Implement security measures to protect your content and your users. This may include configuring SSL/TLS encryption, setting up firewalls, and using DDoS protection.

By following these best practices, you can ensure that your CDN is delivering content efficiently and securely, and is providing a positive user experience.

Choosing the Right CDN Provider

Here are some best practices when choosing the right Content Delivery Provider (CDP) in the cloud:

Understand your requirements. Before selecting a CDP, determine your requirements for content delivery. This includes identifying the types of content you want to deliver, the expected traffic volume, and the geographical locations of your users.
Research multiple providers. Research multiple CDP providers to determine which one best meets your requirements. Look for providers with a global network of servers, high reliability, and good performance.
Evaluate performance. Evaluate the performance of each provider by conducting tests from different geographical locations. This will help you determine which provider delivers content quickly and efficiently to your users.
Consider cost. Consider the cost of each provider, including the cost of data transfer, storage, and other associated fees. Look for providers that offer flexible pricing models and transparent pricing structures.
Evaluate security features. Evaluate the security features of each provider, including DDoS protection, SSL/TLS encryption, and firewalls. Look for providers that offer comprehensive security features to protect your content and your users.
Check for integration. Check if the provider integrates well with your existing infrastructure and tools, such as your content management system, analytics tools, and monitoring tools.
Look for analytics. Look for providers that offer analytics and reporting tools to help you track content delivery performance and optimize your content delivery.
Check for support. Check the level of support offered by each provider, including support for troubleshooting, upgrades, and maintenance. Look for providers with a responsive support team that can help you resolve issues quickly.

By following these best practices, you can select the right CDP that meets your requirements and delivers content efficiently and securely to your users.

Configure Your CDN Properly

Here are some best practices when configuring your Content Delivery Network (CDN) properly in the cloud:

Configure caching. Caching is a key feature of CDNs that enables content to be delivered quickly and efficiently. Configure caching rules to ensure that frequently accessed content is cached and delivered quickly.
Configure content compression. Configure content compression to reduce the size of your files and improve performance. Gzip compression is a popular option that can be configured at the CDN level.
Configure SSL/TLS encryption. Configure SSL/TLS encryption to ensure that content is delivered securely. Look for CDNs that offer free SSL/TLS certificates or have the option to use your own certificate.
Configure firewall rules. Configure firewall rules to protect your content and your users. This includes setting up rules to block traffic from malicious sources and to restrict access to your content.
Use multiple CDNs. Consider using multiple CDNs to improve performance and ensure availability. Use a multi-CDN strategy to distribute traffic across multiple CDNs, which can reduce latency and increase reliability.
Configure DNS settings. Configure DNS settings to ensure that traffic is directed to your CDN. This includes configuring CNAME records or using a DNS provider that integrates with your CDN.
Test your CDN configuration. Test your CDN configuration to ensure that it is working properly. This includes testing performance, testing from different geographical locations, and testing content delivery from different devices.
Monitor your CDN configuration. Monitor your CDN configuration to ensure that it is delivering content efficiently and securely. This includes monitoring CDN usage, network latency, caching effectiveness, and security events.

By following these best practices, you can configure your CDN properly to ensure that content is delivered quickly, efficiently, and securely to your users.

Monitor Your CDN

Here are some best practices when monitoring your Content Delivery Network (CDN) in the cloud:

Monitor CDN usage. Monitor CDN usage to track how much content is being delivered and where it is being delivered. This can help you identify potential issues and optimize content delivery.
Monitor network latency. Monitor network latency to ensure that content is being delivered quickly and efficiently. Use tools like Pingdom or KeyCDN’s real-time monitoring to identify network latency issues.
Monitor caching effectiveness. Monitor caching effectiveness to ensure that frequently accessed content is being cached and delivered quickly. Use CDN analytics and monitoring tools to track caching effectiveness.
Monitor security events. Monitor security events to ensure that your content and your users are protected from security threats. This includes monitoring for DDoS attacks, intrusion attempts, and other security events.
Monitor CDN performance. Monitor CDN performance to ensure that content is being delivered efficiently and that the CDN is performing optimally. This includes monitoring server response times, cache hit rates, and other performance metrics.
Monitor user experience. Monitor user experience to ensure that users are able to access content quickly and efficiently. Use tools like Google Analytics or Pingdom to monitor user experience and identify issues.
Monitor CDN costs. Monitor CDN costs to ensure that you are staying within your budget. Use CDN cost calculators to estimate costs and monitor usage to identify potential cost savings.
Set up alerts. Set up alerts to notify you when issues arise, such as network latency spikes, security events, or server downtime. Use CDN monitoring tools to set up alerts and notifications.

By following these best practices, you can monitor your CDN effectively to ensure that content is being delivered quickly, efficiently, and securely to your users, while staying within your budget.

Optimize Your Content Delivery Network

Optimizing your Content Delivery Network (CDN) in the cloud is crucial to ensure fast and reliable content delivery to your users. Here are some best practices to follow:

Choose the right CDN provider. Research different CDN providers and choose the one that meets your needs in terms of cost, performance, and geographical coverage.
Use a multi-CDN approach. Consider using multiple CDN providers to improve your content delivery performance and reliability.
Optimize your CDN configuration. Configure your CDN to serve static assets, such as images, videos, and files, directly from the CDN cache, while serving dynamic content, such as HTML pages, from your origin server.
Use caching effectively. Set appropriate cache control headers for your content to ensure that it is cached by the CDN and served quickly to your users.
Monitor your CDN performance. Monitor your CDN performance regularly and identify any issues or bottlenecks that may be affecting your content delivery performance.
Use compression. Use compression techniques, such as gzip compression, to reduce the size of your content and improve its delivery speed.
Optimize DNS resolution. Use a global DNS service to optimize DNS resolution and reduce the time it takes for your users to access your content.
Implement HTTPS. Implement HTTPS to ensure secure content delivery and improve your search engine ranking.
Consider using edge computing. Consider using edge computing to offload some of the processing and caching tasks to the CDN edge servers, which can help reduce the load on your origin servers and improve content delivery performance.

By following these best practices, you can optimize your CDN in the cloud and ensure fast, reliable, and secure content delivery to your users.

Use Multiple Content Delivery Networks

Using multiple Content Delivery Networks (CDNs) in the cloud can improve the performance and reliability of your content delivery, but it also requires careful management to ensure optimal results. Here are some best practices to follow when using multiple CDNs:

Use a multi-CDN management platform. Consider using a multi-CDN management platform to manage your CDNs and monitor their performance. These platforms allow you to configure your CDNs and dynamically route traffic to the optimal CDN based on real-time performance metrics.
Define a clear CDN selection strategy. Develop a clear strategy for selecting the CDN to use for each request. Consider factors such as geographical proximity, latency, and availability when selecting a CDN.
Test your CDNs. Regularly test the performance of your CDNs using real-world scenarios. This will help you identify any issues and ensure that your CDNs are performing optimally.
Implement consistent caching policies. Ensure that your caching policies are consistent across all your CDNs. This will help avoid cache misses and improve content delivery performance.
Implement a failover strategy. Define a failover strategy to ensure that your content is always available even if one of your CDNs experiences an outage. This may involve dynamically routing traffic to a backup CDN or using your origin server as a fallback.
Monitor and optimize costs. Monitor your CDN costs and optimize your usage to ensure that you are getting the best value for your investment.
Ensure security. Implement appropriate security measures, such as SSL encryption and DDoS protection, to ensure that your content is delivered securely and your CDNs are protected from attacks.

By following these best practices, you can successfully use multiple CDNs in the cloud to improve your content delivery performance and reliability.

Test Your Content Delivery Network

Testing your Content Delivery Network (CDN) in the cloud is crucial to ensure that it is performing optimally and delivering content quickly and reliably to your users. Here are some best practices to follow when testing your CDN:

Define clear testing objectives. Define clear objectives for your CDN testing, such as identifying bottlenecks or measuring performance metrics. This will help you design your test scenarios and evaluate the results effectively.
Use real-world testing scenarios. Use real-world scenarios to test your CDN, such as simulating traffic from different geographic locations, devices, and network conditions. This will help you identify any issues that may affect your users’ experience.
Measure key performance metrics. Measure key performance metrics such as response time, latency, throughput, and cache hit rate. This will help you identify any areas of improvement and optimize your CDN performance.
Test caching behavior. Test the caching behavior of your CDN by measuring cache hit and miss rates for different types of content. This will help you optimize your caching policies and reduce your origin server load.
Test CDN failover and disaster recovery. Test your CDN failover and disaster recovery mechanisms to ensure that your content is always available even if one or more CDNs experience an outage.
Use automated testing tools. Use automated testing tools to simulate traffic and measure performance metrics. This will help you perform tests more efficiently and accurately.
Test regularly. Regularly test your CDN to ensure that it is performing optimally and identify any issues as soon as possible.

By following these best practices, you can effectively test your CDN in the cloud and ensure fast and reliable content delivery to your users.

Use CDN Analytics

Analytics play a critical role in optimizing your Content Delivery Network (CDN) in the cloud, as they provide insights into your CDN performance and user behavior. Here are some best practices to follow when using CDN analytics in the cloud:

Define clear goals. Define clear goals for your CDN analytics, such as identifying areas of improvement or measuring user engagement. This will help you select the appropriate analytics tools and collect the right data.
Use a multi-CDN analytics platform. If you are using multiple CDNs, consider using a multi-CDN analytics platform to consolidate data from multiple CDNs and provide a unified view of your content delivery performance.
Monitor key performance metrics. Monitor key performance metrics such as page load time, bounce rate, and conversion rate. This will help you identify any performance issues and optimize your content delivery.
Use real-time analytics. Use real-time analytics to monitor your CDN performance and user behavior in real time. This will help you identify any issues as soon as they occur and take immediate action.
Use A/B testing. Use A/B testing to test different CDN configurations or content variations and measure the impact on user engagement and performance.
Use data visualization tools. Use data visualization tools to help you visualize and analyze your CDN performance data effectively. This will help you identify trends and patterns that may be difficult to detect using raw data.
Optimize CDN configuration based on analytics insights. Use insights from your CDN analytics to optimize your CDN configuration, such as adjusting caching policies, optimizing content delivery routes, or reducing page load time.

By following these best practices, you can use CDN analytics in the cloud to optimize your content delivery performance, improve user engagement, and achieve your business goals.

Implement Security Measures

Implementing security measures in the cloud is essential to protect your applications, data, and infrastructure from cyber threats. Here are some best practices to follow when implementing security measures in the cloud:

Use strong authentication and access controls. Implement strong authentication mechanisms such as multi-factor authentication and use access controls to restrict access to your cloud resources based on the principle of least privilege.
Implement network security controls. Use network security controls such as firewalls, intrusion detection and prevention systems (IDPS), and virtual private networks (VPNs) to protect your cloud infrastructure from network-based attacks.
Implement encryption. Use encryption to protect sensitive data, both in transit and at rest. Implement encryption for data stored in the cloud, and use secure protocols such as HTTPS and TLS for data in transit.
Regularly apply security patches and updates. Regularly apply security patches and updates to your cloud infrastructure and applications to ensure that you are protected against known vulnerabilities.
Implement security monitoring and logging. Implement security monitoring and logging to detect and respond to security events. Use tools that provide visibility into your cloud environment and alert you to potential security threats.
Use cloud-native security tools. Use cloud-native security tools and services such as security groups, network ACLs, and security information and event management (SIEM) to secure your cloud environment.
Develop an incident response plan. Develop an incident response plan that outlines how you will respond to security incidents, including containment, investigation, and remediation.

By following these best practices, you can effectively implement security measures in the cloud and protect your applications, data, and infrastructure from cyber threats.

The post The Modern Data Ecosystem: Leverage Content Delivery Network (CDN) appeared first on Unravel.

DataOps Resiliency: Tracking Down Toxic Workloads

Christine Della Penna — Thu, 11 May 2023 21:12:07 +0000

By Jason Bloomberg, Managing Partner, Intellyx
Part 4 of the Demystifying Data Observability Series for Unravel Data

In the first three articles in this four-post series, my colleague Jason English and I explored DataOps observability, the connection between DevOps and DataOps, and data-centric FinOps best practices.

In this concluding article in the series, I’ll explore DataOps resiliency – not simply how to prevent data-related problems, but also how to recover from them quickly, ideally without impacting the business and its customers.

Observability is essential for any kind of IT resiliency – you can’t fix what you can’t see – and DataOps is no exception. Failures can occur anywhere in the stack, from the applications on down to the hardware. Understanding the root causes of such failures is the first step to fixing, or ideally preventing, them.

The same sorts of resiliency problems that impact the IT environment at large can certainly impact the data estate. Even so, traditional observability and incident management tools don’t address specific problems unique to the world of data processing.

In particular, DataOps resiliency must address the problem of toxic workloads.

Understanding Toxic Workloads

Toxic data workloads are as old as relational database management systems (RDBMSs), if not older. Anyone who works with SQL on large databases knows there are some queries that will cause the RDBMS to slow dramatically or completely grind to a halt.

The simplest example: SELECT * FROM TRANSACTIONS where the TRANSACTIONS table has millions of rows. Oops! Your resultset also has millions of rows!

JOINs, of course, are more problematic, because they are difficult to construct, and it’s even more difficult to predict their behavior in databases with complex structures.

Such toxic workloads caused problems in the days of single on-premises databases. As organizations implemented data warehouses, the risks compounded, requiring increasing expertise from a scarce cadre of query-building experts.

Today we have data lakes as well as data warehouses, often running in the cloud where the meter is running all the time. Organizations also leverage streaming data, as well as complex data pipelines that mix different types of data in real time.

With all this innovation and complexity, the toxic workload problem hasn’t gone away. In fact, it has gotten worse, as the nuances of such workloads have expanded.

Breaking Down the Toxic Workload

Poorly constructed queries are only one of the causes of a modern toxic workload. Other root causes include:

Poor quality data – one table with NULL values, for example, can throw a wrench into seemingly simple queries. Expand that problem to other problematic data types and values across various cloud-based data services and streaming data sources, and small data quality problems can easily explode into big ones.
Coding issues – Data engineers must create data pipelines following traditional coding practices – and whenever there’s coding, there are software bugs. In the data warehouse days, tracking down toxic workloads usually revealed problematic queries. Today, coding issues are just as likely to be the root cause.
Infrastructure issues – Tracking down the root causes of toxic workloads means looking everywhere – including middleware, container infrastructure, networks, hypervisors, operating systems, and even the hardware. Just because a workload runs too slow doesn’t mean it’s a data issue. You have to eliminate as many possible root causes as you can – and quickly.
Human issues – Human error may be the root cause of any of the issues above – but there is more to this story. In many cases, root causes of toxic workloads boil down to a shortage of appropriate skills among the data team or a lack of effective collaboration within the team. Human error will always crop up on occasion, but a skills or collaboration issue will potentially cause many toxic workloads over time.

The bottom line: DataOps resiliency includes traditional resiliency challenges but extends to data-centric issues that require data observability to address.

Data Resiliency at Mastercard

Mastercard recently addressed its toxic workload problem on Hadoop, as well as Impala, Spark, and Hive.

The payment processor has petabytes of data across hundreds of nodes, as well as thousands of users who access the data in an ad hoc fashion – that is, they build their own queries.

Mastercard’s primary issue was poorly constructed queries, a combination of users’ inexperience as well as the complexity of the required queries.

In addition, the company faced various infrastructure issues, from overburdened data pipelines to maxed-out storage and disabled daemons.

All these problems led to application failures, system slowdowns and crashes, and resource bottlenecks of various types.

To address these issues, Mastercard brought in Unravel Data. Unravel quickly identified hundreds of unused data tables. Freeing up the associated resources improved query performance dramatically.

Mastercard also uses Unravel to help users tune their own query workloads as well as automate the monitoring of toxic workloads in progress, preventing the most dangerous ones from running in the first place.

Overall, Unravel helped Mastercard improve its mean time to recover (MTTR) – the best indicator of DataOps Resiliency.

The Intellyx Take

The biggest mistake an organization can make around DataOps observability and resiliency is to assume these topics are special cases of the broader discussion of IT observability and resiliency.

In truth, the areas overlap – after all, infrastructure issues are often the root causes of data-related problems – but without the particular focus on DataOps, many problems would fall through the cracks.

The need for this focus is why tools like Unravel’s are so important. Unravel adds AI optimization and automated governance to its core data observability capabilities, helping organizations optimize the cost, performance, and quality of their data estates.

DataOps resiliency is one of the important benefits of Unravel’s approach – not in isolation, but within the overall context for resiliency that is so essential to modern IT.

Copyright © Intellyx LLC. Unravel Data is an Intellyx customer. None of the other organizations mentioned in this article is an Intellyx customer. Intellyx retains final editorial control of this article. No AI was used in the production of this article.

The post DataOps Resiliency: Tracking Down Toxic Workloads appeared first on Unravel.

Data Observability Is Big(Data 100) News

Christine Della Penna — Wed, 10 May 2023 02:38:09 +0000

Here at Unravel, we are all about data observability. We live it. We breathe it. And increasingly, others are recognizing the vital role it plays in not only controlling costs, but helping companies realize and optimize their investments in cloud data projects. It’s become such a critical component of the data space that CRN created a new Data Observability category in this year’s BigData 100 list. We are excited, therefore, to announce that not only have we been named to the list for the fifth consecutive year, but we were among the inaugural group of Coolest Data Observability Companies of 2023.

CRN’s BigData 100 is an annual list that recognizes the technology vendors that go above and beyond by delivering innovation-driven products and services that solution providers can use to help organizations of all sizes gain the competitive advantages of becoming data-driven companies. The list identifies IT vendors that have consistently made technical innovation a top priority through their portfolios of big data management and integration tools; systems and platforms; and data operations and observability offerings.

We provide the market with the only platform that delivers AI-powered answers and precise recommendations for data teams. It’s no wonder that increasing numbers of data-driven enterprises rely on Unravel to help them bring their cloud costs under control. It’s safe to say that data observability has become a must-have for companies to succeed in today’s data-driven and cost-conscious environment.

Not only does data observability improve the speed and quality of data delivery, but it makes it easier for IT and business teams to work together and make smarter data-driven decisions, faster. By giving teams a unified view of data performance, cost, and quality across the entire organization, it breaks down silos and empowers individuals to make their own decisions on how to best reduce and/or optimize their cloud spend while still meeting SLAs.

Unravel provides data observability for the modern data stack. Learn more about how Unravel can help modern enterprises understand, troubleshoot, and optimize their data workloads.

The post Data Observability Is Big(Data 100) News appeared first on Unravel.

What’s New In Unravel 4.7.8.0

Stephen Lamont — Tue, 11 Apr 2023 16:19:44 +0000

As usual, the latest release of Unravel (4.7.8.0) introduces new features and a slew of improvements and enhancements. As always, these new capabilities reflect the emerging needs of data team practitioners and their executive teams. Three of the new features are summarized below. Complete Release Notes for Unravel v4.7.8.0 can be found here.

The top three things any data-forward enterprise cares about are Performance, Cost, and Quality. That reliable results get delivered on time, every time, in the most cost-effective manner. Unravel’s strength has always been in performance observability, and we’ve taken the lead in cost governance for data workloads (DataFinOps). Now we bring in the third pillar: data quality.

Unravel integrates with Data Quality solutions

Unravel now integrates and correlates external data quality checks within its AI-driven insights. Unravel has taken the path of integrating data quality checks that run outside of Unravel—all the homegrown checks, assertions built in any number of excellent solution vendor tools—rather than building yet another data quality tool inside Unravel.

The first data quality integration is with the open source leader, Great Expectations.

Now data teams have insights and details about performance, cost, and quality in a single pane of glass. No more jumping from tool to tool. And as different personas care about these different dimensions, everybody is working from the same single source of truth. With one click, drill down into quality check results, timelines, lineage, partitions, schema, usage, size, users, and more—right alongside AL recommendations and insights for performance and cost.

AutoActions for Amazon EMR help control cloud costs

AutoActions can now monitor EMR apps and clusters, and you can set the AutoAction policy to generate alerts when EMR exceeds a set dollar threshold, a running time threshold, or when it’s been idle for a long time.

Unravel gets EMR cost data while clusters are running, so you get a notification in real time that the cluster is exceeding a threshold that impacts costs—not after the fact.

You can get a list of all threshold violations across the various clusters, their cost, with drill-down details into each cluster.

Check out this video with a sample use case.

Get proactive budget-tracking alerts and notifications

Unravel can now proactively send notifications—via email or Slack—to specified individuals or groups whenever certain conditions are triggered. Perhaps most usefully, this feature gives both individual users and their managers a heads-up when projected costs are estimated to be at risk of exceeding budget (or have already done so). A common, reusable notification module can be used for multiple purposes, and a single notification may be used by multiple budgets.

Costs may be expressed in DBUs (for Databricks) or dollars (for Amazon EMR). To avoid alert storms and notification fatigue, only one alert is generated per month for budgets at risk, and one per month for exceeded budgets. (But if budgets are changed, alerts can be triggered again for the same month.)

Explore this feature further in this video.

The post What’s New In Unravel 4.7.8.0 appeared first on Unravel.

Solving key challenges in the ML lifecycle with Unravel and Databricks Model Serving

Christine Della Penna — Tue, 04 Apr 2023 21:53:31 +0000

By Craig Wiley, Senior Director of Product Management, Databricks and Clinton Ford, Director of Product Marketing, Unravel Data

Introduction

Machine learning (ML) enables organizations to extract more value from their data than ever before. Companies who successfully deploy ML models into production are able to leverage that data value at a faster pace than ever before. But deploying ML models requires a number of key steps, each fraught with challenges:

Data preparation, cleaning, and processing
Feature engineering
Training and ML experiments
Model deployment
Model monitoring and scoring

Figure 1. Phases of the ML lifecycle with Databricks Machine Learning and Unravel Data

Challenges at each phase

Data preparation and processing

Data preparation is a data scientist’s most time-consuming task. While there are many phases in the data science lifecycle, an ML model can only be as good as the data that feeds it. Reliable and consistent data is essential for training and machine learning (ML) experiments. Despite advances in data processing, a significant amount of effort is required to load and prepare data for training and ML experimentation. Unreliable data pipelines slow the process of developing new ML models.

Training and ML experiments

Once data is collected, cleansed, and refined, it is ready for feature engineering, model training, and ML experiments. The process is often tedious and error-prone, yet machine learning teams also need a way to reproduce and explain their results for debugging, regulatory reporting, or other purposes. Recording all of the necessary information about data lineage, source code revisions, and experiment results can be time-consuming and burdensome. Before a model can be deployed into production, it must have all of the detailed information for audits and reproducibility, including hyperparameters and performance metrics.

Model deployment and monitoring

While building ML models is hard, deploying them into production is even more difficult. For example, data quality must be continuously validated and model results must be scored for accuracy to detect model drift. What makes this challenge even more daunting is the breadth of ML frameworks and the required handoffs between teams throughout the ML model lifecycle– from data preparation and training to experimentation and production deployment. Model experiments are difficult to reproduce as the code, library dependencies, and source data change, evolve, and grow over time.

The solution

The ultimate hack to productionize ML is data observability combined with scalable, serverless, and automated ML model serving. Unravel’s AI-powered data observability for Databricks on AWS and Azure Databricks simplifies the challenges of data operations, improves performance, saves critical engineering time, and optimizes resource utilization.

Databricks Model Serving deploys machine learning models as a REST API, enabling you to build real-time ML applications like personalized recommendations, customer service chatbots, fraud detection, and more – all without the hassle of managing serving infrastructure.

Databricks + data observability

Whether you are building a lakehouse with Databricks for ML model serving, ETL, streaming data pipelines, BI dashboards, or data science, Unravel’s AI-powered data observability for Databricks on AWS and Azure Databricks simplifies operations, increases efficiency, and boosts productivity. Unravel provides AI insights to proactively pinpoint and resolve data pipeline performance issues, ensure data quality, and define automated guardrails for predictability.

Scalable training and ML experiments with Databricks

Databricks uses pre-installed, optimized libraries to build and train machine learning models. With Databricks, data science teams can build and train machine learning models. Databricks provides pre-installed, optimized libraries. Examples include scikit-learn, TensorFlow, PyTorch, and XGBoost. MLflow integration with Databricks on AWS and Azure Databricks makes it easy to track experiments and store models in repositories.

MLflow monitors machine learning model training and running. Information about the source code, data, configuration information, and results are stored in a single location for quick and easy reference. MLflow also stores models and loads them in production. Because MLflow is built on open frameworks, many different services, applications, frameworks, and tools can access and consume the models and related details.

Serverless ML model deployment and serving

Databricks Serverless Model Serving accelerates data science teams’ path to production by simplifying deployments and reducing mistakes through integrated tools. With the new model serving service, you can do the following:

Deploy a model as an API with one click in a serverless environment.
Serve models with high availability and low latency using endpoints that can automatically scale up and down based on incoming workload.
Safely deploy the model using flexible deployment patterns such as progressive rollout or perform online experimentation using A/B testing.
Seamlessly integrate model serving with online feature store (hosted on Azure Cosmos DB), MLflow Model Registry, and monitoring, allowing for faster and error-free deployment.

Conclusion

You can now train, deploy, monitor, and retrain machine learning models, all on the same platform with Databricks Model Serving. Integrating the feature store with model serving and monitoring helps ensure that production models are leveraging the latest data to produce accurate results. The end result is increased availability and simplified operations for greater AI velocity and positive business impact.

Ready to get started and try it out for yourself? Watch this Databricks event to see it in action. You can read more about Databricks Model Serving and how to use it in the Databricks on AWS documentation and the Azure Databricks documentation. Learn more about data observability in the Unravel documentation.

The post Solving key challenges in the ML lifecycle with Unravel and Databricks Model Serving appeared first on Unravel.

DataFinOps: Holding individuals accountable for their own cloud data costs

Stephen Lamont — Mon, 06 Mar 2023 20:32:16 +0000

Most organizations spend at least 37% (sometimes over 50%) more than they need to on their cloud data workloads. A lot of costs are incurred down at the individual job level, and this is usually where there’s the biggest chunk of overspending. Two of the biggest culprits are oversized resources and inefficient code. But for an organization running 10,000s or 100,000s of jobs, finding and fixing bad code or right-sizing resources is shoveling sand against the tide. Too many jobs taking too much time and too much expertise. That’s why more and more organizations are progressing to the next logical step, to DataFinOps, and leveraging observability, automation, and AI to find–and ideally, fix–all those thousands of places where costs could and should be optimized. This is the stuff that without AI takes even experts hours or days (even weeks) to figure out. In a nutshell, DataFinOps empowers “self-service” optimization, where AI does the heavy lifting to show people exactly what they need to do to get cost right from the get-go.

DataFinOps and job-level “spending decisions”

One of the highly effective core principles of DataFinOps–the overlap and marriage between DataOps and FinOps–is the idea of holding every individual accountable for their own cloud usage and cost. Essentially, shift cost-control responsibility left, to the people who are actually incurring the expenses.

Now that’s a big change. The teams developing and running all the data applications/pipelines are, of course, still on the hook for delivering reliable results on time, every time–but now cost is right up there with performance and quality as a co-equal SLA. But it’s a smart change: let’s get the cost piece right, everywhere, then keep it that way.

At any given time, your organization probably has hundreds of people running 1,000s of individual data jobs in the cloud–building a Spark job, or Kafka, or doing something in dbt or Databricks. And the meter is always ticking, all the time. How high that meter runs depends a lot on thousands of individual data engineering “spending decisions” about how a particular job is set up to run. In our experience with data-forward organizations over the years, as much as 60% of their cost savings were found by optimizing things at the job level.

Enterprises have 100,000s of places where cloud data spending decisions need to be made.

When thinking about optimizing cloud data costs at the job level, what springs to mind immediately is infrastructure. And running oversized/underutilized cloud resources is a big problem–one that the DataFinOps approach is especially good at tackling. Everything runs on a machine, sooner or later, and data engineers have to make all sorts of decisions about the number, size, type of machine they should request, how much memory to call for, and a slew of other configuration considerations. And every decision carries a price tag.

There’s a ton of opportunity to eliminate inefficiency and waste here. But the folks making these spending decisions are not experts at making these decisions. Very few people are. Even a Fortune 500 enterprise could probably count on one hand the number of experts who can “right-size” all the configuration details accurately. There just aren’t enough of these people to tackle the problem at scale.

But it’s not just “placing the infrastructure order” that drives up the cost of cloud data operations unnecessarily. Bad code costs money, but finding and fixing it takes–again–time and expertise.

So, we have the theory of FinOps kind of hitting a brick wall when it comes into contact with the reality of today’s data estate operations. We want to hold the individual engineers accountable for their cloud usage and spend, but they don’t have the information at their fingertips to be able to use and spend cloud resources wisely.

In fact, “getting engineers to take action on cost optimization” remains the #1 challenge according to the FinOps Foundation’s State of FinOps 2023 survey. But that stat masks just how difficult it is. And to get individuals to make the most cost-effective choices about configuring and running their particular jobs–which is where a lot of the money is being spent–it has to be easy for them to do the right thing. That means showing them what the right thing is. Otherwise, even if they knew exactly what they were looking for, sifting through thousands of logs and cloud vendor billing data is time-sucking toil. They don’t need more charts and graphs that, while showing a lot of useful information, still leave it to you to figure out how to fix things.

That’s where DataFinOps comes in. Combining end-to-end, full-stack observability data (at a granular level), a high degree of automation, and AI capabilities, DataFinOps identifies a wide range of cost control opportunities–based on, say, what resources you have allocated vs. what resources you actually need to run the job successfully–then automatically provides a prescriptive AI-generated recommendation on what, where, how to make things more cost-effective.

Using AI to solve oversized resources and inefficient code at the job level

DataFinOps uses observability, automation, and AI to do a few things. First, it collects all sorts of information that is “hidden in plain sight”–performance metrics, logs, traces, events, and metadata from the dozens of components you have running in your modern data stack, at both the job (Spark, Kafka) and platform (Databricks, Snowflake, BigQuery, Amazon EMR, etc.) level; details from the cloud vendors about what is running, where, and how much it costs; details about the datasets themselves, including lineage and quality–and then stitches it all together into an easily understandable, correlated context. Next, DataFinOps throws some math at all that data to analyze usage and cost, hundreds of AI algorithms and ML models that have been well trained over the years. AI can do this kind of investigation, or detective work, faster and more accurately than humans. Especially at scale, when we’re looking at thousands and thousands of individual spending decisions.

Two areas where AI helps individuals cut to the chase and actually do something about eliminating overspending at the job level–which is where the biggest and longer-term cost savings opportunities can be found–are oversized resources and inefficient code.

Watch our DataFinOps virtual discussion on demand (no form to fill out)
Watch now

Oversized resources are simply where somebody (or lots of somebodies) requested more resources from the cloud provider than are actually needed to run the job successfully. What DataFinOps AI does is analyze everything that’s running, pinpoint among the thousands of jobs where the number, size, type of resources you’re using is more than you need, calculate the cost implications, and deliver up a prescriptive recommendation (in plain English) for a more cost-effective configuration.

For example, in the screenshot below, the AI has identified more cost-effective infrastructure for this particular job at hand, based on real-time data from your environment. The AI recommendation specifies exactly what type of resource, with exactly how much memory and how many cores, would be less expensive and still get the job done. It’s not quite a self-healing system, but it’s pretty close.

Code problems may be an even more insidious contributor to overspending. Workloads are constantly moving from on-prem Cloudera/Hadoop environments to the cloud. There’s been an explosion of game-changing technologies and platforms. But we’ve kind of created a monster: everybody seems to be using a different system, and it’s often different from the one they used yesterday. Everything’s faster, and there’s more of it. Everybody is trying to do code checks as best they can, but the volume and velocity of today’s data applications/pipelines make it a losing battle.

DataFinOps pinpoints where in all that code there are problems causing cloud costs to rise unnecessarily. The ML model recognizes anti-patterns in the code–it’s learned from millions of jobs just like this one–and flags the issue.

In the example below, there’s a slow SQL stage: Query 7 took almost 6½ minutes. The AI/ML has identified this as taking too long, an anomaly or at least something that needs to be corrected. It performs an automated root cause analysis to point directly to what in the code needs to be optimized.

Every company that’s running cloud data workloads is already trying to crack this nut of how to empower individual data engineers with this kind of “self-service” ability to optimize for costs themselves. It’s their spending decisions here at the job level that have such a big impact on overall cloud data costs. But you have to have all the data, and you have to have a lot of automation and AI to make self-service both easy and impactful.

AI recommendations show the individual engineers what to do, so being held accountable for cloud usage and cost is no longer the showstopper obstacle. From a wider team perspective, the same information can be rolled up, sliced-and-diced, into a dashboard that gives a clear picture on the overall state of cost optimization opportunities. You can see how you’re doing (and what still needs to be done) with controlling costs at the cluster or job or even individual user level.

Bottom line: The best way to control cloud data costs in a meaningful way, at scale and speed, with real and lasting impact, is the DataFinOps approach of full-stack observability, automation, and AI.

The post DataFinOps: Holding individuals accountable for their own cloud data costs appeared first on Unravel.

DataFinOps: More on the menu than data cost governance

Christine Della Penna — Fri, 24 Feb 2023 20:14:25 +0000

By Jason English, Principal Analyst, Intellyx
Part 3 in the Demystifying Data Observability Series, by Intellyx for Unravel Data

IT and data executives find themselves in a quandary about deciding how to wrangle an exponentially increasing volume of data to support their business requirements – without breaking an increasingly finite IT budget.

Like an overeager diner at a buffet who’s already loaded their plate with the cheap carbs of potatoes and noodles before they reach the protein-packed entrees, they need to survey all of the data options on the menu before formulating their plans for this trip.

In our previous chapters of this series, we discussed why DataOps needs its own kind of observability, and then how DataOps is a natural evolution of DevOps practices. Now there’s a whole new set of options in the data observability menu to help DataOps teams track the intersection of value and cost.

From ROI to FinOps

Executives can never seem to get their fill of ROI insights from IT projects, so they can measure bottom-line results or increase top-line revenue associated with each budget line item. After all, predictions about ROI can shape the perception of a company for its investors and customers.

Unfortunately, ROI metrics are often discussed at the start of a major technology product or services contract – and then forgotten as soon as the next initiative gets underway.

The discipline of FinOps burst onto the scene over the last few years, as a strategy to address the see-saw problem of balancing the CFO’s budget constraints with the CIO’s technology delivery requirements to best meet the current and future needs of customers and employees.

FinOps focuses on improving technology spending decisions of an enterprise using measurements that go beyond ROI, to assess the value of business outcomes generated through technology investments.

Some considerations frequently seen on the FinOps menu include:

Based on customer demand or volatility in our consumption patterns, should we buy capacity on-demand or reserve more cloud capacity?
Which FinOps tools should we buy, and what functionality should we build ourselves, to deliver this important new capability?
Which cloud cost models are preferred for capital expenditures (capex) projects and operational expenditures (opex)?
What is the potential risk and cost of known and unknown usage spikes, and how much should we reasonably invest in analysts and tools for preventative purposes?

As a discipline, FinOps has come a long way, building communities of interest among expert practitioners, product, business, and finance teams as well as solution providers through its own FinOps Foundation and instructional courses on the topic.

FinOps + DataOps = DataFinOps?

Real-time analytics and AI-based operational intelligence are enabling revolutionary business capabilities, enterprise-wide awareness, and innovative machine learning-driven services. All of this is possible thanks to a smorgasbord of cloud data storage and processing, cloud data lakes, cloud data warehouse, and cloud lakehouse options.

Unfortunately, the rich streams of data required for such sophisticated functionality bring along the unwanted side effect of elastically expanding budgetary waistbands, due to ungoverned cloud storage and compute consumption costs. Nearly a third of all data science projects go more than 40% over budget on cloud data, according to a recent survey–a huge delta between cost expectations and reality.

How can better observability into data costs help the organization wring more value from data assets without cutting into results, or risking cost surprises?

As it turns out, data has its own unique costs, benefits, and value considerations. Combining the disciplines of FinOps and DataOps – which I’ll dub DataFinOps just for convenience here – can yield a unique new set of efficiencies and benefits for the enterprise’s data estate.

Some of unique considerations of DataFinOps:

Which groups within our company are the top spenders on cloud data analytics, and is anything anomalous about their spending patterns versus the expected budgets?
What is the value of improving data performance or decreasing the latency of our data estate by region or geography, in order to improve local accuracy, reduce customer and employee attrition and improve retention?
If we are moving to a multi-cloud, hybrid approach, what is an appropriate and realistic mix of reserved instances and spot resources for processing data of different service level agreements (SLAs)?
Where are we paying excessive ingress / egress fees within our data estate? Would it be more cost effective to process data near the data or move our data elsewhere?
How much labor do our teams spend building and maintaining data pipelines, and what is that time worth?
Are cloud instances being intelligently right-sized and auto-scaled to meet demand?

Systems-oriented observability platforms such as DataDog and Dynatrace can measure system or service level telemetry, which is useful for a DevOps team looking at application-level cloud capacity and cost/performance ratios. Unfortunately these tools do not dig into enough detail to answer data analytics-specific FinOps questions.

Taming a market of data options

Leading American grocery chain Kroger launched its 84.51° customer experience and data analytics startup to provide predictive data insights and precision marketing for its parent company and other retailers, across multiple cloud data warehouses such as Snowflake and Databricks, using data storage in multiple clouds such as Azure and GCP.

Using the Unravel platform for data observability, they were able to get a grip on data costs and value across multiple data platforms and clouds without having to train up more experts on the gritty details of data job optimization within each system.

“The end result is giving us tremendous visibility into what is happening within our platforms. Unravel gave recommendations to us that told us what was good and bad. It simply cut to the chase and told us what we really needed to know about the users and sessions that were problematic. It not only identified them, but then made recommendations that we could test and implement.”

– Jeff Lambert, Vice President Engineering, 84.51°

It’s still early days for this transformation, but a data cost reduction of up to 50% would go a long way toward extracting value from deep customer analytics, as transaction data volumes continue to increase by 2x or 3x a year as more sources come online.

The Intellyx Take

It would be nice if CFOs could just tell CIOs and CDOs to simply stop consuming and storing so much data, and have that reduce their data spend. But just like in real life, crash diets will never produce long-term results, if the ‘all-you-can-eat’ data consumption pattern isn’t changed.

The hybrid IT underpinnings of advanced data-driven applications evolves almost every day. To achieve sustainable improvements in cost/benefit returns on data, analysts and data scientists would have to become experts on the inner workings of each public cloud and data warehousing vendor.

DataFinOps practices should encourage data team accountability for value improvements, but more importantly, it should give them the data observability, AI-driven recommendations, and governance controls necessary to both contain costs, and stay ahead of the organization’s growing business demand for data across hybrid IT data resources and clouds.

The post DataFinOps: More on the menu than data cost governance appeared first on Unravel.

The Evolution from DevOps to DataOps

Christine Della Penna — Wed, 22 Feb 2023 16:11:51 +0000

By Jason Bloomberg, President, Intellyx
Part 2 of the Demystifying Data Observability Series for Unravel Data

In part one of this series, fellow Intellyx analyst Jason English explained the differences between DevOps and DataOps, drilling down into the importance of DataOps observability.

The question he left open for this article: how did we get here? How did DevOps evolve to what it is today, and what parallels or differences can we find in the growth of DataOps?

DevOps Precursors

The traditional, pre-cloud approach to building custom software in large organizations separated the application development (‘dev’) teams from the IT operations (‘ops’) personnel responsible for running software in the corporate production environment.

In between these two teams, organizations would implement a plethora of processes and gates to ensure the quality of the code and that it would work properly in production before handing it to the ops folks to deploy and manage.

Such ‘throw it over the wall’ processes were slow and laborious, leading to deployment cycles many months long. The importance of having software that worked properly, so the reasoning went, was sufficient reason for such onerous delays.

Then came the Web. And the cloud. And enterprise digital transformation initiatives. All of these macro-trends forced enterprises to rethink their plodding software lifecycles.

Not only were they too slow to deliver increasingly important software capabilities, but business requirements would evolve far too quickly for the deployed software to keep up.

Such pressures led to the rise of agile software methodologies, cloud native computing, and DevOps.

Finding the Essence of DevOps

The original vision of DevOps was to pull together the dev and ops teams to foster greater collaboration, in hopes that software deployment cadences would accelerate while maintaining or improving the quality of the resulting software.

Over time, this ‘kumbaya’ vision of seamless collaboration itself evolved. Today, we can distill the essence of modern DevOps into these five elements:

A cultural and organizational shift away from the ‘throw it over the wall’ mentality to greater collaboration across the software lifecycle
A well-integrated, comprehensive automation suite that supports CI/CD activities, along with specialists who manage and configure such technologies, i.e., DevOps engineers
A proactive, shift-left mentality that seeks to represent production behavior declaratively early in the lifecycle for better quality control and rapid deployment
Full-lifecycle observability that shifts problem resolution to the left via better prediction of problematic behavior and preemptive mitigation of issues in production
Lean practices to deliver value and improve efficiency throughout the software development lifecycle

Furthermore, DevOps doesn’t live in a vacuum. Rather, it is consistent with and supports other modern software best practices, including infrastructure-as-code, GitOps, and the ‘cattle not pets’ approach to supporting the production environment via metadata representations that drive deployment.

The Evolution of DataOps

Before information technology (IT), organizations had management of information systems (MIS). And before MIS, at the dawn of corporate computing, enterprises implemented data processing (DP).

The mainframes at the heart of enterprise technology as far back as the 1960s were all about processing data – crunching numbers in batch jobs that yielded arcane business results, typically dot-matrix printed on green and white striped paper.

Today, IT covers a vast landscape of technology infrastructure, applications, and hybrid on-premises and cloud environments – but data processing remains at the heart of what IT is all about.

Early in the evolution of DP, it became clear that the technologies necessary for processing transactions were different from the technology the organization required to provide business intelligence to line-of-business (LoB) professionals.

Enterprises required parallel investments in online transaction processing (OLTP) and online analytical processing (OLAP), respectively. OLAP proved the tougher nut to crack, because enterprises generated voluminous quantities of transactional data, while LoB executives required complex insights that would vary over time – thus taxing the ability of the data infrastructure to respond to the business need for information.

To address this need, data warehouses exploded onto the scene, separating the work of OLAP into two phases: transforming and loading data into the warehouses and supporting business intelligence via queries of the data in them.

Operating these early OLAP systems was relatively straightforward, centering on administering the data warehouses. In contrast, today’s data estate – the sum total of all the data infrastructure in a modern enterprise – is far more varied than in the early data warehousing days.

Motivations for DataOps

Operating this data estate has also become increasingly complex, as the practice of DataOps rises in today’s organizations.

Complexity, however, is only one motivation for DataOps. There are more reasons why today’s data estate requires it:

Increased mission-criticality of data, as digital transformations rework organizations into digital enterprises
Increased importance of real-time data, a capability that data warehouses never delivered
Greater diversity of data-centric use cases beyond basic business intelligence
Increased need for dynamic applications of data, as different LoBs need an ever-growing variety of data-centric solutions
Growing need for operational cost predictability, optimization, and governance

Driving these motivations is the rise of AI, as it drives the shift from code-based to data-based software behavior. In other words, AI is more than just another data-centric use case. It repositions data as the central driver of software functionality for the enterprise.

The Intellyx Take

For all these reasons, DataOps can no longer follow the simplistic data warehouse administration pattern of the past. Today’s data estate is dynamic, diverse, and increasingly important, requiring organizations to take a full-lifecycle approach to collecting, transforming, storing, querying, managing, and consuming data.

As a result, DataOps requires the application of core DevOps practices along the data lifecycle. DataOps requires the cross-lifecycle collaboration, full-lifecycle automation and observability, and the shift-left mentality that DevOps brings to the table – only now applied to the enterprise data estate.

Thinking of DataOps as ‘DevOps for data’ may be too simplistic an explanation of the role DataOps should play. Instead, it might be more accurate to say that as data increasingly becomes the driver of software behavior, DataOps becomes the new DevOps.

Next up in part 3 of this series: DataFinOps: More on the menu than data cost governance

The post The Evolution from DevOps to DataOps appeared first on Unravel.

Why do we need DataOps Observability?

Christine Della Penna — Mon, 13 Feb 2023 14:36:50 +0000

By Jason English, Principal Analyst, Intellyx
Part 1 of the Demystifying Data Observability Series for Unravel Data

Don’t we already have DevOps?

DevOps was started more than a decade ago as a movement, not a product or solution category.

DevOps offered us a way of collaborating between development and operations teams, using automation and optimization practices to continually accelerate the release of code, measure everything, lower costs, and improve the quality of application delivery to meet customer needs.

Today, almost every application delivery shop naturally aspires to take flight with DevOps practices, and operate with more shared empathy and a shared commitment to progress through faster feature releases and feedback cycles.

DevOps practices also include better management practices such as self-service environments, test and release automation, monitoring, and cost optimization.

On the journey toward DevOps, teams who apply this methodology deliver software more quickly, securely, reliably, and with less burnout.

For dynamic applications to deliver a successful user experience at scale, we still need DevOps to keep delivery flowing. But as organizations increasingly view data as a primary source of business value, data teams are tasked with building and delivering reliable data products and data applications. Just as DevOps principles emerged to enable efficient and reliable delivery of applications by software development teams, DataOps best practices are helping data teams solve a new set of data challenges.

What is DataOps?

If “data is the new oil,” as pundits like to say, then data is also the most valuable resource in today’s modern data-driven application world.

The combination of commodity hardware, ubiquitous high-bandwidth networking, cloud data warehouses, and infrastructure abstraction methods like containers and Kubernetes creates an exponential rise in our ability to use data itself to dynamically compose functionality such as running analytics and informing machine learning-based inference within applications.

Enterprises recognized data as a valuable asset, welcoming the newly minted CDO (chief data officer) role to the E-suite, with responsibility for data and data quality across the organization. While leading-edge companies like Google, Uber and Apple increased their return on data investment by mastering DataOps, many leaders struggled to staff up with enough data scientists, data analysts, and data engineers to properly capitalize on this trend.

Progressive DataOps companies began to drain data swamps by pouring massive amounts of data (and investment) into a new modern ecosystem of cloud data warehouses and data lakes from open source Hadoop and Kafka clusters to vendor-managed services like Databricks, Snowflake, Amazon EMR, BigQuery, and others.

The elastic capacity and scalability of cloud resources allowed new kinds of structured, semi-structured, and unstructured data to be stored, processed and analyzed, including streaming data for real-time applications.

As these cloud resources quickly grew and scaled, they became a complex tangle of data sources, pipelines, dashboards, and machine learning models, with a variety of interdependencies, owners, stakeholders, and products with SLAs. Getting additional cloud resources and launching new data pipelines was easy, but operating them well required a lot of effort, and making sense of the business value of any specific component to prioritize data engineering efforts became a huge challenge.

Software teams went through the DevOps revolution more than a decade ago, and even before that, there were well-understood software engineering disciplines for design/build/deploy/change, as well as monitoring and observability. Before DataOps, data teams didn’t typically think about test and release cycles, or misconfiguration of the underlying infrastructure itself.

Where DevOps optimized the lifecycle of software from coding to release, DataOps is about the flow of data, breaking data out of work silos to collaborate on the movement of data from its inception, to its arrival, processing, and use within modern data architectures to feed production BI and machine learning applications.

DataOps jobs, especially in a cloudy, distributed data estate, aren’t the same as DevOps jobs. For instance, if a cloud application becomes unavailable, DevOps teams might need to reboot the server, adjust an API, or restart the K8s cluster.

If a DataOps-led application starts failing, it may show incorrect results instead of simply crashing, and cause leaders informed by faulty analytics and AI inferences to make disastrous business decisions. Figuring out the source of data errors and configuration problems can be maddeningly difficult, and DataOps teams may even need to restore the whole data estate – including values moving through ephemeral containers and pipelines – back to a valid, stable state for that point in time.

Why does DataOps need its own observability?

Once software observability started finding its renaissance within DevOps practices and early microservices architectures five years ago, we also started seeing some data management vendors pivoting to offer ‘data observability’ solutions.

The original concept of data observability was concerned with database testing, properly modeling, addressing and scaling databases, and optimizing the read/write performance and security of both relational and cloud back ends.

In much the same way that the velocity and automated release cadence of DevOps meant dev and ops teams needed to shift component and integration testing left, data teams need to tackle data application performance and data quality earlier in the DataOps lifecycle.

In essence, DataOps teams are using agile and other methodologies to develop and deliver analytics and machine learning at scale. Therefore they need DataOps observability to clarify the complex inner plumbing of apps, pipelines and clusters handling that moving data. Savvy DataOps teams must monitor ever-increasing numbers of unique data objects moving through data pipelines.

The KPIs for measuring success in DataOps observability include metrics and metadata that standard observability tools would never see: differences or anomalies in data layout, table partitioning, data source lineages, degrees of parallelism, data job and subroutine runtimes and resource utilization, interdependencies and relationships between data sets and cloud infrastructures – and the business tradeoffs between speed, performance and cost (or FinOps) of implementing recommended changes.

The Intellyx Take

A recent survey noted that 97 percent of data engineers ‘feel burned out’ at their current jobs, and 70 percent say they are likely to quit within a year! That’s a wakeup call for why DataOps observability matters now more than ever.

We must maintain the morale of understaffed and overworked data teams, when these experts take a long time to train and are almost impossible to replace in today’s tight technical recruiting market.

Any enterprise that intends to deliver modern DataOps should first consider equipping data teams with DataOps observability capabilities. Observability should go beyond the traditional metrics and telemetry of application code and infrastructure, empowering DataOps teams to govern data and the resources used to refine and convert raw data into business value as it flows through their cloud application estates.

Next up in part 2 of this series: Jason Bloomberg on the transformation from DevOps to DataOps!

©2022 Intellyx LLC. Intellyx retains editorial control of this document. At the time of writing, Unravel Data is an Intellyx customer. Image credit: Phil O’Driscoll, flickr CC2.0 license, compositing pins by author.

The post Why do we need DataOps Observability? appeared first on Unravel.

Three Companies Driving Better Business Outcomes from Data Analytics

Stephen Lamont — Thu, 09 Feb 2023 19:00:08 +0000

You’re unlikely to be able to build a business without data.

But how can you use it effectively?

There are so many ways you can use data in your business, from creating better products and services for customers, to improving efficiency and reducing waste.

Enter data observability. Using agile development practices, companies can create, deliver, and optimize data products, quickly and cost-effectively. Organizations can easily identify a problem to solve and then break it down into smaller pieces. Each piece is then assigned to a team that breaks down the work to solve the problem into a defined set of time – usually called a sprint – that includes planning, work, deployment, and review.

Marion Shaw, Head of Data and Analytics at Chaucer Group, and Unravel’s Ben Cooper presented on transforming data analytics to build better products and services and on making data analytics more efficient, respectively, at a recent Chief Disruptor virtual event.

Following on from the presentation, members joined a roundtable discussion and took part in a number of polls, in order to share their experiences. Here are just some examples of how companies have used data analytics to drive innovation:

Improved payments. An influx of customer calls asking “where’s my money or payment?” prompted a company to introduce a “track payments” feature as a way of digitally understanding the payment status. As a result, the volume of callers decreased, while users of the new feature actually eclipsed the amount of original complaints, which proved there was a batch of customers who couldn’t be bothered to report problems but still found the feature useful. “If you make something easy for your customers, they will use it.“

Cost reduction and sustainability: Moving from plastics to paper cups improved cost reduction and sustainability for one company, showing how companies can use their own data to make business decisions.

New products: Using AI in drug discovery, data collaboration, to explore disease patterns can help pharmaceutical companies find new treatments for diseases with the potential for high returns. Cost of discovery is as expensive as using big data sets.

The key takeaways from the discussion were:

Make it simple. When you make an action easy for your customers, they will use it.
Lean on the data. If there isn’t data behind someone’s viewpoint, then it is simply an opinion.
Get buy-in. Data teams need to buy into the usage of data—just because a data person owns the data side of things does not mean that they are responsible for the benefits or failings of it.

Using data analytic effectively with data observability is key. Companies across all industries are using data observability to create better products and services, reduce waste, and to improve productivity.

But data observability is no longer just about data quality or observing the condition of the data itself. Today it encompasses much more, and you can’t “borrow” your software teams’ observability solution. Discover more in our report DataOps Observability: The Missing Link for Data Teams.

The post Three Companies Driving Better Business Outcomes from Data Analytics appeared first on Unravel.

Taming Cloud Costs for Data Analytics with FinOps

Stephen Lamont — Fri, 03 Feb 2023 15:15:11 +0000

Uncontrolled cloud costs pose an enormous risk for any organization. The longer these costs go ungoverned, the greater your risk. Volatile, unforeseen expenses eat into profits. Budgets become unstable. Waste and inefficiency go unchecked. Making strategic decisions becomes difficult, if not impossible. Uncertainty reigns.

Everybody’s cloud bill continues to get steeper month over month, and the most rapidly escalating (and often the single largest) slice of the pie is cloud costs for data analytics—usually at least 40%. With virtually every organization asking for ever more data analytics, and more data workloads moving to the cloud, uncontrolled cloud data costs increasingly become a bigger part of the overall problem.

Data workloads are the fastest-growing cloud cost category and, if they’re not already, will soon be the #1 cloud expense.

All too often, business leaders don’t even have a clear understanding of where the money is going—or why—for cloud data expenditures, much less a game plan for bringing these costs under control.

Ungoverned cloud data usage and costs result in multiple, usually coexisting, business vulnerabilities.

Consider these common scenarios:

You’re a C-suite executive who is ultimately responsible for cloud costs, but can’t get a good explanation from your data team leaders about why your Azure Databricks or AWS or Google Cloud costs in particular are skyrocketing.
You’re a data team leader who’s on the hook to explain to the C-suite these soaring cloud data costs, but can’t. You don’t have a good handle on exactly who’s spending how much on what. This is a universal problem, no matter what platform or cloud provider: 70% of organizations aren’t sure what they spend their cloud budget on.
You’re in Finance and are getting ambushed by your AWS (or Databricks or GCP) bill every month. Across companies of every size and sector, usage and costs are wildly variable and unpredictable—one survey has found that cloud costs are higher than expected for 6 of every 10 organizations.
You’re a business product owner who needs additional budget to meet the organization’s increasing demand for more data analytics but don’t really know how much more money you’ll need. Forecasting—whether for Snowflake, Databricks, Amazon EMR, BigQuery, or any other cloud service—becomes a guessing game: Forrester has found that 80% of companies have difficulty predicting data-related cloud costs, and it’s becoming ever more problematic to keep asking for more budget every 3-6 months.
You’re an Engineering/Operations team lead who knows there’s waste and inefficiency in your cloud usage but don’t really know how much or exactly where, much less what to do about it. Your teams are groping in the dark, playing blind man’s bluff. And it’s getting worse: 75% of organizations report that cloud waste is increasing.
Enterprise architecture teams are seeing their cloud migration and modernization initiatives stall out. It’s common for a project’s three-year budget to be blown by Q1 of Year 2. And increasingly companies are pulling the plug. A recent report states that 81% of IT teams have been directed by the C-suite to cut or halt cloud spending. You find yourself paralyzed, unable to move either forward or back—like being stranded in a canoe halfway through a river crossing.
Data and Finance VPs don’t know the ROI of their modern data stack investments—or even how to figure that out. But whether you measure it or not, the ROI of your modern data stack investments nosedives as costs soar. You find it harder and harder to justify your decision to move to Databricks or Snowflake or Amazon EMR or BigQuery. PricewaterhouseCoopers has found that over half (53%) of enterprises have yet to realize substantial value from their cloud investments.
With seemingly no end in sight to escalating cloud costs, data executives may even be considering the radical step of sacrificing agility and speed-to-market gains and going back to on-prem (repatriation). The inability to control costs is a leading reason why 71% of enterprises expect to move all or some of their workloads to on-prem environments.

How’d we get into this mess?

The problem of ungoverned cloud data costs is universal and has been with us for a while: 83% of enterprises cite managing cloud spend as their top cloud challenge, and optimizing cloud usage is the top cloud initiative for the sixth straight year, according to the 2022 State of the Cloud Report.

Some of that is simply due to the increased volume and velocity of data and analytics. In just a few short years, data analytics has gone from a science project to an integral business-critical function. More data workloads are running in the cloud, often more than anticipated. Gartner has noted that overall cloud usage for data workloads “almost always exceeds initial expectations,” stating that workloads may grow 4X or more in the first year alone.

If you’re like most data-driven enterprises, you’ve likely invested millions in the innovative data platforms and cloud offerings that make up your modern data stack—Databricks, Snowflake, Amazon EMR, BigQuery, Dataproc, etc.—and have realized greater agility and go-to-market speed. But those benefits have come at the expense of understanding and controlling the associated costs. You’re also seeing your cloud costs lurch unpredictably upwards month over month. And that’s the case no matter which cloud provider(s) you’re on, or which platform(s) you’re running. Your modern data stack consumes ever more budget, and it never seems to be enough. You’re under constant threat of your cloud data costs jumping 30-40% in just six months. It’s become a bit like the Wild West, where everybody is spinning up clusters and incurring costs left and right, but nobody is in control to govern what’s going on.

The only thing predictable about modern data stack costs is that they seem to always go up. Source: Statistica

Many organizations that have been wrestling with uncontrolled cloud data costs have begun adopting a FinOps approach. Yet they are struggling to put these commonsensical FinOps principles into practice for data teams. They find themselves hamstrung by generic FinOps tools and are hitting a brick wall when it comes to actualizing foundational core capabilities.

So why are organizations having trouble implementing DataFinOps (FinOps for data teams)?

FinOps for data teams

Just as DevOps and DataOps are “cousin” approaches—bringing agile and lean methodologies to software development and data management, tackling similar types of challenges, but needing very distinct types of information and analyses to get there—FinOps and DataFinOps are related but different. In much the same way (and for similar reasons) DevOps observability built for web applications doesn’t work for data pipelines and applications, DataFinOps brings FinOps best practices to data management, taking the best of FinOps to help data teams measure and improve cost effectiveness of data pipelines and data applications.

DataOps + FinOps = DataFinOps

FinOps principles and approach

As defined by the FinOps Foundation, FinOps is “an evolving cloud financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.”

It’s important to bear in mind that FinOps isn’t just about lowering your cloud bill—although you will wind up saving money in a lot of areas, running things more cost-effectively, and being able to do more with less (or at least the same). It’s about empowering engineers and business teams to make better choices about their cloud usage and deriving the most value from their cloud investments.

There are a few underlying “north star” principles that guide all FinOps activities:

Cloud cost governance is a team sport. Too often controlling costs devolves into Us vs. Them friction, frustration, and finger-pointing. It can’t be done by just one group alone, working with their own set of data (and their own tools) from their own perspective. Finance, Engineering, Operations, technology teams, and business stakeholders all need to be on the same page, pulling in the same direction to the same destination, working together collaboratively.
Spending decisions are driven by business value. Not all cloud usage is created equal. Not everything merits the same priority or same level of investment/expenditure. Value to the business is the guiding criterion for making collaborative and intelligent decisions about trade-offs between performance, quality, and cost.
Everyone takes ownership of their cloud usage. Holding individuals accountable for their own cloud usage—and costs—essentially shifts budget responsibility left, onto the folks who actually incur the expenses.This is crucial to controlling cloud costs at scale, but to do so, you absolutely must empower engineers and Operations with the self-service optimization capabilities to “do the right thing” themselves quickly and easily.
Reports are accessible and timely. To make data-driven decisions, you need accurate, real-time data. The various players collaborating on these decisions all bring their own wants and needs to the table, and everybody needs to be working from the same information, seeing the issue(s) the same way—a single pane of glass for a single source of truth.Dashboards and reports must be visible, understandable, and practical for a wide range of people making these day-to-day cost-optimization spending decisions.

Applying FinOps to data management

Data teams can put these FinOps principles into practice with a three-phase iterative DataFinOps life cycle:

Observability, or visibility, is understanding what is going on in your environment, measuring everything, tracking costs, and identifying exactly where the money is going.
Optimization is identifying where you can eliminate waste and inefficiencies, take advantage of less-expensive cloud pricing options, or otherwise run your cloud operations more cost-efficiently—and then empowering individuals to actually make improvements.
Governance is all about going from reactive problem-solving to proactive problem-preventing, sustaining iterative improvements, and implementing guardrails and alerts.

The DataFinOps life cycle: observability, optimization, and governance.

The essential principles and three-phase approach of FinOps are highly relevant to data teams. But just as DataOps requires a different set of information than DevOps, data teams need unique and data-specific details to achieve observability, optimization, and governance in their everyday DataFinOps practice.

What makes DataFinOps different—and difficult

The following challenges all manifest themselves in one way or another at each stage of the observability/optimization/governance DataFinOps life cycle. Be aware that while you can’t do anything until you have full visibility/observability into your cloud data costs, the optimization and governance phases are usually the most difficult and most valuable to put into action.

First off, just capturing and correlating the volume of information across today’s modern data stacks can be overwhelming. The sheer size and complexity of data applications/pipelines—all with multiple sub-jobs and sub-tasks processing in parallel—generates millions of metrics, metadata, events, logs, etc. Then everything has to be stitched together somehow in a way that makes sense to Finance and Data Engineering and DataOps and the business side.

Even a medium-sized data analytics operation has 100,000s of individual “data-driven spending decisions” to make. That’s the bad news. The good news is that this same complexity means there are myriad opportunities to optimize costs.

Second, the kinds of details (and the level of granularity) your teams need to make informed, intelligent DataFinOps decisions simply are not captured or visualized by cloud cost-management FinOps tools or platform-specific tools like AWS Cost Explorer, OverWatch, or dashboards and reports from GCP and Microsoft Cost Management. It’s the highly granular details about what’s actually going on in your data estate (performance, cost) that will uncover opportunities for cloud cost optimizations. But those granular details are scattered across dozens of different tools, technologies, and platforms.

Third, there’s usually no single source of truth. Finance, data engineers, DataOps, and the business product owners all use their own particular tools of choice to manage different aspects of cloud resources and spending. Without a common mechanism (the proverbial single pane of glass) to see and measure cost efficiency, it’s nearly impossible for any of them to make the right call.

Finally, you need to be able to see the woods for the trees. Seemingly simple, innocuous changes to a single cloud data application/pipeline can have a huge blast radius. They’re all highly connected and interdependent: the output of something upstream is the input to something else downstream or, quite likely, another application/pipeline for somebody somewhere else in the company. Everything must be understood within a holistic business context so that everybody involved in DataFinOps understands how everything is working as a whole.

Implementing DataFinOps in practice: A 6-step game plan

At its core, DataFinOps elevates cost as a first-class KPI metric, right alongside performance and quality. Most data team SLAs revolve around performance and quality: delivering reliable results on time, every time. But with cloud spend spiraling out of control, now cost must be added to reliability and speed as a co-equal consideration.

1. Set yourself up for success

Before launching DataFinOps into practice, you need to lay some groundwork around organizational alignment and expectations. The collaborative approach and the principle of holding individuals accountable for their cloud usage are a cultural sea-change. DataFinOps won’t work as a top-down mandate without buy-in from team members further down the ladder. Recognize that improvements will grow proportionally over time as your DataFinOps practice gains momentum. It’s best to adopt a crawl-walk-run approach, where you roll out a pilot project to discover best practices (and pitfalls) and get a quick win that demonstrates the benefits. Find a data analyst or engineer with an interest in the financial side who can be the flag-bearer, maybe take some FinOps training (the FinOps Foundation is an excellent place to start), and then dig into one of your more expensive workloads to see how to apply DataFinOps principles in practice. Similarly, get at least one person from the finance and business teams who is willing to get onboard and discover how DataFinOps works from their side.

Check out this 15-minute discussion on how JPMorgan Chase & Co, tackled FinOps at scale.

2. Get crystal-clear visibility into where the money is going

Before you can even begin to control your cloud costs, you have to understand—with clarity and precision—where the money is going. Most organizations have only hazy visibility into their overall cloud spend. The 2022 State of Cloud Cost Report states that gaining visibility into cloud usage is the single biggest challenge to controlling costs. The 2022 State of Cloud Cost Intelligence Report further finds that only 3 out of 10 organizations know exactly where their spend is going, with the majority either guesstimating or having no idea. And the larger the company, the more significant the cost visibility problem.

A big part of the problem are the cloud bills themselves. They’re either opaque or inscrutable; either way, they lack the business context that makes sense to your particular organization. Cloud vendor billing consoles (and third-party cost-management tools) give you only an aggregated view of total spend for different services (compute, storage, platform, etc.).

Cloud bills don’t answer questions like, Who spent on these services? Which department, teams, users are the top spenders?

Or you have to dedicate a highly trained expert or two to decode hundreds of thousands (millions, even) of individual billing line items and figure out who submitted which jobs (by department, team, individual user) for what purpose—who’s spending what, when, and why.

Decoding 1,000,000s of individual lines of billing data is a full-time job.

The lack of visibility is compounded by today’s heterogeneous multi-cloud, multi-platform reality. Most enterprise-scale organizations use a combination of different cloud providers and different data platforms, often the same platform on different cloud providers. While this is by no means a bad thing from an agility perspective, it does make comprehending overall costs across different providers even more difficult.

So, the first practical step to implementing DataFinOps is understanding your costs with clarity and precision. You need to be able to understand at a glance, in terms that make sense to you, which projects, applications, departments, teams, and individual users are spending exactly how much, and why.

You get precise visualization of cloud data costs through a combination of tagging and full-stack observability. Organizations need to apply some sort of cost-allocation tagging taxonomy to every piece of their data estate in order to categorize and track cloud usage. Then you need to capture a wide and deep range of granular performance details, down to the individual job level, along with cost information.

All this information is already available in different systems, hidden in plain sight. What you need is a way to pull it all together, correlating everything in a data model and mapping resources to the tagged applications, teams, individuals, etc.

Once you have captured and correlated all performance/cost details and have everything tagged, you can slice and dice the data to do a number of things with a high degree of accuracy:

See who the “big spenders” are—which departments, teams, applications, users are consuming the most resources—so you can prioritize where to focus first on cost optimization, for the biggest impact.
Track actual spend against budgets—again, by project or group or individual. Know ahead of time whether usage is projected to be on budget, is at risk, or has already gone over. Avoid “sticker shock” surprises when the monthly bill arrives.
Forecast with accuracy and confidence. You can run regular reports that analyze historical usage and trends (e.g., peaks and valleys, auto-scaling) to predict future capacity needs. Base projections on data, not guesswork.
Allocate costs with pinpoint precision. Generate cost-allocation reports that tell you exactly—down to the penny—who’s spending how much, where, and on what.

3. Understand why the costs are what they are

Beyond knowing what’s being spent and by whom, there’s the question of how money is being spent—specifically, what resources were allocated for the various workload jobs and how much they cost. Your data engineers and DataOps teams need to be able to drill down into the application- and cluster-level configuration details to identify which individuals are using cloud data resources, the size and number of resources they’re using, which jobs are being run, and what their performance looks like. For example, some data jobs are constrained by network bandwidth, while others are memory- or CPU-bound. Running those jobs on clusters and instance types may not maximize the return on your cloud data investments. If you think about cost optimization as a before-and-after exercise, your teams need a clear and comprehensive picture of the “before.”

With so many different cloud pricing models, instance types, platforms, and technologies available to choose from, you need everybody to have a fully transparent 360° view into the way expenses are being incurred at any given time so that the next step—cost optimization and collaborative decision-making about cloud spend—can happen intelligently.

4. Identify opportunities to do things more cost-effectively

If the observability phase gives you visibility into where costs are going, how, and why, the optimization phase is about taking that information and figuring out where you can eliminate waste, remove inefficiencies, and leverage different options that are less expensive but still meet business needs.

Most companies have very limited insight (if any) into where they’re spending more than they need to. Everyone knows on an abstract level that cloud waste is a big problem in their organization, but identifying concrete examples that can be remediated is another story. Various surveys and reports peg the amount of cloud budget going to waste between 30-50%, but the true percentage may well be higher—especially for data teams—because most companies are just guessing and tend to underestimate their waste. For example, after implementing cloud data cost optimization, most Unravel customers have seen their cloud spend lowered by 50-60%

An enterprise running 10,000s (or 100,000s) of data jobs every month has literally millions of decisions to make—at the application, pipeline, and cluster level—about where, when, and how to run those jobs. And each individual decision about resources carries a price tag.

Enterprises have 100,000s of places where cloud data spending decisions need to be made.

Only about 10% of cost-optimization opportunities are easy to see, for example, shutting down idle clusters (deployed but no-longer-in-use resources). The remaining 90% lie below the surface.

90% of cloud data cost-optimization opportunities are buried deep, out of immediate sight.

You need to go beyond knowing how much you are currently spending, to knowing how much you should be spending in a cost-optimized environment. You need insight into what resources are actually needed to run the different applications/pipelines vs. what resources are being used.

Data engineers are not wasting money intentionally; they simply don’t have the insights to run their jobs most cost-efficiently. Identifying waste and inefficiencies needs two things in already short supply: time and expertise. Usually you need to pull engineering or DataOps superstars away from what they’re doing—often the most complex or business-critical projects—to do the detective work of analyzing usage vs. actual need. This kind of cost-optimization analysis does get done today, here and there, on an ad hoc basis for a handful of applications, but doing so at enterprise scale is something that remains out of reach for most companies.

The complexity is overwhelming: for example, AWS alone has more than 200 cloud services and over 600 instance types available, and has changed its prices 107 times since its launch in 2006. For even your top experts, this can be laborious, time-consuming work, with continual trial-and-error, hit-or-miss attempts.

Spoiler alert: You need AI

AI is crucial to understanding where cost optimization is needed—at least, at enterprise scale and to make an impact. It can take hours, days, sometimes weeks, for even your best people to tune a single application for cost efficiency. AI can do all this analysis automatically by “throwing some math” at all the observability performance and cost data to uncover where there’s waste and inefficiencies.

Overprovisioned resources. This is most often why budgets go off the rails, and the primary source of waste. The size and number of instances and containers is greater than necessary, having been allocated based on perceived need rather than actual usage. Right-sizing cloud resources alone can save an organization 50-60% of its cloud bill.
Instance pricing options. If you understand what resources are needed to run a particular application (how that application interacts with other applications), the DataFinOps team can make informed decisions about when to use on-demand, reserved, or spot instances. Leveraging spot instances can be up to 90% less expensive than on-demand—but you have to have the insights to know which jobs are good candidates.
Bad code. AI understands what the application is trying to do and can tell you that it’s been submitted in a way that’s not efficient. We’ve seen how a single bad join on a multi-petabyte table kept a job running all weekend and wound up costing the company over $500,000.
Intelligent auto-scaling. The cloud’s elasticity is great but comes at a cost, and isn’t always needed. AI analyzes usage trends to help predict when auto-scaling is appropriate—or when rescheduling the job may be a more cost-effective option.
Data tiering. You probably have petabytes of data. But you’re not using all of it all of the time. AI can tell you which datasets are (not) being used, applying cold/warm/hot labels based on age or usage, so you understand which ones haven’t been touched in months yet still sit on expensive storage. Moving cold data to less-expensive options can save 80-90% on storage costs.

5. Empower individuals with self-service optimization

Holding individual users accountable for their own cloud usage and cost is a bedrock principle of DataFinOps. But to do that, you also have to give them the insights and actionable intelligence for them to make better choices.

You have to make it easy for them to do the right thing. They can’t just be thrown a bunch or charts and graphs and metrics, and be expected to figure out what to do. They need quick and easy-to-understand prescriptive recommendations on exactly how to run their applications reliably, quickly, and cost-efficiently.

Ideally, this is where AI can also help. Taking the analyses of what’s really needed vs. what’s been requested, AI could generate advice on exactly where and how to change settings or configurations or code to optimize for cost.

Leveraging millions of data points and hundreds of cost and performance algorithms, AI offers precise, prescriptive recommendations for optimizing costs throughout the modern data stack.

6. Go from reactive to proactive

Optimizing costs reactively is of course highly beneficial—and desperately needed when costs are out of control—but even better is actively governing them. DataFinOps is not a straight line, with a beginning and end, but rather a circular life cycle. Observability propels optimization, whose outcomes become baselines for proactive governance and feed back into observability.

Governance is all about getting ahead of cost issues beforehand rather than after the fact. Data team leaders should implement automated guardrails that give a heads-up when thresholds for any business dimension (projected budget overruns, jobs that exceed a certain size, time, cost) are crossed. Alerts could be triggered whenever a guardrails constraint is violated, notifying the individual user—or sent up the chain of command—that their job will miss its SLA or cost too much money and they need to find less expensive options, rewrite it to be more efficient, reschedule it, etc. Or guardrail violations could trigger preemptive corrective “circuit breaker’ actions to kill jobs or applications altogether, request configuration changes, etc., to rein in rogue users, put the brakes on runaways jobs, and nip cost overruns in the bud.

Controlling particular users, apps, or business units from exceeding certain behaviors has a profound impact on reining in cloud spend.

Conclusion

The volatility, obscurity, and lack of governance over rapidly growing cloud data costs introduce a high degree of unpredictability—and risk—into your organization’s data and analytics operations. Expenses must be brought under control, but reducing costs by halting activities can actually increase business risks in the form of lost revenue, SLA breaches, or even brand and reputation damage. Taking the radical step of going back on-prem may restore control over costs but sacrifices agility and speed-to-market, which also introduces the risk of losing competitive edge.

A better approach, DataFinOps, is to use your cloud data spend more intelligently and effectively. Make sure that your cloud investments are providing business value, that you’re getting the highest possible ROI from your modern data stack. Eliminate inefficiency, stop the rampant waste, and make business-driven decisions—based on real-time data, not guesstimates—about the best way to run your workloads in the cloud.

That’s the driving force behind FinOps for data teams, DataFinOps. Collaboration between engineering, DataOps, finance, and business teams. Holding individual users accountable for their cloud usage (and costs).

But putting DataFinOps principles into practice is a big cultural and organizational shift. Without the right DataFinOps tools, you’ll find it tough sledding to understand exactly where your cloud data spend is going (by whom and why) and identify opportunities to operate more cost-efficiently. And you’ll need AI to help empower individuals to optimize for cost themselves.

Then you can regain control over your cloud data costs, restore stability and predictability to cloud budgets, be able to make strategic investments with confidence, and drastically reduce business risk.

The post Taming Cloud Costs for Data Analytics with FinOps appeared first on Unravel.

3 Takeaways from the 2023 Data Teams Summit

Stephen Lamont — Thu, 02 Feb 2023 13:45:19 +0000

The 2023 Data Teams Summit (formerly DataOps Unleashed) was a smashing success, with over 2,000 participants from 1,600 organizations attending 23 expert-led breakout sessions, panel discussions, case studies, and keynote presentations covering a wide range of thought leadership and best practices.

There were a lot of sessions devoted to different strategies and considerations when building a high-performing data team, how to become a data team leader, where data engineering is heading, and emerging trends in DataOps (asset-based orchestration, data contracts, data mesh, digital twins, data centers center of excellence). And winding as a common theme throughout almost every presentation was that top-of-mind topic: FinOps and how to get control over galloping cloud data costs.

Some of the highlight sessions are available now on demand (no form to fill out) on our Data Teams Summit 2023 page. More are coming soon.

There was a lot to see (and full disclosure: I didn’t get to a couple of sessions), but here are 3 sessions that I found particularly interesting.

Enabling strong engineering practices at Maersk

The fireside chat between Unravel CEO and Co-founder Kunal Agarwal and Mark Sear, Head of Data Platform Optimization at Maersk, one of the world’s largest logistics companies, is entertaining and informative. Kunal and Mark cut through the hype to simplify complex issues in commonsensical, no-nonsense language about:

The “people problem” that nobody’s talking about
How Maersk was able to upskill its data teams at scale
Maersk’s approach to rising cloud data costs
Best practices for implementing FinOps for data teams

Check out their talk here

Maximize business results with FinOps

Unravel DataOps Champion and FinOps Certified Practitioner Clinton Ford and FinOps Foundation Ambassador Thiago Gil explain how and why the emerging cloud financial management discipline of FinOps is particularly relevant—and challenging—for data teams. They cover:

The hidden costs of cloud adoption
Why observability matters
How FinOps empowers data teams
How to maximize business results
The state of production ML

See their session here

Situational awareness in a technology ecosystem

Charles Boicey, Chief Innovation Officer and Co-founder of Clearsense, a healthcare data platform company, explores the various components of a healthcare-centric data ecosystem and how situational awareness in the clinical environment has been transferred to the technical realm. He discusses:

What clinical situational awareness looks like
The concept of human- and technology-assisted observability
The challenges of getting “focused observability” in a complex hybrid, multi-cloud, multi-platform modern data architecture for healthcare
How Clearsense leverages observability in practice
Observability at the edge

Watch his presentation here

How to get more session recordings on demand

To see other session recordings without any registration, visit the Unravel Data Teams Summit 2023 page.
To see all Data Teams Summit 2023 recordings, register for access here.

And please share your favorite takeaways, see what resonated with your peers, and join the discussion on LinkedIn.

The post 3 Takeaways from the 2023 Data Teams Summit appeared first on Unravel.

Panel recap: What Is DataOps observability?

Stephen Lamont — Fri, 06 Jan 2023 14:39:13 +0000

Data teams and their business-side colleagues now expect—and need—more from their observability solutions than ever before. Modern data stacks create new challenges for performance, reliability, data quality, and, increasingly, cost. And the challenges faced by operations engineers are going to be different from those for data analysts, which are different from those people on the business side care about. That’s where DataOps observability comes in.

But what is DataOps observability, exactly? And what does it look like in a practical sense for the day-to-day life of data application developers or data engineers or data team business leaders?

In the Unravel virtual panel discussion What Is Data Observability? Sanjeev Mohan, principal with SanjMo and former Research Vice President at Gartner, lays out the business context and driving forces behind DataOps observability, and Chris Santiago, Unravel Vice President of Solutions Engineering, shows how different roles use the Unravel DataOps observability platform to address performance, cost, and quality challenges.

Why (and what is) DataOps observability?

Sanjeev opens the conversation by discussing the top three driving trends he’s seeing from talking with data-driven organizations, analysts, vendors, and fellow leaders in the data space. Specifically, how current economic headwinds are causing reduced IT spend—except in cloud computing and, in particular, data and analytics. Second, with the explosion of innovative new tools and technologies, companies are having difficulty in finding people who can tie together all of these moving pieces and are looking to simplify this increasing complexity. Finally, more global data privacy regulations are coming into force while data breaches continue unabated. Because of these market forces, Sanjeev says, “We are seeing a huge emphasis on integrating, simplifying, and understanding what happens between a data producer and a data consumer. What happens between these two is data management, but how well we do the data management is what we call DataOps.”

Sanjeev presents his definition of DataOps, why it has matured more slowly than its cousin DevOps, and the kind(s) of observability that is critical to DataOps—data pipeline reliability and trust in data quality—and how his point of view has evolved to now include a third dimension: demonstrating the business value of data through BizOps and FinOps. Taken together, these three aspects (performance, cost, quality) give all the different personas within a data team the observability they need. This is what Unravel calls DataOps observability.

See Sanjeev’s full presentation–no registration!
Video on demand

DataOps observability in practice with Unravel

Chris Santiago walks through how the various players on a data team—business leadership, application/pipeline developers and engineers, data analysts—use Unravel across the three critical vectors of DataOps observability: performance (reliability of data applications/pipelines), quality (trust in the data), and cost (value/ROI modern data stack investments).

Cost (value, ROI)

First up is how Unravel DataOps observability provides deep visibility and actionable intelligence into cost governance. As opposed to the kind of dashboards that cloud providers themselves offer—which are batch-processed aggregated summaries—Unravel lets you drill down from that 10,000-foot view into granular details to see exactly where the money is going. Chris uses a Databricks chargeback report example, but it would be similar for Snowflake, EMR, or GCP. He shows how with just a click, you can filter all the information collected by Unravel to see with granular precision which workspaces, teams, projects, even individual users, are consuming how many resources across your entire data estate—in real time.

From there, Chris demonstrates how Unravel can easily set up budget tracking and automated guardrails for, say, a user (or team or project or any other tagging category that makes sense to your particular business). Say you want to track usage by the metric of DBUs; you set the budget/guardrail at a predefined threshold and get real-time status insight into whether that user is on track or is in danger of exceeding the DBU budget. You can set up alerts to get ahead of usage instead of getting notified only after you’ve blown a monthly budget.

See more self-paced interactive product tours here

Performance (reliability)

Chris then showed how Unravel DataOps observability helps the people who are actually on the hook for data pipeline reliability, making sure everything is working as expected. When applications or pipelines fail, it’s a cumbersome task to hunt down and cobble together all the information from disparate systems (Databricks, Airflow, etc.) to understand the root cause and figure out the next steps. Chris shows how Unravel collects and correlates all the granular details about what’s going on and why from a pipeline view. And then how you can drill down into specific Spark jobs. From a single pane of glass, you have all the pertinent logs, errors, metrics, configuration settings, etc., from Airflow or Spark or any other component. All the heavy lifting has been done for you automatically.

But where Unravel stands head and shoulders above everyone else is its AI-powered analysis. Automated root cause analysis diagnoses why jobs failed in seconds, and pinpoints exactly where in a pipeline something went wrong. For a whole class of problems, Unravel goes a step further and provides crisp, prescriptive recommendations for changing configurations or code to improve performance and meet SLAs.

See more self-paced interactive product tours here

Data quality (trust)

Chris then pivots away from the processing side of things to look at observability of the data itself—especially how Unravel provides visibility into the characteristics of data tables and helps prevent bad data from wreaking havoc downstream. From a single view, you can understand how large tables are partitioned, size of the data, who’s using the data tables (and how frequently), and a lot more information. But what may be most valuable is the automated analysis of the data tables. Unravel integrates external data quality checks (starting with the open source Great Expectations) so that if certain expectations are not met—null values, ranges, number of final columns—Unravel flags the failures and can automatically take user-defined actions, like alert the pipeline owner or even kill a job that fails a data quality check. At the very least, Unravel’s lineage capability enables you to track down where the rogue data got introduced and what dependencies are affected.

See Chris’s entire walk-through–no registration!
Video on demand

Whether it’s engineering teams supporting data pipelines, developers themselves making sure they hit their SLAs, budget owners looking to control costs, or business leaders who funded the technology looking to realize full value—everyone who constitutes a “data team”—DataOps observability is crucial to ensuring that data products get delivered reliably on time, every time, in the most cost-efficient manner.

The Q&A session

As anybody who’s ever attended a virtual panel discussion knows, sometimes the Q&A session is a snoozefest, sometimes it’s great. This one is the latter. Some of the questions that Sanjeev and Chris field:

If I’m migrating to a modern data stack, do I still need DataOps observability, or is it baked in?
How is DataOps observability different from application observability?
When should we implement DataOps observability?
You talked about FinOps and pipeline reliability. What other use cases come up for DataOps observability?
Does Unravel give us a 360 degree view of all of the tools in the ecosystem, or does it only focus on data warehouses like Snowflake?

To jump directly to the Q&A portion of the event, click on the video image below.

The post Panel recap: What Is DataOps observability? appeared first on Unravel.

Sneak Peek into Data Teams Summit 2023 Agenda

Stephen Lamont — Thu, 05 Jan 2023 22:47:29 +0000

The Data Teams Summit 2023 is just around the corner!

This year, on January 25, 2023, we’re taking the peer-to-peer empowerment of data teams one step further, transforming DatOps Unleashed into Data Teams Summit to better reflect our focus on the people—data teams—who unlock the value of data.

Data Teams Summit is an annual, full-day virtual conference, led by data rockstars at future-forward organizations about how they’re establishing predictability, increasing reliability, and creating economic efficiencies with their data pipelines.

Check out full agenda and register
Get free ticket

Join us for sessions on:

DataOps best practices
Data team productivity and self-service
DataOps observability
FinOps for data teams
Data quality and governance
Data modernizations and infrastructure

The peer-built agenda is packed, with over 20 panel discussions and breakout sessions. Here’s a sneak peek at some of the most highly anticipated presentations:

Keynote Panel: Winning strategies to unleash your data team

Great data outcomes depend on successful data teams. Every single day, data teams deal with hundreds of different problems arising from the volume, velocity, variety—and complexity—of the modern data stack.

Learn best practices and winning strategies about what works (and what doesn’t) to help data teams tackle the top day-to-day challenges and unleash innovation.

Breakout Session: Maximize business results with FinOps

As organizations run more data applications and pipelines in the cloud, they look for ways to avoid the hidden costs of cloud adoption and migration. Teams seek to maximize business results through cost visibility, forecast accuracy, and financial predictability.

In this session, learn why observability matters and how a FinOps approach empowers DataOps and business teams to collaboratively achieve shared business goals. This approach uses the FinOps Framework, taking advantage of the cloud’s variable cost model, and distributing ownership and decision-making through shared visibility to get the biggest return on their modern data stack investments.

See how organizations apply agile and lean principles using the FinOps framework to boost efficiency, productivity, and innovation.

Breakout Session: Going from DevOps to DataOps

DevOps has had a massive impact on the web services world. Learn how to leverage those lessons and take them further to improve the quality and speed of delivery for analytics solutions.

Ali’s talk will serve as a blueprint for the fundamentals of implementing DataOps, laying out some principles to follow from the DevOps world and, importantly, adding subject areas required to get to DataOps—which participants can take back and apply to their teams.

Breakout Session: Becoming a data engineering team leader

As you progress up the career ladder for data engineering, responsibilities shift as you start to become more hands-off and look at the overall picture rather than a project in particular.

How do you ensure your team’s success? It starts with focusing on the team members themselves.

In this talk, Matt Weingarten, a lead Data Engineer at Disney Streaming, will walk through some of his suggestions and best practices for how to be a leader in the data engineering world.

Attendance is free! Sign up here for a free ticket

The post Sneak Peek into Data Teams Summit 2023 Agenda appeared first on Unravel.

5 highlights from the Unravel Roadmap 2023 preview

Stephen Lamont — Wed, 21 Dec 2022 21:33:41 +0000

What are the (rapidly) evolving trends in the data space? And how is Unravel responding—especially for FinOps? What’s on the product roadmap for 2023? How can you get the most out of your Unravel deployment today?

Unravel leadership presented a virtual year-end update to talk about emerging industry developments, give a sneak peek into what’s in store for 2023, and share our customers’ best practices on how they’re using Unravel to maximize business value.

Here are some of the highlights, with video clips.

The rapid rise of FinOps

CEO Kunal Agarwal discusses his vision for Unravel and how cost governance and cost optimization has become a top challenge for data teams running workloads in the cloud. Some 60% of organizations encounter cloud cost overruns. Some of that is due to increased workloads, but another big reason is that companies just have difficulty understanding how to run more efficiently. But that kind of “actionable intelligence” is the foundation of FinOps—engineering, finance, and business teams making collaborative data-driven decisions about spending.

It’s the same kind of insights—at both a granular and holistic level—you need to tackle performance issues and/or to migrate to the cloud reliably and efficiently. Kunal shows how customers are using Unravel to realize tangible benefits to optimize performance and cost, and recaps our recent $50 million Series D funding

Unravel roadmap for 2023

Vice President of Product Eric Chu gives a glimpse into what’s ahead for the Unravel platform. Over the next year, you’ll see Unravel expand its capabilities to include integration with data quality solutions, expand its coverage on the modern data stack, and introduce additional FinOps observability, intelligence, and automation features. Specifically:

Unravel will support Snowflake
Databricks SQL and Unity Catalog support
Amazon EMR Serverless, as well as Glue and Kubernetes
Integration with data quality solutions, starting with Great Expectations
Integration with ServiceNow, PagerDuty, Jira, and more
Top-down FinOps dashboards, as well as enhanced AI insights

We are collaborating with customers now to get design input and user feedback, enabling early access to preview features.

Unlocking value with Unravel

There are probably features in Unravel you already have access to but may not be leveraging to their fullest extent. Chris Santiago, Unravel VP of Solutions Engineering, digs into some hidden treasures and best practices that other customers use on a daily basis.

Chris walks through a FinOps-focused use case demo to show how Unravel capabilities helps at each stage of the cost-optimization lifecycle process: reporting (visualize, track, allocate), self-service insights (reduce, improve, empower), and guardrails & automation (control, prevent).

His demo takes us from a bird’s-eye view overview of Databricks costs, then walks through the various Unravel reports you can access to drill down into get precise insights—cost breakdown, user and usage, node downsizing recommendations, event analysis—into where the spend is going at a granular level, where there’s waste, where optimization can have the greatest impact, etc. Then Chris touches on how Unravel’s AI recommendations fit in, as well as run through how to set up automated guardrails and real-time budget-tracking alerts to proactively control cloud costs.

The demo is specific to Databricks, but there are dozens of reports available for any environment Unravel supports. Some of them are in beta, so reach out to your technical account manager (TAM) for access to or training on these reports. In the meantime, explore these key Unravel features.

New professional services offering

Unravel’s Global Head of Customer Success, Matt Sangar, introduces our new professional services to help you leverage Unravel’s full capabilities. Whether it’s improving application performance, lowering cloud costs, reducing MTTR and support ticket volume, or improving data team productivity, Unravel professional services can help you get the most out of Unravel quickly. He has added two new roles to your Customer Success team, an Unravel engineer and a project manager, that will work with you as an extension of your team.

Expanded training options

Unravel is also introducing new on-site training enablement for in-person learning, at basic, advanced, and custom level, for Admin and Ops teams, AppDev and Support teams, and FinOps—budget management training for Finance counterparts.

Sometimes you might need extra help, or there are other teams who want/need to learn about Unravel. We now offer half- or full-day Unravel Data Days for on-site, in-person (or Zoom) workshops. We are also introducing office hours, when users can sit down with Unravel engineers to troubleshoot apps in real time.

For more information about our new training programs, professional services, access to beta reports, or anything else, please contact your TAM and we’ll take care of you!

The post 5 highlights from the Unravel Roadmap 2023 preview appeared first on Unravel.

Our 4 key takeaways from the 2022 Gartner® Market Guide for DataOps Tools

Stephen Lamont — Thu, 15 Dec 2022 02:28:27 +0000

Unravel is recognized as a Representative Vendor in the DataOps Market in the 2022 Gartner Market Guide for DataOps Tools.

Data teams are struggling to keep pace with the increased volume, velocity, variety—and complexity—of their data applications/pipelines. They are facing many of the same (generic) challenges that software teams did 10+ years ago. Just as DevOps helped streamline web application development and make software teams more productive, DataOps aims to do the same thing for data applications.

According to a Gartner strategic planning assumption from this Market Guide, “by 2025, a data engineering team guided by DataOps practices and tools will be 10 times more productive than teams that do not use DataOps.”

The report states that “the DataOps market is highly fragmented.” This Gartner Market Guide analyzes the DataOps market and explains the various capabilities of DataOps tools, paints a picture of the DataOps tool landscape, and offers recommendations.

Our understanding on DataOps and some of the key points we took away from the Gartner Market Guide:

The way data teams are doing things today isn’t working.

One of the key findings in the Gartner report analysis reveals that “a DataOps tool is a necessity to reduce the use of custom solutions and manual efforts around data pipeline operations. Buyers seek DataOps tools to streamline their data operations.” In our opinion, manual effort/custom solutions require prohibitively vast amounts of time and expertise—both of which are already in short supply within enterprise data teams.

No single DataOps tool does everything.

DataOps covers a lot of ground. Gartner defines the core capabilities of DataOps tools as orchestration (including connectivity, workflow automation, data lineage, scheduling, logging, troubleshooting and alerting), observability (monitoring live/historic workflows, insights into workflow performance and cost metrics, impact analysis), environment management (infrastructure as code, resource provisioning, environment repository templates, credentials management), deployment automation, and test automation. We think it’s clear that no one tool does it all. The report recommends, “Follow the decision guidelines and avoid multibuy circumstances by understanding the diverse market landscape and focusing on a desired set of core capabilities.”

DataOps tools break down silos.

A consequence of modern data stack complexity is that disparate pockets of experts all run their own particular tools of choice in silos. Communication and collaboration break down, and you wind up with an operational Tower of Babel that leads to missed SLAs, friction and finger-pointing, and everybody spending way too much time firefighting. DataOps tools are designed specifically to avoid this unsustainable situation.

Choose DataOps tools that give a single pane of glass view.

The Gartner report recommends to “prioritize DataOps tools that give you a ‘single pane of glass’ for diverse data workloads across heterogeneous technologies with orchestration, lineage, and automation capabilities.” The modern data stack is a complex collection of different systems, platforms, technologies, and environment. Most enterprises use a combination of them all. You need a DataOps tool that works with all kinds of workloads in this heterogenous stack for different capabilities and different reasons—which can be boiled down to the three dimensions of performance, cost, and quality.

We are excited that Gartner recognized Unravel as a Representative Vendor in the DataOps market, and we couldn’t agree more with their Market Guide for DataOps Tools recommendations.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post Our 4 key takeaways from the 2022 Gartner^® Market Guide for DataOps Tools appeared first on Unravel.

How Unravel Helps FinOps Control Data Cloud Costs

Stephen Lamont — Mon, 31 Oct 2022 13:41:28 +0000

As most organizations that have started to run a lot of data applications and pipelines in the cloud have found out, it’s really easy for things to get really expensive, really fast. It’s not unusual to see monthly budget overruns of 40% or more, or for companies to have burned through their three-year data cloud budget by early in year two. Consequently, we’re seeing that cloud migration projects and modernization initiatives are stalling out. Plans to scale up usage of modern platforms (think: Databricks and Snowflake, Amazon EMR, Google BigQuery and Dataproc, Azure) are hitting a wall.

Cloud bills are usually an organization’s biggest IT expense, and the sheer massive size of data workloads is driving most of the cloud bill. Many companies are feeling ambushed by their monthly cloud bills, and simply understanding where the money is going is a challenge—let alone figuring out where and how to rein in those costs and keeping them under control. Capacity/budget forecasting becomes a guessing game. There’s a general lack of individual accountability for waste/abuse of cloud resources. It feels like the Wild West, where everybody is spinning up instances left and right, without enough governance or control over how cloud resources are being used.

Enter the emerging discipline of FinOps. Sometimes called cloud cost management or cloud optimization, FinOps is an evolving cloud financial management practice that, in the words of the FinOps Foundation, “enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.”

The principles behind the FinOps lifecycle

It’s important to bear in mind that a FinOps approach isn’t just about slashing costs—although you almost invariably will wind up saving money. It’s about empowering data engineers and business teams to make better choices about their cloud usage and derive the most value from their modern data stack investments.

Controlling costs consists of three iterative phases along a FinOps lifecycle:

Observability: Getting visibility into where the money is going, measuring what’s happening in your cloud environment, understanding what’s going on in a “workload-aware” context
Optimization: Seeing patterns emerge where you can eliminate waste, removing inefficiencies, actually making things better
Governance: Going from reactive problem-solving to proactive problem-preventing, sustaining iterative improvements, automating guardrails, enabling self-service optimization

Each phase builds upon the former, to create a virtuous cycle of continuous improvement and empowerment for individual team members—regardless of expertise—to make better decisions about their cloud usage while still hitting their SLAs. In essence, this shifts the budget left, pulling accountability for controlling costs forward.

How Unravel puts FinOps lifecycle principles into practice

Unravel helps turn this conceptual FinOps framework into practical cost governance reality. You need four things to make this FinOps lifecycle work—all of which Unravel is uniquely able to provide:

You need to capture the right kind of details at a highly granular level from all the various systems in your data stack—horizontally and vertically—from the application down to infrastructure and everything in between.
All this deep information needs to be correlated into a “workload-aware” business context—cost governance is not just an infrastructure issue, and you need to get a holistic understanding of how everything works together: apps, pipelines, users, data, as well as infrastructure resources.
You need to be able to automatically identify opportunities to optimize—oversized resources, inefficient code, data tiering—and then make it easy for engineers to implement those optimizations.
Go from reactive to proactive—not just perpetually scan the horizon for optimization opportunities and respond to them after the fact, but leverage AI to predict capacity needs accurately, implement automated governance guardrails to keep things under control, and even launch corrective actions automatically.

Observability: Understanding costs in context

The first step to controlling costs is to understand what’s happening in your cloud environment—who’s spending what, where, why. The key here is to measure everything with precision and in context. Cloud vendor billing consoles (and third-party cost management tools) give you only an aggregated view of total spend on various services (compute, storage, platform), and monthly cloud bills can be pretty opaque with hundreds of individual line items that, again, don’t go any deeper than service type. There’s nothing that tells you which applications are running, who’s submitting them, which datasets are actually touched, and other contextual details.

To counter such obscurity, many organizations employ some sort of cost-allocation tagging taxonomy to categorize and track resource usage. If you already have such a classification in place, Unravel can easily adopt it without having to reinvent the wheel. If you don’t, it’s simple to implement such an approach in Unravel.

Two things to consider when looking at tagging capabilities:

How deep and atomic is the tagging?
What’s the frequency?

Unravel lets you apply tags at a highly granular level: by department, team, workload, application, even down to the individual job (or sub-part of a job) or specific user. And it happens in real time—you don’t have to wait around for the end of the monthly billing cycle.

These twin capabilities of capturing and correlating highly granular details and delivering real-time information are not merely nice-to-haves, but must-haves when it comes to the practical implementation of the FinOps lifecycle framework. They enable Unravel to:

Track actual spend vs. budget. Know ahead of time whether workload, user, cluster, etc., usage is projected to be on target, is at risk, or has already gone over budget. Preemptively prevent budget overruns rather than getting hit with sticker shock at the end of the month.

Check out the 90-second Automated Budget Tracking interactive demo

Identify the big spenders. Know which projects, teams, applications, tables, queue, users, etc. are consuming the most cloud resources.

Understand trends and patterns. Visualize how the cost of clusters changes over time. Understand seasonality-driven peaks and troughs—are you using a ton of resources only on Saturday night? Same time every day? When is there quiet time?—to identify opportunities for improvement.

Implement precise chargebacks/showbacks. Automatically generate cost-allocation reports down to the penny. Pinpoint who’s spending how much, where, and why. Because Unravel captures all the deep, nitty-gritty details about what’s running across your data stack—horizontally and vertically—and correlates them all in a “workload-aware” business context, you can finally solve the problem of allocating costs of shared resources.

Check out the 90-second Chargeback Report interactive demo

Forecast capacity with confidence. Run regularly scheduled reports that analyze historical usage and trends to predict future needs.

The observability phase is all about empowering data teams with visibility, precise allocation, real-time budgeting information, and accurate forecasting for cost governance. It provides a 360° view of what’s going on in your cloud environment and how much it’s costing you.

Optimization: Having AI do the heavy lifting

If the observability phase tells you what’s happening and why, the optimization phase is about taking that information to identify where you can eliminate waste, remove inefficiencies, or leverage different instance types that are less expensive without sacrificing SLAs.

In theory, that sounds pretty obvious and straightforward. In practice, it’s anything but. First of all, figuring out where there’s waste is no small task. In an enterprise that runs tens (or hundreds) of thousands of jobs every month, there are countless decisions—at the application, pipeline, and cluster level—to make about where, when, and how to run those jobs. And each individual decision carries a price tag.

And the people making those decisions are not experts in allocating resources; their primary concern and responsibility is to make sure the job runs successfully and meets its SLA. Even an enterprise-scale organization could probably count the number with such operational expertise on one hand.

Most cost management solutions can only identify idle clusters. Deployed but no-longer-in-use resources are an easy target but represent only 10% of potential savings. The other 90% lie below the surface.

Digging into the weeds to identify waste and inefficiencies is a laborious, time-consuming effort that these already overburdened experts don’t really have time for.

AI is a must

Yet generating and sharing timely information about how and where to optimize so that individuals can assume ownership and accountability for cloud usage/cost is a bedrock principle of FinOps. This is where AI is mandatory, and only Unravel has the AI-powered insights and prescriptive recommendations to empower data teams to take self-service optimization action quickly and easily.

What Unravel does is take all the full-stack observability information and “throw some math at it”—machine learning and statistical algorithms—to understand application profiles and analyze what resources are actually required to run them vs. the resources that they’re currently consuming.

This is where budgets most often go off the rails: overprovisioned (or underutilized) clusters and jobs due to instances and containers that are based on perceived need rather than on actual usage. What makes Unravel unique in the marketplace is that its AI not only continuously scours your data environment to identify exactly where you have allocated too many or oversized resources but gives you crisp, prescriptive recommendations on precisely how to right-size the resources in question.

Check out the 90-second AI Recommendations interactive demo

But unoptimized applications are not solely infrastructure issues. Sometimes it’s just bad code. Inefficient or problematic performance wastes money, too. We’ve seen how a single bad join on a multi-petabyte table kept a job running all weekend and wound up costing the company over $500,000. Unravel can prevent this from happening in the first place; the AI understands what your app is trying to do and can tell you this app submitted in this fashion is not efficient—pointing you to the specific line of code causing problems.

Check out the 90-second Code-Level Insights interactive demo

Every cloud provider has auto-scaling options. But what should you auto-scale, to what, and when? Because Unravel has all this comprehensive, granular data in “workload-aware” context, the AI understands trends and usage and access to help you predict and see the days of the week and times of day when auto-scaling is appropriate (or find better times to run jobs). Workload heatmaps based on actual usage make it easy to visualize when, where, and how to scale resources intelligently.

You can save 80-90% on storage costs through data tiering, moving less frequently used data to less expensive options. Most enterprises have petabytes of data, and they’re not using all of it all of the time. Unravel shows which datasets are (not) being used, applying cold/warm/hot labels based on age or usage, so you understand which ones haven’t been touched in months yet still sit on expensive storage.

The optimization phase is where you take action. This is the hardest part of the FinOps lifecycle to put into practice and where Unravel shines. The enormity and complexity of today’s data applications/pipelines comprise hundreds of thousands of places where waste, inefficiencies, or better alternatives exist. Rooting them out and then getting the insights to optimize require the two things no company has enough of: time and expertise. Unravel automatically identifies where you could do better—at the user, job, cluster level—and tells you the exact parameters to apply that would improve costs.

Governance: Going from reactive to proactive

Reducing costs reactively is good. Controlling them proactively is better. Finding and “fixing” waste and inefficiencies too often relies on the same kind of manual firefighting for FinOps that bogs down data engineers who need to troubleshoot failed or slow data applications. In fact, resolving cost issues usually relies on the same handful of scarce experts who have the know-how to resolve performance issues. In many respects, cost and performance are flip sides of the same coin—you’re looking at the same kind of granular details correlated in a holistic workload context, only from a slightly different angle.

AI and automation are crucial. Data applications are simply too big, too complex, too dynamic for even the most skilled humans to manage by hand. What makes FinOps for data so difficult is that the thousands upon thousands of cost optimization opportunities are constantly recasting themselves. Unlike software applications, which are relatively static products, data applications are fluid and ever-changing. Data cloud cost optimization is not a straight line with a beginning and end, but rather a circular loop: the AI-powered insights used to gain contextual full-stack observability and actually implement cost optimization are harnessed to give everyone on the data team at-a-glance understanding of their costs, implement automated guardrails and governance policies, and enable non-expert engineers to make expert-level optimizations via self-service. After all, the best way to spend less time firefighting is to avoid there being a fire in the first place.

With Unravel, you can set up customizable governance policy-based automated guardrails that set boundaries for any business dimension (looming budget overruns, jobs that exceed size/time/cost thresholds, etc.). Controlling particular users, apps, or business units from exceeding certain behaviors has a profound impact on reining in cloud spend.
Proactive alerts can be triggered whenever a guardrail constraint is violated. Alerts can be sent to the individual user—or sent up the chain of command—to let them know their job will miss its SLA or cost too much money and they need to find less expensive options, rewrite it to be more efficient, reschedule it, etc.
Governance policy violations can even trigger preemptive corrective actions. Unravel can automatically take “circuit breaker” remediation to kill jobs or applications (even clusters), request configuration changes, etc., to put the brakes on rogue users, runaway jobs, overprovisioned resources, and the like.

The governance phase of the FinOps lifecycle deals with getting ahead of cost issues before the fact rather than afterwards. Establishing control in a highly distributed environment is a challenge for any discipline, but Unravel empowers individual users with self-service capabilities that enables them to take individual accountability for their cloud usage, automatically alerts on potential budget-busting, and can even take preemptive corrective action without any human involvement. It’s not quite self-healing, but it’s the next-best thing—the true spirit of AIOps.

The post How Unravel Helps FinOps Control Data Cloud Costs appeared first on Unravel.

3-Minute Recap: Unlocking the Value of Cloud Data and Analytics

Stephen Lamont — Wed, 05 Oct 2022 20:54:06 +0000

DBTA recently hosted a roundtable webinar with four industry experts on “Unlocking the Value of Cloud Data and Analytics.” Moderated by Stephen Faig, Research Director, Unisphere Research and DBTA, the webinar featured presentations from Progress, Ahana, Reltio, and Unravel.

You can see the full 1-hour webinar “Unlocking the Value of Cloud Data and Analytics” below.

Here’s a quick recap of what each presentation covered.

Todd Wright, Global Product Marketing Manager at Progress, in his talk “More Data Makes for Better Analytics” showed how Progress DataDirect connectors let users get to their data from more sources securely without adding a complex software stack in between. He quoted former Google research director Peter Norvig (now fellow at the Stanford Institute for Human-Centered Artificial Intelligence) about how more data beats clever algorithms: “Simple models and a lot of data trump more elaborate models based on less data.” He then outlined how Progress DataDirect and its Hybrid Data Pipeline platform uses standard-based connectors to expand connectivity options of BI and analytics tools, access all your data using a single connector, and make on-prem data available to cloud BI and analytics tools without exposing ports or implementing costly and complex VPN tunnels. He also addressed how to secure all data sources behind your corporate authentication/identity to mitigate the risk of exposing sensitive private information (e.g., which tables and columns to expose to which people) and keep tabs on data usage for auditing and compliance.

Rachel Pedreschi, VP of Technical Services at Ahana, presented “4 Easy Tricks to Save Big Money on Big Data in the Cloud.” She first simplified the cloud data warehouse into its component parts (data on disk + some kind of metadata + a query layer + system that authenticates users and allows them to do stuff). Then she broke it down to see where you could save some money. Starting at the bottom, the storage layer, she said data lakes are a more cost-effective way of providing data to users throughout the organization. At the metadata level, Hive Metastore or AWS Glue are less expensive options. For authentication, she mentioned Apache Ranger or AWS Lake Formation. But what about SQL? For that she has Presto, an open source project that came out of Facebook as a successor to Hive. Presto is a massively scalable distributed in-memory system that allows you write queries not just against data lake files but also other databases as well. She calls this collectively the Open SQL Data Lakehouse. Ahana is an AWS-based service that gets you up and running with this Open Data Lakehouse in 30 minutes.

How can DataOps observability help unlock the value of cloud data and analytics?
Download the white paper

Mike Frasca, Field Chief Field Technology Officer at Reltio, discussed the value and benefits of a modern master data management platform. He echoed previous presenters’ points about how the volume, velocity, and variety of data have become almost overwhelming, especially given how highly siloed and fragmented data is today. Data teams spend most of their time getting the data ready for insights—data consolidation, aggregation, and cleansing; synchronizing and standardizing data, ensuring data quality, timeliness, and accuracy, etc.—rather than actually delivering insights from analytics. He outlined the critical functions a master data management (MDM) platform should provide to deliver precision data. Entity management automatically unifies data into a dynamic enterprise source of truth, including context-aware master records. Data quality management continuously validates, cleans, and standardizes the data via custom validation rules. Data integration receives input from and distributes mastered data to any application or data store in real time and at high volumes. He emphasized that what characterizes a “modern” MDM is the ability to access data in real time—so you’re not making decisions based on data that’s a week or a month old. Cloud MDM extends the capabilities to relationship management, data governance, and reference data management. He wound up his presentation with compelling real-life examples of customer efficiency gains and effectiveness, including Forrester TEI (total economic impact) results on ROI.

Chris Santaigo, VP of Solutions Engineering at Unravel, presented how AI-enabled optimization and automatic insights help unlock the value of cloud data. He noted how nowadays every company is a data company. If they’re not leveraging data analytics to create strategic business advantages, they’re falling behind. But he illustrated how the complexity of the modern data stack is slowing companies down. He broke down the challenges into three C’s: cost, where enterprises are experiencing significant budget overruns; resource constraints—not infrastructure resources, but human resources—the talent gap and mismatch between supply and demand for expertise; and complexity of the stack, where 87% of data science projects never make it into production because it’s so challenging to implement them.

But it all comes down to people—all the different roles on data teams. Analysts, data scientists, engineers on the application side; architects, operations teams, FinOps, and business stakeholders on the operations side. Everybody needs to be working off a “single source of truth” to break down silos, enable collaboration, eliminate finger-pointing, and empower more self-service. Software teams have APM to untangle this complexity—for web apps. But data apps are a totally different beast. You need observability designed specifically for data teams. You could try to stitch together the details you need for performance, cost, data quality from a smorgasbord of point tools or solutions that do some of what you need (but not all), but that’s time-consuming and misses the holistic view of how everything works together so necessary to connect the dots when you’re looking to troubleshoot faster, control spiraling cloud costs, automate AI-driven optimization (for performance and cost), or migrate to the cloud on budget and on time. That’s exactly where Unravel comes in.

Check out the full webinar, including attendee Q&A here!

The post 3-Minute Recap: Unlocking the Value of Cloud Data and Analytics appeared first on Unravel.

Get Ready for the Next Generation of DataOps Observability

Stephen Lamont — Wed, 05 Oct 2022 00:15:52 +0000

This blog was originally published by Unravel CEO Kunal Agarwal on LinkedIn in September 2022.

I was chatting with Sanjeev Mohan, Principal and Founder of SanjMo Consulting and former Research Vice President at Gartner, about how the emergence of DataOps is changing people’s idea of what “data observability” means. Not in any semantic sense or a definitional war of words, but in terms of what data teams need to stay on top of an increasingly complex modern data stack. While much ink has been spilled over how data observability is much more than just data profiling and quality monitoring, until only very recently the term has pretty much been restricted to mean observing the condition of the data itself.

But now DataOps teams are thinking about data observability more comprehensively as embracing other “flavors” of observability like application and pipeline performance, operational observability into how the entire platform or system is running end-to-end, and business observability aspects such as ROI and—most significantly—FinOps insights to govern and control escalating cloud costs.

That’s what we at Unravel call DataOps observability.

Data teams are getting bogged down

Data teams are struggling, overwhelmed by the increased volume, velocity, variety, and complexity of today’s data workloads. These data applications are simultaneously becoming ever more difficult to manage and ever more business-critical. And as more workloads migrate to the cloud, team leaders are finding that costs are getting out of control—often leading to migration initiatives stalling out completely because of budget overruns.

The way data teams are doing things today isn’t working.

Data engineers and operations teams spend way too much time firefighting “by hand.” Something like 70-75% of their time is spent tracking down and resolving problems through manual detective work and a lot of trial and error. And with 20x more people creating data applications than fixing them when something goes wrong, the backlog of trouble tickets gets longer, SLAs get missed, friction among teams creeps in, and the finger-pointing and blame game begins.

This less-than-ideal situation is a natural consequence of inherent process bottlenecks and working in silos. There are only a handful of experts who can untangle the wires to figure out what’s going on, so invariably problems get thrown “over the wall” to them. Self-service remediation and optimization is just a pipe dream. Different team members each use their own point tools, seeing only part of the overall picture, and everybody gets a different answer to the same problem. Communication and collaboration among the team breaks down, and you’re left operating in a Tower of Babel.

Check out our white paper DataOps Observability: The Missing Link for Data Teams
Download here

Accelerating next-gen DataOps observability

These problems aren’t new. DataOps teams are facing some of the same general challenges as their DevOps counterparts did a decade ago. Just as DevOps united the practice of software development and operations and transformed the application lifecycle, today’s data teams need the same observability but tailored to their unique needs. And while application performance management (APM) vendors have done a good job of collecting, extracting, and correlating details into a single pane of glass for web applications, they’re designed for web applications and give data teams only a fraction of what they need.

System point tools and cloud provider tools all provide some of the information data teams need, but not all. Most of this information is hidden in plain sight—it just hasn’t been extracted, correlated, and analyzed by a single system designed specifically for data teams.

That’s where Unravel comes in.

Data teams need what Unravel delivers—observability designed to show data application/pipeline performance, cost, and quality coupled with precise, prescriptive fixes that will allow you to quickly and efficiently solve the problem and get on to the real business of analyzing data. Our AI-powered solution helps enterprises realize greater return on their investment in the modern data stack by delivering faster troubleshooting, better performance to meet service level agreements, self-service features that allow applications to get out of development and into production faster and more reliably, and reduced cloud costs.

I’m excited, therefore, to share that earlier this week, we closed a $50 million Series D round of funding that will allow us to take DataOps observability to the next level and extend the Unravel platform to help connect the dots from every system in the modern data stack—within and across some of the most popular data ecosystems.

Unlocking the door to success

By empowering data teams to spend more time on innovation and less time firefighting, Unravel helps data teams take a page out of their software counterparts’ playbook and tackle their problems with a solution that goes beyond observability—to not just show you what’s going on and why, but actually tell you exactly what to do about it. It’s time for true DataOps observability.

To learn more about how Unravel Data is helping data teams tackle some of today’s most complex modern data stack challenges, visit: www.unraveldata.com.

The post Get Ready for the Next Generation of DataOps Observability appeared first on Unravel.

Reflections on “The Great Data Debate 2022” from Big Data London

Stephen Lamont — Tue, 04 Oct 2022 13:47:57 +0000

This year’s Big Data LDN (London) was huge. Over 150 exhibitors, 300 expert speakers across 12 technical and business-led conference theaters. It was like being the proverbial kid in a candy store, and I had to make some tough decisions on which presentations to attend (and which I’d miss out on).

One that looked particularly promising was “The Great Data Debate 2022,” panel discussion, hosted by industry analyst and conference chair Mike Ferguson with panelists Benoit Dageville, Co-founder and President of Products at Snowflake, Shinji Kim, Select Star Founder and CEO, Chris Gladwin, Ocient CEO, and Tomer Shiran, Co-founder and Chief Product Officer of Dremio. You can watch the 1 hour Great Data Debate 2022 recording below.

The panel covered a lot of ground, everything from the rise of component-based development, how the software development approach has gate-crashed the data and analytics world, the challenges around integrating all the new tools, best-of-breed vs. single platform, the future of data mesh, metadata standards, data security and governance, and much more.

Sometimes the panelists agreed with each other, sometimes not, but the discussion was always lively. The parts I found most interesting revolved around migrating to the cloud and controlling costs once there.

Moderator Mike Ferguson opened up the debate by asking the panelists how the current economic climate has changed companies’ approach—whether they’re accelerating their move to the cloud, focusing more on cost reduction or customer retention, etc.

All the panelists agreed that more companies are increasingly migrating workloads to the cloud. Said Benoit Dageville: “We’re seeing an acceleration to moving to the cloud, both because of cost—you can really lower your cost—and because you can do much more in the cloud.”

Chris Gladwin added that the biggest challenge among hyperscale companies is that “they want to grow faster and be more efficient.” Shinji Kim echoed this sentiment, though from a different viewpoint, saying that many organisations are looking at how they want to structure the team—focusing more effort on automation or tooling to make everyone more productive in their own role. Tomer Shiran made the point that “a lot of customers now are leveraging data to either save money or increase their revenue. And there’s more focus on people asking if the path of spending they’re on with current data infrastructure is sustainable for the future.”

We at Unravel are also seeing an increased focus on making data teams more productive and on leveraging automation to break down silos, promote more collaboration, reduce toilsome troubleshooting, and accelerate the DataOps lifecycle. But piggy-backing on Tomer’s point: While the numbers certainly bear out that more workloads are indeed moving to the cloud, we are seeing that among more mature data-driven organisations—those that already have 2-3 years of experience running data workloads in the cloud under their belt—migration initiatives are “hitting a wall” and stalling out. Cloud costs are spiraling out of control, and companies find themselves burning through their budgets with little visibility into where the spend is going or ability to govern expenses.

As Mike put it: “As an analyst, I get to talk to CFOs and a lot of them have no idea what the invoice is going to be like at the end of the month. So the question really is, how does a CFO get control over this whole data and analytics ecosystem?”

Chris was first to answer. “In the hyperscale segment, there are a lot of things that are different. Every customer is the size of a cloud, every application is the size of a cloud. Our customers have not been buying on a per usage basis—if you’re hammering away all day on a cluster of clusters, you want a price based on the core. They want to know in advance what it’s going to cost so they can plan for it. They don’t want to be disincented from using the platform more and more because it’ll cost more and more.”

Benoit offered a different take: “Every organisation wants to become really data-driven, and it pushes a lot of computation to that data. I believe the cloud and its elasticity is the most cost-effective way to do that. And you can do much more at lower costs. We have to help the CFO and the organisation at large understand where the money is spent to really control [costs], to define budget, have a way to charge back to the different business units, and be very transparent to where the cost is going. So you have to have what we call cost governance. And we tell all our customers when they start [using Snowflake] that they have to put in place guardrails. It’s not a free lunch.”

Added Shinji: “It’s more important than ever to track usage and monitor how things are actually going, not just as a one-time cost reduction initiative but something that actually runs continuously.”

Benot summed it up by saying, “Providing the data, the monitoring, the governance of costs is a very big focus for all of us [on the panel], at different levels.”

It’s interesting to hear leaders from modern data stack vendors as diverse as Snowflake, Select Star, and Dremio emphasise the need for automated cost governance guardrails. Because nobody does cloud cost governance for data applications and pipelines better than Unravel.

Check out the full The Great Data Debate 2022 panel discussion.

The post Reflections on “The Great Data Debate 2022” from Big Data London appeared first on Unravel.

The Data Challenge Nobody’s Talking About: An Interview from CDAO UK

Stephen Lamont — Wed, 28 Sep 2022 17:57:33 +0000

Chief Data & Analytics Officer UK (CDAO UK) is the United Kingdom’s premier event for senior data and analytics executives. The three-day event, with more than 200 attendees and 50+ industry-leading speakers, was packed with case studies, thought leadership, and practical advice around data culture, data quality and governance, building a data workforce, data strategy, metadata management, AI/MLOps, self-service strategies, and more.

Catherine King of Corinium Global Intelligence interviews Chris Santiago, Unravel Data VP of Solutions Engineering

Chris Santiago, Unravel VP of Solutions Engineering, sat down with Catherine King from event host Corinium Global Intelligence to discuss what’s top-of-mind for data and analytics leaders.

Here are the highlights from their 1-on-1 interview.

Catherine: What are you hearing in the market at the moment? What are people coming up to you and having a chat with you about today?

Chris: I think that, big picture-wise, there are a lot of things being talked about that are very insightful. But some of the stuff that hasn’t really been talked about are the things that people don’t want to talk about. They have this grand vision, but how are they going to get there? How are they going to execute? How are they going to have the processes in place, the right people hired—that sort of thing—to take advantage of data and all the great technology that’s out there and actually execute on the vision?

Catherine: I think you’ve hit the nail on the head there. I think the technology piece, which was so prevalent a few years ago—do we have the right tech, the right tools, to go out and do these things?—that’s almost been ticked off and done. Now it’s actually, do you have the processes in place to achieve it. So from your perspective, what would you like to see businesses do differently?

Chris: The technology has made leaps and bounds over the last few years. If you think about big data people who came from the Hadoop world, one of the challenges of that stack was that it did require expertise. Companies back then struggled to get the ROI on performance, on getting some sort of business value at the end of the day. Fast-forward to today, especially with the advancements in the cloud, they have solved a lot of the challenges. Ease of use? I can just literally just log in, click a button, and have an environment. Storage is cheaper. A lot of the problems back then have been solved with today’s technology.

I do think that the one thing that hasn’t been 100% solved yet, though, is the people, the skills. Right now, the things technology is not necessarily addressing directly are: Do we have the right skill set? Do we have the right people? As more folks are using these newer technologies, the gap in skills to do things the right way and best practices are not being directly addressed. The way that most customers are handling it right now is that they’ll bring in the experts, consultants like Accenture, Avanade, Deloitte, etc.

In order to achieve the true ROI in these data stacks, it’s adjusting the people-problem.

Catherine: What do you see coming in the next 12 months?

Chris: I think that if we look at the industry as a whole, a lot of the technology is still considered new(ish). You have a lot of folks who are still using DevOps tools. So they’re trying to work with observability tools that are focused on the issues that aren’t necessarily what data teams want. So I think in the next six to twelve months we’re going to have a proliferation of observability trying to solve this problem specifically for data teams. Because they’re different problems [than software application issues]. You can’t use the same [APM] tooling for folks who are running Databricks or Snowflake. There are going to be different problems, different challenges.

Obviously, Unravel is in that space now, but I do think the industry will recognize more and more that this is actually not just a small problem, but a major problem. Companies are starting to realize that [observability designed for data teams] is not a nice-to-have anymore, it’s a must-have—and needs to be addressed right now.

Everybody has a vision. Everybody has an idea of what they want to do with data, whether it’s having that strategic business advantage or getting insights that they didn’t know about—everybody trying to do really cool stuff—but everybody always seems to forget about how to execute. There’s lots of interesting talks about how we’re going to measure things, what KPIs we’re going to be tracking, what methodologies we should have in place—all great stuff—but if you truly want to be successful, it’s all about execution and [doing] the stuff people don’t want to talk about. That’s what will set up companies to be successful or not with their data initiatives, getting into the weeds and solving these hard challenges.

Check out the full interview with Chris from CDAO UK

The post The Data Challenge Nobody’s Talking About: An Interview from CDAO UK appeared first on Unravel.

Data Observability: The Missing Link for Data Teams

Christine Della Penna — Thu, 15 Sep 2022 20:16:38 +0000

As organizations invest ever more heavily in modernizing their data stacks, data teams—the people who actually deliver the value of data to the business—are finding it increasingly difficult to manage the performance, cost, and quality of these complex systems.

Data teams today find themselves in much the same boat as software teams were 10+ years ago. Software teams have dug themselves out the hole with DevOps best practices and tools—chief among them full-stack observability.

But observability for data applications is a completely different animal from observability for web applications. Effective DataOps demands observability that’s designed specifically for the different challenges facing different data team members throughout the DataOps lifecycle.

That’s data observability for DataOps.

In this white paper, you’ll learn:

What is data observability for DataOps?
What data teams can learn from DevOps
Why observability designed for web apps is insufficient for data apps
What data teams need from data observability for DataOps
The 5 capabilities data observability for DataOps

Download the white paper here.

The post Data Observability: The Missing Link for Data Teams appeared first on Unravel.

Unravel Goes on the Road at These Upcoming Events

Stephen Lamont — Thu, 01 Sep 2022 19:41:27 +0000

Join us at an event near you or attend virtually to discover our DataOps observability platform, discuss your challenges with one of our DataOps experts, go under the hood to check out platform capabilities, and see what your peers have been able to accomplish with Unravel.

September 21-22: Big Data LDN (London)

Big Data LDN is the UK’s leading free-to-attend data & analytics conference and exhibition, hosting leading data and analytics experts, ready to arm you with the tools to deliver your most effective data-driven strategy. Stop by the Unravel booth (stand #724) to see how Unravel is observability designed specifically for the unique needs of today’s data teams.

And be sure to stop by the Unravel booth at 5PM on Day 1 for the Data Drinks Happy Hour for drinks and snacks (while supplies last!)

October 5-6: AI & Big Data Expo – North America (Santa Clara)

The world’s leading AI & Big Data event returns to Santa Clara as a hybrid in-person and virtual event, with more than 5,000 attendees expected to join from across the globe. The expo will showcase the most cutting-edge technologies from 250+ speakers sharing their unparalleled industry knowledge and real-life experiences, in the forms of solo presentations, expert panel discussions and in-depth fireside chats.

And don’t miss Unravel Co-Founder and CEO Kunal Agarwal’s feature presentation on the different challenges facing different AI & Big Data team members and how multidimensional observability (performance, cost, quality) designed specifically for the modern data stack can help.

October 10-12: Chief Data & Analytics Officers (CDAO) Fall (Boston)

The premier in-person gathering for data & analytics leaders in North America, CDAO Fall offers focus tracks on data infrastructure, data governance, data protection & privacy; analytics, insights, and business intelligence; and data science, artificial intelligence, and machine learning. Exclusive industry summit days for data and analytics professionals in Financial Services, Insurance, Healthcare, and Retail/CPG.

October 14: DataOps Observability Conf India 2022 (Bengaluru)

India’s first DataOps observability conference, this event brings together data professionals to collaborate and discuss best practices and trends in the modern data stack, analytics, AI, and observability.

Join leading DataOps observability experts to:

Understand what DataOps is and why it’s important
Learn why DataOps observability has become a mission-critical need in the modern data stack
Discover how AI is transforming DataOps and observability

November 1-3: ODSC West (San Francisco)

The Open Data Science Conference (ODSC) is essential for anyone who wants to connect to the data science community and contribute to the open source applications it uses every day. A hybrid in-person/virtual event, ODSC West features 250 speakers, with 300 hours of content, including keynote presentations, breakout talk sessions, hands-on tutorials and workshops, partner demos, and more.

Sneak peek into what you’ll see from Unravel

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

The post Unravel Goes on the Road at These Upcoming Events appeared first on Unravel.

Expert Panel: Challenges with Modern Data Pipelines

Stephen Lamont — Thu, 01 Sep 2022 14:47:26 +0000

Modern data pipelines have become more business-critical than ever. Every company today is a data company, looking to leverage data analytics as a competitive advantage. But the complexity of the modern data stack imposes some significant challenges that are hindering organizations from realizing their goals and realizing the value of data.

TDWI recently hosted an expert panel on modern data pipelines, moderated by Fern Halper, VP and Senior Director of Advanced Analytics at TDWI, with guests Kunal Agarwal, Co-Founder and CEO of Unravel Data, and Krish Krishnan, Founder of Sixth Sense Advisors.

Dr. Halper opened the discussion with a quick overview of the trends she’s seeing in the changing data landscape, characteristics of the modern data pipeline (automation, universal connectivity, scalable/flexible/fast, comprehensive and cloud-native), and some of the challenges with pipeline processes. Data takes weeks or longer to access new data. Organizations want to enrich data with new data sources, but can’t. There aren’t enough data engineers. Pipeline growth causes problems such as errors and management. It’s hard for different personas to make use of pipelines for self-service. A quick poll of attendees showed a pretty even split among the different challenges.

The panel was asked how they define a modern data pipeline, and what challenges their customers have that modern pipelines help solve.

Kunal talked about the different use cases data-driven companies have for their data to help define what a pipeline is. Banks are using big data to detect and prevent fraud, retailers are running multidimensional recommendation engines (products, price). Software companies are measuring their SaaS subscriptions.

“All these decisions, and all these insights, are now gotten through data products. And data pipelines are the backbone of running any of these business-critical processes,” he said. “So, ultimately, a data pipeline is a sequence of actions that’s gathering, collecting, moving this data from all the different sources to a destination. And the destination could be a data product or a dashboard. And a pipeline is all the stages it takes to clean up the data, transform the data, connect it together, and give it to the data scientist or business analyst to be able to make use of it.”

He said that with the increased volume, velocity, and variety of data, modern data pipelines need to be scalable and extensible—to add new data sources, to move to a different processing paradigm, for example. “And there are tons of vendors who are trying to crack this problem in unique ways. But besides the tools, it should all go back to what you’re trying to achieve, and what you’re trying to drive from the data, that dictates what kind of data pipeline or architecture you should have,” Kunal said.

Krish sees the modern data pipeline as the integration point enabling the true multi-cloud vision. It’s no longer just one system or another, but a mix-and-match “system of systems.” And the challenges revolve around moving workloads to the cloud. “If a company is an on-premises shop and they’re going to the cloud, it’s a laborious exercise,” he said. “There is no universal lift-and-shift mechanism for going to the cloud. That’s where pipelines come into play.”

Observability and the Modern Data Pipeline

The panel discussed the components and capabilities of the modern data pipeline, again circling back to the challenges spawned by the increased complexity. Kunal noted that one common capability among organizations that are running modern data pipelines successfully is observability.

“One key component we see along all the different stages of a pipeline is observability. Observability helps you ultimately improve reliability of these pipelines—that they work on time, every time—and improve the productivity of all the data team members. The poll results show that there aren’t enough data engineers, the demand far exceeds the supply. And what we see is that the data engineers who are present are spending more than 50% of their time just firefighting issues. So observability can help you eliminate—or at least get ahead of—all these different issues. And last but not least, we see that businesses tame their data ambitions by looking at their ballooning cloud bills and cloud costs. And that also happens because of the complexity that these data technologies present. Observability can also help get control and governance around cloud costs so you can start to scale your data operations in a much more efficient manner.”

See the entire panel discussion on demand

Watch the entire conversation with Kunal, Krish, and Fern from TDWI’s Expert Panel: Modern Data Pipelines on-demand replay.

The post Expert Panel: Challenges with Modern Data Pipelines appeared first on Unravel.

Takeaways from CDO TechVent on Data Observability

Stephen Lamont — Tue, 30 Aug 2022 21:54:13 +0000

The Eckerson Group recently presented a CDO TechVent that explored data observability, “Data Observability: Managing Data Quality and Pipelines for the Cloud Era.” Hosted by Wayne Eckerson, president of Eckerson Group, Dr. Laura Sebastian-Coleman, Data Quality Director at Prudential Financial, and Eckerson VP of Research Kevin Petrie, the virtual event kicked off with a keynote overview of data observability products and best practices, followed by a technology panel discussion, “How to Evaluate and Select a Data Observability Platform,” with four industry experts.

Here are some of the highlights and interesting insights from the roundtable discussion on the factors data leaders should consider when looking at data observability solutions.

Josh Benamram, CEO of Databand, said it really depends on what problem you’re facing, or which team is trying to solve which problem. It’s going to be different for different stakeholders. For example, if your challenge is maintaining SLAs, your evaluation criteria would lean more towards Ops-oriented solutions that cater to data platform teams that need to manage the overall system. On the other hand, if your observability requirements are more targeted towards analysts, your criteria would be more oriented towards understanding the health of data tables rather than the overall pipeline.

This idea of identifying the problems you’re trying to solve first, or as Wayne put it, “not putting the tools before the horse,” was a consistent refrain among all panelists.

Seth Rao, CEO of FirstEigen, said he would start by asking three questions: Where do you want to observe? What do you want to observe? How much do you want to automate? If you’re running Snowflake, there are a bunch of vendor solutions; but if you’re talking about data lakes, there’s a different set of solutions. Pipelines are yet another group of solutions. If you’re looking to observe the data itself, that’s a different type of observability altogether. Different solutions automate different pieces of observability. He suggested not trying to “boil the ocean” with one product that tries to do everything. He feels that you’ll get only an average product for all functions. Rather, he said, get flexibility of tooling—like Lego blocks that connect with other Lego blocks in your ecosystem.

This point drew the biggest reaction from the attendees (at lest as evidenced by the Q&A chat). Who’s responsible for integrating all the different tools? We already don’t have enough time! A couple of panelists tackled the argument head-on, either in the panel discussion or in breakout sessions.

Specifically, Rohit Choudhary, CEO of Acceldata, said that the purpose of observability is to simplify everything data teams have to do. You don’t have enough data engineers as it is, and now you’re asking data leaders to invest in a bunch of different data observability tools. Instead of actually help them solve problems, you’re handing them more problems. He said to look at two things when evaluating data observability solutions: what it is capable of today, and what its roadmap looks like and what use cases it will support moving forward. Observability means different things to different people, and it all depends on whether the offering fits your maturity model. Smaller organizations with analytics teams of 10-20 people are probably fine with point solutions. But large enterprises that are dealing with data pipelines at petabyte scale are dealing with much greater complexity. For them, it would be prohibitively expensive to build their own observability solution.

Chris Santiago, Unravel Data VP of Solutions Engineering, was of the same opinion but looked at things from a different slant. He agreed that different tools—system-specific point tools, native cloud vendor capabilities, various data quality monitoring solutions—all have strengths and weaknesses, with insight into different “pieces of the puzzle.” But rather than connect them all together as discrete building blocks, observability would be better realized by extracting all the relevant granular details, correlating them into a holistic context, and analyzing them with ML and other analytical algorithms so that data teams get the intelligence they need in one place. The problems data teams face—around performance, quality, reliability, cost—are all interconnected, so you’re saving a lot of valuable time and reducing manual effort to have as much insight as possible in a single pane of glass. He refers to such comprehensive capabilities as DataOps observability.

The dimension of cost was something Eckerson analyst Kevin Petrie highlighted in the wrap-up as a key emerging factor. He’s seeing an increased focus on FinOps capabilities, which Chris called out specifically: it’s not just making sure pipelines are running smoothly, but understanding where the spend is going and who the “big spenders” are, so that observability can uncover opportunities to optimize for cost and control/govern the cloud spend.

That’s the cost side for the business, but he said it’s also crucial to understand the profit side. Companies are investing millions of dollars in their modern data stack, but how are we measuring whether they’re getting the value they expected from their investment? Can the observability platform help make sense of all the business metrics in some way? Because at the end of the day, all these data projects have to deliver value.

Check out Unravel’s breakout presentation, A DataOps Observability Dialogue: Empowering DevOps for Data Teams.

The post Takeaways from CDO TechVent on Data Observability appeared first on Unravel.

Tips to optimize Spark jobs to improve performance

Unravel Data — Tue, 23 Aug 2022 01:48:56 +0000

Summary: Sometimes the insight you’re shown isn’t the one you were expecting. Unravel DataOps observability provides the right, and actionable, insights to unlock the full value and potential of your Spark application.

One of the key features of Unravel is our automated insights. This is the feature where Unravel analyzes the finished Spark job and then presents its findings to the user. Sometimes those findings can be layered and not exactly what you expect.

Why Apache Spark Optimization?

Let’s take a scenario where you are developing a new Spark job and doing some testing. The goal of this testing is to ensure the Spark application is properly optimized. You want to see if you have tuned it appropriately so you bump up the resource allocation really high. The goal being you want to see if Unravel shows a “Container Sizing” event, or something along those lines. Instead of seeing a “Container Sizing” event there are a few others, in our case “Contended Driver” and “Long Idle Time for Executors” events. Let’s take a look at why this might be the case!

Recommendations to optimize Spark

When Unravel presents recommendations it presents them based on what is most helpful for the current state of the application. It will also only present the best case, when it can, for improving the applications. In certain scenarios this will lead to Insights that are not shown because they would end up causing more harm than good. There can be many reasons for this, but let’s take a look at how Unravel is presenting this particular application.

Below are the resource utilization graphs for the two runs of the application we are developing:

Original run

New Run

The most glaring issue is this idle time. We can see the job is doing bursts of work and then just sitting around doing nothing. If the idle time was looked at from an application perspective then it would help improve the jobs performance tremendously. This is most likely masking other potential performance improvements. If we dig into the job a bit further we can see this:

The above is a breakdown of what the job was doing for the whole time. Nearly 50% was spent on ScheduledWaitTime! This leads to the Executor idle time recommendation:

Taking all of the above information we can see that the application was coded in such a way that it’s waiting around for something for a while. At this point you could hop into the “Program” tab within Unravel to take a look at the actual source code associated with this application. We can say the same thing about the Contended Driver:

With this one we should examine why the code is spending so much time with the driver. It goes hand-in-hand with the idle time because while the executors are idle, the driver is still working away. Once we take a look at our code and resolve these two items, we would see a huge increase in this job’s performance!

Another item which was surfaced when looking at this job was this:

This is telling us that the Java processes spent a lot of time in garbage collection. This can lead to thrashing and other types of issues. More than likely it’s related to the recommendations Unravel made that need to be examined more deeply.

With all of the recommendations we saw we didn’t see the expected resource recommendations. That is because Unravel is presenting the most important insights. Unravel is presenting the right things that the developer needs to look at.

These issues are deeper than settings changes. An interesting point is that both jobs we looked at required some resource tuning. We could see that even the original job was overallocated. The problem with these jobs though is showing just the setting changes isn’t the whole story.

If Unravel just presented the resource change and those were implemented it might lead to the job failing; for example being killed by the JVM because of garbage collection issues. Unravel is attempting to steer the developer in the right direction and give them the tools they need to determine where the issue is in their code. If they fix the larger issues, like high GC or contended driver, then they will start being able to really tune and optimize their job.

Next steps:

Unravel is a purpose-built observability platform that helps you stop firefighting issues, control costs, and run faster data pipelines. What does that mean for you?

Never worry about monitoring and observing your entire data estate.
Reduce unnecessary cloud costs and the DataOps team’s application tuning burden.
Avoid downtime, missed SLAs, and business risk with end-to-end observability and cost governance over your modern data stack.

How can you get started?

Create your free account today.
Book a demo to see how Unravel simplifies modern data stack management.

The post Tips to optimize Spark jobs to improve performance appeared first on Unravel.

Join Unravel at the AI & Big Data Expo

Stephen Lamont — Fri, 19 Aug 2022 19:53:52 +0000

Swing by the Unravel Data booth at the AI & Big Data Expo in Santa Clara on October 5-6. The world’s leading AI & Big Data event returns as a hybrid in-person and virtual event, with more than 5,000 attendees expected to join from across the globe.

And don’t miss Unravel Co-Founder and CEO Kunal Agarwal’s feature presentation on DataOps Observability. He’ll explain why AI & Big Data organizations need an observability platform designed specifically for data teams and their unique challenges, the limitations of trying to “borrow” other observability (like APM) or relying on a bunch of different point tools, and how DataOps observability cuts across and incorporates cross-sections of multiple observability domains (data applications/pipelines/model observability, operations observability, business and FinOps observability, data observability).

Stop by our booth and you’ll be able to:

Go under the hood with Unravel’s DataOps observability platform
Deep-dive into features and capabilities with our experts
Learn what your peers have been able to accomplish with Unravel

Our experts will run demos and be available for 1-on-1 conversations throughout the conference.

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

The expo will showcase the most cutting-edge technologies from 250+ speakers sharing their unparalleled industry knowledge and real-life experiences, in the forms of solo presentations, expert panel discussions and in-depth fireside chats.

You can register for the event here.

The post Join Unravel at the AI & Big Data Expo appeared first on Unravel.

Amazon EMR cost optimization and governance

Unravel Data — Thu, 04 Aug 2022 15:07:23 +0000

There are now dozens of AWS cost optimization tools that exist today. Here’s the purpose-built one for AWS EMR: Begin monitoring immediately to gain control of your AWS EMR costs and continuously optimize resource performance.

What is Amazon EMR (Elastic MapReduce)?

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.

Based on the workload and the application type, EMR can process a huge amount of data by using EC2 instances running the Hadoop File System (HDFS) and EMRFS based on AWS S3. Based on the workload type, these EC2 instances can be configured with any instance types of on-demand and/or spot market kind.

AWS EMR is a great platform, as more and more workloads get added to it then understanding pricing can be a challenge. It’s challenging to govern the cost and easy to lose track of aspects of your monthly spend. In this article, we are sharing tips for governing and optimizing your AWS EMR costs, and resources.

Amazon EMR costs

With multiple choices in selecting instance types and configuring the EMR cluster, understanding pricing of EMR service can become cumbersome and difficult. And because the EMR service inherently utilizes other AWS services (EC2, EBS, EMR, others) so the usage cost of all these services too gets factored into the cost bill.

Best practices for optimizing the cost of your AWS EMR cluster

Here are the list of best practices/techniques for monitoring and optimizing the cost of your EMR cluster:

1. Always tag your resources
Tags is a label consisting of key-value pairs, which allows one to assign metadata to their AWS resources. And, providing one the ability to easily manage, identify, organize, search for, and filter resources. Thus, it is important to give a meaningful and purpose built tags
For example: Create tags to categorize resources by purpose, owner, department, or other criteria as shown below

2. Pick the right cluster type
AWS EMR offers two cluster types – permanent and transient.

For transient clusters, the compute unit is decoupled from the storage. HDFS on local storage is best used for caching the intermediate results and EMRFS is the final destination for storing persistent data in AWS S3. Once the computation is done and the results are stored safely in AWS S3, the resources on transient clusters can be reclaimed.

For permanent clusters, the data in HDFS is stored in EBS volumes, and can not easily be shared outside of the clusters. Small file issues in Hadoop Name Nodes will still be present just as the on-premise Hadoop clusters.

3. Size your cluster appropriately
Undersized or oversized clusters are what you need to absolutely avoid. EMR platform provides you with auto-scaling capabilities, however, it is important to first factor in the right-sizing aspect for your clusters so as to avoid higher costs and workload execution inefficiencies. To anticipate these issues, you can calculate the number and type of nodes that will be needed for the workloads.
Master node: As the computational requirements are low for this node so it can be a single node.
Core node: As these nodes perform the data processing and storing of data in the HDFS so it is important to right size these nodes. As per Amazon guiding principle, you can multiply the number of nodes by the EBS storage capacity of each node.

For example, if you define 10 core nodes to process 1 TB of data, and you have selected m5.xlarge instance type (with 64 GiB of EBS storage), you have 10 nodes*64 GiB, or 640 GiB of capacity. Based on the HDFS replication factor of three, your data size is replicated three times in the nodes, so 1 TB of data requires a capacity of 3 TB.

For above scenario, you have two options:
a. As you have only 640 GiB here so to run your workload optimally you must increase the number of nodes or change the instance type until you have a capacity of 3 TB.
b. Alternatively, switching your instance type from m5.xlarge to an m5.4xlarge instance type and selecting 12 instances provides enough capacity.

12 instances * 256 GiB = 3072 GiB = 3.29 TB available

Task node: As these nodes only run tasks and do not store data so to calculate the number of task nodes, you only need to estimate the memory usage. As this memory capacity can be distributed between the core and task nodes, it is easy to calculate the number of task nodes by subtracting the core node memory.

As per Amazon provided best practices guidelines, you need to multiply the memory needed by three.
For example, suppose that you have 28 processes of 20 GiB each then your total memory requirements would be as follows:

3*28 processes*20 GiB of memory = 1680 GiB of memory

For this example, your core nodes have 64 GiB of memory (m5.4xlarge instances), and your task nodes have 32 GiB of memory (m5.2xlarge instances). Your core nodes provide 64 GiB * 12 nodes = 768 GiB of memory, which is not enough in this example. To find the shortage, subtract this memory from the total memory required: 1680 GiB – 768 GiB = 912 GiB. You can set up the task nodes to provide the remaining 912 GiB of memory. Divide the shortage memory by the instance type memory to obtain the number of task nodes needed.

912 GiB / 32 GiB = 28.5 task nodes
4. Based on your workload size, always pick the right instance type and size

Let us take an example of Task node where suppose that you have 28 processes of 20 GiB each then your total memory requirements would be as follows:
3*28 processes*20 GiB of memory = 1680 GiB of memory

For this example, suppose your core nodes have 64 GiB of memory (m5.4xlarge instances), and your task nodes have 32 GiB of memory (m5.2xlarge instances). Your core nodes provide 64 GiB * 12 nodes = 768 GiB of memory, which is not enough in this example.

To find the shortage, subtract this memory from the total memory required: 1680 GiB – 768 GiB = 912 GiB. You can set up the task nodes to provide the remaining 912 GiB of memory. Divide the shortage memory by the instance type memory to obtain the number of task nodes needed.

912 GiB / 32 GiB = 28.5 task nodes

5. Use autoscaling as needed
Based on your workload size, Amazon EMR can programmatically scale out applications like Apache Spark and Apache Hive to utilize additional nodes for increased performance and scale in the number of nodes in your cluster to save costs when utilization is low.

For example, you can assign minimum, maximum, on-demand limit and maximum core node to dynamically scale up and down based on your running workload.

6. Always have a cluster termination policy set
When you add an auto-termination policy to a cluster, you specify the amount of idle time after which the cluster should automatically shut down. This ability allows you to orchestrate cluster cleanup without the need to monitor and manually terminate unused clusters.

You can attach an auto-termination policy when you create a cluster, or add a policy to an existing cluster. To change or disable auto-termination, you can update or remove the policy.

7. Monitor cost with cost explorer

To manage and meet your costs within your budget, you need to diligently monitor your costs. One tool that AWS offers you here is AWS Cost Explorer, which allows you to visualize, understand, and manage your AWS costs and usage over time.

With Cost Explorer, you can build custom applications to directly access and query your costs and usage data, build interactive and ad-hoc analytics reports over a daily or monthly granularity. You can even create a forecast by selecting a future time range for your reports, estimate your AWS bill and set alarms and budgets based on predictions.

Unravel can help!

Without doubt, AWS helps you manage your EMR clusters and its costs with the above listed pathways. And, Cost Explorer is a great tool to use for monitoring your monthly bill, however, that all does come with a price where you have to spent your precious time checking and monitoring things manually or writing custom scripts to fetch the data and then run that data by your data science teams and finance ops for detailed analysis.

Further, the data provided by the Cost Explorer for your EMR cluster costs is not in real-time (has a turn around of 24 hours delay). And, it is also difficult to access your EMR cluster cost usage with other services costs included. However, not to worry, there is a better solution available today — a dataops observability product from the company Unravel Data! — which frees you completely from worrying all about your EMR cluster management and costs as it gives you the real-time view, holistic and fully automated way to manage your clusters!

AWS EMR Cost Management is made easy with Unravel Data

Although there are many tools offered by AWS as well other companies to manage your EMR cluster costs, Unravel stands out with its key offerings of providing you a single pane of glass and ease of use.

Unravel provides an automated observability for your modern data stack!

Unravel’s purpose-built observability for modern data stacks helps you stop firefighting issues, control costs, and run faster data pipelines, all monitored and observed via a single pane of glass.

One unique value that Unravel provides is Chargeback details for the EMR clusters in real-time, where a detailed cost breakdown is provided for your services (EMR, EC2, and EBS volumes) for each configured AWS account. In addition, you get a holistic view of your cluster with respect to resources utilization, chargeback and instance health, with automated Artificial Intelligence based delivered cluster cost-saving recommendations and suggestions.

AWS EMR Monitoring with Unravel’s DataOps Observability

Unravel 4.7.4 has the capability to holistically monitor your EMR cluster. It collects and monitors a range of data points for various KPIs and metrics, with which it then builds a knowledge base to derive the resource and cost saving insights and recommendation decisions.

AWS EMR chargeback and showback

The image below shows the cost breakdown for EMR, EC2 and EBS services

Monitoring AWS EMR cost trends

For your EMR cluster usage, it is important to see how the costs are trending based on your usage and workload size. Unravel helps you with understanding your costs via its chargeback page. Our agents are constantly fetching all the relevant metrics used for analyzing the cluster costs usage and resource utilization, showing you the instant chargeback view in real-time. These collected metrics are further feeded into our AI engine to give you the recommended insights.

The image below shows the cost trends per cluster type, avg costs and total costs

Complete monitoring AWS EMR insights

As seen in the above image, Unravel has analyzed the resources utilization (both memory and CPU) of the clusters and analyzed the configured instance types used for your cluster. Further, based on your executed workload size, Unravel has come up with a set of recommendations to help you save costs by downsizing your node instance types.

Do you want to lower your AWS EMR cost?

Avoid overspending on AWS EMR. If you’re not sure how to lower your AWS EMR cost, or simply don’t have time, the Unravel’s DataOps Observability tool can help you save cost.
Schedule a free consultation.
Create your free account today.
Watch video: 5 Best Practices for Optimizing Your Big Data Costs With Amazon EMR

The post Amazon EMR cost optimization and governance appeared first on Unravel.

Tuning Spark applications: Detect and fix common issues with Spark driver

Unravel Data — Wed, 20 Jul 2022 00:01:04 +0000

Learn more about Apache Spark drivers and how to tune spark application quickly.

During the lifespan of a Spark application, the driver should distribute most of the work to the executors, instead of doing the work itself. This is one of the advantages of using python with Spark over Pandas. The Contented Driver event is detected when a Spark application spends way more time on the driver than the time spent in the Spark jobs on the executors. Leaving the executors idle could waste lots of time and money, specifically in the Cloud environment.

Here is a sample application shown in Unravel for Azure Databricks. Note in the Gantt Chart there was a huge gap of about 2 hours 40 minutes between Job-1 and Job-2, and the job duration for most of the Spark jobs was under 1 minute. Based on the data Unravel collected, Unravel detected the bottleneck which was the contended driver event.

Further digging into the Python code itself, we found that it actually tried to ingest data from the MongoDB server on the driver node alone. This left all the executors idle while the meter was still ticking.

There was some network issue that caused the MongoDB injection slowing down from 15 minutes to 2 plus hours. Once this issue was resolved, there was about 93% reduction in cost. The alternative solution is to move the MongoDB ingestion out of the Spark application. If there is no dependency on previous Spark jobs, we can do it before the Spark application.

If there is a dependency, we can split the Spark application into two. Unravel also collects all the job run status such as Duration and IO as shown below and we can easily see the history of all the job runs and monitor the job.

In conclusion, we must pay attention to the Contended Driver event in a Spark application, so we can save money and time without leaving the executors IDLE for a long time.

Next steps

Check out what are the biggest spark troubleshooting challenges in 2022 and how to fix them.
Learn ways to troubleshoot Apache Spark issues.
Create your free account today.
Book a demo to see how Unravel simplifies modern data stack management.

The post Tuning Spark applications: Detect and fix common issues with Spark driver appeared first on Unravel.

Kafka best practices: Monitoring and optimizing the performance of Kafka applications

George Demarest — Mon, 18 Jul 2022 21:04:02 +0000

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Administrators, developers, and data engineers who use Kafka clusters struggle to understand what is happening in their Kafka implementations. There are out of the box solutions like Ambari and Cloudera manager which provide some high level monitoring however most customers find these tools to be insufficient for troubleshooting purposes. These tools also fail to provide right visibility down to the applications acting as consumers that are processing Kafka data streams.

For Unravel Data users Kafka monitoring and insights come out of the box with your installation. This blog is geared towards showing you best practices for using Unravel as a Kafka monitoring tool to detect and resolve issues faster. We assume you have a baseline knowledge of Kafka concepts and architecture.

Kafka monitoring tools

There are several tools and ways to monitor Kafka applications, such as open source tools, proprietary tools, and purpose-built tools. There are advantages associated with each option. Based on the availability of skilled resources and organizational priority, you can choose the best option.

Importance of using Kafka Monitoring tools

An enterprise-grade Kafka monitoring tool, like Unravel, enables your critical resources to avoid wasting hours manually configuring monitoring for Apache Kafka. They can get more time delivering value to end users. Locating an issue can be a challenging task without having a monitoring solution in place. Your data engineering team will have to test different use cases and educated guesses before they can get to the root of issues. Whereas proprietary Kafka monitoring tools like Unravel, provides AI-enabled insights, recommendation and automate remediation.

Monitoring Cluster Health

Unravel provides color coded KPIs per cluster to give you a quick overall view on the health of your Kafka cluster. The colors represent the following:

Green = Healthy
Red = Unhealthy and is something you will want to investigate
Blue = Metrics on cluster activity

Let’s walk through a few scenarios and how we can use Unravel’s KPIs to help us through troubleshooting issues.

Under-replicated partitions

Under-replicated partitions tell us that replication is not going as fast as configured, which adds latency as consumers don’t get the data they need until messages are replicated. It also suggests that we are more vulnerable to losing data if we have a master failure. Any under replicated partitions at all constitute a bad thing and is something we’ll want to root cause to avoid any data loss. Generally speaking, under-replicated partitions usually point to an issue on a specific broker.

To investigate scenarios if under replicated partitions are showing up in your cluster we’ll need to understand how often this is occurring. For that we can use the following graphs to get a better understanding of this:

# of under replicated partitions

Another useful metric to monitor is “Log flush latency, 99th percentile“ graph which provides us the time it takes for the brokers to flush logs to disk:

Log Flush Latency, 99th Percentile

Log flush latency is important, because the longer it takes to flush log to disk, the more the pipeline backs up, the worse our latency and throughput. When this number goes up, even 10ms going to 20ms, end-to-end latency balloons which can also lead to under replicated partitions.

If we notice latency fluctuates greatly then we’ll want to identify which broker(s) are contributing to this. In order to do this we’ll go back to the “broker” tab and click on each broker to refresh the graph for the broker selected:

Once we identify the broker(s) with wildly fluctuating latency we will want to get further insight by investigating the logs for that broker. If the OS is having a hard time keeping up with log flush then we may want to consider adding more brokers to our Kafka architecture.

Offline partitions

Generally speaking you want to avoid offline partitions in your cluster. If you see this metric > 0 then there are broker level issues which need to be addressed.

Unravel provides the “# Offline Partitions” graph to understand when offline partitions occur:

This metric provides the total number of topic partitions in the cluster that are offline. This can happen if the brokers with replicas are down, or if unclean leader election is disabled and the replicas are not in sync and thus none can be elected leader (may be desirable to ensure no messages are lost).

Controller Health

This KPI displays the number of brokers in the cluster reporting as the active controller in the last interval. Controller assignment can change over time as shown in the “# Active Controller Trend” graph:

“# Controller” = 0
- There is no active controller. You want to avoid this!
“# Controller” = 1
- There is one active controller. This is the state that you want to see.
“# Controller” > 1
- Can be good or bad. During steady state there should be only one active controller per cluster. If this is greater than 1 for only one minute, then it probably means the active controller switched from one broker to another. If this persists for more than one minute, troubleshoot the cluster for “split brain”.

For Active Controller <> 1 we’ll want to investigate logs on the broker level for further insight.

Cluster Activity

The last three KPI’s show cluster activity within the last 24 hours. They will always be colored blue because these metrics can neither be good or bad. These KPIs are useful in gauging activity in your Kafka cluster for the last 24 hours. You can also view these metrics via their respective graphs below:

These metrics can be useful in understanding cluster capacity such as:

Add additional brokers to keep up with data velocity
Evaluate performance of topic architecture on your brokers
Evaluate performance of partition architecture for a topic

The next section provides best practices in using Unravel to evaluate performance of your topics / brokers.

Topic Partition Strategy

A common challenge for Kafka admins is providing an architecture for the topics / partitions in the cluster which can support the data velocity coming from producers. Most times this is a shot in the dark because OOB solutions do not provide any actionable insight on activity for your topics / partitions. Let’s investigate how Unravel can help shed some light into this.

Producer architecture

On the producer side it can be a challenge in deciding on how to partition topics on the Kafka level. Producers can choose to send messages via key or use a round-robin strategy when no key has been defined for a message. Choosing the correct # of partitions for your data velocity are important in ensuring we have a real time architecture that is performant.

Let’s use Unravel to get a better understanding on how the current architecture is performing. We can then use this insight in guiding us in choosing the correct # of partitions.

To investigate – click on the topic tab and scroll to the bottom to see the list of topics in your Kafka cluster:

In this table we can quickly identify topics which have heavy traffic where we may want to understand how our partitions are performing. This is convenient in identifying topics which have heavy traffic.

Let’s click on a topic to get further insight:

Consumer Group Architecture

We can also make changes on the consumer group side to scale according to your topic / partition architecture. Unravel provides a convenient view on the consumer group for each topic to quickly get a health status of the consumer groups in your Kafka cluster. If a consumer group is a Spark Streaming application then Unravel can also provide insights to that application thereby providing an end-to-end monitoring solution for your streaming applications.

See Kafka Insights for a use case on monitoring consumer groups, and lagging/stalled partitions or consumer groups.

Next steps:

Unravel is a purpose-built observability platform that helps you stop firefighting issues, control costs, and run faster data pipelines. What does that mean for you?

Never worry about monitoring and observing your entire data estate.
Reduce unnecessary cloud costs and the DataOps team’s application tuning burden.
Avoid downtime, missed SLAs, and business risk with end-to-end observability and cost governance over your modern data stack.

How can you get started?

Create your free account today.

Book a demo to see how Unravel simplifies modern data stack management.

References:

https://www.signalfx.com/blog/how-we-monitor-and-run-kafka-at-scale/

https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster

https://docs.confluent.io/current/control-center/systemhealth.html

The post Kafka best practices: Monitoring and optimizing the performance of Kafka applications appeared first on Unravel.

Three Takeaways from the Data + AI Summit 2022

Stephen Lamont — Tue, 12 Jul 2022 14:23:36 +0000

Databricks recently hosted the Data + AI Summit 2022, attended by 5,000 people in person in San Francisco and some 60,000 virtually. Billed as the world’s largest data and AI conference, the event featured over 250 presentations from dozens of speakers, training sessions, and four keynote presentations.

Databricks made a slew of announcements falling into two buckets: enhancements to open-source technologies underpinning the Databricks platform, and previews, enhancements, and GA releases related to its proprietary capabilities. Day 1 focused on data engineering announcements, day 2 on ML announcements.

There was obviously a lot to take in, so here are just three takeaways from one attendee’s perspective.

Convergence of Data Analytics and ML

The predominant overarching (or underlying, depending on how you look at it) theme running as a red thread throughout all the presentations was the convergence of data analytics and ML. It’s well known how the “big boys” like Amazon, Google, Facebook, Netflix, Apple, and Microsoft have driven disruptive innovation through data, analytics, and AI. But now more and more enterprises are doing the same thing.

In his opening keynote presentation, Databricks Co-Founder and CEO Ali Ghodshi expressed this trend as the process of moving to the right-hand side of the Data Maturity Curve, moving from hindsight to foresight.

However, today most companies are not on the right-hand side of the curve. They are still struggling to find success at scale. Ghodshi says the big reason is that there’s a technology divide between two incompatible architectures that’s getting in the way. On one side are the data warehouse and BI tools for analysts; on the other, data lakes and AI technologies for data scientists. You wind up with disjointed and duplicative data silos, incompatible security and governance models, and incomplete support for use cases.

Bridging this chasm—where you get the best of both worlds—is of course the whole idea behind the Databricks lakehouse paradigm. But it wasn’t just Databricks who was talking about this convergence of data analytics and ML. Companies as diverse as John Deere, Intuit, Adobe, Nasdaq, Nike, and Akamai were saying the same thing. The future of data is AI/ML, and moving to the right-hand side of the curve is crucial for a better competitive advantage.

Databricks Delta Lake goes fully open source

Probably the most warmly received announcement was that Databricks is open-sourcing its Delta Lake APIs as part of its Delta Lake 2.0 release. Further, Databricks will contribute all its proprietary Delta Lake features and enhancements. This is big—while Databricks initially launched its lakehouse platform as open source back in 2019, many of its subsequent features were proprietary additions available only to its customers.

Make the most of your Databricks platform
Download Unravel for Databricks data sheet

Databricks cost optimization + Unravel

A healthy number of the 5,000 in-person attendees swung by the Unravel booth to discuss its DataOps observability platform. Some were interested in how Unravel accelerates migration of on-prem Hadoop clusters to the Databricks platform. Others were interested in how its vendor-agnostic automated troubleshooting and AI-enabled optimization recommendations could benefit teams already on Databricks. But the #1 topic of conversation was around cost optimization for Databricks.

Specifically, these Databricks customers are realizing the business value of the platform and are looking to expand their usage. But they are concerned about running up unnecessary costs—whether it was FinOps people or data team leaders looking for ways to govern costs with greater accuracy, Operations personnel who had uncovered occurrences of jobs where someone had requested oversized configurations, or data engineers who wanted to find out how to right-size their resource requests. Everybody wanted to discover a data-driven approach to optimizing jobs for cost. They all said the same thing, one way or another: the powers-that-be don’t mind spending money on Databricks, given the value it delivers; what they hate is spending more than they have to.

And this is something Databricks as a company appreciates from a business perspective. They want customers to be happy and successful—if customers have confidence that they’re spending only as much as absolutely necessary, then they’ll embrace increased usage without batting an eye.

This is where Unravel really stands out from the competition. Its deep observability intelligence understands how many and what size resources are actually needed, compares that to the configuration settings the user requested, and comes back with a crisp, precise recommendation on exactly what changes to make to optimize for cost—at either the job or cluster level. These AI recommendations are generated automatically, waiting for you at any number of dashboards or views within Unravel. Cost-optimization recommendations let you cut to the chase with literally just one click.

Check out this short interactive demo (takes about 90 seconds) to see Unravel’s Job-Level AI Recommendations capability or this one to see its Automated Cloud Cluster Optimization feature.

The post Three Takeaways from the Data + AI Summit 2022 appeared first on Unravel.

Unravel Data Is SOC 2 Type II Compliant

Unravel Data — Fri, 24 Jun 2022 11:00:08 +0000

Security is top of mind for every enterprise these days. There are so many threats they can hardly be counted, but one commonality exists: data is always the target. Unravel’s mission is to help organizations better understand and improve the performance of their data. We’re a data business, so we appreciate the scope and implications of these threats. Unravel Data is dedicated to making sure that both our current and potential customers are aware of the advanced features and security of Unravel Data’s AI-enabled DataOps Observability for modern data stacks. We are committed to having our company’s policies and procedures examined and verified by impartial third parties as part of that commitment. That’s why we’re excited to announce that the Unravel Data platform has earned a Service Organization Control (SOC) 2, Type II certification for 2022.

What is SOC 2?

The SOC 2 certification is the most rigorous and respected security certification any software platform can earn. The certification process is governed by the American Institute of CPAs (AICPA). According to the organization, the SOC 2 certification is “intended to meet the needs of a broad range of users that need detailed information and assurance about the controls at a service organization relevant to security, availability, and processing integrity of the systems the service organization uses to process users’ data and the confidentiality and privacy of the information processed by these systems.” The SOC 2, Type II certification focuses on “management’s description of a service organization’s system and the suitability of the design and operating effectiveness of controls.” This basically means, does a company have the right policies and procedures in place to keep its users safe from attacks and vulnerabilities and maintain their data privacy?

Why is this important?

High-leverage data teams choose Unravel as a data observability platform that helps them save time and preserve trust. Customers that use our platform can trust that their metadata is secure.

For Unravel to achieve this certification, independent auditor A-LIGN Assurance performed a thorough test of the product and examined the company’s operations to ensure Unravel has adequate safeguards to prevent or mitigate any security threats. The process was meticulous and comprehensive, taking several months to complete. Essentially, the SOC 2, Type II certification means that Unravel customers can rest assured knowing that our product is safe and that we take significant proactive measures to protect user data.

If you would like to view the full Unravel full SOC 2 report under NDA, please contact us here.

The post Unravel Data Is SOC 2 Type II Compliant appeared first on Unravel.

Join Unravel at Data + AI Summit 2022

Stephen Lamont — Tue, 14 Jun 2022 18:08:24 +0000

Stop by Unravel Data Booth #322 at the Data + AI Summit 2022 in San Francisco, June 27-30 for a chance to win a special giveaway prize!

When you stop by our booth on expo days June 28-29, you’ll have a golden opportunity to:

Go under the hood with Unravel’s DataOps observability platform
Get a sneak-peek first look at Unravel’s new Databricks-specific capabilities
Deep-dive into features with our experts
Learn what your peers have been able to accomplish with Unravel
Win a raffle prize

Our experts will run demos and be available for 1-on-1 conversations throughout the conference—and everyone who books a meeting will be entered into a raffle for a to-be-announced ah-mazing prize.

Sponsored by Databricks, the Data+AI Summit is the world’s largest data and AI conference—four days packed with keynotes by industry visionaries, technical sessions, hands-on training, and networking opportunities.

See the full agenda here.

This year, a new hybrid format features an expanded curriculum of half- and full-day in-person and virtual training classes and six different speaker sessions tracks: Data Analytics, BI and Visualization; Data Engineering; Data Lakes, Data Warehouses and Data Lakehouses; Data Science, Machine Learning and MLOps; Data Security and Governance; and Research.

What will you see at Unravel Booth #322?

Meet Unravel Data at booth #322

You’ll see Unravel in action: how our purpose-built AI-enabled observability platform helps you stop firefighting issues, control costs, and run faster data pipelines.

Want a taste of what we’ll be showing? Check out our 2-minute guided-tour interactive demos of our unique capabilities. Explore features like:

Full-stack data pipeline observability
Automated root cause analysis, at both the job and pipeline level
Job-level AI optimization recommendations
Automated cloud cluster optimizations
Pinpoint-precise chargeback reports
Automated budget tracking
Proactive cost governance alerts
Cloud migration workload fit reports
Automated cluster discovery

Explore all our interactive guided tours here.

And here’s a quick overview how Unravel helps you make the most of our Databricks platform.

The post Join Unravel at Data + AI Summit 2022 appeared first on Unravel.

Unravel at Data Summit 2022 Recap: Beyond Observability

Stephen Lamont — Wed, 25 May 2022 14:02:44 +0000

At the recent Data Summit 2022 conference in Boston, Unravel presented “The DataOps Journey: Beyond Observability”—the most highly attended session within the conference’s DataOps Boot Camp track. In a nutshell, Unravel VPs Keith Alsheimer and Chris Santiago discussed:

The top challenges facing different types of data teams today
Why web APM and point tools leave fundamentally big observability gaps
How observability purpose-built for the modern data stack closes these gaps
What going “beyond observability” looks like

Data teams get caught in a 4-way crossfire

It’s been said many times, many ways, but it’s true: nowadays every company has become a data company. Data pipelines are creating strategic business advantages, and companies are depending on their data pipelines more than ever before. And everybody wants in—every department in every business unit in every kind of company wants to leverage the insights and opportunities of big data.

But data teams are finding themselves caught in a 4-way crossfire: not only are data pipelines more important than ever, but more people are demanding more output from them, there’s more data to process, people want their data faster, and the modern data stack has gotten ever more complex.

The increased volume, velocity, and variety of today’s data pipelines coupled with their increasing complexity is leaving battle scars across all teams. Data engineers find it harder to troubleshoot apps/pipelines, meet SLAs, and get quality right. Operations teams have too many issues to solve and not enough time—they’re drowning in a flood of service tickets. Data architects have to figure out how to scale these complex pipelines efficiently. Business unit leaders are seeing cloud costs skyrocket and are under the gun to deliver better ROI.

See how Unravel can help you keep pace with the volume, velocity and variety of today’s data pipelines
Try for free

It’s becoming increasingly harder for data teams to keep pace, much less do everything faster, better, cheaper. All this complexity is taking its toll. Data teams are bogged down and burning out. A recent survey shows that 70% of data engineers are likely to quit in the next 12 months. Demand for talent has always exceeded supply, and it’s getting worse. McKinsey research reveals that data analytics has the greatest need to address potential skills gaps, with 15 open jobs for every active job seeker. The latest LinkedIn Emerging Jobs Report says that 5 of the top 10 jobs most in demand are data-related.

And complexity continues in the cloud. First off, the variety of choices is huge—with more coming every day. Too often everybody is doing their own thing in the cloud, with no clear sense of what’s going on. This makes things hard to rein in and scale out, and most companies find their cloud costs are getting out of control.

DataOps is the answer, but . . .

Adopting a DataOps approach can help solve these challenges. There are varying viewpoints on exactly what DataOps is (and isn’t), but we can certainly learn from how software developers and operations teams tackled a similar issue via DevOps—specifically, leveraging and enhancing collaboration, automation, and continuous improvement.

One of the key foundational needs for this approach is observability. You absolutely must have a good handle on what’s going on in your system, and why. After all, you can’t improve things—much less simplify and streamline—if you don’t really know everything that’s happening. And you can’t break down silos and embrace collaboration without the ability to share information.

At DataOps Unleashed 2022 we surveyed some 2,000 participants about their top challenges, and by far the #1 issue was the lack of visibility across the environment (followed by lack of proactive alerts, the expense of runaway jobs/pipelines, lack of experts, and no visibility into cloud cost/usage).

One of the reasons teams have so much difficulty gaining visibility into their data estate is that the kind of observability that works great for DevOps doesn’t cut it for DataOps. You need observability purpose-built for the modern data stack.

Get the right kind of observability

When it comes to the modern data stack, traditional observability tools leave a “capabilities gap.” Some of the details you need live in places like the Spark UI, other information can be found in native cloud tools or other point solutions, but all too often you have to jump from tool to tool, stitching together disparate data by yourself manually. Application performance monitoring (APM) does this correlation for web applications, but falls far short for data applications/pipelines.

First and foremost, the whole computing framework is different for data applications. Data workloads get broken down into multiple, smaller, often similar parts each processed concurrently on a separate node, with the results re-combined upon completion — parallel processing. In contrast, web applications are a tangle of discrete request-response services processed individually.

Consequently, there’s a totally different class of problems, root causes, and remediation for data apps vs. web apps. When doing your detective work into a slow or failed app, you’re looking at a different kind of culprit for a different type of crime, and need different clues. You need a whole new set of data points, different KPIs, from distinct technologies, visualized in another way, and correlated in a uniquely modern data stack–specific context.

You need to see details at a highly granular level — for each sub-task within each sub-part of each job — and then marry them together into a single pane of glass that comprises the bigger picture at the application, pipeline, platform, and cluster levels.

That’s where Unravel comes in.

Unravel extracts all the relevant raw data from all the various components of your data stack and correlates it to paint a picture of what’s happening. All the information captured by telemetry data (logs, metrics, events, traces) is pulled together in context and in the “language you’re speaking”—the language of data applications and pipelines—within the context of your workloads.

click to enlarge

Going beyond observability

Full-stack observability tells you what’s going on. But to understand why, you need to go beyond observability and apply AI/ML to correlate patterns, identify anomalies, derive meaningful insights, and perform automated root cause analysis. And for that, you need lots of granular data feeding ML models and AI algorithms purpose-built specifically for modern data apps/pipelines.

The size and complexity of modern data pipelines—with hundreds (even thousands) of intricate interdependencies, jobs broken down into stages processing in parallel—could lead to a lot of trial-and-error resolution effort even if you know what’s happening and why. What you really need to know is what to do.

That’s where Unravel goes where other observability solutions can only dream about.

Unravel’s AI recommendation engine tells you exactly what your next step is in crisp, precise detail. For example, if there’s one container in one part of one job that’s improperly sized and so causing the entire pipeline to fail, Unravel not only pinpoints the guilty party but tells you what the proper configuration settings would be. Or another example is that Unravel can tell you exactly why a pipeline is slow and how you can speed it up.

AI recommendations tell you exactly what to do to optimize for performance.

Observability is all about context. Traditional APM provides observability in context for web applications, but not for data applications and pipelines. But Unravel does. Its telemetry, correlation, anomaly detection, root cause analysis capabilities, and AI-powered remediation recommendations are all by design built specifically to understand, troubleshoot, and optimize modern data workloads.

See how Unravel goes beyond observability
Try for free

The post Unravel at Data Summit 2022 Recap: Beyond Observability appeared first on Unravel.

Why Legacy Observability Tools Don’t Work for Modern Data Stacks

Stephen Lamont — Fri, 13 May 2022 13:13:01 +0000

Whether they know it or not, every company has become a data company. Data is no longer just a transactional byproduct, but a transformative enabler of business decision-making. In just a few years, modern data analytics has gone from being a science project to becoming the backbone of business operations to generate insights, fuel innovation, improve customer satisfaction, and drive revenue growth. But none of that can happen if data applications and pipelines aren’t running well.

Yet data-driven organizations find themselves caught in a crossfire: their data applications/pipelines are more important than ever, but managing them is more difficult than ever. As more data is generated, processed, and analyzed in an increasingly complex environment, businesses are finding the tools that served them well in the past or in other parts of their technology stack simply aren’t up to the task.

Modern data stacks are a different animal

Would you want an auto mechanic (no matter how excellent) to diagnose and fix a jet engine problem before you took flight? Of course not. You’d want an aviation mechanic working on it. Even though the basic mechanical principles and symptoms — engine trouble — are similar, automobiles and airplanes are very different under the hood. The same is true with observability for your data application stacks and your web application stacks. The process is similar, but they are totally different animals.

At first glance, it may seem that the leading APM monitoring and observability tools like Datadog, New Relic, Dynatrace, AppDynamics, etc., do the same thing as a modern data stack observability platform like Unravel. And in the sense that both capture and correlate telemetry data to help you understand issues, that’s true. But one is designed for web apps, while the other for modern data pipelines and applications.

Observability for the modern data stack is indeed completely different from observability for web (or mobile) applications. They are built and behave differently, face different types of issues for different reasons, requiring different analyses to resolve problems. To fully understand, troubleshoot, and optimize (for both performance and cost) data applications and pipelines, you need an observability platform that’s built from the ground up to tackle the specific complexities of the modern data stack. Here’s why.

What’s different about modern data applications?

First and foremost, the whole computing framework is different for data applications. Data workloads get broken down into multiple, smaller, often similar parts each processed concurrently on a separate node, with the results re-combined upon completion — parallel processing. And this happens at each successive stage of the workflow as a whole. Dependencies within data applications/pipelines are deep and layered. It’s crucial that everything — execution time, scheduling and orchestration, data lineage, and layout — be in sync.

In contrast, web applications are a tangle of discrete request-response services processed individually. Each service does its own thing and operates relatively independently. What’s most important is the response time of each service request and how that contributes to the overall response time of a user transaction. Dependencies within web apps are not especially deep but are extremely broad.

Web apps are request-response; data apps process in parallel.

Consequently, there’s a totally different class of problems, root causes, and remediation for data apps vs. web apps. When doing your detective work into a slow or failed app, you’re looking at a different kind of culprit for a different type of crime, and need different clues (observability data). You need a whole new set of data points, different KPIs, from distinct technologies, visualized in another way, and correlated in a uniquely modern data stack–specific context.

The flaw with using traditional APM tools for modern data stacks

What organizations who try to use traditional APM for the modern data stack find is that they wind up getting only a tiny fraction of the information they need from a solution like Dynatrace or Datadog or AppDynamics, such as infrastructure and services-level metrics. But over 90% of the information data teams need is buried in places where web APM simply doesn’t go — you need an observability platform specifically designed to dig through all the systems to get this data and then stitch it together into a unified context.

This is where the complexity of modern data applications and pipelines rears its head. The modern data stack is not a single system, but a collection of systems. You might have Kafka or Kinesis or Data Factory for data ingestion, some sort of data lake to store it, then possibly dozens of different components for different types of processing: Druid for real-time processing, Databricks for AI/ML processing, BigQuery or Snowflake for data warehouse processing, another technology for batch processing — the list goes on. So you need to capture deep information horizontally across all the various systems that make up your data stack. But you also need to capture deep information vertically, from the application down to infrastructure and everything in between (pipeline definitions, data usage/lineage, data types and layout, job-level details, execution settings, container-level information, resource allocation, etc.).

Cobbling it together manually via “swivel chair whiplash,” jumping from screen to screen, is a time-consuming, labor-intensive effort that can take hours — even days — for a single problem. And still there’s a high risk that it won’t be completely accurate. There is simply too much data to make sense of, in too many places. Trying to correlate everything on your own, whether by hand or with a homegrown jury-rigged solution, requires two things that are always in short supply: time and expertise. Even if you know what you’re looking for, trolling through hundreds of log files is like looking for a needle in a stack of needles.

An observability platform purpose-built for the modern data stack can do all that for you automatically. Trying to make traditional APM observe data stacks is simply using the wrong tool for the job at hand.

DevOps APM vs. DataOps observability in practice

With the growing complexity in today’s modern data systems, any enterprise-grade observability solution should do 13 things:

Capture full-stack data — both horizontally and vertically — from the various systems that make up the stack, including engines, schedulers, services, and cloud provider
Capture information about all data applications: pipelines, warehousing, ETL/ELT, machine learning models, etc.
Capture information about datasets, lineage, users, business units, computing resources, infrastructure, and more
Correlate, not just aggregate, data collected into meaningful context
Understand all application/pipeline dependencies on data, resources, and other apps
Visualize data pipelines end to end from source to output
Provide a centralized view of your entire data stack, for governance, usage analytics, etc.
Identify baseline patterns and detect anomalies
Automatically analyze and pinpoint root causes
Proactively alert to prevent problems
Provide recommendations and remedies to solve issues
Automate resolution or self-healing
Show the business impact

While the principles are the same for data app and web app observability, how to go about this and what it looks like are markedly dissimilar.

Everything starts with the data — and correlating it

If you don’t capture the right kind of telemetry data, nothing else matters.

APM solutions inject agents that run 24×7 to monitor the runtime and behavior of applications written in .NET, Java, Node.js, PHP, Ruby, Go, and dozens of other languages. These agents collect data on all the individual services as they snake through the application ecosystem. Then APM stitches together all the data to understand which services the application calls and how the performance of each discrete service call impacts performance of the overall transaction. The various KPIs revolve around response times, availability (up/down, green/red), and the app users’ digital experience. The volume of data to be captured is incredibly broad, but not especially deep.

APM is primarily concerned with response times and availability. Here, Datadog shows red/green status and aggregated metrics.

Here, AppDynamics shows the individual response times for various interconnected services.

The telemetry details to be captured and correlated for data applications/pipelines, on the other hand, need to be both broad and extremely deep. A modern data workload comprises hundreds of jobs, each broken down into parallel-processing parts, with each part executing various tasks. And each job feeds into hundreds of other jobs and applications not only in this particular pipeline but all the other pipelines in the system.

Today’s pipelines are built on an assortment of distributed processing engines; each might be able to monitor its own application’s jobs but not show you how everything works as a whole. You need to see details at a highly granular level — for each sub-task within each sub-part of each job — and then marry them together into a single pane of glass that comprises the bigger picture at the application, pipeline, platform, and cluster levels.

DataOps observability (here, Unravel) looks at completely different metrics at the app level . .

. . . as well as the pipeline level.

Let’s take troubleshooting a slow Spark application as an example. The information you need to investigate the problem lives in a bunch of different places, and the various tools for getting this data give you only some of what you need, not all.

The Spark UI can tell you about the status of individual jobs but lacks infrastructure and configuration details and other information to connect together a full pipeline view. Spark logs help you retrospectively find out what happened to a given job (and even what was going on with other jobs at the same time) but don’t have complete information about resource usage, data partitioning, container configuration settings, and a host of other factors that can affect performance. And, of course, Spark tools are limited to Spark. But Spark jobs might have data coming in from, say, Kafka and run alongside a dozen other technologies.

Conversely, platform-specific interfaces (Databricks, Amazon EMR, Dataproc, BigQuery, Snowflake) have the information about resource usage and the status of various services at the cluster level, but not the granular details at the application or job level.Having all the information specific for data apps is a great start, but it isn’t especially helpful if it’s not all put into context. The data needs to be correlated, visualized, and analyzed in a purposeful way that lets you get to the information you need easily and immediately.

Then there’s how data is visualized and analyzed

Even the way you need to look at a data application environment is different. A topology map for a web application shows dependencies like a complex spoke-and-wheel diagram. When visualizing web app environments, you need to see the service-to-service interrelationships in a map like this:

How Dynatrace visualizes service dependencies in a topology map.

With drill-down details on service flows and response metrics:

Dynatrace drill-down details

For a modern data environment, you need to see how all the pipelines are interdependent and in what order. The view is more like a complex system of integrated funnels:

A modern data estate involves many interrelated application and pipeline dependencies (Source: Sandeep Uttamchandani)

You need full observability into not only how all the pipelines relate to one another, but all the dependencies of multiple applications within each pipeline . . .

An observability platform purpose-built for modern data stacks provides visibility into all the individual applications within a particular pipeline

. . . with granular drill-down details into the various jobs within each application. . .

. . and the sub-parts of each job processing in parallel . . .

How things get fixed

Monitoring and observability tell you what’s going on. But to understand why, you need to go beyond observability and apply AI/ML to correlate patterns, identify anomalies, derive meaningful insights, and perform automated root cause analysis. “Beyond observability” is a continuous and incremental spectrum, from understanding why something happened to knowing what to do about it to automatically fixing the issue. But to make that leap from good to great, you need ML models and AI algorithms purpose-built for the task at hand. And that means you need complete data about everything in the environment.

The best APM tools have some sort of AI/ML-based engine (some are more sophisticated than others) to analyze millions of data points and dependencies, spot anomalies, and alert on them.

For data applications/pipelines, the type of problems and their root causes are completely different than web apps. The data points and dependencies needing to be analyzed are completely different. The patterns, anomalous behavior, and root causes are different. Consequently, the ML models and AI algorithms need to be different.

In fact, DataOps observability needs to go even further than APM. The size of modern data pipelines and the complexities of multi-layered dependencies — from clusters to platforms and frameworks to applications to jobs within applications to sub-parts of those jobs to the various tasks within each sub-part — could lead to a lot of trial-and-error resolution effort even if you know what’s happening and why. What you really need to know is what to do.

An AI-driven recommendation engine like Unravel goes beyond the standard idea of observability to tell you how to fix a problem. For example, if there’s one container in one part of one job that’s improperly sized and so causing the entire pipeline to fail, Unravel not only pinpoints the guilty party but tells you what the proper configuration settings would be. Or another example is that Unravel can tell you exactly why a pipeline is slow and how you can speed it up. This is because Unravel’s AI has been trained over many years to understand the specific intricacies and dependencies of modern data stacks.

AI recommendations tell you exactly what to do to optimize for performance.

Business impact

Sluggish or broken web applications cost organizations money in terms of lost revenue and customer dissatisfaction. Good APM tools are able to put the problem into a business context by providing a lot of details about how many customer transactions were affected by an app problem.

As more and more of an organization’s operations and decision-making revolve around data analytics, data pipelines that miss SLAs or fail outright have an increasingly significant (negative) impact on the company’s revenue, productivity, and agility. Businesses must be able to depend on their data applications, so their applications need to have predictable, reliable behavior.

For example: If a fraud prevention data pipeline stops working for a bank, it can cause billions of dollars in fraudulent transactions going undetected. Or a slow healthcare analysis pipeline may increase risk for patients by failing to provide timely responses. Measuring and optimizing performance of data applications and pipelines correlates directly to how well the business is performing.

Businesses need proactive alerts when pipelines deviate from their normal behavior. But going “beyond observability” would tell them automatically why this is happening and what they can do to get the application back on track. This allows businesses to have reliable and predictable performance.

There’s also an immediate bottom-line impact that businesses need to consider: maximizing their return on investment and controlling/optimizing cloud spend. Modern data applications process a lot of data, which usually consumes a large amount of resources — and the meter is always running. This means the cloud bills can rack up fast.

To keep costs from spiraling out of control, businesses need actionable intelligence on how best to optimize their data pipelines. An AI recommendations engine can take all the profile and other key information it has about applications and pipelines and identify where jobs are overprovisioned or could be tuned for improvement. For example: optimizing code to remove inefficiencies, right-sizing containers to avoid wastage, providing the best data partitioning based on goals, and much more.

AI recommendations pinpoint exactly where and how to optimize for cost.

AI recommendations and deep insights lay the groundwork for putting in place some automated cost guardrails for governance. Governance is really all about converting the AI recommendations and insights into impact. Automated guardrails (per user, app, business unit, project) would alert operations teams about unapproved spend, potential budget overruns, jobs that run over a certain time/cost threshold, and the like. You can then proactively manage your budget, rather than getting hit with sticker shock after the fact.

In a nutshell

Application monitoring and observability solutions like Datadog, Dynatrace, and AppDynamics are excellent tools for web applications. Their telemetry, correlation, anomaly detection, and root cause analysis capabilities do a good job of helping you understand, troubleshoot, and optimize most areas of your digital ecosystem — one exception being the modern data stack. They are by design built for general-purpose observability of user interactions.

In contrast, an observability platform for the modern data stack like Unravel is more specialized. Its telemetry, correlation, anomaly detection, root cause analysis capabilities — and in the case of Unravel uniquely, its AI-powered remediation recommendations, automated guardrails, and automated remediation — is by design built specifically to understand, troubleshoot, and optimize modern data workloads.

Observability is all about context. Traditional APM provides observability in context for web applications, but not for data applications and pipelines. That’s not a knock on these APM solutions. Far from it. They do an excellent job at what they were designed for. They just weren’t built for observability of the modern data stack. That requires another kind of solution designed specifically for a different kind of animal.

The post Why Legacy Observability Tools Don’t Work for Modern Data Stacks appeared first on Unravel.

Visit Unravel at Data Summit 2022

Stephen Lamont — Fri, 13 May 2022 13:12:46 +0000

Get a quick Unravel product demo at Data Summit 2022 Tuesday & Wednesday, May 17-18, at the Hyatt Regency Boston.

Presented by our friends from DBTA (Database Trends and Applications), Data Summit 2022 is a unique conference that brings together IT practitioners and business stakeholders from all types of organizations to learn, share, and celebrate the trends and technologies shaping the future of data. See where the world of Big Data and data science is going, and how to get there fast.

Featuring workshops, panel discussions, and provocative talks, Data Summit 2022 provides a comprehensive educational experience designed to guide you through all of today’s key issues in data management and analysis. Whether your interests lie in the technical possibilities and challenges of new and emerging technologies or using Big Data for business intelligence, analytics, and other business strategies, we have something for you!

And Unravel is there as a sponsor, so be sure to stop by and check out how we’re going beyond observability to solve some of today’s biggest challenges with:

Full-stack data pipeline observability: Accelerate troubleshooting with observability purpose-built for the modern data stack
AI-enabled optimization: Get automated AI-driven recommendations to improve performance and cost
Advanced cost governance: Pinpoint exactly where cloud spend is going, track actual and projected costs against budgets, and put in automated guardrails
Cloud migration IQ: Make accurate data-driven decisions before, during, and after your migration

And we’re running a raffle – stop by our booth for a chance to an Oculus VR headset.

The conference has different talk tracks over the two days—Modern Data Strategy Essentials Today, Emerging Technologies & Trends in Data & Analytics, What’s Next in Data & Analytics Architecture, The Future of Data Warehouses, Data Lakes & Data Hubs—as well as special day-long programs: DataOps Boot Camp, Database & DevOps Boot Camp, and AI & Machine Learning Summit.

Doug Laney, Data & Strategy Innovation Fellow at West Monroe and author of Infonomics, is the keynote speaker with his presentation Data Is Not the New Oil.

On Tuesday, May 17, don’t miss Unravel Co-Founder & CEO Kunal Agarwal’s feature presentation, Exploiting the Multi-Cloud Opportunity With DataOps, in which he

details the common obstacles that data teams encounter in data migration
explains why next-generation data tools must evolve beyond simple observability to provide prescriptive insights
shares best practices for optimizing big data costs
demonstrates through real-world case studies how a mature DataOps practice can accelerate even the most complex cloud migration projects.

So stop by the Unravel booth to go beyond observability, win an Oculus VR headset, and see how Unravel’s AI-enabled purpose-built observability for modern data stack can help your data team monitor, observe, manage, troubleshoot, and optimize the performance and cost of large-scale modern data applications.

The post Visit Unravel at Data Summit 2022 appeared first on Unravel.

Roundtable Recap: DataOps Just Wanna Have Fun

Stephen Lamont — Fri, 06 May 2022 12:04:42 +0000

We like to keep things light at Unravel. In a recent event, we hosted a group of industry experts for a night of laughs and drinks as we discussed cloud migration and heard from our friends at Don’t Tell Comedy.

Unravel VP of Solutions Engineering Chris Santiago and AWS Sr. Worldwide Business Development Manager for Analytics Kiran Guduguntla moderated a discussion with data professionals from Black Knight, TJX Companies, AT&T Systems, Georgia Pacific, and IBM, among others.

This post briefly recaps that discussion.

The cloud journey

To start, Chris asked attendees where they were in their cloud journey. The top responses were tied, with hybrid cloud and on-prem being the most popular responses. Following that were cloud-native, cloud-enabled, and multi-cloud workloads.

Kiran, who focuses primarily on migrations to EMR, responded to these results noting that he wasn’t surprised. The industry has seen significant churn in the past few years, especially people moving from Hadoop. As clients move to the cloud, EMR continues to lead the pack as a top choice for users.

Migration goals

As a follow-up question, we conducted a second poll to learn about the migration goals of our attendees. Are they prioritizing cost optimization? Seeking greater visibility? Boosting performance? Or are they looking for ways to better automate and decrease time to resolution?

Unsurprisingly, the number one migration goal was reducing and optimizing resource usage. Cost is king. Kiran explained the results of an IDC study that followed users as they migrated their workloads from on-premises environments to EMR. The study found that customers saved about 57%, and the ROI over five years rose to 350%.

He emphasized that cost isn’t the only benefit of migration from on-prem to the cloud. The shift allows for better management, reduced administration, and better performance. Customers can run their workloads two or three times faster because of the optimization included in EMR frameworks.

Thinking about migrating to AWS? Start with Unravel
Discover why

Data security and privacy in the cloud

One attendee brought the conversation along by bringing up the questions many clients are asking: How can I be sure of data security? Their priority is meeting regulatory compliance and taking every step to ensure they aren’t hacked. The main concern is not how to use the cloud, but how to secure the cloud.

Kiran agreed, emphasizing that security is paramount at AWS. He explained the security features AWS implements to promote data security:

1. Data encryption

AWS encrypts data either while in S3 or as it’s in motion to S3.

2. Access control

Providing fine-grain access control using Lake Formation combined with robust audit controls limits data access.<

3. Compliance

AWS meets every major compliance requirement, including GDPR.

He continued, noting that making these features available is good, but it is essential to architect them to meet the user’s or clients’ particular requirements.

Interested in learning more about Unravel for EMR migrations? Start here.

The post Roundtable Recap: DataOps Just Wanna Have Fun appeared first on Unravel.

Beyond Observability for the Modern Data Stack

Stephen Lamont — Tue, 26 Apr 2022 16:04:21 +0000

The term “observability” means many things to many people. A lot of energy has been spent—particularly among vendors offering an observability solution—in trying to define what the term means in one context or another.

But instead of getting bogged down in the “what” of observability, I think it’s more valuable to address the “why.” What are we trying to accomplish with observability? What is the end goal?

At Unravel, I’m not only the co-founder and CTO but also the head of our Customer Success team, so when thinking about modern data stack observability, I look at the issue through the lens of customer problems and what will alleviate their pain and solve their challenges.

I start by considering the ultimate end goal or Holy Grail: autonomous systems. Data teams just want things to work; they want everything to be taken care of, automatically. They want the system itself to be watching over issues, resolving problems without any human intervention at all. It’s the true spirit of AI: all they have to do is tell the system what they want to achieve. It’s going from reactive to proactive. No one ever has to troubleshoot the problem, because it was “fixed” before it ever saw the light of day. The system recognizes that there will be a problem and auto-corrects the issue invisibly.

As an industry, we’re not completely at that point yet for the modern data stack. But we’re getting closer.

We are on a continuous and incremental spectrum of progressively less toil and more automation: going from the old ways of manually stitching together logs, metrics, traces, and events from disparate systems; to accurate extraction and correlation of all that data in context; to automatic insights and identification of significant patterns; to AI-driven actionable recommendations; to autonomous governance.

“Beyond observability” is on a spectrum from manual to autonomous

See how Unravel goes “beyond observability” for the modern data stack
Create a free account

Beyond observability: to optimization and on to governance

Let’s say you have a data pipeline from Marketing Analytics called ML_Model_Update_and_Scoring that’s missing its SLA. The SLA specifies that the pipeline must run in less than 20 minutes, but it’s taking 25 minutes. What happened? Why is the data pipeline now too slow? And how can you tune things to meet the SLA? This particular application is pretty complex, with multiple jobs processing in parallel and several components (orchestration, compute, auto-scaling clusters, streaming platforms, dashboarding), so the problem could be anywhere along the line.

It’s virtually impossible to manually pore through the thousands of logs that are generated at each stage of the pipeline from the various tools—Spark and Kafka and Airflow logs, Databricks cluster logs, etc.—to “untangle the wires” and figure out where the slowdown could be. But where should you even start? Even if you have an idea of what you’re looking for, it can take hours—even days or weeks for highly complex workflows—to stitch together all the raw metrics/events/logs/traces data to figure out what’s meaningful to why your data pipeline is missing its SLA. Just a single app can run 10,000 containers on 10,000 nodes, with 10,000 logs.

Modern data stacks are simply too complex for humans to manage by hand

That’s where observability comes in.

Observability tells you “what”

Instead of having to sift through reams of logs and cobble together everything manually, full-stack observability extracts all the relevant raw data from all the various components of your data stack and correlates it to paint a picture of what’s going on. All the information captured by telemetry data (logs, metrics, events, traces) is pulled together in context and in the “language you’re speaking”—the language of data applications and pipelines.

Full-stack observability correlates data from all systems to provide a clear picture of what happened

In this case, observability shows you that while the application completed successfully, it took almost 23 minutes (22m 57s)—violating the SLA of 20 minutes. Here, Unravel takes you to the exact spot in the complex pipeline shown earlier and, on the left, has pulled together granular details about the various jobs processing in parallel. You can toggle on a Gantt chart view to get a better view of the degree of parallelism:

A Gantt chart view breaks down all the individual jobs and sub-tasks processing in parallel

So now you know what caused the pipeline to miss the SLA and where it happened (jobs #0 and #3), but you don’t know why. You’ve saved a lot of investigation time—you get the relevant data in minutes, not hours—and can be confident that you haven’t missed anything, but you still need to do some detective work to analyze the information and figure out what went wrong.

Optimization tells you “why”—and what to do about it

The better observability tools also point out, based on patterns, things you should pay attention to—or areas you don’t need to investigate. By applying machine learning and statistical algorithms, it essentially throws some math at the correlated data to identify significant patterns—what’s changed, what hasn’t, what’s different. This pipeline runs regularly; why was it slow this time? It’s the same kind of analysis a human expert would do, only done automatically with the help of ML and statistical algorithms.

While it would certainly be helpful to get some generalized insight into why the pipeline slowed—it’s a memory issue—what you really need to know is what to do about it.

AI-enabled observability goes beyond telling you what happened and why, to pinpointing the root cause and providing recommendations on exactly what you need to do to fix it.

AI-driven recommendations pinpoint exactly where—and how—to optimize (click on image to expand)

AI-driven recommendations provide specific configuration parameters that need to be applied in order for the pipeline to run faster and meet the 20-minute SLA. After the AI recommendations are implemented, we can see that the pipeline now runs in under 11 minutes—a far cry from the almost 23 minutes before.

AI recommendations improved pipeline performance from 23m to <11m, now meeting its SLA

Too often getting to the point of actually fixing the problem is another time-consuming effort of trial and error. Actionable AI recommendations won’t fix things automatically for you—taking the action still requires human intervention—but all the Sherlock Holmes work is done for you. The fundamental questions of what went wrong, why, and what to do about it are answered automatically.

Beyond optimizing performance, AI recommendations can also identify where cost could be improved. Say your pipeline is completing within your SLA commitments, but you’re spending much more on cloud resources than you need to. An AI engine can determine how many or what size containers you actually need to run each individual component of the job—vs. what you currently have configured. Most enterprises soon realize that they’re overspending by as much as 50%.

Start getting AI recommendations for your data estate today
Create a free account

Governance helps avoid problems in the first place

These capabilities save a ton of time and money, but they’re still relatively reactive. What every enterprise wants is a more proactive approach to making their data applications and pipelines run better, faster, cheaper. Spend less time firefighting because there was no fire to put out. The system itself would understand that the memory configurations were insufficient and automatically take action so that pipeline performance would meet the SLA.

For a whole class of data application problems, this is already happening. AI-powered recommendations and insights lay the groundwork for putting in place some automated governance policies that take action on your behalf.

Governance is really all about converting the AI recommendations and insights into impact. In other words, have the system run automatic responses that implement fixes and remediations for you. No human intervention is needed. Instead of reviewing the AI recommendation and then pulling the trigger, have the system apply the recommendation automatically.

Automated alerts proactively identify SLA violations

Policy-based governance rules could be as benign as sending an alert to the user if the data table exceeds a certain size threshold or as aggressive as automatically requesting a configuration change for a container with more memory.

This is true AIOps. You don’t have to wait until after the fact to address an issue or perpetually scan the horizon for emerging problems. The system applies AI/ML to all the telemetry data extracted and correlated from everywhere in your modern data stack to not only tell you what went wrong, why, and what to do about it, but it predicts and prevents problems altogether without any human having to touch a thing.

The post Beyond Observability for the Modern Data Stack appeared first on Unravel.

Building vs. Buying Your Modern Data Stack: A Panel Discussion

Stephen Lamont — Thu, 21 Apr 2022 19:55:09 +0000

One of the highlights of the DataOps Unleashed 2022 virtual conference was a roundtable panel discussion on building versus buying when it comes to your data stack. Build versus buy is a question for all layers of the enterprise infrastructure stack. But in the last five years — even in just the last year alone — it’s hard to think of a part of IT that has seen more dramatic change than that of the modern data stack.

These transformations shape how today’s businesses engage and work with data. Moderated by Lightspeed Venture Partners’ Nnamdi Iregbulem, the panel’s three conversation partners — Andrei Lopatenko, VP of Engineering at Zillow; Gokul Prabagaren, Software Engineering Manager at Capital One; and Aaron Richter, Data Engineer at Squarespace — weighed in on the build versus buy question and walked us through their thoughts:

What motivates companies to build instead of buy?
How do particular technologies and/or goals affect their decision?

These issues and other considerations were discussed. A few of the highlights follow, but the entire session is available on demand here.

What are the key variables to consider when deciding whether to build or buy in the data stack?

Gokul: I think the things which we probably consider most are what kind of customization a particular product offers or what we uniquely need. Then there are the cases in which we may need unique data schemas and formats to ingest the data. We must consider how much control we have of the product and also our processing and regulatory needs. We have to ask how we will be able to answer those kinds of questions if we are building in-house or choosing to adopt an outsourced product.

Aaron: Thinking from the organizational perspective, there are a few factors that come from just purchasing or choosing to invest in something. Money is always a factor. It’s going to depend on the organization and how much you’re willing to invest.

Beyond that a key factor is the expertise of the organization or the team. If a company has only a handful of analysts doing the heavy-lifting data work, to go in and build an orchestration tool would take them away from their focus and their expertise of providing insights to the business.

Andrei: Another important thing to consider is the quality of the solution. Not all the data products on the market have high quality from different points of view. So sometimes it makes sense to build something, to narrow the focus of the product. Compatibility with your operations environment is another crucial consideration when choosing build versus buy.

What’s the more compelling consideration: saving headcount or increasing productivity of the existing headcount?

Aaron: In general, everybody’s oversubscribed, right? Everybody always has too much work to do. And we don’t have enough people to accomplish that work. From my perspective, the compelling part is, we’re going to make you more efficient, we’re going to give you fewer headaches, and you’ll have fewer things to manage.

Gokul: I probably feel the same. It depends more on where we want to invest and if we’re ready to change where we’re investing: upfront costs or running costs.

Andrei: And development costs: do we want to buy this, or invest in building? And again, consider the human equation. It’s not just the number of people in your headcount. Maybe you have a small number of engineers, but then you have to invest more of their time into data science or data engineering or analytics. Saving time is a significant factor when making these choices.

How does the decision matrix change when the cloud becomes part of the consideration set in terms of build versus buy?

Gokul: I feel like it’s trending towards a place where it’s more managed. That may not be the same question as build or buy. But it skews more towards the manage option, because of that compatibility, where all these things are available within the same ecosystem.

Aaron: I think about it in terms of a cloud data warehouse: some kind of processing tool, like dbt; and then some kind of orchestration tool, like Airflow or Prefect; and there’s probably one pillar on that side, where you would never think to build it yourself. And that’s the cloud data warehouse. So you’re now kind of always going to be paying for a cloud vendor, whether it’s Snowflake or BigQuery or something of that nature.

So you already have your foot in the door there, and you’re already buying, right? So then that opens the door now to buying more things, adding things on that integrate really easily. This approach helps the culture shift. If a culture is very build-oriented, this allows them to be more okay with buying things.

Andrei: Theoretically you want to have your infrastructure independent on cloud, but it never happens, for multiple reasons. Firstly, cloud company tools make integration work much easier. Second, of course, once you have to think about multi-cloud, you must address privacy and security concerns. In principle, it’s possible to be independent, but you’ll often run into a lot of technical problems. There are multiple different factors when cloud becomes key in deciding what you will make and what tools to use.

See the entire Build vs. Buy roundtable discussion on demand
Watch now

The post Building vs. Buying Your Modern Data Stack: A Panel Discussion appeared first on Unravel.

Key Findings of the 2022 DataOps Unleashed Survey

Stephen Lamont — Thu, 14 Apr 2022 13:05:40 +0000

At the recent DataOps Unleashed 2022 virtual conference, we surveyed several hundred leading DataOps professionals across multiple industries in North America and Europe on a range of issues, including:

Current adoption of DataOps approaches
Top challenges in operating/managing their data stack
How long they estimate cloud migration will take
Where they are prioritizing automation
On which tasks they spend their time

Download our infographic here.

Highlights and takeaways

DataOps as a practice is hitting an inflection point

DataOps has gained momentum over the past 12 months. While a DataOps approach is still in the early innings for most companies—almost 4 out of 5 of respondents said they are “having active discussions” or “progressing”—this is an 80% jump from last year.

Visibility into data pipelines remains the top challenge

For the second consecutive year, “lack of visibility across the environment” was the #1 challenge. Interestingly, “lack of proactive alerts” leapfrogged “expensive runaway jobs/pipelines” as the second biggest challenge, followed by “lack of experts” and “no visibility into cost/usage.”

Projected cloud migration schedules are longer

When we asked respondents to estimate how long their cloud migration will take, only 28% said just one year—with the vast majority saying 2 years or more. This is a 150% jump from last year, reflecting the growing realization of the complexity in moving from on-prem to cloud.

Automation continues to be a key driver

Over 75% of survey respondents said that being able to “automatically test and verify before moving jobs/pipelines to production” was their top automation priority. Right behind was automating the troubleshooting of pipeline issues (65%), followed by a three-way tie (about 33% each) among automatically troubleshooting platform issues, troubleshooting jobs, and automatically reducing usage costs.

Teams spend more time building pipelines than maintaining/deploying them

Like last year, we saw that data teams spend more time building pipelines (43% in 2022, up from 39% in 2021) than maintaining/troubleshooting them (30% in 2022 vs. 34% in 2021) or deploying pipelines (holding steady at 27%).

Interested in more DataOps trends?

Check out the summary recap of the DataOps Unleashed 2022 keynote session, Three Venture Capitalists Weigh In on the State of DataOps 2022, or watch the full roundtable discussion on demand.

The post Key Findings of the 2022 DataOps Unleashed Survey appeared first on Unravel.

What’s New in Amazon EMR Unveiled at DataOps Unleashed 2022

Stephen Lamont — Fri, 08 Apr 2022 15:51:36 +0000

At the DataOps Unleashed 2022 virtual conference, AWS Principal Solutions Architect Angelo Carvalho presented How AWS & Unravel help customers modernize their Big Data workloads with Amazon EMR. The full session recording is available on demand, but here are some of the highlights.

Angelo opened his session with a quick recap of some of the trends and challenges in big data today: the ever increasing size and scale of data; the variety of sources and stores and silos; people of different skill sets needing to access this data easily balanced against the need for security, privacy, and compliance; the expertise challenge in managing open source projects; and, of course, cost considerations.

He went on to give an overview of how Amazon EMR makes it easy to process petabyte-scale data using the latest open source frameworks such as Spark, Hive, Presto, Trino, HBase, Hudi, and Flink. But the lion’s share of his session delved into what’s new in Amazon EMR within the areas of cost and performance, ease of use, transactional data lakes, and security; the different EMR deployment options; and the EMR Migration Program.

What’s new in Amazon EMR?

Cost and performance

EMR takes advantage of the new Amazon Graviton2 instances to provide differentiated performance at lower cost—up to 30% better price-performance. Angelo presented some compelling statistics:

Up to 3X faster performance than standard Apache Spark at 40% of the cost
Up to 2.6X faster performance than open-source Preston at 80% of the cost
11.5% average performance improvement with Graviton2
25.7% average cost reduction with Graviton2

And you can realize these improvements out of the box while still remaining 100% compliant with open-source APIs.

Ease of use

EMR Studio now supports Presto. EMR Studio is a fully managed integrated development environment (IDE) based on Jupyter notebooks that makes it easy for data scientists and data engineers to develop, visualize, and debug applications on an EMR cluster without having to log into the AWS console. So basically, you can attach and detach notebooks to and from the clusters using a single click at any time.

Transactional data lakes

Amazon EMR has supported Apache Hudi for some time to enable transactional data lakes, but now it has added support for Spark SQL and Apache Iceberg. Iceberg is a high-performance format for huge analytic tables at massive scale. Created by Netflix and Apple, it brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to work safely in the same tables at the same time.

Security

Amazon EMR has a comprehensive set of security features and functions, including isolation, authentication, authorization, encryption, and auditing. The latest version adds user execution role authorizations, as well as fine-grained access controls (FGAC) using AWS Lake Formation, and auditing using Lake Formation via AWS CloudTrail.

Different options for deploying Amazon EMR

Options for deploying EMR

There are multiple options for deploying Amazon EMR:

Deployment on Amazon EC2 allows customers to choose instances that offer optimal price and performance ratios for specific workloads.
Deployment on AWS Outposts allows customers to manage and scale Amazon EMR in on-premises environments, just as they would in the cloud.
Deployment on containers on top of Amazon Elastic Kubernetes Service (EKS). But note that at this time, Spark is the only big data framework supported by EMR on EKS.
Amazon EMR Serverless is a new function that lets customers run petabyte-scale data analytics in the cloud without having to manage or operate server clusters.

See how Unravel helps optimize Amazon EMR
Create a free account

Using Amazon’s EMR migration program

The EMR migration program was launched to help customers streamline their migration and answer questions like, How do I move this massive data set to EMR? What will my TCO look like if I move to EMR? How do we implement security requirements?

Amazon EMR Migration program outcomes

Taking a data-driven approach to determine the optimal migration strategy, the Amazon EMR Migration Program (EMP) consists of three main steps:

1. Assessing the migration process begins with creating an initial TCO report, conducting discovery meetings, and using Unravel to quickly discover everything about the data estate.

2. The mobilization stage involves delivering an assessment insights summary, qualifying for incentives, and developing a migration readiness plan.

3. The migration stage itself includes executing the lift-and shift-migration of applications and data, before modernizing the migrated applications.

Amazon relies on Unravel to perform a comprehensive AI-powered cloud migration assessment. As Angelo explained, “We partner with Unravel Data to take a faster, more data-driven approach to migration planning. We collect utilization data for about two to four weeks depending on the size of the cluster and the complexity of the workloads.

“During this phase, we are looking to get a summary of all the applications running on the on-premises environment, which provides a breakdown of all workloads and jobs in the customer environment. We identify killed or failed jobs—applications that fail due to resource contention and or lack of resources—and bursty applications or pipelines.

“For example, we would locate bursty apps to move to EMR, where they can have sufficient resources every time those jobs are run, in a cost-effective way via auto-scaling. We can also estimate migration complexity and effort required to move applications automatically. And lastly, we can identify tools suited for separate clusters. For example, if we identify long-running batch jobs that run at specific intervals, they might be good candidates for spinning a transient cluster only for that job.”

Unravel is equally valuable during and after migration. Its AI-powered recommendations for optimizing applications simplifies tuning and its full-stack insights accelerate troubleshooting.

To illustrate, Angelo concluded with an Amazon EMR-Unravel success story: GoDaddy was moving 900 scripts to Amazon EMR, and each one had to be optimized for performance and cost in a long, time-consuming manual process. But with Unravel’s automated optimization for EMR, they spent 99% less time tuning jobs—from 10+ hours to 8 minutes—saving 2700 hours of data engineering time. Performance improved by up to 72%, and GoDaddy realized $650,000 savings in resource usage costs.

See the entire DataOps Unleashed session on demand
Watch on demand

The post What’s New in Amazon EMR Unveiled at DataOps Unleashed 2022 appeared first on Unravel.

Roundtable Recap: Sharing Challenges in Migrating to Databricks

Stephen Lamont — Thu, 07 Apr 2022 21:30:10 +0000

Unravel Co-Founder and CTO Shivnath Babu recently hosted an informal roundtable discussion with representatives from Fortune 500 insurance firms, healthcare providers, and other enterprise companies. It was a chance for some peer-to-peer conversation about the challenges of migrating to the cloud and talk shop about Databricks. Here are some of the highlights.

Where are you on your cloud journey?

Everybody is at a different place in their migration to the cloud. So, the first thing we wanted to understand is exactly where each participant stood. While cloud migration challenges are pretty much universally the same for anyone, the specific top-of-mind challenges are a bit different at each stage of the game.

About one-third of the participants said that they are running their data applications on a mix of on-prem and cloud environments. This is not surprising, as most of the attendees work in sectors that have been leveraging big data for some time—specifically, insurance and healthcare—and so have plenty of legacy systems to contend with. As one contributor noted, “If a company has been around for more than 10 years, they are going to have multiple systems and they’re going to have legacy systems. Every company is in some type of digital transformation. They start small with what they can afford. And as they grow, they’re able to buy better products.”

Half indicated that they are in multi-cloud environments—which presents a different set of challenges. One participant who has been on the cloud migration path for 3-4 years said that along the way, “we’ve accumulated new tech debt because we have both AWS Lake Formation and Databricks. This generated some complexity, having two flavors of data catalog. So we’re trying to figure out how to deal with that and get things back on track so we can have better governance over data access.”

What are your biggest challenges in migrating to Databricks?

We polled the participants on what their primary goals are in moving to the cloud and what the top challenges they are experiencing. The responses showed a dead heat (43% each) between “better cost governance, chargeback & forecasting” and “improve performance of critical data pipelines with SLAs,” with “reduce the number of tickets and time to resolution” coming in third.

Again, not surprising results given that the majority of the audience were data engineers on the hook for ensuring data application performance and/or people who have been running data pipelines in the cloud for a while. It’s usually not until you start running more and more data jobs in the cloud that cost becomes a big headache.

How does Unravel help with your Databricks environment?

Within the context of the poll questions, Shivnath addressed—at a very high level—how Unravel helps solve the roundtable’s most pressing challenges.

If you think about an entire data pipeline converting data into insights, it can be a multi-system operation: data gets ingested by maybe Kafka, getting into a data lake or lakehouse, but then the actual processing may happen in Databricks. It’s not uncommon to see a dozen or more different technologies among the various components of a modern data pipeline.

What happens quickly is that you end up with an architecture that is very modern, very agile, very cloud-friendly, very elastic but where it’s very difficult to even understand who’s using what and how many resources it takes. An enterprise may have hundreds of data users, with more constantly being added.

Data gets to be like a drug, where the company wants more and more of it, driving more and more insights. And when you compare the number of engineers tasked with fixing problems with the number of data users, it’s clear that there just isn’t enough operational muscle (or enough people with the operational know-how and skill sets) to tackle the challenges.

See how Unravel helps manage Databricks with full-stack observability
Create a free account

Unravel helps fill that gap, and it solves the problem with AI and ML. In a nutshell, what Unravel does is collect all the telemetry information from every system—legacy, cloud, SQL, data pipelines, access to data, the infrastructure—and stream it into the Unravel platform, where it is analyzed with AI/ML to convert that telemetry data into insights to answer questions like

Where are all my costs going?
Are they being efficiently used?
And if not, how do I improve?

Shivnath pointed out that he sees a lot of customers running very large Databricks clusters when they don’t need to. Maybe they’re over-provisioning jobs because of an inefficiency in how the data is stored, or in how the SQL is actually written. Unravel helps with this kind of tuning optimization, reducing costs and building better governance policies from the get-go—so applications are already optimized when they are pushed into production.

Unravel helps different members of data teams. It serves the data operations folks, the SREs, the architects who are responsible for designing the right architecture and setting governance policies, the data engineers who are creating the applications.

And everything is based on applying AI/ML to telemetry data.

Want to see Unravel AI in action? Create a free account here.

The post Roundtable Recap: Sharing Challenges in Migrating to Databricks appeared first on Unravel.

From Data to Value: Building a Scalable Data Platform with Google Cloud

Stephen Lamont — Fri, 25 Mar 2022 19:50:21 +0000

At the DataOps Unleashed 2022 conference, Google Cloud’s Head of Product Management for Open Source Data Analytics Abhishek Kashyap discussed how businesses are using Google Cloud to build secure and scalable data platforms. This article summarizes key takeaways from his presentation, Building a Scalable Data Platform with Google Cloud.

Data is the fuel that drives informed decision-making and digital transformation. Yet major challenges remain when it comes to tackling the demands of data at scale.

The world’s datasphere reached 33 zettabytes in 2018. IDC estimates that it will grow more than fivefold by 2025 to reach 175 zettabytes.

IDC estimates the datasphere to grow to 175ZB by 2025.

From data to value: A modern journey

“Building a [modern] data platform requires moving from very expensive traditional [on-prem] storage to inexpensive managed cloud storage because you’re likely going to get to petabytes—hundreds of petabytes of data—fairly soon. And you just cannot afford it with traditional storage,” says Abhishek.

He adds that you need real-time data processing for many jobs. To tie all the systems together—from applications to databases to analytics to machine learning to BI—you need “an open and flexible universal system. You cannot have systems which do not talk to each other, which do not share metadata, and then expect that you’ll be able to build something that scalable,” he adds.

Further, you need a governed and scalable data sharing systems, as well as build machine learning into your pipelines, your processes, your analytics platform. “If you consider your data science team as a separate team that does special projects and needs to download data into their own cluster of VMs, it’s not going to work out,” says Abhishek.

And finally, he notes, “You have to present this in an integrated manner to all your users to platforms that they use and to user interfaces they are familiar with.”

The benefits of a modern data platform vs. traditional on-prem

The three tenets of a modern data platform

Abhishek outlined three key tenets of a modern data platform:

Scalability
Security and governance
End-to-end integration

And showed how Google Cloud Platform addresses each of these challenges.

Scalability

Abhishek cited an example of a social media company that has to process more than a trillion messages each day. It has over 300 petabytes of data stored, mostly in BigQuery, and is using more than half a million compute cores. This scale is achievable because of two things: the separation of compute and storage, and being serverless.

The Google Cloud Platform divorces storage from compute

By segregating storage from compute, “you can run your processing and analytics engine of choice that will scale independently of the storage,” explains Abhishek. Google offers a low-cost object store as well as an optimized analytical store for data warehousing workloads, BigQuery. You can run Spark and Dataproc, Dataflow for streaming, BigQuery SQL for your ETL pipelines, Data Fusion, your machine learning models on Cloud AI Platform—all without tying any of that to the underlying data so you can scale the two separately. Google has multiple customers who have run queries on 100,000,000,000,000 (one hundred trillion) rows on BigQuery, with one running 10,000 concurrent queries.

“Talk to any DBA or data engineer who’s working on a self-managed Hadoop cluster, or data warehouse on prem, and they’ll tell you how much time they spent in thinking about virtual machines resource provisioning, fine tuning, putting in purchase orders, because they’ll need to scale six months from now, etc, etc. All that needs to go away,” says Abhishek. He says if users want to use the data warehouse, all they think about is SQL queries. If they use Spark, all they think about is code.

Google has serverless products for each step of the data pipeline. To ingest and distribute data reliably, there’s the serverless auto-scaling messaging queue Pub/Sub, Dataproc for serverless Spark, Dataflow for serverless Beam, and BigQuery (the first serverless data warehouse) for analysis.

See the entire presentation on Building a Scalable Data Platform with Google Cloud
Watch presentation on demand

Security and governance

With data at this scale, with users spread across business units and data spread across a variety of sources, security and governance must be automated. You can’t be manually filing tickets for every access request or manually audit everything—that just doesn’t scale.

Google has a product called Dataplex, which essentially allows you to take a logical view of all your data spread across data warehouses, data lakes, data marts and build a centralized governed set of data lakes whose life cycle you can manage. Let’s say you have structured, semi-structured, and streaming data stored in a variety of places and you have analytics happening through BigQuery SQL or through Spark or Dataflow. Dataplex provides a layer in the middle that allows you to set up automatic data discovery, harvesting metadata, doing things like reading a file from objects to provide it as a table in BigQuery, metadata where it’s going, to ensure data quality.

So you store your data where your applications need it, split your applications from the data, and have Dataplex manage security, governance, and lifecycle management for these data lakes.

End-to-end integration

Effective data analytics ultimately serve to make data-driven insights available to anyone in the business who needs them. End-to-end integration with the modern data stack is crucial so that the platform accommodates the various tools that different teams are using—not the other way around.

The Google Cloud Platform does this by enhancing the capabilities of the enterprise data warehouse—specifically, BigQuery. BigQuery consolidates data from a variety of places to make it available for analytics in a secure way through the AI Platform, Dataproc for Spark jobs, Dataflow for streaming, Data Fusion for code-free ETL. All BI platforms work with it, and there’s a natural language interface so citizen data scientists can work with it.

This allows end users to all work through the same platform, but doing their work through languages and interfaces of their choice.

Google offering for each stage of the data pipeline

Want AI-enabled full-stack observability for GCP?
Request a free account

Wrapping up

Abhishek concludes with an overview of the products that Google offers for each stage of the data pipeline. “You might think that there are too many products,” Abhishek says. “But if you go to any customer that’s even medium size, what you’ll find is different groups have different needs, they work through different tools. You might find one group that loves Spark, you might find another group that needs to use Beam, or third groups wants to use SQL. So to really help this customer build an end-to-end integrated data platform, we provide this fully integrated, very sophisticated set of products that has an interface for each of the users who needs to use this for the applications that they build.”

See how Unravel takes the risk out of managing your Google Cloud Platform migrations.

The post From Data to Value: Building a Scalable Data Platform with Google Cloud appeared first on Unravel.

The New Cloud Migration Playbook: Strategies for Data Cloud Migration and Management

Stephen Lamont — Wed, 09 Mar 2022 17:53:33 +0000

Experts from Microsoft, WANdisco, and Unravel Data recently outlined a step-by-step playbook—utilizing a data-driven approach—for migrating and managing data applications in the cloud.

Unravel Data Co-Founder and CTO Shivnath Babu moderated a discussion with Microsoft Chief Architect Amit Agrawal and WANdisco CTO Paul Scottt-Murphy on how Microsoft, WANdisco, and Unravel complement each other is accelerating the migration of complex data applications like Hadoop to cloud environments like Azure.

Shivnath is an industry leader in making large-scale data platforms and distributed systems easy to manage. At Microsoft, Amit is responsible for driving cross-cloud solutions and is an expert in Hadoop migrations. Paul spearheads WANdisco’s “data-first strategy,” especially in how to migrate fast-changing and rapidly growing datasets to the cloud in a seamless fashion.

We’re all seeing how data workloads are moving to the cloud en masse. There are several reasons, probably higher business agility topping the list. Plus you get access to the latest and greatest modern software to power AI/ML initiatives. If you optimize properly, the cloud can be hugely cost-beneficial. And then there’s the issue of Cloudera stopping support for its older on-prem software.

But getting to the cloud is not easy. Time and time again, we hear how companies’ migration to the cloud goes over budget and behind schedule. Most enterprises have complex on-prem environments, which are hard to understand. Mapping them to the best cloud topology is even harder—especially when there isn’t enough talent available with the expertise to migrate workloads correctly. All too often, migration involves a lot of time-consuming, error-prone manual effort. And one of the bigger challenges is simply the lack of prioritization from the business.

During the webinar, the audience was asked, “What challenges have you faced during your cloud migration journey?” Nearly two-thirds responded with “complexity of my current environment,” followed by “finding people with the right skills.”

These challenges are exactly why Microsoft, WANdisco, and Unravel have created a framework to help you accelerate your migration to Azure.

See the entire presentation on new strategies for data cloud migration & management
Watch webinar

Hadoop Migration Framework

As enterprises build up their data environment over the years, it becomes increasingly difficult to understand what jobs are running and whether they’re optimized—especially for cost, as there’s a growing recognition that jobs which are not optimized can lead to cloud costs spiraling out of control very quickly—and then tie migration efforts to business priorities.

The Microsoft Azure framework for migrating complex on-prem data workloads to the cloud

Amit laid out the framework that Azure uses when first talking to customers about migration. It starts with an engagement discovery that basically “meets the customer where they are” to identify the starting point for the migration journey. Then they take a data-first approach by looking at four key areas:

Environment mapping: This is where Unravel really helps by telling you exactly what kind of services you’re running, where they’re running, which queues they belong to, etc. This translates into a map of how to migrate into Azure, so you have the right TCO from the start. You have a blueprint of what things will look like in Azure and a step-by-step process of where to migrate your workloads.
Application migration: With a one-to-one map of on-prem applications to cloud applications in place, Microsoft can give customers a clear sense of how, say, an end-of-life on-prem application can be retired in lieu of a modern application.
Workload assessment: Usually customers want to “test the waters” by migrating over one or two workloads to Azure. They are naturally concerned about what their business-critical applications will look like in the cloud. So Microsoft does an end-to-end assessment of the workload to see where it fits in, what needs to be done, and thereby give both the business stakeholder and IT the peace of mind that their processes will not break during migration.
Data migration: This is where WANdisco can be very powerful, with your current data estate migrated or replicated over to Azure and your data scientists starting to work on creating more use cases and delivering new insights that drive better business processes or new business streams.

Then, once all this is figured out, they determine what the business priorities are, what the customer goals are. Amit has found that this framework fits any and all customers and helps them realize value very quickly.

Data-first strategy vs. traditional approach to cloud migration

Data-First Strategy

A data-first strategy doesn’t mean you need to move your data before doing anything else, but it does mean that having data available in the cloud is critical to quickly gaining value from your target environment.

Without the data being available in some form, you can’t work with it or take advantage of the capabilities that a cloud environment like Azure offers.

A data-first migration differs from a traditional approach, which tends to be more application-centric—where entire sets of applications and their data in combination need to be moved before a hard cutover to use in the cloud environment.

As Paul Scott-Murphy from WANdisco explains, “That type of approach takes much longer to complete and doesn’t allow you to use data while a migration is under way. It also doesn’t allow you to continue to operate the source environment while you’re conducting a migration. So that may be well suited to smaller types of systems—transactional systems and online processing environments—but it’s typically not very well suited to the sort of information data sets and the work done against them from large-scale data platforms built around a data lake.

“So really what we’re saying here is that the migration of analytics infrastructure from data lake environments on premises like Hadoop and Spark and other distributed environments is extremely well suited to a data-first strategy, and what that strategy means is that your data become available for use in the cloud environment in as short a time as possible, really accelerating the pace with which you can leverage the value and giving you the ability to get faster outcomes from that data.

“The flexibility it offers in terms of automating the migration of your data sets and allowing you to continue operating those systems while the migration is underway is also really critical to the data-first migration approach.”

That’s why WANdisco developed its data-first strategy around four fundamental requirements:

1. The technology must efficiently handle arbitrary volumes of data. Data lakes can span potentially exabyte-scale data systems, and traditional tools for copying static information aren’t going to satisfy the needs of migrating data at scale.
2. You need a modern approach to support frequent and actively changing data. If you have data at scale, it is always changing—you’re ingesting that data set all the time, you’re constantly modifying information in your data lake.
3. You don’t want to suffer downtime in your business systems just for the purpose of migrating to the cloud. Incurring additional costs beyond the effort involved in migration is likely to be unacceptable to the business.
4. You need to validate or have a means of ensuring that your data migrated in full.
With those elements as the technical basis, WANdisco has developed the technology, LiveData Migrator for Azure, to support this approach to data-first migration. The approach that WANdisco technology enables with data-first migration is really central to getting the most out of your migration effort.

Check out the case study of how WANdisco took a data-first approach to migrate a global telecom from on-prem Hadoop to Azure
See case study

Unravel accelerates every stage of migration

How to Accelerate the Migration Journey

When teaming up with Microsoft and WANdisco, Unravel splits the entire migration into four stages:

1. Assess

This is really about assessing the current environment. Assessing a complex environment can take months—it’s not uncommon to see a 6-month assessment—which gets expensive and is often inaccurate about all the dependencies across all the apps and data and resources and pipelines. Identifying the intricate dependencies is critical to the next phase—we’ve seen how 80% of the time and effort in a migration can be just getting the dependencies right.

2. Plan

This is the most important stage. Because if the plan is flawed, with inaccurate or incomplete data, there is no way your migration will be done on time and within budget. It’s not unusual to see a migration have to be aborted partway through and then restarted.

3. Test, Fix, Verify

Usually the first thing you ask after a workload has migrated is, Is it correct? Are you getting the same or better performance that we got on-prem? And are you getting that performance at the right cost? Again, if you don’t get it right, the entire cost of migration can run significantly over budget.

4. Optimize, Manage, Scale

As more workloads come over to the cloud, scaling becomes a big issue. It can be a challenge to understand the entire environment and its cost, especially around governance. Because if you don’t bring in best practices and set up guardrails from the get-go, it can be very difficult to fix things later on. That can lead to low reliability, or maybe the clusters are auto-scaling and the expenses are much higher than they need to be—actually, a lot of things can go wrong.

Unravel can help with all of these challenges. And it helps by accelerating each stage. How can you accelerate assessment? Unravel automatically generates an X-ray of the complex on-prem environment, capturing all the dependencies and converting that information into a quick total cost of ownership analysis. That can be then used to justify and prioritize the business to actually move to the cloud pretty quickly. Then you can drill down from the high-level cost estimates into where the big bottlenecks are (and why).

This then feeds into the planning, where it’s really all about ensuring that the catalog, or inventory, of apps, data, and resources that has been built—along with all the dependencies that have been captured—is converted into a sprint-by-sprint plan for the migration.

As you go through these sprints, that’s where the testing, fixing, and verifying are done—especially getting a single pane of glass that can show the workloads as they are running on-prem and then the migrated counterparts on the cloud so that correctness, performance, and cost can be easily checked and verified against the baseline.

Then everything gets magnified as more and more apps are brought over. Unravel provides full-stack observability along with the ability to govern and fix problems. But most important, Unravel can help ensure that these problems in terms of performance, correctness, and cost never happen in the first place.

Unravel complements Microsoft and WANdisco along at each step of the new data-first cloud migration playbook.

Check out the entire New Cloud Migration Playbook webinar on demand.

The post The New Cloud Migration Playbook: Strategies for Data Cloud Migration and Management appeared first on Unravel.

DBS Bank Goes “Beyond Observability” for DataOps

Stephen Lamont — Thu, 03 Mar 2022 18:07:23 +0000

At the DataOps Unleashed 2022 conference, Luis Carlos Cruz Huertas, Head of Technology Infrastructure & Automation at DBS Bank, discussed how the bank has developed a framework whereby they translate millions of telemetry data points into actionable recommendations and even self-healing capabilities.

The session, Beyond observability and how to get there, opens with Dr. Shivnath Babu, Unravel Co-Founder and CTO, setting the context for observability challenges in the modern data stack—how simple applications grow to become so complex—and walking through the successive stages of what’s “beyond observability.” How can we go from extracting and correlating events, logs, metrics, and traces from applications, datasets, infrastructure, and tenants to get to self-healing systems?

The evolution of “beyond observability” goes from correlation to causation.

Then Luis shows how DBS is doing just that—how they leverage what he calls “cognitive capabilities” to deliver faster root cause analysis insights and automated corrective actions.

Why DBS went beyond observability

“When you have an always-on environment where banking applications are fully needed, you come to a point where observability [by itself] doesn’t cut it. You come to a point where your NCI [negative customer impact] truly becomes a key valuable indicator on how your systems are relating. We’re no longer in the game of measuring systems to be able to monitor, we’re measuring systems with the intent to provide a better customer experience,” says Luis.

Given the complexity of the bank’s IT ecosystem (not just its data stack), DBS made a strategic decision to not focus on tools developed by third-party vendors but rather build an “overarching brain” that could collect and understand the metrics from the diversity of tools in place without forcing the organization to rip and replace for something new. The objective was to speed root cause analysis across the board, provide less “noise,” reduce manual effort (toil), and get to proactive, predictive alerting on emerging issues.

See how DBS built its “beyond observability” self-healing platform

Watch on-demand session

How DBS built its cognitive capability platform

“You have applications that are collecting different telemetry through different systems or different log collectors—node exporters, metric beats, file beats, you can have an ELK stack. But ultimately what you want to do is create an open platform that you can ingest all this data,” Luis says. And for that you need three elements, he explains:

a historical repository, where you can collect and cross-check data
a real-time series database, because time becomes the de facto metadata to identify a critical incident and its correlations
a log aggregator

Luis notes that one of the things he gets asked constantly is, How do you define the ingestion?

“We do it all based on metadata. We define the metadata before it actually gets ingested. And then we park that into the [system] data lake. On top of that we provide an [ML-driven] analytical engine. Ultimately, what our system does is basically provide a recommendation to our site reliability engineers. And it gives them a list of elements, saying I’ve identified this set of errors or incidents that have happened over the last month and are repetitive and continuous. And then our site reliability engineers need to marry that with our automation engine. You build the right scripting—the right automation—to properly fix and remediate the problem. So that every time an incident is identified, it maps Incident A to Automation B to get Outcome C.”

DBS turns diverse telemetry data into auto-corrective actions.

Luis adds that with Unravel, the telemetry data has already been correlated. He says, “Unravel is huge for us. I don’t need to worry about marrying the correlation. I can just consume it right away.”

Luis concludes: “So in the end, we’re not changing tools, we’re collecting the metrics from all of the tools. We are providing a higher overarching mapping of the data being collected across all the tools in our environment, mapping them through metadata, and leveraging that to provide the right ML.”

The bottom line? DBS Bank is able to go “beyond observability” and leverage machine learning to get closer to the ultimate goal of a self-healing system that doesn’t require any human intervention at all.

Check out the full presentation from DataOps Unleashed on demand here.

See the DBS schematics for its Cognitive Technology Services Platform, tools mapping to the architectural components, solution data flows, overview of data sources, and more.

The post DBS Bank Goes “Beyond Observability” for DataOps appeared first on Unravel.

Three Venture Capitalists Weigh In on the State of DataOps 2022

Stephen Lamont — Thu, 03 Mar 2022 17:55:45 +0000

The keynote presentation at DataOps Unleashed 2022 featured a roundtable panel discussion on the State of DataOps and the Modern Data Stack. Moderated by Unravel Data Co-Founder and CEO Kunal Agarwal, this session features insights from three investors who have a unique vantage point on what’s changing and emerging in the modern data world, the effects of these changes, and the opportunities being created.

The panel’s three venture capitalists—Matt Turck, Managing Director at FirstMark, Venky Ganesan, Partner at Menlo Ventures, and Glenn Solomon, Managing Partner at GGV Capital—all invest in companies that are both users and creators of the modern data stack. They are either crunching massive amounts of data and converting it into insights or are helping other companies do that at scale.

Today every company is a data company. And these data pipelines, AI, and machine learning models are creating tremendous strategic value for companies. And these companies are depending on them more than ever before. Now, while the advantages of becoming data-driven are clear, what’s often hard to grasp is what’s changing in this landscape.

The discussion covered a broad range of topics, loosely revolving around a couple of key areas. Here are just a handful of the interesting observations and insights they shared.

What are the top data industry macro-trends?

Glenn Solomon: I don’t think companies are nearly as advanced as you’re likely to believe. Most companies are in the early innings of trying to figure out how to manage data. We see that even with born-digital companies. And the complexity is compounded by the fact that there is a lot of noise in the startup universe. Figuring out the decisions you make as a company is difficult and challenging. So I think this best-of-breed vs. platform, balancing act that companies had to go through in software is also going to happen in the data stack.

Matt Turck: The big driver is the rise of modern cloud data warehouses, and the lake houses as well. So for me, that’s been the big unlock in the space. We finally have that base level in the data hierarchy of needs, where we can take all this data, put it somewhere, and actually process it at scale. Now this whole thing is becoming very real, no longer experimental. Now the whole thing needs to work.

Venky Ganesan: The digital transformation that was happening just got super accelerated by the pandemic. All these analog business processes were digitized. And now that they are digitized, they can be tracked, stored, analyzed, evaluated and acted upon. I think the data stack has got to be one of the most important stacks in a company because your success long term is going to be based on how good is your data stack? How good is your DataOps? And then how do you build the analytics on top of it?

What trends are you seeing within DataOps specifically?

Venky: I would say the biggest trend I’m seeing is pushing these data workloads to the cloud. And I think it’s a really interesting game-changer. One of the things we are seeing now as we move to the cloud is suddenly you can separate out the storage and compute, have the infrastructure handle it, and then have the data warehouses such as Snowflake, Databricks. Now there are new sets of problems that come into play around DataOps when you move the data to Snowflake or Databricks or any other cloud providers, which is that you need to still understand the workloads, still need to optimize them. I think there’s going to be a whole DataOps category that helps you both migrate workloads to the cloud and also monitor them, because you can’t have the fox guarding the henhouse, you need some third party there to help you make sure you’re optimizing the workload, because the cloud provider is not interested in optimizing the workload for efficiency.

Glenn: A driver that I’m seeing accelerate, and gain momentum, is quite simply just the need for speed. In organizations there’s a tremendous amount of momentum around real-time streaming, real time analytics. Companies are growing the number of business processes for which they want real-time data to make decisions. And that is having a big impact on this whole world.

Another trend I’d point out is the rise of open source and the impact open source is having on many, many areas within the DataOps world. It looks like Kafka has had a massive impact on streaming as a result. That shows me that open source can really standardize markets. It can standardize technologies and standardize workflows. We’ll have to see how this all plays out, but I think open source—when it works well—is a very, very powerful trend.

Watch full panel discussion: The State of DataOps and the Modern Data Stack

Watch session on demand

Moving data to the cloud—why or why not?

Venky: I think if you are a company that has data on premises, you just have numerous issues. On prem is very heterogeneous: heterogeneous in hardware, heterogeneous in environment. And in a world of labor shortages, what happens if you don’t have the people? If you have the kind of turnover you’re seeing, can you get the people to manage it? So if you don’t move to the cloud, you’re going to be trapped on an island with fewer and fewer resources that cost more and more.

But I actually think the most important part of moving into the cloud is that it gives you an opportunity to standardize data, think about the data you want. And then once you move it to the cloud, you can unlock new generations of AI technologies that come into the cloud and allow you to get more insights from data. And so to me, eventually, data is worthless if it doesn’t translate to insights. Your best way of getting that insight is to figure out a scalable way of moving to the cloud, cheaper, and also unlock a lot of the new AI techniques to get insight from it.

Glenn: I think we’re on a continuum where there are still lots of companies who are reticent to move all their data to the cloud. I think the view is, hey, we have regulatory obligations and there’s risk if we don’t manage things ourselves. For data that is viewed as too sensitive, too risky, too valuable to move to the cloud—it’s just a matter of time. The value that can be driven from having data up in the cloud is just too great. But how do you safely move data into the cloud? And then once it’s there, how do you manage the applications that consume that data in a way that is rational? And if you want to use [cloud services] rationally, and use them the right way, and in a cost-efficient way, then you really do need other tools to make sure that things don’t run away from you.

Matt: For many years, there was an almost cognitive dissonance, where everything I read, and all the conversations I was having with like execs and people in the industry, was all cloud, cloud, cloud. But our customers all wanted to be on prem—actually, zero people want to be in the cloud. It feels like in the last year and a half that cognitive dissonance has disappeared, and suddenly I starting seeing all these customers, almost all at once—and the pandemic certainly accelerated all this—saying, okay, now is the time I will move those workloads and the data to the cloud. So it feels like there’s been an inflection point of some sort. And it is very anecdotal, I realize, but it is very, very clear.

The only nuance to all of this is I think there’s a little bit of a growing realization and concern around the cost of being in the cloud. When you start in the cloud, you actually save a lot of money. Once you’ve configured your organization to actually run in the cloud, you save a lot of money for a while. But then there’s a moment when it starts actually being pretty expensive. And I think that’s a problem that’s starting to come to the fore.

What’s the impact of the talent shortage?

Glenn: It’s very difficult to amass the kind of talent you need to really effectively both manage the data and then ultimately evaluate and analyze it for good purpose in your business. If you split the world into managing data—DataOps and the data engineer and all the challenges and complexities there, where there’s definitely a labor shortage—and then analyzing the data, getting value from it (data scientists up through business analysts), where there’s also a shortage—we have a human problem in both. One of my colleagues used the term “unbundling the data engineer.” If you look at all the tasks a data engineer would need to do to get a well-functioning data stack in place, there just aren’t enough of those people. Companies are picking off and automating various aspects of that workflow.

But on the other side, on analyzing the data, I think there are a lot of interesting things to be done there. How do you make data scientists, because there aren’t enough of them, more efficient? What tools and technologies do they need? I think we’ll see more solutions on the analysis side because we have that same human capital problem there too.

Matt: It’s an obvious problem that is only going to get worse, because the rate that we as a society produce technical people—data engineers, ML engineers data scientists—is nowhere near the pace we need to meet the demand. Again, every company is becoming not just a software company but a data company. That has two consequences. One, we need products and platforms that abstract away the complexity. That’s empowering people who are somewhat technical but not engineers to do more and more of what’s needed to make the whole machinery work. And the second, related consequence is the rise of automation—making a lot of those technologies just work in a way where no human is required. There’s plenty of opportunity there, especially AI-driven information of system optimization, tuning, anomaly detection, auto-repair, and the like.

Venky: Whether we’re talking about DataOps or security, these are things that will get automated at scale. It won’t replace humans. They will be complemented by technology that does most of the mundane stuff, and humans will deal with the exceptions. The mundane stuff gets done automatically, the exceptions get kicked to humans.That’s the only way forward. There’s no way to build the human capital required.

Watch the entire panel discussion on demand here. Hear more war stories, anecdotes, and expert insights, including predictions for they coming year.

The post Three Venture Capitalists Weigh In on the State of DataOps 2022 appeared first on Unravel.

Top Cloud Data Migration Challenges and How to Fix Them

Stephen Lamont — Wed, 02 Feb 2022 18:26:07 +0000

We recently sat down with Sandeep Uttamchandani, Chief Product Officer at Unravel, to discuss the top cloud data migration challenges in 2022. No question, the pace of data pipelines moving to the cloud is accelerating. But as we see more enterprises moving to the cloud, we also hear more stories about how migrations went off the rails. One report says that 90% of CIOs experience failure or disruption of data migration projects due to the complexity of moving from on-prem to the cloud. Here are Dr. Uttamchandani’s observations on the top obstacles to ensuring that data pipeline migrations are successful.

1. Knowing what should migrate to the cloud

The #1 challenge is getting an accurate inventory of all the different workloads that you’re currently running on-prem. The first question to be answered when you want to migrate data applications or pipelines is, Migrate what?

In most organizations, this is a highly manual, time-consuming exercise that depends on tribal knowledge and crowd-sourcing information. It’s basically going to every team running data jobs and asking what they’ve got going. Ideally, there would be a well-structured, single centralized repository where you could automatically discover everything that’s running. But in reality, we’ve never seen this. Virtually every enterprise has thousands of people come and go over the years—all building different dashboards for different reasons, using a wide range of technologies and applications—so things wind up being all over the place. The more brownfield the deployment, the more probable that workloads are siloed and opaque.

It’s important to distinguish between what you have and what you’re actually running. Sandeep recalls one large enterprise migration where they cataloged 1200+ dashboards but discovered that some 40% of them weren’t being used! They were just sitting around—it would be a waste of time to migrate them over. The obvious analogy is moving from one house to another: you accumulate a lot of junk over the years, and there’s no point in bringing it with you to the new place.

Understanding exactly what’s actually running is the first step to cloud migration.

How Unravel helps identify what to migrate

Unravel’s always-on full-stack observability capabilities automatically discover and monitor everything that’s running in your environment, in one single view. You can zero in on what jobs are actually operational, without having to deal with all the clutter of what’s not. If it’s running, Unravel finds it—and learns as it goes. If we apply the 80/20 rule, that means Unravel finds 80% of your workloads immediately because you’re running them now; the other 20% get discovered when they run every month, three months, six months, etc. Keep Unravel on and you’ll automatically always have an up-to-date inventory of what’s running.

How Unravel helps cloud migrations

Create a free account

2. Determining what’s feasible to migrate to the cloud

After you know everything that you’re currently running on-premises, you need to figure out what to migrate. Not everything is appropriate to run in the cloud. But identifying which workloads are feasible to run in the cloud and which are not is no small task. Assessing whether it makes sense to move a workload to the cloud normally requires a series of trial-and-error experimentation. It’s really a “what if” analysis that involves a lot of heavy lifting. You take a chunk of your workload, make a best-guess estimate on the proper configuration, run the workload, and see how it performs (and at what cost). Rinse and repeat, balancing cost and performance.

Look before you leap into the cloud—gather intelligence to make the right choices.

How Unravel helps determine what to move

With intelligence gathered from its underlying data model and enriched with experience managing data pipelines at large scale, Unravel can perform this what-if analysis for you. It’s more directionally accurate than any other technique (other than moving things to the cloud and seeing what happens). You’ll get a running head start into your feasibility assessment, reducing your configuration tuning efforts significantly.

3. Defining a migration strategy

Once you determine whether moving a particular workload is a go/no-go from a cost and business feasibility/agility perspective, the sequence of moving jobs to the cloud—what to migrate when—must be carefully ordered. Workloads are all intertwined, with highly complex interdependencies.

Source: Sandeep Uttamchandani, Medium: Wrong AI, How We Deal with Data Quality Using Circuit Breakers

You can’t simply move any job to the cloud randomly, because it may break things. That job or workload may depend on data tables that are potentially sitting back on-prem. The sequencing exercise is all about how to carve out very contained units of data and processing that you can then move to the cloud, one pipeline at a time. You obviously can’t migrate the whole enchilada at once—moving hundreds of petabytes of data in one go is impossible—but understanding which jobs and which tables must migrate together can take months to figure out.

How Unravel helps you define your migration strategy

Unravel gives you a complete understanding of all dependencies in a particular pipeline. It maps dependencies between apps, datasets, users, and resources so you can easily analyze everything you need to migrate to avoid breaking a pipeline.

4. Tuning data workloads for the cloud

Once you have workloads migrated, you need to optimize them for this new (cloud) environment. Nothing works out of the box. Getting everything to work effectively and efficiently in the cloud is a challenge.The same Spark query, the same big data program, that you have written will need to be tuned differently for the cloud. This is a function of complexity: the more complex your workload, the more likely it will need to be tuned specifically for the cloud. This is because the model is fundamentally different. On-prem is a small number of large clusters; the cloud is a large number of small clusters.

This where two different philosophical approaches to cloud migration come into play: lift & shift vs. lift & modernize. While lift & shift simply takes what you have running on-prem and moves it to the cloud, lift & modernize essentially means first rewriting workloads or tuning them so that they are optimized for the cloud before you move them over. It’s like a dirt road that gets damaged with potholes and gullies after a heavy rain. You can patch it up time after time, or you can pave it to get a “new” road.

But say you have migrated 800 workloads. It would take months to tune all 800 workloads—nobody has the time or people for that. So the key is to prioritize. Tackle the most complex workloads first because that’s where you get “more bang for your buck.” That is, the impact is worth the effort. If you take the least complex query and optimize it, you’ll get only a 5% improvement. But tuning a 60-hour query to run in 6 hours has a huge impact.

AI-powered recommendations show where applications can be optimized.

How Unravel helps tune data pipelines

Unravel has a complexity report that shows you at a glance which workloads are complex and which are not. Then Unravel helps you tune complex cloud data pipelines in minutes instead of hours (maybe even days or weeks). Because it’s designed specifically for modern data pipelines, you have immediate access to all the observability data about every data job you’re running at your fingertips without having to stitch together logs, metrics, traces, APIs, platform UI information information, etc., manually.

But Unravel goes a step further. Instead of simply presenting all this information about what’s happening and where—and leaving you to figure out what to do next—Unravel’s “workload-aware” AI engine provides actionable recommendations on specific ways to optimize performance and costs. You get pinpoint how-to about code you can rewrite, configurations you can tweak, and more.

Cloud migration made much, much easier

Create a free account

5. Keeping cloud migration costs under control

The moment you move to the cloud, you’ll start burning a hole in your wallet. Given the on-demand nature of the cloud, most organizations lose control of their costs. When first moving to the cloud, almost everyone spends more than budgeted—sometimes as much as 2X. So you need some sort of good cost governance. One approach to managing costs is to simply turn off workloads when you threaten to run over budget, but this doesn’t really help. At some point, those jobs need to run. Most overspending is due to overprovisioned cloud resources, so a better approach is to configure instances based on actual usage requirements rather than on received need.

How Unravel helps keep cost under control

Because Unravel has granular, job-level intelligence about what resources are actually required to run individual workloads, it can identify where you’ve requested more or larger instances than needed. Its AI recommendations automatically let you know what a more appropriately “right-sized” configuration would be. Other so-called cloud cost management tools can tell you how much you’re spending at an aggregated infrastructure level—which is okay—but only Unravel can pinpoint at a workload level exactly where and when you’re spending more than you have to, and what to do about it.

6. Getting data teams to think differently

Not so much a technology issue but more a people and process matter, training data teams to adopt a new mind-set can actually be a big obstacle. Some of what is very important when running data pipelines on-prem—like scheduling—is less crucial in the cloud. That running one instance for 10 hours is the same as running 10 instances for one hour is a big change in the way people think. On the other hand, some aspects assume greater importance in the cloud—like having to think about the number and size of instances to configure.

Some of this is just growing pains, like moving from a typewriter to a word processor. There was a learning curve in figuring out how to navigate through Word. But over time as people got more comfortable with it, productivity soared. Similarly, training teams to ramp up on running data pipelines in the cloud and overcoming their initial resistance is not something that can be accomplished with a quick fix.

How Unravel helps clients think differently

Unravel helps untangle the complexity of running data pipelines in the cloud. Specifically, with just the click of a button, non-expert users get AI-driven recommendations in plain English about what steps to take to optimize the performance of their jobs, what their instance configurations should look like, how to troubleshoot a failed job, and so forth. It doesn’t require specialized under-the-hood expertise to look at charts and graphs and data dumps to figure out what to do next. Delivering actionable insights automatically makes managing data pipelines in the cloud a lot less intimidating.

Next steps

Check out our 10-step strategy to a successful move to the cloud in Ten Steps to Cloud Migration, or create a free account and see for yourself how Unravel helps simplify the complexity of cloud migration.

The post Top Cloud Data Migration Challenges and How to Fix Them appeared first on Unravel.

A Better Approach to Controlling Modern Data Cloud Costs

Stephen Lamont — Fri, 14 Jan 2022 20:09:12 +0000

As anyone running modern data applications in the cloud knows, costs can mushroom out of control very quickly and easily. Getting these costs under control is really all about not spending more than you have to. Unfortunately, the common approach to managing these expenses—which looks at things only at an aggregated infrastructure level—helps control only about 5% of your cloud spend. You’re blind to the remaining 95% of cost-saving opportunities because there’s a huge gap in your ability to understand exactly what all your applications, pipelines, and users are doing, how much they’re costing you (and why), and whether those costs can be brought down.

Controlling cloud costs has become a business imperative for any enterprise running modern data stack applications. Industry analysts estimate that at least 30% of cloud spend is “wasted” each year—some $17.6 billion. For modern data pipelines in the cloud, the percentage of waste is much higher—closer to 50%.

This is because cloud providers have made it so much easier to spin up new instances, it’s also much easier for costs to spiral out of control. The sheer size of modern data workloads amplifies the problem exponentially. We recently saw a case where a single sub-optimized data job that ran over the weekend wound up costing the company an unnecessary $1 million. But the good news is that there are a lot of opportunities for you to save money.

Every data team and IT budget holder recognizes at an abstract, theoretical level what they need to do to control data cloud costs:

Shut off unused always-on resources (idle clusters)
Leverage spot instance discounts
Optimize the configuration of auto-scaling data jobs

Understanding what to do in principle is easy. Knowing where, when, and how to do it in practice is far more complicated. And this is where the common cost-management approach for modern data clouds falls short.

What’s needed is a new approach to controlling modern data cloud costs—a “workload aware” approach that harnesses precise, granular information at the application level to develop deep intelligence about what workloads you are running, who’s running them, which days and time of day they run, and, most important, what resources are actually required for each particular workload.

Pros and Cons of Existing Cloud Cost Management

Allocating Costs—Okay, Not Great

The first step to controlling costs is, of course, understanding where the money is going. Most approaches to cloud cost management today focus primarily on an aggregated view of the total infrastructure expense: monthly spend on compute, storage, platform, etc. While this is certainly useful from a high-level, bird’s-eye view for month-over-month budget management, it doesn’t really identify where the cost-saving opportunities lie. Tagging resources by department, function, application, user etc., is a good first step in understanding where the money is going, but tagging alone doesn’t help you understand if the cost is good or bad—and whether it could be brought down (and how).

Here’s a cloud budget management dashboard that does a good job of summarizing at a high level where cloud spend is going. But where could costs be brought down?

Native cloud vendor tools (and even some third-party solutions) can break down individual instance costs—how much you’re spending on each machine—but don’t tell you which jobs or workflows are running there. It’s just another view of the same unanswered question: Are you overspending?

Eliminating Idle Resources—Good

When it comes to shutting down idle clusters, all the cloud platform providers (AWS, Azure, Google Cloud Platform, Databricks), as well as a host of so-called cloud cost management tools from third-party vendors, do a really good job of identifying resources you’re not using anymore and can terminate them automatically.

Reducing Costs—Poor

Where many enterprises’ approach leaves a lot to be desired revolves around resource and application optimization. This is where you can save the big bucks—specifically, reducing costs with spot instances and auto-scaling, for example—but it requires a new approach.

Check out the Unravel approach to controlling cloud costs

Create a free account

A Smarter “Workload-Aware” Approach

An enterprise with 100,000+ modern data jobs has literally a million decisions—at the application, pipeline, and cluster level—to make about where, when, and how to run those jobs. And each individual decision carries a price tag.

Preventing Overspending

The #1 culprit in data cloud overspending is overprovisioning/underutilizing instances.

Every enterprise has thousands and thousands of jobs that are running on more expensive instances than necessary. Either more resources are requested than actually needed or the instances are larger than needed for the job at hand. Frequently the resources allocated to run jobs in the cloud are dramatically different than what is actually required, and those requirements can vary greatly depending upon usage patterns or seasonality. Without visibility into the actual resource requirements of each job over time, it is just a guessing game which level of machine to allocate in the cloud.

That’s why a more effective approach to controlling cloud costs must start at the job level. Whichever data team members are requesting instances must be empowered with accurate, easy-to-understand metrics about actual usage requirements—CPU, memory, I/O, duration, containers, etc.—so that they can specify right-sized configurations based on actual utilization requirements rather than having to “guesstimate” based on perceived capacity need.

For example, suppose you’re running a Spark job and estimate that you’ll need six containers with 32 GB of memory. That’s your best guess, and you certainly don’t want to get “caught short” and have the job fail. But if you had application-level information at your fingertips that in reality you need only three containers that use only 5.2 GB of memory (max), you could request a right-sized instance configuration and avoid unnecessarily overspending. Instead of paying, say, $1.44 for a 32GB machine (x6), you pay only $0.30 for a 16GB machine (x3). Multiply the savings across several hundred users running hundreds of thousands of jobs every month, and you’re talking big bucks.

Application-level intelligence about actual utilization allows you specify with precision the number and size of instances really needed for the job at hand.

Further, a workload-aware approach that has application-level intelligence also can also tell you how long the job ran for.

Knowing the duration of each job empowers you to shut down clusters based on execution state rather than waiting for an idle state.

Here, you can see that this particular job took about 6½ minutes to run. With this intelligence, you know exactly how long you need this instance. You can tell when you can utilize the instance for another job or, even better, shut it down altogether.

Leveraging Spot Instances

This really helps with determining when to use spot instances. Everybody wants to take advantage of spot instance pricing discounts. While spot instances can save you big bucks—up to 90% cheaper than on-demand pricing—they come with the caveat that the cloud provider can “pull the plug” and terminate the instance with as little as a 30-second warning. So, not every job is a good candidate to run on spot. A simple SQL query that takes 10-15 seconds is great. A 3-hour job that’s part of a larger pipeline workflow, not so much. Nobody wants their job to terminate after 2½ hours, unfinished. The two extremes are pretty easy yes/no decisions, but that leaves a lot of middle ground. Actual utilization information at the job level is needed to know what you can run on spot (and, again, how many and what size instances) without compromising reliability. One of the top questions cloud providers hear from their customers is how to know what jobs can be run safely on spot instances. App-level intelligence that’s workload-aware is the only way.

Auto-Scaling Effectively

This granular app-level information can also be leveraged at the cluster level to know with precision how, when, and where to take advantage of auto-scaling. One of the main benefits of the cloud is the ability to auto-scale workloads, spinning up (and down) resources as needed. But this scalability isn’t free. You don’t want to spend more than you absolutely have to. A heatmap like the one below shows how many jobs are running at each time of the day on each day of the week (or month or whatever cadence is appropriate), with drill-down capabilities into how much memory and CPU is needed at any given point in time. You avoid guesstimating configurations based on perceived capacity needs rather than precisely specifying resources based on actual usage requirements. Once you understand what you really need, you can tell what you don’t need.

To take advantage of cost savings with auto-scaling, you have to know exactly when you need max capacity and for how long—at the application level.

Unravel is full-stack “workload-aware”

Create a free account

Conclusion

Most cloud cost management tools on the market today do a pretty good job of aggregating cloud expenses at a high level, tracking current spend vs. budget, and presenting cloud provider pricing options (i.e., reserved vs. on-demand vs. spot instances). Some can even go a step further and show you instance-level costs, what you’re spending machine by machine. While this approach has some (limited) value for accounting purposes, it doesn’t help you actually control costs.

When talking about “controlling costs,” most budget holders are referring to the bottom-line objective: reducing costs, or at least not spending more than you have to. But for that, you need more than just aggregated information about infrastructure spend, you need application-level intelligence about all the individual jobs that are running, who’s running them (and when), and what resources each job is actually consuming compared to what resources have been configured.

As much as 50% of data cloud instances are overprovisioned. When you have 100,000s of jobs being run by thousands of users—most of whom “guesstimate” auto-scaling configurations based on perceived capacity needs rather than on actual usage requirements—it’s easy to see how budgets blow up. Trying to control costs by looking at the overall infrastructure spend is impossible. Only by knowing with precision at a granular job level what you do need can you understand what you don’t need and trim away the fat.

To truly control data cloud costs, not just see how much you spent month over month, you need to tackle things with a bottom-up, application-level approach rather than the more common top-down, infrastructure-level method.

See what a “workload-aware” approach to controlling modern data pipelines in the cloud looks like. Create a free account.

The post A Better Approach to Controlling Modern Data Cloud Costs appeared first on Unravel.

DataOps Unleashed Returns, Even Bigger and Better

Stephen Lamont — Thu, 13 Jan 2022 17:00:18 +0000

DataOps Unleashed is back! Founded and hosted by Unravel Data, DataOps Unleashed is a free, full-day virtual event, that took place on Wednesday, February 2, that brings together DataOps, CloudOps, AIOps, MLOps, and other professionals to connect and collaborate on trends and best practices for running, managing, and monitoring data pipelines and data-intensive workloads. Once again, thousands of your peers signed up for what turned out to be the most high-impact big data summit of the year.

All DataOps Unleashed sessions now available on demand

Check out sessions here

A combination of industry thought leadership and practical guidance, sessions included talks by DataOps professionals at such leading organizations as Slack, Zillow, Cisco, and IBM (and dozens of others, detailing how they’re establishing data predictability, increasing reliability, and reducing costs.

Here’s just a taste of what DataOps Unleashed 2022 covered.

In the keynote address, the state of DataOps and the modern data stack, Unravel CEO Kunal Agarwal moderates a roundtable panel discussion with three prominent venture capitalists (Matt Turck from FirstMark, Glenn Solomon from GGV Capital, Venky Ganesan from Menlo Ventures) to cut through the hype and talk about what they’re seeing in reality within the DataOps world. They talk about macro-trends, how DataOps is emerging, the pros and cons of moving data workloads to the cloud, how the industry id=s dealing with the talent shortage, and more.
Ryan Kinney, Senior Data Engineer at Slack, shares how his team has streamlined its DataOps stack with Slack to not only collaborate, but to observe their data pipelines and orchestration, enforce data quality and governance, manage their CloudOps, and unlock the entire data science and analytics platform for their customers and stakeholders.
See how Torsten Steinbach, Cloud Data Architect at IBM, has incorporated different open tech into a state-of-the-art cloud-native data lakehouse platform. Torsten shares practical tips on table formats for consistency, metastores and catalogs for usability, encryption for data protection, data skipping indexes for performance, and more.
Michael DePrizio, Principal Architect at Akamai, and Jordan Tigani, Chief Product Officer at SingleStore, discuss How Akamai handled 13x data growth and moved from batch to near-real-time visibility and analytics.
Shivnath Babu, Co-Founder, Chief Technology Officer, and Head of Customer Success at Unravel, discusses what’s beyond observability and how to get there. Luis Carlos Cruz Huertas, Head of Technology Infrastructure and Automation at DBS Bank, then walks through how DBS has gone “beyond observability” to build a framework for automated root cause analysis and auto-corrections.

See the entire lineup and agenda here.

Check out the recorded DataOps Unleashed 2022 sessions today!

The post DataOps Unleashed Returns, Even Bigger and Better appeared first on Unravel.

DataOps Unleashed 2022 Keynote

Unravel Data — Tue, 11 Jan 2022 22:28:27 +0000

Unravel Data CEO to Keynote Second “DataOps Unleashed” Virtual Event on February 2, 2022

Peer-to-Peer Event Dedicated to Helping Data Professionals Untangle the Complexity of Modern Data Clouds and Simplify their Data Operations

WHAT: Unravel Data announced that its CEO and Co-founder, Kunal Agarwal will deliver the keynote address at the “DataOps Unleashed” event, a free, full-day virtual summit will showcase some of the most prominent voices from across the burgeoning DataOps community that will be taking place on February 2nd, 2022.

The most successful enterprise organizations of tomorrow will be the ones who can effectively harness data from a broad array of sources and operating environments and rapidly transform it into actionable intelligence that supports data-driven decision making. DataOps Unleashed is a growing, cross-industry community where data professionals can productively collaborate with one another, share industry best practices, and deliver on the promise of what it means to be a data-led company.

The event will feature compelling presentations, conversation, and peer-sharing between technical practitioners, data scientists, and executive strategists from some of the most data-forward enterprise organizations in the world, including Slack, Zillow, Cisco, IBM, and many other recognized global brands. Over the course of the day, attendees will learn how a modernized approach to DataOps can transform their operations, improve data predictability, increase reliability, and create economic efficiencies with their data pipelines. Leading vendors from the DataOps market including Census, Metaplane, Airbyte, and Manta will be joining Unravel Data in supporting this community event.

Speakers at the DataOps Unleashed event to include:

Kunal Agarwal, Co-founder & CEO & Shivnath Babu, Co-founder & CTO, Unravel Data
Ryan Kinney, Senior Data Engineer, Slack
Andrei Lopatenko, VP of Engineering, Zillow
Disha Ahuja, Senior Manager, Cisco
Torsten Steinbach, Lead Architect, IBM

More information about the event can be found at: https://dataopsunleashed.com/

WHEN: Wednesday, February 2nd beginning at 9AM PST

COST: Free

WHERE: Register at: https://dataopsunleashed.com/

WHO: Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Adobe and Deutsche Bank. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

The post DataOps Unleashed 2022 Keynote appeared first on Unravel.

Big Data Meets the Cloud

Stephen Lamont — Tue, 21 Dec 2021 15:37:28 +0000

This article by Unravel CEO Kunal Agarwal originally appeared as a Forbes Technology Council post under the title The Future of Big Data Is in the Hybrid Cloud: Part 2 and has been updated to reflect 2021 statistics.

With interest in big data and cloud increasing around the same time, it wasn’t long until big data began being deployed in the cloud. Big data comes with some challenges when deployed in traditional, on-premises settings. There’s significant operational complexity, and, worst of all, scaling deployments to meet the continued exponential growth of data is difficult, time-consuming, and costly.

The cloud provides the perfect solution to this problem since it was built for convenience and scalability. In the cloud, you don’t have to tinker around trying to manually configure and troubleshoot complicated open-source technology. When it comes to growing your deployments, you can simply hit a few buttons to instantly roll out more instances of Hadoop, Spark, Kafka, Cloudera, or any other big data app. This saves money and headaches by eliminating the need to physically grow your infrastructure and then service and manage that larger deployment. Moreover, the cloud allows you to roll back these deployments when you don’t really need them—a feature that’s ideal for big data’s elastic computing nature.

Big data’s elastic compute requirements mean that organizations will have a great need to process big data at certain times but little need to process it at other times. Consider the major retail players. They likely saw massive surges of traffic on their websites this past Cyber Monday, which generated a reported $10.7 billion in sales. These companies probably use big data platforms to provide real-time recommendations for shoppers as well as to analyze and catalog their actions. In a traditional big data infrastructure, a company would need to deploy physical servers to support this activity. These servers would likely not be needed the other 364 days of the year, resulting in wasted expenditures. However, in the cloud, retail companies could simply spin up the big data and the resources that are needed and then get rid of them when traffic subsides.

This sort of elasticity occurs on a day-to-day basis for many companies that are driving the adoption of big data. Most websites experience a few hours of peak traffic and few hours of light traffic each day. Think of social media, video streaming, or dating sites. Elasticity is a major feature of big data, and the cloud provides the elasticity to keep those sites performing under any conditions.

One important thing to keep in mind when deploying big data in the cloud is cost assurance. In situations like the ones described above, organizations suddenly use a lot more compute and other resources. It’s important to have set controls when operating in the cloud to prevent unforeseen, massive cost overruns. In short, a business’s autoscaling rules must operate within its larger business context so it’s not running over budget during traffic spikes. And it’s not just the sudden spikes you need to worry about. A strict cost assurance strategy needs to be in place even as you gradually migrate apps and grow your cloud deployments. Costs can rise quickly based on tiered pricing, and there’s not always a lot of visibility depending on the cloud platform.

Reasons Why Big Data Migrations Fail—and Ways to Succeed

Watch video presentation

A Hybrid Future

Of course, the cloud isn’t ideal for all big data deployments. Some amount of sensitive data, such as financial or government records, will always be stored on-premises. Also, in specific environments such as high-performance computing (HPC), data will often be kept on-premises to meet rigorous speed and latency requirements. But for most big data deployments, the cloud is the best way to go.

As a result, we can expect to see organizations adopt a hybrid cloud approach in which they deploy more and more big data in the cloud but keep certain applications in their own data centers. Hybrid is the way of the future, and the market seems to be bearing that out. A hybrid approach allows enterprises to keep their most sensitive and heaviest data on-premises while moving other workloads to the public cloud.

It’s important to note that this hybrid future will also be multi-cloud, with organizations putting big data in a combination of AWS, Azure, and Google Cloud. These organizations will have the flexibility to operate seamlessly between public clouds and on-premises. The different cloud platforms have different strengths and weaknesses, so it makes sense for organizations embracing the cloud to use a combination of platforms to best accommodate their diverse needs. In doing so, they can also help optimize costs by migrating apps to the cloud that is cheapest for that type of workload. A multi-cloud approach is also good for protecting data, enabling customers to keep apps backed up in another platform. Multi-cloud also helps avoid one of the bigger concerns about the cloud: vendor lock-in.

Cloud adoption is a complex, dynamic life cycle—there aren’t firm start and finish dates like with other projects. Moving to the cloud involves phases such as planning, migration, and operations that, in a way, are always ongoing. Once you’ve gotten apps to the cloud, you’re always trying to optimize them. Nothing is stationary, as your organization will continue to migrate more apps, alter workload profiles, and roll out new services. In order to accommodate the fluidity of the cloud, you need the operational capacity to monitor, adapt, and automate the entire process.

The promise of big data was always about the revolutionary insights it offers. As the blueprint for how best to deploy, scale, and optimize big data becomes clearer, enterprises can focus more on leveraging insights from that data to drive new business value. Embracing the cloud may seem complex, but the cloud’s scale and agility allow organizations to mine those critical insights at greater ease and lower cost.

Next Steps

Be sure to check out Unravel Director of Solution Engineering Chris Santiago’s on-demand webinar recording—no form to fill out—on Reasons Why Big Data Cloud Migrations Fail and Ways to Succeed.

The post Big Data Meets the Cloud appeared first on Unravel.

Spark Troubleshooting Guides

Unravel Data — Tue, 02 Nov 2021 20:17:13 +0000

Thanks for your interest in the Spark Troubleshooting Guides.

You can download it here.

This 3 part series is your one-stop guide to all things Spark troubleshooting. In Part 1, we describe the ten biggest challenges for troubleshooting Spark jobs across levels. In Part 2, we describe the major categories of tools and types of solutions to solve challenges.

Lastly, Part 3 of the guide builds on the other two to show you how to address the problems we described, and more, with a single tool that does the best of what single-purpose tools do, and more – our DataOps platform, Unravel Data.

The post Spark Troubleshooting Guides appeared first on Unravel.

Twelve Best Cloud & DataOps Articles

Unravel Data — Thu, 28 Oct 2021 19:27:17 +0000

Our resource picks for October!
Prescriptive Insights On Cloud & DataOps Topics

Interested in learning about different technologies and methodologies, such as Databricks, Amazon EMR, cloud computing and DataOps? A good place to start is reading articles that give tips, tricks, and best practices for working with these technologies.

Here are some of our favorite articles from experts on cloud migration, cloud management, Spark, Databricks, Amazon EMR, and DataOps!

Cloud Migration

Cloud-migration Opportunity: Business Value Grows but Missteps Abound
(Source: McKinsey & Company)
Companies aim to embrace the cloud more fully, but many are already failing to reap the sizable rewards. Outperformers have shown what it takes to overcome the costly hurdles and could potentially unlock $1 trillion in value, according to a recent McKinsey article.

4 Major Mistakes That Can Derail a Cloud Migration (Source: MDM)
If your organization is thinking of moving to the cloud, it’s important to know both what to do and what NOT to do. This article details four common missteps that can hinder your journey to the cloud. One such mistake is not having a cloud migration strategy.

Check out the full article on the Modern Distribution Management (MDM) site to learn about other common mistakes, their impacts, and ways to avoid them.

Plan Your Move: Three Tips For Efficient Cloud Migrations (Source: Forbes)
Think about the last time you moved to a new place. Moving is usually exciting, but the logistics can get complicated. The same can be said for moving to the cloud.

Just as a well-planned move is often the smoothest, the same holds true for cloud migrations.

As you’re packing up your data and workloads to transition business services to the cloud, check out this article on Forbes for three best practices for cloud migration planning.

(Bonus resource: Check out our Ten Steps to Cloud Migration post. If your company is considering making the move, these steps will help!)

Cloud Management

How to Improve Cloud Management (Source: DevOps)
The emergence of technologies like AI and IoT as well as the spike in remote work due to the COVID-19 pandemic have accelerated cloud adoption.

With this growth comes a need for a cloud management strategy in order to avoid unnecessarily high costs and security or compliance violations. This DevOps article shares insights on how to build a successful cloud management strategy.

The Challenges of Cloud Data Management (Source: TechTarget)
Cloud spend and the amount of data in the cloud continues to grow at an unprecedented rate. This rapid expansion is causing organizations to also face new cloud management challenges as they try to keep up with cloud data management advancements.

Head over to TechTarget to learn about cloud management challenges, including data governance and adhering to regulatory compliance frameworks.

Spark

Spark: A Data Engineer’s Best Friend (Source: CIO)
Spark is the ultimate tool for data engineers. It simplifies the work environment by providing a platform to organize and execute complex data pipelines and powerful tools for storing, retrieving, and transforming data.

This CIO article describes different things data engineers can do with Spark, touches on what makes Spark unique, and explains why it is so beneficial for data engineers.

Is There Life After Hadoop? The Answer is a Resounding Yes. (Source: CIO)
Many organizations who invested heavily in the Hadoop ecosystem have found themselves wondering what life after Hadoop is like and what lies ahead.

This article addresses life after Hadoop and lays out a strategy for organizations entering the post-Hadoop era, including reasons why you may want to embrace Spark as an alternative. Head to the CIO site for more!

Databricks

5 Ways to Boost Query Performance with Databricks and Spark (Source: Key2 Consulting)
When running Spark jobs on Databricks, do you often find yourself frustrated by slow query times?

Check out this article from Key2 Consulting to discover 5 rules for speeding up query times. The rules include:

Cache intermediate big dataframes for repetitive use.
Monitor the Spark UI within a cluster where a Spark job is running.

For more information on these rules and to find out the remaining three, check out the full article.

What is a Data Lakehouse and Why Should You Care? (Source: S&P Global)
A data lakehouse is an environment designed to combine the data structure and data management features of a data warehouse with the low-cost storage of a data lake.

Databricks offers a couple data lakehouses, including Delta Lake and Delta Engine. This article from S&P Global, gives a more comprehensive explanation of what a data lakehouse is, its benefits, and what lakehouses you can use on Databricks.

Amazon EMR

What is Amazon EMR? – Amazon Elastic MapReduce Tutorial (Source: ADMET)
AWS EMR is among the hottest clouds and massive data-based platforms. It gives a supervised structure for simply, cost-effectively, and securely working information processing frameworks.

In this ADMET blog, learn what Amazon Elastic MapReduce is and how it can be used to deal with a variety of issues.

DataOps

3 Steps for Successfully Implementing Industrial DataOps (Source: eWeek)
DataOps has been growing in popularity over the past few years. Today, we see many industrial operations realizing the value of DataOps.

This article explains three steps for successfully implementing industrial DataOps:

1. Make industrial data available
2. Make data useful
3. Make data valuable

Head over to eWeek for a deeper dive into the benefits of implementing industrial DataOps and what these three steps really mean.

Using DataOps To Maximize Value For Your Business (Source: Forbes)
Everybody is talking about artificial intelligence and data, but how do you make it real for your business? That’s where DataOps comes in.

From this Forbes article, learn how DataOps can be used to solve common business challenges, including:

A process mismatch between traditional data management and newer techniques such as AI.
A lack of collaboration to drive a successful cultural shift and support operational readiness.
Unclear approach to measure success across the organization.

In Conclusion

Knowledge is power! We hope our data community enjoys these resources and they provide valuable insights to help you in your current role and beyond.

Be sure to visit our library of resources on DataOps, Cloud Migration, Cloud Management (and more) for best practices, happenings, and expert tips and techniques. If you want to know more about Unravel Data, you can sign up for a you can sign up for a free account or contact us.

The post Twelve Best Cloud & DataOps Articles appeared first on Unravel.

The Spark Troubleshooting Solution is The Unravel DataOps Platform

Unravel Data — Wed, 20 Oct 2021 23:33:02 +0000

Current practice for Spark troubleshooting is messy. Part of this is due to Spark’s very popularity; it’s widely used on platforms as varied as open source Apache Spark, on all platforms; Cloudera’s Hadoop offerings (on-premises and in the cloud); Amazon EMR, Azure Synapse, and Google Dataproc; and Databricks, which runs on all three public clouds. (Which means you have to be able to address Spark’s interaction with all of these very different environments.)

Because Spark does so much, on so many platforms, “Spark troubleshooting” covers a wide range of problems – jobs that halt; pipelines that fail to deliver, so you have to find the issue; performance that’s too slow; or using too many resources, either in the data center (where your clusters can suck up all available resources) or in the cloud (where resources are always available, but your costs rise, or even skyrocket.)

Where Are the Issues – and the Solutions?

Problems in running Spark jobs occur at the job and pipeline levels, as well as at the cluster level, as described in Part 1 of this three-part series: the top ten problems you encounter in working with Spark. And there are several solutions that can help, as we described in Part 2: five types of solutions used for Spark troubleshooting. (You can also see our recent webinar, Troubleshooting Apache Spark, for an overview and demo.)

Table: What each level of tool shows you – and what’s missing

Existing tools provide incomplete, siloed information. We created Unravel Data as a go-to DataOps platform that includes much of the best of existing tools. In this blog post we’ll give examples of problems at the job, pipeline, and cluster levels, and show how to solve them with Unravel Data. We’ll also briefly describe how Unravel Data helps you prevent problems, providing AI-powered, proactive recommendations.

The Unravel Data platform gathers more information than existing tools by adding its own sensors to your stack, and by using all previously existing metrics, traces, logs, and available API calls. It gathers this robust information set together and correlates pipeline information, for example, across jobs.

The types of issues that Unravel covers are, broadly speaking: fixing bottlenecks; meeting and beating SLAs; cost optimization; fixing failures; and addressing slowdowns, helping you improve performance. Within each of these broad areas, Unravel has the ability to spot hundreds of different types of factors contributing to an issue. These contributing factors include data skew, bad joins, load imbalance, incorrectly sized containers, poor configuration settings, and poorly written code, as well as a variety of other issues.

Fixing Job-Level Problems with Unravel

Here’s an example of a Spark job or application run that’s monitored by Unravel.

In Unravel, you first see automatic recommendations, analysis, and insights for any given job. This allows users to quickly understand what the problem is, why it happened, and how to resolve it. In the example below, resolving the problem will take about a minute.

Let’s dive into the insights for an application run, as shown below.

You can see here that Unravel has spotted bottlenecks, and also room for improving the performance of this app. It has narrowed down what the particular problem is with this application and how to resolve it. In this case, it has recommended to double the number of executors and reduce the memory for each executor, which will improve performance by about 30%, meeting the SLA.

Additionally, Unravel has also spotted some bad joins which are slowing this application down, as shown below.

In addition to helping speed this application up, Unravel is also recommending resource settings which will lower the cost of running this application, as shown below – reductions of roughly 50% in executor memory and driver memory, cutting out half the total memory cost. Again, Unravel is delivering pinpoint recommendations. Users avoid a lengthy trial-and-error exercise; instead, they can solve the problem in about a minute.

Unravel can also help with jobs or applications that just didn’t work and failed. It uses a similar approach as above to help data engineers and operators get to the root cause of the problem and resolve it quickly.

In this example, the job or application failed because of an out of memory exception error. Unravel surfaces this problem instantly and pinpoints exactly where the problem is.

For further information, and to support investigation, Unravel provides distilled and easy-to-access logs and error messages, so users and data engineers have all the relevant information they need at hand.

And once data teams start using Unravel, they can do everything with more confidence. For instance, if they try to save money by keeping resource allocations low, but overdo that a little bit, they’ll get an out-of-memory error. Previously, it might have taken many hours to resolve the error, so the team might not risk tight allocations. But fixing the error only takes a couple of minutes with Unravel, so the data team can cut costs effectively.

Examples of logs that Unravel provides for easy access and error message screens follow.

Unravel strives to help users solve their problems with a click of a button. At the same time, Unravel provides a great deal of detail about each job and application, including displaying code execution, displaying DAGs, showing resource usage, tracking task execution, and more. This allows users to drill down to whatever depth needed to understand and solve the problem.

Task stage metrics in Unravel Data

As another example, this screen shows details for task stage information:

Left-hand side: task metrics. This includes the job stage task metrics of Spark, much like what you would see from Spark UI. However, Unravel keeps history on this information; stores critical log information for easy access; presents multiple logs coherently; and ties problems to specific log locations.
Right-hand side: holistic KPIs. Information such as job start and end time, run-time durations, I/O in KB – and whether each job succeeded or failed.

Data Pipeline Problems

The tools people use for troubleshooting Spark jobs tend to focus on one level of the stack or another – the networking level, the cluster level, or the job level, for instance. None of these approaches helps much with Spark pipelines. A pipeline is likely to have many stages, involving many separate Spark jobs.

Here’s an example. One Spark job can handle data ingest; a second job, transformation; a third job may send the data to Kafka; and a final job can be reading the data from Kafka and then putting it into a distributed store, like Amazon S3 or HDFS.

Airflow being used to create and organize a Spark pipeline.

The two most important orchestration tools are Oozie, which tends to be used with on-premises Hadoop, and Airflow, which is used more often in the cloud. They will help you create and manage a pipeline; and, when the pipeline breaks down, they’ll show you which job the problem occurred in.

But orchestration tools don’t help you drill down into that job; that’s up to you. You have to find the specific Spark run where the failure occurred. You have to use other tools, such as Spark UI or logs, and look at timestamps, using your detailed knowledge of each job to cross-correlate and, hopefully, find the issue. As you see, just finding the problem is messy, intense, time-consuming, expert work; fixing it is even more effort.

Oozie also gives you a big-picture view of pipelines.

Unravel, by contrast, provides pipeline-specific views that first connect all the components – Spark, and everything else in your modern data stack – and runs of the data pipeline together in one place. Unravel then allows you to drill down into the slow, failed, or inefficient job, identify the actual problem, and fix it quickly. And it gets even better; Unravel’s AI-powered recommendations will help you prevent a pipeline problem from even happening in the first place.

You didn’t have to look at Spark UI, plus dig through Spark logs, then check Oozie or Airflow. All the information is correlated into one view – a single pane of glass.

This view shows details for several jobs. In the graphic, each line has an instance run. The longest duration shown here is three minutes and 1 second. If the SLA is “under two minutes,” then the job failed to meet its SLA. (Because some jobs run scores or hundreds of times a day, missing an SLA by more than a minute – especially when that means a roughly 50% overshoot against the SLA – can become a very big deal.)

Unravel then provides history and correlated information, using all of this to deliver AI-powered recommendations. You can also set AutoActions against a wide variety of conditions and get cloud migration support.

Cluster Issues

Resources are allocated at the cluster level. The screenshot shows ResourceManager (RM), which tracks resources, schedules jobs such as Spark jobs, and so on. You can see the virtual machines assigned to your Spark jobs, what resources they’re using, and status – started or not started, completed or not completed.

Apache Hadoop ResourceManager

The first problem is that there’s no way to see what actual resources your job is consuming. Nor can you see whether those resources are being used efficiently or not. So you can be over-allocated, wasting resources – or running very close to your resources limit, with the job likely to crash in the future.

Nor can you compare past to present; ResourceManager does not have history in it. Now you can pull logs at this level – the YARN level – to look at what was happening, but that’s aggregated data, not the detail you’re looking for. You also can’t dig into potential conflicts with neighbors sharing resources in the cluster.

You can use site tools like Cloudwatch, Cloudera Manager or Ambari. They provide a useful holistic view, at the cluster level – total CPU consumption, disk I/O consumption, and network I/O consumption. But, as with some of the pipeline views we discussed above, you can’t take this down to the job level.

You may have a spike in cluster disk I/O. Was it your job that started that, or someone else’s? Again, you’re looking at Spark UI, you’re looking at Spark logs, hoping maybe to get a bit lucky and figure out what the problem is. Troubleshooting becomes a huge intellectual and practical challenge. And this is all taking away from time making your environment better or doing new projects that move the business forward.

It’s common for a job to be submitted, then held because the cluster’s resources are already tied up. The bigger the job, the more likely it will have to wait. But existing tools make it hard to see how busy the cluster is. So later, when the job that had to wait finishes late, no one knows why that happened.

A cluster-level view showing vCores, specific users, and a specific queue

By contrast, in this screenshot from Unravel, you see cluster-level details. This job was in the data security queue, and it was submitted on July 5th, around 7:30pm. These two rows show vCores – overall consumption on this Hadoop cluster’s memory. The orange line shows maximum usage, and the blue line shows what’s available.

At this point in time, usage (blue line) did not exceed available resources (orange line)

You can also get more granular and look at a specific user. You can go to the date and time that the job was launched and see what was running at that point in time. And voilà – there were actually enough resources available.

So it’s not a cluster-level problem; you need to examine the job itself. And Unravel, as we’ve described, gives you the tools to do that. You can see that we’ve eliminated a whole class of potential problems for this slowdown – not in hours or days, and with no trial-and-error experimentation needed. We just clicked around in Unravel for a few minutes.

Unravel Data: An Ounce of Prevention

For the issues above, such as slowdowns, failures, missed SLAs or just expensive runs, a developer would have to be looking at YARN logs, ResourceManager logs, and Spark logs, possibly spending hours figuring it all out. Within Unravel, though, they would not need to jump between all those screens; they would get all the information in one place. They can then use Unravel’s built-in intelligence to automatically root-cause the problem and resolve it.

Unravel Data solves the problem of Spark troubleshooting at all three levels – at the job, pipeline, and cluster levels. It handles the correlation problem – tying together cluster, pipeline, and job information – for you. Then it uses that information to give unique views at every level of your environment. Unravel makes AI-powered recommendations to help you head off problems; allows you to create AutoActions that execute on triggers you define; and makes troubleshooting much easier.

Unravel solves systemic problems with Spark. For instance, Spark tends to cause overallocation: assigning very large amounts of resources to every run of a Spark job, to try to avoid crashes on any run of that job over time. The biggest datasets or most congested conditions set the tone for all runs of the job or pipeline. But with Unravel, you can flexibly right-size the allocation of resources.

Unravel frees up your experts to do more productive work. And Unravel often enables newer and more junior-level people to be as effective as an expert would have been, using the ability to drill down, and the proactive insights and recommendations that Unravel provides.

Unravel even feeds back into software development. Once you find problems, you can work with the development team to implement new best practices, heading off problems before they appear. Unravel will then quickly tell you which new or revised jobs are making the grade.

The Unravel advantage – on-premises and all public clouds

Another hidden virtue of Unravel is: it serves as a single source of truth for different roles in the organization. If the developer, or an operations person, finds a problem, then they can use Unravel to highlight just what the issue is, and how to fix it. And not only how to fix it this time, for this job, but to reduce the incidence of that class of problem across the whole organization. The same goes for business intelligence (BI) tool users such as analysts, data scientists, everyone. Unravel gives you a kind of X-ray of problems, so you can cooperate in solving them.

With Unravel, you have the job history, the cluster history, and the interaction with the environment as a whole – whether it be on-premises, or using Databricks or native services on AWS, Azure, or Google Cloud Platform. In most cases you don’t have to try to remember, or discover, what tools you might have available in a given environment. You just click around in Unravel, largely the same way in any environment, and solve your problem.

Between the problems you avoid, and your new-found ability to quickly solve the problems that do arise, you can start meeting your SLAs in a resource-efficient manner. You can create your jobs, run them, and be a rockstar Spark developer or operations person within your organization.

Conclusion

In this blog post, we’ve given you a wide-ranging tour of how you can use Unravel Data to troubleshoot Spark jobs – on-premises and in the cloud, at the job, pipeline, and cluster levels, working across all levels, efficiently, from a single pane of glass.

In Troubleshooting Spark Applications, Part 1: Ten Challenges, we described the ten biggest challenges for troubleshooting Spark jobs across levels. And in Spark Troubleshooting, Part 2: Five Types of Solutions, we describe the major categories of tools, several of which we touched on here.

This blog post, Part 3, builds on the other two to show you how to address the problems we described, and more, with a single tool that does the best of what single-purpose tools do, and more – our DataOps platform, Unravel Data.

The post The Spark Troubleshooting Solution is The Unravel DataOps Platform appeared first on Unravel.

Troubleshooting Apache Spark – Job, Pipeline, & Cluster Level

Unravel Data — Wed, 13 Oct 2021 18:05:18 +0000

Apache Spark is the leading technology for big data processing, on-premises and in the cloud. Spark powers advanced analytics, AI, machine learning, and more. Spark provides a unified infrastructure for all kinds of professionals to work together to achieve outstanding results.

Technologies such as Cloudera’s offerings, Amazon EMR, and Databricks are largely used to run Spark jobs. However, as Spark’s importance grows, so does the importance of Spark reliability – and troubleshooting Spark problems is hard. Information you need for troubleshooting is scattered across multiple, voluminous log files. The right log files can be hard to find, and even harder to understand. There are other tools, each providing part of the picture, leaving it to you to try to assemble the jigsaw puzzle yourself.

Would your organization benefit from rapid troubleshooting for your Spark workloads? If you’re running significant workloads on Spark, then you may be looking for ways to find and fix problems faster and better – and to find new approaches that steadily reduce your problems over time.

This blog post is adapted from the recent webinar, Troubleshooting Apache Spark, part of the Unravel Data “Optimize” Webinar Series. In the webinar, Unravel Data’s Global Director of Solutions Engineering, Chis Santiago, runs through common challenges faced when troubleshooting Spark and shows how Unravel Data can help.

Spark: The Good, the Bad & the Ugly

Chris has been at Unravel for almost four years. Throughout that time, when it comes to Spark, he has seen it all – the good, the bad, and the ugly. On one hand, Spark as a community has been growing exponentially. Millions of users are adopting Spark, with no end in sight. Cloud platforms such as Amazon EMR and Databricks are largely used to run Spark jobs.

Spark is here to stay, and use cases are rising. There are countless product innovations powered by Spark, such as Netflix recommendations, targeted ads on Facebook and Instagram, or the “Trending” feature on Twitter. On top of its great power, the barrier to entry for Spark is now lower than ever before. But unfortunately, with the good comes the bad, and the number one common issue is troubleshooting.

Troubleshooting Spark is complex for a multitude of different reasons. First off, there are multiple points of failure. A typical Spark data pipeline could be using orchestration tools, such as Airflow or Oozie, as well as built-in tools, such as Spark UI. You also may be using cluster management technologies, such as Cluster Manager or Ambari.

A failure may not always start on Spark; it could rather be a failure within a network layer on the cloud, for example.

Because Spark uses so many tools, not only does that introduce multiple points of failure, but there is also a lot of correlating information from various sources that you must carry out across these platforms. This requires expertise. You need experience in order to understand not only the basics of Spark, but all the other platforms that can support Spark as well.

Lastly, when running Spark workloads, the priority is often to meet SLAs at all costs. To meet SLAs you may, for example, double your resources, but there will always be a downstream effect. Determining what’s an appropriate action to take in order to make SLAs can be tricky.

Want to experience Unravel for Spark?

Create a free account

The Three Levels of Spark Troubleshooting

There are multiple levels when it comes to troubleshooting Spark. First there is the Job Level, which deals with the inner workings of Spark itself, from executors to drivers to memory allocation to logs. The job level is about determining best practices for using the tools that we have today to make sure that Spark jobs are performing properly. Next is the Pipeline Level. Troubleshooting at the pipeline level is about managing multiple Spark runs and stages to make sure you’re getting in front of issues and using different tools to your advantage. Lasty, there is the Cluster Level, which deals with infrastructure. Troubleshooting at the cluster level is about understanding the platform in order to get an end-to-end view of troubleshooting Spark jobs.

Troubleshooting: Job Level

A Spark job refers to actions such as doing work in a workbook or analyzing a Spark SQL query. One tool used on the Spark job level is Spark UI. Spark UI can be described as an interface for understanding Spark at the job level.

Spark UI is useful in giving granular details about the breakdown of tasks, breakdown of stages, the amount of time it takes workers to complete tasks, etc. Spark UI is a powerful dataset that you can use to understand every detail about what happened to a particular Spark job.

Challenges people often face are manual correlation; understanding the overall architecture of Spark; and, more importantly, things such as understanding what logs you need to look into and how one job correlates with other jobs. While Spark UI is the default starting point to determine what is going on with Spark jobs, there is still a lot of interpretation that needs to be done, which takes experience.

Further expanding on Spark job-level challenges, one thing people often face difficulty with is diving into the logs. If you truly want to understand what caused a job to fail, you must get down to the log details. However, looking at logs is the most verbose way of troubleshooting, because every component in Spark produces logs. Therefore, you have to look at a lot of errors and timestamps across multiple servers. Looking at logs gives you the most information to help understand why a job has failed, but sifting through all that information is time-consuming.

It also may be challenging to determine where to start on the job level. Spark was born out of the Hadoop ecosystem, so it has a lot of big data concepts. If you’re not familiar with big data concepts, determining a starting point may be difficult. Understanding the concepts behind Spark takes time, effort, and experience.

Lastly, when it comes to Spark jobs, there are often multiple job runs that make up a data pipeline. Figuring out how one Spark job affects another is tricky and must be done manually. In his experience working with the best engineers at large organizations, Chris finds that even they sometimes decide not to finish troubleshooting, and instead just keep on running and re-running a job until it’s completed or meets the SLA. While troubleshooting is ideal, it is extremely time-consuming. Therefore, having a better tool for troubleshooting on the job level would be helpful.

Troubleshooting: Pipeline Level

In Chris’ experience, most organizations don’t have just one Spark job that does everything, but there are rather multiple stages and jobs that are needed to carry out Spark workloads. To manage all these steps and jobs you’d need an orchestration tool. One popular orchestration tool is Airflow, which allows you to sequence out specific jobs.

Orchestration tools like Airflow are useful in managing pipelines. But while these tools are helpful for creating complex pipelines and mapping where points of failure are, they are lacking when it comes to providing detailed information about why a specific step may have failed. Orchestration tools are more focused on the higher, orchestration level, rather than the Spark job level. Orchestration tools tell you where and when something has failed, so they are useful as a starting point to troubleshoot data pipeline jobs on Spark.

Those who are running Spark on Hadoop, however, often use Oozie. Similarly to Airflow, Oozie gives you a high level view, alerting you when a job has failed, but not providing the type of information needed to answer questions such as “Where is the bottleneck?” or “Why did the job break down?”. To answer these questions, it’s up to the user to manually correlate the information that orchestration tools provide with information from job-level tools, which again requires expertise. For example, you may have to determine which Spark run that you see in Spark UI correlates to a certain step in Airflow. This can be very time consuming and prone to errors.

Troubleshooting: Cluster Level

The cluster level for Spark is the level that refers to things such as infrastructure, VMs, allocated resources, and Hadoop. Hadoop’s ResourceManager is a great tool for managing applications as they come in. ResourceManage is also useful for determining what the resource usage is, and where a Spark job is in the queue.

However, one shortcoming of ResourceManager is that you don’t get historical data. You cannot view the past state of ResourceManager from twelve or twenty-four hours ago, for example. Everytime you open ResourceManager you have a view of how jobs are consuming resources at that specific time, as shown below.

Another challenge when troubleshooting Spark on the cluster level is that while tools such as Cluster Manager or Ambari give a holistic view of what’s going on with the entire estate, you cannot see how cluster-level information, such as CPU consumption, I/O consumption, or network I/O consumption, relate to Spark jobs.

Lastly, and similarly to the challenges faced when troubleshooting on the job and pipeline level, manual correlation is also a problem when it comes to troubleshooting on the cluster level. Manual correlation takes time and effort that a data science team could instead be putting towards product innovations.

But what if there was a tool that takes all these troubleshooting challenges, on the job, pipeline, and cluster level, into consideration? Well, luckily, Unravel Data does just that. Chris next gives examples of how Unravel can be used to mitigate Spark troubleshooting issues, which we’ll go over in the remainder of this blog post.

Want to experience Unravel for Spark?

Create a free account

Demo: Unravel for Spark Troubleshooting

The beauty of Unravel is that it provides a single pane of glass where you can look at logs, the type of information provided by Spark UI, Oozie, and all the other tools mentioned throughout this blog, and data uniquely available through Unravel, all in one view. At this point in the webinar, Chris takes us through a demo to show how Unravel aids in troubleshooting at all Spark levels – job, pipeline, and cluster. For a deeper dive into the demo, view the webinar.

Job Level

At the job level, one area where Unravel can be leveraged is in determining why a job failed. The image below is a Spark run that is monitored by Unravel.

On the left hand side of the dashboard, you can see that Job 3 has failed, indicated by the orange bar. With Unravel, you can click on the failed job and see what errors occurred. On the right side of the dashboard, within the Errors tab, you can see why Job 3 failed, as highlighted in blue. Unravel is showing the Spark logs that give information on the failure.

Pipeline Level

Using Unravel for troubleshooting at the data pipeline level, you can look at a specific workflow, rather than looking at the specifics of one job. The image shows the dashboard when looking at data pipelines.

On the left, the blue lines represent instance runs. The black dot represents a job that ran for two minutes and eleven seconds. You could use information on run duration to determine if you meet your SLA. If your SLA is under two minutes, for example, the highlighted run would miss the SLA. With Unravel you can also look at changes in I/O, as well as dive deeper into specific jobs to determine why they lasted a certain amount of time. The information in the screenshot gives us insight into why the job mentioned prior ran for two minutes and eleven seconds.

The Unravel Analysis tab, shown above, carries out analysis to detect issues and make recommendations on how to mitigate those issues.

Cluster Level

Below is the view of Unravel when troubleshooting at the cluster level, specifically focusing on the same job mentioned previously. The job, which again lasted two minutes and eleven seconds, took place on July 5th at around 7PM. So what happened?

The image above shows information about the data security queue at the time when the job we’re interested in was running. The table at the bottom of the dashboard shows the state of the jobs that were running on July 5th at 7PM, allowing you to see which, if any, job was taking up too much resources. In this case, Chris’ job, highlighted in yellow, wasn’t using a large amount of resources. From there, Chris can then conclude that perhaps the issue is instead on the application side and something needs to be fixed within the code. The best way to determine what needs to be fixed is to use the Analysis tab mentioned previously.

Conclusion

There are many ways to troubleshoot Spark, whether it be on the job, pipeline, or cluster level. Unravel can be your one-stop shop to determine what is going on with your Spark jobs and data pipelines, as well as give you proactive intelligence that allows you to quickly troubleshoot your jobs. Unravel can help you meet your SLAs in a resourceful and efficient manner.

We hope you have enjoyed, and learned from, reading this blog post. If you think Unravel Data can help you troubleshoot Spark and would like to know more, you can create a free account or contact Unravel.

The post Troubleshooting Apache Spark – Job, Pipeline, & Cluster Level appeared first on Unravel.

Unravel Data Features in HPE Ezmeral Marketplace

Unravel Data — Fri, 17 Sep 2021 23:22:11 +0000

Unravel Data can now be found in the Hewlett-Packard Enterprise (HPE) Ezmeral Marketplace. This is the third cloud marketplace to feature Unravel, following multiple entries in the AWS Marketplace (for Amazon EMR and Databricks) and Microsoft Azure (for Azure HDInsight and Databricks).

Ezmeral is based on Kubernetes. It’s described as an “instant hybrid cloud platform” that’s optimized for edge computing, incorporating customer-owned on-premises and colocated servers, along with HPE servers on the edge. Ezmeral is conceptual similar to Tanzu, from VMware (owned by Dell), and OpenShift, from Red Hat. However, HPE makes it easy to include HPE hardware with already-owned servers and access to Ezmeral’s Kubernetes platform and the Ezmeral marketplace.

By including Unravel Data in the HPE Ezmeral software portfolio, HPE makes it easier for its Ezmeral users to monitor, manage, and improve big data and streaming data operations – DataOps – on the new platform, which recently won the 2020 CRN Tech Innovator Award.

To learn more about HPE Ezmeral and the Ezmeral Marketplace, visit the HPE Ezmeral homepage. If you’d like to learn more about Unravel Data directly, you can create a free account or contact us.

The post Unravel Data Features in HPE Ezmeral Marketplace appeared first on Unravel.

Spark Troubleshooting Solutions – DataOps, Spark UI or logs, Platform or APM Tools

Unravel Data — Thu, 09 Sep 2021 21:53:14 +0000

Note: This guide applies to running Spark jobs on any platform, including Cloudera platforms; cloud vendor-specific platforms – Amazon EMR, Microsoft HDInsight, Microsoft Synapse, Google DataProc; Databricks, which is on all three major public cloud providers; and Apache Spark on Kubernetes, which runs on nearly all platforms, including on-premises and cloud.

Introduction

Spark is known for being extremely difficult to debug. But this is not all Spark’s fault. Problems in running a Spark job can be the result of problems with the infrastructure Spark is running on, an inappropriate configuration of Spark, Spark issues, the currently running Spark job, other Spark jobs running at the same time – or interactions among these layers. But Spark jobs are very important to the success of the business; when a job crashes, or runs slowly, or contributes to a big increase in the bill from your cloud provider, you have no choice but to fix the problem.

Widely used tools generally focus on part of the environment – the Spark job, infrastructure, the network layer, etc. These tools don’t present a holistic view. But that’s just what you need to truly solve problems. (You also need the holistic view when you’re creating the Spark job, and as a check before you start running it, to help you avoid having problems in the first place. But that’s another story.)

In this guide, Part 2 in a series, we’ll show ten major tools that people use for Spark troubleshooting. We’ll show what they do well, and where they fall short. In Part 3, the final piece, we’ll introduce Unravel Data, which makes solving many of these problems easier.

What’s the Problem(s)?

The problems we mentioned in Part 1 of this series have many potential solutions. The methods people usually use to try to solve them often derive from that person’s role on the data team. The person who gets called when a Spark job crashes, such as the job’s developer, is likely to look at the Spark job. The person who is responsible for making sure the cluster is healthy will look at that level. And so on.

In this guide, we highlight five types of solutions that people use – often in various combinations – to solve problems with Spark jobs

Spark UI
Spark logs
Platform-level tools such as Cloudera Manager, the Amazon EMR UI, Cloudwatch, the Databricks UI, and Ganglia
APM tools
DataOps platforms such as Unravel Data

As an example of solving problems of this type, let’s look at the problem of an application that’s running too slowly – a very common Spark problem, that may be caused by one or more of the issues listed in the chart. Here. we’ll look at how existing tools might be used to try to solve it.

Note: Many of the observations and images in this guide have been drawn from the presentation Beyond Observability: Accelerate Performance on Databricks, by Patrick Mawyer, Systems Engineer at Unravel Data. We recommend this webinar to anyone interested in Spark troubleshooting and Spark performance management, whether on Databricks or on other platforms.

Solving Problems Using Spark UI

Spark UI is the first tool most data team members use when there’s a problem with a Spark job. It shows a snapshot of currently running jobs, the stages jobs are in, storage usage, and more. It does a good job, but is seen as having some faults. It can be hard to use, with a low signal-to-noise ratio and a long learning curve. It doesn’t tell you things like which jobs are taking up more or less of a cluster’s resources, nor deliver critical observations such as CPU, memory, and I/O usage.

In the case of a slow Spark application, Spark UI will show you what the current status of that job is. You can also use Spark UI for past jobs, if the logs for those jobs still exist, and if they were configured to log events properly. Also, the Spark history server tends to crash. When this is all working, it can help you find out how long an application took to run in the past – you need to do this kind of investigative work just to determine what “slow” is.

The following screenshot is for a Spark 1.4.1 job with a two-node cluster. It shows a Spark Streaming job that steadily uses more memory over time, which might cause the job to slow down. And the job eventually – over a matter of days – runs out of memory.

(Source: Stack Overflow)

To solve this problem, you might do several things. Here’s a brief list of possible solutions, and the problems they might cause elsewhere:

Increase available memory for each worker. You can increase the value of the spark.executor.memory variable to increase the memory for each worker. This will not necessarily speed the job up, but will defer the eventual crash. However, you are either taking memory away from other jobs in your cluster or, if you’re in the cloud, potentially running up the cost of the job.
Increase the storage fraction. You can change the value of spark.storage.memoryFraction, which varies from 0 to 1, to a higher fraction. Since the Java virtual machine (JVM) uses memory for caching RDDs and for shuffle memory, you are increasing caching memory at the expense of shuffle memory. This will cause a different failure if, at some point, the job needs shuffle memory that you allocated away at this step.
Increase the parallelism of the job. For a Spark Cassandra Connector job, for example, you can change spark.cassandra.input.split.size to a smaller value. (It’s a different variable for other RDD types.) Increasing parallelism decreases the data set size for each worker, requiring less memory per worker. But more workers means more resources used. In a fixed-resources environment, this takes resources away from other jobs; in a dynamic environment, such as a Databricks job cluster, it directly runs up your bill.

The point here is that everything you might do has a certain amount of guesswork to it, because you don’t have complete information. You have to use trial and error approaches to see what might work – both which approach to try/variable to change, and how much to change the variable(s) involved. And, whichever approach you choose, you are putting the job in line for other, different problems – including later failure, failure for other reasons, or increased cost. And, when your done, this specific job may be fine – but at the expense of other jobs that then fail. And those failures will also be hard to troubleshoot.

Here’s a look at the Stages section of Spark UI. It gives you a list of metrics across executors. However, there’s no overview or big picture view to help guide you in finding problems. And the tool doesn’t make recommendations to help you solve problems, or avoid them in the first place.

Spark UI is limited to Spark, but a Spark job may, for example, have data coming in from Kafka, and run alongside other technologies. Each of those has its own monitoring and management tools, or does without; Spark UI doesn’t work with those tools. It also lacks pro-active alerting, automatic actions, and AI-driven insights, all found in Unravel.

Spark UI is very useful for what it does, but its limitations – and the limitations of the other tool types described here – lead many organizations to build homegrown tools or toolsets, often built on Grafana. These solutions are resource-intensive, hard to extend, hard to support, and hard to keep up-to-date.

A few individuals and organizations even offer their homegrown tools as open source software for other organizations to use. However, support, documentation, and updates are limited to non-existent. Several such tools, such as Sparklint and DrElephant, do not support recent versions of Spark. At this writing, they have not had many, if any, fresh commits in recent months or even years.

Spark Logs

Spark logs are the underlying resource for troubleshooting Spark jobs. As mentioned above, Spark UI can even use Spark logs, if available, to rebuild a view of the Spark environment on an historical basis. You can use the logs related to the job’s driver and executors to retrospectively find out what happened to a given job, and even some information about what was going on with other jobs at the same time.

If you have a slow app, for instance, you can painstakingly assemble a picture to tell you if the slowness was in one task versus the other by scanning through multiple log files. But answering why and finding the root cause is hard. These logs don’t have complete information about resource usage, data partitioning, correct configuration setting and many other factors than can affect performance. There are also many potential issues that don’t show up in Spark logs, such as “noisy neighbor” or networking issues that sporadically reduce resource availability within your Spark environment.

Spark logs are a tremendous resource, and are always a go-to for solving problems with Spark jobs. However, if you depend on logs as a major component of your troubleshooting toolkit, several problems come up, including:

Access and governance difficulties. In highly secure environments, it can take time to get permission to access logs, or you may need to ask someone with the proper permissions to access the file for you. In some highly regulated companies, such as financial institutions, it can take hours per log to get access.
Multiple files. You may need to look at the logs for a driver and several executors, for instance, to solve job-level problems. And your brain is the comparison and integration engine that pulls the information together, makes sense of it, and develops insights into possible causes and solutions.
Voluminous files. The file for one component of a job can be very large, and all the files for all the components of a job can be huge – especially for long-running jobs. Again, you are the one who has to find and retain each part of the information needed, develop a complete picture of the problem, and try different solutions until one works.
Missing files. Governance rules and data storage considerations take files away, as files are moved to archival media or simply lost to deletion. More than one large organization deletes files every 90 days, which makes quarterly summaries very difficult. Comparisons to, say, the previous year’s holiday season or tax season become impossible.
Only Spark-specific information. Spark logs are just that – logs from Spark. They don’t include much information about the infrastructure available, resource allocation, configuration specifics, etc. Yet this information is vital to solving a great many of the problems that hamper Spark jobs.

Because Spark logs don’t cover infrastructure and related information, it’s up to the operations person to find as much information they can on those other important areas, then try to integrate it all and determine the source of the problem. (Which may be the result of a complex interaction among different factors, with multiple changes needed to fix it.)

Platform-Level Solutions

There are platform-level solutions that work on a given Spark platform, such as Cloudera Manager, the Amazon EMR UI, and Databricks UI. In general, these interfaces allow you to work at the cluster level. They tell you information about resource usage and the status of various services.

If you have a slow app, for example, these tools will give you part of the picture you need to put together to determine the actual problem, or problems. But these tools do not have the detail-level information in the tools above, nor do they even have all the environmental information you need. So again, it’s up to you to decide how much time to spend researching, pulling all the information together, and trying to determine a solution. A quick fix might take a few hours; a comprehensive, long-term solution may take days of research and experimentation.

This screenshot shows Databricks UI. It gives you a solid overview of jobs and shows you status, cluster type, and so on. Like other platform-level solutions, it doesn’t help you much with historical runs, nor in working at the pipeline level, across the multiple jobs that make up the pipeline.

Another monitoring tool for Spark, which is included as open source within Databricks, is called Ganglia. It’s largely complementary to Databricks UI, though it also works at the cluster and, in particular, at the node level. You can see hostlevel metrics such as CPU consumption, memory consumption, disk usage, network-level IO – all host-level factors that can affect the stability and performance of your job.

This can allow you to see if your nodes are configured appropriately, to institute manual scaling or auto-scaling, or to change instance types. (Though someone trying to fix a specific job is not inclined to take on issues that affect other jobs, other users, resource availability, and cost.) Ganglia does not have job-specific insights, nor work with pipelines. And there are no good output options; you might be reduced to taking a screen snapshot to get a JPEG or PNG image of the current status.

Support from the open-source community is starting to shift toward more modern observability platforms like Prometheus, which works well with Kubernetes. And cloud providers offer their own solutions – AWS Cloudwatch, for example, and Azure Log Monitoring and Analytics. These tools are all oriented toward web applications; they lack modern data stack application and pipeline information which is essential to understand what’s happening to your job and how your job is affecting things on the cluster or workspace.

Platform-level solutions can be useful for solving the root causes of problems such as out-of-memory errors. However, they don’t go down to the job level, leaving that to resources such as Spark logs and tools such as Spark UI. Therefore, to solve a problem, you are often using platform-level solutions in combination with job-level tools – and again, it’s your brain that has to do the comparisons and data integration needed to get a complete picture and solve the problem.

Like job-level tools, these solutions are not comprehensive, nor integrated. They offer snapshots, but not history, and they don’t make proactive recommendations. And, to solve a problem on Databricks, for example, you may be using Spark logs, Spark UI, Databricks UI, and Ganglia, along with Cloudwatch on AWS, or Azure Log Monitoring and Analytics. None of these tools integrate with the others.

APM Tools

There is a wide range of monitoring tools, generally known as Application Performance Management (APM) tools. Many organizations have adopted one or more tools from this
category, though they can be expensive, and provide very limited metrics on Spark and other modern data technologies. Leading tools in this category include Datadog, Dynatrace, and Cisco AppDynamics.

For a slow app, for instance, an APM tool might tell you if the system as a whole is busy, slowing your app, or if there were networking issues, slowing down all the jobs. While helpful, they’re oriented toward monitoring and observability for Web applications and middleware, not data-intensive operations such as Spark jobs. They tend to lack information about pipelines, specific jobs, data usage, configuration setting, and much more, as they are not designed to deal with the complexity of modern data applications.

Correlation is the Issue

To sum up, there are several types of existing tools:

DIY with Spark logs. Spark keeps a variety of types of logs, and you can parse them, in a do it yourself (DIY) fashion, to help solve problems. But this lacks critical infrastructure, container, and other metrics.
Open source tools. Spark UI comes with Spark itself, and there are other Spark tools from the open source community. But these lack infrastructure, configuration and other information. They also do not help connect together a full pipeline view, as you need for Spark – and even more so if you are using technologies such as Kafka to bring data in.
Platform-specific tools. The platforms that Spark runs on – notably Cloudera platforms, Amazon EMR, and Databricks – each have platform-specific tools that help with Spark troubleshooting. But these lack application-level information and are best used for troubleshooting platform services.
Application performance monitoring (APM) tools. APM tools monitor the interactions of applications with their environment, and can help with troubleshooting and even with preventing problems. But the applications these APM tools are built for are technologies such as .NET, Java, and Ruby, not technologies that work with data-intensive applications such as Spark.
DataOps platforms. DataOps – applying Agile principles to both writing and running Spark, and other big data jobs – is catching on, and new platforms are emerging to embody these principles.

Each tool in this plethora of tools takes in and processes different, but overlapping, data sets. No one tool provides full visibility, and even if you use one or more tools of each type, full visibility is still lacking.

You need expertise in each tool to get the most out of that tool. But the most important work takes place in the expert user’s head: spotting a clue in one tool, which sends you looking at specific log entries and firing up other tools, to come up with a hypothesis as to the problem. You then have to try out the potential solution, often through several iterations of trial and error, before arriving at a “good enough” answer to the problem.

Or, you might pursue two tried and trusted, but ineffective, “solutions”: ratcheting up resources and retrying the job until it works, either due to the increased resources or by luck; or simply giving up, which our customers tell us they often had to do before they started using Unravel Data.

The situation is much worse in the kind of hybrid data clouds that organizations use today. To troubleshoot on each platform, you need expertise in the toolset for that platform, and all the others. (Since jobs often have cross-platform interdependencies, and the same team has to support multiple platforms.) In addition, when you find a solution for a problem on one platform, you should apply what you’ve learned on all platforms, taking into account their differences. In addition, you have issues that are inherently multi-platform, such as moving jobs from one platform to a platform that is better, faster, or cheaper for a given job. Taking on all this with the current, fragmented, and incomplete toolsets available is a mind-numbing prospect.

The biggest need is for a platform that integrates the capabilities from several existing tools, performing a five-step process:

Ingest all of the data used by the tools above, plus additional, application-specific and pipeline data.
Integrate all of this data into an internal model of the current environment, including pipelines.
Provide live access to the model.
Continually store model data in an internally maintained history.
Correlate information across the ingested data sets, the current, “live” model, and the stored historical background, to derive insights and make recommendations to the user.

This tool must also provide the user with the ability to put “triggers” onto current processes that can trigger either alerts or automatic actions. In essence, the tool’s inbuilt intelligence and the user are then working together to make the right things happen at the right time.

A simple example of how such a platform can help is by keeping information per pipeline, not just per job – then spotting, and automatically letting you know, when the pipeline suddenly starts running slower than it had previously. The platform will also make recommendations as to how you can solve the problem. All this lets you take any needed action before the job is delayed.

The post Spark Troubleshooting Solutions – DataOps, Spark UI or logs, Platform or APM Tools appeared first on Unravel.

Migrating Data Pipelines from Enterprise Schedulers to Airflow

Unravel Data — Thu, 02 Sep 2021 17:16:42 +0000

What is a Data Pipeline?

Data pipelines convert rich, varied, and high-volume data sources into insights that power the innovative data products that many of us run today. Shivnath represents a typical data pipeline using the diagram below.

In a data pipeline, data is continuously captured and then stored into distributed a storage system, such as a data lake or data warehouse. From there, a lot of computation happens on the data to transform it into the key insights that you want to extract. These insights are then published and made available for consumption.

Modernizing Data Stacks and Pipelines

Many enterprises have already built data pipelines on stacks such as Hadoop, or using solutions such as NewHive and HDFS. Many of these pipelines are orchestrated with enterprise schedulers, such as Autosys, Tidal, Informatica Pentaho, or native schedulers. For example, Hadoop comes with a native scheduler called Oozie.

In these environments, there are common challenges people face when it comes to their data pipelines. These problems include:

Large clusters supporting multiple apps and tenants: Clusters tend to be heavily multi-tenant and some apps may struggle for resources.
Less agility: In these traditional environments, there tends to be less agility in terms of adding more capabilities and releasing apps quickly.
Harder to scale: In these environments, data pipelines tend to be in large data centers where you may not be able to add resources easily.

These challenges are causing many enterprises to modernize their stacks. In the process, they are picking innovative schedulers, such as Airflow, and they’re changing their stacks to incorporate systems like Databricks, Snowflake, or Amazon EMR. With modernization, companies are often striving for:

Smaller, decentralized, app-focused clusters: Instead of running large clusters, companies are trying to run smaller, more focused environments.
More agility and easier scalability: When clusters are smaller, they also tend to be more agile and easier to scale. This is because you can decouple storage from compute, then allocate resources when you need them.

Shivnath shares even more goals of modernization, including removing resources as a constraint when it comes to how fast you can release apps and drive ROI, as well as reducing cost.

So why does Airflow often get picked as part of modernization? The goals that motivated the creation of Airflow often tie in very nicely with the goals of modernization efforts. Airflow enables agile development and is better for cloud-native architectures compared to traditional schedulers, especially in terms of how fast you can customize or extend it. Keeping with the modern methodology of agility, Airflow is also available as a service from companies like Amazon and Astronomer.

Diving deeper into the process of modernization, there are two main phases at the high level, Phase 1: Assess and Plan and Phase 2: Migrate, Validate, and Optimize. The rest of the presentation dived deep into the key lessons that Shivnath and Hari have learned from helping a large number of enterprises migrate from their traditional enterprise schedulers and stacks to Airflow and modern data stacks.

Lessons Learned

Phase 1: Assess and Plan

The assessment and planning phase of modernization is made up of a series of other phases, including:

Pipeline discovery: First you have to discover all the pipelines that need to be migrated.
Resource usage analysis: You have to understand the resource usage of the pipelines.
Dependency analysis: More importantly, you have to understand all the dependencies that the pipelines may have.
Complexity analysis: You need to understand the complexity of modernizing the pipelines. For example, some pipelines that run on-prem can actually have thousands of stages and run for many hours.
Mapping to the target environment.
Cost estimation for target environment.
Migration effort estimation.

Shivnath said that he has learned two main lessons from the assessment and planning phase:

Lesson 1: Don’t underestimate the complexity of pipeline discovery

Multiple schedulers may be used, such as Autosys, Informatica, Oozie, Pentaho, Tidal, etc. And worse, there may not be any common pattern in how these pipelines work, access data, schedule and name apps, or allocate resources.

Lesson 2: You need very fine grain tracking from a telemetry data perspective

Due to the complexity of data pipeline discovery, tracking is needed in order to do a good job at resource usage estimation, dependency analysis, and to map the complexity and cost of running pipelines in a newer environment.

After describing the two lessons, Shivnath goes through an example to further illustrate what he has learned.

Shivnath then passes it on to Hari, who speaks about the lessons learned during the migration, validation, and optimization phase of modernization.

Phase 2: Migrate, Validate, and Optimize

While Shivnath shared various methodologies that have to do with finding artifacts and discovering the dependencies between them, there is also a need to instill a sense of confidence in the entire migration process. This confidence can be achieved by validating the operational side of the migration journey.

Data pipelines, regardless of where they live, are prone to suffer from the same issues, such as:

Failures and inconsistent results
Missed SLAs and growing lag/backlog
Cost overruns (especially prevalent in the cloud)

To maintain the overall quality of your data pipelines, Hari recommends constantly evaluating pipelines using three major factors: correctness, performance, and cost. Here’s a deeper look into each of these factors:

Correctness: This refers to data quality. Artifacts such as tables, views, or CSV files are generated at almost every stage of the pipeline. We need to lay down the right data checks at these stages, so that we can make sure that things are consistent across the board. For example, a check could be that the partitions of a table should have at least n number of records. Another check could be that a specific column of a table should never have null values.
Performance: Evaluating performance has to do with setting SLAs and maintaining baselines for your pipeline to ensure that performance needs are met after the migration. Most orchestrators have SLA monitoring baked in. For example, in Airflow the notion of an SLA is incorporated in the operator itself. Additionally, if your resource allocations have been properly estimated during the migration assessment and planning phase, often you’ll see that SLAs are similar and maintained. But in a case where something unexpected arises, tools like Unravel can help maintain baselines, and help troubleshoot and tune pipelines, by identifying bottlenecks and suggesting performance improvements.
Cost: When planning migration, one of the most important parts is estimating the cost that the pipelines will incur and, in many cases, budgeting for it. Unravel can actually help monitor the cost in the cloud. And by collecting telemetry data and interfacing with cloud vendors, Unravel can offer vendor-specific insights that can help minimize the running cost of these pipelines in the cloud.

Hari then demos several use cases where he can apply the lessons learned. To set the stage, a lot of Unravel’s enterprise customers are migrating from the more traditional on-prem pipelines, such as Oozie and Tidal, to Airflow. The examples in this demo are actually motivated by real scenarios that customers have faced in their migration journey.

The post Migrating Data Pipelines from Enterprise Schedulers to Airflow appeared first on Unravel.

Data Pipeline HealthCheck and Analysis

Unravel Data — Tue, 17 Aug 2021 20:07:31 +0000

Why Data Pipelines Become Complex

Companies across a multitude of industries, including entertainment, finance, transportation, and healthcare, are on their way to becoming data companies. These companies are creating advanced data products with insights generated using data pipelines. To build data pipelines, you need a modern stack that involves a variety of systems, such as Airflow for orchestration, Snowflake or Redshift as the data warehouse, Databricks and Presto for advanced analytics on a data lake, and Kafka and Spark Streaming to process data in real-time, just to name a few. Combining all these systems naturally causes your data pipelines to become complex, however. In this talk, Shivnath shares a recipe for dealing with this complexity and keeping your data pipelines healthy.

What Is a Healthy Data Pipeline?

Pipeline health can be viewed along three dimensions: correctness, performance, and cost. To understand pipeline health, Shivnath described what an unhealthy pipeline would look like along the three dimensions.

In this presentation, Shivnath not only shares tips and tricks to monitor the health of your pipelines, and to give you confidence that your pipelines are healthy, but he also shares ways to troubleshoot and fix problems that may arise if and when pipelines fall ill.

HealthCheck for Data Pipelines

Correctness

Users cannot be expected to write and define all health checks, however. For example, a user may not define a check because the check is implicit. But if this check is not evaluated, a false negative can arise, meaning that your pipeline might generate incorrect results that will be hard to resolve. Luckily, you can avoid this problem by having a system that runs automatic checks – for example, automatically detecting anomalies or changes. It is important to note, however, that automatic checks can instead induce false positives. Balancing false positives and false negatives remains a challenge today. A good practice is to design your checks in parallel with designing the pipeline.

But what do you do when these checks fail? Of course you have to troubleshoot the problem and fix it, but it is also important to capture failed checks in the context of the pipeline execution. There are two reasons why. One, the root cause of the failure may lie upstream in the pipeline, so you must understand the lineage of the pipeline. And two, a lot of checks fail because of changes in your pipeline. Having a history of these pipeline runs is key to understanding what has changed.

Performance

Shivnath has some good news for you; performance checks involve less context and application semantics compared to correctness checks. The best practice for performance checks is to define end-to-end performance checks in the form of pipeline SLAs. You can also define checks at different stages of the pipeline. For example, with Airflow you can easily specify the maximum time a task can take and define a check at that time. Automatic checks for performance are useful because the users don’t have to specify all the checks, just as with correctness checks. Again, keeping in mind the caveat about false positives and false negatives, it is critical but easy to build the appropriate baselines and detect deviations from these baselines though pipeline runs.

The timing of checks is also important. For example, it probably wouldn’t be helpful if a check fails after the pipeline SLA was missed. A best practice is to keep pipeline stages short and frequent so checks can also be seen and evaluated often.

When it comes to troubleshooting and tuning, Shivnath notes that it is important to have what he calls a “single pane of glass”. Pipelines can be complex with many moving parts, so having an end-to-end view is very important to troubleshoot problems. For example, due to multi-tenancy, an app that is otherwise unrelated to your pipeline may affect your pipeline’s performance. This is another example of where having automatic insights about what caused a problem is vital.

Cost

Just like performance checks, HealthChecks for cost also require less context and application semantics compared to correctness checks. But when it comes to cost checks, early detection is especially important. If a check is not executed, or a failed check is not fixed, and as a result there is a cost overrun, there can be severe consequences. So it’s very important to troubleshoot and fix problems as soon as checks fail.

Costs can be incurred in many different ways, including cost of storage or cost of compute. Therefore, a single pane of glass is again useful, as well as a detailed cost breakdown. Lastly, automated insights to remediate problems are also critical.

HealthCheck Demos

After Shivnath describes the different kinds of HealthChecks and the impact they have on helping you monitor and track your pipelines, he demos a couple different scenarios where HealthChecks fail – first at the performance level, then at the cost level, and lastly at the correctness level.

That was a short demo of how HealthChecks can make it very easy to manage complex pipelines. And this blog just gives you a taste of how HealthChecks can help you find and fix problems, as well as streamline your pipelines. You can view Shivnath’s full session from Airflow Summit 2021 here. If you’re interested in assessing Unravel for your own data-driven applications, you can create a free account or contact us to learn how we can help.

The post Data Pipeline HealthCheck and Analysis appeared first on Unravel.

The Biggest Spark Troubleshooting Challenges in 2024

Unravel Data — Fri, 06 Aug 2021 22:04:36 +0000

Spark has become one of the most important tools for processing data – especially non-relational data – and deriving value from it. And Spark serves as a platform for the creation and delivery of analytics, AI, and machine learning applications, among others. But troubleshooting Spark applications is hard – and we’re here to help.

In this blog post, we’ll describe ten challenges that arise frequently in troubleshooting Spark applications. We’ll start with issues at the job level, encountered by most people on the data team – operations people/administrators, data engineers, and data scientists, as well as analysts. Then, we’ll look at problems that apply across a cluster. These problems are usually handled by operations people/administrators and data engineers.

For more on Spark and its use, please see this piece in Infoworld. And for more depth about the problems that arise in creating and running Spark jobs, at both the job level and the cluster level, please see the links below.

Five Reasons Why Troubleshooting Spark Applications is Hard

Some of the things that make Spark great also make it hard to troubleshoot. Here are some key Spark features, and some of the issues that arise in relation to them:

Memory-resident

Spark gets much of its speed and power by using memory, rather than disk, for interim storage of source data and results. However, this can cost a lot of resources and money, which is especially visible in the cloud. It can also make it easy for jobs to crash due to lack of sufficient available memory. And it makes problems hard to diagnose – only traces written to disk survive after crashes.

Parallel processing

Spark takes your job and applies it, in parallel, to all the data partitions assigned to your job. (You specify the data partitions, another tough and important decision.) But when a processing workstream runs into trouble, it can be hard to find and understand the problem among the multiple workstreams running at once.

Variants

Spark is open source, so it can be tweaked and revised in innumerable ways. There are major differences between the Spark 1 series, Spark 2.x, and the newer Spark 3. And Spark works somewhat differently across platforms – on-premises; on cloud-specific platforms such as AWS EMR, Azure HDInsight, and Google Dataproc; and on Databricks, which is available across the major public clouds. Each variant offers some of its own challenges and a somewhat different set of tools for solving them.

Configuration options

Spark has hundreds of configuration options. And Spark interacts with the hardware and software environment it’s running in, each component of which has its own configuration options. Getting one or two critical settings right is hard; when several related settings have to be correct, guesswork becomes the norm, and over-allocation of resources, especially memory and CPUs (see below) becomes the safe strategy.

Trial and error approach

With so many configuration options, how to optimize? Well, if a job currently takes six hours, you can change one, or a few, options, and run it again. That takes six hours, plus or minus. Repeat this three or four times, and it’s the end of the week. You may have improved the configuration, but you probably won’t have exhausted the possibilities as to what the best settings are.

Sparkitecture diagram – the Spark application is the Driver Process, and the job is split up across executors. (Source: Apache Spark for the Impatient on DZone.)

Three Issues with Spark Jobs, On-Premises and in the Cloud

Spark jobs can require troubleshooting against three main kinds of issues:

Failure. Spark jobs can simply fail. Sometimes a job will fail on one try, then work again after a restart. Just finding out that the job failed can be hard; finding out why can be harder. (Since the job is memory-resident, failure makes the evidence disappear.)
Poor performance. A Spark job can run slower than you would like it to; slower than an external service level agreement (SLA); or slower than it would do if it were optimized. It’s very hard to know how long a job “should” take, or where to start in optimizing a job or a cluster.
Excessive cost or resource use. The resource use or, especially in the cloud, the hard dollar cost of a job may raise concerns. As with performance, it’s hard to know how much the resource use and cost “should” be, until you put work into optimizing and see where you’ve gotten to.

All of the issues and challenges described here apply to Spark across all platforms, whether it’s running on-premises, in Amazon EMR, or on Databricks (across AWS, Azure, or GCP). However, there are a few subtle differences:

Move to cloud. There is a big movement of big data workloads from on-premises (largely running Spark on Hadoop) to the cloud (largely running Spark on Amazon EMR or Databricks). Moving to cloud provides greater flexibility and faster time to market, as well as access to built-in services found on each platform.
Move to on-premises. There is a small movement of workloads from the cloud back to on-premises environments. When a cloud workload “settles down,” such that flexibility is less important, then it may become significantly cheaper to run it on-premises instead.
On-premises concerns. Resources (and costs) on-premises tend to be relatively fixed; there can be a leadtime of months to years to significantly expand on-premises resources. So the main concern on-premises is maximizing the existing estate: making more jobs run in existing resources, and getting jobs complete reliably and on-time, to maximize the pay-off from the existing estate.
Cloud concerns. Resources in the cloud are flexible and “pay as you go” – but as you go, you pay. So the main concern in the cloud is managing costs. (As AWS puts it, “When running big data pipelines on the cloud, operational cost optimization is the name of the game.”) This concern increases because reliability concerns in the cloud can often be addressed by “throwing hardware at the problem” – increasing reliability but at a greater cost.
On-premises Spark vs Amazon EMR. When moving to Amazon EMR, it’s easy to do a “lift and shift” from on-premises Spark to EMR. This saves time and money on the cloud migration effort, but any inefficiencies in the on-premises environment are reproduced in the cloud, increasing costs. It’s also fully possible to refactor before moving to EMR, just as with Databricks.
On-premises Spark vs Databricks. When moving to Databricks, most companies take advantage of Databricks’ capabilities, such as ease of starting/shutting down clusters, and do at least some refactoring as part of the cloud migration effort. This costs time and money in the cloud migration effort, but results in lower costs and, potentially, greater reliability for the refactored job in the cloud.

All of these concerns are accompanied by a distinct lack of needed information. Companies often make crucial decisions – on-premises vs. cloud, EMR vs. Databricks, “lift and shift” vs. refactoring – with only guesses available as to what different options will cost in time, resources, and money.

The Biggest Spark Troubleshooting Challenges in 2024

Many Spark challenges relate to configuration, including the number of executors to assign, memory usage (at the driver level, and per executor), and what kind of hardware/machine instances to use. You make configuration choices per job, and also for the overall cluster in which jobs run, and these are interdependent – so things get complicated, fast.

Some challenges occur at the job level; these challenges are shared right across the data team. They include:

How many executors should each job use?
How much memory should I allocate for each job?
How do I find and eliminate data skew?
How do I make my pipelines work better?
How do I know if a specific job is optimized?

Other challenges come up at the cluster level, or even at the stack level, as you decide what jobs to run on what clusters. These problems tend to be the remit of operations people and data engineers. They include:

How do I size my nodes, and match them to the right servers/instance types?
How do I see what’s going on across the Spark stack and apps?
Is my data partitioned correctly for my SQL queries?
When do I take advantage of auto-scaling?
How do I get insights into jobs that have problems?

See exactly how to optimize Spark configurations automatically
Book a demo

Section 1: Five Job-Level Challenges

These challenges occur at the level of individual jobs. Fixing them can be the responsibility of the developer or data scientist who created the job, or of operations people or data engineers who work on both individual jobs and at the cluster level.

However, job-level challenges, taken together, have massive implications for clusters, and for the entire data estate. One of our Unravel Data customers has undertaken a right-sizing program for resource-intensive jobs that has clawed back nearly half the space in their clusters, even though data processing volume and jobs in production have been increasing.

For these challenges, we’ll assume that the cluster your job is running in is relatively well-designed (see next section); that other jobs in the cluster are not resource hogs that will knock your job out of the running; and that you have the tools you need to troubleshoot individual jobs.

1. How many executors and cores should a job use?

One of the key advantages of Spark is parallelization – you run your job’s code against different data partitions in parallel workstreams, as in the diagram below. The number of workstreams that run at once is the number of executors, times the number of cores per executor. So how many executors should your job use, and how many cores per executor – that is, how many workstreams do you want running at once?

A Spark job uses three cores to parallelize output. Up to three tasks run simultaneously, and seven tasks are completed in a fixed period of time. (Source: Lisa Hua, Spark Overview, Slideshare.)

You want high usage of cores, high usage of memory per core, and data partitioning appropriate to the job. (Usually, partitioning on the field or fields you’re querying on.) This beginner’s guide for Hadoop suggests two-three cores per executor, but not more than five; this expert’s guide to Spark tuning on AWS suggests that you use three executors per node, with five cores per executor, as your starting point for all jobs. (!)

You are likely to have your own sensible starting point for your on-premises or cloud platform, the servers or instances available, and the experience your team has had with similar workloads. Once your job runs successfully a few times, you can either leave it alone or optimize it. We recommend that you optimize it, because optimization:

Helps you save resources and money (not over-allocating)
Helps prevent crashes, because you right-size the resources (not under-allocating)
Helps you fix crashes fast, because allocations are roughly correct, and because you understand the job better

2. How much memory should I allocate for each job?

Memory allocation is per executor, and the most you can allocate is the total available in the node. If you’re in the cloud, this is governed by your instance type; on-premises, by your physical server or virtual machine. Some memory is needed for your cluster manager and system resources (16GB may be a typical amount), and the rest is available for jobs.

If you have three executors in a 128GB cluster, and 16GB is taken up by the cluster, that leaves 37GB per executor. However, a few GB will be required for executor overhead; the remainder is your per-executor memory. You will want to partition your data so it can be processed efficiently in the available memory.

This is just a starting point, however. You may need to be using a different instance type, or a different number of executors, to make the most efficient use of your node’s resources against the job you’re running. As with the number of executors (see the previous section), optimizing your job will help you know whether you are over- or under-allocating memory, reduce the likelihood of crashes, and get you ready for troubleshooting when the need arises.

For more on memory management, see this widely read article, Spark Memory Management, by our own Rishitesh Mishra.

Right-size Spark jobs in seconds
See a demo

3. How do I handle data skew and small files?

Data skew and small files are complementary problems. Data skew tends to describe large files – where one key-value, or a few, have a large share of the total data associated with them. This can force Spark, as it’s processing the data, to move data around in the cluster, which can slow down your task, cause low utilization of CPU capacity, and cause out-of-memory errors which abort your job. Several techniques for handling very large files which appear as a result of data skew are given in the popular article, Data Skew and Garbage Collection, by Rishitesh Mishra of Unravel.

Small files are partly the other end of data skew – a share of partitions will tend to be small. And Spark, since it is a parallel processing system, may generate many small files from parallel processes. Also, some processes you use, such as file compression, may cause a large number of small files to appear, causing inefficiencies. You may need to reduce parallelism (undercutting one of the advantages of Spark), repartition (an expensive operation you should minimize), or start adjusting your parameters, your data, or both (see details here).

Both data skew and small files incur a meta-problem that’s common across Spark – when a job slows down or crashes, how do you know what the problem was? We will mention this again, but it can be particularly difficult to know this for data-related problems, as an otherwise well-constructed job can have seemingly random slowdowns or halts, caused by hard-to-predict and hard-to-detect inconsistencies across different data sets.

4. How do I optimize at the pipeline level?

Spark pipelines are made up of dataframes, connected by transformers (which calculate new data from existing data), and Estimators. Pipelines are widely used for all sorts of processing, including extract, transform, and load (ETL) jobs and machine learning. Spark makes it easy to combine jobs into pipelines, but it does not make it easy to monitor and manage jobs at the pipeline level. So it’s easy for monitoring, managing, and optimizing pipelines to appear as an exponentially more difficult version of optimizing individual Spark jobs.

Existing Transformers create new Dataframes, with an Estimator producing the final model. (Source: Spark Pipelines: Elegant Yet Powerful, InsightDataScience.)

Many pipeline components are “tried and trusted” individually, and are thereby less likely to cause problems than new components you create yourself. However, interactions between pipeline steps can cause novel problems.

Just as job issues roll up to the cluster level, they also roll up to the pipeline level. Pipelines are increasingly the unit of work for DataOps, but it takes truly deep knowledge of your jobs and your cluster(s) for you to work effectively at the pipeline level. This article, which tackles the issues involved in some depth, describes pipeline debugging as an “art.”

5. How do I know if a specific job is optimized?

Neither Spark nor, for that matter, SQL is designed for ease of optimization. Spark comes with a monitoring and management interface, Spark UI, which can help. But Spark UI can be challenging to use, especially for the types of comparisons – over time, across jobs, and across a large, busy cluster – that you need to really optimize a job. And there is no “SQL UI” that specifically tells you how to optimize your SQL queries.

There are some general rules. For instance, a “bad” – inefficient – join can take hours. But it’s very hard to find where your app is spending its time, let alone whether a specific SQL command is taking a long time, and whether it can indeed be optimized.

Spark’s Catalyst optimizer, described here, does its best to optimize your queries for you. But when data sizes grow large enough, and processing gets complex enough, you have to help it along if you want your resource usage, costs, and runtimes to stay on the acceptable side.

Section 2: Cluster-Level Challenges

Cluster-level challenges are those that arise for a cluster that runs many (perhaps hundreds or thousands) of jobs, in cluster design (how to get the most out of a specific cluster), cluster distribution (how to create a set of clusters that best meets your needs), and allocation across on-premises resources and one or more public, private, or hybrid cloud resources.

The first step toward meeting cluster-level challenges is to meet job-level challenges effectively, as described above. A cluster that’s running unoptimized, poorly understood, slowdown-prone, and crash-prone jobs are impossible to optimize. But if your jobs are right-sized, cluster-level challenges become much easier to meet. (Note that Unravel Data, as mentioned in the previous section, helps you find your resource-heavy Spark jobs, and optimize those first. It also does much of the work of troubleshooting and optimization for you.)

Meeting cluster-level challenges for Spark may be a topic better suited for a graduate-level computer science seminar than for a blog post, but here are some of the issues that come up, and a few comments on each:

6. Are Nodes Matched Up to Servers or Cloud Instances?

A Spark node – a physical server or a cloud instance – will have an allocation of CPUs and physical memory. (The whole point of Spark is to run things in actual memory, so this is crucial.) You have to fit your executors and memory allocations into nodes that are carefully matched to existing resources, on-premises, or in the cloud. (You can allocate more or fewer Spark cores than there are available CPUs, but matching them makes things more predictable, uses resources better, and may make troubleshooting easier.)

On-premises, poor matching between nodes, physical servers, executors, and memory results in inefficiencies, but these may not be very visible; as long as the total physical resource is sufficient for the jobs running, there’s no obvious problem. However, issues like this can cause data centers to be very poorly utilized, meaning there’s big overspending going on – it’s just not noticed. (Ironically, the impending prospect of cloud migration may cause an organization to freeze on-prem spending, shining a spotlight on costs and efficiency.)

In the cloud, “pay as you go” pricing shines a different type of spotlight on efficient use of resources – inefficiency shows up in each month’s bill. You need to match nodes, cloud instances, and job CPU and memory allocations very closely indeed, or incur what might amount to massive overspending. This article gives you some guidelines for running Apache Spark cost-effectively on AWS EC2 instances and is worth a read even if you’re running on-premises, or on a different cloud provider.

You still have big problems here. In the cloud, with costs both visible and variable, cost allocation is a big issue. It’s hard to know who’s spending what, let alone what the business results that go with each unit of spending are. But tuning workloads against server resources and/or instances is the first step in gaining control of your spending, across all your data estates.

7. How Do I See What’s Going on in My Cluster?

“Spark is notoriously difficult to tune and maintain,” according to an article in The New Stack. Clusters need to be “expertly managed” to perform well, or all the good characteristics of Spark can come crashing down in a heap of frustration and high costs. (In people’s time and in business losses, as well as direct, hard dollar costs.)

Key Spark advantages include accessibility to a wide range of users and the ability to run in memory. But the most popular tool for Spark monitoring and management, Spark UI, doesn’t really help much at the cluster level. You can’t, for instance, easily tell which jobs consume the most resources over time. So it’s hard to know where to focus your optimization efforts. And Spark UI doesn’t support more advanced functionality – such as comparing the current job run to previous runs, issuing warnings, or making recommendations, for example.

Logs on cloud clusters are lost when a cluster is terminated, so problems that occur in short-running clusters can be that much harder to debug. More generally, managing log files is itself a big data management and data accessibility issue, making debugging and governance harder. This occurs in both on-premises and cloud environments. And, when workloads are moved to the cloud, you no longer have a fixed-cost data estate, nor the “tribal knowledge” accrued from years of running a gradually changing set of workloads on-premises. Instead, you have new technologies and pay-as-you-go billing. So cluster-level management, hard as it is, becomes critical.

See how Unravel helps manage clusters.

Create a free account.

8. Is my data partitioned correctly for my SQL queries? (and other inefficiencies)

Operators can get quite upset, and rightly so, over “bad” or “rogue” queries that can cost way more, in resources or cost, than they need to. One colleague describes a team he worked on that went through more than $100,000 of cloud costs in a weekend of crash-testing a new application – a discovery made after the fact. (But before the job was put into production, where it would have really run up some bills.)

SQL is not designed to tell you how much a query is likely to cost, and more elegant-looking SQL queries (ie, fewer statements) may well be more expensive. The same is true of all kinds of code you have running.

So you have to do some or all of three things:

Learn something about SQL, and about coding languages you use, especially how they work at runtime
Understand how to optimize your code and partition your data for good price/performance
Experiment with your app to understand where the resource use/cost “hot spots” are, and reduce them where possible

All this fits in the “optimize” recommendations from 1. and 2. above. We’ll talk more about how to carry out optimization in Part 2 of this blog post series.

9. When do I take advantage of auto-scaling?

The ability to auto-scale – to assign resources to a job just while it’s running, or to increase resources smoothly to meet processing peaks – is one of the most enticing features of the cloud. It’s also one of the most dangerous; there is no practical limit to how much you can spend. You need some form of guardrails, and some form of alerting, to remove the risk of truly gigantic bills.

The need for auto-scaling might, for instance, determine whether you move a given workload to the cloud, or leave it running, unchanged, in your on-premises data center. But to help an application benefit from auto-scaling, you have to profile it, then cause resources to be allocated and de-allocated to match the peaks and valleys. And you have some calculations to make because cloud providers charge you more for spot resources – those you grab and let go of, as needed – than for persistent resources that you keep running for a long time. Spot resources may cost two or three times as much as dedicated ones.

The first step, as you might have guessed, is to optimize your application, as in the previous sections. Auto-scaling is a price/performance optimization, and a potentially resource-intensive one. You should do other optimizations first.

Then profile your optimized application. You need to calculate ongoing and peak memory and processor usage, figure out how long you need each, and the resource needs and cost for each state. And then decide whether it’s worth auto-scaling the job, whenever it runs, and how to do that. You may also need to find quiet times on a cluster to run some jobs, so the job’s peaks don’t overwhelm the cluster’s resources.

To help, Databricks has two types of clusters, and the second type works well with auto-scaling. Most jobs start out in an interactive cluster, which is like an on-premises cluster; multiple people use a set of shared resources. It is, by definition, very difficult to avoid seriously underusing the capacity of an interactive cluster.

So you are meant to move each of your repeated, resource-intensive, and well-understood jobs off to its own, dedicated, job-specific cluster. A job-specific cluster spins up, runs its job, and spins down. This is a form of auto-scaling already, and you can also scale the cluster’s resources to match job peaks, if appropriate. But note that you want your application profiled and optimized before moving it to a job-specific cluster.

10. How Do I Find and Fix Problems?

Just as it’s hard to fix an individual Spark job, there’s no easy way to know where to look for problems across a Spark cluster. And once you do find a problem, there’s very little guidance on how to fix it. Is the problem with the job itself, or the environment it’s running in? For instance, over-allocating memory or CPUs for some Spark jobs can starve others. In the cloud, the noisy neighbors problem can slow down a Spark job run to the extent that it causes business problems on one outing – but leaves the same job to finish in good time on the next run.

The better you handle the other challenges listed in this blog post, the fewer problems you’ll have, but it’s still very hard to know how to most productively spend Spark operations time. For instance, a slow Spark job on one run may be worth fixing in its own right and may be warning you of crashes on future runs. But it’s very hard just to see what the trend is for a Spark job in performance, let alone to get some idea of what the job is accomplishing vs. its resource use and average time to complete. So Spark troubleshooting ends up being reactive, with all too many furry, blind little heads popping up for operators to play Whack-a-Mole with.

Impacts of these Challenges

If you meet the above challenges effectively, you’ll use your resources efficiently and cost-effectively. However, our observation here at Unravel Data is that most Spark clusters are not run efficiently.

What we tend to see most are the following problems – at a job level, within a cluster, or across all clusters:

Under-allocation. It can be tricky to allocate your resources efficiently on your cluster, partition your datasets effectively, and determine the right level of resources for each job. If you under-allocate (either for a job’s driver or the executors), a job is likely to run too slowly or crash. As a result, many developers and operators resort to…
Over-allocation. If you assign too many resources to your job, you’re wasting resources (on-premises) or money (cloud). We hear about jobs that need, for example, 2GB of memory but are allocated much more – in one case, 85GB.

Applications can run slowly, because they’re under-allocated – or because some apps are over-allocated, causing others to run slowly. Data teams then spend much of their time fire-fighting issues that may come and go, depending on the particular combination of jobs running that day. With every level of resource in shortage, new, business-critical apps are held up, so the cash needed to invest against these problems doesn’t show up. IT becomes an organizational headache, rather than a source of business capability.

Conclusion

To jump ahead to the end of this series a bit, our customers here at Unravel are easily able to spot and fix over-allocation and inefficiencies. They can then monitor their jobs in production, finding and fixing issues as they arise. Developers even get on board, checking their jobs before moving them to production, then teaming up with Operations to keep them tuned and humming.

One Unravel customer, Mastercard, has been able to reduce usage of their clusters by roughly half, even as data sizes and application density has moved steadily upward during the global pandemic. And everyone gets along better, and has more fun at work, while achieving these previously unimagined results.

So, whether you choose to use Unravel or not, develop a culture of right-sizing and efficiency in your work with Spark. It will seem to be a hassle at first, but your team will become much stronger, and you’ll enjoy your work life more, as a result.

You need a sort of X-ray of your Spark jobs, better cluster-level monitoring, environment information, and to correlate all of these sources into recommendations. In Troubleshooting Spark Applications, Part 2: Solutions, we will describe the most widely used tools for Spark troubleshooting – including the Spark Web UI and our own offering, Unravel Data – and how to assemble and correlate the information you need.

The post The Biggest Spark Troubleshooting Challenges in 2024 appeared first on Unravel.

Jeeves Grows Up: How an AI Chatbot Became Part of Unravel Data

Floyd Smith — Mon, 31 May 2021 04:20:30 +0000

Jeeves is the stereotypical English butler – and an AI chatbot that answers pertinent and important questions about Spark jobs in production. Shivnath Babu, CTO and co-founder of Unravel Data, spoke yesterday at Data + AI Summit, formerly known as Spark Summit, about the evolution of Jeeves, and how the technology has become a key supporting pillar within Unravel Data’s software.

Unravel is a leading platform for DataOps, bringing together a raft of seemingly disparate information to make it much easier to view, monitor, and manage pipelines. With Unravel, individual jobs and their pipelines become visible. But also, the interactions between jobs and pipelines become visible too.

It’s often these interactions, which are ephemeral, and very hard to track through traditional monitoring solutions, that cause jobs to cause or fail. Unravel makes them visible and actionable. On top of this, AI and machine learning help the software make proactive suggestions about improvements, and even head off trouble before it happens.

Both performance improvements and cost management become far easier with Unravel, even for DataOps personnel who don’t know all of the underlying technologies used by a given pipeline in detail.

Jeeves to the Rescue

An app failure in Spark may be difficult to even discover – let alone to trace, troubleshoot, repair, and retry. If the failure is due to interactions among multiple apps, a whole additional dimension of trouble arises. As data volumes and pipeline criticality rocket upward, no one in a busy IT department has time to dig around for the causes of problems and search for possible solutions.

But – Jeeves to the rescue! Jeeves acts as a chatbot for finding, understanding, fixing, and improving Spark jobs, and the configuration settings that define where, when, and how they run. The Jeeves demo in Shivnath’s talk shows how Jeeves comes up with the errant Spark job (by ID number), describes what happened, and recommends the needed adjustments – configuration changes, in this case – to fix the problem going forward. Jeeves can even resubmit the repaired job for you.

See how Unravel simplifies troubleshooting Spark

Create a free account

Wait, One More Thing…

But there’s more – much more. The technology underlying Jeeves has now been built into Unravel Data, with stellar results.

Modern data pipelines are ever more populated. In his talk, Shivnath shows us several views on the modern data landscape. His simplified diagram shows five silos, and 14 different top-level processes, between today’s data sources and a plethora of data consumers, both human and machine.

But Shivnath shines a bright light into this Aladdin’s cave of – well, either treasures or disasters, depending on your point of view, and whether everything is working or not. He describes each of the major processes that take place within the architecture, and highlights the plethora of technologies and products that are used to complete each process.

He sums it all up by showing the role of data pipelines in carrying out mission-critical tasks, how the data stack for all of this continues to get more complex, and how DataOps as a practice has emerged to try and get a handle on all of it.

This is where we move from Jeeves, a sophisticated bot, to Unravel, which incorporates the Jeeves functionality – and much more. Shivnath describes Unravel’s Pipeline Observer, which interacts with a large and growing range of pipeline technologies to monitor, manage, and recommend (through an AI and machine learning-powered engine) how to fix, improve, and optimize pipeline and workload functionality and reliability.

In an Unravel demo, Shivnath shows how to improve a pipeline that’s in danger of:

Breaking due to data quality problems
Missing its performance SLA
Cost overruns – check your latest cloud bill for examples of this one

If you’re in DataOps, you’ve undoubtedly experienced the pain of pipeline breaks, and that uneasy feeling of SLA misses, all reflected in your messaging apps, email, and performance reviews – not to mention the dreaded cost overruns, which don’t show up until you look at your cloud provider bills.

Shivnath concludes by offering a chance for you to create a free account; to contact the company for more information; or to reach out to Shivnath personally, especially if your career is headed in the direction of helping solve these and related problems. To get the full benefit of Shivnath’s perspective, dig into the context, and understand what’s happening in depth, please watch the presentation.

The post Jeeves Grows Up: How an AI Chatbot Became Part of Unravel Data appeared first on Unravel.

Unravel Data Featured in CRN’s 2021 Big Data 100 List

Floyd Smith — Thu, 06 May 2021 13:30:50 +0000

In a press release delivered today, Unravel Data announced its appearance on CRN’s Big Data 100 list for 2021. Unravel’s entry appears in the Data Management and Integration category. Also featured in this category are other rising stars such as Confluent, Fivetran, Immuta, and Okera, all of whom spoke at new industry conference DataOps Unleashed, held in March.

The list recognizes vendors for their “innovation, insight, and industry expertise,” according to Blaine Raddon, CEO of the Channel Company, owners of CRN. Unravel Data CEO Kunal Agarwal cited the company’s mission, “to empower organizations to unleash innovation” by modernizing their approach to DataOps.

Other categories in the list are Business Analytics, Data Science and Machine Learning, Database Systems, and Systems and Platforms. Taken together, the Big Data 100 list provides a snapshot of rapid change taking place in big data, streaming data, and modern data solutions.

The post Unravel Data Featured in CRN’s 2021 Big Data 100 List appeared first on Unravel.

AI/ML without DataOps is just a pipe dream!

Floyd Smith — Fri, 23 Apr 2021 04:19:24 +0000

The following blog post appeared in its original form on Towards Data Science. It’s part of a series on DataOps for effective AI/ML. The author is CDO and VP Engineering here at Unravel Data. (GIF by giphy)

Let’s start with a real-world example from one of my past machine learning (ML) projects: We were building a customer churn model. “We urgently need an additional feature related to sentiment analysis of the customer support calls.” Creating the data pipeline to extract this dataset took about 4 months! Preparing, building, and scaling the Spark MLlib code took about 1.5-2 months! Later we realized that “an additional feature related to the time spent by the customer in accomplishing certain tasks in our app would further improve the model accuracy” — another 5 months gone in the data pipeline! Effectively, it took 2+ years to get the ML model deployed!

After driving dozens of ML initiatives (as well as advising multiple startups on this topic), I have reached the following conclusion: Given the iterative nature of AI/ML projects, having an agile process of building fast and reliable data pipelines (referred to as DataOps) has been the key differentiator in the ML projects that succeeded. (Unless there was a very exhaustive feature store available, which is typically never the case).

Behind every successful AI/ML product is a fast and reliable data pipeline developed using well-defined DataOps processes!

To level-set, what is DataOps? From Wikipedia: “DataOps incorporates the agile methodology to shorten the cycle time of analytics development in alignment with business goals.”

I define DataOps as a combination of process and technology to iteratively deliver reliable data pipelines with agility. Depending on the maturity of your data platform, you might be one of the following DataOps phases:

Ad-hoc: No clear processes for DataOps
Developing: Clear processes defined, but accomplished manually by the data team
Optimal: Clear processes with self-service automation for data scientists, analysts, and users.

Similar to software development, DataOps can be visualized as an infinity loop

The DataOps lifecycle – shown as an infinity loop above – represents the journey in transforming raw data to insights. Before discussing the key processes in each lifecycle stage, the following is a list of top-of-mind battle scars I have encountered in each of the stages:

Plan: “We cannot start a new project — we do not have the resources and need additional budget first”
Create: “The query joins the tables in the data samples. I didn’t realize the actual data had a billion rows! ”
Orchestrate: “Pipeline completes but the output table is empty — the scheduler triggered the ETL before the input table was populated”
Test & Fix: “Tested in dev using a toy dataset — processing failed in production with OOM (out of memory) errors”
Continuous Integration; “Poorly written data pipeline got promoted to production — the team is now firefighting”
Deploy: “Did not anticipate the scale and resource contention with other pipelines”
Operate & Monitor: “Not sure why the pipeline is running slowly today”
Optimize & Feedback: “I tuned the query one time — didn’t realize the need to do it continuously to account for data skew, scale, etc.”

To avoid these battle scars and more, it is critical to mature DataOps from ad hoc, to developing, to self-service.

This blog series will help you go from ad hoc to well-defined DataOps processes, as well as share ideas on how to make them self-service, so that data scientists and users are not bottlenecked by data engineers.

DataOps at scale with Unravel

Create a free account

For each stage of the DataOps lifecycle stage, follow the links for the key processes to define and the experiences in making them self-service (some of the links below are being populated, so please bookmark this blog post and come back over time):

Plan Stage

How to streamline finding datasets
Formulating the scope and success criteria of the AI/ML problem
How to select the right data processing technologies (batch, interactive, streaming) based on business needs

Create Stage

How to streamline accessing metadata properties of the datasets
How to streamline the data preparation process
How to make behavioral data self-service

Orchestrate Stage

Test & Fix Stage

Streamlining sandbox environment for testing
Identify and remove data pipeline bottlenecks
Verify data pipeline results for correctness, quality, performance, and efficiency

Continuous Integration & Deploy Stage

Smoke test for data pipeline code integration
Scheduling window selection for data pipelines
Changes rollback

Operate Stage

Detect anomalies to proactively avoid SLA violation
Managing data incidents in production
Alerting on rogue (resource hogging) jobs

Monitor Stage

Building end-to-end observability of data pipelines
Tracking lineage of data flows data
Enforcing data quality with circuit breakers

Optimize & Feedback Stage

Continuously optimize existing data pipelines
Alerting on budgets

In summary, DataOps is the key to delivering fast and reliable AI/ML! It is a team sport. This blog series aims to demystify the required processes as well as build a common understanding across Data Scientists, Engineers, Operations, etc.

DataOps as a team sport

To learn more, check out the recent DataOps Unleashed Conference, as well as innovations in DataOps Observability at Unravel Data. Come back to get notified when the links above are populated.

The post AI/ML without DataOps is just a pipe dream! appeared first on Unravel.

DataOps vs DevOps and Their Benefits Towards Scaling Delivery

Unravel Data — Thu, 15 Apr 2021 15:07:24 +0000

The exponential adoption of IT technologies over the past several decades has had a profound impact on organizations of all sizes. Whether it is a small, medium, or large enterprise, the need to create web applications while managing an extensive set of data effectively is high on every CIO’s priority list.

As a result, there has been an ongoing effort to implement better approaches to software development, data analysis, and data management.

The efforts are so pervasive across industries that these approaches have been given names of their own. The approach to better manage software development and delivery is known as DevOps. The end-to-end approach to efficiently and effectively deliver data products – from responses to SQL queries, to data pipelines, to machine learning models and AI-powered insights – is known as DataOps.

DataOps and DevOps are similar in that they both aim to solve the need to scale delivery. The key difference is that DataOps focuses on the flow of data, and the use of data in analytics, rather than on the software development and delivery lifecycle.

There’s also a difference in impact. Strong DataOps practices are vital to the successful development and delivery of AI-powered applications, including machine learning models. AI and ML are powerful areas of innovation, perhaps the most important in decades. DataOps as a discipline is necessary for the successful development and deployment of AI.

To help you gain an understanding of DataOps vs DevOps, it’s helpful to provide an overview of both, discuss their respective goals, and then highlight the key differences between the two.

DataOps Overview

DataOps is sometimes seen as simply making data flows through an organization, and data transformations, work correctly. This misconception that DataOps is just DevOps applied to data analytics is common. Rather, DataOps is actually a holistic approach to solving business problems.

According to Gartner,

DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate the design, deployment and management of data delivery with appropriate levels of governance, and it uses metadata to improve the usability and value of data in a dynamic environment.

As a concept, DataOps was first introduced by InformationWeek’s Lenny Liebmann in 2014. The term appeared in a blog post on the IBM Big Data & Analytics Hub, titled “3 reasons why DataOps is essential for big data success”.

A few years later, in 2017, DataOps began to get its own ecosystem, dedicated analyst coverage, increased keyword searches, inclusion in surveys, mention in books, and use in open source projects.

DataOps is Not Just Analytics

DataOps is sometimes seen as a set of best practices around analytics. Analytics can be considered to include most of AI and ML, so improving analytics functionality is not a trivial goal. But DataOps is much more than just DevOps applied to analyze data.

Data analytics happens at the end of a data pipeline, while DataOps encompasses nearly a dozen steps, including data ingestion and the entire data pipeline before analytics happens. DataOps also includes the delivery of analytics results and their ultimate business impact. And it serves as a framework for the development and delivery of useful capabilities from AI and ML.

As complex data environments are constantly changing, it is critical for an organization to possess and maintain a solid understanding of their data assets, and to add to their data assets when needed for business success. Understanding the origin of the data, analyzing data dependencies, and keeping documentation up to date are each resource-intensive, yet critical.

Having a high-performing DataOps team in place can help an organization accelerate the DataOps lifecycle – developing powerful, data-centric apps to deliver accurate insights to both internal and external customers.

The Complete Guide to DataOps

Download guide

DevOps Overview

Now that we’ve briefly described DataOps, let’s discuss DevOps. According to Atlassian, the DevOps movement started to come together between 2007 and 2008. At the time, software development and IT operations communities started raising concerns about an increase in what they felt was a near-fatal level of dysfunction in the industry.

The primary dysfunction these two groups saw was that in a traditional software development model, those who wrote the code were functionally and organizationally separate and apart from those who deployed and supported the code.

As such, software developers and IT Operations teams had competing objectives, different leadership, and different KPIs by which each group was measured and evaluated. As a result, teams were siloed and only concerned with what had a direct impact on them.

The result: poorly executed software releases, unhappy customers, and often unhappy development and IT Operations teams.

Over time, DevOps evolved to solve the pain caused by these siloed teams and poor lines of communication between software developers and IT Operations.

What is DevOps

DevOps describes an approach to software development that accelerates the build lifecycle using automation. By focusing on continuous integration and delivery of software, and by automating the integration, testing, and deployment of code, enterprises start to see many benefits. Specifically, this approach of merging DEVelopment and OPerationS, reduces time to deployment, increases time to market, keeps defects or bugs at a minimum, and generally shortens the time required to resolve any defects.

6 Primary Goals of DevOps

There are six key principles or goals the DevOps aims to deliver.

DevOps Goal 1: Speed

In order to quickly adapt to customer needs, changes in the market, or new business goals, the application development process and release capabilities need to be extremely fast. Practices such as continuous delivery and continuous integration help make it possible to maintain that speed throughout the development and operations phases.

DevOps Goal 2: Reliability

Continuous delivery and integration not only improve the time to market with new code, they also improve the overall stability and reliability of software. Integrating automated testing and exception handling helps software development teams identify problems immediately, minimizing the chances of errors being introduced and exposed to end users.

DevOps Goal 3: Rapid Delivery

DevOps aims to increase the pace and frequency of new software application releases, enabling development teams to improve an application as often as they’d like. Performing frequent, fast-paced releases ensures that the turnaround time for any given bug fix or new feature release is as short as possible.

DevOps Goal 4: Scalability

DevOps focuses on creating applications and infrastructure platforms that quickly and easily scale to address the constantly changing needs and demands of both business needs and end users. A practice that is gaining in popularity that helps scale applications is infrastructure as code, which is the process of managing and provisioning hardware data centers to immediately add resources and capacity for an application.

DevOps Goal 5: Security

DevOps encourages strong security practices by automating compliance policies. This simplifies the configuration process and introduces detailed security controls. This programmatic approach ensures that any resources that fall out of compliance are noticed immediately so they can be evaluated by the development team, in order to get them back into compliance immediately.

DevOps Goal 6: Collaboration

Just like other agile-based software development methodologies, DevOps strongly encourages collaboration throughout the entire software development life cycle. This leads to software development teams that are up to 25% more productive and 50% faster to market than non-agile teams.

Differences between DataOps and DevOps

As outlined above, DevOps is the transformation in the ability and capability of software development teams to continuously deliver and integrate their code.

DataOps focuses on the transformation of intelligence systems and analytic models by data analysts and data engineers.

DevOps brings software development teams and IT operations together with the primary goal to reduce the cost and time spent on the development and release cycle.

DataOps goes one step further, integrating data so that data teams can acquire the data, transform it, model it, and obtain insights that are highly actionable.

DataOps is not limited to making existing data pipelines work effectively, getting reports and Artificial Intelligence and Machine Learning outputs and inputs to appear as needed, and so on. DataOps actually includes all parts of the data management lifecycle.

DataOps, DevOps, and Competitive Advantage

DevOps, as a term and as a practice, grew rapidly in interest and activity throughout the last decade, but has plateaued recently. A decade ago, or even five years ago, aggressive adoption of DevOps as a practice could give an organization a significant competitive advantage. But DevOps is now “table stakes” for modern software development and delivery.

It’s now DataOps that’s in a phase of rapid growth. A big part of the reason for this is the need for strong DataOps practices in developing, and delivering value from, AI and ML. For IT practitioners and management, there are new skills to learn, new ways to deliver value, and in a sense, whole new worlds to conquer, all based on the development, growth, and institutionalization of new practices around data.

At the organizational level, DataOps gives companies the opportunity to innovate, to better serve customers, and to gain competitive advantage by rapid, effective adoption and innovation around DataOps as a practice.

Many of today’s largest and fastest-growing companies are DataOps innovators. Facebook, Apple, Alphabet, Netflix, and Uber are just the best-known among companies which have grown to a previously unheard-of degree, largely based on their innovative (and, often, controversial) practices around the use of data.

Adobe is an example of a company that has grown rapidly – increasing its market value by 4x in the last few years – by adding a data-driven business, the marketing-centered Experience Cloud, to their previously existing portfolio.

Algorithms are hard to protect, competitively, while a company’s data is its own. And, while AI algorithms and machine learning models that depend on this data can be shared, they don’t mean much without the flow of data that powers them. So all this means that innovation based on a company’s data, accelerated by the adoption and implementation of DataOps, is more able to yield lasting and protectable competitive advantage, and contribute to a company’s growth.

Test-drive Unravel for DataOps

Create a free account

Conclusion

It’s fair to say that “DataOps is the new DevOps” – not because one replaces the other, but because DataOps is the hot new area for innovation and competitive advantage. The main difference is that it’s easier for competitive advantage based on data, and on making the best possible use of that data via DataOps, to be enduring.

Unravel Data customers are more easily able to work all steps in the DataOps cycle, because they can see and work with their data-driven applications holistically and effectively. If you’re interested in assessing Unravel for your own data-driven applications, you can create a free account or contact us to learn how we can help.

The post DataOps vs DevOps and Their Benefits Towards Scaling Delivery appeared first on Unravel.

The Guide To DataOps, AIOps, and MLOPs in 2022

Unravel Data — Wed, 14 Apr 2021 15:22:11 +0000

Over the past several years, there has been an explosion of different terms related to the world of IT operations. Not long ago, it was standard practice to separate business functions from IT operations. But those days are a distant memory now, and for good reason.

The Ops landscape has expanded beyond the generic “IT” to include DevOps, DataOps, AIOPs, MLOps, and more. Each of these Ops areas are cross-functional throughout an organization, and each provides a unique benefit. And each of the Ops areas emerges from the same general mechanism: Applying agile principles, originally created to guide software development, to the overlap of different flavors of software development, related technologies (data-driven applications, AI, and ML), and operations.

As with DevOps, the goal of DataOps, AIOps, and MLOps is to accelerate processes and improve the quality of data, analytics insights, and AI models respectively. In practice, we see DataOps as a superset of AIOps and MLOps with the latter two Ops overlapping each other.

Why is this? DataOps describes the flow of data, and the processing that takes place against that data, through one or more data pipelines. In this context, every data-driven app needs DataOps; those which are primarily driven by machine learning models also need MLOps, and those with additional AI capabilities need AIOps. (Confusingly, ML is sometimes considered to be separate from AI, and sometimes simply an important part of the technologies that are part of AI as a whole.)

The goal of this article is to help you understand these new terms and provide insight into where they came from, the similarities, differences, and who in an organization uses them. To start, we’ll look at DataOps.

What is DataOps

DataOps is the use of agile development practices to create, deliver, and optimize data products, quickly and cost-effectively. DataOps is practiced by modern data teams, including data engineers, architects, analysts, scientists and operations.

The data products which power today’s companies range from advanced analytics, data pipelines, and machine learning models to embedded AI solutions. Using a DataOps methodology allows companies to move faster, more surely, and with greater cost-effectiveness in extracting value out of data.

A company can adopt a DataOps approach independently, without buying any specialty software or services. However, just as the term DevOps became strongly associated with the use of containers, as commercial software from Docker and as open source software from Kubernetes, the term DataOps is increasingly associated with data pipelines and applications.

DataOps at scale with Unravel

Create a free account

The origin of DataOps

DataOps is a successor methodology to DevOps, which is an approach to software development that optimizes for responsive, continuous deployment of software applications and software updates. DataOps applies this approach across the entire lifecycle of data applications and to even help productize data science.

As a term, DataOps has been in gradually increasing use for several years now. Gartner began to include it in their Hype Cycle for Data Management in 2018, and the term is now “moving up the Hype Cycle” as it becomes more widespread. The first industry conference devoted to DataOps, DataOps Unleashed, was launched in March 2021.

While DataOps is sometimes described as a set of best practices for continuous analytics, it is actually more holistic. DataOps includes identifying the data sources needed to solve a business problem, the processing needed to make that data ready for analytics, the analytics step(s) – which may be seen as including AI and ML – and the delivery of results to the people or apps that will use them. DataOps also includes making all this work fast enough to be useful, whether that means processing the weekly report in an hour or so, or approving a credit card transaction in a fraction of a second.

Who uses DataOps

DataOps is directly for data operations teams, data engineers, and software developers building data-powered applications and the software-defined infrastructure that supports them. The benefits of DataOps approaches are felt by the teams themselves; the IT team, which can now do more with less; the data team’s internal “customers,” who request and use analytics results; the organization as a whole; and the company’s customers. Ultimately, society benefits, as things simply work better, faster, and less expensively. (Compare booking a plane ticket online to going to a travel agent, or calling airlines yourself, out of the phone book, as a simple example.)

What is AIOps

AIOps stands for Artificial Intelligence for IT operations. It is a paradigm shift that allows machines to solve IT issues without the need of human assistance or interaction. AIOPs uses machine learning and analytics to analyze big data obtained via different tools, which allows for issues to be spotted automatically and dealt with in real time. (Confusingly, AIOps is also sometimes used to describe the operationalization of AI project, but we will stick with the definition used by Gartner and others, as described here.)

As part of DataOps, AIOps supports continuous integration and deployment for the core tech functions of machine learning and big data. AIOps helps automate operations across hybrid environments. AIOps includes the use of machine learning to detect patterns and reduce noise.

See Unravel AI and automation in action

Create a free account

The Origin of AIOps

AIOps was originally defined in 2017 by Gartner as a means to describe the growing interest and investment in applying a broad spectrum of AI capabilities to enterprise IT operations management challenges.

Gartner defines AIOps as platforms that utilize big data, machine learning, and other advanced analytics technologies to directly and indirectly enhance IT operations (such as monitoring, automation and service desk) functions with proactive, personal, and dynamic insight.

Put another way, AIOps refers to improving the way IT organizations manage data and information in their environments using artificial intelligence.

Who uses AIOps

From enterprises with large, complex environments, to cloud-native small and medium enterprises (SMEs), AIOps is being used globally by organizations of all sizes in a variety of industries. AIOps is most often bought in as products or services from specialist companies; few large organizations are using their own in-house AI expertise to solve IT operations problems.

Companies with extensive IT environments, spanning multiple technology types, are adopting AIOps, especially when they face issues of scale and complexity. AIOps can make a significant contribution when those challenges are layered on top of a business model that is dependent on IT. If the business needs to be agile and to quickly respond to market forces, IT will rely upon AIOps to help IT be just as agile in supporting the goals of the business.

AIOps is not just for massive enterprises, though. SMEs that need to develop and release software continuously are embracing AIOps as it allows them to continually improve their digital services, while preventing malfunctions and outages.

The ultimate goal of AIOps is to enhance IT Operations. AIOps delivers intelligent filtering of signals out of the noise in IT systems. AIOps intelligently observes IT operations data in order to identify root causes and recommend solutions quickly. In some cases, these recommendations can even be implemented without human interaction.

What is MLOps

MLOps, or Machine Learning Operations, helps simplify the management, logistics, and deployment of machine learning models between operations teams and machine learning researchers.

MLOps is like DataOps – the fusion of a discipline (machine learning in one case, data science in the other) and the operationalization of projects from that discipline. MLOps and DataOps are different from AIOps, which is the use of AI to improve AI operations.

MLOps takes the DevOps methodology of continuous integration and continuous deployment and applies it to machine learning. As in traditional development, there is code that needs to be written and deployed, as well as bug testing to be done, and changes in user requirements to be accommodated. Specific to the topic of machine learning, models are being trained with data, and new data is introduced to retrain the models again and again.

The Origin of MLOps

According to Forbes, the origins of MLOps date back to 2015, from a paper entitled “Hidden Technical Debt in Machine Learning Systems.” The paper offered the position that machine learning offered an incredibly powerful toolkit for building useful complex prediction systems quickly, but that it was dangerous to think of these quick wins as coming for free.

Who Uses MLOps

Data scientists tend to focus on the development of models that deliver valuable insights to your organization more quickly. Ops people tend to focus on running those models in production. MLOps unifies the two approaches into a single, flexible practice, focused on the delivery of business value through the use of machine learning models and relevant input data.

Because MLOps follows a similar pattern to DevOps, MLOps creates a seamless integration between your development cycle and your overall operations process that transforms how your organization handles the use of big data as input to machine learning models. MLOps helps drive insights that you can count on and put into practice.

Streamlining MLOps is critical to organizations that are developing Machine Learning models, as well as to end users who use applications that rely on these models.

According to research from Finances Online, machine learning applications and platforms account for 57% (or $42 Billion) in AI funding worldwide. Organizations that are making significant investments want to ensure they are deriving significant value.

As an example of the impact of MLOps, 97% of all mobile users use AI-powered voice assistants that depend on machine learning models, and thus benefit from MLOps. These voice assistants are the result of a subset of ML known as deep learning. This deep learning technology is built around machine learning and is at the core of platforms such as Apple’s Siri, Amazon Echo, and Google Assistant.

The goal of MLOps is to bridge the gap between data scientists and operations teams to deliver insights from machine learning models that can be put into use immediately.

Conclusion

Here at Unravel Data, we deliver a DataOps platform that uses AI-powered recommendations – AIOps – to help proactively identify and resolve operations problems. This platform complements the adoption of DataOps practices in an organization, with the end results including improved application performance, fewer operational hassles, lower costs, and the ability for IT to take on new initiatives that further benefit organizations.

Our platform does not explicitly support MLOps, though MLOps always occurs in a DataOps context. That is, machine learning models run on data – usually, on big data – and their outputs can also serve as inputs to additional processes, before business results are achieved.

As DataOps, AIOps, and MLOps proliferate – as working practices, and in the form of platforms and software tools that support agile XOPs approaches – complex stacks will be simplified and made to run much faster, with fewer problems, and at less cost. And overburdened IT organizations will truly be able to do more with less, leading to new and improved products and services that perhaps can’t all be imagined today.

If you want to know more about DataOps, check out the recorded sessions from the first DataOps industry conference, DataOps Unleashed. To learn more about the Unravel Data DataOps platform, you can create a free account or contact us.

The post The Guide To DataOps, AIOps, and MLOPs in 2022 appeared first on Unravel.

The Spark 3.0 Performance Impact of Different Kinds of Partition Pruning

Floyd Smith — Wed, 14 Apr 2021 06:42:40 +0000

In this blog post, I’ll set up and run a couple of experiments to demonstrate the effects of different kinds of partition pruning in Spark.

Big data has a complex relationship with SQL, which has long served as the standard query language for traditional databases – Oracle, Microsoft SQL Server, IBM DB2, and a vast number of others.

Only relational databases support true SQL natively, and many big data databases fall in the NoSQL – ie, non-relational – camp. For these databases, there are a number of near-SQL alternatives.

When dealing with big data, the most crucial part of processing and answering any SQL or near-SQL query is likely to be the I/O cost – that is, moving data around, including making one or more copies of the same data, as the query processor gathers needed data and completes the response to the query.

A great deal of effort has gone into reducing I/O costs for queries. Some of the techniques used are indexes, columnar data storage, data skipping, etc.

Partition pruning, described below, is one of the data skipping techniques used by most of the query engines like Spark, Impala, and Presto. One of the advanced ways of partition pruning is called dynamic partition pruning. In this blog post, we will work to understand both of these concepts and how they impact query execution, while reducing I/O requirements.

What is Partition Pruning?

Let’s first understand what partition pruning is, how it works, and the implications of this for performance.

If a table is partitioned and one of the query predicates is on the table partition column(s), then only partitions that contain data which satisfy the predicate get scanned. Partition pruning eliminates scanning of the partitions which are not required by the query.

There are two types of partition pruning.

Static partition pruning. Partition filters are determined at query analysis/optimize time.
Dynamic partition pruning. Partition filters are determined at runtime, looking at the actual values returned by some of the predicates.

Experiment Setup

For this experiment we have created three Hive tables, to mimic real-life workloads on a smaller scale. Although the number of tables is small, and their data size almost insignificant, the findings are valuable to clearly understand the impact of partition pruning, and to gain insight into some of its intricacies.

Spark version used : 3.0.1

Important configuration settings for our experiment :

spark.sql.cbo.enabled = false (default)
spark.sql.cbo.joinReorder.enabled = false (default)
spark.sql.adaptive.enabled = false (default)
spark.sql.optimizer.dynamicPartitionPruning.enabled = false (default is true)

We have disabled dynamic partition pruning for the first half of the experiment to see the effect of static partition pruning.

Table Name	Columns	Partition column	Row Count	Num Partitions
T1	key, val	key	21,000,000	1000
T2	dim_key, val	dim_key	800	400
T3	key, val	key	11,000,000	1000

Static Partition Pruning

We will begin with the benefits of static partition pruning, and how they affect table scans.

Let’s first try with a simple join query between T1 & T2.

select t1.key from t1,t2 where t1.key=t2.dim_key

Query Plan

This query scans all 1000 partitions of T1 and all 400 partitions of T2. So in this case no partitions were pruned.

Let’s try putting a predicate on T1’s partition column.

select t1.key from t1 ,t2 where t1.key=t2.dim_key and t1.key = 40

What happened here? As we added the predicate “t1.key = 40,” and “key” is a partitioning column, that means the query requires data from only one partition, which is reflected in the “number of partitions read” on the scan operator for T1.

But there is another interesting thing. If you observe the scan operator for T2, it also says only one partition is read. Why?

If we deduce logically, we need:

All rows from T1 where t1.key = 40
All rows from T2 where t2.dim_key = t1.key
Hence, all rows from T2 where t2.dim_key = 40
All rows of T2 where t2.dim_key=40 can be only be in one partition of T2, as it’s a partitioning column

What if the predicate would have been on T2, rather than T1? The end result would be the same. You can check yourself.

select t1.key from t1,t2 where t1.key=t2.dim_key and t2.dim_key = 40

What if the predicate satisfies more than one partition? The ultimate effect would be the same. The query processor would eliminate all the partitions which don’t satisfy the predicate.

select t1.key from t1 ,t2 where t1.key=t2.dim_key and t2.dim_key > 40 and t2.dim_key < 100

As we can see, there are 59 partitions which are scanned from each table. Still a significant saving as compared to scanning 1000 partitions.

So far, so good. Now let’s try adding another table, T3, to the mix.

select t1.key from t1, t3, t2
where t1.key=t3.key and t1.val=t3.val and t1.key=t2.dim_key and t2.dim_key = 40

Something strange happened here. As expected, only one partition was scanned for both T1 and T2. But 1000 partitions were scanned for T3.

This is a bit odd. If we extend our logic above, only one partition should have been picked up for T3 as well. It was not so.

Let’s try the same query, arranged a bit differently. Here we have just changed the order of join tables. The rest of the query is the same.

select t1.key from t1, t2, t3
where t1.key=t3.key and t1.val=t3.val and t1.key=t2.dim_key and t2.dim_key = 40

Now we can see the query scanned one partition from each of the tables. So is the join order important for static partition pruning? It should not be, but we see that it is. This looks like it’s a limitation in Spark 3.0.1.

Dynamic Partition Pruning

So far, we have looked at queries which have one predicate on a partitioning column(s) of tables – key, dim_key etc. What will happen if the predicate is on a non-partitioned column of T2, like “val”?

Let’s try such a query.

select t1.key from t1, t2 where t1.key=t2.dim_key and t2.val > 40

As you might have guessed, all the partitions of both the tables are scanned. Can we optimize this, given that the size of T2 is very small ? Can we eliminate certain partitions from T1, knowing the fact that predicates on T2 may select data from only a subset of its partitions?

This is where dynamic partition pruning will help us. Let’s re-enable it.

set spark.sql.optimizer.dynamicPartitionPruning.enabled=true

Then, re-run the above query.

As we can see, although all 400 partitions were scanned for T2, only 360 partitions were scanned for T1. We won’t go through the details of dynamic partition pruning (DPP). A lot of materials already exist on the web, such as the following:

https://medium.com/@prabhakaran.electric/spark-3-0-feature-dynamic-partition-pruning-dpp-to-avoid-scanning-irrelevant-data-1a7bbd006a89

Dynamic Partition Pruning in Apache Spark Bogdan Ghit Databricks -Juliusz Sompolski (Databricks).

You can read about how DPP is implemented in the above blogs. Our focus is its impact on the queries.

Now that we have executed a simple case, let’s fire off a query that’s a bit more complex.

select t1.key from t1, t2, t3
where t1.key=t3.key and t3.key=t2.dim_key and t1.key=t2.dim_key and t2.val > 40 and t2.val < 100

A lot of interesting things are going on here. As before, scans of T1 got the benefit of partition pruning. Only 59 partitions of T1 got scanned. But what happened to T3? All 1000 partitions were scanned.

To investigate the problem, I checked Spark’s code. I found that, at the moment, only a single DPP subquery is supported in Spark.

Ok, but how does this affect query writers? Now it becomes more important to get our join order right so that a more costly table can get help from DPP. For example, if the above query were written as shown below, with the order of T1 and T3 changed, what would be the impact?

select t1.key from t3, t2, t1
where t1.key=t3.key and t3.key=t2.dim_key and t1.key=t2.dim_key and t2.val > 40 and t2.val < 100

You guessed it! DPP will be applied to T3 rather than T1, which will not be very helpful for us, as T1 is almost twice the size of T3.

That means in certain cases join order is very important. Isn’t it the job of the optimizer, more specifically the “cost-based optimizer” (CBO), to do this?

Let’s now switch on the cost-based optimizer for our queries and see what happens.

set spark.sql.cbo.enabled=true

set spark.sql.cbo.joinReorder.enabled=true
ANALYZE TABLE t1 partition (key) compute statistics NOSCAN
ANALYZE TABLE t3 partition (key) compute statistics NOSCAN
ANALYZE TABLE t2 partition (dim_key) compute statistics NOSCAN
ANALYZE TABLE t2 compute statistics for columns dim_key
ANALYZE TABLE t1 compute statistics for columns key
ANALYZE TABLE t3 compute statistics for columns key

Let’s re-run the above query and check if T1 is picked up for DPP or not. As we can see below, there is no change in the plan. We can conclude that CBO is not able to change where DPP should be applied.

Conclusion

As we saw, with even as few as three simple tables, there are many permutations of running queries, each behaving quite differently to the others. As SQL developers we can try to optimize our queries ourselves, and try to know the impact of a specific query. But imagine dealing with a million queries a day.

Here at Unravel Data, we have customers who run more than a million SQL queries a day across Spark, Impala, Hive etc. Each of these engines have their respective optimizers. Each optimizer behaves differently.

Optimization is a hard problem to solve, and during query runtime it’s even harder due to time constraints. Moreover, as datasets grow, collecting statistics, on which optimizers depend, becomes a big overhead.

But fortunately, not all queries are ad hoc. Most are repetitive and can be tuned for future runs.

We at Unravel are solving this problem, making it much easier to tune large numbers of query executions quickly.

In upcoming articles we will discuss the approaches we are taking to tackle these problems.

The post The Spark 3.0 Performance Impact of Different Kinds of Partition Pruning appeared first on Unravel.

DataOps Has Unleashed, Part 2

Floyd Smith — Wed, 31 Mar 2021 02:52:03 +0000

The DataOps Unleashed conference, founded this year by Unravel Data, was full of interesting presentations and discussions. We described the initial keynote and some highlights from the first half of the day in our DataOps Unleashed Part 1 blog post. Here, in Part 2, we bring you highlights through the end of the day: more about what DataOps is, and case studies as to how DataOps is easing and speeding data-related workflows in big, well-known companies.

You can freely access videos from DataOps Unleashed – most just 30 minutes in length, with a lot of information packed into hot takes on indispensable technologies and tools, key issues, and best practices. We highlight some of the best-received talks here, but we also urge you to check out any and all sessions that are relevant to you and your work.

Mastercard Pre-Empts Harmful Workloads

See the Mastercard session now, on-demand.

Chinmay Sagade and Srinivasa Gajula of Mastercard are responsible for helping the payments giant respond effectively to the flood of transactions that has come their way due to the pandemic, with cashless, touchless, and online transactions all skyrocketing. And much of the change is likely to be permanent, as approaches to doing business that were new and emerging before 2020 become mainstream.

Hadoop is a major player at Mastercard. Their largest cluster is petabytes in size, and they use Impala for SQL workloads, as well as Spark and Hive. But they have in the past been plagued by services being unavailable, applications failing, and bottlenecks caused by suboptimalsub-optimal use of workloads.

Mastercard has used Unravel to help application owners self-tune their workloads and to create an automated monitoring system to detect toxic workloads and automatically intervene to prevent serious problems. For instance, they proactively detect large broadcast joins in Impala, which tend to consume tremendous resources. They also detect cross-joins in queries.

Their work has delivered tremendous business value:

Vastly improve reliability
Better configurations free up resources
Reduced problems free up time for troubleshooting recurring issues
Better capacity usage and forecasting saves infrastructure costs

To the team’s surprise, users were hugely supportive of restrictions, because they could see the positive impact on performance and reliability. And the entire estate now works much better, freeing resources for new initiatives.

Take the hassle out petabyte-scale DataOps

Create a free account

Gartner CDO DataOps Panel Shows How to Maximize Opportunities

See the Gartner CDO Panel now, on-demand.

As the VP in charge of big data and advanced analytics for leading industry analysts, Gartner, Sanjeev Mohan has seen it all. So he had some incisive questions for his panel of four CDOs. A few highlights follow.

Paloma Gonzalez Martinez is CDO at AlphaCredit, one of the fastest-growing financial technology companies in Latin America. She was asked: How has data architecture evolved? And, if you had a chance to do your big data efforts over again, how would you do things differently?

Paloma shared that her company actually is revisiting their whole approach. The data architecture was originally designed around data; AlphaCredit is now re-architecting around customer and business needs.

David Lloyd is CDO at Ceridian, a human resources (HR) services provider in the American Midwest. David weighed in on the following: What are the hardest roles to fill on your data team? And, how are these roles changing?

David said that one of the guiding principles he uses in hiring is to see how a candidate’s eyes light up around data. What opportunities do they see? How do they want to help take advantage of them?

Kumar Menon is SVP of Data Fabric and Decision Science Technology at Equifax, a leading credit bureau. With new candidates, Kumar looks for the combination of the intersection of engineering and insights. How does one build platforms that identify crucial features, then share them quickly and effectively? When does a model need to be optimized, and when does it need to be rebuilt?

Sarah Gadd is Head of Semantic Technology, Analytics and Machine Intelligence at Credit Suisse. (Credit Suisse recently named Unravel Data a winner of their 2021 Disruptive Technologies award.) Technical problems disrupted Sarah from participating live, but she contributed answers to the questions that were discussed.

Sarah looks for storytellers to help organize data and analytics understandably, and is always on the lookout for technical professionals who deeply understand the role of the semantic layer in data models. And in relation to data architecture, the team faces a degree of technical debt, so innovation is incremental rather than sweeping at this point.

84.51°/Kroger Solves Contention and the Small Files Problem with DataOps Techniques

See the 84.51/Kroger session now, on-demand.

Jeff Lambert and Suresh Devarakonda are DataOps leaders at the 84.51° analytics business of retailing giant Kroger. Their entire organization is all about deriving value from data and delivering that value to Kroger, their customers, partners, and suppliers. They use Yarn and Impala as key tools in their data stack.

They had a significant problem with jobs that created hundreds of small files, which consumed system resources way out of proportion to the file sizes. They have built executive dashboards that have helped stop finger-pointing, and begin solving problems based on shared, trusted information.

Unravel Data has been a key tool in helping 84.51° to adopt a DataOps approach and get all of this done. They are expanding their cloud presence on Azure, using Databricks and Snowflake. Unravel gives them visibility, management capability, and automatically generated actions and recommendations, making their data pipelines work much better. 84.51 has just completed a proof of concept (PoC) for Unravel on Azure and Databricks, and are heavily using recently introduced Spark 3.0 support.

Resource contention was caused by a rogue app that spiked memory usage. Using Unravel, 84.51° quickly found the offending app, killed it, and worked with the owner to prevent the problem in the future. 84.51 now proactively scans for small files and concerning issues using Unravel, heading off problems in advance. Unravel also helps move problems up to a higher level of abstraction, so operations work doesn’t require that operators be expert in all of the technologies they’re responsible for managing.

At 84.51°, Unravel has helped the team improve not only their own work, but what they deliver to the company:

Solving the small files problem improves performance and reliability
Spotting and resolving issues early prevents disruption and missed SLAs
Improved resource usage and availability saves money and increases trust
More production from less investment allows innovation to replace disruption

Cutting Cloud Costs with Amazon EMR and Unravel Data

See the AWS session now, on-demand.

As our own Sandeep Uttamchandani says, “Once you get into the cloud, ‘pay as you go’ takes on a whole new meaning: as you go, you pay.” But AWS Solutions Architect Angelo Carvalho is here to help. AWS understands that their business will grow healthier if customers are wringing the most value out of their cloud costs, and Angelo uses this talk to help people do so.

Angelo describes the range of AWS services around big data, including EMR support for Spark, Hie, Presto, HBase, Flink, and more. He emphasized EMR Managed Scaling, which makes scaling automatic, and takes advantage of cloud features to save money compared to on-premises, where you need to have enough servers all the time to support peak workloads that only occur some of the time. (And where you can easily be overwhelmed by unexpected spikes.)

Angelo was followed by Kunal Agarwal of Unravel Data, who showed how Unravel optimizes EMR. Unravel creates an interconnected model of the data in EMR and applies predictive analytics and machine learning to it. Unravel automatically optimizes some areas, offers alerts for others, and provides dashboards and reports to help you manage both price and performance.

Kunal then shows how this actually works in Unravel, demonstrating a few key features, such as automatically generated, proactive recommendations for right-sizing resource use and managing jobs. The lessons from this session apply well beyond EMR, and even beyond the cloud, to anyone who needs to run their jobs with the highest performance and the most efficient use of available resources.

Need to get cloud overspending under control?

Create a free account

Microsoft Describes How to Modernize your Data Estate

See the Microsoft session now, on-demand.

According to Priya Vijayarajendran, VP for Data and AI at Microsoft, a modern, cloud-based strategy underpins success in digital transformation. Microsoft is enabling this shift for customers and undertaking a similar journey themselves.

Priya describes data as a strategic asset. Managing data is not a liability or a problem, but a major opportunity. She shows how even a simplified data estate is very complex, requiring constant attention.

Priya tackled the “what is DataOps” challenge, using DevOps techniques, agile, and statistics, processes, and control methodologies to intelligently manage data as a strategic asset. She displayed a reference architecture for continuous integration and continuous delivery on the Azure platform.

Priya ended by offering to interact with the community around developing ever-better answers to the challenges and opportunities that data provides, whether on Microsoft platforms or more broadly. Microsoft is offering multi-cloud products that work on AWS and Google Cloud Platform as well as Azure. She said that governance should not be restrictive, but instead should enable people to do more.

A Superset of Advanced Topics for Data Engineers

See the Apache Superset session now, on-demand.

Max Beauchemin is the creator of Airflow and currently CEO at Preset, the company that is bringing Apache Superset to the market. Superset is the leading open-source analytics platform and is widely used at major companies. Max is the original creator of Apache Airflow, mentioned in our previous Unleashed blog post, as well as Superset. Preset makes Superset available as a managed service in the cloud.

Max discussed high-end, high-impact topics around Superset. He gave a demo, then demonstrated SQL Lab, a SQL development environment built in React. He then showed how to build a visualization plugin; creating alerts, reports, charts and dashboards; and using the Superset Rest API.

Templating is a key feature in SQL Lab, allowing users to build a framework that they can easily adapt to a wide variety of SQL queries. Built on Python, Jinja allows you to use macros in your SQL code. Jinja integrates with Superset, Airflow, and other open source technologies. A parameter can be specified as type JSON, so values can be filled in at runtime.

With this talk, Max gave the audience the information they need to plan, and begin to implement, ambitious Superset projects that work across a range of technologies.

Soda Delivers a Technologist’s Call to Arms for DataOps

See the Soda DataOps session now, on-demand.

What does DataOps really mean to practitioners? Vijay Karan, Head of Data Engineering at Soda, shows how DataOps applies at each stage of moving data across the stack, from ingest to analytics.

Soda is a data monitoring platform that supports data integrity, so Vijay is in a good position to understand the importance of DataOps. He discusses core principles of DataOps and how to apply those principles in your own projects.

Vijay begins with the best brief description of DataOps, from a practitioner’s point of view, that we’ve heard yet:

What is DataOps?

A process framework that helps data teams deliver high quality, reliable data insights with high velocity.

At just sixteen words, this is admirably concise. In fact, to boil it down to just seven words, “A process framework that helps data teams” is not a bad description.

Vijay goes on to share half a dozen core DataOps principles, and then delivers a deep dive on each of them.

Here at Unravel, part of what we deliver is in his fourth principle:

Improve Observability

Monitor quality and performance metrics across data flows

Just in this one area, if everyone did what Vijay suggests around this – defining metrics, visualizing them, configuring meaningful alerts – the world would be a calmer and more productive place.

Conclusion

This wraps up our overview description of DataOps Unleashed. If you haven’t already done so, please check out Part 1, highlighting the keynote and talks discussing Adobe, Cox Automotive, Airflow, Great Expectations, DataOps, and moving Snowflake to space.

However, while this blog post gives you some idea as to what happened, nothing can take the place of “attending” sessions yourself, by viewing the recordings. You can view the videos from DataOps Unleashed here. You can also download The Unravel Guide to DataOps, which was made available for the first time during the conference.

Finding Out More

Read our blog post Why DataOps Is Critical for Your Business.

The post DataOps Has Unleashed, Part 2 appeared first on Unravel.

DataOps Has Unleashed, Part 1

Floyd Smith — Wed, 24 Mar 2021 04:10:27 +0000

DataOps Unleashed launched as a huge success, with scores of speakers, thousands of registrants, and way too many talks for anyone to take in all at once. Luckily, as a virtual event, all sessions are available for instant viewing, and attendees can keep coming back for more. (You can click here to see some of the videos, or visit Part 2 of this blog post.)

Kunal Agarwal, CEO of Unravel Data, kicked off the event with a rousing keynote, describing DataOps in depth. Kunal introduced the DataOps Infinity Loop, with ten integrated and overlapping stages. He showed how teams work together, across and around the loop, to solve the problems caused as both data flow and business challenges escalate.

Kunal introduced three primary challenges that DataOps addresses, and that everyone assembled needs to solve, in order to make progress:

Complexity. A typical modern stack and pipeline has about a dozen components, and the data estate as a whole has many more. All this capability brings power – and complexity.
Crew. Small data teams – the crew – face staggering workloads. Finding qualified, experienced people, and empowering them, may be the biggest challenge.
Coordination. The secret to team productivity is coordination. DataOps, and the DataOps lifecycle, are powerful coordination frameworks.

These challenges resonated across the day’s presentations. Adobe, Kroger, Cox Automotive, Mastercard, Astronomer, and Superconductive described the Unravel Data platform as an important ally in tackling complexity. And Vijay Kiran, head of data engineering at Soda, emphasized the role of collaboration in making teams effective. The lack of available talent to expand one’s Crew – and the importance of empowering one’s team members – came up again and again.

There were many highlights per presentation. A few that stand out from the morning sessions are Adobe, moving their entire business to the cloud; Airflow, a leader in orchestration; Cox Automotive, running a global business with a seven-person data team; and Great Expectations, which majors in data reliability.

How Adobe is Moving a $200B Business to the Cloud

Adobe was up next, with an in-depth cloud migration case study, covering the company’s initial steps toward cloud migration. Adobe is one of the original godfathers of today’s digital world, powering so much of the creativity seen in media, entertainment, on websites, and everywhere one looks. The company was founded in the early 1980s, and gained their original fame with Postscript and the laser printer. But the value of the business did not really take off until the past ten years, which is when the company moved from boxed software to a subscription business, using a SaaS model.

Now, Adobe is moving their entire business to the cloud. They describe four lessons they’ve learned in the move:

Ignorance is not bliss. In planning their move to the cloud, Adobe realized that they had a large blind spot about what work was running on-premises. This may seem surprising until you check and realize that your company may have this problem too.
Comparison-shop now. You need to compare your on-premises cost and performance per workload to what you can expect in the cloud. The only way to do this is to use a tool that profiles your on-premises workloads and maps each to the appropriate infrastructure and costs on each potential cloud platform.
Optimize first. Moving an unoptimized workload to the cloud is asking for trouble – and big bills. It’s critically important to optimize workloads on-premises to reduce the hassle of cloud migration and the expense of running the workload in the cloud.
Manage effectively. On-premises workloads may run poorly without too much hassle, but running workloads poorly in the cloud adds immediately, and unendingly, to costs. Effective management tools are needed to help keep performance high and costs under budget.

Kevin Davis, Application Engineering Manager, was very clear that Adobe has only gained the clarity they need through the use of Unravel Data for cloud migration, and for performance management, both on-premises and in the cloud. Unravel allows Adobe to profile their on-premises workloads; map each workload to appropriate services in the cloud; compare cost and performance on-premises to what they could expect in the cloud; optimize workloads on-premises before the move; and carefully manage workload cost and performance, after migration, in the cloud. Unravel’s automatic insights increase the productivity of their DataOps crew.

Cloud DataOps at scale with Unravel

Try Unravel for free

Cox Automotive Scales Worldwide with Databricks

Cox Automotive responded with a stack optimization case study. Cox Automotive is a global business, with wholesale and retail offerings supporting the lifecycle of motor vehicles. Their data services team, a mighty team of only seven people, supports the UK and Europe, offering reporting, analytics, and data sciences services to the businesses.

The data services team is moving from Hadoop clusters, deployed manually, to a full platform as a service (PaaS setup) using Databricks on Microsoft Azure. As they execute this transition, they are automating everything possible. Databricks allows them to spin up Spark environments quickly, and Unravel helps them automate pipeline health management.

Data comes from a range of sources – mostly as files, with streaming expected soon. Cox finds Unravel particularly useful for optimization: choosing virtual machine types in Databricks, watching memory usage by jobs, making jobs run quicker, optimizing cost. These are all things that the team has trouble finding through other tools, and can’t readily build by themselves. They have an ongoing need for the visibility and observability that Unravel gives them. Unravel helps with optimization and is “really strong for observability in our platform.”

Great Expectations for Data Reliability

Great Expectations shared best practices on data reliability. Great Expectations is the leading open-source package for data reliability, which is crucial to DataOps success. An expectation is an assertion about data; when data falls outside these expectations, an exception is raised, making it easier to manage outliers and errors. Great Expectations integrates with a range of DataOps tools, making its developers DataOps insiders.SuperConductive provides support for Great Expectations and is a leading contributor to the open source project.

The stages where Great Expectations works map directly to the DataOps infinity loop. Static data must be prepared, cleaned, stored, and used for model development and testing. Then, when the model is deployed, live data must be cleansed, in real time, and run into and through the model. Results go out to users and apps, and quality concerns feed back to operations and development.

Airflow Enhances Observability

Astronomer conducted a master class on observability. Astronomer was founded to help organizations adopt Apache Airflow. Airflow is open source software for programmatically creating, scheduling, and monitoring complex workflows, including core DataOps tasks such as data pipelines used to feed machine learning models. To construct workflows, users create task flows called directed acyclic graphs (DAGs) in Python.

The figure shows a neat integration between Airflow and Unravel. Specifically, how Unravel can provide end-to-end observability and automatic optimization for Airflow pipelines. In this example it’s a simple DAG containing a few Hive and Spark stages. Data is passed from Airflow into Unravel via REST APIs, which helps create an easy to understand interface and then allows Unravel to generate automated insights for these pipelines.

DataOps Powers the New Data Stack

Unravel Data co-founder Shivnath Babu described and demystified the new data stack that is the focus of today’s DataOps practice. This stack easily supports new technologies such as advanced analytics and machine learning. However, this stack, while powerful, is complex, and operational challenges can derail its success.

Shivnath showed an example of the new data stack, with Databricks providing Spark support, Azure Data Lake for storage, Airflow for orchestration, dbt for data transformation, and Great Expectations for data quality and validation. Slack provides communications and displays alerts, and Unravel Data provides end-to-end observability and automated recommendations for configuration management, troubleshooting, and optimization.

In Shivnath’s demo, he showed pipelines in danger of missing performance SLAs, overrunning on costs, or hanging due to data quality problems. Unravel’s Pipeline Observer allows close monitoring, and alerts feed into Slack.

The goal, in Shivnath’s talk and across all of DataOps, is for companies to move up the data pipeline maturity scale – from problems detected after the fact, and fixed weeks later, to problems detected proactively, RCA’d (the root cause analyzed and found) automatically, and healing themselves.

Simplify modern data pipeline complexity

Try Unravel for free

OneWeb Takes the Infinity Loop to the Stars

To finish the first half of the day, OneWeb showed how to take Snowflake beyond the clouds – the ones that you can see in the sky over your head. OneWeb is a global satellite network provider that takes Internet connectivity to space, reaching anywhere on the globe. They are going to near-Earth orbit with Snowflake, using a boost from DataOps.live.

OneWeb connects to their satellites with antenna arrays that require lots of room, isolation – and near-perfect connectivity. Since customer connectivity can’t drop, reliability is crucial across their estate, and a DataOps-powered approach is a necessity for keeping OneWeb online.

One More Thing…

All of this is just part of what happened – in the first half of the day! We’ll provide a further update soon, including – wait for it – the state of DataOps, how to create a data-driven culture, customer case studies from 84.51°/Kroger and Mastercard, and much, much more. You can view the videos from DataOps Unleashed here. You can also download The Unravel Guide to DataOps, which was made available for the first time during the conference.

Finding Out More

Read our blog post Why DataOps Is Critical for Your Business.

The post DataOps Has Unleashed, Part 1 appeared first on Unravel.

Why DataOps is Critical For The Success Of Your Business

Unravel Data — Thu, 18 Mar 2021 18:08:14 +0000

What is DataOps?

DataOps is the use of agile development practices to create, deliver, and optimize data products, quickly and cost-effectively. Data products include data pipelines, data-dependent apps, dashboards, machine learning models, other AI-powered software, and even answers to ad hoc SQL queries. DataOps is practiced by modern data teams, including data engineers, architects, analysts, scientists and operations.

What are Data Products?

A data product is any tool or application that processes data and generates results. The primary objective of data products is to manage, organize, and make sense of the large amount of data that an organization generates and collects.

What is a DataOps Platform?

DataOps is more of a methodology than a specific, discrete platform. However, a platform that supports multiple aspects of DataOps practices can assist in the adoption and effectiveness of DataOps.

Every organization today is in the process of harnessing the power of their data using advanced analytics, which is likely running on a modern data stack. On top of the data stack, organizations create “data products.”

These data products range from advanced analytics, data pipelines, and machine learning models to embedded AI solutions. All of these work together to help organizations gain insights, make decisions,and power applications. In order to extract the maximum value out of these data products, companies employ a DataOps methodology that allows them to efficiently extract value out of their data.

What does DataOps do for an organization?

DataOps is a set of agile-based development practices that make it faster, easier, and less costly to develop, deploy, and optimize data-powered applications.

Using an agile approach, an organization identifies a problem to solve, and then breaks it down into smaller pieces. Each piece is then assigned to a team that breaks down the work to solve the problem into a defined set of time – usually called a sprint – that includes planning, work, deployment, and review.

Who benefits from DataOps?

DataOps is for organizations that want to not only succeed, but to outpace the competition. With DataOps, an organization is continually striving to create better ways to manage their data, which should lead to being able to use this data to make decisions that help them.

The practice of DataOps can benefit organizations by fostering cross-matrix collaboration between teams of data scientists, data engineers, data analysts, operations, and product owners. Each of these roles needs to be in sync in order to use the data in the most efficient manner, and DataOps strives to accomplish this.

Research by Forrester indicates that companies that embed analytics and data science into their operating models to bring actionable knowledge into every decision are at least twice as likely to be in a market-leading position than their industry peers.

10 Steps of the DataOps Lifecycle

DataOps is not limited to making existing data pipelines work effectively, getting reports and artificial intelligence and machine learning outputs and inputs to appear as needed, and so on. DataOps actually includes all parts of the data management lifecycle.

The DataOps lifecycle shown above takes data teams on a journey from raw data to insights. Where possible, DataOps stages are automated to accelerate time to value. The steps below show the full lifecycle of a data-driven application.

Plan. Define how a business problem can be solved using data analytics. Identify the needed sources of data and the processing and analytics steps that will be required to solve the problem. Then select the right technologies, along with the delivery platform, and specify available budget and performance requirements.
Create. Create the data pipelines and application code that will ingest, transform, and analyze the data. Based on the desired outcome, data applications are written using SQL, Scala, Python, R, or Java, among others.
Orchestrate. Connect stages needed to work together to produce the desired result. Schedule code execution, with reference to when the results are needed; when cost-effective processing is most available; and when related jobs (inputs and outputs, or steps in a pipeline) are running.
Test & Fix. Simulate the process of running the code against the data sources in a sandbox environment. Identify and remove any bottlenecks in data pipelines. Verify results for correctness, quality, performance, and efficiency.
Continuous Integration. Verify that the revised code meets established criteria to be promoted into production. Integrate the latest, tested and verified code and data sources incrementally, to speed improvements and reduce risk.
Deploy. Select the best scheduling window for job execution based on SLAs and budget. Verify that the changes are an improvement; if not, roll them back, and revise.
Operate. Code runs against data, solving the business problem, and stakeholder feedback is solicited. Detect and fix deviations in performance to ensure that SLAs are met.
Monitor. Observe the full stack, including data pipelines and code execution, end-to-end. Data operators and engineers use tools to observe the progress of code running against data in a busy environment, solving problems as they arise.
Optimize. Constantly improve the performance, quality, cost, and business outcomes of data applications and pipelines. Team members work together to optimize the application’s resource usage and improve its performance and effectiveness.
Feedback. The team gathers feedback from all stakeholders – the data team itself, app users, and line of business owners. The team compares results to business success criteria and delivers input to the Plan phase.

There are two overarching characteristics of DataOps that apply to every stage in the DataOps lifecycle: end-to-end observability and real-time collaboration.

End-to-End Observability

End-to-end observability is key to delivering high-quality data products, on time and under budget. You need to be able to measure key KPIs about your data-driven applications, the data sets they process, and the resources they consume. Key metrics include application / pipeline latency, SLA score, error rate, result correctness, cost of run, resource usage, data quality, and data usage.

You need this visibility horizontally – across every stage and service of the data pipeline – and vertically, to see whether it is the application code, service, container, data set, infrastructure or another layer that is experiencing problems. End-to-end observability provides a single, trusted “source of truth” for data teams and data product users to collaborate around.

Real-Time Collaboration

Real-time collaboration is crucial to agile techniques; dividing work into short sprints, for instance, provides a work rhythm across teams. The DataOps lifecycle helps teams identify where in the loop they’re working, and to reach out to other stages as needed to solve problems – both in the moment, and for the long term.

Real-time collaboration requires open discussion of results as they occur. The observability platform provides a single source of truth that grounds every discussion in shared facts. Only through real-time collaboration can a relatively small team have an outsized impact on the daily and long-term delivery of high-quality data products.

Conclusion

Through the use of a DataOps approach to their work, and careful attention to each step in the DataOps lifecycle, data teams can improve their productivity and the quality of the results they deliver to the organization.

As the ability to deliver predictable and reliable business value from data assets increases, the business as a whole will be able to make more and better use of data in decision-making, product development, and service delivery.

Advanced technologies, such as artificial intelligence and machine learning, can be implemented faster and with better results, leading to competitive differentiation and, in many cases, industry leadership.

The post Why DataOps is Critical For The Success Of Your Business appeared first on Unravel.

Unravel Data2021 Infographic

Unravel Data — Tue, 16 Mar 2021 22:20:59 +0000

Thank you for your interest in the Unravel Data2021 Infographic.

You can download it here.

The post Unravel Data2021 Infographic appeared first on Unravel.

Minding the Gaps in Your Cloud Migration Strategy

Unravel Data — Thu, 28 Jan 2021 13:00:52 +0000

As your organization begins planning and budgeting for 2021 initiatives, it’s time to take a critical look at your cloud migration strategy. If you’re planning to move your on-premises big data workloads to the cloud this year, you’re undoubtedly faced with a number of questions and challenges:

Which workloads are best suited for the cloud?
How much will each workload cost to run?
How do you manage workloads for optimal performance, while keeping costs down?

Gartner Cloud Migration Report

Neglecting careful workload planning and controls prior to cloud migration can lead to unforeseen cost spikes. That’s why we encourage you to read Gartner’s new report that cites serious gaps in how companies move to the cloud: “Mind the Gaps in DBMS Cloud Migration to Avoid Cost and Performance Issues.”

Gartner’s timely report provides invaluable information for any enterprise with substantial database spending, whether on-premises, in the cloud, or migrating to the cloud. Organizations typically move to the cloud to save money, cutting costs by an average of 21% according to the report. However, Gartner finds that migrations are often more expensive and disruptive than initially planned because organizations neglect three crucial steps:

Price/performance comparison. They fail to assess the price and performance of their apps, both on-premises and after moving to the cloud.
Apps conversion assessment. They don’t assess the cost of converting apps to run effectively in the cloud, then get surprised by failed jobs and high costs.
Ops conversion assessment. DataOps tasks change greatly across environments, and organizations don’t maximize their gains from the move.

When organizations do not to take these important steps, they typically fail to complete the migration on-time, overspend against their established cloud operational budgets, and miss critical optimization opportunities available in the cloud.

Remove the Risk of Cloud Migration With Unravel Data

Unravel Data can help you fill in the gaps cited in the Gartner report, providing full-stack observability and AI-powered recommendations to drive more reliable performance on Azure, AWS, Google Cloud Platform or your in own data center. By simplifying, operationalizing, and automating performance improvements, applications are more reliable, and costs are lower. Your team and your workflows will be more efficient and productive, so you can focus your resources on your larger vision.

To learn more – including information about our Cloud Migration Acceleration Programs – contact us today. And make sure to download your copy of the Gartner report. Or start by reading our two-page executive summary.

The post Minding the Gaps in Your Cloud Migration Strategy appeared first on Unravel.

“More than 60% of Our Pipelines Have SLAs,” Say Unravel Customers at Untold

Floyd Smith — Fri, 18 Dec 2020 21:42:41 +0000

Also see our blog post with stories from Untold speakers, “My Fitbit sleep score is 88…”

Unravel Data recently held its first-ever customer conference, Untold 2020. We promoted Untold 2020 as a convocation of #datalovers. And these #datalovers generated some valuable data – including the interesting fact that more than 60% of surveyed customers have SLAs for either “more than 50% of their pipelines” (42%) or “all of their pipelines” (21%).

All of this ties together. Unravel makes it much easier for customers to set attainable SLAs for their pipelines, and to meet those SLAs once they’re set. Let’s look at some more data-driven findings from the conference.

And, if you’re an Unravel Data customer, reach out to access much more information, including webinar-type recordings of all five customer and industry expert presentations, with polling and results presentations interspersed throughout. If you are not yet a customer, but want to know more, you can create a free account or contact us.

Unravel Data CEO Kunal Agarwal kicking off Untold.

Note: “The plural of anecdotes is not data,” the saying goes – so, nostra culpa. The findings discussed herein are polling results from our customers who attended Untold, and they fall short of statistical significance. But they do represent the opinions of some of the most intense #datalovers in leading corporations worldwide – Unravel Data’s customer base. (“The great ones skate to where the puck’s going to be,” after Wayne Gretzky…)

More Than 60% of Customer Pipelines Have SLAs

Using Unravel, more than 60% of the pipelines managed by our customers have SLAs:

More than 20% have SLAs for all their pipelines.
More than 40% have SLAs for more than half of their pipelines.
Fewer than 40% have SLAs for fewer than half their pipelines (29%) or fewer than a quarter of them (8%).

Pipelines were, understandably, a huge topic of conversation at Untold. Complex pipelines, and the need for better tools to manage them, are very common amongst our customers. And Unravel makes it possible for our customers to set, and meet, SLAs for their pipelines.

What percentage of your data pipelines have SLAs
<25%	8.3%
>25-50%	29.2%
>50%	41.7%
All of them	20.8%

Bad Applications Cause Cost Overruns

We asked our attendees the biggest reason for cost overruns:

Roughly three-quarters replied, “Bad applications taking too many resources.” Finding out which applications all the resources are being consumed by is a key feature of Unravel software.
Nearly a quarter replied, “Oversized containers.” Now, not everyone is using containers yet, so we are left to wonder just how many container users are unnecessarily supersizing their containers. Most of them?
And the remaining answer was, “small files.” One of the strongest presentations at the Untold conference was about the tendency of bad apps to generate a myriad of small files that consume a disproportionate share of resources; you can use Unravel to generate a small files report and track these problems down.

What is usually the biggest reason for cost overruns?
Bad applications taking too many resources	75.0%
Oversized containers	20.0%
Small files	5.0%

Two-Thirds Know Their Most Expensive App/User

Amongst Unravel customers, almost two-thirds can identify their most expensive app(s) and user(s). Identifying costly apps and users is one of the strengths of Unravel Data:

On-premises, expensive apps and users consume far more than their share of system resources, slowing other jobs and contributing to instability and crashes.
In moving to cloud, knowing who’s costing you in your on-premises estate – and whether the business results are worth the expense – is crucial to informed decision-making.
In the cloud, “pay as you go” means that, as one of our presenters described it, “When you go, you pay.” It’s very easy for 20% of your apps and users to generate 80% of your cloud bill, and it’s very unlikely for that inflated bill to also represent 80% of your IT value.

Do you know who is the most expensive user / app on your system?
Yes	65.0%
No	25.0%
No, but would be great to know	10.0%

An Ounce of Prevention is Worth a Pound of Cure

Knowing whether you have rogue users and/or apps on your system is very valuable:

A plurality (43%) of Unravel customers have rogue users/apps on their cluster “all the time.”
A minority (19%) see this about once a day.
A near-plurality (38%) only see this behavior once a week or so.

We would bet good money that non-Unravel customers see a lot more rogue behavior than our customers do. With Unravel, you can know exactly who and what is “going rogue” – and you can help stakeholders get the same results with far less use of system resources and cost. This cuts down on rogue behavior, freeing up room for existing jobs, and for new applications to run with hitherto unattainable performance.

How often do you have rogue users/apps on your cluster?
All the time!	42.9%
Once a day	19.0%
Once a week	38.1%

Unravel Customers Are Fending Off Bad Apps

Once you find and improve the bad apps that are currently in production, the logical next step is to prevent bad apps from even reaching production in the first place. Unravel customers are meeting this challenge:

More than 90% of attendees find automation helpful in preventing bad quality apps from being promoted into production.
More than 80% have a quality gate when promoting apps from Dev to QA, and from QA to production.
More than half have a well-defined DataOps/SDLC (software development life cycle) process, and nearly a third have a partially-defined process. Only about one-eighth have neither.
About one-quarter have operations people/sysadmins troubleshooting their data pipelines; another quarter put the responsibility onto the developers or data engineers who create the apps. Nearly half make a game-time decision, depending on the type of issue, or use an “all hands on deck” approach with everyone helping.

The Rest of the Story

More than two-thirds are finding their data costs to be running over budget.
More than half are in the process of migrating to cloud – though only a quarter have gotten to the point of actually moving apps and data, then optimizing and scaling the result.
Half find that automating problem identification, root cause analysis, and resolution saves them 1-5 hours per issue; the other half save from 6-10 hours or more.
Somewhat more than half find their clusters to be on the over-provisioned side.
Nearly half have ten or more technologies in representative data pipelines.

Finding Out More

Unravel Data customers – #datalovers all – look to be well ahead of the industry in managing issues that relate to big data, streaming data, and moving to the cloud. If you’re interested in joining this select group, you can create a free account or contact us. (There may still be some Untold conference swag left for new customers!)

The post “More than 60% of Our Pipelines Have SLAs,” Say Unravel Customers at Untold appeared first on Unravel.

Spark APM – What is Spark Application Performance Management

Unravel Data — Thu, 17 Dec 2020 21:01:51 +0000

What is Spark?

Apache Spark is a fast and general-purpose engine for large-scale data processing. It’s most widely used to replace MapReduce for fast processing of data stored in Hadoop. Spark also provides an easy-to-use API for developers.

Designed specifically for data science, Spark has evolved to support more use cases, including real-time stream event processing. Spark is also widely used in AI and machine learning applications.

There are six main reasons to use Spark.

1. Speed – Spark has an advanced engine that can deliver up to 100 times faster processing than traditional MapReduce jobs on Hadoop.

2. Ease of use – Spark supports Python, Java, R, and Scala natively, and offers dozens of high-level operators that make it easy to build applications.

3. General purpose – Spark offers the ability to combine all of its components to make a robust and comprehensive data product.

4. Platform agnostic – Spark runs in nearly any environment.

5. Open source – Spark and its component and complementary technologies are free and open source; users can change the code as needed.

6. Widely used – Spark is the industry standard for many tasks, so expertise and help are widely available.

A Brief History of Spark

Today, when it comes to parallel big data analytics, Spark is the dominant framework that developers turn to for their data applications. But what came before Spark?

In 2003, several developers, mostly based at Yahoo!, started working on an open, distributed computing platform. A few years later, these developers released their work as an open source project called Hadoop.

This is also approximately the same time that Google created a Java interface called MapReduce that they used to work with their volumes of data. While Hadoop grew in popularity in its ability to store massive volumes of data, at Facebook, developers wanted to provide their data science team with an easier way to work with their data in Hadoop. As a result, they created Hive, a data warehousing framework based on Hadoop.

Even though Hadoop was gaining wide adoption at this point, there really weren’t any good interfaces for analysts and data scientists to use. So, in 2009, a group of people at the University of California Berkeley ampLab started a new project to solve this problem. Thus Spark was born – and was released as open source a year later.

What is Spark APM?

Spark enables rapid innovation and high performance in your applications. But as your applications grow in complexity, inefficiencies are bound to be introduced. These inefficiencies add up to significant performance losses and increased processing costs.

For example, a Spark cluster may have idle time between batches because of slow data writes on the server. Batch modes become idle because the next batch can’t start until all of the previous tasks haven’t been completed yet. Your Spark jobs are “bottlenecked on writes.”

When this happens, you can’t scale your application horizontally – adding more servers to help with processing won’t improve your application performance. Instead, you’d just be increasing the idle time of your clusters.

Unravel Data for Spark APM

This is where Unravel Data comes in to save the day. Unravel Data for Spark provides a comprehensive full-stack, intelligent, and automated approach to Spark operations and application performance management across your modern data stack architecture.

The Unravel platform helps you analyze, optimize, and troubleshoot Spark applications and pipelines in a seamless, intuitive user experience. Operations personnel, who have to manage a wide range of technologies, don’t need to learn Spark in great depth in order to significantly improve the performance and reliability of Spark applications.

See Unravel for Spark in Action

Create a free account

Example of Spark APM with Unravel Data

A Spark application consists of one or more jobs, each of which in turn has one or more stages. A job corresponds to a Spark action – for example, count, take, for each, etc. Within the Unravel platform, you can view all of the details of your Spark application.

Unravel’s Spark APM lets you:

Quickly see which jobs and stages consume the most resources
View your app as a resilient distributed dataset (RDD) execution graph
Drill into the source code from the stage tile, Spark stream batch tile, or the execution graph to locate the problems

Unravel Data for Spark APM can then help you:

Resolve inefficiencies, bottlenecks, and reasons for failure within apps
Optimize resource allocation for Spark drivers and executors
Detect and fix poor partitioning
Detect and fix inefficient and failed Spark apps
Use recommended settings to tune for speed and efficiency

Improve Your Spark Application Performance

To learn how Unravel can help improve the performance of your applications, create a free account and take it for a test drive on your Spark applications. Or, contact Unravel.

The post Spark APM – What is Spark Application Performance Management appeared first on Unravel.

Spark Catalyst Pipeline: A Deep Dive into Spark’s Optimizer

Floyd Smith — Sat, 12 Dec 2020 03:44:11 +0000

The Catalyst optimizer is a crucial component of Apache Spark. It optimizes structural queries – expressed in SQL, or via the DataFrame/Dataset APIs – which can reduce the runtime of programs and save costs. Developers often treat Catalyst as a black box that just magically works. Moreover, only a handful of resources are available that explain its inner workings in an accessible manner.

When discussing Catalyst, many presentations and articles reference this architecture diagram, which was included in the original blog post from Databricks that introduced Catalyst to a wider audience. The diagram has inadequately depicted the physical planning phase, and there is an ambiguity about what kind of optimizations are applied at which point.

The following sections explain Catalyst’s logic, the optimizations it performs, its most crucial constructs, and the plans it generates. In particular, the scope of cost-based optimizations is outlined. These were advertised as “state-of-art” when the framework was introduced, but our article will describe larger limitations. And, last but not least, we have redrawn the architecture diagram:

The Spark Catalyst Pipeline

This diagram and the description below focus on the second half of the optimization pipeline and do not cover the parsing, analyzing, and caching phases in much detail.

Diagram 1: The Catalyst pipeline

The input to Catalyst is a SQL/HiveQL query or a DataFrame/Dataset object which invokes an action to initiate the computation. This starting point is shown in the top left corner of the diagram. The relational query is converted into a high-level plan, against which many transformations are applied. The plan becomes better-optimized and is filled with physical execution information as it passes through the pipeline. The output consists of an execution plan that forms the basis for the RDD graph that is then scheduled for execution.

Nodes and Trees

Six different plan types are represented in the Catalyst pipeline diagram. They are implemented as Scala trees that descend from TreeNode. This abstract top-level class declares a children field of type Seq[BaseType]. A TreeNode can therefore have zero or more child nodes that are also TreeNodes in turn. In addition, multiple higher-order functions, such as transformDown, are defined, which are heavily used by Catalyst’s transformations.

This functional design allows optimizations to be expressed in an elegant and type-safe fashion; an example is provided below. Many Catalyst constructs inherit from TreeNode or operate on its descendants. Two important abstract classes that were inspired by the logical and physical plans found in databases are LogicalPlan and SparkPlan. Both are of type QueryPlan, which is a direct subclass of TreeNode.

In the Catalyst pipeline diagram, the first four plans from the top are LogicalPlans, while the bottom two – Spark Plan and Selected Physical Plan – are SparkPlans. The nodes of logical plans are algebraic operators like Join; the nodes of physical plans (i.e. SparkPlans) correspond to low-level operators like ShuffledHashJoinExec or SortMergeJoinExec. The leaf nodes read data from sources such as files on stable storage or in-memory lists. The tree root represents the computation result.

Transformations

An example of Catalyst’s trees and transformation patterns is provided below. We have used expressions to avoid verbosity in the code. Catalyst expressions derive new values and are also trees. They can be connected to logical and physical nodes; an example would be the condition of a filter operator.

The following snippet transforms the arithmetic expression –((11 – 2) * (9 – 5)) into – ((1 + 0) + (1 + 5)):


import org.apache.spark.sql.catalyst.expressions._
 val firstExpr: Expression = UnaryMinus(Multiply(Subtract(Literal(11), Literal(2)), Subtract(Literal(9), Literal(5))))
 val transformed: Expression = firstExpr transformDown {
   case BinaryOperator(l, r) => Add(l, r)
   case IntegerLiteral(i) if i > 5 => Literal(1)
   case IntegerLiteral(i) if i < 5 => Literal(0)
 }

The root node of the Catalyst tree referenced by firstExpr has the type UnaryMinus and points to a single child, Multiply. This node has two children, both of type Subtract.

The first Subtract node has two Literal child nodes that wrap the integer values 11 and 2, respectively, and are leaf nodes. firstExpr is refactored by calling the predefined higher-order function transformDown with custom transformation logic: The curly braces enclose a function with three pattern matches. They convert all binary operators, such as multiplication, into addition; they also map all numbers that are greater than 5 to 1, and those that are smaller than 5 to zero.

Notably, the rule in the example gets successfully applied to firstExpr without covering all of its nodes: UnaryMinus is not a binary operator (having a single child) and 5 is not accounted for by the last two pattern matches. The parameter type of transformDown is responsible for this positive behavior: It expects a partial function that might only cover subtrees (or not match any node) and returns “a copy of this node where `rule` has been recursively applied to it and all of its children (pre-order)”.

This example appears simple, but the functional techniques are powerful. At the other end of the complexity scale, for example, is a logical rule that, when it fires, applies a dynamic programming algorithm to refactor a join.

Catalyst Plans

The following (intentionally bad) code snippet will be the basis for describing the Catalyst plans in the next sections:


val result = session.read.option("header", "true").csv(outputLoc)
   .select("letter", "index")
   .filter($"index" < 10) 
   .filter(!$"letter".isin("h") && $"index" > 6)
result.show()

The complete program can be found here. Some plans that Catalyst generates when evaluating this snippet are presented in a textual format below; they should be read from the bottom up. A trick needs to be applied when using pythonic DataFrames, as their plans are hidden; this is described in our upcoming follow-up article, which also features a general interpretation guide for these plans.

Parsing and Analyzing

The user query is first transformed into a parse tree called Unresolved or Parsed Logical Plan:


'Filter (NOT 'letter IN (h) && ('index > 6))
+- Filter (cast(index#28 as int) < 10)
   +- Project [letter#26, index#28]
             +- Relation[letter#26,letterUppercase#27,index#28] csv

This initial plan is then analyzed, which involves tasks such as checking the syntactic correctness, or resolving attribute references like names for columns and tables, which results in an Analyzed Logical Plan.

In the next step, a cache manager is consulted. If a previous query was persisted, and if it matches segments in the current plan, then the respective segment is replaced with the cached result. Nothing is persisted in our example, so the Analyzed Logical Plan will not be affected by the cache lookup.

A textual representation of both plans is included below:


Filter (NOT letter#26 IN (h) && (cast(index#28 as int) > 6))
+- Filter (cast(index#28 as int) < 10)
   +- Project [letter#26, index#28]
            +- Relation[letter#26,letterUppercase#27,index#28] csv

Up to this point, the operator order of the logical plan reflects the transformation sequence in the source program, if the former is interpreted from the bottom up and the latter is read, as usual, from the top down. A read, corresponding to Relation, is followed by a select (mapped to Project) and two filters. The next phase can change the topology of the logical plan.

Logical Optimizations

Catalyst applies logical optimization rules to the Analyzed Logical Plan with cache data. They are part of rule groups which are internally called batches. There are 25 batches with 109 rules in total in the Optimizer (Spark 2.4.7). Some rules are present in more than one batch, so the number of distinct logical rules boils down to 69. Most batches are executed once, but some can run repeatedly until the plan converges or a predefined maximum number of iterations is reached. Our program CollectBatches can be used to retrieve and print this information, along with a list of all rule batches and individual rules; the output can be found here.

Major rule categories are operator pushdowns/combinations/replacements and constant foldings. Twelve logical rules will fire in total when Catalyst evaluates our example. Among them are rules from each of the four aforementioned categories:

Two applications of PushDownPredicate that move the two filters before the column selection
A CombineFilters that fuses the two adjacent filters into one
An OptimizeIn that replaces the lookup in a single-element list with an equality comparison
A RewritePredicateSubquery which rearranges the elements of the conjunctive filter predicate that CombineFilters created

These rules reshape the logical plan shown above into the following Optimized Logical Plan:


Project [letter#26, index#28], Statistics(sizeInBytes=162.0 B)
 +- Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6)), Statistics(sizeInBytes=230.0 B)
    +- Relation[letter#26,letterUppercase#27,index#28] csv, Statistics(sizeInBytes=230.0 B

Quantitative Optimizations

Spark’s codebase contains a dedicated Statistics class that can hold estimates of various quantities per logical tree node. They include:

Size of the data that flows through a node
Number of rows in the table
Several column-level statistics:
- Number of distinct values and nulls
- Minimum and maximum value
- Average and maximum length of the values
- An equi-height histogram of the values

Estimates for these quantities are eventually propagated and adjusted throughout the logical plan tree. A filter or aggregate, for example, can significantly reduce the number of records the planning of a downstream join might benefit from using the modified size, and not the original input size, of the leaf node.

Two estimation approaches exist:

The simpler, size-only approach is primarily concerned with the first bullet point, the physical size in bytes. In addition, row count values may be set in a few cases.
The cost-based approach can compute the column-level dimensions for Aggregate, Filter, Join, and Project nodes, and may improve their size and row count values.

For other node types, the cost-based technique just delegates to the size-only methods. The approach chosen depends on the spark.sql.cbo.enabled property. If you intend to traverse the logical tree with a cost-based estimator, then set this property to true. Besides, the table and column statistics should be collected for the advanced dimensions prior to the query execution in Spark. This can be achieved with the ANALYZE command. However, a severe limitation seems to exist for this collection process, according to the latest documentation: “currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run”.

The estimated statistics can be used by two logical rules, namely CostBasedJoinReorder and ReorderJoin (via StarSchemaDetection), and by the JoinSelection strategy in the subsequent phase, which is described further below.

The Cost-Based Optimizer

A fully fledged Cost-Based Optimizer (CBO) seems to be a work in progress, as indicated here: “For cost evaluation, since physical costs for operators are not available currently, we use cardinalities and sizes to compute costs”. The CBO facilitates the CostBasedJoinReorder rule and may improve the quality of estimates; both can lead to better planning of joins. Concerning column-based statistics, only the two count stats (distinctCount and nullCount) appear to participate in the optimizations; the other advanced stats may improve the estimations of the quantities that are directly used.

The scope of quantitative optimizations does not seem to have significantly expanded with the introduction of the CBO in Spark 2.2. It is not a separate phase in the Catalyst pipeline, but improves the join logic in several important places. This is reflected in the Catalyst optimizer diagram by the smaller green rectangles, since the stats-based optimizations are outnumbered by the rule-based/heuristic ones.

The CBO will not affect the Catalyst plans in our example for three reasons:

The property spark.sql.cbo.enabled is not modified in the source code and defaults to false.
The input consists of raw CSV files for which no table/column statistics were collected.
The program operates on a single dataset without performing any joins.

The textual representation of the optimized logical plan shown above includes three Statistics segments which only hold values for sizeInBytes. This further indicates that size-only estimations were used exclusively; otherwise attributeStats fields with multiple advanced stats would appear.

Physical Planning

The optimized logical plan is handed over to the query planner (SparkPlanner), which refactors it into a SparkPlan with the help of strategies — Spark strategies are matched against logical plan nodes and map them to physical counterparts if their constraints are satisfied. After trying to apply all strategies (possibly in several iterations), the SparkPlanner returns a list of valid physical plans with at least one member.

Most strategy clauses replace one logical operator with one physical counterpart and insert PlanLater placeholders for logical subtrees that do not get planned by the strategy. These placeholders are transformed into physical trees later on by reapplying the strategy list on all subtrees that contain them.

As of Spark 2.4.7, the query planner applies ten distinct strategies (six more for streams). These strategies can be retrieved with our CollectStrategies program. Their scope includes the following items:

The push-down of filters to, and pruning of, columns in the data source
The planning of scans on partitioned/bucketed input files
The aggregation strategy (SortAggregate, HashAggregate, or ObjectHashAggregate)
The choice of the join algorithm (broadcast hash, shuffle hash, sort merge, broadcast nested loop, or cartesian product)

The SparkPlan of the running example has the following shape:


Project [letter#26, index#28]
+- Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6))
   +- FileScan csv [letter#26,index#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/example], PartitionFilters: [], PushedFilters: [IsNotNull(index), IsNotNull(letter), Not(EqualTo(letter,h))], ReadSchema: struct

A DataSource strategy was responsible for planning the FileScan leaf node and we see multiple entries in its PushedFilters field. Pushing filter predicates down to the underlying source can avoid scans of the entire dataset if the source knows how to apply them; the pushed IsNotNull and Not(EqualTo) predicates will have a small or nil effect, since the program reads from CSV files and only one cell with the letter h exists.

The original architecture diagram depicts a Cost Model that is supposed to choose the SparkPlan with the lowest execution cost. However, such plan space explorations are still not realized in Spark 3.01; the code trivially picks the first available SparkPlan that the query planner returns after applying the strategies.

A simple, “localized” cost model might be implicitly defined in the codebase by the order in which rules and strategies and their internal logic get applied. For example, the JoinSelection strategy assigns the highest precedence to the broadcast hash join. This tends to be the best join algorithm in Spark when one of the participating datasets is small enough to fit into memory. If its conditions are met, the logical node is immediately mapped to a BroadcastHashJoinExec physical node, and the plan is returned. Alternative plans with other join nodes are not generated.

Physical Preparation

The Catalyst pipeline concludes by transforming the Spark Plan into the Physical/Executed Plan with the help of preparation rules. These rules can affect the number of physical operators present in the plan. For example, they may insert shuffle operators where necessary (EnsureRequirements), deduplicate existing Exchange operators where appropriate (ReuseExchange), and merge several physical operators where possible (CollapseCodegenStages).

The final plan for our example has the following format:


*(1) Project [letter#26, index#28]
 +- *(1) Filter ((((isnotnull(index#28) && isnotnull(letter#26)) && (cast(index#28 as int) < 10)) && NOT (letter#26 = h)) && (cast(index#28 as int) > 6))
    +- *(1) FileScan csv [letter#26,index#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/example], PartitionFilters: [], PushedFilters: [IsNotNull(index), IsNotNull(letter), Not(EqualTo(letter,h))], ReadSchema: struct

This textual representation of the Physical Plan is identical to the preceding one for the Spark Plan, with one exception: Each physical operator is now prefixed with a *(1). The asterisk indicates that the aforementioned CollapseCodegenStages preparation rule, which “compiles a subtree of plans that support codegen together into a single Java function,” fired.

The number after the asterisk is a counter for the CodeGen stage index. This means that Catalyst has compiled the three physical operators in our example together into the body of one Java function. This comprises the first and only WholeStageCodegen stage, which in turn will be executed in the first and only Spark stage. The code for the collapsed stage will be generated by the Janino compiler at run-time.

Adaptive Query Execution in Spark 3

One of the major enhancements introduced in Spark 3.0 is Adaptive Query Execution (AQE), a framework that can improve query plans during run-time. Instead of passing the entire physical plan to the scheduler for execution, Catalyst tries to slice it up into subtrees (“query stages”) that can be executed in an incremental fashion.

The adaptive logic begins at the bottom of the physical plan, with the stage(s) containing the leaf nodes. Upon completion of a query stage, its run-time data statistics are used for reoptimizing and replanning its parent stages. The scope of this new framework is limited: The official SQL guide states that “there are three major features in AQE, including coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skew join optimization.” A query must not be a streaming query and its plan must contain at least one Exchange node (induced by a join for example) or a subquery; otherwise, an adaptive execution will not be even attempted – if a query satisfies the requirements, it may not get improved by AQE.

A large part of the adaptive logic is implemented in a special physical root node called AdaptiveSparkPlanExec. Its method getFinalPhysicalPlan is used to traverse the plan tree recursively while creating and launching query stages. The SparkPlan that was derived by the preceding phases is divided into smaller segments at Exchange nodes (i.e., shuffle boundaries), and query stages are formed from these subtrees. When these stages complete, fresh run-time stats become available, and can be used for reoptimization purposes.

A dedicated, slimmed-down logical optimizer with just one rule, DemoteBroadcastHashJoin, is applied to the logical plan associated with the current physical one. This special rule attempts to prevent the planner from converting a sort merge join into a broadcast hash join if it detects a high ratio of empty partitions. In this specific scenario, the first join type will likely be more performant, so a no-broadcast-hash-join hint is inserted.

The new optimized logical plan is fed into the default query planner (SparkPlanner) which applies its strategies and returns a SparkPlan. Finally, a small number of physical operator rules are applied that takes care of the subqueries and missing ShuffleExchangeExecs. This results in a refactored physical plan, which is then compared to the original version. The physical plan with the lowest number of shuffle exchanges is considered cheaper for execution and is therefore chosen.

When query stages are created from the plan, four adaptive physical rules are applied that include CoalesceShufflePartitions, OptimizeSkewedJoin, and OptimizeLocalShuffleReader. The logic of AQE’s major new optimizations is implemented in these case classes. Databricks explains these in more detail here. A second list of dedicated physical optimizer rules is applied right after the creation of a query stage; these mostly make internal adjustments.

The last paragraph has mentioned four adaptive optimizations which are implemented in one logical rule (DemoteBroadcastHashJoin) and three physical ones (CoalesceShufflePartitions, OptimizeSkewedJoin, and OptimizeLocalShuffleReader). All four make use of a special statistics class that holds the output size of each map output partition. The statistics mentioned in the Quantitative Optimization section above do get refreshed, but will not be leveraged by the standard logical rule batches, as a custom logical optimizer is used by AQE.

This article has described Spark’s Catalyst optimizer in detail. Developers who are unhappy with its default behavior can add their own logical optimizations and strategies or exclude specific logical optimizations. However, it can become complicated and time-consuming to devise customized logic for query patterns and other factors, such as the choice of machine types, may also have a significant impact on the performance of applications. Unravel Data can automatically analyze Spark workloads and provide tuning recommendations so developers become less burdened with studying query plans and identifying optimization potentials.

The post Spark Catalyst Pipeline: A Deep Dive into Spark’s Optimizer appeared first on Unravel.

10X Engineering Leadership Series: 21 Playbooks to Lead in the Online Era

Unravel Data — Fri, 11 Dec 2020 17:29:05 +0000

Managing online teams has become the new normal! In an online world, how do you give effective feedback, have a difficult conversation, increase team accountability, communicate to stakeholders effectively, and so on?

At Unravel, we are a fast-growing AI startup with a globally distributed engineering team across the US, EMEA, and India. Even before the pandemic this year, the global nature of our team has prepared us for effectively leading outcomes across online engineering teams.

To help fellow Engineering Leaders (Managers, Tech Leads, ICs), we are making available our 10X Engineering Leadership Video series. Instead of hours of training sessions (where very little is ingested and retained), we developed a new format – 15 min playbooks with concrete actionable levers leaders can apply right away!

This micro-learning async approach has served us well and allows leaders to pick topics most relevant to their needs. Each playbook has an assignment — the intention is for leaders to discuss and learn from peer leaders. The playbooks help create a shared terminology among leaders especially required in an online setting.

We discovered there are three categories where engineering leaders (including seasoned ones) often struggle when it comes to online teams — creating clarity, driving execution accountability, and coaching team members to deliver the best work of their lives.

Playbooks for Creating Clarity

“Everyone in the team is running fast” — “But in different directions!” It is very easy to get out of sync, especially in online teams. Also, clarity is required for team members to effectively balance tactical fires while ensuring long-term initiatives are delivered.

Defining Why: Is your team suffering from the myth, “Build it, and they (customers) will come”? The mantra today is, “If they will come, we will build it.” This playbook covers three plays (that are best taught at Intuit): Follow-me-home, customer need/ideal state statement, Leap-of-Faith-Assumptions statement.
Defining What: The field team is expecting an elephant, the product management is thinking of a giraffe, and engineering delivers a horse. How to align everyone on the same “what.” This playbook covers three plays: User story Jiras, Working backward, Feature-kickoff meeting.
Defining How: Is the team aligned on the new feature scope, tradeoffs, dependencies? Are they thinking about long pole tasks proactively and front-loading risk? To help align on How, this playbook covers four plays.
Clarifying priorities as OKRs: In an online setting, objectives and key results (OKRs) are a great way for leaders to communicate their priorities. Are you using OKRs effectively – or is it treated as another tracking overhead ? This playbook covers top-down and bottom-up plays in defining OKRs effectively.
Effective stakeholder updates: “How is the customer issue resolution coming along?” Leaders need to learn to provide an online (Slack) response in 2-3 sentences. Whether it is a 50-second response or a 50-min planning meeting, this playbook covers the corresponding plays: SCQA Pyramid, Amazon 6-pager, Weekly progress updates.

Driving Execution Accountability

“Vision without execution… is just hallucination,” or “Vision without action is a daydream.” Inspiring outcomes requires leaders to have the rigor for execution across the team.

Your operating checklist: All leaders have the same suitcase-size of 24 hours — what you fill in it defines your leadership effectiveness. This playbook covers plays for creating an operating rhythm of regular checkpoints with the team and helps you get organized to tackle unplanned work that inevitably will show up on your plate.
Applying Extreme Ownership: “There are no bad teams, only bad leaders” — as a leader, accepting responsibility for everything that impacts the outcome is a foundational building block, especially in an online setting. This playbook covers extreme ownership behaviors critical to demonstrate in high-performance teams.
Getting better with retrospectives: Similar to individuals, teams need to have a growth-mindset — getting better with each sprint and taking on bigger and bigger challenges together. This playbook covers plays on conducting retrospectives effectively with the lens of both what and how tasks were accomplished.
Effective delegation: Are leaders scaling themselves effectively in an online setting? Effective delegation is also important for leaders for growing their teams. This playbook covers Andy Grove’s Task-Relevant Maturity (TRM) and applying it effectively!
Clarifying roles: In an online setting, being very clear on the roles of the individuals on the team for a given project — who is driver, approve, collaborator, informed? This playbook covers plays namely DACI/RACI/RAPID, one-way & two-way doors, disagree and commit, and effective escalations.

Coaching Team

Effective online weekly check-ins: For team members, one of the biggest motivators and sources of satisfaction comes from making progress on meaningful goals. The playbook covers how to have an effective 15-min check-in with everyone on the team each week.
Giving actionable feedback: Leaders typically are not comfortable sharing feedback online. This playbook covers SBI (Situation-Behavior-Impact) and related plays.
Structured interviewing process: As leaders, effective hiring is the foundational building block of effective outcomes. How to hire the right candidates via online interviewing? Hint: It’s relying less on your gut for decision-making. This playbook covers a structured interviewing approach you need to implement.
Increasing team trust & effectiveness: Trust within the team members is the foundational building block for effective outcomes. Leaders need to work for developing trust and accountability. This playbook provides actionable levers to apply.
Having difficult conversations: An employee is consistently missing timelines or is making others in the team uncomfortable with their actions. While most leaders avoid difficult conversations even in office settings, online makes this even more difficult. This playbook covers how to create shared understanding.
Sharpening EQ: An effective leader needs to have IQ+Technical Skills+EQ. While this is a very broad topic, the playbook covers key actionable next steps.
Working on your mindset: Leaders need to be aware of their mindsets and biases. The lens with which they view themselves and others is important to understand. This playbook covers plays on mindsets to inculcate and those to watch out.

We continue to add more playbooks to the series and are committed to investing and growing our leaders within Unravel, as well as helping the community. Subscribe to the channel to make sure you do not miss out on these playbooks.

The post 10X Engineering Leadership Series: 21 Playbooks to Lead in the Online Era appeared first on Unravel.

“My Fitbit Score is Now 88” – Sharing Success Stories at the First Unravel Customer Conference

Floyd Smith — Thu, 10 Dec 2020 07:02:32 +0000

Also see our blog post with stats from our Untold conference polls, “More than 60% of our Pipelines have SLAs…”

Unravel Data recently held our first-ever customer conference, Untold. Untold was a four-hour virtual event exclusively for and by Unravel customers, with five talks and an audience of 200 enthusiastic customers. The event featured powerful presentations, lively Q&A sessions, and interactive discussions on a special Slack channel.

Unravel customers can view the recorded presentations and communicate with the speakers, as well as their peers who attended the event. If you’re among the hundreds of existing Unravel customers, including three of the top five North American banks, reach out to your customer success representative for access. If you are not yet a customer, you can create a free account or contact us.

The talks included:

Using Unravel for DataOps and SDLC enhancement
Keeping complex data pipelines healthy
Solving the small files problem (with great decreases in resource usage and large increases in throughput)
Managing thousands of nodes in scores of ever-busier modern data clusters

One speaker even described a successful cloud migration and the key pain points that they had to overcome.

In a “triple threat” presentation, with three speakers addressing different use cases for Unravel in the same organization, the middle speaker captured much of the flavor of the day in a single comment: “Especially with cloud costs, I think Unravel could be a game changer for every one of us.”

Another speaker summed up “the Unravel effect” in their organization: “We can invest those dollars that we’re saving elsewhere, which makes everybody in our organization super happy.”

Brief descriptions of each talk, with highlights from each of our speakers, follow. Look for case studies on these uses of Unravel in the weeks ahead.

Using Unravel to Improve the Software Development Life Cycle and DataOps

A major bank uses DataOps and Unravel to support the software development life cycle (SDLC) – to create, validate, and promote apps to production. And Unravel helps operators to orchestrate the apps in production. Unravel significantly shortens development time and improves software quality and performance.

Highlights from the presentation include:

- “We have Unravel integrated with our entire software development life cycle. Developers get all the insights into how well their code will perform before they go to production.”
- “The Unravel toolset is something like a starting point to log into our system, and do all the work, from starting building software to deployment level to production.”
- “When less experienced users look at Spark code, they don’t understand what they’re seeing. When they look at Unravel recommendations, they understand quickly, and then they go fix it.”
- “As we move to the cloud, Unravel’s APIs will be useful. Today you run your code and spend $10. Tomorrow, you’ll check your code in Unravel, you’ll implement all these recommendations, you’ll spend $5 instead of $10.”

Test-drive Unravel for DataOps

Create a free account

Managing Mission-Critical Pipelines at Scale

A leading software publisher manages thousands of complex, mission-critical pipelines, with tight SLAs – and costs that can range up into the millions of dollars associated with any failures. Unravel gives this organization fine-grained visibility into their pipelines in production, allowing issues to be detected and fixed before they become problems.

Key points from the talk:

“At my company, this is a simple pipeline. (Speaker shows slide with Rube Goldberg pipeline image – Ed.) One of our complicated pipelines has a 50-page user manual. We partnered with the Unravel team to stitch these pipelines together.”
“Unravel is helping us to actually predict the problems in advance and early in the process, ensuring we can bring these pipelines’ health back to normal, and making sure our SLAs are met.”
“Last year, to be frank, we were around 80% SLA accomplishments. But, through the two quarters, we are 98% and above.”

Easing Operations and Cutting Costs

A fast-growing marketing services company uses Unravel to manage operations, reduce costs, and support leadership use cases, all while orchestrating and managing cloud migration – and saves hugely on resources by solving the small files problem.

This talk actually involved three separate perspectives. A few highlights follow:

“On-prem Hadoop, that’s where the Unraveled story started for us.”
“We had around 1,800 cores allocated. After the config changes recommended by Unravel, the cores allocated went down to 120.”
“Unravel helped us build an executive dashboard… the response time from Unravel on it was great. The whole entire experience was wonderful for me.”
“Proactive alerting – AutoActions – catches runaway queries, rogue jobs, and resource contention issues. It makes our Hadoop clusters more efficient, and our users good citizens.”
“We were over-allocating for jobs in Spark, and I’m not an expert Spark user. Unravel recommendations and AutoActions helped solve the problem of over-allocation, without my having to learn more about Spark.”
“Having the data from Unravel makes our conversations with our tenants much more productive, and has helped to build trust among our teams.”
“We can invest those dollars that we’re saving elsewhere, which makes everybody in our organization super happy.”

A Successful Move to Cloud

A leading provider of consumer financial software moved to the cloud successfully – but they had to use inadequate tools, including tracking the move in Excel spreadsheets and saving data in CSV files. They were able to complete the move in less than a year, “leaving no app behind.”

As the speaker described, it was a huge job:

“Think of our move from on-premises to the cloud as changing the engines of the plane while the plane is flying.”
“We actually started the exercise with more than 20,000 tables, but careful analysis showed that half of them were unused.”
“We have many, many critical pipelines, and there were millions of dollars at stake for these things not to be correct, complete, or within SLA.”
“The cloud is not forgiving when it comes to cost. Linear increases in usage lead to linear increases in costs.”
“The cloud offers pay as you go, which sounds great. But when you go, you pay.”

The move made clear the need for Unravel Data as a supportive platform for assessing the on-premises data estate, matching on-premises software tools to their closest equivalents on each of the major public cloud platforms, and tracking the success of the move.

Simplify the complexity of data cloud migration

Create a free account

Unravel Right-Sizes Resource Consumption – as the Pandemic Sends Traffic Soaring

One of the world’s largest financial services companies sees usage passing critical thresholds – and then the pandemic sends their data processing needs through the roof. They used Unravel to steadily reduce memory and CPU usage, even as data processing volumes grew rapidly.

“Before Unravel, our memory and CPU allocations were way greater than our actual usage. And we spent many hours troubleshooting problems because we couldn’t see what was going on inside our apps.”
“With Unravel, we saw that a lot of vCores were unused. And we were able to drop almost 40,000 tables… that helped us a lot.”
“Before Unravel, we were uncomfortably past the 80% line in capacity, and memory was always pegged. With Unravel, we were able to cut usage roughly in half for the same throughput.”
“Before Unravel, we couldn’t give users – including the CEO – a real good reason on why they weren’t getting what they wanted.”
“Comprehensive job visibility, such as the configuration page in Unravel, has improved resolution times.”
“(Unravel) provides us a reasonable rate of growth in our use of resources compared to workloads processed – a rate which I can sell to my management team.”
“I’m sleeping a lot better than I was a year ago. My Fitbit sleep score is now 88. It’s been a good journey so far.” (A Fitbit sleep score of 88 is well above most Fitbit users – good, bordering on excellent – Ed.)

Finding Out More

Unravel customers can view the talks, communicate with industry peers who gave and attended the talks, and more. (There may still be some swag available!) If you’re interested, you can create a free account or contact us.

The post “My Fitbit Score is Now 88” – Sharing Success Stories at the First Unravel Customer Conference appeared first on Unravel.

Unravel’s Hybrid Test Strategy to Conquer the Scaling Data Giant

Floyd Smith — Wed, 25 Nov 2020 16:34:01 +0000

Unravel provides full-stack coverage and a unified, end-to-end view of everything going on in your environment, plus recommendations from our rules-based model and our AI engine. Unravel works on-premises, in the cloud, and for cloud migration.

Unravel provides direct support for platforms such as Cloudera Hadoop (CDH), HortonWorks Data Platform (HDP), Cloudera Data Platform (CDP), and a wide range of cloud solutions, including AWS infrastructure as a service (IaaS), Amazon EMR, Microsoft Azure IaaS, Azure HDInsight, and DataBricks on both cloud platforms, as well as GCP IaaS, Dataproc, and BigQuery. We have grown to supporting scores of well-known customers and engaging in productive partnerships with both AWS and Microsoft Azure.

We have an ambitious engineering agenda and a relatively large team, with more than half the company in the engineering org. We want our engineering process to be as forward-looking as the product we deliver.

We constantly strive to develop adaptive and end-to-end testing strategies. For testing, Unravel had started with a modest customer deployment. We now support scores of large customer deployments with 2000 nodes and 18 clusters. We had to conquer the giant challenges posed by this massive increase in scale.

Since testing is an integral part of every release cycle, we give top priority to developing a systematic, automated, scalable, and yet customizable approach for driving the entire release cycle. As a new startup comes up, the obvious and quickest approach one is tempted to follow is the traditional testing model, and to manually test and certify a module/product.

Well, this structure sometimes works satisfactorily when the features in the product are few. However, a growing customer base, increasing features, and the need for supporting multiple platforms give rise to proportionally more and more testing. At this stage, testing becomes a time-consuming and cumbersome process. So if you and your organization are struggling with the traditional, manual testing approach for modern data stack pipelines, and looking for a better solution, then read on.

In this blog, we will walk you through our journey about:

How we evolved our robust testing strategies and methodologies.
The measures we took and the best practices that we applied to make our test infrastructure the best fit for our increasing scale and growing customer base.

Take the Unravel tour

Try Unravel for free

Evolution of Unravel’s Test Model

Like any other startup, Unravel had a test infrastructure that followed the traditional testing approach of manual testing, as depicted in the following image:

Initially, with just a few customers, Unravel mainly focused on release certification through manual testing. Different platforms and configurations were manually tested, which took roughly ~4-6 weeks of release cycle. With increasing scale, this cycle became endless, which made the release train longer and unpredictable.

This type of testing model has quite a few stumbling blocks and does not work well with scaling data sizes and product features. Common problems with the traditional approach include:

Late discovery of defects, leading to:
- - Last-minute code changes and bug fixes
  - Frantic communication and hurried testing
  - Paving the way for newer regressions
Deteriorating testing quality due to:
- - Manual end-to-end testing of the modern data stack pipeline, which becomes error-prone and tends to miss out on corner cases, concurrency issues, etc.
  - Difficulty in capturing the lag issues in modern data stack pipelines
Longer and unpredictable release trains that leads to:
- - Stretched deadlines, since testing time increases proportionally with the number of builds across multiple platforms.
  - Increased cost due to high resource requirements such as more man-hours, heavily equipped test environments, etc.

Spotting the defects at a later stage becomes a risky affair, since the cost of fixing defects increases exponentially across the software development life cycle (SDLC).

While the traditional testing model has its cons, it also has some pros. A couple of key advantages are that:

Manual testing can reproduce customer-specific scenarios
It can also catch some good bugs where you least expect them to be

So we resisted the temptation to move fully to what most organizations now implement, a completely mechanized approach. To cope with the challenges faced in the traditional model, we introduced a new test model, a hybrid approach that has, for our purposes, the best of both worlds.

This model is inspired by the following strategy which is adaptive, to scale with a robust testing framework.

Our Strategy

Unravel’s hybrid test strategy is the foundation for our new test model.

New Testing Model

Our current test model is depicted in the following image:

This approach mainly focuses on end-to-end automation testing, which provides the following benefits:

Runs automated daily regression suite on every new release build, with end-to-end tests for all the components in the Unravel stack
Provides a holistic view of the regression results using a rich reporting dashboard
The Automation framework works for all kinds of releases (point release, GA release), making it flexible, robust, and scalable.

A reporting dashboard and an automated regression summary email are key differentiators of the new test model.

The new test model provides a lot of key advantages. However, there are some disadvantages too.

KPI Comparison – Traditional Vs New Model

The following bar chart is derived on the KPI values for deployment and execution time, which is captured for both the traditional as well as the new model.

The following graph showcases the comparison of deployment, execution, and resource time savings:

Release Certification Layout

The new testing model comes with a new Release Certification layout, as shown in the image below. The process involved in the certification of a release cycle is summarized in the Release Cycle Summary table.

Release Cycle Summary

Conclusion

Today, Unravel has a rich set of unit tests; more than 1000 tests are run for every commit, along with the CI/CD pipeline in place. This includes functional sanity test cases (1500+) and can cover our end-to-end data pipelines as well as the integration test cases. Such a testing strategy can significantly reduce the impact on integrated functionality by proactively highlighting issues in pre-checks.

Cutting a long story short, It is indeed a difficult and tricky task to build a flexible, robust, and scalable test infrastructure that caters to varying scales, especially for a cutting-edge product like Unravel, and with a team that strives for the highest quality in every build.

In this post, we have highlighted commonly faced hurdles in testing modern data stack pipelines. We have also showcased the reliable testing strategies we have developed to efficiently test and certify modern data stack ecosystems. Armed with these test approaches, just like us, you can also effectively tame the scaling giant!

Reference Links (clip art images)

Unravel’s Hybrid Test Strategy:

End to End testing: https://www.lambdatest.com/blog/all-you-need-to-know-about-end-to-end-testing/
Integrated testing: https://professionalqa.com/types-of-integration-testing
Load testing: https://www.kiwiqa.com/getting-to-know-the-fundamentals-of-performance-testing-a-guide-for-amateurs-and-professionals/
Weekly E2E pipeline: https://www.pngkit.com/view/u2r5a9y3q8y3r5y3_smart-end-to-end-process-icon/
Parallelism: https://pages.tacc.utexas.edu/~eijkhout/istc/html/parallel.html (2.6.1.1 The Fork-Join Mechanism)

Exponential Cost of Fixing Defects:

Developer (Male computer user vector): image:https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1E42d7aLu5TRDcxzmpFD2N5b3YGo4E3C2dt1fcGONICav8jTteF-mpOM&s
Developer (Female computer user vector image): https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTX-Q6L_n9Bd5zdQ4J3haaO9hkMJKX3Cpg7RY9Ml8bq56YOmXqo2i5rrA&s
Build: https://icon-library.com/icon/automation-icon-2.html

Unravel’s Test Model:

QA Testing – Manual (Computer Users): https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSqI4svILCrc77R2-K0q5jys4PCXDiJFnEVASBiF5A_FNdnSBbdA_kD9Q&s

The post Unravel’s Hybrid Test Strategy to Conquer the Scaling Data Giant appeared first on Unravel.

The Ten Steps To A Successful Cloud Migration Strategy

Floyd Smith — Tue, 03 Nov 2020 06:53:36 +0000

In cloud migration, also known as “move to cloud,” you move existing data processing tasks to a cloud platform, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform, to private clouds, and-or to hybrid cloud solutions.

See our blog post, What is Cloud Migration, for an introduction.

Figure 1: Steps in cloud migration. (Courtesy IBM, via Slideshare.)

Cloud migration usually includes most or all of the following steps:

Identifying on-premises workloads to move to cloud
Baselining performance and resource use before moving
(Potentially) optimizing the workloads before the move
Matching services/components to equivalents on different cloud platforms
Estimating the cost and performance of the software, post-move
Estimating the cost and schedule for the move
Tracking and ensuring success of the migration
Optimizing the workloads in the cloud
Managing the workloads in the cloud
Rinse and repeat

Some of the steps are interdependent; for instance, it helps to know the cost of moving various workloads when choosing which ones to move first. Both experiences with cloud migration and supportive tools can help in obtaining the best results.

The bigger the scale of the workloads you are considering moving to the cloud, the more likely you will be to want outside consulting help, useful tools, or both. With that in mind, it might be worthwhile to move a smaller workload or two first, so you can gain experience and decide how best to proceed for the larger share of the work.

We briefly describe each of the steps below, to help create a shared understanding for all the participants in a cloud migration project, and for executive decision-makers. We’ve illustrated the steps with screenshots from Unravel Data, which includes support for cloud migration as a major feature. The steps, however, are independent of any specific tool.

Note. You may decide, at any step, not to move a given workload to the cloud – or not to undertake cloud migration at all. Keep track of your results, as you complete each step, and evaluate whether the move is still a good idea.

The Relative Advantages of On-Premises and Cloud

Before starting a cloud migration project, it’s valuable to have a shared understanding of the relative advantages of on-premises installations vs. cloud platforms.

On-premises workloads tend to be centrally managed by a large and dedicated staff, with many years of expertise in all the various technologies and business challenges involved. Allocations of servers, software licenses, and operations people are budgeted in advance. Many organizations keep some or all of their workloads on-premises due to security concerns about the cloud, but these concerns may not always be well-founded.

On-premises hosting has high capital costs; long waits for new hardware, software updates, and configuration changes; and difficulties finding skilled people, including for supporting older technologies.

Cloud workloads, by contrast, are often managed in a more ad hoc fashion by small project teams. Servers and software licenses can be acquired instantly, though it still takes time to find skilled people to run them. (This is a major factor in the rise of DevOps & DataOps, as developers and operations people learn each other’s skills to help get things done.)

The biggest advantage of working in the cloud is flexibility, including reduced need for in-house staff. The biggest disadvantage is the flip side of that same flexibility: surprises as to costs. Costs can go up sharply and suddenly. It’s all too easy for a single complex query to cost many thousands of dollars to run, or for hundreds of thousands of dollars in costs to appear unexpectedly. Also, the skilled people needed to develop, deploy, maintain, and optimize cloud solutions are in short supply.

When it comes to building up an organization’s ability to use the cloud, fortune favors the bold. Organizations that develop a reputation as strong cloud shops are more able to attract the talent needed to get benefits from the cloud. Organizations which fall behind have trouble catching up.

Even some of these bold organizations, however, often keep a substantial on-premises footprint. Security, contract requirements with customers, lower costs for some workloads (especially stable, predictable ones), and high costs for migrating specific workloads to the cloud are among the reasons for keeping some workloads on-premises.

The Ten Steps To A Successful Cloud Migration

1. Identifying On-Premises Workloads to Move to Cloud

Typically, a large organization will run a wide variety of workloads, some on-premises, and some in the cloud. These workloads are likely to include:

Third-party software and open source software hosted on company-owned servers; examples include Hadoop, Kafka, and Spark installations
Software created in-house hosted on company-owned servers
SaaS packages hosted on the SaaS provider’s servers
Open source software, third-party software, and in-house software running on public cloud platform, private cloud, or hybrid cloud servers

The cloud migration motion is to examine software running on company-owned servers and either replace it with a SaaS offering, or move it to a public, private, or hybrid cloud platform.

2. Baselining Performance and Resource Use

It’s vital to analyze each on-premises workload that is being considered for move to cloud for performance and resource use. To move the workload successfully, similar resources in the cloud need to be identified and costed. And the new, cloud-based version must meet or exceed the performance of the on-premises original, while running at lower cost. (Or while gaining flexibility and other benefits that are deemed worth any cost increase.)

Baselining performance and resource use for each workload may need to include:

Identifying the CPU and memory usage of the workload
Identifying all elements in the software stack that supports the workload, on-premises
Identifying dependencies, such as data tables used by a workload
Identifying the most closely matching software elements available on each cloud platform that’s under consideration
Identifying custom code that will need to be ported to the cloud
Specifying work needed to adapt code to different supporting software available in the cloud (often older versions of the software that’s in use on-premises)
Specifying work needed to adapt custom code to similar code platforms (for instance, modifying Hive SQL to a SQL version available in the cloud)

This work is easier if a target cloud platform has already been chosen, but you may need to compare estimates for your workloads on several cloud platforms before choosing.

3. (Potentially) Optimizing the Workloads, Pre-Move

GIGO – Garbage In, Garbage Out – is one of the older acronyms in computing. If a workload is a mess on-premises, and you move it, largely unchanged, to the cloud, then it will be a mess in the cloud as well.

If a workload runs poorly on-premises, that may be acceptable, given that the hardware and software licenses needed to run the workload are already paid for; the staff needed for operations are already hired and working. But in the cloud, you pay for resource use directly. Unoptimized workloads can become very expensive indeed.

Unfortunately, common practice is to not optimize workloads – and then implement a rough, rapid lift and shift, to meet externally imposed deadlines. This may be followed by shock and awe over the bills that are then generated by the workload in the cloud. If you take your time, and optimize before making a move, you’ll avoid trouble later.

4. Matching Services to Equivalents in the Cloud

Many organizations will have decided on a specific platform for some or all of their move to cloud efforts, and will not need to go through the selection process again for the cloud platform. In other cases, you’ll have flexibility, and will need to look at more than one cloud platform.

For each target cloud platform, you’ll need to choose the target cloud services you wish to use. The major cloud services providers offer a plethora of choices, usually closely matched to on-premises offerings. You are also likely to find third-party offerings that offer some combination of on-premises and multi-cloud functionality, such as Databricks and Snowflake on AWS and Azure.

The cloud platform and cloud services you choose will also dictate the amount of code revision you will have to do. One crucial detail: software versions in the cloud may be behind, or ahead of, the versions you are using on-premises, and this can necessitate considerable rework.

There are products out there to help automate some of this work, and specialized consultancies that can do it for you. Still, the cost and hassle involved may prevent you from moving some workloads – in the near term, or at all.

Software available on major cloud platforms is shown in Figure 2, from our blog post, Understanding Cloud Data Services (recommended).

Figure 2: Software available on major cloud platforms. (Subject to change)

5. Estimating the Cost and Performance of the Software, Post-Move

Now you can estimate the cost and performance of the software, once you’ve moved it to your chosen cloud platform and services. You need to estimate the instance sizes you’ll need and CPU costs, memory costs, and storage costs, as well as networking costs and costs for the services you use. Estimating all of this can be very difficult, especially if you haven’t done much cloud migration work in the past.

Unravel Data is a valuable tool here. It provides detailed estimates, which are likely to be far easier to get, and more accurate, than you could come up with on your own. (Through either estimation, experimentation, or both.) You need to install Unravel Data on your on-premises stack, then use it to generate comparisons across all cloud choices.

An example of a comparison between an on-premises installation and a targeted move to AWS EMR is shown in Figure 3.

Figure 3: On-premises to AWS EMR move comparison

6. Estimating the Cost and Schedule for the Move

After the above analysis, you need to schedule the move, and estimate the cost of the move itself:

People time. How many people-hours will the move require? Which people, and what won’t get done while they’re working on the move?
Calendar time. How many weeks or months will it take to complete all the necessary tasks, once dependencies are taken into account?
Management time. How much management time and attention, including the time to complete the schedule and cost estimates, is being consumed by the move?
Direct costs. What are the costs in person-hours? Are there costs for software acquisition, experimentation on the cloud platform, and lost opportunities while the work is being done?

Each of these figures is difficult to estimate, and each depends to some degree on the others. Do this as best you can, to make a solid business decision as to whether to proceed. (See the next step.)

7. Tracking and Ensuring Success of the Migration

We will spend very little digital “ink” here on the crux of the process: actually moving the workloads to the chosen cloud platform. It’s a lot of work, with many steps and sub-steps of its own.

However, all of those steps are highly contingent on the specific workloads you’re moving, and on the decisions you’ve made up to this point. (For instance, you may try a quick and dirty “lift and shift,” or you may take the time to optimize workloads first, on-premises, before making the move.)

So we will simply point out that this step is difficult and time-consuming. It’s also very easy to underestimate the time, cost, and heartburn attendant to this part of the process. As mentioned above, doing a small project first may help reduce problems later.

8. Optimizing the Workloads in the Cloud

This step is easy to ignore, but it may be the most important step of all in making the move a success. You don’t have to optimize workloads on-premises before moving them. But you must optimize the workloads in the cloud after the move. Optimization will improve performance, reduce costs, and help you significantly in achieving better results going forward.

Only after optimization of the workload in the cloud can you calculate an ROI for the project. The ROI you obtain may signal that move to cloud is a vital priority for your organization – or signal that the ancient mapmaker’s warning for mariners, Here Be Dragons, should be considered as you move workloads to the cloud.

Figure 4: Ancient map with sea monsters. (Source:Atlas Oscura.)

9. Managing the Workloads in the Cloud

You have to actively manage workloads in the cloud, or you’re likely to experience unexpected shocks as to cost. Actually, you’re going to get shocked; but if you manage actively, you’ll keep those shocks to a few days’ worth of bills. If you manage passively, your shocks will equate to one or several months’ worth of bills, which can lead you to question the entire basis of your move to cloud strategy.

Actively managing your workloads in the cloud can also help you identify further optimizations – both for operations in the cloud, and for each of the steps above. And those optimizations are crucial to calculating your ROI for this effort, and to successfully making future cloud migration decisions, one workload at a time.

10. Rinse and Repeat

Once you’ve successfully made your move to cloud decision pay off for one, or a small group of workloads, you’ll want to repeat it. Or, possibly, you’ve learned that move to cloud is not a good idea for many of your existing and planned workloads.

Just as one example, if you currently have a lot of your workloads on a legacy database vendor, you may find many barriers to moving workloads to the cloud. You may need to refactor, or even re-imagine, much of your data processing infrastructure before you can consider cloud migration for such workloads.

That’s the value of going through a process like this for a single, or a small set of workloads first: once you’ve done this, you will know much of what you previously didn’t know. And you will be rested and ready for the next steps in your move to cloud journey.

Unravel Data as a Force Multiplier for Move to Cloud

Unravel Data is designed to serve as a force multiplier in each step of a move to cloud journey. This is especially true for big data moves to cloud.

As the name implies, big data processing on-premises may be the most challenging set of operations when it comes to the potential for generating costs and resource allocation issues that are surprising, even shocking, as you pursue cloud migration.

Unravel Data has been engineered to help you solve these problems. Unravel Data can help you easily estimate the costs and challenges of cloud migration for on-premises workloads heading to AWS, to Microsoft Azure, and to Google Cloud Platform (GCP).

Unravel Data is optimized for big data technologies:

On-premises (source): Includes Cloudera, Hadoop, Impala, Hive, Spark, and Kafka
Cloud (destination): Similar technologies in the cloud, as well as AWS EMR, AWS Redshift, AWS Glue, AWS Athena, Azure HD Insight, Google Dataproc and Google BigQuery, plus Databricks and Snowflake, which run on multiple public clouds.

We will be adding many on-premises sources, and many cloud destinations, in the months ahead. Both relational databases (which natively support SQL) and NoSQL databases are included in our roadmap.

Unravel is available in each of the cloud marketplaces for AWS, Azure, and GCP. Among the many resources you can find about cloud migration are a video from AWS for using Unravel to move to Amazon EMR, and a blog post from Microsoft for moving big data workloads to Azure HDInsight with Unravel.

We hope you have enjoyed, and learned from, reading this blog post. If you want to know more, you can create a free account or contact us.

The post The Ten Steps To A Successful Cloud Migration Strategy appeared first on Unravel.

A Simple Explanation Of What Cloud Migration Actually Is

Floyd Smith — Fri, 30 Oct 2020 18:09:16 +0000

Cloud migration also called “move to the cloud,” is the process of moving existing data processing tasks to a cloud platform, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform. Private clouds and hybrid cloud solutions can also serve as destinations.

Organizations move to the cloud because it’s a much easier place to start new experiments or projects; gain operational flexibility; cut costs; and take advantage of services available in the cloud, for instance, AI and machine learning.

Steps in cloud migration include selecting workloads to move, choosing a target cloud platform, choosing what services to use in the cloud, and estimating and assessing costs. See our blog post, Ten Steps to Cloud Migration, for details.

There are many ways to compare on-premises vs. cloud IT spending, but one such estimate, from Flexera, places on-premises workloads at just over two-thirds of spending, with cloud about one-third. Cloud is expected to gain a further 10% of the total in a 12-month period. So cloud migration is being pursued energetically and quickly, overall.

Many cloud migration efforts, however, underachieve or fail. In a recent webinar, Chris Santiago of Unravel Data spelled out three reasons for challenges:

Poor planning
Failure to optimize for the cloud
Failure to achieve anticipated ROI

Unravel Data has support for cloud migration as a major feature, especially for big data workloads, including many versions of Hadoop, Spark, and Kafka, both on-premises and in the cloud. If you want to know more, you can create a free account.

The post A Simple Explanation Of What Cloud Migration Actually Is appeared first on Unravel.

Unravel Data Release 4.6.2 Features New UI and Multi-Cluster Support

Unravel Data — Wed, 30 Sep 2020 20:47:54 +0000

THE NEW UNRAVEL 4.6.2 RELEASE INCREASES THE POWER OF UNRAVEL WHILE MAKING THAT POWER SIGNIFICANTLY EASIER TO ACCESS

The Unravel 4.6.2.0 release, now generally available, builds on our previous 4.6 release with a new UI/UX, multi-cluster support, monitoring for ELK (Elasticsearch, Logstash, and Kibana), and a new installer that makes Unravel available in minutes.

Note: The new UI/UX for Unravel and multi-cluster support have been among the most requested features from Unravel’s delighted customers.

NEW UI/UX

Unravel 4.6.2 adds a new graphical user interface as the visible front end of a new user experience. The previous interface was respected for its ability to bring information from disparate systems to a single pane of glass, for its ability to deliver root cause analysis data crisply, and for the way in which it delivered monitoring information directly from sensors alongside AI-driven insights.

The new UI is faster, cleaner, and combines information that was formerly on several pages into a unified, single-page view. Now, you can also find the appropriate, persona-based experience inside Unravel for data operations people, data analysts, data scientists, pipeline owners, and chief data officers (CDOs). In addition to the UI, Unravel continues to support the Unravel RESTful API for programmatic access to Unravel functionality.

The new UI, showing the new main dashboard and multi-cluster support (arrow).

MULTI-CLUSTER SUPPORT

With Unravel 4.6.2, you can now use a single Unravel instance to monitor multiple independent on-premise clusters for Cloudera Distributed Hadoop (CDH) and Hortonworks Data Platform (HDP). Multi-cluster support helps to create a “single source of truth” for all connected CDH and HDP clusters.

ELK SUPPORT

Unravel 4.6.2 supports metrics and graphical presentation of KPIs for the Elasticsearch/Logstash/Kibana (ELK) stack. This extends Unravel’s platform support, which continues to include Kafka, Spark, Pig, Cascading, Hadoop, Impala, HBASE, and relational databases supporting SQL.

Cluster details for Elasticsearch (also see the Unravel documentation).

MUCH MORE

I’ve highlighted just some of the key features in Unravel 4.5. Please see the release documentation for 4.6.2.0 for further information. Create a free account to experience the Unravel product.

The post Unravel Data Release 4.6.2 Features New UI and Multi-Cluster Support appeared first on Unravel.

Top Takeaways From CDO Sessions: Customers and Thought Leaders

Unravel Data — Thu, 13 Aug 2020 14:00:46 +0000

We’ve been busy speaking to our customers and thought leaders in the industry and have rounded up the key takeaways from our latest CDO sessions. Here are some of the top takeaways and advice gained from these sessions with big data leaders, Kumar Menon from Equifax, Anheuser-Busch’s Harinder Singh, Sandeep Uttamchandani from Unravel, and DBS Bank’s Matteo Pelati:

1. DataOps is the end to end management and support of a data platform, data pipelines, data science models, and essentially the full lifecycle of consumption and production of information.

2. You need to incorporate a multitude of different factors, such as compliance, cost, root cause analysis, tuning, and so on, early on so that DataOps is seamless and you can avoid surprises.

3. It is important to strike the right balance between governance and time to market. When you have to move fast, governance always slows down. And governance doesn’t just refer to the required regulation or compliance. It’s also just good data hygiene, maintaining the catalog, and maintaining the glossaries.

4. Building your company’s platform and services as a product can be extremely beneficial and pay you back after some time. You need time for the investment to return, but once you get to that stage, you’ll get your ROI.

Since for many companies, the pandemic has accelerated the movement to the Cloud, these big data leaders gave plenty of insights on Cloud migration:

1. Moving to the Cloud is a double edged sword. While it’s convenient when it’s time to market fast, you also have to be very careful about security in terms of how you manage, configure, and enforce it.

2. When you think about moving to the Cloud, it’s a highly non trivial process. On one side, you have your data and thousands of data sets, then on the other side, you have these active pipelines that run daily, dashboards, and ML models feeding into the product. You have to figure out the best sequence to move these.

3. When moving to the Cloud, you have to have a different philosophy when you’re building Cloud native applications versus when you’re building on-prem. You must improve the skill sets of your people to think more broadly.

4. A big challenge when moving to the cloud is accessing data. However, you can use encryption and tokenization of data at a large scale and expand the use throughout the entire data platform.

They also provided businesses with thoughtful, yet practical, advice on what they should be doing in order to not only stay afloat, but grow, during this COVID-19 pandemic:

1. Try to understand your internal business partners and customers’ needs. Everybody is in a unique situation right now, not just your company, so focus on your internal customers and what they need from you in terms of data analytics.

2. Consider changing the delivery model of your product or service and meet the customer where they are instead of expecting customers to come to you.

3. Make sure that you connect a lot more with your customers and your coworkers to keep the momentum going. This ecosystem, however, is not just your customers, but potentially your customers’ customers as well.

4. Focus on data literacy and explainable insights within your organization. Not everyone understands data the way you do, but data professionals have a unique opportunity here to really educate and build that literacy within their enterprise for better decision making.

5. Keep an eye out for how fast regulations are changing. It’s very likely that new data residency requirements, regulations, and privacy laws will emerge as a result of the pandemic, so make sure that the architecture you build today is adaptable and flexible in order to withstand the challenge of the time.

Matteo Pelati spoke on how DBS Bank has leveraged Unravel:

1. DBS has leveraged Unravel to analyze jobs, analyze previous runs of a job and block the promotion of a job if it doesn’t satisfy certain criteria.

2. Unravel has become really useful to understand the impact of users’ queries on the system and to let users understand the impacts of the operation that they’re orchestrating.

The above takeaways just scratch the surface of the insights that these CDO’s have to offer. As skilled and experienced big data leaders, they contribute valuable knowledge to the big data community. To hear more from them, you can watch the webinars or read the transcripts from our Getting Real with Data Analytics and Transforming DataOps in Banking sessions.

The post Top Takeaways From CDO Sessions: Customers and Thought Leaders appeared first on Unravel.

Transforming DataOps in Banking

Unravel Data — Thu, 16 Jul 2020 13:00:39 +0000

CDO Sessions: A Chat with DBS Bank

On July 9th 2020, Unravel CDO Sandeep Uttamchandani joined Matteo Pelati, Executive Director, DBS in a fireside chat to discuss transforming DataOps in Banking. Watch the 45 minute video below or review the transcript.

Transcript

Sandeep: Welcome Matteo, really excited to have you, can you briefly introduce yourself, tell us a bit about your background and your role at DBS.

Matteo: Thank you for the opportunity. I’m leading DBS’s data platform and have been with the bank for the last three years. Over the last three years, we’ve built the entire data platform from the ground up. We started with nothing with a team of about five people and now we are over one hundred. My last 20 years has been in this field with many different companies, mostly startups. DBS is actually my first bank.

Sandeep: That’s phenomenal and I know DBS is a big company being one of the key banks in Singapore, so it’s really phenomenal that you and your team kicked off the building of your data platform.

Matteo: As DBS is going through a company wide digitization initiative we initially started to outsource development but now, we retained much more development in-house with the use of open source technologies. We’re also contributing back to open source, too!

Sandeep: That’s phenomenal and I have seen some of your other work. Matt, you’re truly a thought leader in this space. So really, I would say spearheading a whole
bunch of things!

Matteo: Thank you very much.

Sandeep: Top of mind for everyone is COVID-19. This has been something that every enterprise is grappling with and maybe a good starting point. Matteo, can you share your thoughts on the COVID-19 situation impacting banks in Singapore?

Matteo: Obviously there is an economic impact that is inevitable everywhere in the world. Definitely there is a big impact on the organization because banks don’t traditionally have a remote workforce. All of a sudden we found ourselves having to work remotely as ordered by the government. We had to adapt and we’ve done well adapting to the transition to home based working.There were challenges in the beginning such as remote access to systems and the suddenness of all of this, but we handled and are handling it pretty well.

Sandeep: That’s definitely good news. Matteo, do you have thoughts on other industries in Singapore and how are they recovering?

Matteo: In Singapore we didn’t really have that strict lockdown other countries experienced. We did however limited social contact and the government instructed people to stay at home. There are some businesses that have been directly impacted by this i.e. Singapore Airlines, the airlines have all shut down. I’m from Italy and COVID-19 has been hugely disruptive to peoples lives as all businesses were shut down. It did happen here in Singapore, but with a lesser impact. As things start to ease up and restrictions will start to loosen, hopefully the situation will get better soon.

Sandeep: From a DBS standpoint, Matteo, what is top of mind for you from the broader aspect as you are planning ahead from the impacts of the pandemic.

Matteo: Planning ahead, we’re looking at remote working as a long-term arrangement as there are many requests for it. We’re also exploring cloud migration as most of our systems have always been on-premise. As you already know, banks and companies with PII data may find it challenging to move sensitive data to the cloud.

The current COVID-19 pandemic has accelerated the planning for the cloud. It is a journey that will take time, won’t happen overnight, but having a remote workforce has helped up understand actual use cases. We’re investing in tokenization and encryption of data so that it can be in the Cloud. There are lots of investments in that direction and have probably been sped up by the pandemic.

How DBS Bank Leverages Unravel Data

Sandeep: In addition to moving to the cloud, what new data project priorities can you shed a light on?

Matteo: As you know we are investing a lot in the data group as we’re running the data platform. Building a platform is very much about putting pieces and making them work together. We decided in the beginning to invest a lot in building a complete solution, and we started doing a lot of development as well. We built this platform from the ground up adopting open source software and building to an extent a full end-to-end self-service portal for the users. This took time, obviously, but the ROI was worth it because our users are now able to ingest data easier enabling them to simply build compute jobs.

Let me give you an example of where we leveraged Unravel. We have compute jobs that are built by our users directly on our cluster, on a web UI too. Once they have done the build they can take that artifact and easily promote it to the next environment. We can go to User Acceptance Testing (UAT), QA, and production in a fully automated way using our UI of a component that we wrote. This has now become our application lifecycle manager where we have integrated with Unravel giving us the ability to automatically validate the quality of jobs.

We leverage Unravel to analyze jobs, analyze previous runs of a job and basically block the promotion of a job if the job doesn’t satisfy certain criteria thanks to Unravel. For us, it’s not just bringing in tools and installing them, but building an entire ecosystem of integrated tools. We have integration with Airflow and many other applications that we’re building and fully automating. Having gone through this experience, we’ve learned a lot as we are bringing the same user experience to the model productionization. What we’ve done with Spark data pipelines, etc, we are gonna do with models.

We want our users to be able to build and iterate on machine learning models which should enable them to productionize them easier to replicate the same button that we have used for ETL and compute jobs to machine learning models as well. That’s our next step we’re working on right now.

Sandeep: You talked a bit about the whole self-service portal, and making it easy for the users to accomplish whatever insights they’re trying to extract which is really interesting. When you think about your existing users, how are you satisfying their needs? What are some of the gaps that you’re seeing that you’re working towards?

Matteo: There are definitely gaps because there are always new feature requests and new boxes, as any product. There are always new feature requests coming from customers and users. We do try and preempt what they need by analyzing their behavior such as usage patterns and historic requests.

For example we’re heavily investing in streaming now but historically banks have always been aligned to doing batch processing restricted by their legacy systems like mainframes, databases. Fast forward to now, we are starting to have a system that can produce real time streaming, that doesn’t change the platform in a way to support streaming data which we introduced more than 2 years ago.

This changes the whole paradigm because you just don’t want to build a platform supporting streaming, but supporting natively where all we can have end to end streaming of applications. Traditionally all the applications are built using batch processing, SQL, etc. Now the paradigm has shifted which changes the requirements for machine learning where the deployment of a model becomes independent from the serving means, transport layer etc.

While many organizations package deployment into the application and deploy as a REST API, here, we say “okay let’s isolate the model itself from the application”. So basically once the data scientist builds a model, we can deploy a model and build the tools for the discoverability of the models too. This enables me to use my model, deploy my model as a REST API, embed the model into my application, deploy the model as a streaming component, or deploy the model as a UDF inside the spark job This is how we facilitate reusability of models and the journey we’re going through, and it has started to pay back.

Three years ago we started with a very simple UI to allow users to clean their SQL so it would ease the migration of existing Teradata & Exadata jobs to our data platform. As the users became more skilled they needed more features on reusability. So the platform evolved with the users at heart and now we are at a very good stage. We get good feedback from what we have built.

Cloud Migration Strategy

Sandeep: I’ve heard some of your talks and it’s good stuff. Share some of the detailed challenges that you’re facing when you think about the cloud migration.

Matteo: We’re at the early stages of Cloud migration, you could say the exploration phase. The biggest challenge is access to data. What we are working on uses encryption and tokenization at a large scale and expanding the use throughout the entire data platform. So data access, whatever the data will be governed by these technologies.

We have to handle it holistically incorporating our own data ingestion framework. To an extent we’re simplified by the work that we have done previously because every component that reads or writes data to our platform goes through a layer that we built, which is the our data access layer, which handles all these these details, so for example all the tokenization, access validation, authorization, is all handled by the data access layer. As our users are using this data access layer, it gives us an opportunity to implement the feature across all the users in a very easy way, so that’s basically our abstraction layer.

Security and the hybrid cloud model is a challenge at the moment. How are we going to share the data between on prem and cloud? How are we going to handle the movement of data? Part of the data will be in the cloud, part of the data will be on prem, so it’s not easy to define the rules which determine what is going to be on prem, what is going to be on the cloud. So we are evaluating different technologies to help us move the data across data centers, such as abstraction layers, abstracting the file system using a caching layer. I must say that these are probably the two challenges we’re facing now, and we are at the very beginning of that, actually, so I see many more challenges on our journey.

Sandeep: Having done cloud migration a few times before, I can totally vouch on the complexity. Can you share the other ways in which Unravel is providing visibility to data operations that are helping you out?

Matteo: We use Unravel for two different purposes. One, as I mentioned, is the integration with CICD for all the validation of the jobs and the other is more for analyzing and debugging the jobs. We definitely leverage Unravel while building our ingestion framework. I also see a lot of usage from users that are writing jobs and deploying into the platform, so they can leverage Unravel to understand more about the quality of the jobs, if they did something wrong, etc, etc.

Unravel has become really useful to understand the impact of our users queries on the system. It’s very easy to migrate a SQL query that has been written for Oracle or Teradata, encounter operations like joining twenty, thirty tables, actually, and these operations are very expensive on a distributed system like Spark and the user might not necessarily know it actually. Unravel has become extremely useful, to let users understand the impacts of the operation that they’re orchestrating. As you know, we have our own CI/CD integration that prevents users from let’s say expensive jobs in production. So this and Unravel is a very powerful combination as we empower the user. First we stop the user from messing up the platform, and second we empower the user to debug their own things and analyze their own job. Unravel gives us the possibility for users that have traditionally been DBMS users to understand more about their complex jobs.

Sandeep: Can you share what was it prior to deploying Unravel? What are some of the key operational challenges you were facing and what was the starting point for an Unravel deployment.

Matteo: Through the control checks that we implemented recently, we saw too many poor quality jobs on the platform and that obviously had a huge impact on the platform. So before introducing Unravel, we saw the platform using too many cores and jobs very inefficiently executed.

We taught users how to use Unravel, which enabled them to spend time understanding their jobs and going back to Unravel finding out what the issues were. People were not using that process previously as you know, optimization is always a hard task as people would want to deliver fast. So control checks basically started to push the user back and to Unravel to gain performance insights before putting jobs into production.

Advice for Big Data Community

Sandeep: Matteo what do you see coming in terms of technology evolution in DataOps? Earlier you mentioned about adoption of machine learning and AI, can you share some thoughts on how you’re thinking about building out the platform and potentially extending some of the capabilities you have in that domain?

Matteo: We have had different debates about how to organize the platform and we have built a lot of stuff in-house. Now we are challenged with moving to the cloud, the biggest question is, “shall we adopt the current stack that we have and leverage and can we be a cloud agnostic. Should we rely on services provided by cloud providers?”

We don’t have a clear answer yet and we’re still debating. I believe that you can get a lot of benefits on what the cloud can give you natively and you can basically make your platform work.

Talking about technology, we are investing a lot in Kubernetes, actually, and most of our workload is on Spark, that’s where we’re planning to go. Now our entire platform runs on Spark on YARN and we are investing a lot in experimenting using Spark on Kubernetes and migrating all apps to Kubernetes. This will simplify the integration with machine learning jobs as well. Running machine learning jobs are much easier on Kubernetes because you have containers and full integration is what we need.

We are also exploring technologies, like KubeFlow, for example, for productionize machine learning pipelines. To an extent, it’s like scraping a lot of stuff that has been built over the last three years and rebuilding it, because we are using different technologies.

I see a lot of hype, also, around other other languages too. Traditionally the Hadoop and Spark stack has been rotating around the JDM, Java and Scala, and more recently I started exploring with Python. We’ve also seen a lot of work using other languages like Golang and Raster. So I think there will be a huge change in the entire ecosystem, because of the limitation that JDM has. People are starting to realize that going back to a much smaller executable, like in Golang or Raster, with a very much simpler garbage collection, no garbage collection, can simplify really well.

Sandeep: I think there’s definitely a revival of the functional programming languages. You made an interesting point about a cloud agnostic platform, and one of the things that Unravel focuses a lot on is supporting technologies across on-prem as well as the cloud. For instance, we support all the three major cloud providers as well as technologies. One of the aspects we’ve added is the migration planner, any thoughts on that, Matteo? Knowing what data to move In the cloud versus what data to keep local? How are you sort of solving that?

Matteo: We are exploring different technologies and different patterns, and we have some technical limitations and policy limitations. To give you an example, all the data that we encrypt and tokenize, if they are tokenized on-prem and they need to be moved to the cloud, they actually need to be re-encrypted and re-tokenized with different access keys. So that’s one of the things that we are discussing that makes, obviously, the data movement harder.

One thing that we are exploring is having a virtualized file system layer across not just the file system, but a virtual cluster on-prem and in the cloud. For example, to visualize our file system, we’re using Alluxio. With Alluxio we are experimenting having an Alluxio cluster that spawns across the two data centers, on-prem and the cloud. We are doing the same with database technologies, as we are heavily using Aerospike and experimenting the same in Aerospike.

We have to be really careful, because being across data centers, the bandwidth between on-prem and cloud is unlimited. I’m not sure if this will be our final solution because we have to face some technical challenges, like re-tokenization of data.

Re-tokenization and re-encryption of data, with this automatic movement, is too expensive, so we are also exploring ingesting the data on-prem and in the cloud, and letting the user decide where the data should be. So, we are experimenting with these two options. We haven’t come to any conclusion yet because it’s in the R&D phase now.

Sandeep: Thank you so much for sharing. So to wrap up, Matteo, in fact, I just wanted to end with, you know, do you have any final thoughts on some of the topics we discussed? anything that you’d like to add to the discussion?

Matteo: No, not particularly. To summarize the way we run the platform is like a product company. We have product managers, we have our own roadmap that is decided by us and not by the users.

This has turned out to be very successful from two aspects; one aspect is the integration, because, you know, building a product, we make sure that every piece is fully integrated with each other and we can give the user a unified user experience – from the infrastructure, to the UI. The second is helped a lot with the retention of the engineering team, actually, because, you know, building a product creates much more engagement than doing random projects. This has been very impactful.

I think about all the integrations that we’ve done, the automation that we have done, and there are multiple aspects. For us, building our platform and our services as a product has been extremely beneficial and with payback after some time. You need to give time for the investment to return, but once you get to that stage, you’re gonna get your ROI.

Sandeep: That’s a phenomenal point, Matteo, especially treating the platform as a product and really investing and driving it. I couldn’t agree more. I think that’s really very insightful and, from your own experience, you have been clearly driving this very successfully. Thank you so much, Matteo.

Matteo: Thank you, that was great.

FINDING OUT MORE

Download the Unravel Guide to DataOps. Contact us or try out Unravel for free.

The post Transforming DataOps in Banking appeared first on Unravel.

The Promise of Data and Why I Joined Unravel

Unravel Data — Tue, 23 Jun 2020 11:00:42 +0000

This is my first post since landing at Unravel and I couldn’t be more energized about what’s to come. Our industry is being re-architected in new and exciting ways. Many industry observers would say it’s about time! The landscape is littered with unfulfilled promises and unsolved complexity. However, one fact remains, without data and more critically, what that data can tell us, we are flying blind amidst a business climate that has never moved faster and has never faced more uncertainty.

Many data ‘swamps’ have frankly failed to deliver, not because the data was not of value, but rather it has been too complex to wrangle and actionable insights have been too hard to extract. This has now changed. Our collective industry ability to apply more discipline and rigor to Data Operations and the maturity of Machine Learning and AI developments fuel our unwavering belief in the promise of Data.

To get the full benefits of modern data applications, you need reliable, optimal performance from those applications. And that’s no easy task when everything is running on what has been a massive, ungainly stack – different technologies, stretching across on-premises and cloud environments. You have not had the visibility or the resources to monitor each element, understand how they’re all working together, find and resolve issues, and optimize for the greatest efficiency and effectiveness.

It’s time to stop being reactive and start getting proactive. Data is only useful if you can depend on it. I joined Unravel not only because of their strong belief system around the promise of data but also in their ability to help you build understanding, remediate issues, uncover opportunities, and ensure your applications remain effective and reliable.

Ultimately, Unravel is here to help you get ahead of the game, avoid nasty surprises, and be ready for whatever new needs and technologies you’ll have in the future. In addition to providing recommendations for today’s needs, we also help you plan for what’s next – modeling how you’re going to grow, for instance, and what you need to do to keep up with that growth. I couldn’t be more excited about our industry at this time and of course the phenomenal team I have the privilege to be joining.

The full press release can be viewed below.
————————————————————————————————————————————–

Unravel Hires Enterprise Sales Leader as New Vice President of Worldwide Sales.

The new VP Worldwide Sales will draw on experience working with global enterprises to help Unravel customers accelerate time to value for their business critical modern data cloud workloads

PALO ALTO, CALIFORNIA – June 23, 2020 – Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced that it has hired Roman Orosco as its new Vice President of Worldwide Sales. Orosco will scale out and lead the company’s sales and revenue operations teams as Enterprises across all industries leverage and increasingly depend on data to provide competitive advantage, improve operating efficiencies and re-architect their business to deal with macro-environment changes such as Covid-19.

Orosco brings over 20 years of demand generation, channel, sales and operational leadership to his new role at Unravel. Orosco most recently served as the Vice President of Americas at BlueCat Networks, where he built a high-performing sales organization that fueled the company’s growth from $20M to $100M. Prior to BlueCat, he led revenue teams and delivered company changing revenue for over 15 years at SAP, NICE, Documentum, i2 Technologies, and other software companies.

“We’re excited to have an accomplished sales leader like Roman join the Unravel team. Roman has a great combination of domain expertise and functional experience but most importantly he always puts the customer at the forefront of his decision making. At a time like we are in right now, it has never been more important for us to have high empathy with our customer’s business needs and focus on driving near term impact,” said Kunal Agarwal, CEO, Unravel Data. “Roman is the right person, at the right time to lead our sales organization through a period of growth and customer transition to modern data clouds.”

“My North star has always been for my teams and I to deliver tangible business value for our clients. Enterprises continue to scale their Investments in data and adopting best practice disciplines in Data Operations to ensure their data driven applications are high performing, reliable and operating cost effectively,” said Roman Orosco, VP Worldwide Sales, Unravel Data. “I look forward to working closely with our internal teams and partners to remove operational friction and where possible, automate key Data Operations workflows such as Performance tuning, Troubleshooting and Cost optimization for cloud and on-premise workloads.”

About Unravel Data
Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Kaiser Permanente, Adobe, Deutsche Bank, Wayfair, and Neustar. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement
The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post The Promise of Data and Why I Joined Unravel appeared first on Unravel.

Now Is the Time to Take Stock in Your Dataops Readiness: Are Your Systems Ready?

Unravel Data — Tue, 14 Apr 2020 14:49:50 +0000

As the global business climate is experiencing rapid change due to the health crisis, the role of data to provide much needed solutions to urgent issues are being highlighted throughout the world. Helping customers manage critical modern data systems for years, Unravel sees a heightened interest in fortifying the reliability of business operations in healthcare, logistics, financial services and telecommunications.

DATA TO THE RESCUE

Complex issues – from closely tracking population spread to accelerating clinical trials for vaccines take priority. Other innovations in anomaly detection, such as the application of image classification to detect COVID patients through lung scans provide hope of instantaneous detection. Further, big data has been used to personalize patient treatments even before the crisis, so those most at risk will receive more curated application beyond simply age and comorbidity. Other novel uses of AI to track the pandemic have been reported (facial recognition, heat signatures to detect fevers) with great success in flattening the curve of infection.

Outside of healthcare, the rapid change in the global business climate is putting strain on modern data systems which are being pushed to unprecedented levels: Logistics engines for demand of essential goods need to recalibrate on a sub second basis. Financial institutions need to update risk models to incorporate a 24/7 steady stream of rapidly evolving stimulus policies. And in social media, a wave of misinformation and conspiracy theories are giving rise to phishing and malware attacks perpetrated by bad actors. AI systems help identify these offenders to minimize the damage.

EXCELLENCE IN DATA OPERATIONS IS NO LONGER A LUXURY

No doubt we are seeing unprecedented and accelerating demand for solutions to complex, business-critical challenges. Modern data systems are becoming even more crucial, with the task of providing reliability more important than ever. As an organization keenly focused on data and operations management, we believe it’s paramount to keep systems performing optimally. From a business perspective now is the time to assess operational readiness. Companies that lean in now will be ready to leverage and accelerate their business and marketplace value more quickly – post market events.

UNRAVEL PROVIDES COMPLETE MANAGEMENT INTO EVERY ASPECT OF YOUR DATA PIPELINES AND PLATFORM:

Is application code executing optimally (or failing)?
Are cluster resources being used efficiently?
How do I lower my cloud costs, while running more workloads?
Which workloads should I scale out to cloud?
Which user, app and use case is using most of the cluster?
How do I ensure all applications meet their SLAs?
How can I proactively prevent performance and reliability issues?

THESE ISSUES APPLY AS MUCH TO SYSTEMS LOCATED IN THE CLOUD AS THEY DO TO SYSTEMS ON-PREMISES. THIS IS TRUE FOR THE BREADTH OF PUBLIC CLOUD DEPLOYMENT TYPES:

Cloud-Native: Products like Amazon Redshift, Azure Databricks, AWS Databricks, Snowflake, etc.

Serverless: Ready-to-use services that require no setup like Amazon Athena, Google BigQuery, Google Cloud DataFlow, etc.

PaaS (Platform as a Service): Managed Hadoop/Spark Platforms like AWS Elastic MapReduce (EMR), Azure HDInsight, Google Cloud Dataproc, etc.

IaaS (Infrastructure as a Service): Cloudera, Hortonworks or MapR data platforms deployed on cloud VMs where your modern data applications are running

For those interested in learning more about specific services offered by the cloud platform providers we recently posted a blog on “Understanding Cloud Data Services.”

CONSIDER A HEALTHCHECK AS A FIRST STEP

As you consider your operational posture, we have a team available to run a complimentary data operations and application performance diagnostic. Consider it a Business Ops Check-up. Leveraging our platform, you’ll quickly see how we can help lower the cost of support while bolstering the performance in your modern data applications stack. If you are interested in an Unravel DataOps healthcheck, please contact us at hello@unraveldata.com.

The post Now Is the Time to Take Stock in Your Dataops Readiness: Are Your Systems Ready? appeared first on Unravel.

Supermarkets Optimizing Supply Chains with Unravel DataOps

Unravel Data — Tue, 07 Apr 2020 13:00:10 +0000

Retailers are using big data to report on consumer demand, inventory availability, and supply chain performance in real time. Big data provides a convenient, easy way for retail organizations to quickly ingest petabytes of data and apply machine learning techniques for efficiently moving consumer goods. A top supermarket retailer has recently used Unravel to monitor its vast trove of customer data to stock the right product for the right customer, at the right time.

The supermarket retailer needed to bring point-of-sale, online sales, demographic and global economic data together in real-time and give the data team a single tool to analyze and take action on the data. The organization needed all the systems in their data pipelines to be monitored and managed end-to-end to ensure proper system and application performance and reliability. Existing methods were largely manual, error prone and lacked actionable insights.

Unravel Extensible Data Operations Platform

After failing to find alternative solutions to cluster performance management, the customer chose Unravel to help remove risks in their cloud journey. During implementation, Unravel worked closely with the ITOps team to support the customer, providing support and iterating in collaboration based on the insights and recommendations provided by Unravel. This enabled both companies to triage issues, and troubleshoot issues faster.

Get answers, not more charts and graphs

Try Unravel for free

Bringing All The Data Into A Single Interface

The customer utilized a number of modern open source data projects in its data engineering workflow – Spark, MapReduce, HBase, YARN and Kafka. These components were needed to ingest and properly process millions of transactions a day. Hive query performance was a particular concern, as numerous downstream business intelligence reports depended on timely completion of these queries. Previously, the devops team spent several days to a week troubleshooting job failure issues, often blaming the operations team of improperly cluster configuration settings. The operations teams would in turn ask the devops team to re-check SQL query syntax for cartesian joins and other inefficient code. Unravel was able to shed light to these types of issues, providing usage based reporting which helped both teams pinpoint inefficiencies quickly.

Unravel was able to leverage its AI and automated recommendations engine to clean up hundreds of Hive tables, greatly enhancing performance. A feature that the company found particularly useful is the ability to generate custom failure reports using Unravel’s flexible API. In addition to custom reports, Unravel is able to deliver timely notifications through e-mail, serviceNOW, and PagerDuty.

Happy with the level of control Unravel was able to provide for Hive, the customer deployed Unravel for all other components – Spark, MapReduce, HBase, YARN and Kafka and made it a standard tool for DataOps management across the organization. Upon deploying Unravel, the team was presented with an end-to-end dashboard of insights and recommendations across the entire stack (Kafka, Spark, Hive/Tez, HBase) from a single interface, which allowed them to correlate thousands of pages of logs automatically. Previously, the team performed this analysis manually, with unmanaged spreadsheet tracking tools.

In addition to performance management, the organization was looking for an elegant means to isolate users who were consistently wasteful with the compute resources on the Hadoop clusters. Such reporting is difficult to put together, and requires cluster telemetry to not only be collected across multiple components, but also correlated to a specific job and user. Using Unravel’s chargeback feature, the customer was able to report not only the worst offenders who were over-utilizing resources, but the specific cost ramifications of inefficient Hive and Spark jobs. It’s a feature that enabled the company to recoup any procurement costs in a matter of months.

Examples of cluster utilization showing in the Unravel UI

Scalable modern data applications on the cloud are critical to the success of retail organizations. Using Unravel’s AI-driven DataOps platform, a top retail organization was able to confidently optimize its supply chain. By providing full visibility of their applications and operations, Unravel helped the retail organization to ensure their modern data apps are effectively architected and operational. This enabled the customer to minimize excess inventory and deliver high demand goods on time (such as water, bread, milk, eggs) and maintain long term growth.

FINDING OUT MORE

Download the Unravel Guide to DataOps. Contact us or create a free account.

The post Supermarkets Optimizing Supply Chains with Unravel DataOps appeared first on Unravel.

The journey to democratize data continues

Unravel Data — Wed, 01 Apr 2020 13:00:12 +0000

Data is the new oil and a critical differentiator in generating retrospective, interactive, and predictive ML insights. There has been an exponential growth in the amount of data in the form of structured, semi-structured, and unstructured data collected within the enterprise. Harnessing this data today is difficult — typically data in the lakes is not consistent, interpretable, accurate, timely, standardized, or sufficient. Scully et. al. from Google highlight that for implementing ML in production, less than 5% of the effort is spent on the actual ML algorithms. The remaining 95% of the effort is spent on data engineering related to discovering, collecting, preparing data, as well as building and deploying the models in production.

As a result of the complexity, enterprises today are data rich, but insights poor. Gartner predicts that 80% of analytics insights will not deliver business outcomes through 2022. Another study, highlights that 87% of data projects never make it production deployment.

Over the last two years, I have been leading an awesome team in the journey to democratize data at Intuit Quickbooks. The focus has been to radically improve the time it takes to complete the journey map from raw data into insights (defined as time to insight). Our approach has been to systematically break the jou and automate corresponding data engineering patterns, making it self-service for citizen data users. We modernized the data fabric to leverage the cloud, and developed several tools and frameworks as a part of the overall self-serve data platform.

The team has been sharing the automation frameworks both as talks in key conferences and 3 open-source projects. Checkout the list of talks and open-source projects at the end of the blog. It makes me really product on how the team has truly changed the trajectory of the data platform. A huge shoutout and thank you to the team — all of you rock!

In the journey to democratize data platforms, I recently moved to Unravel Data. Today, there is no “one-size-fits-all” requiring enterprises to adopt polyglot datastores and query engines both on-premise as well as the cloud. Configuring and optimizing queries to run seamlessly for performance, SLAs, cost, and root-cause diagnosis is highly non-trivial requiring deep understanding. Data users such as data analysts, scientists and data citizens essentially need a turn-key solution to analyze and automatically configure their jobs and applications.

I am very excited to be joining the Unravel Data driving the technology of AI-powered data operations platform for performance management, resource and cost optimization, and cloud operations and migration. The mission to democratize data platforms continues …

The full press release can be viewed below.

————————————————————————————————————————————–

Unravel Hires Data Industry Leader with over 40 Patents as New Chief Data Officer and VP of Engineering

The new CDO will draw on experience from IBM, VMware and Intuit QuickBooks to help Unravel customers accelerate their modern data workloads

PALO ALTO, CALIFORNIA – April 1, 2020 – Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced that it has hired Sandeep Uttamchandani as its new Chief Data Officer and VP of Engineering. Uttamchandani will help boost Unravel’s capabilities for optimizing data apps and end-to-end data pipelines, with special focus on driving innovations for cloud and machine learning workloads. He will also lead and expand the company’s world-class data and engineering team.

Uttamchandani brings over 20 years of critical industry experience in building enterprise software and running petabyte-scale data platforms for analytics and artificial intelligence. He most recently served as Chief Data Architect and Global Head of Data and AI at Intuit QuickBooks, where he led the transformation of the data platform used to power transactional databases, data analytics and ML products. Before that, he held engineering leadership roles for over 16 years at IBM and VMWare. Uttamchandani has spent his career delivering innovations that provide tangible business value for customers.

“We’re thrilled to have someone with Sandeep’s track record on board the Unravel team. Sandeep has led critical big data, AI and ML efforts at some of the world’s biggest and most successful tech companies. He’s thrived everywhere he’s gone,” said Kunal Agarwal, CEO, Unravel Data. “Sandeep will make an immediate impact and help advance Unravel’s mission to radically simplify the way businesses understand and optimize the performance of their modern data applications and data pipelines, whether they’re on-premises, in the cloud or in a hybrid setting. He’s the perfect fit to lead Unravel’s data and engineering team in 2020 and beyond.”

In addition to his achievements at Intuit QuickBooks, IBM and VMWare, Uttamchandani has also led outside the office. He has received 42 total patents involving systems management, virtualization platforms, and data and storage systems, and has written 25 conference publications, including an upcoming O’Reilly Media book on self-service data strategies. Uttamchandani earned a Ph.D. in computer science from the University of Illinois at Urbana-Champaign, one of the top computer science programs in the world. He currently serves as co-Chair of Gartner’s CDO Executive Summit.

“My career has always been focused on developing customer-centric solutions that foster a data-driven culture, and this experience has made me uniquely prepared for this new role at Unravel. I’m excited to help organizations boost their businesses by getting the most out of their modern data workloads,” said Sandeep Uttamchandani, CDO and VP of Engineering, Unravel Data. “In addition to driving product innovations and leading the data and engineering team, I look forward to collaborating directly with customer CDOs to assist them in bypassing any roadblocks they face in democratizing data platforms within the enterprise.”

Copyright Statement
The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post The journey to democratize data continues appeared first on Unravel.

Unravel Data Now Certified on Cloudera Data Platform

Unravel Data — Wed, 25 Mar 2020 14:00:13 +0000

Last year, Cloudera released the Cloudera Data Platform, an integrated data platform that can be deployed in any environment, including multiple public clouds, bare metal, private cloud, and hybrid cloud. Customers are increasingly demanding maximum flexibility to adhere to multi-cloud, hybrid data management demands. Unravel has from the beginning has made it a core strategy to support the full modern data stack, on any cloud, hybrid as well as on-premises.

Today we are pleased to announce that Unravel is now certified on Cloudera Data Platform (both CDP public cloud as well as CDP Data Center). This marks an important milestone in our continued partnership with Cloudera, bolstered by a growing demand among Cloudera users for our AI-driven performance optimization solution for modern data clouds.

The certification ensures that Unravel is integrated seamlessly with Cloudera Data Platform, providing customers with an intelligent solution to improve the reliability and performance of their modern data stack applications and operations, and optimize costs through data driven insights. We look forward to continually supporting Cloudera customers, on CDP as well as CDH and HDP.

The full press release can be viewed below.

————————————————————————————————————————————————

Unravel Data Earns Certification for Cloudera Data Platform

Unravel Supports Cloudera Data Platform in the public cloud, on-premises and in hybrid environments, continuing Unravel’s mission to simplify and optimize modern data applications wherever they exist

PALO ALTO, CALIFORNIA – March 25, 2020– Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced that it has been certified on the Cloudera Data Platform. Cloudera Data Platform manages data in any environment, including multiple public clouds, bare metal deployments, private clouds and hybrid clouds. The certification further advances Unravel’s mission to simplify and optimize modern data apps wherever they exist, with this move particularly bolstering Unravel’s support for hybrid cloud and multi-cloud environments.

“Data apps, especially AI and ML apps, are increasingly being spread across a mix of on-premise and public cloud environments. These highly distributed hybrid cloud deployments provide unique advantages and greater flexibility compared to all-in approaches that put everything either in the cloud or in a private datacenter,” said Kunal Agarwal, CEO, Unravel Data. “Organizations are also deploying more data apps in multiple clouds, which allows them to leverage the specific strengths of each cloud and provides unique app functionality. However, the growing distribution of data apps in hybrid and multi-cloud settings introduces operational complexity and naturally makes it harder to optimize and monitor these apps. Unravel ensures that enterprises have both a clear line of sight into these apps and automated recommendations to troubleshoot and maximize their performance, no matter where the apps are located.”

In order to earn this certification, Unravel maintained their silver partnership status through the Cloudera Connect partner program, built new integrations for Cloudera Data Platform (for both the public cloud and on-premise version), then documented and tested those integrations. Cloudera worked closely with Unravel during the entire process.

The certification is the latest milestone in a long relationship between Unravel and Cloudera. Unravel was previously certified on Cloudera Enterprise and the two share many joint customers. This integration will ensure legacy CDH and HDP customers who migrate to Cloudera CDP will continue to enjoy Unravel’s solutions to simplify data operations on AWS, Azure and GCP in addition to on-premises.

About Unravel Data
Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Kaiser Permanente, Adobe and Deutsche Bank. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement
The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Unravel Data Now Certified on Cloudera Data Platform appeared first on Unravel.

4 Modern Data Stack Riddles: The Straggler, the Slacker, the Fatso, and the Heckler

Unravel Data — Wed, 26 Feb 2020 14:00:53 +0000

This article discusses four bottlenecks in modern data stack applications and introduces a number of tools, some of which are new, for identifying and removing them. These bottlenecks could occur in any framework but a particular emphasis will be given to Apache Spark and PySpark.

The applications/riddles discussed below have something in common: They require around 10 minutes wall clock time when a “local version” of them is run on a commodity notebook. Using more or more powerful processors or machines for their execution would not significantly reduce their run time. But there are also important differences: Each riddle contains a different kind of bottleneck that is responsible for the slowness and each of these bottlenecks will be identified with a different approach. Some of these analytical tools are innovative or not widely used, references to source code are included in the second part of this article. The first section will discuss the riddles in a “black box” manner:

The Fatso

The fatso occurs frequently in the modern data stack. A symptom of running a local version of it is noise – the fans of my notebook are very active for almost 10 minutes, the entire application lifetime. Since we can rarely listen to machines in a cluster computing environment, we need a different approach to identify a fatso:

JVM Profile

The following code snippet (full version here) combines two information sources, the output of a JVM profiler and normal Spark logs, into a single visualization:


profile_file = './data/ProfileFatso/CpuAndMemoryFatso.json.gz'  # Output from JVM profiler
profile_parser = ProfileParser(profile_file, normalize=True)
data_points: List[Scatter] = profile_parser.make_graph()

logfile = './data/ProfileFatso/JobFatso.log.gz'  # standard Spark logs
log_parser = SparkLogParser(logfile)
stage_interval_markers: Scatter = log_parser.extract_stage_markers()
data_points.append(stage_interval_markers)

layout = log_parser.extract_job_markers(700)
fig = Figure(data=data_points, layout=layout)
plot(fig, filename='fatso.html')

The interactive graph produced by running this script can be analyzed in its full glory here, a smaller snapshot is displayed below:

Spark’s execution model consists of different units of different “granularity levels” and some of these are displayed above: Boundaries of Spark jobs are represented as vertical dashed lines, start and end points of Spark stages are displayed as transparent blue dots on the x-axis which also show the full stage names/IDs. This scheduling information does not add a lot of insight here since Fatso consists of only one Spark job which in turn consists of just a single Spark stage (comprised of three tasks) but, as shown below, knowing such time points can be very helpful when analyzing more complex applications.

For all graphs in this article, the x-axis shows the application run time as UNIX Epoch time (milliseconds passed since 1 January 1970). The y-axis represents different normalized units for different metrics: For graph lines representing memory metrics such as total heap memory used (“heapMemoryTotalUsed”, ocher green line above), it represents gigabytes; for time measurements like MarkSweep GC collection time (“MarkSweepCollTime”, orange line above), data points on the y-axis represent milliseconds. More details can be found in this data structure which can be changed or extended with new metrics from different profilers.

One available metric, ScavengeCollCount, is absent from the snapshot above but present in the original. It signifies a minor garbage collection event and almost increases linearly up to 20000 during Fatso’s execution. In other words, the application ran for almost 10 minutes – from epoch 1550420474091 (= 17/02/2019 16:21:14) until epoch 1550421148780 (= 17/02/2019 16:32:28) – and more than 20000 minor Garbage Collection events and almost 70 major GC events (“MarkSweepCollCount”, green line) occurred.

When the application was launched, no configuration parameters were manually set so the default Spark settings applied. This means that the maximum memory available to the program was 1GB. Having a closer look at the two heap memory metrics heapMemoryCommitted and heapMemoryTotalUsed reveals that both lines approach this 1GB ceiling near the end of the application.

The intermediate conclusion that can be drawn from the discussion so far is that the application is very memory hungry and a lot of GC activity is going on, but the exact reason for this is still unclear. A second tool can help now:

JVM FlameGraph

The profiler also collected stacktraces which can be folded and transformed into flame graphs with the help of my fold_stacks.py script and this external script:

Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/ProfileFatso/StacktraceFatso.json.gz > Fatso.folded
Phils-MacBook-Pro:analytics a$ perl flamegraph.pl Fatso.folded > FatsoFlame.svg

Opening FatsoFlame.svg in a browser shows the following, the full version which is also searchable (top right corner) is located at this location:

Click for Full Size

A rule of thumb for the interpretation of flame graphs is: The more spiky the shape, the better. We see many plateaus above with native Spark/Java functions like sun.misc.unsafe.park sitting on top (first plateau) or low-level functions from packages like io.netty occurring near the top, this is a 3rd party library that Spark depends on for network communication / IO. The only functions in the picture that are defined by me are located in the center plateau, searching for the package name profile.sparkjob in the top right corner will prove this claim. On top of these user defined functions are native Java Array and String functions; a closer look at the definition of fatFunctionOuter and fatFunctionInner would reveal that they create many String objects in an efficient way so we have identified the two Fatso methods that need to be optimized.

Python/PySpark Profiles

What about Spark applications written in Python? I created several PySpark profilers that try to provide some of the functionality of Uber’s JVM profiler. Because of the architecture of PySpark, it might be beneficial to generate both Python and JVM profiles in order to get a good grasp of the overall resource usage. This can be accomplished for the Python edition of Fatso by using the following launch command (abbreviated, full command here):

~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
--conf spark.python.profile=true \
--conf spark.driver.extraJavaOptions=-javaagent:/.../=sampleInterval=1000,metricInterval=100,reporter=...outputDir=... \
./spark_jobs/job_fatso.py cpumemstack /users/phil/phil_stopwatch/analytics/data/profile_fatso > Fatso_PySpark.log

The –conf parameter in the third line is responsible for attaching the JVM profiler. The –conf parameter in the second line as well as the two script arguments in the last line are Python specific and required for PySpark profiling: The cpumemstack argument will choose a PySpark profiler that captures both CPU/memory usage as well as stack traces. By providing a second script argument in the form of a directory path, it is ensured that the profile records are written into separate output files instead of just printing all of them to the standard output.

Similar to its Scala cousin, the PySpark edition of Fatso completes in around 10 minutes on my MacBook and creates several JSON files in the specified output directory. The JVM profile could be visualized idenpendently of the Python profile but it might be more insightful to create a single combined graph from them. This can be accomplished easily and is shown in the second half of this script. The full combined graph is located here

The clever reader will already have a hunch about the high memory consumption and who is responsible for it: The garbage collection activity of the JVM that is again represented by MarkSweepCollCount and ScavengeCollCount is much lower here compared to the “pure” Spark run described in the previous paragraphs (20000 events above versus less than 20 GC events now). The two inefficient fatso functions are now implemented in Python and therefore not managed by the JVM leading to far fewer JVM memory usage and GC events. A PySpark flamegraph should confirm our hunch:

Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/profile_fatso/s_8_stack.json  > FatsoPyspark.folded
Phils-MacBook-Pro:analytics a$ perl flamegraph.pl  FatsoPyspark.folded  > FatsoPySparkFlame.svg

Opening FatsoPySparkFlame.svg in a browser displays …

Click for Full Size

And indeed, two fatso methods sit ontop the stack for almost 90% of all measurements burning most CPU cycles. It would be easy to create a combined JVM/Python flamegraph by concatenating the respective stacktrace files. This would be of limited use here though since the JVM flamegraph will likely consist entirely of native Java/Spark functions over which a Python coder has no control. One scenario I can think of where this merging of JVM with PySpark stacktraces might be especilly useful is when Java code or libraries are registered and called from PySpark/Python code which is getting easier and easier in newer versions of Spark. In the discussion of Slacker later on, I will present a combined stack trace of Python and Scala code.

The Straggler

The Straggler is deceiving: It appears as if all resources are fully utilized most the time and only closer analysis can reveal that this might be the case for only a small subset of the system or for a limited period of time. The following graph created from this script combines two CPU metrics with information about task and stage boundaries extracted from the standard logging output of a typical straggler run; the full size graph can be investigated here

The associated application consisted of one Spark job which is represented as vertical dashed lines at the left and right. This single job was comprised of a single stage, shown as transparent blue dots on the x axis that coincide with the job start and end points. But there were three tasks within that stage so we can see three horizontal task lines. The naming schema of this execution hierarchy is not arbitrary:

The stage name in the graph is 0.0@0 because a stage with the id 0.0 which belonged to a job with id 0 is referred to. The first part of stage or task names is a floating point number, this reflects the apparent naming convention in Spark logs that new attempts of failed task or stages are baptized with an incremented fraction part.
The task names are 0.0@0.0@0, 1.0@0.0@0, and 2.0@0.0@0 because three tasks were launched that were all members of stage 0.0@0 that in turn belonged to job 0

The three tasks have the same start time which almost coincides with the application’s invocation but very different end times: Tasks 1.0@0.0@0 and 2.0@0.0@0 finish within the first fifth of the application’s lifetime whereas task 0.0@0.0@0 stays alive for almost the entire application since its start and end points are located at the left and right borders of this graph. The orange and light blue lines visualize two CPU metrics (system cpu load and process cpu load) whose fluctuations correspond with the task activity: We can observe that the CPU load drops right after tasks 1.0@0.0@0 and 2.0@0.0@0 end. It stays at around 20% for 4/5 of the time, when only straggler task 0.0@0.0@0 is running.

Concurrency Profiles

When an application consists of more than just one stage with three tasks like Straggler, it might be more illuminating to calculate and represent the total number of tasks that were running at any point during the application’s lifetime. The “concurrency profile” of a modern data stack workload might look more like

The source code that is the basis for the graph can be found in this script. The big difference to the Straggler toy example before is that in real-life applications, many different log files are compiled (one for each container/Spark executor) and there is only one “master” log file which contains necessary information about scheduling and task boundaries which are needed to make concurrency profiles. The script uses an AppParser class that does this automatically by creating a list of LogParser objects (one for each container) and then parsing them to determine the master log.
We can attempt a back-of-the-envelope calculation to increase the efficiency of the application from just looking at this concurrency profile: In case around 80 physical CPU cores were used (given multiple peaks of ~80 active tasks), we can hypothesize that the application was “overallocated” by at least 20 CPU cores or 4 to 7 Spark executors or one to three nodes as Spark executors are often configured to use 3 to 5 physical CPU cores. Reducing the machines reserved for this application should not increase its execution time but it will give more resources to other users in a shared cluster setting or save some $$ in a cloud environment.

A Fratso

What about the specs of the actual compute nodes used? The memory profile for a similar app created via this code segment is chaotic yet illuminating since more than 50 Spark executors/containers were launched by application and each one left its mark in the graph in the form of a memory metric line (original located here)

The peak heap memory used is a little more than 10GB, one executor crosses this 10k line twice (top right) while most other executors use at most 8-9 GB or less. Removing the memory usage from the picture and displaying scheduling information like task boundaries instead results in the following graph

The application launches several small Spark jobs initially as indicated by the occurrence of multiple dashed lines near the left border. However, more than 90% of the total execution time is consumed by a single big job which has the ID 8. A closer look at the blue dots on the x-axis that represent boundaries of Spark stages reveals that there are two longer stages within job 8. During the first stage, there are four task waves without stragglers – concurrent tasks that together look like solid blue rectangles when visualized this way. The second stage of job 8 does have a straggler task as there is one horizontal blue task line that is much longer active than its “neighbour” tasks. Looking back at the memory graph of this application, it is likely that this straggler task is also responsible for the heap memory peak of >10GB that we discovered. We might have identified a “fratso” here (a straggling fatso) and this task/stage should definitely be analyzed in more detail when improving the associated application.

The script that generated all three previous plots can be found here.

The Heckler: CoreNLP & spaCy

Applying NLP or machine learning methods often involves the use of third party libraries which in turn create quite memory-intensive objects. There are several different ways of constructing such heavy classifiers in Spark so that each task can access them, the first version of the Heckler code that is the topic of this section will do that in the worst possible way. I am not aware of a metric currently exposed by Spark that could directly show such inefficiencies, something similar to a measure of network transfer from master to executors would be required for one case below. The identification of this bottleneck must therefore happen indirectly by applying some more sophisticated string matching and collapsing logic to Spark’s standard logs:

log_file = './data/ProfileHeckler1/JobHeckler1.log.gz'
log_parser = SparkLogParser(log_file)
collapsed_ranked_log: List[Tuple[int, List[str]]] = log_parser.get_top_log_chunks()
for line in collapsed_ranked_log[:5]:  # print 5 most frequently occurring log chunks
    print(line)

Executing the script containing this code segment produces the following output:


Phils-MacBook-Pro:analytics a$ python3 extract_heckler.py
 
^^ Identified time format for log file: %Y-%m-%d %H:%M:%S
 
(329, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(257, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(223, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(221, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])
(197, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
'StanfordCoreNLP:88 - Adding annotator parse'])

These are the 5 most frequent log chunks found in the logging file, each one is a pair [int, List[str]]. The left integer signifies the total number of times the right list of log segments occurred in the file; each individual member in the list occurred in a separate log line in the file. Hence the return value of the method get_top_log_chunks that created the output above has the type annotation List[Tuple[int, List[str]]], it extracts a ranked list of contiguous log segments.

The top record can be interpreted the following way: The four strings


StanfordCoreNLP:88 - Adding annotator tokenize
StanfordCoreNLP:88 - Adding annotator ssplit 
StanfordCoreNLP:88 - Adding annotator pos
StanfordCoreNLP:88 - Adding annotator lemma
StanfordCoreNLP:88 - Adding annotator parse

occurred as infixes in this order 329 times in total in the log file. They were likely part of longer log lines as normalization and collapsing logic was applied by the extraction algorithm, an example occurrence of the first part of the chunk (StanfordCoreNLP:88 – Adding annotator tokenize) would be

2019-02-16 08:44:30 INFO StanfordCoreNLP:88 - Adding annotator tokenize

What does this tell us? The associated Spark app seems to have performed some NLP tagging since log4j messages from the Stanford CoreNLP project can be found as part of the Spark logs. Initializing a StanfordCoreNLP object …


  val props = new Properties()
  props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse")

  val pipeline = new StanfordCoreNLP(props)
  val annotation = new Annotation("This is an example sentence")

  pipeline.annotate(annotation)
  val parseTree = annotation.get(classOf[SentencesAnnotation]).get(0).get(classOf[TreeAnnotation])
  println(parseTree.toString) // prints (ROOT (NP (NN Example) (NN sentence)))


0 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator tokenize
9 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator ssplit
13 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator pos
847 [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger  - Loading POS tagger from [...] done [0.8 sec].
848 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator lemma
849 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator parse
1257 [main] INFO edu.stanford.nlp.parser.common.ParserGrammar  - Loading parser from serialized [...] ... done [0.4 sec].

… which tells us that five annotators (tokenize, ssplit, pos, lemma, parse) are created and wrapped inside a single StanfordCoreNLP object. Concerning the use of CoreNLP with Spark, the number of cores/tasks used in Heckler is three (as it is in all other riddles) which means that we should find at most three occurrences of these annotator messages in the corresponding Spark log file. But we already saw more than 1000 occurrences when only the top 5 log chunks were investigated above. Having a closer look at the Heckler source code resolves this contradiction, the implementation is bad since one classifier object is recreated for every input sentence that will be syntactially annotated – there are 60000 input sentences in total so an StanfordCoreNLP object will be constructed a staggering 60000 times. Due to the distributed/concurrent nature of Heckler, we don’t always see the annotator messages in the order tokenize – ssplit – pos – lemma – parse because log messages of task (1) might interweave with log messages of task (2) and (3) in the actual log file which is also the reason for the slightly reordered log chunks in the top 5 list.

Improving this inefficient implementation is not too difficult: Creating the classifier inside a mapPartitions instead of a map function as done here will only create three StanfordCoreNLP objects overall. However, this is not the minimum, I will now set the record for creating the smallest number of tagger objects with the minimum amount of network transfer: Since StanfordCoreNLP is not serializable per se, it needs to be wrapped inside a class that is in order to prevent a java.io.NotSerializableException when broadcasting it later:


class DistribbutedStanfordCoreNLP extends Serializable {
  val props = new Properties()
  props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse")
  lazy val pipeline = new StanfordCoreNLP(props)
}
[...]
val pipelineWrapper = new DistribbutedStanfordCoreNLP()
val pipelineBroadcast: Broadcast[DistribbutedStanfordCoreNLP] = session.sparkContext.broadcast(pipelineWrapper)
[...]
val parsedStrings3 = stringsDS.map(string => {
   val annotation = new Annotation(string)
   pipelineBroadcast.value.pipeline.annotate(annotation)
   val parseTree = annotation.get(classOf[SentencesAnnotation]).get(0).get(classOf[TreeAnnotation])
   parseTree.toString
})

The proof lies in the logs:


19/02/23 18:48:45 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/23 18:48:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/23 18:48:45 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator tokenize
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator ssplit
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator pos
19/02/23 18:48:46 INFO MaxentTagger: Loading POS tagger from [...] ... done [0.6 sec].
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator lemma
19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator parse
19/02/23 18:48:47 INFO ParserGrammar: Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
19/02/23 18:59:07 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1590 bytes result sent to driver
19/02/23 18:59:07 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1590 bytes result sent to driver
19/02/23 18:59:07 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1590 bytes result sent to driver

I’m not sure about the multi-threading capabilities of StanfordCoreNLP so it might turn out that the second “per partition” solution is superior performance-wise to the third. In any case, we reduced the number of tagging objects created from 60000 to three or one, not bad.

spaCy on PySpark

The PySpark version of Heckler will use spaCy (written in Cython/Python) as NLP library instead of CoreNLP. From the perspective of a JVM aficionado, packaging in Python itself is odd and spaCy doesn’t seem to be very chatty. Therefore I created an initialization function that should print more log messages and address potential issues when running spaCy in a distributed environment as its model files need to be present on every Spark executor.

As expected, the “bad” implementation of Heckler recreates one spaCy NLP model per input sentence as proven by this logging excerpt:


[Stage 0:>                                                          3 / 3]
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
[...]

Inspired by the Scala edition of Heckler, the “per partition” PySpark solution only initialize three spacy NLP objects during the application’s lifetime, the complete log file of that run is short:


[Stage 0:>                                                          (0 + 3) / 3]
^^ Using spaCy 2.0.18
^^ Using spaCy 2.0.18
^^ Using spaCy 2.0.18
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
^^ Created model en_core_web_sm
1500

Finding failure messages

The functionality introduced in the previous paragraphs can be modified to facilitate the investigation of failed applications: The reason for a crash is often not immediately apparent and requires sifting through log files. Resource-intensive applications will create numerous log files (one per container/Spark executor) so search functionality along with deduplication and pattern matching logic should come in handy here: The function extract_errors from the AppParser class tries to deduplicate potential exceptions and error messages and will print them out in reverse chronological order. An exception or error message might occur several times during a run with slight variations (e.g., different timestamps or code line numbers) but the last occurrence is the most important one for debugging purposes since it might be the direct cause for the failure.


app_path = './data/application_1549675138635_0005'
app_parser = AppParser(app_path)
app_errors: Deque[Tuple[str, List[str]]] = app_parser.extract_errors()

for error in app_errors:
    print(error)


^^ Identified app path with log files
^^ Identified time format for log file: %y/%m/%d %H:%M:%S
^^ Warning: Not all tasks completed successfully: {(16.0, 9.0, 8), (16.1, 9.0, 8), (164.0, 9.0, 8), ...}
^^ Extracting task intervals
^^ Extracting stage intervals
^^ Extracting job intervals

Error messages found, most recent ones first:

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:35 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted.’, ‘org.apache.spark.SparkException: Job aborted.’, ‘at org.apache.spark.sql.execution.datasources.FileFormatWriteranonfun$write$1.apply(FileFormatWriter.scala:166)’, ‘at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)’, ‘at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)’, ‘at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)’, ‘at org.apache.spark.sql.execution.SparkPlananonfun$executeQuery$1.apply(SparkPlan.scala:138)’, ‘at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)’, ‘at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)’, ‘at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)’, ‘at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)’, ‘at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)’, ‘at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)’, ‘at org.apache.spark.sql.execution.SparkPlananonfun$executeQuery$1.apply(SparkPlan.scala:138)’, ‘at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)’, ‘at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)’, ‘at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)’, ‘at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)’, ‘at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)’, ‘at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)’, ‘at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508)’, […] ‘… 48 more’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000124/stderr.gz’, [‘18/02/01 21:49:34 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN YarnAllocator: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 ERROR TaskSetManager: Task 30 in stage 9.0 failed 4 times; aborting job’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 ERROR YarnClusterScheduler: Lost executor 62 on ip-172-18-39-28.ec2.internal: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN TaskSetManager: Lost task 4.3 in stage 9.0 (TID 610, ip-172-18-39-28.ec2.internal, executor 62): ExecutorLostFailure (executor 62 exited caused by one of the running tasks) Reason: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node’])

(‘/users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz’, [‘18/02/01 21:49:34 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0’])

[…]

Each error record printed out in this fashion consists of two elements: The first one is a path to the source log to the source log in which the second element, the actual error chunk, was found. The error chunk is a single or multiline error message to which collapsing logic was applied and is stored in a list of strings. The application that threw the errors shown above seems to have crashed because some problems occurred during a write phase since classes like FileFormatWriter occur in the final stack trace that was produced by executor container_1549_0001_02_000002. It is likely that not enough output partitions were used when materializing output records to a storage layer. The total number of error chunks in all container log files associated with this application was more than 350, the deduplication logic of AppParser.extract_errors boiled this high number down to less than 20.

The Slacker

The Slacker does exactly what the name suggests – not a lot. Let’s collect and investigate the maximum values of its most important metrics:


combined_file = './data/profile_slacker/CombinedCpuAndMemory.json.gz'  # Output from JVM & PySpark profilers

jvm_parser = ProfileParser(combined_file)
jvm_parser.manually_set_profiler('JVMProfiler')

pyspark_parser = ProfileParser(combined_file)
pyspark_parser.manually_set_profiler('pyspark')

jvm_maxima: Dict[str, float] = jvm_parser.get_maxima()
pyspark_maxima: Dict[str, float] = pyspark_parser.get_maxima()

print('JVM max values:')
print(jvm_maxima)
print('\nPySpark max values:')
print(pyspark_maxima)

The output is …


JVM max values:
{'ScavengeCollTime': 0.0013, 'MarkSweepCollTime': 0.00255, 'MarkSweepCollCount': 3.0, 'ScavengeCollCount': 10.0,
'systemCpuLoad': 0.64, 'processCpuLoad': 0.6189945167759597, 'nonHeapMemoryTotalUsed': 89.079,
'nonHeapMemoryCommitted': 90.3125, 'heapMemoryTotalUsed': 336.95, 'heapMemoryCommitted': 452.0}

PySpark max values:
{'pmem_rss': 78.50390625, 'pmem_vms': 4448.35546875, 'cpu_percent': 0.4}

These are low values given a baseline overhead of running Spark and especially when comparing them to the profiles for Fatso above – for example, only 13 GC events happened and the peak CPU load for the entire run was less than 65%. Visualizing all CPU data points shows that these maxima occured at the beginning / end of the application when there is always lots of initialization and cleanup work going on regardless of the actual code being executed (bigger version here):

So the system is almost idle for the majority of the time. The slacker in this pure form is a rare sight; when processing real-life workloads, slacking most likely occurs in certain stages that interact with an external system like querying a database for records that should be joined with Datasets/RDDs later on or that materialize output records to a storage layer like HDFS and use not enough write partitions. A combined flame graph of JVM and Python stack traces will reveal the slacking part:

Click for Full Size

In the first plateau which is also the longest, two custom Python functions sit at the top. After inspecting their implementation here and there, the low system utilization should not be surprising anymore: The second function from the top,job_slacker.py:slacking, is basically a simple loop that calls a function helper.py:secondsSleep from an external helper package many times. This function has a sample presence of almost 20% (seen in the original) and, since it sits atop the plateau, is executed by the CPU most of the time. As its function name suggests, it causes the program to sleep for one second. So Slacker is esentially a 10 minute long system sleep. In real-world modern data stack applications that have slacking phases, we can expect the top of many plateaus to be occupied by “write” functions like FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala or by functions related to DB queries.

Source Code & Links & Goodies

Quite a lot of things were discussed and code exists that implements every idea presented:

Riddle source code

The Spark source code for the riddles is located in this repository:

The PysSpark editions along with all log and profile parsing logic can be found in the main repo:

Profiling the JVM

Uber’s recently open sourced JVM profiler isn’t the first of its kind but has a number of features that are very handy for the cases described in this article: It is “non-invasive” so source code doesn’t need to be changed at all in order to collect metrics. Any JVM can be profiled which means that this project is suitable for tracking a Spark master as well as its associated executors. Internally this profiler uses the java.lang.management interface that was introduced with Java 1.5 and accesses several Bean objects.

When running the riddles mentioned above in local mode, only one JVM is launched and the master subsumes the executors so only a –conf spark.driver.extraJavaOptions= has to be added to the launch command, a distributed application also requires a second –conf spark.executor.extraJavaOptions=. Full launch commands are included in my project’s repo

Phil’s PysSpark profilers

I implemented three custom PySpark profilers here which should provide functionality similar to that of Uber’s JVM profiler for a PySpark user: A CPU/memory profiler, a stack profiler for creating PySpark flamegraphs and a combination of the two. Detais about how to integrate them into an application can be found in the project’s readme file.

Going distributed

If the JVM profiler is used as described above, three different types of records are generated which, in case the FileOutputReporter flag is used, are written to three separate JSON files, ProcessInfo.json, CpuAndMemory.json, and Stacktrace.json. The ProcessInfo.json file contains meta information and is not used in this article. Similarly, my PySpark profilers will create one or two different types of output records that are stored in at least two JSON files with the pattern s_X_stack.json or s_X_cpumem.json_ when sparkContext.dump_profiles(dump_path). If sparkContext.show_profiles() is used instead, all profile records would be written to the standard output.

In a distributed/cloud environment, Uber’s and my FileOutputReporter might not be able to create output files on storage systems like HDFS or S3 so the profiler records might need to be written to the standard output files (stdout.gz) instead. Since the design of the profile and application parsers in parsers.py is compositional, this is not a problem. A demonstration of how to extract both metrics and scheduling info from all standard output files belonging to an application is here.

When launching a distributed application, Spark executors run on multiple nodes in a cluster and produce several log files, one per executor/container. In a cloud environment like AWS, these log files will be organized in the following structure:


s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_001/
                                                                                         stderr.gz
                                                                                         stdout.gz
s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_002/
                                                                                         stderr.gz
                                                                                         stdout.gz
[...]
s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_N/
                                                                                         stderr.gz
                                                                                         stdout.gz

[...]

s3://aws-logs/elasticmapreduce/clusterid-M/containers/application_K_0001/container_K_L/
                                                                                         stderr.gz
                                                                                         stdout.gz

An EMR cluster like clusterid-1 might run several Spark applications consecutively, each one as its own step. Each application launched a number of containers, application_1_0001 for example launched executors container_1_001, container_1_002, …, container_1_N. Each of these container created a standard error and a standard out file on S3. In order to analyze a particular application like application_1_0001 above, all of its associated log files like …/application_1_0001/container_1_001/stderr.gz and …/application_1_0001/container_1_001/stdout.gz are needed. The easiest way is to collect all files under the application folder using a command like …


aws s3 cp --recursive s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/ ./application_1_0001/

… and then to create an AppParser object like


from parsers import AppParser
app_path = './application_1_0001/'  # path to the application directory downloaded from s3 above
app_parser = AppParser(app_path)

This object creates a number of SparkLogParser objects internally (one for each container) and automatically identifies the “master” log file created by the Spark driver (likely located under application_1_0001/container_1_001/). Several useful functions are now made available by the app_parser object, examples can be found in this script and more detailed explanations are in the readme file.

The post 4 Modern Data Stack Riddles: The Straggler, the Slacker, the Fatso, and the Heckler appeared first on Unravel.

Data Structure Zoo

Unravel Data — Thu, 13 Feb 2020 23:35:12 +0000

Solving a problem programmatically often involves grouping data items together so they can be conveniently operated on or copied as a single unit – the items are collected in a data structure. Many different data structures have been designed over the past decades, some store individual items like phone numbers, others store more complex objects like name/phone number pairs. Each has strengths and weaknesses and is more or less suitable for a specific use case. In this article, I will describe and attempt to classify some of the most popular data structures and their actual implementations on three different abstraction levels starting from a Platonic ideal and ending with actual code that is benchmarked:

Theoretical level: Data structures/collection types are described irrespective of any concrete implementation and the asymptotic behavior of their core operations are listed.
Implementation level: It will be shown how the container classes of a specific programming language relate to the data structures introduced at the previous level – e.g., despite their name similarity, Java’s Vector is different from Scala’s or Clojure’s Vector implementation. In addition, asymptotic complexities of core operations will be provided per implementing class.
Empirical level: Two aspects of the efficiency of data structures will be measured: The memory footprints of the container classes will be determined under different configurations. The runtime performance of operations will be measured which will show to what extent asymptotic advantages manifest themselves in concrete scenarios and what the relative performances of asymptotically equal structures are.

Theoretical Level

Before providing actual speed and space measurement results in the third section, execution time and space can be described in an abstract way as a function of the number of items that a data structure might store. This is traditionally done via Big O notation and the following abbreviations are used throughout the tables:

C is constant time, O(1)
aC is amortized constant time
eC is effective constant time
Log is logarithmic time, O(log n)
L is linear time, O(n)

The green, yellow or red background colors in the table cells will indicate how “good” the time complexity of a particular data structure/operation combination is relative to the other combinations.

Click for Full Size

The first five entries of Table 1 are linear data structures: They have a linear ordering and can only be traversed in one way. By contrast, Trees can be traversed in different ways, they consist of hierarchically linked data items that each have a single parent except for the root item. Trees can also be classified as connected graphs without cycles; a data item (= node or vertex) can be connected to more than two other items in a graph.

Data structures provide many operations for manipulating their elements. The most important ones are the following four core operations which are included above and studied throughout this article:

Access: Read an element located at a certain position
Search: Search for a certain element in the whole structure
Insertion: Add an element to the structure
Deletion: Remove a certain element

Table 1 includes two probabilistic data structures, Bloom Filter and Skip List.

Implementation Level – Java & Scala Collections Framework

The following table classifies almost all members of both the official Java Collection and Scala Collection libraries in addition to a number of relevant classes like Array or String that are not canonical members. The actual class names are placed in the second column, a name that starts with im. or m. refers to a Scala class, other prefixes refer to Java classes. The fully qualified class names are shortened by using the following abbreviations:

u. stands for the package java.util
c. stands for the package java.util.concurrent
lang. stands for the package java.lang
m. stands for the package scala.collection.mutable
im. stands for the package scala.collection.immutable

The actual method names and logic of the four core operations (Access, Search, Addition and Deletion) are dependent on a concrete implementation. In the table below, these method names are printed right before the asymptotic times in italic (they will also be used in the core operation benchmarks later). For example: Row number eleven describes the implementation u.ArrayList (second column) which refers to the Java collection class java.util.ArrayList. In order to access an item in a particular position (fourth column, Random Access), the method get can be called on an object of the ArrayList class with an integer argument that indicates the position. A particular element can be searched for with the method indexOf and an item can be added or deleted via add or remove. Scala’s closest equivalent is the class scala.collection.mutable.ArrayBuffer which is described two rows below ArrayList: To retrieve the element in the third position from an ArrayBuffer, Scala’s apply method can be used which allows an object to be used in function notation, Ss we would write val thirdElement = bufferObject(2). Searching for an item can be done via find and appending or removing an element from an ArrayBuffer is possible by calling the methods += and -= respectively.

Click for Full Size

Subclass and wrapping relationships between two classes are represented via

General features of Java & Scala structures

Several collection properties are not explicitly represented in the table above since they either apply to almost all elements or a simple rule exists:

Almost all data structures that store key/value pairs have the characters Map as part of their class name in the second column. The sole exception to this naming convention is java.util.Hashtable which is a retrofitted legacy class born before Java 2 that also stores key/value pairs.

Almost all Java Collections are mutable: They can be destroyed, elements can be removed from or added to them and their data values can be modified in-place, mutable structures can therefore loose their original/previous state. By contrast, Scala provides a dedicated immutable package (scala.collection.immutable) whose members, in contrast to scala.collection.mutable and the Java collections, cannot be changed in-place. All members of this immutable package are also persistent: Modifications will produce an updated version via structural sharing and/or path copying while also preserving the original version. Examples of immutable but non-persistent data structures from third party providers are mentioned below.

Mutability can lead to problems when concurrency comes into play. Most classes in Table 2 that do not have the prefix c. (abbreviating the package java.util.concurrent) are unsynchronized. In fact, one of the design decision made in the Java Collections Framework was to not synchronize most members of the java.util package since single-threaded or read-only uses of data structures are pervasive. In case synchronization for these classes is required, java.util.Collections provides a cascade of synchronized* methods that accept a given collection and return a synchronized, thread-safe version.

Due to the nature of immutability, the (always unsynchronized) immutable structures in Table 2 are thread-safe.

All entries in Table 2 are eager except for scala.collection.immutable.Stream which is a lazy list that only computes elements that are accessed.

Java supports the eight primitive data types byte, short, int, long, float, double, boolean and char. Things are a bit more complicated with Scala but the same effectively also applies there at the bytecode level. Both languages provide primitive and object arrays but the Java and Scala Collection libraries are object collections which always store object references: When primitives like 3 or 2.3F are inserted, the values get autoboxed so the respective collections will hold a reference to numeric objects (a wrapper class like java.lang.Integer) and not the primitive values themselves:

int[] javaArrayPrimitive = new int[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
Integer[] javaArrayObject = new Integer[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
// javaArrayPrimitive occupies just 64 Bytes, javaArrayObject 240 Bytes

List javaList1 = new ArrayList<>(11); // initial capacity of 11
List javaList2 = new ArrayList<>(11);
for (int i : javaArrayPrimitive)
javaList1.add(i);
for (int i : javaArrayObject)
javaList2.add(i);
// javaList1 is 264 bytes in size now as is javaList2

Similar results for Scala:

val scalaArrayPrimitive = Array[Int](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
val scalaArrayObject = scalaArrayPrimitive.map(new java.lang.Integer(_))
// scalaArrayPrimitive occupies just 64 Bytes, scalaArrayObject 240 Bytes

val scalaBuffer1 = scalaArrayPrimitive.toBuffer
val scalaBuffer2 = scalaArrayObject.toBuffer
// scalaBuffer1 is 264 bytes in size now as is scalaBuffer2

Several third party libraries provide primitive collection support on the JVM allowing the 8 primitives mentioned above to be directly stored in data structures. This can have a big impact on the memory footprint – the creators of Apache Spark recommend in their official tuning guide to

Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.

We will see below whether FastUtil is really the most suitable alternative.

Empirical Level

Hardly any concrete memory sizes and runtime numbers have been mentioned so far, these two measurements are in fact very different: Estimating memory usage is a deterministic task compared to runtime performance since the latter might be influenced by several non-deterministic factors, especially when operations run on an adaptive virtual machine that performans online optimizations.

Memory measurements for JVM objects

Determining the memory footprint of a complex object is far from trivial since JVM languages don’t provide a direct API for that purpose. Apache Spark has an internal function for this purpose that implements the suggestions of an older JavaWorld article. I ported the code and modified it a bit here so this memory measuring functionality can be conveniently used outside of Spark:

val objectSize = JvmSizeEstimator.estimate(new Object())
println(objectSize) // will print 16 since one flat object instance occupies 16 bytes

Measurements for the most important classes from Table 2 with different element types and element sizes are shown below. The number of elements will be 0, 1, 4, 16, 64, 100, 256, 1024, 4096, 10000, 16192, 65536, 100000, 262144, 1048576, 4194304, 10000000, 33554432 and 50000000 in all configurations. For data structures that store individual elements, the two element types are int and String. For structures operating with key/value pairs, the combinations int/int and float/String will be used. The raw sizes of these element types are 4 bytes in the case of an individual int or float (16 bytes in boxed form) and, since all Strings used here will be 8 characters long, 56 bytes per String object.

The same package abbreviations as in Table 2 above will be used for the Java/Scala classes under measurement. In addition, some classes from the following 3rd party libraries are also used in their latest edition at the time of writing:

HPPC (0.8.1)
Koloboke (1.0.0)
fastutil (8.2.1)
Eclipse Collections (9.2.0)

Concerning the environment, jdk1.8.0_171.jdk on MacOS High Sierra 10.13 was used. The JVM flag +UseCompressedOops can affect object memory sizes and was enabled here, it is enabled by default in Java 8.

Measurements of single element structures

Below are the measurement results for the various combinations, every cell contains the object size in bytes for the particular data structure in the corresponding row filled with the number of elements indicated in the column. Some mutable classes provide the option to specify an initial capacity at construction time which can sometimes lead to a smaller overall object footprint after the structure is filled up. I included an additional + capacity row in cases where data structure in the previous row provides such an option and a difference could be measured.

Java/Scala structures storing integers:

Java/Scala structures storing strings:

3rd party structures storing integers:

3rd party structures storing strings:

Measurements for key/value structures:

For some mutable key/value structures like Java’s HashMap, a load factor that determines when to rehash can be specified in addition to an initial capacity. Similar to the logic in the previous tables, a row with + capacity will indicate that the data structure from the previous row was initialized using a capacity.

Java/Scala structures storing integer/integer pairs:

Java/Scala structures storing strings/float pairs:

3rd party structures storing integer/integer pairs:

3rd party structures storing strings/float pairs:

The post Data Structure Zoo appeared first on Unravel.

The Guide To Apache Spark Memory Optimization

Unravel Data — Thu, 23 Jan 2020 21:49:31 +0000

Memory mysteries

I recently read an excellent blog series about Apache Spark but one article caught my attention as its author states:

Let’s try to figure out what happens with the application when the source file is much bigger than the available memory. The memory in the below tests is limited to 900MB […]. Naively we could think that a file bigger than available memory will fail the processing with OOM memory error. And this supposition is true:

It would be bad if Spark could only process input that is smaller than the available memory – in a distributed environment, it implies that an input of 15 Terabytes in size could only be processed when the number of Spark executors multiplied by the amount of memory given to each executor equals at least 15TB. I can say from experience that this is fortunately not the case so let’s investigate the example from the article above in more detail and see why an OutOfMemory exception occurred.

The input to the failed Spark application used in the article referred to above is a text file (generated_file_1_gb.txt) that is created by a script similar to this. This file is 1GB in size and has ten lines, each line simply consists of a line number (starting with zero) that is repeated 100 million times. The on-disk-size of each line is easy to calculate, it is one byte for the line number multiplied by 100 million or ~100MB. The program that processes this file launches a local Spark executor with three cores and the memory available to it is limited to 900MB as the JVM arguments -Xms900m -Xmx900m are used. This results in an OOM error after a few seconds so this little experiment seems to validate the initial hypothesis that “we can’t process datasets bigger than the memory limits”.

I played around with the Python script that created the original input file here and …
(-) created a second input file that is twice the disk size of the original (generated_file_1_gb.txt) but will be processed successfully by ProcessFile.scala
(-) switched to the DataFrame API instead of the RDD API which again crashes the application with an OOM Error
(-) created a third file that is less than a third of the size of generated_file_1_gb.txt but that crashes the original application
(-) reverted back to the original input file but made one small change in the application code which now processes it successfully (using .master(“local[1]”))
The first and last change directly contradict the original hypothesis and the other changes make the memory mystery even bigger.

Memory compartments explained

Visualizations will be useful for illuminating this mystery, the following pictures show Spark’s memory compartments when running ProcessFile.scala on my MacBook:

According to the system spec, my MacBook has four physical cores that amount to eight vCores. Since the application was initializd with .master(“local[3]”), three out of those eight virtual cores will participate in the processing. As reflected in the picture above, the JVM heap size is limited to 900MB and default values for both spark.memory. fraction properties are used. The sizes for the two most important memory compartments from a developer perspective can be calculated with these formulas:

Execution Memory = (1.0 – spark.memory.storageFraction) * Usable Memory = 0.5 * 360MB = 180MB
Storage Memory = spark.memory.storageFraction * Usable Memory = 0.5 * 360MB = 180MB

Execution Memory is used for objects and computations that are typically short-lived like the intermediate buffers of shuffle operation whereas Storage Memory is used for long-lived data that might be reused in downstream computations. However, there is no static boundary but an eviction policy – if there is no cached data, then Execution Memory will claim all the space of Storage Memory and vice versa. If there is stored data and a computation is performed, cached data will be evicted as needed up until the Storage Memory amount which denotes a minimum that will not be spilled to disk. The reverse does not hold true though, execution is never evicted by storage.

This dynamic memory management strategy has been in use since Spark 1.6, previous releases drew a static boundary between Storage and Execution Memory that had to be specified before run time via the configuration properties spark.shuffle.memoryFraction, spark.storage.memoryFraction, and spark.storage.unrollFraction. These have become obsolete and using them will not have an effect unless the user explicitly requests the old static strategy by setting spark.memory.useLegacyMode to true.

The example application does not cache any data so Execution Memory will eat up all of Storage Memory but this is still not enough:

We can finally see the root cause for the application failure and the culprit is not the total input size but the individual record size: Each record consists of 100 million numbers (0 to 9) from which a java.lang.String is created. The size of such a String is twice its “native” size (each character consumes 2 bytes) plus some overhead for headers and fields which amortizes to a Memory Expansion Rate of 2. As already mentioned, the Spark Executor processing the text file uses three cores which results in three tasks trying to load the first three lines of the input into memory at the same time. Each active task gets the same chunk of Execution Memory (360MB), thus
Execution Memory per Task = (Usable Memory – Storage Memory) / spark.executor.cores = (360MB – 0MB) / 3 = 360MB / 3 = 120MB

Based on the previous paragraph, the memory size of an input record can be calculated by
Record Memory Size = Record size (disk) * Memory Expansion Rate
= 100MB * 2 = 200MB
… which is significantly above the available Execution Memory per Task hence the observed application failure.

Assigning just one core to the Spark executor will prevent the Out Of Memory exception as shown in the following picture:

Now there is only one active task that can use all Execution Memory and each record fits comfortably into the available space since 200MB < < 360MB. This defeats the whole point of using Spark of course since there is no parallelism, all records are now processed consecutively.

With the formulas developed above, we can estimate the largest record size which would not crash the original version of the application (which uses .master(“local[3]”)): We have around 120MB per task available so any record can only consume up to 120MB of memory. Given our special circumstances, this implies that each line in the file should be 120/200 = 0.6 times shorter. I created a slightly modified script that creates such a maximum input, it uses a factor of 0.6 and the resulting file can still be processed without an OOM error. Using a factor of 0.7 though would create an input that is too big and crash the application again thus validating the thoughts and formulas developed in this section.

Going distributed: Spark inside YARN containers

Things become even more complicated in a distributed environment. Suppose we run on AWS/EMR and use a cluster of m4.2xlarge instance types, then every node has eight vCPUs (four physical CPUs) and 32GB memory according to https://aws.amazon.com/ec2/instance-types/. YARN will be responsible for resource allocations and each Spark executor will run inside a YARN container. Additional memory properties have to be taken into acccount since YARN needs some resources for itself:

Out of the 32GB node memory in total of an m4.2xlarge instance, 24GB can be used for containers/Spark executors by default (property yarn.nodemanager.resource.memory-mb) and the largest container/executor could use all of this memory (property yarn.scheduler.maximum-allocation-mb), these values are taken from https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html. Each YARN container needs some overhead in addition to the memory reserved for a Spark executor that runs inside it, the default value of this spark.yarn.executor.memoryOverhead property is 384MB or 0.1 * Container Memory, whichever value is bigger; the memory available to the Spark executor would be 0.9 * Container Memory in this scenario.

How is this Container Memory determined? It is actually not a property that is explicitly set: Let’s say we use two Spark executors and two cores per executor (–executor-cores 2) as reflected in the image above. Then the Container Memory is
Container Memory = yarn.scheduler.maximum-allocation-mb / Number of Spark executors per node = 24GB / 2 = 12GB

Therefore each Spark executor has 0.9 * 12GB available (equivalent to the JVM Heap sizes in the images above) and the various memory compartments inside it could now be calculated based on the formulas introduced in the first part of this article. The virtual core count of two was just chosen for this example, it wouldn’t make much sense in real life since four vcores are idle under this configuration. The best setup for m4.2xlarge instance types might be to just use one large Spark executor with seven cores as one core should always be reserved for the Operating System and other background processes on the node.

Spark and Standalone Mode

Things become a bit easier again when Spark is deployed without YARN in StandAlone Mode as is the case with services like Azure Databricks:

Only one Spark executor will run per node and the cores will be fully used. In this case, the available memory can be calculated for instances like DS4 v2 with the following formulas:
Container Memory = (Instance Memory * 0.97 – 4800)
spark.executor.memory = (0.8 * Container Memory)

Memory and partitions in real life workloads

Determining the “largest” record that might lead to an OOM error is much more complicated than in the previous scenario for a typical workload: The line lengths of all input files used (like generated_file_1_gb.txt) were the same so there was no “smallest” or “largest” record. Finding the maximum would be much harder if not practically impossible when transformations and aggregations occur. One approach might consist in searching the input or intermediate data that was persisted to stable storage for the “largest” record and creating an object of the right type (the schema used during a bottleneck like a shuffle) from it. The memory size of this object can then be directly determined by passing a reference to SizeEstimator.estimate, a version of this function that can be used outside of Spark can be found here.

Once an application succeeds, it might be useful to determine the average memory expansion rate for performance reasons as this could influence the choice of the number of (shuffle) partitions: One of the clearest indications that more partitions should be used is the presence of “spilled tasks” during a shuffle stage. In that case, the Spark Web UI should show two spilling entries (Shuffle spill (disk) and Shuffle spill (memory)) with positive values when viewing the details of a particular shuffle stage by clicking on its Description entry inside the Stage section. The presence of these two metrics indicates that not enough Execution Memory was available during the computation phase so records had to be evicted to disk, a process that is bad for the application performance. A “good” side effect of this costly spilling is that the memory expansion rate can be easily approximated by dividing the value for Shuffle spill (memory) by Shuffle spill (disk) since both metrics are based on the same records and denote how much space they take up in-memory versus on-disk, therefore:
Memory Expansion Rate ≈ Shuffle spill (memory) / Shuffle spill (disk)

This rate can now be used to approximate the total in-memory shuffle size of the stage or, in case a Spark job contains several shuffles, of the biggest shuffle stage. An estimation is necessary since this value is not directly exposed in the web interface but can be inferred from the on-disk size (field Shuffle Read shown in the details view of the stage) multiplied by the Memory Expansion Rate:
Shuffle size in memory = Shuffle Read * Memory Expansion Rate

Finally, the number of shuffle partitions should be set to the ratio of the Shuffle size (in memory) and the memory that is available per task, the formula for deriving the last value was already mentioned in the first section (“Execution Memory per Task”). So
Shuffle Partition Number = Shuffle size in memory / Execution Memory per task
This value can now be used for the configuration property spark.sql.shuffle.partitions whose default value is 200 or, in case the RDD API is used, for spark.default.parallelism or as second argument to operations that invoke a shuffle like the *byKey functions.

The intermediate values needed for the last formula might be hard to determine in practice in which case the following alternative calculation can be used; it only uses values that are directly provided by the Web UI: The Shuffle Read Size per task for the largest shuffle stage should be around 150MB so the number of shuffle partitions would be equal to the value of Shuffle Read divided by it:
Shuffle Partition Number = Shuffle size on disk (= Shuffle Read) / 150

The post The Guide To Apache Spark Memory Optimization appeared first on Unravel.

Spark APIs: RDD, DataFrame, DataSet in Scala, Java, Python

Unravel Data — Wed, 22 Jan 2020 21:41:47 +0000

This is a blog by Phil Schwab, Software Engineer at Unravel Data. This blog was first published on Phil’s BigData Recipe website.

Once upon a time there was only one way to use Apache Spark but support for additional programming languages and APIs have been introduced in recent times. A novice can be confused by the different options that have become available since Spark 1.6 and intimidated by the idea of setting up a project to explore these APIs. I’m going to show how to use Spark with three different APIs in three different programming languages by implementing the Hello World of Big Data, Word Count, for each combination.

Project Setup: IntelliJ, Maven, JVM, Fat JARs

This repo contains everything needed for running Spark or Pyspark locally and can be used as a template for more complex projects. Apache Maven is used as a build tool so dependencies, build configurations etc. are specified in a POM file. After cloning, the project can be opened and executed with Maven from the terminal or imported with a modern IDE like Intellij via:File => New => Project From Existing Sources => Open => Import project from external model => Maven

To build the project without an IDE, go to its source directory and execute the command mvn clean package which compiles an “uber” or “fat” JAR at bdrecipes/target/top-modules-1.0-SNAPSHOT.jar. This file is a self-contained unit that is executable so it will contain all dependencies specified in the POM. If you run your application on a cluster, the Spark dependencies are already “there” and shouldn’t be included in the JAR that contains the program code and other non-Spark dependencies. This could be done by adding a provided scope to the two the Spark dependencies (spark-sql_2.11 and spark-core_2.11) in the POM.

Run an application using the command

java -cp target/top-modules-1.0-SNAPSHOT.jar

scala -cp target/top-modules-1.0-SNAPSHOT.jar

plus the fully qualified name of the class that should be executed. To run the DataSet API example for both Scala and Java, use the following commands:

scala -cp target/top-modules-1.0-SNAPSHOT.jar spark.apis.wordcount.Scala_DataSet

java -cp target/top-modules-1.0-SNAPSHOT.jar spark.apis.wordcount.Java_DataSet

PySpark Setup

Python is not a JVM-based language and the Python scripts that are included in the repo are actually completely independent from the Maven project and its dependencies. In order to run the Python examples, you need to install pyspark which I did on MacOS via pip3 install pyspark. The scripts can be run from an IDE or from the terminal via python3 python_dataframe.py

Implementation
Each Spark program will implement the same simple Word Count logic:

1. Read the lines of a text file; Moby Dick will be used here

2. Extract all words from those lines and normalize them. For simplicity, we will just split the lines on whitespace here and lowercase the words

3. Count how often each element occurs

4. Create an output file that contains the element and its occurrence frequency

The solutions for the various combinations using the most recent version of Spark (2.3) can be found here:

Scala + RDD
Scala + DataFrame
Scala + DataSet
Python + RDD
Python + DataFrame
Java + RDD
Java + DataFrame
Java + DataSet

These source files should contain enough comments so there is no need to describe the code in detail here.

The post Spark APIs: RDD, DataFrame, DataSet in Scala, Java, Python appeared first on Unravel.

Benchmarks, Data Spark and Graal

Unravel Data — Mon, 20 Jan 2020 15:30:56 +0000

This is a blog by Phil Schwab, Software Engineer at Unravel Data. This blog was first published on Phil’s BigData Recipe website.

A very important question is how long something takes and the answer to that is fairly straightforward in normal life: Check the current time, then perform the unit of work that should be measured, then check the time again. The end time minus the start time would equal the amount of time that the task took, the elapsed time or wallclock time. The programmatic version of this simple measuring technique could look like

def measureTime[A](computation: => A): Long = { 
 val startTime = System.currentTimeMillis() 
 computation
 val endTime = System.currentTimeMillis() 
 val elapsedTime = endTime - startTime elapsedTime 
 }

In the case of Apache Spark, the computation would likely be of type Dataset[_] or RDD[_]. In fact, the two third party benchmarking frameworks for Spark mentioned below are based on a function similar to the one shown above for measuring the execution time of a Spark job.

It is surprisingly hard to accurately predict how long something will take in programming: The result from a single invocation of the naive method above is likely not very reliable since numerous non-deterministic factors can interfere with a measurement, especially when the underlying runtime applies dynamic optimizations like the Java Virtual Machine. Even the usage of a dedicated microbenchmarking framework like JMH only solves parts of the problem – the user is reminded every time of that caveat after JMH completes:


[info] REMEMBER: The numbers below are just data. To gain reusable 
[info] insights, you need to follow up on why the numbers are the 
[info] way they are. Use profilers (see -prof, -lprof), design 
[info] factorial experiments, perform baseline and negative tests 
[info] that provide experimental control, make sure the 
[info] benchmarking environment is safe on JVM/OS/HW level, ask 
[info] for reviews from the domain experts. Do not assume the 
[info] numbers tell you what you want them to tell.

From the Apache Spark creators: spark-sql-perf

If benchmarking a computation on a local machine is already hard, then doing this for a distributed computation/environment should be very hard. spark-sql-perf is the official performance testing framework for Spark 2. The following twelve benchmarks are particularly interesting since they target various features and APIs of Spark; they are organized into three classes:

DatasetPerformance compares the same workloads expressed via RDD, Dataframe and Dataset API:

range just creates 100 million integers of datatype Long which are wrapped in a case class in the case of RDDs and Datasets
filter applies four consecutive filters to 100 million Longs
map applies an increment operation 100 million Longs four times
average computes the average of one million Longs using a user-defined function for Datasets, a built-in sql function for DataFrames and a map/reduce combination for RDDs.

JoinPerformance is based on three data sets with one million, 100 million and one billion Longs:
singleKeyJoins: joins each one of the three basic data sets in a left table with each one of the three basic data sets in a right table via four different join types (Inner Join, Right Join, Left Join and Full Outer Join)
varyDataSize: joins two tables consisting of 100 million integers each with a ‘data’ column containing Strings of 5 different lengths (1, 128, 256, 512 and 1024 characters)
varyKeyType: joins two tables consisting of 100 million integers and casts it to four different data types (String, Integer, Long and Double)
varyNumMatches

AggregationPerformance:

varyNumGroupsAvg and twoGroupsAvg both compute the average of one table column and group them by the other column. They differ in the cardinaliy and shape of the input tables used.
The next two aggregation benchmarks use the three data sets that are also used in the DataSetPerformance benchmark described above:
complexInput: For each of the three integer tables, adds a single column together nine times
aggregates: aggregates a single column via four different aggregation types (SUM, AVG, COUNT and STDDEV)

Running these benchmarks produces……. almost nothing, most of them are broken or will crash in the current state of the official master branch due to various problems (issues with reflection, missing table registrations, wrong UDF pattern, …):

$ bin/run --benchmark AggregationPerformance
[...]
[error] Exception in thread "main" java.lang.InstantiationException: com.databricks.spark.sql.perf.AggregationPerformance
[error] at java.lang.Class.newInstance(Class.java:427)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:81)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:82)
[error] at scala.util.Try.getOrElse(Try.scala:79)
[...]

$ bin/run --benchmark JoinPerformance
[...]
[error] Exception in thread "main" java.lang.InstantiationException: com.databricks.spark.sql.perf.JoinPerformance
[error] at java.lang.Class.newInstance(Class.java:427)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:81)
[error] at com.databricks.spark.sql.perf.RunBenchmark$anonfun$6.apply(RunBenchmark.scala:82)
[error] at scala.util.Try.getOrElse(Try.scala:79)

I repaired these issues and was able to run all of these twelve benchmarks sucessfully; the fixed edition can be downloaded from my personal repo here, a PR was also submitted. Enough complaints, the first results generated via

$ bin/run --benchmark DatasetPerformance

that compare the same workloads expressed in RDD, Dataset and Dataframe APIs are:

name |minTimeMs| maxTimeMs| avgTimeMs| stdDev
-------------------------|---------|---------|---------|---------
DF: average | 36.53 | 119.91 | 56.69 | 32.31
DF: back-to-back filters | 2080.06 | 2273.10 | 2185.40 | 70.31
DF: back-to-back maps | 1984.43 | 2142.28 | 2062.64 | 62.38
DF: range | 1981.36 | 2155.65 | 2056.18 | 70.89
DS: average | 59.59 | 378.97 | 126.16 | 125.39
DS: back-to-back filters | 3219.80 | 3482.17 | 3355.01 | 88.43
DS: back-to-back maps | 2794.68 | 2977.08 | 2890.14 | 59.55
DS: range | 2000.36 | 3240.98 | 2257.03 | 484.98
RDD: average | 20.21 | 51.95 | 30.04 | 11.31
RDD: back-to-back filters| 1704.42 | 1848.01 | 1764.94 | 56.92
RDD: back-to-back maps | 2552.72 | 2745.86 | 2678.29 | 65.86
RDD: range | 593.73 | 689.74 | 665.13 | 36.92

This is rather surprising and counterintuitive given the focus of the architecture changes and performance improvements in Spark 2 – the RDD API performs best (= lowest numbers in the fourth column) for three out of four workloads, Dataframes only outperform the two other APIs in the back-to-back maps benchmark with 2062 ms versus 2890 ms in the case of Datasets and 2678 in the case of RDDs.

The results for the two other benchmark classes are as follows:

bin/run --benchmark AggregationPerformance

name | minTimeMs | maxTimeMs | avgTimeMs | stdDev
------------------------------------|-----------|-----------|-----------|--------
aggregation-complex-input-100milints| 19917.71 |23075.68 | 21604.91 | 1590.06
aggregation-complex-input-1bilints | 227915.47 |228808.97 | 228270.96 | 473.89
aggregation-complex-input-1milints | 214.63 |315.07 | 250.08 | 56.35
avg-ints10 | 213.14 |1818.041 | 763.67 | 913.40
avg-ints100 | 254.02 |690.13 | 410.96 | 242.38
avg-ints1000 | 650.53 |1107.93 | 812.89 | 255.94
avg-ints10000 | 2514.60 |3273.21 | 2773.66 | 432.72
avg-ints100000 | 18975.83 |20793.63 | 20016.33 | 937.04
avg-ints1000000 | 233277.99 |240124.78 | 236740.79 | 3424.07
avg-twoGroups100000 | 218.86 |405.31 | 283.57 | 105.49
avg-twoGroups1000000 | 194.57 |402.21 | 276.33 | 110.62
avg-twoGroups10000000 | 228.32 |409.40 | 303.74 | 94.25
avg-twoGroups100000000 | 627.75 |733.01 | 673.69 | 53.88
avg-twoGroups1000000000 | 4773.60 |5088.17 | 4910.72 | 161.11
avg-twoGroups10000000000 | 43343.70 |47985.57 | 45886.03 | 2352.40
single-aggregate-AVG-100milints | 20386.24 |21613.05 | 20803.14 | 701.50
single-aggregate-AVG-1bilints | 209870.54 |228745.61 | 217777.11 | 9802.98
single-aggregate-AVG-1milints | 174.15 |353.62 | 241.54 | 97.73
single-aggregate-COUNT-100milints | 10832.29 |11932.39 | 11242.52 | 601.00
single-aggregate-COUNT-1bilints | 94947.80 |103831.10 | 99054.85 | 4479.29
single-aggregate-COUNT-1milints | 127.51 |243.28 | 166.65 | 66.36
single-aggregate-STDDEV-100milints | 20829.31 |21207.90 | 20994.51 | 193.84
single-aggregate-STDDEV-1bilints | 205418.40 |214128.59 | 211163.34 | 4976.13
single-aggregate-STDDEV-1milints | 181.16 |246.32 | 205.69 | 35.43
single-aggregate-SUM-100milints | 20191.36 |22045.60 | 21281.71 | 969.26
single-aggregate-SUM-1bilints | 216648.77 |229335.15 | 221828.33 | 6655.68
single-aggregate-SUM-1milints | 186.67 |1359.47 | 578.54 | 676.30

bin/run --benchmark JoinPerformance

name |minTimeMs |maxTimeMs |avgTimeMs |stdDev
------------------------------------------------|----------|----------|----------|--------
singleKey-FULL OUTER JOIN-100milints-100milints | 44081.59 |46575.33 | 45418.33 |1256.54
singleKey-FULL OUTER JOIN-100milints-1milints | 36832.28 |38027.94 | 37279.31 |652.39
singleKey-FULL OUTER JOIN-1milints-100milints | 37293.99 |37661.37 | 37444.06 |192.69
singleKey-FULL OUTER JOIN-1milints-1milints | 936.41 |2509.54 | 1482.18 |890.29
singleKey-JOIN-100milints-100milints | 41818.86 |42973.88 | 42269.81 |617.71
singleKey-JOIN-100milints-1milints | 20331.33 |23541.67 | 21692.16 |1660.02
singleKey-JOIN-1milints-100milints | 22028.82 |23309.41 | 22573.63 |661.30
singleKey-JOIN-1milints-1milints | 708.12 |2202.12 | 1212.86 |856.78
singleKey-LEFT JOIN-100milints-100milints | 43651.79 |46327.19 | 44658.37 |1455.45
singleKey-LEFT JOIN-100milints-1milints | 22829.34 |24482.67 | 23633.77 |827.56
singleKey-LEFT JOIN-1milints-100milints | 32674.77 |34286.75 | 33434.05 |810.04
singleKey-LEFT JOIN-1milints-1milints | 682.51 |773.95 | 715.53 |50.73
singleKey-RIGHT JOIN-100milints-100milints | 44321.99 |45405.85 | 44965.93 |570.00
singleKey-RIGHT JOIN-100milints-1milints | 32293.54 |32926.62 | 32554.74 |330.73
singleKey-RIGHT JOIN-1milints-100milints | 22277.12 |24883.91 | 23551.74 |1304.34
singleKey-RIGHT JOIN-1milints-1milints | 683.04 |935.88 | 768.62 |144.85

From Phil: Spark & JMH

The surprising results from the DatasetPerformance benchmark above should make us skeptical – probably the benchmarking code or setup itself is to blame for the odd measurement, not the actual Spark APIs. A popular and quasi-official benchmarking framework for languages targeting the JVM is JMH so why not use it for the twelve Spark benchmarks? I “translated” them into JMH versions here and produced new results, among them the previously odd DatasetPerformance cases:

Phils-MacBook-Pro: pwd
/Users/Phil/IdeaProjects/jmh-spark
Phils-MacBook-Pro: ls
README.md benchmarks build.sbt project src target

Phils-MacBook-Pro: sbt benchmarks/jmh:run Bench_APIs1
[...]
Phils-MacBook-Pro: sbt benchmarks/jmh:run Bench_APIs2

Benchmark (start) Mode Cnt Score Error Units
Bench_APIs1.rangeDataframe 1 avgt 20 2618.631 ± 59.210 ms/op
Bench_APIs1.rangeDataset 1 avgt 20 1646.626 ± 33.230 ms/op
Bench_APIs1.rangeDatasetJ 1 avgt 20 2069.763 ± 76.444 ms/op
Bench_APIs1.rangeRDD 1 avgt 20 448.281 ± 85.781 ms/op
Bench_APIs2.averageDataframe 1 avgt 20 24.614 ± 1.201 ms/op
Bench_APIs2.averageDataset 1 avgt 20 41.799 ± 2.012 ms/op
Bench_APIs2.averageRDD 1 avgt 20 12.280 ± 1.532 ms/op
Bench_APIs2.filterDataframe 1 avgt 20 2395.985 ± 36.333 ms/op
Bench_APIs2.filterDataset 1 avgt 20 2669.160 ± 81.043 ms/op
Bench_APIs2.filterRDD 1 avgt 20 2776.382 ± 62.065 ms/op
Bench_APIs2.mapDataframe 1 avgt 20 2020.671 ± 136.371 ms/op
Bench_APIs2.mapDataset 1 avgt 20 5218.872 ± 177.096 ms/op
Bench_APIs2.mapRDD 1 avgt 20 2957.573 ± 26.458 ms/op

These results are more in line with expectations: Dataframes perform best in two out of four benchmarks. The Spark-internal functionality used for the other two (average and range) indeed favour RDDs:

From IBM: spark-bench
To be published

From CERN:
To be published

Enter GraalVM

Most computer programs nowadays are written in higher-level languages so humans can create them faster and understand them easier. But since a machine can only “understand” numerical languages, these high-level artifacts cannot directly be executed by a processor so typically one or more additional steps are required before a program can be run. Some programming languages transform their user’s source code into an intermediate representation which then gets compiled again into and interpreted as machine code. The languages of interest in this article (roughly) follow this strategy: The programmer only writes high-level source code which is then automatically transformed to a file ending in .class that contains platform-independent bytecode. This bytecode file is further compiled down to machine code by the Java Virtual Machine while hardware-specific aspects are fully taken care of and, depending on the compiler used, optimizations are applied. Finally this machine code is executed in the JVM runtime.

One of the most ambitious software projects of the past years has probably been the development of a general-purpose virtual machine, Oracle’s Graal project, “one VM to rule them all.” There are several aspects to this technology, two of the highlights include the goal of providing seamless interoperability between (JVM and non-JVM) programming languages while running them efficiently on the same JVM and a new, high performance Java compiler. Twitter already uses the enterprise edition in production and saves around 8% of CPU utilization. The Community edition can be downloaded and used for free, more details below.

Graal and Scala

Graal works at the bytecode level. In order to to run Scala code via Graal, I created a toy example that is inspired by the benchmarks described above: The source code snippet below creates 10 million integers, increments each number by one, removes all odd elements and finally sums up all of the remaining even numbers. These four operations are repeated 100 times and during each step the execution time and the sum (which stays the same across all 100 iterations) are printed out. Before the program terminates, the total run time will also be printed. The following source code implements this logic – not in the most elegant way but with optimization potential for the final compiler phase where Graal will come into play:

object ProcessNumbers {
 // Helper functions:
 def increment(xs: Array[Int]) = xs.map(_ + 1)
 def filterOdd(xs: Array[Int]) = xs.filter(_ % 2 == 0)
 def sum(xs: Array[Int]) = xs.foldLeft(0L)(_ + _)
 // Main part:
 def main(args: Array[String]): Unit = {
   var totalTime = 0L
   // loop 100 times
  for (iteration <- 1 to 100) {
     val startTime = System.currentTimeMillis()
     val numbers = (0 until 10000000).toArray
     // transform numbers and sum them up
     val incrementedNumbers = increment(numbers)
     val evenNumbers = filterOdd(incrementedNumbers)
     val summed = sum(evenNumbers)
     // calculate times and print out
     val endTime = System.currentTimeMillis()
     val elapsedTime = endTime - startTime
     totalTime += elapsedTime
     println(s"Iteration $iteration took $elapsedTime milliseconds\t\t$summed")
   }
   println("*********************************")
     println(s"Total time: $totalTime ms")
   }
}

The transformation of this code to the intermediate bytecode representation is done as usual, via scalac ProcessNumbers.scala. The resulting bytecode file is not directly interpretable by humans but those JVM instructions can be made more intelligible by disassembling them with the command javap -c -cp. The original source code above has less than 30 lines, the disassembled version has more than 200 lines but in a simpler structure and with a small instruction set:

javap -c -cp ./ ProcessNumbers$

public final class ProcessNumbers$ {
[...]

public void main(java.lang.String[]);
Code:
0: lconst_0
1: invokestatic #137 // Method scala/runtime/LongRef.create:(J)Lscala/runtime/LongRef;
4: astore_2
5: getstatic #142 // Field scala/runtime/RichInt$.MODULE$:Lscala/runtime/RichInt$;
8: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
11: iconst_1
12: invokevirtual #145 // Method scala/Predef$.intWrapper:(I)I
15: bipush 100
17: invokevirtual #149 // Method scala/runtime/RichInt$.to$extension0:(II)Lscala/collection/immutable/Range$Inclusive;
20: aload_2
21: invokedynamic #160, 0 // InvokeDynamic #3:apply$mcVI$sp:(Lscala/runtime/LongRef;)Lscala/runtime/java8/JFunction1$mcVI$sp;
26: invokevirtual #164 // Method scala/collection/immutable/Range$Inclusive.foreach$mVc$sp:(Lscala/Function1;)V
29: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
32: ldc #166 // String *********************************
34: invokevirtual #170 // Method scala/Predef$.println:(Ljava/lang/Object;)V
37: getstatic #35 // Field scala/Predef$.MODULE$:Lscala/Predef$;
40: new #172 // class java/lang/StringBuilder
43: dup
44: ldc #173 // int 15
46: invokespecial #175 // Method java/lang/StringBuilder."":(I)V
49: ldc #177 // String Total time:
51: invokevirtual #181 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
54: aload_2
55: getfield #185 // Field scala/runtime/LongRef.elem:J
58: invokevirtual #188 // Method java/lang/StringBuilder.append:(J)Ljava/lang/StringBuilder;
61: ldc #190 // String ms
63: invokevirtual #181 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
66: invokevirtual #194 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
69: invokevirtual #170 // Method scala/Predef$.println:(Ljava/lang/Object;)V
72: return
[...]
}

Now we come to the Graal part: My system JDK is

Phils-MacBook-Pro $ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

I downloaded the community edition of Graal from here and placed it in a folder along with Scala libraries and the files for the toy benchmarking example mentioned above:

Phils-MacBook-Pro: ls
ProcessNumbers$.class ProcessNumbers.class ProcessNumbers.scala graalvm scalalibs

Phils-MacBook-Pro: ./graalvm/Contents/Home/bin/java -version
openjdk version "1.8.0_172"
OpenJDK Runtime Environment (build 1.8.0_172-20180626105433.graaluser.jdk8u-src-tar-g-b11)
GraalVM 1.0.0-rc5 (build 25.71-b01-internal-jvmci-0.46, mixed mode)

Phils-MacBook-Pro: ls scalalibs/
jline-2.14.6.jar scala-library.jar scala-reflect.jar scala-xml_2.12-1.0.6.jar
scala-compiler.jar scala-parser-combinators_2.12-1.0.7.jar scala-swing_2.12-2.0.0.jar scalap-2.12.6.jar

Let’s run this benchmark with the “normal” JDK via java -cp ./lib/scala-library.jar:. ProcessNumbers. Around 31 seconds are needed as can be seen below (only the first and last iterations are shown)

Phils-MacBook-Pro: java -cp ./lib/scala-library.jar:. ProcessNumbers

Iteration 1 took 536 milliseconds 25000005000000
Iteration 2 took 533 milliseconds 25000005000000
Iteration 3 took 350 milliseconds 25000005000000
Iteration 4 took 438 milliseconds 25000005000000
Iteration 5 took 345 milliseconds 25000005000000
[...]
Iteration 95 took 293 milliseconds 25000005000000
Iteration 96 took 302 milliseconds 25000005000000
Iteration 97 took 333 milliseconds 25000005000000
Iteration 98 took 282 milliseconds 25000005000000
Iteration 99 took 308 milliseconds 25000005000000
Iteration 100 took 305 milliseconds 25000005000000
*********************************
Total time: 31387 ms

And here a run that invokes Graal as JIT compiler:

Phils-MacBook-Pro:testo a$ ./graalvm/Contents/Home/bin/java -cp ./lib/scala-library.jar:. ProcessNumbers

Iteration 1 took 1287 milliseconds 25000005000000
Iteration 2 took 264 milliseconds 25000005000000
Iteration 3 took 132 milliseconds 25000005000000
Iteration 4 took 120 milliseconds 25000005000000
Iteration 5 took 128 milliseconds 25000005000000
[...]
Iteration 95 took 111 milliseconds 25000005000000
Iteration 96 took 124 milliseconds 25000005000000
Iteration 97 took 122 milliseconds 25000005000000
Iteration 98 took 123 milliseconds 25000005000000
Iteration 99 took 120 milliseconds 25000005000000
Iteration 100 took 149 milliseconds 25000005000000
*********************************
Total time: 14207 ms

14 seconds compared to 31 seconds means a 2x speedup with Graal, not bad. The first iteration takes much longer but then a turbo boost seems to kick in – most iterations from 10 to 100 take around 100 to 120 ms in the Graal run compared to 290-310 ms in the vanilla Java run. Graal itself has an option to deactivate the optimization via the -XX:-UseJVMCICompiler flag, trying that results in similar numbers compared with the first run:

Phils-MacBook-Pro: /Users/a/graalvm/Contents/Home/bin/java -XX:-UseJVMCICompiler -cp ./lib/scala-library.jar:. ProcessNumbers
Iteration 1 took 566 milliseconds 25000005000000
Iteration 2 took 508 milliseconds 25000005000000
Iteration 3 took 376 milliseconds 25000005000000
Iteration 4 took 456 milliseconds 25000005000000
Iteration 5 took 310 milliseconds 25000005000000
[...]
Iteration 95 took 301 milliseconds 25000005000000
Iteration 96 took 301 milliseconds 25000005000000
Iteration 97 took 285 milliseconds 25000005000000
Iteration 98 took 302 milliseconds 25000005000000
Iteration 99 took 296 milliseconds 25000005000000
Iteration 100 took 296 milliseconds 25000005000000
*********************************
Total time: 30878 ms

Graal and Spark

Why not invoke Graak for Spark jobs. Let’s do this for my benchmarking project introduced above with the -jvm flag:

Phils-MacBook-Pro:jmh-spark $ sbt
Loading settings for project jmh-spark-build from plugins.sbt ...
Loading project definition from /Users/Phil/IdeaProjects/jmh-spark/project
Loading settings for project jmh-spark from build.sbt ...
Set current project to jmh-spark (in build file:/Users/Phil/IdeaProjects/jmh-spark/)
sbt server started at local:///Users/Phil/.sbt/1.0/server/c980c60cda221235db06/sock

sbt:jmh-spark> benchmarks/jmh:run -jvm /Users/Phil/testo/graalvm/Contents/Home/bin/java
Running (fork) spark_benchmarks.MyRunner -jvm /Users/Phil/testo/graalvm/Contents/Home/bin/java

The post Benchmarks, Data Spark and Graal appeared first on Unravel.

Unravel Joins AWS Partner Network Global Startup Program

Unravel Data — Wed, 18 Dec 2019 12:00:50 +0000

At Unravel, we have been working with AWS for some time to help our joint customers simplify data migrations to the cloud. Announced at AWS re:Invent, we are proud to be partnering with AWS on the newly created AWS Partner Network Global Startup Program. Unravel provides detailed insights and recommendations to help enterprises of all sizes slice and dice application workloads to ensure an effective cloud migration strategy.

Once data workloads are running on AWS, Unravel provides a solution for rapid troubleshooting and performance optimization that enables you to get more cluster for your money and enables ops and apps teams to meet their SLAs targets with confidence.

In addition, Unravel provides full visibility of your Amazon EMR resource consumption and accounting for costs and chargeback reporting for all your application stakeholders.

For a more detailed view on how we do what we do under the covers take a look here.

If you are ready to try out Unravel and see for yourself, we are available on the AWS Marketplace and offer a $2000 incentive to try us out!

The full Press Release can be viewed below:

————————————————————————————————————————————————

Unravel Joins AWS Partner Network Global Startup Program

PALO ALTO, Calif. – December 18, 2019– Unravel Data, a data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, announced its membership in the Amazon Web Services (AWS) Partner Network (APN) Global Startup Program. The APN Global Startup Program is a unique “white glove” support and go-to-market program for selected startup APN Partners, allowing members to build on their AWS expertise, better serve shared customers, and accelerate their growth. To be selected for the APN Global Startup Program, Unravel had to meet predefined criteria, including a clear, demonstrated product market fit for an innovative enterprise tech product, be backed and recommended by a top-tier venture capital firm, and demonstrate a strategic commitment to building their AWS and cloud expertise.

The APN Global Startup Program enables qualifying startups to gain product design wins, visibility, exposure, leads, and commercial opportunities made possible with exclusive APN resources and dedicated Startup Partner Development Managers (PDM) with deep AWS knowledge and startup business experience, that guide startups in their growth journey with APN. By becoming an APN Global Startup Partner, Unravel will receive benefits ranging from a tailor-made plan for mapping the startup needs and opportunities to a selection of AWS services and APN programs, promotion support to drive visibility and awareness around the startup offering, to resources for helping startups sell and deploy innovative solutions on behalf of AWS shared customers.

“Unravel is proud to be part of the APN and the newly launched APN Global Startup Program,” said Kunal Agarwal, CEO of Unravel Data. “Our team is dedicated to helping companies simplify their cloud data operations by leveraging the agility, breadth of services, and pace of innovation that AWS provides access to.”

The Unravel platform is designed to accelerate the adoption of big data workloads in AWS. By supporting Amazon EMR, Unravel allows users to connect to a new or existing Amazon EMR cluster with just one click. Unravel for Amazon EMR can improve the productivity of big data teams with a simple, intelligent, self-service performance management capability. Unravel for Amazon EMR is engineered to:

Automatically fix slow, inefficient and failing Spark, Hive, MapReduce, HBase, and Kafka applications
Right size AWS cloud expenses by automatically adjusting resource consumption by users and applications
Get a detailed view of consumption to understand cluster resource usage by user, department, or project and enable chargeback accounting
Using AI, machine learning, and other advanced analytics, Unravel can assure service level agreements (SLAs) and optimizes compute, I/O, and storage costs. Furthermore, Unravel can reduce operational overhead through advanced automation and predictive maintenance, enabled by unified observability and AIOps capabilities.

Joining the APN Global Startup Program will allow more customers to discover the benefits Unravel has to offer for modern data stack environments to monitor, manage and improve data pipelines build on AWS.

AWS is enabling scalable, flexible, and cost-effective solutions from startups to global enterprises. The APN is a global program helping partners build a successful AWS-based business by aiding organizations to build, market, and sell their offerings. The APN provides valuable business, technical, and marketing support, empowering startups to achieve exponential growth.

About Unravel Data – Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Kaiser Permanente, Adobe, Deutsche Bank, Wayfair, and Neustar. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement

The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

Contacts

Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Unravel Joins AWS Partner Network Global Startup Program appeared first on Unravel.

Top 10 Bank Leverages Unravel’s AIOps To Tame Fraud Detection and Compliance Performance

Unravel Data — Mon, 09 Dec 2019 14:00:57 +0000

Unsurprisingly, Modern data apps have become crucial across all areas of the financial industry, with apps for fraud detection, claims processing and compliance amongst others playing a business-critical role. Unravel has been deployed by several of the world’s largest financial services organizations to ensure these critical apps perform reliably at all times. One recent example is one of America’s ten largest banks, a corporation that encompasses over 3,000 retail branches, 5000 ATMs and over 70,000 employees. This is what happened.

This bank has been using big data for a variety of purposes, but its two most important apps are fraud detection and compliance. They deployed Informatica broadly in order to run ETL jobs. This was a massive focus for the bank’s DataOps team, which had many workflows running multiple Hive queries. They also made heavy use of Spark and Kafka Streaming in order to process tons of real-time streaming data for their fraud detection app.

Unravel Kafka Dashboard (main)

The bank suffered constant headaches before they deployed Unravel. First, their data apps tended to be slow and failed frequently. In order to figure out why, they had to dig through an avalanche of raw data logs, a process that could take weeks. Once they had identified the problem, they would have to do a long trial-and-error process to determine how to fix it. This again could take weeks, if they were fortunate enough to even find a fix for the issue.

There was another monitoring issue at the cluster usage level. They knew they weren’t optimally consuming their compute resources but had no visibility into how to improve utilization. The team only fully became aware of how critical their compute utilization problem was when it caused a critical data app to fail.

After deploying Unravel, the bank was able to quickly alleviate these problems. To begin, the platform’s reporting capabilities changed things dramatically. The team was able to monitor and understand its many different modern data stack technologies (Hive, Spark, Workflows, Kafka) from a single interface rather than relying on siloed views that didn’t enable correlation or many useful insights. The bank’s Kafka Streaming deployment had been particularly hard to monitor due to the massive input of streaming data. In addition, they previously had no way to track if Informatica and Hive queries for ETL jobs were hitting SLAs. Unravel changed all of that, delivering detailed insights that told the team how every workload was performing.

The insights were just as valuable at the cluster usage level, with Unravel providing cluster optimization opportunities to further boost performance and reduce wasteful resource consumption. This was the first time the bank really felt they understood what was happening in each and every cluster.

On top of the monitoring and visibility capabilities, Unravel yielded a significant boost in app performance. This is where the platform’s AI and automated recommendations were huge boon for the customer. After first automatically diagnosing several root cause issues, Unravel delivered cleanup recommendations for almost half million Hive tables, resulting in tremendous performance improvements. The platform also enabled the team to set notifications for specific failures and gave them the option to run automated fixes in these circumstances.

See Unravel AI and automation in action

Try Unravel for free

Examples of “Stalled” and “Lagging” Consumer Groups (name = “demo”) showing in the Unravel UI

While the bank isn’t currently deploying data apps in the cloud, they do have plans to migrate soon. One of the hardest parts of any cloud migration is the planning phase. Unravel’s cloud assessment capabilities give the bank detailed insights to streamline this painstaking preliminary phase: The assessment mapped out the bank’s on-premises big data pipelines and then told them which apps are best fit for the cloud and how those apps should be configured using specific instance type recommendations and forecasting costs and consumption. This move saved the customer from having to hire an expensive consulting firm to evaluate and advise their move to the cloud, accelerated their decision timeline and critically provided data driven insights instead of relying on guesswork.

Modern data apps are the backbone of any major financial institution. Unravel’s AI-driven DataOps platform allowed this bank to leverage these critical data apps to their full potential for the first time. Unravel has been so transformative that the customer has been able to open their data lake to broader business users, democratizing data apps so they provide value to the team outside of the developer and IT operations staff. In the bank’s own words, Unravel is helping drive a cultural shift by ensuring big data delivers on its true potential and is future proofing architectural decisions as hybrid cloud deployments are evaluated.

The post Top 10 Bank Leverages Unravel’s AIOps To Tame Fraud Detection and Compliance Performance appeared first on Unravel.

Doing more with Data and evolving to DataOps

Unravel Data — Tue, 03 Dec 2019 23:16:45 +0000

As technology evolves at a rapid pace, the healthcare industry is transforming quickly along with it. Tech breakthroughs like IoT, advanced imaging, genomics mapping, artificial intelligence and machine learning are some of the key items re-shaping the space. The result is better patient care and health outcomes. To facilitate this shift to the next generation of healthcare services – and to deliver on the promise of improved patient care – organizations are adopting modern data technologies to support new use cases.

We are a large company operating healthcare facilities across the US and employing over 20,000 people. Despite our size, we understand that we must be nimble and adapt fast to keep delivering cutting edge healthcare services. We only began leveraging big data about three years ago, we’ve grown fast and built out a significant modern data stack, including Kafka, Impala and Spark Streaming deployments, among others. Our focus here was always on the applications, developer needs and, ultimately, business value.

We’ve built a number of innovative data apps on top of our growing data pipelines, providing great new services and insights for our customers and employees alike. During this process, though, we realized that it’s extremely difficult to manually troubleshoot and manage such data apps. We have a very developer-focused culture –the programmers are building the very apps that are ushering in the next generation of healthcare, and we put them front and center. We were concerned when we noticed these developers were sinking huge chunks of their time fixing and diagnosing failing apps, taking their focus away from creating new apps to drive core business value.

Impala and Spark Streaming are two modern data stack technologies that our developers increasingly employed to support next generation use cases. These two technologies are commonly used to build apps that leverage streaming data, which is prevalent in our industry. Unfortunately, both Impala and Spark Streaming are difficult to manage. Apps built with these two were experiencing frequent slowdowns and intermittent crashes. Spark Streaming, in particular, was very hard to even monitor.

Our key data apps were not performing the way we expected and our programmers were wasting tons of time trying to troubleshoot them. When we deployed Unravel Data, it changed things swiftly, providing new insights into aspects of our data apps we previously had no visibility of and drastically improving app performance.

Impala – improving performance by 12-18x !

Impala is a distributed analytic SQL engine running on the Hadoop platform. Unravel provided critical metrics that helped us to better understand how Impala was being used, including:

Impala memory consumption
Impala queries
Detail for queries using drill-down functionality
Recommendations on how to make queries run faster, use data across nodes more efficiently, and more

Unravel analyzed the query pattern (insert, select, data, data locality across the Hadoop cluster) against Impala and offered a few key insights. For one, Unravel saw most of the time was spent scanning Hadoop’s file system across nodes and combining the results. After computing stats on the underlying table – a simple operation – we were able to dramatically improve performance by 12-18x.

Unravel provides detailed insights for Impala

Spark Streaming – Reducing memory requirements by 80 percent!

Spark Streaming is a lightning quick processing and analytics engine that’s perfect for handling enormous quantities of streaming data. As with Impala, Unravel provided insights and recommendations that alleviated the headaches we were having with the technology. The platform told us we didn’t have the memory for many Spark Streaming jobs, which was ultimately causing the all the slowdowns and crashes. Unravel then provided specific recommendations on how to re-configure Spark Streaming, a process that’s typically complicated and replete with costly missteps. In addition, Unravel found that we could save significant CPU resources by sending parallel tasks to cores.

Overall, two critical Spark Streaming jobs saw memory reductions of 74 percent and 80 percent. Unravel’s parallelism recommendations saved us 8.63 hours of CPU per day.

Spark Streaming performance recommendation

The Bigger Picture

Unravel is straight forward to implement and immediately delivers results. The platform’s recommendations are all configuration changes and don’t require any changes to coding. We were stunned that we could improve app performance so considerably without making a single tweak to the code, yielding an immediate boost to critical business apps. Unravel’s full-stack platform delivers insights and recommendations for our entire modern data stack deployment, eliminating the need to manage any data pipeline with a siloed tool.

Modern data apps are fueling healthcare’s technical transformation. By improving data app performance, We have been able to continue delivering a pioneering healthcare experience, achieving better patient outcomes, new services and greater business value. Without a platform like Unravel, our developers and IT team would be bogged down troubleshooting these apps rather than creating new ones and revolutionizing our business. Unravel has helped create a deep cultural shift to do more with our data and evolve to a DataOps mindset.

The post Doing more with Data and evolving to DataOps appeared first on Unravel.

The Role of Data Operations During Cloud Migrations

Unravel Data — Wed, 27 Nov 2019 14:03:28 +0000

Over the last few years, there has been a mad rush within enterprise organizations to move big data workloads to the cloud. On the surface, it would seem that it’s a bit of a keeping up with the Joneses syndrome with organizations moving big data workloads to the cloud simply because they can.

It turns out, however, that the business rationale for moving big data workloads to the cloud is essentially the same as the broader cloud migration story: rather than expending scarce resources on building, managing, and monitoring infrastructure, use those resources, instead, to create value.

In the case of big data workloads, that value comes in the form of uncovering insights in the data or building and operationalizing machine learning models, among others.

To realize this value, however, organizations must move big data workloads to the cloud at scale without adversely impacting performance or incurring unexpected cost overruns. As they do so, they begin to recognize that migrating production big data workloads from on-premises environments to the cloud introduces and exposes a whole new set of challenges that most enterprises are ill-prepared to address.

The most progressive of these organizations, however, are also finding an unexpected solution in the discipline of data operations.

Test-drive Unravel for DataOps

Try Unravel for free

The Challenges With Migrating Big Data to the Cloud

The transition from an on-premises big data architecture to a cloud-based or hybrid approach can expose an enterprise to several operational, compliance and financial risks borne of the complexity of both the existing infrastructure and the transition process.
In many cases, however, these risks are not discovered until the transition is well underway — and often after an organization has already been exposed to some kind of meaningful risk or negative business impact.

As enterprises begin the process of migration, they often discover that their big data workloads are exceedingly complex, difficult to understand, and that their teams lack the skills necessary to manage the transition to public cloud infrastructure.

Once the migrations are underway, the complexity and variability of cloud platforms can make it difficult for teams to effectively manage the movement of these in-production workloads while they navigate the alphabet soup of options (IaaS, PaaS, SaaS, lift and shift, refactoring, cloud-native approaches, and so on) to find the proper balance of agility, elasticity, performance, and cost-effectiveness.

During this transitional period, organizations often suffer from runaway and unexpected costs, and significant operational challenges that threaten the operational viability of these critical workloads.

Worse yet, the initial ease of transition (instantiating instances, etc.) belies the underlying complexity, creating a false sense of security that is gloriously smashed.

Even after those initial challenges are overcome, or at least somewhat mitigated, however, organizations find that having migrated these critical and intensive workloads to the cloud, they now present an on-going management and optimization challenge.

While there is no question that the value driving the move of these workloads to the cloud remains valid and worthwhile, enterprises often realize it only after much gnashing of teeth and long, painful weekends. The irony, however, is that these challenges are often avoidable when organizations first embrace the discipline of data operations and apply it to the migration of their big data workloads to the cloud.

Data Operations: The Bridge Over Troubled Waters

The biggest reason organizations unnecessarily suffer this fate is that they buy into the sex appeal (or the executive mandate – every CIO needs to be a cloud innovator right?) of the cloud sales pitch: just turn it on, move it over, and all is well.

While there is unquestionable value in moving all forms of operations to the cloud, organizations have repeatedly found that it is rarely as simple as merely spinning up an instance and turning on an environment.

Instead of jumping first and sorting out the challenges later, organizations must plan for the complexity of a big data workload migration. One of the simplest ways of doing so is through the application of data operations.

While data operations is a holistic, process-oriented discipline that helps organizations manage data pipelines, organizations can also apply it to significant effect during the migration process. The reason this works is that data operations uses process and performance-oriented data to manage the data pipeline and its associated workloads.

Because of this data orientation, it also provides deep visibility into those workloads — precisely what organizations require to mitigate the less desirable impacts of migrating to the cloud.

Using data operations, and tools built to support it, organizations can gather data that enables them to assess and plan their migrations to the cloud, do baselining, instance mapping, capacity forecasting and, ultimately, cost optimization.

It is this data — or more precisely, the lack of it — that is often the difference between successful big data cloud migrations and the pain and suffering that most organizations now endure.

Cloud migration made much, much easier

Try Unravel for free

The Intellyx Take: Visibility is the Key

When enterprises undertake a big data cloud migration, they must step through three core stages of the effort: planning, the migration process itself, and continual post-migration optimization.

Because most organizations lack sufficient data, they often do only limited planning efforts. Likewise, lacking data and with a pressing urgency to execute, they often rush migration efforts and charge full-steam ahead. The result is that they spend most of their time and resources dealing with the post-migration optimization efforts — where it is both most expensive and most negatively impactful to the organization.

Flipping this process around and minimizing the risk, costs, and negative impact of a migration requires the same key ingredient during each of these three critical stages: visibility.

Organizations need the ability to capture and harness data that enables migration teams to understand the precise nature of their workloads, map workload requirements to cloud-based instances, forecast capacity demands over time, and dynamically manage all of this as workload requirements change and the pipeline transitions and matures.

Most importantly, this visibility also enables organizations to plan and manage phased migrations in which they can migrate only slices of applications at a time, based on specific and targeted demands and requirements. This approach enables not only faster migrations, but also simultaneously reduces both costs and risk.

Of course, this type of data-centric visibility demands tools highly tuned to the specific needs of data operations. Therefore, those organizations that are taking this more progressive and managed approach to big data migrations are turning to purpose-built data operations tools, such as Unravel, to help them manage the process.

The business case for moving big data operations to the cloud is clear. The pathway to doing so without inflicting significant negative impact on the organization, however, is less so.

Leading organizations, therefore, have recognized that they can leverage the data operations discipline to smooth this process and give them the visibility they need to realize value from the cloud, without taking on unnecessary cost, risk or pain.

The post The Role of Data Operations During Cloud Migrations appeared first on Unravel.

Unravel Launches Performance Management and Cloud Migration Assessment Solution for Google Cloud Dataproc

Unravel Data — Thu, 17 Oct 2019 22:31:34 +0000

We are very excited to announce that Unravel is now available for Google Cloud Dataproc. If you are already on GCP Dataproc or on a journey to migrate on-premises data workloads to GCP Dataproc then Unravel is now available immediately to accelerate and de-risk your Dataproc migration and ensure performance SLAs and cost targets are achieved.

We introduce our support for Dataproc as Google Cloud enters what is widely referred to as ‘Act 2’ under the watch of new CEO Thomas Kurian. We can expect an acceleration in new product announcements, engagement initiatives with the partner ecosystem and a restructured enterprise focused go-to-market model. We got a feel for this earlier this year at the San Francisco Google Next event and we can expect a lot more coming out of the highly anticipated Google Cloud Next ’19 event in London coming in November.

As cloud services adoption continues to accelerate, the complexity will only continue to present Enterprise buyers with a bewildering array of choice. As outlined in our Understanding Cloud Data Services blog, wherever you are on your cloud adoption, Platform evaluation, or workload migration journey, now is the time to start to accelerate your strategic thinking and execution planning for Cloud based data services.

We are already helping customers run their big data workloads on GCP IaaS and with this new addition, we now support cloud native big data workloads running on Dataproc. In addition, Unravel plans to cover other important analytics platforms (including Google BigQuery) as part of our roadmap. This ensures Unravel provides an end-to-end, ‘single pane of glass’ for enterprises to manage their data pipelines on GCP.

Learn more about Unravel for Google Dataproc here and as always please provide feedback so we can continue to deliver for your platform investments.

Press Release quoted below.

————————————————————————————————————————————————

Unravel Data Launches Performance Management and Cloud Migration Assessment Solution for Google Cloud Dataproc

PALO ALTO, Calif. – October 22, 2019 —Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today introduced a performance management solution for the Google Cloud Dataproc platform making data workloads running on the top of the platform simpler to use and cheaper to run.

Unravel for Cloud Dataproc, which is available immediately, can improve the productivity of data teams with a simple and intelligent self-service performance management capability, helping dataops teams:

Optimize data pipeline performance and ensure application SLAs are adhered to
Monitor and automatically fix slow, inefficient and failing Spark, Hive, HBase and Kafka workloads
Maximize cost savings by containing resource-hogging users or applications
Get a detailed chargeback view to understand which users or departments are utilizing the system resources

For enterprises powered by modern data applications that rely on distributed data systems, the Unravel platform accelerates new cloud workload adoption by operationalizing a reliable data infrastructure, and it ensures enforceable SLAs and lower compute and I/O costs, while drastically lowering storage costs. Furthermore, it reduces operational overhead through rapid mean time to identification (MTTI) and mean time to resolution (MTTR), enabled by unified observability and AIOps capabilities.

“Unravel simplifies the management of data apps wherever they reside – on-premises, in a public cloud, or in a hybrid mix of the two. Extending our platform to Google Cloud Dataproc marks another milestone on our roadmap to radically simplify data operations and accelerate cloud adoption,” said Kunal Agarwal, CEO of Unravel Data. “As enterprises plan and execute their migrations to the cloud, Unravel enables operations and app development teams to improve the performance and reduce the risks commonly associated with these migrations.”

In addition to DataOps optimization, Unravel provides a cloud migration assessment offering to help organizations move data workloads to Google Cloud faster and with lower cost. Unravel has built a goal-driven and adaptive solution that uniquely provides comprehensive details of the source environment and applications running on it, identifies workloads suitable for the cloud and determines the optimal cloud topology based on business strategy, and then computes the anticipated hourly costs. The assessment also provides actionable recommendations to improve application performance and enables cloud capacity planning and chargeback reporting, as well as other critical insights.

“We’re seeing an increased adoption of GCP services for cloud-native workloads as well as on-premises workloads that are targets for cloud migration. Unravel’s full-stack DataOps platform can simplify and speed up the migration of data-centric workloads to GCP giving customers peace of mind by minimizing downtime and lowering risk.” said Mike Leone, Senior Analyst, Enterprise Strategy Group. “Unravel adds operational and business value by delivering actionable recommendations for Dataproc customers. Additionally, the platform can troubleshoot and mitigate migration and operational issues to boost savings and performance for Cloud Dataproc workloads.”

Unravel for Google Cloud Dataproc is available now.

Create a free account. Sign up and get instant access to the Unravel environment. Take a guided tour of Unravel’s full-stack monitoring, AI-driven recommendations, automated tuning and remediation capabilities.

Unravel and Azure Databricks: Monitor, Troubleshoot and Optimize in the Cloud

George Demarest — Wed, 04 Sep 2019 11:00:18 +0000

Monitor, Troubleshoot and Optimize Apache Spark Applications Using Microsoft Azure Databricks

We are super excited to announce our support for Azure Databricks! We continue to build out the capabilities of the Unravel Data Operations platform and specifically support for the Microsoft Azure data and AI ecosystem teams. The business and technical imperative to strategically and tactically architect the journey to cloud for your organization has never been stronger. Businesses are increasingly dependent on data for decision making and by extension the services and platforms such as Azure HDI and Azure Databricks that underpin these modern data applications.

The large scale industry adoption of Spark, and Cloud services from Azure and other platforms. represent the heart of the modern data operations program for the next decade. The combination of Microsoft and Databricks and resulting Azure Databricks offering is a natural response to deliver a deployment platform for AI, machine learning, and streaming data applications.

Spark has largely eclipsed Hadoop/MapReduce as the development paradigm of choice to develop a new generation of data applications that provide new insights and user experiences. Databricks has added a rich development and operations environment for running Apache Spark applications in the cloud, while Microsoft Azure has rapidly evolved into an enterprise favorite for migrating and running these new data applications in the cloud.

It is against this backdrop that Unravel announces support for the Azure Databricks platform to provide our AI-powered data operations solution for Spark applications and data pipelines running on Azure Databricks. While Azure Databricks provides a state of the art platform for developing and running Spark apps and data pipelines, Unravel provides the relentless monitoring, interrogating, modeling, learning, and guided tuning and troubleshooting to create the optimal conditions for Spark to perform and operate at its peak potential.

Unravel is able to ask and answer questions about Azure Databricks that are essential to provide the levels of intelligence that are required to:

Provide a unified view across all of your Azure Databricks instances and workspaces
Understand Spark runtime behavior and how it interacts with Azure infrastructure, and adjacent technologies like Apache Kafka
Detect and avoid costly human error in configuration, tuning, and root cause analysis
Accurately report cluster usage patterns and be able to adjust resource usage on the fly with Unravel insights
Set and guarantee enterprise service levels, based on correlated operational metadata

The Unravel Platform is constantly learning and our training models adapting. The intelligence you glean from Unravel today continues to extend and adapt over time as application and user behaviors themselves change and adapt to new business demands. These in-built capabilities of the Unravel platform and our extensible APIs enable us to move fast to support customer demands to support an expanding range of Data and AI services such as Azure Databricks. More importantly though it provides the insights, recommendations and automation to assure your journey to cloud is accelerated and your ongoing Cloud operations is fully optimized for cost and performance.

Take the hassle out of managing data pipelines in the cloud

Try Unravel for free

Read on to learn more about today’s news from Unravel.

Unravel Data Introduces AI-powered Data Operations Solution to Monitor, Troubleshoot and Optimize Apache Spark Applications Using Microsoft Azure Databricks

New Offering Enables Azure Databricks Customers to Quickly Operationalize Spark Data Engineering Workloads with Unprecedented Visibility and Radically Simpler Remediation of Failures and Slowdowns

PALO ALTO, Calif. – Sep. 4, 2019 —Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced Unravel for Azure Databricks, a solution to deliver comprehensive monitoring, troubleshooting, and application performance management for Azure Databricks environments. The new offering leverages AI to enable Azure Databricks customers to significantly improve performance of Spark jobs while providing unprecedented visibility into runtime behavior, resource usage, and cloud costs.

“Spark, Azure, and Azure Databricks have become foundational technologies in the modern data stack landscape, with more and more Fortune 1000 organizations using them to build their modern data pipelines,” said Kunal Agarwal, CEO, Unravel Data. “Unravel is uniquely positioned to empower Azure Databricks customers to maximize the performance, reliability and return on investment of their Spark workloads.”

Unravel for Azure Databricks helps operationalize Spark apps on the platform: Azure Databricks customers will shorten the cycle of getting Spark applications into production by relying on the visibility, operational intelligence, and data driven insights and recommendations that only Unravel can provide. Users will enjoy greater productivity by eliminating the time spent on tedious, low value tasks such as log data collection, root cause analysis and application tuning.

“Unravel’s full-stack DataOps platform has already helped Azure customers get the most out of their cloud-based big data deployments. We’re excited to extend that relationship to Azure Databricks,” said Yatharth Gupta, principal group manager, Azure Data at Microsoft. “Unravel adds tremendous value by delivering an AI-powered solution for Azure Databricks customers that are looking to troubleshoot challenging operational issues and optimize cost and performance of their Azure Databricks workloads.”

Key features of Unravel for Azure Databricks include:

Application Performance Management for Azure Databricks – Unravel delivers visibility and understanding of Spark applications, clusters, workflows, and the underlying software stack
Automated root cause analysis of Spark apps – Unravel can automatically identify, diagnose, and remediate Spark jobs and the full Spark stack, achieving simpler and faster resolution of issues for Spark applications on Azure Databricks clusters
Comprehensive reporting, alerting, and dashboards – Azure Databricks users can now enjoy detailed insights, plain-language recommendations, and a host of new dashboards, alerts, and reporting on chargeback accounting, cluster resource usage, Spark runtime behavior and much more.

Azure Databricks is a Spark-based analytics platform optimized for Microsoft Azure. Azure Databricks provides one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

An early access release of Unravel for Azure Databricks available now.

About Unravel Data

Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern big data leaders, including Kaiser Permanente, Adobe, Deutsche Bank, Wayfair, and Neustar. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement

The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

Contacts

Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Unravel and Azure Databricks: Monitor, Troubleshoot and Optimize in the Cloud appeared first on Unravel.

End-to-end Monitoring of HBase Databases and Clusters

George Demarest — Fri, 02 Aug 2019 18:42:19 +0000

End-to-end Monitoring of HBase databases and clusters

Running real time data injection and multiple concurrent workloads on HBase clusters in production is always challenging. There are multiple factors that affect a cluster’s performance or health and dealing with them is not easy. Timely, up-to-date and detailed data is crucial to locating and fixing issues to maintain a cluster’s health and performance.

Most cluster managers provide high level metrics, which while helpful, are not enough for understanding cluster and performance issues. Unravel provides detailed data and metrics to help you identify the root causes of cluster and performance issues specifically hotspotting. This tutorial examines how you can use Unravel’s HBase Application Program Manager (APM) to debug issues in your HBase cluster and improve its performance.

Cluster health

Unravel provides cluster metrics per cluster which provide an overview of HBase clusters in the OPERATIONS > USAGE DETAILS > HBase tab where all the HBase clusters are listed. Click on a cluster name to bring up the cluster’s detailed information.

The metrics are color coded so you can quickly ascertain your cluster’s health and what, if any, issues you need to drill into.

Green = Healthy
Red = Unhealthy, alert for metrics and investigation required

Below we examine the HBase metrics and understand how use them to monitor your HBase cluster through Unravel.

Overall cluster activity

HBase Cluster activity

Live Region Servers: the number of running region servers.
Dead Region Servers: the number of dead region servers.
Cluster Requests: the number of read and write requests aggregated across the entire cluster.
Average Load: the average number of regions per region server across all Servers
Rit Count: the number of regions in transition.
RitOver Threshold: the number of regions that have been in transition longer than a threshold time (default: 60 seconds).
RitOldestAge: the age, in milliseconds, of the longest region in transition.

Note

Region Server refers to the servers (hosts) while region refers to the specific regions on the servers.

Dead Region Servers

This metric gives you insight into the health of the region servers. In a healthy cluster this should be zero (0). When the number of dead region servers is one (1) or greater you know something is wrong in the cluster.

Dead region servers can cause an imbalance in the cluster. When the server has stopped its regions are then distributed across the running region servers. This consolidation means some region servers handle a larger number of regions and consequently have a corresponding higher number of read, write and store operations. This can result in some servers processing a huge number of operations and while others are idle, causing hotspotting.

Average Load

This metric is the average number of regions on each region server. Like Dead Region Servers, this metric helps you to triage imbalances on cluster and optimize the cluster’s performance.

Below, for the same cluster, you can see the impact on the Average Load when Dead Region Servers is 0 and 4.

Dead Region Servers=0In this case, the Average Load is 2k.

HBase dead region server

Dead Region Servers=4

As the number increased so did the corresponding Average Load which is now 3.23k, an increase of 60%,

The next section of the tab contains a region server table which shows the number of regions hosted by the region server. You can see the delta ( Average Load – Region Count); a large delta generates an imbalance in cluster and reduces the cluster’s performance. You can resolve this issue by either:

Moving the regions onto other region server.
Removing regions from current region server(s) at which point the master immediately deploys it on another available region server.

Region server

Unravel provides a list of region servers, their metrics and Unravel’s insight into server’s health for all region serversacross the cluster for a specific point of time.

For each region server the table lists the REGION SERVER NAME and the server metrics READ REQUEST COUNT, WRITE REQUEST COUNT, STORE FILE SIZE, PERCENT FILES LOCAL, COMPACTION QUEUE LENGTH, REGION COUNT, and HEALTH for each server. These metrics and health status are helpful in monitoring activity in your HBase cluster

The HEALTH status is color coded and you can see at a glance when the server is in trouble. Hover over the server’s health status to see a tool tip listing the hotspotting notifications with their associated threshold (Average value * 1.2). If any value is above the threshold the region server is hotspotting and Unravel shows the health as bad.

Region server metric graphs

Below the table are four (4) graphs readRequestCount, writeRequestCount, storeFileSize, and percentFilesLocal. These graphs are for all metrics across the entire cluster. The time period the metrics are graphs over is noted above the graphs.

Tables

The last item in the Cluster tab is a list of tables. This list is for all the tables across all the region servers across the entire cluster. Unravel then uses these metrics to attempt to detect an imbalance. Averages are calculated within each category and alerts are raised accordingly. Just like with the region servers you can tell at glance the Health of the table.

The list is searchable and displays the TABLE NAME, TABLE SIZE, REGION COUNT, AVERAGE REGION SIZE, STORE FILE SIZE, READ REQUEST COUNT, WRITE REQUEST COUNT, and finally HEALTH. Hover over the health status to see a tool tip listing the hotspotting. Bad health indicates that a large amount of store operations from different sources are redirected to this table. In this example, the Store File Size is more than 20% of the threshold.

You can use this list to drill down into the tables, and get its details which can be useful to monitor your cluster.

Click on a table to view its details, which include graphs of metrics over time for the region, a list of the regions using the table, and the applications accessing the table. Below is an example of the graphed metrics, regionCount, readRequestCount, and writeRequestCount.

Region

Once in the TABLE view, click on the REGION tab to see the list of regions accessing the table.

The table list shows the REGION NAME, REGION SERVER NAME, and the region metrics STORE FILE SIZE, READ REQUEST COUNT, WRITE REQUEST COUNT, and HEALTH. These metrics are useful in gauging activity and load in region. The region health is important in deciding whether the region is functioning properly. In case any metric value is crossing threshold, the status is listed as bad. A bad status indicates you should start investigating to locate and fix hotspotting.

The post End-to-end Monitoring of HBase Databases and Clusters appeared first on Unravel.

Accelerate and Reduce Costs of Migrating Data Workloads to the Cloud

Unravel Data — Wed, 31 Jul 2019 19:40:29 +0000

Today, Unravel announced a new cloud migration assessment offer to accelerate the migration of data workloads to Microsoft Azure, Amazon AWS, or Google Cloud Platform. Our latest offer fills a significant gap in the cloud journey, equips enterprises with the tools to deliver on their cloud strategy, and provides the best possible transition with insights and guidance before, during, and after migration. Full details on the assessment and business value are detailed below in our announcement below.

So, why now?

The rapid increase in data volume and variety has driven organizations to rethink enterprise infrastructures and focus on longer-term data growth, flexibility, and cost savings. Current, on-prem solutions are too complicated, inflexible, and are not delivering on expected value. Data is not living up to its promise.

As an alternative, organizations are looking to cloud services like Azure, AWS, and Google Cloud to provide the flexibility to accommodate modern capacity requirements and elasticity. Unfortunately, organizations are often challenged by unexpected costs and a lack of data and insights to ensure a successful migration process. If left unaddressed, organizations will struggle with the complexity of these projects that don’t fulfill their expectations and frequently result in significant cost overruns.

The cloud migration assessment offer provides details of the source environment and applications running on it, identifies workloads suitable for the cloud, and computes the anticipated hourly costs. It offers granular metrics, as well as broader insights, that eliminate transition complexity and deliver migration success.

Customers can be confident that they’re migrating the right data apps, configuring them properly in the cloud, meeting performance service level agreements, and minimizing costs. Unravel can provide an alternative to what is frequently a manual effort fraught with guesswork and errors.

The two approaches can be characterized per the diagram below

Still unsure how the migration assessment will provide value to your business? Drop us a line to learn more about the offer – or download a sample cloud migration assessment report here.

————-

Read on to learn more about today’s news from Unravel.

Unravel Introduces Cloud Migration Assessment Offer to Reduce Costs and Accelerate the Transition of Data Workloads to Azure, AWS or Google Cloud

New Offer Builds a Granular Dependency Map of On-Premises Data Workloads and Provides Detailed Insights and Recommendations for the Best Transition to Cloud

PALO ALTO, Calif. – July, 31, 2019 —Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced a new cloud migration assessment offer to help organizations move data workloads to Azure, AWS or Google Cloud faster and with lower cost. Unravel has built a goal-driven and adaptive solution that uniquely provides comprehensive details of the source environment and applications running on it, identifies workloads suitable for the cloud and determines the optimal cloud topology based on business strategy , and computes the anticipated hourly costs. The offer also provides actionable recommendations to improve application performance and enables cloud capacity planning and chargeback reporting, as well as other critical insights.

“Managing the modern data stack on-premises is complex and requires expert technical talent to troubleshoot most problems. That’s why more enterprises are moving their data workloads to the cloud, but the migration process isn’t easy , as there’s little visibility into costs and configurations,” said Kunal Agarwal, CEO, Unravel Data. “Unravel’s new cloud migration assessment offer delivers actionable insights and visibility so organizations no longer have to fly blind. No matter where an organization is in its cloud adoption and migration journey, now is the time to accelerate strategic thinking and execution, and this offering ensures the fastest, most cost effective and valuable transition for the full journey-to-cloud lifecycle.”

“Companies have major expectations when they embark on a journey to the cloud. Unfortunately, organizations that migrate manually often don’t fulfill these expectations as the process of transitioning to the cloud becomes more difficult and takes longer than anticipated. And then once there, costs rise higher than forecasted and apps are difficult to optimize,” said Enterprise Strategy Group senior analyst Mike Leone. “This all results from the lack of insight into their existing data apps on-premises and how they should map those apps to the cloud. Unravel’s new offer fills a major gap in the cloud journey, equipping enterprises with the tools to deliver on their cloud goals.”

The journey to cloud is technically complex and aligning business outcomes with a wide array of cloud offerings can be challenging. Unravel’s cloud migration assessment offer takes the guesswork and error-prone manual processes out of the equation to deliver a variety of critical insights. The assessment enables organizations to:

Discover current clusters and detailed usage to make an effective and informed move to the cloud
Identify and prioritize specific application workloads that will benefit most from cloud-native capabilities, such as elastic scaling and decoupled storage
Define the optimal cloud topology that matches specific goals and business strategy, minimizing risks or costs. Users get specific instance types recommendations on the amount of storage needed with the option to choose between local attached and object storage
Obtain the hourly costs expected to incur when moving to the cloud, allowing users to compare and contrast the costs for different cloud providers and services and for different goals
Compare costs for different cloud options (across IaaS and Managed Hadoop/Spark PaaS services). Includes the ability to override default on-demand prices to incorporate volume discounts users may have received
Optimize cloud storage tiering choices for hot, warm, and cold data

The Unravel cloud assessment service encompasses four phases. The first phase is a discovery meeting in which the project is scoped, stakeholders identified and KPIs defined. Then during technical discovery, Unravel works with customers to define use cases, install the product and begin gathering workload data. Following, is the initial readout, as enterprises receive a summary of their infrastructure and workloads along with fresh insights and recommendations for cloud migration. Then comes the completed assessment, including final insights, recommendations and next steps.

Unravel is building a rapidly expanding ecosystem of partners to provide a portfolio of data operations and migration services utilizing the Unravel Data Operations Platform and cloud migration assessment offer.

Enterprises can find a sample cloud migration assessment report here.

About Unravel Data
Unravel radically simplifies the way businesses understand and optimize the performance of their modern data applications – and the complex pipelines that power those applications. Providing a unified view across the entire stack, Unravel’s data operations platform leverages AI, machine learning, and advanced analytics to offer actionable recommendations and automation for tuning, troubleshooting, and improving performance – both today and tomorrow. By operationalizing how you do data, Unravel’s solutions support modern data stack leaders, including Kaiser Permanente, Adobe, Deutsche Bank, Wayfair, and Neustar. The company is headquartered in Palo Alto, California, and is backed by Menlo Ventures, GGV Capital, M12, Point72 Ventures, Harmony Partners, Data Elite Ventures, and Jyoti Bansal. To learn more, visit unraveldata.com.

Copyright Statement
The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Accelerate and Reduce Costs of Migrating Data Workloads to the Cloud appeared first on Unravel.

Data Pipelines and the Promise of Data

Unravel Data — Thu, 27 Jun 2019 15:21:13 +0000

This is a blog by Keith D. Foote, Writer Researcher for Dataversity. This blog was first published on the DM Radio Dataversity site.

The flow of data can be perilous. Any number of problems can develop during the transport of data from one system to another. Data flows can hit bottlenecks resulting in latency; it can become corrupted, or datasets may conflict or have duplication. The more complex the environment and intricate the requirements, the more the potential for these problems increases. Volume also increases the potential for problems. Transporting data between systems often requires several steps. These include copying data, moving it to another location, and reformatting and/or joining it to other datasets. With good reason, data teams are focusing on the end-to-end performance and reliability of their data pipelines.

If massive amounts of data are streaming in from a variety of sources and passing through different data ecosystem software during different stages, it is reasonable to expect periodic problems with the data flow. A well-designed and well-tuned data pipeline ensures that all the steps needed to provide reliable performance are taken care of. The necessary steps should be automated, and most organizations will require at least one or two engineers to maintain the systems, repair failures, and make updates as the needs of the business evolve.

DataOps and Application Performance Management

Not long ago, data from different sources would be sent to separate silos, which often provided limited access. During transit, data could not be viewed, interpreted, or analyzed. Data was typically processed in batches on a daily basis, and the concept of real-time processing was unrealistic.

“Luckily, for enterprises today, such a reality has changed,” said Shivnath Babu the co-founder and Chief Technology Officer at Unravel in a recent interview with DATAVERSITY®.

“Now data pipelines can process and analyze huge amounts of data in real time. Data pipelines should therefore be designed to minimize the number of manual steps to provide a smooth, automated flow of data from one stage to the next.”

The first stage in a pipeline defines how, where, and what data is collected. The pipeline should then automate the processes used to extract the data, transform it, validate it, aggregate it and load it for further analysis. A data pipeline provides operational velocity by eliminating errors and correcting bottlenecks. A good data pipeline can handle multiple data streams simultaneously and has become an absolute necessity for many data-driven enterprises.

DataOps teams leverage Application Performance Management (APM) tools to monitor the performance of apps written in specific languages (Java, Python, Ruby, .NET, PHP, node.js). As data moves through the application, three key metrics are collected:

Error rate (errors per minute)
Load (data flow per minute)
Latency (average response time)

According to Babu, Unravel Provides an AI-powered DataOps/APM solution that is specifically designed for the modern data stack such as Spark, Kafka, NoSQL, Impala and Hadoop. A spectrum of industries—such as financial, telecom, healthcare, and technology—use Unravel to optimize their data pipelines. It is known for improving the reliability of applications, the productivity of data operations teams, and reducing costs. Babu commented that:

“The modern applications that truly are driving the promise of data, that are delivering or helping a company make sense of the data—they are running on a complex stack, which is critical and crucial to delivering on the promise of data. And that means, now you need software. You need tooling that can ensure these systems that are part of those applications end up being reliable, that you can troubleshoot them, and ensure their performance can be depended upon. You can detect and fix it. Ideally, the problem should never happen in the first place. They should be avoided, via the Machine learning algorithms.”

How Unravel AI Automates DataOps

Try Unravel for free

Data Pipelines

Data pipelines are the digital reflection of the goals of data teams and business leaders, with each pipeline having unique characteristics and serving different purposes depending on those goals. For example, in a marketing-feedback scenario, real-time tools would be more useful than tools designed for moving data to the cloud. Pipelines are not mutually exclusive and can be optimized for both the cloud and real-time, or other combinations. The follow list describes the most popular kinds of pipelines available:

Cloud native pipelines: Deploy data pipelines in the cloud helps with cloud-based data and applications. Generally, cloud vendors provide many of the tools needed to create these pipelines, thereby saving the client time and money on building out infrastructure.
Real-time pipelines: Designed to process data as it comes in, or in real-time. The use of real-time requires processing data coming from a streaming source.
Batch pipelines: The use of batch processing is generally most useful when moving large amounts of data at regularly scheduled intervals. It is not a process that supports receiving data in real-time.

Ideally, a data pipeline handles any data as though it were streaming data and allows for flexible schemas. Whether the data comes from static sources (e.g. flat-file databases) or whether it comes from real-time sources (e.g. online retail transactions), a data pipeline can separate each data stream into narrower streams, which are then processed in parallel. It should be noted that processing in parallel uses significant computing power, said Babu:

“Companies get more value from their data by using modern applications. And these modern applications are running on this complex series of systems, be it the cloud or in data centers. In case any problem happens, you want to detect and fix it—and ideally fix the problem before it happens. That solution comes from Unravel. This is how Unravel helps companies deliver on the promise of data.

The big problem in the industry is having to hire human experts, who aren’t there 24/7, he commented. If the application is slow at 2 a.m. in the morning, or the real-time application is not capable of delivering in true real time, if somebody has to troubleshoot it, and fix it, that person is very hard to find. We have created technology that can collect what we call full-stack performance information at every level of this complex procedural stack. From application, from the platform, from infrastructure, from storage, we can collect performance information.

Cloud Migration Containers

There has been a significant rise in the use of data containers. As use of the cloud has gained in popularity, methods for transferring data and their processing instructions have become important, with data containers providing a viable solution. Data containers will organize and store “virtual objects” (these are self-contained entities made up of both data and the instructions needed for controlling the data).

“Containers do, however, come with limitations,” Babu remarked. While containers are very easy to transport, they can only be used with servers having compatible operating system “kernels,” and put a limit on the kinds of servers that can be used:

“We’re moving toward microservices. We have created a technology called Sensors, which can basically pop open any container. Whenever a container comes up, in effect, it opens what we call a sensor. It can sense everything that is happening within the container, and then stream that data back to Unravel. Sensor enables us to make these correlations between Spark applications, that are streaming from a Kafka topic. Or writing to a S3 storage, where we are able to draw these connections. And that is in conjunction with the data that we are tapping into, where the APIs go with each system.”

DataOps

DataOps is a process-oriented practice that focuses on communicating, integrating, and automating the ingest, preparation, analysis, and lifecycle management of data infrastructure. DataOps manages the technology used to automate data delivery, with attention to the appropriate levels of quality, metadata, and security needed to improve the value of data.

“We apply AI and machine learning algorithms to solve a very critical problem that affects modern day applications. A lot of the time, companies are creating new business intelligence applications, or streaming applications. Once they create the application, it takes them a long time to make the application reliable and production-ready. So, there are mission-critical technical applications, very important to the overall company’s bottom line.”

But, the application itself often ends up running on multiple definitive systems, systems to collect data in a streaming fashion, systems to collect very large and valuable types of data. The database architecture that most companies have ends up being a complex collection of details that’s inside, and on this, “we have mission-critical technical applications running.”

Unravel has applied AI and machine learning to logs, metrics, execution plans, and configurations and then applied it to automatically identify where crucial problems lie, where inefficiencies lie, and fix a lot of these problems automatically, so that companies that are running these applications are guaranteed that these mission-critical applications are very reliable.

“So, this is how we help companies deliver on the promise of data,” said Babu in closing. “Data pipelines are certainly making data flows and data volumes easier to deal with, and we remove those blind spots in an organization’s data pipelines. We give them visibility, AI-powered recommendations, and much more reliable performance.”

The post Data Pipelines and the Promise of Data appeared first on Unravel.

Meeting SLAs for Data Pipelines on Amazon EMR With Unravel

Unravel Data — Wed, 26 Jun 2019 22:31:02 +0000

Post by George Demarest, Senior Director of Product Marketing, Unravel and Abha Jain, Director Products, Unravel. This blog was first published on the Amazon Startup blog.

A household name in global media analytics – let’s call them MTI – is using Unravel to support their data operations (DataOps) on Amazon EMR to establish and protect their internal service level agreements (SLAs) and get the most out of their Spark applications and pipelines. Amazon EMR was an easy choice for MTI as the platform to run all their analytics. To start with, getting up and running is simple. There is nothing to install, no configuration required etc. and you can get to a functional cluster in a few clicks. This enabled MTI to focus most of their efforts in building out analytics that would benefit their business instead of having to spend time and money on acquiring the skillset needed for setting up and maintaining Hadoop deployments by themselves. MTI was very quickly able to get a point that they were running 10’s of thousands of jobs per week. About 70% of which are Spark, with the remaining 30% of workloads running on Hadoop, or more specifically Hive/MapReduce.

Among the most common complaints and concerns about optimizing modern data stack clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel employs a frictionless method of collecting relevant data about the full data stack, running applications, cluster resources, datasets, users, business units and projects. Unravel then aggregates and correlates this data into the Unravel data model and then applies a variety of analytical techniques to put that data into a useful context. Unravel utilizes EMR bootstrap Actions to distribute (non-intrusive) sensors on each node of a new cluster that are needed for collecting granular application level data which in turn is used to optimize applications.

See Unravel for EMR in Action

Try Unravel for free

Unravel’s Amazon AWS/EMR architecture

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing Data Operations
For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

● Reduce time needed for troubleshooting and resolving issues.

● Improve cluster efficiency and performance.

● Improve visibility into cluster workloads.

● Provide usage analysis

Reducing Time to Identify and Resolve Issues
One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

One of the most basic ways that Unravel helps reduce MTTI/MTTR is through the elimination of the time-consuming steps of data collection and correlation. Unravel collects granular cluster and application-specific runtime information, as well as metrics on infrastructure, resources using native Hadoop APIs and via lightweight sensors that only send data while an application is executing. This alone can save data teams hours – if not days – of data collection by, capturing application and system log data, configuration parameters, and other relevant data.

Once that data is collected, the manual process of evaluating and interpreting that data has just begun. You may spend hours charting log data from your Spark application only to find that some small human error, a missed configuration parameter, and incorrectly sized container, or a rogue stage of your Spark application is bringing your cluster to its knees.

Unravel’s top level operations dashboard

Improving Visibility Into Cluster Operations
In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster Level Reporting
Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.

Unravel’s cluster infrastructure dashboard

One observation from reports like the above was that the memory and CPU usage in the cluster was peaking from time to time, potentially leading to application failures and slowdowns. To resolve this issue, MTI utilized EMR Automatic scaling feature so that instances were automatically added and removed as needed to ensure adequate resources at all times. This also ensured that they were not incurring unnecessary costs from underutilized resources.

Application and Workflow Tagging
Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage Analysis and Capacity Planning
MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel Chargeback Reporting
You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel Data’s chargeback reporting

Improving Cluster Efficiency and Performance
MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Get answers to EMR issues, not more charts and graphs

Try Unravel for free

Unravel Auto-actions and Alerting
Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

The simplicity of the Auto-actions screen belies the deep automation and functionality of autonomous remediation of application slowdowns and failures. At the highest level, Unravel Auto-actions can be quickly set up to alert your team via email, PagerDuty, Slack or text message. Offending jobs can also be killed or moved to a different queue. Unravel can also create an HTTP post that gives users a lot of powerful options

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Pre-configured Unravel auto-action templates

Applications
Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased Visibility Into Application Runtime and Trends
MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Because typical data pipelines are built on a collection of distributed processing engines (Spark, Hadoop, et al.), getting visibility into the complete data pipeline is a challenge. Each individual processing engine may have monitoring capabilities, but there is a need to have a unified view to monitor and manage all the components together.

Unravel Monitoring, Tuning, and Troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel was designed with an end-to-end perspective on data pipelines. The basic navigation moves from the top level list of applications to drill down to jobs, and further drill down to individual stages of a Spark, Hive, MapReduce or Impala applications.

Unravel Gantt chart view of a Hive query

Unravel provides a number of intuitive navigational and reporting elements in the user interface including a Gantt chart of application components to understand the execution and parallelism of your applications.

Unravel Self-service Optimization of Spark Applications
MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

MTI has chosen Unravel as a foundational element of their self-service application and workflow improvements, especially taking advantage of application recommendations and insights for Spark developers.

Unravel self-service capabilities

Unravel provides plain language insights as well as specific, actionable recommendations to improve performance and efficiency. In addition to these recommendations and insights, users can take action via the auto-tune function, which is available to run from the events panel.

Intelligent recommendations and insights as well as auto-tuning

Optimizing Application Resource Efficiency
In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel provides a rich catalog of insights and recommendations around resource consumption at the application level. To eliminate resource wastage Unravel can help you run your data applications more efficiently by providing you AI driven insights and recommendations to do show:

Underutilization of Container Resources, CPU, or Memory

Too few partitions with respect to available parallelism

Mapper/Reducers Requesting Too Much Memory

Too Many Map Tasks and/or Too Many Reduce Tasks

Solution Highlights
Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

MTI also got great insights on the number and causes of inefficiently running applications. Unravel detected a significant number of inefficient applications. Unravel detected 38,190 events after analyzing 30,378 MapReduce jobs that they executed. They were also able to detect 44,176 events for 21,799 Spark jobs that they executed. They were also able to detect resource contention which causing Spark jobs to get stuck in “Accepted” state, rather than running to completion.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to identify many jobs which waste CPU and memory resources.

The post Meeting SLAs for Data Pipelines on Amazon EMR With Unravel appeared first on Unravel.

Making Data Work With the Unravel Partner Program

Audrey Enriquez — Tue, 25 Jun 2019 12:00:45 +0000

Modern data apps are increasingly going to the cloud due to their elastic compute demands, skills shortages and the complexity of managing big data on premises. And while more and more organizations are taking their data apps to the cloud to leverage its flexibility, they’re also finding that it is very challenging to assess application needs and how to migrate and optimize their data to ensure performance and cost targets are not compromised. This complexity underscores how critical a platform like Unravel is for businesses in any industry to get the most out of their cloud-based data apps.

The Unravel Partner Program we introduced today helps make the journey and experience in the cloud even easier. The program enables customers of world-leading technology innovators like Microsoft Azure, AWS, Informatica, Attunix, Dell, Clairvoyant and RCG Global Services, to solve their data operations challenges and accelerate their cloud adoption cycles.

I have the pleasure of leading this new program and working with the Unravel community. It’s thrilling to be at the starting line with our partners and customers to help them future-proof their data workloads in the cloud. Full details on platform, solution partnerships and program benefits are below in our company announcement and here, and I encourage you to drop us a line to learn more about the Unravel Partner Program.

————-

Read on to learn more about today’s news from Unravel.

Unravel Data Announces Partner Program and New Executive Hire to Meet Demand for Cloud Migration and Data Operations Initiatives

Unravel Partner Ecosystem Encompasses an Expansive Network of Innovators to Help Businesses Optimize Their Data Systems and Future-Proof Cloud Migration Strategies

PALO ALTO, Calif.,—June 25, 2019—Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced the Unravel Partner Program, which brings together world-leading technology innovators that support Unravel’s mission to radically simplify the way businesses optimize the performance of their modern data applications and the complex pipelines that power those applications. The new program debuts with the appointment of Mark Wolfram as vice president of business development and partnerships, who will direct the program’s strategy and expansion.

Unravel’s Partner Program ecosystem includes a robust list of brands including Informatica, Attunix, Dell and RCG Global Services to help their customers solve data operations challenges and optimize their cloud adoption plans :

- Technology and Platform Partners:
  Unravel’s partnerships with leading cloud platforms, data management, data operations and analytics partners guarantee the performance, reliability and cost of data systems. Customers benefit from operationalized data application workloads and the ability to address performance and reliability challenges while in production. Opportunities for optimization are pinpointed to ensure that service level agreements are met and resources are optimized. And as data workloads move to the cloud and hybrid deployment models, Unravel can accelerate decision making while eliminating risk.
- Solution Partners:
  Unravel’s partnerships with integrators and solution providers deliver validated data operations solutions to their customers to guarantee the performance, reliability and cost of their data systems either on premises or in the cloud. Customers benefit from access to highly differentiated Unravel technology to solve a variety of data operations and cloud migration challenges. Unravel’s training and support accelerates time to value, as does its direct and early access to product, engineering and go-to market teams for deal acceleration and collaborative support in solution design.

“We share with Unravel the common goal of building a culture of innovation and helping customers benefit from all of their data, not just the data for which they have ready-made solutions or tools,” stated Rick Skriletz, Global Managing Principal at RCG Global Services. “Although it’s not a universal truth, we typically find the more data that can be brought to bear on a problem, the better the results. Our complementary solution sets – and tight integration between them – are geared to bring automation to big data workflows, which is the one need we most often hear from customers.”

“It is a daunting task to migrate and optimize your data to the cloud as part of a digital transformation journey and yet we see a growing demand from our customers to do so from energy to financial services, manufacturing, government and retail,” said Matt O’Donnell, President of Cloud Services at Redapt Attunix. “The impetus of demand and complexity has set us on the path to partner with Unravel to offer our customers the sophisticated tools, support, vision and platform expertise they need to migrate with confidence, and from there monitor, manage and improve their data pipelines to achieve more reliable performance in the applications that power their business. Unravel’s compatibility and collaboration with leading platforms is a significant advantage to our customers who seek this synergy to deliver value to their customers.”

Unravel’s ecosystem partners benefit from enablement, go-to-market and post-sales support. Exclusive technical and best-practices and sales and product training are made available to the partners. Partners can take advantage of opportunities that can help build their business through joint marketing activities, co-marketing collateral and registered partner lead and deal registration. Through the program, partners have access to tools and APIs for customer account management, workflow integration, global dedicated technical support, and recognition as an official partner in the Unravel community.

Mark Wolfram, an accomplished technology executive with more than 20 years of experience in sales, business development, and management, leads the Unravel Partner Program. Wolfram brings a strong background in cloud and data analytics honed at Microsoft and Azuqua (acquired by Okta).

“Unravel Data is a unique company that has given more than just lip service about how to address the immense challenges inherent when enterprises look to move data workloads to the cloud. It is actually taking the significant steps necessary to truly partner with businesses during this complex migration process and to ensure continuous availability of modern data applications,” said Mark Wolfram, vice president of business development and partnerships at Unravel Data. “Unravel has developed the industry’s only AI-powered platform for planning, migrating and managing modern data apps in the cloud. I passionately support Unravel’s approach and look forward to sharing my expertise as together we grow the Unravel Partner Program to serve more enterprises around the world.”

To learn more about Unravel’s Partner Program, please visit: https://www.unraveldata.com/company/partners/.

Copyrights
The name Unravel Data is a trademark of Unravel Data. Informatica and Informatica World are trademarks or registered trademarks of Informatica in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.

###

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Making Data Work With the Unravel Partner Program appeared first on Unravel.

Unraveling the Complex Streaming Data Pipelines of Cybersecurity

George Demarest — Wed, 12 Jun 2019 04:55:49 +0000

Earlier this year, Unravel released the results of a survey that looked at how organizations are using modern data apps and general trends in big data. There were many interesting findings, but I was most struck by what the survey revealed about security. First, respondents indicated that they get the most value from big data when leveraging it for use in security applications. Fraud detection was listed as the single most effective use case for big data, while cybersecurity intelligence was third. This was hardly surprising, as security is at the top of everyone’s minds today and modern security apps like fraud detection rely heavily on AI and machine learning to work properly. However, despite that value, respondents also indicated that security analytics was the modern data application they struggled most to get right. This also didn’t surprise me, as it reflects the complexity of managing the real-time streaming data common in security apps.

Cybersecurity is difficult from a big data point of view, and many organizations are struggling with it. The hardest part is managing all of the streaming data that comes pouring in from the internet, IoT devices, sensors, edge platforms and other endpoints. Streaming data comes often, piles up quickly and is complex. To properly manage this data and deliver working security apps, you need the right solution that provides trustworthy workload management for cybersecurity analytics and offers you the ability to track, diagnose, and troubleshoot end-to-end across all of your data pipelines.

Real-time processing is not a new concept, but the ability to run real-time apps reliably and at scale is. The development of open-source technologies such as Kafka, Spark Streaming, Flink and HBase have enabled developers to create real-time apps that scale, further accelerating their proliferation and business value. Cybersecurity is critical for the well-being of enterprises and large organizations, but many don’t have the right data operations platform to do it correctly.

Apache Metron is an example of a complex security data pipeline

To analyze streaming traffic data, generate statistical features, and train machine learning models to help detect cybersecurity threats on large-scale networks, like malicious hosts in botnets, big data systems require complex and resource-consuming monitoring methods. Security analysts may apply multiple detection methods simultaneously, to the same massive incoming data, for pre-processing, selective sampling, and feature generation, adding to the existing complexity and performance challenges. Keep in mind, the applications often span across multiple systems (e.g., interacting with Spark for computation, with YARN for resource allocation and scheduling, with HDFS or S3 for data access, with Kafka or Flink for streaming) and may contain independent, user-defined programs, making it inefficient to repeat data pre-processing and feature generation common in multiple applications, especially in large-scale traffic data.

These inefficiencies create bottlenecks in application execution, hog the underlying systems, cause suboptimal resource utilization, increase failures (e.g., due to out-of-memory errors), and more importantly, may decrease the chances to detect a threat or a malicious attempt in time.

Unravel’s full stack platform addresses these challenges and provides a compelling solution for operationalizing security apps. Leveraging artificial intelligence, Unravel has introduced a variety of capabilities for enabling better workload management for cybersecurity analytics, all delivered from a single pane of glass. These include:

Automatically identifying applications that share common characteristics and requirements and grouping them based on relevant data colocation (e.g., a combination of port usage entropy, IP region or geolocation, time or flow duration)
Recommendations on how to segregate applications with different requirements (e.g., disk i/o heavy preprocessing tasks vs. computational heavy feature selection) submitted by different users (e.g., SOC level 1 vs. level 3 analysts)
Recommendations on how to allocate applications with increased sharing opportunities and computational similarities to appropriate execution pools/queues
Automatic fixes for failed applications drawing on rich historic data of successful and failed runs of the application
Recommendations for alternative configurations to get failed applications quickly to a running state, followed by getting the application to a resource-efficient running state

Security apps running on a highly distributed modern data stack are too complex to monitor and manage manually. And when these apps fail, organizations don’t just suffer inconvenience or some lost revenue. Their entire business is put at risk. Unravel ensures these apps are reliable and operate at optimal performance. Ours is the only platform that makes such effective use of AI to scrutinize application execution, identify the cause of potential failure, and to generate recommendations for improving performance and resource usage, all automatically.

The post Unraveling the Complex Streaming Data Pipelines of Cybersecurity appeared first on Unravel.

Hortonworks, MapR, now Cloudera. What’s your next move?

Audrey Enriquez — Fri, 07 Jun 2019 18:52:40 +0000

As discussed in our recent Series C announcement, we are seeing unprecedented and accelerating demand for solutions to complex, business-critical challenges in dealing with data. Modern data systems are becoming impossibly complex. The burgeoning amount of data being processed in organizations today is staggering. Just a year ago, Forbes reported that 90% of the world’s data was created in the previous two years.

“New Stack” of Software Emerging

Organizations of all sizes and in all industries are transforming to deal with this change and a ‘New Stack’ of software is emerging to enable the building, operation and monitoring of these modern applications and the systems that support them. According to Morgan Stanley, ‘New Stack’ revenue is set to hit $50 Billion by 2022. In addition, according to their January 2019 CIO Survey, 50% of application workloads are expected to reside in public cloud environments by 2022, up from ~22% today.

With that backdrop, are the recent headlines about Cloudera and MapR surprising or anticipated? Is interest in data waning? Not even close. So why did Hortonworks get swallowed, Cloudera stumble and is MapR disappearing? Consensus is clear – execution gaps and an expected, but much faster than anticipated adoption of public cloud services. The other side of the equation is evidenced by Microsoft’s impressive latest earnings announcement driven in large part by its $40 Billion Azure business which grew at 73% last quarter – hard to post those kind of growth figures on a number that big

Rise of Big Data in the Cloud

I’ve written before about the rise of big data in the cloud, but Unravel has been doing a lot more than just talking about this shift. We’ve taken significant steps to support the transition of data workloads to the cloud. Unravel has forged partnerships with Azure and AWS, and our solution for migration and management of data workloads is available on both platforms. We have a particularly deep relationship with Azure and M12 (Microsoft’s investment arm) who participated in both Unravel’s Series B and Series C funding rounds. Unravel caught Azure’s eye precisely because of the need to solve these large scale operational data issues wherever the workloads are being executed.

With Unravel you get complete visibility into every aspect of your data pipelines:

Is application code executing optimally (or failing)?
Are resources being used, abused or constrained?
How do I lower my cloud instance costs?
Which workloads should I migrate to cloud first and what’s the performance/cost tradeoff?
What are specific application and workload costs for Chargeback?
Where are all my workloads being executed?
How are all my services being utilized?
How are users behaving and who are the bad actors?

These issues apply as much to systems located in the cloud as they do to systems on-premises. This is true for the breadth of public cloud deployment types:

IaaS (Infrastructure as a Service): Cloudera, Hortonworks or MapR data platforms deployed on cloud VMs where your modern data applications are running
PaaS (Platform as a Service): Managed Hadoop/Spark Platforms like AWS Elastic MapReduce (EMR), Azure HDInsight, Google Cloud Dataproc, etc.
Cloud-Native: Products like Amazon Redshift, Azure Databricks, AWS Databricks, etc.
Serverless: Ready-to-use services that require no setup like Amazon Athena, Google BigQuery, Google Cloud DataFlow, etc.

For those interested in learning more about specific services offered by the cloud platform providers we recently posted a blog on “Understanding Cloud Data Services.”

We introduced a portfolio of capabilities that help customers plan, migrate, and manage modern data applications running on AWS, Microsoft Azure and the Google Cloud Platform and we have talk frequently about what it takes to “Migrate and scale data pipelines on the AWS Platform” and about “Getting the most from data applications in the cloud.”

Test-drive Unravel for cloud environments

Try Unravel for free

Building Expertise in Data Operations

No matter if you are running an on-premises system such as Cloudera or a fully managed PaaS environment such as Azure HDInsight or AWS EMR, or some hybrid combination, companies need to build expertise in data operations. DataOps is the discipline that ensures you are future proofing your architectural, operational and commercial decisions as you transform your business and migrate data workloads to the cloud, go straight to the cloud for new workloads, maintain an application on-premises or any hybrid combination.

Planning for Cloud-based Data Services

So, wherever you are on your cloud adoption and workload migration journey, now is the time to start or accelerate your strategic thinking and execution planning for cloud-based data services. Very recent history shows us that we need to be proactive, not reactive, and to expect this pace of change to continue to accelerate.

However, as migration goes from planning to reality, ensure that you invest in the critical skills, technology and process changes to establish a data operations center of excellence and future proof your critical data applications and systems.

The post Hortonworks, MapR, now Cloudera. What’s your next move? appeared first on Unravel.

How to intelligently monitor Kafka/Spark Streaming data pipelines

George Demarest — Thu, 06 Jun 2019 18:41:06 +0000

Identifying issues in distributed applications like a Spark streaming application is not trivial. This blog describes how Unravel helps you connect the dots across streaming applications to identify bottlenecks.

Spark streaming is widely used in real-time data processing, especially with Apache Kafka. A typical scenario involves a Kafka producer application writing to a Kafka topic. The Spark application then subscribes to the topic and consumes records. The records might be further processed downstream using operations like map and foreachRDD ops or saved into a datastore.

Below are two scenarios illustrating how you can use Unravel’s APMs to inspect, understand, correlate, and finally debug issues around a Spark streaming app consuming a Kafka topic. They demonstrate how Unravel’s APMs can help you perform an end-to-end analysis of the data pipelines and its applications. In turn, this wealth of information helps you effectively and efficiently debug and resolve issues that otherwise would require more time and effort.

Use Case 1: Producer Failure

Consider a running Spark application that has not processed any data for a significant amount of time. The application’s previous iterations successfully processed data and you know for a certainty that current iteration should be processing data.

Detect Spark is processing no data

When you notice the Spark application has been running without processing any data, you want to check the application’s I/O.

Checking I/O Consumption

Click on it to bring up the application and see its I/O KPI

View Spark input details

When you see the Spark application is consuming from Kafka, you need to examine Kafka side of the process to determine the issue. Click on a batch to find the topic it is consuming.

View Kafka topic details

Bring up the Kafka Topic details which graphs the Bytes In Per Second, Bytes Out Per Second, Messages In Per Second, and Total Fetch Requests Per Second. In this case, you see Bytes In Per Second and Messages In Per Second graphs show a steep decline; this indicates the Kafka Topic is no longer producing any data.

The Spark application is not consuming data because there is no data in the pipeline to consume.
You should notify the owner of the application that writes to this particular topic. Upon notification, they can drill down to determine and then resolve the underlying issues causing that writes to this topic to fail.

The graphs show a steep decline in the BytesInPerSecond and MessagesInPerSecond metric. Since there is nothing flowing into the topic, there is no data to be consumed and processed by Spark down the pipeline. The administrator can then alert the owner of the Kafka Producer Application to further drill down into what are the underlying issues that might have caused the Kafka Producer to fail.

Use Case 2: Slow Spark app, offline Kafka partitions

In this scenario, an application’s run has processed significantly less data when compared to what is the expected norm. In the above scenario, the analysis was fairly straightforward. In these types of cases, the underlying root cause might be far more subtle to identify. The Unravel APM can help you to quickly and efficiently root cause such issues.

Detect inconsistent performance

When you see a considerable difference in the data processed by consecutive runs of a Spark App, bring up the APMs for each application.

Detect drop in input records

Examine the trend lines to identify a time window when there is a drop in input records for the slow app. The image, the slow app, shows a drop off in consumption.

Drill down to I/O drop off

Drill down into the slow application’s stream graph to narrow the time period (window) in which the I/O dropped off.

Drill down to suspect time interval

Unravel displays the batches that were running during the problematic time period. Inspect the time interval to see the batches’ activity. In this example no input records were processed during the suspect interval. You can dig further into the issue by selecting a batch and examining it’s input tab. In the below case, you can infer that no new offsets have been read based upon the input source’s description,“Offsets 1343963 to Offset 1343963”.

View KPIs on the Kafka monitor screen

You can further debug the problem on the Kafka side by viewing the Kafka cluster monitor. Navigate to OPERATIONS > USAGE DETAILS > KAFKA. At a glance, the cluster’s KPIs convey important and pertinent information.

Here, the first two metrics show that the cluster is facing issues which need to be resolved as quickly as possible.

# of under replicated partitions is a broker level metric of the count of partitions for which the broker is the leader replica and the follower replicas that have yet not caught up. In this case there are 131 under replicated partitions.

# of offline partitions is a broker level metric provided by the cluster’s controlling broker. It is a count of the partitions that currently have no leader. Such partitions are not available for producing/consumption. In this case there are two (2) offline partitions.

You can select the BROKER tab to see broker table. Here the table hints that the broker, Ignite1.kafka2 is facing some issue; therefore, the broker’s status should be checked.

Broker Ignite1.kafka2 has two (2) offline partitions. The broker table shows that it is/was the controller for the cluster. We could have inferred this because only the controller can have offline partitions. Examining the table further, we see that Ignite.kafka1 also has an active controller of one (1).
The Kafka KPI # of Controller lists the Active Controller as one (1) which should indicate a healthy cluster. The fact that there are two (2) brokers listed as being one (1) active controller indicates the cluster is in an inconsistent state.

Identify cluster controller

In the broker tab table, we see that for the broker Ignite1.kafka2 there are two offline Kafka partitions. Also we see that the Active controller count is 1, which indicates that this broker was/is the controller for the cluster (This information can also be deduced from the fact that offline partition is a metric exclusive only to the controller broker). Also notice that the controller count for Ignite.kafka1 is also 1, which would indicate that the cluster itself is in an inconsistent state.

This gives us a hint that it could be the case that the Ignite1.kafka2 broker is facing some issues, and the first thing to check would be the status of the broker.

View consumer group status

You can further corroborate the hypothesis by checking the status of the consumer group for the Kafka topic the Spark App is consuming from. In this example, consumer groups’ status indicates it’s currently stalled. The topic the stalled group is consuming is tweetSentiment-1000.

Topic drill down

To drill down into the consumer groups topic, click on the TOPIC tab in the Kafka Cluster manager and search for the topic. In this case, the topic’s trend lines for the time range the Spark applications consumption dropped off show a sharp decrease in the Bytes In Per Second and Bytes Out Per Second. This decrease explains why the Spark app is not processing any records.

Consumer group details

To view the consumer groups lag and offset consumption trends, click on the consumer groups listed for the topic. Below, the topic’s consumer groups, stream-consumer-for-tweetSentimet1000, log end offset trend line is a constant (flat) line that shows no new offsets have been consumed with the passage of time. This further supports our hypothesis that something is wrong with the Kafka cluster and especially broker Ignite.kafka2.

Conclusion

These are but two examples of how Unravel helps you to identify, analyze, and debug Spark Streaming applications consuming from Kafka topics. Unravel’s APMs collates, consolidates, and correlates information from various stages in the data pipeline (Spark and Kafka), thereby allowing you to troubleshoot applications without ever having to leave Unravel.

The post How to intelligently monitor Kafka/Spark Streaming data pipelines appeared first on Unravel.

Case Study: Meeting SLAs for Data Pipelines on Amazon EMR

George Demarest — Thu, 30 May 2019 20:44:44 +0000

Among the most common complaints and concerns about optimizing big data clusters and applications is the amount of time it takes to root-cause issues like application failures or slowdowns or to figure out what needs to be done to improve performance. Without context, performance and utilization metrics from the underlying data platform and the Spark processing engine can laborious to collect and correlate, and difficult to interpret.

Unravel architecture for Amazon AWS/EMR

MTI has prioritized their goals for big data based on two main dimensions that are reflected in the Unravel product architecture: Operations and Applications.

Optimizing data operations

For MTI’s cluster level SLAs and operational goals for their big data program, they identified the following requirements:

Reduce time needed for troubleshooting and resolving issues.
Improve cluster efficiency and performance.
Improve visibility into cluster workloads.
Provide usage analysis

Reducing time to identify and resolve issues

One of the most basic requirements for creating meaningful SLAs is to set goals for identifying problems or failures – known as Mean Time to Identification (MTTI) – and the resolution of those problems – known as Mean Time to Resolve (MTTR). MTI executives set a goal of 40% reduction in MTTR.

Unravel top level operations dashboard

Improving visibility into cluster operations

In order for MTI to establish and maintain their SLAs, they needed to troubleshoot cluster-level issues as well as issues at the application and user levels. For example, MTI wanted to monitor and analyze the top applications by duration, resources usage, I/O, etc. Unravel provides a solution to all of these requirements.

Cluster level reporting

Cluster level reporting and drill down to individual nodes, jobs, queues, and more is a basic feature of Unravel.Unravel cluster infrastructure dashboard

Application and workflow tagging

Unravel provides rich functionality for monitoring applications and users in the cluster. Unravel provides cluster and application reporting by user, queue, application type and custom tags like Project, Department etc.. These tags are preconfigured so that MTI can instantly filter their view by these tags. The ability to add custom tags is unique to Unravel and enables customers to tag various applications based on custom rules specific to their business requirements (e.g. Project, business unit, etc.).

Unravel application tagging by department

Usage analysis and capacity planning

MTI wants to be able to maintain service levels over the long term, and thus require reporting on cluster resource usage, and data on future capacity requirements for their program. Unravel provides this type of intelligence through the Chargeback/showback reporting.

Unravel chargeback reporting

You can generate ChargeBack reports in Unravel for multi-tenant cluster usage costs associated by the Group By options: application type, user, queue, and tags. The window is divided into three (3) sections,

Donut graphs showing the top results for the Group by selection.
Chargeback report showing costs, sorted by the Group By choice(s).
List of Yarn applications running.

Unravel chargeback reporting

Improving cluster efficiency and performance

MTI wanted to be able to predict and anticipate application slowdowns and failures before they occur. by using Unravel’s proactive alerting and auto-actions so that they could, for example, find runaway queries and rogue jobs, detect resource contention, and then take action.

Unravel Auto-actions and alerting

Unravel Auto-actions are one of the big points of differentiation over the various monitoring options available to data teams such as Cloudera Manager, Splunk, Ambari, and Dynatrace. Unravel users can determine what action to take depending on policy-based controls that they have defined.

Unravel Auto-actions set up

Unravel also provide a number of powerful pre-built Auto-action templates that can give users a big head start on crafting the precise automation they wish for their environment.

Preconfigured Unravel auto-action templates

Applications

Turning to MTI’s application-level requirements, the company was looking at improving overall visibility into their data application runtime performance, and to encourage a self-service approach to tuning and optimizing their Spark applications.

Increased visibility into application runtime and trends

MTI data teams, like many, are looking for that elusive “single pane of glass” for troubleshooting slow and failing Spark jobs and applications. They were looking to:

Visualize app performance trends, viewing metrics such as applications start time, duration, state, I/O, memory usage, etc.
View application component (pipeline stages) breakdown and their associated performance metrics
Understand execution of Map Reduce jobs, Spark applications and the degree of parallelism and resource usage as well as obtain insights and recommendations for optimal performance and efficiency

Unravel monitoring, tuning and troubleshooting

Intuitive drill-down from Spark application list to an individual data pipeline stage

Unravel Gantt chart view of a Hive query

Unravel self-service optimization of Spark applications

MTI has placed an emphasis on creating a self-service approach to monitoring, tuning, and management of their data application portfolio. They are for development teams to reduce their dependency on IT and at the same time to improve collaboration with their peers. Their targets in this area include:

Reducing troubleshooting and resolution time by providing self-service tuning
Improving application efficiency and performance with minimal IT intervention
Provide Spark developers performance issues and relate directly to the lines of code associated with a given step.

Unravel self-service capabilities

Unravel provides intelligent recommendations and insights as well as auto-tuning.

Optimizing Application Resource Efficiency

In large scale data operations, the resource efficiency of the entire cluster is directly linked to the efficient use of cluster resources at the application level. As data teams can routinely run hundreds or thousands of job per day, an overall increase in resource efficiency across all workloads improves the performance, scalability and cost of operation of the cluster.

Unravel Insight: Under-utilization of container resources, CPU or memory

Unravel Insight: Too few partitions with respect to available parallelism

Unravel Insight: Mapper/Reducers requesting too much memory

Unravel Insight: Too many map tasks and/or too many reduce tasks

Solution Highlights

Work on all of these operational goals is ongoing with MTI and Unravel, but to date, they have made significant progress on both operational and application goals. After running for over a month on their production computation cluster, MTI were able to capture metrics for all MapReduce and Spark jobs that were executed.

During a deep dive on their applications, MTI found multiple inefficient jobs where Unravel provided recommendations for repartitioning the data. They were also able to Identify many jobs which waste CPU and memory resources.

The post Case Study: Meeting SLAs for Data Pipelines on Amazon EMR appeared first on Unravel.

The Guide To Understanding Cloud Data Services in 2022

George Demarest — Fri, 24 May 2019 15:05:48 +0000

In the past five years, a shift in Cloud Vendor offerings has fundamentally changed how companies buy, deploy and run big data systems. Cloud Vendors have absorbed more back-end data storage and transformation technologies into their core offerings and are now highlighting their data pipeline, analysis, and modeling tools. This is great news for companies deploying, migrating, or upgrading big data systems. Companies can now focus on generating value from data and Machine Learning (ML), rather than building teams to support hardware, infrastructure, and application deployment/monitoring.

The following chart shows how more and more of the cloud platform stack is becoming the responsibility of the Cloud Vendors (shown in blue). The new value for companies working with big data is the maturation of Cloud Vendor Function as a Service (FaaS), also known as serverless, and Software as a Service (SaaS) offerings. For FaaS (serverless) the Cloud Vendor manages the applications and users focus on data and functions/features. With SaaS, features and data management become the Cloud Vendor’s responsibility. Google Analytics, Workday, and Marketo are examples of SaaS offerings.As the technology gets easier to deploy, and the Cloud Vendor data services mature, it becomes much easier to build data-centric applications and provide data and tools to the enterprise. This is good news: companies looking to migrate from on-premise systems to the cloud are no longer required to purchase directly or manage hardware, storage, networking, virtualization, applications, and databases. In addition, this changes the operational focus for a big data systems from infrastructure and application management (DevOps) to pipeline optimization and data governance (DataOps). The following table shows the different roles required to build and run Cloud Vendor-based big data systems.

This article is aimed at helping big data systems leaders moving from on-premise or native IaaS (compute, storage, and networking) deployments understand the current Cloud Vendor offerings. Those readers new to big data, or Cloud Vendor services, will get a high-level understanding of big data system architecture, components, and offerings. To facilitate discussion we provide an end-to-end taxonomy for big data systems and show how the three leading Cloud Vendors (AWS, Azure and GCP) align to the model:

Amazon Web Services (AWS)
Microsoft Azure (Azure)
Google Cloud Platform (GCP)

APPLYING A COMMON TAXONOMY

Understanding Cloud Vendor offerings and big data systems can be very confusing. The same service may have multiple names across Cloud Vendors and, to complicate things even more, each Cloud Vendor has multiple services that provide similar functionality. However, the Cloud Vendors big data offerings align to a common architecture and set of workflows.

Each big data offering is set up to receive high volumes of data to be stored and processed for real-time and batch analytics as well as more complex ML/AI modeling. In order to provide clarity amidst the chaos, we provide a two-level taxonomy. The first-level includes five stages that sit between data sources and data consumers: CAPTURE, STORE, TRANSFORM, PUBLISH, and CONSUME. The second-level taxonomy includes multiple service offerings for each stage to provide a consistent language for aligning Cloud Vendor solutions.The following sections provide details for each stage and the related service offerings.

CAPTURE

Persistent and resilient data CAPTURE is the first step in any big data system. Cloud Vendors and the community also describe data CAPTURE as ingest, extract, collect, or more generally as data movement. Data CAPTURE includes ingestion of both batch and streaming data. Streaming event data becomes more valuable by being blended with transactional data from internal business applications like SAP, Siebel, Salesforce, and Marketo. Business application data usually resides within a proprietary data model and needs to be brought into the big data system as changes/transactions occur.

Cloud Vendors provide many tools for bringing large batches of data into their platforms. This includes database migration/replication, processing of transactional changes, and physical transfer devices when data volumes are too big to send efficiently over the internet. Batch data transfer is common for moving on-premise data sources and bringing in data from internal business applications, both SaaS and on-premise. Batch data can be run once as part of an application migration or in near real-time as transactional updates are made in business systems.

The focus of many big data Pipeline implementations is the capture of real-time data streaming in as an application clickstream, product usage events, application logs, and IoT sensor events. To properly capture streaming data requires configuration on the edge device or application. For, example, collecting clickstream from a mobile or web application requires events to be instrumented and sent back to an endpoint listening for the events. This is similar with IoT devices, which may also perform some data processing on the edge device prior to sending it back to an end point.

STORE

For big data systems the STORE stage focuses on the concept of a data lake, a single location where structured, semi-structured, unstructured data and objects are stored together. The data lake is also a place to store the output from extract, transform, load (ETL) and ML pipelines running in the TRANSFORM stage. Vendors focus on scalability and resilience over read/write performance. To increase data access and analytics performance, data should be highly aggregated in the data lake or organized and placed into higher performance data warehouses, massively parallel processing (MPP) databases, or key-value stores as described in the PUBLISH stage. In addition, some data streams have such high event volume, or the data are only relevant at the time of capture, that the data stream may be processed without ever entering the data lake.

Cloud Vendors have recently put more focus on the concept of the data lake, by adding functionality to their object stores and creating a much tighter integration with TRANSFORM and CONSUME service offerings. For example, Azure created Data Lake Storage on top of the existing Object Store with additional services for end to end analytics pipelines. Also, AWS now provides Data Lake Formation to make it easier to set up a data lake on their core object store S3.

TRANSFORM

The heart of any big data implementation is the ability to create data pipelines in order to clean, prepare, and TRANSFORM complex multi-modal data into valuable information. Data TRANSFORM is also described as preparing, massaging, processing, organizing, and analyzing among other things. The TRANSFORM stage is where value is created and, as a result, Cloud Vendors, start-ups, and traditional database and ETL vendors provide many tools. The TRANSFORM stage has three main data pipeline offerings including Batch Processing, Machine Learning, and Stream Processing. In addition, we include the Orchestration offering because complex data pipelines require tools to stage, schedule, and monitor deployments.

Batch TRANSFORM uses traditional extract, TRANSFORM, and load techniques that have been around for decades and are the purview of traditional RDBMS and ETL vendors. However, with the increase in data volumes and velocity, TRANSFORM now commonly comes after extraction and loading into the data lake. This is referred to as extract, load, and transform or ELT. Batch TRANSFORM uses Apache Spark or Hadoop to distribute compute across multiple nodes to process and aggregate large volumes of data.

ML/AI uses many of the same Batch Processing tools and techniques for data preparation and for the development and training of predictive models. Machine Learning also takes advantage numerous libraries and packages to help optimize data science workflows and provide pre-built algorithms.
big data systems also provide tools to query continuous data streams in near real-time. Some data has immediate value that would be lost waiting for a batch process to run. For example, predictive models for fraud detection or alerts based on data from an IoT sensor. In addition, streaming data is commonly processed, and portions are loaded into a data lake.

Cloud Vendor offerings for TRANSFORM are evolving quickly and it can be difficult to understand which tools to use. All three Cloud Vendors have versions of Spark/Hadoop that scale on their IaaS compute nodes. However, all three now provide serverless offerings that make it much simpler to build and deploy data pipelines for batch, ML and streaming workflows. For example, AWS EMR, GCP Cloud Data Proc, and Azure Databricks provide Spark/Hadoop that scale by adding additional compute resources. However, they also offer the serverless AWS Glue, GCP Data Flow, and Azure Data Factory which abstract away the need to manage compute nodes and orchestration tools. In addition, they now all provide end-to-end tools to build, train, and deploy machine learning models quickly. This includes data preparation, algorithm development, model training algorithm, and deployment tuning and optimization.

PUBLISH

Once through the data CAPTURE and TRANSFORM stages it is necessary to PUBLISH the output from batch, ML, or streaming pipelines for users and applications to CONSUME. PUBLISH is also described as deliver or serve, and comes in the form of Data Warehouses, Data Catalogs, or Real-Time Stores.

Data Warehouse solutions are abundant in the market, and the choice depends on the data scale and complexity as well as performance requirements. Serverless relational databases are a common choice for Business Intelligence applications and for publishing data for other systems to consume. They provide scale and performance and, most of all, SQL-based access to the prepared data. Cloud Vendor examples include AWS Redshift, Google BigQuery, and Azure SQL Data Warehouse. These work great for moderately sized and relatively simple data structures. For higher performance and complex relational data models, massively parallel processing (MPP) databases store large volumes of data in-memory and can be blazing fast, but often at a steep price.

As the tools for TRANSFORM and CONSUME become easier to use, data analyses, models, and metrics proliferate. It becomes harder to find valuable, governed, and standardized metrics in the mass of derived tables and analyses. A well-managed and up-to-date data catalog is necessary for both technical and non-technical users to manage and explore published tables and metrics. Cloud Vendor Data Catalog offerings are still relatively immature. Many companies build their own or use third party catalogs like Alation or Waterline. More technical users including data engineers and data scientists explore both raw and transformed data directly in the data lake. For these users the data catalog, or metastore, is the key for various compute options to understand where data is and how it is structured.

Many streaming applications require a Real-Time Store to meet millisecond response times. Hundreds of optimized data stores exist in the market. As with Data Warehouse solutions, picking a Real-Time Store depends on the type and complexity of the application and data. Cloud Vendors examples include AWS DynamoDB, Google Bigtable, and Azure Cosmos DB providing wide-column or key-value data stores. These are applied as high performance in-process databases and improve the performance of Data processing and analytics workloads.

CONSUME

The value of any big data system comes together in the hands of technical and non-technical users, and in the hands of customers using data-centric applications and products. Vendors also refer to CONSUME as use, harness, explore, model, infuse, and sandbox. We discuss three CONSUME models: Advanced Analytics, Business Intelligence (BI), and Real-Time APIs.

Aggregated data does not always allow for deeper data exploration and understanding. So, advanced analytics users CONSUME both raw and processed data either directly from the data lake or from a Data Warehouse. Advanced analytics users use similar tools from the TRANSFORM stage including Spark- and Hadoop-based distributed compute. In addition, notebook technologies are a popular tool that allow data engineers and data scientists to create documents containing live code, equations, visualizations and text. Notebooks allow users to code in a variety of languages, run packages, and share the results. All three Cloud Vendors offer notebook solutions, most based on the popular open source Jupyter project.

BI tools have been in the market for a couple of decades and are now being optimized to work with larger data sets, new types of compute, and directly in the cloud. Each of the three cloud vendors now provide a BI tool optimized to work with their stack. These include AWS Quicksight, GCP Data Studio, and Microsoft Power BI. However, several more mature BI tools exist in the market that work with data from most vendors. BI tools are optimized to work with published data and usage improves greatly with an up-to-date data catalog and some standardization of tables and metrics.

Applications, products, and services also CONSUME raw and transformed data through APIs built on the Real-Time Store or predictive ML models. The same Cloud Vendor ML offerings used to explore and build models also provides Real-Time APIs for alerting, analysis, and personalization. Example use cases include fraud detection, system/sensor alerting, user classification, and product personalization.

CLOUD VENDOR OFFERINGS

AWS, GCP and AZURE have very complex cloud offerings based on their core networking, storage and compute. In addition, they provide vertical offerings for many markets, and within the big data systems and ML/AI verticals they each provide multiple offerings. In the following chart we align the Cloud Vendor offerings within the two-tier big data system taxonomy defined in the second section.The following table includes some additional Cloud Vendor offerings as well as open source and selected third party tools that provide similar functionality.

THE TIME IS NOW

Most companies that deployed big data systems and data-centric applications in the past 5-10 years did this on-premise (or colocation) or on top of the Cloud Vendor core infrastructure services including storage, networking, and compute. Much has changed in the Cloud Vendor offerings since these early deployments. Cloud Vendors now provide a nearly complete set of serverless big data services. In addition, more and more companies see the value of Cloud Vendor offerings and are trusting their mission-critical data and applications to run on them. So, now is the time to think about migrating big data applications from on-premise or upgrading bespoke systems built on Cloud Vendor infrastructure services. In order to make the best use of, make sure to get a deep understanding of existing systems, develop a clear migration strategy, and establish a data operations center of excellence.

In order to prepare for migration to Cloud Vendor big data offerings, it is necessary for an organization to get a clear picture of its current big data system. This can be difficult depending on the heterogeneity of existing systems, the types of data-centric products it supports, and the number of teams or people using the system. Fortunately, tools (such as Unravel) exist to monitor, optimize, and plan migrations for big data systems and pipelines. During migration planning it is common to discover inefficient, redundant, and even unnecessary pipelines actively running, chewing up compute, and costing the organization time and money. So, during the development of a migration strategy companies commonly find ways to clean up and optimize their data pipelines and overall data architecture.

It is helpful that all three Cloud Vendors are interested in getting a company’s data and applications onto their platforms. To this end, they provide a variety of tools and services to help move data or lift and shift applications and databases onto their platforms. For example, AWS provides a Migration Hub to help plan and execute migrations and a variety of tools like the AWS Database Migration Service. Azure provides free Migration Assessments as well as several tools. And, GCP provides a variety of migration strategies and tools like Anthos and Velostrata depending on a company’s’ current and future system requirements.

Please take a look at the Cloud Vendor migration support sites below.

No matter whether a company runs an on-premise system or a fully managed serverless environment, or some hybrid combination, companies need to build expertise a core competence in data operations. DataOps is a rapidly emerging discipline that companies need to own–it is difficult to outsource. Most data implementations utilize tools from multiple vendors, maintain hybrid cloud/on-premises systems, or rely on more than one Cloud Vendor. So, it becomes difficult to rely on a single company or Cloud Vendor to manage all the DataOps task for an organization.

Typical scope includes:

Data quality
Metadata management
Pipeline optimization
Cost management and Charge back
Performance Management
Resource Management
Business stakeholder Management
Data governance
Data catalogs
Data security & compliance
ML/AI model management
Corporate metrics and reporting

Where ever you are on your cloud adoption and workload migration journey, now is the time to start or accelerate your strategic thinking and execution planning for Cloud based data services. Serverless offerings are maturing quickly and give companies faster time to value, increased standardization, and overall lower people and technology costs. However, as migration goes from planning to reality, ensure you invest in the critical skills, technology and process changes to establish a data operations center of excellence.

The post The Guide To Understanding Cloud Data Services in 2022 appeared first on Unravel.

Using Unravel to tune Spark data skew and partitioning

George Demarest — Wed, 22 May 2019 21:20:40 +0000

This blog provides some best practices for how to use Unravel to tackle issues of data skew and partitioning for Spark applications. In Spark, it’s very important that the partitions of an RDD are aligned with the number of available tasks. Spark assigns one task per partition and each core can process one task at a time.

By default, the number of partitions is set to the total number of cores on all the nodes hosting executors. Having too few partitions leads to less concurrency, processing skew, and improper resource utilization, whereas having too many leads to low throughput and high task scheduling overhead.

The first step in tuning a Spark application is to identify the stages which represents bottlenecks in the execution. In Unravel, this can be done easily by using the Gantt Chart view of the app’s execution. We can see at a glance that there is a longer-running stage:

Once I’ve navigated to the stage, I can navigate to the Timeline view where the duration and I/O of each task within the stage are readily apparent. The histogram charts are very useful to identify outlier tasks, it is clear in this case that 199 tasks take at most 5 minutes to complete however one task takes 35-40 min to complete!

When we select the first bucket of 199 tasks, another clear representation of the effect of this skew is visible within the Timeline, many executors are sitting idle:

When we select the outlier bucket that took over 35 minutes to complete, we can see the duration of the associated executor is almost equal to the duration of the entire app:

We can also observe the bursting of containers at the time the longer executor started in the Graphs > Containers view. Adding more partitions via repartition() can help distribute the data set among the executors.

Unravel can provide recommendations for optimizations in some of the cases where join key(s) or group by key(s) are skewed.

In the below Spark SQL example two dummy data sources are used, both of them are partitioned.

The join operation between customer & order table is on cust_id column which is heavily skewed. Looking at the code it can easily be identified that key “1000” has most number of entries in the orders table. So one of the reduce partition will contain all the “1000” entries. In such cases we can apply some techniques to avoid skewed processing.

Increase the spark.sql.autoBroadcastJoinThreshold value so that smaller table “customer” gets broadcasted. This should be done ensuring sufficient driver memory.
If memory in executors are sufficient we can decrease the spark.sql.shuffle.partitions to accomodate more data per reduce partitions. This will finish all the reduce tasks in more or less in same time.
If possible find out the keys which are skewed and process them separately by using filters.

Let’s try the #2 approach

With Spark’s default spark.sql.shuffle.partitions=200.

Observe the lone task which takes more time during shuffle. That means the next stage can’t be started and executors are lying idle.

Now let’s change the spark.sql.shuffle.partitions = 10. As the shuffle input/output is well within executor memory sizes we can safely do this change.

In real-life deployments, not all skew problems can be solved by configurations and repartitioning. That may need underlying data layout modification. If the data source itself is skewed then tasks which read from these sources can’t be optimized. Sometimes at enterprise level its not possible as the same data source will be used from different tools & pipelines.

Sam Lachterman is a Manager of Solutions Engineering at Unravel.

Unravel Principle Engineer Rishitesh Mishra also contributed to this blog. Take a look at Rishi’s other Unravel blogs:

Why Your Spark Jobs are Slow and Failing; Part I Memory Management

Why Your Spark Jobs are Slow and Failing; Part I Data Skew and Garbage Collection

The post Using Unravel to tune Spark data skew and partitioning appeared first on Unravel.

Software Ate the World and Now the Models are Running It

Audrey Enriquez — Tue, 14 May 2019 10:00:04 +0000

Off the back of a breakout year where we grew revenue 500%, today we announced the latest milestone in the Unravel journey – we closed a $35M Series C funding round.

Along with our data ecosystem partners, we are seeing unprecedented demand for solutions to complex, business-critical challenges in dealing with data.

Consider this. Data Engineers walk into work every day knowing they’re fighting an uphill battle. The root of the problem – or at least one problem – is that modern data systems are becoming impossibly complex. The burgeoning amount of data being processed in organizations today is staggering, where annual data growth is often measured in high double-digit percentages. Just a year ago, Forbes reported that 90% of the world’s data was created in the previous two years.

And with that data growth has come rapid growth in the number of applications for ingesting, correlating and analyzing that data. Each component of a data pipeline is by nature a specialist, and it takes lots of specialists to make data deliver results – and, more importantly, insights. This is a problem that touches virtually every corner of the world of business. And the pressure to perform and make data “work” is unrelenting.

In our own research from November 2018, Unravel found that three-quarters of businesses expect their modern data stack to drive profitable business applications by the end of 2019 – but only 12% were seeing this value at the time.

The media is rife with stories prophesying magnificent discoveries to be made when data converges with artificial intelligence-driven models. Some of these discoveries have been made, but many more are still to come. Too often, these discoveries are over the horizon or well beyond the horizon, as data practitioners struggle with data systems that create more hurdles than they knock down.

Old Technologies Cannot Solve New Problems

Model-driven insights from data is what every business aspires to. The need for reliable and scalable application performance spawned the development of Application Performance Management (APM) and log management tools, two pioneering technologies in the race to make sense of new, multi-tier web architectures. The problem is that those technologies fell short because they were not designed and built for modern data systems. From the standpoint of the Data Engineer, the metrics and graphs those technologies deliver fall flat, when the team needs actual recommendations and answers to the issues faced multiple times every day.

“It’s clear that enterprises continue to struggle with dealing with the enormous amount of data that fuels their businesses. Legacy approaches have failed, and they need to modernize their systems or risk being made irrelevant,” said Venky Ganesan, managing director, Menlo Ventures.

Dealing with Overwhelming Complexity

Although it might be trite, it’s worth mentioning that every business is becoming a data business. That’s why most businesses consider data management systems such as Spark, Kafka, hadoop and NoSQL as their critical systems of record.

Data pipelines are so complex that they are outgrowing our ability to manage them. That’s because these systems have so many interdependencies that solutions lie beyond human intuition or deduction. And that’s why Unravel talks a lot about the importance of full-stack visibility for optimizing the performance of data-driven applications. We obsess over the need to explore, correlate, and analyze everything in your modern data stack environment, search for dependencies and issues, understand how data and resources are being used, and discover how to troubleshoot and remediate issues.

And we believe in the promise of AI. That’s why Unravel integrated a powerful AI engine to deliver recommendations that drive more reliable performance in modern data applications.

See Unravel AI and automation in action

Try Unravel for free

Cloud Complicates Everything

As businesses migrate their data-focused applications and their data to the cloud, they face the fact that many cloud platforms provide only minimal siloed tools for managing these workloads.

In response, Unravel, unveiled its newest version of the Unravel platform, which focuses squarely on the unique requirements of hosting data-focused applications in the cloud. That release took the AI, machine learning, and predictive analytics that are the hallmarks of the platform and enabled users to assess which apps are the best candidates to move to the cloud – based on the customer’s own defined criteria.

The release also gave users the tools to validate the success of their cloud migration and predict capacity based on their specific application workloads. At the time, I noted that many unknowns around cost, visibility and migration had prevented this transition to the cloud from occurring more quickly. But that is no more.

Continuous Improvement

Continuous improvement: although the term is dated, the concept is still as timely as ever. And it’s a mantra of many businesses today that are never content, even with their highest achievements.

Continuous improvement is just the latest growth driver in modern data systems as well, and it’s being built on models. In turn, these models are built on closed-loop data. “When built right, these models create a reinforcing cycle: Their products get better, allowing them (businesses) to collect more data, which allows them to build better models, making their products better, and onward,” said Steven Cohen and Matthew Granade of Point72 Ventures, an investor in Unravel Data.

If anything is keeping CIOs from meeting their OKRs, #1 on that list is likely data system complexity. Well, complexity is here to stay! In our data-driven world, gains come when we deal with the inevitable complexities and move beyond them. At Unravel, we think big data can do better, and we’re here to help it along. By radically simplifying the way you do data operations, how your models perform and ensuring big data lives up to your expectations – both today and tomorrow.

————-

Read on to learn more about today’s news from Unravel.

Unravel Data Grows Revenue 500% Year-Over-Year, Secures $35M in Series C Funding

Point72 Ventures leads funding round to address performance and complexity challenges of modern data applications and cloud migration initiatives

PALO ALTO, Calif.,—May 14, 2019—Unravel Data, the only data operations platform providing full-stack visibility and AI-powered recommendations to drive more reliable performance in modern data applications, today announced it has raised $35 million in an oversubscribed Series C funding round. Point72 Ventures, founded by renowned hedge fund investor Steve Cohen, led the round with participation from Harmony Partners, and existing Unravel investors Menlo Ventures, GGV Capital and M12 (Microsoft Ventures).

“At first glance, application performance management (APM) may seem like a problem that has been addressed by the APM and log management vendors such as AppDynamics, New Relic, Splunk, and DataDog. The reality is that these solutions were not initially built for modern data systems. By being natively built with modern IT in mind, Unravel can cost-effectively deliver the data application awareness and AI-powered recommendations, resolutions, and answers that organizations demand,” said Mike Leone, senior analyst at ESG. “Additionally, as workloads migrate to cloud platforms, the complexity of multiple systems, locality of services and technologies, different operating and pricing models, and constantly changing dependencies throughout the data pipeline add higher levels of risk to migrations, performance optimization challenges, and cost concerns.”

“Most industry-leading companies are now software businesses, and a majority of those businesses are running on top of mission-critical big data applications,” said David Dubick, Partner, Point72 Ventures. “These big data tailwinds have created a demand for tools to monitor, optimize and secure these systems, and Unravel is uniquely positioned to address this need in the marketplace.”

“CIOs in our network told us story after story of traditional application monitoring tools failing in a big data context because those tools were designed for the world of the past. And we didn’t just hear this problem from third parties, we were seeing it at Point72 as well,” said Matthew Granade, Chief Market Intelligence Officer at Point72 and Managing Partner of Point72 Ventures. “This new architecture requires a different product, one built from the ground up to focus on the unique challenges posed by big data applications. Unravel is poised to capture this emerging big-data APM market.”

Company Momentum Highlights

Today’s funding news follows a year of significant momentum for Unravel as evidenced by a series of milestones:

Annual Recurring Revenue (ARR) growth of 500%
Cloud Services Product and Partner Momentum
- Microsoft Azure Cloud Partner Ecosystem — Unravel introduced support for Azure services and uses operational data from Azure HDInsight, Spark, Kafka, Hadoop, Hive, and HBase to automatically troubleshoot on-going issues that reduce confidence and performance on customers’ clusters. Unravel also correlates this full-stack data to help in migration to Azure. Unravel is available on the Microsoft Azure Marketplace and is co-sell ready.
- AWS Cloud Partner Ecosystem — Unravel introduced its platform across the Amazon ecosystem supporting Amazon AWS, Amazon EMR, Cloudera EDH for AWS, Hortonworks Data Cloud on AWS, and MapR CDP on AWS, providing critical operational intelligence. Unravel is an AWS Advanced Partner and is available in the AWS marketplace
Industry Accolades – Gartner named Unravel Data to its list of “Cool Vendors” for 2018 in Performance Analysis; Analytics and Containers. CRN awarded Unravel as a ‘Top 100 Coolest Cloud Computing Company,’ and Unravel made CNBC’s Upstart 100 list.

“Every business is becoming a data business, and companies are relying on their data applications such as machine learning, IoT, and customer analytics, for better business outcomes using technologies such as Spark, Kafka, and NoSQL,” said Kunal Agarwal, CEO, Unravel Data. “We are making sure that these technologies are easy to operate and are high performing so that businesses can depend on them. We partner with our customers through their data journey and help them successfully run data apps on various systems whether on-premises or in the cloud.”

Customer Validation

See https://www.unraveldata.com/customers/

Unravel Reviews on Gartner Peer Insights:
“Key Software Product for Today’s Modern Data Applications And Systems” March, 2019
https://www.gartner.com/reviews/review/view/785382#

Market Validation

“Enterprises are turning to technologies like Spark, Kafka, MPP, and NoSQL to embrace a data-centric approach to their business. The challenge is that there are massive skills shortages associated with architecting, managing, and optimizing all these integrated tools supported by numerous vendors across a data pipeline. In fact, on average, organizations work with 37 different vendors across their data pipeline today. Many of the technologies they rely on have their own monitoring and management tools, and this exacerbates the problem, creating operational silos and ultimately preventing end-to-end insight,” said Mike Leone, senior analyst at ESG. “How can organizations effectively utilize a wide range of applications like customer analytics, fraud prevention, and predictive maintenance that rely heavily on next-generation technology like AI and machine learning? By turning to a comprehensive data operations platform. Unravel allows customers to manage and optimize all their data pipelines from one location. By using AI-driven recommendations and automation, a high percentage of manual troubleshooting can simply be eliminated, enabling data operations teams to be proactive in preventing future issues.”

“As enterprises of every size choose the Azure Cloud platform to build and deliver their modern data, Unravel has proven an important tool to help enterprises operationalize this data and drive tangible value to the business,” said Rashmi Gopinath, partner, M12 Ventures. “Azure and Unravel have worked closely on product development and go-to market execution and are well positioned to meet this market demand.”

Investor Quotes

“There’s a tremendous need to enable organizations to maximize the value of their data infrastructure investments,” said Mark Lotke, founder and managing partner, Harmony Partners. “Unravel fills that gap perfectly as the only company that is truly using machine learning and an AI-driven platform to optimize and operationalize data-driven applications and the data systems they depend on at scale. The Unravel team demonstrated incredible growth in 2018 and is poised for an even bigger year in 2019 as demand for data operations solutions accelerates.”

“It’s clear that enterprises continue to struggle with dealing with the enormous amount of data that fuels their businesses. Legacy approaches have failed and they need to modernize their systems or risk being made irrelevant. Unravel is leading the pack in providing technology innovations that provide this competitive edge and fuel the next generation of cloud and hybrid cloud data services,” commented Venky Ganesan, managing director, Menlo Ventures.

“Since we led the company’s B round, we have been blown away by the market momentum that Unravel has achieved in a short space of time. There is clearly an unmet need in large enterprises for solving the complexity and operational challenges they face as they transition to being data driven and cloud first,” said Glenn Solomon, Managing Partner, GGV Capital. “Current approaches are failing these data ops teams and Unravel has come to market with technology innovation and go-to-market execution that is solving real world problems, today.”

Copyright Statement
The name Unravel Data is a trademark of Unravel Data. Other trade names used in this document are the properties of their respective owners.

PR Contact
Jordan Tewell, 10Fold
unravel@10fold.com
1-415-666-6066

The post Software Ate the World and Now the Models are Running It appeared first on Unravel.

Migrating big data workloads to Azure HDInsight

George Demarest — Fri, 03 May 2019 03:17:06 +0000

This is a guest blog by Arnab Ganguly, Senior Program Manager for Azure HDInsight at Microsoft. This blog was first published on the Microsoft Azure blog.

Migrating big data workloads to the cloud remains a key priority for our customers and Azure HDInsight is committed to making that journey simple and cost effective. HDInsight partners with Unravel whose mission is to reduce the complexity of delivering reliable application performance when migrating data from on-premises or a different cloud platform onto HDInsight. Unravel’s Application Performance Management (APM) platform brings a host of services towards providing unified visibility and operational intelligence to plan and optimize the migration process onto HDInsight.

Identify current big data landscape and platforms for baselining performance and usage.
Use advanced AI and predictive analytics to increase performance, throughput and to reduce application, data, and processing costs.
Automatically size cluster nodes and tune configurations for the best throughput for big data workloads.
Find, tier, and optimize storage choices in HDInsight for hot, warm, and cold data.

In our previous blog we discussed why the cloud is a great fit for big data and provided a broad view of what the journey to the cloud looks like, phase by phase. In this installment and the following parts we will examine each stage in that life cycle, diving into the planning, migration, operation and optimization phases. This blog post focuses on the planning phase.

Phase one: Planning

In the planning stage you must understand your current environment, determine high priority applications to migrate, and set a performance baseline to be able to measure and compare your on-premises clusters versus your Azure HDInsight clusters. This raises the following questions that need to be answered during the planning phase:

On-premises environment

What does my current on-premises cluster look like, and how does it perform?
How much disk, compute, and memory am I using today?
Who is using it, and what apps are they running?
Which of my workloads are best suited for migration to the cloud?
Which big data services (Spark, Hadoop, Kafka, etc.) are installed?
Which datasets should I migrate?

Azure HDInsight environment

What are my HDInsight resource requirements?
How do my on-premises resource requirements map to HDInsight?
How much and what type of storage would I need on HDInsight, and how will my storage requirements evolve with time?
Would I be able to meet my current SLAs or better them once I’ve migrated to HDInsight?
Should I use manual scaling or auto-scaling HDInsight clusters, and with what VM sizes?

Baselining on-premises performance and resource usage

To effectively migrate big data pipelines from physical to virtual data centers, one needs to understand the dynamics of on-premises workloads, usage patterns, resource consumption, dependencies and a host of other factors.

Unravel creates on-premises cluster discovery reports in minutes

Unravel provides detailed reports of on-premises clusters including total memory, disk, number of hosts, and number of cores used. This cluster discovery report also delivers insights on cluster topology, running services, operating system version and more. Resource usage heatmaps can be used to determine any unique needs for Azure.

Unravel on-premises cluster discovery reporting

Gain key app usage insights from cluster workload analytics and data insights

Unravel can highlight application workload seasonality by user, department, application type and more to help calibrate and make the best use of Azure resources. This type of reporting can greatly aid in HDInsight cluster design choices (size, scale, storage, scalability options, etc.) to maximize your ROI on Azure expenses.

Unravel also provides data insights to enable decision making on the best strategy for storage in the cloud, by looking at specific metrics on usage patterns of tables and partitions in the on-premises cluster.

Unravel tiered storage detail for tables

It can also identify unused or cold data. Once identified, one can then decide on the appropriate layout for the data in the cloud accordingly and make the best use of their Azure budget. Based on this information, one can distribute datasets most effectively across HDInsight storage options. For example, hottest data can be stored on disk or the highly performant object storage of Azure Data Lake Storage Gen 2 (hot), and the least used ones on the relatively less performant Azure Blob storage (cold).

Data migration

Migrate on-premises data to Azure

There are two main options to migrate data from on-premises to Azure. Learn more information around the processes and data migration best practices.

Transfer data over network with TLS
- Over the internet. Transfer data to Azure storage over a regular internet connection.
- Express Route. Create private connections between Microsoft datacenters and infrastructure on-premises or in a colocation facility.
- Data Box online data transfer. Data Box acts as network storage gateways to manage data between your site and Azure.
Shipping data offline
- Import/Export service. Send physical disks to Azure and they will be uploaded for you.
- Data Box offline data transfer. This option helps you transfer large amounts of data to Azure when the network transfer isn’t an option.

Once you’ve identified which workloads to migrate, the planning gets a little more involved, requiring a proper APM tool to get the rest right. For everything to work properly in the cloud, you need to map out workload dependencies as they currently exist on-premises. This may be challenging when done manually, as these workloads rely on many different complex components. Incorrectly mapping these dependencies is one of the most common causes of big data application breakdowns in the cloud.

The Unravel platform provides a comprehensive and immediate readout of all the big data stack components involved in a workload. For example it could tell you that a streaming app is using Kafka, HBase, Spark, and Storm, and detail each component’s relationship with one another while also quantifying how much the app relies on each of these technologies. Knowing that the workload relies far more on Spark than Storm allows you to avoid under-provisioning Spark resources in the cloud and overprovisioning Storm.

Resource management and capacity planning

Organizations face a similar challenge in determining resource such as disk, compute, and memory that they will need for the workload to run efficiently on the cloud. It’s a challenge to determine utilization metrics of these resources for on-premises clusters, and which services are consuming them. Unravel provides reports that precisely bring forth quantitative metrics around resources consumed by each big data workload. If resources have been overprovisioned and thereby wasted, as many organizations unknowingly do, the platform provides recommendations to reconfigure applications to maximize efficiency and optimize spend. These resource settings are then translated to Azure.

Since cloud adoption is an ongoing and iterative process, customers might want to look ahead and think about how resource needs will evolve throughout the year as business needs change. Unravel leverages predictive analytics based on previous trends to determine resource requirements in the cloud for up to six months out.

For example, workloads such as fraud detection employ several datasets including ATM transaction data, customer account data, charge location data, and government fraud data. Once in Azure, some apps require certain datasets to remain in Azure in order to work properly, while other datasets can remain on-premises without issue. Like app dependency mapping it’s difficult to determine which datasets an app needs to run properly. Other considerations are security, data governance laws (some sensitive data must remain in private datacenters in certain jurisdictions), as well as the size of data. Based on Unravel’s resource management and capacity planning reports customers can efficiently manage data placement in HDInsight storage options and on-premises to best suit their business requirements.

Capacity planning and chargeback

Unravel brings some additional visibility and predictive capabilities that can remove a lot of mystery and guesswork around Azure migrations. Unravel analyzes your big data workloads for both on-premises or for Azure HDInsight, and can provide chargeback reporting by user, department, application type, queue, or other customer defined tags.

Unravel chargeback reporting by user, application, department, and more

Cluster sizing and instance mapping

As the final part of the planning phase, one will need to decide on the scale, VM sizes, and type of Azure HDInsight clusters to fit the workload type. This would depend on the business use case and priority of the given workload. For example a recommendation engine that needs to meet a stringent SLA at all times might require an auto-scaling HDInsight cluster so that it always has the compute resources it needs, but can also scale down during lean periods to optimize costs. Conversely if you have a workload that is fixed in resource requirements, such as a predictable batch processing app, one might want to deploy manual scaling HDInsight clusters, and size then optimally with the right VM sizes to keep costs under control.

Since choice of HDInsight VM instances is key to the success of the migration Unravel can infer the seasonality of big data workloads, and deliver recommendations for optimal server instance sizes in minutes instead of hours or days.

Unravel instance mapping by workload

Given the default virtual machine sizes for HDInsight clusters provided by Microsoft, Unravel provides some additional intelligence to help choose the correct virtual machine sizes for data workloads based on three different migration strategies:

Lift and shift – If on-premises clusters collectively had 200 cores, 20 terabytes of storage, and 500 GB of memory Unravel will provide a close mapping to the Azure VM environment.This strategy ensures that the overall Azure HDInsight deployment will have the same (or more) of resources available as the current on-premises environment. This works to minimize any risks associated with under provisioning HDInsight for the migrating workloads.
Cost reduction – This provides a one to one mapping of each existing on-premise host to the most suitable Azure Virtual Machine on HDInsight, such that it matches the actual resource usage. This determines a cost optimized closest fit per host by matching the VMs published specifications to the actual usage of the host. If your on-premise hosts are underutilized this method will always be less expensive than lift and shift.
Workload fit – Consumes application runtime data that Unravel has collected, and offers the flexibility of provisioning Azure resources to provide 100 percent SLA compliance. Can also allow a bit of flexibility to choose a lower value, say 90 percent compliance as pictured below. The flexibility of the workload fit configuration enables the right price-to-performance trade-off in Azure.

Unravel allows for flexibility around SLA compliance in capacity planning for your Azure clusters and can compute average hourly cost at each percentile.

Conclusion

The planning phase is the critical first step towards any workload migration to HDInsight. Many organizations lack effective quantitative and qualitative guidance like the ones provided by Unravel APM during the critical planning process, and may face challenges downstream in areas of workload execution and cost optimization. Unravel’s robust APM platform can help navigate this planning phase complexity by providing tools for mapping workload dependencies, forecasting resource usage, and guiding decisions on which datasets to move, and this in turn can make the migration process much more efficient, data driven, and ultimately successful.

In our upcoming blog, we’ll look closely at migration to HDInsight.

The post Migrating big data workloads to Azure HDInsight appeared first on Unravel.

Why Data Skew & Garbage Collection Causes Spark Apps To Slow or Fail

George Demarest — Tue, 09 Apr 2019 05:55:13 +0000

The second part of our series “Why Your Spark Apps Are Slow or Failing” follows Part I on memory management and deals with issues that arise with data skew and garbage collection in Spark. Like many performance challenges with Spark, the symptoms increase as the scale of data handled by the application increases.

What is Data Skew?

In an ideal Spark application run, when Spark wants to perform a join, for example, join keys would be evenly distributed and each partition would get nicely organized to process. However, real business data is rarely so neat and cooperative. We often end up with less than ideal data organization across the Spark cluster that results in degraded performance due to data skew.

Data skew is not an issue with Spark per se, rather it is a data problem. The cause of the data skew problem is the uneven distribution of the underlying data. Uneven partitioning is sometimes unavoidable in the overall data layout or the nature of the query.

For joins and aggregations Spark needs to co-locate records of a single key in a single partition. Records of a key will always be in a single partition. Similarly, other key records will be distributed in other partitions. If a single partition becomes very large it will cause data skew, which will be problematic for any query engine if no special handling is done.

Dealing with data skew

Data skew problems are more apparent in situations where data needs to be shuffled in an operation such as a join or an aggregation. Shuffle is an operation done by Spark to keep related data (data pertaining to a single key) in a single partition. For this, Spark needs to move data around the cluster. Hence shuffle is considered the most costly operation.

Common symptoms of data skew are

Stuck stages & tasks
Low utilization of CPU
Out of memory errors

There are several tricks we can employ to deal with data skew problem in Spark.

Identifying and resolving data skew

Spark users often observe all tasks finish within a reasonable amount of time, only to have one task take forever. In all likelihood, this is an indication that your dataset is skewed. This behavior also results in the overall underutilization of the cluster. This is especially a problem when running Spark in the cloud, where over-provisioning of cluster resources is wasteful and costly.

If skew is at the data source level (e.g. a hive table is partitioned on _month key and table has a lot more records for a particular _month), this will cause skewed processing in the stage that is reading from the table. In such a case restructuring the table with a different partition key(s) helps. However, sometimes it is not feasible as the table might be used by other data pipelines in an enterprise.

In such cases, there are several things that we can do to avoid skewed data processing.

Data Broadcast

If we are doing a join operation on a skewed dataset one of the tricks is to increase the “spark.sql.autoBroadcastJoinThreshold” value so that smaller tables get broadcasted. This should be done to ensure sufficient driver and executor memory.

Data Preprocess

If there are too many null values in a join or group-by key they would skew the operation. Try to preprocess the null values with some random ids and handle them in the application.

Salting

In a SQL join operation, the join key is changed to redistribute data in an even manner so that processing for a partition does not take more time. This technique is called salting. Let’s take an example to check the outcome of salting. In a join or group-by operation, Spark maps a key to a particular partition id by computing a hash code on the key and dividing it by the number of shuffle partitions.

Let’s assume there are two tables with the following schema.

Let’s consider a case where a particular key is skewed heavily e.g. key 1, and we want to join both the tables and do a grouping to get a count. For example,

After the shuffle stage induced by the join operation, all the rows having the same key needs to be in the same partition. Look at the above diagram. Here all the rows of key 1 are in partition 1. Similarly, all the rows with key 2 are in partition 2. It is quite natural that processing partition 1 will take more time, as the partition contains more data. Let’s check Spark’s UI for shuffle stage run time for the above query.

As we can see one task took a lot more time than other tasks. With more data it would be even more significant. Also, this might cause application instability in terms of memory usage as one partition would be heavily loaded.

Can we add something to the data, so that our dataset will be more evenly distributed? Most of the users with skew problem use the salting technique. Salting is a technique where we will add random values to join key of one of the tables. In the other table, we need to replicate the rows to match the random keys.The idea is if the join condition is satisfied by key1 == key1, it should also get satisfied by key1_ = key1_. The value of salt will help the dataset to be more evenly distributed.

Here is an example of how to do that in our use case. Check the number 20, used while doing a random function & while exploding the dataset. This is the distinct number of divisions we want for our skewed key. This is a very basic example and can be improved to include only keys which are skewed.

Now let’s check the Spark UI again. As we can see processing time is more even now

Note that for smaller data the performance difference won’t be very different. Sometimes the shuffle compress also plays a role in the overall runtime. For skewed data, shuffled data can be compressed heavily due to the repetitive nature of data. Hence the overall disk IO/ network transfer also reduces. We need to run our app without salt and with salt to finalize the approach that best fits our case.

Garbage Collection

Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection (GC). Therefore, garbage collection (GC) can be a major issue that can affect many Spark applications.

Common symptoms of excessive GC in Spark are:

Slowness of application
Executor heartbeat timeout
GC overhead limit exceeded error

Spark’s memory-centric approach and data-intensive applications make it a more common issue than other Java applications. Thankfully, it’s easy to diagnose if your Spark application is suffering from a GC problem. The Spark UI marks executors in red if they have spent too much time doing GC.

Spark executors are spending a significant amount of CPU cycles performing garbage collection. This can be determined by looking at the “Executors” tab in the Spark application UI. Spark will mark an executor in red if the executor has spent more than 10% of the time in garbage collection than the task time as you can see in the diagram below.

The Spark UI indicates excessive GC in red

Addressing garbage collection issues

Here are some of the basic things we can do to try to address GC issues.

Data structures

If using RDD based applications, use data structures with fewer objects. For example, use an array instead of a list.

Specialized data structures

If you are dealing with primitive data types, consider using specialized data structures like Koloboke or fastutil. These structures optimize memory usage for primitive types.

Storing data off-heap

The Spark execution engine and Spark storage can both store data off-heap. You can switch on off-heap storage using

–conf spark.memory.offHeap.enabled = true
–conf spark.memory.offHeap.size = Xgb.

Be careful when using off-heap storage as it does not impact on-heap memory size i.e. it won’t shrink heap memory. So to define an overall memory limit, assign a smaller heap size.

Built-in vs User Defined Functions (UDFs)

If you are using Spark SQL, try to use the built-in functions as much as possible, rather than writing new UDFs. Most of the SPARK UDFs can work on UnsafeRow and don’t need to convert to wrapper data types. This avoids creating garbage, also it plays well with code generation.

Be stingy about object creation

Remember we may be working with billions of rows. If we create even a small temporary object with a 100-byte size for each row, it will create 1 billion * 100 bytes of garbage.

End of Part II

So far, we have focused on memory management, data skew, and garbage collection as causes of slowdowns and failures in your Spark applications. For Part III of the series, we will turn our attention to resource management and cluster configuration where issues such as data locality, IO-bound workloads, partitioning, and parallelism can cause some real headaches unless you have good visibility and intelligence about your data runtime.

If you found this blog useful, you may wish to view Part I of this series Why Your Spark Apps are Slow or Failing: Part I Memory Management. Also see our blog Spark Troubleshooting, Part 1 – Ten Challenges.

The post Why Data Skew & Garbage Collection Causes Spark Apps To Slow or Fail appeared first on Unravel.

Unravel Introduces the First AI Powered DataOps Solution for the Cloud

George Demarest — Tue, 26 Mar 2019 21:14:09 +0000

It’s indisputable, new data-driven applications are moving to, or starting life running in the cloud. The increasing automation and resilience of current cloud infrastructure is an ideal environment for running modern data pipelines. For many companies and institutions, their cloud first strategy is becoming a cloud only strategy.

Native online business’ such as Netflix as well as mainstream Enterprises such as Capital One have multi $Billion valuations and almost no physical data centers. Public cloud providers will account for over 60% of all capital expenditures on cloud infrastructure – disks, CPUs, network switches and the like. Given this momentum, there is increased pressure on IT teams to prove that they are getting the most of their cloud and big data investments.

Against this backdrop Unravel Introduces the Industry’s First AI-Powered Cloud Platform Operations and Workload Migration Solution for Data Applications, delivering new AI-powered and Unified performance optimization for planning, migrating, and managing modern data applications on the AWS, Azure and Google Cloud Platforms.

Some of the new capabilities that IT teams will gain from this latest release include:

Unified management of the full modern data stack on all deployment platforms – Unravel Cloud Migration covers AWS, Azure and Google clouds, as well as on-premises, hybrid environments and multi-cloud settings. Customers get AI-powered troubleshooting, auto-tuning and automated remediation of failures and slowdowns with the same user interface.

Full stack visibility – Unravel uses automation to provide detailed reports and metrics on app usage, performance, cost and chargebacks in the cloud.

Recommendations for the best apps to migrate – Unravel baselines on-premises performance of the full modern data stack and uses AI to identify the best app candidates for migration to cloud. Organizations can avoid migrating apps that aren’t ideal for the cloud and having to repatriate them later.

Mapping on-premises infrastructure to cloud server instances – Unravel helps customers choose cloud instance types for their migration based on three strategies:

Lift and shift – A one-to-one mapping from physical to virtual servers ensures that a cloud deployment will have the same (or more) resources available. This minimizes any risks associated with migrating to the cloud.
Cost reduction – Provides the most cost-effective instance recommendations based on detailed dependency understanding for minimizing wasted capacity and over provisioning.
Workload fit – Takes into account data collected over time from the on-premises environment, making recommendations for instance types based on the actual workload of applications running in a data center. These recommendations will be based on the VCore, memory, and storage requirements of a customer’s typical runtime environment.

Cloud capacity planning and chargeback reporting – Unravel can predict cloud storage requirements up to six months out and can provide a detailed accounting of resource consumption and chargeback by user, department or other criteria.

Migration validation – Unravel can provide a before and after assessment of cloud applications by comparing on-premises performance and resource consumption to the same metrics in the cloud, thereby validating the relative success of the migration.

All indications point to a massive shift in data deployments to the cloud, but there are too many unknowns around cost, visibility and migration that have prevented this transition to the cloud from occurring more quickly.

We are incredibly proud of this latest release and the value we believe it can deliver as organizations either begin their cloud journey for their modern data applications or look to optimize performance and cost efficiencies for those data workloads already operating in the cloud.

The post Unravel Introduces the First AI Powered DataOps Solution for the Cloud appeared first on Unravel.

Planning and Migration For Modern Data Applications in the Cloud

George Demarest — Tue, 26 Mar 2019 12:57:10 +0000

Current trends indicate that the cloud is a boon for big data. Conversations with our customers also clearly indicate this trend of data workloads and pipelines moving to the cloud. More and more organizations across industries are running – or are looking to run – data pipelines, machine learning, and AI in the cloud. And until today, there has not been an easy way to migrate, deploy and manage data-driven applications in the cloud. But now, getting the most from modern data applications in the cloud requires data driven planning and execution.

Unravel provides full-stack visibility and AI-powered guidance to help customers understand and optimize the performance of their data-driven applications and monitor, manage and optimize their modern data stacks. This applies as much to clusters located in the cloud as it does to modern data clusters on-premise. Specifically, for the cloud, our goal is to cover the entire gamut below:

IaaS (Infrastructure as a Service): Cloudera, Hortonworks or MapR data platforms deployed on cloud VMs where your modern data applications are running
PaaS (Platform as a Service): Managed Hadoop/Spark Platforms like AWS Elastic MapReduce (EMR), Azure HDInsight, Google Cloud Dataproc etc.
Cloud-Native: Products like Amazon Redshift, Azure Databricks, AWS Databricks etc.
Serverless: Ready to use, no setup needed services like Amazon Athena, Google BigQuery, Google Cloud DataFlow etc.

We have also learnt that enterprises tend to use a combination of one or more of the above to solve their modern data stack needs. In addition, it is not uncommon to have more than one cloud provider in use (Multi-Cloud Strategy). Often workloads and data are also distributed between on-premise and cloud clusters (Hybrid Strategy).

This blog covers Unravel’s current capabilities in the cloud, what is currently in the works, and what is on our longer term roadmap. Most importantly we talk about how you can participate and be part of this wave with us!

Looking to Migrate your Modern Data Workloads to the Cloud?

Many enterprises today are looking to migrate their modern data workloads from on-premises to the Cloud. The goals vary from improving ease of management to increasing elasticity to assure SLAs for bursty workloads to reducing costs.

Unravel truly understands your modern data cluster, details of the workloads running on that (and what can be done for improving reliability, efficiency and time to root cause issues). Now, this information and analysis are key for initiatives like migrating your modern data workloads to the cloud as well. So, Unravel has done precisely this – providing you features to help plan and migrate your modern data stack applications to the cloud based on your specific goal(s) (e.g. cost reduction, increased SLA, agility etc.). We have built a Goal-Driven and Adaptive Solution for helping Migrating Modern Data Stack Applications to the Cloud.

Pre-Migration: Planning

Cluster Discovery

Migrating modern data stack workloads to the cloud requires extensive planning, which in turn begins with developing a really good understanding the current (on-premise) environment. Specifically, details around cluster resources and their usage.

Cluster Details

What’s my modern data stack cluster like? What are the services deployed? How many nodes does it have? What are the resources allocated and what is their consumption like across the entire cluster?

Usage Details

What applications are running on it? What is the distribution of applications runs across users/departments? How much resources are these applications utilizing? Which of these applications are bursty in terms of resource utilization? Are some of these not running reliably due to lack of resources?

Let’s see how Unravel can help you discover and piece together such information:

Unravel’s cluster discovery reporting

As you can see, Unravel provides a single pane of glass to display all of the relevant information e.g., services deployed, the topology of the cluster, cluster level stats (which have been suitably aggregated over the entire cluster’s resources) in terms of CPU, memory and disk.

The across-cluster heatmaps display the relative utilization of the cluster across time (a 24×7 view for each hour of say a week). If the utilization peaks at specific days and times of the week, you can plan on designing the cloud cluster such that it only scales for those precise times (keeping the costs low for when resources are not needed).

Unravel’s Cluster Discovery Report is purpose built for migrating data workloads to the cloud. All the relevant information is readily made available and all the number crunching is done to provide the most relevant details in a single pane of glass.

Identifying Applications Suitable for the Cloud

After developing an understanding of the cluster, the next step in the cloud migration journey is to figure out which modern data stack workloads may be most suitable to move to the cloud and would result in maximum benefits (e.g. in terms of increased SLA or reduce costs). So, it would be useful to discover applications of the following kinds for example:

Bursty in nature

Applications that take varying amounts of time to complete and/or resources can be good candidates to move to the cloud. These are typically frequently failing or highly variable applications due to lack of resources, contention and bottlenecks and could be better suited for the cloud so that they can run more reliably and SLAs can be met. Unravel can help you easily identify applications that are bursty in nature.

Discover apps that are bursty in nature using Unravel

Unravel enables you to easily identify applications that could run more reliably in the cloud

Applications by tenants specific to the Business

Often, many corporations strategically decide to carry out the migration of modern data stack workloads to cloud in a phased manner. They may decide to migrate applications belonging to a specific users and/or queues first followed by others in a different phase (based on say the number of apps/resources used etc.). So, it becomes important to be able to have a clear view into distribution of these applications for various categories.

Unravel provides a clear graphical view into such information. Also, admins can explicitly tag certain application types in Unravel to achieve custom categorization such as grouping applications by department or project.

Unravel: Distribution of apps in different categories

Mapping your On-Premise Cluster to a Deployment Topology in the Cloud

Unravel provides you a mapping of your current on-premise environment to a cloud based one. Cloud Mapping Reports tell you the precise details of what type of VMs you would need, how many and what it would cost you.

You can choose different strategies for this mapping based on your goal(s). For each case, the resulting report will provide you the details of the cloud topology to match the goal.

In addition, each of these strategies are aware of multiple cloud providers (AWS EMR, Azure HDInsight, GCP etc.), the specs and cost of each VM, and optimize for mapping your actual workloads to the most effective VMs on the cloud.

Lift and Shift Strategy

This report provides a one to one mapping of each existing on-premise host to the most suitable instance type in Cloud in a way that It meets or exceeds the host’s hardware specs. This strategy ensures that your cloud deployment will have the same (or more) amount of resources available as your current on-premise environment and minimizes any risks associated with migrating to the cloud.

Unravel provides you a mapping of your current on-premise environment to a cloud based one. Strategy: Lift and Shift

Cost Reduction Strategy

This report provides a one to one mapping of each existing on-premise host to the most suitable Azure VM (or EC2/EMR/GCP) such that it matches the actual resource usage. This determines a cost optimum closest fit per host by matching the VM’s specs to the actual usage of the host. If your on-premise hosts are underutilized this method will always be less expensive than lift and shift.

Unravel provides you a mapping of your current on-premise environment to a cloud based one. Strategy: Cost Reduction

Workload Fit Strategy

Unlike the other methods this is not a one-to-one mapping. Unravel analyzes your workload for the time period and bases it recommendations on that workload. Unravel provides multiple recommendations based on the resources needed to meet X% of your workloads. It determines the optimal assignment of VM types to meet the requirements while minimizing cost. This method is typically the most cost-effective method.

Unravel provides you a mapping of your current on-premise env. to a cloud based one. Strategy: “Workload Fit” (meeting requirements for 85% of the workloads)

The workload fit strategy also enables enterprises to pick the right price-to-performance trade-offs in the cloud. For example, in case less tight SLAs are acceptable, costs can be further reduced by choosing a less expensive VM type.

The above is a sampling of the mapping results showing the recommended cloud cluster topology for a given on-premise one. The mapping is done in accordance with the inputs from you about your specific organizational needs e.g. the specific provider you have decided to migrate to/the need to separate compute and storage or not/use of a specific set of instance types/discounts you get on pricing from the cloud providers and so on… You can even compare the costs for your specific case across different cloud providers and make a decision on the one that best suits your goals.

Unravel also provides a matching of services you have in your on-premise cluster with the ones made available by the specific cloud provider chosen (and whether it is a strong or weak match). The goal of this mapping is to give you a good sense of what applications may need some rewriting and which ones may be easily portable. For example, if you have been using Impala on your on-premise cluster, which is not available on EMR, we will suggest the closest service to use on migrating to EMR.

Also, are several other Unravel features and functionalities to best determine the way you should create the modern data stack cluster in the cloud. For example, check out the cluster workload heatmap:

Heatmap of workload profile: Sunday has several hot hours, Saturday is the hottest day, Monday is hardly use and the cluster could be scaled down drastically on Mondays

As you can see above, Unravel analyzes and showcases seasonality to suggest how you can best make use of cloud’s auto-scaling capabilities. As you design your cloud cluster topology, you can use this information to setup appropriate processes for scaling-up and scaling-down clusters to ensure desired SLAs and reducing costs at the same time.

Decide on the best strategy for the storage in the cloud by looking at granular and specific information about the usage patterns of your tables and partitions in your on-premise cluster. Unravel can tell you which tables in your storage are least and most used and those that are used moderately.

Unravel can identify unused/cold data. You can decide on the appropriate layout for your data in the cloud accordingly and make the best use of your money

The determination of these labels is based on your specific configuration policy so it can be tailored to the business goals of your organization. Based on this information, you may decide to distribute these tables suitably across various types of storage in the cloud, e.g. most used ones on disk or high performant object storage e.g. AWS S3 Standard Storage/Azure Data Lake Storage Gen 2 (Hot) etc. and least used ones on lesser performant object storage e.g. AWS S3 Glacier Storage/Azure Data Lake Storage Gen 2 (Archive) etc. For example, in the example above, approximately half the data is never used and could be safely squared away to cheaper archival storage when moving to the cloud.

Unravel also provides you analysis on which partitions could possibly be reclaimed and hence some storage space could be saved. You have an opportunity to decide on a trade-off based on this information – store some partitions in archival storage in cloud or get rid of them altogether and reduce costs.

During and Post Migration: Tracking the Migration and its success

Unravel can help you track the success of the migration as you move your modern data stack applications from the on-premises cluster to the cloud cluster. You can use Unravel to compare how a given application was performing on-premise and how it is doing in its new home. If the performance is not up to par, Unravel’s insights and recommendations can help bring your migration on track.

Compare how app is doing in new environment. This app is ~17 times slower on cloud. Unravel provides automatic fixes to get app back to meeting SLA.

End of Part I

Planning and migrating is, of course, only one step in the journey to realize the full value of big data in the cloud. In Parts II and III of the series, I will cover performance optimization, cloud cost reduction strategies; troubleshooting, debugging, and root cause analysis; and related topics.

In the meantime, check out the Unravel Cloud Data Operations Guide, which covers some of the topics from this blog series.

The post Planning and Migration For Modern Data Applications in the Cloud appeared first on Unravel.

Why Memory Management is Causing Your Spark Apps To Be Slow or Fail

Audrey Enriquez — Wed, 20 Mar 2019 00:25:31 +0000

Spark applications are easy to write and easy to understand when everything goes according to plan. However, it becomes very difficult when Spark applications start to slow down or fail. (See our blog Spark Troubleshooting, Part 1 – Ten Biggest Challenges.) Sometimes a well-tuned application might fail due to a data change, or a data layout change. Sometimes an application that was running well so far, starts behaving badly due to resource starvation. The list goes on and on.

It’s not only important to understand a Spark application, but also its underlying runtime components like disk usage, network usage, contention, etc., so that we can make an informed decision when things go bad.

In this series of articles, I aim to capture some of the most common reasons why a Spark application fails or slows down. The first and most common is memory management.

If we were to get all Spark developers to vote, out-of-memory (OOM) conditions would surely be the number one problem everyone has faced. This comes as no big surprise as Spark’s architecture is memory-centric. Some of the most common causes of OOM are:

Incorrect usage of Spark
High concurrency
Inefficient queries
Incorrect configuration

To avoid these problems, we need to have a basic understanding of Spark and our data. There are certain things that can be done that will either prevent OOM or rectify an application which failed due to OOM. Spark’s default configuration may or may not be sufficient or accurate for your applications. Sometimes even a well-tuned application may fail due to OOM as the underlying data has changed.

Out of memory issues can be observed for the driver node, executor nodes, and sometimes even for the node manager. Let’s take a look at each case.

See How Unravel identifies Spark memory issues
Create a free account

Out of memory at the driver level

A driver in Spark is the JVM where the application’s main control flow runs. More often than not, the driver fails with an OutOfMemory error due to incorrect usage of Spark. Spark is an engine to distribute workload among worker machines. The driver should only be considered as an orchestrator. In typical deployments, a driver is provisioned less memory than executors. Hence we should be careful what we are doing on the driver.

Common causes which result in driver OOM are:

1. rdd.collect()
2. sparkContext.broadcast
3. Low driver memory configured as per the application requirements
4. Misconfiguration of spark.sql.autoBroadcastJoinThreshold. Spark uses this limit to broadcast a relation to all the nodes in case of a join operation. At the very first usage, the whole relation is materialized at the driver node. Sometimes multiple tables are also broadcasted as part of the query execution.

Try to write your application in such a way that you can avoid all explicit result collection at the driver. You can very well delegate this task to one of the executors. E.g., if you want to save the results to a particular file, either you can collect it at the driver or assign an executor to do that for you.

If you are using Spark’s SQL and the driver is OOM due to broadcasting relations, then either you can increase the driver memory if possible; or else reduce the “spark.sql.autoBroadcastJoinThreshold” value so that your join operations will use the more memory-friendly sort merge join.

Out of memory at the executor level

This is a very common issue with Spark applications which may be due to various reasons. Some of the most common reasons are high concurrency, inefficient queries, and incorrect configuration. Let’s look at each in turn.

High concurrency

Before understanding why high concurrency might be a cause of OOM, let’s try to understand how Spark executes a query or job and what are the components that contribute to memory consumption.

Spark jobs or queries are broken down into multiple stages, and each stage is further divided into tasks. The number of tasks depends on various factors like which stage is getting executed, which data source is getting read, etc. If it’s a map stage (Scan phase in SQL), typically the underlying data source partitions are honored.

For example, if a hive ORC table has 2000 partitions, then 2000 tasks get created for the map stage for reading the table assuming partition pruning did not come into play. If it’s a reduce stage (Shuffle stage), then spark will use either “spark.default.parallelism” setting for RDDs or “spark.sql.shuffle.partitions” for DataSets for determining the number of tasks. How many tasks are executed in parallel on each executor will depend on “spark.executor.cores” property. If this value is set to a higher value without due consideration to the memory, executors may fail with OOM. Now let’s see what happens under the hood while a task is getting executed and some probable causes of OOM.

Let’s say we are executing a map task or the scanning phase of SQL from an HDFS file or a Parquet/ORC table. For HDFS files, each Spark task will read a 128 MB block of data. So if 10 parallel tasks are running, then memory requirement is at least 128 *10 only for storing partitioned data. This is again ignoring any data compression which might cause data to blow up significantly depending on the compression algorithms.

Spark reads Parquet in a vectorized format. To put it simply, each task of Spark reads data from the Parquet file batch by batch. As Parquet is columnar, these batches are constructed for each of the columns. It accumulates a certain amount of column data in memory before executing any operation on that column. This means Spark needs some data structures and bookkeeping to store that much data. Also, encoding techniques like dictionary encoding have some state saved in memory. All of them require memory.

Figure: Spark task and memory components while scanning a table

So with more concurrency, the overhead increases. Also, if there is a broadcast join involved, then the broadcast variables will also take some memory. The above diagram shows a simple case where each executor is executing two tasks in parallel.

Inefficient queries

While Spark’s Catalyst engine tries to optimize a query as much as possible, it can’t help if the query itself is badly written. E.g., selecting all the columns of a Parquet/ORC table. As seen in the previous section, each column needs some in-memory column batch state. If more columns are selected, then more will be the overhead.

Try to read as few columns as possible. Try to use filters wherever possible, so that less data is fetched to executors. Some of the data sources support partition pruning. If your query can be converted to use partition column(s), then it will reduce data movement to a large extent.

Incorrect configuration

Incorrect configuration of memory and caching can also cause failures and slowdowns in Spark applications. Let’s look at some examples.

Executor & Driver memory

Each application’s memory requirement is different. Depending on the requirement, each app has to be configured differently. You should ensure correct spark.executor.memory or spark.driver.memory values depending on the workload. As obvious as it may seem, this is one of the hardest things to get right. We need the help of tools to monitor the actual memory usage of the application. Unravel does this pretty well.

Memory Overhead

Sometimes it’s not executor memory, rather its YARN container memory overhead that causes OOM or the node gets killed by YARN. “YARN kill” messages typically look like this:

YARN runs each Spark component like executors and drivers inside containers. Overhead memory is the off-heap memory used for JVM overheads, interned strings and other metadata of JVM. In this case, you need to configure spark.yarn.executor.memoryOverhead to a proper value. Typically 10% of total executor memory should be allocated for overhead.

Caching Memory

If your application uses Spark caching to store some datasets, then it’s worthwhile to consider Spark’s memory manager settings. Spark’s memory manager is written in a very generic fashion to cater to all workloads. Hence, there are several knobs to set it correctly for a particular workload.

Spark has defined memory requirements as two types: execution and storage. Storage memory is used for caching purposes and execution memory is acquired for temporary structures like hash tables for aggregation, joins etc.

Both execution & storage memory can be obtained from a configurable fraction of (total heap memory – 300MB). That setting is “spark.memory.fraction”. Default is 60%. Out of which, by default, 50% is assigned (configurable by “spark.memory.storageFraction”) to storage and rest assigned for execution.

There are situations where each of the above pools of memory, namely execution and storage, may borrow from each other if the other pool is free. Also, storage memory can be evicted to a limit if it has borrowed memory from execution. However, without going into those complexities, we can configure our program such that our cached data which fits in storage memory should not cause a problem for execution.

If we don’t want all our cached data to sit in memory, then we can configure “spark.memory.storageFraction” to a lower value so that extra data would get evicted and execution would not face memory pressure.

Out of memory at Node Manager

Spark applications which do data shuffling as part of group by or join like operations, incur significant overhead. Normally data shuffling process is done by the executor process. If the executor is busy or under heavy GC load, then it can’t cater to the shuffle requests. This problem is alleviated to some extent by using an external shuffle service.

External shuffle service runs on each worker node and handles shuffle requests from executors. Executors can read shuffle files from this service rather than reading from each other. This helps requesting executors to read shuffle files even if the producing executors are killed or slow. Also, when dynamic allocation is enabled, its mandatory to enable external shuffle service.

When Spark external shuffle service is configured with YARN, NodeManager starts an auxiliary service which acts as an External shuffle service provider. By default, NodeManager memory is around 1 GB. However, applications which do heavy data shuffling might fail due to NodeManager going out of memory. Its imperative to properly configure your NodeManager if your applications fall into the above category.

End of Part I – Thanks for the Memory

Spark’s in-memory processing is a key part of its power. Therefore, effective memory management is a critical factor to get the best performance, scalability, and stability from your Spark applications and data pipelines. However, the Spark defaults settings are often insufficient. Depending on the application and environment, certain key configuration parameters must be set correctly to meet your performance goals. Having a basic idea about them and how they can affect the overall application helps.

I have provided some insights into what to look for when considering Spark memory management. This is an area that the Unravel platform understands and optimizes very well, with little, if any, human intervention needed. See how Unravel can make Spark perform better and more reliably here.

Event better, schedule a demo to see Unravel in action. The performance speedups we are seeing for Spark apps are pretty significant.

See how Unravel simplifies Spark Memory Management.

Create a free account

In Part II of this series Why Your Spark Apps are Slow or Failing: Part II Data Skew and Garbage Collection, I will be discussing how data organization, data skew, and garbage collection impact Spark performance.

The post Why Memory Management is Causing Your Spark Apps To Be Slow or Fail appeared first on Unravel.

Transitioning big data workloads to the cloud: Best practices from Unravel Data

Audrey Enriquez — Sat, 16 Mar 2019 04:10:04 +0000

Migrating on-premises Apache Hadoop® and Spark workloads to the cloud remains a key priority for many organizations. In my last post, I shared “Tips and tricks for migrating on-premises Hadoop infrastructure to Azure HDInsight.” In this series, one of HDInsight’s partners, Unravel Data, will share their learnings, best practices, and guidance based on their insights from helping migrate many on-premises Hadoop and Spark deployments to the cloud.

Unravel Data is an AI-driven Application Performance Management (APM) solution for managing and optimizing modern data stack workloads. Unravel Data provides a unified, full-stack view of apps, resources, data, and users, enabling users to baseline and manage app performance and reliability, control costs and SLAs proactively, and apply automation to minimize support overhead. Ops and Dev teams use Unravel Data’s unified capability for on-premises workloads and to plan, migrate, and operate workloads on Azure. Unravel Data is available on the HDInsight Application Platform.

Today’s post, which kicks off the five-part series, comes from Shivnath Babu, CTO and Co-Founder at Unravel Data. This blog series will discuss key considerations in planning for migrations. Upcoming posts will outline the best practices for cloud migration, operation, and optimization phases of the cloud adoption lifecycle for big data.

Unravel Data’s perspective on migration planning

The cloud is helping to accelerate big data adoption across the enterprise. But while this provides the potential for much greater scalability, flexibility, optimization, and lower costs for big data, there are certain operational and visibility challenges that exist on-premises that don’t disappear once you’ve migrated workloads away from your data center.

Time and time again, we have experienced situations where migration is oversimplified and considerations such as application dependencies and system version mapping are not given due attention. This results in cost overruns through over-provisioning or production delays through provisioning gaps.

Businesses today are powered by modern data applications that rely on a multitude of platforms. These organizations desperately need a unified way to understand, plan, optimize, and automate the performance of their modern data apps and infrastructure. They need a solution that will allow them to quickly and intelligently resolve performance issues for any system through full-stack observability and AI-driven automation. Only then can these organizations keep up as the business landscape continues to evolve, and be certain that big data investments are delivering on their promises.

Current challenges in big data

Today, IT uses many disparate technologies and siloed approaches to manage the various aspects of their modern data apps and big data infrastructure.

Many existing monitoring solutions often do not provide end-to-end support for modern data stack environments, lack full-stack compatibility, or require complex instrumentation. This includes configuration changes to applications and their components, which requires deep subject matter expertise. The murky soup of monitoring solutions that organizations currently rely on doesn’t deliver the application agility that is required by the business.

Consequently, this results in poor user experience, inefficiencies and mounting costs as organizations buy more and more tools to solve these problems and then have to spend additional resources managing and maintaining those tools.

Additionally, organizations see a high Mean Time to Identify (MTTI) and Mean Time to Resolve (MTTR) issues because it is hard to understand the dependencies and keep focused on root cause analysis. The lack of granularity and end to end visibility makes it impossible to remedy all of these problems, and businesses are stuck in a state of limbo.

It’s not an option to continue doing what was done in the past. Teams need a detailed appreciation of what they are doing today, what gaps they still have, and what steps they can take to improve business outcomes. It’s not uncommon to see 10x or more improvements in root cause analysis and remediation times for customers who are able to gain a deep understanding of the current state of their big data strategy and make a plan for where they need to be.

Starting your big data journey to the cloud

Without a unified APM platform, the challenges only intensify as enterprises move big data to the cloud. Cloud adoption is not a finite process with a clear start and end date — it’s an ongoing lifecycle with four broad phases (planning, migration, operation, and optimization). Below, we briefly discuss some of the key challenges and questions that arise for organizations below, which we will dive into in further detail in subsequent posts.

In the planning  phase, key questions may include:

“Which apps are best suited for a move to the cloud?”
“What are the resource requirements?
“How much disk, compute, and memory am I using today?”
“What do I need over the next 3, 6, 9, and 12 months?”
“Which datasets should I migrate?”
“Should I use permanent, transient, autoscaling, or spot instances?”

During migration, which can be a long running process as workloads are iteratively moved, there is a need for continuous monitoring of performance and costs. Key questions may include:

“Is the migration successful?”
“How does the performance compare to on-premises?”
“Have I correctly assessed all the critical dependencies and service mapping?”

Once workloads are in production on the cloud, key considerations include:

“How do I continue to optimize for cost and for performance to guarantee SLAs?”
“How do I ensure Ops teams are as efficient and as automated as possible?”
“How do I empower application owners to leverage self-service to solve their own issues easily to improve agility?”

The challenges of managing disparate modern data stack technologies both on-premise and in the cloud can be solved with a comprehensive approach to operational planning. In this blog series, we will dive deeper into each stage of the cloud adoption lifecycle and provide practical advice for every part of the journey. Upcoming posts will outline the best practices for the planning, migration, operation, and optimization phases of this lifecycle.

About HDInsight application platform

The HDInsight application platform provides a one-click deployment experience for discovering and installing popular applications from the modern data stack ecosystem. The applications cater to a variety of scenarios such as data ingestion, data preparation, data management, cataloging, lineage, data processing, analytical solutions, business intelligence, visualization, security, governance, data replication, and many more. The applications are installed on edge nodes which are created within the same Azure Virtual Network boundary as the other cluster nodes so you can access these applications in a secure manner.

The post ‘Transitioning big data workloads to the cloud: Best practices from Unravel Data’ by Ashish Thapliyal Principal Program Manager, Microsoft Azure HDInsight appeared first on Microsoft.

The post Transitioning big data workloads to the cloud: Best practices from Unravel Data appeared first on Unravel.

Modern Data Stack Predictions With Unravel Data Advisory Board Member, Tasso Argyros

Audrey Enriquez — Tue, 05 Mar 2019 07:32:57 +0000

Unravel lucked out with the quality and strategic Impact of our advisory board. Collectively, they hold a phenomenal track record of entrepreneurship, leadership, and product innovation, and I am pleased to introduce them to the Unravel community. Looking into the year ahead, we asked two of our advisors for their perspective on what 2019 holds as we dive into the next few quarters.. Our first guest, Herb Cunitz, was featured in Part 1 of our Prediction series (read it here) and discussed breakout modern data stack technologies, the role of artificial intelligence (AI) and automation in the modern data stack, and the increasing modern data stack skills gap. Now, in Part 2, Tasso Argyros, founder, and CEO of ActionIQ will outline his take on the upcoming trends for 2019.

Tasso is a widely recognized and respected innovator, with awards and accolades from the World Economic Forum, BusinessWeek and Forbes, and has a storied career with more than a decade’s experience working with data-focused organizations and advising companies on how to accelerate growth and become market leaders. He is a CEO and Founder at ActionIQ, a company giving Marketers direct access to their customer data, and previously founded Aster Data, a pioneer in the modern data stack, which was ultimately acquired by Teradata. He was also a Founder of early-stage Big Data seed fund, Data Elite, that helped incubate Unravel Data.

In 2018, big data matured and hit a point of inflection, where an increasing number of Fortune 1000 enterprises deployed critical modern data applications that depend on the modern data stack, into production. What effect will this have on the product innovation pipeline and adoption for 2019 and beyond? This is an excerpt of my recent conversation with Tasso:

Unravel: Looking back in the ‘rear-view mirror’ to the past year, what were the most exciting developments and innovations in the modern data stack?

TA: While some say innovation has slowed down in big data, I’m seeing the opposite and believe it has accelerated. When we started Aster Data in 2005, many thought that database innovation was dead. Between Oracle, IBM, and some specialty players like Teradata, investors and pundits believed that all problems had been solved and that there was nothing else to do. Instead, it was the perfect time to start a database company as the seeds of the Big Data revolution were about to be planted.

Since then, the underlying infrastructure have experienced massive, continual changes in about 3 to 4-year intervals. For example, in the mid-2000s, the primary industry trend was moving from expensive proprietary hardware to more cost-effective commodity hardware, and in the early-2010s, the industry spotlighted open source data software. Now, for the past few years, the industry has been focused on the introduction of and transition to cloud solutions, the increasing volume of streaming data and debut of Internet-of-Things (IoT) technologies.

As we focus on finding better ways to manage data, introduce new technologies and databases, and explore the ecosystem that lays on top of the big data layer, these will be the underlying trends that will continue to drive innovation in 2019 and beyond. Whereas initially, big data was more about collection, aggregation and experimentation, in 2018, it became clear that big data is a crucial, mission-critical aspect to the next generation of applications – and there is much more to learn.

Unravel: What breakout technology will move to the forefront in 2019?

TA: There has been a definite increase in the number and variety of data-centric applications (versus data infrastructure) that are being created and in-use today. As a result, there is a rising interest in the industry in learning how to manage data for specific systems and in different environments, including on-premises, hybrid, and across multiple clouds. In 2019, the industry will start empowering these organization with tools that help non-experts become self-sufficient at managing their data operations processes across their end-to-end irrespective of where code is executing.

Unravel: Which industries or use cases have delivered the most value and have seen the most significant adoption of a mature modern data ecosystem?

TA: Digital-native organizations were the first companies to jump in at-scale – which is not a surprise as they have historically advanced more rapidly and been ahead of those who have some form of legacy to consider. Although heavily regulated, financial services institutions saw the value of an effective modern data strategy early-on as well as those industries that struggled with the cost and complexity of traditional data warehousing approaches when the 2008-2009 recession hit. In fact, few realize that the big recession was one of the key catalysts that accelerated the adoption of new modern data stack technologies.

Big data started with a heavy analytics focus– and then, as it matured, turned operational. Now, it’s coming to the point where streaming data is driving innovation, and many different industries and verticals are set to benefit from this next step. For example, one compelling modern data use case is delivering improved customer experiences through real-time customer data gathering, inference and personalization.

Moreover, the convergence of data science and big data has accelerated adoption as it activates the use of big data for critical business decision-making through optimized machine learning. By offering the ability to filter and prepare data, extract insights from large data sets, and capture complex patterns and develop models, big data becomes a critical value driver for modern data application areas like fraud and risk detection, or industries telecom and Healthcare.

Unravel: Is 2019 the year where ‘Big data’ gives way to just ‘Data’, as the lines and technologies between the 2 become increasingly hard to separate. A data-pipeline is a data pipeline after all.

TA: In the early days, there was confusion between big data and data warehousing. Data warehousing was the buzzword during the two decades prior, whereas big data became the hot trend more recently. A data warehouse is a central repository of integrated data – it is rigid and organized. A technology category, such as big data, is instead a means to store and manage large amounts of data – from many different sources, at a lower cost– to make better decisions and more accurate predictions. In short, modern data stack technologies have been more efficient at processing today’s high-volume, highly variable data pipelines that continue to grow at ever increasing rates.

With that in mind however, nothing stands still for very long, especially with technology innovation. The delineation between categories, as with any maturing market continues to evolve and high degrees of fragmentation, often lead by Open Source committers is often juxtaposed with the evolution of existing adopted technologies. SQL is a good example of this where the traditional the landscape of SQL, NoSQL, NewSQL and serverless solutions like AWS Athena start to blur the lines between what is ‘big’ and what is just ‘data’. One thing is for sure, we have come a long way in a short space of time and ‘Big Data’ is much more that on-premises Hadoop.

Unravel: What role will AI and automation, and capabilities like AIOps, play in the modern data stack in the coming year?

TA: Technologies like Spark, Hive and Kafka, are very complex under the hood, and when an application fails, it requires a specialist with a unique skill set to comb through massive amounts of data and determine the cause and solution. Data Operations frameworks need to mature to permit separation of roles rather than relying on a single Data engineer to solve all of the problems. Self-service for the applications owners will relieve part of this bottleneck but fully operationalizing a growing number of production data pipelines will require a different approach that relies heavily on Machine Learning and Artificial Intelligence.

In 2019, as the industry continues to strive for higher efficiency, automation will rise as a solution to the modern data stack skills problem. For example, AI for Operations (AIOps), which combines big data, artificial intelligence, and machine learning functionality, can augment and replace many IT operations processes to e.g. Accelerate the time it takes to identify performance issues, proactively tune resources to reduce cost or Automate configuration changes to prevent an app failures proactivly.

Unravel: What major vendor shake-ups do you predict in 2019?

TA: The industry now understands that there is more to a big data ecosystem than just Hadoop. Hadoop, for many years, was the leading open source framework, but Spark and Kafka’s increasing rise in popularity has proven that the stack will continue to rapidly evolve in ways we have not yet thought of. Complexity will be with us for a very long time and along with that some incredible new innovative companies, a new emerging incumbent (Cloudera/Hortonworks) and the Cloud giants will jockey for customer mindshare.

The post Modern Data Stack Predictions With Unravel Data Advisory Board Member, Tasso Argyros appeared first on Unravel.

CIDR 2019

Unravel Data Posts — Wed, 13 Feb 2019 13:15:15 +0000

As Enterprises Deploy complex Data pipelines into Full Production, AI for Operations (AIOPS) is Key to ensuring reliability and performance.

I recently traveled down to Asilomar, Calif., to speak at the Conference on Innovative Data Systems Research (CIDR), a gathering where researchers and practicing IT professionals discuss the latest innovative and visionary ideas in data. There was a diverse range of speakers and plenty of fascinating talks this year, leaving me with some valuable insights and new point of views to consider. However, despite all of the innovative and cutting edge material, the main theme of the event was an affirmation of what some of us already know: the challenges of managing distributed data systems—the problems we’ve been addressing at Unravel for years—are very real and are being experienced in both academia and the enterprise, both small and large businesses, and in the government and private sectors. Moreover, organizations feel like they have no choice but to react to these problems as they occur, rather than preparing in advance and avoiding them altogether.

My presentation looked at the common headaches in running modern applications that are built on many distributed systems. Apps fail, SLAs are missed, cloud costs spiral out of control, etc. There’s a wide variety of causes for these issues, such as bad joins, inaccurate configuration settings, bugs, misconfigured containers, machine degradation, and many more. It’s tough (and sometimes impossible) to identify and diagnose these problems manually, because monitoring data is highly siloed and often not available at all.

As a result, enterprises take a slow, reactive approach to addressing these issues. According to a survey from AppDynamics, most enterprise IT teams first discover that an app has failed when users call or email their help desk or a non-IT member alerts the IT department. Some don’t even realize there’s an issue until users post on social media about it! This of course results in a high mean time to resolve problems (an average of seven hours in AppDynamics’ survey). Clearly, this is not a sustainable way to manage apps.

Unravel’s approach starts first by collecting all the disparate monitoring data in a unified platform, eliminating silo issues. Then the platform applies algorithms to analyze the data and, whenever possible, take action automatically; providing automatic fixes for any of the problems listed above. The use of AIOps and automation is what really differentiates this approach and provides so much value. Take root cause analysis, for example. Manually determining root cause of an app failure is time consuming and often requires domain expertise. It’s a process that can often last days. Using AI and our inference engine, Unravel can complete root cause analysis in seconds.

How does this work? We draw on a large set of root cause patterns learned from customers and partners. This data is constantly updated. We then continuously inject this root cause data to train and test models for root-cause prediction and proactive remediation.

During the Q&A portion of the session, an engineering lead from Amazon asked a great question about what Unravel is doing to keep their AIOps techniques up to date as modern data stack systems evolve rapidly. Simply, the answer is that the platform doesn’t stop learning. We consistently perform careful probes to identify places where we can enhance the training data for learning, then collect more of that data to do so.

There were a couple of other conference talks that also nicely highlighted the value of AIOps:

SageDB: A Learned Database System: Traditional data processing systems have been designed to be general purpose. SageDB presents a new type of data processing system, one which highly specializes to an application through code synthesis and machine learning. (Joint work from Massachusetts Institute of Technology, Google, and Brown University)

Learned Cardinalities: Estimating Correlated Joins with Deep Learning: This talk addressed a critical challenge of cardinality estimation, namely correlated data. The presenters have developed a new deep learning method for cardinality estimation. (Joint work from the Technical University of Munich, University of Amsterdam, and Centrum Wiskunde & Informatica)

Organizations are deploying highly distributed data pipelines into full production now. These aren’t science projects, they’re for real, and the bar has been raised even higher for accuracy and performance. And these organizations aren’t just growing data lakes like they were five years ago—they’re now trying to get tangible value from that data by using it to develop a range of next-generation applications. Organizations are facing serious hurdles daily as they take that next step with data, and AIOps is emerging as the clear answer to help them with it.

Big data is no longer just the future, it’s the present, too.

The post CIDR 2019 appeared first on Unravel.

Unravel Named to CRN’s List of 100 Coolest Cloud Computing Companies

Unravel Data Posts — Wed, 13 Feb 2019 13:13:28 +0000

At Unravel we have always thought of the modern data stack and cloud computing as cool, and it always excites us when the industry reaches new milestones or when companies demonstrate game changing innovation with their cloud projects. CRN shares the same sentiment, and just released its 100 Coolest Cloud Computing Companies of 2019 list. The list includes some of the most advanced cloud technology providers across such categories as infrastructure, platforms and development, security, storage and software, and we’re thrilled to share that we have been included in this list, too, in the software category.

In a very short space of time, Unravel has rapidly innovated and expanded our support for all the major data platforms, including Spark, Hadoop, Kafka, Impala, NoSQL and SQL across cloud and hybrid environments. Unravel is deployed today in IaaS and PaaS models, alongside Azure HD Insights, Amazon EMR, Redshift and Athena workloads and is also available today on the Azure marketplace and AWS marketplace to support modern data apps such as IoT, Machine Learning and Analytics pipelines.

The 100 Coolest Cloud Computing Companies are selected by the CRN editorial team for their creativity and innovation in product development, the quality of their services and partner programs, and their demonstrated ability to help customers benefit from the ease of use, flexibility, scalability and budgetary savings that the cloud offers.

Cloud services are a vital part of the business and IT ecosystem with many enterprises clamoring for solution providers who will help them manage the scale and complexity of their data pipelines in the cloud. Cloud services are accelerating time to value for many data projects and providing enterprises of all sizes and from all industries with the potential for much greater scalability, flexibility, optimization, cost efficiency and access to specialist skills for their critical data driven applications and the data pipelines that support them. However, there many operational issues that get in the way of a successful cloud project such as application performance, resource optimization, workload migration and rapid troubleshooting. Without a unified and full stack approach that can provide AI-driven insights, recommendation and Automation, the challenges only intensify as enterprises move more data workloads to the cloud.

Frequently we find that many application or operations teams are not well prepared for migrating data workloads to a cloud platform. It is all too common place to over simplify or overlook certain considerations, such as application dependencies, application or query level cost assurance, End-to-end data pipeline visibility and given that migration isn’t a one time event, there is a need for continuous observability of performance and costs and infrastructure behavior.

We’re pleased and humbled that CRN has given us the label of being a “cool” cloud company and recognizing us for our innovation, tenacity and customer centricity when it comes to solving these data operations challenges.

You can check out the good company that Unravel is in with the full list of new 100 Coolest Cloud Computing Companies in the February 2019 issue of CRN and online at www.crn.com/cloud100.

The post Unravel Named to CRN’s List of 100 Coolest Cloud Computing Companies appeared first on Unravel.

Monitor and optimize data pipelines and workflows end-to-end

Unravel Data Posts — Fri, 25 Jan 2019 11:00:57 +0000

How do you monitor, optimize, and really operationalize the data pipeline architecture and workflows in your modern data stack? There are different types of problems that occur every day when you’re running these workflows at scale. How do you go about resolving those common issues?

Understanding Data Pipeline Architecture

First, let’s quickly understand what we mean by a data pipeline architecture or a workflow. A data pipeline and a workflow are interchangeable terms, and they are comprised of several stages that are stitched together to drive a recommendation engine, a report, a dashboard, etc. These data pipelines or workflows may then be scheduled on top of Oozie or AirFlow or BMC, etc., and may be running repeatedly maybe once every day, or several times during the day.

For example, if I’m trying to create a recommendation engine, I may have all of my raw data in separate files, separate tables, separate systems, that I may put together using Kafka, and then I may load this data into HDFS or S3, after which I’m doing some cleaning and transformation jobs on top of it, merging these data sets together, trying to create a single table out of this, or just trying to clean it up and make sure that all the data’s consistent across these different tables. And then I may run a couple of Spark applications on top, to create my model or analyze this data to create some reports and views. This end-to-end process is an example of a data pipeline or workflow.

A lot of industries are already running data pipelines and workflows in production. In the telecom industry, we see a lot of companies using these pipelines for churn prevention. In the automotive industry, a lot of companies are using data pipeline architecture for predictive maintenance using IoT applications. For instance, datasets could be going in through Kafka in a real-time fashion, and then running some Spark streaming applications to be able to predict any faults or problems that may happen with the entire manufacturing line itself. A financial services company may be using this for fraud detection with your credit card. Additional examples of data pipeline architecture include e-commerce companies’ recommendation engines, and healthcare companies improving clinical decision making.

Data Pipeline Architecture Common Problems

Data pipelines are mission critical applications, so getting them to run well, and getting them to run reliably, is very important. However, that’s not usually the case, when we try to create data pipelines and workflows on modern data stack systems there are four common problems that occur every time.

You may have runtime problems, meaning an application just won’t work.
You may have configuration settings problems, where are you trying to understand how to set these different configuration setting levels to make these applications run in the most efficient way possible?
You may have problems with scale, meaning this application ran with 10 GB of data just fine, but what about when it needs to run with 100 GB or PB, will it then still finish in the same amount of time or not?
You may have multi-tenancy issues where hundreds of possible users, and tens of thousands of applications / jobs / workflows, are all running together in the same cluster. As a result, you may have other applications affecting the performance of your application.

In this blog series, we will look into all these different types of problems in a little more depth and understand, how they can be solved.

COMING SOON

The post Monitor and optimize data pipelines and workflows end-to-end appeared first on Unravel.

Learn how Unravel Complements Apache Ambari for Hortonworks

Unravel Data Posts — Fri, 25 Jan 2019 10:50:02 +0000

The post Learn how Unravel Complements Apache Ambari for Hortonworks appeared first on Unravel.

How Unravel Partners with Hortonworks on HDP

Unravel Data Posts — Fri, 25 Jan 2019 10:43:29 +0000

The post How Unravel Partners with Hortonworks on HDP appeared first on Unravel.

AppDynamics co-founder thinks this Big Data startup may repeat his success

Unravel Data Posts — Fri, 25 Jan 2019 10:41:18 +0000

The chairman of the app management tech company Cisco bought for $3.7 billion earlier this year thinks a Menlo Park startup can have the same kind of success helping customers manage their Big Data projects.

The post AppDynamics co-founder thinks this Big Data startup may repeat his success appeared first on Unravel.

Big Data Applications: Managing Complexity with Success

Unravel Data Posts — Fri, 25 Jan 2019 10:33:10 +0000

In essence, Unravel Data makes processing Big Data easier. The program was designed to resolve the complicated and disconcerting problems that emerge when processing Big Data. These applications can become confusing and difficult to operate. Never-before-seen challenges arise with chronic regularity, leaving research teams constantly struggling with issues such as allocating resources, scheduling, and debugging. These redundant issues act to slow down the actual processing, and can make it difficult to use Big Data Applications effectively, or to even profit from it. Unravel Data streamlines the process by detecting, and correcting, concealed defects that block advertising and analytics tracking, and online conversions.

The post Big Data Applications: Managing Complexity with Success appeared first on Unravel.

Unravel Data Advances Application Performance Management for Big Data

Unravel Data Posts — Fri, 25 Jan 2019 10:32:11 +0000

Unravel Data, which provides an application performance management (APM) platform designed to simplify DataOps, has unveiled a new set of automated actions for improving modern data stack operations and performance.

Based on its work with more than 100 enterprise customers and prospects, the 4.0 release helps make DataOps more proactive and productive by automating problem discovery, root-cause analysis, and resolution across the entire modern data stack, while improving ROI and time to value of big data investments.

The post Unravel Data Advances Application Performance Management for Big Data appeared first on Unravel.

Intelligent Application Performance Management (APM) Platform for the Modern Data Stack

Unravel Data Posts — Fri, 25 Jan 2019 10:31:29 +0000

Unravel Data has unveiled a new set of automated actions to improve modern data stack operations and performance. The solution was designed with input from more than 100 enterprise customers and prospects to uncover their biggest modern data stack challenges and to make DataOps more proactive and productive by automating problem discovery, root-cause analysis, and resolution across the entire modern data stack, while improving ROI and time to value of Big Data investments.

The post Intelligent Application Performance Management (APM) Platform for the Modern Data Stack appeared first on Unravel.

Unravel Data upgrades performance monitor for modern data stacks applications

Unravel Data Posts — Fri, 25 Jan 2019 10:27:25 +0000

Unravel Data Systems Inc. is updating its application performance management system for modern data stack environments with improved detection of runaway applications, better service level agreement management and enhanced ability to diagnose and recommend fixes for slowdowns and other performance problems.

The post Unravel Data upgrades performance monitor for modern data stacks applications appeared first on Unravel.

“Above the Trend Line” – Your Industry Rumor Central

Unravel Data Posts — Fri, 25 Jan 2019 10:26:07 +0000

Above the Trend Line: machine learning industry rumor central, is a recurring feature of insideBIGDATA. In this column, we present a variety of short time-critical news items such as people movements, funding news, financial results, industry alignments, rumors and general scuttlebutt floating around the big data, data science and machine learning industries including behind-the-scenes anecdotes and curious buzz. Our intent is to provide our readers a one-stop source of late-breaking news to help keep you abreast of this fast-paced ecosystem. We’re working hard on your behalf with our extensive vendor network to give you all the latest happenings. Heard of something yourself? Tell us! Just e-mail me at: daniel@insidebigdata.com. Be sure to Tweet Above the Trend Line articles using the hashtag: #abovethetrendline.

The post “Above the Trend Line” – Your Industry Rumor Central appeared first on Unravel.

Skills Developers Need To Optimize Performance

Unravel Data Posts — Fri, 25 Jan 2019 10:23:29 +0000

To gather insights on the state of performance optimization and monitoring today, we spoke to 12 executives from 11 companies that provide performance optimization and monitoring solutions for their clients.

Here’s what they told us when we asked, “What skills do developers need to have to optimize the performance and ease of monitoring their applications?”

The post Skills Developers Need To Optimize Performance appeared first on Unravel.

Additional Considerations Regarding Performance And Monitoring

Unravel Data Posts — Fri, 25 Jan 2019 10:22:28 +0000

Here’s what they told us when we asked, “What have we failed to ask you that you think we need to consider with regards to performance and monitoring?”

The post Additional Considerations Regarding Performance And Monitoring appeared first on Unravel.

Opportunities to Improve Performance and Monitoring

Unravel Data Posts — Fri, 25 Jan 2019 10:21:53 +0000

Here’s what they told us when we asked, “Where do you think the biggest opportunities are for improvement in performance optimization and monitoring?”

The post Opportunities to Improve Performance and Monitoring appeared first on Unravel.

Common Issues With Performance and Monitoring

Unravel Data Posts — Fri, 25 Jan 2019 10:20:48 +0000

Here’s what they told us when we asked, “What are the most common issues you see affecting performance optimization and monitoring?”

The post Common Issues With Performance and Monitoring appeared first on Unravel.

Real-World Problems Solved By Performance Monitoring and Optimization

Unravel Data Posts — Fri, 25 Jan 2019 10:14:27 +0000

Here’s what they told us when we asked, “What real-world problems are you, or your clients, solving with performance optimization and monitoring?”

We have an ad services client monitoring service quality by monitoring latency and the flow of ads being served. The path through the internet varies. It can be influenced and changed. It’s important to be aware of the path being taken.

The post Real-World Problems Solved By Performance Monitoring and Optimization appeared first on Unravel.

Most Frequently Used Performance Tools

Unravel Data Posts — Fri, 25 Jan 2019 10:10:21 +0000

Here’s what they told us when we asked, “What technical solutions do you use beyond your own?”

Standard tools. APM = AppDynamics and New Relic. It’s open source and commercial. For synthetic, it’s Catchpoint and Dynatrace. We don’t see our clients using one overriding set of solutions.

The post Most Frequently Used Performance Tools appeared first on Unravel.

The Data Scene in 2017: More Cloud, Greater Governance, Higher Performance

Unravel Data Posts — Fri, 25 Jan 2019 10:04:03 +0000

The past year was a blockbuster one for those working in the data space. Businesses have wrapped their fates around data analytics in an even tighter embrace as competition intensifies and the drive for greater innovation becomes a top priority.

The year ahead promises to get even more interesting, especially for data managers and professionals. Leading experts in the field have witnessed a number of data trends emerge in 2016, and now see new developments coming into view for 2017.

The post The Data Scene in 2017: More Cloud, Greater Governance, Higher Performance appeared first on Unravel.

Data lakes and brick walls, big data predictions for 2017

Unravel Data Posts — Fri, 25 Jan 2019 09:59:28 +0000

There’s been a lot of talk about big data in the past year. But many companies are still struggling with implementing big data projects and getting useful results from their information.

In this part of our series on 2017 predictions we look at what the experts think will affect the big data landscape in the coming year.

Steve Wilkes, co-founder and CTO at Striim believes we’ll see increasing commoditization, with on-premise data lakes giving way to cloud-based big data storage and analytics utilizing vanilla open-source products like Hadoop and Spark.

The post Data lakes and brick walls, big data predictions for 2017 appeared first on Unravel.

Lack of talent and compliance worries among cloud predictions for 2017

Unravel Data Posts — Fri, 25 Jan 2019 09:54:53 +0000

This is the time of year when industry experts like to come up with predictions for the coming 12 months. Last week we looked at some of their security forecasts, today it’s the turn of the cloud to get the crystal ball gazing treatment.

So, what do experts think are going to be the cloud trends of 2017?

The post Lack of talent and compliance worries among cloud predictions for 2017 appeared first on Unravel.

2017 Application Performance Management Predictions – Part 4

Unravel Data Posts — Fri, 25 Jan 2019 09:53:37 +0000

APMdigest’s 2017 Application Performance Management Predictions is a forecast by the top minds in APM today. Industry experts — from analysts and consultants to users and the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM and related technologies will evolve and impact business in 2017. Part 4 covers cloud, containers and microservices.

The post 2017 Application Performance Management Predictions – Part 4 appeared first on Unravel.

2017 Application Performance Management Predictions – Part 2

Unravel Data Posts — Fri, 25 Jan 2019 09:50:40 +0000

The post 2017 Application Performance Management Predictions – Part 2 appeared first on Unravel.

“Above the Trend Line” – Your Industry Rumor Central for 11/7/2016

Unravel Data Posts — Fri, 25 Jan 2019 09:47:34 +0000

The post “Above the Trend Line” – Your Industry Rumor Central for 11/7/2016 appeared first on Unravel.

Organizations Turn Focus to Data Reliability, Security and Governance

Unravel Data Posts — Fri, 25 Jan 2019 09:43:58 +0000

With the rapidly growing investments in data analytics, many organizations complain they lack a combined view of all the information being created. At the recent Strata & Hadoop World Conference in New York, Information Management asked Kunal Agarwal, chief executive officer at UnravelData what this means for the industry, and his company.

The post Organizations Turn Focus to Data Reliability, Security and Governance appeared first on Unravel.

Unravel Named to CNBC’s 2018 Upstart 100 List

Unravel Data Posts — Fri, 25 Jan 2019 08:23:22 +0000

The CNBC Upstart list debuts today with two big milestones—an expanded ranking from 25 to now 100 companies, who are transcending industry barriers with innovative products or services—and the inclusion of Unravel Data! We’re thrilled to share that Unravel is one of the featured technology startups that stood out in a highly competitive field and earned high marks in scalability, sales, customer growth and intellectual property.

As the only application performance management (APM) solution for big data, Unravel impressed judges with our approach in supporting how enterprises realize the promise of big data and how big data management tools can radically simplify the planning and management of business-critical modern data applications. While big data presents enormous complexity and continues to evolve at a rapid pace, we’re invigorated by these challenges—and humbled by the accolades from CNBC.

On behalf of the Unravel team, we’re very proud of this achievement. And it’s been an exciting run recently—as this comes on the heels of Gartner’s Cool Vendor recognition and our latest platform update, 4.4—but we’re only just getting started. We continue to be inspired to innovate and work even smarter to keep up with our customers’ big data challenges.

Read more CNBC coverage of Unravel Data and the CNBC Upstart 100 list: CNBC.com/Upstart

The post Unravel Named to CNBC’s 2018 Upstart 100 List appeared first on Unravel.

Is The Promise of Big Data Hampered by Skills Shortages and Poor Performance?

Unravel Data Posts — Fri, 25 Jan 2019 08:10:37 +0000

Ahead of Big Data London in November, Unravel Data worked with Sapio Research to find out how business leaders really feel about big data and how they are doing in executing against their deployment plans.

We’ve heard first hand from many of the UK’s largest financial services, telecommunications, and technology enterprises that they are facing significant challenges in harnessing the power of data to transform their businesses and create more compelling customer experiences. This new research suggests there is plenty of optimism from businesses about putting data to work. Yet addressing operational challenges as they morph and evolve is a pressing concern.

Although 84 percent of respondents claim their big data projects usually deliver on expectations, only 17 percent currently rate the performance of their modern data stack as ‘optimal’. There are also stark differences between how managers and senior teams see their modern data stacks: 13 percent of VPs, directors and C-suite members report their stack only meets half of its KPIs, but more than double the number of managers (29%) say the same.

It also seems businesses aren’t yet using modern data stack applications to grow their businesses, instead seeing protection and compliance as the most worthwhile goals. The top four most valuable and effective uses of big data currently, according to business leaders, are:

Cybersecurity intelligence (42%);
Risk, regulatory, compliance reporting (41%);
Predictive analytics for preventative maintenance (35%);
Fraud detection and prevention (35%).

Further findings from the research:

According to survey respondents the two greatest aspirations of big data teams are delivering on the promise of big data and to improve big data performance
Lack of skills was cited as the top challenge (49%) to big data success in the enterprise; followed closely by challenges of handling data volumes, variety, and velocity (44%)
Lack of big data architects (45%) and big data engineers (43%) top the list of the most pressing skills that are lacking
Data analysis is the top priority for improvement, cited by 43% of respondents. This is followed by data transformation (39%) and data visualization (37%)
Cost reduction is the biggest expected benefit for big data applications, cited by 41% of respondents, followed by faster application release timings (37%)
78% of organizations are already running big data workloads in the cloud, and 82% have a strategy to move existing big data applications into the cloud
Only one in five have an ‘all cloud’ strategy, with more than half (54%) using a mix of cloud and on-premise applications
99% of business leaders report that their big data projects on delivering on business goals at least ‘some of the time’

The top level insight that we can derive from this primary research is that most organizations believe in the promise of big data – but the operational challenges are holding back enterprises from realising the full potential. This is due to a combination of factors, most notably performance and a shortage of experienced talent.

The challenge now is to ensure the modern data stack performs reliably and efficiently, and that big data teams have the tools and expertise to deliver the next generation of applications, analytics, AI and Machine Learning. This is the Unravel mission: To radically simplify the tuning and optimization of modern data stack applications and infrastructure.

Unravel provides a single view of performance across platforms – whether on-premise or in the cloud, and delivers automated Insights and recommendations through machine learning to assure SLA’s are met and ensuring operational teams are as efficient as possible.

The full research report will be available in December.

The survey was conducted with 200 IT decision makers involved in big data in organizations with over 1,000 employees. The interviews were conducted in October 2018 via email and online panels.

The post Is The Promise of Big Data Hampered by Skills Shortages and Poor Performance? appeared first on Unravel.

Architecting an Automated Performance Management (APM) Solution for Modern Data Applications — Part 1

Unravel Data Posts — Fri, 25 Jan 2019 07:55:02 +0000

While many enterprises across a wide range of verticals—finance, healthcare, technology—build applications on the modern data stack, some are not fully aware of the application performance management (APM) and operational challenges that often arise. Over the course of a two-part blog series, we’ll address the requirements at both the individual application level, as well as holistic clusters and workloads, and explore what type of architecture can provide automated solutions for these complex environments.

The Modern Data Stack and Operational Performance Requirements

Composed of multiple distributed systems, the modern data stack in an enterprise typically goes through the following evolution:

Big data ETL: storage systems, such as Azure Blob Store (ABS), house the large volumes of structured, semi-structured and unstructured data. Distributed processing engines, like MapReduce, come in for data extraction, cleaning and transformation of the data.
Big data BI: MPP SQL systems, such as Impala, are added to the stack, which power the interactive SQL queries that are common in BI workloads.
Big data science: with maturity of use of a modern data stack, more workloads leverage machine learning (ML) and artificial intelligence (AI).
Big data streaming: Systems such as Kafka are added to the modern data stack to support applications that ingest and process data in a continuous streaming fashion.

Evolution of the modern data stack in an enterprise

Industry analysts estimate that there are more than 10,000 enterprises worldwide running applications in production on a modern data stack, comprised of three or more distributed systems.

Naturally, performance challenges are inherent related to failure, speed, SLA or behavior. Typical questions that are nontrivial to answer include:

What caused an application to fail and how can I fix it?
This application seems to have made little progress in the last hour. Where is it stuck?
Will this application ever finish, or will it finish in a reasonable time?
Will this application meet its SLA?
Does this application behave differently than in the past? If so, in what way and why?
Is this application causing problems in my cluster?
Is the performance of this application being affected by one or more other applications?

Many operational performance requirements are needed at the “macro” level compared to the level of individual applications. These include:

Configuring resource allocation policies to meet SLAs in multi-tenant clusters
Detecting rogue applications that can affect the performance of SLA-bound applications through a variety of low-level resource interactions
Configuring the hundreds of configuration settings that distributed systems are notoriously known for having in order to get the desired performance
Tuning data partitioning and storage layout
Optimizing dollar costs on the cloud
Capacity planning using predictive analysis in order to account for workload growth proactively

The architecture of a Performance Management Solution

Addressing performance challenges requires a sophisticated architecture that includes the following components:

Full data stack collection: to answer questions like what caused this application to fail or will this application ever meet its SLA, monitoring data will be needed from every level of the stack. However, collecting such data in a non-intrusive or low-overhead manner from production clusters is a major technical challenge.
Event-driven data processing: deployments can generate tens of terabytes of logs and metrics every day, which can present variety and consistency challenges, thus calling for the data processing layer to be based on event-driven processing algorithms whose outputs converge to the same final state irrespective of the timeliness and order in which the monitoring data arrives.
Machine learning-driven insights and policy-driven actions: Enabling all of the monitoring data to be collected and stored in a single place allows for statistical analysis and machine learning to be applied to the data. Such algorithms can generate insights that can then be applied to address the performance requirements.

Architecture of a performance management platform for Modern data applications

There are several technical challenges involved in this process. Examples include:

What is the optimal way to correlate events across independent systems? In other words, if an application accesses multiple systems and the events pass through them, how end-to-end correlation can be done?
How can we deal with telemetry noise and data quality issues which are common in such complex environments? For example, records from multiple systems may not be completely aligned and may not be standardized.
How can we train ML models in closed-loop, complex deployments? How can we split training and test data?

Now that we’ve tackled the modern data stack and reviewed operational challenges and requirements for a performance management solution to meet those needs, we look forward to discussing the features that a solution should offer in our next post.

The post Architecting an Automated Performance Management (APM) Solution for Modern Data Applications — Part 1 appeared first on Unravel.

Architecting APM Solution for Modern Data Applications — Part 2

Unravel Data Posts — Fri, 25 Jan 2019 07:37:26 +0000

Architecting Automated Performance Management (APM) Solution

In the last APM blog, we broke down the modern data stack, reviewed some of the stack’s biggest operational challenges, and explained what’s required for a performance management solution to meet those needs. In this piece, we’ll dive deeper in discussing the features that solutions should offer in order to support modern data stacks comprised of multiple distributed systems. Specifically, we’ll look at two of the main components these solutions should deliver: diagnosing application failures and optimizing clusters.

Application Failures

Apps break down for many different reasons, particularly in highly distributed environments. Traditionally, it’s up to the users to identify and ﬁx the cause of the failure and get the app running smoothly again. Applications in distributed systems are comprised of many components, resulting in lots of raw logs appearing when failure occurs. With thousands of messages, including errors and stack traces, these logs are highly complex and can be impossible for most users to comb through. It’s incredibly difficult and time-consuming even for experts to discover the root cause of app failure through all of this noise. To generate insights into a failed application in a multi-engine modern data stack, a performance management solution needs to deliver the following:

Automatic identiﬁcation of the root cause of application failure: This is one of the most important tools to mitigate challenges. The solution should continuously collect logs from a variety of application failures, convert those logs into feature vectors, then learn a predictive model for root cause analysis (RCA) from these feature vectors. Figure X1 shows a complete solution to deal with such problems and automate RCA for application failures.
Data collection for training: We all know the chief axiom of analytics: garbage in, garbage out. With this principle in mind, RCA models must be trained on representative input data. For example, if you’re performing RCA on a Spark failure, it’s imperative that the solution has been trained on log data from actual Spark failures across many different deployments.
Accurate model prediction: ML techniques for prediction can be classified as supervised learning and unsupervised learning. We use both in our solutions. We attach root-cause labels with the logs collected from an application failure. Labels come from a taxonomy of root causes as the one illustrated in Figure X2. Once the logs are available, we can construct feature vectors, using for example, the Doc2Vec technique that uses a three-layer neural network to gauge the context of the document and related similar content together. After the feature vectors are generated along with a label, we can apply a variety of learning techniques for automatic RCA, such as shallow as well as deep learning techniques, including random forests, support vector machines, Bayesian classiﬁers, and neural networks. Example results produced by our solution are shown in Figure X3.
Automatic fixes for failed applications: This capability uses an algorithm to automatically find and implement solutions to failing apps. This algorithm is based on data from both successful and failed runs of a given app. Providing automatic fixes is one of the most critical features of a performance management solution as it saves users the time and headache of manually troubleshooting broken apps. Figure X4 shows an automated tuning process of a failed Spark application.

Figure X1: Approach for automatic root cause analysis

Figure X2: Taxonomy of failures

Figure X3: Feature vector generation

Figure X4: Automated tuning of a failed Spark application

Cluster Optimization

In addition to understanding app failures, the other key component of an effective big data performance management solution is cluster optimization. This includes performance management, autoscaling, and cost optimization. When properly delivered, these objectives benefit both IT operations and DevOps teams. However, when dealing with distributed big data, it’s not easy to execute cluster-level workload analysis and optimization. Here’s what we need to do:

Operational insights: This refers to the analysis of metrics data to yield actionable insights. This includes insights on app performance issues (i.e., whether a failure is due to bad code or resource contention), insights on cluster tuning based on aggregation of app data (i.e., finding out whether a compute cluster is properly tuned at both a cluster and application level), and insights on cluster utilization, cloud usage, and autoscaling. We also provide users with tools to help them understand how they are using their compute resources, as for example, compare cluster activity between two time periods, aggregated cluster workload, summary reports for cluster usage, chargeback reports, and so on. In addition to that, we also provide cluster level recommendations to ﬁne tune cluster wide parameters to maximize a cluster’s eﬃciency based upon the cluster’s typical workload. To do so, we (a) collect performance data of prior completed applications, (b) analyze the applications w.r.t. the cluster’s current conﬁguration, (c) generate recommended cluster parameter changes, and (d) predict and quantify the impact that these changes will have on applications that will execute in the future. Figure Y1 shows example recommendations for tuning the size of map containers (top) and reduce containers (bottom) on a production cluster, and in particular the allocated memory in MB.
Workload analysis: This takes a closer look at queue usage to provide a holistic picture of how apps run and how they affect one another on a set of clusters. Workload analysis highlights usage trends, sub-optimal queue designs, workloads that run sub-optimally on queues, convoys, slow apps, problem users (e.g., users who frequently run applications that reach max capacity for a long period), and queue usage per application type or user project. Figure Y2 shows example resource utilization charts for two queues over a time range. In one queue, the workload running does not use all the resources allocated, whilst the workload in the other queue needs more resources. In this figure, the purple line shows resource request and the black line shows resource allocation. Based on such ﬁndings, we generate queue level insights and recommendations including queue design/settings modiﬁcations (e.g., change resource budget for a queue or max/min limits), workload reassignment to diﬀerent queues (e.g., move an application or a workload from one queue to another), queue usage forecasting, etc. A typical big data deployment involves 100s of queues and without automation such task can be tedious.
Forecasting: Forecasting leverages historical operational and application metrics to determine future needs for provisioning, usage, cost, job scheduling, and other attributes. This is important for capacity planning. Figure Y3 shows an example disk capacity forecasting chart; the black line shows actual utilization and the light blue line shows a forecast within an error bound.

Figure Y1: Example set of cluster-wide recommendations

Figure Y2: Workload analysis

Figure Y3: Disk capacity forecasting

Solving the Biggest Big Data Challenges

Modern data applications face a number of unique performance management challenges. These challenges exist at the individual application level, as well as at the workload and cluster levels. To solve these problems at every level, a performance management solution needs to offer a range of capabilities to tell users why apps fail and enable them to optimize entire clusters. As organizations look to get more and more out of their modern data stack—including the use of artificial intelligence and machine learning—it’s imperative that they can address these common issues.

For more technical details, we refer the interested reader to our research paper to be presented at the biennial Conference on Innovative Data Systems Research (CIDR 2019).

The post Architecting APM Solution for Modern Data Applications — Part 2 appeared first on Unravel.

Why is APM a Must-Have for Big Data Self-Service Analytics?

Unravel Data Posts — Fri, 25 Jan 2019 07:22:49 +0000

Self-Service Big Data Analytics is the key to enabling a data driven organization. Enterprises understand the value of this capability and are investing heavily in technology, tools, processes and people to make this a reality. However, as rightfully pointed out by the recent Gartner report on big data self service analytics and BI, several challenges remain. While I agree with many of the assessments from Gartner, I also view that there are some key requirements enabling a self-serve analytics paradigm:

1. (Self-Serve) Platform & Infrastructure

We need modern Data Management platforms to be agile to address the self-serve needs. We are headed towards a model of “Query-As-a-Service” where end users expect the right infrastructure and analytic frameworks (batch, real-time, streaming etc.) provisioned appropriately to support their analytic needs, regardless of scale and complexity.

2. Security/Governance

Modern data stack platforms are democratizing access to data. We now have the capability to store all of the data in a single scalable platform and make it accessible for a wide range of analysis. This presents both an opportunity and challenge. Enterprises need to ensure that they have the right “controls” and “governance” structures in place, so only the right data is accessible for the right analysis and relevant users.

3. Self-Service Application Performance Management

While this is often an afterthought, end-users should have an easy way to understand and rationalize the performance of their analytic applications and in many cases also be able to optimize them for the desired SLA. In the absence of this “layer”, end users spend a lot of time depending on their IT organization to do these tasks for them or jump through hoops to get the appropriate information (logs, metrics, configuration data) to understand their application performance. This approach slow down or defeats the whole self-service approach!

At Unravel, we are squarely focused on solving the Self-Service Application Performance Management (APM) challenges for modern analytic applications built on modern data stack platforms. Unravel is built ground-up to provide a complete 360 degree view of applications and provide insights and recommendation on what needs to be done to improve performance and address failures associated with these applications – all in a self-serve fashion.

Unravel deployment is streamlined to integrate easily with a self-serve platform/infrastructure and integrates with existing security layers both at the infrastructure level (e.g kerberos) and platform level (e.g Apache Sentry, Apache Ranger etc.). Another unique capability in Unravel is to provide a role-based access control (RBAC) for users. This feature enables end users to view only their applications or those related to their business organization (mapped to a queue, project, tenant, departments etc.).

You can learn more about how Unravel can enable your organization achieve the self-serve analytics paradigm by visiting us at www.unraveldata.com or meeting us at the Strata San Jose 2018 Conference at booth #1329

References

https://www.gartner.com/newsroom/id/3848671
https://www.datanami.com/2018/01/25/self-service-analytics-seen-overtaking-data-scientists/

The post Why is APM a Must-Have for Big Data Self-Service Analytics? appeared first on Unravel.

Fighting for precious resources on a multi-tenant big data cluster

Unravel Data Posts — Fri, 25 Jan 2019 07:16:23 +0000

Over the coming few weeks, we’ll provide some practical technical insights and advice on typical problems we encounter in working with many complex data pipelines. In this blog, we’ll talk about multi-tenant cluster contention issues. Check back in for future topics.

Modern data clusters are becoming increasingly commonplace and essential to your business. However, a wide variety of workloads typically run on a single cluster, making it a nightmare to manage and operate, similar to managing traffic in a busy city. We feel the pain of the operations folks out there who have to manage Spark, Hive, Impala, and Kafka apps running on the same cluster where they have to worry about each app’s resource requirements, the time distribution of the cluster workloads, the priority levels of each app or user, and then make sure everything runs like a predictable well-oiled machine.

We at Unravel have a unique perspective on this kind of thorny problem since we spend countless hours, day in and day out, studying the behavior of giant production clusters in the discovery of insights into how to improve performance, predictability, and stability whether it is a thousand node Hadoop cluster running batch jobs, or a five hundred node Spark cluster running AI, ML or some advanced, real-time, analytics. Or, more likely, 1000 nodes of Hadoop, connected via a 50 node Kafka cluster to a 500 node Spark cluster for processing. What could go wrong?

As you are most likely painfully well aware, plenty! Here are some of the most common problems that we see in the field:

Oversubscribed clusters – too many apps or jobs to run, just not enough resources
Bad container sizing – too big or too small
Poor queue management – sizes of queues are inappropriate
Resource hogging users or apps – bad apples in the cluster

So how do you go about solving each of these issues?

Measure and Analyze

To understand which of the above issues is plaguing your cluster, you must first understand what’s happening under the hood. Modern data clusters have several precious resources that the operations team must constantly watch. These include Memory, CPU, NameNode. When monitoring these resources, make sure to measure both the total available and consumed at any given time.

Next, break down these resource charts by user, app, department, project to truly understand who contributes how much to the total usage. This kind of analytical exploration can help quickly reveal:

If there is anyone tenant (user, app, dept, project) causing the majority of users of the cluster, which may then require further investigation to determine if that tenant is using or abusing resources
Which resources are under constant threat of being oversubscribed
If you need to expand your big data cluster or tune apps and system to get more juice

Make apps better for multi-tenant citizens

Configuration settings at the cluster and app level dictate how many system resources each app gets. For example, if we have a setting of 8GB containers at the master level, then each app will get 8GB containers whether they need it or not. Now imagine if most of your apps only needed 4GB containers. Well, your system would show it’s at max capacity when it could be running twice as many apps.

In addition to inefficient memory sizing, big data apps can be bad multi-tenant citizens due to other bad configuration settings (CPU, # of containers, heap size, etc., etc. ), inefficient code, and bad data layout.

Therefore it’s crucial to measure and understand each of these resource-hogging factors for every app on the system and ensure they are using and not abusing resources.

Define queues and priority levels

Your big data cluster must have a resource management tool built-in, for example, YARN or Kubernetes. These tools allow you to divide your cluster into queues. This feature can work well to separate production workloads from experiments or Spark from HBase or high priority users from low priority ones, etc. The trick, though, is to get the levels of these queues right.

This is where “measure and analyze” techniques help again. You should analyze the usage of system resources by users, departments, or any other tenant you see fit to determine the min, max, average that they usually demand. This will get you some common sense levels for your queues.

However, queue levels may need to be adjusted dynamically for best results. For example, a mission-critical app may need more resources if it processes 5x more data one day compared to the other. Therefore having a sense of seasonality is also essential when allocating these levels. A heatmap of cluster usage (as shown below) will enable us to get more precise about these allocations.

Proactively find and fix rogue users or apps.

Even after you follow the steps above, your cluster will experience rogue usage from time to time. Rogue usage is defined as bad behavior on the cluster by an application or user, such as hogging resources from a mission-critical app, taking more CPU or memory than needed for timely execution, having a very long idle shell, etc. In a multi-tenant environment, this type of behavior affects all users and ultimately reduces the reliability of the overall platform.

Therefore setting boundaries for acceptable behavior is very important to keep your big data cluster humming. A few examples of these are:

The time limit for application execution
CPU, memory, containers limit for each application or user

Setting the thresholds for these boundaries should be done after analyzing your cluster patterns over a month to help determine the average or accepted values. These values may also be different for different days of the week. Also, think about what happens when these boundaries are breached? Should the user and admin get an alert? Should these rogue applications be killed or moved to a lower priority queue?

Let us know if this was helpful or topics you would like us to address in future postings. Then, check back in for the next issue, which will be on the subject of rogue applications.

The post Fighting for precious resources on a multi-tenant big data cluster appeared first on Unravel.

How to resolve performance issues of big data applications

Unravel Data Posts — Fri, 25 Jan 2019 07:01:44 +0000

I didn’t grow up celebrating Christmas, but every time I watched Chevy Chase, as Clark Griswold in National Lampoon’s Christmas Vacation, trying to unravel his Christmas lights I could feel his pain. For those who had to put up lights, you might remember how time-consuming it was to unwrap those lights. But that wasn’t the biggest problem. No, the biggest issue was troubleshooting why one or more lights were out. It could be an issue with the socket, the wiring, or just one light causing a section of good lights to be out. Figuring out the root cause of the problem was a trial-and-error process that was very frustrating.

Today’s approach to diagnose and resolve performance issues of big data systems is just like dealing with those pesky Christmas lights. Current performance monitoring and management tools don’t pin point the root cause of the problem or how they affect other systems or components running across a big data platform. As a result, troubleshooting and resolving performance issues, like rogue users and jobs impacting cluster performance, missed SLAs, stuck jobs, failed queries, or not understanding cluster usage and application performance, is very time consuming and cannot scale to support big data applications in a production deployment.

There’s a better way to resolve Big Data performance issues than spending hours sifting through monitoring graphs and logs

Managing big data operations in a multi-tenant cluster is complex and it’s hard to diagnose some of the problems listed above. It’s also hard to track who is doing what, understand cluster usage and application performance, justify resource demands, and forecast capacity needs.

Gaining full visibility across the big data stack is difficult because there is no single pane that gives operations and users insight into what’s going on. Even with monitoring tools like Cloudera Manager, Ambari, and MapR Control System, people have to use logs, monitoring graphs, configuration files, and so on to try to resolve application performance issues.

The Unravel platform gives you this important insight across multiple systems.

The post How to resolve performance issues of big data applications appeared first on Unravel.

Reduce Apache Spark Troubleshooting Time from Days to Seconds

Unravel Data Posts — Fri, 25 Jan 2019 06:53:27 +0000

Spark’s simple programming constructs and powerful execution engine have brought a diverse set of users to its platform. Many new modern data stack applications are being built with Spark in fields like healthcare, genomics, financial services, self-driving technology, government, and media. Things are not so rosy, however, when a Spark application fails.

Similar to applications in other distributed systems that have a large number of independent and interacting components, a failed Spark application throws up a large set of raw logs. These logs typically contain thousands of messages including errors and stacktraces. Hunting for the root cause of an application failure from these messy, raw, and distributed logs is hard for Spark experts; and a nightmare for the thousands of new users coming to the Spark platform. We aim to radically simplify root cause detection of any Spark application failure by automatically providing insights to Spark users like what is shown in Figure 1.

Figure 1: Insights from automatic root cause analysis improve Spark user productivity

Spark platform providers like Amazon, Azure, Databricks, and Google clouds as well as Application Performance Management (APM) solution providers like Unravel have access to a large and growing dataset of logs from millions of Spark application failures. This dataset is a gold mine for applying state-of-the-art artificial intelligence (AI) and machine learning (ML) techniques. In this blog, we look at how to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root causes have been identified. These models can then automatically predict the root cause when an application fails [1]. Such actionable root-cause identification improves the productivity of Spark users significantly.

Clues in the logs

A number of logs are available every time a Spark application fails. A distributed Spark application consists of a driver container and one or more executor containers. The logs generated by these containers have information about the application, as well as how the application interacts with the rest of the Spark platform. These logs form the key dataset that Spark users scan for clues to understand why an application failed.

However, the logs are extremely verbose and messy. They contain multiple types of messages, such as informational messages from every component of Spark, error messages in many different formats, stacktraces from code running on the Java Virtual Machine (JVM), and more. The complexity of Spark usage and internals make things worse. Types of failures and error messages differ across Spark SQL, Spark Streaming, iterative machine learning and graph applications, and interactive applications from Spark shell and notebooks (e.g., Jupyter, Zeppelin). Furthermore, failures in distributed systems routinely propagate from one component to another. Such propagation can cause a flood of error messages in the log and obscure the root cause.

Figure 2 shows our overall solution to deal with these problems and to automate root cause analysis (RCA) for Spark application failures. Overall, the solution consists of:

Continuously collecting logs from a variety of Spark application failures
Converting logs into feature vectors
Learning a predictive model for RCA from these feature vectors

Of course, as with any intelligent solution that uses AI and ML techniques, the devil is the details!

Figure 2: Root cause analysis of Spark application failures

Data collection for training: As the saying goes: garbage in, garbage out. Thus, it is critical to train RCA models on representative input data. In addition to relying on logs from real-life Spark application failures observed on customer sites, we have also invested in a lab framework where root causes can be artificially injected to collect even larger and more diverse training data.

Structured versus unstructured data: Logs are mostly unstructured data. To keep the accuracy of model predictions to a high level in automated RCA, it is important to combine this unstructured data with some structured data. Thus, whenever we collect logs, we are careful to collect trustworthy structured data in the form of key-value pairs that we additionally use as input features in the predictive models. These include Spark platform information and environment details of Scala, Hadoop, OS, and so on.

Labels: ML techniques for prediction fall into two broad categories: supervised learning and unsupervised learning. We use both techniques in our overall solution. For the supervised learning part, we attach root-cause labels with the logs collected from an application failure. This label comes from a taxonomy of root causes that we have created based on millions of Spark application failures seen in the field and in our lab. Broadly speaking, the taxonomy can be thought of as a tree data structure that categorizes the full space of root causes. For example, the first non-root level of this tree can be failures caused by:

Configuration errors
Deployment errors
Resource errors
Data errors
Application errors
Unknown factors

The leaves of this taxonomy tree form the labels used in the supervised learning techniques. In addition to a text label representing the root cause, each leaf also stores additional information such as: (a) a description template to present the root cause to a Spark user in a way that she will easily understand (like the message in Figure 1), and (b) recommended fixes for this root cause. We will cover the root-cause taxonomy in a future blog.

The labels are associated with the logs in one of two ways. First, the root cause is already known when the logs are generated, as a result of injecting a specific root cause we have designed to produce an application failure in our lab framework. The second way in which a label is given to the logs for an application failure is when a Spark domain expert manually diagnoses the root cause of the failure.

Input Features: Once the logs are available, there are various ways in which the feature vector can be extracted from these logs. One way is to transform the logs into a bit vector (e.g., 1001100001). Each bit in this vector represents whether a specific message template is present in the respective logs. A prerequisite to this approach is to extract all possible message templates from the logs. A more traditional approach for feature vectors from the domain of information retrieval is to represent the logs for a failure as a bag of words. This approach is mostly similar to the bit vector approach except for a couple of differences: (i) each bit in the vector now corresponds to a word instead of a message template, and (ii) instead of 0’s and 1’s, it is more common to use numeric values generated using techniques like TF-IDF.

More recent advances in ML have popularized vector embeddings. In particular, we use the doc2vectechnique [2]. At a high level, these vector embeddings map words (or paragraphs, entire documents) to multidimensional vectors by evaluating the order and placement of words with respect to their neighboring words. Similar words map to nearby vectors in the feature vector space. The doc2vec technique uses a 3-layer neural network to gauge the context of the document and relate similar content together.

Once the feature vectors are generated along with the label, a variety of supervised learning techniques can be applied for automatic RCA. We have evaluated both shallow as well as deep learning techniques including random forests, support vector machines, Bayesian classifiers, and neural networks.

Conclusion

The overall results produced by our solution are very promising and will be presented at Strata 2017 in New York. We are currently enhancing the solution in some key ways. One of these is to quantify the degree of confidence in the root cause predicted by the model in a way that users will easily understand. Another key enhancement is to speed up the ability to incorporate new types of application failures. The bottleneck currently is in generating labels. We are working on active learning techniques [3] that nicely prioritize the human efforts required in generating labels. The intuition behind active learning is to pick the unlabeled failure instances that provide the most useful information to build an accurate model. The expert labels these instances and then the predictive model is rebuilt.

Manual failure diagnosis in Spark is not only time-consuming but highly challenging due to correlated failures that can occur simultaneously. Our unique RCA solution enables the diagnosis process to function effectively even in the presence of multiple concurrent failures, as well as noisy data prevalent in production environments. Through automated failure diagnosis, we remove the burden of manually troubleshooting failed applications from the hands of Spark users and developers, enabling them to focus entirely on solving business problems with Spark.

To learn more about how to run Spark in production reliably, contact us.

[1] S. Duan, S. Babu, and K. Munagala, “Fa: A System for Automating Failure Diagnosis”, International Conference on Data Engineering, 2009

[2] Q. Lee and T. Mikolov, “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning, 2014

[3] S. Duan and S. Babu, “Guided Problem Diagnosis through Active Learning“, International Conference on Autonomic Computing, 2008

The post Reduce Apache Spark Troubleshooting Time from Days to Seconds appeared first on Unravel.

Organizations Need APM to fully Harness Big Data

Unravel Data Posts — Fri, 25 Jan 2019 06:50:07 +0000

Gartner recently released the report Monitor Data and Analytics Platforms to Drive Digital Business. In the introduction, the authors explain, “Unfortunately, most application performance monitoring (APM) and infrastructure monitoring technologies are not equipped to deal with the challenges of monitoring business data and analytics platforms. And most IT operations professionals lack the skills required to understand and manage the performance problems and incidents that are likely to arise as these platforms are put through their paces by an increasingly data-hungry business community.”

Moving and managing big data applications in production will be a nightmare because many problems can occur that can cripple an operations team. The growth and diversity of applications in a multi-tenant cluster are making it hard to diagnose problems, like rogue applications, missed SLAs, cascading cluster failures, application slowdowns, stuck jobs, and failed queries. Also, it’s becoming hard to track who is doing what, understand cluster usage and application performance, and optimize and plan in order to justify resource demands and forecast capacity needs.

For operations teams, addressing these challenges is a must because there are too many downstream customers that depend on things running smoothly. But trying to have full visibility across the modern data stack is really hard to do because there is no single pane that gives operations and users insight into what’s going on or a way to collaborate – instead people have to look at code, logs, monitoring graphs, configuration files, and so on in an attempt to figure it out.

Operations professionals need more. The Gartner report recommends that leaders tasked with optimizing and managing their data and analytics platforms should:

Monitor business data and analytics platform performance holistically by combining data from infrastructure and application performance tools and tasks. This will ensure good traceability, improve performance and support the business uptake of analytics

Unravel provides the only tool that analyzes the entire modern data stack and enables operations teams to understand, improve and control application performance in production. What makes Unravel so unique is that the solution is designed to automatically detect, diagnose and resolve issues to improve productivity, guarantee reliability, and reduce costs for running modern data stack applications in any environment– on-premises, in the cloud, or hybrid.

Learn more about why Big Data APM is mission critical for modern data stack applications in production.

Citations:
Gartner, Monitor Data and Analytics Platforms to Drive Digital Business, 05 April 2017

The post Organizations Need APM to fully Harness Big Data appeared first on Unravel.

Unravel Now Certified on MapR Converged Data Platform

Unravel Data Posts — Fri, 25 Jan 2019 06:42:09 +0000

We are pleased to announce that Unravel is now MapR certified. This marks an important milestone in our continued partnership with MapR, underpinned by a growing demand among MapR users for our modern data stack Application Performance Management (APM) intelligence product.

The certification ensures that Unravel is integrated seamlessly with MapR Converged Data Platform, providing customers with an intelligent product to improve the reliability and performance of their modern data stack applications and optimize overall cluster resource usage.

Deploying Unravel on a MapR cluster is a simple two-step process:

Step 1: Deploy Unravel Server on a client node machine which connects to relevant services like YARN, Hive, Oozie, Spark etc., and enables access to the Unravel Web UI.
Step 2: Deploy Unravel Sensors (for Hive and Spark). Please note that the sensors are simply jar files (that provide additional metadata on Hive and Spark applications) and do not run as root on the deployed machines.

Unravel works smoothly on MapR environments with MapR Tickets/Certificates, encryption and other security configurations enabled. In our experience, getting Unravel up and running on MapR environments takes less than an hour, even for large scale clusters greater than 100+ nodes.

Learn more about why APM for the modern data stack is mission critical for big data in production.

The post Unravel Now Certified on MapR Converged Data Platform appeared first on Unravel.

How to Gain 5x Performance for Hadoop and Apache Spark Jobs

Unravel Data Posts — Fri, 25 Jan 2019 06:39:00 +0000

Improve performance of Hadoop jobs

Unravel Data can help improve the performance of your Spark and Hadoop jobs by up to 5x. The Unravel platform can help drive this performance improvement in several ways, including:

Save resources by reducing the number of tasks for queries
Drastically speed up queries
Removing the reliance of using default configurations
Recommending optimized configurations that minimizes the number of tasks
Automatically detect problems and suggest optimum values

Let’s look at a real-world example from an Unravel customer.

Too many tasks, too many problems

We recently heard from a customer who was experiencing performance issues and by using Unravel, he was able to speed up his query from 41 minutes to approximately 8 minutes. He did this by using our automatic inefficiency diagnoses and implementing the suggested fixes.

The Unravel platform identified that a number of this customer’s queries were using way more tasks than were needed. This problem is not unique to his application.

When running big data applications, we often find the following common pitfalls:

Using too many tasks for a query increases the overhead
Increased wait time for resource allocation
Increased duration of the query
Resource contention at the cluster level affects concurrent queries

The result after fine-tuning the Hadoop jobs with Unravel

In this case, the customer was able to fine-tune their queries based on Unravel’s suggestions and see a 5x speed improvement by using only one-tenth of the original number of tasks.

Let’s look at this example in more detail to review the problems facing this customer and how Unravel helped resolve it. Our examples will be Hive with MapReduce applications running on a YARN cluster as this is what the customer was using, but the problem is applicable in other types of YARN applications such as Spark and Hive with Tez, which Unravel supports also.

Customer Experience

The Hive query in this customer example took 40 minutes on average to complete and could take longer (more than 1 hour) or less (about 29 min) depending on how busy the cluster was. To note, he did not make changes to any YARN, Hive, MapReduce configurations for these runs, so the configurations were from the default setting for the cluster.

After installing Unravel, he saw that for this query, Unravel automatically detected that it was using too many map and reduce tasks (Figures 1 and 2). Moreover, it provided actionable suggestions to improve the inefficiency – it specified new values for the configurations mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize, and hive.exec.reducers.bytes.per.reducer, that would result in fewer map and reduce tasks in a future run.

Significant improvements were seen immediately after using the new configurations. As shown in Q1 in Table 1, the number of tasks used by the query dropped from 4368 to 407, while query duration went down from 41 minutes 18 seconds to 8 minutes 2 seconds. Three other queries in this workload were also flagged by Unravel for using too many tasks. The user proceeded to use the settings recommended for these queries and got significant improvements in both resource efficiency and speedup.

Q2 to Q4 in Table 1 show the results. We could see that for all of the queries, the new runs used approximately one-tenth of the original number of tasks, but achieved a query speedup of 2 to 5X. It is also worth noting that the significant reduction in tasks by these queries freed up resources for other jobs to run on the cluster, thereby improving the overall throughput of the cluster.

Figure 1. Query used too many mappers. Unravel shows that this query has one job using too many map tasks, and provides a recommendation for the configurations mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize to use fewer number of map tasks in the next run.

Figure 2. Query used too many reducers. Unravel shows that this query has one job using too many reduce tasks, and provides a recommendation for the configuration hive.exec.reducers.bytes.per.reducer to use reduce tasks in the next run.

Table 1. The duration and tasks used before and after using Unravel’s suggested configurations.

Detecting and Mitigating Too Many Tasks in Spark and Hadoop jobs

Unravel analyzes every application running on the cluster for various types of inefficiencies. To determine whether a query is using too many tasks, Unravel checks your Spark and Hadoop jobs to see if a larger number of its tasks process too little data per task.

As shown in Figures 1 and 2, Unravel explained that a majority of the tasks ran for too little time (average map time is 44 seconds and average reduce computation time is 1 second). Figure 3 provides another view of the symptom. We see that all 3265 map tasks finished in less than 2 minutes, and most of them processed between 200 and 250 MB of data. On the reduce side, more than 1000 reduce tasks finished in less than 1.5 minutes, and most of them processed less than 100 KB of data each.

Upon detecting the inefficiency, Unravel automatically computes new configuration values that users can use to get fewer tasks for a future run. Contrasting Figure 3, Figure 4 shows the same histograms for the run with new configurations. This time there were only 395 map tasks, an 8X reduction, and most of them took longer than 2 minutes (though all of them still finished within 3.5 minutes) and processed between 2 and 2.5 GB of data. On the reduce side, only 8 reduce tasks were used, and each processed between 500 and 800 KB. Most importantly, the job duration went from 39 minutes 37 seconds to 7 minutes 11 seconds.

Figure 3. Histograms showing the distribution of map and reduce task duration, input, and output size. More than 3000 map tasks and more than 1000 reduce tasks are allocated for this job, which took almost 40 minutes, but each task runs for very little time.

Figure 4. Drastic reduction in job duration with fewer tasks processing more data per task. Fewer than 400 map tasks and only 8 reduce tasks were allocated to do the same amount of work in the new run. The job finished in just over 7 minutes.

How Using Fewer Tasks Improves Query Speedup for Spark and Hadoop jobs

Several reasons explain why using fewer tasks resulted in shorter duration, not longer as many might expect. First, in the original run, a larger number of tasks were allocated but each task processed very little data. Allocating and starting a task incurs an overhead, but this overhead is about the same whether a task ends up processing 100KB or 1GB of data. Therefore, using 1000 tasks to process 100GB, with each task processing 0.1GB, would have a much larger overhead than doing the same with 100 tasks, with each task processing 1GB.

Second, there is a limit to how many containers an application can get at a time. Not only does the cluster have limited resources, but also it is typically multi-tenant, so concurrent applications have to share (fight for) containers – recall that the query with the default configuration could vary in duration from 29 minutes to more than an hour depending on how busy the cluster was.

Suppose a query needs 200 containers but only 100 are available at the time. The query will get the 100 containers, and will wait till resources become available for the other 100. Often, this means it will have to wait until its first 100 containers are finished, freed, and allocated again for the second “wave” of the 100 containers. Therefore, requesting fewer tasks means that it is faster to get all the tasks needed.

Third, the default configurations often result in too many tasks for many applications. Unfortunately, most users submit applications with the default configurations because they may not know that resources can be tuned, or which parameters to tune. Savvier users who know which parameters to tune may try to change them, often arbitrarily and through trial and error, but this process takes a long time because they do not know what values to set.

This is where Unravel shines – in addition to detecting an inefficiency, it can tell users what configurations to change, and what values to set to improve the inefficiency. In fact, besides the post-query analysis that I discussed in this post, Unravel can detect issues when the query is still running, and users can configure rules to automatically take predefined actions when a problem is detected. We will discuss that further in a future post.

The post How to Gain 5x Performance for Hadoop and Apache Spark Jobs appeared first on Unravel.

Common Modern Data Stack Application and Operation Challenges

Unravel Data Posts — Fri, 25 Jan 2019 06:36:44 +0000

Over the last one year, we at Unravel Data have spoken to several enterprise customers (with cluster sizes ranging from 10’s to 1000’s of nodes) about their modern data stack application challenges. Most of them leverage Hadoop in a typical multi-tenant fashion wherein they have different applications (ETL, Business Intelligence, Machine Learning Apps etc.) accessing the common data substrate.

The conversations have typically been with both the Operations team, who manage and maintain these platforms and Developers/End Users, who are building these applications on the modern data stack (Hive, Spark, MR, Oozie etc.).

Broadly speaking, the challenges can be grouped from the application perspective and cluster perspective. We have highlighted these challenges with actual examples that we have discussed with these customers.

Challenges faced by Developers/Data Engineers

Some of the most common Application challenges we have seen or heard of that are top of mind include:

An ad-hoc Application (submitted via Hive, Spark, MR) is stuck (not making forward progress) or fails after a while
Example 1: Spark job gets hung at the last task and eventually fails
Example 2: Spark job fails with executor OOM at a particular stage
Application is performing poorly suddenly
Example 1: A hive query that used to take ~6 hrs is now taking > 10 hrs
Not a good understanding on what “knobs” (configuration parameters) to change to improve application performance and resource usage
Need a self-serve platform to understand end-to-end how their specific application(s) is behaving.

Today, engineers end up going to five different sources (e.g CM/Ambari UI, Job History UI, Application Logs, Spark WebUI, AM/RM UI/Logs) to get an end to end understanding of application behavior and performance challenges. Further, these sources maybe insufficient to truly understand the bottlenecks associated with an applications (e.g detailed container execution profiles, visibility into the transformations that execute within a Spark stage etc.). See an example here on how to debug a Spark application. It’s cumbersome!

To add to the above challenges, many developers do not have access to the above sources and have to go via their Ops team, which adds to significant delays.

Challenges faced by Data Operations Team(s)

Some of the most common challenges we have seen or heard of that are top of mind for include:

Lack of visibility of Cluster usage from an application perspective
Example 1: Which application(s) cause my cluster usage (cpu, memory) to spike up?
Example 2: Are Queues being used optimally
Example 3: Are various Data sets actually being used?
Example 4: Need a comprehensive chargeback/showback view for planning/budgeting purposes
Not having good visibility and understanding when Data Pipelines miss SLAs? Where do we start to triage these issues?
Example 1: An Oozie orchestrated data pipeline that needs to complete by 4AM every morning is now consistently getting delayed. Completes by 6AM only
Unable to control/manage runaway jobs that could end up taking way more resources than needed effecting overall cluster and starving other applications
Example 1: Would need an automated way to track and manage these runaway jobs
Not a good understanding on what “knobs” (configuration parameters) to change at the cluster level to improve overall application performance and resource usage
How to quickly identify inefficient applications that can be reviewed and acted upon?
Example 1: Most Ops team members do not have deep Hadoop Application expertise. So having a way to quickly triage and understand root-causes around application performance degradation is very helpful

The above application and operations challenges are real blockers preventing enterprise customers to make their Hadoop applications production ready. This in turn is slowing down the ROI enterprises are seeing from their Big Data investments.

Unravel’s vision is to address these broad challenges and more for modern data stack applications and operations. Our vision is to provide a full stack performance intelligence and self-serve platform that can improve your modern data stack operations, make your applications reliable and improve overall cluster utilization. In our subsequent blog posts we will detail on how we go about doing this and the (data) science behind these solutions.

The post Common Modern Data Stack Application and Operation Challenges appeared first on Unravel.

To Cache or Not to Cache RDDs in Spark

Unravel Data Posts — Fri, 25 Jan 2019 06:31:25 +0000

This post is the first part of a series of posts on caching, and it covers basic concepts for caching data in Spark applications. Following posts will cover more how-to’s for caching, such as caching DataFrames, more information on the internals of Spark’s caching implementation, as well as automatic recommendations for what to cache based on our work with many production Spark applications. For a more general overview of causes of Spark performance issues, as well as an orientation to our learning to date, refer to our page on Spark Performance Management.

Caching RDDs in Spark

It is one mechanism to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD. There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. persist() without an argument is equivalent with cache(). We discuss caching strategies later in this post. Freeing up space from the Storage memory is performed by unpersist().

When to use caching

As suggested in this post, it is recommended to use caching in the following situations:

RDD re-use in iterative machine learning applications
RDD re-use in standalone Spark applications
When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails

1.  val sc = new SparkContext(sparkConf)
2.  val ranGen = new Random(2016)
3.
4.  val pairs1 = sc.parallelize(0 until numMappers, numPartitions).flatMap { p =>
5.
6.  // map output sizes lineraly increase from the 1st to the last
7.  val numKVPairsLocal = (1.0 * (p+1) / numMapppers * numKVPairs).toInt
8.
9.  var arr1 =new Array(Int, Array[Byte])(numKVPairsLocal)
10. for(i<=0 until numKVPairsLocal){ 
11.  val byteArr = new Array[Byte](valSize) 
12.  ranGen.nextBytes(byteArr) 
13.
14.  // generate more like a zipfian skew 
15.  var key: Int = 0 
16.  if (ranGen.nextInt(100) > 50)
17.   key = 0
18.  else
19.   key = ranGen.nextInt(numKVPairsLocal)
20.
21.  arr1(i) = (key, byteArr)
22. }
23. arr1
24. }
25. .cache()
26.
27. pairs1.count()
28.
29. printIn("Number of pairs: " + pairs1.count)
30. printIn("Count result " + pairs1.groupByKey(numReducers.count()))
31.
32. sc.stop()

Caching example

Let us consider the example application illustrated in Figure 1. Let us also assume that we have enough memory to cache any RDD. The example application generates an RDD on line 4, “pairs” which consists of a set of key-value pairs. The RDD is split into several partitions (i.e., numPartitions). RDDs are lazily evaluated in Spark. Thus, the “pairs” RDD is not evaluated until an action is called. Neither cache() nor persist() is an action. Several actions are called on this RDD as seen on lines 27, 29, and 30. The data is cached during the first action call, on line 27. All the subsequent actions (line 29, line 30) will find the RDD blocks in the memory. Without the cache() statement on line 25, each action on “pairs” would evaluate the RDD by processing the intermediate steps to generate the data. For this example, caching clearly speeds up execution by avoiding RDD re-evaluation in the last two actions.

Block eviction

Now let us also consider the situation that some of the block partitions are so large that they will quickly fill up the storage memory used for caching. The partitions generated in the example above are skewed, i.e., more key-value pairs are allocated for the partitions with higher IDs. Highly skewed partitions are the first candidates that will not fit into the cache storage.

When the storage memory becomes full, an eviction policy (i.e., Least Recently Used) will be used to make up space for new blocks. This situation is not ideal, as cached partitions may be evicted before actually being re-used. Depending on the caching strategy adopted, evicted blocks are cached on disk. A better use case is to cache only RDDs that are expensive to re-evaluate, and have a modest size such that they will fit in the memory entirely. Making such decisions before application execution may be challenging as it is unclear which RDDs will fit in the cache, and which caching strategy is better to use (i.e., where to cache: in memory, on disk, off heap, or a combined version of the above) in order to achieve the best performance. Generally, a caching strategy that caches blocks in memory and on disk is preferred. For this case, cached blocks evicted from memory are written to disk. Reading the data from disk is relatively fast compared with re-evaluating the RDD [1].

Under the covers

Internally, caching is performed at the block level. That means that each RDD consists of multiple blocks and each block is being cached independently of the other blocks. Caching is performed on the node that generated that particular RDD block. Each Executor in Spark has an associated BlockManager that is used to cache RDD blocks. The memory allocation of the BlockManager is given by the storage memory fraction (i.e., spark.memory.storageFraction) which gives the fraction from the memory pool allocated to the Spark engine itself (i.e., specified by spark.memory.fraction) . A summary of memory management in Spark can be found here. The BlockManager manages cached partitions as well as intermediate shuffle outputs. The storage, the place where blocks are actually stored, can be specified through the StorageLevel (e.g., persist(level: StorageLevel)). Once the storage level of the RDD has been defined, it cannot be changed. An RDD block can be cached in memory, on disk, or off-heap as specified by level.

Caching strategies (StorageLevel): RDD blocks can be cached in multiple stores (memory, disk, off-heap), in serialized or non-serialized format.

MEMORY_ONLY: Data is cached in memory only in non-serialized format.
MEMORY_AND_DISK: Data is cached in memory. If enough memory is not available, evicted blocks from memory are serialized to disk. This mode of operation is recommended when re-evaluation is expensive and memory resources are scarce.
DISK_ONLY: Data is cached on disk only in serialized format.
OFF_HEAP: Blocks are cached off-heap, e.g., on Alluxio [2].
The caching strategies above can also use serialization to store the data in serialized format. Serialization increases the processing cost but reduces the memory footprint of large datasets. These variants append “_SER” suffix to the above schemes. E.g., MEMORY_ONLY_SER, MEMORY_AND_DISK_SER. DISK_ONLY and OFF_HEAP always write data in serialized format.
Data can be also replicated to another node by appending “_2” suffix to the StorageLevel: e.g., MEMORY_ONLY_2, MEMORY_AND_DISK_SER_2. Replication is useful for speeding up recovery in the case one node of the cluster (or an executor) fails.
A full description of caching strategies can be found here.

Summary

Caching is very useful for applications that re-use an RDD multiple times. Iterative machine learning applications include such RDDs that are re-used in each iteration.
Caching all of the generated RDDs is not a good strategy as useful cached blocks may be evicted from the cache well before being re-used. For such cases, additional computation time is required to re-evaluate the RDD blocks evicted from the cache.
Given a large list of RDDs that are being used multiple times, deciding which ones to cache may be challenging. When memory is scarce, it is recommended to use MEMORY_AND_DISK caching strategy such that evicted blocks from cache are saved to disk. Reading the blocks from disk is generally faster than re-evaluation. If extra processing cost can be afforded, MEMORY_AND_DISK_SER can further reduce the memory footprint of the cached RDDs.
If certain RDDs have very large evaluation cost, it is recommended to replicate them to another node. This will boost significantly performance in the case of a node failure, since re-evaluation can be skipped.

References:

[1] https://community.databricks.com/t5/data-engineering/should-i-always-cache-my-rdd-s-and-dataframes/td-p/30763[2] “Learning Spark: Lightning-Fast Big Data Analysis”. Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. O’Reilly Media, 2015.

*Image from https://www.storagereview.com/

The post To Cache or Not to Cache RDDs in Spark appeared first on Unravel.

Why We Created Unravel Data for Big Data APM

Unravel Data Posts — Fri, 25 Jan 2019 06:22:34 +0000

Big Data is no longer a side project. Hadoop, Spark, Kafka, and NoSQL systems are slowly becoming part of the core IT fabric in most organizations. From product usage reports to recommendation engines, organizations are now running various types of Big Data applications in production to provide their customers greater value than ever before.

But Big Data applications are not easy to run, and ongoing operations management presents organizations with never-before-seen challenges. Application developers often complain about applications missing their delivery commitments (SLAs) or failing outright. Meanwhile, Big Data operations teams struggle with everyday tasks such as debugging, scheduling, and allocating resources. These ongoing management and performance challenges make it difficult for organizations to rely on their Big Data investment – let alone profit from it.

Shivnath Babu and I saw that managing the chaos and complexity of Big Data systems was taking up the majority of the time spent by Big Data professionals — versus working to deliver results to the business from this Big Data stack. We also saw that these problems weren’t unique to one organization, but were common in companies employing Big Data technology. Complicating matters, the Big Data ecosystem is expanding at such a rapid pace that practitioners are unable to keep up. Lack of expertise is often cited as one of the primary reasons for Big Data projects failing or slowing down.

We thought there had to be a better way to cope with this complexity so that enterprises could focus their attention on delivering value quickly using their Big Data stack, and set out on a mission to radically simplify Big Data operations.

Companies have engineers responsible for Big Data operations. Their job is to keep clusters healthy and Big Data applications performing reliably and efficiently. Today, they use a fragmented set of tools, command-line interfaces, and home-developed scripts to keep an eye on their Big Data stack. However, these fragmented tools fail to show the complete picture, making it very hard to root-cause and solve problems. For example, to troubleshoot the slowdown in the performance of a critical Big Data application, engineers have to look at the entire stack since the root cause could be in the application code, configuration settings, data layout, or resource allocation.

Therefore, we had to create a management solution that wasn’t looking only at one part of the stack, but one that was holistic. We also had to do more than simply provide charts and dashboards, which most users cannot make sense of; we had to provide ‘performance intelligence’ which would simplify the process of ongoing management and make data teams more productive.

Creating software like Unravel requires a mix of both industry experience as well as deep scientific knowledge. Therefore, we have assembled a team of innovators who previously worked at companies such as Cloudera, Oracle, IBM, Netflix, and scientists from Duke University, IIT, MIT, and Stanford. Together the Unravel team brings the needed experience in distributed computing and enterprise software that is crucial to solving this major problem that our industry faces today.

I am excited to announce that Unravel is already being used in production by several leading web and Fortune 100 companies today. We couldn’t be happier to see that Unravel is helping organizations rely on Big Data by ensuring that applications are fast and error-free and that the underlying cluster is being utilized to its full potential.

See how Unravel is mission critical to run big data in production here!

The post Why We Created Unravel Data for Big Data APM appeared first on Unravel.

Auto-Tuning Tez Applications

Unravel Data Posts — Fri, 25 Jan 2019 06:20:46 +0000

In a previous blog and webina r, my colleague Eric Chu discussed Unravel’s powerful recommendation capabilities in the Hive on MapReduce setting. To recap briefly, for these Hive on MR applications, Unravel surfaced inefficiencies and associated performance problems associated with the excessive number of mapper and reducer tasks being spawned to physically execute the query.

In this article, we’ll focus on another execution engine for Hive queries, Apache Tez. Tez provides a DAG-based model for physical execution of Hive queries and ships with Hortonworks’s HDP distribution as well as Amazon EMR.

We’ll see that reducing the number of tasks is an important optimization in this setting as well, however, it’s important to note that Tez makes use of a different mechanism for input splits, which means different parameters must be tuned. Of course, Unravel simplifies this nuance by directly recommending efficient settings for the Tez-specific parameters in play, using our ML/AI-informed infrastructure monitoring tools.

Before diving into some hands-on tuning from a data engineer’s point-of-view, I want to note that there are many factors that can lead to variability in execution performance and resource consumption of Hive on Tez queries.

Once Unravel’s intelligent APM functionality provides deep insights into the nature of multi-tenant workloads running in modern data stack environments, operators can rapidly identify the causes of this variability, which could include:

Data skew
Resource contention
Changes in volume of processed data

Towards the end of this blog, we will highlight some of the operator-centric features in Unravel, providing full fidelity visibility that can be leveraged to yield such insights.

We’ll use the following TPC-DS query on an HDP 2.6.3 cluster.

When we execute a query of this form with aggregation, sort, and limit semantics, we can see that the Map-Reduce-Reduce pattern is used by Tez which is a nice improvement over the MR execution engine, as this pattern isn’t possible in pure MapReduce.

As discussed above and in Eric’s aforementioned blog, there is a balancing act in distributed processing that must be maintained between the degree of parallelism of the processing and how much work is being completed by each parallel process. Tez attempts to achieve this balance via the grouping of splits. The task parallelism in Tez is determined by the grouping algorithm discussed here.

The key thing to note is that the number of splits per task must be aligned with the configuration settings for tez.grouping.min-size and tez.grouping.max-size.

Unravel first recommends tuning of these grouped splits in order for the cluster to allocate a smaller number of Tez map tasks, in order to gain better throughput with each task processing a larger amount of data relative to the pre-tuned execution.

Returning to the Map-Reduce-Reduce DAG that the optimizer identified for this query, we see that the first reducer stage is taking a long time to process.

At this point, without Unravel, a data engineer would usually dig into the physical execution plan, in order to identify how much data each reducer task is processing. Next, the engineer would need to understand how to properly configure the number of reducers to yield better efficiency and performance.

This is non-trivial, given the number of parameters in play: hive.tez.auto.reducer.parallelism, hive.tez.min.partition.factor, hive.tez.max.partition.factor, hive.exec.reducers.max, and hive.exec.reducers.bytes.per.reducer, and more (take a look at the number of Tez configuration parameters available, a large number of which can affect performance).

In fact, with auto reducer parallelism turned on (the default in HDP), Tez samples the source vertices’ output sizes and adjusts the number of reducer tasks dynamically. When there is a large volume of input data, but the reducer stage output is small relative to this input, the default heuristics can lead to too few reducers.

This is a good example of the subtleties that can arise with performance tuning–in the Hive on MR case, we had a query that was using way too many reducer tasks, leading to inefficiencies. In this case, we have a query using too few reducer tasks. Furthermore, even more reducers does not necessarily mean better performance in this case either!

In fact, in further testing, we found that tuning hive.exec.reducers.bytes.per.reducer to a somewhat higher value, resulting in less reducers, can squeeze out a bit better performance in this case.

As we discussed, it’s a complex balancing act to play with the number of tasks processing data in parallel and the amount of data being processed by each. Savvy data engineers spend an inordinate amount of time running experiments in order to hand-tune these parameters before deploying their applications to Production.

How Unravel Helps Auto-Tuning Tez Applications

Fortunately, with Unravel, we have a lot less work to do as data engineers. Unravel intelligently navigates this complex response surface using machine learning algorithms, to identify and surface optimal configurations automatically. It’s up to you how to optimize your output, using the full-stack infrastructure monitoring tools provided.

Let’s move on to the operator-centric point-of-view in Unravel to briefly discuss the insights we can gain when we determine there is considerable variability in the efficiency and performance of our Tez applications.

The first item we identified is data skew, how can we use Unravel to identify this issue? First, we can quickly see the total run time broken down by Tez task this in the vertex timeline:

Secondly, we have a sortable cross-table we can leverage to easily see which tasks took the longest.

This can be the result of processing skew or data skew, and Unravel provides some additional histograms to provide insights into this. These histograms bucket the individual tasks by duration and by processed I/O.

Non-uniformity of the distribution of these tasks yields insights into the nature of the skew which cluster operators and developers can leverage to take actionable tuning and troubleshooting steps.

Fantastic, how about resource contention? First, we can take a look at an app that executed quickly because it was allocated all of its requested containers by YARN right away:

Our second figure shows what can occur when there are other tenants within the cluster requesting containers for their applications, the slower execution is due to waiting for containers from the Resource Manager:

Finally, as far as changes in input data volume, we can make use of Unravel’s powerful Workflow functionality in this context in order to determine when this changes considerably, leading to variances in application behavior. In a future blog, we will discuss techniques to address this concern.

Create a free account.

The post Auto-Tuning Tez Applications appeared first on Unravel.