Webinars Archives - Unravel

Data ActionabilityTM DataOps Webinar

Unravel Data — Wed, 20 Nov 2024 21:39:37 +0000

Data Actionability: Boost Productivity with Unravel’s New DataOps AI Agent

The post Data Actionability^TM DataOps Webinar appeared first on Unravel.

Data ActionabilityTM Data Eng Webinar

Unravel Data — Wed, 20 Nov 2024 21:32:41 +0000

Data Actionability: Speed Up Analytics with Unravel’s New Data Engineering AI Agent

The post Data Actionability^TM Data Eng Webinar appeared first on Unravel.

Data ActionabilityTM Finops Webinar

Unravel Data — Wed, 20 Nov 2024 21:25:52 +0000

Data Actionability: Cost Governance with Unravel’s New FinOps AI Agent

The post Data Actionability^TM Finops Webinar appeared first on Unravel.

Data ActionabilityTM Webinar

Unravel Data — Wed, 20 Nov 2024 21:04:00 +0000

Data Actionability: Empower Your Team with Unravel’s New AI Agents

The post Data Actionability^TM Webinar appeared first on Unravel.

Webinar Recap: Succeeding with DataOps

Stephen Lamont — Fri, 06 May 2022 15:13:22 +0000

The term DataOps (like its older cousin, DevOps) means different things to different people. But one thing that everyone agrees on is its objective: accelerate the delivery of data-driven insights to the business quickly and reliably.

As the business now expects—or more accurately, demands—increasingly more insights from data, on increasingly shorter timelines, with increasingly more data being generated, in increasingly more complex data estates, it becomes ever more difficult to keep pace with the new levels of speed and scale.

Eliminating as much manual effort as possible is key. In a recent Database Trends and Applications (DBTA)-hosted roundtable webinar, Succeeding with DataOps: Implementing, Managing, and Scaling, three engineering leaders with practical experience in enterprise-level DataOps present solutions that simplify and accelerate:

Security. Satori Chief Scientist Ben Herzberg discusses how you can meet data security, privacy, and compliance requirements faster with a single universal approach.
Analytics. Incorta VP of Product Management Ashwin Warrier shows how you can make more data available to the business faster and more frequently.
Observability. Unravel VP of Solutions Engineering Chris Santiago explains how AI-enabled full-stack observability accelerates data application/pipeline optimization and troubleshooting, automates cost governance, and delivers the cloud migration intelligence you need before, during, and after your move to the cloud.

Universal data access and security

Businesses today are sitting on a treasure trove of data for analytics. But within that gold mine is a lot of personally identifiable information (PII) that must be safeguarded. That brings up an inherent conflict: making the data available to the company’s data teams in a timely manner without exposing the company to risk. The accelerated volume, velocity, and variety of data operations only makes the challenge greater.

Satori’s Ben Herzberg shared that recently one CISO said he has visibility over his network and his endpoints, but data was like a black box—it was hard for the Security team to understand what was going on. Security policies usually had to be implemented by other teams (data engineering or DataOps, database administrators).

The 5 capabilities of Satori’s DataSecOps

So what Satori does is provide a universal data access and security platform where organizations can manage all their security policies across their entire data environment from one single place in a simplified way. Ben outlined the five capabilities that Satori enables to realize DataSecOps:

Continuous data discovery and classification
Data keeps on changing, with more people touching it every day. Auditing or mapping your sensitive data once a year, even once a quarter or once a month, may not be sufficient.
Dynamic data masking
Different data consumers need different levels of data anonymization. Usually to dynamically mask data, you have to rely on the different data platforms themselves.
Granular data access policies
A key way to safeguard data is via Attribute-based Access Control (ABAC) policies. For example, you may want data analysts to be able access data stores using Tableau or Looker but not with any other technologies.
Access control and workflows
Satori automates and simplifies the way access is granted or revoked. A common integration is with Slack, so Security team can approve or deny access requests quickly.
Monitoring, audit & compliance Satori sits between the data consumers and the data that’s being accessed, so monitoring/auditing for compliance is transparent without ever touching the data itself.

Keeping data safe is always a top-of-mind concern, but no organization can afford to delay projects for months in order to apply security.

Scaling analytics with DataOps

When you think of the complexity of the modern data stack, you’re looking at a proliferation of cloud-based applications but also many legacy on-prem applications that continue to be the backbone of a large enterprise (SAP, Oracle). You have all these sources rich with operational data that business users need to build analytics on top of. So the first thing you see in a modern data stack is the data being moved to a data lake or data warehouse, to make it readily accessible along with making refinements along the way.

But to Incorta’s Ashwin Warrier, this presents two big challenges when it comes to enterprise analytics and operational reporting. First, getting through this entire process takes weeks, if not months. Second, once you finally get the data into the hands of business users, it’s aggregated. Much of your low-fidelity data is gone because of summarizations and aggregations. Building a bridge between the analytics plane and the operations plane requires some work.

Incorta’s Direct Data Platform approach

Where Incorta is different is that it does enrichments but doesn’t try to create transformations to take your source third normal form data and convert that into star schemas. It takes an exact copy with minimal amount of change. Having the same fidelity means that now your business users can get access to their data at scale in just a matter of hours or days instead of weeks or months.

Ashwin related the example of a Fortune 50 customer with about 2,000 users developing over 5,000 unique reports. It had daily refreshes, with an average query time of over 2 minutes, and took over 8 days to deliver insights. With Incorta, it reduced average query time to just a few seconds, now refreshes data 96 times every day, and can deliver insights in less than 1 day.

Unravel is purpose-built for the modern data stack
Create a free account

Taming the beast of modern data stacks

Unravel’s Chris Santiago opened with an overview of how the complexity compounded by complexity of modern data stacks causes problems for different groups of data teams. Data engineers’ apps aren’t hitting their SLAs; how can they make them more reliable? Operations teams are getting swamped with tickets; how can they troubleshoot faster? Data architects are falling behind schedule and going over budget; how can they speed migrations without more risk? Business leaders are seeing skyrocketing cloud bills; how can they better govern costs?

Then he went into the challenges of complexity. A single modern data pipeline is complex enough on its own, with dozens or even hundreds of interdependent jobs on different technologies (Kafka, Airflow, Spark, Hive, all the barn animals in Hadoop). But most enterprises run large volumes of these complex pipelines continuously—now we’re talking about the number of jobs getting into the thousands. Plus you’ve got multiple instances and data stored in different geographical locations around the globe. Already it’s a daunting task to figure out where and why something went wrong. And now there’s multi-cloud.

Managing this environment with point tools is going to be a challenge: they are very specific to the service they’re running, and the crucial information you need is spread across dozens of different systems and tools. Chris points out four key reasons why managing modern data stacks with point tools falls flat:

4 ways point tools fall flat for managing the modern data stack

You get only a partial picture.
The Spark UI, for instance, has granular details about individual jobs but not at the cluster level. Public cloud vendors’ tools have some of this information, but not nothing at the job level.
You don’t get to the root cause.
You’ll get a lot of graphs and charts, but what you really need to know is what went wrong, why—and how to fix it.
You’re blind to where you’re overspending.
You need to know exactly which jobs are using how much resources, whether that’s appropriate for the job at hand, and how you can optimize for performance and cost.
You can’t understand your cloud migration needs.
Things are always changing, and changing fast. You need to always be thinking about the next move.

That’s where Unravel comes in.

Unravel is purpose-built for the modern data stack
Create a free account

Unravel is purpose-built to collect granular data app/pipeline-specific details from every system in your data estate, correlate it all in a “workload-aware” context automatically, analyze everything for you, and provide actionable insights and precise AI recommendations on what to do next.

Unravel is purpose-built for the modern data stack

Get single-pane-of-glass visibility across all environments.
You’ll have complete visibility to understand what’s happening at the job level up to the cluster level.
Optimize automatically.
Unravel’s AI engine is like having a Spark or Databricks or Amazon EMR telling you exactly what you need to do to optimize performance to meet SLAs or change instance configurations to control cloud costs.
Fine-grained insight into cloud costs.
See at a granular level exactly where the money is going, set some budgets, track spend month over month—by project, team, even individual job or user—and have AI uncover opportunities to save.
Migrate on time, on budget.
Move to the cloud with confidence, knowing that you have complete and accurate insight into how long migration will take, the level of effort involved, and what it’ll cost once you migrate.

As businesses become ever more data-driven, build out more modern data stack workloads, and adopt newer cloud technologies, it will become ever more important to be able to see everything in context, let AI take much of the heavy lifting and manual investigation off the shoulders of data teams already stretched too thin, and manage, troubleshoot, and optimize data applications/pipelines at scale.

Check out the full webinar here.

The post Webinar Recap: Succeeding with DataOps appeared first on Unravel.

Webinar Recap: Functional strategies for migrating from Hadoop to AWS

Stephen Lamont — Thu, 21 Apr 2022 19:55:25 +0000

In a recent webinar, Functional (& Funny) Strategies for Modern Data Architecture, we combined comedy and practical strategies for migrating from Hadoop to AWS.

Unravel Co-Founder and CTO Shivnath Babu moderated a discussion with AWS Principal Architect, Global Specialty Practice, Dipankar Ghosal and WANdisco CTO Paul Scott-Murphy. Here are some of the key takeaways from the event.

Hadoop migration challenges

Business computing workloads are moving to the cloud en masse to achieve greater business agility, access to modern technology, and reduce operational costs. But identifying what you have running, understanding how it all works together, and mapping it to a cloud topology is extremely difficult when you have hundreds, if not thousands, of data pipelines and are dealing with tens of thousands of data sets.

We asked attendees what their top challenges have been during cloud migration and how they would describe the current state of their cloud journey. Not surprisingly, the complexity of their environment was the #1 challenge (71%), followed by the “talent gap” (finding people with the right skills).

The unfortunate truth is that most cloud migrations run over time and over budget. However, when done right, moving to the cloud can realize spectacular results.

How AWS approaches migration

Dipankar talked about translating a data strategy into a data architecture, emphasizing that the data architecture must be forward-looking: it must be able to scale in terms of both size and complexity, with the flexibility to accommodate polyglot access. That’s why AWS does not recommend a one-size-fits-all approach, as it eventually leads to compromise. With that in mind, he talked about different strategies for migration.

Hadoop to Amazon EMR Migration Reference Architecture

He recommends a data-first strategy for complex environments where it’s a challenge to find the right owners to define why the system is in place. Plus, he said, “At the same time, it gives the business the data availability on the cloud, so that they can start consuming the data right away.”

The other approach is a workload-first strategy, which is favored when migrating a relatively specialized part of the business that needs to be refactored (e.g., Pig to Spark).

He wrapped up with a process with different “swim lanes where every persona has skin in the game for a migration to be successful.”

Why a data-first strategy?

Paul followed up with a deeper dive into a data-first strategy. Specifically, he pointed out that in a cloud migration, “people are unfamiliar with what it takes to move their data at scale to the cloud. They’re typically doing this for the first time, it’s a novel experience for them. So traditional approaches to copying data, moving data, or planning a migration between environments may not be applicable.” The traditional lift-and-shift application-first approach is not well suited to the type of architecture in a big data migration to the cloud.

Paul said that the WANdisco data-first strategy looks at things from three perspectives:

Performance: Obviously moving data to the cloud faster is important so you can start taking advantage of a platform like AWS sooner. You need technology that supports the migration of large-scale data and allows you to continue to use it while migration is under way. There cannot be any downtime or business interruption.
Predictability: You need to be able to determine when data migration is complete and plan for workload migration around it.
Automation: Make the data migration as straightforward and simple as possible, to give you faster time to insight, to give you the flexibility required to migrate your workloads efficiently, and to optimize workloads effectively.

How Unravel helps before, during, and after migration

Shivnath went through the pitfalls encountered at each stage of a typical migration to AWS (assess; plan; test/fix/verify; optimize, manage, scale). He pointed out that it all starts with careful and accurate planning, then continuous optimization to make sure things don’t go off the rails as more and more workloads migrate over.

And to plan properly, you need to assess what is a very complex environment. All too often, this is a highly manual, expensive, and error-filled exercise. Unravel’s full-stack observability collects, correlates, and contextualizes everything that’s running in your Hadoop environment, including identifying all the dependencies for each application and pipeline.

Then once you have this complete application catalog, with baselines to compare against after workloads move to the cloud, Unravel generates a wave plan for migration. Having such accurate and complete data-based information is crucial to formulating your plan. Usually when migrations go off schedule and over budget, it’s because the original plan itself was inaccurate.

Then after workloads migrate, Unravel provides deep insights into the correctness, performance, and cost. Performance inefficiencies and over-provisioned resources are identified automatically, with AI-driven recommendations on exactly what to fix and how.

As more workloads migrate, Unravel empowers you to apply policy-based governance and automated alerts and remediation so you can manage, troubleshoot, and optimize at scale.

Case study: GoDaddy

The Unravel-WANdisco-Amazon partnership has proven success in migrating a Hadoop environment to EMR. GoDaddy had to move petabytes of actively changing “live” data when the business depends on the continued operation of applications in the cluster and access to its data. They had to move an 800-node Hadoop cluster with 2.5PB of customer data that was growing by more than 4TB every day. The initial (pre-Unravel) manual assessment took several weeks and proved incomplete: only 300 scripts were discovered, whereas Unravel identified over 800.

GoDaddy estimated that its lift-and-shift migration operating costs would cost $7 million, but Unravel AI optimization capabilities identified savings that brought down the cloud costs to $2.9 million. Using WANdisco’s data-first strategy, GoDaddy was able to complete its migration process on time and under budget while maintaining normal business operations at all times.

Q&A

The webinar wrapped up with a lively Q&A session where attendees asked questions such as:

We’re moving from on-premises Oracle to AWS. What would be the best strategy?
What kind of help can AWS provide in making data migration decisions?
What is your DataOps strategy for cloud migration?
How do you handle governance in the cloud vs. on-premises?

To hear our experts’ specific individual responses to these questions as well as the full presentation, click here to get the webinar on demand (no form to fill out!)

The post Webinar Recap: Functional strategies for migrating from Hadoop to AWS appeared first on Unravel.

Webinar Recap: Optimizing and Migrating Hadoop to Azure Databricks

Stephen Lamont — Mon, 18 Apr 2022 20:32:46 +0000

The benefits of moving your on-prem Spark Hadoop environment to Databricks are undeniable. A recent Forrester Total Economic Impact (TEI) study reveals that deploying Databricks can pay for itself in less than six months, with a 417% ROI from cost savings and increased revenue & productivity over three years. But without the right methodology and tools, such modernization/migration can be a daunting task.

Capgemini’s VP of Analytics Pratim Das recently moderated a webinar with Unravel’s VP of Solutions Engineering Chris Santiago, Databricks’ Migrations Lead (EMEA) Amine Benhamza, and Microsoft’s Analytics Global Black Belt (EMEA) Imre Ruskal to discuss how to reduce the risk of unexpected complexities, avoid roadblocks, and present cost overruns.

The session Optimizing and Migrating Hadoop to Azure Databricks is available on demand, and this post briefly recaps that presentation.

Pratim from Capgemini opened by reviewing the four phases of a cloud migration—assess; plan; test, fix, verify; optimize, manage, scale—and polling the attendees about where they were on their journey and the top challenges they have encountered.

How Unravel helps migrate to Databricks from Hadoop

Chris ran through the stages an enterprise goes through when doing a cloud migration from Hadoop to Databricks (really, any cloud platform), with the different challenges associated with each phase.

Specifically, profiling exactly what you have running on Hadoop can be a highly manual, time-consuming exercise than can take 4-6 months, requires domain experts, can cost over $500K—and even then is still usually inaccurate and incomplete by 30%.

This leads to problematic planning. Because you don’t have complete data and have missed crucial dependencies, you wind up with inaccurate “guesstimates” that delay migrations by 9-18 months and underestimate TCO by 3-5X

Then once you’ve actually started deploying workloads in the cloud, too often users are frustrated that workloads are running slower than they did on-prem. Manual tuning each job takes about 20 hours in order to meet SLAs, increasing migration expenses by a few million dollars.

Finally, migration is never a one-and-done deal. Managing and optimizing the workloads is a constant exercise, but fragmented tooling leads to cumbersome manual management and lack of governance results in ballooning cloud costs.

Chris Santiago shows over a dozen screenshots illustrating Unravel capabilities to assess and plan a Databricks migration. Click on image or here to jump to his session.

Chris illustrated how Unravel’s data-driven approach to migrating to Azure Databricks helps alleviate and solve these challenges. Specifically, Unravel answers questions you need to ask to get a complete picture of your Hadoop inventory:

What jobs are running in your environment—by application, by user, by queue?
How are your resources actually being utilized over a lifespan of a particular environment?
What’s the velocity—the number of jobs that are submitted in a particular environment—how much Spark vs. Hive, etc.?
What pipelines are running (think Airflow, Oozie)?
Which data tables are actually being used, and how often?

Then once you have a full understanding of what you’re running in the Hadoop environment, you can start forecasting what this would look like in Databricks. Unravel gathers all the information about what resources are actually being used, how many, and when for each job. This allows you to “slice” the cluster to start scoping out what this would look like from an architectural perspective. Unravel takes in all those resource constraints and provides AI-driven recommendations on the appropriate architecture: when and where to use auto-scaling, where spot instances could be leveraged, etc.

See the entire presentation on migrating from Hadoop to Azure Databricks
Watch webinar

Then when planning, Unravel gives you a full application catalog, both at a summary and drill-down level, of what’s running either as repeated jobs or ad hoc. You also get complexity analysis and data dependency reports so you know what you need to migrate and when in your wave plan. This automated report takes into account the complexity of your jobs, the data level and app level dependencies, and ultimately spits out a sprint plan that gives you the level of effort required.

Click on image or here to see Unravel’s AI recommendations in action

But Unravel also helps with monitoring and optimizing your Databricks environment post-deployment to make sure that (a) everyone is using Databricks most effectively and (b) you’re getting the most out of your investment. With Unravel, you get full-stack observability metrics to understand exactly what’s going on with your jobs. But Unravel goes “beyond observability” to not just tell you what’s going and why, but also tell you what to do about it.

By collecting and contextualizing data from a bunch of different sources—logs, Spark UI, Databricks console, APIs—Unravel’s AI engine automatically identifies where jobs could be tuned to run for higher performance or lower cost, with pinpoint recommendations on how to “fix things for greater efficiency. This allows you to tune thousands of jobs on the fly, control costs proactively, and track actual vs. budgeted spend in real time.

Why Databricks?

Amine then presented a concise summary of why he’s seen customers migrate to Databricks from Hadoop, recounting the high costs associated with Hadoop on-prem, the administrative complexity of managing the “zoo of technologies,” the need to decouple compute and storage to reduce waste of unused resources, the need to develop modern AI/ML use cases, not to mention the Cloudera end-of-life issue. He went on to illustrate the advantages and benefits of the Databricks data lakehouse platform, Delta Lake, and how by bringing together the best of Databricks and Azure into a single unified solution, you get a fully modern analytics and AI architecture.

He then went on to show how the kind of data-driven approach that Capgemnini and Unravel take might look for different technologies migrating from Hadoop to Databricks.

Hadoop migration beyond Databricks

The Hadoop ecosystem over time has become extremely complicated and fragmented. If you are looking at all the components that might be in your Hortonworks or Cloudera legacy distribution today, and are trying to map them to the Azure model analytics reference architecture layer, things get pretty complex.

Some things are relatively straightforward to migrate over to Databricks—Spark, HDFS, Hive—others, not so much. This is where his team at Azure Data Services can help out. He went through the considerations and mapping for a range of different components, including:

Oozie
Kudi
Nifi
Flume
Kafka
Storm
Flink
Solr
Pig
HBase
MapReduce
and more

He showed how these various services were used to make sure customers are covered, to fill in the gaps and complement Databricks for an end-to- end solution.

Check out the full webinar Optimizing and Migrating Hadoop to Azure Databricks on demand.
No form to fill out!

The post Webinar Recap: Optimizing and Migrating Hadoop to Azure Databricks appeared first on Unravel.

Driving Data Governance and Data Products at ING Bank France

Unravel Data — Thu, 12 Aug 2021 15:45:59 +0000

Data+AI Battlescars Takeaways: Driving Data Governance and Data Products at ING Bank France

In this episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Samir Boualla, CDO at ING Bank France, one of the largest banks in the world. They cover his battlescars in Driving Data Governance Across Business Teams and Building Data Products.

At ING Bank France, Samir is the Chief Data Officer. He’s responsible for several teams that govern, develop, and manage data infrastructure and data assets to deliver value to the business. With over 20+ years of experience on various data topics, Samir shares interesting battle-tested techniques in this podcast, including using a process catalog, having a “data minimum standard,” and having a change management mindset. Here are the key takeaways from their chat.

Data Governance

Preparing for Data Governance

The first step to preparing for data governance is defining data governance frameworks, guidelines, and principles as part of your organization’s data strategy and discussing them with various stakeholders.
Next, you need to get approval and validation from your directors. It is important to have support and commitment from senior leaders since it will be impacting the whole organization.
After that, you can identify the appropriate people that should be assigned to roles such as data steward, data owner, or process lead.
It is important to implement data governance in parallel with building a data architecture. You can identify and close gaps quickly so you don’t develop something which later on is not compliant with a regulation or with other frameworks.

Having a Data and a Process Catalog

In order to monitor compliance proactively, you must know your data. This is where a data catalog plays an important role.
A data catalog allows you to have uniform definitions in place, have a data owner for each of the specific data categories, and have people who have more in-depth knowledge and the capability to manage data acting as data stewards.
Having a process catalog in addition to a data catalog allows you to seamlessly track the data life cycle.
In a process catalog, it’s documented, for each process, what sources are used, who is doing what, and which data is being used in those processes.
The process catalog is linked with your data catalog and used to manage your life cycle and apply additional data capabilities, like retention and deletion, on the appropriate systems or appropriate process steps.

Data Minimum Standards

Data minimum standards are key frameworks that can give you insight and input on the controls and checks that should be part of your minimum standards for governance.
Having these data minimum standards, such as GDPR and BCBS 239, allows you to proactively monitor for compliance and apply governance.
These data minimum standards are part of your compliance framework, so you are able to apply them to all kinds of different processes or departments.
ING Bank is constantly assessing if the controls work, testing them for effectiveness, and auditing them every once in a while.

Data Products

Building External-Facing Data Products

In preparation for building external-facing data products, you need to have a set of standardized APIs as a product, which you can deliver to third parties for external consumers.
At the same time, you should also be using another product as your data catalog to make sure that the data that is being defined and flowing through those APIs are made unique.
The biggest challenges ING faces when building external-facing data products is making sure they are acting more or less on the edge of technology and architecture, while also ensuring that they are working towards their goal of becoming a data-driven bank.
They encounter several challenges in making sure that their platforms are compatible and can exchange data in the right formats and in the right structure, but also in a way that the infrastructure remains scalable.

Challenges of Building Data Products

Sometimes when a product is based on data quality, while Samir hopes to identify upfront any data quality issues, somewhere down the line the consumer may identify issues or have questions regarding the quality of the data. This is where another data product, called remediation, can come into play.
Remediation is when a consumer can address data quality issues directly to the appropriate data stewards in the organization. Using other complementary products, a consumer can look into certain data or to a certain report to identify which data point came from where and who’s the data steward or data owner of that specific data. They can then address it automatically in a workflow environment, and request remediation.
When building a data product, you may run into manual, legacy processes that have not yet been redesigned.

Change Management

Having a change management mindset means that you are willing to implement something new or change something from the legacy based on new data products.
A standard data model and data catalog are essential when it comes to change management.
By having a single data model and a data catalog, you have a decoupling layer which helps you and supports you in the exercise of identifying what that data point reflecting the truth exactly is.
ING has a broad framework, which allows them to work in a similar, agile way across the organization, and across the globe, whenever something needs to be adjusted.
Their data products also allow them to minimize the amount of effort that needs to be put into a change to make it available.

While we’ve highlighted the key talking points here, Sandeep and Samir talked about so much more in the full podcast. Be sure to check out the full episode and the rest of the Data+AI Battlescars podcast series!

The post Driving Data Governance and Data Products at ING Bank France appeared first on Unravel.

Simplifying Data Management at LinkedIn – What is Data Quality

Unravel Data — Fri, 18 Jun 2021 13:00:31 +0000

In the second of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In part one, they covered LinkedIn’s challenges related to metadata management and data access APIs. This second part dives deep into data quality.

Kapil has 20+ years of experience in data and infrastructure, at large companies such as Oracle and at multiple startups. At LinkedIn, Kapil has been leading the next generation of big data infrastructure, platforms, tools, and applications to empower data scientists, AI engineers, app developers, to extract value from data. Kapil’s team has been at the forefront of innovation driving multiple open source initiatives such as Pinot, Gobblin, and DataHub. Here are some key talking points from the second part of their insightful chat.

What Does “Data Quality” Mean?

Data quality spans many dimensions and can be really broad in its coverage. It answers the questions: Is your data accurate? Do you have data integrity, validity, security, timeliness, and completeness? Is the data consistent? Is it interpretable?
LinkedIn measures tons of different metrics, aiming to understand the health of the business, products, and systems.
When KPIs differ from the norm, it becomes a question of data quality.
When determining the root cause of poor data quality, it can differ for each dimension of quality.
For example, when there is metric inconsistency, you must ask yourself if you have an accurate source of truth for your metrics.
Timeliness and completeness problems often happen as a result of infrastructure issues. In complex ecosystems you have a lot of moving parts, so a lot can go wrong that impacts timeliness and completeness.
Data quality problems often don’t actually manifest themselves as data quality problems in obvious ways. It takes time to monitor, detect, and effectively analyze the root cause of these issues, and remediate the issues.
When assessing and improving data quality, it often helps to categorize it in three buckets. There is (1) the monitoring and observability aspect, (2) anomaly detection and root cause analysis for anomalies, and (3) preventing data quality issues in the first place. The last bucket is the best-case scenario.
In any complex ecosystem, when something goes wrong for any reason in a pipeline or in a single stage of a data set or a data flow, it can have real consequences on the entire downstream change. So your goal is to detect problems as close to the source as possible.

How Does LinkedIn Maintain Data Quality?

Unified Metrics Platform

It is important to have an evolving schema of your datasets and signals for when something goes awry, to act as markers of data quality.
At LinkedIn they built a platform for metric definition and the entire life cycle management of those metrics, called the Unified Metrics Platform. The platform processes all of LinkedIn’s critical metrics – to the point that if it’s not produced by the platform, it wouldn’t be considered a trusted metric. The Unified Metrics Platform defines their source of truth.
The company turned to machine learning techniques to improve the detection of anomalies and alerting based on user feedback.

Data Sentinel

You can have situations where the overall metric that you’re monitoring may not have a significant deviation, but when you look into the sub spaces within that metric, you find significant deviations. To solve this problem, LinkedIn leveraged algorithms to automatically build structures based on the nature of the data itself and build a multi-dimensional data cube.
When you’re unable to pinpoint the root cause of a deviation, it becomes a matter of identifying the space of the key drivers which might impact the particular metric. You narrow that space, present it to users for their feedback, and then continuously refine the system.
To detect issues based on the known properties of the data, Linkedin built a system called the Data Sentinel. This system has the ability to take the developer’s knowledge about the dataset and specify it as declarative roots. The Data Sentinel then takes on the responsibility of generating the code to perform data validations.
Linkedin is considering making Data Sentinel open source in the future.

Building a Data Quality Team

At LinkedIn, they make sure that team members take the time to treat event schemas as code. They have to be reviewed and checked on. The same goes for metrics. This requires collaboration between different teams. They are constantly talking to each other and coming together to improve not just tools, but also processes.
What is accepted as state-of-the-art today is almost guaranteed not to be state-of-the-art tomorrow. So when hiring for a data quality or data management team, it is important to look for people who are naturally curious and have a growth mindset.

If you’re interested in any of the topics discussed here, Sandeep and Kapil talked about even more in the full podcast. Be sure to check out Part 1 of this blog post, the full episode and the rest of the Data+AI Battlescars podcast series!

The post Simplifying Data Management at LinkedIn – What is Data Quality appeared first on Unravel.

Simplifying Data Management at LinkedIn – Metadata Management and APIs

Unravel Data — Thu, 17 Jun 2021 13:00:49 +0000

In the first of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In this first part, they cover LinkedIn’s challenges related to Metadata Management and Data Access APIs. Part 2 will dive deep into data quality.

Metadata Management

The Problem: So Much Data

LinkedIn manages over a billion data points and over 50 billion relationships.
As the number of datasets skyrocketed at LinkedIn, the company found that internal users were spending an inordinate amount of time searching for and trying to understand hundreds of thousands, if not millions, of datasets.
LinkedIn could no longer rely on manually generated information about the datasets. The company needed a central metadata repository and a metadata management strategy.

Solution 1: WhereHows

The company initiated an in-depth analysis of its Hadoop data lake and asked questions such as: What are the datasets? Where do they come from? Who produced the dataset? Who owns it? What are the associated SLAs? What other sets does a particular dataset depend on?
Similar questions were asked about the jobs: What are the inputs and outputs? Who owns the jobs?.

The first step was the development of an open source system called WhereHows, a central metadata repository to capture metadata across the diverse datasets, with a search engine.

Solution 2: Pegasus

LinkedIn discovered that it was not enough just to focus on the datasets and the jobs. The human element had to be accounted for as well. A broader view was necessary, accommodating both static and dynamic metadata.
In order to expand the capabilities of the metadata model, the company realized it needed to take a “push” approach rather than a metadata “scraping” approach.
The company then built a library called Pegasus to create a push-based model that improved both efficiency and latency.

The Final Solution: DataHub

Kapil’s team then found that you need the ability to query metadata through APIs. You must be able to query the metadata online, so other services can integrate.
The team went back to the drawing board and re-architected the system from the ground up based on these learnings.
The result was DataHub, the latest version of the company’s open source metadata management system, released last year.
A major benefit of this metadata approach is the ability to drive a lot of other functions in the organization that depend on access to metadata.

Data Access APIs

LinkedIn recently completely rebuilt its mobile experience, user tracking, and associated data models, essentially needing to “change engines in mid-flight.”
The company needed a data API to meet this challenge. One did not exist, however, so they created a data access API named “Dali API” to provide an access layer to offline data.
The company used Hive to build the infrastructure and get to market quickly. But in hindsight, using Hive added some dependencies.
So LinkedIn built a new, more flexible library called Coral and made it available as open source software. This removed the Hive dependency, and the company will benefit from improvements made to Coral by the community.

If you’re interested in any of the topics discussed here, Sandeep and Kapil talked about even more in the full podcast. Keep an eye out for takeaways from the second part of this chat, and be sure to check out Part 2 of this blog post, the full episode and the rest of the Data+AI Battlescars podcast series!

The post Simplifying Data Management at LinkedIn – Metadata Management and APIs appeared first on Unravel.

Recruiting and Building the Data Science Team at Etsy

Unravel Data — Wed, 16 Jun 2021 13:00:47 +0000

Data+AI Battlescars Takeaways: Recruiting and Building the Data Science Team at Etsy

In this episode of Data+AI Battlescars (formerly CDO Battlescars), Sandeep Uttamchandani talks to Chu-Cheng, CDO at Etsy. This episode focuses on Chu-Cheng’s battlescars related to recruiting and building a data science team.

Chu-Cheng leads the global data organization at Etsy. He’s responsible for data science, AI innovation, machine learning and data infrastructure. Prior to Etsy, Chu-Cheng has led various data roles, including at Amazon, Intuit, Rakuten and eBay. Here are the key talking points from their chat.

Building a Data Science Team: The Early Stages

At the early stages of building a data science team, it may be more useful to hire people who are generalists rather than, for example, specialized data scientists or machine learning engineers.
In any successful team, you need a mix of different experience levels and skill sets.
When building a data science team, Chu-Cheng actually looks for people who probably wouldn’t pass a traditional data science interview, but can still get the job done.
For first hires, Chu-Cheng generally targets people who have at least a few years of work experience and have a lot of patience and willingness to learn.
A good candidate is someone who can explain a difficult concept in a way that anyone can understand. You want to find someone who knows how to tell people what they are thinking.
For example, you can follow them on LinkedIn and see their profile, how they write their own self description, and you get an idea even before the interview.
In the past, Chu-Cheng often unconsciously looked for someone with whom he had a similar background.
To counteract this bias, he learned to create a checklist of criteria prior to interviewing a candidate. He uses that checklist to evaluate the candidate’s qualifications, rather than just picking someone who has a similar background to him.

Building a Data Science Team: The Later Stages

Eventually when you have a bigger team, you must start moving from hiring generalists to hiring specialists. If you only hire generalists, you’ll eventually run into a wall, because you have a bunch of people who are fungible.
In interviews, Chu-Cheng tries to balance technical and soft skills, even when hiring scientists and engineers.
If you are interviewing someone for a manager position, prioritize their mentoring, coaching, conflict resolution, and communication skills. The manager’s success is defined by the team’s success.
As a manager, when coaching someone, instead of trying to give out or prescribe an answer, focus on how you can ask the right questions so that the person can come up with a solution on their own. Switch from telling to asking. Allow people to make mistakes so you can coach them to grow and learn a lesson from it.

Innovation at Etsy

Chu-Cheng tries to teach his team how to write papers and patents. At Etsy, they encourage this innovation by sending a recognition gift or an innovation award.
Papers and patents are not the only types of innovation, however. Innovation also involves the process of making something easier. Not everything can be patented or become a paper, but you can, for instance, write a blog sharing your learning. Innovation is a mindset.
When considering a new technology, it is important to get a sense of the circumstances under which you should not use the technology, as well as when to use it. You must know when to use what.

If you’re interested in any of the topics discussed here, Sandeep and Chu-Cheng talked about even more in the full podcast. Be sure to check out the full episode and the rest of the Data+AI Battlescars podcast series!

The post Recruiting and Building the Data Science Team at Etsy appeared first on Unravel.

Developing Data Literacy and Standardized Business Metrics at Tailored Brands

Unravel Data — Thu, 11 Mar 2021 14:00:44 +0000

In this episode of CDO Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Meenal Iyer, Sr. Director of Enterprise Analytics and Data at Tailored Brands. They discuss battlescars in two areas, data and metrics: Growing Data Literacy and Developing a Data-Driven Culture and Standardization of Business Metrics.

Meenal brings in 20+ years of data analytics experience across retail, travel, and financial services. Meenal has been transforming enterprises to become data-driven and shares interesting, domain-agnostic lessons from her experience. Here are the key talking points from their chat.

Growing Data Literacy and Developing a Data-Driven Culture

What Does it Mean to be Data-Driven?
At a very fundamental level, being data-driven means that actions are taken based on insights derived from data.

In order for an enterprise to be truly classified as data-driven, there are a few qualifications that need to be met:

Leadership must have a data-driven mentality.
The organization has to be data-literate, meaning that everyone in the organization knows that there is an initiative that they are pulling the data for.
You need to have a foundational framework of your data.

How to Create a Data-Driven Culture and Increase Data Literacy
The biggest challenge Meenal faced in shifting the culture to a data-driven one is the fact that people often have the mindset that, ”if something is already working, why do we need to fix it?”

To change that mindset, it is important to collaborate and communicate with all parties the reasons for making changes in an application.
Allow everyone, including leadership, middle management, and end-users, to lay out pros and cons, and address it all, while keeping emotions off the table. Be transparent about both the capabilities and limitations of each application.
To increase data literacy in an organization, the education must be top-down. Leadership must communicate why the organization needs to be data-literate, making the end goal clear.

Standardization of Business Metrics

Currently, Meenal is working on building the data foundation needed to build the data science platform at Tailored Brands. One of the biggest challenges she is facing is maintaining consistency in business metrics.

It may be challenging to come to a single definition for an enterprise KPI or metric and then identify the data steward for that metric. Sometimes you have to take the lead and choose who will be the data steward.
You need to make sure that the source of data that is informing the metric is the right dataset. This dataset should come out of a central organization rather than multiple organizations. This works well because if it is wrong, it is wrong in just that one place.
Once a metric is defined, it is built into reporting applications.

If you’re interested in any of the topics discussed here, Sandeep and Meenal talked about even more in the full podcast. Be sure to check out the full episode and the rest of the CDO Battlescars podcast series!

The post Developing Data Literacy and Standardized Business Metrics at Tailored Brands appeared first on Unravel.

Creating a Data Strategy & Self-Service Data Platform in FinTech

Unravel Data — Tue, 23 Feb 2021 13:00:11 +0000

In this episode of CDO Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Keyur Desai, CDO of TD Ameritrade. They discuss battlescars in two areas: Building a Data Strategy and Pervasive Self-Service Analytics Platforms.

Keyur is a data executive with over 30 years of experience managing and monetizing data and analytics. He has created data-driven organizations and driven enterprise-wide data strategies in areas including data literacy, modern data governance, machine learning and data science, pervasive self-service analytics, and several other areas.

He has experience across multiple industries including insurance, technology, healthcare, and retail. Keyur shares some really valuable lessons based on his extensive experience. Here are the key talking points from their chat.

Building a Data Strategy

The Problem: Disconnect Between Business Goals & Technical Infrastructure

A data analytics strategy is never singularly built by a data analytics organization. It is absolutely co-created between the business and the data analytics organization.
To the business, a data and analytics strategy is a set of data initiatives and analytics initiatives that will be brought together to help them achieve their business outcomes.
However, building your data analytics initiatives based on desired business outcomes doesn’t guarantee that your technical infrastructure would be neatly built as well.

The Solution 1: A Data Economist

To solve this problem, Keyur felt that there was a need for a new role, called a data economist.

The data economist’s job is to create a mathematical model of the outcome that they’re trying to achieve with the business and backtrack that into attributes of data, the analytics system, and the technical system.
Through these mathematical models, you can estimate not only whether you’re going to be able to meet the business objectives or the business outcomes, but also model out the impact within the company, measured in earnings per share.
You can use this model to propose a fact-based plan and engage with cross-functional executives in a conversation about why you’re proposing it that way.

The Solution 2: Data Literacy

Keyur also stressed the importance of establishing data literacy on the business side.

A lot of companies will try to create a data literacy program that is overarching across the entire company. However, Keyur has found a lot of success in segmenting the business user base, as well as the technical user base, based on the types of capabilities each segment will need when it comes to data and analytics.
A fluency program needs to enable the right segment to translate what they’re seeing from the tool at hand to an insight, and then translate that insight into some kind of implication.
Literacy is not just about understanding the data; it’s also about practices to keep it safe and private, as well as being able to effectively tell a story with the data.
Establishing data literacy across the business allows them to determine what types of outcomes are even possible with data and to begin to figure out which outcomes they want to chase.

Pervasive Self-Service Analytics Platforms

The Role of the Self-Service Analytics Team

The self-service analytics team’s role has shifted from creating reporting assets to enabling data fluency across the organization and watching what data sets are being used by whom to solve what types of problems.
There is a balancing act between wanting to provide self-service to everybody while, at the same time, making sure that everyone is doing it in a secure way that does not open up a risk for the corporation.
It comes down to making sure that you’ve got a corporate-wide access framework where all subsystems that have data of any sort have the potential to store, move, or, share the data.

Data Prep Environments
Up to now, the missing link to developing an integrated, end-to-end self-service model was a self-service data preparation environment that is as easy to use as Excel. We’re now getting there!

Data prep environments now allow non-technical business users to be able to get past some of the big bottlenecks they had in the past, like lacking the technical skill to clean up the data.
The AI running in the background of data prep environments, combined with what the users around you are doing, is now smart enough to basically propose to you some of the cleaning actions you should be able to perform.
Through the data prep environment, you can get dashboards on what data is being used for what purposes, or what kinds of metrics are being created, across the organization.

The Role of a Business Leader in the Self-Service World
This new self-service world is one that revolutionizes how we go about sharing data and even accelerates the sharing of data.

With the unified access framework, you can now ensure that people are not only sharing, but are seeing the things they should. More importantly, business intelligence leaders can now watch and see what people are doing and ensure that everything is going smoothly.
All leaders must be aligned around the concept of respect. Respect allows for a team to get to a level of interaction where they can easily bounce ideas off of each other. This breeds innovation.
A leader needs to ensure that they trust the people they hire on their team and their colleagues enough to be able to give them enough autonomy to get the job done.
In addition to laying out the end goal, leaders must also lay out the intermediate milestones required to reach that goal.
Lastly, a leader should be able to see how previous leaders have failed and be able to now have a sense of purpose around how the new approaches will actually create business value.

If you’re interested in any of the topics discussed here, Sandeep and Keyur talked about even more in the full podcast. Be sure to check out the full episode and the rest of the CDO Battlescars podcast series!

The post Creating a Data Strategy & Self-Service Data Platform in FinTech appeared first on Unravel.

CDO Battlescars Podcast Series

Unravel Data — Wed, 27 Jan 2021 13:00:28 +0000

Thank you for your interest in the CDO Battlescars Podcast Series.

You can access the series here.

The post CDO Battlescars Podcast Series appeared first on Unravel.

Standardizing Business Metrics & Democratizing Experimentation at Intuit

Unravel Data — Wed, 27 Jan 2021 13:00:24 +0000

CDO Battlescars Takeaways: Standardizing Business Metrics & Democratizing Experimentation at Intuit

CDO Battlescars is a podcast series hosted by Sandeep Uttamchandani, Unravel Data’s CDO. He talks to data leaders across data engineering, analytics, and data science about the challenges they encountered in their journey of transforming raw data into insights. The motivation of this podcast series is to give back to the data community the hard-learned lessons that Sandeep and his peer data leaders have learned over the years. (Thus, the reference to battlescars.)

In this episode, Sandeep talked to Anil Madan about battlescars in two areas: Standardization of Business Metrics and Democratization of Experimentation at Intuit.

At Intuit, Anil was the VP of Data and Analytics for Intuit’s Small Business and Self Employed group. He has over 25 years of experience in the data space across Intuit, PayPal, and eBay, and is now the VP of Data Platforms at Walmart. Anil is a pioneer in building data infrastructure and creating value from data across products, experimentation, digital marketing, payments, Fintech, and many other areas. Here are the key talking points from this insightful chat!

Standardization of Business Metrics

The Problem: Slow and Inconsistent Processing

Anil led data and analytics for Intuit’s QuickBooks software. The goal of QuickBooks is to create a platform where they can power all the distinct needs of small businesses seamlessly.
As they looked at their customer base, the key metric Anil’s team measured was signups and subscriptions. This metric needed to be consistent in order to have standards that could be relied on to drive business.
This metric used to be inconsistent, and the time-to-value ranged from 24 to 48 hours because their foundational pipelines had several hubs.

The Solution: Simplify, Measure, Improve, and Move to Cloud
To solve this problem, Anil shared the following insights:

Determine what the true metric definition was and establish the right counting rules and documentation.
Define the right SLA for each data source.
Invest deeply in programmatic tools, building a lineage called Superglue, which would start traversing right from the source all the way into reporting.
Create micro-Spark pipelines, moving away from traditional MapReduce and monolithic ways of processing.
Migrate to the cloud to help them auto-scale their compute capacity.
Establish the right governance so that schema changes in any source would be detected through their programs.

The analytics teams, business leaders, and finance teams all looked at business definitions. As they launch new product offerings, or new monetization and consumption patterns, they review the metrics against these new patterns and ensure that business definitions are still holding true.

Democratization of Experimentation

Moving to democratizing experimentation, Anil was involved in significantly transforming the experimentation trajectory.

Anil breaks this transformation into people, processes, and technology:

From a people perspective, what the challenges were, and where the handoffs were happening, determining if the right skills, and hiring data analysts with specialization in the experimentation space.
On the processing side, ensuring a single place where they could measure what’s really happening at every step.
On the technology side, looking at several examples from the industry to decide what to invest in the platform capabilities. They invested in things like metrics generation to establish overall evaluation criteria in each experiment. They also looked at techniques to move faster, like sequential testing, interleaving, and even quasi-experimentation.

Focusing on key metrics and ensuring that there are no mistakes in setting up and running the experiment became extremely important. They invested in a lot of programmatic ways to do that.

Anil’s team looked into where time is spent in the overall experimentation process and then focused on addressing those areas through a combination of building programs, leveraging programs that are open source, and even buying the needed software.

Focusing on the foundational pieces around how your pipelines work and how you analyze things, and investing in those foundational pieces, is key.

Though we highlighted the key talking points, Sandeep and Anil talked about so much more in the 29-minute podcast. Be sure to check out the full episode and the rest of the CDO Battlescars podcast series!

Listen to Podcast on Spotify

The post Standardizing Business Metrics & Democratizing Experimentation at Intuit appeared first on Unravel.

Webinar Achieving Top Efficiency in Cloud Big Data Operations

Unravel Data — Thu, 23 Apr 2020 19:22:02 +0000

The post Webinar Achieving Top Efficiency in Cloud Big Data Operations appeared first on Unravel.