Videos Archives - Unravel

How Shopify Fixed a $1 Million Single Query

Stephen Lamont — Fri, 26 Jan 2024 20:04:34 +0000

A while back, Shopify posted a story about how they avoided a $1 million query in BigQuery. They detail the data engineering that reduced the cost of the query to $1,300 and share tips for lowering costs in BigQuery.

Kunal Agarwal, CEO and Co-founder of Unravel Data, walks through Shopify’s approach to clustering tables as a way to bring the price of a highly critical query down to a more reasonable monthly cost. But clustering is not the only data engineering technique available to run far more efficiently—for cost, performance, reliability—and Kunal brings up 6-7 others.

Few organizations have the data engineering muscle of Shopify. All the techniques to keep costs low and performance high can entail a painstaking, toilsome slog through thousands of telemetry data points, logs, error messages, etc.

Unravel’s customers understand that they cannot possibly have people go through the hundreds of thousands or more lines of code to find the areas to optimize, but rather that this is all done better and faster with automation and AI. Whether your data runs on BigQuery, Databricks, or Snowflake.

Kunal shows what that looks like in a 1-minute drive-by.

Get Unravel Standard data observability + FinOps for free. Forever.
Get started here

The post How Shopify Fixed a $1 Million Single Query appeared first on Unravel.

Equifax Optimizes GCP Costs at Scale

Stephen Lamont — Mon, 12 Jun 2023 16:49:28 +0000

The post Equifax Optimizes GCP Costs at Scale appeared first on Unravel.

Managing FinOps at Equifax

Stephen Lamont — Mon, 12 Jun 2023 16:49:11 +0000

The post Managing FinOps at Equifax appeared first on Unravel.

DBS Empowers Self-Service Engineering with Unravel

Stephen Lamont — Thu, 25 May 2023 22:30:28 +0000

The post DBS Empowers Self-Service Engineering with Unravel appeared first on Unravel.

DBS Discusses Data+FinOps for Banking

Stephen Lamont — Thu, 25 May 2023 22:30:14 +0000

The post DBS Discusses Data+FinOps for Banking appeared first on Unravel.

Enabling Strong Engineering Practices at Maersk

Christine Della Penna — Thu, 09 Feb 2023 17:32:45 +0000

As DataOps moves along the maturity curve, many organizations are deciphering how to best balance the success of running critical jobs with optimized time and cost governance.

Watch the fireside chat from Data Teams Summit where Mark Sear, Head of Data Platform Optimization for Maersk, shares how his team is driving towards enabling strong engineering practices, design tenets, and culture at one of the largest shipping and logistics companies in the world. Transcript below.

Transcript

Kunal Agarwal:

Very excited to have a fireside chat here with Mark Sear. Mark, you’re the director of data integration, AI, machine learning, and analytics at Maersk. And Maersk is one of the largest shipping line and logistics companies in the world. Based out of Copenhagen, but with subsidiaries and offices across 130 countries with about 83,000 employees worldwide. We know that we always think about logistics and shipping as something just working harmoniously, transparently in the background, but in the recent past, given all of the supply chain pressures that have happened with the pandemic and beyond, and even that ship getting stuck in the Suez Canal, I think a lot more people are paying attention to this industry as well. So I was super excited to have you here, Mark, to hear more about yourself, you as the leader of data teams, and about what Maersk is doing with data analytics. Thank you so much for joining us.

Mark Sear:

It’s an absolute pleasure. You’ve just illustrated the perils of Wikipedia. Maersk is not just one of the largest shipping companies in the world, but we’re also actually one of the largest logistics companies in the world. We have our own airline. We’ve got hundreds of warehouses globally. We’re expanding massively, so we are there and of course we are a leader in decarbonation. We’ve got a pledge to be carbon-neutral way before just about anybody else. So it’s a fantastic company to work at. Often I say to my kids, we don’t just deliver stuff, we are doing something to help the planet. It’s a bigger mission than just delivering things, so it’s a pleasure to be here.

Kunal Agarwal:

That’s great. Mark, before we get into Maersk, we’d love to learn about you. So you have an amazing background and accumulation of all of these different experiences. Would you help the audience to understand some of your interests and how you got to be in the role that you currently are at? And what does your role comprise inside of Maersk?

Mark Sear:

Wow. It’s a long story. I’m an old guy, so I’m just couple of years over 60 now, which you could say you don’t look it, but don’t worry about it.

Kunal Agarwal:

You don’t look it at all, only 40.

Mark Sear:

I’m a generation that didn’t, not many of us went to university, so let me start there. So I left school at 18, did a bit of time in the basic military before going to what you would call, I suppose fundamentally, a crypto analyst school. They would detect how smart you were, whether you had a particular thing for patents, and they sent me there. Did that, and then since then I’ve worked in banking, in trading in particular. I ran a big trading group for a major bank, which was great fun, so we were using data all the time to look for both, not just arbitrage, but other things. Fundamentally, my life has been about data.

Kunal Agarwal:

Right.

Mark Sear:

Even as a kid, my dad had a very small business and because he didn’t know anything about computers, I would do the computing for him and work out the miles per gallon that his trucks were getting and what the trade-in was.

Kunal Agarwal:

Sure.

Mark Sear:

And things like that. So data’s been part of my life and I love everything about data and what it can do for people, companies, everything. Yeah, that’s it. Data.

Kunal Agarwal:

That’s great, Mark. Obviously this is a conference spot, a data team, so it’s great to hear from the data guy who’s been doing it for a really long time. So, Mark, to begin, Maersk, as you said, is one of the largest shipping and logistics companies in the world. How has data transformed your company?

Mark Sear:

One thing, this is a great question. How has it transformed and how will it transform?

Kunal Agarwal:

Yes.

Mark Sear:

I think that for the first time in the last couple of years, and I’ve been very lucky, I’ve only been with the company three years, but shortly after I joined, we had a new tech leader, a gentleman called Navneet Kapoor. The guy is a visionary. If you imagine shipping was seen for many years, there’s a bit of a backwater really. You move containers from one country to another country on ships, that was it. Navneet has changed the game for us all and made people realize that data is pervasive in logistics. It’s literally everywhere. If you think about our biggest ship, ship class, for example, it’s called an E-Class. That can take over 18,000 shipping containers on one journey from China to Europe, 18,000.

Kunal Agarwal:

Oh wow.

Mark Sear:

Think about that. So that’s absolutely huge. Now, to put that into context, in one journey, one of those ships will move more goods than was moved in the entire 19th century between continents, one journey. And we got six of them and they’re going backwards and forwards all the time. So the data has risen exponentially and what you can do with it, we are now just starting to get to grips with it, that’s what so exciting. Consider, we have companies that do want to know how much carbon is being produced as part of their products. We have things like that. We just have an incredibly diverse set of products.

To give you an example, I worked on a project about 18 months ago where we worked out, working in tandem with a couple of nature organizations, that if a ship hits a whale at 12 knots and above, that whale will largely die. If you hit it below 12 knots, it will live. It’s a bit like hitting an adult at 30 miles an hour versus 20. The company puts some money in so we could use the data for where the whales were to slow the ships down. So this is an example of where this company doesn’t just think about what can we do to make money. This is a company that thinks about how can we use data to better the company, make us more profitable, and at the same time, put back into the planet that gave us the ability to have this business.

Kunal Agarwal:

Let’s not forget that we’re human, most importantly.

Mark Sear:

Yeah, it’s super exciting, right? You can make all the money in the world. If you trash the planet, there’s not a lot left to enjoy as part of it. And I love that about this company.

Kunal Agarwal:

Absolutely. And I’m guessing with the pandemic and post-pandemic, and all of the other data sets that you guys are gathering anyways from sensors or from the shipping lines or from all the efficiencies, with all the proliferation of all this data inside your organization, what challenges has your team faced or does the Maersk data team face?

Mark Sear:

Well, my team is in the enterprise architecture team. We therefore deal with all the other teams that are dealing with data, and I think we got the same challenges as everybody. We’ve got the data quality right? Do we know where that data comes from? Are we processing it efficiently? Do we have the right ideas to work on the right insights to get value out of that data? I think they’re common industry things, and as with everything, it’s a learning process. So one man’s high-quality data is another woman’s low-quality data.

And depending on who you are and what you want to do with that data, people have to understand how that quality affects other people downstream. And of course, because you’re quite right, we did have a pandemic, and in the pandemic shipping rates went a little bit nuts and they’re normalizing now. But, of course, if you think about introducing predictive algorithms where the price is going vertically and the algorithm may not know that there’s a pandemic on, it just sees price. So I think what we find is challenging, same as everybody else, is how do you put that human edge around data? Very challenging. How do you build really high-performing teams? How do you get teams to truly work together and develop that esprit de corps? So there are a lot of human problems that go alongside the data problems.

Kunal Agarwal:

Yeah. Mark, give us a sense of your size. In terms of teams, applications, whatever would help us understand what you guys were, where you guys are, and where you guys headed.

Mark Sear:

Three years ago when I joined there were 1,900 people in tech; we’ve now got nearly 6,000. We had a huge amount of outsourcing; now we’re insourcing, we’re moving to an open source first event-based company. We’ve been very inquisitive. We’ve bought some logistics companies, so we’ve gone on the end-to-end journey now with the logistics integrator of choice globally. We’ve got our own airline. So you have to think about a lot of things that play together.

My team is a relatively tiny team. We’ve got about 12, but we liaise with, for example, our global data and analytics team that has got 600 people in it. We then organized into platforms, which are vertically problem solving, but fully horizontally integrated passing events between them. And each one of those has their own data team in it as well. So overall, I would guess we’ve got 3,000 people working directly with data in IT and then of course many thousands more.

Kunal Agarwal:

Wow.

Mark Sear:

Out in the organization. So it’s big organizations. Super exciting. Should say, now I’m going to get a quick commercial in. If you are watching this and you are a top data talent, please do hit me up with your resume.

Kunal Agarwal:

There’s a couple of thousand people watching this live, so you’ll definitely.

Mark Sear:

Hey, there you go, man. So listen, as long as they’re quality, I don’t care.

Kunal Agarwal:

From Mark, he’s a great boss as well. So when you think about the maturity curve of data operations, where do you think Maersk is at and what stands in your way to be fully matured?

Mark Sear:

Okay, so let’s analyze that. I think the biggest problem in any maturity curve is not defining the curve. It’s not producing a pyramid to say we are here and a dial to say, well, you rank as a one, you want to rank as a five.

Kunal Agarwal:

Sure.

Mark Sear:

The biggest problem to me is the people that actually formulate that curve. Now everyone’s got staff turnover and everyone or the majority of people know that they’re part of a team. But the question is how do you get that team to work with other teams and how do you disseminate that knowledge and get that group think of what is best practice for DataOps? What is best practice for dealing with these problems?

Kunal Agarwal:

It’s almost a spectrum on the talent side, isn’t it?

Mark Sear:

It’s a spectrum on the talent side, there’s a high turnover because certainly in the last 12 to 18 months, salaries have been going crazy, so you’ve had crazy turnover rates in some areas, not so much in other areas. So the human side of this is one part of the problem, and it’s not just the human side to how do you keep them engaged, it’s how do you share that knowledge and how do you get that exponential learning organization going?

And perhaps when we get into how we’ve arrived at tools like Unravel, I’ll explain to you what my theory is on that, but it’s almost a swarm learning that you need here, an ants style learning of how to solve problems. And that’s the hardest thing, is getting everybody in that boat swimming in the same direction before you can apply best practices because everybody says this is best practice. Sure, but if it was as simple as looking at a Gartner or whoever thing and saying, “Oh, there are the five lines we need to do,” everybody would do it. There’d be no need for anybody to innovate because we could do it; human beings aren’t very good at following rules, right?

Kunal Agarwal:

Yeah. So what kind of shifts and changes did you have to make in your big data operations and tools that you had to put into place for getting that maturity to where you expected it to be?

Mark Sear:

I think the first thing we’ve got to do, we’ve got to get people thinking slightly shorter timeframe. So everybody talks about Agile, Agile, Agile.

Kunal Agarwal:

Right.

Mark Sear:

Agile means different things to different people. We had some people who thought that Agile was, “Well, you’re going to get a fresh data set at the end of the day, so what the heck are you complaining about? When I started 15 years ago, you got it weekly.” That’s not agile. Equally, you’ve got people who say, I need real-time data. Well, do you really need real-time data if you’re actually dealing with an expense account? You probably don’t.

Kunal Agarwal:

Right.

Mark Sear:

Okay, so the first thing we’ve got to do is level set expectations of our users and then we’ve got to dovetail what we can deliver into those. You’ve got to be business focused, you’ve got to bring value. And that’s a journey. It’s a journey for the business users.

Kunal Agarwal:

Sure.

Mark Sear:

It’s a journey for our users. It’s about learning. So that’s what we’re doing. It’s taking time. Yeah, it’s taking time, but it’s like a snowball. It is rolling and it is getting bigger and it’s getting better, getting faster.

Kunal Agarwal:

And then when you think about the tools, Mark, are there any that you have to put into place to accelerate this?

Mark Sear:

I mean, we’ve probably got one of everything to start and now we’re shrinking. If I take . . . am I allowed to talk about Unravel?

Kunal Agarwal:

Sure.

Mark Sear:

So I’ll talk about–

Kunal Agarwal:

As much as you would.

Mark Sear:

–Unravel for a few seconds. So if you think about what we’ve got, let’s say we’ve got 3,000 people, primarily relatively young, inexperienced people churning out Spark code, let’s say Spark Databricks code, and they all sit writing it. And of course if you are in a normal environment, you can ask the person next to you, how would you do this? You ask the person over there, how would you do this? We’ve had 3,000 engineers working from home for two years, even now, they don’t want to come into the office per se, because it’s inconvenient, number one, because you might be journeying in an hour in and an hour home, and also it’s not actually, truly is productive. So the question is how do you harvest that group knowledge and how do people learn?

So for us, we put Unravel in to look at and analyze every single line of code we write and come up with those micro suggestions and indeed macro suggestions that you would miss. And believe me, we’ve been through everything like code walkthroughs, code dives, all those things. They’re all standard practice. If you’ve got 2,000 people and they write, let’s say, 10 lines of code a day each, 20,000 lines of code, you are never going to walk through all of that code. You are never going to be able to level set expectations. And this is key to me, be able to go back to an individual data engineer and say, “Hey, dude, listen, about these couple of lines of code. Did you realize if you did it like this, you could be 10 times as efficient?” And it’s about giving that feedback in a way that allows them to learn themselves.

And that’s what I love about Unravel: you can get the feedback, but it’s not like when you get that feedback, nobody says, “Come into my office, let’s have a chat about these lines of code.” You go into your private workspace, it gives you the suggestions, you deal with the suggestions, you learn, you move on, you don’t make the mistakes again. And they may not even be mistakes. They might just be things you didn’t know about.

Kunal Agarwal:

Right.

Mark Sear:

And so because Unravel takes data from lots of other organizations as well, as I see it, we’re in effect, harvesting the benefits of hundreds of thousands of coders globally, of data engineers globally. And we are gaining the insights that we couldn’t possibly gain by being even the best self-analysis on the planet, you couldn’t do it without that. And that to me is the advantage of it. It’s like that swarm mentality. If you’ve ever, anybody watching this, had a look at swarm AI, which is to predict, you can use it to predict events. It’s like if you take a soccer game, and I’ve worked in gambling, if you take a soccer match and you take a hundred people, I’ll call it soccer, even though the real name for is football, you Americans.

Kunal Agarwal:

It’s football, I agree too.

Mark Sear:

It’s football, so we’re going to call it football, association football to give you it’s full name. If you ask a hundred football fans to predict a score, you’ll get a curve, and you’ll generally, from that predictor, get a good result. Way more accurate than asking 10 so-called experts, such as with code. And that’s what we’re finding with Unravel is that sometimes it’s the little nuances that just pop up that are giving us more benefits.

Kunal Agarwal:

Right.

Mark Sear:

So it’s pivotal to how we are going to get benefits out over the longer term of what we’re doing.

Kunal Agarwal:

That’s great. And we always see a spectrum of skills inside an organization. So our mission is trying to level the playing field so anybody, even a business user, can log in without knowing the internals of all of these complex data technologies. So it’s great to hear the way Maersk is actually using it. We spoke a little bit about making these changes. We’d love to double click on some of these hurdles, right? Because you said it was a journey to get to people to this mature or fast-moving data operations, if you may, or more agile data operations if you may. If we can double click for a second, what has been the biggest hurdle? Is it the mindset? Is it managing the feedback loop? Is it changing the practices? Is it getting new types of people on board? What has been the biggest hurdle?

Mark Sear:

Tick all of the above.

Kunal Agarwal:

Okay.

Mark Sear:

But I think–

Kunal Agarwal:

Pick for option E.

Mark Sear:

Yeah, so let me give you an example. There are several I’ve had with people that have said to me, “I’ve been doing this 25 years. There’s nothing, I’ve been doing it 25 years.” That presupposes that 25 years of knowledge and experience is better than 10 minutes with a tool that’s got 100,000 years of learning.

Kunal Agarwal:

Right.

Mark Sear:

Over a 12-month period. So that I classify that as the ego problem. Sometimes people need their ego brushing, sometimes they need their ego crushing. It’s about looking the person in the eye, working out what’s the best strategy of dealing with them and saying to them, “Look, get on board.” This isn’t about saying you are garbage or anything else. This is about saying to you, learn and keep mentoring other people as you learn.

Kunal Agarwal:

Yeah.

Mark Sear:

I remember another person said to me, “Oh my god, I’ve seen what this tool can do. Is AI going to take my job?” And I said to them, no, AI isn’t going to take your job, but if you’re not careful, somebody, a human being that is using AI will take it, and that doesn’t apply to me. That applies just in general to the world. You cannot be a Luddite, you cannot fight progress. And as we’ve seen with Chat GPT and things like that recently, the power of the mass of having hundreds and thousands and millions of nodes analyzing stuff is precisely what will bring that. For example, my son who’s 23, smart kid, well, so he tells me. Smart kid, good uni, good university, blah blah blah. He said to me, “Oh Tesla, they make amazing cars.” And I said to him, Tesla isn’t even a car company. Tesla is a data company that happens to build a fairly average electric car.

Kunal Agarwal:

Absolutely.

Mark Sear:

That’s it. It’s all about data. And I keep saying to my data engineers, to be the best version of you at work and even outside work, keep picking up data about everything, about your life, about your girlfriend, the way she feels. About your boyfriend, the way he feels. About your wife, your mother. Everything is data. And that’s the mindset. And the biggest thing for me, the biggest issue has been getting everybody to think and recognize how vital data is in their life, and to be open to change. And we all know throughout go through this cycle of humanity, a lack of openness to change is what’s held humanity back. I seek to break that as well.

Kunal Agarwal:

I love that Mark. Switching gears, we spoke a little bit about developer productivity. We spoke about agility and data operations. Maersk obviously runs, like you were explaining, a lot of their data operations on the cloud. And as we see a lot of organizations when they start to get bigger and bigger and bigger in use cases on the cloud, cost becomes a front and center or a first-class citizen conversation to have. Shed some light on that for us. What is that maturity inside of Maersk, or how do you think about managing costs and budgets and forecast on the cloud, and what’s the consequence of not doing that correctly?

Mark Sear:

Well, there are some things that I can’t discuss because they’re obviously internal, but I think, let’s say I speak to a lot of people in a lot of companies, and there seem to be some themes that run everywhere, which is there’s a rush towards certain technologies, and people, they test it out on something tiny and say, “Hey, isn’t this amazing? Look how productive I am.” Then they get into production and somebody else says, “That’s really amazing. You were very productive. But have you seen what comes out the other end? It’s a cost, a bazillion dollars an hour to run it.” Then you’ve got this, I think they called it the Steve Jobs reality distortion field, where both sets of people go into this weird thing of, “Well, I’m producing value because I’m generating code and isn’t it amazing?” And the other side is saying, “Yeah, but physically the company’s going to spend all its money on the cloud. We won’t be able to do any other business.”

Kunal Agarwal:

Yeah.

Mark Sear:

So we are now getting to a stage where we have some really nice cost control mechanisms coming in. For me, it’s all in the audit. And crucial to this is do it upfront. Do it in your dev environment. Don’t go into production, get a giant bill and then say, how do I cut that bill? Which is again, where we’ve put Unravel now, right in the front of our development environment. So nothing even goes into production unless we know it’s going to work at the right cost price. Because otherwise, you’ve just invented the world’s best mechanism for closing the stable door after the cost horse has bolted, right?

Kunal Agarwal:

Right.

Mark Sear:

And then that’s always a pain because post-giant bill examinations are really paying, it’s a bit like medicine. I don’t know if you know, but in China, you only pay a doctor when you are well. As soon as you are sick, you stop paying bills and they have to take care of you. So that to me is how we need to look at cost.

Kunal Agarwal:

I love that. Love that analogy.

Mark Sear:

Do it upfront. Keep people well, don’t ever end up with a cost problem. So that’s again, part of the mindset. Get your data early, deal with it quickly. And that’s the level of maturity we are getting to now. It’s taking time to get there. We’re not the only people, I know it’s everywhere. But I would say to anybody, I was going to say lucky enough to be watching this, but that’s a little bit cocky, isn’t it? Anybody watching this? Whatever you do, get in there early, get your best practice in as early as possible. Go live with fully costed jobs. Don’t go live, work out what the job cost is and then go, how the hell do I cut it?

Kunal Agarwal:

Yeah.

Mark Sear:

Go live with fully costed jobs and work out well, if it costs this much in dev test, what’s it going to cost in prod? Then check it as soon as it goes live and say, yeah, okay, the delta’s right, game on. That’s it.

Kunal Agarwal:

So measure twice, cut once, and then you’re almost shifting left. So you’re leaving it for the data engineers to go and figure this out. So there’s a practice that’s emerging called FinOps, which is really a lot of these different groups of teams getting together to exactly solve this problem of understand what the cost is, optimize what the cost is, and then govern what the cost is as well. So who within your team does what I’m sure the audience would love to hear that a little bit.

Mark Sear:

Pretty much everybody will do everything, every individual data engineer, man, woman, child, whatever will be, but we’re not using child labor incidentally, that was.

Kunal Agarwal:

Yeah, let’s clarify that one for the audience.

Mark Sear:

That’s a joke. Edit that out. Every person will take it on themselves to do that because ultimately, I have a wider belief that every human being wants to do the right thing, given everything else being equal, they want to do the right thing. So I will say to the people that I speak to as data engineers, as new data engineers, I will say to them, we will empower you to create the best systems in the world. Only you can empower yourself to make them the most efficient systems in the world.

Kunal Agarwal:

Interesting.

Mark Sear:

And by giving it to them and saying, “This is a matter of personal pride, guys,” at the end of the day, am I going to look at every line of your code and say, “You wouldn’t have got away with that in my day.” Of course not. When I started in it, this is how depressingly sad it is. We had 16K of main memory on the main computer for a bank in an IBM mainframe, and you had to write out a form if you wanted 1K of disk. So I was in a similar program in those days. Now I’ve got a phone with God knows how much RAM on it.

Kunal Agarwal:

Right, and anybody can spin up a cloud environment.

Mark Sear:

Absolutely. I can push a button, spin up whatever I want.

Kunal Agarwal:

Right.

Mark Sear:

But I think the way to deal with this problem is to, again, push it left. Don’t have somebody charging in from finance waiving a giant bill saying, “Guys, you are costing a fortune.” Say to people, let’s just keep that finance dude or lady out of the picture. Take it on yourself, yourself. Show a bit of pride, develop this esprit de corps, and let’s do it together.

Kunal Agarwal:

Love it. Mark, last question. This is a fun one and I know you’re definitely going to have some fun answer over here. So what are your predictions for this data industry for this year and beyond? What are we going to see?

Mark Sear:

Wow, what do I think? Basically–

Kunal Agarwal:

Since you’ve got such a pulse on the overall industry and market.

Mark Sear:

So to me, the data industry, obviously it’ll continue to grow. I don’t believe that technology in many levels, I’ll give you over a couple of years, technology in many levels, we’re actually a fashion industry. If the fashion is to outsource, everybody outsource. So the fashion is to in-source, everybody does. Women’s skirts go up, fashion changes, they come down. Guys wear flared trousers, guy wears wear narrow trousers and nobody wants to be out of fashion. What I think’s going to happen is data is going to continue to scale, quantum computing will take off within a few years. What’s going to happen is your CEO is going to say, “Why have I got my data in the cloud and in really expensive data centers when someone has just said that I can put the whole of our organization on this and keep it in the top drawer of my desk?”

And you will have petabyte, zettabyte scale in something that can fit in a shoebox. And at that point it’ll change everything. I will probably either be dead, or at least hopefully retired and doing something by then. But I think it is for those people that are new to this industry, this is an industry that’s going to go forever. I personally hope I get to have an implant in my head at some point from Elon. I will be going for, I’m only going to go for version two. I’m not going for version one and hopefully–

Kunal Agarwal:

Yeah, you never want to go for V1.

Mark Sear:

Exactly, absolutely right. But, guys, ladies, everybody watching this, you are in the most exciting part, not just of technology, of humanity itself. I really believe that, of humanity itself, you can make a difference that very few people on the planet get to make.

Kunal Agarwal:

And on that note, I think the big theme that we have going on this series, we strongly feel that data teams are running the world and will continue to run the world. Mark, thank you so much for sharing this, exciting insights, and it’s always fun having you. Thanks you for making the time.

Mark Sear:

Complete pleasure.

The post Enabling Strong Engineering Practices at Maersk appeared first on Unravel.

Maximize Business Results with FinOps

Christine Della Penna — Thu, 09 Feb 2023 17:32:33 +0000

As organizations run more data applications and pipelines in the cloud, they look for ways to avoid the hidden costs of cloud adoption and migration. Teams seek to maximize business results through cost visibility, forecast accuracy, and financial predictability.

Watch the breakout session video from Data Teams Summit and see how organizations apply agile and lean principles using the FinOps framework to boost efficiency, productivity, and innovation. Transcript available below.

Transcript

Clinton Ford:

Hi, and welcome to this session, Maximize Business Results with FinOps. I’m Clinton Ford, director of Product Marketing at Unravel, and I’m joined today by Thiago Gil, an ambassador from the FinOps Foundation and a KubeCon, Cloudnative Con 2021/2022 Kubernetes AI Day Program Committee member. Great to have you with us today, Thiago. Thank you.

Now, if you have any questions during our session, please feel free to put those in the Q&A box, or visit Unravel booth after this session. Happy to answer your questions.
So today, we’ll talk about some of the challenges that you face with cloud adoption and how FinOps empowers you and your team to harness those investments, maximize business results, and we’ll share some success stories from companies who are applying these principles. Then Thiago is going to share the state of production machine learning.

So among the challenges that you face, the first is visibility. Simply understanding and deciphering the cloud bill can be an enormous hurdle, and forecasting spend can be really difficult to do accurately.

How to optimize your costs once you get visibility. There are complex dependencies, as you know, within your data pipeline. So making adjustments to resources can have downstream effects, and you don’t want to interrupt the flow of those pipelines.

Finally, governance. So governing that cloud spending is hard. With over 200 services and over 600 instance types on AWS alone, it’s difficult to define what good looks like. The result is that on average, organizations report their public cloud spend is over budget by 13%, and they expect cloud spending to increase by 29% this year.

Observability is a key here, because it unlocks several important benefits. First, visibility. Just getting full visibility to understand where spending is going and how teams are tracking towards their budgets.

Granularity, and seeing the spending details by data team, by pipeline, by data application or product or division, and forecasting, seeing those trends and being able to project out accurately to help forecast future spending and profitability.

So data management represents approximately 40% of the typical cloud bill, and data management services are the fastest growing category of cloud service spending. It’s also driving a lot of the incredible revenue growth that we’ve seen, and innovation in products.

When you combine the best of DataOps and the best of FinOps, you get DataFinOps, and DataFinOps empowers data engineers and business teams to make better choices about your cloud usage. It helps you get the most from your modern data stack investments.

A FinOps approach, though, isn’t just about slashing costs. Although you’ll almost invariably wind up saving money, it’s about empowering data engineers and business teams to make better choices about their cloud usage, and derive the most value from their modern data stack investments.

Managing costs consists of three iterative phases. The first is getting visibility into where the money is going, measuring what’s happening in your cloud environment, understanding what’s going on in a workload aware context. Once you have that observability, next you can optimize. You begin to see patterns emerge where you can eliminate waste, remove inefficiencies and actually make things better, and then you can go from reactive problem solving to proactive problem preventing, sustaining iterative improvements in automating guardrails, enabling self-service optimization.

So each phase builds upon the previous one to create a virtuous cycle of continuous improvement and empowerment for individual team members, regardless of their expertise, to make better decisions about their cloud usage while still hitting their SLAs and driving results. In essence, this shifts the budget left, pulling accountability for managing costs forward.

So now let me share a few examples of FinOps success. A global company in the healthcare industry discovered they were spending twice their target spending for Amazon EMR. They could manually reduce their infrastructure spending without using observability tools, but each time they did, they saw job failures happen as a result. They wanted to understand the reason why cost was so far above their expected range.

Using observability tools, they were able to identify the root cause for the high costs and reduce them without failures.

Using a FinOps approach, they were able to improve EMR efficiency by 50% to achieve their total cost of ownership goals. Using the FinOps framework, their data analytics environment became much easier to manage. They used the best practices from optimizing their own cloud infrastructure to help onboard other teams, and so they were able to improve the time to value across the entire company.

A global web analytics company used a FinOps approach to get the level of fidelity that they needed to reduce their infrastructure costs by 30% in just six months. They started by tagging the AWS infrastructure that powered their products, such as EC2 instances, EBS volumes, RDS instances and network traffic.

The next step was to look by product and understand where they could get the biggest wins. As they looked across roughly 50 different internal projects, they were able to save more than five million per year, and after running some initial analysis, they realized that more than 6,000 data sources were not connected to any destinations at all, or were sent to destinations with expired credentials.

They were wasting $20,000 per month on unused data infrastructure. The team provided daily reporting on their top cost drivers visualized in a dashboard, and then using this information, they boosted margins by 20% in just 90 days.

All right, with that, let’s hand it over to Thiago to give us an update on the state of production machine learning. Thiago, over to you.

Thiago Gil:

Thank you, Clinton. Let’s talk about the state of production ML. This includes understanding the challenges and the best practice for deploying, scaling and managing ML models in production environments, and how FinOps principles and Kubernetes can help organizations to optimize and manage the costs associated with their ML workloads, improve efficiency and scalability and cost effectiveness of their models while aligning them with the business objectives.

ML is moving to Kubernetes because it provides a flexible and scalable platform for deploying and managing machine learning models… Kubernetes [inaudible 00:07:29] getting resources such as CP-1 memory to match the demands of our workloads. Additionally, Kubernetes provides features such as automatic aerobics, self healing and service discovery, which are useful in managing and deploying ML models in a production environment.

The FinOps framework, which includes principles such as team collaboration, ownership and cloud usage, centralized team for financial operations, realtime reporting, decision driven by business value, and taking advantage of the variable cost of model… And taking advantage of the variable cost model of the cloud can relate to Kubernetes in several ways.

Kubernetes can also be used to allocate costs to specific teams of project and track and optimize the performance and cost of workloads in real time. By having a centralized team for financial operations and collaboration among teams, organizations can make better decisions driven by business value, and take advantage of the variable cost model of the cloud by only paying for the resources they use.

FinOps principles such as optimization, automation, cost allocation, and monitoring and metrics can be applied to ML workloads running on Kubernetes to improve their efficiency, scalability and cost effectiveness.

Kubernetes, by its nature, allows for cloud diagnostic workloads. It means that workloads deployed on Kubernetes can run on any cloud provider or on premises. This allows for more flexibility in terms of where ML workloads are deployed and can help to avoid vendor lock-in.

FinOps can help DataOps teams identify and eliminate unnecessary expenses, such as redundant or underutilized resources. This can include optimizing cloud infrastructure costs, negotiating better pricing service and licenses, and identifying opportunities to recycle or repurpose existing resources.

FinOps can help DataOps teams develop financial plans that align with business goals and priorities, such as inventing new technologies or expanding data capabilities.
By setting clear financial objectives and budgets, DataOps teams can make more informed decisions about how to allocate resources and minimize costs.

FinOps can help data teams automate financial processes, such as invoice and payment tracking, to reduce the time and effort to manage these tasks. This can free up DataOps teams members to focus on more strategic tasks, such as data modeling and analysis. FinOps help DataOps teams track financial performance and identify areas for improvement. This can include monitoring key financial metrics, such as cost per data unit or return on investment, to identify opportunities to reduce costs and improve efficiency.

A FinOps team known as Cloud Cost Center of Excellence is a centralized team within an organization that is responsible for managing and optimizing the financial aspects of the organization cloud infrastructure. This team typically has a broad remit that includes monitoring and analyzing cloud usage and cost, developing and implementing policies and best practices, collaborating with teams across the organization, providing guidance and support, providing real-time reporting, and continuously monitoring and adapting to changes in cloud pricing and services. The goal of this team is to provide a centralized point of control and expertise for all cloud related financial matters, ensuring that the organization cloud usage is optimized, cost-effective, and aligns with the overall business objectives.

Our product mindset focus on delivering value to the end user and the business, which help data teams better align their full efforts with the organization’s goals and priorities.

Changing the mindset from project to products can help improving collaboration. When FinOps teams adopt a product mindset, it helps to have a better collaboration between the team responsible for creating and maintaining the products, and cost transparency allows an organization to clearly understand and track the costs associated with its operation, including its cloud infrastructure, by providing visibility into cloud usage, costs and performance metrics, forecasting future costs, making data-driven decision, allocating costs, improving collaboration, and communicating and aligning cloud usage with overall business objectives.

When moving workloads to the cloud, organizations may discover hidden costs related to Kubernetes, so just cost of managing and scaling the cluster, the cost of running the control plane itself, and the cost of networking and storage. This hidden cost can arise from not fully understanding the pricing model of cloud providers, not properly monitoring or managing usage of cloud resources, or not properly planning for data transfer or storage costs.

Applications that require different amounts of computational power can be placed on that spectrum. Some applications like training large AI models require a lot of processing power to keep GPUs fully utilized during training processes by batch processing hundreds of data samples in parallel. However, other applications may only require a small amount of processing power, leading to underutilization of the computational power of GPUs.

When it comes to GPU resources, Kubernetes does not have the native support for GPU allocation, and must rely on third-party solutions, such as Kubernetes device plugin to provide this functionality. These solutions add an extra layer of complexity to resource allocation and scaling, as they require additional configuration and management.
Additionally, GPUs are not as easily being shareable as CPU resources, and have more complex life cycles. They have to be allocated and deallocated to specific parts that have to be managed by the Kubernetes collective itself. This can lead to situations where the GPU resources are not being fully utilized, or where multiple parts are trying to access the same GPU resources at the same time, resulting in computation and performance issues.

So why do we need realtime observability? Sometimes data teams do not realize GPU memories, CPU limits and requests are not treated the same way before it’s too late.
The Prius effect refers to the changing driving behavior observed in some drivers of the Toyota Prius hybrid car, altered their driving style to reduce fuel consumption after receiving realtime feedback on their gasoline consumption.

Observability by design on ML workloads, which includes collecting and monitoring key metrics, logging, tracing, setting up alerts, and running automated experiments allow teams to gain insights into the performance behavior and impact on their ML models. Make data-driven decisions to improve their performance and the reliability, and align with FinOps principles such as cost optimization, forecasting, budgeting, cost allocation, and decision making based on cost benefit analysis, all of which can help organizations optimize and manage the cost associated with their ML workloads.

By providing real time visibility into the performance and resource usage of AI and ML workloads, organizations can proactively identify and address [inaudible 00:18:05] and make more informed decision about how to optimize the cost of running these workloads in the cloud, understand how GPU resources are being consumed by different workloads and make informed decisions about scaling and allocating resources to optimize costs, and even find and troubleshoot GPU scheduling issues, such as GPU starvation or GPU oversubscription, that can cause workloads to consume more resources than necessary, and correct them.

Clinton Ford:

Fantastic. Thank you so much, Thiago. It’s been great having you here today, and we look forward to answering all of your questions. Feel free to enter them in the chat below, or head over to the Unravel booth. We’d be happy to visit with you. Thanks.

The post Maximize Business Results with FinOps appeared first on Unravel.

Why is Cost Governance So Hard for Data Teams?

Christine Della Penna — Fri, 03 Feb 2023 17:46:18 +0000

Chris Santiago, VP Solutions Engineering Unravel Data, shares his perspective on Why Cost governance is so hard for data teams in this 3 min video. Transcript below.

Transcript

Cost governance – a burning issue for companies that are running mission-critical workloads in production, using large amounts of data, and are at the point where they’re starting to see rising costs but they don’t have a handle or know what to do next in order to curtail those costs.

It all starts from the cloud vendors itself, because that’s really the only way we’re going to get true costs. And so the way that cloud companies do this is they have loads of customers. And so it’s going to take them time to batch up all the information about what the resources are consumed for each of their customers. And then when it’s time for billing, they go ahead and they send out a report that’s highly aggregated in nature, to really just show this is how much you were spending. And it comes in as a batch process, right? What you really need is fine-grained, or fine granularity, of where costs are going. But the challenge is, because we’re getting this highly aggregated data from the cloud vendors itself, it becomes challenging.

Now, what do customers do? What they’ll end up doing is they’ll try to do this themselves by capturing whatever metrics that they can. I see a lot of customers do this through manual spreadsheets, trying to track things by job, in a very manual fashion. In fact, one customer that I was working with actually had two full-time, very talented developers doing this just so that way they can get a semblance of where cost is going at a granularity that they wanted. Now, their issue is that this spreadsheet was prone to errors. Things would always change on the cloud vendor side, which would make their reports not accurate. And at the end of the day, they were not getting buy-in from the business.

And so that’s why you can’t use DevOps tools to solve this problem. You need a purpose-built platform built for DataOps teams, and that’s where Unravel comes in. So we’ll be able to take in all this information, provide you the ability to understand costs at whatever granularity you want. And best of all, we’re capturing this as soon as this information comes in. So now you can do things such as accurate forecasting of how resources are being spent. Setting budgets so that we can really understand who’s spending what. Being able to provide chargeback visibility so we can see who the top spenders are and we can actually have that conversation. And most importantly of all, being able to optimize at the job level. So that way we really, truly can reduce costs wherever we would want them to be.

Full stack visibility is just that easy when you utilize Unravel. For a full demo or try it yourself.

The post Why is Cost Governance So Hard for Data Teams? appeared first on Unravel.

DataOps Observability Designed for Data Teams

Unravel Data — Mon, 19 Sep 2022 15:10:09 +0000

The post DataOps Observability Designed for Data Teams appeared first on Unravel.

Managing Costs for Spark on Amazon EMR

Unravel Data — Tue, 28 Sep 2021 20:43:11 +0000

The post Managing Costs for Spark on Amazon EMR appeared first on Unravel.

Managing Costs for Spark on Databricks

Unravel Data — Fri, 17 Sep 2021 20:51:08 +0000

The post Managing Costs for Spark on Databricks appeared first on Unravel.

Managing Cost & Resources Usage for Spark

Unravel Data — Wed, 08 Sep 2021 20:55:29 +0000

The post Managing Cost & Resources Usage for Spark appeared first on Unravel.

Troubleshooting EMR

Unravel Data — Tue, 17 Aug 2021 20:57:06 +0000

The post Troubleshooting EMR appeared first on Unravel.

Troubleshooting Databricks – Challenges, Fixes, and Solutions

Unravel Data — Wed, 11 Aug 2021 21:00:22 +0000

The post Troubleshooting Databricks – Challenges, Fixes, and Solutions appeared first on Unravel.

Accelerate Amazon EMR for Spark & More

Unravel Data — Mon, 21 Jun 2021 21:54:59 +0000

The post Accelerate Amazon EMR for Spark & More appeared first on Unravel.

Strategies To Accelerate Performance for Databricks

Unravel Data — Fri, 18 Jun 2021 21:57:29 +0000

The post Strategies To Accelerate Performance for Databricks appeared first on Unravel.

Effective Cost and Performance Management for Amazon EMR

Unravel Data — Wed, 28 Apr 2021 22:06:57 +0000

The post Effective Cost and Performance Management for Amazon EMR appeared first on Unravel.

Reasons Why Big Data Cloud Migrations Fail

Unravel Data — Fri, 09 Apr 2021 22:01:56 +0000

The post Reasons Why Big Data Cloud Migrations Fail appeared first on Unravel.

How To Get the Best Performance & Reliability Out of Kafka & Spark Applications

Unravel Data — Thu, 11 Mar 2021 22:10:06 +0000

The post How To Get the Best Performance & Reliability Out of Kafka & Spark Applications appeared first on Unravel.

Going Beyond Observability for Spark Applications & Databricks Environments

Unravel Data — Tue, 23 Feb 2021 22:10:57 +0000

The post Going Beyond Observability for Spark Applications & Databricks Environments appeared first on Unravel.

Achieving Top Efficiency in Cloud Data Operations

Unravel Data — Fri, 05 Feb 2021 22:34:53 +0000

The post Achieving Top Efficiency in Cloud Data Operations appeared first on Unravel.

Why You Need DataOps in Your Organization

Unravel Data — Fri, 05 Feb 2021 22:27:33 +0000

The post Why You Need DataOps in Your Organization appeared first on Unravel.

Moving Big Data and Streaming Data Workloads to Google Cloud Platform

Unravel Data — Fri, 05 Feb 2021 22:24:18 +0000

The post Moving Big Data and Streaming Data Workloads to Google Cloud Platform appeared first on Unravel.

Discover Your Datasets The Self-Service Data Roadmap

Unravel Data — Fri, 05 Feb 2021 22:20:12 +0000

The post Discover Your Datasets The Self-Service Data Roadmap appeared first on Unravel.

How DBS Bank Leverages Unravel Data

Unravel Data — Wed, 13 Jan 2021 21:27:44 +0000

The post How DBS Bank Leverages Unravel Data appeared first on Unravel.

Cost-Effective, High-Performance Move to Cloud

Unravel Data — Thu, 05 Nov 2020 21:38:49 +0000

The post Cost-Effective, High-Performance Move to Cloud appeared first on Unravel.

Why Enhanced Visibility Matters for Your Databricks Environment

Unravel Data — Thu, 22 Oct 2020 21:54:23 +0000

The post Why Enhanced Visibility Matters for Your Databricks Environment appeared first on Unravel.

Reasons Why Your Big Data Cloud Migration Fails and Ways to Overcome

Unravel Data — Wed, 21 Oct 2020 21:47:48 +0000

The post Reasons Why Your Big Data Cloud Migration Fails and Ways to Overcome appeared first on Unravel.

Getting Real With Data Analytics

Unravel Data — Thu, 30 Jul 2020 14:00:59 +0000

CDO Sessions: Getting Real with Data Analytics Acting on Market Risks & Opportunities

On July 9th, big data leaders, Harinder Singh from Anheuser-Busch, Kumar Menon at Equifax, and Unravel’s Sandeep Uttamchandani, joined together for our CDO Session, hosted by our co-founder and CEO, Kunal Agarwal, to discuss how their companies have adapted and evolved during these challenging times.

Transcript

Kunal: Hi guys, welcome to this chat today with big data leaders. Thank you everybody for joining today’s call. We’re all going through very uncertain, unprecedented times, to say the least, with pandemic, combined with all the geo social unrest. Business models and analytics that have developed over the last several years have perhaps been thrown out of the window, or need to be reworked for what we call “the new normal”. We’ve seen companies take a defensive position for the first couple of weeks when this pandemic hit and now we’re looking at companies taking an offensive position. We have an excellent panel here to discuss how they’re using data within their companies to be able to uncover both risks and opportunities. Can we start with your background and your current role?

Harinder: Hey guys, it’s Harinder Singh. I lead data strategy and architecture and Anheuser-Busch InBev, also known as AB InBev. AB InBev is a $55 billion revenue company with 500 plus brands, including Corona, Budweiser, and Stella. We operate in hundreds of countries. Personally, I’ve been in this industry for about 20 years and prior to being at AB InBev, I was at Walmart, Stanford, and a few other companies. I’m very excited to be here part of the panel.

Kumar: Hey guys, it’s Kumar Menon. I lead Data Fabric at Equifax. I’ve been at Equifax for a couple of years and we are a very critical part of the credit life cycle within the economy in almost all the regions that we operate in. So it puts us in an interesting position during situations like these. I’ve been in the industry for 25 years, doing data work primarily in two major, highly regulated industries, life sciences and pharmaceuticals, and financial services. That’s the experience that I’ve been able to bring to Equifax to be able to really rethink how we use data and analytics to deliver new value to our customers.

Sandeep: Hey everyone, it’s a pleasure to be here. I’m Sandeep Uttamchandani, the VP of Engineering and the Chief Data Officer at Unravel Data. My career has basically been a blend of building data products, as well as running data engineering, at large scale, most recently at Intuit. My passion is basically, how do you democratize data? How do you make data self serve and really sort of the culture of data driven for enterprises? At Intuit, I was responsible for data for Intuit QuickBooks, a $3 billion franchise and continuing that passion at Unravel, we’re really building out what we refer to as the self serve platform, looking at telemetry data combined across resources, clusters, jobs, and applications, and really making sense of it as a true data product. I’m really excited to be talking more about what changes we’ve gone through and how we bounce back.

Kunal: Thank you, Sandeep. To gather some context, why don’t we go around the room and understand what your companies are doing at this particular time? What’s top of mind for your businesses?

Harinder: What’s on top of our mind today for our business is the following: Number one, taking care of our people. We operate in 126 countries and we don’t just ship our products, we actually brew locally, sell locally, meaning everything from the grain, the farmer, the brewery, is local in that country or city. For us, taking care of our people is our number one priority and there are different ways to do that. For example, in our corporate offices, it can be as simple as sending office chairs to people’s houses.

Our second priority is taking care of our customers. When I say customers, I’m not talking just about consumers, I’m talking actually about bars and restaurants, our B2B customers. They have been significantly impacted because people are not going into bars and restaurants. We have done that by using credit, doing advanced purchases, coming up with alliances with other companies, and creating gift cards that people can buy to use later. And finally, another thing that’s on top of mind is taking care of finances. We’re a big company and we don’t know how long this will last, so we want to make sure that we’re in it for the long run. We’re very fortunate that our leadership in the C level has very strong finance backgrounds and even in technology, people actually come from finance. So that’s definitely something that we are doing as a business for sure.

Kunal: That’s fascinating, Harinder. That was the sequence of things that we expect from a big company. Hopefully, once live sports start again, the sales will start picking back up. Kumar, I’d love to hear from you as well.

Kumar: Before our current situation, back in late 2017, Equifax had a significant, unfortunate breach, post which, we as a company made a serious commitment to ensure that we look at everything that we do as a business in terms of our product strategy, our platform strategy, our security posture, and transform the company to really helping our customers, in turn helping the consumer. We were in the middle of a massive transformation and we’re still currently going through that. We were already moving at a quick pace to rebuild all of our data and application platforms and refactoring process portfolio, our engagement with our customers, etc. The situation, I would say, has helped us even more.

When you look at the pandemic scenario, the credit economy has taken on a life of its own, in terms of how banks and other financial institutions are looking at data and at the impact of the pandemic on their portfolios and customers. As a company, we’ve been really looking at the more macroeconomic impact of the situation on the consumers in the regions that we operate. Traditional data indicators or signals don’t really hold as much value as they would in normal times, in these unique times, and we’re constantly looking at new indicators that would help not only our customers, but eventually the consumers, go through the situation in a much better way.

We are actually helping banks and other financial institutions reassess, and while we’ve seen businesses uptake in certain areas, in other areas of course, we see lower volumes. But overall, as a company we’ve actually done pretty well. And in executing the transformation, we’ve actually had to make some interesting choices. We’ve had to accelerate things and we’re taking a really offensive strategy, where we quickly deploy the new capabilities that we’re building. This will help businesses in regions with slower uptake execute better and serve their customers and eventually the consumers better.

Kunal: We would love to hear some of those changes that you’re making in our next round of questions. Sandeep?

Sandeep: I’ll answer this question by drawing from my own experience as well as my discussions talking to a broader set of CDO’s on this topic. There are three high level aspects that are loud and clear. First, I think data is more important than ever. During these uncertain times, it is important to use data to make decisions, which is clear across pretty much every vertical that I’m tracking.

The other aspect here is the need for agility and speed. If you look at the traditional analytical models and ML models built over years and years of data, the question is, do they represent the new normal? How do you now make sure that what you’re representing in the form of predictive algorithms and forecasting is even valid? Are those assumptions right? This is leading to the need to build a whole set of analytics, ML models, and data pipelines really quickly and I’ve seen people finding and looking for signals and data which they otherwise wouldn’t have.

The last piece of the puzzle, as a result of this, is more empowerment of the end users. End users here would be the beta users, the data analysts, the data scientists. In fact, everyone within the enterprise needs data at this time, whether I’m a marketer trying to decide a campaign, or if I’m in sales trying to decide pricing. Data allows you to determine how you react to the competition, how you react to changing demand, and how you react to changing means.

Big Data Projects

Kunal: So let’s talk some data. What data projects have you started are accelerated during these times? And how do you continue to leverage your data at your different companies while thinking about cutting costs and extending your resources?

Harinder: As times are changing, everything we do as a company has to change and align with that. To start off, we were already investing quite heavily in digital transformation to take us to the next level. The journey to taking a business view end to end, including everything from farmers, to customers, to breweries and corporations, has already begun. And due to COVID, we have really expedited the process.

Second, we had some really big projects to streamline our ERP systems. AB InBev grew heavily through M&A, we acquired many companies and partnered with many companies small and big. Each company was a successful business in its own right, meaning they had their own data and technology.

Definitely, the journey to Cloud is big as well. Some aspects of our organization were already multi-Cloud but if you look at the positive side of this crisis,on the Cloud journey as well, it really pushed us hard to go there faster. The same is true for remote work. Something that would have taken three to five years to execute happened overnight.

So the question then becomes, well, how do we manage the costs? Because all of these things that I’m talking about expediting require a budget to go with it. One thing we’ve done is reprioritize some of our initiatives. While these things that I talked about earlier have gone from, let’s say, priority three to priority one, we have some initiatives that we were working on that have been pushed to the backburner.

Let me give you some examples of managing or cutting costs. I run the data platform and the approach there was to scale, scale, scale, because the business is growing and we’re bringing all these different countries and zones online into Cloud. We still want to grow, but we’ve gone from looking at scale to focusing on how we can optimize and have more of a real time inventory of what’s needed, rather than having it three months ahead. The fact that you’re in the Cloud enables you to do that. It’s a positive thing on both sides because it helps expedite the journey to the Cloud, while moving to the Cloud helps you keep your inventory lean. Then, in terms of just doing some basic sanity checks, are there systems that have been around, but just not significantly used? Or are there software that we need less of? Or if there are things, in terms of technology, hardware, software or applications, that we need more of because of COVID, can we negotiate better because of scale again?

Kunal: Alright Harinder, so you’ve been just scrutinizing everything, scrutinizing size, scrutinizing projects, and making sure that you’re scaling in an optimized fashion, and not scaling out fast for unforeseen loads and warranties, if I summarized that correctly.

Harinder: You’re absolutely right. I think we were in a unique position to do that because our company follows a zero-based budget model, which essentially means that at the start of each year, we don’t just build upon from where we were last year. We start from scratch every single year, so that’s already in our culture, our DNA. And once or twice a year, we just had to take the playbooks out and do it again. That’s actually quite easy for us as a company to do versus, I can imagine, big companies that may have a tough time doing that.

Kunal: One last question before we move on to Kumar. What about some of the challenges that Cloud presented to you?

Harinder: Anybody going into the Cloud has to keep in mind two things. One is that it’s a double edged sword. It gives convenience when it’s time to market fast, but you also have to be very careful about security. All of these Cloud vendors, Google, Amazon, or Azure, spend more in security than companies can. So, the Cloud security out of the box is much better compared to an on-prem system. But you also have to be careful about how you manage it, configure it, enforce it, so on and so forth.

The second part to me is the cost. If you do a true comparison and don’t manage your cost properly, then Cloud costs can be much higher. If used properly and managed properly, Cloud costs are much better for business. A lot of people and companies that I talked to say that they are going to move to Cloud to save costs, but while moving to the Cloud is part of that, that’s step one. You must also make sure you manage the cost and watch out for it, especially in the very beginning, and prioritize the cost equally. Those two things, when done in combination, really kind of take care of the bottleneck issue with moving to Cloud.

Kunal: Yeah, Cloud definitely needs guardrails. Harinder, thank you so much for that.

Sandeep: I just want to quickly add to Harinder’s points. Just from our own experience, when we entered the Cloud in the past, we had to repurpose, using one instance for 10 hours versus 10 instances for one hour. I completely resonate with that point, Harinder. You also mentioned multi-Cloud and I would love to learn more.

Kunal: How about you, Kumar?

Kumar: For us, since we were already executing this blazing transformation, we didn’t really have to start anything specifically new. We went through some reprioritization of our roadmaps and were already executing at a serious pace, looking to complete this global transformation in a two year timeframe. So what we really focused on from an acceleration or a reprioritization perspective was deploying the capabilities as quickly as possible into the global footprint. Once the pandemic hit, we had to think about the impact on our portfolio. Most of our customers are big financial institutions and we quickly realized that traditional data points are no longer as predictive for understanding the current scenario, as Sandeep mentioned before. So we had to really reevaluate and look at how we can bring our data together in a much faster way, in a much more frequent manner, that can help our customers understand portfolios better. And obviously, how does this impact our traditional predictive models that we deploy for credit decisioning, fraud, and other areas where we saw some significant uptake in certain ways? All this required the capability to be deployed much faster.

Our transformation was based on a Cloud first strategy, so we are 100% on the Cloud. That helped us accelerate pushing these capabilities out into the global regions at a much faster pace and we completed the global deployment of several of our platforms over the last a couple of months or so.

From a data projects perspective, our goal throughout this transformation has been to enable faster ingestion of new data into our ecosystem, bringing this data together and driving better insights for customers. So we’re constantly looking for new data sources that we can acquire that can add value to the more traditional and the very high fidelity data sources we already have. When you look at our footprint in a particular region, we actually have some of the most important data about a consumer within the region that we operate in. In a traditional environment, that data is very unique and very powerful, but when you look at a scenario like the pandemic situation that we’re in, we have to bring in data and figure out how the current situation impacts customers, therefore understanding consumers better.

Also, anything that we produce has to be explainable. While we absolutely have a lot of desire to and currently do very advanced techniques around analytics, using ML and AI for several things, for some of our regular literary businesses, everything has to be explainable. So we’ve accelerated some of our work in the explainable AI space and we think that’s going to be an interesting play in the market as more and more regulations start governing how we use data and how we help our customers or the consumers eventually own the data. We,in fact, own a patent in the industry that allows for credit decisioning using explainable AI capabilities.

Kunal: We’d love to hear about some of the signals that weren’t considered earlier that are now considered. Would you be able to share some of those, Kumar?

Kumar: Absolutely. So, we have some new data sets that not many of the credit rating agencies or other financial institution data providers have today. For example, the standard credit exchange is all banking loan data that we get at the consumer level that every credit rating agency has. But we also have a very highly valuable data asset called the work number, which is information about people’s employment and income. We also have a utilities exchange, where we get utility payment information of consumers. I can talk about some insights that you don’t even have to be a genius to think about, opportunities that you can literally unravel through combining this data.

If you were to just look at a traditional credit score that is based on credit data, as an example, I could say, “Kumar Menon worked in the restaurant business and has a credit score of 800”. In a traditional way of looking at credit scoring, I would be still a fairly worthy customer to lend money to. Looking at the industry that I’m working in, maybe there is a certain element of risk that is now introduced into the equation, because I’m in the restaurant business, which is obviously not doing well. So what does that mean when I look at Kumar Menon as a consumer? There are things that you can do to understand the portfolio and plan better. I’m not saying that all data points are valid, but understanding the portfolio helps financial institutions prepare better, help consumers, work with consumers to better understand forbearance scenarios, and help them work out scenarios where you don’t have to go into default. I mean, the goal of the financial institution was to bring more people into the credit industry, which is what we are trying to enable more.

By providing financial institutions with more data, we’re helping them become more aware of potential risks or scenarios that may not be visible in a traditional paradigm

Kunal: That’s very interesting. Thanks so much for sharing, Kumar. Moving on to you, Sandeep.

Sandeep: Piggybacking on what Harinder and Kumar touched on, one of the key projects has been accelerated movement to the Cloud. When you think about moving to the Cloud, it’s a highly non trivial process. On one side, you have your data and thousands of data sets, then on the other side, you have these active pipelines that run daily, pipelines, dashboards, ML models feeding into the product. So the question really becomes, what is the sequence to move these? Some pipelines are fairly isolated data sets with, I would say, trivial query constructs being used but on the other side, you’ll be using some query constructs which are deeply embedded in the on-prem system, highly non trivial to move, requiring rethinking of the schema, rethinking the partitioning logic, the whole nine yards. This is a huge undertaking. How do you sequence? How do you minimize risk? These are life systems and the analogy here is, how do you change the engine of the plane while the plane is running? We don’t have the luxury to say, “okay, these pipelines, these dashboards or models won’t refresh for a week”.

The other aspect is the meaning of data. Traditionally, I think as data professionals, the number one question is, where is the data and how do I get to the data? With data sets, which attribute is authentic and which attribute has been refreshed? During the pandemic, the question is now slightly changing into not just “where is my data?, but “is this the right data to use”? Are these the right signals I should be relying on? This new normal requires a combination of both the understanding of the business and the expertise there, combined with the critical aspects of data and the data platform. So there’s clearly increasing synergy in several places as people think about, “okay, how do I rework my models?” It’s a combination of building this out as well as using the right data.

The last piece is how you shorten the whole process of getting a pipeline or an insight developed. We are writing apps to do things which we haven’t done before, no matter which vertical you’re in and the moment you have these apps coming out at a fast pace, in production, there are a lot of discoveries. In terms of misusing the data, scanning, adjoining across a billion row table, all these aspects can inundate any system. Comparing it to a drainage system, one bad query is like a clog that stops everything, affecting everything below. I think that’s the other aspect, increasing awareness of how we fortify the CICD and the process of improving that

Kumar: That’s a very interesting point you bring up because when we look at our data ecosystem, all the way from ingestion of raw data to what we call, purposing the data for a specific set of products, we must ensure that the models and other insights that execute on that data all stay in sync. So how do we monitor that entire ecosystem? How do we ingest data faster, deploy new models, monitor it, and understand if it’s performing in the right way? We looked at that ecosystem and we want it to be almost like a push button scenario where analysts can develop models when they’re looking at data schemas that are exactly similar to what is running in production, so that there is no rewiring of the data required. And the deployment of the models is seamless, so you’re not rewriting the models.

In many of the on-prem systems, you actually end up rewriting the model in a production system because of performance challenges, etc. So, do you really want to extend the CICD pipeline concept to the analytic space, where there is an automated mechanism for data scientists to be able to build and deploy in a way that a traditional data engineer would deploy some pipeline code? And how do we make sure that that synergy is available for us to deploy seamlessly? It’s something that we’ve actually looked at very consciously and are building it into our stack. It’s a very relevant industry problem that I think many companies are trying to solve.

Big Data Challenges and Bottlenecks

Kunal: To summarize what Kumar and Sandeep were saying, we’re growing data projects and somebody, at the end of the day, needs to make sure it runs properly. We’re looking at operations and Kumar made a comment, comparing it with the more mature DevOps lifecycle, which we are thinking about as a DataOps lifecycle. What are some of these challenges and bottlenecks that are slowing you guys down, Harinder?

Harinder: I would like to start off by giving our definition of DataOps. We define DataOps as end to end management and support of a data platform, data pipelines, data science models, and essentially the full lifecycle of consumption and production of information. Now, when we look at this sort of life cycle, there’s the basics of people, default process, and technology and data.

Starting with our people, we started building this team about three years ago, so there’s a lot of experienced talent with a blend of new and upcoming individuals. We were still in the growth phase of the team, but I think that the current situation has slowed down that process a bit.

The technology was always there but it’s more so about the adoption of it because when you have to strike the balance between growth in data projects and more need for data, usually you will have people in technology scale with it. In our case, like I said, the people team is not able to grow as fast because of the situation, so we are looking for automation. How can we utilize our team better? CICD was there in some parts of the platform while in some, it wasn’t. So we are finding those opportunities, automating the platform, and applying full CICD.

When we talk about Cloud, there are scenarios where you can move to the Cloud in different ways. You can move as an infrastructure, as a platform, as a full SAS, so we always wanted to be sort of platform agnostic and multi-Cloud. There are some things we have done, mostly on the infrastructure side, but now we are taking the middle ground a bit, moving away from infrastructure to more of a platform as a service model so that, again, going back to people, we can save some time to market by moving to that model.

On the process side, it’s about striking the right balance between governance and time to market. When you have to move fast, governance always slows down. The industry is very regulated and that means you still have to maintain a minimum requirement on compliance. Depending on which country we are in, while in the US, not so much, in other countries where we operate, there’s always the GDPR. So those requirements have to be met while we move fast to meet the demands of our internal customers for data analytics and insights. When we talk about this whole process end to end, I think it’s about how we continue to scale and meet the needs of our business, while also doing our best to strike the balance just because of the space we are in. And when I talk about regulation, I’m not just talking about the required regulation or compliance, it’s also just good data hygiene, maintaining the catalog, maintaining the glossaries. Right now it’s just that complication of sometimes speed taking over and other times governance taking over , so we’re trying to find the right balance there.

Kunal: As is every organization, Harinder, so you’re not alone there for sure. Kumar, anything to add there?

Kumar: I think he covered it pretty well. For us, when moving to the Cloud, you really have to have a different philosophy when you’re building Cloud native applications versus what you’re building on-prem. It’s really about how do you improve the skill sets of your people to think more broadly. Now you take a developer who has been developing on Java, on-prem, and she or he now has to have a little bit of an understanding of infrastructure, a little bit about the network, a little bit about how Cloud Security works so we can actually have security in the stack, versus just an overlay on the application. A lot of on-prem applications are built that way, relying on perimeter security by the network. How do you actually engineer the right IM policies into every layer of the services you’re building? How do you make sure that the encryption and decryption capabilities that you enable to the application are enterprise wide policies?

I’ve come back often to the ability to deploy into the Cloud. How do you ensure that your deployment is compliant? How do you make sure that everything is code in the Cloud, infrastructure is code, security is code, your application is code? How do you check in your CICD pipeline that you have all your controls in place so that your build fails and you don’t actually deploy if you’re violating policy? So we actually started to implement policy as code within our CICD pipeline to ensure that no bad behavior really manifests itself in production.

We’ve also been ruthlessly looking at security because of the situation we were in before, as well as the fact that we hold some very valuable high fidelity data. How do you ensure that what our security policy is on the data is also on the technology stack that operates on the data? So those have been some very interesting learnings and I wouldn’t say this has slowed us down, but these things are mandatory and we must learn and be able to master them as a company.

Regulations are ever changing. We’ve encountered new regulations, as we’re building this. New privacy laws are coming into existence, like the CCPA in California, and I think there’ll be other states pursuing similar privacy laws. Obviously, that will impact you globally when you extrapolate that GDPR and other regional laws. So when you’re deploying the Cloud, how do you make sure you’re adhering to the data residency requirements within those regions, as well as the privacy laws? How you build an architecture that can adapt and be flexible to that change is really the big challenge.

Kunal: Thank you for sharing all of that. Sandeep, any thoughts there?

Sandeep: I define DataOps as a point in the project where the honeymoon phase ends and reality sets in, the phase of making a prototype and building out the models and analytics.

On a single weekend, I’ve seen a bad query accumulate more than a hundred thousand dollars in cost. That’s an example where if you don’t really have the right guardrails, just one weekend with high end GPUs in play, trying to do ML training for a model that honestly we did not even require, you get a bill of $100,000.

I think the other thing is just the sheer root cause analysis and debugging. There are so many technologies out there and on one side, there is the philosophy of using the right tool for the job, which is actually the right thing, there is no silver bullet. But then, if you look at the other side of it, the ugly side of it, you need the skill sets required to understand how Presto works, versus how Hive works, versus how spark works, and tune it to really figure out where issues are happening. It’s much more difficult, how do you package that? Figuring it out is one of those issues which has always been there, but is now becoming even more critical.

The last thing to wrap up, and I think Kumar touched on this, is a very different way to think about some of the newer technologies. If you think of these serverless technologies like Google BigQuery or AWS Athena, they have different pricing models. Here, you’re actually being charged by the amount of data scanned and imagine a query that is basically doing a massive scan of data, incurring significant costs. So you need to incorporate all of these aspects, be it compliance, cost, root cause analysis, tuning, and so on, early on so that DataOps is seamless and you can avoid surprises.

How Big Data Professionals Can Adjust to Current Times

Kunal: Thank you. We’ll have one, one minute rapid fire round question for everybody as a parting thought. There’s several hundred people listening in right now, so what should all of us data professionals plan for as we’re thinking through this prolonged period of uncertain times? What is that one thing that they should be doing right now that they have not in the past?

Harinder: I actually have not one, but five, but they’re all very quick. First of all, empathy. We are in completely different times, so make sure you have empathy towards your team and towards your business partners.

Number two, move fast. It’s not the time to think too hard or plan, you just have to move fast and adapt.

Number three, manage your costs.

Number four, focus on your business partners internally and try to understand what their needs are, because it’s not just you, everybody is in a unique situation. So focus on your internal customers, what do they need from you, in terms of data analytics?

And finally, focus on your external customers, try to understand their needs. One of the most important things would be maybe changing the delivery model of your product or service and meeting where the customer is instead of expecting customers to come to you.

Kumar: I totally agree with focusing on internal customers. Obviously focus on the ecosystem you’re operating in so it’s your customers as well as potentially your customers’ customers. Definitely make sure that you connect a lot more with your customers and your coworkers to keep the momentum going.

I think, in several scenarios there are new opportunities that are being unearthed in the market space, so really watch out for where those opportunities lie, especially when you’re in the data space. There are new signals coming up, new ways of looking at data that can provide you better insights. So how do you constantly look at that?

Finally, I would say to keep an eye out for how fast regulations are changing. I’m sure new regulations will be in play with this new normal, so just make sure that what you build today today can withstand the challenge of the time.

Kunal: Thank you, Sandeep?

Sandeep: One piece of advice for professionals would be to also focus on data literacy and explainable insights within your organization. Not everyone understands data the way you do and when you think about insights, it’s basically three parts, what’s going on, why it is happening and how to get out of it. Not everyone will have the skills and expertise to do all three. The “what” part, what’s going on in the business, how to think about it, how to slice and dice,data professionals have a unique opportunity here to really educate and build that literacy within their enterprise for better decision making. And everything that Harinder and Kumar mentioned are spot on.

Kunal: Thank you. Guys again, this was a fantastic one hour. We had a ton of viewers here today. I hope we all took away something from these data professionals, I certainly learned a lot. Harinder, Kumar, Sandeep, thank you so much for taking time out during such crazy times and sharing your experiences, all the practical advice, and strategies with the entire data community.

The post Getting Real With Data Analytics appeared first on Unravel.

CDO Sessions: Getting Real with Data Analytics

Unravel Data — Tue, 28 Jul 2020 19:06:58 +0000

The post CDO Sessions: Getting Real with Data Analytics appeared first on Unravel.

Optimizing big data costs with Amazon EMR & Unravel

Unravel Data — Sat, 25 Jul 2020 19:16:35 +0000

The post Optimizing big data costs with Amazon EMR & Unravel appeared first on Unravel.

EMR Cost Optimization

Unravel Data — Wed, 22 Jul 2020 19:17:56 +0000

The post EMR Cost Optimization appeared first on Unravel.

CDO Sessions: Transforming DataOps in Banking

Unravel Data — Wed, 15 Jul 2020 19:19:35 +0000

The post CDO Sessions: Transforming DataOps in Banking appeared first on Unravel.

Understanding DataOps Impact on Application Quality

Unravel Data — Tue, 16 Jul 2019 19:46:57 +0000

The post Understanding DataOps Impact on Application Quality appeared first on Unravel.

Using Machine Learning to understand Kafka runtime behavior

George Demarest — Wed, 29 May 2019 19:16:43 +0000

On May 13, 2019, Unravel Co-Founder and CTO Shivnath Babu joined Nate Snapp, Senior Reliability Engineer from Palo Alto Networks, to present a session on Using Machine Learning to Understand Kafka Runtime Behavior at Kafka Summit in London. You can review the session slides or read the transcript below.

Transcript

Nate Snapp:

All right. Thanks. I’ll just give a quick introduction. My name is Nate Snapp. And I do big data infrastructure and engineering at companies such as Adobe, Palo Alto Networks, and Omniture. And I have had quite a bit of experience in streaming even outside of the Kafka realm about 12 years. I’ve worked at some of the big scale efforts that we’ve done for web analytics at Adobe and Omniture, working in the space of web analytics for a lot of the big companies out there, about 9 out of the 10 Fortune 500.

I’ve have dealt with major release events, from new iPads to all these things that have to do with big increases in data and streaming that in a timely fashion. Done systems that have over 10,000 servers in that [00:01:00] proprietary stack. But the exciting thing is, last couple years, moving to Kafka, and being able to apply some of those same principles in Kafka. And I’ll explain some of those today. And then I like to blog as well, natesnapp.com.

Shivnath Babu:

Hello, everyone. I’m Shivnath Babu. I’m cofounder and CTO at Unravel. What my day job looks like is building a platform like what is shown on the slide where we collect monitoring information from the big data systems, receiver systems like Kafka, like Spark, and correlate the data, analyze it. And some of the techniques that we will talk about today are inspired by that work to help operations teams as well as application developers, better manage, easily troubleshoot, and to do better capacity planning and operations for their receiver systems.

Nate:

All right. So to start with some of the Kafka work that I have been doing, we have [00:02:00] typically about…we have several clusters of Kafka that we run that have about 6 to 12 brokers, up to about 29 brokers on one of them. But we run the Kafka 5.2.1, and we’re upgrading the rest of the systems to that version. And we have about 1,700 topics across all these clusters. And we have pretty varying rates of ingestion, but topping out about 20,000 messages a second…we actually gotten higher than that. But what I explained this for is as we get into what kind of systems and events we see, the complexity involved often has some interesting play with how do you respond to these operational events.

So one of the aspects is that we have a lot of self-service that’s involved in this. We have users that set up their own topics, they setup their own pipelines. And we allow them to do [00:03:00] that to help make them most proficient and able to get them up to speed easily. And so because of that, we have a varying environment. And we have quite a big skew for how they bring it in. They do it with other…we have several other Kafka clusters that feed in. We have people that use the Java API, we have others that are using the REST API. Does anybody out here use the REST API for ingestion to Kafka? Not very many. That’s been a challenge for us.

But we also been for the Egress have a lot of custom endpoints, and a big one is to use HDFS and Hive. So as I get into this, I’ll be explaining some of the things that we see that are really dynamically challenging and why it’s not as simple as EFL Statements and how you triage and respond to these events. And then Shiv will talk more about how you can get to a higher level of using the actual ML to solve some of these challenging [00:04:00] issues.

So I’d like to start with a simple example. And in fact, this is what we often use. When we first get somebody new to the system, or we’re interviewing somebody to work in this kind of environment, as we start with a very simple example, and take it from a high level, I like to do something like this on the whiteboard and say, “Okay. You’re working in an environment that has data that’s constantly coming in that’s going through multiple stages of processing, and then eventually getting into a location where will be reported on, where people are gonna dive into it.”

And when I do this, it’s to show that there’s many choke points that can occur when this setup is made. And so you look at this and you say, “That’s great,” you have the data flowing through here. But when you hit different challenges along the way, it can back things up in interesting ways. And often, we talk about that as cascading failures of latency and latency-sensitive systems, you’re counting [00:05:00] latency is things back up. And so what I’ll do is explain to them, what if we were to take, for instance, you’ve got these three vats, “Okay, let’s take this last vat or thus last bin,” and if that’s our third stage, let’s go ahead and see if there’s a choke point in the pipe there that’s not processing. What happens then, and how do you respond to that?

And often, a new person will say, “Well, okay, I’m going to go and dive in,” then they’re gonna say, “Okay. What’s the problem with that particular choke point? And what can I do to alleviate that?” And I’ll say, “Okay. But what are you doing to make sure the rest of the system is processing okay? Is that choke point really just for one source of data? How do you make sure that it doesn’t start back-filling and cause other issues in the entire environment?” And so when I do that, they’ll often come back with, “Oh, okay, I’ll make sure that I’ve got enough capacity at the stage before the things don’t back up.”

And this is actually has very practical implications. You take the simple model, and it applies to a variety of things [00:06:00] that happen in the real world. So for every connection you have to a broker, and for every source that’s writing at the end, you actually have to account for two ports. It’s basic things, but as you scale up, it matters. So I have a port with the data being written in. I have a port that I’m managing for writing out. And as I have 5,000, 10,000 connections, I’m actually two X on the number of ports that I’m managing.

And what we found recently was we’re hitting the ephemeral port, what was it, the max ephemeral port range that Linux allows, and all sudden, “Okay. It doesn’t matter if you felt like you had capacity and maybe the message storage state, but that we actually hit boundaries at different areas. And so I think it’s important to say these have to stay a little different at time, so that you’re thinking in terms [00:07:00] of what other resources are we not accounting for, that can back up? So we find that there’s very important considerations on understanding the business logic behind it as well.

Data transparency, data governance actually can be important knowing where that data is coming from. And as I’ll talk about later, the ordering effects that you can have, it helps with being able to manage that at the application layer a lot better. So as I go through this, I want to highlight that it’s not so much a matter of just having a static view of the problem, but really understanding streams for the dynamic nature that they have, and that you may have planned for a certain amount of capacity. And then when you have a spike and that goes up, knowing how the system will respond at that level and what actions to take, having the ability to view that becomes really important.

So I’ll go through a couple of [00:08:00] data challenges here, practical challenges that I’ve seen. And as I go through that, then when we get into the ML part, the fun part, you’ll see what kind of algorithms we can use to better understand the varying signals that we have. So first challenge is variance in flow. And we actually had this with a customer of ours. And they would come to us saying, “Hey, everything look good as far as the number of messages that they’ve received.” But when they went to look at a certain…basically another topic for that, they were saying, “Well, some of the information about these visits actually looks wrong.”

And so they actually showed me a chart that look something like this. You could look at this like a five business day window here. And you look at that and you say, “Yeah, I mean, I could see. There’s a drop after Monday, the first day drops a little low.” That may be okay. Again, this is where you have to know what does a business trying to do with this data. And as an operator like myself, I can’t always make those as testaments really well. [00:09:00] Going to the third peak there, you see that go down and have a dip, they actually point out something like that, although I guess it was much smaller one and they’d circle this. And they said, “This is a huge problem for us.”

And I come to find out it had to do with sales of product and being able to target the right experience for that user. And so looking at that, and knowing what those anomalies are, and what they pertain to, having that visibility, again, going to the data transparency, you can help you to make the right decisions and say, “Okay.” And in this case with Kafka, what we do find is that at times, things like that can happen because we have, in this case, a bunch of different partitions.

One of those partitions backed up. And now, most of the data is processing, but some isn’t, and it happens to be bogged down. So what decisions can we make to be able to account for that and be able to say that, “Okay. This partition backed up, why is that partition backed up?” And the kind of ML algorithms you choose are the ones that help you [00:10:00] answer those five whys. If you talk about, “Okay. It backed up. It backed up because,” and you keep going backwards until you get to the root cause.

Challenge number two. We’ve experienced negative effects from really slow data. Slow data can happen for a variety of reasons. Two of them that we principally have found is users that have data that staging and temporary that they’re gonna be testing another pipeline, and then rolling that into production. But there’s actually some in production that are slow data, and it’s more or less how that stream is supposed to behave.

So think in terms again of what is the business trying to run on the stream? In our case, we would have some data that we wouldn’t want to have frequently like cancellations of users for our product. We hope that that stay really low. We hope that you have very few bumps in the road with that. But what we find with Kafka is that you have to have a very good [00:11:00] understanding of your offset expiration and your retention periods. We find that if you have those offsets expire too soon, then you have to go into a guesswork of, “Am I going to be aggressive,” and try and look back really far and reread that data, in which case, you may double count? Or am I gonna be really conservative, which case you may miss critical events, especially if you’re looking at cancellations, something that you need to understand about your users.

And so we found that to be a challenge. Keeping with that and going into this third idea is event sourcing. And this is something that I’ve heard argued either way should Kafka be used for this event sourcing and what that is. We have different events that we want to represent in the system. Do I create a topic for event and then have a whole bunch of topics, or do I have a single topic because we provide more guarantees on the partitions in that topic?

And it’s arguable, depends on what you’re trying to accomplish, which is the right method. But what we’ve seen [00:12:00] with users is because we have a self-service environment, we wanna give the transparency back to them on here’s what’s going on when you set it up in a certain way. And so if they say, “Okay, I wanna have,” for instance for our products, “a purchase topic representing the times that things were purchased.” And then a cancellation topic to represent the times that they decided not to use certain…they decided to cancel the product, what we can see is some ordering issues here. And I represented those matching with the colors up there. So a brown purchase followed by cancellation on that same user. You could see the, you know, the light purple there, but you can see one where the cancellation clearly comes before the purchase.

And so that is confusing when you do the processing, “Okay. Now, they’re out of order.” So having the ability to expose that back to the user and say, “Here’s the data that you’re looking at and why the data is confusing. What can you do to respond to that,” and actually pushing that back to the users is critical. So I’ll just cover a couple of surprises that [00:13:00] we’ve had along the way. And then I’ll turn the time over to Shiv. But does anybody here use Schema Registry? Any Schema Registry users? Okay, we’ve got a few. Anybody ever have users change the schema unexpectedly and still cause issues, even though you’ve used Schema Registry? Is it just me? Had a few, okay.

We found that users have this idea that they wanna have a solid schema, except when they wanna change it. And so we’ve coined this term flexible-rigid schema. They want a flexible-rigid schema. They want the best in both worlds. But what they have to understand is, “Okay, you introduce a new Boolean value, but your events don’t have that.” I can’t pick one. I can’t default to one. Limit the wrong decision, I guarantee you. It’s a 50/50 chance. And so we have this idea of can we expose back to them what they’re seeing, and when those changes occur. And they don’t always have control over changing the events at the source. And so they may not even be control of their destiny of having a [00:14:00] schema throughout the system changed.

Timeouts, leader affinity, I’m gonna skip over some of these or not spend a lot of time on it. But timeouts is a big one. As we write to HDFS, we see that the backups that occur from the Hive meta store when we’re writing with the HDFS sync can cause everything to go to a rebalanced state, which is really expensive, and now becomes a cascading issue, which was a single broker having an issue. So, again, others with leader affinity, poor assignments. There’s a randomizes where those get assigned to. We’d like to have concepts of the system involved. We wanna be able to understand the state of the system. Windows choices are being made. And if we can affect that, and Shiv will cover how that kind of visibility is helpful with some of the algorithms that can be used. And then basically, those things, all just lead to, why do we need to have better visibility into our data problems with Kafka? [00:15:00] Thanks.

Shivnath:

Thanks, Nate. So what Nate just actually talked about, and I’m sure most of you at this conference, the reason you’re here is that the streaming applications, Kafka-based architectures are becoming very, very popular, very common, and driving mission critical applications across the manufacturing, across ecommerce, many, many industries. So for this talk, I’m just going to approximate streaming architecture as something like this. There’s a stream store, something like Kafka that is being used, or poser, and maybe status actually kept in a NoSQL system, like HBase or Cassandra. And then there’s the computational element. This could be in Kafka itself with KStream, but it could be a Spark streaming, Flink as well.

When things are great, everything is fine. But when things start to break, maybe as Nate mentioned, we’re not getting your results on time. Things are actually congested, [00:16:00] backlog, all of these things can happen. And unfortunately, in any architecture like this, what happens is they’re all receiver systems. So problems could happen in many different places such as, it could be an application problem, maybe the structure, the schema, how things where…like Nate gave an example of a joint across streams, which…There might be problems there, but that might also be problems that the Kafka level, may be the partitioning, may be brokers, may be contention, or things that as a Spark level, no resource level problems, or configuration problems, all of these become very hard to troubleshoot.

And I’m sure many of you have run into problems like this. How many of you here can relate to problems what I’m showing on the slide here? Quite a few hands. As you’re running things in production, these challenges happen. And unfortunately, what happens is, given that these are all individual systems, often there is no single correlated view event that connects the streaming [00:17:00] computational side with the story side, or maybe with the NoSQL sides. And that poses a lot of issues. One of the crucial things is that when a problem happens, it takes a lot of time to resolve.

And wouldn’t it be great if there’s some tool out there, some solution that can give you visibility, as well as to do things along the following lines and empowered like the DevOps teams? First and foremost, there are metrics all over the place, that metrics, logs, you name it. And from all these different levels, especially the platform, the application, and all the interactions that happen in between. Metrics along these can be brought into one single platform, and at least have a good, nice correlated view. So again, you can go from the app, to the platform, or vice-versa, depending on the problem.

And what we will try to cover today is with all these advances that are happening in machine learning, how in applying machine learning to some of this data can help you find out problems quicker, [00:18:00] as well as, in some cases, using artificial intelligence, using ability to take automated actions either prevent the problems from happening the first place, or at least if those problems happen gonna to be able to quickly recover and fix it.

And what we will really try to cover in the stack is basically what I’m showing the slide. There’s tons and tons of interesting machine learning algorithms. And the same time, you have all of these different problems that happened with Kafka streaming architectures. How do you connect both worlds? How can you bring based on the goal that you’re actually trying to solve the right algorithm to bear on the problem? And the way we’ll structure that is, again, DevOps teams have lots of different goals. Broadly, there’s the app side of the fence, and there’s the operations side of the fence.

As a person who owns Kafka streaming application, you might have goals related to latency. I need this kind of latency, or this much amount of the throughput, or maybe might be a combination of this along with, “Hey, I can only [00:19:00] tolerate this much amount of data loss,” and all those talks that have happened very much in this room, when the two different aspects of this and how replicas and parameters help you get all of these goals.

On the operation side, maybe your goals are around ensuring that the cluster reliable like a particular loss of a rack doesn’t really cause data loss and things like that, or maybe they are related on the cloud ensuring that you’re getting the right bite performance and all of those things. And on the other side are all of these interesting advances that happened in ML. There are algorithms for detecting outliers, anomalies, or actually doing correlation. So the real focus is going to be like, let’s take some of these goals. Let’s work with the different algorithms, and then seeing how you can get these things to meet. And along the way, we’ll try to describe some our own experiences. What would worked kind of what didn’t work, as well the other things that are worth exploring?

[00:20:00] So let’s start with the very first one, the outlier detection algorithms. And why would you care? There are two very, very simple problems. And if you remember from the earlier in the talk, Nate talked about one of the critical challenges they had, where the problem was exactly this, which is, hey, there could be some imbalance among my brokers. There could be some imbalance and a topic among the partitions, some partition really getting backed up. Very, very common problems that happen. How can you very quickly instead of manually looking at graphs, and trying to stare and figure things out? Apply some automation to it.

Here’s a quick screenshot that lead us to through the problem. If you look at the graph that have highlighted on the bottom, this is a small cluster with three brokers, capital one, two, and three. And one of the brokers is actually having much lower throughput. It could be the other way. One of the having a really high throughput. Can we detect this automatically? This is about having to figure it after the fact. [00:21:00] And there’s a problem where there are tons of algorithms for outlier detection from statistics and now more recently in machine learning.

And there are algorithms and differ based on, do they deal with one parameter at a time, do they deal with multiple parameters of the time, univariate, multivariate, or algorithms that can actually take temporal things into account, algorithms that are more looking at a snapshot in time. The very, very simple technique, which actually works surprisingly well as this score where the main idea is to take…let’s say, you have different brokers, or you have hundred partitions in a topic, you can take any metric. It might be in the bites and metric, and vector Gaussian and distribution to the data.

And anything that is actually a few standard deviations away as an outlier. Very, very simple technique. The problem in this technique is, it does assume that is the distribution, but sometimes may not be the [00:22:00] case. In the case of in a brokers and partitions that is usually a safe assumption to make. But if that technique doesn’t work, there are other techniques. One of the more popular ones that we have had some success with, especially when you’re looking at multiple time series of the same time is the DBSCAN. It’s basically a density-based clustering technique. I’m not going to the all the details, but the key ideas, it basically uses some notion of distance to group points into clusters, and anything that doesn’t fall into clusters and outlier.

Then there are tons of other very interesting techniques using like in a binary trees to find outliers called the isolation forests. And in the deep learning world, there is a lot of interesting work happening with auto encoders, which tried to learn representation of the data. And again, once you’ve learned the representation from all the training data that is available, things don’t fit the representation are outliers.

So this is the template I’m going to [00:23:00] follow in the rest of the talk. Basically, the template, I pick a technique. Next, I’m gonna look at forecasting and give you some example use cases that having a good forecasting technique can help you in that Kafka DevOps world, and then tell you about some of these techniques. So for forecasting, two places where it makes those…having a good technique makes a lot of difference. One is avoiding this reactive firefighting. When you have something like a latency SLA, if you could get a sense, things are backing up, and there’s a chance in a certain amount of time, the SLAs will get missed, then you can actually take action quickly.

It could even be bringing on normal brokers and things like that, or basically doing some partition reassignment and whatnot. But getting heads up this often very, very useful. The other is more long-term forecasting for things like capacity planning. So I’m gonna use actually a real life example here, an example that one of our telecom customers actually worked with [00:24:00] where it was a sentiment analysis, sentiment extraction used case based on tweets. So the overall architecture consisted of tweets coming in real-time like in those SLA Kafka, and then the computation was happening in Spark streaming with some state actually being stored in a database.

So here’s a quick screenshot of how things actually play out there. In sentiment analysis, especially for customer service, customer support really the used cases, there’s some kind of SLA. In this case, their SLA was around three minutes. What that means is, by the time you learn about the particular incident, if you will, which you can think of these tweets coming in Kafka. Within a three-minute interval, data has to be processed. What are you seeing on screen here, all those green bars represent the rate in which data is coming, and the black line in between as the time series indicating the actual delay, [00:25:00] the end-t-end delay, the processing delay between data arrival and data being processed. And there is a SLA of three-minutes.

So if you can see the line, it trending up, and applying…there was a good forecasting technique that could be applied on this line, you can actually forecast and stay within a certain interval of time, maybe it’s a few hours, maybe even less than that. The rate at which this latency is trending up, my SLA can get missed. And it’s a great use case for having a good forecasting technique. So again, forecasting, another area that has been very well-studied. And there are tons of interesting techniques, from the time-series forecasting of the more statistical bend, there are techniques like ARIMA, which stands for autoregressive integrated moving average. And there’s lots of variance around that, which uses the trend and data differences between data elements and patterns you forecast with smoothing and taking in historic data into account, and all of that good stuff.

On the other [00:26:00] extreme, there has been a lot of like I’ve said, advances recently in using neural networks, because time-series data is one thing that is very, very easy to get, too. So there’s this long short-term memory, the LSCM, and recurrent neural networks, which have been pretty good at this, that we have actually had a lot of I would say success is with a technique that was originally it was something that Facebook released this open source called the Prophet Algorithm, which is not very different from the ARIMA and the older family of forecasting techniques. It defaults in some subtle ways.

The main idea here was what is called a generative additive model. I put in a very simple example here. So the idea is to model this time-series, whichever time-series you are picking as a combination of the trend in the time-series data and extract all the trend. The seasonality, maybe there is a yearly seasonality, maybe there’s monthly seasonality, weekly [00:27:00] seasonality, maybe even daily seasonality. This is a key thing. I’ve used the term shocky [SP]. So if you’re thinking about forecasting and ecommerce setting, like Black Friday or the Christmas days, these are actually times when the time-series will have a very different behavior.

So in Kafka or in the operational context, if you are rebooting, if you are installing a new patch or upgrading, this often end up shifting the patterns of the time-series, and they have to be explicitly modeled. Otherwise, the forecasting can go wrong, and then the rest is error. The recent Prophet has actually worked really well for us apart from the ones I mentioned. It fits quickly and all of that good stuff, but it is very, I would say, customizable, where your domain knowledge about the problem can be incorporated, instead of if it’s something that gives you a result, and then all that remains is parameter tuning.

So Prophet is something I’ll definitely ask all of you [00:28:00] to take a look. And defaults work relatively well, but forecasting is something where we have seen that. You can just trust the machine alone. It needs some guidance, it needs some data science, it needs some domain knowledge to be put along with the machine learning to actually get good forecast. So that’s the forecasting. So we saw outlier detection how that applies, we saw forecasting. Now let’s get into something even more interesting. Anomalies, detecting anomalies.

So a place where anomaly detection, and possible was an anomaly, you can think of an anomaly as unexpected change, something that if you were to expect it, then it’s not an anomaly, something unexpected that needs your attention. That’s what I’m gonna characterize as an anomaly. Where can it help? Actually smart alerts, alerts where you don’t have to configure threshold and all of that things, and worry about your workload changing or new upgrades happening, and all of that stuff, wouldn’t it be great if these anomalies can [00:29:00] be auto-detect it. But that’s also very challenging. By no means, it’s a trivial because if you’re not careful, then your smart alerts will turn out to be really dump. And you might get a lot of false alerts. And that way, you lose confidence, or it might miss a real problems that are happening.

So, I don’t know, detection is something which is pretty hard to get right in practice. And here’s one simple, but one very illustrative example. With Kafka you always see no lags. So here, what I’m plotting here is increasing lag. Is that really an expected one? Maybe there could be both in data arrival, and maybe these lags might have built up, like at some point of time every day, maybe it’s okay. When it is an anomaly that I really need to take a look at. So that’s when these anomaly detection techniques become very important.

Many [00:30:00] different schools have thought excess on how to build a good anomaly detection technique, including the ones I talked about earlier with outlier detection. One approach that has worked well in the past for us is, when you can have really modern anomalies as I’m forecasting something based on what I know. And if what I see the first one to forecasts, then that’s an anomaly. So you can pick your favorite forecasting technique, or the one that actually worked, ARIMA, or Prophet, or whatnot, use that technique to do the forecasting, and then deviations become interesting and anomalous.

Whatever sitting here is a simple example of the technique and for Prophet, or more common one that we have seen actually does work relatively well, this thing called STL. It stands for Seasonal Trend Decomposition using a smoothing function called LOWESS. So you have the time series, extract out and separate out the trend from it first. So without the trend, so that leaves [00:31:00] the time series without the trend and then extract out all the seasonalities. And once you have done that, whatever reminder or residual has called, even if you put some, like a threshold on that, it’s reasonably good. I wouldn’t say it’s perfect but reasonably good at extracting out these anomalies.

Next one, correlation analysis, getting even more complex. So once basically you have detect an anomaly. Or you have a problem that you want the root cause, why did it happen? What do you need to do to fix it. Here’s a great example. I saw my anomaly something shot up, maybe there’s the lag, that is actually building up, it’s definitely looks anomalous. And now what do you really want us…okay, maybe we address where the baseline is much higher. But can I root cause it? Can I pinpoint what is causing that? Is it just the fact that there was a burst of data or something else, maybe resource allocation issues, maybe some hot spotting and the [00:32:00] brokers?

And here, you can start to apply time-series correlation, which time series your lower level correlates best with the higher level time-series where your application latency increased? Challenge here is, there are hundreds, if not thousands of times-series if you look at Kafka, with every broker has so many kind of time-series it can give you from every level. And it quickly all of these adds up. So here, it’s a pretty hard problem. So if you just throw a time-series correlation techniques, even time-series which just have some trend in there they look correlated. So you have to be very careful.

The things to keep in mind are things like pick a good similarity function across time-series. For example using something like Euclidean, which is a straight up well-defined, well-understood distance function between points or between time series. We have had a lot of success with something called dynamic time warping, which is very good to deal with time-series, which might be slightly out of [00:33:00] sync. If you remember all the things that need mentioned, you just gone that Kafka world and streaming world in asynchronous, even world, you just see him all the time-series and nicely synchronized. So time warping is a good technique to extract distances in such a context.

But by far, the other part is just like you saw with Prophet. You have to really instead of just throwing some machine learning technique and praying that it works, you have to really try to understand the problem. And the way in which we have tried to break it down into something usable is, for a lot of these time-series, you can split the space into time-series that are related to the application performance, time-series are related to resource and contention, and then apply correlation within these bucket. So try to scope the problem.

Last technique, model learning. And this turns out to be the hardest. But if you have a good modeling technique, good model that can answer what-if questions, then things like what we saw in the previous talk with [00:34:00] all of those reassignment and whatnot. You can actually find out what’s the good reassignment and what impact will that have. Or, as Nate mentioned, this rebalanced-type, consumer times, timeouts, and rebalancing storms can actually kick in, “What’s the good timeout?”

So a lot of these places where you have to pick some threshold or some strategy, but having a model that can quickly do what-if avoid is better, or can even rank them can be very, very useful and powerful. And this is the thing that is needed for enabling action automatically. Here’s a quick example. There’s a lag. We can figure out where the problem actually is happening. But then wouldn’t be great if something is suggesting how to fix that? Increase the number of partitions from X to Y,. And that will fix the problem.

The modeling at the end of the day is function that you’re fitting to the data. And in modeling, I don’t have time to actually go into this in great detail because of time. You carefully pick the right [00:35:00] input features. And what is very, very important is to ensure that you have the right training data. For some problems, just collecting data from the production cluster is good trading data. Sometimes, it is not because you have only seen certain regions of the space. So with that, I’m actually going to wrap up.

So what we try to do in the talk is to really give you a sense, as much as we can do in a 40-minute talk of having all of these interesting Kafka DevOps challenges, so meaning application challenges, and how to map that to something where you can use machine learning or some elements of AI to make your life easier, at least guide you so that you’re not wasting a lot of time trying to look for a needle in a haystack. And with that, I’m going to wrap up. There is this area called AIOps, which is very interesting and trying to bring AI and ML with data to solve these DevOps challenges. We have a booth here. Please, drop by to see some of the techniques we have. And yes, if you’re interested in working on this interesting challenges, streaming, building [00:36:00] these applications, or applying techniques like this, we are hiring. Thank you.

Okay. Thanks, guys. So we actually have about three or four minutes for any question, mini Q&A.

Attendee 1:

I know nothing about machine learning. But I can see it helping me really help with debugging problems. As a noob, how can I get started?

Shivnath:

I can take that question. So the question was, so the whole point of this talk, and gets in that question, which is, again, on one hand, if you go to a Spark Summit, or Spark, you’ll see a lot of machine learning algorithms, and this and that, right? If we come to a Kafka Summit, then it’s all about Kafka and DevOps challenges. How do these worlds meet? That’s exactly the thing that we tried to cover in the talk. If you were listening looking at a lot of the techniques, they’re not fancy machine learning techniques. So our experience has been that once you understand the use case, then there are fairly good techniques from even statistics that can solve the problem reasonably well.

And once you have first startup experience, then look for better and better techniques. So hopefully, this talk gives you a sense of the techniques just get started with, and you get a better understanding of the problem and the [00:38:00] data, and you can actually improve and apply more of those deep learning techniques and whatnot. But most of the time, you don’t need that for these problems. Nate?

Nate:

No. Absolutely same thing.

Attendee 2:

Hi, Shiv, nice talk. Thank you. Quick question is for those machine learning algorithms, can they be applied cross domain? Or if you are moving from DevOps of Kafka to Spark streaming, for example, do you have to hand-picking those algorithms and tuning the parameters again?

Shivnath:

So the question is, how much of these algorithms just apply to something we’ve found for Kafka? Does it apply for Spark streaming? Would that apply for high performance Impala, or Redshift for that matter? So again, the real hard truth is that no one size fits all. So you can’t just have…I mentioned outlier detection, that might be a technique that can be applied to any [00:39:00] load imbalance problem. But then that’s start getting into anomalies or correlation, Some amount of expert knowledge about the system has to be combined with the machine learning technique to really get good results.

So again, it’s not as if it has to be all export rules, but some combination. So if you pick a financial space, a data scientist who exists, understands the domain, and knows how to work with data. Just like that even in the DevOps world, the biggest success will come if somebody understand receiver system, and has the knack of working with these algorithms reasonably well, and they can combine both of that. So something like a performance data scientist.

The post Using Machine Learning to understand Kafka runtime behavior appeared first on Unravel.

Maximize Big Data Application Performance and ROI

Unravel Data Posts — Tue, 29 Jan 2019 05:55:33 +0000

The post Maximize Big Data Application Performance and ROI appeared first on Unravel.