Kafka Resources | Unravel Data

How to intelligently monitor Kafka/Spark Streaming data pipelines

George Demarest — Thu, 06 Jun 2019 18:41:06 +0000

Identifying issues in distributed applications like a Spark streaming application is not trivial. This blog describes how Unravel helps you connect the dots across streaming applications to identify bottlenecks.

Spark streaming is widely used in real-time data processing, especially with Apache Kafka. A typical scenario involves a Kafka producer application writing to a Kafka topic. The Spark application then subscribes to the topic and consumes records. The records might be further processed downstream using operations like map and foreachRDD ops or saved into a datastore.

Below are two scenarios illustrating how you can use Unravel’s APMs to inspect, understand, correlate, and finally debug issues around a Spark streaming app consuming a Kafka topic. They demonstrate how Unravel’s APMs can help you perform an end-to-end analysis of the data pipelines and its applications. In turn, this wealth of information helps you effectively and efficiently debug and resolve issues that otherwise would require more time and effort.

Use Case 1: Producer Failure

Consider a running Spark application that has not processed any data for a significant amount of time. The application’s previous iterations successfully processed data and you know for a certainty that current iteration should be processing data.

Detect Spark is processing no data

When you notice the Spark application has been running without processing any data, you want to check the application’s I/O.

Checking I/O Consumption

Click on it to bring up the application and see its I/O KPI

View Spark input details

When you see the Spark application is consuming from Kafka, you need to examine Kafka side of the process to determine the issue. Click on a batch to find the topic it is consuming.

View Kafka topic details

Bring up the Kafka Topic details which graphs the Bytes In Per Second, Bytes Out Per Second, Messages In Per Second, and Total Fetch Requests Per Second. In this case, you see Bytes In Per Second and Messages In Per Second graphs show a steep decline; this indicates the Kafka Topic is no longer producing any data.

The Spark application is not consuming data because there is no data in the pipeline to consume.
You should notify the owner of the application that writes to this particular topic. Upon notification, they can drill down to determine and then resolve the underlying issues causing that writes to this topic to fail.

The graphs show a steep decline in the BytesInPerSecond and MessagesInPerSecond metric. Since there is nothing flowing into the topic, there is no data to be consumed and processed by Spark down the pipeline. The administrator can then alert the owner of the Kafka Producer Application to further drill down into what are the underlying issues that might have caused the Kafka Producer to fail.

Use Case 2: Slow Spark app, offline Kafka partitions

In this scenario, an application’s run has processed significantly less data when compared to what is the expected norm. In the above scenario, the analysis was fairly straightforward. In these types of cases, the underlying root cause might be far more subtle to identify. The Unravel APM can help you to quickly and efficiently root cause such issues.

Detect inconsistent performance

When you see a considerable difference in the data processed by consecutive runs of a Spark App, bring up the APMs for each application.

Detect drop in input records

Examine the trend lines to identify a time window when there is a drop in input records for the slow app. The image, the slow app, shows a drop off in consumption.

Drill down to I/O drop off

Drill down into the slow application’s stream graph to narrow the time period (window) in which the I/O dropped off.

Drill down to suspect time interval

Unravel displays the batches that were running during the problematic time period. Inspect the time interval to see the batches’ activity. In this example no input records were processed during the suspect interval. You can dig further into the issue by selecting a batch and examining it’s input tab. In the below case, you can infer that no new offsets have been read based upon the input source’s description,“Offsets 1343963 to Offset 1343963”.

View KPIs on the Kafka monitor screen

You can further debug the problem on the Kafka side by viewing the Kafka cluster monitor. Navigate to OPERATIONS > USAGE DETAILS > KAFKA. At a glance, the cluster’s KPIs convey important and pertinent information.

Here, the first two metrics show that the cluster is facing issues which need to be resolved as quickly as possible.

# of under replicated partitions is a broker level metric of the count of partitions for which the broker is the leader replica and the follower replicas that have yet not caught up. In this case there are 131 under replicated partitions.

# of offline partitions is a broker level metric provided by the cluster’s controlling broker. It is a count of the partitions that currently have no leader. Such partitions are not available for producing/consumption. In this case there are two (2) offline partitions.

You can select the BROKER tab to see broker table. Here the table hints that the broker, Ignite1.kafka2 is facing some issue; therefore, the broker’s status should be checked.

Broker Ignite1.kafka2 has two (2) offline partitions. The broker table shows that it is/was the controller for the cluster. We could have inferred this because only the controller can have offline partitions. Examining the table further, we see that Ignite.kafka1 also has an active controller of one (1).
The Kafka KPI # of Controller lists the Active Controller as one (1) which should indicate a healthy cluster. The fact that there are two (2) brokers listed as being one (1) active controller indicates the cluster is in an inconsistent state.

Identify cluster controller

In the broker tab table, we see that for the broker Ignite1.kafka2 there are two offline Kafka partitions. Also we see that the Active controller count is 1, which indicates that this broker was/is the controller for the cluster (This information can also be deduced from the fact that offline partition is a metric exclusive only to the controller broker). Also notice that the controller count for Ignite.kafka1 is also 1, which would indicate that the cluster itself is in an inconsistent state.

This gives us a hint that it could be the case that the Ignite1.kafka2 broker is facing some issues, and the first thing to check would be the status of the broker.

View consumer group status

You can further corroborate the hypothesis by checking the status of the consumer group for the Kafka topic the Spark App is consuming from. In this example, consumer groups’ status indicates it’s currently stalled. The topic the stalled group is consuming is tweetSentiment-1000.

Topic drill down

To drill down into the consumer groups topic, click on the TOPIC tab in the Kafka Cluster manager and search for the topic. In this case, the topic’s trend lines for the time range the Spark applications consumption dropped off show a sharp decrease in the Bytes In Per Second and Bytes Out Per Second. This decrease explains why the Spark app is not processing any records.

Consumer group details

To view the consumer groups lag and offset consumption trends, click on the consumer groups listed for the topic. Below, the topic’s consumer groups, stream-consumer-for-tweetSentimet1000, log end offset trend line is a constant (flat) line that shows no new offsets have been consumed with the passage of time. This further supports our hypothesis that something is wrong with the Kafka cluster and especially broker Ignite.kafka2.

Conclusion

These are but two examples of how Unravel helps you to identify, analyze, and debug Spark Streaming applications consuming from Kafka topics. Unravel’s APMs collates, consolidates, and correlates information from various stages in the data pipeline (Spark and Kafka), thereby allowing you to troubleshoot applications without ever having to leave Unravel.

The post How to intelligently monitor Kafka/Spark Streaming data pipelines appeared first on Unravel.

Kafka best practices: Monitoring and optimizing the performance of Kafka applications

George Demarest — Mon, 18 Jul 2022 21:04:02 +0000

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Administrators, developers, and data engineers who use Kafka clusters struggle to understand what is happening in their Kafka implementations. There are out of the box solutions like Ambari and Cloudera manager which provide some high level monitoring however most customers find these tools to be insufficient for troubleshooting purposes. These tools also fail to provide right visibility down to the applications acting as consumers that are processing Kafka data streams.

For Unravel Data users Kafka monitoring and insights come out of the box with your installation. This blog is geared towards showing you best practices for using Unravel as a Kafka monitoring tool to detect and resolve issues faster. We assume you have a baseline knowledge of Kafka concepts and architecture.

Kafka monitoring tools

There are several tools and ways to monitor Kafka applications, such as open source tools, proprietary tools, and purpose-built tools. There are advantages associated with each option. Based on the availability of skilled resources and organizational priority, you can choose the best option.

Importance of using Kafka Monitoring tools

An enterprise-grade Kafka monitoring tool, like Unravel, enables your critical resources to avoid wasting hours manually configuring monitoring for Apache Kafka. They can get more time delivering value to end users. Locating an issue can be a challenging task without having a monitoring solution in place. Your data engineering team will have to test different use cases and educated guesses before they can get to the root of issues. Whereas proprietary Kafka monitoring tools like Unravel, provides AI-enabled insights, recommendation and automate remediation.

Monitoring Cluster Health

Unravel provides color coded KPIs per cluster to give you a quick overall view on the health of your Kafka cluster. The colors represent the following:

Green = Healthy
Red = Unhealthy and is something you will want to investigate
Blue = Metrics on cluster activity

Let’s walk through a few scenarios and how we can use Unravel’s KPIs to help us through troubleshooting issues.

Under-replicated partitions

Under-replicated partitions tell us that replication is not going as fast as configured, which adds latency as consumers don’t get the data they need until messages are replicated. It also suggests that we are more vulnerable to losing data if we have a master failure. Any under replicated partitions at all constitute a bad thing and is something we’ll want to root cause to avoid any data loss. Generally speaking, under-replicated partitions usually point to an issue on a specific broker.

To investigate scenarios if under replicated partitions are showing up in your cluster we’ll need to understand how often this is occurring. For that we can use the following graphs to get a better understanding of this:

# of under replicated partitions

Another useful metric to monitor is “Log flush latency, 99th percentile“ graph which provides us the time it takes for the brokers to flush logs to disk:

Log Flush Latency, 99th Percentile

Log flush latency is important, because the longer it takes to flush log to disk, the more the pipeline backs up, the worse our latency and throughput. When this number goes up, even 10ms going to 20ms, end-to-end latency balloons which can also lead to under replicated partitions.

If we notice latency fluctuates greatly then we’ll want to identify which broker(s) are contributing to this. In order to do this we’ll go back to the “broker” tab and click on each broker to refresh the graph for the broker selected:

Once we identify the broker(s) with wildly fluctuating latency we will want to get further insight by investigating the logs for that broker. If the OS is having a hard time keeping up with log flush then we may want to consider adding more brokers to our Kafka architecture.

Offline partitions

Generally speaking you want to avoid offline partitions in your cluster. If you see this metric > 0 then there are broker level issues which need to be addressed.

Unravel provides the “# Offline Partitions” graph to understand when offline partitions occur:

This metric provides the total number of topic partitions in the cluster that are offline. This can happen if the brokers with replicas are down, or if unclean leader election is disabled and the replicas are not in sync and thus none can be elected leader (may be desirable to ensure no messages are lost).

Controller Health

This KPI displays the number of brokers in the cluster reporting as the active controller in the last interval. Controller assignment can change over time as shown in the “# Active Controller Trend” graph:

“# Controller” = 0
- There is no active controller. You want to avoid this!
“# Controller” = 1
- There is one active controller. This is the state that you want to see.
“# Controller” > 1
- Can be good or bad. During steady state there should be only one active controller per cluster. If this is greater than 1 for only one minute, then it probably means the active controller switched from one broker to another. If this persists for more than one minute, troubleshoot the cluster for “split brain”.

For Active Controller <> 1 we’ll want to investigate logs on the broker level for further insight.

Cluster Activity

The last three KPI’s show cluster activity within the last 24 hours. They will always be colored blue because these metrics can neither be good or bad. These KPIs are useful in gauging activity in your Kafka cluster for the last 24 hours. You can also view these metrics via their respective graphs below:

These metrics can be useful in understanding cluster capacity such as:

Add additional brokers to keep up with data velocity
Evaluate performance of topic architecture on your brokers
Evaluate performance of partition architecture for a topic

The next section provides best practices in using Unravel to evaluate performance of your topics / brokers.

Topic Partition Strategy

A common challenge for Kafka admins is providing an architecture for the topics / partitions in the cluster which can support the data velocity coming from producers. Most times this is a shot in the dark because OOB solutions do not provide any actionable insight on activity for your topics / partitions. Let’s investigate how Unravel can help shed some light into this.

Producer architecture

On the producer side it can be a challenge in deciding on how to partition topics on the Kafka level. Producers can choose to send messages via key or use a round-robin strategy when no key has been defined for a message. Choosing the correct # of partitions for your data velocity are important in ensuring we have a real time architecture that is performant.

Let’s use Unravel to get a better understanding on how the current architecture is performing. We can then use this insight in guiding us in choosing the correct # of partitions.

To investigate – click on the topic tab and scroll to the bottom to see the list of topics in your Kafka cluster:

In this table we can quickly identify topics which have heavy traffic where we may want to understand how our partitions are performing. This is convenient in identifying topics which have heavy traffic.

Let’s click on a topic to get further insight:

Consumer Group Architecture

We can also make changes on the consumer group side to scale according to your topic / partition architecture. Unravel provides a convenient view on the consumer group for each topic to quickly get a health status of the consumer groups in your Kafka cluster. If a consumer group is a Spark Streaming application then Unravel can also provide insights to that application thereby providing an end-to-end monitoring solution for your streaming applications.

See Kafka Insights for a use case on monitoring consumer groups, and lagging/stalled partitions or consumer groups.

Next steps:

Unravel is a purpose-built observability platform that helps you stop firefighting issues, control costs, and run faster data pipelines. What does that mean for you?

Never worry about monitoring and observing your entire data estate.
Reduce unnecessary cloud costs and the DataOps team’s application tuning burden.
Avoid downtime, missed SLAs, and business risk with end-to-end observability and cost governance over your modern data stack.

How can you get started?

Create your free account today.

Book a demo to see how Unravel simplifies modern data stack management.

References:

https://www.signalfx.com/blog/how-we-monitor-and-run-kafka-at-scale/

https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster

https://docs.confluent.io/current/control-center/systemhealth.html

The post Kafka best practices: Monitoring and optimizing the performance of Kafka applications appeared first on Unravel.

How To Get the Best Performance & Reliability Out of Kafka & Spark Applications

Unravel Data — Thu, 11 Mar 2021 22:10:06 +0000

The post How To Get the Best Performance & Reliability Out of Kafka & Spark Applications appeared first on Unravel.

Using Machine Learning to understand Kafka runtime behavior

George Demarest — Wed, 29 May 2019 19:16:43 +0000

On May 13, 2019, Unravel Co-Founder and CTO Shivnath Babu joined Nate Snapp, Senior Reliability Engineer from Palo Alto Networks, to present a session on Using Machine Learning to Understand Kafka Runtime Behavior at Kafka Summit in London. You can review the session slides or read the transcript below.

Transcript

Nate Snapp:

All right. Thanks. I’ll just give a quick introduction. My name is Nate Snapp. And I do big data infrastructure and engineering at companies such as Adobe, Palo Alto Networks, and Omniture. And I have had quite a bit of experience in streaming even outside of the Kafka realm about 12 years. I’ve worked at some of the big scale efforts that we’ve done for web analytics at Adobe and Omniture, working in the space of web analytics for a lot of the big companies out there, about 9 out of the 10 Fortune 500.

I’ve have dealt with major release events, from new iPads to all these things that have to do with big increases in data and streaming that in a timely fashion. Done systems that have over 10,000 servers in that [00:01:00] proprietary stack. But the exciting thing is, last couple years, moving to Kafka, and being able to apply some of those same principles in Kafka. And I’ll explain some of those today. And then I like to blog as well, natesnapp.com.

Shivnath Babu:

Hello, everyone. I’m Shivnath Babu. I’m cofounder and CTO at Unravel. What my day job looks like is building a platform like what is shown on the slide where we collect monitoring information from the big data systems, receiver systems like Kafka, like Spark, and correlate the data, analyze it. And some of the techniques that we will talk about today are inspired by that work to help operations teams as well as application developers, better manage, easily troubleshoot, and to do better capacity planning and operations for their receiver systems.

Nate:

All right. So to start with some of the Kafka work that I have been doing, we have [00:02:00] typically about…we have several clusters of Kafka that we run that have about 6 to 12 brokers, up to about 29 brokers on one of them. But we run the Kafka 5.2.1, and we’re upgrading the rest of the systems to that version. And we have about 1,700 topics across all these clusters. And we have pretty varying rates of ingestion, but topping out about 20,000 messages a second…we actually gotten higher than that. But what I explained this for is as we get into what kind of systems and events we see, the complexity involved often has some interesting play with how do you respond to these operational events.

So one of the aspects is that we have a lot of self-service that’s involved in this. We have users that set up their own topics, they setup their own pipelines. And we allow them to do [00:03:00] that to help make them most proficient and able to get them up to speed easily. And so because of that, we have a varying environment. And we have quite a big skew for how they bring it in. They do it with other…we have several other Kafka clusters that feed in. We have people that use the Java API, we have others that are using the REST API. Does anybody out here use the REST API for ingestion to Kafka? Not very many. That’s been a challenge for us.

But we also been for the Egress have a lot of custom endpoints, and a big one is to use HDFS and Hive. So as I get into this, I’ll be explaining some of the things that we see that are really dynamically challenging and why it’s not as simple as EFL Statements and how you triage and respond to these events. And then Shiv will talk more about how you can get to a higher level of using the actual ML to solve some of these challenging [00:04:00] issues.

So I’d like to start with a simple example. And in fact, this is what we often use. When we first get somebody new to the system, or we’re interviewing somebody to work in this kind of environment, as we start with a very simple example, and take it from a high level, I like to do something like this on the whiteboard and say, “Okay. You’re working in an environment that has data that’s constantly coming in that’s going through multiple stages of processing, and then eventually getting into a location where will be reported on, where people are gonna dive into it.”

And when I do this, it’s to show that there’s many choke points that can occur when this setup is made. And so you look at this and you say, “That’s great,” you have the data flowing through here. But when you hit different challenges along the way, it can back things up in interesting ways. And often, we talk about that as cascading failures of latency and latency-sensitive systems, you’re counting [00:05:00] latency is things back up. And so what I’ll do is explain to them, what if we were to take, for instance, you’ve got these three vats, “Okay, let’s take this last vat or thus last bin,” and if that’s our third stage, let’s go ahead and see if there’s a choke point in the pipe there that’s not processing. What happens then, and how do you respond to that?

And often, a new person will say, “Well, okay, I’m going to go and dive in,” then they’re gonna say, “Okay. What’s the problem with that particular choke point? And what can I do to alleviate that?” And I’ll say, “Okay. But what are you doing to make sure the rest of the system is processing okay? Is that choke point really just for one source of data? How do you make sure that it doesn’t start back-filling and cause other issues in the entire environment?” And so when I do that, they’ll often come back with, “Oh, okay, I’ll make sure that I’ve got enough capacity at the stage before the things don’t back up.”

And this is actually has very practical implications. You take the simple model, and it applies to a variety of things [00:06:00] that happen in the real world. So for every connection you have to a broker, and for every source that’s writing at the end, you actually have to account for two ports. It’s basic things, but as you scale up, it matters. So I have a port with the data being written in. I have a port that I’m managing for writing out. And as I have 5,000, 10,000 connections, I’m actually two X on the number of ports that I’m managing.

And what we found recently was we’re hitting the ephemeral port, what was it, the max ephemeral port range that Linux allows, and all sudden, “Okay. It doesn’t matter if you felt like you had capacity and maybe the message storage state, but that we actually hit boundaries at different areas. And so I think it’s important to say these have to stay a little different at time, so that you’re thinking in terms [00:07:00] of what other resources are we not accounting for, that can back up? So we find that there’s very important considerations on understanding the business logic behind it as well.

Data transparency, data governance actually can be important knowing where that data is coming from. And as I’ll talk about later, the ordering effects that you can have, it helps with being able to manage that at the application layer a lot better. So as I go through this, I want to highlight that it’s not so much a matter of just having a static view of the problem, but really understanding streams for the dynamic nature that they have, and that you may have planned for a certain amount of capacity. And then when you have a spike and that goes up, knowing how the system will respond at that level and what actions to take, having the ability to view that becomes really important.

So I’ll go through a couple of [00:08:00] data challenges here, practical challenges that I’ve seen. And as I go through that, then when we get into the ML part, the fun part, you’ll see what kind of algorithms we can use to better understand the varying signals that we have. So first challenge is variance in flow. And we actually had this with a customer of ours. And they would come to us saying, “Hey, everything look good as far as the number of messages that they’ve received.” But when they went to look at a certain…basically another topic for that, they were saying, “Well, some of the information about these visits actually looks wrong.”

And so they actually showed me a chart that look something like this. You could look at this like a five business day window here. And you look at that and you say, “Yeah, I mean, I could see. There’s a drop after Monday, the first day drops a little low.” That may be okay. Again, this is where you have to know what does a business trying to do with this data. And as an operator like myself, I can’t always make those as testaments really well. [00:09:00] Going to the third peak there, you see that go down and have a dip, they actually point out something like that, although I guess it was much smaller one and they’d circle this. And they said, “This is a huge problem for us.”

And I come to find out it had to do with sales of product and being able to target the right experience for that user. And so looking at that, and knowing what those anomalies are, and what they pertain to, having that visibility, again, going to the data transparency, you can help you to make the right decisions and say, “Okay.” And in this case with Kafka, what we do find is that at times, things like that can happen because we have, in this case, a bunch of different partitions.

One of those partitions backed up. And now, most of the data is processing, but some isn’t, and it happens to be bogged down. So what decisions can we make to be able to account for that and be able to say that, “Okay. This partition backed up, why is that partition backed up?” And the kind of ML algorithms you choose are the ones that help you [00:10:00] answer those five whys. If you talk about, “Okay. It backed up. It backed up because,” and you keep going backwards until you get to the root cause.

Challenge number two. We’ve experienced negative effects from really slow data. Slow data can happen for a variety of reasons. Two of them that we principally have found is users that have data that staging and temporary that they’re gonna be testing another pipeline, and then rolling that into production. But there’s actually some in production that are slow data, and it’s more or less how that stream is supposed to behave.

So think in terms again of what is the business trying to run on the stream? In our case, we would have some data that we wouldn’t want to have frequently like cancellations of users for our product. We hope that that stay really low. We hope that you have very few bumps in the road with that. But what we find with Kafka is that you have to have a very good [00:11:00] understanding of your offset expiration and your retention periods. We find that if you have those offsets expire too soon, then you have to go into a guesswork of, “Am I going to be aggressive,” and try and look back really far and reread that data, in which case, you may double count? Or am I gonna be really conservative, which case you may miss critical events, especially if you’re looking at cancellations, something that you need to understand about your users.

And so we found that to be a challenge. Keeping with that and going into this third idea is event sourcing. And this is something that I’ve heard argued either way should Kafka be used for this event sourcing and what that is. We have different events that we want to represent in the system. Do I create a topic for event and then have a whole bunch of topics, or do I have a single topic because we provide more guarantees on the partitions in that topic?

And it’s arguable, depends on what you’re trying to accomplish, which is the right method. But what we’ve seen [00:12:00] with users is because we have a self-service environment, we wanna give the transparency back to them on here’s what’s going on when you set it up in a certain way. And so if they say, “Okay, I wanna have,” for instance for our products, “a purchase topic representing the times that things were purchased.” And then a cancellation topic to represent the times that they decided not to use certain…they decided to cancel the product, what we can see is some ordering issues here. And I represented those matching with the colors up there. So a brown purchase followed by cancellation on that same user. You could see the, you know, the light purple there, but you can see one where the cancellation clearly comes before the purchase.

And so that is confusing when you do the processing, “Okay. Now, they’re out of order.” So having the ability to expose that back to the user and say, “Here’s the data that you’re looking at and why the data is confusing. What can you do to respond to that,” and actually pushing that back to the users is critical. So I’ll just cover a couple of surprises that [00:13:00] we’ve had along the way. And then I’ll turn the time over to Shiv. But does anybody here use Schema Registry? Any Schema Registry users? Okay, we’ve got a few. Anybody ever have users change the schema unexpectedly and still cause issues, even though you’ve used Schema Registry? Is it just me? Had a few, okay.

We found that users have this idea that they wanna have a solid schema, except when they wanna change it. And so we’ve coined this term flexible-rigid schema. They want a flexible-rigid schema. They want the best in both worlds. But what they have to understand is, “Okay, you introduce a new Boolean value, but your events don’t have that.” I can’t pick one. I can’t default to one. Limit the wrong decision, I guarantee you. It’s a 50/50 chance. And so we have this idea of can we expose back to them what they’re seeing, and when those changes occur. And they don’t always have control over changing the events at the source. And so they may not even be control of their destiny of having a [00:14:00] schema throughout the system changed.

Timeouts, leader affinity, I’m gonna skip over some of these or not spend a lot of time on it. But timeouts is a big one. As we write to HDFS, we see that the backups that occur from the Hive meta store when we’re writing with the HDFS sync can cause everything to go to a rebalanced state, which is really expensive, and now becomes a cascading issue, which was a single broker having an issue. So, again, others with leader affinity, poor assignments. There’s a randomizes where those get assigned to. We’d like to have concepts of the system involved. We wanna be able to understand the state of the system. Windows choices are being made. And if we can affect that, and Shiv will cover how that kind of visibility is helpful with some of the algorithms that can be used. And then basically, those things, all just lead to, why do we need to have better visibility into our data problems with Kafka? [00:15:00] Thanks.

Shivnath:

Thanks, Nate. So what Nate just actually talked about, and I’m sure most of you at this conference, the reason you’re here is that the streaming applications, Kafka-based architectures are becoming very, very popular, very common, and driving mission critical applications across the manufacturing, across ecommerce, many, many industries. So for this talk, I’m just going to approximate streaming architecture as something like this. There’s a stream store, something like Kafka that is being used, or poser, and maybe status actually kept in a NoSQL system, like HBase or Cassandra. And then there’s the computational element. This could be in Kafka itself with KStream, but it could be a Spark streaming, Flink as well.

When things are great, everything is fine. But when things start to break, maybe as Nate mentioned, we’re not getting your results on time. Things are actually congested, [00:16:00] backlog, all of these things can happen. And unfortunately, in any architecture like this, what happens is they’re all receiver systems. So problems could happen in many different places such as, it could be an application problem, maybe the structure, the schema, how things where…like Nate gave an example of a joint across streams, which…There might be problems there, but that might also be problems that the Kafka level, may be the partitioning, may be brokers, may be contention, or things that as a Spark level, no resource level problems, or configuration problems, all of these become very hard to troubleshoot.

And I’m sure many of you have run into problems like this. How many of you here can relate to problems what I’m showing on the slide here? Quite a few hands. As you’re running things in production, these challenges happen. And unfortunately, what happens is, given that these are all individual systems, often there is no single correlated view event that connects the streaming [00:17:00] computational side with the story side, or maybe with the NoSQL sides. And that poses a lot of issues. One of the crucial things is that when a problem happens, it takes a lot of time to resolve.

And wouldn’t it be great if there’s some tool out there, some solution that can give you visibility, as well as to do things along the following lines and empowered like the DevOps teams? First and foremost, there are metrics all over the place, that metrics, logs, you name it. And from all these different levels, especially the platform, the application, and all the interactions that happen in between. Metrics along these can be brought into one single platform, and at least have a good, nice correlated view. So again, you can go from the app, to the platform, or vice-versa, depending on the problem.

And what we will try to cover today is with all these advances that are happening in machine learning, how in applying machine learning to some of this data can help you find out problems quicker, [00:18:00] as well as, in some cases, using artificial intelligence, using ability to take automated actions either prevent the problems from happening the first place, or at least if those problems happen gonna to be able to quickly recover and fix it.

And what we will really try to cover in the stack is basically what I’m showing the slide. There’s tons and tons of interesting machine learning algorithms. And the same time, you have all of these different problems that happened with Kafka streaming architectures. How do you connect both worlds? How can you bring based on the goal that you’re actually trying to solve the right algorithm to bear on the problem? And the way we’ll structure that is, again, DevOps teams have lots of different goals. Broadly, there’s the app side of the fence, and there’s the operations side of the fence.

As a person who owns Kafka streaming application, you might have goals related to latency. I need this kind of latency, or this much amount of the throughput, or maybe might be a combination of this along with, “Hey, I can only [00:19:00] tolerate this much amount of data loss,” and all those talks that have happened very much in this room, when the two different aspects of this and how replicas and parameters help you get all of these goals.

On the operation side, maybe your goals are around ensuring that the cluster reliable like a particular loss of a rack doesn’t really cause data loss and things like that, or maybe they are related on the cloud ensuring that you’re getting the right bite performance and all of those things. And on the other side are all of these interesting advances that happened in ML. There are algorithms for detecting outliers, anomalies, or actually doing correlation. So the real focus is going to be like, let’s take some of these goals. Let’s work with the different algorithms, and then seeing how you can get these things to meet. And along the way, we’ll try to describe some our own experiences. What would worked kind of what didn’t work, as well the other things that are worth exploring?

[00:20:00] So let’s start with the very first one, the outlier detection algorithms. And why would you care? There are two very, very simple problems. And if you remember from the earlier in the talk, Nate talked about one of the critical challenges they had, where the problem was exactly this, which is, hey, there could be some imbalance among my brokers. There could be some imbalance and a topic among the partitions, some partition really getting backed up. Very, very common problems that happen. How can you very quickly instead of manually looking at graphs, and trying to stare and figure things out? Apply some automation to it.

Here’s a quick screenshot that lead us to through the problem. If you look at the graph that have highlighted on the bottom, this is a small cluster with three brokers, capital one, two, and three. And one of the brokers is actually having much lower throughput. It could be the other way. One of the having a really high throughput. Can we detect this automatically? This is about having to figure it after the fact. [00:21:00] And there’s a problem where there are tons of algorithms for outlier detection from statistics and now more recently in machine learning.

And there are algorithms and differ based on, do they deal with one parameter at a time, do they deal with multiple parameters of the time, univariate, multivariate, or algorithms that can actually take temporal things into account, algorithms that are more looking at a snapshot in time. The very, very simple technique, which actually works surprisingly well as this score where the main idea is to take…let’s say, you have different brokers, or you have hundred partitions in a topic, you can take any metric. It might be in the bites and metric, and vector Gaussian and distribution to the data.

And anything that is actually a few standard deviations away as an outlier. Very, very simple technique. The problem in this technique is, it does assume that is the distribution, but sometimes may not be the [00:22:00] case. In the case of in a brokers and partitions that is usually a safe assumption to make. But if that technique doesn’t work, there are other techniques. One of the more popular ones that we have had some success with, especially when you’re looking at multiple time series of the same time is the DBSCAN. It’s basically a density-based clustering technique. I’m not going to the all the details, but the key ideas, it basically uses some notion of distance to group points into clusters, and anything that doesn’t fall into clusters and outlier.

Then there are tons of other very interesting techniques using like in a binary trees to find outliers called the isolation forests. And in the deep learning world, there is a lot of interesting work happening with auto encoders, which tried to learn representation of the data. And again, once you’ve learned the representation from all the training data that is available, things don’t fit the representation are outliers.

So this is the template I’m going to [00:23:00] follow in the rest of the talk. Basically, the template, I pick a technique. Next, I’m gonna look at forecasting and give you some example use cases that having a good forecasting technique can help you in that Kafka DevOps world, and then tell you about some of these techniques. So for forecasting, two places where it makes those…having a good technique makes a lot of difference. One is avoiding this reactive firefighting. When you have something like a latency SLA, if you could get a sense, things are backing up, and there’s a chance in a certain amount of time, the SLAs will get missed, then you can actually take action quickly.

It could even be bringing on normal brokers and things like that, or basically doing some partition reassignment and whatnot. But getting heads up this often very, very useful. The other is more long-term forecasting for things like capacity planning. So I’m gonna use actually a real life example here, an example that one of our telecom customers actually worked with [00:24:00] where it was a sentiment analysis, sentiment extraction used case based on tweets. So the overall architecture consisted of tweets coming in real-time like in those SLA Kafka, and then the computation was happening in Spark streaming with some state actually being stored in a database.

So here’s a quick screenshot of how things actually play out there. In sentiment analysis, especially for customer service, customer support really the used cases, there’s some kind of SLA. In this case, their SLA was around three minutes. What that means is, by the time you learn about the particular incident, if you will, which you can think of these tweets coming in Kafka. Within a three-minute interval, data has to be processed. What are you seeing on screen here, all those green bars represent the rate in which data is coming, and the black line in between as the time series indicating the actual delay, [00:25:00] the end-t-end delay, the processing delay between data arrival and data being processed. And there is a SLA of three-minutes.

So if you can see the line, it trending up, and applying…there was a good forecasting technique that could be applied on this line, you can actually forecast and stay within a certain interval of time, maybe it’s a few hours, maybe even less than that. The rate at which this latency is trending up, my SLA can get missed. And it’s a great use case for having a good forecasting technique. So again, forecasting, another area that has been very well-studied. And there are tons of interesting techniques, from the time-series forecasting of the more statistical bend, there are techniques like ARIMA, which stands for autoregressive integrated moving average. And there’s lots of variance around that, which uses the trend and data differences between data elements and patterns you forecast with smoothing and taking in historic data into account, and all of that good stuff.

On the other [00:26:00] extreme, there has been a lot of like I’ve said, advances recently in using neural networks, because time-series data is one thing that is very, very easy to get, too. So there’s this long short-term memory, the LSCM, and recurrent neural networks, which have been pretty good at this, that we have actually had a lot of I would say success is with a technique that was originally it was something that Facebook released this open source called the Prophet Algorithm, which is not very different from the ARIMA and the older family of forecasting techniques. It defaults in some subtle ways.

The main idea here was what is called a generative additive model. I put in a very simple example here. So the idea is to model this time-series, whichever time-series you are picking as a combination of the trend in the time-series data and extract all the trend. The seasonality, maybe there is a yearly seasonality, maybe there’s monthly seasonality, weekly [00:27:00] seasonality, maybe even daily seasonality. This is a key thing. I’ve used the term shocky [SP]. So if you’re thinking about forecasting and ecommerce setting, like Black Friday or the Christmas days, these are actually times when the time-series will have a very different behavior.

So in Kafka or in the operational context, if you are rebooting, if you are installing a new patch or upgrading, this often end up shifting the patterns of the time-series, and they have to be explicitly modeled. Otherwise, the forecasting can go wrong, and then the rest is error. The recent Prophet has actually worked really well for us apart from the ones I mentioned. It fits quickly and all of that good stuff, but it is very, I would say, customizable, where your domain knowledge about the problem can be incorporated, instead of if it’s something that gives you a result, and then all that remains is parameter tuning.

So Prophet is something I’ll definitely ask all of you [00:28:00] to take a look. And defaults work relatively well, but forecasting is something where we have seen that. You can just trust the machine alone. It needs some guidance, it needs some data science, it needs some domain knowledge to be put along with the machine learning to actually get good forecast. So that’s the forecasting. So we saw outlier detection how that applies, we saw forecasting. Now let’s get into something even more interesting. Anomalies, detecting anomalies.

So a place where anomaly detection, and possible was an anomaly, you can think of an anomaly as unexpected change, something that if you were to expect it, then it’s not an anomaly, something unexpected that needs your attention. That’s what I’m gonna characterize as an anomaly. Where can it help? Actually smart alerts, alerts where you don’t have to configure threshold and all of that things, and worry about your workload changing or new upgrades happening, and all of that stuff, wouldn’t it be great if these anomalies can [00:29:00] be auto-detect it. But that’s also very challenging. By no means, it’s a trivial because if you’re not careful, then your smart alerts will turn out to be really dump. And you might get a lot of false alerts. And that way, you lose confidence, or it might miss a real problems that are happening.

So, I don’t know, detection is something which is pretty hard to get right in practice. And here’s one simple, but one very illustrative example. With Kafka you always see no lags. So here, what I’m plotting here is increasing lag. Is that really an expected one? Maybe there could be both in data arrival, and maybe these lags might have built up, like at some point of time every day, maybe it’s okay. When it is an anomaly that I really need to take a look at. So that’s when these anomaly detection techniques become very important.

Many [00:30:00] different schools have thought excess on how to build a good anomaly detection technique, including the ones I talked about earlier with outlier detection. One approach that has worked well in the past for us is, when you can have really modern anomalies as I’m forecasting something based on what I know. And if what I see the first one to forecasts, then that’s an anomaly. So you can pick your favorite forecasting technique, or the one that actually worked, ARIMA, or Prophet, or whatnot, use that technique to do the forecasting, and then deviations become interesting and anomalous.

Whatever sitting here is a simple example of the technique and for Prophet, or more common one that we have seen actually does work relatively well, this thing called STL. It stands for Seasonal Trend Decomposition using a smoothing function called LOWESS. So you have the time series, extract out and separate out the trend from it first. So without the trend, so that leaves [00:31:00] the time series without the trend and then extract out all the seasonalities. And once you have done that, whatever reminder or residual has called, even if you put some, like a threshold on that, it’s reasonably good. I wouldn’t say it’s perfect but reasonably good at extracting out these anomalies.

Next one, correlation analysis, getting even more complex. So once basically you have detect an anomaly. Or you have a problem that you want the root cause, why did it happen? What do you need to do to fix it. Here’s a great example. I saw my anomaly something shot up, maybe there’s the lag, that is actually building up, it’s definitely looks anomalous. And now what do you really want us…okay, maybe we address where the baseline is much higher. But can I root cause it? Can I pinpoint what is causing that? Is it just the fact that there was a burst of data or something else, maybe resource allocation issues, maybe some hot spotting and the [00:32:00] brokers?

And here, you can start to apply time-series correlation, which time series your lower level correlates best with the higher level time-series where your application latency increased? Challenge here is, there are hundreds, if not thousands of times-series if you look at Kafka, with every broker has so many kind of time-series it can give you from every level. And it quickly all of these adds up. So here, it’s a pretty hard problem. So if you just throw a time-series correlation techniques, even time-series which just have some trend in there they look correlated. So you have to be very careful.

The things to keep in mind are things like pick a good similarity function across time-series. For example using something like Euclidean, which is a straight up well-defined, well-understood distance function between points or between time series. We have had a lot of success with something called dynamic time warping, which is very good to deal with time-series, which might be slightly out of [00:33:00] sync. If you remember all the things that need mentioned, you just gone that Kafka world and streaming world in asynchronous, even world, you just see him all the time-series and nicely synchronized. So time warping is a good technique to extract distances in such a context.

But by far, the other part is just like you saw with Prophet. You have to really instead of just throwing some machine learning technique and praying that it works, you have to really try to understand the problem. And the way in which we have tried to break it down into something usable is, for a lot of these time-series, you can split the space into time-series that are related to the application performance, time-series are related to resource and contention, and then apply correlation within these bucket. So try to scope the problem.

Last technique, model learning. And this turns out to be the hardest. But if you have a good modeling technique, good model that can answer what-if questions, then things like what we saw in the previous talk with [00:34:00] all of those reassignment and whatnot. You can actually find out what’s the good reassignment and what impact will that have. Or, as Nate mentioned, this rebalanced-type, consumer times, timeouts, and rebalancing storms can actually kick in, “What’s the good timeout?”

So a lot of these places where you have to pick some threshold or some strategy, but having a model that can quickly do what-if avoid is better, or can even rank them can be very, very useful and powerful. And this is the thing that is needed for enabling action automatically. Here’s a quick example. There’s a lag. We can figure out where the problem actually is happening. But then wouldn’t be great if something is suggesting how to fix that? Increase the number of partitions from X to Y,. And that will fix the problem.

The modeling at the end of the day is function that you’re fitting to the data. And in modeling, I don’t have time to actually go into this in great detail because of time. You carefully pick the right [00:35:00] input features. And what is very, very important is to ensure that you have the right training data. For some problems, just collecting data from the production cluster is good trading data. Sometimes, it is not because you have only seen certain regions of the space. So with that, I’m actually going to wrap up.

So what we try to do in the talk is to really give you a sense, as much as we can do in a 40-minute talk of having all of these interesting Kafka DevOps challenges, so meaning application challenges, and how to map that to something where you can use machine learning or some elements of AI to make your life easier, at least guide you so that you’re not wasting a lot of time trying to look for a needle in a haystack. And with that, I’m going to wrap up. There is this area called AIOps, which is very interesting and trying to bring AI and ML with data to solve these DevOps challenges. We have a booth here. Please, drop by to see some of the techniques we have. And yes, if you’re interested in working on this interesting challenges, streaming, building [00:36:00] these applications, or applying techniques like this, we are hiring. Thank you.

Okay. Thanks, guys. So we actually have about three or four minutes for any question, mini Q&A.

Attendee 1:

I know nothing about machine learning. But I can see it helping me really help with debugging problems. As a noob, how can I get started?

Shivnath:

I can take that question. So the question was, so the whole point of this talk, and gets in that question, which is, again, on one hand, if you go to a Spark Summit, or Spark, you’ll see a lot of machine learning algorithms, and this and that, right? If we come to a Kafka Summit, then it’s all about Kafka and DevOps challenges. How do these worlds meet? That’s exactly the thing that we tried to cover in the talk. If you were listening looking at a lot of the techniques, they’re not fancy machine learning techniques. So our experience has been that once you understand the use case, then there are fairly good techniques from even statistics that can solve the problem reasonably well.

And once you have first startup experience, then look for better and better techniques. So hopefully, this talk gives you a sense of the techniques just get started with, and you get a better understanding of the problem and the [00:38:00] data, and you can actually improve and apply more of those deep learning techniques and whatnot. But most of the time, you don’t need that for these problems. Nate?

Nate:

No. Absolutely same thing.

Attendee 2:

Hi, Shiv, nice talk. Thank you. Quick question is for those machine learning algorithms, can they be applied cross domain? Or if you are moving from DevOps of Kafka to Spark streaming, for example, do you have to hand-picking those algorithms and tuning the parameters again?

Shivnath:

So the question is, how much of these algorithms just apply to something we’ve found for Kafka? Does it apply for Spark streaming? Would that apply for high performance Impala, or Redshift for that matter? So again, the real hard truth is that no one size fits all. So you can’t just have…I mentioned outlier detection, that might be a technique that can be applied to any [00:39:00] load imbalance problem. But then that’s start getting into anomalies or correlation, Some amount of expert knowledge about the system has to be combined with the machine learning technique to really get good results.

So again, it’s not as if it has to be all export rules, but some combination. So if you pick a financial space, a data scientist who exists, understands the domain, and knows how to work with data. Just like that even in the DevOps world, the biggest success will come if somebody understand receiver system, and has the knack of working with these algorithms reasonably well, and they can combine both of that. So something like a performance data scientist.

The post Using Machine Learning to understand Kafka runtime behavior appeared first on Unravel.

Unravel Data Solution Guide Kafka

Unravel Data — Tue, 20 Nov 2018 00:09:58 +0000

Thank you for your interest in the Unravel Data Solution Guide Kafka.

You can download it here.

The post Unravel Data Solution Guide Kafka appeared first on Unravel.