PurePerformance - Open Observability: The limits of the 3 pillars with Dotan Horovits

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome everyone to another episode of Pure Performance. Unfortunately, the second time I think in a row, I don't have the fun introduction from Brian. Actually, I think he probably will put a fun introduction to the whole thing. But unfortunately, he couldn't make it today. So I have the honor again to be here just by myself. Well, and with obviously our great guest today

Starting point is 00:00:46 a newcomer on the podcast and I always try to not mess up the pronunciation of the name Doltan Horowitz I hope I said it correctly Doltan is this uh welcome to the show did I did I get your name correctly yeah it's Doltan Horowitz and yeah glad to be here on the show thank you for inviting me and I hope it will be entertaining enough to make up for Brian's absence. Yeah, he's always pretty good with jokes and he always brings in a special mix to the show. Yeah, let's see if you can live up to his standards. Dottan, can you give the audience a little background about yourself, what you do, but

Starting point is 00:01:23 not only what you do right now, but especially where you came from. I think that's also very interesting. Yeah. So today, these days, I'm a principal developer advocate at a company called logs.io, which provides a cloud native observability platform. I come from an engineering background, many years as an engineer, then a systems architect, a solutions architect, consulting to customers about architecture, design, implementation, and so on, tuning, and so on. I even had an episode as a product manager for developer platforms and cloud orchestration and things like that.

Starting point is 00:02:10 And these days as a developer advocate, startups, enterprises, so all the range of goodies there. If I look at your LinkedIn profile, and by the way, we will link to your LinkedIn in the podcast proceedings. So if you're listening in, if you want to follow up with him, just follow the links. I see you worked at Gigaspaces for a while. Yeah, I was the solutions architect there. For those who don't know in the audience, Gigaspaces provides an in-memory data grid, sort of a distributed in-memory database,

Starting point is 00:02:40 so to speak. And yeah, I was there and I consulted customers about how to build an architect, distributed applications on top of that and cool things like that. Yeah, because maybe we have actually worked in the past together because I remember in the early days of Dynatrace. So I've been with Dynatrace since 2008. I remember when I lived in Boston, I had quite some exchange with your colleagues back then in GigaSpaces, instrumenting GigaSpaces with our AppMon product of kind of previous generation distributed tracing.

Starting point is 00:03:13 And we were kind of, you know, instrumenting apps, but then also instrumenting GigaSpaces itself and kind of following traces end to end. And this is why I looked at your profile and said, hey, I need to bring this up at least. A small world. I think back then it wasn't even called distributed tracing as a discipline. So it's a pre-incarnation of the current distributed tracing. But yeah, definitely a challenge we've been encountering wherever it was in the distributed applications realm for quite some time now. Yeah. And you are in your current work. I mean, you work for LogC,

Starting point is 00:03:46 you're a principal developer, but you also have your own podcast, videocast. Can you tell us a little bit about that? So I'm also beyond my role. I also am very passionate about open source and communities. And I'm involved with the CNCF, the Cloud Native Computing Foundation. I'm a co-organizer of the local CNCF chapter here in Israel. And as part of Reach Out to the Community, I also have my podcast called Open Observability Talks. So I invite all your listeners, if they are interested in this topic, open source, DevOps, observability, maybe they'll find it interesting as well. It's a monthly cadence. And I get guests that are maintainers,

Starting point is 00:04:32 committers, end users, all the perspectives around these topics. Yeah, it's perfect. I'll have you on one of the episodes soon. So stay tuned. Yeah, well, yeah. And thanks thank you. I mean, you reached out initially to me, right,

Starting point is 00:04:47 in regards to getting on the show. And then we said, let's do a show on both sides. So that's great. So that also kind of brings me now to the topic of today, because we want to talk about the role of open source for better observability. I think observability is a hot topic. There's many things going on.

Starting point is 00:05:03 And in preparation for today's recording, I sent you a link to a recent Twitter space. I'm not sure if people are familiar with Twitter spaces, but it's a way on Twitter where people can join in into a discussion. That's typically a couple of speakers that discuss about it, but then people can bring in their questions, but they can also join the conversation. And it was a Twitter space on open telemetry. And I thought what was really interesting was the quote that initially triggered that Twitter space.

Starting point is 00:05:33 It was from Adolf Reitbauer. And he said, he had a tweet that says, I again had a call where I had to explain that just sending some open telemetry data does not give you observability. To be clear, I'm a big fan of open telemetry, but observability is, however, a much wider concept, because we need to not only collect the data, we need to figure out how to collect data in context. What do we do with the data? And I also agree with him on this thread. It's just one aspect of it. But I would really love to hear it from you, especially as you're so engaged in the open

Starting point is 00:06:06 source space. You've been helping the community to make sure that observability is thriving through open source. So what I would like to know is a little bit of like, where are we right now? And where do you think we will go? And what can we do in our podcast today to help people better understand what observability is, what OpenTelemetry is today and where it goes so we make sure that we really have something

Starting point is 00:06:32 that is truly delivering value to organizations. So first of all, yeah, thanks for highlighting this discussion. I think it brings a lot of very good points. Let's start even before open source about observability. I think the very basic point is that many people out there limit the discussion around observability to the three, what is known as the three pillars of observability, namely metrics, logs, and traces. And while I think these signals are important, maybe also the formulation of the three pillars of observability as a term helped kickstarting the conversation, I think now it's come to the point where it limits the conversation in fact. Because ultimately,

Starting point is 00:07:19 and that's again, I find myself hearing the very same questions over and over again, and you see companies collecting logs, metrics, traces. They're sure, they're confident that they have observability because they have all the signals. And no, they don't have observability. And it's disappointing because it's just setting of expectations. So I would say maybe, you know, maybe even the definition that we use for observability, the one that we took from control theory is the one to blame because it talks about how does it go, a way to

Starting point is 00:07:52 track the state of our system based on the signals it produces, right? And it puts a lot of emphasis on the signals. But then the other piece of this definition, the inference piece, somehow gets, I guess, lost or in less focus. And I think this is the critical part, actually. I actually, someone, I think, I don't remember who said that definition, but I like it very much. I use another definition of observability that's just the capability to allow a human to ask and answer questions about your system. And the reason I like this definition much better is that it makes it very, very clear that observability is ultimately a data analytics problem. The more questions you can ask and answer about your

Starting point is 00:08:47 system, the better, more observability. So I use that definition because it makes it much clearer. And it's not just semantics of, okay, Dutane, you're using one definition rather than the other. I think it's fundamental to the way that we implement. It's in a way changing the mindset to thinking, let's say, about more like BI analysts rather than, I don't know, reactive monitoring, sysadmin, classical type of thing. So I definitely resonate with that. So maybe I can try to put it into very simple terms, but you're basically saying observability is not about how we are capturing data but the most important thing is what can this data help us to do as a next step and if you are if i'm bringing a let's say a more an example from our regular let's say physical world if i'm in a

Starting point is 00:09:41 car and my car tells me i'm driving 180 kilometers in the city and i don't know what to do with this information that maybe i should slow down because otherwise i'll get a ticket or i cause an accident then the data doesn't do anything good to me right if i'm running out of gas soon and i need to still go 100 kilometers but i only have gas for 10 kilometers and if i don't make a good decision, then obviously, what is this all good for? Yeah. Again, I don't want to make it sound as if the data is not important. Just that we need to remember that the signals, metrics, logs, traces, first of all, these

Starting point is 00:10:17 are not the only signals we like. We as humans like the number three, so three pillars. But, you know, my last episode on the show was about continuous profiling as an emerging signal. And people are talking about events and formalizing them and others. So first of all, it's not just three. And secondly, this is the raw data. We need the data. We need the data structured.

Starting point is 00:10:38 We need the data in many ways, enriched. We need the data. But we need the data for a reason. The data is only a means to an end. Ultimately, we want to be able to understand our system. And the more we go into the current architectures that are like cloud native, Kubernetes, microservices, and so on, and it becomes much more dynamic and much more high cardinality and so on and so forth, the set of permutations that you need to address is such that you can't foresee the questions.

Starting point is 00:11:12 So we can't pre-aggregate, we can't do pre-calculations, we can't put assumptions. We need to support the ad hoc questions much more extensively than we used to in the past. And that drives, I think this is the driver actually for observability in general. And this is why I put so much emphasis on the data analytics type of things, because we can't anticipate. So we can't do all these preparations in advance.

Starting point is 00:11:37 Yeah. But on the other side, I completely agree with you on this. But on the other side, I think a good observability platform or whatever you want to call it then is also anticipating certain things because as a human being, I may not know all the questions I need to ask because I know what I know, what I think I need to ask.

Starting point is 00:11:56 But I think as an observability platform, it should be smart enough to make me aware of certain things that I may not even ask because I don't know about it. I think it goes a little beyond that, but I assume this is also what you mean. No, definitely. I agree. I think vendors such as Dynatrace and Logs.io and many others bring a lot of experience seeing so many customers to create, the buzzwords is around AI and machine learning and stuff like that. But ultimately it's capturing the models that capture the aggregated knowledge of what the typical signals are and the correlation of signals, not in the telemetry signals, but what, let's say, abnormal behavior we should be paying attention to.

Starting point is 00:12:43 But then again, if we do that on the collection side or even the initial side, and we don't even send part of this data, then we won't be able to ask these questions ultimately, or at least we won't be able to answer these questions if I actually don't have the raw data. If I use sampling that is not intelligent and I just don't have the traces

Starting point is 00:13:01 or if my metric aggregation doesn't provide that, I know suddenly I want to ask on the error rate across all my servers and it was pre-aggregated by, I don't know, clusters or something else. And I can't do the P99 across, you know, you can't be P99 of P99s, things like that. Then I lost the data. Yeah, no, I completely agree with you. Now, then let me ask another question. Do you think then there's a misconception still out there that we hopefully can address if it's out there that just by, let's say, looking into open telemetry

Starting point is 00:13:34 or in just looking into Prometheus will solve all of your problems? Because I think that comes back to the question that Alois has raised, right? It's just sending open telemetry data is not really observability. And you'll get the questions as well. So what can we do?

Starting point is 00:13:52 What can we tell people what else it means to really build a good observability platform? It's not just collecting the raw data. I think you mentioned if you collect the raw data, you have to be very smart with, because it's a lot of data, how you aggregate and what you aggregate, if you aggregate. Because I think certain things need to be aggregated, because otherwise it's just a sheer volume,

Starting point is 00:14:14 but you have to be very smart on what you aggregate. And I guess you need to store the data somewhere. It needs to be analyzed somewhere. That means you need storage, you need resource, you need compute. It's not just for free. What else can we do? What else do we, what other misconceptions are out there maybe? So first of all, again, the conversations, if they don't go to signals,

Starting point is 00:14:35 they go to tooling and then say, okay, Prometheus, this or open telemetry, that I want to put it aside and maybe we should spend some time on, on open telemetry as a, as a platform. I'm quite involved there, so I'd be glad to share. But again, before a specific tool, we need to understand what exactly is the problem that we're trying to achieve. And as you said, what are the best practices in doing that? And as I said, the mindset of once you understand it's a data analytics problem, it impacts the whole pipeline. It starts from collecting the data from different signals on different sources and being able to aggregate. And by the way, across the different signals, and it's not just open telemetry, you see all

Starting point is 00:15:19 the industry heading in this direction. You see, if we talk about open source sphere, then you see FluentD that used to specialize in log collection, now expanding into collecting metrics. And if you have a Telegraph that starts from metrics, now expanding to logs and events. And Elastic that had like gazillion bits, file bit, metric bit, packet bit, whatever, now they have one aggregated bits collector agent. So first of all, the aggregation of the different signals is one pain that needs to be addressed, especially, it's not just about one way of collecting, it's also about one standardized way of representing the data.

Starting point is 00:16:00 And this goes to the way that you structure the data, especially logs. It's a nightmare seeing this plain text, freeform logs, as if we as humans are going to sit down and read a gazillion lines of logs to understand. If you understand it's a data analytics problem, and you understand that the machine is actually going to ingest that and parse that and be able to derive data, you immediately understand we need structured logs. We need to export them not as plain text,

Starting point is 00:16:31 but as, I don't know, JSON maybe. We need to maybe enrich the data with certain things such as, I don't know, the trace ID to enable log trace correlation. Maybe we need to use concise data model because if any other piece of my stack will call the service differently, one will call it service, one will call it service with a capital, one will call it service name, one will call it service underscore name, how would I know it's actually the same entity?

Starting point is 00:16:58 So the data modeling is another important thing. The data enrichment, as I said. So all of these, and that's just the, let's say, the ingestion part and the very beginning of the pipeline. And, of course, again, going back to tooling, the tools address that. So if you look at OpenTelemetry, OpenTelemetry puts an effort on standardizing the payload, both in the standard, you know, using a protobuf

Starting point is 00:17:26 format, working also, by the way, on JSON and formalizing the data model for traces, logs, metrics, and so on. So there's work in these projects that will help converge the industry. But let's first understand the problem. The problem is that when we don't have that, it's very, very difficult to understand that actually we were talking about even the same entity. So that's one, I guess, piece of the puzzle. Then if you go further down the line, you talk about also querying and visualization and alerting.

Starting point is 00:17:59 Because again, you use one language, I don't know, PromQL to query your time series data. And you use maybe Lucene to query your log data. But what if I want to ask a question across? So all of these are, I guess, the challenges that we as an industry face. They don't necessarily have a boxed answer to everyone, I have to say upfront.

Starting point is 00:18:20 We're still learning that as an industry, but just to say a few. But then let me ask you something. If I hear you correctly, you said the importance is that we don't have individual data silos because each individual data silo, first of all, won't be able to decide how to best aggregate because you don't need to have a holistic view. Also, each individual data silo is not able to answer questions across the different pillars, right? And there's more than three pillars that we understand. So isn't that then, I mean, I guess this is exactly why all the vendors, it seems, are moving and expanding

Starting point is 00:18:54 into all regions, right? Somebody starts with logs and goes into metrics and traces. Somebody starts with metrics and now goes into other areas. So if this is kind of the ultimate direction that we need to cover everything, what does this again mean for open source? Does this mean, and especially the do-it-yourself, I think I see a lot of organizations that just pick some open source logging framework here, some open source tracing here, some open source logs here. Does this then mean if these

Starting point is 00:19:22 organizations that do it themselves and use these individual pillars still need to solve the overall problem of really combining all that data and then putting an analytics engine on top of it that understands everything? And then if everybody's doing this, aren't we then again duplicating a lot of work? Because every organization then has to set up a team that fully understands the problem that they need to solve. It sounds very strange to me. You're perfectly right. Actually, my own company, Logz.io, we were debating exactly the same thing and this is why we said on the one hand, we identified these are very, very popular open source projects out there and this is what people love to use. But then again, the challenge of each one is a distinct silo and how can we as a vendor help them achieve using the best of breed open source, but still with interaction between them. So we offer a suite that combines,

Starting point is 00:20:18 let's say, the ELK stack alongside the Ager for tracing,, alongside Prometheus for metrics, and then overlay with features to correlate. Again, putting logs aside, generally, this is a challenge to the entire industry. I think there are very important moves in the open source sphere in that direction. So if we divide the observability pipeline, let's say, to the different parts. So if we look at the ingestion part or even the very basic instrumentation, as I said, you see open source projects that before used to specialize on specific signals like Telegraph, like FluentD that I mentioned before that are expanding.

Starting point is 00:20:59 So you see that these open source projects and the communities behind them realize that they need to cover more, otherwise they become less relevant. Or the difficulty for the users to use that disconnected from the rest of the stack. Then there's open telemetry. And maybe let's talk a bit about open telemetry, because I think this is... It doesn't address the full pipeline. It addresses only the telemetry generation and the telemetry collection side of things.

Starting point is 00:21:28 But still, it looks at it as an aggregate or as a holistic platform. So one specification for the APIs and SDKs for generating logs, metrics, traces, one standardized collector for collecting the signals and exporting to whichever backend you'd like. And by having a unified platform and also a protocol for transmitting it,

Starting point is 00:21:56 OTLP, OpenTelemetry Protocol, that again is one way to represent the data model as we said before, and one way to transmit it standardized. It could be the transmission between the SDK and the client library and the collector. It could be between the collector and the backend. It's just an agnostic, general-purpose telemetry broadcast protocol.

Starting point is 00:22:19 So by having that under one project with one holistic view of all the signals together, that is a very, very important step in, as you said, breaking out of these silos, at least on the side, as we said, of the generation and the collection. So imagine again, we're not there yet, but imagine that, you know, you have a Java backend and I don't know where Node.js frontend application. And with open telemetry, unlike what we used to have in the past, you'll have one API and one SDK for Java and one for Node.js, but they are under the same specification.

Starting point is 00:22:55 And that's it. You don't need the many, many libraries that we used to have in order to instrument different pieces of the puzzle. So that's the vision, at least. We're not there. but the realization is there, not just with the vendors, but also with the open source community. I mean, as you mentioned, right, I mean, your organization are obviously betting on OpenTelemetry,

Starting point is 00:23:16 many others do. My organization, Dynatrace, we do the same thing, right? We're obviously understanding that this is a major step forward in making it easier, especially covering other technologies under one umbrella. Because we've been, I mean, I've been with Dynatrace for 14 years and we've been doing automated instrumentation for even longer because the company was founded in 2005. And this is now another question I want to get to. And I think you have an update for me because I'm not as deep into OpenTelemetry as you are.

Starting point is 00:23:46 So please give me an update here. I always thought, at least, and I think this is still true for many of the technologies and SDKs already available, but I always thought that OpenTelemetry means developers need to manually instrument their code. That's what I thought. And manual instrumentation means a lot of additional work. Now I know there is some auto instrumentation is already going on.

Starting point is 00:24:08 Can you just give me, as a novice, can you give an update on what the status of automated instrumentation is, and also how it works and how I can use automated instrumentation? Just would be interested in for which technologies it's currently supported, if you know. Yeah, so maybe just for the audience that is not familiar, the idea is that, again, OpenTelemetry,

Starting point is 00:24:33 we think about this one project. It's actually a mega project. It's the second most active project now in the CNCF after Kubernetes. And so it's essentially many, many projects under it. So it's very important to say because people often treat it as one aggregate and different projects are in different state

Starting point is 00:24:53 of the maturity and different focus areas. So it's very important to say that upfront. Now on the, let's say the multi-call, the telemetry generation side or instrumentation side, there is, as I said before, the specification for the API, the SDK, the data model that is a cross language, and that's the cross language requirements. And then each group per programming language

Starting point is 00:25:23 develops its own reference implementation, if you'd like, for that API and SDK in that specific language. And they, the maintainers and contributors there, look at what each language has to, you know, in the facilities. If it's a bytecode-based language, if it's a just-in-time compilation, if it's a, you know, Go that is very, very explicit and doesn't allow any hooks, or each language with its own tricks and schticks, as we say, and then find the best way to do that. The range is from manual instrumentation that you mentioned, which is just like we used

Starting point is 00:26:01 to do in logs. We just put the developer needs to put in open start a span and the span at the end of the section and say, I want this to be a defined span. And it's very, very explicit, but it's dirty and, so to speak, the code from the business logic and requires all the developers to know the stuff around the instrumentation and so on. As you said, this is the most advanced usage. The other end of the spectrum is that each programming language group works on auto instrumentation agents. And the agent is, as I said, it could be on the injection

Starting point is 00:26:37 on the bytecode, or it could be other ways of hooking and finding it so that it's codeless. And there's in between. So there's all sorts of language-specific integrations with popular web framework, storage clients, RPC libraries, and so on, that enable to automatically capture relevant traces and metrics and handle the context propagation on these libraries. So, for example, we work with Java and then on Spring,

Starting point is 00:27:06 you have integration with Spring. We use Node.js with Happy and Express, and then we have the integration that we use there. So you have the full range from the fully manual to the fully automatic. And my recommendation when I guide my customers and users and the community members is to leverage the most out of the auto instrumentation, the more,

Starting point is 00:27:27 as much as they can to get a baseline and in many languages and SDKs, it's pretty advanced, which is nice, but then oftentimes you'll find yourself still needing to augment with manual instrumentation because I know you have some sort of very sophisticated calculation there algorithm that you want to measure that specifically. It's not the full function. It's just a piece of code that only you knowing your source code know that this is something that you want a specific measurement on or things like that.

Starting point is 00:27:54 So that's how I view the, or I think this is how we view it as open telemetry. And this is why we definitely put a lot of emphasis on an automatic instrumentation. And this is a known problem. It is a barrier to entry to many if they need to result to only manual instrumentation. I agree. You do it just out of curiosity, I guess, for a Java that you know well.

Starting point is 00:28:16 I guess you have a set of rules, like if your rule base of what you're instrumenting or is this just hard-coded in the HM or is there any way how I, as a user, can kind of define the rule set that you then instrument? How does this work in terms of auto-instrumentation? You have configuration of the agent, if that's your question. I don't have it off the top of my head to say exactly

Starting point is 00:28:38 which features you can configure and which not. But remember then that one of the advantages of having it as an open source is that you can actually take the project and if you need some very advanced tweaking of the agent, you can actually fork it off the main one, hopefully to contribute it upstream so that the rest of the community will benefit from that. So very advanced users might go into the agent's bytecode. I am sure that people like Dynatrace probably have a lot to contribute. And that's a very important thing to mention.

Starting point is 00:29:11 The architecture is very modular. So what I talked about, extreme condition of having to open the source code. The pluggability of the architecture, both, by the way, it's on the SDK side and on the controller side and the collector side, sorry, is such that you can actually inject, so to speak, your own pieces. For example, you can put your own exporter from the SDK and during the export phase, you can do some sort of logic that you apply that can do additional, I don't know, filtering or sampling or something like that on the SDK side. And then on the collector side, maybe again, for the audience,

Starting point is 00:29:50 it doesn't know. So you have the SDK, let's say on your application, and then you have a collector that collects from the SDKs and also from infrastructure. So, you know, you have Kafka, you have Redis, you have Mongo or MySQL, whatever. So collecting all of the telemetry. And there also you have like a data pipeline. So you have a receiver in many protocols, supporting many receivers actually in many protocols. And then you can plug in processors that can manipulate the data. They can do filtering, batching, sampling, whatever.

Starting point is 00:30:19 And there again, you can use the pluggable APIs to put in your own, define your own processors and plug them into the process. So I would advise first to go with the built-in hooks within the SDK or collector, depending on where is the right point in the process to inject. And only very, very advanced users, such as maybe yourself, maybe will result to even further tuning the actual

Starting point is 00:30:46 underlying framework. Cool. I got to ask a couple of other questions because I've been doing distributed tracing for the last 15 years since I'm with the company, or 14 years. And I know we had always challenges. First of all, there was always questions from our users. In auto instrumentation, what do you really do? What's the overhead going to be?

Starting point is 00:31:08 How do I know you're not capturing data that you're not supposed to capture? And there was a long, obviously, every vendor in that space, whether it's us, whether it's, I don't know, back in the days, Wiley, AppDynamics, New Relic, Datadog. And I think we all had the same challenges where we got asked the same questions. We always had to defend and prove

Starting point is 00:31:27 that we are collecting the right data and not the wrong data and that we don't have a lot of overhead. But at least for our customers, they had to come to one entity. Now it seems with OpenTelemetry, I have for every single technology, I have different teams responsible for it with different status of maturity. And I guess if I would now,

Starting point is 00:31:49 if I'm an enterprise and I have five major technologies and I'm betting on open telemetry, that means I need to go to five different stakeholders and basically ask the same questions. Overhead, what do you instrument? How does it work?

Starting point is 00:32:02 Or is there also something where I can go to one entity and get these questions addressed? Because I think, especially as an enterprise, I would probably like to have one entity to go to and not play with five different parties.

Starting point is 00:32:18 Yeah, that's a challenge in the industry. I think there are some advantages of having an open source mentality or open source project because many of the questions that were very prominent in black box, closed source things, you can just say it's all out there in the plain field. You can see it. So you can see that we're not stealing any information that, you know, on the back end

Starting point is 00:32:42 to gather some information about yourself. So that's on this side of things. On the other hand, you're right that, you know, a monolithic or one solution like all the vendors that you mentioned used to provide had the convenience, the amenity of having, okay, I have one agent to rule them all. I know that they're fine tuning it and they're doing the thing and I have one agent to rule them all. I know that they're fine-tuning it and they're doing the thing and I have my support and that's it. That's a trade-off. So you're going, when you go with do-it-yourself and when you go with a platform, it means that you take, you prefer to put aside some of the convenience for the flexibility of defining your own.

Starting point is 00:33:22 You can plug in again. What I said to you, I also say to customers, you can plug in your own processing logic. You can enhance, you can tune it to your specific organizations, workload types, and, and, and data modeling and, and, and, and schemas and so on and so forth. So you have this flexibility, obviously. I want to make sure that people understand it's not either open telemetry, do it yourself, or going with closed source vendor monolith. There is something in between which we as open telemetry project encourage, which is the vendors to participate and vendors participating, not just

Starting point is 00:33:58 in the sense of contributing upstream, but also, and this is the pluggability that I talked about before, is where vendors can create differentiation and value add. So vendors are more than welcome to take open telemetry, as you said that you at Dynatrace do, as we at Logz.io do and others, and actually wrap it, add logic on top of that, or add maybe managed services or services of other types and any to provide enterprises that prefer to go with the simplicity rather than the flexibility, still one vendor or one commercial entity that will take, assume ownership, help them tune, help them configure, help them escort them, professional services, support, and all the thing around it. So OpenTelemedry does not exclude the vendors. From my perspective, it actually enables,

Starting point is 00:34:48 because some of the questions that you used to face, you don't need to face now. You say, I'm based on OpenTelemetry. I'm just giving you the amenities on top of that. Yeah, and I think, you know, I took some notes earlier when you said, I think it's the chance for all the vendors to say, we are not only doing data collection, which is where

Starting point is 00:35:07 OpenTelemetry comes in, but we're really providing what is really a true observability platform. We are giving you the answers to the questions that you have, and the answers come from the data that we collect. And now we have a new way of collecting the data in a better, hopefully better way than in the past. Exactly. Yeah. Really great.

Starting point is 00:35:32 Toten, is there anything else that you want to make sure our listeners take away from this discussion? I think we talked a lot about, first of all, I really like the way you said how you see observability and there's more than three pillars and that observability, and there's more than three pillars. And that observability is really about being able to ask questions to pressing or getting answers to pressing questions. I think that's a really great definition. Talked a lot about open telemetry. What else do we miss in this conversation?

Starting point is 00:36:00 So maybe just to give a very brief overview where open telemetit currently stands, because let's say that we convinced people that it's interesting for them to look into that. But as we said, because it's not one monolithic project, but rather many sub-projects, it's important to say the vision is what I said, but where it currently stands is that for tracing signal, it is what in CNCF, it's called stable. And stable is the equivalent of GA, generally available. So if you're looking for, if you're now starting a project and you're looking for distributed tracing, I would highly recommend looking at open telemetry as a mature production ready way of doing that.

Starting point is 00:36:46 Metrix is very soon to have this GA. We were hoping to have it by end of this year. Probably will spill over to beginning of 2022. But maybe by the time that you hear this podcast live, it will already be announced. So it's really there. Like the API is already stable. The SDK is code free and soon to be stable. The collector is nearly there. So it's really there. Like the API is already stable. The SDK is code-free and soon to be stable. The collector is nearly there.

Starting point is 00:37:08 So it's really, really there. And one important thing is a great collaboration with the Prometheus working group to get the collector to support Prometheus. So one of the advantages of the open source community having both under the CNCF. So again, for metrics, I would also highly recommend

Starting point is 00:37:26 looking into that if you're now starting a project suddenly. And logs is unfortunately still behind. Still behind. And I guess the focus there is less about formalizing a new API in SDKs because we do realize that most, if not all the customers already have some logging frameworks there. So the first focus is to get the integration with existing logging systems and getting later on to formalizing a new API and something like that, ingesting from existing log

Starting point is 00:38:01 appenders. There was a Stanza project that was contributed to OpenTeleMT Collector that contributed lots of log processing in many data formats that pushed the collector forwards, if you're familiar with Stanza. So this is where logging stands. So less ready for production, but also looking promising. Just to give a very, very high and brief overview of

Starting point is 00:38:26 where that stands. Yeah, perfect. I got to ask two more questions. The first one is, you know, it feels like we've been shifting a lot of responsibilities to developers because now, and now we're asking them to know more about what they need to instrument and what data and what questions to ask. Just a very brief answer from you. Do you think we are pushing too much on developers? Because we've also, the whole shifting left, meaning testing earlier, doing more things earlier, asking especially developers to do more with less time.

Starting point is 00:39:05 Do you think we're asking too much? Or do you think it's just a natural kind of wave where this is just the, with every kind of evolution of a new, let's say, paradigm, in the beginning, you have a lot of work that needs to be done, especially by the technical folks like the developers. But then as we mature, it will get easier because we kind of then standardize it,

Starting point is 00:39:24 and we are, kind of becomes a commodity. I think it puts a lot more responsibility and awareness on developers. I definitely agree. Developers now need to be fully aware on how the application looks like also in production and non-functional requirements on performance and things like that. On the other hand, I think at least the developers that I work with in my current company and customer companies and users and others, I think they like it in the sense that before they felt disconnected from how it's being used. And I think the developer ultimately, he's the father and the mother and the parent of this piece of functionality

Starting point is 00:40:10 likes seeing how it's being put to use and how we can implement it better for better use. I think that many of the time-consuming tasks will become much, much easier the more we get the dev tools that are easier, better user experience, maybe AI behind the scenes that will help provide more effective insights and less having to dig into raw data and finding yourself. So it will become much easier.

Starting point is 00:40:41 That's part of our job role as the observability vendors to make their lives easier. They don't, and as you said, the auto instrumentation in every piece of the journey, we should try and make it easier for them to enable their focus on the main core business

Starting point is 00:40:58 writing and the business logic. Now another, and my last question, and I think, I don't know where I saw it. It was either a tweet from you or it was a LinkedIn posting. But earlier you mentioned that observability is only three pillars, but many other pillars as well. Yet you were posting or highlighting a presentation that somebody gave where they said the only thing that they need is they said the only thing that they

Starting point is 00:41:25 need is distributed traces. It was a presentation, I think, from Tel Aviv or from Israel from somebody. And I thought that was kind of interesting because here we are and we talk about metrics, logs, traces, all types of events, end user, everything. And then I see your post and say, hey, the only thing we need is to do the traces. How does that work together? Well, uh, yeah, I think my, my tweet was misunderstood. I said that it was interesting for me to hear. It was a talk by a very young startup called velocity, uh, based in Tel Aviv. And I saw them at the conference and, uh,

Starting point is 00:42:00 their engineer or head of engineering that was talking there, uh, said that they specifically decided to go all in with distributed tracing and not do logging. Or let's say the logging is part of the span payload. I found it very interesting just because actually it's a very rare decision to make these days. Not many organizations can do that. And yeah, it started some discussion there. Some people were asking me, so how can that be? It doesn't fit all organizations.

Starting point is 00:42:28 And I agree. I don't think that every organization can work that way. Just to put a very simple example that they gave us on that discussion, that if you use sampling, which is very, very common, a very common practice in tracing,

Starting point is 00:42:42 and you only send 1% or 0.1% of your span data to the backend, which is enough for performance and latency type of use cases, which is the classic use case for, for tracing. But then again, if you base your logs on the, on the spans and you discard, you know, you drop these spans because of the sampling policy. If you now have an error, you don't have these logs. So this is, and by the way, Velocity, and I was asking them that same question in their case, and they said they're using 100% sampling in their case. They have the luxury of not dropping anything and sending everything to the backend. So again, not every organization works under the same workload situations.

Starting point is 00:43:20 So I don't think that we're there as an industry. However, and that's important to say, I do hear more and more people talking about shifting some of the information that they use to deliver via plain logs into the payload of the traces. So it is something that people are now debating. It's not clear cut. Okay. I just throw it in the log line. Let's think for a moment. If this piece of information is actually better served in a span where it's under the context of that specific request and so on and so forth. And even if not, my best practice, again, would be, or my recommendation would be

Starting point is 00:43:56 to enrich the logs with a trace ID so that I can anyway make the correlation from the log to the trace to do this observability that we talked about before it was an interesting talk i hope that it comes out as a recording and i recommend everyone to uh to check it out i'll try to put the link uh to the tweet um in the proceedings hey um dotton thank you so much for this talk. For me, it's always interesting to learn from my guests because my guests are always experts in a particular topic, and you're clearly much more versed in observability when it comes to open source,

Starting point is 00:44:38 when it comes especially to open telemetry. So thank you so much. Also, thank you for allowing me to ask maybe the one or the other stupid question or coming up with a strange idea. But I think I'm just, uh, I'm just trying to learn right. And, uh, and, uh, people that learn sometimes ask questions that say, ah, why does he ask this question? But, you know, I want to just know. So thank you so much. Um, are there any, uh, what's the best way to get ahold of us or a hold of you in case people want to follow up? I know LinkedIn.

Starting point is 00:45:05 What's your Twitter handle? So my Twitter handle is Horowitz, H-O-R-O-V-I-T-S. And in fact, it's my name everywhere. So, you know, Medium and WordPress and LinkedIn and GitHub and everywhere. Probably if you just search for Horowitz, H-O-R-O-V-I-T-S, you'll probably find me. And yeah, just reach out to me.

Starting point is 00:45:26 Any feedback on this chat, on the links that I'll post, that we'll post here on the episode, anything else. I'm more than curious to hear feedback and engaging conversation following this. So that will be bidirectional. And by the way, your questions were excellent, not stupid at all. And these are the right questions that we should be asking ourselves as a community. And also, listeners, make sure that you are watching his Open Observability talks. Really great guests, great conversations. Check out the work that he's doing for LogC.

Starting point is 00:45:56 We're all in the same space and it's an exciting space. And it's a space that is evolving and in the end making lives of our users easier, hopefully, because then they can get answers to the questions that they have, spanning everything from logs to traces to events to end use, whatever it is, right? I think I like that. Yeah, amazing. Thank you very much for inviting me and for the interesting chat

Starting point is 00:46:20 and looking forward to having you in the open observability talks as well to carry on the chats and about some of your open source. And Brian, very sorry that you couldn't be with us. It would have been great to have you as well, but pretty sure we'll have him back and then Brian will be back as well. Okay. Thank you. Bye bye.

Starting point is 00:46:41 Bye bye.

Your Ad Here

PurePerformance - Open Observability: The limits of the 3 pillars with Dotan Horovits

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.