PurePerformance - Unlocking the Power of OpenTelemetry: Insights from an OTel Expert at NWM

Episode Date: May 8, 2023

36 million generated OpenTelemetry spans per hour for GraphQL based queries – that’s just one of the stats we discussed with Justin Scherer, Sr Developer and Consultant, who is leading OTel adopti...on and Shift-Left observability efforts at NWM. For Justin, OpenTelemetry helps commoditize data gathering in modern cloud native environments so that the backend observability platform of choice can focus on answering higher level business impacting questions.If you are about to roll out OpenTelemetry in your organization then take the advice from Justin such as: Bringing Business Leaders early into the discussion! Engage with the OpenTelemetry community! Understand what your Observability Platform already gives you and focus on the gaps! To learn more about OpenTelemetry check out some of the links we discussed during the podcast:OpenTelemetry Website: https://opentelemetry.io/IsItObservable: https://isitobservable.io/open-telemetryPodcast: https://www.spreaker.com/user/pureperformance/adopting-open-observability-across-your-LinkedIn Profile: https://www.linkedin.com/in/justin-scherer-198126160/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time of Pure Performance. As you probably noticed, this is not the voice of Brian Wilson. It's the voice of Andy Grabner. Brian is not here today. He is, well, stuck. I don't think stuck is the right word, but he is actually in Florida enjoying Dynatrace sales kickoff, getting to learn everything that's new in our world so that he can sell better Dynatrace sales kickoff, getting to learn everything that's
Starting point is 00:00:46 new in our world so that he can sell better Dynatrace. But we have a great episode today because we have a great guest today that covers an important topic for all of you. And the topic today is actually shifting laptop's ability. We will talk a lot about open telemetry. I just came back from KubeCon Amsterdam, which was fantastic. Learned a lot about what the latest trends are. Really seeing a big, big boost of OpenTelemetry,
Starting point is 00:01:14 even though it seems like booming for quite a while now. But I invited Justin Scher to the podcast today. Justin, thank you so much for being on the podcast. I know you've been adopting OpenTelemetry in your organization and you're a big advocate of shifting left observability data. Now, I don't want to just talk on my end. So first of all, Justin, could you quickly introduce yourself, who you are and what you do in your current role? And then I want to dive into the topic because I want to learn from you some of the adoption reasons, adoption challenges,
Starting point is 00:01:49 and best practices we can learn from you. Yeah. Hi. Yeah, I'm Justin. I work for Northwestern Mutual. For those that don't know, it's a financial, basically a tool, a financial suite with our financial advisors to help do planning for our customers and figuring out the right financial instruments for them to invest in, which means that there is a lot of data and a lot of various functionality in our system to help our financial advisors, best way of saying it, advise our customers.
Starting point is 00:02:33 I'm a developer on what's known as our illustration system. So when you get that nice fancy printout of all of the various numbers and various things that make up a policy. That's what we work on. And specifically, I'm a dev on our backend system, but I also kind of help on the front end and I'm on performance, so a lot of various aspects of our system to try and help not only deliver features for the business, but also make sure that we're up and running and trying to keep our 99.69% uptime. If you explain this, it almost sounds like you have a lot of heads to cover or heads to wear.
Starting point is 00:03:28 If you are also responsible or helping at least with keeping the systems up and running, isn't that also like an SRE function that you then have or do you support SREs? How does it work? Yeah, so I'm not like officially on our SRE team. We have kind of our enterprise SRE, but in terms of our illustration system, I'm very much I have this kind of off on the side group, which is I'm always looking at our Kubernetes systems, looking at how we hook into our cloud provider and making sure that that system is not only right size, but also that if there is downtime in some capacity, we're figuring out right away. So yeah, kind of really what would be considered SRE.
Starting point is 00:04:13 Hey, and Justin, so one of the reasons why I wanted to get on the podcast, because you have been walking down the path that many in the industry are currently walking down, meaning Kubernetes seems to be becoming the standard, obviously, core platform for the platforms that we are building. It's a complex system. It gives us a lot of opportunity, but it's a complex system and complex systems even need more observability.
Starting point is 00:04:38 Now, OpenTelemetry is the most successful project, I think, these days in the CNCF ecosystem. So for those people that don't know OpenTelemetry, it's an open standard that actually defines how observability platforms can consume data or can observe metrics, logs, and traces. And so, Justin, what I would like to hear from you, why OpenTelemetry for you on Kubernetes? What problems does it solve for you?
Starting point is 00:05:10 Because, you know, obviously, I represent one of the vendors and there's many other vendors out there. We've been doing observability for many, many years. Yet, there's people like you and people that I spoke to last week at KubeCon, and they all want to dig into OpenTelemetry. Why is this? What problem does OpenTelemetry solve for you? So OpenTelemetry, it solves what I would say kind of this kind of for us, it's really a two or three prong approach. And one, as when you're an enterprise or even you're just a developer, you don't necessarily want to be tied to a vendor, right?
Starting point is 00:05:50 One of the things that you always want to try and maintain is this kind of like this agnostic approach, if you can, to at least pulling data. And OpenTelemetry gives that by being an open standard and almost every vendor out there in terms of observability, they support open telemetry data. And so if we can remain agnostic and if, let's say, an observability platform is just not meeting the needs of that enterprise or that company, it makes it just a little bit easier to try and shift. And I think that's one thing I kind of mentioned to people I've talked to about OpenTelemetry is that what used to be, let's say, even just a decade ago, was the highlight for observability platforms was this idea of, oh, we can now ingest traces, or we can ingest metrics or stuff like that. now that should be seen as the mundane and that should be seen as this base level.
Starting point is 00:06:47 And that's what OpenTelemetry kind of gives you, this base level to where now we're trying to elevate observability platforms to say, hey, you don't need to worry about X problem anymore. I now need you to worry about giving me context to this data. Or I want you to somehow figure out how does this trace or this piece of data relate to this piece of data, or how can I query on it? Things like that. And I think that's the first prong of open telemetry is really starting to, let's get rid of people worrying about the mundane and let's get us now starting to work on advanced problems. I will say the second prong, the major prong for us at least,
Starting point is 00:07:28 was no matter what observability platform is out there, they're not going to be able to keep up with technologies. I mean, just to give the example, we do use GraphQL. And for those that don't know, GraphQL is a different way of querying data. Some people have called it the new REST. Really, what it does is allow consumers to query only the pieces of data they want. But because of this kind of new approach, you have this single URL, a single HTTP verb, with every single piece of data that you can imagine stuck in this post.
Starting point is 00:08:10 And because of that, we've been so used to REST for so long that most observability platforms don't understand that piece of it. Even when you get into errors, errors is handled. It's a 200, an HTTP status code of 200, and the error is inside the body. And so open telemetry allows this kind of neutral approach to say, hey, we get it. There's brand new technology coming out every single week. Let us handle these new tech, be it TRPC, be it graphql be it whatever springs up and we can fill in those blanks and that's really what we've been able to do is open telemetry has allowed us to fill in all the blanks that an observability platform i just i can't imagine one being able
Starting point is 00:09:01 to support absolutely every piece of technology that comes out. Yeah, and I think you're obviously here representing one of the vendors, I completely agree with you. While we have built agents and still building agents to cover a big technology breadth, we are really happy about OpenTelemetry because it also makes our life easier because it actually pushes a lot of the reverse engineering we used to do over the years to the vendors of frameworks, of runtimes, or the developers of custom code. They know best what it is that we want. Especially the GraphQL, this is really fascinating.
Starting point is 00:09:36 I didn't know that GraphQL works like this, that basically everything that comes back is an HTTP 200 call, as long as the call is obviously successful. But then in the body, you have the information about did this query actually return any data? Was there, I guess, I don't know, a mistake in the query language that you used or in the query parameters?
Starting point is 00:09:54 And with that, obviously, it makes sense that you are then using OpenTelemetry to get additional context out of that individual transaction. Exactly. Hey, Justin, one more word on GraphQL context out of that individual transaction. Exactly. Yeah. Hey, Justin, one more word on GraphQL, because I'm looking here at a presentation that you did at Dynatrace perform. And if I'm just quoting you, you said one trace can be over 1000 spans big.
Starting point is 00:10:20 So that's a lot of depth of information. Also, you are, it seems one of your GraphQL entry points gets upwards of 60,000 requests per hour. So you also have quite some load on the system. Traces move between 10 different microservices. So this all really shows us that we are really truly living in a distributed world where distributed tracing that OpenTelemetry provides is important. Do you have any other things, especially for folks listening in and using GraphQL that
Starting point is 00:10:53 were kind of surprising for you or it was important for you to kind of pass on? Yeah, so one of the big things that you can do with GraphQL is this idea of creating, there's kind of these two competing viewpoints of how to potentially create what's called a supergraph, which is essentially just delegating to separate GraphQL services. And for your consumer to just see this one entry point, most people can think of this like an API gateway. There's kind of two competing views, and one's called stitching, and another's called, it's a gateway, Apollo gateway. And really,
Starting point is 00:11:36 this is another area that can be a major problem, because what to a consumer looks like an X query could actually be five queries under the hood, and it's all separating out. And so when you get potentially issues that crop up with the consumer and they give feedback and say, hey, X query isn't necessarily working. And if you don't own that piece of the code, you're going to be like, I don't get what you're saying to me right now. So with something like open telemetry or giving you this observability piece, you can now see, oh, I can look at that query, see it's coming to my service,
Starting point is 00:12:15 and I'm potentially the piece that could be broken, and I can now see that. So when you're getting support calls or service calls, it's just a lot easier when you have that piece added in, because otherwise you're going to get the kind of deer in the headlights look of, cool, that's a problem. You know what, this really reminds me, and I'm pretty sure if Brian would be on the call now, he would jump in and say, hey, the problem that it's just explaining, like one request is coming in on the front end and then it's splitting up into multiple requests to the backend. It reminds me a lot of the N plus one pivot problem we've been talking about for so many years
Starting point is 00:12:52 and as a pattern where typically from a backend component to database, making a lot of round trips to the backend database to fetch more data. It seems the same is happening here with GraphQL. And it can turn in, M plus one is actually, it's a major issue in GraphQL. And what can be, it can be exponentially worse across the board because M plus one is always usually a very, it's a very microscopic because you kind of look at the microservice or the gateway and see that
Starting point is 00:13:24 it's doing M plus one. But the problem is if in your stack, you're using a technology that can be an M plus one problem, it just keeps compounding it down. And exactly seeing those issues, uh, without your observability, you probably won't notice that there's an M plus one problem until it's too late. Yeah. And I think this really reminds me, you know, back in the days I started with Dimensions 15 years ago. And I remember in the very early days of my career in observability, we looked at the Hibernate framework,
Starting point is 00:13:58 but Hibernate, I'm not sure if that rings a bell for you or for some other folks that are listening, right? But it's a very popular framework for data access, basically an OR mapper. And it was eye-opening for many developers to see
Starting point is 00:14:12 a distributed trace showing that accessing an object was all of a sudden executing hundreds of thousands of database statements because every single referenced object, like a list or so, was patched individually. It seems now with GraphQL,
Starting point is 00:14:28 it's a different technology, but it's the same problem because you're making it very easy for the consumer to do something, and then you're using GraphQL, and then you don't know what GraphQL is really doing. In the end, it gets you your data, but I think you really need to understand and look at traces to
Starting point is 00:14:43 figure out, is the transaction that I'm triggering efficient or not efficient? And I think this comes back to your point of shifting left. I think enabling developers to see what is actually happening when they're executing these queries is very important. Yeah. So funny story. You mentioned Hibernate. At a previous company, I actually had to go through some of our Hibernate pieces because they were doing exactly what you said. And to give this kind of like, here was the different approach of we didn't really have an observability platform there.
Starting point is 00:15:19 So that was actually digging into code and looking, oh, look, we just hit 150 query get requests or selects because Hibernate decided this was the best approach to do it. And that's digging into Hibernate code, which is for anyone that has ever dug into Hibernate code is not very pretty to look at. But shifting then, like you said, shifting left, when you're now here, when I'm at my current place and you have GraphQL and I have the observability piece and I don't need to dig into GraphQL anymore. I can just look at the traces or the metrics or whatever, and I can see, oh, there is an M plus one problem because the traces show I'm hitting either a hundred distributed traces or I hit this microservice 15 times and the trace just shows it bouncing back and forth. So I think that's a major difference between the two approaches of, I don't necessarily want to dig into a library's code, aka the Hibernate, me looking at Hibernate, versus I just want to look at traces
Starting point is 00:16:25 because I may be able to figure this out right away and then look at just my code to see could I have done something differently. And Justin, also from my understanding and for the listeners, so GraphQL, I assume there is like a lot of libraries that developers use, standard libraries, like client libraries.
Starting point is 00:16:43 Are these already instrumented with OpenTelemetry, most of them? Or do you have to go in as a developer when you use GraphQL to then add your traces? So I will say it's highly dependent on the language. But in terms of JavaScript, the telemetry or the library that wraps the base GraphQL library has already been done. So be it you use, let's say, Apollo Client, which is a library, a wrapper around the main base library,
Starting point is 00:17:16 or you use what's called GraphQL Yoga or something like that, it's already instrumented because all of them still use this base implementation under the hood. So you have something like that, it's already instrumented because all of them still use this base implementation under the hood. So you have something like that, it's amazing because when the JavaScript land, you can basically use any client you want and adding this tiny wrapper instantly starts instrumenting for you. When you get into other things, something like C Sharp, it's really highly dependent on the server. So there's kind of two big implementations in that realm where you have GraphQL.NET and HotChocolate, and both of them have different implementations.
Starting point is 00:17:54 And then even in Java, I know there's a couple implementations that are out there. So it is highly language dependent, but I will say at least the languages that I've looked into, Rush, Java, C Sharp, JavaScript, there are implementations. And OpenTelemetry has basically taken over those GraphQL fields. Cool. Let me take a step back into something you said in the very beginning. You said what OpenTelemetry has done, done open telemetry has kind of um commoditized i would say i helped to commoditize how we get the data
Starting point is 00:18:31 right so there should not be i think it allows us to elevate the discussion of saying and i need metrics i need this metric and this trace and this piece of information to well let's assume we have this data because it comes in and that's just what you assume, to now changing the discussion towards, hey, I want to actually give answers to particular questions like, you know, why am I hitting my database so heavily? Or why am I hitting my backend services so heavily?
Starting point is 00:19:01 Why do they cost so much after the recent update? Are these, you know, the GraphQL example we just talked about, again, one of these examples, what other examples can you give me on kind of what are higher level questions that we can now ask our observability platform? What are the typical things you see in your organization, whether it's on the dev side, on the SRE side, the DevOps side, the business side, what other questions do you see in your organization, whether it's on the dev side, on the SRE side, the DevOps side, the business side? What other questions do you see? So I'm seeing a lot more.
Starting point is 00:19:32 Used to, I would say, a lot of our questions just went around, oh, this thing spiked in CPU usage. Why did it do that? But we can start asking higher level questions. A lot of stuff will now be related to, well, the client made this request, which then spun off, let's say, 10 separate requests in the backend. I can now see, ask, okay, if client does X, Y, and Z, how does that affect my microservices, which now potentially are getting higher usage, are now using up more memory? I can now start seeing the links between all of those.
Starting point is 00:20:13 And I can start asking the questions, okay, if client does this, what is the cost associated with that? And that's, I think, where we're starting to elevate questions. The questions are no longer singularly focused on, oh, microservice started using more CPU, so now I need to up CPU on Kubernetes. I can now ask, well, we added X functionality for the client. How did that actually affect our entire work stream? And I think that, from especially a business standpoint, just helps us out so much. It's no longer a black box anymore
Starting point is 00:20:53 into this kind of microservice and dev world. Business can actually start asking those questions and dev can start answering them with Eaves. And basically, this already kind of translates, I mean, great stuff. So instead of asking, why do we have a CPU spec? The question is, do we actually make a profit with the new
Starting point is 00:21:13 features? Or are the features that we just built actually cost efficient? Or are they hindering us? So basically we're changing the conversation to more like a business-driven discussion. It's like, hey, is everything in place so that whatever we provide as an organization runs within our business constraints? We can actually
Starting point is 00:21:37 afford the hardware, we can afford our, in this case, cloud costs, and we actually run efficiently. Then carbon footprint comes also in. I think this was a big topic also at the recent conferences, also at PERFORM. Are we looking at our carbon footprint? So that's phase two. It's really open, to rephrase, open telemetry. And I think you used the word mundane. I would like to use the word it provides, it makes, what did I say earlier? The standard, what did I use the word? I'm blanking on this now. What did I say earlier? I said it commoditizes, right? kind of observability, which then now allows us to really ask higher level questions that
Starting point is 00:22:30 are especially interesting for the ones that put the business to understand how the system is running, but then gives enough context to the dev teams to understand where systems are not running smoothly. And that's exactly. Yeah. So shift laptop solubility, that is a topic that in our preparation for this call, it was something that you were very,
Starting point is 00:22:55 you know, you were very happy to talk about. And I think we already covered a little bit of this, like, you know, giving developers insight into a trace so that they can see what's actually happening. What else is the benefit for engineers, development teams to get access to this data? Is there any, besides just knowing what the system is doing during development,
Starting point is 00:23:15 what else is the benefit for development? One of the best things that I think has really come of it is I think every developer that's really worked has gotten those midnight calls where production is acting up. No one likes to be on call, but companies have to do it because
Starting point is 00:23:36 we didn't necessarily test or we didn't necessarily run performance tests on this or we did run performance tests, but it wasn't at the scale that our clients are currently calling, things like that. And a big part of shift left
Starting point is 00:23:54 is really trying to minimize impact on our devs and on people as a whole. I mean, first off, our customers don't want our system hitching in the first place, but they also don't want it to be down. They want the data and they want to use the platform on their time. So we're already doing that. We're giving business value back to our consumers. But I think, at least from a dev's point of view, and even from this kind of performance engineer, however you want to call it, we're not getting called anymore at midnight if we're moving all of that type of testing performance
Starting point is 00:24:31 and outlook to shifting it left and getting it earlier in our dev cycle. We're not getting the calls because we tested it all the way that our performance matches what our consumer activity is like. We've tested and we showcased through our testing when we moved it left because we showcased that in a blue-green deploy worked in Inting QA so we can now shift it to production. this kind of shift left mentality, we're actually helping the developers not have these, not in the typical nine to five calls.
Starting point is 00:25:10 They can now feel comfortable, let's say taking a vacation or stepping away from the code. I don't need to be constantly in work mode because we did this shift left mentality. And I think that's really the, we're going to always as a business want to say that, well, what was the business value? And we can tie it to dollars or consumers
Starting point is 00:25:29 or all of that. But I think from a dev point of view, it's really, I get to now have a life outside of work. And I think that's something that devs should always be thinking of is that I'm able to now enjoy my weekends or I'm able to enjoy a nice leisurely Friday and not be worried necessarily that I'm getting called at weird hours. Dustin, I need to add this to the description because I think it's just the first time where I heard shift left explained in that way because typically when we often say that often when I when we talk about shift left people say oh it means you're putting more stress in the developer now they need to do more but actually you are turning this around and say shift left is actually in the end
Starting point is 00:26:17 minimizing the impact on our devs they can focus on their work nine to five whatever they work when uh their work day because we give them all the insights that they need so that up front they can be sure that the system is not going to crash in the middle of the night. Because they see that GraphQL is just like the new
Starting point is 00:26:37 Hibernate and we need to make sure that we don't have these M plus 1 query problems because they will kill us in production. Exactly. I think, I know like we went through a transformative period with shift left and I know I felt the pains just as probably every other dev felt. I mean, not every dev wants to focus on certain aspects
Starting point is 00:26:59 and I'll say like, I know I'll use kind of negative verbs here, but testing can be boring. But from that standpoint, while maybe boring, it's saving you from the potential to not be able to enjoy time when you're not at work. And I think that's the piece that really from the dev standpoint is, yes, maybe we are adding work or we're shifting at least skill set. I think that's the best way of putting it is you're shifting a skill set. You're not just focused on dev, you're thinking of it as a whole.
Starting point is 00:27:34 But when you start shifting that skill set, you're also, like we kind of point out, you're minimizing impact on the dev. And I think that's really the big piece of this. I just need to take a couple of notes. It's great. I will quote you on some of these in my future presentations, I think. If you got another question. So you and your organization, OpenTelemetry,
Starting point is 00:28:00 coming back to that topic quickly. Last week, I talked with a lot of folks at KubeCon in Amsterdam. And a lot of them are saying, yeah, of course, OpenTelemetry is the observability layer of choice, clearly on Kubernetes, but it's the number one thing. But still other people said, well, we don't really know what this really means and how we actually roll it out in our organization. So there was a lot of discussion around enablement of development teams. So the question was actually, if we go with OpenTelemetry,
Starting point is 00:28:35 first of all, what do we need to do and what is already there? What is already instrumented with OpenTelemetry? How much then do we additionally need to instrument on our end? And also the question came up, well, can we do anything wrong? Can we over-instrument? What are best practices? So kind of throwing the question to you, when you were kind of starting on your OpenTelemetry journey
Starting point is 00:29:00 and you rolled it out and enabled your development teams, any lessons learned, any things that development teams any any things any lessons learned any things that went well any things you did that you would do again any things that you that didn't work well so i will say uh what one of the things that really stood out to me was um at least getting the instrumentation in our JavaScript systems, Node.js, it was literally, I added a simple file. I think I called it trace.ts or something like that. I added some very basic things that were on the OpenTelemetry page,
Starting point is 00:29:38 the documentation for it, and I got 80% of the way there. I know it's going to sound like, oh, it's because he's on this observability platform or because he loves open telemetry. But really, it was crazy how easy it was to get set up and getting to that 80% mark. I think that was the craziest part for me is I'm used to using a piece of technology and getting maybe 50% of the way there. And then you got to start adding your own custom code in and really tailoring it to your needs. And OpenTelemetry kind of just gave a lot of stuff to me. I will say kind of some lessons learned about it though
Starting point is 00:30:22 is I think not from the tech point of view, but from the business point of view is really bringing in some of the business leaders earlier in the process, because it was very much, I was kind of experimenting with open telemetry and just bringing it in to see if we can see anything, seeing the immense value you got, but then going to business leaders and showing it, and then them being like, well, why did we add this in? Isn't this what the X platform is meant for? I think it's kind of bringing business leaders in earlier on that process. But number two, now from the tech standpoint, was I think engaging with the OpenTelemetry community earlier. There were aspects of
Starting point is 00:31:08 kind of pitfalls that you can run into. So one of them is, and this is very specific to JavaScript, but if you're on the ECMAScript module system in Node.js already, you won't be able to really get open telemetry, at least the auto-instrumentation in right now. And that's due to some packaging issues and the way the module system works. So it was interesting to have one of our microservices already shifted to that. And so we had to do some custom work to get it actually put in. And I think it was also
Starting point is 00:31:48 kind of lesson learned is understanding what your observability platform may already give you. So one thing that we ended up finding out is our agent that we have on our system was automatically picking up OpenTelemetry data for us. But I know other platforms may not have that. And so understanding where these kind of different pieces of open telemetry come into play. There's things like exporters, there's things like converters and all of that. And really understanding what your platform may give you and what you may need um i think was a major piece for us um that was kind of that was kind of a major lesson learned yeah i think that's that was actually a question that i wanted to ask you because um open telemetry is just one piece of the puzzle right uh instrumenting the code and basically having
Starting point is 00:32:43 the ability to send this data to the observability platform. Or if you go all in with OpenTelemetry, you have your app instrumented OpenTelemetry, then you have an OpenTelemetry collector that needs to collect the data and that then needs to send it to somewhere where it's actually stored, persisted, analyzed. And in your case, your platform has already done a lot of the work for you which is which seems great right um yeah exactly yeah is there because i had a lot of discussions again coming back to kubecon a lot of folks were saying well we are going all the way in open source you know with no no commercial vendor no nothing. And I think some folks don't necessarily know maybe
Starting point is 00:33:29 what this really means. I think OpenTelemetry is great, but there's more to OpenTelemetry than just instrumenting your app because this is just giving you the basic kind of opportunity to actually patch data, but you still need to collect this data, send it somewhere in a secure way, analyze it, make it available again, and this
Starting point is 00:33:49 is where then the real I think this is the real value then on top. Exactly. How can we make use of this data? Yeah, that's a very good point. I think there is some, especially for someone that maybe is not currently in the observability space, they haven't entered it at all. So people that they're just using, let's say Prometheus right now, Prometheus and Jaeger, two of can just use open telemetry i can use my exporter and then collector and then
Starting point is 00:34:28 ship it off to jaeger or prometheus and all figuring that out and i would say that's great i mean if that gets you initially in the space and gets you initially seeing value of what something is providing that's excellent because that's then going to allow your business leaders to look at it and be like, cool, we have this data right now. But that's the day one operation, right? You got your data and you're shipping it somewhere. But now your day two is going to be your business leader coming in and saying, well, now I want to understand these five KPIs.
Starting point is 00:35:05 How does this data give me those five KPIs? And maybe your first day too is, okay, I'm going to write this crazy query for Prometheus that's going to start tying all this data together. Or I'm going to write a Jaeger system to understand how does all these traces now start working with each other? And how does it hook into my ElastiCache logs that are sitting over here? And you're going to start noticing that what was initially easy and getting all
Starting point is 00:35:32 that shipped into just Jaeger, Prometheus, ElastiCache is now starting to turn into this monumental task again. And you're going to be like, well, now I need to hire 10 DevOps people to really start looking at this data or something like that. And that's the piece that I think where if you start getting those 10 KPIs that really tying all that data together is confusing, that's the piece that you're going to start seeing where commercial vendors or an observability platform is going to start giving you that. Do you want to spend the, let's say, half a million dollars over the year and training of developers to just go 100% open source? Some companies may see the value proposition in that.
Starting point is 00:36:19 But for a lot of enterprises, they're going to see it as, well, that doesn't make any sense. We should be letting people that are experts in that field do those day two, day three operations. And I think that's where open telemetry, kind of shifting back to what we said at the beginning, open telemetry gives us day one and potentially day two. But really, all of those advanced things that you want, I don't think OpenTelemetry should really go into those because that's really starting to get into areas where you need to tailor data. And OpenTelemetry is not going to be able to create standards around tailoring data. That's really up to vendors or companies to figure out
Starting point is 00:37:06 what they need. Open telemetry to me is, as you said, commoditizing data. That's what open telemetry should be doing, commoditizing metrics, traces, logs, profilers,
Starting point is 00:37:18 things like that. And then other solutions should provide the analytics or the wrapping of all of that and making it nice and neat packages. I like the way you explained it. And I think there's another analogy or kind of a similar story with Open Feature. Open Feature is another open source project in the CCF space. And it's the same thing. So your open feature
Starting point is 00:37:45 is standardizing the way developers can implement feature flags in their code. So it's again independent of your vendor, so you can really get started easily without a vendor login. And then there's also flag D as an open source backend implementation to get you started. And I think that's also what we heard last week at KubeCon because we are active in open feature because we kicked it off initially last year also with eBay and some of the other feature-plaguing vendors. And people that came to us last week to the booth, they said, hey, you know what?
Starting point is 00:38:18 It's really cool. Open feature is a standard. FlagD is an open source kind of like your day one. We can test it out. We can get started. But then eventually, we obviously need to go to a commercial version of the backend system because we need the scale, the analytics, the enterprise features, like who can change the feature blacks, the analytics on top.
Starting point is 00:38:39 This is stuff where we go beyond day one, where we then go day two and then just operationalize everything. And, and it's just the same with what I see, what you just tell me with open telemetry. That's a great way to get started. And definitely it doesn't lock you into anything. You can walk a long way,
Starting point is 00:38:58 but eventually you should focus again on your core, on your core business value and that is not building and maintaining a complex backend data storage analytics software solution because this is where commercial vendors come in. Yeah, exactly. To me, if it's not your business, I mean, we've seen companies
Starting point is 00:39:26 where they've built other solutions for things and then that's how it's spun into. The best example I can right now think of is like Slack. Slack, they built that messaging tool as their internal tool and they were building a game, if I'm correct, and then Slack is what took off. So those stories exist out there. But in most
Starting point is 00:39:45 cases, you're trying to focus on your specific product niche or whatever you're going into. And you're going to start seeing the exponential increase of trying to get value out of your analytic solution or your open feature system or whatever it is. And you're going to start seeing, okay, my homegrown solution just does not compete here. And that time value proposition just completely falls off eventually. And I think really what some of the best devs or the best businesses are the ones
Starting point is 00:40:18 that are not going to be reactionary to this fall off. They're going to start seeing this kind of slow decrease of value of them building their homegrown and start noticing that, okay, now we need to switch. We need to use X system now. And I think that's really where you see it is. And I guess this kind of goes back to that shift left mentality, even in the business realm, where you're not being reactionary anymore.
Starting point is 00:40:44 You're being preventative you're you're seeing it up front and making your decisions way sooner than when it's already fallen off and you've completely missed your kpi or something like that justin from uh from this conversation is there anything missing? Or is there, like, if you think about it, we have people listening to this. They might be already familiar with OpenTelemetry. They might be new to OpenTelemetry. I think we covered a lot about what does OpenTelemetry result.
Starting point is 00:41:17 I think we understand this, right? It's really the commoditization of how we collect data. That's great. I think we also talked about shifting left. But it's really about minimizing the impact on developers. That's great. I think we also talked about shifting left, but it's really about minimizing the impact on developers. Really great stuff. And also like the shifting left is actually shifting the skill set. So in the end, minimize the impact on dev. You also gave great overview of your rollout experience with OpenTelemetry, 80% just with the default
Starting point is 00:41:41 instrumentation you get with some of these OpenTelemetry frameworks and libraries that are out there, bringing business leaders early in the process, engaging with OpenTelemetry community earlier, and also understanding what your observability platform already provides. Anything else that if somebody that listens to this wants to now get started in rolling out observability in the organization, shifting it left that we need to discuss? I think it's understanding, especially an understanding. I don't want everyone coming away from this and seeing OpenTelemetry as a silver bullet.
Starting point is 00:42:13 It is still evolving. I mean, logs just got feature frozen just a few months ago. So just because it was feature frozen, that now means all the implementations need to go in. And so understanding that, and I kind of gave this at Perform also, this call to action of if you are interested in this, and if it even gives you, let's say that 50% for your day one, talk and bring up suggestions in the OpenTelemetry group because I can be completely honest, they are open and they want help
Starting point is 00:42:48 and they want to understand what are your pitfalls. It's evolving and it's evolving at such a rapid pace that you can definitely tell that it's getting a little bit uncomfortable for them because they're getting so many more users and they're happy, but they're also like, well, we still need X, Y, and Z feature added in. I bring up logs because I was so happy getting traces and metrics.
Starting point is 00:43:10 And then it was like, well, where's the log feature for them? And it hasn't been built yet. And so it's understanding that if you even get some value, try and talk with your organization. Or as a single loan developer, try and work with OpenTelemetry in some capacity because the more we give back in that way on something that you maybe will take for granted or that business takes for granted or whoever, you may not see the value initially, it's going to provide value eventually. And I think those are the pieces that we need to see more, I would say, devs even getting in the space. Because the more developers, the ones,
Starting point is 00:43:52 the boots on the ground people that are working directly on code and working and looking at traces or whatever it is, the more that our commoditizing of data is just going to keep increasing. And that's really what I want to see. It's seeing more of these devs getting involved to help commoditize this data even more. Yeah. So shout out to everyone out there listening. OpenTelemetry, one of the CNCF projects that is definitely not only worth looking into because it benefits you, but I think can also contribute back. It's ever evolving, as you said. There's still a lot of work to be done.
Starting point is 00:44:32 But yeah, it's amazing how far the project already came. And what I really like being in the observability space, looking at it and seeing how it actually brings together, you know, normally companies that are normally rivals on the market right like you know we if you look at all we as dynatrace and also data dog in the relic and honeycomb and we're all contributing to this and um because in the end it benefits us obviously right because we can uh we don't we're no longer eternally depending on our agents to build agent technologies. And we can also contribute back to actually get the data that we need in order to provide higher level value with our observability platforms. Yeah, I mean, and it makes sense.
Starting point is 00:45:19 Yeah, I guess that's the piece that I know I've had some discussion with devs and they'd be like, well, what's the reason why X observability platform is buying into this? And I kind of explained it in this way of, well, okay, you're a dev and you work for a company. You don't want to keep writing the same form so many times. I don't think our observability platform developers and business leaders want to keep rewriting traces over and over again. They want to work on the cool stuff the same way you want to work on the cool stuff. Yeah. Hey, last question for you.
Starting point is 00:45:55 Are there any resources that you have used when you got started with OpenTelemetry? Any particular people to follow? Any, I don't know, anything where you say, hey, this was really good that I have this resource available? So I will say OpenTelemetry.io, their website, their docs are great. They're definitely still holes, but they probably provided some of the best resources. Other than that, I would say get on Slack, get on the CNCF Slack channel and start going into the OpenTelemetry groups. So there's OpenTelemetry, just the base one, but then there's Otel FOSS, Otel JS, all the different specifics, and just start asking questions in there. The founder of Open Telemetry, absolutely excellent.
Starting point is 00:46:56 I got a chance when I was doing some, they were doing user research, got a chance to talk with them. But any of them, go in those channels, because you're going to also run into developers from different pieces. So like Othell Fahs, you have AWS developers in there, you have Azure developers in there. And so I would say really diving into those two will really just elevate your experience. And then I can add two additional things. We just published a podcast with the author of Practical Open Telemetry.
Starting point is 00:47:33 The podcast episode is called Adopting Open Observability Across Your Organization. And our guest was Daniel Gomez Blanco. So he just published a book on open telemetry. And the other thing I want to highlight, Henry Brexit, who is also working with me, he has the Is It Observable channel. So isitobservable.io, and he's been covering open telemetry quite a bit.
Starting point is 00:47:58 And he also put out YouTube tutorials and GitLab tutorials to just get started with some of this. Cool. All right. Justin, thank you. Thank you so much. Actually, I believe you will probably meet Henrik in a couple of weeks because there's going to be Glucon.
Starting point is 00:48:16 I'm not sure if you are at Glucon. Oh, I'm not. I sweat. I wish. Okay. I just got back from a vacation, so I can't leave off of work again. Okay. Because I think you're in Colorado, correct?
Starting point is 00:48:29 Correct. Correct. Yeah, because GlueCon is in Denver in a couple of weeks. Oh. And Henrik is going to be there. We have some other folks from Dynatrace going to be there. So that's going to be a good opportunity. So whoever is listening, if you're in the Denver area or if you go to GlueCon, you may want to ask Henrik about some OpenTelemetry advice because he knows his stuff.
Starting point is 00:48:56 Yeah. All right. With this, I say sorry, Brian Wilson, that you couldn't be my co-host today hopefully I did a good job good enough job to do this interview myself but thanks anyway because Brian is the person that makes sure that
Starting point is 00:49:15 all of this gets post-processed and packaged up and then shipped to the internet so that people can actually listen to it so thank you so much and thank you Justin thank you so much. And thank you, Justin. Thank you. Bye-bye.
Starting point is 00:49:29 Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.