PurePerformance - A year in - Establishing an SRE Role at CFA with Abigail Wilson

Episode Date: January 6, 2020

Do you have a clear definition of what Reliability means for your organization? Abigail Wilson, Reliability Architect at CFA Institute, sees this as a key requirement before you start transforming you...r organization to embrace site reliability, DevOps or Cloud Native.In the podcast we hear how Abigail went on her journey where she has proven that you don’t need a background in IT in order to become an advocate and change agent for reliability engineering. In her role has bridged the gap between business and IT, she has helped bring stable environments to developers and testers and with those and many other steps has increased overall productivity, quality and stability of their business critical applications.https://www.linkedin.com/in/abigailswilson/https://theabigailwilson.com/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always we have with us, I don't know why I'm saying we because I'm not royal, but I have with me my co-host Andy Grabner. Andy, how are you doing today? Not too bad actually. I've never said we have with us before. That was an odd little... I don't know. A little bit there.
Starting point is 00:00:48 Yeah. Well, I'm not sure what's happening on your end. Yeah, I thought so. You said this earlier before we got started. Why is that? Yeah, well, I went to a neighbor of mine. He likes movies too, and he's got one of those movie passes. So last night for a 9-15 showing, we went to go see Knives Out. It's pretty good. Pretty good movie. Good cast. You know, good whodunit type of movie.
Starting point is 00:01:10 You know, murder mystery. But, you know, I was a film major in college, right? And movies back in like the late 90s, early 2000s were typically too short. They didn't develop plot. Now everyone's going too long. So this was like a whodunit murder mystery, and it was over two hours. I'm like, come on.
Starting point is 00:01:26 So anyway, got home late, and then my daughter had a seizure overnight. So I'm just kind of slammed on sleep today. But we will get through, and I'm sure the fun antics will wake me up today. How have you been, Andy, now that I did my sob story? Actually pretty good. So at the time of the recording, right, which is early December, we just made it through our Christmas party here in Austria.
Starting point is 00:01:53 So we had our Dinotrish Christmas party last weekend, which was phenomenal. Wanted to go home early, ended up being at home at 5 a.m. So that's always a good sign. And it also helped that I was was on the west coast the prior the days prior so my body was on a different time zone so that made it easy um but gabby also stayed out that long so she was there dancing involved there was some dancing involved of course we did some we did some salsa and some just dancing to the dj anyway the salt wait do they do the salsa specifically because they know you're there though it's like oh andy's coming let's play some salsa and some just dancing to the dj anyway the salt wait do they do the salsa specifically because they know you're there though it's like oh andy's coming let's play
Starting point is 00:02:28 some salsa well we have a we actually have we have a dynatrace band and um they're called the pure sound agents and i think it's the fifth year that they performed and eva vintage she is kind of the lead singer and she she told me just before it got started, hey, you got to be there because the second song is going to be a salsa. Well, it was a cha-cha, but anyway, we danced to salsa. Just like I was, I know we're going to go to the episode in a second, but the intro music, I was like, all right, I got to try to find like some kind of a beat for Andy.
Starting point is 00:03:00 And I was looking at my Casio keyboard and I was trying the salsa beat, but it really wasn't good. So that was actually a sped up tango with delay on it. But it's, you know, at least somewhere in closer than a rock beat. So there we go. Anyway, enough tomfoolery. Exactly. So, I mean, one Wilson is tired today.
Starting point is 00:03:16 That is you. Well, good news is for the audiences out there that are listening to this, we have a second Wilson that hopefully is not that tired. And that second Wilson, her first name is Abigail. And as always, when I introduce a guest, well, typically I ask, are you there? I know you are there, Abigail. So that's why I directly go into Abigail. I don't know why you always ask, are you there?
Starting point is 00:03:39 I know, I don't know why. But Abigail, I'm pretty sure she's there. And always I just ask, Abigail, please introduce yourself to the audience. I am here, yes. Hi, Andy and Brian, it's good to be here. My name is Abigail Wilson, as Andy said, and I'm the Reliability Architect at CFA Institute. They just created the service reliability function about a year ago, so we're still in the early stages of figuring out what that means for our organization.
Starting point is 00:04:08 It's been quite fun. Very cool. So reliability, so SREs, right? We had this topic coming up over the last couple of episodes. I think Brian, I know you're tired, but I need to ask you at least one more question. I think it came up with Gene Kim when we talked about the Unicorn Project.
Starting point is 00:04:27 Yeah, he was talking about that. It's a hot topic again. Yeah, it's a hot topic. We also had Adrian Hornsby. He was here. We talked about chaos engineering and site reliability engineering also came up. So, Abigail, I know you said a CFA Institute. And I want to just get a little background on your current employer,
Starting point is 00:04:47 because I think this is actually important because I think it seems like a big part of our industry is moving into that direction and not only, let's say, the classical software companies. So that's why I want to get a little bit more background on your current employer. And then I want to dive into why did CFA decide to move into that direction? And what was the driving factor? And then kind of, you said you started a year ago, at least a year ago was when this position was started.
Starting point is 00:05:16 What were you facing back then a year ago? And kind of, how has your role evolved? What have you done? And I'm pretty sure there's even more questions coming up. But maybe starting, who is CFA and why did they decide to move into that field? So CFA stands for Chartered Financial Analysts, and we are a nonprofit financial ethics credentialing company. So for our customers, we're primarily an educational body.
Starting point is 00:05:46 We create a series of exams that you can take if you're working in the financial industry that show not only that you're competent in those topics, but also that you take those themes and you work within them with an ethical perspective. So that's our primary revenue. But again, we're nonprofit. So we also work in a lot of advocacy, looking towards more ethical standards in the financial industry. And we connect a lot with both our candidates who are going through the credentialing process, and our members who have already completed that process and are part of CFA Institute. And we have international societies across the globe
Starting point is 00:06:25 that are technically independent bodies but we support them in a lot of ways and we also host conferences for our members and other interested parties so we're a little bit different from a lot of folks in the industry in a lot of ways we are a nonprofit we try to operate like a profit-based business in the way that we organize our work. But we do have really most of what we're delivering is technology and that really we are a technology company. large restructuring, not only of the IT department, but also of a lot of the other groups in the organization to kind of reflect that and help break down silos and just get us in better shape for faster delivery of technological solutions. And then also supporting our customers, getting feedback from them on those tech solutions and being able to iterate quickly it's interesting right and i mean you said you're in a educational
Starting point is 00:07:52 business and you you obviously saw that you're actually becoming a tech company and i think this is true for so many different industries that I remember we heard that way back when we had Adam Auerbach from Capital One on. They were a bank, but now we're a tech company delivering banking software. I think he said we are a tech company with a banking license or something like that. I mean, that was interesting.
Starting point is 00:08:22 Yeah, I think the content of our business here is educational for the most part. But the way that we interact with our customers and our partners is all through our technological ecosystem right so if that's not strong it affects everything else and so you said major so a year ago major pivotal moment in the company's history, decided or realized your tech company, so you needed to become more efficient. They're also big restructuring. So were you already in the company prior to that,
Starting point is 00:09:01 or were you hired as part of the reorganization, or how did that work? I started with the organization in 2016 as a software developer. And then when they reorganized in the fall of 2018, I was accepted in the position of the service reliability engineer, which has now evolved a little bit further. Yeah. So fill us in a little bit. So that means you started as a software engineer and this position came up.
Starting point is 00:09:27 Did you, I mean, how has it shaped? And also what's maybe, what problems did you try to solve back then? And how has that changed over the years or over the year? Sure, that's a good question. So I think my personal history is a little bit relevant here. I did not work in technology until really important. And I actually learned a word this week, which really captures this. I was so excited to hear this word. It was proposed by Nora Bateson, I believe,
Starting point is 00:10:12 and it's semaphysis. And it's this idea that it's kind of taking systems to the next evolution, where a system is sort of a closed entity with static components that relate in certain ways. But that is kind of an outdated concept because, I mean, at least here, the software infrastructure that we're working with is not static. It changes all the time. And whether you're talking about software or an ecological system, everything is always changing. And so when you have the system where each piece
Starting point is 00:10:46 is learning and improving, as well as affecting the relationships with others, you have a somathesis. And I think my ability to see that kind of nuanced changing system is what made me a good candidate for this role, because we're creating a new function. So I'm creating all new systems of processes, new standards, new software that have to fulfill these needs for the business. And so when I was a developer, I like to find interesting problems and solve them. And one of the problems that we had when I was developing was that we didn't have a good way to visualize all of our digital components. We were in the middle of our digital core transformation where we were moving all of our infrastructure from a monolithic on-prem infrastructure to a microservice structure in the cloud.
Starting point is 00:11:46 And when we did that, suddenly we had no idea where anything was or what was talking to anyone else. And so I developed a solution to fix that, that was automated, that integrated with some of our existing systems, like what we use for deployments to just generate a real-time picture of our whole system. And so I think the ability to identify that need and then fix it using pieces of the existing system is kind of what made me stand out at first. And I've tried to take that perspective into this role with reliability. So one of the first things that we focused on as we brought this reliability function into existence was simply we were very reactive. All we wanted was stability.
Starting point is 00:12:32 If you know anyone who's taken one of our exams before, they'll tell you that getting your results on the day we deliver results is purely a factor of whether or not you get there before the site crashed, which is not a good place to be. And so when I came into the role, my first goal was to bring stability to our sites at all times, especially on results delivery day, because that is the most important thing for our customers. So once we sort of had a strategy for scaling all of our components and how to properly set up our infrastructure for the strange traffic patterns that we see, after that, I started
Starting point is 00:13:14 focusing more on standards. And so then I stepped into the performance testing realm. And this is also where a lot of relationship work was done with the business so that they would understand not only that we were looking at and concerned with performance and reliability, but also to make sure they understood the depth of that question. So I started basically codifying our performance testing process in relationship with the business so that they understood what an SLA was, what a service level agreement was, that they knew how we set those, that they could then agree to those and understand that they are the drivers for those agreements. They're agreeing that we're not dictating anything for them.
Starting point is 00:14:01 And essentially just finding a way to help the business maintain control. It doesn't have the sound I want, but to empower them still to have control over their products while we were still able to build up the systems in the way that we needed to. And so now we're kind of moving more into a strategy phase that's more proactive. And this feels like a really good place for me, because for the past year, it's basically been identifying which aspect of reliability was most on fire and trying to put that out.
Starting point is 00:14:38 But now we're actually at a place where we can look forward and see where we want to be and see what needs to be built to get us there. This is fascinating. Actually, I'm just taking some notes here because, well, maybe I want to let some folks know here that besides doing this podcast, you will also be with me on stage at Perform in Vegas, our conference coming up early February.
Starting point is 00:15:04 And I know the two of us, we talked over the last couple of weeks and months several times. But every time when we talk, there's so many new things that I learn about you. Now, first of all, about yourself, then the last year at the company, what you've been dealing with. And I really like the way you just kind of explained the last year at the company, what you've been dealing with. And I really like the way you just explained the last year and how it started from bringing stability to a system, putting out the fires, making it stable.
Starting point is 00:15:35 And then now, as I understand it, moving over towards becoming more proactive, but also giving tools in the hands of business so that they can work with the stable systems that you've built, enforcing SLAs and defining SLAs and making sure that the value that they deliver is basically according to their standards, but always giving them control. So I think that's pretty cool. I have a lot of questions now. I want to ask you a bunch of questions how are you before i focus on the questions that just came up in my head i first want to go back and you mentioned that in 2016 you started in tech what is your background because i think and brian
Starting point is 00:16:17 to come back to you this is not the first person we interview that had a complete career shift or like came came from other directions and I came in from another direction too. I was actually curious about this as well. And while you're answering that, since it's a common topic and many of us suffer from it who make this transition, did you, while you're answering that question,
Starting point is 00:16:37 did you suffer from any sort of imposter syndrome and how did you overcome that? I'll add to that to Andy's question. The answer to that is definitely yes. I'll get to that to Andy's question. The answer to that is definitely yes. I'll get into that for sure. I think it's a very important question in the industry these days. Yeah. So my background would not point towards technology at all. In fact, before 2015, I had never had a job that was either inside or sitting down. So even the idea of being in front of a computer for much of my life has been giving me the jitters. I just
Starting point is 00:17:12 don't like that idea. I've gotten used to it though. My university degree is in fine art. I am also a printmaker. That's still something that I enjoy doing. You can find my art out there if you'd like to. We'll put it up there. Why not? Yeah. But I graduated with a fine arts degree in 2009, which was a really difficult time to come into the humanities, as I'm sure many other people listening are aware. So I tried to make that work for a little while and it just wasn't happening. So I moved back into some other interests. I was in outdoor education for a while doing team building. And also I ran a boy's summer camp for a little while, their hiking program, which was really fun. And I kind of just adventured for a few years. I'd take young boys out hiking and then on my days years. I'd take young boys out hiking
Starting point is 00:18:05 and then on my days off, I'd go hiking on bigger mountains and it was great. But eventually I wanted a little more stability instead of sleeping in a cabin all the time. And I worked for a period as a cabinet maker. I was in a customs department there. Oh, wow.
Starting point is 00:18:22 And was building all kinds of really fun stuff. I mean, to be honest, building cabinets is not that different from building software though. You have a plan you're working from and you have to figure out how to get there in the most efficient manner without any waste. You have to just make all the little pieces and then bring it together. With software there, you get a little bit more testing. With cabinets, it's kind of one and done. And then in between all of that, I also had a stint. I decided not to go back and get my MFA in fine arts. And instead I started my own bakery. And so for just over a year, I ran a bakery in a little town and it was very successful, but also very stressful. I decided while I liked designing a business, I didn't like being the owner of that
Starting point is 00:19:12 business. So I sold it and moved on to other things. And eventually I ended up being in an office for this cabinet maker and working with installing an ERP system and changing some of their inventory systems and whatnot. And I realized that I actually didn't mind sitting in front of a computer. And there were a lot of parts of it that I really enjoyed. I've always been very interested in languages. And my older brother is a musician primarily, but he had just completed a boot camp to get into software development and loved it. And we had a lot of talks about how the creative mind, whether you're doing music or art or software programming, is actually very similar,
Starting point is 00:20:00 that a lot of the processes are the same. And seeing the way he liked it just kind of made me wonder, like, you know, I wonder if I could do that. I'd build a lot of my own websites first as an artist and then for my bakery. So I'd already taught myself HTML and CSS the hard way using Dreamweaver, it feels so long ago. And so I went and did a three-month bootcamp intensive in.NET technologies.
Starting point is 00:20:26 And from there, I was hired on at CFA Institute as a developer. And when I started here, I was very aware of how little I actually knew, especially once I was out of the classroom and into the workspace. It's very different. And while I had spent a lot of my time during the class expanding on my knowledge and trying to reach beyond what they were teaching us, there's still nothing that can match real world experience. And I decided that the only way that I was going to have confidence in my own abilities was just to completely own how much of a beginner
Starting point is 00:21:03 I was. It's kind of lucky that I came into CFA Institute the way that I did because everybody knew that I was coming straight out of a bootcamp. And I feel like it gave me permission to just to know nothing essentially. So I definitely would do my due diligence about Googling my questions and trying to do my own investigation.
Starting point is 00:21:22 But I also wasn't shy about asking questions of my fellow developers or of our systems team or of anyone else, just to get a bigger picture about what I was working in. And so, yeah, I definitely had imposter syndrome and my reaction was just to lean into it as hard as I could. And I think in general, it not only helped me because I was able to actually learn all the things I. And I think in general, it not only helped me because I was able to
Starting point is 00:21:45 actually learn all the things I felt like I didn't know. But it also made others around me a lot more comfortable asking questions. And it helped me build a lot of really strong relationships within our IT department that are still very strong today and led to a lot of good partnerships on fun projects. just terrible people but a lot of times in any organization you can find the people who you can ask questions and they're going to be happy to help you out and we see this all the time andy we talk about this all the time of everyone sharing knowledge just in general hey i developed this thing i'm going to put it up on github it's free to use modify it make it better but that community that exists whereby you can go to somebody say hey i don't know explain this to me right and
Starting point is 00:22:44 finding those people in your organization who aren't going to look down on you for that, but instead be like, oh, I'd love to explain that to you and help you learn. And maybe as I'm teaching you, I learned something in the process. That's a really, really important part. And for anybody out there who's in more of that beginner phase or learning phase or just not feeling completely comfortable, I just can't stress enough how important it is to find those people in your organization because and there are probably a lot of them and just reach out and talk anyhow that's all it's wanted to get that in because i think it's really key it's yeah i would agree with that i
Starting point is 00:23:19 mean we are all creatives i think um i mean I'm sure some people are in the industry because of its stability and exciting growth. But I mean, at least here, most of the people that I work with, they enjoy making something that they're proud of. And part of that is also sharing that excitement and that pride with other people. And when we're able to collaborate on projects together, it just keeps those good feelings going. And I think in any collaborative environment, especially when innovation is involved, it's really important for people to feel safe when they're going out on a limb. And it's a hard thing to do for all of us. But if one person is able to kind of take a step and show vulnerability and ask that question, then it makes it easier for others to sort of do the same thing. It helps lead to that culture of
Starting point is 00:24:14 trust and excitement. All right, Andy, you got a million questions. I got a million questions. Well, first of all, I would... Thank you for that. Excuse me, I have a few more questions, if you don't mind. Yeah, it's amazing what you, you know, knowing your history. And I think maybe, you know, as you said, it could be a big benefit if you basically come into a completely new field and with your curiosity that you have, but also knowing that you don't know everything. approach people and basically asking them for their advice and for their knowledge and asking them for help, I think that that is something that maybe some of us no longer have that have been in the industry for too long. And we think, well, we know everything better anyway. And so why would I ask somebody?
Starting point is 00:25:00 So I think that's obviously, in this case, a big advantage. Now, I have a couple of questions. So coming back to what you explained when you, the role was created, you mentioned in the beginning, you had to bring stability to the system. You had to put out a lot of fire. Now, I also know we talked leading up to perform and the stuff we are doing there that you mentioned that a big thing for you was to actually bring or ensure that developers have stable environments stable test environments you mentioned earlier that you have done a lot of work around
Starting point is 00:25:37 codifying performance or you know you know changing the way performance is perceived so can you tell me a little bit of the of the actual things that you have done in the first couple of months when after you started on on stabilizing system because i really believe if you have i mean having a stable system and and with system i guess i mean different environments having it stable and getting them the trust of the individual people that are using it is, in the end, obviously improving performance and the outcome of the people that work with these systems. So I would like to know from you, what were the biggest problems with the stability on these systems? Which systems are we actually talking about? Is it production?
Starting point is 00:26:21 Is it pre-prod? And what were some of the measures that you have taken? Because I'm pretty sure a lot of people can learn from that on, hey, ah, this is, yeah, we have the same problem. Ah, and that's the way she tackled it. So that would be interesting for me as a first question. So the first thing I focused on was production because we were having very high visibility problems in production, namely the site crashing on days when most candidates or members were on the site. And that is just not acceptable. So initially, I was in charge of things like negotiating contracts with some of our vendors so that
Starting point is 00:26:59 we could change our scaling strategy. So for example, where we host our website, we used to be in a very restrictive scaling capacity where it took a lot of manual steps and a notice of about three weeks in order to expand our hardware there. And our typical traffic to our website is pretty low. So in general, we don't need a lot of hardware. But on one day, when we deliver our results, it goes from maybe 10,000 visits in a day to well over 100,000, just in a three-hour window. So our need to scale for a short period was pretty dramatic. So we had to negotiate that contract to change the way we could scale and get some more hardware wired up. And I worked in partnership with several other aspects of the business for that. And we have an interesting mix of on-prem legacy systems
Starting point is 00:28:05 with our newer microservice structure in the cloud. And that is where most of our challenges are still coming from. So where a lot of our data is stored is on a very, very old server that we're currently still trying to get off of. It's been a four-year process doing that. It's sort of why we went through this whole digital core transformation. And so managing that hardware has been the
Starting point is 00:28:35 greatest challenge. And that is not owned, luckily, by me. Our prod support team owns that because there's a lot of knowledge on that team. So it's a bit of an interesting responsibility share between me as the reliability function and other teams. But so that strange mix of architectures has been a big challenge that we're still dealing with. But as we've done that, another big aspect for stability was simply to get visibility into all of those components. So when I first came on, we were currently using Dynatrace, but it had not been used to its full potential.
Starting point is 00:29:17 I would argue we're still not using it to its full potential, but we're a lot closer. And so we sort of had to teach ourselves how to use Dynatrace because the former owners had already left. And so I then focused on getting monitoring on all of our primary components, making sure everyone who needed to see those metrics could see them, making sure all of our alerting worked so that once the site went down, hopefully we would know even before the crash occurred and start taking action to ameliorate that. But a lot of it, too, was just about our relationship with the business. They did not trust us to take care of problems the information we needed, we had to convince the business
Starting point is 00:30:05 that we could see everything we needed, that we were on top of things and that we knew what we were doing. So the early days were as much of a publicity campaign, so to speak, as they were actually getting into the weeds. And I think that that was actually really fun for me. A lot of people in technology would hate that kind of role, but I really enjoyed it because it made the relationship between everything that we were setting up between monitoring and metrics and just reliability in general, it made the relationship between that and our end users or like what the business cares about basically, it made it very apparent.
Starting point is 00:30:49 And I think that's one reason why performance then became such a buzzword in our organization is because that was how I spoke to them about reliability. So did you, I got a question i mean this is a this is actually a quote that i also kind of remembered from our initial talks when you said the business didn't trust you and that you you had to convince the business that what you're doing that you're trustworthy and that that you're giving them you know the data that they need and actually you work on on something that matters to them which is reliability which is performance which obviously in the end helps your your end users i have a question so you mentioned obviously that you use dynatrace you took it over
Starting point is 00:31:34 what did the business use before that i mean before you gave them those metrics did they have their own insights and or how did they you know hear about or measure the current problems or the quality of the system uh i honestly don't know the best way to answer that question there weren't a whole lot of established systems for measuring what our products were doing once they were out in the world. We were using things like Adobe Analytics, and we were keeping track of things like how many orders were completed. And then, of course, our prod support teams had things like SCOM alerts set up. But there wasn't really a good way to see how each component was performing, if it was experiencing errors, what kind of
Starting point is 00:32:27 throughput we were having. And so just being able to measure all of that was a huge win for us. It really changed a lot. So basically what you did, and I think this is a theme that I keep hearing, it's you were really, I mean, the business may have their their own metrics but it's coming from one system and they they lack as you said the context or yeah the context of what's actually happening on the technical side and the infrastructure in the applications itself and you basically approach them and say hey look at this i can give you data but we actually can then correlate it
Starting point is 00:33:01 with what's really happening in the system so in in case there is a problem, we know what the root cause is, and then we can obviously work together in fixing it. And that's obviously, that was your mission, to make these systems more stable, to focus on performance. But then with these improvements also show the business that, hey, see, with all of our work, we actually improve your business metrics and we actually know cause and effect right you know that you know let's say a drop in orders is because of bad performance or
Starting point is 00:33:31 because of an outage that you had am i getting i mean getting this right yes that's correct and when i first came on they wanted to know things like do you actually know why the site crashed and we had to show to them that we knew why the site crashed? And we had to show to them that we knew why the site crashed. But now we've evolved to the point where we had just a couple of calls come into our global contact center about being unable to access this particular aspect of our membership application. And it came down to me. I was able to look in Dynatrace to see what was going on. I found the component that was at odds, looked at the errors. I could even pull the individual IDs, and they were able to verify that it was a bug attached to a very particular bit of code that was sort of a fringe case, got missed in regression.
Starting point is 00:34:19 And they were able to turn that around in less than three hours. So it's really changed a lot in the scope of what we're able to link as far as cause and effect, and as well as the expectations and gratitude, I think, for IT. Hey, no, Andy, this runs into a theme we've been seeing as well. We saw it with Nestor and Citrix where there's this idea of trusting your tools, trusting the data that is coming out of it. Now that tools are getting more complex with some AI or some other components,
Starting point is 00:34:57 oftentimes when people are saying, let's just trust this tool to do this, there's pushback and it takes some time for people to do that. I think it was also covered a little bit in um the unicorn project from gene kim where you know the team was going through and setting up this kind of you know bits of automation or streamlining and there's always going to be that growing pain of we can do that but how do we know we're not setting ourselves up for some complete blindsided you know failure because we're just trusting in in the automation
Starting point is 00:35:34 and in the tools too much but i think that's just a natural human reaction right and and it's your story abigail i think highlights that there's always going to be that initial hurdle until the case gets proven out a few times that people can grasp onto and trust it. Yeah, and I think that reluctance to fully adopt a tool, you also see that in larger spheres. For example, we're still struggling to become a DevOps shop and to fully implement CI, CD. And it's not that people don't see what it has to offer. In fact, our IT leadership has given approval for this for about four years in a row. But because of the effort and the risk that goes into moving into that new phase, it's just really hard to get everybody on board.
Starting point is 00:36:26 And so I think what you see with the adoption of something small, like a tool, you also see in these larger areas. And sometimes adopting a tool can help move into new areas. Like for us with Dynatrace, I mean, it was really difficult for us to get the contract that we wanted in order to measure some of the things that we're now measuring. But as soon as we were able to provide those numbers, everybody was on board and they were willing to throw more money at us. And now that we have those numbers and we can provide stats on things like our lower environment availability, they see the effect it can have on our deployment cycle, now is when they're starting to say, well, you know, maybe we really should think about this CICD stuff. Can we deploy faster? Can we deploy smaller? And so I think there's definitely a spectrum of resistance and excitement about moving into new eras. It's always safer to do what you know, but not
Starting point is 00:37:25 nearly as powerful. Or as fun. Yeah, or as fun. And actually, I did want to touch on the lower environments just a little bit because that has been a really big change for us as well. The first year, I mostly was focusing on production because we were having so many issues there. But in the last couple of months, I've started moving my focus into some of the lower environments in cooperation with some of my partners. And one thing we've started instituting is something we call white glove service. So for our high priority products, like our registration process, for example, we're in the middle of a big
Starting point is 00:38:01 development effort for that. And so we have white glove service for this, which means that we have special synthetics set up that track elements of registration in all of our lower environments. So it's looking at login, at create account, and then at certain integration points with vendors or other components within our system so that our developers and our QA can see exactly when something is failing and where the source of that is. And it's been really big for our team because we are responsible for the availability of our environments, which is kind of a funny word to use like in a second. But we used to get so many complaints about the environments are down, the environments are down, we can't test.
Starting point is 00:38:49 And we set up these synthetics. They're showing at least 98% availability all of the time and no one's complaining anymore. So it's really changed how work gets done, I think. And that affects the way that we're able to communicate that to the business. Again, we can show them that they are a focus for us and we can give them numbers that prove that we are focusing on delivering what they want. But it's also been incredible for our developers now because instead of being blocked for an hour and a half because they don't know why,
Starting point is 00:39:22 they can go in and see exactly which processes are having trouble, and they can themselves investigate, since they know the most about implementation. Usually this is a much faster resolution process. And they've actually requested that we add quite a few more synthetics because it's been such an advantage in their development process. And another aspect that those have been really great with is that with our login vendor, it used to take hours and hours to get resolution because we wouldn't know that login was down in our test environment
Starting point is 00:39:57 until a QA member found it and filed a bug, and then we would send a ticket to our vendor. And now, as soon as that synthetic fails, it sends an a ticket to our vendor and now as soon as that synthetic fails it sends an email straight to our vendor and so we're often resolving those problems before a qa member can even discover it so our our meantime to resolve has shortened quite a bit there what would you that's interesting what what would you do you have a number of how many dependencies you have to third parties or like do you or maybe do you I mean, it doesn't have to be a clear number, but do you have a lot of dependencies to third parties and a lot of problems happen and now you can more proactively kind of engage them and actually include them in the process?
Starting point is 00:40:39 Or is this like one-off? So I would assume it's more than just a login. Yeah, it is more than just a login. Yeah it is more than just the login and really it depends on the product as far as how many dependencies we have. We do try to keep a lot of things in-house but there are certain things that are more secure or better to have someone else do for us. So typically we do have at least one other vendor with all of our products. And some of them are harder to include in this kind of monitoring. But also they're less involved in our lower environments. And so it's not as much of an issue.
Starting point is 00:41:19 Well, it's cool that you are, I mean, I know you're using, you said using synthetic tests to, you know, to basically keep an eye on all the systems and then alert in case something is wrong. Now, if your developers ask for more synthetic tests, maybe that's your chance also to kind of tell them, well, developers, maybe it's also time for you to do your own testing and automate it in the CICD, I mean, with your move towards kind of, hey, CICD is really the way to move forward. And if the developers already see the benefit of having these tests executed on a continuous basis, then hopefully they also see that, well, now it takes them also to kind of bring it to the next level. Because obviously CICD only really works if you are doing test driven development and then get
Starting point is 00:42:06 these tests executed and then getting the monitoring in. But it's, it's, it's a really cool that you're using. And I know we talked about this kind of a couple of weeks ago when you explained it to me and that, that you just using something simple as synthetic tests on a regular basis and, and acting as an early warning system.
Starting point is 00:42:28 Once a change is breaking a system where a lot of people are depending on that this alone has obviously improved the situation a lot because you are not you know realizing the problem hours or days later and then it's hard to figure out what is the real root cause you see it immediately and therefore also the impact of that system being down and not available um you know is shortened and therefore it improves the overall availability of that system and therefore more people can you know work with it i mean and that's you know it's simple things i mean it's i know it's not simple simple but it's still it's it's it's I think it's individual steps that people can take, right? And this is what I like about these podcasts. So we learn about what you did in order to bring stability to systems that have not been stable and therefore impacting productivity. simple as adding synthetic tests with early warning system with automatic notifications to
Starting point is 00:43:25 your internal folks but also your third parties has resulted in increased stability and that's what it is right it's about i mean reliability stability and this is phenomenal so thanks for sharing that story it's pretty cool yeah happy to share i think some people look at SRE and they think that in a lot of ways it's synonymous with prod support, but I actually look at it in a much, much broader view. I think that reliability really does start with our developers. It starts at the very beginning, you know, with the architecture even. And so I've been trying to get that concept to take root here. And luckily, a lot of people are on board, too. And you talk about how the developers can maybe own some of this transition to CICD. That's the truth, but here they want to.
Starting point is 00:44:16 In general, all of the people who are on the front edge of this, we really want to make this move. But there's always that tension between IT priorities and business priorities. And we still have to deliver. So there's still a challenge there, but we're getting closer and closer. Very cool. It's interesting. I always find, in my head at least, SRE teams are the glue that bring the development teams, the testing teams, the operation teams,
Starting point is 00:44:47 the product teams, the business teams together with a common goal of performance and reliability and help weave best practices through all those layers in order to get that done. I know that sounds kind of grand, but that's how I was envisioning it, because you mentioned it being more like product side. But I think that's at least the way I've seen SRE. Yeah, and that's how it's played out here to a large extent. I am considered to be a part of our systems team, which handles our deployment and releases and then keeps our environments up. But to be honest, I spend a lot of my time interacting with developers, with the head of QA, with the head of prod support, to try and get everyone aligned.
Starting point is 00:45:32 Like I mentioned earlier, we're still trying to move into that DevOps mindset. We still are essentially all in our own units. And I think the SRE role here is pretty unique in that way in that it is a bridge from one to the next. And so I really enjoy helping the conversations to become a little bit more united so that if QA is working on some endeavor to improve their processes, that I can see how that matches something that development is also working on and I can try and bring those teams together so that we don't have redundant efforts and that also we can help each other out.
Starting point is 00:46:11 And so that's been a pretty exciting part of the role for me for sure is that you kind of get to be in all the different elements within IT. So it may be a little bit grand, but yeah, the glue is less grand. I think I also like that word. How big is your team? Is it just you or do you have people
Starting point is 00:46:36 working with you? I am the only person technically in the SRE role. I do work with a performance engineer who scripts all of our performance tests and every once in a while will help with some of some of my workload but no right now it's it's just me so that's why the focus shifts from one to the next is because we can't i can't really divide and conquer too much how do you how do you automate or what do you automate
Starting point is 00:47:04 or what have you automated from your work? Because you mentioned, I mean, on the one side, you have to do all of these coordinational tasks where you talk with people, bring the right people together. Also, in the beginning, you have to advocate a lot for the benefit that your team is bringing. But then you mentioned, you know, tests have to be set up. You need the alerts have to be set up. You're managing the monitoring. So have you over the last months started to automating some of these tasks and kind of pushing it to the people?
Starting point is 00:47:41 Or let's say not pushing, but offering this to the individual stakeholders like business and development as a self-service? We are starting to push this as a self-service? We are starting to push this as a self-service. I've just finished setting up all of our tags and management zones in Dynatrace so that everybody, literally everybody who wants it, can get into Dynatrace and look at how their products or their components are working. I've been working a lot with product support to, or production support, excuse me, to refine their alerting and notification and just helps their response to be honest I haven't had a lot
Starting point is 00:48:13 of time to get into automation I've got a very long list of things that I would like to automate but it's just a matter of of getting that to a priority. So one thing that we have done, like with the synthetics that I mentioned, I automated toggling those on and off at the boundary of the workday because we only really care about them while people are here working. So that was just a little simple script against the API. There are a lot of other things we like to do. I mentioned that our peak events are sort of non-standard. And a lot of those peak events, the traffic is such that auto-scaling rules are too slow to keep up with the increase in demand.
Starting point is 00:48:53 So what I like to do is find a way that we can set dates while we're doing our performance testing, which is pretty early in the process. As part of that setup, we figure out when all the business events are and what they need in order to be fully healthy in production, and then automate the manual scaling for all of those products so that as the day arrives, the system will just know what kind of pattern to expect, what kind of increases to make in our resources, and then also decrease those resources after the event. But yeah, right now I'm working a lot on just feeding metrics back into the teams and helping them to understand
Starting point is 00:49:34 the way their changes are affecting our different APIs and products. So looking at how to create an error budget, we just got deployment events integrated with Dynatrace from our deployment engine, which is really great. I'm excited about that. That'll give the developers the ability drift over time, for example, that the appropriate people would be alerted or if we see a sudden increase to do some auto rollbacks,
Starting point is 00:50:15 that sort of thing. But we just got to get to it. Very cool. Well, it's good that there's still work to be done, right? Because otherwise you need to open up a new shop or do something completely different if your work is done there so true don't tell my boss no very cool um so i yeah it's it's been it well you know there's probably you know so much more we could we could dive into but i think this is a great overview of kind of,
Starting point is 00:50:45 well, first of all, your background, which I wasn't aware of, like your complete history, even though I think we know we talked a couple of times, but I was never aware of that you were just basically entering the technical field and what you've accomplished. Now, I have one last question that I want to ask you. Is there something that you would have done differently of the things you've done in the last year, especially considering there might be people
Starting point is 00:51:15 listening in and say, hey, you know, site reliability engineering, it's something our company is looking forward to. We don't know yet what it is, but is there something maybe that you want to give us an advice? Or as I said, maybe something that you learned
Starting point is 00:51:30 that you would do differently so that others can kind of avoid a mistake, whether it is, you know, not a clear definition of the role or I don't know, giving you the right powers in the company to actually have an impact. Anything that you can give us an advice,
Starting point is 00:51:44 it would be great. That's a tough question. Well, you're not getting off easy here. Yeah, there's definitely been a lot of trial and error in our process. I've tried to introduce a few things that really just didn't work. And so we had to pivot really quickly.
Starting point is 00:52:04 I think, to be honest, a lot of that is just unavoidable. If you're a new function, I think you really just have to see what's going to work. Every company is different, so some things that will work great at one company won't work somewhere else. I think the biggest improvement that could have been made to this whole process is to have had a clearer idea of what reliability meant to the organization first. Because especially for the first six or eight months, there was a lot of almost political back and forth about what was or was not in my realm. Like what was my responsibility or not? Did I really belong on this team
Starting point is 00:52:46 or should I be on another team? And I, of course, being the service reliability function. And so I think if your company can identify why they really want to create that service reliability function and what your initial goals are for it in fairly specific terms, I think that'd be really helpful.
Starting point is 00:53:08 And a lot of the reading I've been doing about service reliability, it does seem like there's a different manifestation almost everywhere you look. And so if you can try to define that earlier, it's definitely helpful. It's just, it's been a little bit of a painful growing process to find the right fit with our existing teams because they've all sort of had to change their own domains as well in order to accommodate this new function. But I think flexibility is also really important, especially as a new function, because you just don't know. Right. And if your company is adding a reliability function, it's likely that a lot of other things are changing as well or that the focus in your organization is shifting.
Starting point is 00:53:56 And so I think the ability to get fast feedback, to fail fast and just, you know, to drop an effort that's not working and try something different I think can be a big strength I would like to have automated more stuff earlier there's still a lot of manual work I'm doing because it hasn't been automated yet but again you know you just have to figure out what your priorities are absolutely awesome Andy we're at that time I know and I know you have to run. So should we skip this Sumtron today? I want to just have a couple of final words because I think this is just what I want to do. I mean, I keep it short, but Abigail, thank you so much for showing us that you don't to have worked in IT for 10, 20, 30 years to actually tackle a big problem, which is helping companies transform their IT into a more, well, towards DevOps, to more autonomy,
Starting point is 00:54:59 to breaking barriers. I think it's great to see from you as an example that it can be an outsider to the industry coming in with a fresh perspective and also with the drive to to learn and change something and and and obviously and it works right i think you have to have obviously a drive you have to become an advocate for what you want to achieve also thanks for letting us know about the things you would have what will you we want to make sure that, thanks for letting us know about the things you would have, what you want to make sure that everybody else out there understands. I think what you said in the end
Starting point is 00:55:29 about having a clear definition of what reliability really means, what the responsibilities are for the people that drive that change. And I think this is going to be great advice for everyone that wants to go and kind of start an SRE role or a reliability team in an organization. I'm looking forward so much to the first week of February, having you on stage in Vegas where you host me, or I host you, where you be my guest uh on stage in the session devops in
Starting point is 00:56:06 action and actually there's a it's going to be you as well as nestor from citrix who we mentioned earlier and we we talk about you know what it takes to change an organization and actually devops in action and we will i'm sure you know cover a couple of these things you've implemented over the years so that people can learn from it and hopefully, you know, kind of see the spark and get the spark and become a change agent in their organization as well. So really happy that you are, that you're doing this with us and helping us change the world. That also sounds pretty grand. Well, thanks for having me. It's been quite a pleasure and I know that Vegas will be even more so.
Starting point is 00:56:47 And one last thing I wanted to ask, I really liked that perspective you added in the beginning there, Abigail, about coming at this in a creative direction. With your background in visual arts, I also came from a music background or, and also I was trying to do film, but when you, if you look at any of those more artistically sided things, a lot of people think, oh, it's just creative and this and that, but you, you, you can't really execute on those until you understand the technical means to do that, right? If you're doing prints or if you're, you know, you need to learn that process. If you're making music, you need to learn how to play the instrument and how to play with other people. And then the creativity begins.
Starting point is 00:57:29 I think the same thing applies to the world of IT, whether you're doing development, site reliability, performance, any of these things, there's the technical skill that you need to build and develop to execute. But if you're just executing that technical skill, you're not, you know, you're just going to, let's say write functions, right.
Starting point is 00:57:51 But you're not going to create anything grander. You're not going to have a breakthrough, not necessarily in a breakthrough, but you know, if you approach it from a creative mind, you can do a lot more than if you just say i have to insert code here to do this function right so i i really appreciate that idea of of putting creativity into this work because i think it can be done um and that also just makes it a heck of
Starting point is 00:58:16 a lot more fun so again thanks for that perspective and thanks for joining us today um to you and to anybody listening too if you're going to perform make sure you swing by the Pure Performance podcast booth in the main display hall. Come say hi to me, and the PerfBytes team will be doing all our podcasting fun there. Thanks again, Abigail, for being on. Thank you. Thank you.
Starting point is 00:58:37 Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.