PurePerformance - A year in - Establishing an SRE Role at CFA with Abigail Wilson
Episode Date: January 6, 2020Do you have a clear definition of what Reliability means for your organization? Abigail Wilson, Reliability Architect at CFA Institute, sees this as a key requirement before you start transforming you...r organization to embrace site reliability, DevOps or Cloud Native.In the podcast we hear how Abigail went on her journey where she has proven that you don’t need a background in IT in order to become an advocate and change agent for reliability engineering. In her role has bridged the gap between business and IT, she has helped bring stable environments to developers and testers and with those and many other steps has increased overall productivity, quality and stability of their business critical applications.https://www.linkedin.com/in/abigailswilson/https://theabigailwilson.com/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always we have with us, I don't know why I'm saying we because I'm not royal, but I have with me my co-host Andy Grabner. Andy, how are you doing today?
Not too bad actually.
I've never said we have with us before. That was an odd little...
I don't know.
A little bit there.
Yeah. Well, I'm not sure what's happening on your end.
Yeah, I thought so. You said this earlier before we got started. Why is that?
Yeah, well, I went to a neighbor of mine. He likes movies too, and he's got one of those movie passes.
So last night for a 9-15 showing, we went to go see Knives Out.
It's pretty good.
Pretty good movie.
Good cast.
You know, good whodunit type of movie.
You know, murder mystery.
But, you know, I was a film major in college, right?
And movies back in like the late 90s, early 2000s were typically too short.
They didn't develop plot.
Now everyone's going too long.
So this was like a whodunit murder mystery,
and it was over two hours.
I'm like, come on.
So anyway, got home late,
and then my daughter had a seizure overnight.
So I'm just kind of slammed on sleep today.
But we will get through,
and I'm sure the fun antics will wake me up today.
How have you been, Andy, now that I did my sob story?
Actually pretty good.
So at the time of the recording, right, which is early December, we just made it through our Christmas party here in Austria.
So we had our Dinotrish Christmas party last weekend, which was phenomenal.
Wanted to go home early, ended up being at home at 5 a.m.
So that's always a good sign.
And it also helped that I was was on the west coast the prior
the days prior so my body was on a different time zone so that made it easy um but gabby also stayed
out that long so she was there dancing involved there was some dancing involved of course we did
some we did some salsa and some just dancing to the dj anyway the salt wait do they do the salsa specifically because they know you're there though it's like oh andy's coming let's play some salsa and some just dancing to the dj anyway the salt wait do they do the salsa
specifically because they know you're there though it's like oh andy's coming let's play
some salsa well we have a we actually have we have a dynatrace band and um they're called the
pure sound agents and i think it's the fifth year that they performed and eva vintage she is kind of
the lead singer and she she told me just before it got started,
hey, you got to be there because the second song is going to be a salsa.
Well, it was a cha-cha, but anyway, we danced to salsa.
Just like I was, I know we're going to go to the episode in a second,
but the intro music, I was like, all right,
I got to try to find like some kind of a beat for Andy.
And I was looking at my Casio keyboard and I was trying the salsa beat,
but it really wasn't good.
So that was actually a sped up tango with delay on it.
But it's, you know, at least somewhere in closer than a rock beat.
So there we go.
Anyway, enough tomfoolery.
Exactly.
So, I mean, one Wilson is tired today.
That is you.
Well, good news is for the audiences out there that are listening to this,
we have a second Wilson that hopefully is not that tired.
And that second Wilson, her first name is Abigail.
And as always, when I introduce a guest, well, typically I ask, are you there?
I know you are there, Abigail.
So that's why I directly go into Abigail.
I don't know why you always ask, are you there?
I know, I don't know why.
But Abigail, I'm pretty sure she's there.
And always I just ask, Abigail, please introduce yourself to the audience.
I am here, yes.
Hi, Andy and Brian, it's good to be here.
My name is Abigail Wilson, as Andy said, and I'm the Reliability Architect at CFA Institute.
They just created the service reliability function about a year ago,
so we're still in the early stages of figuring out what that means for our organization.
It's been quite fun.
Very cool.
So reliability, so SREs, right?
We had this topic coming up over the last couple of episodes.
I think Brian, I know you're tired,
but I need to ask you at least one more question.
I think it came up with Gene Kim
when we talked about the Unicorn Project.
Yeah, he was talking about that.
It's a hot topic again.
Yeah, it's a hot topic.
We also had Adrian Hornsby.
He was here.
We talked about chaos engineering and site reliability engineering also came up.
So, Abigail, I know you said a CFA Institute.
And I want to just get a little background on your current employer,
because I think this is actually important because I think it seems like a big part of
our industry is moving into that direction and not only, let's say, the classical software
companies.
So that's why I want to get a little bit more background on your current employer.
And then I want to dive into why did CFA decide to move into that direction?
And what was the driving factor?
And then kind of, you said you started a year ago, at least a year ago was when this position
was started.
What were you facing back then a year ago?
And kind of, how has your role evolved?
What have you done?
And I'm pretty sure there's even more questions coming up.
But maybe starting, who is CFA and why did they decide to move into that field?
So CFA stands for Chartered Financial Analysts,
and we are a nonprofit financial ethics credentialing company.
So for our customers, we're primarily an educational body.
We create a series of exams that you can take if you're working in the financial industry that show
not only that you're competent in those topics, but also that you take those themes and you work
within them with an ethical perspective. So that's our primary revenue. But again,
we're nonprofit. So we also work in a lot
of advocacy, looking towards more ethical standards in the financial industry. And we connect a lot
with both our candidates who are going through the credentialing process, and our members who
have already completed that process and are part of CFA Institute. And we have international
societies across the globe
that are technically independent bodies but we support them in a lot of ways and
we also host conferences for our members and other interested parties so we're a
little bit different from a lot of folks in the industry in a lot of ways we are
a nonprofit we try to operate like a profit-based business in the way that we organize our work. But we do have really most of what we're delivering is technology and that really we are a technology company. large restructuring, not only of the IT department, but also of a lot of the other groups in the
organization to kind of reflect that and help break down silos and just get us in better
shape for faster delivery of technological solutions.
And then also supporting our customers, getting feedback from them on those tech solutions
and being able to iterate quickly it's interesting right and i mean you said you're in a educational
business and you you obviously saw that you're actually becoming a tech company and i think this
is true for so many different industries that I remember we heard that way back
when we had Adam Auerbach from Capital One on.
They were a bank, but now we're a tech company
delivering banking software.
I think he said we are a tech company
with a banking license or something like that.
I mean, that was interesting.
Yeah, I think the content of our business here
is educational for the most part. But the way that we interact with our customers and our partners is
all through our technological ecosystem right so if that's not strong it affects everything else
and so you said major so a year ago major pivotal moment in the company's history,
decided or realized your tech company,
so you needed to become more efficient.
They're also big restructuring.
So were you already in the company prior to that,
or were you hired as part of the reorganization,
or how did that work?
I started with the organization in 2016 as a software developer.
And then when they reorganized in the fall of 2018,
I was accepted in the position of the service reliability engineer,
which has now evolved a little bit further.
Yeah.
So fill us in a little bit. So that means you started as a software engineer and this position came up.
Did you, I mean, how has it shaped?
And also what's maybe, what problems did you try to solve back then?
And how has that changed over the years or over the year?
Sure, that's a good question.
So I think my personal history is a little bit relevant here. I did not work in technology until really important. And I actually learned a word this week,
which really captures this.
I was so excited to hear this word.
It was proposed by Nora Bateson, I believe,
and it's semaphysis.
And it's this idea that it's kind of taking systems
to the next evolution,
where a system is sort of a closed entity
with static components that relate in certain ways. But that is kind of an outdated
concept because, I mean, at least here, the software infrastructure that we're working with
is not static. It changes all the time. And whether you're talking about software or an
ecological system, everything is always changing. And so when you have the system where each piece
is learning and improving, as well as affecting the relationships with others, you have a
somathesis. And I think my ability to see that kind of nuanced changing system is what made me
a good candidate for this role, because we're creating a new function. So I'm creating all new systems of processes, new standards, new software that have to fulfill
these needs for the business. And so when I was a developer,
I like to find interesting problems and solve them. And one of the problems that we had when I was developing was that we
didn't have a good way to visualize all of our digital components. We were in the middle of our
digital core transformation where we were moving all of our infrastructure from a monolithic
on-prem infrastructure to a microservice structure in the cloud.
And when we did that, suddenly we had no idea where anything was or what was talking to anyone
else. And so I developed a solution to fix that, that was automated, that integrated with some of
our existing systems, like what we use for deployments to just generate a real-time picture of our whole system.
And so I think the ability to identify that need and then fix it using pieces of the existing system
is kind of what made me stand out at first.
And I've tried to take that perspective into this role with reliability.
So one of the first things that we focused on as we brought this reliability
function into existence was simply we were very reactive. All we wanted was stability.
If you know anyone who's taken one of our exams before, they'll tell you that getting your results
on the day we deliver results is purely a factor of whether or not you get there before the site crashed, which is not a good place to be.
And so when I came into the role,
my first goal was to bring stability to our sites at all times,
especially on results delivery day,
because that is the most important thing for our customers.
So once we sort of had a strategy for scaling all of our components and how to
properly set up our infrastructure for the strange traffic patterns that we see, after that, I started
focusing more on standards. And so then I stepped into the performance testing realm. And this is
also where a lot of relationship work was done with the business so that they would understand not only that we were looking at and concerned with performance and reliability,
but also to make sure they understood the depth of that question.
So I started basically codifying our performance testing process in relationship with the business
so that they understood what an SLA was, what a service level agreement was, that they knew
how we set those, that they could then agree to those and understand that they are the
drivers for those agreements.
They're agreeing that we're not dictating anything for them.
And essentially just finding a way to help the business maintain control.
It doesn't have the sound I want, but to empower them still to have control over their products while we were still able to build up the systems in the way that we needed to.
And so now we're kind of moving more into a strategy phase
that's more proactive.
And this feels like a really good place for me,
because for the past year, it's basically
been identifying which aspect of reliability was most on fire
and trying to put that out.
But now we're actually at a place
where we can look forward and see where we want to be
and see what needs to be built to get us there.
This is fascinating.
Actually, I'm just taking some notes here because, well,
maybe I want to let some folks know here that besides doing this podcast,
you will also be with me on stage at Perform in Vegas,
our conference coming up early February.
And I know the two of us, we talked over the last couple of weeks
and months several times.
But every time when we talk, there's so many new things
that I learn about you.
Now, first of all, about yourself, then the last year at the company,
what you've been dealing with.
And I really like the way you just kind of explained the last year at the company, what you've been dealing with. And I really like the way you just explained the last year and how it started from bringing
stability to a system, putting out the fires, making it stable.
And then now, as I understand it, moving over towards becoming more proactive, but
also giving tools in the hands of business so that they can work
with the stable systems that you've built, enforcing SLAs and defining SLAs and making
sure that the value that they deliver is basically according to their standards,
but always giving them control. So I think that's pretty cool. I have a lot of questions now.
I want to ask you a bunch of questions how
are you before i focus on the questions that just came up in my head i first want to go back and
you mentioned that in 2016 you started in tech what is your background because i think and brian
to come back to you this is not the first person we interview that had a complete career shift
or like came came from other directions and I came in from another direction too.
I was actually curious about this as well.
And while you're answering that,
since it's a common topic
and many of us suffer from it
who make this transition,
did you, while you're answering that question,
did you suffer from any sort of imposter syndrome
and how did you overcome that?
I'll add to that to Andy's question.
The answer to that is definitely yes. I'll get to that to Andy's question. The answer to that is definitely yes.
I'll get into that for sure. I think it's a very important question in the industry these days.
Yeah. So my background would not point towards technology at all. In fact, before 2015,
I had never had a job that was either inside or sitting down. So even the idea of
being in front of a computer for much of my life has been giving me the jitters. I just
don't like that idea. I've gotten used to it though. My university degree is in fine
art. I am also a printmaker. That's still something that I enjoy doing. You can find my art out there if
you'd like to. We'll put it up there. Why not? Yeah. But I graduated with a fine arts degree
in 2009, which was a really difficult time to come into the humanities, as I'm sure many other
people listening are aware. So I tried to make that work for a little while and it just wasn't happening. So I moved back into some other interests. I was in outdoor education for a while
doing team building. And also I ran a boy's summer camp for a little while, their hiking program,
which was really fun. And I kind of just adventured for a few years. I'd take young
boys out hiking and then on my days years. I'd take young boys out hiking
and then on my days off,
I'd go hiking on bigger mountains
and it was great.
But eventually I wanted a little more stability
instead of sleeping in a cabin all the time.
And I worked for a period as a cabinet maker.
I was in a customs department there.
Oh, wow.
And was building all kinds of really fun stuff. I mean, to be honest,
building cabinets is not that different from building software though. You have a plan you're
working from and you have to figure out how to get there in the most efficient manner without any
waste. You have to just make all the little pieces and then bring it together. With software there,
you get a little bit more testing. With cabinets, it's kind of one and done. And then in between all of that, I also had a stint.
I decided not to go back and get my MFA in fine arts. And instead I started my own bakery. And
so for just over a year, I ran a bakery in a little town and it was very successful, but also very
stressful. I decided while I liked designing a business, I didn't like being the owner of that
business. So I sold it and moved on to other things. And eventually I ended up being in an
office for this cabinet maker and working with installing an ERP system and
changing some of their inventory systems and whatnot. And I realized that I actually didn't
mind sitting in front of a computer. And there were a lot of parts of it that I really enjoyed.
I've always been very interested in languages. And my older brother is a musician primarily,
but he had just completed a boot camp
to get into software development and loved it. And we had a lot of talks about how the creative mind,
whether you're doing music or art or software programming, is actually very similar,
that a lot of the processes are the same. And seeing the way he liked it just kind of made me wonder,
like, you know, I wonder if I could do that.
I'd build a lot of my own websites first as an artist
and then for my bakery.
So I'd already taught myself HTML and CSS the hard way
using Dreamweaver, it feels so long ago.
And so I went and did a three-month bootcamp
intensive in.NET technologies.
And from there, I was hired on at CFA Institute as a developer.
And when I started here, I was very aware of how little I actually knew, especially
once I was out of the classroom and into the workspace.
It's very different.
And while I had spent a lot of my time during the class
expanding on my knowledge and trying to reach beyond what they were teaching us,
there's still nothing that can match real world experience. And I decided that the only way that
I was going to have confidence in my own abilities was just to completely own how much of a beginner
I was. It's kind of lucky that I came into CFA Institute
the way that I did because everybody knew
that I was coming straight out of a bootcamp.
And I feel like it gave me permission
to just to know nothing essentially.
So I definitely would do my due diligence
about Googling my questions
and trying to do my own investigation.
But I also wasn't shy about asking questions
of my fellow developers or of our systems team
or of anyone else,
just to get a bigger picture about what I was working in.
And so, yeah, I definitely had imposter syndrome
and my reaction was just to lean into it as hard as I could.
And I think in general, it not only helped me
because I was able to actually learn all the things I. And I think in general, it not only helped me because I was able to
actually learn all the things I felt like I didn't know. But it also made others around me a lot more
comfortable asking questions. And it helped me build a lot of really strong relationships within
our IT department that are still very strong today and led to a lot of good partnerships
on fun projects. just terrible people but a lot of times in any organization you can find the people who you can
ask questions and they're going to be happy to help you out and we see this all the time andy we talk
about this all the time of everyone sharing knowledge just in general hey i developed this
thing i'm going to put it up on github it's free to use modify it make it better but that community
that exists whereby you can go to somebody say hey i don't know explain this to me right and
finding those people in your organization who aren't going to look down on you for that,
but instead be like, oh, I'd love to explain that to you and help you learn.
And maybe as I'm teaching you, I learned something in the process.
That's a really, really important part.
And for anybody out there who's in more of that beginner phase or learning phase
or just not feeling completely comfortable,
I just can't stress enough how important it is to find those people in your organization because and there are probably a lot of them and just reach out and talk anyhow that's all
it's wanted to get that in because i think it's really key it's yeah i would agree with that i
mean we are all creatives i think um i mean I'm sure some people are in the industry because of its stability
and exciting growth. But I mean, at least here, most of the people that I work with,
they enjoy making something that they're proud of. And part of that is also sharing that excitement
and that pride with other people. And when we're able to collaborate on projects together, it just
keeps those good feelings going. And I think in any collaborative environment, especially when
innovation is involved, it's really important for people to feel safe when they're going out on a
limb. And it's a hard thing to do for all of us. But if one person is able to kind of take a step and show vulnerability and ask that question,
then it makes it easier for others to sort of do the same thing. It helps lead to that culture of
trust and excitement. All right, Andy, you got a million questions. I got a million questions.
Well, first of all, I would... Thank you for that. Excuse me, I have a few more questions,
if you don't mind. Yeah, it's amazing what you, you know, knowing your history.
And I think maybe, you know, as you said, it could be a big benefit if you basically come into a completely new field and with your curiosity that you have, but also knowing that you don't know everything. approach people and basically asking them for their advice and for their knowledge and
asking them for help, I think that that is something that maybe some of us no longer
have that have been in the industry for too long.
And we think, well, we know everything better anyway.
And so why would I ask somebody?
So I think that's obviously, in this case, a big advantage.
Now, I have a couple of questions.
So coming back to what you explained when you, the role was created, you mentioned in
the beginning, you had to bring stability to the system.
You had to put out a lot of fire.
Now, I also know we talked leading up to perform and the stuff we are doing there that you mentioned that
a big thing for you was to actually bring or ensure that developers have stable environments
stable test environments you mentioned earlier that you have done a lot of work around
codifying performance or you know you know changing the way performance is perceived
so can you tell me a little bit of the of the
actual things that you have done in the first couple of months when after you started on on
stabilizing system because i really believe if you have i mean having a stable system and and with
system i guess i mean different environments having it stable and getting them the trust of the individual people that are using it is, in the end, obviously improving performance and the outcome of the people that work with these systems.
So I would like to know from you, what were the biggest problems with the stability on these systems?
Which systems are we actually talking about?
Is it production?
Is it pre-prod?
And what were some of the measures that you have taken? Because I'm pretty sure a lot of people can learn from that on, hey, ah,
this is, yeah, we have the same problem. Ah, and that's the way she tackled it.
So that would be interesting for me as a first question.
So the first thing I focused on was production because we were having very high visibility
problems in production, namely the site crashing on days when most
candidates or members were on the site. And that is just not acceptable. So initially,
I was in charge of things like negotiating contracts with some of our vendors so that
we could change our scaling strategy. So for example, where we host our website, we used to be in a very
restrictive scaling capacity where it took a lot of manual steps and a notice of about three weeks
in order to expand our hardware there. And our typical traffic to our website is pretty low. So in general, we don't need a lot of hardware.
But on one day, when we deliver our results, it goes from maybe 10,000 visits in a day
to well over 100,000, just in a three-hour window. So our need to scale for a short period was pretty dramatic. So we had to
negotiate that contract to change the way we could scale and get some more hardware wired up. And I
worked in partnership with several other aspects of the business for that. And we have an interesting
mix of on-prem legacy systems
with our newer microservice structure in the cloud.
And that is where most of our challenges
are still coming from.
So where a lot of our data is stored
is on a very, very old server that we're currently still
trying to get off of.
It's been a four-year process doing that. It's sort of why we went
through this whole digital core transformation. And so managing that hardware has been the
greatest challenge. And that is not owned, luckily, by me. Our prod support team owns that
because there's a lot of knowledge on that team.
So it's a bit of an interesting responsibility share between me as the reliability function
and other teams.
But so that strange mix of architectures has been a big challenge that we're still dealing
with.
But as we've done that, another big aspect for stability was simply to get visibility into all of those components.
So when I first came on, we were currently using Dynatrace, but it had not been used to its full potential.
I would argue we're still not using it to its full potential, but we're a lot closer.
And so we sort of had to teach ourselves how to use Dynatrace because the
former owners had already left. And so I then focused on getting monitoring on all of our
primary components, making sure everyone who needed to see those metrics could see them,
making sure all of our alerting worked so that once the site went down, hopefully we would know
even before the crash occurred and start taking action to ameliorate that.
But a lot of it, too, was just about our relationship with the business.
They did not trust us to take care of problems the information we needed, we had to convince the business
that we could see everything we needed,
that we were on top of things
and that we knew what we were doing.
So the early days were as much of a publicity campaign,
so to speak, as they were actually getting into the weeds.
And I think that that was actually really fun for me.
A lot of people in technology would hate that kind of role, but I really enjoyed it because it made the relationship between everything that we were setting up between monitoring and metrics and just reliability in general, it made the relationship between that and our end users or like what the business
cares about basically, it made it very apparent.
And I think that's one reason why performance then became such a buzzword in our organization
is because that was how I spoke to them about reliability.
So did you, I got a question i mean this is a this is actually a quote that i also
kind of remembered from our initial talks when you said the business didn't trust you and
that you you had to convince the business that what you're doing that you're trustworthy and that
that you're giving them you know the data that they need and actually you work on on something
that matters to them which is reliability which is performance which obviously in the end helps your
your end users i have a question so you mentioned obviously that you use dynatrace you took it over
what did the business use before that i mean before you gave them those metrics did they have
their own insights and or how did they you know hear about or measure the current
problems or the quality of the system
uh i honestly don't know the best way to answer that question
there weren't a whole lot of established systems for measuring what our products were doing once they were out in the world.
We were using things like Adobe Analytics, and we were keeping track of things like how many orders were completed.
And then, of course, our prod support teams had things like SCOM alerts set up.
But there wasn't really a good way to see how each component was performing, if it was experiencing errors, what kind of
throughput we were having.
And so just being able to measure all of that was a huge win for us.
It really changed a lot.
So basically what you did, and I think this is a theme that I keep hearing, it's you were
really, I mean, the business may have their their own metrics but it's coming from one system
and they they lack as you said the context or yeah the context of what's actually happening
on the technical side and the infrastructure in the applications itself and you basically
approach them and say hey look at this i can give you data but we actually can then correlate it
with what's really happening in the system so in in case there is a problem, we know what the root cause is,
and then we can obviously work together in fixing it.
And that's obviously, that was your mission,
to make these systems more stable, to focus on performance.
But then with these improvements also show the business that,
hey, see, with all of our work, we actually improve your business metrics
and we actually know cause
and effect right you know that you know let's say a drop in orders is because of bad performance or
because of an outage that you had am i getting i mean getting this right yes that's correct and
when i first came on they wanted to know things like do you actually know why the site crashed
and we had to show to them that we knew why the site crashed? And we had to show to them that we
knew why the site crashed. But now we've evolved to the point where we had just a couple of calls
come into our global contact center about being unable to access this particular aspect of our
membership application. And it came down to me. I was able to look in Dynatrace to see what was
going on. I found the component that was at odds, looked at the errors.
I could even pull the individual IDs, and they were able to verify that it was a bug attached to a very particular bit of code that was sort of a fringe case, got missed in regression.
And they were able to turn that around in less than three hours. So it's really changed a lot in the scope of what we're able to link as far as cause and effect,
and as well as the expectations and gratitude, I think, for IT.
Hey, no, Andy, this runs into a theme we've been seeing as well.
We saw it with Nestor and Citrix
where there's this idea of trusting your tools,
trusting the data that is coming out of it.
Now that tools are getting more complex
with some AI or some other components,
oftentimes when people are saying,
let's just trust this tool to do this,
there's pushback and it takes some time
for people to do that.
I think it was also covered a little bit in um the unicorn project from gene kim where you know
the team was going through and setting up this kind of you know bits of automation or streamlining
and there's always going to be that growing pain of we can do that but how do we know we're not setting ourselves up
for some complete blindsided you know failure because we're just trusting in in the automation
and in the tools too much but i think that's just a natural human reaction right and and it's your
story abigail i think highlights that there's always going to be that initial hurdle until the case gets proven out a few times that people can grasp onto and trust it.
Yeah, and I think that reluctance to fully adopt a tool, you also see that in larger spheres.
For example, we're still struggling to become a DevOps shop and to fully implement CI, CD.
And it's not that people don't see what it has to offer.
In fact, our IT leadership has given approval for this for about four years in a row.
But because of the effort and the risk that goes into moving into that new phase, it's
just really hard to get everybody on board.
And so I think what you see with the adoption of something small, like a tool, you also see
in these larger areas. And sometimes adopting a tool can help move into new areas. Like for us
with Dynatrace, I mean, it was really difficult for us to get the contract that we wanted in order to measure some of the things that we're now measuring.
But as soon as we were able to provide those numbers, everybody was on board and they were willing to throw more money at us.
And now that we have those numbers and we can provide stats on things like our lower environment availability, they see the effect it can have on our deployment cycle, now is when they're starting to say, well, you know,
maybe we really should think about this CICD stuff. Can we deploy faster? Can we deploy smaller?
And so I think there's definitely a spectrum of resistance and excitement about moving into new
eras. It's always safer to do what you know, but not
nearly as powerful.
Or as fun.
Yeah, or as fun. And actually, I did want to touch on the lower environments just a
little bit because that has been a really big change for us as well. The first year,
I mostly was focusing on production because we were having so many issues there. But in
the last couple of months, I've started moving my focus into some of the lower environments in cooperation with some of
my partners. And one thing we've started instituting is something we call white glove service. So for
our high priority products, like our registration process, for example, we're in the middle of a big
development effort for that. And so we have white glove service for this, which means that we have special synthetics
set up that track elements of registration in all of our lower environments.
So it's looking at login, at create account, and then at certain integration points with
vendors or other components within our system
so that our developers and our QA can see exactly when something is failing and where the source of
that is. And it's been really big for our team because we are responsible for the availability
of our environments, which is kind of a funny word to use like in a second. But we used to get so many complaints about the environments are down,
the environments are down, we can't test.
And we set up these synthetics.
They're showing at least 98% availability all of the time
and no one's complaining anymore.
So it's really changed how work gets done, I think.
And that affects the way that we're able to communicate that to the business.
Again, we can show them that they are a focus for us and we can give them numbers that prove
that we are focusing on delivering what they want. But it's also been incredible for our developers
now because instead of being blocked for an hour and a half because they don't know why,
they can go in and see exactly which processes are having trouble,
and they can themselves investigate,
since they know the most about implementation.
Usually this is a much faster resolution process.
And they've actually requested that we add quite a few more synthetics
because it's been such an advantage in their development process.
And another aspect that those have been really great with is that with our login vendor,
it used to take hours and hours to get resolution because we wouldn't know that login was down in our test environment
until a QA member found it and filed a bug, and then we would send a ticket to our vendor.
And now, as soon as that synthetic fails, it sends an a ticket to our vendor and now as soon as that synthetic fails it sends
an email straight to our vendor and so we're often resolving those problems before a qa member can
even discover it so our our meantime to resolve has shortened quite a bit there what would you
that's interesting what what would you do you have a number of how many dependencies you have
to third parties or like do you or maybe do you I mean, it doesn't have to be a clear number,
but do you have a lot of dependencies to third parties and a lot of problems happen
and now you can more proactively kind of engage them and actually include them in the process?
Or is this like one-off?
So I would assume it's more than just a login.
Yeah, it is more than just a login. Yeah it is more than just the login and really it depends on the product as far as how many dependencies we have. We do try to keep
a lot of things in-house but there are certain things that are more secure or better to have
someone else do for us. So typically we do have at least one other vendor with all of our products.
And some of them are harder to include in this kind of monitoring.
But also they're less involved in our lower environments.
And so it's not as much of an issue.
Well, it's cool that you are, I mean, I know you're using,
you said using synthetic tests to, you know, to basically keep an eye on all the systems and then alert in case something is wrong.
Now, if your developers ask for more synthetic tests, maybe that's your chance also to kind of tell them, well, developers, maybe it's also time for you to do your own testing and automate it in the CICD, I mean, with your move towards kind of,
hey, CICD is really the way to move forward.
And if the developers already see the benefit of having these tests
executed on a continuous basis, then hopefully they also see that,
well, now it takes them also to kind of bring it to the next level.
Because obviously CICD only really works if you are doing test driven development and then get
these tests executed and then getting the monitoring in.
But it's,
it's,
it's a really cool that you're using.
And I know we talked about this kind of a couple of weeks ago when you
explained it to me and that,
that you just using something simple as synthetic tests on a regular basis and,
and acting as an early warning system.
Once a change is breaking a system where a lot of people are depending on that this alone has obviously improved the
situation a lot because you are not you know realizing the problem hours or days later and
then it's hard to figure out what is the real root cause you see it immediately and therefore also the impact of that system being down and not available um you know is shortened
and therefore it improves the overall availability of that system and therefore more people can you
know work with it i mean and that's you know it's simple things i mean it's i know it's not simple
simple but it's still it's it's it's I think it's individual steps that people can take, right?
And this is what I like about these podcasts.
So we learn about what you did in order to bring stability to systems that have not been stable and therefore impacting productivity. simple as adding synthetic tests with early warning system with automatic notifications to
your internal folks but also your third parties has resulted in increased stability and that's
what it is right it's about i mean reliability stability and this is phenomenal so thanks for
sharing that story it's pretty cool yeah happy to share i think some people look at SRE and they think that in a lot of ways it's
synonymous with prod support, but I actually look at it in a much, much broader view. I think that
reliability really does start with our developers. It starts at the very beginning, you know, with
the architecture even. And so I've been trying to get that concept to take root here. And luckily, a lot of people are on board, too.
And you talk about how the developers can maybe own some of this transition to CICD.
That's the truth, but here they want to.
In general, all of the people who are on the front edge of this, we really want to make this move.
But there's always that tension between IT priorities and business priorities.
And we still have to deliver.
So there's still a challenge there, but we're getting closer and closer.
Very cool.
It's interesting.
I always find, in my head at least, SRE teams are the glue
that bring the development teams, the testing teams, the operation teams,
the product teams, the business teams together with a common goal of performance and reliability
and help weave best practices through all those layers in order to get that done.
I know that sounds kind of grand, but that's how I was envisioning it, because you mentioned it being more like product side.
But I think that's at least the way I've seen SRE.
Yeah, and that's how it's played out here to a large extent.
I am considered to be a part of our systems team, which handles our deployment and releases and then keeps our environments up. But to be honest, I spend a lot of my time interacting with developers, with the head of QA,
with the head of prod support,
to try and get everyone aligned.
Like I mentioned earlier,
we're still trying to move into that DevOps mindset.
We still are essentially all in our own units.
And I think the SRE role here is pretty unique in that way
in that it is a bridge from one to the next.
And so I really enjoy helping the conversations to become a little bit more united so that if QA is working on some endeavor to improve their processes, that I can see how that matches something that development is also working on and I can try and bring
those teams together so that we don't have redundant efforts and that also we can help
each other out.
And so that's been a pretty exciting part of the role for me for sure is that you kind
of get to be in all the different elements within IT.
So it may be a little bit grand, but yeah,
the glue is less grand.
I think I also like that word.
How big is your team?
Is it just you
or do you have people
working with you?
I am the only person
technically in the SRE role.
I do work with a performance engineer
who scripts all of our
performance tests and every once in a while will help with some of some of my workload but no right
now it's it's just me so that's why the focus shifts from one to the next is because we can't
i can't really divide and conquer too much how do you how do you automate or what do you automate
or what have you automated from your work?
Because you mentioned, I mean, on the one side, you have to do all of these
coordinational tasks where you talk with people, bring the right people together.
Also, in the beginning, you have to advocate a lot for the benefit that your team is bringing.
But then you mentioned, you know, tests have to be set up.
You need the alerts have to be set up.
You're managing the monitoring.
So have you over the last months started to automating some of these tasks and kind of pushing it to the people?
Or let's say not pushing, but offering this to the individual stakeholders like business and development as a self-service?
We are starting to push this as a self-service? We are starting to push this as a self-service.
I've just finished setting up all of our tags and management zones in Dynatrace
so that everybody, literally everybody who wants it,
can get into Dynatrace and look at how their products
or their components are working.
I've been working a lot with product support to,
or production support, excuse me, to refine their alerting and notification and just helps their response to be honest I haven't had a lot
of time to get into automation I've got a very long list of things that I would like to automate
but it's just a matter of of getting that to a priority. So one thing that we have done, like with the synthetics that I mentioned,
I automated toggling those on and off at the boundary of the workday
because we only really care about them while people are here working.
So that was just a little simple script against the API.
There are a lot of other things we like to do.
I mentioned that our peak events are sort of non-standard.
And a lot of those peak events, the traffic is such that auto-scaling rules are too slow to keep up with the increase in demand.
So what I like to do is find a way that we can set dates while we're doing our performance testing, which is pretty early in the process.
As part of that setup, we figure out when all the business events are and what they need in order to be fully healthy in production,
and then automate the manual scaling for all of those products so that as the day arrives,
the system will just know what kind of pattern to expect, what kind of increases to make in
our resources, and then also decrease those resources after the event.
But yeah, right now I'm working a lot
on just feeding metrics back into the teams
and helping them to understand
the way their changes are affecting
our different APIs and products.
So looking at how to create an error budget,
we just got deployment events integrated with Dynatrace from our deployment engine, which is really great.
I'm excited about that.
That'll give the developers the ability drift over time, for example,
that the appropriate people would be alerted
or if we see a sudden increase to do some auto rollbacks,
that sort of thing.
But we just got to get to it.
Very cool.
Well, it's good that there's still work to be done, right?
Because otherwise you need to open up a new shop
or do something completely different if your work is done there so true don't tell my boss
no very cool um so i yeah it's it's been it well you know there's probably you know so much more
we could we could dive into but i think this is a great overview of kind of,
well, first of all, your background, which I wasn't aware of,
like your complete history, even though I think we know we talked
a couple of times, but I was never aware of that you were just
basically entering the technical field and what you've accomplished.
Now, I have one last question that I want to ask you.
Is there something that you would have done differently
of the things you've done in the last year,
especially considering there might be people
listening in and say, hey, you know,
site reliability engineering,
it's something our company is looking forward to.
We don't know yet what it is,
but is there something maybe
that you want to give us an advice?
Or as I said,
maybe something that you learned
that you would do differently
so that others can kind of avoid a mistake,
whether it is, you know,
not a clear definition of the role
or I don't know,
giving you the right powers in the company
to actually have an impact.
Anything that you can give us an advice,
it would be great.
That's a tough question.
Well, you're not getting off easy here.
Yeah, there's definitely been a lot of trial and error
in our process.
I've tried to introduce a few things
that really just didn't work.
And so we had to pivot really quickly.
I think, to be honest, a lot of that is just unavoidable.
If you're a new function, I think you really just have to see what's going to work.
Every company is different, so some things that will work great at one company won't
work somewhere else.
I think the biggest improvement that could have been made to this whole process is to have had a clearer idea of what reliability meant to the organization first.
Because especially for the first six or eight months, there was a lot of almost political back and forth about what was or was not in my realm.
Like what was my responsibility or not?
Did I really belong on this team
or should I be on another team?
And I, of course, being the service reliability function.
And so I think if your company can identify
why they really want to create
that service reliability function
and what your initial goals are for it
in fairly specific terms,
I think that'd be really helpful.
And a lot of the reading I've been doing about service reliability, it does seem like there's
a different manifestation almost everywhere you look.
And so if you can try to define that earlier, it's definitely helpful.
It's just, it's been a little bit of a painful growing
process to find the right fit with our existing teams because they've all sort of had to change
their own domains as well in order to accommodate this new function.
But I think flexibility is also really important, especially as a new function, because you just don't know.
Right. And if your company is adding a reliability function, it's likely that a lot of other things are changing as well or that the focus in your organization is shifting.
And so I think the ability to get fast feedback, to fail fast and just, you know, to drop an effort that's not working and try something different I think
can be a big strength I would like to have automated more stuff earlier there's still a
lot of manual work I'm doing because it hasn't been automated yet but again you know you just
have to figure out what your priorities are absolutely awesome Andy we're at that time I
know and I know you have to run. So should we skip this
Sumtron today? I want to just have a couple of final words because I think this is just what I
want to do. I mean, I keep it short, but Abigail, thank you so much for showing us that you don't to have worked in IT for 10, 20, 30 years to actually tackle a big problem, which is
helping companies transform their IT into a more, well, towards DevOps, to more autonomy,
to breaking barriers.
I think it's great to see from you as an example that it can be an outsider
to the industry coming in with a fresh perspective and also with the drive to to learn and change
something and and and obviously and it works right i think you have to have obviously a drive you have
to become an advocate for what you want to achieve also thanks for letting us know about the things
you would have what will you we want to make sure that, thanks for letting us know about the things you would have,
what you want to make sure that everybody else out there understands.
I think what you said in the end
about having a clear definition
of what reliability really means,
what the responsibilities are
for the people that drive that change.
And I think this is going to be great advice
for everyone that wants to go
and kind of start an SRE role or a reliability team in an organization.
I'm looking forward so much to the first week of February, having you on stage in Vegas where you host me, or I host you, where you be my guest uh on stage in the session devops in
action and actually there's a it's going to be you as well as nestor from citrix who we mentioned
earlier and we we talk about you know what it takes to change an organization and actually
devops in action and we will i'm sure you know cover a couple of these things you've implemented
over the years so that people can learn from it and hopefully, you know, kind of see the spark and get the spark and become a change agent in their organization as well.
So really happy that you are, that you're doing this with us and helping us change the world.
That also sounds pretty grand.
Well, thanks for having me.
It's been quite a pleasure and I know that Vegas will be even more so.
And one last thing I wanted to ask, I really liked that perspective you added in the beginning there, Abigail,
about coming at this in a creative direction.
With your background in visual arts, I also came from a music background or, and also I was trying to do
film, but when you, if you look at any of those more artistically sided things, a lot of people
think, oh, it's just creative and this and that, but you, you, you can't really execute on those
until you understand the technical means to do that, right? If you're doing prints or if you're,
you know, you need to learn that process. If you're making music, you need to learn how to
play the instrument and how to play with other people. And then the creativity begins.
I think the same thing applies to the world of IT, whether you're doing development,
site reliability, performance, any of these things, there's the technical skill that you
need to build and develop to execute. But if you're just executing that technical skill,
you're not,
you know,
you're just going to,
let's say write functions,
right.
But you're not going to create anything grander.
You're not going to have a breakthrough,
not necessarily in a breakthrough,
but you know,
if you approach it from a creative mind,
you can do a lot more than if you just say i have to
insert code here to do this function right so i i really appreciate that idea of of putting
creativity into this work because i think it can be done um and that also just makes it a heck of
a lot more fun so again thanks for that perspective and thanks for joining us today um to you and to
anybody listening too if you're going to perform make sure you swing by the Pure Performance podcast booth
in the main display hall.
Come say hi to me,
and the PerfBytes team will be doing all our podcasting fun there.
Thanks again, Abigail, for being on.
Thank you.
Thank you.
Bye-bye.