Screaming in the Cloud - Leaving Chemistry and Becoming a Data Nerd with Yulan Lin

Starting point is 00:00:00 Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is brought to you by DigitalOcean, the cloud provider that makes it easy for startups to deploy and scale modern web applications with, and this is important to me, no billing surprises. With simple, predictable pricing that's flat across 12 global data center regions and a UX developers around the world love,

Starting point is 00:00:48 you can control your cloud infrastructure costs and have more time for your team to focus on growing your business. See what businesses are building on DigitalOcean and get started for free at do.co slash screaming. That's do.co slash screaming. That's do.co slash screaming. And my thanks to DigitalOcean for their continuing support of this ridiculous podcast. This episode has been sponsored by Chaos Search. If you have a log analytics problem, consider Chaos Search. They do sensible things like separating out the compute from the storage

Starting point is 00:01:26 in your log analysis environment. You store the data in S3 in your account. You know where it lives, you know what it costs, and then they compress it heavily while indexing it, and then they query that data using a separately scalable fleet of containers. Therefore the amount of data you're storing no longer is bounded to how much compute you throw at it as well. It's broken that relationship, leading to over 80% cost savings in most environments and being a sensible scaling strategy while still being able to access it through the APIs you've come to know and tolerate.

Starting point is 00:02:01 To learn more, visit chaossearch.io. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Yulan Lin, a developer advocate at a small company called Google. Yulan, welcome to the show. Thanks, Corey. Thanks for having me. Of course. So you self-describe yourself as a data nerd with experience in everything from bioinformatics to NLP. I have to look up what some of those words even mean. So backing up a sec, what do you do and how did you get there? Yeah, so I'm a developer advocate for a product called Data Studio at Google,

Starting point is 00:02:38 which is a business intelligence and dashboarding product that we have. And how did I get here? Well, I studied chemistry. And I actually thought I was going to go down the research route. And I had a bioinformatics research project, which is basically like computational genomics kind of stuff. And I was looking at RNA sequencing and macular degeneration. But the interesting part of that was I had a data set that crashed Excel, and I was like, I don't know what to do. And so what ended up happening was in the process of learning to analyze that data, one, I learned all sorts of statistical techniques that were completely new to me. But I also learned how Python scripting worked, R scripting worked,

Starting point is 00:03:21 learned a little bit of SQL along the way. And I realized that I was much better at picking up those data analysis skills that were transferable than I was at keeping cells alive and kind of went from there. Yeah. And you found your way to Google of all places, which if you're working with data, it seems like a decent place to go. I'm told they have a bit of that there, but your path sounds like it diverged from mine almost immediately. When I wound up early on in my career with data sets that had problems in Excel, I threw the computer aside in a huff. And instead of doing data analytics, I just figured I would talk qualitatively instead and tell interesting stories and indulge my ongoing love affair with the sound of my own voice. It didn't occur to me that there might be better ways to solve these problems. They're the path not taken as it were. I actually think there are

Starting point is 00:04:08 really incredible stories to be told with and about data sets, which is I think why, what compelled me about data analytics in the first place. So in research, when I was kind of looking either at like chemistry education or at bioinformatics stuff what I always loved was that I could ask a question get some data about it and tell a compelling story that I think I would argue mattered to the world and I think the same is true for a lot of data sets because after Google I ended up at a non-profit doing event management and ops work and so I was doing a lot of like the statistics around a large, about 15,000 person event. And so in the process, I learned a lot about what

Starting point is 00:04:51 different stakeholders wanted to get out of the stories inside our data sets about who was attending and how registration was going. Out of all of our talks, which irritated people the most? I mean, things like that. I mean, yes, but also really interesting things like you find users who take different paths through the registration system. And so you end up with really interesting, like technical issues, like the tags associated with someone's registration don't make any logical sense, because they found like a bug that allowed them to take multiple routes through it, all these like data management and data quality issues as well. And I like loved going in and figuring out what didn't look right, why it didn't look right.

Starting point is 00:05:33 Was there something we could do about it that would stop that problem? That sounds like it's first, incredibly difficult. And secondly, it sounds like it's the sort of thing that no one knows exists. No one pays attention to that sort of thing at all. It's a conference. It magically happens. It's sprung fully formed. And I'm sure they started setting this up yesterday evening or something.

Starting point is 00:05:57 People don't realize all of the heavy lifting and tremendous amount of work that goes into putting something like that on. I've talked about that previously with other folks on this show, but it never occurred to me to figure out the other end of that of, all right, with all the data that's thrown off and stuff like that, how do you analyze that? How do you turn that into something that is usable by humans? Yes. And I think the same question of given the data that you have, how do you turn it into something usable by humans is a question that applies across a lot of organizations in really small ways. So everyone's talking about big data, but I think a lot of quick wins are to be found in the spreadsheets that are on people's local devices or just one analyst or person is maintaining month after month and converting into some kind of presentation or doc or report because those are often human curated often a little bit messy but if they were like regularized they were shared with the right people you could link them to different

Starting point is 00:06:59 data sets like all of a sudden you have like a wealth of information at your disposal and and context as well. And the ability to like to present things to different stakeholders and tell the right story. And that's really cool to me. One of the things that I always found somewhat challenging was the idea of when do you have a big data problem? Like the rule of thumb was if it fits on a thumb drive, it's not big data. It's little data, medium data at absolute most if you squint hard enough. And then the other argument that became, oh, if it fits in RAM, it's not big data. And then I started seeing instances in cloud and whatnot with many terabytes of RAM in there.

Starting point is 00:07:35 So is there a, I guess, a clear line differentiating what separates big data from medium data? Or is it more of a ish type of soft boundary? I think in general, it's an ish boundary. But I think the framework I use is less, why do you care about how big the data is? Do you care about it for reasons of data engineering, and you want to know what the best kind of technical ways to manage and process your data analysis pipelines are, or are you interested in what statistical techniques are valid on the data? Because the definitions of quote-unquote big data differ across those things. Because it occurred to me to think of it in terms of domain specific. I mean, on some level,

Starting point is 00:08:21 log data could be enormous if you log everything forever from just a simple web service, but it also winds up being awfully repetitive. Oh, wow, 98% of our data in the logs is the load balancer checking to see if the thing is still okay. Maybe there's a transformation that makes this a little bit more usable as you start filtering that through. And again, I am not a data person at all. It turns out that stateless stuff is way more aligned with how I tend to operate, because if I break that, I can push a button, build a new one, and no one notices or cares. When you lose the data, very often you don't really have the company anymore after that. Yeah. I think the other thing, too, with longitudinal data or data over time is that definitions

Starting point is 00:09:03 can change, too. And so within the same organization, even if it's been collecting a particular piece of data forever, like the original reason might have been to answer question X. And then at some point, like they realized question Y might also be kind of relevant to this data set. So I'm going to add a couple other fields to capture those things as well. And tracking that metadata and the evolution of the whys behind why a database exists or why a field exists in a table, I think really can inform the questions that are valid to ask about the data set. I think one of the challenges with data, at least one that I

Starting point is 00:09:38 experienced myself, is that I don't know what questions to ask that data can effectively answer. I mean, so from that perspective, it's always challenging to figure out what questions does data visualization solve for me? I think jumping to shiny visualizations before understanding the data set and the domain is actually like going too quickly. At my last job, I sometimes described it

Starting point is 00:10:03 as I was playing data therapist because I talked to different people about what data sets they had, and what questions they wanted to answer, and whether or not those data sets could effectively answer those questions. We also talked about, like, what are the best ways to answer those questions? Is it some kind of analysis? Is it some kind of visual dashboard? And so that's something I think that has to be done in partnership with a domain expert and also just time spent in the data, right? What's the distribution of things? What do null values look like? What are things I should

Starting point is 00:10:39 know about what different codes mean? All of these questions really should be thought through in partnership with a domain expert who then also knows, has a better idea, what are the things that they want to track or that would impact their day-to-day work? This episode is sponsored in part by DataStacks. The NoSQL event of the year is DataStacks Accelerate in San Diego this May from the 11th through the 13th. I've given a talk previously called The Myth of Multicloud, and it's time for me to revisit that with a sequel, which is funny given that it's a no-sequel conference, but there you have it. To learn more, visit datastax.com. That's D-A-T-A-S-T-A-X dot com. And I hope to see you in San Diego this May. Well, let's back up a second here just to clarify something that I may not be entirely clear on.

Starting point is 00:11:31 One of Google's core competencies is taking words and putting them after the word Google as a product. In this case, they've done that with Google Data Studio. What is Google Data Studio? Yeah, so Data Studio is a in-browser data visualization BI kind of dashboarding product that connects to all sorts of data sources. So the way we describe it is if it has an internet accessible API, you can probably get the data into Data Studio. So it allows people to integrate data from different data sources into the same place so that it's easy to have an at-a-glance look or analysis of kind of whatever metrics you

Starting point is 00:12:14 care about. And it's also really easy to make sure that it's shared with the right stakeholders. So it winds up visualizing data for human consumption, not machine consumption? Yes, it's for human consumption. And it's also structured in such a way that it's relatively easy to get started with it because the product itself, it's a click and drag kind of product. It's a GUI based thing, even though I work on the developer features, which is kind of this separate box. So I guess my question for you then becomes as a developer advocate

Starting point is 00:12:45 for something like this, what does developer advocacy around data visualization look like? Who are the people you're talking to and what challenges do they have? Yeah, I think to answer that question, it might be useful to talk a little bit more about my job. So my job is to actually support this API called Community Visualizations, which allows people to build their own custom visualizations or different kinds of solutions and storytelling around their data sets and how to build them with Data Studio. And so it's things like, is there a chart that somebody made in an academic paper that actually would be really great for your use case but is super specific and you have to have everything configured a certain way. And when does that chart work?

Starting point is 00:13:48 When does it not? Is it for a particular dashboard or infographic or is it something that's generalizable? I think these are all questions that I'm hoping that my work helps people to answer a little bit. It's always difficult, I guess, from my perspective to figure out how to structure any sort of visualization of reasonable data. It's easy once you have a dashboard or something that shows the relationship you're looking at. Oh yeah, that's incredibly

Starting point is 00:14:16 valuable and helpful for whatever reason. I don't know if it's just who I am or this is something a lot of people struggle with, but I personally have trouble figuring out even how to begin structuring what I might represent data as in a visual context. Is that common? Am I just crap at this thing and I should accept that? What is the, I guess, what are you seeing in the world as far as people's level of comfort with this sort of thing? Yeah, that's a great question. I think that it's actually a really hard problem and it's deceptively hard. And the reason is because I think the right visualization or the right structure of a dashboard depends so heavily on what you want that dashboard to do. Because there's a difference between some of the key metrics you want to have on a TV display in your

Starting point is 00:15:06 lobby or in an open office than something that you want an analyst to be able to interact with and kind of find trends or interesting things in. And it's different than like another dashboard that summarizes particular metrics for an executive. And so everyone cares about different things. So I think my first question is always, what metric do you care about? Who is looking at it? And is it meant to be kind of a display kind of thing? So a dashboard in a lounge or a infographic kind of thing?

Starting point is 00:15:43 Or is it meant to be something you can interact with and a means of exploration and analysis? Because that tends to help me start deciding, right, how complicated things should be. Should they be scorecards? Should they be pie charts or bar charts? Do I want to bring in something really complex because it actually represents something like the number of people transferring in and out of certain regions well, or the genome data well? Should I be bringing in domain-specific things like that? That's an area where it seems to be extraordinarily challenging to, I think, articulate to folks who aren't steeped in areas of this. I mean, it becomes the popular

Starting point is 00:16:23 question that I think a lot of us who work in anything that even remotely touches technology has to answer when we deal with folks who are not in that space, usually at holidays with family, explaining what you do for a living to people who have no touch points for it. Do you have a go-to that you wind up using for that? Yeah, I talk about the New York Times data visualization team, partly because it was their work that inspired me to care about data visualization and see how powerful it was in the first place. And because that tends to be a good point of reference. So even if people aren't familiar with that team, if I pull up a map that they've created or pull up some charts that they've created to show, to go with some of their stories, people immediately understand like, oh, seeing this in

Starting point is 00:17:11 a chart instead of a table actually makes it click in a different way or I ask different questions. And that starts the conversation around data visualization. Let's go down a path that I love to explore that most people often don't. Generally because it's a terrible way to teach people things, but I find it entertaining. What are some of the most egregious misuses of data visualization that you've seen, or I guess bad data visualization. And this is an audio podcast. So showing people crappy charts is not going to be as compelling when you're just describing a crappy chart but have you seen anything that is horrifying i it's hard to say things are actually horrifying but i think there are some cases where uh there's just lines everywhere it's it's incredibly complicated and there's no explanation or walkthrough of what the different icons mean and

Starting point is 00:18:08 like why lines are moving in certain directions and whether or not things were stylized or whether like every angle and motion of the line or color variance means something and what that maps to because I think at some point of complexity my brain personally just kind of shuts down um the other thing I found and I'm guilty of this too is just making decisions that look kind of pretty but have no meaning. So arbitrary color changes because it matches a particular palette, even though the colors have absolutely no meaning, that ends up being very confusing. So those are some of my own pet peeves.

Starting point is 00:18:56 That, oh, that and I also really dislike low contrast color palettes because that's just for accessibility reasons, but also just readability reasons. It's like, cool, you use this very uniform palette that looks great with your branding. And I cannot tell the difference between your different categories. One of the things I've always found is for whatever reason, and I see this periodically in various state of the cloud style reports where they'll have a whole bunch of different providers or services or offerings that they'll wind up trying to visualize. And this isn't even a data visualization issue as such, but it's always, we're going to represent each one of these different things,

Starting point is 00:19:38 five or 10 of them in different shades of blue. Maybe there's another color or million that you could use that would show a little bit more contrast. At some point, I look at that and wonder if I've suddenly gone colorblind. Nope. It's just graphic design is hard for everyone. Yes. Yeah. And I think there's also the sense of making something clear and easily readable might be at odds with some kind of sleek visual identity that certain infographics or reports want to attempt to follow. So it's this like, do you pick readability? Do you pick your brand palette? What if they're at odds? And that's always a question. I dare not tread down that path. I found that it is best not to walk down into the den of corporate communications and branding and, oh, no, no, no, no, you wound up

Starting point is 00:20:32 not quite centering that or the font isn't quite right. Throw it away, start over, and if you do it again, you're being censured. I may deal with big companies too much at this point in that context. So changing gears slightly, you are a developer advocate. What exactly does that look like in your particular scenario? Very often, I find that developer advocates spend the bulk of their time arguing with other developer advocates about what developer advocacy is. Yeah, that's a great question. I will say that I think a definition most of my colleagues and peers can agree on is that we want technical practitioners to be successful with our products.

Starting point is 00:21:11 And that ultimately, if that happens, then I feel like I have succeeded. In my particular case, I think my goal is to build an ecosystem around this API. I want people to know what's possible with it and I want to help people solve problems with it. And so to understand, you know, why they should care and also have a clear path to success once they figure out, oh, I want to build something. So it involves for me everything from talking to developers and understanding their use cases so that I can make sure their concerns are addressed as I write the documentation or make videos or give talks. And that tends to be the bulk of my work is just thinking through how do I make somebody successful who thinks or wants to build something using this API?

Starting point is 00:22:06 Do you find that the bulk of your developer advocacy work, it looks like blog posts, like one-on-one conversations with customers or developers in the community? Are you giving conference talks? Are you writing API examples and documentation style stuff? Or other things entirely. There's so many different expressions of the whole DevRel world that I learn something new every time I talk to someone who does this full time. Yeah. So the answer to your question is yes, I do most of those things, if not all of them. It is a lot less speaking than I thought it would be. So my time, at least right now, is split between a couple of things. One is content creation. So making sure the documentation is there,

Starting point is 00:22:46 making sure there are examples, some blog posts, some social things. I also spend some time talking to developers and companies who are developing against this API. And the other thing I do is I am writing API examples and developer tooling that makes the developer experience easier. And as I'm writing these examples, I'm also collecting my feedback and other people's feedback about the API and bringing it to our internal teams and saying, here are things I think would help the future developer experience.

Starting point is 00:23:19 These are ways I think we could make it easier. And then trying to address it either from my end or talking to our internal teams to see like, can we solve this problem for future developers? I remember back when we first met, you had given a talk at a conference and we wound up catching up at the event afterwards and got to talking. A few speakers started gathering together. I think you were even asking, how can you start doing more conference talks as part of a career path? And I think the default response from everyone who's done that was, no, don't do that. It's awful. It's drudgery and misery and horrible. And as I recall, you hadn't gone through to that side yet and thought that it was going to be fun and amazing and worth doing. Where do you stand on public speaking now, now that you have found the job

Starting point is 00:24:04 where that is part and parcel of what you do? There's a couple of things. One is I still absolutely love public speaking. And I do wish I did it a little bit more because there is something, especially in kind of small to medium sized audiences about being in front of people and sharing things, but also just reading and reacting to the energy of the room and helping people understand something new or hopefully learn about something they hadn't heard about before or thought about before. At the same time, I think there is a sense of

Starting point is 00:24:38 maybe not the talks themselves, but travel for conferences has become kind of less shiny to me, even though I actually do it less as a developer advocate than I did before. And part of it might just be because it's part of my job, it seems less shiny to do it for fun. And I think part of it too is, I think it's different to nerd out as Yulan being a data nerd because data is super cool versus when I'm representing a company or a product because they're just different considerations. And that's not an aspect of it I had ever thought about. how people's evolution as they walk down the path of doing whatever it is that they're involved in tends to modify itself and, I guess, express itself in different forms. It's strange.

Starting point is 00:25:31 When you think of someone who has a background in data engineering, the path that you talked about going down, the idea of, oh, and then pivoting to becoming a speaker and someone who helps other people understand these things, it's always interesting seeing the different routes people take to get there. You see people who look an awful lot alike on stage sometimes, but the paths they took to get there are incredibly varied. Yeah, and it's also, I think, that the same skills and things that people enjoy

Starting point is 00:26:00 can be expressed in so many different ways throughout a career or throughout a job. And so when I was not a developer advocate, I still loved helping to organize and speak at meetups because I just loved seeing people learn more and seeing knowledge sharing within the community. And also I was really excited to just show people things I thought were cool. I'm still very excited to show people things I think are cool. There's a part of it too, where when it's my actual job, I have to think not only about like, how do I tell people about something I think is cool? It's this question of how am I good at telling people about that thing? And kind of content creation and technical communication is its own set of skills in

Starting point is 00:26:45 addition to kind of having the technical knowledge of whatever it is I'm trying to communicate. Do you have any advice for people who are looking to get started with data visualization to where they can go to learn more? How can people dip a toe in this water if it's something that they're unfamiliar with and want to learn more? Oh, so many places. I would just start looking for the places that are building charts that you respect and trying to figure out what you like about them. So there's several news outlets that are really good about that. And there's also some independent kind of data visualization experts, designers that I really respect, people like Shirley Wu and Nadie Bremer. I hope I'm pronouncing her name right. And so that's one piece to learn about the design aspect. say get started with either Python or JavaScript and just get into a data set and figure out

Starting point is 00:27:48 how do I put something on a page? How do I make a chart? How do I explore this data? Don't worry about the bells and whistles that takes time and that will come, but just trying to figure out what is that conversion from data to kind of pixels on a page look like? And, oh, also draw data visualizations on graph paper because it's a fantastic way to get an intuition for what you're trying to map out and why. That's a great starting point that I think people will appreciate. If people want to learn more about what you're doing and various things that you have to say about a variety of topics, where can they find you? They can find me usually on Twitter.

Starting point is 00:28:33 I'm at Y3L2N. And I talk about all sorts of things from women in tech rants to data visualization to posting videos I make for work. So that's a great place to look. And we'll definitely throw a link to that in the show notes. Yulan, thank you so much for taking the time to speak with me today. I appreciate it.

Starting point is 00:28:54 Yeah, thanks for having me. This has been great. Yulan Lin, developer advocate at Google, specifically on Google Data Studio. I'm Corey Quinn. This is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star rating in Apple Podcasts. If you've hated this podcast,

Starting point is 00:29:12 please leave a five-star rating in Apple Podcasts and tell me what my problem is. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at ScreamingInTheCloud.com or wherever fine snark is sold. This has been a HumblePod production. Stay humble. this has been a humble pod production stay humble

Screaming in the Cloud - Leaving Chemistry and Becoming a Data Nerd with Yulan Lin

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.