Effectively Wild: A FanGraphs Baseball Podcast - Effectively Wild Episode 469: Preserving Sabermetrics for Posterity

Episode Date: June 12, 2014

Ben and Sam talk to Matt Dennewitz about his efforts to archive sabermetric research....

Transcript
Discussion (0)
Starting point is 00:00:00 Sooner or later, just like the word for stay. Sooner or later, we'll just throw the past away. Sooner or later, we'll just throw the past away. History teaches nothing. Good morning and welcome to episode 469 of Effectively Wild, the daily podcast from Baseball Perspectus, presented by the BaseballReference.com Play Index. I am Ben Lindberg, joined as always by Sam Miller. Today we have a guest, a man who is making a heroic effort to preserve sabermetric history for posterity. to preserve sabermetric history for posterity. He is Matt Denowitz. He has created the site Saber Archives, saberarchive.com,
Starting point is 00:00:54 and he's here to tell us a little bit about why he did it and how it works. So hi, Matt, and tell us a little bit about what you do and how you got the idea to start this site. Hey, so I'm Matt Denowitz. Right now I'm the vice president of technology at a music magazine called Pitchfork. Baseball has long been something I've loved and it was only in the recent couple of years, maybe the last four or so, that I kind of got turned on to sabermetrics. And the more I started reading about it, the more I realized it's really fractured all over the internet. There's obviously a long and rich history. Significant parts of it have gone missing. Because part of my
Starting point is 00:01:30 job and what I really like about my job is organizing content and building structures to not only maintain it but to preserve it and make it searchable and easily accessible. I thought I could take some of that to the baseball world. I kind of saw an opening and went for it. And how does the site work exactly? How do you compile all these things? In what sense is it user-generated? So the site itself was built to be entirely user-driven. Anyone from the community can add content.
Starting point is 00:02:06 It's easy to flag content that maybe needs to be cleaned up or should be removed. Maybe it's not meritorious of being archived. We really leave all of that up to the community to decide. Adding content is often as simple as inserting a link and then filling out just the title author and then classifying it with a couple of topics. We also offer the ability to upload a file. We can read PDFs and make, you know, actually read the text in the PDF and make all of that searchable. We can do the same for slideshows and e-books and Word documents or Office-style application documents. And so people don't need to register or anything, or do they?
Starting point is 00:02:47 Do they just show up and they can submit stuff? Using the site does not require an account. Adding content does. It's just to make sure that we're not getting spam or trolls or bots, you know? Right. And so when did it start, and how much activity have you seen? How many articles are in the Sabre archive? The site's been live for about three months now,
Starting point is 00:03:08 and I'd say we're fast coming up on about 250 articles. I would say by the end of this month, we'll probably have probably closer to 300, maybe 400, depending on just some more of the ingestion and how much time some of our community members have to add. But I think we're on the cusp of seeing kind of a bigger boom in terms of backfilling the archive. The effectively wild bump. Yeah.
Starting point is 00:03:34 Here's hoping. Now, one thing that's really helping us out here is that Baseball Perspectives has generously offered their archives to us in a very convenient format. Don't say. No. Surprise. So what we're going to be working on here very shortly is importing everything and then narrowing it down to just like the most relevant and broad scoping coverage that Baseball Perspectives has covered over the years
Starting point is 00:04:05 to help fill out the archive. And what's the process for that? Because, you know, I mean, we have, I don't know how many articles are in the BP archives, but something like, I mean, over 20,000 drafts at least. And in recent years, we've been tagging them, but in the past, we didn't. So do you have some sort of automated way to crawl them and figure out what the articles are about? Yeah, under the hood, I've got a couple of tools that will help crawl and classify things. So kind of a two-step process.
Starting point is 00:04:40 The first thing would be just classifying it. And then the second step would be going over those classifications and making sure that it makes sense. Not everything with the word FIP in it is going to have relevant research about FIP. So just a quick gut check. Everything that's tagged should be very easily digestible. Everything that's not is just going to take a little more manual labor. Is there like a kind of of an origin story like a moment where you were trying to find something and you you thought this darn internet just can't
Starting point is 00:05:10 help me or for what yeah i was i was looking for park factors and i had read a few sources that had referenced um i think it's patriots park factor piece which lives on tripod which when i finally found the link after like three days of Googling and asking around in a couple of forums, I got to it and I was thinking like, my God, this is on Tripod. And at that time, I think GeoCities was either under the process of shutting down again or being revived kind of ironically. So that's when the light bulb went on of like it'd be not i should build an archive but like my god there's got to be a better way to do this
Starting point is 00:05:50 around that time fangraphs library was was also um kind of coming into its own and i know baseball prospect just had some some great resources uh baseball reference was also, that's always been a pretty well-organized resource. But yeah, that would probably be the light bulb moment. Yeah, so I assume I'm not alone in this, but I assume anybody who's ever written about baseball has tried to sort of create his own library on his computer via some sort of complex system of bookmarks or some document or something, and it goes nowhere because of a couple of issues. One is that it turns out to be a lot harder to index things than you're expecting. Things cross categories a lot of times, and
Starting point is 00:06:36 you're constantly trying to decide, well, is this similar enough to this one to lump them together in the same category? The other is that the format of useful sabermetric data, I guess, comes in, it's inconsistent. We mostly think of articles, of blog posts. As you mentioned, sometimes it's in a book. But a lot of times what it'll be is like, it'll be comment 48 on a Tom Tango post. And it's hard to figure out, well, how do I, where do I index that and what do I do about, what do I do about comment 47 now? And so I wonder, like, have you sort of been bumping up against these problems and discovered any, like, great, great, like, index hacks?
Starting point is 00:07:20 That's a great question. The closest I would say, or the easiest thing to point to, would be that documents that are like part one of six tend to get a little fractured. Something that I'd love to work on in the near future would be a way to link documents that are intrinsically linked either by being the same, maybe documents split up over a few different days, linked back together, or also like a research document that cites other documents that are in the archive. So that's definitely a shortcoming right now, and you're totally right about that. And you mentioned the Patriot piece. Has there been anything else that you've sort of unearthed from the bowels of the Internet? Anything particularly old or obscure that someone has submitted that you've noticed? No, I would say one thing that comes to mind, though, would be the absolute volume, just enormous volume of pitch effects research
Starting point is 00:08:15 that's out there relative to just the kind of general questions I still see around the web of how pitch effects works. It seems like tutorials from 101 to 501 and up have been written several times over and just nothing stuck, I guess. So looking at this site right now, the latest addition to the archive is an article by Alan Nathan that I put up at BP about 15 minutes ago or so. And then the second most recent edition is an article that I wrote earlier today. Does that mean that those have been submitted by someone already? Yeah, absolutely. Wow.
Starting point is 00:08:56 The internet is good on top of things. Wait, wait. They were submitted by somebody? This was not you in preparation of this? No, I didn't do it. Wow. So, wait, hang on though. For instance, Ben's article is about moving the mound and whether moving the mound back could fix the strikeout scourge, so-called. And that's one of the categories for this is mound configuration. It is indexed under mound configuration.
Starting point is 00:09:27 And there are other articles about mound configuration. Like, what made you, like, at what point do you, because I would imagine that, like, maybe the first time you had a mound configuration article, well, that's not a category yet. So you wouldn't think to necessarily invent a category. So what happens when you have a category that comes up every so often but not that often? So what happens when you – it must be hard when you're reading these things and you're sort of building categories as you go to figure out what is likely to show up. Because if, for instance, you hadn't previously established the mound configuration category, Ben's article probably wouldn't have gotten a mound configuration tag on it, right? Right.
Starting point is 00:10:09 You have to know it's a category in advance. Yeah. When the site launched, we had a taxonomy of about 250 turps all in a pretty nice hierarchy under four topics, fielding, offense, pitching, and general sabermetrics. That's grown. I mean, we probably have about 310 or so now. It's definitely kind of as new content is added, things we didn't think of pop up or maybe we need to split one topic into two.
Starting point is 00:10:36 I mean, mound configuration could easily be mound height, mound distance, things like that. We could easily add pitch distance, something like that. As it comes up, the community has been great about suggesting things. I'll add it. It's okay if there's one tag that has maybe one document under it. Ideally, it'll grow out. Maybe it won't.
Starting point is 00:10:56 And do you happen to know what the hottest categories are or the most heavily populated categories? Pitcher injuries and pitch effects. Uh-huh. Well, that makes sense. I actually, I literally looked at the pitcher injuries category today for unrelated to this. I had to look. I had to see it. It was useful to me in research. So it is actually useful. So the larger question, before this site came about, as you mentioned, everything was in different places. It was this very homebrew sort of arrangement and still sort of is where someone who's been part of the sabermetric community for a long time will remember some article that someone wrote a decade ago and will bring it up and then everyone will rediscover it. And it ends up being the case that there's a lot of work that's redone.
Starting point is 00:11:51 You know, I've written things and then found out that someone wrote about the exact same topic 10 years ago. And I've written things that someone else has subsequently written about again. So is that a bug or a feature, do you think? I think it's a feature. I think one reason that we try and expose the original publication data is to show that, A, research might be old, and, B, there really is a chronology of this,
Starting point is 00:12:16 even an evolution in every topic. Pitcher injuries totally weren't understood 30 years ago or understood differently 10 years ago. So the research from maybe 2000 might show something different than what the research from 2010 might show. And we're keen on exposing that history. It's interesting because you can make the case that when we're all duplicating each other's work, which happens a lot, I think. I think we've all had that sort of awful feeling of realizing we just published a story that had been done somewhere else like six days earlier, and we duplicated it. On the one hand, well, that's a waste of resources. On the other
Starting point is 00:12:53 though, we get different results. There's different ways of approaching the questions. You could almost imagine if there was going to be... I mean, the upside to this is obviously that we can build on each other's work, be more efficient, and be well-informed. If there was a downside, it might be that there would be less duplication, because we would easily go find out that the article that we were about to duplicate has already been done, and then we might not accidentally find something else. I don't know. Maybe it's not a problem.
Starting point is 00:13:21 Yeah. I'd agree. Yeah, I mean, it's sort of a peer review process in a way, even if it's unwitting, duplicating the research. And maybe someone who did it originally had an error in the code somewhere that was never caught, and everyone just accepted the results, and that's happened. And so there's some value to redoing it.
Starting point is 00:13:44 And plus, I guess it's a way to learn if you're someone who's just picking up database skills or learning how to query, then even if you're going over the same ground, maybe that's a foundation that can help you in future research. So what I'm saying is you're killing innovation here with Saber Archive.
Starting point is 00:14:03 Oh no, oh no. None of us will question any of this research that's in here, and we'll never check it again because now it's in the archive. What have you done? You can flag it as inaccurate, or maybe we should add a flag for is too competitive. So are you looking for any help in addition to user submissions? Do you need volunteers to process those submissions or to go through them in any way? Yeah, I mean,
Starting point is 00:14:37 really where we could use the help is filling up the archive. But at a certain point, we're going to need people to help go over certain documents and make sure that they're properly classified. Right now we're really relying on the accuracy of the submission on the first time, so having somebody reviewing each submission might be really nice. It would be very helpful. Additionally, we've got a couple new features in the hopper right now that could definitely use some other eyeballs on them, one of them being creating bundles of documents so that if you want to create a bundle of like PitchFX expert research versus PitchFX
Starting point is 00:15:12 maybe introductions, that could be good. Or if you want to create a document on maybe start to bottom or start to finish building your own water, you could do that. And the other feature is implementing OCR so that people who have older documents, and this is where review would really come in handy. People who have older documents, maybe newspaper articles from the 50s or something, or even research papers, could upload them. We would scan them, use OCR to decipher the text, and then present them in a format that
Starting point is 00:15:42 could be reviewed and then published if accurate. That would be amazing, just so you know. That was good to hear. So when you, I mean, if you take this to sort of its conclusion, I mean, there have probably been, oh, I don't know, 2,000 articles written that might be categorized as FIP, for instance, and it's not that useful to have a list of 2,000 articles written that might be categorized as FIP, for instance, and it's not that useful to have a list of 2,000 articles. They might all be good enough to qualify for your standards for posting them.
Starting point is 00:16:12 There might be nothing wrong with them. They might all contribute a little something to the discussion. They might all merit archiving, but 2,000 articles is not that helpful. So, um, so do you envision basically these 2000 articles somehow being, um, you know, ranked in a way that, um, that, you know, the, the most crucial ones, you know, through, through sort of user rate ranking somehow, maybe, or, or however you would want to do it, that the most crucial ones would be a little bit easier to find in the, in the haystack? And then if so, what would you say would be the standard by which we would judge these things? Is it what comes first? Is it what is most linked? Is it what is written by the most prominent writer? Is it what has the most colors in
Starting point is 00:17:00 the chart? Sure. I mean, I would assume if you're looking at FIP, one of the first things you'd want to see, depending on your approach, would be some of the initial DIPS work by Boros. But if you're a FIP expert looking for some pretty granular data, that's not what you're looking for. Right now, what we have for voting is a way to indicate the best of content. It's not a voting system like you'd see on Reddit. It's either best of or it's not. That's a good way right now, while the archive is small, to highlight the most critical pieces. As for the problem of 2,000 articles about FIP, what I would love to see, and this would again call on the community to help out,
Starting point is 00:17:44 would be ending with a FIP tag also having five other tags that really describe more about the article. If it's purely about FIP, then it shouldn't have any other tags, and we can start to look at what other more granular tags it has. And the more granular it gets, we can maybe take points away from its ranking in FIP overall. overall. So if something's tagged FIP but is also tagged pitch types and pitch velocity, maybe it's not so much about FIP unless it's a correlation between the two. And there's a best of area on the site with little trophy icons. Is that something that you are doing now or is that something that the community has produced somehow? It is community produced. And that's by visits or by votes?
Starting point is 00:18:32 If you're looking at an article, so I'm going to bring up the Analyzing Path Part 2. So in the Document Tools module on the right, the first item is a button that says vote best of. If I click that, that's going to put in a vote that says I think this is one of the best pieces of research in Saber Archive. If right now three people do that, it's going to filter on up. Three is an arbitrary number. As the archive grows or as the user base grows, we'll obviously raise that. And there's a pretty powerful search tool, it looks like. You can search by author. You can search by whether it's best of or not.
Starting point is 00:19:11 You can search by year. So if you're looking for something from a particular person or particular time frame, it will be, I guess, easy to do that, even as the archive grows. And you are starting a newsletter, I think, right? How will, even as the archive grows. And you are starting a newsletter, I think, right? How will that work? Correct. Yeah. So every week, probably Thursday or Friday, we're just going to send a newsletter out that has links to articles added this week, as well as maybe one or two just called out, maybe most prominent articles. So right now, everyone's talking about Cespedes' throw. so we'll probably call that out, especially this physics piece about it, because that's just fantastic. That
Starting point is 00:19:49 was a really great piece. So that might be highlighted. We'll have some other stuff from around the archive, maybe a couple of random articles. Should be a good way, though, for everyone to stay updated, especially as the archive, the volume of submissions starts to grow, the frequency starts to grow as well. Maybe it's harder than looking at the homepage, something to sum it up nicely. All right, so I want to ask you one thing that's not about baseball. You're the vice president of technology at Pitchfork.
Starting point is 00:20:17 Ben Lindberg, who's the other person you were talking to before, he still listens to Astro Lounge, the Smash Mouth album from 1999, I want to say. And I just wanted to get your sort of assessment of how Smash Mouth has aged. Is it actually, is it more embarrassing to listen to Smash Mouth now or is it actually less embarrassing? So right now it's very en vogue to reunite the band and play your most popular album at the big music festivals. I don't think Smash Mouth
Starting point is 00:20:49 is going to receive too many invitations for that. Smash Mouth is touring with Third Eye Blind. Oh man, so the band never broke up? Not really, no. There is constant demand. Well, apparently the genius, most people think of Smash Mouth and they think of Steve Harwell, the singer, the Guy Fieri guy, lookalike guy.
Starting point is 00:21:10 But in fact, the genius apparently of the band was not Steve Harwell. It was, I think his name is like something Camp or something like that. Greg Camp. And he wrote all the songs, the genius. I'm doing air quotes. i've been practicing air quotes for the last two hours in anticipation of saying the genius of smash mouth so i am doing smash mouth genius air quotes but he i think he left and i think he recently reunited with the band if i if i understand correctly has that phrase ever been said before?
Starting point is 00:21:45 The genius of Smash Mouth? The genius behind Smash Mouth? Let's check. Hang on, hang on. The genius behind Smash Mouth. It's going to be smashmoutharchive.com, right? No results with it in quotation marks. We could try other iterations of those words,
Starting point is 00:22:08 but it doesn't appear that it's ever been sadly astro lunge predated pitchfork i guess so there's no no review if you need a retroactive review let me know oh um if that was out 99 we were definitely around maybe i'm not looking in the right place we might not have have reviewed it. Sorry. Do you have a favorite Smash Mouth album? We've gone over this on the show before, but what's your favorite? Do you have their Christmas covers album by chance? I don't. I didn't even know they did one. Why would you?
Starting point is 00:22:39 All right. Well, thank you for starting the Saber Archive, for building it. It's a very, very smooth, snappy-looking site, and we encourage everyone to contribute to it so that we can keep track of all the good research out there and not redo it if we don't have to, if we can't improve upon it. And you can follow Matt on Twitter, at Matt Denowitz. That's two Ns.
Starting point is 00:23:06 You can also follow the Sabre Archive on Twitter at Sabre Archive. And again, the website is sabrearchive.com. That's with the E-R, S-A-B-E-R, archive.com. So thank you, Matt. Thank you, guys. All right. Please support our sponsor, Baseball Reference Play Index. Go to baseballreference.com
Starting point is 00:23:25 Subscribe to the Play Index using the coupon code BP To get the discounted price of $30 On a one year subscription We will be back with another show tomorrow Sorry, I'm an empty vessel When it comes to Smash Mouth Ben and I have been Knee deep in Smash Mouth
Starting point is 00:23:42 Meme-ology lately Smash Mouth has really tipped into meme territory I would say there's thisdeep in Smash Mouth meme-ology lately. Smash Mouth has really tipped into meme territory, I would say. There's this ironic Smash Mouth revival. Lots of Smash Mouth mashups, Smash Mouth interview advice, and Smash Mouth Buzzfeed
Starting point is 00:23:56 remembrances. We've had lots of Smash Mouth to discuss. Kind of looks like the guy from It's Same Clown Posse, too. They both have the same hair and thick faces. Yeah, that's a good point. Geniuses look alike.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.