The Data Stack Show - 18: Data Science in Health Insurance with Jason Haupt of Bind
Episode Date: December 31, 2020This week on The Data Stack Show, Kostas and Eric are joined by Jason Haupt, data science lead at Bind, a no-deductible health insurance company determined to give immediate answers and clear costs be...fore point of care. Jason’s unique background of having a Ph.D. in particle physics and working at the Large Hadron Collider at CERN have informed the way he goes about approaching data at Bind.Highlights from this week’s episode include:Jason’s background in particle physics and his path to Bind (2:53)A cloud-only approach to data and utilizing AWS (9:01)Focusing on activities that help its members (12:08)Dealing with 12,000 columns of data from an insurance claim form (17:13)Rethinking the relationship between marketing and product teams (25:28)Examining the data pipeline (29:30)Privacy and security concerns with medical information (35:45)How experience with the LHC impacted the way he thinks about data (40:06)Transition from academic work to industry (46:20)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome back to the Data Stack Show.
We hope your holiday season has been wonderful.
We have a very interesting guest today, Jason from a company called Bind.
Jason has a very interesting background, actually coming from the world of physics and specifically
academia.
And hopefully we get to talk to him about some of his work there. background actually coming from the world of physics and specifically academia. And, you know,
hopefully we get to talk to him about some of his work there. And Bind is a fascinating company.
They are doing a lot of interesting work in the healthcare space and bringing price transparency to health insurance, which is fascinating. So I'm extremely excited to meet Jason and learn about his sort of
data science practice at Bind Costas. He's such an interesting guy. What do you want to ask him about?
I think the most important aspects of our conversation today is that we are going to
discuss with a data scientist and actually a pretty hardcore one, which is great because
usually in our show so far, we mainly
have people from data engineering and we are covering things that have to do more of the
typical data stack around BI and the standards analytics that a company implements as the first
step into becoming data-driven. So today we are going to chat with someone who has a very strong background in data science.
So I'm pretty sure that we will have the opportunity to discuss about a bit of more advanced, let's say, analytics use cases.
So this is super interesting for me. Another thing that's going to be, I think, a big part of our conversation is around data privacy and how you work with sensitive data in general.
I think Bind is a very good example of a company that has to work with very sensitive data.
And it would be super interesting to see from the data scientist's perspective what privacy means and how at the end you can deliver value
without compromising, let's say, the privacy and the security of the people that are trusting you
with their data. So yeah, I think it's going to be very interesting. Hopefully we will also learn
something around physics. We'll see. But yeah, let's do it. Great. Let's dive in.
We have a really exciting guest, Jason from BIND. BIND is doing some really
interesting things in the healthcare benefit space. So we'll hear about that. But first of all,
welcome to the show, Jason. Thanks for joining us. Yeah, thanks a lot, Eric. I'm very interested in
having a conversation with you guys today. We are too. Well, let's start out. Could you just give us a brief background on yourself
and then just a high-level overview of what Bind as a company is doing in the healthcare space?
Yeah, really good. So I myself got my PhD in particle physics, worked over at CERN for a
long time. For me, I used to always say petabyte is a small data set because it was easy to run 20,000 jobs overnight and process data.
I left that to go into industry, ended up in healthcare, just kind of happened local to the Minneapolis region.
I worked for a provider, a large local provider organization for a while.
So that means hospitals and clinics for a few years.
Built a team until that team got acquired by Health Catalyst, a startup that just went IPO
last year out in Salt Lake City. And when that acquisition occurred, I moved to the insurance
world for UnitedHealthcare, did that for a few years, led a team of several hundred, building
with a few petabytes of data within internally to the UnitedHealthcare, building a lot of assets
on their benefit services. And all of a sudden, I got a call
someday from a startup in Minneapolis that they had a new way of doing things. So I listened to
the pitch and yeah, that was right. I felt I wanted to be part of solution in what BIND is
and actually maybe change some of the fundamental structures that I felt were not right about how health insurance operated.
So, Eric, you want me to get in a little bit and tell you about what BIND is?
That'd be great. I mean, you know, health care is such an interesting space and it seems like BIND is doing some really interesting things.
So, yeah, an overview of how y'all are trying to change things would be great.
Yeah. And one way is compare what other people are doing.
And the other is compare to expectations, right? One of my favorite ways of describing what buying
does is taking the consumer approach. And I'll give you a couple examples, right? If you think
about the way healthcare works and the way you expect consumer interactions to work in your
day-to-day interactions, For instance, let's say you take
your credit card, you decide to stop and get gas, you swipe it, and you drive away. Imagine you don't
sit there and pray or hope or whatever you may do, such that I hope that when it appears on my
credit card bill in three to four weeks, it was only $50, right? There's no disconnect between
the price you pay and when the transaction occurs.
Similar, you're not going to go from where you guys are located to Vegas for the weekend and
come back and hope Delta, United, whomever you fly with only charges you $500 for the ticket in a
month or 2000, right? Those are the type of price swings that we see in healthcare. Could be a
couple hundred, could be a couple thousand. So the fundamental
problem we have here is it's not the consumer marketplace doesn't exist. So what does my team
do? My team has data on tens of million Americans in one data set, almost 200 million Americans in
another data set about their experiences, their claims and other experiences with healthcare.
We look for those patterns of how people experience healthcare, both cost efficiency and quality. And it's simple as this,
we rank everybody, every provider in the space. And then what do we do? Because we're the insurance
company, I put a different price tag on everybody. And what we do then is we expose that price tag
to the members. And guess what? What they see is what they pay.
They can look it up in their app.
They're not app savvy or website savvy.
They can call us, right?
And like, you want an MRI right there?
A hundred bucks.
You want an MRI down there?
2000.
And what this does in the end,
it's open access.
So we don't restrict the access, right?
We have a broad nationwide network.
So we're not one of those companies that are out there like, oh, we'll just find the cheapest person. That's the only place
you can go. We think that's horrible. We just price everybody and consumers can make a decision
with their wallets, right? And we incentivize them appropriately. Let's say for back surgery,
this one might be $500 because they only charge 20,000 to the employer. Or this one might be
5,000 because they charge $200,000 to the employer. Employ this one might be $5,000 because they charge $200,000 to the
employer. Employer saves $150,000. You as an individual just save yourself $4,500. And guess
what? It works. Simple as that. We found these categories where if you introduce price variations
in some of our products like Bind On Demand, there's an activation component where you can
activate additional insurance coverage on demand or Bind Basic whereand, there's an activation component where you can activate additional insurance coverage on demand, or Bind Basic, where there's no activations required. Simple as that,
we keep finding these categories where if you just show people the prices, some people are going to,
enough people are going to make decisions and save tens of percentages and sometimes 20 or more
of the overall health insurance costs for some of our employer groups.
This is, it actually works. And we've been scaling. I can tell you about some of our clients,
but it just is an idea. We are a multiple X growth company. One, one is essentially what it usually looks like for us. Are you going to start using the product?
Am I going to start using it? Well, we're not on the individual market yet. So your employer currently has to, we operate technically as a TPA in most of our business.
We're fully insured in, let's say, the state of Florida.
And we will eventually have an individual product that you can get on the marketplaces in various states.
But right now, your employer has to have Bind as an insurance option for you to be able to select it.
Some Bind is the only option.
For others, Bind is an option amongst a handful of others.
Right, right.
Yeah, actually, my comment was more about your pitching,
which I think you might...
Yeah, that was my sales pitch.
Yeah, you did an amazing job pitching the product
and the business.
So yeah, I think both Eric and me are sold on it already.
Cool.
So Jason, do you want to get like into a bit of more like the technical details on like
how this works?
You mentioned about like the size of the data sets that you're working with.
And I think it's pretty clear that, okay, like a big part of the product itself
is based on the analysis that you are doing on data. So from a technical perspective, how does
this look like? What kind of technologies you are using? And then we can also discuss a little bit
more about methodologies and what kind of analysis you are doing on the data. Yeah. And I mean,
one of the things, you know, compared to, you know, working just previously
at the, you know, a fortune 10 company, very large, lot, lot of data assets, you know,
national company is you can see what slows them down, right? So bind has taken a cloud only
approach to how we deal with our data and allows us to take AWS services as we need to scale them
and use them in a way
that just was so hard to do
when you have these merger
and acquisition on-prem solutions
that are just really slow to catch up.
I will say one thing I'll know
about the big co's
is they're getting better
with their modernization strategies.
So they're getting to a point where they're getting more and more cloud-based and more and more able to scale some
of their low-level functions as well some of their medium and high-level functions that's great for
them they're on a multi-year journey to be able to have basically modern services that are cloud-based
that can actually scale or not you know oh that'll take two to three years just to optimize something 10%. But we get to start off, hey, it's cloud-based.
I can click a button AWS and double or triple my database in minutes, right? Depending on how the
Redshift shards work, or I can go with more of an online solution, right? So most bind apps are, you know, a Java based backend microservice approach sitting in a very heavily secured AWS account. And my team is more of a Python based data science team has only been really coming live the last year, last year and a half.
My team's already over three years old.
So a lot of what we had were custom built and we've kind of gone through one level of modernization there as well, where we're using SageMaker, Step Functions, Lambdas, a bunch of those AWS technologies that are allowing us to build model inference pipelines or
data transformation pipelines. And a lot of our data transformation right now is done in PySpark,
just to give you an example of the type of things that we're doing.
Oh, that's great. So as a data team, let's abstract a little bit like your work and let's
talk in terms of like inputs and outputs so what's the input that your team
gets which my assumption is that like the raw data that you work with and i'd love to hear a
little bit more about like what kind of data you're working with and the sources and what's
the output which probably might be is it a model is like how does this work and how do you update
and how you actually turn the result of your work into
into a product at the end something that the end customer can use which i think it's super
interesting to hear about because my experience so far to be honest it's more about people who
are doing more ad hoc analysis or like in the bi space and the kind of work that you are doing and
how you can turn this into a product i think it's something very fascinating and still something
that the industry is trying to figure out, right?
There are still tools that are built for that needs.
Yeah, that's a really good question.
And this was one of my bigger concerns when I had the large team at the large insurance company.
I was worried about taking the flavor of the day in terms of the tool, getting vendor lock.
And then a different tool a year, year and a half would be better.
Because it's just a better product offering at that time.
And I ran into that a few times, right?
Having the larger team and running into that problem.
And it was kind of annoying.
So sometimes I just hired a bunch of Java developers and developed a tool that met my needs.
Rather than trying to find the one that I felt I was willing to go with a little bit of vendor lock with, right?
Being able to not be able to, being able to trust that they're going to move in the direction with me.
So sometimes that's been very successful.
Sometimes it's not.
So what do we actually do here at Bind?
My inputs.
Medical claims.
So we have both the plan that we're operating, the hundred plus thousand
members, and then that more than doubles come one, one, when we're talking about hundreds of
thousands of members that we have on our plan in just a little over a week's time. What we do is
we have their medical claims coming in. Then we have their ability of their other touch points
that we have, their RX claims, other sort of interactions, right? Whether they're eligible, one of the providers like, hmm, does this person have
insurance? Person walks into a doctor's office, they'll send in some sort of query that says,
hey, does this person have it? So I think about inputs as signals. Each of these things is a
signal. A claim is a signal. Someone getting a prescription is a signal. A doctor checking things is a signal from a operations perspective. Plus a good percentage of our members log in with
our app, right? They sign up, they log in, they begin to search. All of those become more signals
that I can use about member behavior that I can link into outcomes. Plus then I can go to the
market, buy some of these historical things on these tens or partner with other organizations to get these tens of millions, in some cases, almost hundreds of millions of other historical data.
That's other signals that I can use.
My team takes that historical data.
We look for these patterns, and then we implement product based on that pattern. So ranking all providers based on a myriad of algorithmic things, that's something my team does. That gets loaded into the
product and what you see as a price tag for every provider, what you see as a price tag for every
service, right? That's something we deliver. And also in the other sense, we take that historical
data to build models. We can take these patterns and predict what's likely to happen next.
And then we put this into our MarTech stack.
If you're not familiar, that's the marketing technology stack that allows us to fuel our
member engagements or our internal marketing strategy.
So based on something, and now the next time you log into the app, it might say, hey, it
looks like you're about to head down surgery.
Are you interested in a free second opinion service? Right.
That's just giving you an example of one of our internal marketing campaigns that are fueled based off analytics and services, which I can tell you, you know, there's a lot of I've had research jobs before in the past.
But I heavily focus on things that are driving value to my members.
In fact, if my team's like, oh, we want to improve this algorithm, I'm going to say, well, let's look at the roadmap.
How is this actually going to help our members?
What's the likelihood that it's going to help our members?
And we try to focus our activities on things that are going to help them.
That's amazing.
And I think Eric will have a couple of questions to ask about the marketing tools and how you work with them. But before we
go there, my last question, data related for me, and then we can return to that, is about,
you said something very, very interesting. You talked about signals, like all these
data points that are coming, they're like actually signals that you combine together.
And the end result at the end is price point, like price value. In my mind, what I find extremely fascinating,
it might be because of me, but is how you start from something that has so many dimensions,
all these signals that we are talking, because I think the problem with a bit of a problem with the
term signal is that people might tend to think that it's something very one-dimensional, but usually these data points might be quite complex.
And you manage through all the models that you build to collapse all this into something
that's like a numerical value that someone can use.
And for me, this whole process and this kind of magic that data science and all these algorithms
do are like, it's amazing.
But can you share a little bit more about the structure of the signals that you are
working with, how they look like?
You talked about claims, right?
I think most people will think of a claim as a piece of document that they have to fill,
right?
So how does a claim looks like from the perspective of a data scientist?
And what's the complexity and what kind of preparation you
have to make on this data in order to turn them into signals that then you can apply all these
algorithms and turn them into value at the end? Yeah, that's a really good question. I'll give
you an example both of the claim and I'll give you an example as well based on our kind of member
direct search experiences. But let's unpack that example of the claim, right?
The claim comes in what's called an X12 EDI format, electronic data interchange format that's
been around for quite some time. This format, very compact, can be unpacked. And if you think about
it, a couple times in the past, I've had people write Java parsers to unpack. When you unpack
this into a data warehouse for a professional or an
institutional claim, you usually end up, and I'm not joking here, with between 10,000 and 11,000
columns, right? It's a very sparse thing. Not all of those 10,000 to 12,000 columns are populated,
but sometimes there's things that are given, but some things you don't know. So the structure of a medical claim, because you think about it from this paper form, gets transferred into 12,000 column sparsity.
Wow. That's wild for a single claim. there's so many loops that are allowed, right? So you can have 25 diagnostic codes
for every procedure code, right?
And then every procedure code
can have a different assigned pointer
to a description as to why, right?
So that creates these, if you think about it,
if you're JSON or XML in mindset,
you think about these nested loops, right?
And why it's so compactified. But
if you wanted to unpack it, and denorm it as much as you want, I've done this activity in the big
co and it becomes big, what we found out is you can create levels from that variables. Because,
you know, people spend time with that very unpacked version of several 100 key variables to
several dozens of key variables. So I've gone
through that activity. It was kind of interesting when I left to the startup, I just developed a
big product, a couple of teams adjacent to me and myself, my own team, this kind of real-time
online claims processing system and micro batch that as a claim comes in the door, it issued fraud
predictions within minutes, right? I think it ran every 10 minutes. It was great architecture. Then I read Uber's Michelangelo architecture. And when I left,
I'm like, ah, an online offline. If you haven't yet, it was, and they've had a couple articles
since, since 2018 and again in 2019. And I'm like, that's very much like the architecture that we had
built about taking things into database, unpacking it, creating an online
version of maybe taking those top one or 200 features or putting them into a feature store
and then building all of your models on that feature store. So yeah. So when I say this is
a signal, it is not one dimensional. It could be 10K to 12K dimensional, but when I'm actually
running my models, I've already limited it down to those couple hundred features or so that are key, especially for things that are run online.
Offline, I can keep a few more, but to be honest, it's so sparse that going beyond a few hundred
isn't there. So that's an example. And what's interesting, my team even did that here at Bind.
We unpacked that format. We picked out the top 20 or 40 features or variables that mattered, built our models
specifically on those features, and therefore deployed our models specifically on those
features to get, you know, depending on what we were trying to predict, varying degrees
of success, some of which now are impacting our members positively.
So, and if you're interested, I can tell you the other space is search, right? We have a type ad as you're searching. Every time you click and add another
variable, we have metadata about that search. So you can think about it as like, oh, you type in
diabetes, but by the time you get to the S, I've known every, you know, every single, I have a row
in my database for every letter you've typed. And I've
known what search results exist. I know what a search attempt looks like. I know if you went
back and went forward and what your final search. So even though some people would say, oh, it's a
signal, what do they search for? Well, I've got metadata stored in my logs for every keystroke
you made, which allows me to make sure people aren't, people are, my search is working effectively,
right? They're finding the things quickly. They're not, they're not misspelling things.
It's providing diabetes quickly for them, right? Those are the type of experiences that we enable
by just looking at all the data. Oh, that's amazing. One last technical question before I let
Eric ask his questions,
but sorry, I'm really getting excited with that stuff.
So you talked about like unpacking the format
and it's very sparse, as you said.
So, and you also mentioned like Redshift.
So can you give us a little bit more of like a technical description
of how this unpacking happens,
like from the JSON or XML document, whatever it is,
do you end up like with 10,000 columns
before you start creating the features?
And the reason I'm asking is because I know
about the limitations of Redshift.
Like for example, you can't have a table
with more than 1,500 like columns.
So I'm very interested to see how you manage
these dimensional explosion that you have
with the limitations of a data storage system like Redshift?
Yeah, so in this role, so previous role, we unpacked it.
We had everything.
We were using HBase.
So because of that ability to just hold the entire object there at the big co,
and then we would create HBase tables that are reduced feature sets, right?
So that worked fine.
But now it doesn't make sense for us to unpack the entire thing, because we already know every
field is not valuable, or we can do that at a future state. So we define a schema, let's call,
let's say it's, you know, a JSON format, and then we have a schema on top of that, we unpack that
schema that we define, right? So it's only those variables that we've determined to unpack out of it. And you want to think about
this from a technical perspective, we are definitely an orchestration organization,
right? Kafka was central to the way we set things up. So we have these engines that go in,
the schema gets unpacked once, gets put into a Kafka topic that anybody that needs to use that
then can use it, right? So there's something
that listens to that topic and then instantiates that unpacking into an analytics-ready database
that I just talked about, right? There's other people that use that to actually subscribe to
that outcome to actually begin to process that claim, right? To actually adjudicate it and
determine what the actual price should be, how much the provider is owed,
how much the member may or may not owe, et cetera, and how much the employer needs to pay.
So we have a bit of many microservices that allow these transfers and these processes to occur.
That's great. All right, Eric, he's all yours. I know that you have many questions to ask,
but you know me, like I get too excited sometimes.
So go ahead.
I mean, it really is fascinating.
I just love all the unique things that we learn on the show,
like a medical claim producing 11,000 columns.
It's wild.
Jason, I'm interested on the sort of customer experience
aspect of what you discussed as far as the outputs of the data. And I have two questions there. One
is about the interaction between the data science team and the marketing team. And the second is
about just the technical piping that sort of connects your work with the MarTech stack, as you said.
But let's start with the relationship between the data science team and the marketing team or
other people driving customer experiences. And specifically, I'm thinking about even the example
that you gave around providing a customer who opens the app with a recommendation on a free second
opinion. Where does that the where does an effort like that originate? Is that coming from marketing
or coming from someone in product? And then depending on where it originates, you know,
how do you work with those teams to sort of produce the output that they need from your work?
Yeah. So from where in an organization is this owned? Let's say that that's been something that
has changed because we're still trying to find the optimal structure. So when this first came out,
there was a product owner of, let's say, if you think about this as a store, you had inventory, you have these SKUs, you have these things that people can
purchase, right?
Things that have price tags.
So providers doing this thing somewhere is a SKU.
So you'd inventory.
If you also thought about it from a retail store concept, then you have merchandising,
right?
How do you arrange the things in the store such that people can see things, right?
You put eye level around the end caps, things you want to highlight to people.
So we had in our product division, we still have an inventory function.
We had a merchandising function and the person who owned merchandising was in charge of
basically figuring out how things get stocked, right?
Where they were from a visual perspective, think about in the app,
you know, how do we highlight things? We've since kind of changed that function. It served us very well for this to now we have a member experience function within the business, within our
operational business. They are kind of more in charge of that, you know, call that the arrangement
of how things are still within the store, right? I want
to stick with that retail construct. Our marketing team plays making sure that the technologies are
there, that our brand makes sense. And they make, and if you think about, they own a lot of aspects
of it, right? How to develop the kind of the front end of that. For instance, the videos that we're going to see, the
images that are embedded onto the machines of the potential people that are going to select us,
right? So our marketing team is usually focused on selling Bind out of the front door and then
selling Bind to the employees within these organizations, or at least providing, giving
them information so they can make the choice
for Bind. We love to be in choice environments. We don't, in many situations, we don't want to
be the only option. We want people to choose us over their high deductible health plan. We want
to say, you know, one thing I didn't tell you, there's no deductible with Bind, right? There's
no co-insurance. You don't have to hit some number before Bind kicks
in. If this is a hundred dollar MRI, that's all you're going to pay. The hundred dollar MRI.
I've got that on the site, which is awesome.
Yeah, you have Constance and I excited. We're going to go back to our employer and ask them to
ask us to check it out.
So we just want to give people that information that for many people, look at the information,
go to our website, type in the things that you care about.
Is it diabetes?
Is it this drug?
Is this better for you?
Right?
So our marketing team focuses heavily on those, the upfront experience of helping to sell
or at least provide buying to those HR managers.
And then to the employee level, being able to understand that during these annual
enrollment events, when people are given the option to select bind or to not select bind,
the information they may need to make a good decision on their own behalf, right? So it's a
great relationship. We've hired some brilliant people that I really enjoy working with. So
I'm really happy with the way we're structured.
Very cool. And jumping over to the technical side, could you explain, and I realize, you know,
this may not, you know, be under your purview from a technical standpoint, or maybe it is, but you talked about sort of, you know, pulling data in and then processing it. How does it go from
the infrastructure that you and your team leverage through to the end user experience,
right? So let's say they open the app and they get a notification. What are the pipelines that
actually drive that experience and how does the data get from you, you know, sort of to the places
where it's going to be activated for the customer?
Yeah, that's a really good question.
And I would say one of the best ways
for me to answer that experience
is going back to architecture,
being all the AWS space
allows us to have some of these integrations
be far more streamlined than some of the on-prem companies, right.
That maybe not have thought about these interactivity or connections in their
original design or use cases.
So let's say back to the orchestration engine,
I can just publish my model to Kafka with a model topic, right.
And then my MarTech stock stack can listen to that, right? As long as I have some
sort of data contract with the marketing team or with the product owner of this stack, that they
know what that thing means, what that structure of that thing I published in. And sometimes when
I'm early in and not ready for full production, I might publish it to a database and they'll query
that database, right? That fuels into, let's say, a segment or whatever tools that you guys are familiar with
that now is now the MarTech stack that now understands how multi-mode interaction occurs,
be it phone calling people, be it emailing folks, be it fax, be it in-app notifications. So using basically, if you're
talking about just that transaction, that stack can just listen to Kafka and fuel its data stores.
That stack can just query a database and fuel its data stores through configuration. And then
that team that manages the marketing and merchandising function can then configure those campaigns, right, within those
tools, based on that data and information that they, that was loaded in. And if you want me to
get more technical, I can, but that's kind of the way I like to describe that.
No, I mean, that's, that is, I mean, we, you know, we have the benefit of seeing just a lot
of these different setups. And, you know, the way that you have approached it is very modern
and very streamlined. One thing I'm interested in is in the development of the MarTech stack. I mean,
it makes absolute sense that you would have a pipeline that the marketing stack can listen to
and then sort of just receive the information they need and then, you know,
route it and do the things they need to do with it.
Were you involved in sort of the architecture of the marketing tech stack as well?
And was that system sort of, you know, from a claim coming in to going through your pipelines
and data science to, you know, sort of publishing that in a way that the marketing tech stack can listen to it. Were you involved in that or had they sort of, you know, architected their
system separately and you, you know, built your Kafka pipeline to suit? Yeah. So in this instance,
I was aware, but not involved in the choice of technology for the MarTech stack. I was aware
they were doing it. I was understanding of which vendors they were, but I was not a key stakeholder in that process. So it was more of, hmm, here's
our, it was the data contract, if you want to think about it from that concept,
was how am I going to get you data? Great. We're a Kafka organization. I can read Kafka.
Just put it there, right? From an orchestration standpoint.
So we had this going in position
such that we already had a method of communication
so they can go off and with whatever,
all the use cases that they wanted this to work for, right?
To manage marketing campaigns,
you really want that ability
to manage the app notifications,
to manage the email notifications with modern tech, right?
You just do, you're not gonna build
your own Java application for that.
It exists in the marketplace.
So we just had to make sure,
hey, here's information on how to load it in there
with kind of advanced analytic techniques.
So it came down to that data contract.
I feel we did well with that.
Yeah, and it's interesting. We actually wrote a post recently about the history of data
engineering. And one of the points we brought up was that IT and marketing, there's been a schism
between the two groups within a lot of organizations because IT was seen as sort of a limiter,
right?
Like, oh, we don't want to go to IT because it's going to take longer and they're not
going to give us what we need.
And, you know, they're going to say no.
And so it's just really exciting for me, especially coming from the marketing side, to hear about
a partnership that's actually, you know, seems to really be driving better value and better
experiences for the customer.
And I think that's where things are going to go in the future, you know, as companies
really figure out that that creates a competitive advantage.
So really exciting to hear about that, hear about that structure of Bind.
To kind of just, you know, hit it a little bit more when I think about where this technology
is going, we've still got a lot of opportunity to enable it even more,
right? That's the clincher there. When I think about organizations that are stumbling over
themselves to kind of get things in there, I don't think that's our biggest problem,
to be honest. Our biggest problem is making sure that consumers can understand our information
in a way that's valuable to them. It's usually not a technology problem.
That's not our biggest thing.
It's understanding the user experience and optimizing to that is, I think,
where you become a consumer-oriented organization.
Like I said, as long as you are upfront
with the technology,
then we can actually focus on what really matters,
creating a consumer experience
that actually works for people.
Sure. The technology gets out of the way and you can focus on the user,
which is the whole point. One question speaking about IT, the issues that marketing has with IT, we can't talk about healthcare data without talking about security and privacy. And insurance and healthcare
are extremely regulated in terms of data privacy and security. So how does that impact your work
as a data scientist? I mean, you're obviously sort of dealing directly with the sensitive data.
I would just love to know the types of things that you deal with on the data science team
related to security and privacy.
Yeah. And the interesting thing about Bind, it's, I think, the most secure PHI organization that I've ever been part of. Just to kind of throw that out compared to the
bigger companies I work for. So when I say that, I mean in the day-to-day operation, right? So we
take a very strong dev-pro prod mindset, right? Most people at
Bind have no access to prod. In fact, very few developers do, right? So they must develop their
code on test data, dummy data, implement it, test it, put it into the pipeline. Even the data
scientists need to do this, right? To a point that when we want to deploy code into production,
that's when we get to see it.
And certain variables are covered. So only a few people have access to the PHI itself, right?
It makes it harder to develop when you need to live in the dev stage prod or whatever paradigm
and you need to develop that way. Doing it with data science is a little weird,
but we've figured out ways to make it better. But it takes longer to live in that paradigm.
So when I say it's more secure, I just meant it was easier to get full data access at some of the other companies.
But it was really hard for them to get the data off the computers.
Let me put it that way, right?
It'd be say, yep, this person has access to 100 million Americans' data, but it's impossible for data to leave, right? They've
got the machine so locked down. The possibility for breach is very, very, very small, but it was
much easier to be like, yep, this person gets full access because it's part of their job. They need
it. Right. Sure. And does that impact the way that you train models on the data science side?
Like, do you, you know, thinking through test data and your development
flow, how does that look on the team? Yeah, for the most part, things that are PII or PHI,
most of those identifiers are unimportant from a modeling perspective, right? I don't need to
know someone's name. I don't even necessarily need their address or stuff like that, right? So if I want to pull in because we do live in the age of
checking for equities, and which we do, right? I sometimes might take their zip code, link it in
with socioeconomic data I might have about the region they live in to make sure that we have an
equitable product, right? And equitable outcomes in terms of how people
experience Bind, right? So those are things that are important to me. But age is an important
variable. But most of the PII can get blinded from a modeling perspective, right? Which is
really nice. The only time that I need to put it in sometimes is if I'm providing output to an
operations team that's now going to go do something. is if I'm providing output to an operations team that's
now going to go do something. So if we have a model that's predicting people that are going
to be high cost in some sort of condition category, that data needs to be plugged into
a clinical ops team that might call them or might try to help that person make good decisions on
their strategy, right? This isn't all just app-based. We operate a product and we sometimes will just call the folks and make sure they have all the information they need
about their benefits to make good decisions. So that happens. So we might have a couple of folks
with kind of like that front end, but we are heavily regulated, heavily locked down. And
we have a very good DevSecOps, a development ops security team
that really makes sure our data is protected.
Very cool.
Yeah, that is very interesting on the modeling side
in terms of the data you actually need to accomplish
what you need to accomplish.
Yeah.
I mean, it's that age is, but you can group age.
You can do age since January one,
because 66.5 and 66, trust me,
are not sensitive to almost every model
I've ever seen in the healthcare space.
So just to give you an example,
we find that we're able to strip out the PHI
for a modeling perspective pretty readily.
So Jason, I have a question that's still related with the dimensionality of the data,
but I really have to ask this to you because probably you are like one of the few who can
answer that.
What is more complex in terms of like dimensionality?
Is it a medical claim or a measurement on LHC?
From a dimensionality,
I would say probably measurements in LHC are far more complex now.
So I worked on something called
the electromagnetic calorimeter when I was there.
It had, if I remember, 60 to 70,
I think it was like 64,000 crystals.
So just one part, one sub detector had 64,000 crystals
and the measurements of energy were sampled every 25 nanoseconds. So you had to reconstruct
the energy profile. So just that, let's say it was 10 or 15 batches of 60,000. So that's already
telling you you're dealing with more than a million measurements. And then for every crystal,
you reconstruct the energy profile.
Then you run a bunch of higher level.
Was it an electron?
Was it a proton?
Where was it going?
All these higher level things.
That's just one thing, the hadronic calorimeter.
There's a tracking thing.
So from a data element for every one interaction, you're talking about billions. It's pretty much the most,
what these devices are pretty much the most highly instrumented spaces that exist.
And I only gave you the surface of how instrumented that was. I think when it was
designed in 93, it was designed with more fiber optics than existed or was laid in the world at
the time so i mean that the the world laid a lot more fiber during the actual development of it but
that just gives an idea of how instrumented that was yeah yeah i think every time an sre complains
about all the different metrics that they have to measure every day they should have like a
conversation with you so they can feel better in terms of like what data that they have to measure every day, they should have like a conversation with you. So they can feel better in terms of like what data points they have to keep track every day.
That's great. Actually, I think it's very interesting because of your background. And
that's why I asked this question, because you have a very interesting background in terms of like
data science coming from doing your PhD in physics in LHC. So, you know, we're always talking about big data.
We're talking about the scale of like the data problems that the industry is like facing every
day and all that stuff. You are a person who has probably been involved in like one of the most
complex in terms of data projects that humanity has come to so far. So can you share a little bit of like around about that?
Like how does it feel coming from the CERN experiments to going to the industry?
And what difference did you see there?
And what also I think it's quite important, what kind of lessons you learned there that
you are still applying today and you find like very useful?
Yeah.
So the most interesting thing for me about
the scalability question is most time when I find somebody, be it in industry or not saying,
oh boy, this is just impossible to do. This doesn't scale. I look at it and I'm like,
this isn't hard at all. I mean, sometimes they're talking about going from one gig to 10 gigs, and it's because they've chosen a tool that's in-memory RAM.
And there's a solution for that.
None of these are mind-stopping.
Similar to before, it's like I had a data set that had 900 terabytes or one that had 1.3 petabytes.
I'm talking about 2010.
This wasn't that hard.
We wrote C++ programming over many years that
would put this on the worldwide grid and the grid might run 10,000 jobs in Torino. It might run
10,000 jobs in Chicago in 2000 and would kick it back. And it would be formatted and unpacked.
In eight hours, I'd have 40,000 jobs done. I made a mistake, run 40,000 more jobs.
So it's kind of funny,
almost every time someone has shown me a scale problem in industry, there's already a solution that humans have figured out for other purposes. So it's been really kind of funny that when I
look at it, they don't think outside of the box. So I'm in R, this machine's only got 32 gigs of
RAM. I need to run 32 gigs. I just can't do it and I'm like well
we could just put this on a bigger machine at least Salzy for today but or we could use something
that doesn't require in-RAM analytics so I haven't yet come to a problem that I hadn't seen
already a solution I actually thought MapReduce was so backwards when I left into because the
way we had done things at CERN is you unpack it, you do all your analytics at once, and then you repack it. And it was a C++, so you can add 10 plated classes. So when MapReduce 2 came out, I was much happier. When Spark came out, I was a lot happier. But I still thought they hadn't met what, you know, the physics community had already done on these larger data sets, but they've definitely have made it better since then. And just to kind of hit one more thing on
the research side, I find people that have gone through this rigorous level of research that have
been kind of data scientists for the large research things do very well, right? I have
starting in January, my third PhD physicist on the team, but I have plenty of other folks,
you know, masters in bio,
master in behavioral health, who add a lot of statistical health rigor to the types of things we do. But in other areas, I've had people fail who have been in the research world, and they just
keep going down rabbit holes. They'll spend two weeks on hyper parameter tuning. And knowing when
the business is going to get value has been a very tough thing
to learn for some folks, where, wait, one is basically perfection, the enemy of good enough.
I love that phrase. Yeah, no, no. And it's amazing to hear that from a person who's coming from the
academia, because I'm also a person who worked like the academia for quite a long time. I can
relate in what you're describing.
And it's amazing to hear from someone like you that you understand this distinction.
And that's great, actually.
So going from academia to the industry, more on a personal level and more on a professional level, let's say mainly, how did you choose to do that?
And what are the differences that you see there?
I know
that there are many people, but they are going after PhDs, and they might be like thinking about
that. So I think it would be great to hear from someone who has done it. Are there things that
you regret or things that came out to be like much better than expected? And what's the overall
experience that you can share with us? Yeah, so I would say it's very hard to leave academic
sometimes and go in the industry.
I know there's a lot of programs out there, be it, there's a few now fellowship programs, right?
That try to take people with MDs and PhDs and give them data science or data engineering skills,
right? And those can produce folks that now have been, now have some understanding of business
value, right? Which is one good thing that can
come out of those programs, right? Some people go get like a master's in business analytics from
the business schools. And those as well, come out with people that can understand business value
without needing to be taught that from the get-go. I sort of got lucky with the role I had. I left
the industry. It was super busy. Got my PhD. The thought of
doing a postdoc in that field, the mean postdoc is over 10 years. The thought of moving my family
every two years to various institutions around the world was not enticing. I wanted to kind of
more dig in, develop my family and develop my career, be compensated okay for that. And so I just kind of fell into healthcare.
An opportunity happened, got some experience, and then I dug in, right? I showed up to that
first job with a tie every day, right? I did that in that organization, which mattered.
So when they needed a manager and as people adding value, one manager moved to director,
it was easy for them to select me. I'd already been providing that value to the organization.
So for me, it was that focus on the business value that allowed me to get my feet in the door.
It allowed me to continue to move to do the things that I wanted to do.
So I just kept saying, like, not what did I find interesting?
It's what do I find interesting that matters, right?
Because those first thing I did there was one of
the first things is I built a model that predicted if people are going to come back to the hospital,
readmission models are still very popular. They were still popular in 2011 when I built one.
And then I could meet with providers, discharging people from the hospital,
build a model, put on dashboard, the dashboard refreshed every hour. They could see these colors
based on my models that would say, oh, this person's got a 20% chance of coming back in 30 days,
worked with them to develop interventions, and they worked to mitigate that. That was cool.
That added value made me feel good, saving people's lives by providing information to
doctors and in hospitals on these screens that social workers can pay attention to.
Yeah, makes sense.
I mean, I think from my experience and my perspective, I mean, usually people that they go after like an academic career in many cases, because there is a passion behind that, right?
Going and doing a PhD in particle physics, you need to be passionate about something
to go and do that.
The same with other also, not only in physics, but in other like disciplines.
So I think what is quite important, and I think this is the responsibility of the industry
to figure out how to do it.
It's when you try to attract these people out from the academia and get them inside
the industry, I think that it's also important outside of like, okay, the monetary, let's
say, benefits of doing that.
It's also to try and see how these people can get passionate about the problems that they are going to be solving.
And that's what I get also from what you said about this first problem that you solved in healthcare and how this drives your passion in working with data.
And we have also to remember, and I think that, okay, we have a pretty technical audience out there, but most people don't understand that actually doing physics today, it's mainly
a data science problem.
I mean, we saw the first image of a black hole.
That's mainly because people were scratching a lot, a lot of data, like a big part of this
work was finding the right algorithms, the right processes to take this raw signal and
turn it into something that we can consume as humans and understand.
And that's exactly what LHC is doing.
And the finance sector is doing a pretty good job attracting people.
But I think there's a lot of talent out there that still, if the industry figures out the
right ways to do it, there's going to be a lot of value to be driven from there without
wanting to steal people from the academia and all that stuff, right?
But yeah, that's super, super interesting.
I think we are at the end of our recording, Jason.
I really, really enjoyed it.
I think we can keep chatting for hours.
We have many topics that we didn't even touch.
And I'm really looking forward to have another chat with you in the future.
Yeah, I really appreciate it for you guys.
For me, it was really kind of fun to kind of talk about these things.
So Eric and Costas, it was really cool.
Yeah, we really appreciate you having on the show.
It's a treat for us any time we get to talk about Kafka and particle physics in the same conversation, which, you know which not many people probably have that privilege.
So thank you for joining us. Really excited about the work that you're doing at BIND.
And we'll reach out again in maybe six months or so to see how things are going.
I'd be looking forward to it. Thanks a lot, guys.
Well, that was a fascinating conversation. I think one of the most fascinating things I learned was that when you unpack a to do something valuable with the data that comes in a certain format or a certain size, it just creates all sorts of interesting
complications. And of course, hearing about the scale of data that Jason's worked with was
fascinating to me. But Costas, what stuck out to you and what did you learn today?
That's a great point, Eric. I think people were spoiled by interacting with digital products and we don't really understand the complexity
behind the technology itself. And we are also, let's say, a little bit oblivious of how powerful
of a processing machine the human brain is, right? Like we consider something like a medical claim or something that we can process so quickly,
but actually like working with this
and like representing in a way
that a machine can work with
can become like something extremely complicated.
So it was a great discussion to have
and to communicate and help people understand
of like the complexity of tasks
that the data scientist or a data analyst or a data engineer has to go
through in order to ensure that value at the end is extracted and delivered to all of us.
So that was great. I really enjoyed discussing more about the complexity of the data. And that's,
I think, it's also a benefit of discussing with someone who's a data scientist, because
a big part of the work of data scientists is to navigate this
complexity and find out ways to compress this complexity. And of course, for me, it's always
a great pleasure to chat with people that are coming from the academic environment and
the industry because these people usually are very, very passionate about the things that they
do. And I think this is something that we also experience today with Jason.
As a person that I'm also passionate around data,
it's always a great pleasure to be discussing with someone
who shares this passion.
And I'm extremely happy that I also managed to learn
a few more things about projects like CERN
and how humanity is actually pushing forward the state of the art when it
comes to data and our understanding of the world in general. So I hope we will have an opportunity
to chat with him again in the future. I think we have many more things to discuss.
Yeah, I think it was great. One other thing that was very interesting to me was how seamless it
seems like the relationship is between data science and marketing. And that's pretty unique, you know, even from a technical standpoint. And so, you know,
hats off to Jason and the entire team at Bind for building something pretty special there,
it seems like. And we'll look forward to catching them again on another episode of
the Daystack Show. Thanks for joining us and we'll catch you next time.