The Data Stack Show - 02: The Importance of Data During a Global Pandemic with Utkarsh Gupta of 1mg
Episode Date: August 19, 2020In this episode, Kostas Pardalis sits down with Utkarsh Gupta, senior engineer of data science at 1mg, India’s largest online healthcare platform.Together they discussed 1mg’s data infrastructure,... its response to the global pandemic and how data drives their product and their business. Highlights from the show included discussions about:Utkarsh and 1MG’s background (1:32) 1mg being based on a bedrock of data (4:25)Business analytics (5:33)Effects of COVID-19 pandemic on business (11:40)Description of 1mg’s data stack (16:53)Biggest challenges faced and managing collaboration (27:08)Opinions on open source technology (40:31)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Sack Show.
This week we had the pleasure of talking with Uttarsh from 1MG.
1MG is India's largest online healthcare platform,
and with such a huge market in India and a global pandemic going on,
you can imagine that we had so many things to learn from him
with all the data they're collecting from users across all their different products.
Exactly. In this episode of the Data Stack Show, we discussed with him about the data infrastructure
they have built in 1MG that also helped them respond to this pandemic
and how they use data to drive their product and business.
We will hear from him on how they use the data to create highly personalized recommendation systems,
work with very noisy medical data to create a consolidated medical view,
and of course, how the data is used to drive business decisions.
Hi, Kostas. It's very nice to have you today.
Hi, Kostas. How are you doing?
I'm doing well. I hope you are also
doing well. So let's start this episode of our podcast. Can you do like a quick introduction
about yourself, like your background, how you ended up like working with data and also share with us some information around the
company you work right now, which is 1MG.
Sure.
So I'm a 2012 graduate from IIT Bombay in India.
And after passing out from college, I worked in multiple domains like big data consulting, algorithmic trading, mobile advertising.
Before joining 1MG, my present company in 2017 to lead their data science team.
Even before joining 1MG, I was very intrigued about a lot of things that you can do with data, all of the predictive and statistical analysis that
you can do with it. And that's how I sort of moved from being an electrical engineer to
as in pursuing data science. And since 2017 at 1MG, I've been involved in almost every part of the business in sort of touching that part
of the business with data science.
1MG started off in 2012 as HealthCard Plus, which was an initiative to provide information
about medicine, like side effects, substitutes, and compare prices of different products.
If you think about the healthcare space in India, there's a huge issue around obesity,
which leads to a cost burden on the consumers.
Especially for generic medicines, the price variation is huge and at times it's up to 80% between brands and the consumers are just not aware.
So after 1MG started, it got some early traction and encouraged by this and the founders pursued 1MG as a fully dedicated opportunity.
Soon after that, 1MG became India's largest health app
and kept on adding features,
learning from what the customers said
and rebranded itself as 1MG in 2015.
1MG is an integrated health app
and offers online pharmacy diagnostics consultations at scale all in one
place packaged in a one-stop solution. In addition the app also has a ton of digital health tools like
medicine reminders, digital health records and much more to make healthcare management
very easy for the patients. The goal essentially is to
provide a 360 degree healthcare service in very few clicks.
That's great. So based on what you are saying, I'm assuming that data is quite
important in 1MG. Can you expand a little bit on this? Can you share
a little more information around the role of data in your organization
and how important it is in driving the business and also supporting your product?
Absolutely. Data is an extremely important commodity here at 1MG.
A lot of what we've built at 1MG has been learned from what the customers have said.
We use data for decision making on all the fronts of our business, from what we procure and stock in our warehouses to what features to build in mobile applications to deciding price of different services offered by the company. So in a nutshell, we're essentially based on a bedrock of data.
And that's how all of the important considerations and decisions are taken in the business.
That's great. Can you mention like one or two use cases of data inside the MG that you find that are quite important,
interesting and fascinating?
Okay, so business analytics is obviously at the core when it comes to running a B2C company.
It helps in understanding what initiatives have worked, which ones need some rethinking.
Even in the product team, we've transitioned to an experiment-driven approach
where all important features are A-B tested before releasing into production.
From deciding the placement of elements on the homepage or product pages to pricing decision of subscription plans,
all of that is completely data-driven. Me being part of the data science team,
one of the biggest areas of AI ML for 1MG
is creation of a single unified health repository,
which gives a complete picture of an individual's health.
Using patients' longitudinal health data,
we've developed disease progression models for chronic conditions
that make personalized education and interventions possible. longitudinal health data we've developed disease progression models for chronic conditions that
make personalized education and interventions possible. 1MG also brings artificial intelligence
based differential diagnosis to the forefront of patient doctor conversations. We recently
published an article around probabilistic model for differential diagnosis in one of the top medical journals.
Apart from this, at 1MG, we also do a lot of cutting edge work in commerce as well.
So we have an in-house order delivery time prediction engine, which has been developed using deep learning algorithms, which are responsible for high fidelity and accurate ETAs
as part of the order fulfillment pipeline.
This has helped us in reducing order cancellations
and improving customer ratings.
Personalization is also embedded in almost all the parts
of the 1MG's platform.
And to enable better recommendations
and to serve personalized offerings
to the visitors on the platform,
at 1MG, we use state-of-the-art techniques
like collaborative filtering and transformer models
to build our recommendation engine.
And all of this is definitely possible
because of our data collection and data processing technologies
that we use at 1MG.
So it's definitely something that is really, really important at 1MG.
Yeah, it's very interesting what you are saying.
From what I understand, 1MG is a company that pretty much incorporates almost everything that has to do with
processing data from whatever we understand as the typical business
intelligence use cases like helping the business understand what happened in the
past and also inform their decisions for the future. Product analytics from what I
understood you are leading the whole product development using data.
But you are also actually creating, let's say,
what I usually call data products.
Like you use the data from your customers
to create products on top of that,
like this unified patient.
Health repository, yes. Yeah, which is quite interesting and I'm
pretty sure that you are using like from the whole spectrum of different
methodologies that there are out there to analyze and work with the data. That's
quite fascinating. So from what I understand and also by being like a
company that works in healthcare and at the same time in a huge country like
India, I assume that the volumes of data that you have to work on a daily basis
are like quite big. So can you give us a sense of the volume of users
or data that you are dealing with every day
and also their complexity?
Because from what I understand,
you are also dealing with quite complex data.
Yes.
So every day,
1MG is visited by a few million users that are here to fulfill a variety of their healthcare needs,
ranging from information about medicines prescribed to them to ordering healthcare products and services.
As a company, we collect data from multiple sources like clickstream event sources like BigQuery, there is Rudder which
helps us collect a lot of clickstream data, we have API logs, we have
transactional data residing in our databases. Other than that we have CRM
systems, third-party point-of-sale systems, marketing platforms. So it's not just the volume of data,
but also the variety of data that we collect at 1MG.
Everything is not structured and in tabular format.
We collect a lot of other variety of data like prescription images,
lab reports, which are essentially PDFs. We have emails, audio messages
of conversations between the customer care agent and the user. All of these form a good
percentage of our data lake as well. On Every single day throughout our ETL pipeline, the data pipeline that we maintain,
we extract and transform a few TB of data into our data lake and finally make them available
for downstream use cases by loading the process data into multiple databases.
Sounds good. So just to be like in the spirit of the periods that we are living,
I assume that COVID has pandemic affected your business and
consequently also the data that you have to work with and your infrastructure?
So yes these are definitely testing times for every company and especially the startup ecosystem where life can be very volatile.
However, being a part of essential services provider in India, 1MG has been online and running throughout the lockdown process,
providing critical healthcare products and services to our users.
That being said, from the outside,
it doesn't seem that there's a lot of effect on the business, but internally, as in the data that we actually capture
and use for a lot of our downstream use cases
has definitely changed in a lot of dynamics. So to give you an
example, during the lockdown, when the lockdown happened, it was so abrupt that every company's
operations were affected. So our inventory and stockpiles were affected. The time duration that we were delivering orders
to customers was affected.
And as I mentioned earlier,
we have an in-house built order delivery prediction,
delivery time prediction engine
where we actually predict how much time an order
will take to get delivered to a person's doorstep.
That whole engine needed rethinking
because since we were learning on past data
and the data dynamics had changed
because of the COVID-19 pandemic,
we had to reevaluate how we were actually training
that model and utilizing those predictions.
So it's definitely sort of kept us
excited and interesting. But yes, these are these are definitely testing times for every company.
Yeah, it's very interesting. So how long it took you to iterate on this? Like you mentioned that you had to create a new model
and rethink the whole process around some sides of the product.
That would be very interesting to see like your reaction
as a company to that and how you reacted to it.
So we at 1MG have always been very proactive in taking these decisions.
And given our data-driven approach, as in we've allowed the data to speak for itself.
So as soon as we started seeing discrepancies in our predictions and the actual delivery times on the ground,
that was sort of within a week of
the lockdown starting we started rethinking of how how to re-engineer our features or
what to change in the model how to retrain the model so we did a very interesting study at that point where we trained our auto delivery time
prediction engine on datasets from three different times one was where we were
seeing very normal distributions in our data the other was where we had
disrupted data due to events that were very localized, like cyclones happening in a particular part of the country or festivals happening that used to sort of disturb the normal flow of operations.
So we trained on normal, we tested on the disrupted period, we trained on the disrupted period, tested on the
normal period and we did a study to understand whether the change that we were proposing in the
model would work out or not and it definitely did show a lot of improvement and therefore
within like two weeks of the lockdown starting we were ready to deploy a different or a new revamped model for all the time predictions.
That's very interesting.
And I think it's a very good example of how important it is for a company to actually be data-driven
and how it can help an organization to first identify really
fast what is happening out there and second to react to that really fast.
That's great. So moving forward and starting to discuss a little bit more
about the technical side of things, can you give us a brief description of the current data stack infrastructure that you
have in 1MG and what kind of technologies and architecture you are using?
Sure.
So we're completely cloud-based at 1MG.
Our services and databases are hosted on AWS and we use many out-of box services from the AWS stable, like RDS, Athena, EKS, for our services and databases.
Most of our transactional database
says are in an RDS or a MongoDB table.
But our data lake is primarily based out of S3,
which is the file archive system provided by AWS.
Now, the data that is stored in S3,
we access it through Athena,
which is a managed Presto database provided by AWS again.
Athena essentially stores the metadata and schema information for the data in our data lake.
All the major transformations that we do
of converting data from JSON to readable tabular format
or for joining different data that is stored in the data lake
to finally come up with a normalized or processed database
happens through Athena only.
The processed, once the data is sort of being transformed
and processed, it's sent to different databases
for the downstream use cases,
like it's sent to Redshift for all business
and product analytics use cases.
It's sent to Cassandra for user personalization and all of that.
So that's our core data processing data stack.
However, we also use Kafka for real time data processing and feature creations for our recommendation use cases. Very recently we moved our event collection
infrastructure in-house through getting RadoStack on board and this is really
tied up very well with our data lake that's in S3 and our real-time data
processing that happens in Kafka.
And that's because RadoStack supports both S3 and Kafka
as destinations out of the box.
That's interesting.
So you mentioned at the beginning of our discussion
about the importance of collecting all the different data
that you've made.
And this is from what we can all understand,
like the foundation
of all the work that can be done on your data. Can you elaborate a little bit more
on this ETL pipeline that you have in the broader sense of the word ETL and the types of data that you are also working and collecting.
So as I mentioned earlier, we collect data from a lot of different sources.
So there are transactional databases, there are clickstream data sources, we collect API
logs, the CRM systems. So the first step in our ETL pipeline
is to extract the data from all of these different sources
and push them into a data lake, which is S3.
We do this through multiple ways.
So we use data migration service provided by AWS to move data from Postgres or MySQL
or MongoDB databases directly into S3.
We have written custom scripts to pull data from clickstream sources like BigQuery on
a daily basis and dump them into S3.
We've written our own data pipeline to push data from different microservices
and the API logs again to our data lake in S3.
So once this first step of extracting the data is done and everything resides in s3 the next step
is essentially to convert all of these the data that's been extracted from
different sources into a usable format to sort of normalize all of this data
that's been collected and to create single tables that have all the data joined from multiple
source tables. All of that happens through SQL queries that are run on the Presto database
that I mentioned Athena which is managed by AWS again. So after a bunch of SQL transformations that happen on the data,
finally we get a few, as in much less destination tables,
which are normalized,
which are sort of according to the different downstream use cases that we
want to have.
This data, again, is dumped into S3
so that it stays part of the data lake.
But from there, it's also pushed off
to other downstream databases like Redshift and Cassandra
for the different use cases to happen.
So this is more or less the outline of our ETL data pipeline. Sounds good. So
from what you said you have many different types of data that you collect and I guess that as a
B2C company clickstream data are also important. Can you tell us a few things about the importance of clickstream data
inside 1MG, the volume of the clickstream data that you have to work, the sources,
and how you perceive actually this clickstream or event data? What is the role? It's some kind of way to capture behaviors or something else.
I think it would be very interesting to hear your opinion on that.
So clickstream or event data is an extremely critical source of data at 1MG.
It helps us hugely in a lot of our use cases that we have in our product team or the data science team where the team is looking to answer multiple questions.
To give you some examples, so the product team wants to understand and learn from customer behavior and take appropriate action. All the product experiments are dependent on accurate
and reliable collection of user events.
Whether it's designing of new components
for the homepage or the product pages,
it's done through an A-B test
and only when the A-B test actually shows downstream metrics improving from the efforts that have been taken, the solution, would not have been possible without fine-grained user behavior events.
We utilize them to train our models and we utilize them for a lot of our predictions and inferences as well.
Again, to give you some examples here, so we sort of power multiple widgets,
product recommendation widgets across the 1MG platform. So when a person visits a product page,
we not only look at the metadata of the product page to decide
what to fill in the widget, but we also look at the past products that the user has viewed or
added to cart or even purchased to decide what are the best recommendations for the users so that it improves downstream metrics like clicks,
conversions, the overall GMB
that we get out of recommendations.
So event data is definitely very, very critical.
At 1MG, event data, and not just at 1MG, everywhere,
event data is collected from the user's device,
whether it's a mobile or desktop browser or an Android or iOS application.
At 1MG, we've sort of built an in-house data collection infrastructure powered by Rudder.
So the event stream is sent to Rudder from where it gets redirected to the multiple destinations from where we consume
that data, like S3, Google Analytics, Kafka.
So yeah, we've sort of improved our event collection infrastructure a lot.
And earlier, though we were collecting events through Google Analytics 360 into BigQuery,
we were only processing them once a day or a few times a day as batch processing only for analytics use cases.
But once we got access to real-time event streams, we've been sort of pursuing a lot of real-time recommendation use cases as well
where we were learning on
what the user has clicked,
what's happened in the session
till now to define
what the recommendations for the user should be.
So, Kars,
so far we discussed a little around the
technology and the data sources
and the structure of the data.
I think it will be interesting also to talk a little bit about the people that both work with the data, of course,
but also maintain this infrastructure.
So starting with the team that is maintaining the data stack that you currently have in 1MG,
can you share a little bit of information around like,
the people that are working there, their roles? And, yeah, I mean, how important it is? And
what's like the difficulties, let's say that they are facing on their day to day work?
Sure. So we have a very lean team
that comprises of a couple of data engineers
who work very closely with the data lake
and responsible for running and maintaining the ETL pipeline.
Other folks that work with the data usually work
after the data has sort of been processed
and it reaches the Redshift or the Cassandra database.
So folks on the business and product analytics team
or the data science team benefit hugely
from the work done by the data engineering team
and are able to deliver their targets
more productively and efficiently
because of our ETL pipeline.
As I sort of mentioned earlier, when I was talking about the use cases of data, one of
the work that we do on the data science team is to create this single unified health user profile.
In healthcare especially, the data is very unstructured. It's not standardized. It's not
computer readable. And this presents a very unique challenge when we think of building a unified health repository.
So one of the biggest areas of work
for the data engineering and AIML at 1MG
is thus to sort of create this complete picture
of individuals' health and their behavior
by both constructing a single data of individuals' health and their behavior
by both constructing a single data lake or an ETL pipeline
as well as deploying a number of AI models that convert this incoming unstructured data
into computer readable and standard data sets.
So this is sort of the, what would I say, maybe the Holy Grail, if we were to able to
achieve all of our data in a machine readable format, which can be used for a lot of downstream
use cases.
Yeah.
Sounds good.
So apart from the technical teams that you have inside 1MG,
that of course they are working with maintaining the infrastructure and also working on ML tasks and analytic tasks,
do you have people from other functions
that they interact with the data that your team
and the rest of the technical teams are delivering? And if yes, what I'm talking about is mainly
teams from marketing or sales. And how do you manage this interaction with these teams and
the data that they need? And how important is this for the organization?
I would say data is sort of omnipresent.
Everybody, even the people who sort of talk to customers in the customer service team or the doctors who are talking to patients through the online consultation medium, all of them are utilizing some of the other forms of data to decide what should be their next steps,
where are they going wrong, or how should they change their approach.
So the marketing team uses data very, very centrally to decide who is the right person for the campaign that they want to run.
They decide on almost a daily basis of who is the audience for the email that they're about to send, the push notification that they're about to send. So all of these are sent in a very targeted manner.
Because if you think about healthcare,
not everything is relevant for everybody.
As in, if you have designed a campaign for a diabetic user,
a person who is not diabetic might really not be interested
in actually looking at the details of this campaign.
So targeting is definitely very, very central
for the marketing team.
And therefore they utilize a lot of their signals
and data points to define good cohorts to target to.
The sales team and even the supply chain, the team that manages the supply
chain has a lot of data at their disposal about what inventory they have in stock, what are the
fill rates of orders that are coming in daily. So I would say a lot of teams apart from the core business and product analytics team and
data science team are utilizing data. They're however not independently querying databases
and sort of getting their cuts and their metrics analyzed, but they take help from Discord team,
the business and product analytics team and the data science team to enable
their sort of view of the data. So they sort of work very collaboratively with
the analytics team to figure out what is the data that they're interested in?
What are the cuts that will be useful for them?
Whether this is a one-time thing
or it'll be great if this could be sort of automated
and run in a scheduled manner
and they could be sent emails or notifications
about whenever this
data throws up an anomaly.
So, yeah, it's sort of omnipresent.
Everybody's using it.
The analytics team and the data science team are at the core of it, which understand and
maintain the data and helping out everybody in enabling their use cases in a data driven manner.
That's great. So by having data omnipresent in the organization, I guess there are also challenges
around working with them either on an organizational or technological level.
So from your experience so far, what kind of challenges you think you currently have and
how do you think these challenges will be addressed in the future, both from a technology
perspective, like if there's a technology that's missing or something that you haven't implemented
yet and you're thinking to implement as part of your stack, but also on an organizational level
because that's also important.
So yeah, I'd love to hear your opinion on that.
So at 1MG, we have a very high variety of data.
We have a lot of different data sources
and also a very high volume of data sources
and data coming from a bunch of these data sources
like clickstream, transactional databases.
So there are, I would say, three main problems that I see that we're trying to solve.
One is what I've already mentioned that in healthcare data is unstructured
and not standardized. And that's sort of one of the underlying fundamental problems that we've
been trying to solve over the past few years. But other than that, there is the problem of
data discovery, when there is so much data coming in and flowing in from different sources.
Discovering the data that you actually want to fulfill your use case can sometimes be very, very difficult.
Also, the problem with democratizing data, everybody is using data is a problem of collaboration. So there might be a team somewhere that is actually
generating as well as utilizing their data very efficiently. But because there is so much volume
and variety of data that we're actually processing, transferring that knowledge from this team to
other teams so that other teams can collaborate
and not only enrich the data for that team, but also utilize the data that this other team is
generating can be a huge problem. How we're actually looking at solving this problem or how we are actually focused in solving this problem is by
building a single unified data processing pipeline as well as a single unified user
and data repository so earlier as the business grew we did have a lot of our core data residing in one place,
but there were a lot of fringe elements that were popping up like the CRM systems did not
talk to our core database, marketing systems did not talk to our core database.
The clickstream systems when they sort of evolved were not tied up very closely with our transactional databases. and different ETL pipelines for extracting, transforming, and loading these data sources
in a processed manner for consumption for these teams.
Very soon, we realized that this was sort of aggravating the problems of data discovery
and collaboration.
And since the past one year, we've been very, very focused on not creating multiple pipelines,
but sort of incorporating all data processing into a single pipeline,
so that everybody has visibility on what data is entering the pipeline,
what data is at what stage of processing and what are the final outputs.
The data schema can be centralized.
It can be distributed to everybody.
This is sort of elevated some of our pains of discovering new data or collaborating and
enriching a bunch of our data sources.
Yes. or collaborating and enriching a bunch of our data sources. Yes, so our current stack is definitely focused
on removing these pain points for us.
That's great.
I think what you mentioned, especially point two and three,
let's say the pains that you're identifying,
which is about the data discovery and also collaboration,
I think it's a problem that affects almost every company out there
that it's growing and it's becoming, let's say,
or trying to become more data-driven.
So that's very interesting to see how it's going to progress
and what kind of solutions the market will come up
for this kind of problems.
Because it looks like we have figured out so far
how to access the data, how to collect the data,
how to create very scalable pipelines for ETL
and have all the different tools there
in place to query the data.
But from what I understand, and I think that this is also what I get from what you said,
there's a lot of work to be done on how we can make sense out of this data and how we
can democratize it, as you very well said,
and make it available to everyone inside the organization,
which is very, very interesting.
And it's great to hear that there are companies
that are in the forefront of this
and they're trying to solve that kind of challenges.
So moving forward and reaching the end of our discussion,
I'd like to hear a little bit about
your opinion and also 1MG's opinion
through you about open source.
How important you think that open source
movement around building software has been
for enabling these data stacks
that we are talking about today.
And finally, tell me, tell us,
share with us like a couple of tools,
open source tools preferably that you really love working with as part of your
everyday job.
Whenever there is a new promising technology out there,
we're not sort of shying away from it, but we're sort of inviting that with open arms.
So we use a lot of open source technologies, or we've sort of used a lot of open source
technologies over the past few years at 1MG. Especially if I talk about data science,
we've only been using a lot of open source libraries
like Keras, TensorFlow, PyTorch.
Even the recent tree-based models like Exib Boost or Light Boost have been sort of used in our AIML stack very, very, very actively.
We use a lot of open source database technologies like Cassandra that I spoke about, Kafka. Even Rudder is a great tool to have in our arsenal
and it's open source as well.
So definitely, I would say that open source
is in 1MG's DNA for sure.
We have not been shying away from it
and we've been sort of working with it very, very actively.
Some of the open source tools that I personally like to work with is definitely Kafka and
Cassandra because these are very, very nice and use cases at 1MG and we're sort of picking up
and sort of bulking them up. So a lot of my daily work happens in and around these tools.
To connect both Kafka and Cassandra, we've been using Spark. So Spark is again a tool that really binds all of these distributed technologies
very very beautifully. Also working with technologies would be very mundane or very boring if you did not have good and rich data to work with.
So I would also give a shout out to a bunch of our data collection infrastructure,
which has made it possible to enjoy the work with a lot of open source technologies and especially Rudder because Rudder connects very, very seamlessly
in our data stack and is able to seamlessly transfer
the events collected from the user's device in real time,
both to our data lake as well as to Kafka topics.
And because of this variety of data that we have at our disposal,
working with technologies like Spark and Cassandra and Kafka have become
very, very interesting nowadays.
That's great.
That's all for today.
And thank you so much for your time.
It was really enjoyable to discuss and learn more so much for your time. It was really enjoyable
to discuss and learn more about
1MG and I'm pretty
sure that we'll have the opportunity again in the
future to discuss again
and learn more
about all the amazing
and fascinating things that you're building
around data in your organization.
Thanks, Kostas. It was wonderful
talking to you.
It was great having you on this episode
of the Data Stack Show.
I loved learning about all the challenges
that they have working with medical data specifically,
but it's amazing the volume of data
that they have collected in the past couple months
responding to the pandemic.
And it's been fun to be a part of that journey with them.
We'll check back in with them in the next couple of months to see the different ways
that they're using the data in different parts of the teams as they direct the data stream
to other products and other use cases inside their company.
We'll catch you next time.