The Data Stack Show - 02: The Importance of Data During a Global Pandemic with Utkarsh Gupta of 1mg

Starting point is 00:00:00 Welcome to the Data Sack Show. This week we had the pleasure of talking with Uttarsh from 1MG. 1MG is India's largest online healthcare platform, and with such a huge market in India and a global pandemic going on, you can imagine that we had so many things to learn from him with all the data they're collecting from users across all their different products. Exactly. In this episode of the Data Stack Show, we discussed with him about the data infrastructure they have built in 1MG that also helped them respond to this pandemic

Starting point is 00:00:39 and how they use data to drive their product and business. We will hear from him on how they use the data to create highly personalized recommendation systems, work with very noisy medical data to create a consolidated medical view, and of course, how the data is used to drive business decisions. Hi, Kostas. It's very nice to have you today. Hi, Kostas. How are you doing? I'm doing well. I hope you are also doing well. So let's start this episode of our podcast. Can you do like a quick introduction

Starting point is 00:01:16 about yourself, like your background, how you ended up like working with data and also share with us some information around the company you work right now, which is 1MG. Sure. So I'm a 2012 graduate from IIT Bombay in India. And after passing out from college, I worked in multiple domains like big data consulting, algorithmic trading, mobile advertising. Before joining 1MG, my present company in 2017 to lead their data science team. Even before joining 1MG, I was very intrigued about a lot of things that you can do with data, all of the predictive and statistical analysis that you can do with it. And that's how I sort of moved from being an electrical engineer to

Starting point is 00:02:13 as in pursuing data science. And since 2017 at 1MG, I've been involved in almost every part of the business in sort of touching that part of the business with data science. 1MG started off in 2012 as HealthCard Plus, which was an initiative to provide information about medicine, like side effects, substitutes, and compare prices of different products. If you think about the healthcare space in India, there's a huge issue around obesity, which leads to a cost burden on the consumers. Especially for generic medicines, the price variation is huge and at times it's up to 80% between brands and the consumers are just not aware. So after 1MG started, it got some early traction and encouraged by this and the founders pursued 1MG as a fully dedicated opportunity.

Starting point is 00:03:28 Soon after that, 1MG became India's largest health app and kept on adding features, learning from what the customers said and rebranded itself as 1MG in 2015. 1MG is an integrated health app and offers online pharmacy diagnostics consultations at scale all in one place packaged in a one-stop solution. In addition the app also has a ton of digital health tools like medicine reminders, digital health records and much more to make healthcare management

Starting point is 00:04:01 very easy for the patients. The goal essentially is to provide a 360 degree healthcare service in very few clicks. That's great. So based on what you are saying, I'm assuming that data is quite important in 1MG. Can you expand a little bit on this? Can you share a little more information around the role of data in your organization and how important it is in driving the business and also supporting your product? Absolutely. Data is an extremely important commodity here at 1MG. A lot of what we've built at 1MG has been learned from what the customers have said.

Starting point is 00:04:45 We use data for decision making on all the fronts of our business, from what we procure and stock in our warehouses to what features to build in mobile applications to deciding price of different services offered by the company. So in a nutshell, we're essentially based on a bedrock of data. And that's how all of the important considerations and decisions are taken in the business. That's great. Can you mention like one or two use cases of data inside the MG that you find that are quite important, interesting and fascinating? Okay, so business analytics is obviously at the core when it comes to running a B2C company. It helps in understanding what initiatives have worked, which ones need some rethinking. Even in the product team, we've transitioned to an experiment-driven approach where all important features are A-B tested before releasing into production.

Starting point is 00:05:51 From deciding the placement of elements on the homepage or product pages to pricing decision of subscription plans, all of that is completely data-driven. Me being part of the data science team, one of the biggest areas of AI ML for 1MG is creation of a single unified health repository, which gives a complete picture of an individual's health. Using patients' longitudinal health data, we've developed disease progression models for chronic conditions that make personalized education and interventions possible. longitudinal health data we've developed disease progression models for chronic conditions that

Starting point is 00:06:25 make personalized education and interventions possible. 1MG also brings artificial intelligence based differential diagnosis to the forefront of patient doctor conversations. We recently published an article around probabilistic model for differential diagnosis in one of the top medical journals. Apart from this, at 1MG, we also do a lot of cutting edge work in commerce as well. So we have an in-house order delivery time prediction engine, which has been developed using deep learning algorithms, which are responsible for high fidelity and accurate ETAs as part of the order fulfillment pipeline. This has helped us in reducing order cancellations and improving customer ratings.

Starting point is 00:07:15 Personalization is also embedded in almost all the parts of the 1MG's platform. And to enable better recommendations and to serve personalized offerings to the visitors on the platform, at 1MG, we use state-of-the-art techniques like collaborative filtering and transformer models to build our recommendation engine.

Starting point is 00:07:42 And all of this is definitely possible because of our data collection and data processing technologies that we use at 1MG. So it's definitely something that is really, really important at 1MG. Yeah, it's very interesting what you are saying. From what I understand, 1MG is a company that pretty much incorporates almost everything that has to do with processing data from whatever we understand as the typical business intelligence use cases like helping the business understand what happened in the

Starting point is 00:08:18 past and also inform their decisions for the future. Product analytics from what I understood you are leading the whole product development using data. But you are also actually creating, let's say, what I usually call data products. Like you use the data from your customers to create products on top of that, like this unified patient. Health repository, yes. Yeah, which is quite interesting and I'm

Starting point is 00:08:48 pretty sure that you are using like from the whole spectrum of different methodologies that there are out there to analyze and work with the data. That's quite fascinating. So from what I understand and also by being like a company that works in healthcare and at the same time in a huge country like India, I assume that the volumes of data that you have to work on a daily basis are like quite big. So can you give us a sense of the volume of users or data that you are dealing with every day and also their complexity?

Starting point is 00:09:30 Because from what I understand, you are also dealing with quite complex data. Yes. So every day, 1MG is visited by a few million users that are here to fulfill a variety of their healthcare needs, ranging from information about medicines prescribed to them to ordering healthcare products and services. As a company, we collect data from multiple sources like clickstream event sources like BigQuery, there is Rudder which helps us collect a lot of clickstream data, we have API logs, we have

Starting point is 00:10:14 transactional data residing in our databases. Other than that we have CRM systems, third-party point-of-sale systems, marketing platforms. So it's not just the volume of data, but also the variety of data that we collect at 1MG. Everything is not structured and in tabular format. We collect a lot of other variety of data like prescription images, lab reports, which are essentially PDFs. We have emails, audio messages of conversations between the customer care agent and the user. All of these form a good percentage of our data lake as well. On Every single day throughout our ETL pipeline, the data pipeline that we maintain,

Starting point is 00:11:09 we extract and transform a few TB of data into our data lake and finally make them available for downstream use cases by loading the process data into multiple databases. Sounds good. So just to be like in the spirit of the periods that we are living, I assume that COVID has pandemic affected your business and consequently also the data that you have to work with and your infrastructure? So yes these are definitely testing times for every company and especially the startup ecosystem where life can be very volatile. However, being a part of essential services provider in India, 1MG has been online and running throughout the lockdown process, providing critical healthcare products and services to our users.

Starting point is 00:12:28 That being said, from the outside, it doesn't seem that there's a lot of effect on the business, but internally, as in the data that we actually capture and use for a lot of our downstream use cases has definitely changed in a lot of dynamics. So to give you an example, during the lockdown, when the lockdown happened, it was so abrupt that every company's operations were affected. So our inventory and stockpiles were affected. The time duration that we were delivering orders to customers was affected. And as I mentioned earlier,

Starting point is 00:13:09 we have an in-house built order delivery prediction, delivery time prediction engine where we actually predict how much time an order will take to get delivered to a person's doorstep. That whole engine needed rethinking because since we were learning on past data and the data dynamics had changed because of the COVID-19 pandemic,

Starting point is 00:13:36 we had to reevaluate how we were actually training that model and utilizing those predictions. So it's definitely sort of kept us excited and interesting. But yes, these are these are definitely testing times for every company. Yeah, it's very interesting. So how long it took you to iterate on this? Like you mentioned that you had to create a new model and rethink the whole process around some sides of the product. That would be very interesting to see like your reaction as a company to that and how you reacted to it.

Starting point is 00:14:20 So we at 1MG have always been very proactive in taking these decisions. And given our data-driven approach, as in we've allowed the data to speak for itself. So as soon as we started seeing discrepancies in our predictions and the actual delivery times on the ground, that was sort of within a week of the lockdown starting we started rethinking of how how to re-engineer our features or what to change in the model how to retrain the model so we did a very interesting study at that point where we trained our auto delivery time prediction engine on datasets from three different times one was where we were seeing very normal distributions in our data the other was where we had

Starting point is 00:15:21 disrupted data due to events that were very localized, like cyclones happening in a particular part of the country or festivals happening that used to sort of disturb the normal flow of operations. So we trained on normal, we tested on the disrupted period, we trained on the disrupted period, tested on the normal period and we did a study to understand whether the change that we were proposing in the model would work out or not and it definitely did show a lot of improvement and therefore within like two weeks of the lockdown starting we were ready to deploy a different or a new revamped model for all the time predictions. That's very interesting. And I think it's a very good example of how important it is for a company to actually be data-driven and how it can help an organization to first identify really

Starting point is 00:16:26 fast what is happening out there and second to react to that really fast. That's great. So moving forward and starting to discuss a little bit more about the technical side of things, can you give us a brief description of the current data stack infrastructure that you have in 1MG and what kind of technologies and architecture you are using? Sure. So we're completely cloud-based at 1MG. Our services and databases are hosted on AWS and we use many out-of box services from the AWS stable, like RDS, Athena, EKS, for our services and databases. Most of our transactional database

Starting point is 00:17:13 says are in an RDS or a MongoDB table. But our data lake is primarily based out of S3, which is the file archive system provided by AWS. Now, the data that is stored in S3, we access it through Athena, which is a managed Presto database provided by AWS again. Athena essentially stores the metadata and schema information for the data in our data lake. All the major transformations that we do

Starting point is 00:17:51 of converting data from JSON to readable tabular format or for joining different data that is stored in the data lake to finally come up with a normalized or processed database happens through Athena only. The processed, once the data is sort of being transformed and processed, it's sent to different databases for the downstream use cases, like it's sent to Redshift for all business

Starting point is 00:18:24 and product analytics use cases. It's sent to Cassandra for user personalization and all of that. So that's our core data processing data stack. However, we also use Kafka for real time data processing and feature creations for our recommendation use cases. Very recently we moved our event collection infrastructure in-house through getting RadoStack on board and this is really tied up very well with our data lake that's in S3 and our real-time data processing that happens in Kafka. And that's because RadoStack supports both S3 and Kafka

Starting point is 00:19:09 as destinations out of the box. That's interesting. So you mentioned at the beginning of our discussion about the importance of collecting all the different data that you've made. And this is from what we can all understand, like the foundation of all the work that can be done on your data. Can you elaborate a little bit more

Starting point is 00:19:33 on this ETL pipeline that you have in the broader sense of the word ETL and the types of data that you are also working and collecting. So as I mentioned earlier, we collect data from a lot of different sources. So there are transactional databases, there are clickstream data sources, we collect API logs, the CRM systems. So the first step in our ETL pipeline is to extract the data from all of these different sources and push them into a data lake, which is S3. We do this through multiple ways. So we use data migration service provided by AWS to move data from Postgres or MySQL

Starting point is 00:20:31 or MongoDB databases directly into S3. We have written custom scripts to pull data from clickstream sources like BigQuery on a daily basis and dump them into S3. We've written our own data pipeline to push data from different microservices and the API logs again to our data lake in S3. So once this first step of extracting the data is done and everything resides in s3 the next step is essentially to convert all of these the data that's been extracted from different sources into a usable format to sort of normalize all of this data

Starting point is 00:21:20 that's been collected and to create single tables that have all the data joined from multiple source tables. All of that happens through SQL queries that are run on the Presto database that I mentioned Athena which is managed by AWS again. So after a bunch of SQL transformations that happen on the data, finally we get a few, as in much less destination tables, which are normalized, which are sort of according to the different downstream use cases that we want to have. This data, again, is dumped into S3

Starting point is 00:22:08 so that it stays part of the data lake. But from there, it's also pushed off to other downstream databases like Redshift and Cassandra for the different use cases to happen. So this is more or less the outline of our ETL data pipeline. Sounds good. So from what you said you have many different types of data that you collect and I guess that as a B2C company clickstream data are also important. Can you tell us a few things about the importance of clickstream data inside 1MG, the volume of the clickstream data that you have to work, the sources,

Starting point is 00:22:53 and how you perceive actually this clickstream or event data? What is the role? It's some kind of way to capture behaviors or something else. I think it would be very interesting to hear your opinion on that. So clickstream or event data is an extremely critical source of data at 1MG. It helps us hugely in a lot of our use cases that we have in our product team or the data science team where the team is looking to answer multiple questions. To give you some examples, so the product team wants to understand and learn from customer behavior and take appropriate action. All the product experiments are dependent on accurate and reliable collection of user events. Whether it's designing of new components for the homepage or the product pages,

Starting point is 00:24:00 it's done through an A-B test and only when the A-B test actually shows downstream metrics improving from the efforts that have been taken, the solution, would not have been possible without fine-grained user behavior events. We utilize them to train our models and we utilize them for a lot of our predictions and inferences as well. Again, to give you some examples here, so we sort of power multiple widgets, product recommendation widgets across the 1MG platform. So when a person visits a product page, we not only look at the metadata of the product page to decide what to fill in the widget, but we also look at the past products that the user has viewed or added to cart or even purchased to decide what are the best recommendations for the users so that it improves downstream metrics like clicks,

Starting point is 00:25:28 conversions, the overall GMB that we get out of recommendations. So event data is definitely very, very critical. At 1MG, event data, and not just at 1MG, everywhere, event data is collected from the user's device, whether it's a mobile or desktop browser or an Android or iOS application. At 1MG, we've sort of built an in-house data collection infrastructure powered by Rudder. So the event stream is sent to Rudder from where it gets redirected to the multiple destinations from where we consume

Starting point is 00:26:06 that data, like S3, Google Analytics, Kafka. So yeah, we've sort of improved our event collection infrastructure a lot. And earlier, though we were collecting events through Google Analytics 360 into BigQuery, we were only processing them once a day or a few times a day as batch processing only for analytics use cases. But once we got access to real-time event streams, we've been sort of pursuing a lot of real-time recommendation use cases as well where we were learning on what the user has clicked, what's happened in the session

Starting point is 00:26:51 till now to define what the recommendations for the user should be. So, Kars, so far we discussed a little around the technology and the data sources and the structure of the data. I think it will be interesting also to talk a little bit about the people that both work with the data, of course, but also maintain this infrastructure.

Starting point is 00:27:17 So starting with the team that is maintaining the data stack that you currently have in 1MG, can you share a little bit of information around like, the people that are working there, their roles? And, yeah, I mean, how important it is? And what's like the difficulties, let's say that they are facing on their day to day work? Sure. So we have a very lean team that comprises of a couple of data engineers who work very closely with the data lake and responsible for running and maintaining the ETL pipeline.

Starting point is 00:27:55 Other folks that work with the data usually work after the data has sort of been processed and it reaches the Redshift or the Cassandra database. So folks on the business and product analytics team or the data science team benefit hugely from the work done by the data engineering team and are able to deliver their targets more productively and efficiently

Starting point is 00:28:23 because of our ETL pipeline. As I sort of mentioned earlier, when I was talking about the use cases of data, one of the work that we do on the data science team is to create this single unified health user profile. In healthcare especially, the data is very unstructured. It's not standardized. It's not computer readable. And this presents a very unique challenge when we think of building a unified health repository. So one of the biggest areas of work for the data engineering and AIML at 1MG is thus to sort of create this complete picture

Starting point is 00:29:19 of individuals' health and their behavior by both constructing a single data of individuals' health and their behavior by both constructing a single data lake or an ETL pipeline as well as deploying a number of AI models that convert this incoming unstructured data into computer readable and standard data sets. So this is sort of the, what would I say, maybe the Holy Grail, if we were to able to achieve all of our data in a machine readable format, which can be used for a lot of downstream use cases.

Starting point is 00:30:03 Yeah. Sounds good. So apart from the technical teams that you have inside 1MG, that of course they are working with maintaining the infrastructure and also working on ML tasks and analytic tasks, do you have people from other functions that they interact with the data that your team and the rest of the technical teams are delivering? And if yes, what I'm talking about is mainly teams from marketing or sales. And how do you manage this interaction with these teams and

Starting point is 00:30:39 the data that they need? And how important is this for the organization? I would say data is sort of omnipresent. Everybody, even the people who sort of talk to customers in the customer service team or the doctors who are talking to patients through the online consultation medium, all of them are utilizing some of the other forms of data to decide what should be their next steps, where are they going wrong, or how should they change their approach. So the marketing team uses data very, very centrally to decide who is the right person for the campaign that they want to run. They decide on almost a daily basis of who is the audience for the email that they're about to send, the push notification that they're about to send. So all of these are sent in a very targeted manner. Because if you think about healthcare, not everything is relevant for everybody.

Starting point is 00:31:52 As in, if you have designed a campaign for a diabetic user, a person who is not diabetic might really not be interested in actually looking at the details of this campaign. So targeting is definitely very, very central for the marketing team. And therefore they utilize a lot of their signals and data points to define good cohorts to target to. The sales team and even the supply chain, the team that manages the supply

Starting point is 00:32:28 chain has a lot of data at their disposal about what inventory they have in stock, what are the fill rates of orders that are coming in daily. So I would say a lot of teams apart from the core business and product analytics team and data science team are utilizing data. They're however not independently querying databases and sort of getting their cuts and their metrics analyzed, but they take help from Discord team, the business and product analytics team and the data science team to enable their sort of view of the data. So they sort of work very collaboratively with the analytics team to figure out what is the data that they're interested in? What are the cuts that will be useful for them?

Starting point is 00:33:29 Whether this is a one-time thing or it'll be great if this could be sort of automated and run in a scheduled manner and they could be sent emails or notifications about whenever this data throws up an anomaly. So, yeah, it's sort of omnipresent. Everybody's using it.

Starting point is 00:33:54 The analytics team and the data science team are at the core of it, which understand and maintain the data and helping out everybody in enabling their use cases in a data driven manner. That's great. So by having data omnipresent in the organization, I guess there are also challenges around working with them either on an organizational or technological level. So from your experience so far, what kind of challenges you think you currently have and how do you think these challenges will be addressed in the future, both from a technology perspective, like if there's a technology that's missing or something that you haven't implemented yet and you're thinking to implement as part of your stack, but also on an organizational level

Starting point is 00:34:45 because that's also important. So yeah, I'd love to hear your opinion on that. So at 1MG, we have a very high variety of data. We have a lot of different data sources and also a very high volume of data sources and data coming from a bunch of these data sources like clickstream, transactional databases. So there are, I would say, three main problems that I see that we're trying to solve.

Starting point is 00:35:22 One is what I've already mentioned that in healthcare data is unstructured and not standardized. And that's sort of one of the underlying fundamental problems that we've been trying to solve over the past few years. But other than that, there is the problem of data discovery, when there is so much data coming in and flowing in from different sources. Discovering the data that you actually want to fulfill your use case can sometimes be very, very difficult. Also, the problem with democratizing data, everybody is using data is a problem of collaboration. So there might be a team somewhere that is actually generating as well as utilizing their data very efficiently. But because there is so much volume and variety of data that we're actually processing, transferring that knowledge from this team to

Starting point is 00:36:23 other teams so that other teams can collaborate and not only enrich the data for that team, but also utilize the data that this other team is generating can be a huge problem. How we're actually looking at solving this problem or how we are actually focused in solving this problem is by building a single unified data processing pipeline as well as a single unified user and data repository so earlier as the business grew we did have a lot of our core data residing in one place, but there were a lot of fringe elements that were popping up like the CRM systems did not talk to our core database, marketing systems did not talk to our core database. The clickstream systems when they sort of evolved were not tied up very closely with our transactional databases. and different ETL pipelines for extracting, transforming, and loading these data sources

Starting point is 00:37:47 in a processed manner for consumption for these teams. Very soon, we realized that this was sort of aggravating the problems of data discovery and collaboration. And since the past one year, we've been very, very focused on not creating multiple pipelines, but sort of incorporating all data processing into a single pipeline, so that everybody has visibility on what data is entering the pipeline, what data is at what stage of processing and what are the final outputs. The data schema can be centralized.

Starting point is 00:38:29 It can be distributed to everybody. This is sort of elevated some of our pains of discovering new data or collaborating and enriching a bunch of our data sources. Yes. or collaborating and enriching a bunch of our data sources. Yes, so our current stack is definitely focused on removing these pain points for us. That's great. I think what you mentioned, especially point two and three, let's say the pains that you're identifying,

Starting point is 00:39:07 which is about the data discovery and also collaboration, I think it's a problem that affects almost every company out there that it's growing and it's becoming, let's say, or trying to become more data-driven. So that's very interesting to see how it's going to progress and what kind of solutions the market will come up for this kind of problems. Because it looks like we have figured out so far

Starting point is 00:39:37 how to access the data, how to collect the data, how to create very scalable pipelines for ETL and have all the different tools there in place to query the data. But from what I understand, and I think that this is also what I get from what you said, there's a lot of work to be done on how we can make sense out of this data and how we can democratize it, as you very well said, and make it available to everyone inside the organization,

Starting point is 00:40:08 which is very, very interesting. And it's great to hear that there are companies that are in the forefront of this and they're trying to solve that kind of challenges. So moving forward and reaching the end of our discussion, I'd like to hear a little bit about your opinion and also 1MG's opinion through you about open source.

Starting point is 00:40:31 How important you think that open source movement around building software has been for enabling these data stacks that we are talking about today. And finally, tell me, tell us, share with us like a couple of tools, open source tools preferably that you really love working with as part of your everyday job.

Starting point is 00:40:59 Whenever there is a new promising technology out there, we're not sort of shying away from it, but we're sort of inviting that with open arms. So we use a lot of open source technologies, or we've sort of used a lot of open source technologies over the past few years at 1MG. Especially if I talk about data science, we've only been using a lot of open source libraries like Keras, TensorFlow, PyTorch. Even the recent tree-based models like Exib Boost or Light Boost have been sort of used in our AIML stack very, very, very actively. We use a lot of open source database technologies like Cassandra that I spoke about, Kafka. Even Rudder is a great tool to have in our arsenal

Starting point is 00:42:09 and it's open source as well. So definitely, I would say that open source is in 1MG's DNA for sure. We have not been shying away from it and we've been sort of working with it very, very actively. Some of the open source tools that I personally like to work with is definitely Kafka and Cassandra because these are very, very nice and use cases at 1MG and we're sort of picking up and sort of bulking them up. So a lot of my daily work happens in and around these tools.

Starting point is 00:42:57 To connect both Kafka and Cassandra, we've been using Spark. So Spark is again a tool that really binds all of these distributed technologies very very beautifully. Also working with technologies would be very mundane or very boring if you did not have good and rich data to work with. So I would also give a shout out to a bunch of our data collection infrastructure, which has made it possible to enjoy the work with a lot of open source technologies and especially Rudder because Rudder connects very, very seamlessly in our data stack and is able to seamlessly transfer the events collected from the user's device in real time, both to our data lake as well as to Kafka topics. And because of this variety of data that we have at our disposal,

Starting point is 00:44:10 working with technologies like Spark and Cassandra and Kafka have become very, very interesting nowadays. That's great. That's all for today. And thank you so much for your time. It was really enjoyable to discuss and learn more so much for your time. It was really enjoyable to discuss and learn more about 1MG and I'm pretty

Starting point is 00:44:30 sure that we'll have the opportunity again in the future to discuss again and learn more about all the amazing and fascinating things that you're building around data in your organization. Thanks, Kostas. It was wonderful talking to you.

Starting point is 00:44:45 It was great having you on this episode of the Data Stack Show. I loved learning about all the challenges that they have working with medical data specifically, but it's amazing the volume of data that they have collected in the past couple months responding to the pandemic. And it's been fun to be a part of that journey with them.

Starting point is 00:45:04 We'll check back in with them in the next couple of months to see the different ways that they're using the data in different parts of the teams as they direct the data stream to other products and other use cases inside their company. We'll catch you next time.

Your Ad Here

The Data Stack Show - 02: The Importance of Data During a Global Pandemic with Utkarsh Gupta of 1mg

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.