Epicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies - Allen Day: Google’s Mission to Provide Open Datasets for Public Blockchains

Starting point is 00:00:00 This is Epicenter. Episode 254 with guest, Alan Day. Hey, we're core organizing a live event at the San Francisco Blockchain Week. It's called SF Blockchain Epicenter and it'll be October 8th and 9th at the Hilton Union Square. You come see members of the Epicenter team and a lot of familiar faces from the show. There are reduced rates for developers and you can learn more at sf blockchainweek.io. This episode of Epicenter is brought to you by DutchX. the fair and secure decentralized exchange platform by Knosis. To learn how you can build apps, which leverage DutchX's liquidity pool,

Starting point is 00:01:09 visit epicenter.tv.tv.com. And by Microsoft Azure. Configure and deploy a consortium network in just a few clicks with pre-built configurations and enterprise-grade infrastructure. Spend less time on blockchain scaffolding and more time building your application. Learn more at aka.m.m.s. Hi, welcome to Epicenter. show which talks about the technologies, projects, and startups driving decentralization and the global

Starting point is 00:01:39 blockchain revolution. My name is Sebastankujiu, and today I'm very pleased to have with me, Alan Day, who is science advocate at Google, at the Singapore office. We met in Singapore a few months ago when I was traveling through Asia, and at the time, he told me about this really interesting initiative, which I hadn't heard about, but had been up for a few months, which was that Google had actually added the Bitcoin, entire Bitcoin transactional data set to their cloud infrastructure and was on the cusp of releasing also an Ethereum dataset. And so, you know, I vowed to have them on at some points so that we could discuss this. And in August, Google did in fact release their Ethereum dataset on BigQuery.

Starting point is 00:02:32 So I'm here with Alan today to talk about all this and other things. So Alan, thank you so much for having on for coming home. Yeah, sure. It's really my pleasure. Happy to be here. So before we get started, let's talk a bit about your background. Your PhD is in human genetics. Talk about your journey.

Starting point is 00:02:53 Where did you come from? And how did you end up at Google working as a science advocate putting blockchains in big data sets? Yeah, I've been working with computers since I was a little kid. And when I went into my, as I was moving through school and eventually ended up in a doctorate program, I was combining computing and biology all the way through that. And so that led me into an interdisciplinary field called bioinformatics. And that involves working with distributed systems for doing scientific computing as well as. you know large data sets and computer science and statistics and so I I was becoming something that's now called the data scientist before the title

Starting point is 00:03:42 really existed a lot of these people come from physical sciences and so you know once I once I acquired that skill set it was quite easy to apply it to to other disciplines and so I could see that there was something interesting happening with these blockchain data sets and so I decided to start looking at those and applying some of the same techniques and methods that I'd learned for analyzing biological networks to analyze these new types of financial networks. So one thing that I think is sort of interesting, and I think I think Mayer, my co-host, has often talked about this, is this idea that blockchains and biological systems are so much similar,

Starting point is 00:04:23 or they have common characteristics in that a blockchain can mutate and can fork, have a sort of new evolution within its life form. Can you maybe give us your take on what your thoughts are on this? Do you find that there are similarities between the way blockchains evolve to the way biology has evolved? Yeah, certainly. The most direct parallel is the forking that happens between projects where one project may decide to change the operational rules for how the consensus works, for example, or the block time or block size or something.

Starting point is 00:05:09 And that's very similar to if you were to have a mutation that caused two populations of individuals from the same species to become different species. So a speciation event is the equivalent of a fork. Also, if you look at the smart contract platforms and having smart contracts start on chain, and these things have some function that's made available to any blocks that are added after the smart contract was added. There's some additional effects that are possible as the as the blockchain evolves. That's also related to you know adding new functions into a genome for example. Yeah, there are certainly some parallels. Do you know of anyone who's doing any research on this and that is sort of exploring this at a much deeper level?

Starting point is 00:05:58 No, but there was there's something interesting that I encountered from a friend of mine his name is Daniel Suarez. He's a sci-fi author and he recently published a book called Change Agent which is about a bioinformatician based in Singapore. So I thought that was kind of interesting

Starting point is 00:06:21 since there's some parallels with my life there. But in this book he talks about blockchains at some point And one of these chains is called a bio coin, which the proof of work, we'll just call it that, is basically blocks are added as a result of some mutation happening in a bioreactor. And so there's some interesting concept or idea here that if you could define a fitness function that you wanted to have a population of organisms move toward through directed evolution, that's some form of work. because you're exploring the combinatorial space of the genome or the proteome or whatever aspect of these living systems that are evolving in parallel, right? You're trying to move towards some target, and that's the work that you're establishing. If you have a way to measure that, you could actually link evolution to adding records onto a chain.

Starting point is 00:07:17 So this is maybe a way that we could do some interesting work, as part of securing the chain, but it requires a much lower cost of genome sequencing and genome editing than we have today. But certainly in the future, if you look at the rate at which these things are dropping in costs, it's conceivable. Some kind of technology like this could exist. That's really fascinating. I think we can probably spend the whole episode just exploring this topic. But specifically, I want to have you on to discuss this. this initiative that Google has had of bringing the Bitcoin and Ethereum blockchain onto Google Cloud. But first, you know, tell us about your role at Google.

Starting point is 00:08:01 What is it, what's typical day like as a science advocate at Google? Sure. So we can start with by unpacking my title a little bit. So this is one that I just gave to myself because my official title is a developer advocate. And I'm specifically interacting with communities who are involved in mostly physical sciences. And part of that is doing communication. And so this is more a title that resonates with them, so I usually just use that. My day-to-day is, as I mentioned, communications.

Starting point is 00:08:33 So doing interviews like this or blogging or public speaking, this is maybe, I don't know, 30, 40% of my time. About half of my time I spend doing software development. So I'm actually an engineer at Google, but I happen to be externally facing, showing people outside. side, what cloud can be used for to develop interesting applications, and then collecting information from outside, seeing what the market is doing, what kind of cool stuff people are building, and in particular where they're encountering friction or where cloud doesn't have some specific offering yet, and then bringing that back into Google to help product teams to help us make better stuff for the people that we're trying to serve.

Starting point is 00:09:15 And then the remainder of my time, like everyone else, administrative kind of things, quite a lot of travel and email and etc. And so this is advocation work that you do, is it mostly centered around cloud platform or do you also touch other Google products? It's all cloud. Yeah, and I'm specifically building things that are more like end-to-end realistic use cases,

Starting point is 00:09:36 and I'm working quite a lot with these public data sets, as we'll talk about later, I'm sure. Some of my colleagues are doing more like feature advocating about, you know, incremental updates on products, but I tend to build large integrated projects that are touching many possible cloud components. Interesting.

Starting point is 00:09:55 So, as you said, you build these projects that are sort of more realistic and that are sort of like experiments that could potentially turn into products. Have any of the things that you've worked on turned into, morphed into Google products or anything that has been commercialized?

Starting point is 00:10:16 I give a lot of feedback to, So as a geneticist, I work quite a lot with the genomics and healthcare product team. And yeah, definitely some of the stuff that I encounter, either fiction that I'm encountering as an individual developing with the tools, I'm basically like customer zero. I give them that feedback and then stuff that's really bothering customers. I give that to them too. And then that results in updates to products. Yeah, for sure.

Starting point is 00:10:41 Sounds like a really fascinating role where you can live your passion for technology and science while experimenting and having sort of a lot of flexibility to propose new types of experiments internally. Yeah, it's like people who love to play with new technology, they find themselves quite at home in this kind of role. And I basically just to play with, get to play with like Lego bricks all day. It's fantastic. That's cool. So let's let's talk about Google Cloud Platform at a high level. So I think most of our listeners will probably be familiar with Google Cloud Platform. But give us a high-level overview of that product and the types of components that exist within it.

Starting point is 00:11:22 Okay, sure. Yeah, it's a public cloud. So we have a bunch of data centers and network connecting the data centers. It's comparable to other public clouds in that regard. Google's been operating its own data centers for 20 years now. We just passed our 20th birthday. And so we know quite a bit about how to operate these things. The first cloud product was something called App Engine, which is still around, was a bit ahead of its time.

Starting point is 00:11:49 Today, our product areas, you could break it down roughly into three areas. One of them is related to virtualization and infrastructure. So this would be Kubernetes or other types of virtual machine services and microservices infrastructure, networking, firewalls, etc. Another area product is related to applications development. So this would be for App Engine fits into there, for example, or other components for building, let's just say, web services and integrating all of your stuff together to make something usable. And then the final area is data analytics, and this is the area where I'm advocating, which is primarily big data technologies. So BigQuery is one of these, big table, spanner, or a bunch of our databases. is most, quite a lot of its database related and storage.

Starting point is 00:12:42 And then on storage, and then on the compute side, we have a whole bunch of AI technologies. And so, you know, you can't really compute if you don't have data. And data isn't that really useful if you don't have compute. So we have these two things that can move data back and forth between them to build new data from old data. And the more interesting types of services we have on the compute side are our AI-related. Give us a sense of how big this cloud computer is. I mean, I don't know what kind of metric we want to use,

Starting point is 00:13:09 if it's whether it's a number of data centers, number of computers, or, you know, had a bias of information processed. Can you give us a sense of how massive Google Cloud is? Yeah, I can't give you, like, specific stats on the number of data centers, but I know that we have, we're represented in all the,

Starting point is 00:13:28 all the major geographies around the world and have our own dedicated connection between the data centers. So the connectivity is quite good. running through our dark fiber. It doesn't ever pass over the public internet. And then a lot of our services are, because again, it comes from this heritage of Google before Google Cloud, we have built some services like Spanner, for example. So this is a globally consistent distributed database that relies on atomic clocks to make sure transactions allow these different data centers to be synchronized, and that's now available

Starting point is 00:14:05 will be a public cloud as well. So there's a whole bunch of goodies from Google that are inside of this public cloud. There's a lot of big customers on here. Snapchat runs on Google Cloud, for example. If you remember Pokemon Go, people still play this. That's also on Google Cloud. Dropbox runs on Google Cloud.

Starting point is 00:14:26 Yeah, there's a lot of customers. Zillow runs in Google Cloud. They do some interesting AI stuff related to forecasting and image analysis of properties. geospatial analysis. Yeah, it's quite big and we've got some major customers. Now, you mentioned that Google Cloud was sort of a suite of services. We've got the virtualization aspect. We've got the storage aspect and then also the data processing and machine learning and AI. As a user of Google Cloud, I presume that you can sort of be all these products integrate together, correct?

Starting point is 00:15:03 Yeah, yeah. There's some places. where some components don't interact as seamlessly as you would like or to move data between them. But in general, there's some way for them to interoperate. So what are the most interesting things, the most cutting-edge things that you've seen that you can talk about that people are doing in areas like data processing or research or the types of things that people are doing with your AI modules? Oh, I would recommend looking at another YouTube. channel called two minute papers. This is a, they're usually a little bit more than two minutes,

Starting point is 00:15:42 but they cover the latest advances in deep learning. And quite a lot of that is happening with a application, sorry, an SDK called TensorFlow. And TensorFlow was developed by Google. It was open source. This is the largest, largest, most popular library for doing deep learning, which is the current most popular area of machine learning. And that's all compatible with Google Cloud. It all runs the Google Cloud. We've got specific services that make it run really, really well. Cloud Machine Learning Engine, for example.

Starting point is 00:16:17 Yeah, computer vision is one of the most interesting areas. You can see how computers are able to now drive cars, for example. So they're doing real-time analysis of images and looking at all the sensor data coming in and using that into the model of how the car is operating to make sure it can operate safely. So moving now more towards the BigQuery component, can you spend a bit of times describing BigQuery

Starting point is 00:16:42 and the different components there as it relates to what you're now doing with Bitcoin and Ethereum? Sure. So BigQuery is also a distributed system, similar to Spanner, as I mentioned earlier, but it's not distributed across multiple data centers like I was describing. It's living more locally than that,

Starting point is 00:17:03 but it still has a whole bunch of nodes that store parts of a data set. And so when you do a query, you're actually running a job in parallel across a large number of machines to produce a result. And so we take the approach of basically having data center as computer and don't try to implement anything very fancy like indexes on the tables,

Starting point is 00:17:29 We just do linear scan across everything because we have enough hard drives that it's economical enough to do that given we can distribute the workload well within the data center. And we're using AI to do that. And because we're not making any assumptions about the structure of the dataset, it's quite workable for many different data sets of many different shapes and sizes and scales extremely well. You know, the Dutch have given us so much. orange carrots, Bluetooth, artificial hearts, even donuts were invented by Dutch people. But they also gave us Dutch auctions, which as it turns out are great for decentralized exchanges. Dutch X is a decentralized trading protocol for ERC20 tokens, and it's invented, designed, and built by Gnosis. Current order-based exchanges, whether centralized or decentralized, have a couple of issues.

Starting point is 00:18:20 Miners and exchanges can frontrun a trade when they step in front of a large order to gain an economic advantage, not to mention issues with securing funds, high listing fees, lack of liquidity, and pricing efficiencies. The Dutchex exchange platform uses a Dutch auction mechanism to determine the fair value for a token. And participants in a trade are encouraged to reveal their true willingness to pay, which eliminates front running. As a permissionless on-chain protocol, it's useful for bots and other smart contracts needing to exchange tokens. And Dutchex also acts as an Oracle for DAPS requiring a price fee. So to learn more, check out the documentation at epicenter.tvs, slide. DutchX. Smart contracts are live on the Ethereum Mainnet so you can start building today.

Starting point is 00:18:59 We'd like to thank Gnosis and DutchX for their supportive epicenter. So people are using BigQuery with their own datasets. So presumably all types of companies from companies processing consumer data to, you know, user behavior data, whatever types of data that the one you can think of wanting to process, you could presumably use BigQuery to hold that data and query the data to get some sort of analysis. But there are also public data sets on BigQuery. And this is specifically in the context of what we're talking today quite interesting, because there are, I think, quite a few public data sets on there.

Starting point is 00:19:48 Can you describe some of the other datasets that people are using on BigQuery? Yeah, there are. As you mentioned, there's quite a lot of private data sets. And then those can be joined against public data sets for, we could call it augmentation, for example, where you might have some private information that you want to enhance or enrich by joining against public data. The majority of the public data sets, though, are not dynamic, so they're not regularly updated. It's typically some kind of toy data set like a, there's one about the New York Tax

Starting point is 00:20:22 And so there's a, I forget how many days or months it is of data, but it's a snapshot or a sampling of taxi rides, what time the pickup happened, what time the drop off happened, and what was point A to point B. There's other ones that look at types of trees and shade coverage for doing solar radiation analysis on city streets. There's various image data sets. There's one from, I forget which museum it's from, but a bunch of pieces of art are cataloged. Another data set I present, I produced was a genomics data set of a thousand different cannabis genomes where in order to accelerate innovation in agriculture, there's a whole bunch of stuff happening right now with this new plant that is undergoing regulatory changes. And so if we begin to look at the genetic structure of these plants, we might be able to improve the varieties more quickly. So there's a whole hodgepodge of all kinds of different stuff, weather data, satellite imagery data. All of the Reddit comments are also in BigQuery. So if you want it to query

Starting point is 00:21:27 any of the subreddits and threaded forms, you can look at all of that. All of GitHub is in BigQuery, not just the source code, but also the comments and the merge requests and everything. So if you wanted to do some code analysis, that's a pretty popular one because, you know, developers are interested in development. So it gets quite a lot of, quite a lot of use. I like this idea of combining private data sets and public data sets. And some of the ways that one might use that, so for example, tell me if this makes sense, like if you're a company like a ride-sharing app and you want to gain some insights to the ways people are using your app and specifically with regards to your competition, which are taxis, you know, you could use that New York taxi.

Starting point is 00:22:16 ride data set and cross it with your own data set of like how your users are using your app, how many times a day or a week they're booking rides and then maybe extract some insight from that so that you can perhaps put more cars in a certain area to better compete with the New York taxis, for example. What are the types of examples can you point to is to how people are using public data sets with their own private data sets. Sure. Yeah, conceivably that's possible. Although bear in mind that this taxi data set,

Starting point is 00:22:54 we can keep working with this example, is quite small and limited in what it has. The yellow cab company is not putting all of their data into the public data. It's just a little bit as a toy. But that raises an interesting possibility. What if all of the data were available? How much would you have to pay

Starting point is 00:23:12 to incentivize the company to put all their public data out there. Or at what level of resolution would you be willing to pay for lower resolution data? And would they be willing to sell that as opposed to the highest resolution data? And so there's actually an interesting case study we can provide it as a supplement possibly in the comments or something. That's Thompson Reuters did something like this, where they actually host their headline data along with some other attributes. I don't know if it's the full article or what I've not looked at it. It's a private data set. And what they're doing is they're using Google data exchange to make this available using Google's access control.

Starting point is 00:23:53 So Google basically allows them to manage the access control and is managing BigQuery tables to store the data, such that Thompson Reuters has to only takes the responsibility to put the data in, and then they're selling subscription access to get access to these tables. So you could do this. You could also put data into queues for real-time streaming analysis. So that's an example of where we can now generalize out to not just two data sets, but actually having the notion of a marketplace. And maybe there's some opportunity for transportation or logistics companies

Starting point is 00:24:30 to bring it back to Yellow Cab, where they could be willing to operate by exchanging some or all of their data and pricing it accordingly depending on you know, how much of you need access to, how much latency you're willing to accept, etc. So turning all those knobs, but it all is possible in a marketplace design. You could think about ad tech is doing a very similar thing, right? Like advertising and ad exchanges. It's quite a similar idea. Interesting. Well, maybe we can come back to some other examples of it later on in the show. So let's talk about this Bitcoin blockchain dataset on Big Cloud. So this came out in February of this year.

Starting point is 00:25:14 So what exactly does it include? So what is the Bitcoin dataset on Google Cloud on BigQuery? And what was the goal in making this dataset publicly available? I wanted to be able to explore the data. This is sort of in my own selfish interest to be able to make some blog posts or just look at the data. Because I know other developers who are wanting to do this too. You certainly see a lot of interest in cryptocurrency and Bitcoin and any of these keywords. They're all growing over time, right?

Starting point is 00:25:44 It's like, okay, there's developers here. I can go and become more of these developers and draw some attention to Google Cloud. And I know we have good AI tools. So blockchain plus AI should be like super exciting, right? So I tried to do some of these queries, and it turned it out to be really difficult to talk to a Bitcoin node directly. And usually the kind of query I'd want to do would be some kind of historical analysis. and that's not possible just going block by block very easily. You have to query one by one by one,

Starting point is 00:26:13 whereas normally in a SQL type of scenario, you do like a group by to aggregate. It's a particular type of operation for this kind of programming. And so I realized that I could extract these data out of the Bitcoin data set and put them into BigQuery and do these analysis I wanted to do. And so that's what's in there. It's nothing more than the Bitcoin blockchain data itself. So we download all of the blocks.

Starting point is 00:26:38 It's about 200 gigabytes worth of data and then parse each of the blocks and load it into BigQuery. So every time a new block comes out, we update the table and put it in there. It's just the transaction data. I don't know how much your listeners would be familiar with what's in Bitcoin, but it's really just some addresses are sending some number of Satoshis from address A to address B. Right. So I'm looking at it right now. So we'll link to the show in the show notes to the to the BigQuery Bitcoin data set. So it actually has two tables. So it has a blocks table and a transactions table. Correct? Yep. That's right. And the transaction, those are actually identical tables.

Starting point is 00:27:19 The reason I did that is I denormalized it because the way the BigQuery pricing model works is that you're paying for unit of I.O. And if I unnest the blocks, I allow access. to the transactions at a lower cost. So it saves the user's money to do it that way. Okay, that makes sense. Right, so rather than querying simply the, rather than having to query the block and then find the transactions within the block,

Starting point is 00:27:46 then you can simply query transactions. Yes, correct. That's right. Okay, interesting. And so you update this every time a block is confirmed? Every block. We're staying intentionally six blocks behind height. because that allows us to avoid having to deal with chain reorgs.

Starting point is 00:28:08 We don't want to have to delete data. There's some complexity, right? Because if you add a block and that ends up not being the real chain and just like some kind of dead branch on the chain, you don't want to have to then delete that and manage that. You'd rather, if you're going to build the simplest possible system, you don't want to take that into consideration. And so by staying several blocks behind the tip of the chain,

Starting point is 00:28:30 you can avoid that problem. but the trade-off is then the data are slightly stale. Okay, so you're not storing orphaned chains or orphaned block. Correct. Yeah, none of the branches on the blockchain that don't end up becoming part of the main trunk are not stored in the table. Okay, interesting. Is there a particular reason why you chose not to also store that data? It seems like there could potentially be some interesting analysis that could be made on this, on orphaned blocks? Well, any data that's on an orphan block is not part of the consensus, right? There was some minority thought there might be a block there that everyone had agreed to, but due to race conditions

Starting point is 00:29:13 or randomness or whatever, it just ended up not being the case. Looking at what transactions end up on these dead branches, yeah, it could be interesting. Maybe there's some censorship happening on the blockchain where somebody's blocking, you know, entity A is blocking entities B's transactions from being placed on chain by denial of service attacking them? I suppose so. Yeah, it could be interesting. I have not gone down that direction. That's actually a really cool idea, though. Yeah, I was thinking sort of along those lines, right? I mean, if at some point that were to be the case, perhaps we could detect those types of anomalies on this data set. In particular, I think it would be interesting if there was some geospatial data or IP address data, which is not

Starting point is 00:29:59 stored on the chain, right? But you can see that from the mempool if you're operating a node. And is there some like relationship between IP addresses and can you basically see an adversarial relationship between peers where they try to block one another? Interesting. And there's probably some interesting patterns in there if that does happen. So you're also not storing mempool transactions. At any point I can't like query big this data set and say, okay, what are the transactions waiting to be confirmed? It's only confirmed blocks, six blocks after height. Correct.

Starting point is 00:30:36 No mempool in BigQuery dataset. That's correct. Okay. Cool. So can you talk a bit of the technical infrastructure that you've built in order to, in order to query the blockchain and pull this data into your data set? Sure. Do you want to talk about just Bitcoin or do you want to get into Ethereum?

Starting point is 00:30:55 How do you want to... Let's talk about Bitcoin for the moment. We can talk about Ethereum a bit later. Okay, yeah, sure. So for the Bitcoin infrastructure, what we did is we built a custom Bitcoin client with a library called Bitcoin J. So this is the Java version that implements the Bitcoin peer-to-peer protocol. And it's a peer on the network like any other peer,

Starting point is 00:31:17 and it accepts new blocks coming in. And if a peer asks for a block, it will send the blocks out. But we're not doing any mining. We're just acting as sort of like a file sharing note. like a bit torrent node basically storing the blockchain and then we know when new blocks are coming in because we're accepting these new files right block files and we're looking at what the height is of the chain and every time a block comes in we increment this follower position that's x blocks behind and then kick off a job using something called cloud functions that will grab that block from cloud storage so there's the there's the there's the note that that's running a compute engine instance. It's a virtual machine, so it's just like a computer. No special mining hardware because we're not mining.

Starting point is 00:32:05 And then it writes the block file to storage. And then there's a function that watches that storage area for a new file to have come in, and it processes that file and sticks it into BigQuery. That's it. That's all it does. So you essentially have a node that's listening to the network that pulls in transactions and then stores them to integrate BigQuery where they can then be queried. And so the BigQuery as a user, how do I query it?

Starting point is 00:32:35 How do I query the blockchain? What language am I using? Are there APIs or SDKs that I can plug into my software to query the Bitcoin blockchain? Yeah. So we're using a language called SQL or SQL. This is industry standard language for interacting with databases. So Oracle database runs SQL and mySQL, Postgres, Microsoft, Access, and all of that.

Starting point is 00:33:01 It's an industry standard thing. Teradata, all of them use, they support some core SQL words or functions, you could say, like operators. And then typically there's some vendor-specific extensions. So BigQuery has some vendor-specific extensions, too, related to AI, geospatial, various other things. But for working with the blockchain data, you don't really need to have any vendor-specific. specific extensions, you could conceivably take the loading system and push it into MySQL and it would all work in the same way. Okay.

Starting point is 00:33:34 So you could construct a query as simple as, you know, select all transactions from this day to this day where the amount transacted was like one Bitcoin, for example, and it would just return all those transactions in the result. Okay. You could select the mean price of a transaction per day. across all days or you could look at the quartiles or max or variants or whatever attribute you're looking to if we continue on with this per day example you could partition by day you could partition by block I mean there's there's sort many different ways you could slice it

Starting point is 00:34:12 and maybe you could correlate it with some other public uh public data set like weather and try to see if there's any like patterns that are affecting people's transaction volumes or something like that for sure so while we were preparing for this episode you mentioned this platform called Kaggle, which I had sort of heard of before, but wasn't super familiar with it. So the way you described it to me is sort of this, this GitHub for data analysis. So it's a platform where data scientists can share their data analyses, I presume sort of like queries and code, and they can fork them. And it's sort of this open platform where this open community,

Starting point is 00:34:57 of data scientists. Talk about some of the things that people are doing on Kaggle, things that you may have noticed with regards to these data sets. What type of analyses are people doing on the Bitcoin blockchain? Yeah, so Kaggle is the largest community of data scientists online. So there's quite a lot of machine learning happening there. They analyze in these notebook environments. So there's a computer sitting behind a interactive form in a web browser,

Starting point is 00:35:29 and they can run code against data that's sitting on a remote machine that's connected via the web browser. And it can also connect to BigQuery. So they can pull data into this analysis environment and process it with code inside of this notebook environment. They're typically programming in Python. Specifically in regards to the Bitcoin dataset, users have mostly been interested in looking at these features that we were just talking about,

Starting point is 00:35:59 like prices of transactions denominated in Satoshes or what were the largest transactions per day, and then correlating these to other datasets. So they like to link against other private data. So for example, as part of the Bitcoin data set, as you mentioned, we just have the blocks and the transactions. It's really just the chain data, right? But since this is all financial stuff, frequently people want to link against financial. financial data. And so they'll bring in some other tables. They may host it in their own BigQuery tables or they may be uploading a CSV as part of their analysis. And then they can

Starting point is 00:36:33 start doing, you know, pricing type analyses over time. Interesting. So then what, what people are analyzing is the Bitcoin transaction data and then crossing that data with other could be, you know, financial data. For example, if one wanted to see if there's any correlation between like the NASDAQ financial indexes or Dow Jones with the price of Bitcoin, for example. Well, I suppose the price of Bitcoin wouldn't because it's not in your data set. But one could make that type of analysis using the BigQuery language, perhaps combined with some machine learning stuff. Yep.

Starting point is 00:37:17 Yeah. Ex-auxiliary tables that are available elsewhere. I think there's, you know, the more data that you can link together, that's structured and documented and linkable, the more value that comes out. And it's not just one plus one equals two. It's more like, well, it's not two plus two because, how would I say? Three plus three equals six. It's three times three equals nine. Right. Your utility of the data is more of a product of the pieces as opposed to the sum. If you've listened to previous episodes with Marley Gray and Matt Kerner, you know that Microsoft is committed to providing enterprise-grade tools and infrastructure for blockchain developers. Well, the Azure blockchain workbench is perfect for organizations building consortium networks. Take the Ethereum proof of authority

Starting point is 00:38:02 template, for example. It's ideal for permission networks where consensus participants are known and reputable. Ethereum on Azure has on-chain network governance that leverages Parody's extensible proof-of-authority client. Each consortium member has the power to govern the network or delegate their consensus participants to a trusted operator. And Parody's WebAssembly support allows developers to write smart contracts in familiar languages like C, C++, and Rust. Azure Blockchain Workbench was created on the same principles that drive all production services in Azure, so you know you're relying on secure, redundant infrastructure that can scale. And with built-in services like authenticated APIs, off-chain databases, and secure key

Starting point is 00:38:39 management services, you can scaffold your infrastructure in just a few hours. To learn more about Azure Blockchain Workbench and how Microsoft is advancing blockchain usability and enterprise, check out AKA.m.s. and start building today. We'd like to thank Microsoft Azure for their support of Epicenter. So recently there was a blog post that was published sometime in August that announced that Google's also releasing an Ethereum dataset that's now available on BigQuery. So what has the reception been like? It's been initially very positive. This is only a few weeks ago now, not even a month yet.

Starting point is 00:39:20 But the numbers are looking good in terms of utilization and number of inbound increase. I get a lot of basically direct pings from developers because my name is out there. I'm on Twitter, et cetera. And some fraction of them want to talk to me about things they're interested in doing. And relative to the Bitcoin dataset, the amount of developer activity has been very high. So I expect that the utilization of the Ethereum data public data set will be even larger than the Bitcoin public data set. Yeah, the Bitcoin one, it's been regularly, heavily scanned ever since it was released. So it's a very popular public data set.

Starting point is 00:39:59 And presumably people are acquiring this public data to link against their private data, right, to do, I don't know what they're trying to do, something, analyze it for some purpose. The Ethereum one, because it's such a large developer community, though, I think there's going to be many more different, there's going to be more variety of applications. and maybe even more volume of applications on this one, just because it's a lot more complex. So what are the unique challenges that you face in implementing the Ethereum dataset as opposed to Bitcoin? Because I mean, the Bitcoin dataset, it seems quite... I mean, I don't want to say it's a simple fee,

Starting point is 00:40:37 but it seems quite simple, right? You've got a node, it pulls transactions. There's some data processing in these transactions to normalize the data, and then it's put into essentially a SQL database. And with Ethereum, there's a bit more to it. There's the transactions, but there's also smart contract transactions and token transactions. And there's a whole bunch of other things that go into that, that, you know, add some complexities.

Starting point is 00:41:02 Can you talk about those and how you've overcome those challenges? Yeah. So as you said, Bitcoin Data Set is really just, you know, transfers and then the cost of the transfer that the requester was willing to pay for that transaction. Right. And there's some strange like obfuscation where there's change addresses and intentional obfuscation of who's really paying whom as part of like a pseudo private design. Ethereum doesn't have that.

Starting point is 00:41:30 So this analysis of where money is flowing to is not a problem in Ethereum. But there's this other difficulty where there can be data that go along with a transaction. and that could be a smart contract or other types of data. And that's really the core complexity, is you've got this Ethereum virtual machine that takes inputs that go into some compiled code that lives at an address on the blockchain that can do things with that input,

Starting point is 00:42:02 those input bytes that are going along with the transaction that are coming in. And what those smart contracts do with those input bytes is arbitrarily complex. It's a touring machine. So dealing with representing that complexity is very difficult, which is why it took a lot longer to get the Ethereum dataset release than the Bitcoin dataset. So token transfers, that's a great example.

Starting point is 00:42:28 The infrastructure for the Ethereum dataset is pretty similar. So we're operating an Ethereum node, and it's writing out some files into cloud storage. but it deviates from the Bitcoin design there where the loading is happening, not just as a direct insertion into BigQuery, but there's another cloud component called Cloud Composer. So this is based on an open source project called Apache Airflow, that you can define an ETL pipeline. So ETL is a term used in data warehousing.

Starting point is 00:43:05 Basically everything we're talking about today is data warehousing. and then analysis on the stored data. So ETL is for extract, transform, and load. And basically you're extracting data from the Ethereum node. You're transforming it to some form that will be useful for users. So it could be reading the transactions and parsing them so that you can see if it's a ERC20 transfer or not, or ERC 721 transfer, or whatever other kind of smart contract function call. And then the load part is putting it into the tables.

Starting point is 00:43:34 And so there's a whole bunch of additional ETS. processing that we're doing as part of loading the Ethereum data into the BigQuery because we don't just want to load the transactions where we give only the bytes that were the input and then they go to some smart contract but we don't tell you what did that mean that there's no at at base value there's really no analysis that's interesting analyses that are enabled by only giving input You actually have to look at what the smart contracts are doing. And so we're looking at these things called traces and logs that are emitted by the smart contracts as part of their operation.

Starting point is 00:44:16 So they have some events that are coming out that describe what the smart contract is doing. And so we're putting all of that into some tables too so that you can look at those and aggregate on the effects of the smart contract on the network. Can you go in a bit more detail about how you perform these analyses? on these traces and logs? Yeah, sure. So this is getting pretty far down into the weeds of how Ethereum works, but smart contracts,

Starting point is 00:44:45 they have these functions that are defined in typically solidity, and a given function will have some defined inputs. An input could be like an address, or it could be an amount, or it could be other random binary data. And then they do something with this data, and what they're doing is the result of that is being emitted as events typically. And so what we're doing is we're taking those events and putting them into a table so you can observe it.

Starting point is 00:45:21 An example would be, and the input is actually very difficult, it's difficult to understand because it's a binary string. It's a bunch of bytes, and you need to segment the bytes according to the function specification. So there's like all this array manipulation and very low-level stuff that's not really relevant to the business purpose of making a query or analysis that you have to do in order to be able to do the query or the analysis. And so really what we're doing is we're factoring out all of this menial labor that a developer would have to do, doing it once and doing it correctly. And then nobody has to work on that problem again.

Starting point is 00:46:01 if they're willing to run their operations on BigQuery. So the Ethereum blockchain contains, the data is contained there. So within the input bytes of the smart contract, you know, we have the information there, for example, perhaps some address data or the function call hash. And what you're doing is you're extracting that data from essentially just sort of a blob of bytes and making it available in this query format so that now you might be able to say, okay, what are all the contracts

Starting point is 00:46:35 that are calling this specific function? Correct. Or what are the transactions that are calling to this specific function in this contract and making it easily available, whereas if you wanted to do that by yourself, you would have to build it from scratch?

Starting point is 00:46:52 Yeah, we're just reformating the data, right? It's in this form that, is just difficult to access for doing certain types of analyses. It's designed for usability by the Ethereum blockchain software, right? Ethereum peer-to-peer network is concerned with consensus and concerned with the efficiency of transactions and concerned with the operations of the Ethereum virtual machine in the blockchain, right? But it's not concerned with, hey,

Starting point is 00:47:23 what if somebody wants to do some historical analysis on all the data in here? That's irrelevant from the point of view of a database. If somebody was to design a database that is for transactions, then you would design it in a particular way. If you had a database you wanted for analysis, you would design that in an almost like, you would call it the opposite way. And if you get into data warehousing theory,

Starting point is 00:47:46 these are the two extremes of different types of database design. One's called an OLTP. It's an online transaction processing database. The canonical example is usually like a hotel or an airline reservation system. It's very concerned with transactional integrity and very concerned with throughput, number of transactions per unit time. But it doesn't really concern itself with analyzing like price trends of the hotel rooms or the flights, right? But you can take that transactional data and reformat it into an online analytics processing system, which is an OLAP database,

Starting point is 00:48:23 which denormalizes it, doesn't care about transactions, but tries to structure the data in such a way that it's easy to send any arbitrary query against it and get reasonable performance for the query when you want to get the data back to a business application or an analyst or something like that. So at a higher level, that's what we're doing with these blockchain data. We're taking OLTP data.

Starting point is 00:48:44 It's optimized by the OLTP system, i.e. the blockchain system, for transaction throughput, and we reformat it, and we make it available as OLAP. Okay. So it occurs to me that an OLTP system is like Bitcoin, but without mutability? Bitcoin does not have mutability. Yeah. These OLTPs can be, yeah. So if you remove, if you add an immutability constraint to an OLTP, you get a blockchain. Sure. You could say it like that. Okay. What type of analyses have you been doing on this, on the Ethereum data says? In the blog post, There's a couple of examples there.

Starting point is 00:49:24 Can you talk about those? Yeah, one of the examples, I mean, some of them are time series examples where we're looking at, you know, number of transactions per day. And that's a very obvious kind of thing to do. What interests me more, though, is the characteristics of actors on the network or, I guess, interactors, you could say. Because wallets aren't doing anything on their own on Bitcoin or Ethereum or any of these block, any other blockchain. There's always some interacting partner, right? And that interaction between those two partners tells you something about it's some measurable observation. And if you look at multiple of these observations in aggregate for one address or groups of addresses over time,

Starting point is 00:50:13 you can start to quantify what they're doing and assign attributes to the addresses. So you could begin to identify, as a concrete example, you could. identify exchanges, for example, because they're typically going to have large volumes flowing in and out of them, and they'll typically have many, many interacting partners of other wallets sending money or tokens in or out. Now, there could be other addresses in the network that aren't exchanges or aren't known exchanges that have similar behavior, but we could use a duck typing approach to characterize those. I mean, if it looks like a duck and it quacks like a duck and it smells like a duck, it's probably a duck. So you could start to

Starting point is 00:51:04 label exchanges that are unknown based solely on their attributes by looking at their behavior over time. Mining pools would be another example. You can start to pick those out and you can imagine, you know, what kind of characteristics a pool would have or a miner who is time sharing inside of a pool, they'll have a particular type of characteristic, right? They're going to receive deposits periodically and only after the mining pool mines a block, right? So that sort of thing is very interesting to me because it relates very much to the type of work I was doing in my dissertation as a graduate student. I was looking at genetic networks, and particularly the human genome network, which is composed of genes that are interacting with one

Starting point is 00:51:51 another to operate a cell, which is basically a distributed, highly parallel distributed system of these molecules that are interacting with another to process information. And there's a whole bunch of analytical techniques that can be used that come from biostatistics for analyzing these biological networks. You can analyze financial networks with the same techniques. And it turns out some of these guys doing anti-money laundering applications or other types of like fraud analytics or, you know, forensic accounting. They're also very interested in this kind of stuff, but they don't necessarily have the

Starting point is 00:52:27 level of sophistication the biologists do because the National Science Foundation and the National Institutes of Health has been throwing a lot of dollars at, you know, curing cancer for a long time, which is where a lot of these, why a lot of these methods were developed, because cancer is a disease of the genetic program. I feel like I'm kind of like getting off on some tangents here in rambling, but there's some very direct connection, like, between the math and the methods of what I was doing as a graduate student and this stuff that's happening with blockchain right now. And not the blocks or the consensus themselves,

Starting point is 00:53:02 but the interaction between the entities on the network. That's what I'm interested. Well, I feel free to elaborate. I think that's a really fascinating topic. Yeah. Well, I just gave you two examples, one of them being identifying exchanges, the other one being identifying pools or, time share minors, I don't know, what other sorts of interesting patterns would you want to look for?

Starting point is 00:53:24 Well, there was one example here in the block post that I thought was kind of interesting, and that's analyzing the functionality of smart contracts. Can you dive into that one? Yeah, sure. That's pretty cool. Because, so as you mentioned earlier, there's these has these has these functions, right? And each function has its own signature. If you are to consider a function as having some set of, a smart contract is having some set of functions available to it, where it's between zero functions to whatever, all possible functions in the hash space, 4 billion functions. And every smart contract has some subset of that 0 to 4 billion. You can define a distance metric or a similarity metric between any two contracts that tells you how similar they are in the functions that they implement.

Starting point is 00:54:15 So it's reasonable to say that if two smart contracts have the same functions available, probably what they do is, if not the same, quite similar. So all ERC20s implement four methods. And so you could find all ERC20 contracts by checking to see if the contract implements those four methods. That's actually how we do this in the database. There's a table that documents specifically ERC 20 transfers, and there's a table that lists all smart contracts. And there's a Boolean column on that smart contract table, which says, is ERC20 or is ERC 721? Because these are the two, you know, dominant smart contract types. And so we just kind of add as a convenience the analysis, pre-analyze to save developers some time so they can only restrict analysis to those smart contracts.

Starting point is 00:55:09 But you could do it for any arbitrary, you know, set of functions you're interested. that all the data are there. Right. So for any standardized smart contract, such as the token contract or the 721, the ERC 721 contract, you can essentially just look at the transaction, look at the function hashes in the contract and determine whether or not that's in fact this type of transaction or not. That's right.

Starting point is 00:55:41 And then you're putting that, you're making that. you're making that data available so that you don't have to extract it that yourself. It's already made available for you in the BigQuery data sets. Yep. And I define a function for the similarity metric we were just describing. There's an analysis like this that's done as one of the examples in the blog post where I show the original CryptoKitties contract. And you can see all of the subsequent iterations where they basically upgrade the contract. And they're all similar to one another because they're adding functions over time. And then you can also see clones.

Starting point is 00:56:12 because the CryptoKitties contract is open source, you can see somebody made crypto puppies and crypto clowns and other variants of this thing. It's basically a clone of the game using the same code. And it shows up as a very similar contract. So if you have some game that you like to play, let's say you like to play any of these Connect 3 jewel games, Candy Crush or something,

Starting point is 00:56:35 you could find all other Candy Crush-like games on the blockchain because they would have similar functionality. I think I saw this in one of your talks that you gave in Singapore recently. I tried to find a link and add to the show notes, but there was this analysis of the frequency that specific contracts were called and the token contract. I mean, I guess overshadowed everything else. But then there was one contract, which more recently had gained quite a bit of volume. And that was the CryptoKitties contract.

Starting point is 00:57:09 Yeah. It had a very brief spike, and I think probably most of your listeners will remember that the Ethereum blockchain, you couldn't add transactions to it. It was clogged up, and a bunch of ICOs had to delay because there was a cryptocurrency craze, right? But the ERC20 transfer function is the most common one, or has been. I haven't looked at the data recently, but I would imagine it's still quite common. Yeah, of course, I'm sure it still is. One area that I wanted to dive into is, so we mentioned earlier, that the Google Cloud services are integrated. So we can essentially connect different services. So what are the types of analyses that one could make?

Starting point is 00:57:53 I figure there's actually quite powerful analysis that you can make using the AI component of Google Cloud with the Ethereum transaction data. Have you thought about the types of things that one could infer from from this data by plugging in a machine learning algorithm and doing some deep learning on this transaction data. Yeah, so I am actively thinking about that. And so the data that are in BigQuery, there is a couple of first order things that you can do just with the data as they are today.

Starting point is 00:58:27 You could look at the input bytes going into transactions and begin to reason about what the functionality of a contract might be. Or a function signature that you don't know what it does, but you see what its inputs are, you could identify round numbers, or you could identify addresses. And this would give you some hint as to what that function is likely to be doing

Starting point is 00:58:55 without getting into analysis of the stack trace of the virtual machine. You can also analyze the smart contracts themselves, because these are all some bytecode, and you can treat each of those bytes as features and train some kind of analyzer to classify contracts. It would probably be something quite similar to what we were just talking about of finding similar contracts,

Starting point is 00:59:19 but would have some additional ability to detect other things that are difficult to do with such a simple method. But the more interesting methods, you can't really do them. So let's come back to, you know, I was talking about networks and network analysis. You can't really do network analysis on the BigQuery dataset directly because network analyses require traversals through the network. So this basically would mean in scanning, like looking at a table that is set up for scanning. We talked about BigQuery being a scanning system earlier.

Starting point is 00:59:57 You would have to basically have random access and do these recursive queries, which it's not really very well. well suited to that. It's well suited for linear scans. And so in order to do these analyses that involve traversing the network, you need to move them into another type of database called a graph database. And so what's beginning to happen now, this is me and some other data scientists, in the open source space, we're exploring this. And there's another link we can put in the supplement to some work by one of my collaborators. We're loading data into graph databases, analyzing it, reducing the graph down to something like a single measurement per address or per transaction, allowing us to assign a value between 0 and 1 of the probability of this thing being in exchange, for example.

Starting point is 01:00:48 And then taking those attribute data from the graph database, putting them back into BigQuery, so then it becomes, we would call it a vector of features. This is getting kind of very specific into machine learning now. But these machine learning models typically want to operate on something called a vector space model. They want every observation to be a row. But what I just told you is, like, in order to work with the network data, it's not row-like in nature. It's graph-like in nature. But you can reduce the graph down to rows by going out to a graph database.

Starting point is 01:01:20 So iterating between the data warehouse and a graph database to create more elements in the data warehouse, basically doing enrichment through analysis of the graph enables these AI algorithms to begin analyzing the graph. So that's the direction that we are moving right now. And fortunately, Google Cloud has good technologies for building graph databases as well. So we're secure there, but it requires a lot of work. You mentioned earlier that the Reddit, the entirety of Reddit was also as a public data set in Google Cloud. Yeah.

Starting point is 01:02:02 It occurs to me that, you know, we might be able to do some sort of analysis as well as to the success of an ICO, for example. So take all the ICOs, you know, since, you know, two or three years ago, look at the amount of money raise and then perhaps correlate that with some natural language processing data that you would have extracted from like Reddit communities and maybe some other data in there and come up with some kind of predictive model to determine what types of characteristics, what kind of like characteristics of communities

Starting point is 01:02:35 that you might look for in order to determine whether an ICO will be like, you know, successful or not or this type of thing. Yeah, totally. So Reddit actually was presenting at the Google Cloud next conference over the summer talking about what they're doing to analyze some of the activity to build. They're trying to basically help people find more content

Starting point is 01:02:59 on Reddit so they can continue, you know, living out there happy Reddit life. But yeah, you can use this type of technology called natural language processing. It's a type of AI that tries to understand natural language, like the Google Assistant. You might see this on your phone or Siri or Google Duplex or Google Home. You've probably seen these like conversational agents, right, bots. So you can take some text or speech to text and then take the text. And so you can, send it into an AI and extract things like, hey, what were the key, what was the key object that was being talked about here? Or what was the sentiment being expressed about that object?

Starting point is 01:03:41 And quantifying it, basically reducing the human readable text down to some numbers, and then you could cross-reference those numbers. So back to your example, like, okay, was it more important to the success of a project that there was a lot of positive sentiment? Or is it more important that there was just a lot of buzz in general? even though a whole bunch of it was negative. Maybe there's, I mean, I don't know what the answer is, but you could begin to explore that kind of line of reasoning

Starting point is 01:04:08 by linking the Reddit dataset to the Bitcoin or Ethereum dataset. Could make for a real interesting PhD thesis. I would love to support any data scientists who wants to work on this. Please, if anybody listening wants to work on that, shoot me an email or join us, join up on Kaggle, or, you know, let's get the data analysis going. I can see something like analyzing memes, you know, like analyzing also like those stuff like Doshcoin memes, you know, like how many times are people sharing a specific meme?

Starting point is 01:04:42 Does that have an impact on communities, on a success in ICO or something in that nature? I mean, we even have this vision API, which is a sibling of the language API that can actually look at images and analyze those and tell you what's in the image and, you know, what's the general sentiment about that. Yeah, for sure. Cool. So looking into the future, are there any plans to release other tokens, blockchains or cryptocurrency on the open platform? I'm looking at a bunch of other stuff right now. Yeah, more of these public blockchain data. It's all just fitting there in its public data sets and these current ones of getting a bunch of good traction and analysis. Yeah, that would be interesting to do. I think the Ethereum data set has quite a lot

Starting point is 01:05:35 of mileage to be gotten out of it, though. It's just so deep and interesting, and there are, I've not done any plot of developer activity versus, I don't know, market cap or anything like that, but I would imagine Ethereum would be a real outlier. That community has done a good job of making ecosystem components available and being very supportive of their developers. And so it's just very vibrant compared to a lot of the other projects. Even though Bitcoin is, you know, bigger by market cap, it doesn't, it didn't, it didn't really attract as much. I didn't get as many inbound inquiries about the data set as I did for, for the Ethereum

Starting point is 01:06:14 dataset. Well, Alan, this is all very fascinating. I want to thank you for coming on the show and talking about this. I think there's a lot of interesting things to be built on top of these data sets. And we kind of mentioned this before the show, but in some ways, this data is meant to be public, right? And a lot of companies out there have sort of made their business model upon these data sets. I mean, like block explorers, companies that do blockchain analysis, like their entire business models are built on what is essentially a public data set, but where the blockchain itself doesn't really have the underlying infrastructure that allows

Starting point is 01:06:54 just anyone to make these really complex, actually quite simple queries on top of them. We've had to build all this other infrastructure on top. So the fact that Google is making this available to the public is really great. And I'm looking forward to seeing what kind of applications or what kind of research comes out of this in the future. Yeah, that's what we're here to do. We're here to help developers make cool applications. great so we will add

Starting point is 01:07:17 I've got a ton of links to add to the show notes so everything we've talked about the blog posts the links to the actual data sets and some of the other articles that we mentioned during the show or we'll be in the show notes so thank you for our listeners for tuning in you can get new episodes of Epicenter every week by subscribing to us on YouTube

Starting point is 01:07:35 SoundCloud you know your favorite podcast app whether that be on iTunes or Android we will soon be on Spotify that is going to be finalized in the next couple of weeks. I know that people have been asking for that, and it took a while, but we're finally going to be on Spotify,

Starting point is 01:07:49 so we'll be tweeting that out as soon as that becomes the case. That's where I'll be listening. That's great. Great. Awesome. We're also on Google Play. For those who don't know that, we're also on Google Play music.

Starting point is 01:08:03 And if you want to support the show, well, we love to hear from you by getting iTunes reviews. It helps people find the show, and we're always happy to see your reviews there. So thanks so much. We look forward to being back next week.

Epicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies - Allen Day: Google’s Mission to Provide Open Datasets for Public Blockchains

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.