Storage Developer Conference - #161: Analysis of Distributed Storage on Blockchain
Episode Date: February 2, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, Episode 161.
Hello, everyone. My name is Tejas Chopra. I am a Senior Software Engineer at Netflix and today I'll be talking of analysis of storage on the blockchain.
The agenda for the talk is, I'll briefly discuss about myself. We'll talk about what centralized and decentralized storage mean.
We'll get into the blockchain, how it has off-chain and on-chain solutions for storage. We'll discuss some storage alternatives out in the wild today,
such as Storage, SIA, IPFS, and ILcoin,
and finally some takeaways and learnings.
I am a Senior Software Engineer at Netflix.
I work with the data storage platform team.
My job is to create services that can help Netflix studios and streaming platforms manage
exabytes of data and billions of assets.
I'm also a keynote speaker.
I talk on distributed systems, cloud, and blockchain.
Before Netflix, I was working as a senior software engineer at Box, where, again, I
was architecting storage solutions to support millions of customers and petabytes of data.
I have experience working in companies such as Datrium, Samsung, Cadence, and Tensilica.
Today, we are seeing a lot of enterprises move to cloud storage.
Cloud can allow scalability and ease of manageability.
But at the end of the day, cloud is an example of centralized storage. Sure, a single entity, which is an enterprise such as Amazon, Google or Microsoft, contains and controls all of the data.
The data is not secure because a lot of these enterprises can actually look into the data that you've stored.
And then you need to trust the enterprise with your data.
The other important fact to note is that if one of these cloud providers actually goes down with your data, you have to maintain multiple availability zones and replication
factors to ensure that the data is not lost in that case. And a lot of that is actually left
to the client to deal with. There have been instances at Netflix where Amazon is our primary cloud provider and Amazon did go down.
That did lead to a lot of redundancy and checks on Netflix's end to ensure that such a situation doesn't occur in the future.
The other end of the spectrum is something that we will discuss today, which is decentralized storage.
Decentralized storage is actually when there is no central authority dealing with storage.
It is secure, encrypted and scalable by its nature.
And it is also very cost effective because it also has variants where there is a marketplace for both folks that want storage and folks that want to provide storage.
Whenever there is a marketplace, it will drive down the costs. And in some cases,
it has been observed that the costs are 10 times lower than the costs that are provided by all of
these cloud providers. A natural fit for decentralized storage is the blockchain.
Blockchain ledger actually can record shard hashes, data locations, leasing costs, or other transaction-specific information.
And blockchain can also match users looking for storage with posts that are offering it.
So what exactly is the blockchain?
A blockchain is nothing but a distributed ledger.
It is trustless, scalable, performant, and secure.
A lot of industries are using blockchain and getting disrupted.
At its very core, it contains a list of a chain of blocks where each block contains a bunch of transactions and a hash of the previous block.
So whenever a new transaction gets entered, let's say there are two folks that are sending money from one person to the other person, which is the basis of how Bitcoin was formed.
The transaction is transmitted to a network of computers scattered around the world.
These clients or these network of computers and nodes then solve some equations to confirm the validity of these transactions.
Once they have been confirmed to be legitimate, the transaction gets recorded and they are clustered together to form something called as a block.
And the blocks are actually chained together, creating a long history of all the transactions that have ever occurred.
And this chain is permanent in nature. So you cannot mutate any of these transactions without actually changing the hash of that block and the hash of all the subsequent blocks from that point
on. That is why it is very difficult to break the system and that is why it is scalable,
performant and very secure. And after that, finally, the transaction is marked as complete.
As you can see, blockchain is ideal for storing transactional data.
There are three pillars on which a blockchain design is based. They are security, scalability and decentralization.
There have been scalability issues with the Bitcoin in the past. Bitcoin has two variables that you can change.
One of them is the size of the block and a block contains multiple transactions inside it.
The original Bitcoin had a block size of only a megabyte. So you could only fit some number of transactions in that.
And the other is how frequently does a block get generated?
This is determined by the difficulty level and is roughly around 10 minutes.
So now, as you can see, because you're limited by these two variables,
the number of transactions per second that is provided by Bitcoin is actually four compared to the visas and the MasterCards of the world that do 60,000 transactions per second.
So in order for Bitcoin to get mainstream adoption, we would have to either increase the block size,
which is what some variants of Bitcoin, such as the Bitcoin Cash, Bitcoin SV have done.
They've increased the block size from one megabyte to 32 megabytes.
So more transactions can fit into a block.
And that has given a 32x improvement in the number of transactions per second.
But this only solves the problem so much.
You cannot have an infinite block size.
So there are alternatives that have been proposed in the community to deal with large amounts of data and scalability for the blockchain. And we'll get into
that now. The two types of solutions generally and the two ways of thinking about it are as follows.
Because the data itself can be very huge, if you try to fit that data on the blockchain,
it can exhaust the blockchain. In fact, there was a study that said that even a five megabyte upload
can try to bring down the entire network. So what instead is done is the metadata of this data,
which is generally 1% of the size of the data, is stored on the blockchain. And the data itself is
stored outside the blockchain in a separate cluster of machines. So that is what most
solutions are doing today.
And these are off-chain solutions because the data itself lives off the chain and the metadata
lives on the chain. Compared to a traditional application where the application interacts
with data that resides on a database, in this hybrid situation of a blockchain, you have a
distributed app that deals with data that is off-chain,
but also deals with a contract that has transactions, state and smart contract
information on the blockchain. This is the on-chain metadata.
If we were to generate any solution, the general design principle would be as follows.
You would have some data that you want
to put on the blockchain. You would first split the data into multiple shards. A shard is nothing
but a small part of the data. You would then encrypt the shards and keep the encryption key
at the client side. So the client is the only authorized person that can decrypt all of this
data and get the original file back. These encrypted shards will then be uploaded to a cluster of machines on the background.
These cluster of machines are nothing but the nodes in our blockchain system.
And also, once they have been uploaded, the location of where the shards are located
would be updated on the blockchain ledger. The client is the only authorizing entity
that should know two things.
One is the file itself and second is the location of the shards.
So client maintains the encryption keys to decrypt both these pieces of information.
This is how a generic system would be designed.
And this is the basis of design of all the different storage, SIA coin, IPFS that we will observe.
Storch was one of the first projects
to tackle storage on the blockchain.
And there are four important stages in storage.
The client-side encryption, data sharding,
distribution across network and periodic verification.
The client-side encryption would be that before you upload your data onto the blockchain,
it will have to be encrypted on the device.
And there are different types of encryption algorithms that are used.
In storage, they use the AES-256 CTR algorithm, which is pretty secure.
One of the main benefits for users of client-side encryption
is that all the information that is required to decrypt the file is kept away from the nodes
that actually store the data. And the decryption key is controlled on the client machine.
Of course, there are cases with storage bridge, what they've introduced is cases where a client that does not have the decryption key, but wants to actually decrypt the data.
So if someone moves to a different machine and they want to decrypt that, that is enabled by StorageBridge.
And it's a service that just maintains your keys for you.
The second important part is sharding.
Sharding is nothing but splitting the file into smaller chunks for it to be uploaded.
And there are many reasons why you would want to shard your data. For one, you do not want your
entire file to live with only one particular node. So you only want to put pieces of your file in
every node. So in the worst case, if a node goes down or if it is unavailable, you at least will
only like, and if let's say you don't have any redundancy, you will only lose a part of the data and not the whole data.
But redundancy comes built in. There are many replication factors that storage maintains for you.
So you will actually never have a situation where even if a node goes down, you will not be able to access its piece because it may be present on some other node.
The other benefit is that there is no node that holds all of your files.
So that means even if they are able to decrypt
by whatever way,
they will only be able to get a small chunk of your file
and not the entire file.
The other important part of privacy
is knowing which nodes contain your file.
And the storage network actually uses a Kademlia hash table
that this distributed hash table hides the information
of where the shards are exactly located on which nodes
from anyone else but the user.
So the user is the only person that knows
which machines have its chunks of the file
that were uploaded.
And also once it gets all these chunks,
the user is the only person that can decrypt them
and get the entire original file.
So there is two levels of encryption
that actually is done in storage.
The other important aspect is proving
that the farmers, which are entities that host
or that store this data, they have the file.
This is the proof of retrievability. They need to prove that they actually have the shards.
And the way it is done is by using something called as a Merkle proof.
A Merkle proof is nothing but a small form of information that is sent across from the user in a heartbeat message to these farmers and they respond with some information
that response is then calculated and the user can confirm that the farmer the claim of the farmer
that it has the file is actually true or not and for the farmers once they have claimed that they
have the file and they actually have the file they actually earn the right to storage coins. And storage coin is nothing but the cryptocurrency of this particular network,
and they trade between the different users. So this is a very brief introduction of storage.
The next important project was Siacoin. Siacoin is similar to storage in the sense that here two
files are actually divided prior to uploading them and in fact sia coin divides
every file into 30 segments and it claims that even if 20 segments are not available you can
recreate the file using 10 segments so it has redundancy erasure coding built in each segment
gets encrypted by actually a tree fish tree fish algorithm that is used and it is an open source, secure,
high-performance encryption algorithm.
Files are actually sent to hosts using smart contracts
and renters and hosts pay each other with CR coin.
Contracts renew automatically over time.
And even in this case,
we actually have hosts that periodically give proofs
that they actually store the content that the user has uploaded.
In this case, the unique part is that the flow of micropayments between the renters and the hosts occurs using a technology called payment channels. This is very similar to Bitcoin's lightning network, which means that payments between the renters and the hosts occur off the chain, which greatly
increases the network efficiency and scalability. Also, in this case, hosts pay with a collateral
into every storage contract. So they have a very strong disincentive to go offline.
They are eventually also, they prove that they have the file using
Merkle trees. Merkle trees, like I explained, make it possible to prove that a small segment of data
is part of a larger file. The advantage of these proofs is that they are very small, no matter how
large the file is. And this is important because these proofs get eventually stored on the blockchain.
Since we've discussed both storage and SIA,
they are actually very similar in their nature.
The next one is IPFS,
which actually aims to be a distributed system
for storing and accessing any kind of data,
which is files, websites, applications.
And it tries to compete with HTTP.
So instead of referring to data, which is all of these articles, videos, and photos by location,
IPFS has something called as a content addressable system,
which means that it refers to the data by the content of the data.
The idea is if you want to access a particular page from your browser,
IPFS will ask the entire network,
does anyone have the data that corresponds to this hash? And any node on the IPFS that contains
the corresponding hash will return the data. And the best part is it can actually allow you to
access it from anywhere and potentially even offline. It uses content addressing the same
way that HTTP uses URLs. This means that instead of
creating identifiers that address the artifacts by location, we can address them by some representation
of the content itself. And this separates the what from the where so that the data and the
files can be located and served anywhere from anyone. HTTP has a helpful property in which that the location is in the identifier.
This makes it easy to find the computers that post the file and talk to them. And this works
generally well, but not in the offline case or in large distributed scenarios where you want to
minimize the load across the network. It also means that if a particular server is down, the content
is actually unavailable. In IPFS, you separate this step into
two parts. The first part is identify the file with the content addressing via its hash. And the
second is ask who has it. And you can connect to the corresponding nodes and actually download it.
The result is a peer-to-peer network that enables very fast routing, which is not tied to any
physical location, but is widely and immediately available.
IPFS is essentially a peer-to-peer system. And it's a peer-to-peer system that manages IPFS objects. IPFS objects has actually two parts. One of them is data, which is a blob of binary data
that is less than 256 kilobytes and an array of links. Each link actually has three things inside
it. It has the name, the name of
the link, the hash of the linked IPFS object, and the size of the linked IPFS object. Using this
IPFS object, it actually becomes an interesting exercise to model the blockchain on the IPFS.
And this is done by storing the hashes on the chain. A blockchain actually contains three pointers generally.
It has a list of all the transaction objects.
These are transactions that occur in a block.
It contains a pointer to the previous block.
And the third is a hash of a state tree or a database.
This is present in more advanced blockchains
like the Ethereum blockchain,
which have an associated state database,
which has a Merkle-Patricia tree structure. And this can be emulated using IPFS
objects. Because IPFS is content addressable, if there are two such state databases that have
exactly the same content, you will not have two copies of it stored, but rather you would actually
have a single copy of it stored and multiple pointers pointing to that.
This is the essence of deduplication in file systems as well.
So between any two blocks, if 90 percent of the state has not changed,
then the only the delta that has changed is the only new content that is present on IPFS.
And that reduces the entire footprint of data on the chain itself. And this is the gain that you get by putting the state management on IPFS.
Finally, I would like to discuss a unique project, which is ILcoin.
So far, we've discussed cases where we only try to store the metadata on the chain and the data is stored off-chain.
ILcoin actually turns this around.
Its goal is to create a viable on-chain data storage solution.
And the way it does is it actually tries to increase the block size,
which was one of the variables, to as much as 5 GB from 1 MB.
As you can imagine, this will greatly increase the scalability of our system.
And the way it does this is by using something called as a RIF protocol.
So the files, when they are uploaded, they are encrypted.
And then finally, parts of the files get uploaded into what the IL coin calls our mini blocks.
So the RIF protocol actually features two layers of blocks. There are standard blocks that are present in any blockchain. And then there are mini blocks. So the RIF protocol actually features two layers of blocks. There are standard
blocks that are present in any blockchain and then there are mini blocks. This additional layer
of indirection is what increases the scalability. Unlike standard blocks, mini blocks are not mined
but they're generated by the system giving the network unlimited potential for scalability.
These mini blocks are interconnected by references and are connected to their parent blocks as well, ensuring that the data stays intact.
In addition to its unique structure, Rift also introduces simultaneous asynchronization,
a new mechanism that carries out a parallel sync of individual blocks and prevents network
congestion. So in essence, Rift contains these blockchain level blocks.
It contains mini blocks
and mini blocks contain the transactions.
Therefore, the blockchain only contains references
to these mini blocks.
And you can potentially have 5GB of worth transactions
actually stored on the block in the blockchain.
This is how ILcoin aims to Coin aims to solve the scalability problem
and store data on-chain.
They haven't yet seen a massive adoption,
but it is just a unique technology
to learn about decentralized storage.
Finally, coming to some of the results and takeaways.
Using blockchain,
cloud storage can become truly decentralized in nature.
And blockchain alternatives can reduce the price of storing the data on the cloud. In fact, if you compare the
pricing of Amazon and Microsoft and Google to SiaCoin and Storj, the claim is that they would
reduce the prices by 10x in terms of storage. Also, cloud has interesting costs of egress.
So you actually pay to put the data,
you pay to store the data,
and you pay a lot more to actually get the data out from cloud.
All of these costs get eliminated
when you use something like a blockchain alternative.
It opens up a marketplace for providers of hard drive space and consumers.
And whenever there is a marketplace,
there will be a drive for adoption
and therefore it will result in reduction of prices.
It is a very nascent technology. It still needs more users to generate a significant dent in the market.
The other thing is that the performance numbers, the throughput and the latency is not very stable.
It's very variable just because there are fewer participants, they can be at different locations of the world and your throughput will greatly depend on where you're trying to get the file
or the data from. Newer alternatives such as IL-COIN promise on-chain solutions that can truly
offer the benefits of secure decentralized storage. There's another project that I've been
looking at, which is the YotaCoin. And what they claim is that generally in file systems, you would encrypt the data after you deduplicated the data.
Because if you try to deduplicate data after you've done encryption, there is very little chance that the data will get deduplicated.
But they've invented a unique technology where you still get dedupe benefits after encryption.
So they can employ techniques which can reduce the footprint of the data on the blockchain itself.
It's still in a very nascent stage. And I think that because the footprint of the cloud is growing and a lot of enterprises are evaluating hybrid infrastructure to store their data, I believe that decentralized solutions that can live on the blockchain provide a very great alternative to enterprises.
That's my talk. Thank you so much.
Please feel free to reach out to me on LinkedIn, Twitter,
or email address to discuss more.
Also, we at Netflix are hiring.
So we're looking for folks that are interested in problems of distributed systems, storage, file systems.
And we would love to talk to you if you're based in the United States of America.
Thank you.
Thanks for listening. be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.