The Good Tech Companies - Scaling Ethereum: Data Bloat, Data Availability, and the Cloudless Solution

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Scaling Ethereum, Data Bloat, Data Availability, and the Cloudless Solution. By logos, Codex is a cloudless, trustless, P2P storage protocol seeking to offer strong data persistence and durability guarantees for the Ethereum ecosystem and beyond. Currently, EIP-4844 only offers a partial solution to the problem of data bloat. Fees remain high, and the ecosystem has few long-term data storage options. Determining how to persist Ethereum's excess data will allow it to scale indefinitely into the future, and Codex has arrived on the scene to help alleviate those concerns.

Starting point is 00:00:42 Let's explore the problem. Have you ever swapped ETH for another token on Uniswap? I connected via Metamask and tried to trade. 001 ETH, roughly $35, for SNT. The gas fee costs as much as the transaction. That is too high of a fee to trade cryptocurrency. Most people do not want to pay this much. Let's get to the heart of why these transactions are so expensive. NNWeb3 and decentralized finance have grown massively in recent years. Due to the rapid development and implementation of new protocols, the Ethereum blockchain chain has become bloated with data. The result? Prohibitively expensive gas fees and poor user experience. This data bloat can also be defined as network congestion, where transaction data clogs the network and undermines scalability.

Starting point is 00:01:30 This article examines why the blockchain has become bloated, why transaction throughput has suffered, and various approaches to solving the problem. I will specifically focus on data availability in the context of Ethereum and rollupsups i will explore how codex offers a solution to the da problem except with data persistence and durability guarantees that most other solutions lack bear with me i will use jargon and technical language but i will do my best to explore this vital underappreciated topic in clear language more people in the ecosystem must begin grappling with how robust data availability sampling – DOS – is for scaling blockchains.

Starting point is 00:02:08 Before continuing, the reader should have read about consensus mechanisms, proof of stake, and how the technology functions from a high level. Let us start by unpacking the blockchain trilemma. Problematic trilemma All decentralized technologies that want to grow suffer from similar constraints. They want to scale to allow more and more users to adopt the tech, from thousands to millions of users. However, scaling different technologies comes with different engineering challenges. In the case of Ethereum, the blocks on the chain contain transactional, state, and smart contract data. The more people use the network, the more data is added to each block. The problem

Starting point is 00:02:46 is that when the blocks start filling up, a fee market emerges, where those who pay higher gas fees are more likely to get their transaction included in the next block. A simple solution would be to expand the block size and allow more transactional data. However, there is a problem with this approach, which is part of the blockchain trilemma. NN the trilemma states blockchains have three primary features they want to maintain and enhance, scalability, decentralization, and security. The trilemma suggests that trying to improve two reduces the other. In the case of Ethereum, upgrading the block capacity also increases hardware requirements for running a fully validating node on the network. When the network raises hardware requirements in such a way, it becomes more difficult for ordinary people to run a full node, which negatively impacts the network by decreasing

Starting point is 00:03:34 overall decentralization and censorship resistance. On the surface, the problem seems insurmountable. Luckily, developers and engineers are rethinking how blockchains can scale. They are envisioning blockchains can scale. They are envisioning blockchains and their ecosystems as modular rather than monolithic. Modular versus monolithic it is vital to restate that running a full node on the network is imperative to its success. But what exactly is a full node or fully validating node? A full node is a network participant that downloads all blockchain data and executes all transactions created on the network. Full nodes require more computing power and disk space because they download the complete transactional dataset. An article by Yuan Han Li titled,

Starting point is 00:04:16 WTF is Data Availability, explains. N, N, since full nodes check every transaction to verify they follow the rules of the blockchain blockchains cannot process more transactions per second without increasing the hardware requirements of running a full node better hardware equals more powerful full nodes equals full nodes can check more transactions equals bigger blocks containing more transactions are allowed the problem with maintaining decentralization is that you want some network participants to run full nodes. However, these nodes require tremendous computing power that is too expensive for most users to purchase and maintain. And if that occurs, it dramatically limits the number of nodes on the network, harming overall decentralization. The main problem is that

Starting point is 00:05:01 the miners and validators could withhold data from the network, preventing others from accessing all the data. This is the crux of the problem in the context of monolithic blockchains. Although this is a bit of an overused buzzword in the ecosystem, the idea of monolithic in blockchain means that the base layer, or the Ethereum blockchain, has to act as the settlement layer, the consensus layer, and the data availability layer, which bloats the system with data, slowing down transactional throughput and raising fees. The solution to this problem of having a monolithic blockchain is to modularize its functionality and offload the data availability function to their network participants. In this scenario, the base layer of the blockchain would then just function as the

Starting point is 00:05:44 settlement and consensus layer. All data availability requirements would be offloaded to other actors in the network and and now that we understand the wisdom of modularization what exactly is data availability and why is it crucial to the network the da problem and rollups data availability is what a blockchain requires to function as an immutable arbiter of truth. Without the availability of transactional data, no one would know if the blockchain contains fraudulent or invalid transactions. In other words, no one could prove whether the validators and miners behaved maliciously or not. An underscore underscore article underscore underscore by Emmanuel Awosika described it, NN, data availability, is the guarantee that the data behind a newly proposed block, which is necessary to verify

Starting point is 00:06:30 the block's correctness, is available to other participants on the blockchain network. An important aside. Note that there is a difference between data availability and data storage. Many people in the space confuse the two. Data availability ASKS if the data is available and anyone can access it, and data storage means holding data in a location over the long term. In this sense, data storage implies the idea of data persistence. Nick White, Celestia's COO, provided a powerful analogy. If you have canned food, it represents data storage. The food is in the can and stored for the long term and can be accessed and taken out of storage at any time. In this sense, there is an element of data persistence with regard to data storage.

Starting point is 00:07:16 Conversely, data availability is more like a buffet. The food is prepared and spread out on a buffet table. It is available for everyone to sample. Data availability is similar. Data is made available to the network primarily so network participants can verify the data is accurate and does not contain malicious transactions. And this begs the question. What is the data availability problem? The data availability problem is the central problem technologists are trying to solve to scale Ethereum. The problem is that when a full node broadcasts transactional data around the ecosystem, smaller nodes called light nodes do not typically have the hardware requirements to download and execute all of the transactions. NNA Ledger.com article explained how light nodes work.

Starting point is 00:08:02 NN, light nodes do not download or validate transactions and only contain the block header. In other words, light nodes assume that transactions in a block are invalid without the verification that full nodes provide, which makes light nodesless secure. This issue is referred to as the data availability problem. In this case, those nodes just need to know if the data is available and if it represents the current state of the blockchain. A state is simply all the blockchain data stored on the chain, address balances, and smart contract values. On the Ethereum blockchain, in its current form, light clients have Torley on so-called data availability committees, DAX, to provide on chain ad stations that the data is indeed available. In the context of an Ethereum scaling solution, called a rollup, this data has to be made

Starting point is 00:08:49 available so that network participants can determine if that data conforms to network rules. In other words, they need to ensure the data is accurate and that validators do not try to dupe the light clients. Optimistic and ZK Rollups To understand the DA problem further, it is crucial to comprehend rollups. Rollups are Layer 2 blockchains that have nodes called sequencers. These sequencers assist in batching, compressing, and ordering transactions. Benjamin Simon described the relationship between rollups and Ethereum. NN

Starting point is 00:09:20 A rollup is essentially a separate blockchain, but with a couple of modifications. Like Ethereum, a rollup protocol has a virtual machine that executes smart contract code. The rollup's virtual machine operates independently from Ethereum's own virtual machine, the EVM, but it is managed by an Ethereum smart contract. This connection allows rollups in Ethereum to communicate. A rollup executes transactions and processes data, and Ethereum receives and stores the results. Put simply, rollups are off-chain scaling solutions. However, rollups do not sacrifice security like many

Starting point is 00:09:57 off-chain scaling solutions normally would. In the case of rollups, only data processing and computation occur off-chain, via sequencers. The transactions are ultimately stored on the layer 1 blockchain, preserving security. This on-chain data was previously called, call data. In a way, rollups are the community's way of, having their cake and eating itu. They get to maintain network security while scaling the usability. It is an ingenious solution. There are two popular types of rollups, optimistic rollups and ZK rollups. N-optimistic rollups are the more widely discussed and deployed types of rollups.

Starting point is 00:10:35 As their name suggests, optimistic rollups assume that there are at least 1xn good actors in the ecosystem. What does that mean? Optimistic rollups assume all transactions posted to the network are valid. To compensate for this optimism, rollups provide a 7-day window for the network to submit a fraud proof, showing the transactions submitted by the rollup are invalid. One key thing to know about optimistic rollups is that they are mostly EVM compatible, so developers can efficiently work with them. In this way, they can be seen as Ethereum's more popular scaling solution. Two examples of optimistic rollups are Optimism and Arbitrum. N. ZK rollups use zero-knowledge cryptography to prove that the transactions

Starting point is 00:11:16 they compress and batch are correct and accurate. Instead of assuming that all the transactions are accurate, like optimistic rollups, ZK rollups generate a validity proof to demonstrate the transactions are valid immediately, eliminating any waiting period. However, it is known that ZK rollups can be more difficult for developers to work with, as not all of them are EVM compatible. ZK rollups are also computationally intensive because generating the proofs consumes many resources. Nonetheless, more and more EVM-compatible rollups are starting to hit the market. The scroll rollup EVM solution is just one example. Solution. Data availability sampling in CODXI mentioned earlier that rollups need somewhere

Starting point is 00:12:00 to dump their data. Most rollupshave been pushing data to the Ethereum mainchain, as mentioned, which leads to the Ethereum main chain, as mentioned, which leads to the crux of the problem. Data bloat. When bloat occurs, transactional throughputsifers and fees for transactions and smart contract execution increase. Recall that part of the solution is not to rely on fully validating nodes for network security. If we just rely on these nodes, most users would be unable to run full nodes due to prohibitively expensive hardware requirements. Note that raising the block size is a potential solution, albeit dubious, as this path negatively impacts decentralization. Nonetheless, that particular argument has become invalid because rollups act as layered 2 scaling solutions

Starting point is 00:12:42 that maintain the security of the main chain. NN that said, what is the answer to nothaving everyone run full nodes? NN the solution is to empower light nodes, as well as full nodes, to verify data without downloading and executing all transactions. This is the heart of the problem and where the magic of scaling the Ethereum network, among other blockchains, can be found. Data availability, erasure encoding, and codecs. The first step is to have a data availability layer with a robust network of light clients to determine if the data is available. But how can light clients, who typically only check header data and rely on full nodes for their information, ensure their data is valid and complete. The answer can be found

Starting point is 00:13:25 within a mathematical trick called data availability sampling, DOS. DOS is a method of sampling a bit of data from a chunk of data and using it to probabilistically determine the rest of the data that exists and reconstruct it. Many organizations, including the Celestia blockchain and DA layer, are leveraging DOS through erasure encoding and polynomial commitments. Reed-Solomon codes are the popular choice among many projects. These types of polynomials look like this. n, ny equals a, o, plus a, 1, x plus a, 2, x squared plus, plus a, k, x caret k. These functions are used to determine missing data and fully restore it. This works by creating k of n data where k is the original data and n is the parity data. If some of the

Starting point is 00:14:12 original data goes missing the node's machine leverages a mathematical function called Lagrange interpolation to restore it. The mathematics involved are seemingly arcane to most people but the idea is straightforward. N.N. There are a few clear examples of erasure coding in action. The method has been used to backup scratched CDs. Erasure encoding INCDs can reconstruct the missing bits of music due to surface damage. Satellites also leverage erasure codes if data goes missing in the vastness of space. The satellite or the CD can reconstruct missing data, adding redundant protection Tobit systems. The specific scheme that Codex, as well as Celestia, uses is called the 2D erasure coding scheme. It should be noted that

Starting point is 00:14:56 2D erasure coding, although popular in the crypto ecosystem, is not a new technology. However, how it is used to solve the DA problem is quite interesting. Dr. Bautista underscore underscore explained underscore underscore how the Codex team uses erasure coding. N, N, similarly to Codex, erasure coding the original data into a more redundant and robust data structure is fundamental for the rest of the protocol to work, without it there is no magic. In Codex, this happens inside the Codex client of the node that wants to upload the data, while in Ethereum this happens inside the Ethereum validator of the consensus beacon client of the node that is building, proposing the block.

Starting point is 00:15:37 NN there is more to the story regarding the journey of the data in Codex, but it is beyond the scope of the article. Read Dr. Bautista's piece to understand data dispersal, sampling, and the lazy repair mechanisms that Codex leverages. Codex intends to have simultaneous data storage and retrieval functionality and data availability sampling through proof compression. This would allow for processing ephemeral data, or data that is not needed over the long term, and data persistence and durability guarantees that other projects may be missing. Conclusion Cracking the problem The debate on how to scale blockchains is ending. In the Bitcoin ecosystem, arguments have been

Starting point is 00:16:17 raging on how to scale a blockchain, from increasing the block size limit to leveraging Layer 2 solutions. The reality is that a mixture of the two is the most reasonable solution. For instance, Codex can act as the cloudless data availability layer for Ethereum, as well as for other blockchains, allowing the block size to grow because the network would contain many nodes to conduct DA checks on the network. And, and the good news is that this will increase the network's throughput while maintaining the security of the base layer. And what results from that? Yep, you got it. Cheaper fees and faster transactions. As users of blockchains, that is really what we care most about. And one day, perhaps soon, I can do my token swap for pennies on the dollar instead of for $35 bucks.

Starting point is 00:17:02 Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - Scaling Ethereum: Data Bloat, Data Availability, and the Cloudless Solution

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.