The Good Tech Companies - Scaling Ethereum: Data Bloat, Data Availability, and the Cloudless Solution
Episode Date: June 12, 2024This story was originally published on HackerNoon at: https://hackernoon.com/scaling-ethereum-data-bloat-data-availability-and-the-cloudless-solution. Determining how to... persist Ethereum’s excess data will allow it to scale indefinitely into the future, and Codex has arrived to help. Check more stories related to data-science at: https://hackernoon.com/c/data-science. You can also check exclusive content about #data-storage, #decentralized-storage, #peer-to-peer, #web3-storage, #ethereum, #ethereum-scaling, #good-company, #data-bloat, and more. This story was written by: @logos. Learn more about this writer by checking @logos's about page, and for more stories, please visit hackernoon.com. Codex is a cloudless, trustless, p2p storage protocol seeking to offer strong data persistence and durability guarantees for the Ethereum ecosystem and beyond. Due to the rapid development and implementation of new protocols, the Ethereum blockchain chain has become bloated with data. This data bloat can also be defined as “network congestion,” where transaction data clogs the network and undermines scalability. Codex offers a solution to the DA problem, except with data persistence.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Scaling Ethereum, Data Bloat, Data Availability, and the Cloudless Solution.
By logos, Codex is a cloudless, trustless, P2P storage protocol seeking to offer strong
data persistence and durability guarantees for the Ethereum ecosystem and beyond.
Currently, EIP-4844 only offers a partial solution to the problem of data bloat.
Fees remain high, and the ecosystem has few long-term data storage options.
Determining how to persist Ethereum's excess data will allow it to scale indefinitely into
the future, and Codex has arrived on the scene to help alleviate those concerns.
Let's explore the problem. Have you ever swapped ETH for another
token on Uniswap? I connected via Metamask and tried to trade. 001 ETH, roughly $35, for SNT.
The gas fee costs as much as the transaction. That is too high of a fee to trade cryptocurrency.
Most people do not want to pay this much. Let's get to the heart of why these transactions are so expensive. NNWeb3 and decentralized finance have grown massively in recent years. Due to the rapid
development and implementation of new protocols, the Ethereum blockchain chain has become bloated
with data. The result? Prohibitively expensive gas fees and poor user experience. This data bloat
can also be defined as network congestion,
where transaction data clogs the network and undermines scalability.
This article examines why the blockchain has become bloated, why transaction throughput has
suffered, and various approaches to solving the problem. I will specifically focus on data
availability in the context of Ethereum and rollupsups i will explore how codex offers a solution
to the da problem except with data persistence and durability guarantees that most other solutions
lack bear with me i will use jargon and technical language but i will do my best to explore this
vital underappreciated topic in clear language more people in the ecosystem must begin grappling
with how robust data availability sampling
– DOS – is for scaling blockchains.
Before continuing, the reader should have read about consensus mechanisms, proof of
stake, and how the technology functions from a high level.
Let us start by unpacking the blockchain trilemma.
Problematic trilemma All decentralized technologies that want to grow suffer from similar constraints.
They want to scale to allow more and more users to adopt the tech, from thousands to millions of
users. However, scaling different technologies comes with different engineering challenges.
In the case of Ethereum, the blocks on the chain contain transactional, state, and smart contract
data. The more people use the network, the more data is added to each block. The problem
is that when the blocks start filling up, a fee market emerges, where those who pay higher gas
fees are more likely to get their transaction included in the next block. A simple solution
would be to expand the block size and allow more transactional data. However, there is a problem
with this approach, which is part of the blockchain trilemma.
NN the trilemma states blockchains have three primary features they want to maintain and enhance, scalability, decentralization, and security. The trilemma suggests that trying
to improve two reduces the other. In the case of Ethereum, upgrading the block capacity also
increases hardware requirements for running a fully validating node on the network.
When the network raises hardware requirements in such a way, it becomes more difficult for ordinary people to run a full node, which negatively impacts the network by decreasing
overall decentralization and censorship resistance. On the surface, the problem seems insurmountable.
Luckily, developers and engineers are rethinking how blockchains can scale.
They are envisioning blockchains can scale.
They are envisioning blockchains and their ecosystems as modular rather than monolithic.
Modular versus monolithic it is vital to restate that running a full node on the network is imperative to its success. But what exactly is a full node or fully validating node?
A full node is a network participant that downloads all blockchain data and executes all
transactions created on the network. Full nodes require more computing power and disk space
because they download the complete transactional dataset. An article by Yuan Han Li titled,
WTF is Data Availability, explains. N, N, since full nodes check every transaction to verify they
follow the rules of the blockchain
blockchains cannot process more transactions per second without increasing the hardware
requirements of running a full node better hardware equals more powerful full nodes equals
full nodes can check more transactions equals bigger blocks containing more transactions are
allowed the problem with maintaining decentralization is that you want some network participants to run full nodes. However, these nodes require tremendous computing power that
is too expensive for most users to purchase and maintain. And if that occurs, it dramatically
limits the number of nodes on the network, harming overall decentralization. The main problem is that
the miners and validators could withhold data from the network, preventing others from accessing all the data. This is the crux of the problem in the context
of monolithic blockchains. Although this is a bit of an overused buzzword in the ecosystem,
the idea of monolithic in blockchain means that the base layer, or the Ethereum blockchain,
has to act as the settlement layer, the consensus layer, and the data availability layer,
which bloats the system with data, slowing down transactional throughput and raising fees.
The solution to this problem of having a monolithic blockchain is to modularize
its functionality and offload the data availability function to their network participants.
In this scenario, the base layer of the blockchain would then just function as the
settlement and consensus layer. All data availability requirements would be offloaded to other actors in
the network and and now that we understand the wisdom of modularization what exactly is data
availability and why is it crucial to the network the da problem and rollups data availability is
what a blockchain requires to function as an immutable arbiter of truth. Without the availability of transactional data, no one would know if the
blockchain contains fraudulent or invalid transactions. In other words, no one could
prove whether the validators and miners behaved maliciously or not. An underscore underscore
article underscore underscore by Emmanuel Awosika described it, NN, data availability,
is the guarantee that the data behind a newly proposed block, which is necessary to verify
the block's correctness, is available to other participants on the blockchain network.
An important aside. Note that there is a difference between data availability and
data storage. Many people in the space confuse the two. Data availability ASKS if the data is
available and anyone can access it, and data storage means holding data in a location over
the long term. In this sense, data storage implies the idea of data persistence. Nick White, Celestia's
COO, provided a powerful analogy. If you have canned food, it represents data storage. The food is in
the can and stored for the long term and can be accessed and taken out of storage at any time.
In this sense, there is an element of data persistence with regard to data storage.
Conversely, data availability is more like a buffet. The food is prepared and spread out on
a buffet table. It is available for everyone to sample.
Data availability is similar. Data is made available to the network primarily so network participants can verify the data is accurate and does not contain malicious transactions.
And this begs the question. What is the data availability problem?
The data availability problem is the central problem technologists are trying to solve to
scale Ethereum.
The problem is that when a full node broadcasts transactional data around the ecosystem,
smaller nodes called light nodes do not typically have the hardware requirements to download and execute all of the transactions. NNA Ledger.com article explained how light nodes work.
NN, light nodes do not download or validate transactions and only
contain the block header. In other words, light nodes assume that transactions in a block are
invalid without the verification that full nodes provide, which makes light nodesless secure.
This issue is referred to as the data availability problem. In this case, those nodes just need to
know if the data is available and if it represents the current state of the blockchain. A state is simply all the blockchain data stored on the chain,
address balances, and smart contract values. On the Ethereum blockchain, in its current form,
light clients have Torley on so-called data availability committees, DAX, to provide on
chain ad stations that the data is indeed available. In the context of an Ethereum scaling solution, called a rollup, this data has to be made
available so that network participants can determine if that data conforms to network rules.
In other words, they need to ensure the data is accurate and that validators do not try to
dupe the light clients. Optimistic and ZK Rollups
To understand the DA problem further, it is crucial to comprehend rollups.
Rollups are Layer 2 blockchains that have nodes called sequencers.
These sequencers assist in batching, compressing, and ordering transactions.
Benjamin Simon described the relationship between rollups and Ethereum.
NN
A rollup is essentially a separate blockchain, but with a couple of modifications.
Like Ethereum, a rollup protocol has a virtual machine that executes smart contract code.
The rollup's virtual machine operates independently from Ethereum's own virtual machine,
the EVM, but it is managed by an Ethereum smart contract.
This connection allows rollups in Ethereum to communicate.
A rollup executes
transactions and processes data, and Ethereum receives and stores the results. Put simply,
rollups are off-chain scaling solutions. However, rollups do not sacrifice security like many
off-chain scaling solutions normally would. In the case of rollups, only data processing
and computation occur off-chain, via sequencers.
The transactions are ultimately stored on the layer 1 blockchain, preserving security.
This on-chain data was previously called, call data. In a way, rollups are the community's way
of, having their cake and eating itu. They get to maintain network security while scaling the
usability. It is an ingenious solution.
There are two popular types of rollups, optimistic rollups and ZK rollups.
N-optimistic rollups are the more widely discussed and deployed types of rollups.
As their name suggests, optimistic rollups assume that there are at least 1xn good actors
in the ecosystem. What does that mean? Optimistic rollups assume all transactions posted
to the network are valid. To compensate for this optimism, rollups provide a 7-day window for the
network to submit a fraud proof, showing the transactions submitted by the rollup are invalid.
One key thing to know about optimistic rollups is that they are mostly EVM compatible,
so developers can efficiently work with them. In this way, they can be seen
as Ethereum's more popular scaling solution. Two examples of optimistic rollups are Optimism and
Arbitrum. N. ZK rollups use zero-knowledge cryptography to prove that the transactions
they compress and batch are correct and accurate. Instead of assuming that all the transactions are
accurate, like optimistic rollups, ZK rollups generate a validity proof to demonstrate the transactions are valid immediately,
eliminating any waiting period. However, it is known that ZK rollups can be more difficult for
developers to work with, as not all of them are EVM compatible. ZK rollups are also computationally
intensive because generating the proofs consumes many resources.
Nonetheless, more and more EVM-compatible rollups are starting to hit the market.
The scroll rollup EVM solution is just one example.
Solution. Data availability sampling in CODXI mentioned earlier that rollups need somewhere
to dump their data. Most rollupshave been pushing data to the Ethereum mainchain,
as mentioned, which leads to the Ethereum main chain,
as mentioned, which leads to the crux of the problem. Data bloat. When bloat occurs,
transactional throughputsifers and fees for transactions and smart contract execution increase. Recall that part of the solution is not to rely on fully validating nodes for network
security. If we just rely on these nodes, most users would be unable to run
full nodes due to prohibitively expensive hardware requirements. Note that raising the block size is
a potential solution, albeit dubious, as this path negatively impacts decentralization. Nonetheless,
that particular argument has become invalid because rollups act as layered 2 scaling solutions
that maintain the security of the main chain.
NN that said, what is the answer to nothaving everyone run full nodes?
NN the solution is to empower light nodes, as well as full nodes, to verify data without
downloading and executing all transactions. This is the heart of the problem and where
the magic of scaling the Ethereum network, among other blockchains, can be found. Data availability, erasure encoding, and codecs. The first step is to have a data
availability layer with a robust network of light clients to determine if the data is available.
But how can light clients, who typically only check header data and rely on full nodes for
their information, ensure their data is valid and complete. The answer can be found
within a mathematical trick called data availability sampling, DOS. DOS is a method of sampling a bit
of data from a chunk of data and using it to probabilistically determine the rest of the data
that exists and reconstruct it. Many organizations, including the Celestia blockchain and DA layer,
are leveraging DOS through erasure encoding and
polynomial commitments. Reed-Solomon codes are the popular choice among many projects.
These types of polynomials look like this. n, ny equals a, o, plus a, 1, x plus a, 2, x squared
plus, plus a, k, x caret k. These functions are used to determine missing data and fully restore it. This works
by creating k of n data where k is the original data and n is the parity data. If some of the
original data goes missing the node's machine leverages a mathematical function called Lagrange
interpolation to restore it. The mathematics involved are seemingly arcane to most people
but the idea is straightforward.
N.N. There are a few clear examples of erasure coding in action.
The method has been used to backup scratched CDs. Erasure encoding INCDs can reconstruct the missing bits of music due to surface damage. Satellites also leverage erasure codes if data
goes missing in the vastness of space. The satellite or the CD can reconstruct
missing data, adding redundant protection Tobit systems. The specific scheme that Codex,
as well as Celestia, uses is called the 2D erasure coding scheme. It should be noted that
2D erasure coding, although popular in the crypto ecosystem, is not a new technology.
However, how it is used to solve the DA problem is quite
interesting. Dr. Bautista underscore underscore explained underscore underscore how the Codex
team uses erasure coding. N, N, similarly to Codex, erasure coding the original data into a
more redundant and robust data structure is fundamental for the rest of the protocol to work,
without it there is no magic.
In Codex, this happens inside the Codex client of the node that wants to upload the data,
while in Ethereum this happens inside the Ethereum validator of the consensus beacon client of the node that is building, proposing the block.
NN there is more to the story regarding the journey of the data in Codex,
but it is beyond the scope of the article. Read Dr. Bautista's piece to
understand data dispersal, sampling, and the lazy repair mechanisms that Codex leverages.
Codex intends to have simultaneous data storage and retrieval functionality and data availability
sampling through proof compression. This would allow for processing ephemeral data, or data that
is not needed over the long term, and data persistence and durability guarantees that other projects may be missing.
Conclusion Cracking the problem
The debate on how to scale blockchains is ending. In the Bitcoin ecosystem, arguments have been
raging on how to scale a blockchain, from increasing the block size limit to leveraging
Layer 2 solutions. The reality is that a mixture of the two is the most
reasonable solution. For instance, Codex can act as the cloudless data availability layer for
Ethereum, as well as for other blockchains, allowing the block size to grow because the
network would contain many nodes to conduct DA checks on the network. And, and the good news is
that this will increase the network's throughput while maintaining the security of the base layer. And what results from that? Yep, you got it. Cheaper fees and
faster transactions. As users of blockchains, that is really what we care most about.
And one day, perhaps soon, I can do my token swap for pennies on the dollar instead of for $35 bucks.
Thank you for listening to this HackerNoon story,
read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.
