The Good Tech Companies - Unpacking The Starknet Bolt Upgrade
Episode Date: December 26, 2024This story was originally published on HackerNoon at: https://hackernoon.com/unpacking-the-starknet-bolt-upgrade. Learn how the Starknet Bolt upgrade improves performanc...e on one of Ethereum's most popular Layer 2 (L2) chains in this deep dive by 2077 Research. Check more stories related to web3 at: https://hackernoon.com/c/web3. You can also check exclusive content about #ethereum-layer-2-scaling, #starknet, #ethereum-rollups, #blockchain-scalability, #blockchain, #ethereum-network, #2077-research, #good-company, and more. This story was written by: @2077research. Learn more about this writer by checking @2077research's about page, and for more stories, please visit hackernoon.com. The Starknet Bolt upgrade introduced two major features: parallel execution and block performance. This article discusses how these newly introduced feature improve Starknet's performance and improve the transaction experience for Layer 2 (L2) users.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Unpacking the Starknet Bolt upgrade, by 2077 Research.
Starknet's latest upgrade, v0.13.2, named Bolt brings two major changes,
parallel execution and block packing. Although independent of one another,
both features support the push towards the goal of fast,
cheap block space cryptographically secured by Ethereum. Parallel execution allows non-contentious transactions, i.e., transactions
that don't touch the same state, to execute concurrently. By implementing parallel execution,
L2s like StarkNet can reduce execution time without increasing resource usage.
This means lower transaction fees for users and significantly
improved transaction confirmation times. Blockpacking optimizes StarkNet's usage of
Blobspace on Ethereum L1. With blockpacking, sequencers can generate one proof to verify
multiple StarkNet L2 blocks simultaneously. This decouples usage of Blobspace from the
frequency of L2 block production and amortizes
proof verification costs. Both reduce operating costs for the StarkNet sequencer, which means
users pay less per transaction. Like we said, Bolt makes StarkNet cheaper and faster. This report
will provide a detailed analysis of the Bolt upgrade focusing on parallel execution and
blockpacking and explore the implications for StarkNet's performance. A quick refresher on rollups. Rollups are layer 2, L2, scaling solutions
that aim to scale the underlaying layer 1, L1, blockchain by moving computation off-chain.
By moving execution off-chain, rollups can optimize for scalability, cheap and fast
transactions, while the L1 provides security for L2 transactions. Rollups are optimize for scalability, cheap and fast transactions, while the L1 provides
security for L2 transactions. Rollups are often said to inherit security from the L1.
What this essentially means is that they inherit the consensus and data availability guarantees
provided by the L1. In addition to this, the L1 also provides a security guarantee in the form
of secure bridging between it and the rollup.
When sequencers publish L2 blocks to the L1, the L1 provides guarantees of the availability as well as the ordering of this information. From here, L2 node scan trustlessly compute the canonical L2
chain with this information along with the rollup's rules around chain derivation and
state transition described by the node implementation. In order to facilitate
secure bridging between the L1 and the L2, the L1 requires proof that the L2 chain it is currently
following is correct and does not include illegal state changes, e.g., double spend. This need for
rollups to approve the validity of state changes ensures the L1 does not authorize withdrawals
from the rollup based on illegal state. Rollups differ based on how they prove validity of state changes to the L1.
Validity roll-ups rely on validity proofs to objectively verify correctness of execution.
Proposers are allowed to propose a new roll-up state if they submit a validity proof to the
roll-up contract. Optimistic roll-ups rely on the absence of a fraud proof to subjectively verify
correctness of execution. Proposers submit state updates without proofs, but the roll-up contract
can roll back any state update whose validity is successfully challenged by a fraud proof.
Roll-ups also provide the base layer with enough data for interested parties to reconstruct the
L2 state. While optimistic rollups must publish full transaction
data to enable challengers to compute fraud proofs, validity rollups have no such requirements.
Validity proofs guarantee correct execution. But posting full transaction data on L1 is still
useful from a trust minimization perspective, trustless reconstruction of state and permissionless
withdrawals. StarkNet is a validity roll-up that uses scalable, transparent argument of knowledge, STARKs, to prove the validity of
state changes. The latest upgrade to StarkNet codenamed Bolt adds parallel execution and block
packing. In subsequent sections, we'll explain how the two features work and what improvements
they bring for StarkNet users. What did the StarkNet Bolt upgrade change?
At a high level, the Bolt upgrade changed StarkNet's execution,
proving and data availability mechanisms. Optimizing execution before the Bolt upgrade,
StarkNet transactions were executed sequentially, one after the other by the sequencer.
Sequential execution is simple but also very inefficient. It's inefficient because it does
not take advantage of the multiple independent processing units that modern-day computers offer
and the parallelizability of a set of transactions. Parallelizability is a measure of how independent
the transactions in a GIVON set are. For example, consider the set of three transactions below.
Transaction 1. Alice wants to send Bob one streak. Transaction 2. Caitlin wants to send Danny 100
ETH. Transaction 3. Caitlin wants to send Ella 100 ETH. Transaction 1 is completely independent
of transactions 2 and 3 because it is accessing a different part of the state, Alice's balance,
and can be executed concurrently. However, transaction 2 and 3 are conflicting
because they want to access the same state, Caitlin's ETH balance. These transactions
cannot be executed concurrently or we'll end up with conflicting results. To illustrate,
say Caitlin's balance is 300 ETH and transactions 2 and 3 start at the same time.
Both transactions will read the initial balance of Caitlin's account
as 300 ETH, deduct the transfer amount from it, 300 minus 100 equals 200, and write 200 ETH to
the location in memory where Caitlin's account balance is stored. Transactions 2 and 3 will
then read Danny and Ella's balances respectively and add 100 ETH to both users' balances.
Caitlin's balance would reduce by 100
ETH, but 200 ETH would be credited effectively printing 100 ETH out of thin air. Avoiding these
types of conflicts and the complex nature of mitigation mechanisms is why Ethereum chose
sequential execution. However, while sequential execution reduces complexity and improves security,
it results in inefficient use of hardware. Worse still, the trend of hardware design suggests sequential execution
will become increasingly inefficient in the coming years. Figure 4 shows the trend of hardware design
in the last 50 years. The relevant takeaway? Single-thread performance, purple circles,
has been plateauing since the mid-2000s while the number of logical
cores has increased around the same time. We can make two conclusions based on this data.
Hardware designers are scaling computer chips by adding more independent processing units rather
than improving the performance of a single unit. Any system that continues to rely on increased
performance of a single processing unit will experience a stall in performance gains even on newer hardware. In recent years, sophisticated algorithms for managing
transaction conflicts and ensuring correctness of parallel execution have appeared. BlockSTM,
based on a paper by Fukunmi et al. as one such algorithm and forms the core part of StarkNet's
new parallel execution engine.
We analyze the block STM algorithm in later sections.
Optimizing proving and data availability StarkNet's SHARP, short for Shared Prover,
has always taken advantage of recursive proofs to reduce verification costs as much as possible.
A recursive proof is essentially a proof-of-proof,
where one proof verifies that one or more proofs are correct
below is a sketch of how sharp generates a recursive proof the sharp system takes a set of programs to be executed a job as input and generates a proof of execution for the job
these programs are l2 blocks and the proof attests to correctness of transactions the proof is sent
to another program that verifies the proof and converts the proof verification program into a job.
Sharp takes the new job as input and generates another proof. This proof confirms the validity
of the previous proof. The process, proof right pointing arrow job right pointing arrow proof,
restarts and continues until a target is reached at which point the final proof,
which is now a highly compressed version of the original proof, is posted to the L1.
This design greatly amortizes costs for two major reasons. The subsequent jobs are not
individual programs or transactions and proving costs grow sub-linearly. This means the larger
the number of programs, transactions within a job, the greater the savings. By creating proofs per block instead of per transaction, transaction fees can be much
cheaper. Recursion greatly compresses the proofs resulting in a proof of proof that is significantly
cheaper to verify on the L1 than the initial proof or any of the intermediate proofs.
While the proving system was good, there were missed chances to further save
costs. For example, each job was a single StarkNet block and each of these blocks were designed to
take up one blob on the L1. This resulted in certain inefficiencies as described below.
Prior to Bolt, StarkNet did, and still does, not have regular block times. Instead an L2 block was
published to the L1 under two conditions.
One, one blob's worth of data was executed. Two, six minutes had elapsed from the previous block.
Unfortunately, due to demand, most of the blocks were sent to the L1 because of the six-minute
limit, not DA limits. The above meant that blobs, not blocks, were often, severely, underutilized,
resulting in unnecessarily high gas costs. StarkNet blocks, were often, severely, underutilized, resulting in unnecessarily
high gas costs. Starknet blocks also had a fixed cost, which could theoretically be reduced by
applying the same recursion techniques discussed above to produce proofs of multiple blocks.
Block packing tackles these problems by using a binary tree of recursive proofs.
We discuss block packing in a later section of the article.
Bolt Part 1.
Parallel Execution. As discussed earlier, sequential execution is inefficient and will
only be more inefficient as time goes on, and naive parallel execution produces invalid results.
Production parallel execution engines take care to prevent inconsistent results, however.
There are two approaches to tackling parallel execution,
pessimistic concurrency control, PCC, and optimistic concurrency control, OCC.
PCC and OCC are transaction processing units, TPUs. Below is a definition of a transaction
processing unit from block STM versus SVM. A comparison of parallel execution engines greater than the TPU
is usually coupled with, but distinct from the virtual machine, VM. Greater than blockchain VMs
like the EVM, SVM, and move VM or high-level language VMs, the greater than TPU, which is
usually the subject of interest, subsumes the VM. It is tasked greater than with the management of
the entire transaction execution pipeline, tasked greater than with the management of the entire
transaction execution pipeline, including greater than creating and managing instances of the VM.
Pessimistic concurrency control is designed based on the assumption that many of the transactions
within the set of transactions to be executed will conflict, i.e., they will touch the same state.
PCC prevents these conflicts. To prevent conflicts, PCC requires that
a transaction declare up front what portions of state it will access during read-write operations.
The transaction processing unit can use this information to schedule transactions in a way
that ensures that conflicting transactions are executed sequentially instead of concurrently. Some TPUs also use locks to enforce this behavior. A lock, a
K, a mutex, is a mechanism used to prevent concurrent access to a memory location.
That said, PCC-based execution incurs certain trade-offs. First, the requirement to provide
access lists, which identify the memory slot a transaction touches, degrades the developer experience and
reduces the range of possible applications. Second, scheduling transactions can incur
unnecessary overhead especially when there are no conflicts. Optimistic concurrency control is
designed with the assumption that many of the transactions within the given set will not
conflict, i.e., they will not write to the same state. As such, OCC TPUs execute the set
of transactions with all the available resources and only try to detect conflicts. If a conflict
is detected the transactions that conflict are executed and re-verified until the entire set
passes and can be committed. OCC TPUs don't incur overhead from scheduling, so they tend to perform
better when there are few conflicts. OCC-based transaction processing units also have better developer experience
and a wider range of use cases because state dependencies don't need to be known up front.
However, when the set of transactions are highly contentious, OCC performs worse than PCC.
We cover TPU designs in far more detail and compare the OCC and PCC approaches
in our article on parallel execution. StarkNet's new TPU uses the OCC approach.
More specifically, it is an implementation of the BlockSTM algorithm. BlockSTM executes
transactions optimistically with all the available resources assuming none of them will conflict and
verifies after execution that no conflicting transactions executed concurrently. Before we
dive into StarkNet's new architecture, it's important to go over some key definitions.
1. State. State is the condition of an object at an instance in time. In the blockchain context,
it usually refers to the value of a portion of memory, e.g.
The balance of an address is part of its state.
2. Serializability
Parallel execution is said to have preserved the property of serializability if executing
a set of transactions serially and concurrently produces the same results.
3. Conflict
Two transactions are said to conflict if and only if at least one of them
wants to write to a portion of state that both want to access, can be read or write. If both
transactions are only reading from a portion of state, then there's no conflict but if at least
one of them is writing to that portion of state, then the transactions cannot be executed concurrently
without breaking serializability. We covered a sample case above in the Caitlin,
Danny and Ella example. 4. Dependency. A transaction is said to be dependent on,
or a dependency of, a transaction if and only if both transactions write to the same memory location and comes after in the serial ordering. If came after, would be dependent on.
5. Multi-version data structure. A multi-version data structure is one where a new version of the data structure is created for each
write to the data structure. Instead of changing the value in place, a new write-specific version
of the location to be changed is created. The value prop of multi-version data structures is
that they allow highly concurrent reads from and writes to the same region in memory.
We'll explore how this is
useful later. With the definitions out of the way we can move on to cover how block STM works.
How block STM works. The input to block STM is a queue, an ordered list, of transactions. This
list is often called a block. This list can be ordered in any manner. The only requirement is
that there is a clearly defined order. So given a set of transactions T containing transactions, the transactions are sorted such
that, read as is of higher priority than, is of higher priority than etc. At the beginning of the
execution process, two sets are created, an execution set, and a validation set V tracks
transactions that are yet to be executed while V tracks transactions that have been executed but are yet to be validated. Each transaction is also
associated with an incarnation number N to track how many times it has been executed and re-executed.
The initial state of the set sys that E contains all the transactions in V is empty, I, E. And,
with these ordered sets of transactions, each thread used for execution cycles through a
three-stage loop. 1. Check done. 2. Find next task. 3. Perform task. Check done during this step.
The thread checks both V and E. If both are empty and no transaction is being executed then the
current batch of transactions has been executed completely and the results can be committed to storage. Find next task if either V or E contains transactions,
block STM selects the transaction with the lowest index, not incarnation number,
from both set of transactions, I, E. If E contains and V contains, the validation task
for transaction would be selected as the next task. Perform task once the next task has been identified and selected, it is performed.
At the end of this step, the algorithm loops back to check done.
This process continues until both sets of transactions are empty.
Let's look at what happens during execution and validation.
During execution of a transaction, the blockSTM algorithm populates two sets,
per transaction, a read set and a write set. The read set contains all the memory locations that a transaction read from during its execution while the write set contains all the memory
locations it wrote to. During execution, the transactions apply their writes to the multi-version
data structure, but reading is a little nuanced. In block STM, when a transaction
wants to read from the data structure, it checks for the value written by the lowest priority
transaction that has higher priority. For example, if and have all written to a memory location and
wants to read from this location, it reads the version of the data structure corresponding to.
This is done to enforce serializability, since should be executed after and before,
it should use the values written by not. In this example, we'll have to be re-executed because it
should have read the values written by not or any higher priority transactions. If a single version
data structure was used, the value written by would be unavailable and a conflict is sure to
occur. For a validation task, the transaction's read set is compared to
the current values at the memory locations it read from during execution. For example,
if read's account B during execution, during validation, the memory location for account B
is read, keeping the definition of read we established earlier in mind. If the two values
are the same then it means that no higher priority transaction, sayer, has written to that location during the execution of.
This results in marked as validated but not safe to commit.
The transaction is not considered safe to commit because a lower priority transaction could, for any number of reasons, be executed after the transaction had been validated.
In our running example if touches account B and only touches it after, passes
validation then needs to be re-executed. To provision for this, anytime a transaction
completes execution all lower priority transactions that had passed validation are revalidated to
ensure theta not conflict with the transaction. For example, if and have been validated and
completes execution and must be revalidated to ensure that they do not conflict with, and so are dependencies of. If a transaction fails validation, i.e.,
a higher priority transaction that writes to the same state was executed concurrently with it,
then the writes that the transaction made are dirty, because the values are wrong.
But instead of deleting these values from the database completely, they are flagged estimate.
The estimate flag tells any transaction reading that memory location that the values there are
not correct and the transactions pause their execution. This ISD1 in place of deletion
because re-execution of the transaction that failed validation would likely result in writing
to the same memory locations as the previous execution. By marking the memory location as
an estimate instead of deleting it,
dependencies of the transaction that failed validation can be caught even before re-execution,
preventing unnecessary re-executions. This heuristic greatly reduces wasted work.
Putting it all together a complete overview of how block STM approaches parallelization can be
summarized as. If transactions start out as an ordered list
with a clearly defined serial order. These transactions are executed on all the available
resources in priority order. During execution, the read and write sets of the transactions are
tracked and reads and writes are performed from to a multi-version data structure. When a transaction
finishes execution, it is validated to ensure that it did not execute
concurrently with any conflicting transactions. If a transaction passes validation, then it is
removed from E and V and when a transaction is validated, all transactions lower in priority
than it that had passed validation are rescheduled for validation. If a transaction fails validation,
then it is rescheduled for execution. When the entire set of transactions
have been executed and validated then the entire block is saved to commit and the results can be
written to permanent storage. An example is shown below. That is an overview of how block STM works.
More details can be found in our report here and the original block STM paper here.
How does block STM improve STARKNET performance? To quantify the significance of
adding block STM, we performed a few benchmarks to evaluate the performance improvements it
provides over sequential execution and the results are shown below. The results show that as the
number of threads, analogous to independent processing units, used increases, performance
also increases. The improvements are more pronounced with larger
blocks giving as high as 9x performance improvements over sequential execution with
just 16 threads. We found that the results are more pronounced with larger blocks.
Our tests show that block STM performance does significantly degrade under highly contentious
load but industry standard practice is to fall back to sequential execution during such periods. We recommend the same mechanic to StarkNet to preserve throughput
under highly contentious workloads. But, by and large, the addition of BlockSTM will significantly
improve and future-proof StarkNet. The second major change bundled in the V0.13.2 upgrade is
BlockPacking and will cover that next. Bolt part 2. Block
packing. As discussed earlier, prior to Bolt, each StarkNet block was its own job resulting in a fixed
per block cost for every block. In addition, the system was designed such that each block required
its own blob regardless of how much data was actually consumed by the block. In a world where
there was always high demand, this wouldn't be a problem, but StarkNet currently offers more block space than
there is demand for and so there are a lot of wasted resources that could result in hundreds
of ETH lost over the course of a month. To further put into context the severity of the situation
before bold, these are the costs associated with posting a block to L11. 23k gas per fact registration, part of verify proof and register.
2. 56k gas per register sharp memory page.
3. 136k gas per state update.
4. 50k gas for the KZG precompiled to sample a blob.
5. 86k gas for running the state update function. This totals out to 215k gas per block
and this cost is flat, i.e. It's the same regardless of how much data each block contains
and related to the number of blocks by dollar cost equals num blocks asterisk $215,000.
The ideal solution to this problem would be for costs to be related to the amount of data posted instead of the amount of blocks. And that's exactly what block packing achieves via SNAR trees.
SNAR tries StarkNet applicative recursive. SNAR trees are a new type of binary tree
introduced in bulk to address the problems highlighted above. A SNAR tree has following
structure. Each leaf is a StarkNet block and the nodes in all other levels are recursive proofs of their children the root node which is our cursive proof of all other proofs is
the final job that is sent to the sha red prover sharp what does the snar tree do the primary
benefit of the snar tree is that rather than posting one block per proof many starknet blocks
can be amortized into the same l1 update snar tree roots are now posted to the L1 only when one of two configurable limits is hit.
Either the DA limit, 6 blobs worth of data, or after a certain number of leaves are added to
the tree, where a leaf is a block. This design decouples the cost for transactions from the
number of blocks. Now, there is still some fixed cost for each block
that arises from running the Stark NetOS, Snows, in every block but by and large the costs are
decoupled. These are the numbers now 1. Sharp memory registration per job, 23k gas, same as
before. 2. Sharp memory page per job, 36k gas, reduced from 56k thanks to sharp optimization 3. state update
per job equal to 4 86k gas same as before plus 5 50k gas times number of blob used when one blob
was used per block this was 50k x1 resulting in the 136k gas number quoted above. The plot in figure 11 below shows how gas costs
vary with block number in the previous design and now, under Bolt. It's fairly obvious that
block packing greatly reduces the costs of verification on the L1 which will undoubtedly
result in lower gas prices for Startnet users. Conclusion
The effect of the changes made in Bolt. Optim optimistic parallel execution via block STM and block packing through the proprietary SNAR can be
summarized as, faster and cheaper.
Parallel execution reduces execution time and by extension congestion which will reduce
gas fees during periods of high traffic, while SNAR trees tackle the associated DA and proving
costs.
Interestingly, this upgrade makes StarkNet the first L2 with
parallel execution and sets it up to be a major contender in Thelt2 space. It's important to note
that it's unlikely that the effect of these changes will be immediately evident, especially
that of parallel execution, but he are crucial to future proofing StarkNet and the entire Ethereum
ecosystem as a whole. Authors note. A version of this article
was previously published here. Thank you for listening to this HackerNoon story,
read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.