The Good Tech Companies - Unpacking The Starknet Bolt Upgrade

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Unpacking the Starknet Bolt upgrade, by 2077 Research. Starknet's latest upgrade, v0.13.2, named Bolt brings two major changes, parallel execution and block packing. Although independent of one another, both features support the push towards the goal of fast, cheap block space cryptographically secured by Ethereum. Parallel execution allows non-contentious transactions, i.e., transactions that don't touch the same state, to execute concurrently. By implementing parallel execution, L2s like StarkNet can reduce execution time without increasing resource usage.

Starting point is 00:00:42 This means lower transaction fees for users and significantly improved transaction confirmation times. Blockpacking optimizes StarkNet's usage of Blobspace on Ethereum L1. With blockpacking, sequencers can generate one proof to verify multiple StarkNet L2 blocks simultaneously. This decouples usage of Blobspace from the frequency of L2 block production and amortizes proof verification costs. Both reduce operating costs for the StarkNet sequencer, which means users pay less per transaction. Like we said, Bolt makes StarkNet cheaper and faster. This report will provide a detailed analysis of the Bolt upgrade focusing on parallel execution and

Starting point is 00:01:22 blockpacking and explore the implications for StarkNet's performance. A quick refresher on rollups. Rollups are layer 2, L2, scaling solutions that aim to scale the underlaying layer 1, L1, blockchain by moving computation off-chain. By moving execution off-chain, rollups can optimize for scalability, cheap and fast transactions, while the L1 provides security for L2 transactions. Rollups are optimize for scalability, cheap and fast transactions, while the L1 provides security for L2 transactions. Rollups are often said to inherit security from the L1. What this essentially means is that they inherit the consensus and data availability guarantees provided by the L1. In addition to this, the L1 also provides a security guarantee in the form of secure bridging between it and the rollup.

Starting point is 00:02:11 When sequencers publish L2 blocks to the L1, the L1 provides guarantees of the availability as well as the ordering of this information. From here, L2 node scan trustlessly compute the canonical L2 chain with this information along with the rollup's rules around chain derivation and state transition described by the node implementation. In order to facilitate secure bridging between the L1 and the L2, the L1 requires proof that the L2 chain it is currently following is correct and does not include illegal state changes, e.g., double spend. This need for rollups to approve the validity of state changes ensures the L1 does not authorize withdrawals from the rollup based on illegal state. Rollups differ based on how they prove validity of state changes to the L1. Validity roll-ups rely on validity proofs to objectively verify correctness of execution.

Starting point is 00:02:55 Proposers are allowed to propose a new roll-up state if they submit a validity proof to the roll-up contract. Optimistic roll-ups rely on the absence of a fraud proof to subjectively verify correctness of execution. Proposers submit state updates without proofs, but the roll-up contract can roll back any state update whose validity is successfully challenged by a fraud proof. Roll-ups also provide the base layer with enough data for interested parties to reconstruct the L2 state. While optimistic rollups must publish full transaction data to enable challengers to compute fraud proofs, validity rollups have no such requirements. Validity proofs guarantee correct execution. But posting full transaction data on L1 is still

Starting point is 00:03:36 useful from a trust minimization perspective, trustless reconstruction of state and permissionless withdrawals. StarkNet is a validity roll-up that uses scalable, transparent argument of knowledge, STARKs, to prove the validity of state changes. The latest upgrade to StarkNet codenamed Bolt adds parallel execution and block packing. In subsequent sections, we'll explain how the two features work and what improvements they bring for StarkNet users. What did the StarkNet Bolt upgrade change? At a high level, the Bolt upgrade changed StarkNet's execution, proving and data availability mechanisms. Optimizing execution before the Bolt upgrade, StarkNet transactions were executed sequentially, one after the other by the sequencer.

Starting point is 00:04:19 Sequential execution is simple but also very inefficient. It's inefficient because it does not take advantage of the multiple independent processing units that modern-day computers offer and the parallelizability of a set of transactions. Parallelizability is a measure of how independent the transactions in a GIVON set are. For example, consider the set of three transactions below. Transaction 1. Alice wants to send Bob one streak. Transaction 2. Caitlin wants to send Danny 100 ETH. Transaction 3. Caitlin wants to send Ella 100 ETH. Transaction 1 is completely independent of transactions 2 and 3 because it is accessing a different part of the state, Alice's balance, and can be executed concurrently. However, transaction 2 and 3 are conflicting

Starting point is 00:05:06 because they want to access the same state, Caitlin's ETH balance. These transactions cannot be executed concurrently or we'll end up with conflicting results. To illustrate, say Caitlin's balance is 300 ETH and transactions 2 and 3 start at the same time. Both transactions will read the initial balance of Caitlin's account as 300 ETH, deduct the transfer amount from it, 300 minus 100 equals 200, and write 200 ETH to the location in memory where Caitlin's account balance is stored. Transactions 2 and 3 will then read Danny and Ella's balances respectively and add 100 ETH to both users' balances. Caitlin's balance would reduce by 100

Starting point is 00:05:45 ETH, but 200 ETH would be credited effectively printing 100 ETH out of thin air. Avoiding these types of conflicts and the complex nature of mitigation mechanisms is why Ethereum chose sequential execution. However, while sequential execution reduces complexity and improves security, it results in inefficient use of hardware. Worse still, the trend of hardware design suggests sequential execution will become increasingly inefficient in the coming years. Figure 4 shows the trend of hardware design in the last 50 years. The relevant takeaway? Single-thread performance, purple circles, has been plateauing since the mid-2000s while the number of logical cores has increased around the same time. We can make two conclusions based on this data.

Starting point is 00:06:31 Hardware designers are scaling computer chips by adding more independent processing units rather than improving the performance of a single unit. Any system that continues to rely on increased performance of a single processing unit will experience a stall in performance gains even on newer hardware. In recent years, sophisticated algorithms for managing transaction conflicts and ensuring correctness of parallel execution have appeared. BlockSTM, based on a paper by Fukunmi et al. as one such algorithm and forms the core part of StarkNet's new parallel execution engine. We analyze the block STM algorithm in later sections. Optimizing proving and data availability StarkNet's SHARP, short for Shared Prover,

Starting point is 00:07:18 has always taken advantage of recursive proofs to reduce verification costs as much as possible. A recursive proof is essentially a proof-of-proof, where one proof verifies that one or more proofs are correct below is a sketch of how sharp generates a recursive proof the sharp system takes a set of programs to be executed a job as input and generates a proof of execution for the job these programs are l2 blocks and the proof attests to correctness of transactions the proof is sent to another program that verifies the proof and converts the proof verification program into a job. Sharp takes the new job as input and generates another proof. This proof confirms the validity of the previous proof. The process, proof right pointing arrow job right pointing arrow proof,

Starting point is 00:08:01 restarts and continues until a target is reached at which point the final proof, which is now a highly compressed version of the original proof, is posted to the L1. This design greatly amortizes costs for two major reasons. The subsequent jobs are not individual programs or transactions and proving costs grow sub-linearly. This means the larger the number of programs, transactions within a job, the greater the savings. By creating proofs per block instead of per transaction, transaction fees can be much cheaper. Recursion greatly compresses the proofs resulting in a proof of proof that is significantly cheaper to verify on the L1 than the initial proof or any of the intermediate proofs. While the proving system was good, there were missed chances to further save

Starting point is 00:08:45 costs. For example, each job was a single StarkNet block and each of these blocks were designed to take up one blob on the L1. This resulted in certain inefficiencies as described below. Prior to Bolt, StarkNet did, and still does, not have regular block times. Instead an L2 block was published to the L1 under two conditions. One, one blob's worth of data was executed. Two, six minutes had elapsed from the previous block. Unfortunately, due to demand, most of the blocks were sent to the L1 because of the six-minute limit, not DA limits. The above meant that blobs, not blocks, were often, severely, underutilized, resulting in unnecessarily high gas costs. StarkNet blocks, were often, severely, underutilized, resulting in unnecessarily

Starting point is 00:09:25 high gas costs. Starknet blocks also had a fixed cost, which could theoretically be reduced by applying the same recursion techniques discussed above to produce proofs of multiple blocks. Block packing tackles these problems by using a binary tree of recursive proofs. We discuss block packing in a later section of the article. Bolt Part 1. Parallel Execution. As discussed earlier, sequential execution is inefficient and will only be more inefficient as time goes on, and naive parallel execution produces invalid results. Production parallel execution engines take care to prevent inconsistent results, however.

Starting point is 00:10:02 There are two approaches to tackling parallel execution, pessimistic concurrency control, PCC, and optimistic concurrency control, OCC. PCC and OCC are transaction processing units, TPUs. Below is a definition of a transaction processing unit from block STM versus SVM. A comparison of parallel execution engines greater than the TPU is usually coupled with, but distinct from the virtual machine, VM. Greater than blockchain VMs like the EVM, SVM, and move VM or high-level language VMs, the greater than TPU, which is usually the subject of interest, subsumes the VM. It is tasked greater than with the management of the entire transaction execution pipeline, tasked greater than with the management of the entire

Starting point is 00:10:45 transaction execution pipeline, including greater than creating and managing instances of the VM. Pessimistic concurrency control is designed based on the assumption that many of the transactions within the set of transactions to be executed will conflict, i.e., they will touch the same state. PCC prevents these conflicts. To prevent conflicts, PCC requires that a transaction declare up front what portions of state it will access during read-write operations. The transaction processing unit can use this information to schedule transactions in a way that ensures that conflicting transactions are executed sequentially instead of concurrently. Some TPUs also use locks to enforce this behavior. A lock, a K, a mutex, is a mechanism used to prevent concurrent access to a memory location.

Starting point is 00:11:33 That said, PCC-based execution incurs certain trade-offs. First, the requirement to provide access lists, which identify the memory slot a transaction touches, degrades the developer experience and reduces the range of possible applications. Second, scheduling transactions can incur unnecessary overhead especially when there are no conflicts. Optimistic concurrency control is designed with the assumption that many of the transactions within the given set will not conflict, i.e., they will not write to the same state. As such, OCC TPUs execute the set of transactions with all the available resources and only try to detect conflicts. If a conflict is detected the transactions that conflict are executed and re-verified until the entire set

Starting point is 00:12:17 passes and can be committed. OCC TPUs don't incur overhead from scheduling, so they tend to perform better when there are few conflicts. OCC-based transaction processing units also have better developer experience and a wider range of use cases because state dependencies don't need to be known up front. However, when the set of transactions are highly contentious, OCC performs worse than PCC. We cover TPU designs in far more detail and compare the OCC and PCC approaches in our article on parallel execution. StarkNet's new TPU uses the OCC approach. More specifically, it is an implementation of the BlockSTM algorithm. BlockSTM executes transactions optimistically with all the available resources assuming none of them will conflict and

Starting point is 00:13:04 verifies after execution that no conflicting transactions executed concurrently. Before we dive into StarkNet's new architecture, it's important to go over some key definitions. 1. State. State is the condition of an object at an instance in time. In the blockchain context, it usually refers to the value of a portion of memory, e.g. The balance of an address is part of its state. 2. Serializability Parallel execution is said to have preserved the property of serializability if executing a set of transactions serially and concurrently produces the same results.

Starting point is 00:13:40 3. Conflict Two transactions are said to conflict if and only if at least one of them wants to write to a portion of state that both want to access, can be read or write. If both transactions are only reading from a portion of state, then there's no conflict but if at least one of them is writing to that portion of state, then the transactions cannot be executed concurrently without breaking serializability. We covered a sample case above in the Caitlin, Danny and Ella example. 4. Dependency. A transaction is said to be dependent on, or a dependency of, a transaction if and only if both transactions write to the same memory location and comes after in the serial ordering. If came after, would be dependent on.

Starting point is 00:14:21 5. Multi-version data structure. A multi-version data structure is one where a new version of the data structure is created for each write to the data structure. Instead of changing the value in place, a new write-specific version of the location to be changed is created. The value prop of multi-version data structures is that they allow highly concurrent reads from and writes to the same region in memory. We'll explore how this is useful later. With the definitions out of the way we can move on to cover how block STM works. How block STM works. The input to block STM is a queue, an ordered list, of transactions. This list is often called a block. This list can be ordered in any manner. The only requirement is

Starting point is 00:15:03 that there is a clearly defined order. So given a set of transactions T containing transactions, the transactions are sorted such that, read as is of higher priority than, is of higher priority than etc. At the beginning of the execution process, two sets are created, an execution set, and a validation set V tracks transactions that are yet to be executed while V tracks transactions that have been executed but are yet to be validated. Each transaction is also associated with an incarnation number N to track how many times it has been executed and re-executed. The initial state of the set sys that E contains all the transactions in V is empty, I, E. And, with these ordered sets of transactions, each thread used for execution cycles through a three-stage loop. 1. Check done. 2. Find next task. 3. Perform task. Check done during this step.

Starting point is 00:15:55 The thread checks both V and E. If both are empty and no transaction is being executed then the current batch of transactions has been executed completely and the results can be committed to storage. Find next task if either V or E contains transactions, block STM selects the transaction with the lowest index, not incarnation number, from both set of transactions, I, E. If E contains and V contains, the validation task for transaction would be selected as the next task. Perform task once the next task has been identified and selected, it is performed. At the end of this step, the algorithm loops back to check done. This process continues until both sets of transactions are empty. Let's look at what happens during execution and validation.

Starting point is 00:16:39 During execution of a transaction, the blockSTM algorithm populates two sets, per transaction, a read set and a write set. The read set contains all the memory locations that a transaction read from during its execution while the write set contains all the memory locations it wrote to. During execution, the transactions apply their writes to the multi-version data structure, but reading is a little nuanced. In block STM, when a transaction wants to read from the data structure, it checks for the value written by the lowest priority transaction that has higher priority. For example, if and have all written to a memory location and wants to read from this location, it reads the version of the data structure corresponding to. This is done to enforce serializability, since should be executed after and before,

Starting point is 00:17:25 it should use the values written by not. In this example, we'll have to be re-executed because it should have read the values written by not or any higher priority transactions. If a single version data structure was used, the value written by would be unavailable and a conflict is sure to occur. For a validation task, the transaction's read set is compared to the current values at the memory locations it read from during execution. For example, if read's account B during execution, during validation, the memory location for account B is read, keeping the definition of read we established earlier in mind. If the two values are the same then it means that no higher priority transaction, sayer, has written to that location during the execution of.

Starting point is 00:18:08 This results in marked as validated but not safe to commit. The transaction is not considered safe to commit because a lower priority transaction could, for any number of reasons, be executed after the transaction had been validated. In our running example if touches account B and only touches it after, passes validation then needs to be re-executed. To provision for this, anytime a transaction completes execution all lower priority transactions that had passed validation are revalidated to ensure theta not conflict with the transaction. For example, if and have been validated and completes execution and must be revalidated to ensure that they do not conflict with, and so are dependencies of. If a transaction fails validation, i.e., a higher priority transaction that writes to the same state was executed concurrently with it,

Starting point is 00:18:55 then the writes that the transaction made are dirty, because the values are wrong. But instead of deleting these values from the database completely, they are flagged estimate. The estimate flag tells any transaction reading that memory location that the values there are not correct and the transactions pause their execution. This ISD1 in place of deletion because re-execution of the transaction that failed validation would likely result in writing to the same memory locations as the previous execution. By marking the memory location as an estimate instead of deleting it, dependencies of the transaction that failed validation can be caught even before re-execution,

Starting point is 00:19:31 preventing unnecessary re-executions. This heuristic greatly reduces wasted work. Putting it all together a complete overview of how block STM approaches parallelization can be summarized as. If transactions start out as an ordered list with a clearly defined serial order. These transactions are executed on all the available resources in priority order. During execution, the read and write sets of the transactions are tracked and reads and writes are performed from to a multi-version data structure. When a transaction finishes execution, it is validated to ensure that it did not execute concurrently with any conflicting transactions. If a transaction passes validation, then it is

Starting point is 00:20:11 removed from E and V and when a transaction is validated, all transactions lower in priority than it that had passed validation are rescheduled for validation. If a transaction fails validation, then it is rescheduled for execution. When the entire set of transactions have been executed and validated then the entire block is saved to commit and the results can be written to permanent storage. An example is shown below. That is an overview of how block STM works. More details can be found in our report here and the original block STM paper here. How does block STM improve STARKNET performance? To quantify the significance of adding block STM, we performed a few benchmarks to evaluate the performance improvements it

Starting point is 00:20:52 provides over sequential execution and the results are shown below. The results show that as the number of threads, analogous to independent processing units, used increases, performance also increases. The improvements are more pronounced with larger blocks giving as high as 9x performance improvements over sequential execution with just 16 threads. We found that the results are more pronounced with larger blocks. Our tests show that block STM performance does significantly degrade under highly contentious load but industry standard practice is to fall back to sequential execution during such periods. We recommend the same mechanic to StarkNet to preserve throughput under highly contentious workloads. But, by and large, the addition of BlockSTM will significantly

Starting point is 00:21:36 improve and future-proof StarkNet. The second major change bundled in the V0.13.2 upgrade is BlockPacking and will cover that next. Bolt part 2. Block packing. As discussed earlier, prior to Bolt, each StarkNet block was its own job resulting in a fixed per block cost for every block. In addition, the system was designed such that each block required its own blob regardless of how much data was actually consumed by the block. In a world where there was always high demand, this wouldn't be a problem, but StarkNet currently offers more block space than there is demand for and so there are a lot of wasted resources that could result in hundreds of ETH lost over the course of a month. To further put into context the severity of the situation

Starting point is 00:22:18 before bold, these are the costs associated with posting a block to L11. 23k gas per fact registration, part of verify proof and register. 2. 56k gas per register sharp memory page. 3. 136k gas per state update. 4. 50k gas for the KZG precompiled to sample a blob. 5. 86k gas for running the state update function. This totals out to 215k gas per block and this cost is flat, i.e. It's the same regardless of how much data each block contains and related to the number of blocks by dollar cost equals num blocks asterisk $215,000. The ideal solution to this problem would be for costs to be related to the amount of data posted instead of the amount of blocks. And that's exactly what block packing achieves via SNAR trees.

Starting point is 00:23:10 SNAR tries StarkNet applicative recursive. SNAR trees are a new type of binary tree introduced in bulk to address the problems highlighted above. A SNAR tree has following structure. Each leaf is a StarkNet block and the nodes in all other levels are recursive proofs of their children the root node which is our cursive proof of all other proofs is the final job that is sent to the sha red prover sharp what does the snar tree do the primary benefit of the snar tree is that rather than posting one block per proof many starknet blocks can be amortized into the same l1 update snar tree roots are now posted to the L1 only when one of two configurable limits is hit. Either the DA limit, 6 blobs worth of data, or after a certain number of leaves are added to the tree, where a leaf is a block. This design decouples the cost for transactions from the

Starting point is 00:24:01 number of blocks. Now, there is still some fixed cost for each block that arises from running the Stark NetOS, Snows, in every block but by and large the costs are decoupled. These are the numbers now 1. Sharp memory registration per job, 23k gas, same as before. 2. Sharp memory page per job, 36k gas, reduced from 56k thanks to sharp optimization 3. state update per job equal to 4 86k gas same as before plus 5 50k gas times number of blob used when one blob was used per block this was 50k x1 resulting in the 136k gas number quoted above. The plot in figure 11 below shows how gas costs vary with block number in the previous design and now, under Bolt. It's fairly obvious that block packing greatly reduces the costs of verification on the L1 which will undoubtedly

Starting point is 00:24:56 result in lower gas prices for Startnet users. Conclusion The effect of the changes made in Bolt. Optim optimistic parallel execution via block STM and block packing through the proprietary SNAR can be summarized as, faster and cheaper. Parallel execution reduces execution time and by extension congestion which will reduce gas fees during periods of high traffic, while SNAR trees tackle the associated DA and proving costs. Interestingly, this upgrade makes StarkNet the first L2 with parallel execution and sets it up to be a major contender in Thelt2 space. It's important to note

Starting point is 00:25:32 that it's unlikely that the effect of these changes will be immediately evident, especially that of parallel execution, but he are crucial to future proofing StarkNet and the entire Ethereum ecosystem as a whole. Authors note. A version of this article was previously published here. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

Your Ad Here

The Good Tech Companies - Unpacking The Starknet Bolt Upgrade

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.