The Good Tech Companies - Block-STM vs. Sealevel: A Comparison of Parallel Execution Engines

Episode Date: December 13, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/block-stm-vs-sealevel-a-comparison-of-parallel-execution-engines. Block-STM and Sealevel are... two approaches to parallelizing blockchain execution. This paper will examine and break down how both TPUs approach parallelization. Check more stories related to web3 at: https://hackernoon.com/c/web3. You can also check exclusive content about #ethereum, #blockchain, #parallel-executions, #aptos, #solana, #distributed-ledger-technology, #good-company, #hackernoon-top-story, and more. This story was written by: @2077research. Learn more about this writer by checking @2077research's about page, and for more stories, please visit hackernoon.com. Block-STM and Sealevel are antagonistic approaches to parallelizing blockchain execution. This paper will examine and break down how both TPUs approach parallelization at a relatively low level and evaluate their performance and scalability. It will also provide an unbiased evaluation of the strengths and weaknesses of each TPU.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Block STM vs C-Level, a comparison of parallel execution engines, by 2077 Research. Aptos Block STM and Solana's C-Level are antagonistic approaches to parallelizing blockchain execution. Block STM uses an optimistic concurrency control, OCC, approach, IE. The transaction processing unit, TPU, optimistically, assumes that no transactions executed concurrently will conflict and relies on in-built checks and logic to identify and resolve conflicts. C-level, on the other hand, pessimistically, assumes that transactions will conflict and relies on lock-based synchronization to prevent conflicts. The pessimistic concurrency control, PCC, approach has historically been more performant
Starting point is 00:00:51 in distributed database systems and for intuitive reasons too. A TPU that schedules transactions in a manner that prevents conflict should perform better than an TPU that has to resolve conflicts after the fact. However, thanks to clever design and engineering, BlockSTM performs surprisingly well, with the added benefits of allowing arbitrary logic and, by extension, a greater range of use cases and a superior dev-ex. This paper will examine and break down how both TPUs approach parallelization at a relatively low level and evaluate their performance and scalability. It will also provide an unbiased evaluation of the strengths and weaknesses of both TPUs. The article is written with the assumption that the reader is familiar with
Starting point is 00:01:34 blockchain concepts like transactions, blocks, and consensus. Familiarity with distributed databases and computer architecture will help with grokking some of the concepts but the glossary explains unfamiliar terms and the main body and appendix contain primers on unfamiliar subjects. Common TERMSA few terms will come up often so let's quickly break down what they mean in the context of this paper. Concurrency and parallelism. Concurrency refers to multiple processes using the same resources. Parallelism is multiple processes running completely independent of one another. Programs can have any combination, none, or both of the properties. Transactions A transaction is an atomic set of instructions that performs a logical operation.
Starting point is 00:02:19 The instructions referenced here are analogous to low-level computer instructions but they do far more than a computer instruction. Conflict. Two or more transactions are said to conflict when they modify, access the same portion of state. Specifically, a conflict occurs when at least one transaction tries to write to the contended portion of state. If all the transactions are reading, then they don't conflict. State. State describes the condition of a thing at an instance in time. In the context of blockchains, state is the set of accounts and their associated data, balances and code. When memory access modification is mentioned, memory refers to state. Dependencies. Transaction B is said to be a dependency of transaction A if and only if
Starting point is 00:03:02 transaction B conflicts with A and transaction B is of lower priority than A. If B were of higher priority, it would be a dependency of B. Lock. A lock or is a mechanism used to prevent concurrent access to a memory location. When a transaction process wants to access a memory location in lock-based systems, it attempts to grab the lock for the associated location. If the location is already locked, the lock grab fails and the transaction must wait. Locking granularity refers to how fine memory locks are. Imagine a transaction needs to alter an element in a table. A coarse lock could lock the entire table. A more granular lock could lock the row, column of interest. And a very fine lock could lock only
Starting point is 00:03:45 the cell of interest. The more granular a lock is, the more concurrent memory access can be. But highly granular locks are more difficult to manage because they require a larger key value database to manage portions of memory. Serializability of transactions. A set of transactions executed concurrently is said to be serializable if there exists a sequential execution of the same set of transactions that produces the same result. Backslash dot. With all of that out of the way, we can get started. I N T R O D U C T I O N A, Blockchain Network, is a decentralized, Byzantine fault-tolerant distributed database. The Transaction Processing Unit, TPU,
Starting point is 00:04:26 is the component responsible for computing state transitions. It takes transaction data as input and outputs a NORDered list of transactions and a succinct representation of the execution results, usually the block hash. The TPU is usually coupled with, but distinct from the virtual machine, VM. Blockchain VMs like the EVM, SVM, and MoveVM are high-level language VMs. That means they convert bytecode, compiled intermediate representations, of the high-level languages, Solidity, Rust, Move, to machine executable code. A blockchain VM is fundamentally the same as the more familiar emulation VMs. It's a sandboxed environment that allows a non-native instruction set, blockchain bytecode, to be executed on actual hardware, x86 ARM.
Starting point is 00:05:14 The TPU, which is usually the subject of interest, subsumes the VM. It is tasked with the management of the entire transaction execution pipeline, including creating and managing instances of the VM. So, as mentioned earlier, these terms are A-related but distinct. The TPU, specifically, is the focus of this paper. There are two types of TPUs, sequential and parallel. A sequential TPU is the easiest to design and implement. Sequential TPUs process transactions first in first out,
Starting point is 00:05:45 FIFO. This approach is very simple and incurs no scheduling overhead, but sequential TPUs don't take full advantage of the state and trend of computer hardware design. Computer hardware has been approaching the limits of single processor performance since the mid-2000s, and there has been an industry-wide shift towards scaling by increasing the number of cores. If there are no breakthroughs in computing, computers will continue to improve performance by adding more cores as opposed to frequency scaling. And in such a world, sequential TPUs will fail to maximize, even consumer, hardware and execution will quickly become a bottleneck. Figure 1 shows the trend of single-thread performance, blue dots,
Starting point is 00:06:25 and frequency, green squares, both of which have slowed down and are even seeming to trends lightly downward. The yellow diamonds show the number of logical cores, and this number has been steadily growing since the mid-200s. Parallel TPUs are designed to make the most of this trend. Parallel TPUs are designed to execute as many non-conflicting transactions concurrently as is possible. An ideal parallel TPU will execute as many transactions that are not dependent on any higher priority transactions as ice possible. As an example, consider the priority ordered set of transactions with the dependency graph shown below. The arrows indicate dependencies. E. G. is a dependency of.
Starting point is 00:07:07 Assuming each transaction executes within a unit of time, an ideal four-thread parallel TPU would execute transactions and in parallel transactions and would be executed right after. Finally, would be executed. The challenge of designing and implementing a parallel TPU is designing a concurrency control system that ensures only non-conflicting transactions are executed
Starting point is 00:07:29 simultaneously while maintaining priority with minimal overhead. Let's look at how this is accomplished in practice. How parallel TPUs are implemented concurrency control and why it's necessary. It's easy to say, just execute the transactions in parallel, bro. Without truly understanding why this is such a difficult problem to solve and so I'll give a simple example to elaborate why concurrency control is necessary when attempting to parallelize access to shared resources. Consider the example of a banking database where Account A has $50. Transaction number 1 wants to send $50 from account A to account B and transaction number 2 wants to send $50 from account A to account C assuming both transactions are allowed to execute in parallel. Both transactions will initially read the balance of account A as $50. Both
Starting point is 00:08:19 transactions will then write $0 to the memory location where the account balance of account A is stored. Note that it doesn't matter if one writes before the other. They both read the balance as $50 and will update it to be, original balance, transfer amount, so both accounts will write 0 to the account balance. Both transactions will read account B and account C's account balances and write an additional $50 to the memory location where the balances are stored, printing $50 out of thin air. Backslash dot. This is a simple example but it suffices to show that when there's concurrent access, modification of shared resources by incoordinated transactions, the execution results are non-deterministic, subject to race conditions, and unserialize. A few other potential problems that
Starting point is 00:09:06 arise due to a lack of concurrency control are the lost update problem. When a batch of transactions is executed in parallel, there's a possibility that a lower precedence transaction, say, overwrites a memory location that a higher than k precedence transaction, say, needs to read. Txj should have read the values before wrote to the location, without concurrency control, there is no way to enforce this behavior. The dirty read problem. Sometimes transactions are aborted, so all the data they've written will be rolled back, however, another transaction might have read these dirty values before they're rolled back, compromising the integrity of the database.
Starting point is 00:09:50 There are more potential problems that can arise from simply parallelizing transactions but they won't be discussed for the sake of brevity. The important takeaway is that attempting to parallelize execution without additional safety measures compromises the integrity of the execution results. The solution to this problem in distributed database management systems, DDBMSs, is referred to as concurrency control, CC. Concurrency control and types of concurrency control Concurrency control is the process of ensuring that simultaneously executing operations do not conflict. In DBMSs and, by extension, blockchains, there are already major paradigms of concurrency control. Optimistic concurrency control, OCC, and pessimistic concurrency control, PCC. Pessimistic concurrency control, PCC. In PCC, the execution of a transaction is blocked if it needs to access a resource that is already in use by another, usually higher priority, transaction. In PCC systems, locks are usually the
Starting point is 00:10:47 method of choice to enforce, blocking. Most PCC systems also require that transactions declare upfront what portions of memory they will read from and or write to since acquiring locks on the fly will still lead to unserializable execution results. Optimistic concurrency control, OCC, in OCC, transaction are attempted on as many processing resources as are available. But instead of writing directly to or reading directly from persistent memory, the transactions usually write to a log. After attempted execution, the transaction is validated to make sure that its execution did not violate any of the database integrity rules. If validation fails, the effects
Starting point is 00:11:25 of the transaction are rolled back, and the transaction is rescheduled for execution. Otherwise, the transaction commits IE rights to persistent memory. The other two notable developments are the Calvin and BOM protocols that address concurrent underscore transactional modifications underscore of distributed databases. We'll go over a high-level overview of each of them to provide context for BlockSTM. Transactional Memory, TM. I find it easier to think of transactional memory as transactional memory access. The underlying principle of TM is to allow concurrent programs to modify, read and write, shared memory in a way that is analogous to how database transactions modify a database without
Starting point is 00:12:10 using locks. More simply put, team aims to allow concurrent programs to modify shared memory atomically and produce serializable results without locks. The reasoning behind the development of TM was that lock-based synchronization techniques incur overhead from managing locks and must be carefully designed to be resistant to priority inversions, convoying, deadlocks, live locks, etc. TM would not need to worry about these things and would, in theory, outperform lock-based systems. The paper proposed a new multiprocessor architecture, not an instruction set architecture, and the addition of a few instructions that would allow for transactional memory access. The additional instructions allowed programmers to define new read-modify-write operations that performed atomic updates to one or multiple memory locations without needing to lock those
Starting point is 00:12:58 memory locations. A few, less than 10, implementations of TM were developed but TM never had widespread use. Software, only, transactional memory, STM, STM is a suite of software-only implementations of the principles of transactional memory. The goal is the same as TM. To allow concurrent processes to access shared memory without the use of locks. The way STMs usually implement this functionality is optimistic. Threads execute with no regard for what other threads are doing but instead of committing their writes directly to memory, the threads record every read and write in a log, abstract data structure.
Starting point is 00:13:36 At the end of a thread's execution, the results are validated, i.e. the values that the process read during its execution are compared against the current values at the same memory locations. If the values are different, then the thread rolls back all its writes because a difference in the read sets implies that one or more concurrent processes modified the memory areas accessed by the process being validated. Hence, the execution results are unserializable. STM implementations will re-execute and revalidate transactions that fail validation until they pass. STM maximizes concurrency as threads never have to wait to access resources they need, but there's a great deal of wasted work in high-contention use cases. Years of research showed that STM implementations perform worse than fine-grained,
Starting point is 00:14:22 lock-based synchronization methods on a small number of processors due to the overhead from managing the log. For these reasons, STM implementations don't find much use in practice and when they do, it's for hyper-specific use cases that the implementations are optimized for. Calvin The next major development that preceded BlockSTM came in the 2012 Calvin paper, where the authors proved that, contrary to popular belief, enforcing a prese order of transactions improved the execution throughput of distributed databases. The general sentiment before Calvin was that enforcing a preset order of execution would reduce concurrency but Calvin firmly established that as false. Calvin, like C-level,
Starting point is 00:15:03 requires that transactions declare up front all the memory locations that they will access during execution. The rest of the workflow is fairly straightforward Calvin nodes, computers that manage partitions of the distributed database, first decide on the ordering, priority, of a set of transactions. After coming to consensus on the ordering, Calvin's scheduling follows a simple rule. If two or more transactions request access to the same memory locations, the higher priority transaction must be allowed to access the memory location first. A dedicated thread is responsible for lock management and this thread iterates through the serial transaction order and ensures that requests for locks are made in order.
Starting point is 00:15:42 When two transactions conflict, they are scheduled sequentially in order when two transactions conflict they are scheduled sequentially in order of priority but non-conflicting transactions can be scheduled concurrently there is a lot more nuance to calvin's design but the key takeaway is that calvin established the idea that enforcing priority improved throughput boom the next critical development came in 2014 boom like calvin boom is designed for distributed databases but the key insight is also easily extensible to blockchains. BOM uses a multi-version concurrency control, MVCC, mechanism, which is an auxiliary form of concurrency control that uses a multi-versioned log to manage reads from and writes to shared memory. In a nutshell, in MVCC databases, transactions do not directly modify the database.
Starting point is 00:16:27 Instead, a log holds multiple versions of the database for every transaction, i.e. each memory location is associated with all the values of the transactions that have written to it rather than the most recent write. You can imagine the multi-version data structure as a two-dimensional table with entries in the form. Each slice of this two-dimensional table is a table that contains the version of state for a particular memory location. An illustration that should help get the idea across is shown in figure 3. Recording in MVCC is similar to but different from that of TM in that in MVCC, for every write to a memory location, a new version of the location specific to that
Starting point is 00:17:05 transaction is created or updated, rather than overwriting the current value. MVCC databases allow for more concurrency than single-version databases, as transactions can read the values written by any transaction at a memory location regardless of how many transactions have written to that location. The trade-off is an increase in space, memory, complexity. BOM showed that combining multi-version concurrency control with fixed transaction ordering significantly improved the execution throughput of databases while maintaining full serializability. The way BOM works is briefly explained below BOM as a two-layered protocol. There is a concurrency layer and an execution layer. As a stream of transactions
Starting point is 00:17:45 is fed to the concurrency layer, a single thread orders the transactions by timestamp. The concurrency threads are individually responsible for a logical partition of the database. BOM requires that transactions declare their read and write sets upfront and using this information, each concurrency thread cycles through the ordered set and checks if the transaction writes to its partition. For all transactions that do, the concurrency thread creates an uninitialized placeholder in the log for that transaction. Because individual threads are responsible for partitions of the database, BOM increases concurrency with intra-transaction parallelism when creating the placeholders. After the concurrency control threads complete their tasks, another set of threads, execution threads, execute the transactions and fill in the placeholders. There's a priority order, so when a transaction needs to read from a memory location,
Starting point is 00:18:36 it checks for values written by the lowest priority transaction that is higher than it. For example, if transactions in write to a memory location that needs to read from, will read the value written by. In the event that the placeholder associated with the correct version to read is still uninitialized, i.e. has not yet written to that location, the execution of the transaction is blocked until the transaction that should write to that location completes its write. This reads never block writes design allows BOM to be extremely efficient especially with large workloads as the greatest cost constructing the multi-version data structure is amortized as the workload grows. Again some nuance has been
Starting point is 00:19:17 left out but the key takeaway from BOHM's design is that multi-version data structures allow for increased concurrency at the cost of memory. In addition to the four protocols discussed above, a 2017 paper also did some work on STM for blockchains. In the paper, the authors propose the classic STM design with on-the-fly lock acquisition to attempt to prevent concurrent memory access and post-execution validation to identify conflicts. The design is uncomplicated. Transactions are optimistically attempted, attempting to grab locks as they require them, and validated after. If a conflict is discovered, it is resolved by rolling back the transaction Andre executing it. The design allowed the leader a great deal of freedom in deciding transaction order but the
Starting point is 00:20:02 results were non-deterministic so other nodes would likely not arrive at the same execution results unless the leader shared the exact execution path. But the design never saw any adoption, largely because the performance results presented in the paper showed that the protocol was only slightly better than sequential execution and sometimes worse. BlockSTM applies the insights from Calvin and Bohm to succeed where the unnamed 2017 protocol failed. With the appropriate background set, we can move on to Block STM. Block STM, traditional distributed databases view the insights of Calvin and Bohm as constraints since enforcing a priority ordering requires some form of consensus between nodes and committing each transaction individually,
Starting point is 00:20:45 as opposed to block level, is the norm in transactional databases. But both of these properties are inherent in blockchains. Blockchain nodes must agree on the ordering of transactions, even if the leader is free to propose asset wishes and commits usually occur at the block level, or at least in batches a la Solana. In essence, what traditional DBs consider constraints are built into BlockSTM's spec and it leverages them to improve its performance. Below an overview of how BlockSTM works, similar to Calvin, transactions are ordered by priority. After ordering, the transactions are scheduled for execution. Like in STM, transactions are attempted with all available resources with no regard for conflicts.. Like in STM, transactions are attempted with all available
Starting point is 00:21:25 resources with no regard for conflicts. And like in BOEM, the execution threads don't read from write to memory directly. Instead, they read from or write to a multi-version data structure that we'll refer to as data structure going forward. Continuing in BOHM's footsteps, when a transaction, say, reads from the data structure, it reads the values written by the most recent version of the lowest priority transaction that is higher than itself. In this example, we'll read from some transaction that has written to the memory location. For example, if wants to read from a memory location that has been written Toby and reads the value written by. Keep this definition of read in mind
Starting point is 00:22:05 as it is central to the working principle of block STM. During the execution of a transaction, its read set is tracked. The read set contains the memory locations and associated values that a transaction read during execution. After a transaction completes execution, it is verified by comparing its read set with the current values at the memory locations it read from, keeping in mind the definition of read established earlier. If there are any discrepancies between the read set and the current values at the memory locations, it implies that during the transaction's S execution, one or more higher than J precedence transactions, say, modified one or multiple memory locations that read. Based off
Starting point is 00:22:46 the preset serialization order, should have read the values written by so all the values written be are considered dirty. But instead of deleting those values from the data structure, they are marked as estimate and is scheduled for re-execution. The values are not deleted because it's likely that re-executing will write the same locations and any lower than j priority transactions that read values marked as estimate error delayed until is re-executed and revalidated. Because of this heuristic, block STM can avoid a cascade of aborts and re-executions that would occur if the data structure were wiped clean of dirty values. If there are no discrepancies, i.e., no higher priority transaction than the one currently being validated, write to a memory location in the read set of, then is marked valid but not safe to
Starting point is 00:23:31 commit. The transaction is not safe to commit yet because there's a chance that a transaction of higher priority, say will fail validation. In such an eventuality, all validated lower than priority transactions then need to be revalidated to ensure that they haven't read from allocation written to buy. Because of this, transactions are not safe to commit until all transactions that come before them in the preset serialization order have been executed and validated. When all the transactions in the have been executed and validated, execution is complete. That's a basic, and likely confusing, overview of how BlockSTM works. We'll go over the process in detail next. Technical details of BlockSTM
Starting point is 00:24:12 Before we look at the technical details of BlockSTM, a few details need to be concretized. First, the input to BlockSTM is an ordered set called a that contains n transactions in the preset serialization order. The goal of block STM is to take this of transactions and execute them with the most concurrency possible, without breaking the serialization. As I mentioned earlier, each transaction is first executed and then validated. If an executed transaction fails validation, the transaction is scheduled for execution. To track the number of times a transaction has been executed, each transaction is associated with an incarnation number in addition to the index number. You can think of a transaction as being of the form whereas the index number and is the
Starting point is 00:24:55 incarnation number. So the block is initially equivalent to the combination of a transaction's index and its incarnation make up the version of the transaction. For example, the version of is. Lastly, to support concurrent reads and writes, BlockSTM maintains an in-memory multi-version data structure similar to the one discussed in BOM that stores the latest writes per transaction and the transaction version for every memory location. Here's a snippet of the early implementation. The MV HashMap maps each memory location to an internal B-tree map that maps the indexes of transactions that have ridden to that memory location to the corresponding values. The dash map is responsible for concurrency. It allows thread-safe modification of the B-tree map.
Starting point is 00:25:38 Full details on the data structure implementation can be found here. Next, we'll look at the actual thread logic. Thread logic, the activities of the worker threads, execution and validation threads, are coordinated by a collaborative scheduling thread that tracks and modifies toured sets that we'll call E and V. E contains all transactions that are yet to be executed and V tracks transactions that are yet to be validated. The implementation of the collaborative scheduler tracks these tasks with atomic counters. More details can be found in Appendix A3. Each of the worker threads cycles through the three-stage loop outlined below, and if V and E are empty and no other threads are performing a task, then the execution of the is complete. Else, if there are tasks in E and V,
Starting point is 00:26:22 an available worker thread will select the task with the smallest index between V and E. i.e., if V contains and E contains, the worker thread will create and perform a validation task for. The atomic counter mentioned above ensures that both sets will not contain the same transaction. As to why the transaction with the smallest index is chosen, it's because validating or executing the higher priority tasks as soon as possible helps to identify conflicts early. This is one of the ways a preset order improves performance. If the next task is an execution task then execute the next incarnation of if during the execution of an estimate is read then abort execution. Mark all the data written by as estimate and add to E
Starting point is 00:27:06 with an increased incarnation number I E E E S E. If no estimates are read, check if writes to any locations that the previous incarnation did not write to. If there is a write to a new location, add validation tasks for all transactions lower in priority than that have been executed or validated. This step is necessary because if the execution of writes to a new memory location that previous incarnations of did not, there's a chance that transactions of lower precedence, i.e. transactions with indexes greater than n, already read from these memory locations so they need to be checked for validity. Transactions of higher precedence don't need to be validated because they should have read that location before wrote it. Create a validation task for. If does not write to any new locations,
Starting point is 00:27:52 then create a validation task for alone. Loop back to. A validation task, then. Validate if validation succeeds, the thread returns back to check done. Else, if validation fails, and mark all the values written by as estimate. Create validation tasks for all transactions lower in priority than that are not currently in E and add the validation tasks to V. Transactions lower in priority than would never read the value written by because of the definition of a block STM read. Create an execution task for with an incremented incarnation number and add the task to E. Loop back to backslash dot dot. This loop continues until there are no more tasks in V and E, at which point check done returns
Starting point is 00:28:35 done in the most recent version of the data structure is safe to commit to persistent storage. After commitment, a garbage collector frees up the data structure and the transaction processing unit awaits the next. Below is an example of how this process would play out in practice. The illustration above is modeled on a four-worker thread machine. The sets E and V are the execution and validation sets. Nodes, circles, represent transactions and the color of a node helps to identify other transactions that it conflicts with. The upper level of the main table shows execution threads and the lower levels show the validation threads. The sets E and V in every column represent the execution and validation sets after the completion of the iteration. As seen in the illustration, E initially contains all the
Starting point is 00:29:21 transactions and V is empty. During the first iteration, the first four transactions are executed optimistically and E and V are updated accordingly. During the next iteration, all four transactions that were executed are validated. Transaction 4 fails validation due to a conflict with transaction 1 and must be re-executed. During the third iteration, transactions 4, 5, 6, and 7 are executed. Transaction 5 conflicts with 4 so it will read values marked estimate and its execution is paused until transaction 4 completes execution. During the fourth iteration, transactions 4, 6, and 7 are validated while the execution of 5 is completed. During the fifth iteration, transactions 5, 6, and 7 are validated.
Starting point is 00:30:07 6 and 7 are revalidated because of the re-execution of 5 for reasons explained above. Transaction 8 is also executed. During the next iteration, transaction 8 is validated while 9 and 10 are executed. And during the final iteration, transactions 9 and 10 are validated and the block is marked safe to commit. While I've described a simplistic workflow the same process can be applied to any batch of transactions with any dependency structure. At its core block STM is simple and that simplicity in conjunction with specific properties of blockchains, fixed ordering of transactions, block level commits and the safety of blockchain VMs, allows BlockSTM to achieve relatively high throughput without enforcing the declaration of
Starting point is 00:30:51 read-write dependencies upfront. The BlockSTM paper contains formal proofs of liveness and safety. And an even more detailed breakdown of BlockSTM, the full algorithm, with explanatory comments, can be found in Appendix A3. Lastly, an implementation of the TPU in Rust can be found here. Next, we'll look at how BlockSTM performs in practice. BlockSTM performance. To evaluate BlockSTM's performance, we forked Aptos Core and slightly modified the already existing benchmarks to evaluate its performance. The machine we use for testing is Latitudes M4. Metal, large, which has 384 GBs of RAM and an AMD 9254 CPU with 24 physical cores. The tests evaluate the execution process from the being fetched to the completion of execution and validation of all transactions. The transactions used during this evaluation are simple. P2P, transfers. A simple send between two accounts. The performance metric evaluated was throughput and the independent variable used for
Starting point is 00:31:57 presenting the data is thread count. The parameters used for evaluating performance are block size and number of accounts, a proxy for contention. Highly contentious workloads, like two accounts, proxies the type of traffic you'd expect during an NFT mint or when a particular market is hot. Two is the most extreme possibility, as it means all the transactions in the block are between two accounts. Non-contentious workloads like 10,000 accounts proxy the type of traffic you'd expect if the blockchain was used for P2P transactions i.e. simple sends between many individuals. The other loads are proxies for everything in between. Figures 5 to 8 show the performance relative to the number of threads of BlockSTM with different levels of contention and block sizes.
Starting point is 00:32:42 From the data, it is evident that, as expected, block STM performs better in low contention scenarios. However, throughput is only slightly better than sequential execution in high contention scenarios, less than 10 accounts. And in completely sequential situations, 2 accounts, block STM's performance is significantly worse than sequential execution, which is also to be expected. Figure 6 provides more context for performance at 1k block size and the data suggests that throughput peaks at 12 threads with non-contentious workloads. In high contention scenarios, performance trends slightly downward as the number of cores increases, plateauing around 8 physical cores. In very high contention scenarios,
Starting point is 00:33:30 performance peaks at four cores, the lower limits of our tests, suggesting that performance would be better with even fewer cores. This behavior is understandable since more threads in a highly contentious environment increase the chance of the account being concurrently written to an THE chance of a validation failure. Figure 7 and 8 provides more insight on how performance scales relative to block sizes and the data is quite similar to the 1k block size data. Although with a larger block size, performance grows more linearly. For the sake of brevity, we left out additional data but the data collected implies that block STM performs better with larger blocks. The improvement in performance with larger blocks is likely due to the amortization of the initial failures. As the runs get longer, more and more transactions
Starting point is 00:34:10 earlier than their predecessors amortizing the initial costs. We also found that there are no significant improvements beyond a 20k block size and that reduction in contention beyond a certain point does not improve performance. A good indicator of this is that there are infinitesimal performance differences between 1,000 accounts and 10,000 accounts. In essence, if thread count greater than greater than contention, then performance peaks. It's important to note that the performance for thread counts 24 and 32 are not reliable measures of block SDM's performance at those thread counts. As mentioned earlier, the machine used for the tests has only 24 physical cores. To expose 48 logical cores, the CPU utilizes hyper-threading which could be responsible for its performance. The BlockSTM paper contains the
Starting point is 00:34:58 results of more tests, including one where BlockSTM is compared to and slightly outperforms in implementation of BOM. This is surprising considering BOM has complete read and write sets of all transactions beforehand but the outperformance is likely due to the overhead of building the multi-version data structure in BOM. Overall, the data gives a reliable indication of BlockSTM's limits and it's not a stretch to say that in low contention situations, BlockSTM massively outperforms any sequentially executed runtime and also scales almost linearly, making it suitable for large scale applications like rollups. That concludes the analysis of BlockSTM, next we'll consider
Starting point is 00:35:36 the leading PCC execution runtime, C-Level, a marketing term for Solana's parallel execution TPU. C-level. C-level is completely antagonistic in design to block STM. C-level uses SLEC-based synchronization for concurrency control, but most notably, it requires that all transactions declare up front what portion of state they'll be reading from and or writing to during execution. The Solana execution pipeline is further optimized by the use of single instruction multiple data, SIMD, instructions. SIMD instructions are a parallel, not concurrent, processing technique that performs a single operation over multiple data points.
Starting point is 00:36:17 Discussions about runtime optimizations are beyond the scope of this paper but the idea is that transaction instructions can be sorted based on what programs they call and all instructions that call the same instruction within a program can be batched and executed in parallel with SIMD instructions, again, not concurrently. A quote from Solana founder Anatoly Yakovenko might provide some more insight greater than, so, if the incoming transactions that are loaded by C-level all call the greater than same program instructions, such as, Solana can execute all the greater than transactions concurrently in parallel over all the available CUDA cores. In short, C-Level is designed for maximum speed. The following section will cover how C-Level works and its performance. But first, a few primers,
Starting point is 00:37:01 starting with a brief description of the lifecycle and structure of a Solana transaction. The structure and lifecycle of a Solana TRAN SACTION TRAN SACTION STRUCTUREAS mentioned earlier, Solana transactions must declare up front what portions of state they will access. They do this by listing the accounts that they will use during execution and specifying if they will read from or write to these accounts. A basic Solana transaction is of the form the accounts list within the instructions list of the transaction is the main focus here. This list contains all the accounts and instruction will use during its execution. The list is populated based on the struct. The struct has a list of accounts with three fields. The public key, an ed25519 key that identifies accounts of the program being invoked. A bool that determines
Starting point is 00:37:53 if the account will sign the message. A bool that marks if the account will be written during execution. Backslash dot. Here is a sample struct named keys. The key takeaway here is that from the struct, the TPU can identify what accounts will be written by a transaction and use that information to schedule an in conflicting transactions. Lifecycle of a transaction The lifecycle of a Solana transaction is largely the same as that of most other blockchains, with some differences, primary of which is that Solana does nothave a mempool. Let's briefly go over how transactions go from creation to commit. After a transaction is created, the transaction is serialized,
Starting point is 00:38:30 signed and forwarded to an RPC node. When an RPC node receives the packets, they go through SIGVERIFY, which checks that the signature matches the message. After SIGVERIFY, the packets enter the banking stage, which is where they are either processed or forwarded. On Solana, nodes are aware of the leader schedule, a pre-generated roster that dictates what validators will be leaders for what slots. A leader is assigned a set of four consecutive slots. Based on this information, nodes that are not leader forward transactions directly to the current leader and the next x2 to in the agave implementation leaders. Note that since this is not a consensus breaking change, some nodes forward
Starting point is 00:39:11 transactions to more or less leaders. When the leader receives the packets from other nodes, it runs them through sig verify and sends the packets to the banking stage. In the banking stage, transactions are deserialized, scheduled, and executed. The banking stage also communicates with the proof-of-history component to timestamp batches of transactions. More on this later. By design, Solana blocks are continuously built and forwarded in small portions called shreds. When other nodes receive the shreds, they replay them, and when all the shreds that make up a block have been received, validators compare the execution results to the one proposed by the leader.
Starting point is 00:39:50 Then validators sign a message that contains the hash of the block they're voting for and send out their votes. Votes can be disseminated as transactions or through gossip but Solana favors treating votes as transactions in a bid to speed up consensus. When the quorum of validators is reached on a fork, the block at the tip of that fork is confirmed. When a block has 31 plus confirmed blocks built on it, the block is rooted and practically impossible to reorg. The major difference between the lifecycle of a Solana transaction and that of an Aptos
Starting point is 00:40:20 transaction is that in Aptos, transactions are not executed before nodes receive the ordering of the block. Solana and Monad are moving towards an advanced form of this type of execution called asynchronous execution, with the final destination being stateless validator nodes, but that discussion is outside the scope of this paper. Before we discuss the meat of transaction execution, it's helpful to do our fresher on proof-of-history concepts since Poe is heavily intertwined with the banking stage process. Proof-of-history and the banking stage. Solana's proof-of-history is a decentralized clock. Poe's working principle is based on recursive SHA-256 hashes. It uses the cycle to proxy the passage of time, the basic idea is that a SHA-256 hash 1 is pre-image resistant, i.e., the only way to compute a hash, h, of a message, m, is to apply the hash function
Starting point is 00:41:15 h to m, and 2 takes exactly the same amount of time on any high-performance computer. Because of these two properties, nodes recursively performing SHA-256 hashes can agree that an amount of time has passed based on the number of hashes computed. Additionally, verification is highly parallelizable because each hash can be checked against the next one without relying on the results of the previous hashes. Because of these properties, PoE allows Solana nodes to agree on the passage of time without the need for synchronization or communication. The PoHash calculation snippet is shown below as an arbitrary piece of information, in this context, transaction hashes. That is appended to the previous hash to assert that the event represented by occurred before the hash was computed,
Starting point is 00:42:00 essentially timestamping the event. The other post-specific concepts relevant to understanding the banking stage are 1. Ticks. A tick is a measure of time defined by x, currently 12,500, hashes. The tick hash is the 12,500th hash in the chain. 2. Entry. An entry is a timestamped batch of transactions. Entries are called entries because they're how Solana transactions are committed to the ledger. An entry is composed of three components. 1. The number of hashes performed since the previous entry. 2. Is the result of hashing the hash of the previous entry times. Backslash dot. There are two types of entries, tick entries and transaction entries. A tick entry contains no transactions
Starting point is 00:42:45 and is made at every tick. Transaction entries contain a batch of non-conflicting transactions. Entry constraint TS. There are quite a number of rules that determine if a block is valid, even if all the transactions it contains are valid. You can find some of those rules here, but the rule relevant to this report is the entry constraint. This rule dictates that all the transactions within an entry must be non-conflicting. And if an entry contains conflicting transactions, the entire block is invalid and rejected by validators. By enforcing that all the transactions in an entry be non-conflicting, validators can replay all the transactions within an entry in parallel without the overhead of scheduling.
Starting point is 00:43:31 However, SIMD 0083 proposes the removal of this constraint, as it constrains block production and prevents asynchronous execution on Solana. Andrew Fitzgerald discusses this constraint and a few others in this post on what he thinks are the next steps Solana needs to take on its journey to asynchronous execution. To be clear, this constraint does not completely dictate how transactions are scheduled because executed transactions don't have to be included within the next entry, but it is an important consideration for current scheduler designs. With all that out of the way, we can discuss the meat of transaction execution, Solana's banking stage. The banking stage The banking stage module houses Solana's Banking Stage. The Banking Stage The Banking Stage module houses Solana's TPU and a lot of other logic for processing transactions. The focus of this report is on the scheduler but
Starting point is 00:44:11 a brief overview of the Banking Stage will be discussed to provide some necessary context. Overview of the Banking Stage The Banking Stage sits between the SIG Verify module and the Broadcast Stage, with the PO module running in parallel to it. As discussed earlier, SIG Verify is where transaction signatures are verified. The broadcast stage is where processed transaction data is disseminated, via turbine, to other nodes on the network and THETPU underscore forwarding module is responsible for disseminating sanitized transaction packets to leader nodes. In the banking stage, transaction packets from SIG Verify are buffered and received by the appropriate channels. Voting transactions are received at the end of two channels, the TPU underscore vote underscore receiver and gossip
Starting point is 00:44:56 underscore vote underscore receiver while non-votes are received by the non underscore vote underscore receiver. After buffering the packets are forwarded or consumed, depending on the leader schedule. If the node is not the leader or due to be leader shortly, the sanitized packets are forwarded to the appropriate nodes. When the node is leader, it consumes the transaction packets, i.e. the packets are deserialized, scheduled, and executed. Scheduling is the main focus of this paper and it will be expanded on later. The execution stage is relatively straightforward. After a batch of transactions is scheduled, the TPU will run checks. The worker thread checks that the transaction
Starting point is 00:45:36 hasn't expired by checking that the block hash it references is not too old, hasn't been included in a previous block. Grab locks. The worker thread attempts to acquire the appropriate read and write locks for all the transactions in the batch. If a lock grab fails, retry the transaction later. Load accounts and verify that signers can pay fees. The thread checks that the programs being loaded are valid, loads the accounts necessary for execution and checks that the signer can pay the fees specified in the transaction execute the worker threads create vm instances and executes the transactions record the execution results and the transaction is sent to the po thread which generates an entry entries are also sent to the broadcast stage during this step commit if
Starting point is 00:46:22 recording succeeds the results are committed, updating the state. Unlock. Remove locks from accounts. Backslash dot. That's a complete overview of the banking stage. Next, we'll dive deep into how scheduling works in the Solana TPU. This portion has had many significant changes, but for the sake of brevity, we'll discuss only the two most recent implementations, the Thread Local Multi-Iterator and the Central Scheduler. Thread Local Multi-Iterator and the Thread Local Multi-Iterator implementation consume transactions packetset in the channel mentioned earlier. Each of the non-vote threads pulls transactions from the shared channel, sorts them based on priority and stores the ordered transactions in a local buffer.
Starting point is 00:47:04 The local buffers in the TLMI are double-ended priority queues to allow for adding new high priority transactions while removing the low priority ones with minimal time complexity. Each thread then marches a multi-iterator through its buffer to create a batch of 128 non-conflicting transactions. A multi-iterator is a programming pattern that runs multiple iterator objects through an array. The iterators select items from the array based on a decision function. The multi-iterator concept can be abstract so instead of explaining it, here is an example that shows how Solana's TLMI works. Imagine a set of 10 transactions with the dependency graph below. Assuming the TLMI created batches of 4, with this set of transactions, the TLMI would select TX1, then, then, and finally for the first batch,
Starting point is 00:47:55 and would be skipped because they conflict with already selected transactions. The next iterator would select AND N. The third iterator would select N. Each iterator would follow this process to create a batch of non-conflicting transactions and because of the use of locks during execution, transactions are guaranteed to execute safely regardless of what other threads are doing. However, this design led to a lot of problems. The problems with the TLMI implementation The major problem with the TLMI approach is that each thread is isolated from the others. Because each of the worker threads independently pulled transactions from the shared channel, the distribution of transactions between threads was roughly the same. So even if the
Starting point is 00:48:35 threads create a batch of transactions with no intra-thread conflicts, there could still, and likely would, be intra-thread conflicts, with the problem becoming worse as contention increases. In addition, because of the TLMI's design, there is a high tendency for priority inversion, since two conflicting transactions cannot be in a batch, high-priority transactions that happen to conflict with a higher-priority transaction will not be scheduled until the next batch at the very least and lower-pri lower priority transactions would. These problems could be approached by reducing the batch size from 128, but that would create bottlenecks elsewhere, like increased context switching by threads. Instead, the scheduling algorithm was redesigned completely, leading to the central scheduler implementation.
Starting point is 00:49:19 The central scheduler, the Agave 1.18 client addresses the central problem of Solana's banking stage by introducing a central scheduler, very similar in spirit to BlockSDM's collaborative scheduler. The central scheduler uses a new thread, in addition to the previous six, that performs the task of global scheduling. The way it works is relatively straightforward. Just like before, voting transactions are handled by voting threads but now all one-vote transactions go to a dynamically updated buffer. Since this buffer is fed directly from SIG Verify, it has a global view of priority as opposed to the local buffers in the TLMI design. Checks that the transaction can pay the transaction fees and deducts them from the fee-paying account. Transactions that pass the checks are inserted into the preo graph.
Starting point is 00:50:08 The preo graph is created by checking for dependencies between a transaction and the next highest priority transaction per account i.e. the graph builder checks for conflicts on every single account that the transaction touches. This i is implemented by tracking in a hash map what transactions last touched what accounts. This allows the graph builder to quickly identify dependencies. Once the pre-o graph has been completely built, transactions go through opera underscore lock underscore filter. The pre-lock filter is currently unimplemented so it currently does nothing. But the ideal logic flow is to allow transactions that pass the pre- lock underscore filter to be scheduled one by one on the appropriate threads. A small optimization
Starting point is 00:50:49 is that if a transaction does not conflict with any thread or any other transactions that could necessitate executing a chain in sequence, the transaction is assigned to the thread with the least amount of work in terms of cus. Transactions are scheduled until the preo graph is empty or every thread either reaches the max compute units assigned to it, currently 12 million cus. Transactions are scheduled until the preo graph is empty or every thread either reaches the max compute units assigned to it, currently 12 million cus, or has a batch of 64 transactions assigned to it. If any of these conditions is met, the worker threads begins executing the batch. When it finishes, it communicates with the central scheduler thread and work is scheduled for it. This process is repeated until the node is no longer leader, at which point the banking stage
Starting point is 00:51:28 is said to be over. I have left out the details of committing and forwarding as they're not relevant to this discourse but the core idea has already been discussed. That is a complete overview of how the central scheduler approach e-sparallelization. It is, in some aspects, a significant improvement over previous C-level iterations as it prevents intra- and inter-thread conflicts. In the CS, ceterisparibus, there should never be a failed lock grab, which was the main cause of nonlinear performance degradation with the other iterations. In addition, since there are no more inter-thread conflicts, the central scheduler allows the use of as many non-vote threads as are available for execution. The biggest trade-off of the central scheduler is the overhead incurred
Starting point is 00:52:10 during scheduling. Compared to block STM or even the TLMI, it spends a decent amount of time scheduling transactions and has less to spend on execution. Another trade-off is that unlike the TLMI that had all the available threads desserializing and parsing transactions, only one thread, the CS, is responsible for these tasks now. That wraps up the discussion of the design of the CS and, by extension, Solana SPIS. Next, we'll examine their performance. C-level performance. There are quite a number of Solana benchmarks in the wild, but for the purpose of this report, we modified the Agave client codebase and ran the banking bench tests. There are quite a number of Solana benchmarks in the wild, but for the purpose of this report,
Starting point is 00:52:50 we modified the Agave client codebase and ran the banking bench tests. Links to the repo as used and instructions to reproduce the results can be found in the appendix. The tests were run on the exact same machine used to evaluate BlockSTM. Latitudes M4 Metal Large, which has 384GB RAM and an AMD E-PYC 9254 CPU with 24 physical and 48 logical cores. The tests are a the exact same as those used to evaluate BlockSTM as well. The results are shown below. TLMI The data for the TLMI runs are shown in figures 18 to 21 below. The TLMI has a maximum throughput of 137 kTPS observed on 8 threads at a block size of 10k and account contention of 10k. Essentially embarrassingly parallel. The TLMI also performs relatively well in very contentious scenarios, processing over 15 kTPSFOR completely sequential workloads. NNTHE performance trend of the TLMI is relatively less steep,
Starting point is 00:53:50 it reaches about half of its maximum throughput with an account contention of 10, i.e. The performance in contentious and non-contentious situations is similar. This implies that the TLMI will not experience significant performance degradation in contentious situations. The data contains a few surprising facts, one of which is that with a block size of 1k, figures 19 and 20, performance peaks with 4 threads regardless of contention. With a block size of 10k, figures 21 and 22, performance consistently peaked at 8 threads and even slightly degraded as core count increased. This behavior is understandable for thread counts 24 and 32, as the machine used during testing only
Starting point is 00:54:31 has 24 physical cores and there are some other processes running in the background, e.g. Poe, but the behavior is unexpected for thread counts 12 and 16 and seems uncorrelated to contention. There's not enough data to assert a trend but it would seem the TLMI's peak throughput is correlated to block size. This suggests that there's still room for growth in regards to optimizing the performance of the TLMI. But overall, the TLMI is a highly performant TPU that performs well in both contentious and non-contentious environments. Central scheduler the decision was made to exclude the performance of the central scheduler because it was found to
Starting point is 00:55:09 be inconsistent. Our tests saw it peak around 107 KTP SWIFT 10K account contention and as high as 70 KTPS on a completely sequential workload. In another implementation of the tests, we get 30 KTPS on completely sequential workloads and approximately 90 kTPS with an embarrassingly parallel workload. These results are consistent. Within, the tests and rerunning the tests will produce the same results. But the inconsistency across tests suggests that there are still bugs in the implementation that need attention. Because of that, we decided to leave the evaluation and presentation of the results to a future study. A few conclusions that we can draw regardless are that relative
Starting point is 00:55:50 to the TLMI, the performance will be better in highly contentious scenarios and, slightly, we're saying low-contention scenarios. With C-level thoroughly discussed, we can move on to the highlight of the report. Block STM vs C-Level, Fee Markets. Appropriately pricing block space is a difficult challenge to solve. To put it more eloquently, greater than, one of the most challenging issues in blockchain protocol design is how to greater than limit and price the submission of transactions that get included into the greater than chain. Vitalik Buterin a proper discussion on fee markets would fill up its own paper. But on the subject of comparing BlockSTM and the C-Level it is relevant to mention that due to the
Starting point is 00:56:30 designs of both TPUs, the fee market structures are completely different. Specifically, C-Level has pseudo-local fee markets, while BlockSTM's fee markets are global. In local fee markets, contention for a portion of state does not affect other portions of state, while in global fee markets, the cost of block space depends on general demand for block space, irrespective of what portion of state a transaction is accessing. Local fee markets are clearly more ideal from every point of view. For users, the UX is superior because users are not forced to compete with transactions accessing portions of state they are not accessing. For validators in the network, ceteris paribus, local fee markets make for more efficient use of block space as there will be fewer failed transactions.
Starting point is 00:57:15 In short, local fee markets are superior but they are very hard to implement. I'll explain next the fee market structure of each TPU. Block STM fee markets as mentioned above, Block STM's fee markets are global. Ironically, this is a consequence of the design choice that is responsible for Block STM's performance boost, the predefined ordering of transactions. As discussed when analyzing Block STM's design, when Block STM wants to execute a block of transactions, it pulls an already ordered set of transactions from its mempool or quorum store in the new design and executes the block. Because
Starting point is 00:57:51 the block had already been constructed prior to execution and inclusion in the block was based on gas fees, securing inclusion in the block is solely dependent on gas fees. For example, in the event of a highly contested NFT mint where the highest priority transactions are competing for the same account, most, if not all, the transactions packed into the block will compete for the same state. This will significantly degrade the performance of block STM, since it performs poorly in high contention situations. Prevent the inclusion of non-conflicting transactions and unnecessarily drive up the price of block space, as seen on networks with single-threaded TPUs like Ethereum and other EVM-compatible chains. We're more focused on the latter point in this section and what it implies,
Starting point is 00:58:35 that all users on block STM networks will be forced to compete with each other for block space. Considering historical data on networks like Ethereum, BlockSpace could become unnecessarily expensive during times of high activity. And during sustained activity, fees will be higher for all users. This is a problem, as it suggests that BlockSpace on BSTM networks must be cheaper and more abundant than in a C-level network for users to pay similar amounts in fees. C-level fee markets All iterations of C-level have always had some form of fee market locality but the fee markets were barely functional for the reasons
Starting point is 00:59:10 discussed above, local view on priority. Post-central scheduler, C-level's fee markets are functional and pseudo-local. Transactions bidding for inclusion and priority only have Tobit against other transactions contesting the accounts they want to access. For example, if transactions through are all bidding for the same portion of state, doesn't have to bid against all the previous transactions. In the current scheduler design, it will be scheduled second, while transactions through are all queued behind. This is a very simple example but it holds true for any batch of transactions with any type of transaction dependency. It's important to understand that there is and will always be some global pricing of blockbase because of block size limits, currently 48 million compute units, cus.
Starting point is 00:59:55 Transactions that don't conflict with other transactions in the block may never make it into the block simply because their absolute priority is not high enough. But once a transaction can pay some minimum inclusion fee, fee markets are local. Solana further improves its fee market locality by constraining the maximum number of CUS that can be taken up by a single account in a block. The current implementation is 12 million CUS. This ensures that transactions competing for highly contested accounts cannot prevent the inclusion of other transactions, even if the former set of transactions, for some reason, executes much faster. The data collected during this study suggests that there might be benefits to lowering the limits relative to block space but that's a discussion for another day.
Starting point is 01:00:39 To summarize, C-level implicitly enables local fee markets and locality is reinforced by the account limits. Next we'll move on to actual throughput from both TPUs. Block STM vs C-level. Performance. With all the necessary background discussed, the time has come to answer the question of which TPU performs better. Figures 24 and 25 show the performance of both TPUs for different at block sizes 1K and 10K. The results show that C-Level is significantly more performant across the board. For the sake of reading convenience, only the data for 2, 10, 50, and 10,000 accounts were shown, as they suffice to
Starting point is 01:01:17 represent most of the landscape. Figure 24 shows performance at a block size of 1K and C-Level completely outperforms block STM. With a completely sequential workload, there's a roughly 7x difference in performance and with a highly parallelizable workload, 10 CACs, there's a 2.4x difference in performance. With a block size of 10k, C-level has a 45%, 53, and 118% performance boost over BlockSTM, with account contention 100, 1K, and 10K respectively. These results are a better than the 1K tests and are attributable to the reasons discussed earlier. The results are surprising, as years of research on concurrency control have concluded that OCC is highly suited for low contention workloads and PCC for high contention
Starting point is 01:02:05 workloads, too. In isolation, BlockSTM conforms to this standard, with performance improving significantly as contention is reduced. However, regardless of contention, when compared to sea level, BlockSTM falls behind significantly. We investigated this and spoke to members of Aptos Labs and found some potential explanations we provide some context and our findings below during testing we had benchmark block stm with a much earlier iteration of the move vm and the observed throughput is more than double the results presented we got throughput as high as 193 ktps with block size. 10k. Account contention. 10k. And overall, BlockSTM was much more competitive in contentious scenario sand outperformed sea level in non-contentious situations.
Starting point is 01:02:54 While the performance of BlockSTM from those tests is legitimate, those numbers cannot be used for a fair comparison as the execution VM and parts of the underlying program used to obtain them were not production ready. After discussing this with the Aptos Labs team, we understood that this degradation occurred because of the additional checks added to the move VM. This suggests that BlockSTM's true potential might be handicapped by the complexity of the move VM. We say MITA's benchmarking BlockSTM with DM move produces similar results as the ones presented, suggesting that the presented results might be the true limit of BlockSTM's ability with a production-ready VM. Another factor that can explain the performance gap is that C-level
Starting point is 01:03:35 executes transactions in batches as opposed to BlockSTM's ad hoc execution. The increased context switching from ad hoc execution likely adds some overhead. As to the question of what TPU will perform better in real-world situations, it is true that it's impossible to describe real-world contention with a single number. But BlockSTM is too far behind, regardless of contention. However, just to be thorough, we'll briefly attempt to estimate real-world contention by looking at historical blockchain data and how contentious it is. Visualization of blockchain contention Contention can be visualized using the same preo graphs that the central scheduler uses to identify transaction dependencies. The preo graphs shown when describing the
Starting point is 01:04:18 central scheduler were fairly simple. Real priority graphs, shown in figures 26 to 33 are much, much more complex. The following preo graphs are preo graphs for both Ethereum and Solana blocks, as these are the only two networks with enough data to accurately model contention. Figure 26 shows the preo graph for Solana slot 229,666,043. The graph has 89 components, distinct subgraphs of transactions. Most of these components have simple structures, with many having zero dependent transactions or simple sequential dependencies. However, a large number of transactions are part of very complex trees. Zoom in to see dependencies, keeping in mind that the priority of a transaction is indicated by the color of the node. Also remember that the further out a transaction is from the center of a component,
Starting point is 01:05:10 the fewer dependencies it has. Nodes in the center of a component depend on all the transactions outward from them, in all directions. The data above show that blockchain transactions are very contentious. A fairly large set of accounts have very low contention but the vast majority of transactions are in straight, long branches of very large trees. The straight, long branches suggest that the transactions contend over the same accounts and the size of the trees relative to the smaller clusters suggests that most transactions are contentious. But most interestingly, the number of branches suggests that many accounts are contentious, not just a small set. More pre-o-graphs tell the same story. In summary,
Starting point is 01:05:51 state access on blockchains is very contentious. The more nuanced conclusions that can be drawn from looking at the data are 1. Most high-priority transactions are for highly contested portions of state. 2. State access is contentious but also highly parallelizable. There are many hotspots of varying temperature as opposed to one large hostpot. 3. A reasonably large set of transactions do not contend with any others. This set is small relative to the set of transactions that do as small as 10% but not infinitesimal. Backslash dot. So what does this mean for BlockSTM and C-Level? Well, it strengthens the argument that C-Level will perform better in
Starting point is 01:06:32 real-world scenarios. The extent of better is a question that can only be answered by more involved simulations but the current data strongly indicates that C-Level will perform significantly better in real-world use. All said, it's important to note that because of BlockSTM's primitive dependency identification and other optimizations 2, 3, BlockSTM will perform better than traditional OCC TPUs. But its performance will likely never match that of C-level. The case for OCC and BlockSTM. As we approach the end of this paper, it's easy to walk away with the impression that OCC, and, by extension, BlockSTM, is pointless, years of research and test data suggest that it's practically guaranteed to be slower than PCC in an environment like blockchains,
Starting point is 01:07:17 where state access is fairly contentious. In addition, BlockSTM seems like a step backwards in the context of transaction fee markets, TFMs, as it lacks one of the major benefits of concurrent execution, local fee markets. However, BlockSTM has made many improvements to the traditional OCC design and will perform much better than previous OCC TPUs. In addition, there are three more benefits of BlockSTM's design that are worth noting. Wider range of supported applications The primary advantage of an OCC TPU is that it allows for arbitrary transaction logic. Since PCC TPUs require that transactions declare up
Starting point is 01:07:57 front the portions of memory they will access, it is impossible to write transactional logic where a transaction decides to read or write certain memory locations based on information discovered during execution. A good example is on chain order book design. Order books on PCC TPUs usually cannot offer atomic settlement or permissionless market making and limit orders since transactions must specify up front what accounts will be accessed during the transaction. To work around this, most order books on Solana required the aid of an additional entity, the cranker, to finalize limit orders. Phoenix on Solana has managed to overcome this by holding all the balances in one account but this approach faces its own struggles. Portability The final argument for BlockSTM is that it can integrate with any blockchain without breaking compatibility with existing transactions. There have been attempts to have transactions optionally specify up front
Starting point is 01:08:50 what portion of state they'll be accessing on Ethereum in a bid to increase the efficiency of the EVM but they never saw the light of day. Because of BlockSTM's nature, it can be, permissionlessly, integrated into any blockchain without breaking consensus. Even Solana clients can implement BlockSTM with modifications to ensure that the blocks meet all the constraints without a formal proposal. It is specifically for this reason that Monad and all other parallel EVMs use BlockSTM to achieve functional parallelization of the EVM. For these reasons, arbitrary transaction logic, improved developer experience,
Starting point is 01:09:29 and portability, OCC execution engines are at least worth exploring. Conclusion Frequency scaling, work stealing, terminology, Solana, what do, state, and, state change, mean in blockchain? Threads and locks, PDF Deferred execution, monad, paper references, Maurice Herlihy and J. Elliot B. Moss. Transactional Memory. Architectural Support for Lock-Free Data Structures. Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA, 93. Volume 21, Issue 2, May 1993. Daniel A. Manaske, Tatu Nakanishi.
Starting point is 01:10:07 Optimistic vs. Pessimistic Concurrency Control Mechanisms in Database Management Systems, Information Systems, Volume 7, Issue 1, 1982, pages 13-27. Glossary CPU Architecture This section will provide concrete definitions for computer hardware terms that showed up during the report. A CPU core is essentially a CPU in and of itself. It has all the necessary components to qualify as a full CPU and is capable of executing tasks by itself. Most modern CPU chips have multiple cores that share memory and I.O. hardware. A thread is an abstraction that roughly refers to a series of instructions executed by a CPU core. Multithreading is executing a program, not a transaction, across multiple threads. Hyperthreading allows CPU cores to execute two threads pseudo-simultaneously.
Starting point is 01:10:57 The core shares its resources between two threads in a way that improves parallelism but it's not the 2x performance that's often implied. Parallel vs. Concurrent a way that improves parallelism but it's not the 2x performance that's often implied parallel versus concurrent a parallel program uses multiple cpu cores each core performing a task independently on the other hand concurrency enables a program to deal with multiple tasks even on a single cpu core the core switches between tasks i threads, without necessarily completing each one. A program can have both, neither of or a combination of parallelism and concurrency characteristics. Concurrency focuses on managing multiple tasks efficiently with one resource. Parallelism utilizes multiple resources to execute tasks simultaneously, making processes faster.
Starting point is 01:11:43 Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant. For example, multitasking on a single-core machine. Parallelism is when tasks literally run at the same time, e.g. on a multi-core processor. It should be clear now why transaction processing in blockchains is more analogous to concurrent processing than parallel processing. Other terms instruction. An instruction is the most granular operation that a processor can perform. They can vary in complexity based on the computer's instruction set architecture, but the idea is that an instruction
Starting point is 01:12:22 is guaranteed to be atomic by the hardware design. Bytecode. Bytecode is an assembly code like intermediate representation. In this context, it refers to a VM's instruction set. Assembly language. Assembly language is a low-level language with constructs that map one-to-one with the underlying CPU's instruction set. Locking granularity. Refers to how precise locks are. Imagine a transaction needs to alter an element in a table. A coarse lock could lock the entire table. A more granular lock could lock the row, column of interest. And a very fine lock could lock only the cell of interest. The more granular a lock is, the more concurrent memory access can be. But highly granular locks are more difficult to manage because they require a larger key-value database to manage portions of memory.
Starting point is 01:13:09 Priority inversion is when a lower-priority process red transaction prevents the execution of a higher-priority process. Convoying. A lock convoy occurs when threads of equal priority compete for access to a shared resource. Each time a thread relinquishes access to the resource and pauses or stops the process, there is some overhead incurred. Deadlocks. A phenomenon where two or more processes cannot advance because each one requires resources held by the other. Slots. A slot is Solana's term for block time, currently around 400 milliseconds.
Starting point is 01:13:48 Nodes are allotted four consecutive slots every time they are leader. Epic. An epic in Solana is a distinct block of 432,000 consecutive slots. Not any 432,000 slots, but rather distinct blocks like days of the week. Serialization. Is the process of converting data, data structures and objects to a string of bits that can be stored, transmitted, and reconstructed according to the format. Deserialization is the process of reconstructing serialized data from the string of bits to the original data structures or objects. Buffer. A buffer is a temporary storage element. It's usually used when data cannot or should not be processed at
Starting point is 01:14:25 the same rate at which it is being fed to the next block of a process. Packets. A packet is a fixed size unit of data carried by a packet switch network. Packets are relevant to Solana because each transaction is constrained to the maximum size of an IPv6 packet. Double-ended queue, also called DQ, is an abstract data structure that supports the addition of data to one end and removal from the other. Channel. High-performance I.O. gadget. B-tree. Self-balancing data structure. Hashmap. Is an efficient associative array data structure. Race conditions. Occur when uncontrollable processes can lead to incorrect or inconsistent results. In the context of this paper, race conditions occur in OCC systems.
Starting point is 01:15:13 Appendix A1. Distributed databases Distributed databases are a lot like blockchains, in fact, it's not a stretch to say that they're fundamentally the same thing. A distributed database is a collection of multiple, interconnected databases spread across different physical locations connected via a network. These databases appear as a single database to the user but function on multiple servers. The topic of distributed databases is expansive so I'll, briefly, cover only the areas relevant to this report, data distribution methods and database transactions. Data distribution methods horizontal partitioning. Each site stores a different subset of the rows of a table. For example, a customer table might be divided so that the first x entries are stored on one database and the others are stored elsewhere. Vertical partitioning. Each site stores different
Starting point is 01:16:00 columns of a table. For example, one site might store the customer names and addresses, while another site stores their purchase histories. Replication. Copies of the entire database, or subsets, are stored at multiple sites. This improves data availability and reliability but may introduce consistency challenges. Transactions. Transactions in DBMSs are defined as a single unit of operations that perform a logical action, usually accessing and or modifying the database. For most databases, especially those keeping track of financial records, transactions are required to have ACID properties i.e. The transactions are atomic, all or nothing, if one operation fails, the entire transaction reverts.
Starting point is 01:16:45 Consistent Transactions modify the database in predictable and repeatable ways. Isolated. When multiple transactions are executed concurrently, they execute without affecting each other, as if they were executed sequentially, and durable. Changes made by completed transactions are permanent even in the event of system failure. If it's not already obvious, blockchains are just replicated databases with adversarial operators. So it's not surprising that distributed database research forms the foundation for most blockchain designs today. A2. Narwhal and QuorumStore TLDR The value proposition of Narwhal, and by extension QuorumStore, is that, leader-based consensus, I.E. A consensus system where the leader for a slot is responsible for
Starting point is 01:17:32 transmitting most of the information during that slot, a la Solana, is bottlenecked by the performance of the leader as opposed to the entire validator set. Narwhal, and QuorumStore, resolve this bottleneck by decoupling data dissemination and consensus down to the hardware level. The implementation of this idea spreads the work equally across all validators, as opposed to having the leader alone bear the brunt. There's perhaps no better proof of Narwhal's value than Solana itself. The Solana documentation currently recommends that validators use dedicated 1-gigabit S-lines as the minimum bandwidth setup and ideally a 10-gigabits line. Average-sized validators
Starting point is 01:18:11 report around 1.5 gigabits per second peak traffic. During leader slots, Aptos nodrunners report using 25 megabits per second by comparison. Of course, both chains don't process nearly the same amount of traffic and there's the question of the stake weight of each validator but the big idea is that QuorumStore is more efficient from a leader's point of view than leader-based consensus. Let's run through how it works. In QuorumStore, validators continuously steam batches of ordered transactions, similar to the blocks discussed above, to one another. The batches contain raw transaction data and metadata batch identifiers. Other validators receive these batches, sign the
Starting point is 01:18:51 batches, and transmit the signatures to other validators just like they would a normal executed block in leader-based consensus. When a batch reaches the quorum of stake-weighted signatures, it essentially receives approof of availability as all the validators that sign a batch promise to keep and provide the batch on request until it expires. Because of QuorumStore, the leader doesn't have to propose transaction batches anymore. Instead, the leader simply selects a batch for which it has a proof of availability, sends out the batch metadata, executes the block, and sends out the execution results. Other validators can then map the metadata to the batch and replay the batch for verification. If they don't have the batch, they can request it from other nodes and be sure they'll receive it
Starting point is 01:19:35 since two F plus one nodes promise to store and provide the data. This reduces the messaging overhead of the leader. And because Quorum store is run on a separate machine, it's horizontally scalable by adding more boxes. There is a lot of nuance that my TLDR has left out. You can find more details in the Aptos blog post and Narwhal paper. A3. Block STM Algorithm. This section contains the block STM algorithm from the paper alongside explanatory comments to aid understanding. Asterisk 1 Executing transactions in batches reduces messaging and context switching but comes at the cost of requiring extra scheduling. Asterisk 2 The topic of OCC versus PCC is considered moot in academia. The general sentiment is that OCC is suited for low contention applications and PCC for high-contention applications.
Starting point is 01:20:25 As such there is very little work ongoing in this regard. One of the most recent papers to discuss the subject was written in 1982. The following graph is adapted from the paper and it helps to reaffirm the established sentiment. The 2016 study, Revisiting Optimistic and Pessimistic Concurrency Control by Goetzhe's grief of Hewlett-Packard labs comes to the same conclusion as Manassi and Nakanishi. To quote the author, greater than, we have concluded that optimistic concurrency control permits more greater than concurrency than pessimistic concurrency control only if it fails to detect greater than some actual conflicts or if a particular implementation of locking detects greater than false conflicts.
Starting point is 01:21:05 There are numerous studies that come to the same conclusion but we've left them out for brevity. Asterisk 3 A perfectly fair evaluation of block STM and C-level would require isolation, i.e. removing all other processes like Solana's PoE and Ledger commits, using the same VM for both TPUs and a host of laborious engineering tasks that are simply not worth the effort, especially when preliminary testing suggests that the TPUs follow the established trends. Asterisk 4 and optimization in the implementation of BlockSTM allows aborted transactions to be restarted from the point of conflict. Instead of restarting execution from scratch, the Move VM supports validating the red set of
Starting point is 01:21:45 the transaction's previous incarnation and if valid, continuing execution from the point of conflict. Asterisk 5 A second optimization is when the dependency is resolved before the execution task is created i.e. when line 14 of the algorithm returns false. In the implementation, the VM continues execution from where it paused rather than restarting the execution. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.