The Good Tech Companies - Block-STM vs. Sealevel: A Comparison of Parallel Execution Engines
Episode Date: December 13, 2024This story was originally published on HackerNoon at: https://hackernoon.com/block-stm-vs-sealevel-a-comparison-of-parallel-execution-engines. Block-STM and Sealevel are... two approaches to parallelizing blockchain execution. This paper will examine and break down how both TPUs approach parallelization. Check more stories related to web3 at: https://hackernoon.com/c/web3. You can also check exclusive content about #ethereum, #blockchain, #parallel-executions, #aptos, #solana, #distributed-ledger-technology, #good-company, #hackernoon-top-story, and more. This story was written by: @2077research. Learn more about this writer by checking @2077research's about page, and for more stories, please visit hackernoon.com. Block-STM and Sealevel are antagonistic approaches to parallelizing blockchain execution. This paper will examine and break down how both TPUs approach parallelization at a relatively low level and evaluate their performance and scalability. It will also provide an unbiased evaluation of the strengths and weaknesses of each TPU.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Block STM vs C-Level, a comparison of parallel execution engines, by 2077 Research.
Aptos Block STM and Solana's C-Level are antagonistic approaches to parallelizing
blockchain execution. Block STM uses an optimistic concurrency control, OCC, approach, IE. The transaction processing unit,
TPU, optimistically, assumes that no transactions executed concurrently will conflict and relies
on in-built checks and logic to identify and resolve conflicts. C-level, on the other hand,
pessimistically, assumes that transactions will conflict and relies on lock-based synchronization to prevent
conflicts. The pessimistic concurrency control, PCC, approach has historically been more performant
in distributed database systems and for intuitive reasons too. A TPU that schedules transactions in
a manner that prevents conflict should perform better than an TPU that has to resolve conflicts
after the fact. However, thanks to clever design and
engineering, BlockSTM performs surprisingly well, with the added benefits of allowing arbitrary
logic and, by extension, a greater range of use cases and a superior dev-ex. This paper will
examine and break down how both TPUs approach parallelization at a relatively low level and
evaluate their performance and scalability. It will also provide an unbiased evaluation of the strengths and weaknesses of
both TPUs. The article is written with the assumption that the reader is familiar with
blockchain concepts like transactions, blocks, and consensus. Familiarity with distributed
databases and computer architecture will help with grokking some of the concepts but the glossary explains unfamiliar terms and the main body and appendix contain primers on unfamiliar
subjects. Common TERMSA few terms will come up often so let's quickly break down what they mean
in the context of this paper. Concurrency and parallelism. Concurrency refers to multiple
processes using the same resources. Parallelism is multiple processes running completely independent of one another.
Programs can have any combination, none, or both of the properties.
Transactions
A transaction is an atomic set of instructions that performs a logical operation.
The instructions referenced here are analogous to low-level computer instructions but they do
far more than a computer instruction. Conflict. Two or more transactions are said to conflict when
they modify, access the same portion of state. Specifically, a conflict occurs when at least
one transaction tries to write to the contended portion of state. If all the transactions are
reading, then they don't conflict. State. State describes the condition of a thing at an instance in time.
In the context of blockchains, state is the set of accounts and their associated data,
balances and code. When memory access modification is mentioned, memory refers to state.
Dependencies. Transaction B is said to be a dependency of transaction A if and only if
transaction B conflicts with A and transaction B
is of lower priority than A. If B were of higher priority, it would be a dependency of B. Lock.
A lock or is a mechanism used to prevent concurrent access to a memory location.
When a transaction process wants to access a memory location in lock-based systems,
it attempts to grab the lock for the associated location.
If the location is already locked, the lock grab fails and the transaction must wait.
Locking granularity refers to how fine memory locks are. Imagine a transaction needs to alter an element in a table. A coarse lock could lock the entire table. A more granular lock could lock
the row, column of interest. And a very fine lock could lock only
the cell of interest. The more granular a lock is, the more concurrent memory access can be.
But highly granular locks are more difficult to manage because they require a larger key
value database to manage portions of memory. Serializability of transactions. A set of
transactions executed concurrently is said to be serializable if
there exists a sequential execution of the same set of transactions that produces the same result.
Backslash dot. With all of that out of the way, we can get started.
I N T R O D U C T I O N A, Blockchain Network, is a decentralized, Byzantine fault-tolerant
distributed database. The Transaction Processing Unit, TPU,
is the component responsible for computing state transitions. It takes transaction data as input
and outputs a NORDered list of transactions and a succinct representation of the execution results,
usually the block hash. The TPU is usually coupled with, but distinct from the virtual machine, VM. Blockchain VMs like the EVM,
SVM, and MoveVM are high-level language VMs. That means they convert bytecode, compiled
intermediate representations, of the high-level languages, Solidity, Rust, Move, to machine
executable code. A blockchain VM is fundamentally the same as the more familiar emulation VMs.
It's a sandboxed environment that allows a non-native instruction set,
blockchain bytecode, to be executed on actual hardware, x86 ARM.
The TPU, which is usually the subject of interest, subsumes the VM.
It is tasked with the management of the entire transaction execution pipeline,
including creating and managing instances of the VM.
So, as mentioned earlier, these terms are A-related but distinct.
The TPU, specifically, is the focus of this paper.
There are two types of TPUs, sequential and parallel.
A sequential TPU is the easiest to design and implement.
Sequential TPUs process transactions first in first out,
FIFO. This approach is very simple and incurs no scheduling overhead, but sequential TPUs don't
take full advantage of the state and trend of computer hardware design. Computer hardware has
been approaching the limits of single processor performance since the mid-2000s, and there has
been an industry-wide shift towards scaling by increasing the number of cores.
If there are no breakthroughs in computing, computers will continue to improve performance
by adding more cores as opposed to frequency scaling. And in such a world, sequential TPUs
will fail to maximize, even consumer, hardware and execution will quickly become a bottleneck.
Figure 1 shows the trend of single-thread performance, blue dots,
and frequency, green squares, both of which have slowed down and are even seeming to trends lightly
downward. The yellow diamonds show the number of logical cores, and this number has been steadily
growing since the mid-200s. Parallel TPUs are designed to make the most of this trend. Parallel
TPUs are designed to execute as many
non-conflicting transactions concurrently as is possible. An ideal parallel TPU will execute as
many transactions that are not dependent on any higher priority transactions as ice possible.
As an example, consider the priority ordered set of transactions with the dependency graph shown
below. The arrows indicate dependencies. E. G. is a dependency of.
Assuming each transaction executes within a unit of time,
an ideal four-thread parallel TPU would
execute transactions and in parallel
transactions and would be executed right after.
Finally, would be executed.
The challenge of designing and implementing a parallel TPU
is designing
a concurrency control system that ensures only non-conflicting transactions are executed
simultaneously while maintaining priority with minimal overhead. Let's look at how this is
accomplished in practice. How parallel TPUs are implemented concurrency control and why it's
necessary. It's easy to say, just execute the transactions in parallel, bro.
Without truly understanding why this is such a difficult problem to solve and so I'll give a simple example to elaborate why concurrency control is necessary when attempting to
parallelize access to shared resources. Consider the example of a banking database where
Account A has $50. Transaction number 1 wants to send $50 from account A to account B and transaction number 2
wants to send $50 from account A to account C assuming both transactions are allowed to execute
in parallel. Both transactions will initially read the balance of account A as $50. Both
transactions will then write $0 to the memory location where the account balance of account A is stored.
Note that it doesn't matter if one writes before the other. They both read the balance as $50 and
will update it to be, original balance, transfer amount, so both accounts will write 0 to the
account balance. Both transactions will read account B and account C's account balances and
write an additional $50 to the memory location where the balances are stored, printing $50 out of thin air. Backslash dot. This is a simple example
but it suffices to show that when there's concurrent access, modification of shared
resources by incoordinated transactions, the execution results are non-deterministic,
subject to race conditions, and unserialize. A few other potential problems that
arise due to a lack of concurrency control are the lost update problem. When a batch of transactions
is executed in parallel, there's a possibility that a lower precedence transaction, say,
overwrites a memory location that a higher than k precedence transaction, say, needs to read.
Txj should have read the values before wrote to the
location, without concurrency control, there is no way to enforce this behavior. The dirty read
problem. Sometimes transactions are aborted, so all the data they've written will be rolled back,
however, another transaction might have read these dirty values before they're rolled back,
compromising the integrity of the database.
There are more potential problems that can arise from simply parallelizing transactions but they won't be discussed for the sake of brevity. The important takeaway is that attempting to
parallelize execution without additional safety measures compromises the integrity of the execution
results. The solution to this problem in distributed database management systems, DDBMSs, is referred to as concurrency control, CC.
Concurrency control and types of concurrency control
Concurrency control is the process of ensuring that simultaneously executing operations do not
conflict. In DBMSs and, by extension, blockchains, there are already major paradigms of concurrency control.
Optimistic concurrency control, OCC, and pessimistic concurrency control, PCC.
Pessimistic concurrency control, PCC. In PCC, the execution of a transaction is blocked if it needs to access a resource that is already in use by another, usually higher priority, transaction. In PCC systems, locks are usually the
method of choice to enforce, blocking. Most PCC systems also require that transactions declare
upfront what portions of memory they will read from and or write to since acquiring locks on
the fly will still lead to unserializable execution results. Optimistic concurrency control, OCC,
in OCC, transaction are attempted on as many
processing resources as are available. But instead of writing directly to or reading directly from
persistent memory, the transactions usually write to a log. After attempted execution,
the transaction is validated to make sure that its execution did not violate any of the database
integrity rules. If validation fails, the effects
of the transaction are rolled back, and the transaction is rescheduled for execution.
Otherwise, the transaction commits IE rights to persistent memory. The other two notable
developments are the Calvin and BOM protocols that address concurrent underscore transactional
modifications underscore
of distributed databases. We'll go over a high-level overview of each of them to provide
context for BlockSTM. Transactional Memory, TM. I find it easier to think of transactional memory
as transactional memory access. The underlying principle of TM is to allow concurrent programs
to modify, read and write, shared memory in a way that is analogous to how database transactions modify a database without
using locks. More simply put, team aims to allow concurrent programs to modify shared memory
atomically and produce serializable results without locks. The reasoning behind the development of TM
was that lock-based synchronization techniques incur overhead from managing locks and must be carefully designed to be resistant to priority
inversions, convoying, deadlocks, live locks, etc. TM would not need to worry about these things and
would, in theory, outperform lock-based systems. The paper proposed a new multiprocessor architecture,
not an instruction set architecture, and the addition of a few instructions that would allow for transactional memory access.
The additional instructions allowed programmers to define new read-modify-write operations that
performed atomic updates to one or multiple memory locations without needing to lock those
memory locations. A few, less than 10, implementations of TM were developed but TM never had widespread use.
Software, only, transactional memory, STM, STM is a suite of software-only implementations of
the principles of transactional memory. The goal is the same as TM. To allow concurrent
processes to access shared memory without the use of locks. The way STMs usually implement
this functionality is optimistic.
Threads execute with no regard for what other threads are doing
but instead of committing their writes directly to memory,
the threads record every read and write in a log, abstract data structure.
At the end of a thread's execution, the results are validated, i.e.
the values that the process read during its execution are compared against the current values at the same memory locations.
If the values are different, then the thread rolls back all its writes because a difference in the read sets implies that one or more concurrent processes modified the memory areas accessed by the process being validated.
Hence, the execution results are unserializable.
STM implementations will re-execute and revalidate transactions that
fail validation until they pass. STM maximizes concurrency as threads never have to wait to
access resources they need, but there's a great deal of wasted work in high-contention use cases.
Years of research showed that STM implementations perform worse than fine-grained,
lock-based synchronization methods on a small
number of processors due to the overhead from managing the log. For these reasons,
STM implementations don't find much use in practice and when they do, it's for hyper-specific
use cases that the implementations are optimized for. Calvin The next major development that
preceded BlockSTM came in the 2012 Calvin paper, where the authors proved that, contrary to popular
belief, enforcing a prese order of transactions improved the execution throughput of distributed
databases. The general sentiment before Calvin was that enforcing a preset order of execution
would reduce concurrency but Calvin firmly established that as false. Calvin, like C-level,
requires that transactions declare up front all
the memory locations that they will access during execution. The rest of the workflow is fairly
straightforward Calvin nodes, computers that manage partitions of the distributed database,
first decide on the ordering, priority, of a set of transactions. After coming to consensus on the
ordering, Calvin's scheduling follows a simple rule.
If two or more transactions request access to the same memory locations,
the higher priority transaction must be allowed to access the memory location first.
A dedicated thread is responsible for lock management and this thread iterates through the serial transaction order and ensures that requests for locks are made in order.
When two transactions conflict, they are scheduled sequentially in order when two transactions conflict they are scheduled
sequentially in order of priority but non-conflicting transactions can be scheduled concurrently
there is a lot more nuance to calvin's design but the key takeaway is that calvin established
the idea that enforcing priority improved throughput boom the next critical development
came in 2014 boom like calvin boom is designed for distributed databases but the key insight is
also easily extensible to blockchains. BOM uses a multi-version concurrency control, MVCC, mechanism,
which is an auxiliary form of concurrency control that uses a multi-versioned log to manage reads
from and writes to shared memory. In a nutshell, in MVCC databases, transactions do not directly modify the database.
Instead, a log holds multiple versions of the database for every transaction, i.e. each memory
location is associated with all the values of the transactions that have written to it rather than
the most recent write. You can imagine the multi-version data structure as a two-dimensional
table with entries in the form. Each slice of this
two-dimensional table is a table that contains the version of state for a particular memory location.
An illustration that should help get the idea across is shown in figure 3.
Recording in MVCC is similar to but different from that of TM in that in MVCC, for every write
to a memory location, a new version of the location specific to that
transaction is created or updated, rather than overwriting the current value. MVCC databases
allow for more concurrency than single-version databases, as transactions can read the values
written by any transaction at a memory location regardless of how many transactions have written
to that location. The trade-off is an increase in space, memory, complexity.
BOM showed that combining multi-version concurrency control with fixed transaction
ordering significantly improved the execution throughput of databases while maintaining full
serializability. The way BOM works is briefly explained below BOM as a two-layered protocol.
There is a concurrency layer and an execution layer. As a stream of transactions
is fed to the concurrency layer, a single thread orders the transactions by timestamp.
The concurrency threads are individually responsible for a logical partition of the database. BOM
requires that transactions declare their read and write sets upfront and using this information,
each concurrency thread cycles through the ordered set and checks if the transaction writes to its partition.
For all transactions that do, the concurrency thread creates an uninitialized placeholder in the log for that transaction.
Because individual threads are responsible for partitions of the database, BOM increases concurrency with intra-transaction parallelism when creating the placeholders. After the concurrency control threads complete their
tasks, another set of threads, execution threads, execute the transactions and fill in the
placeholders. There's a priority order, so when a transaction needs to read from a memory location,
it checks for values written by the lowest priority transaction that is higher than it.
For example, if transactions in write to a memory location that needs to read from,
will read the value written by. In the event that the placeholder associated with the correct
version to read is still uninitialized, i.e. has not yet written to that location, the execution
of the transaction is blocked until the transaction that should write to that location completes its
write. This reads never block writes design allows BOM to be
extremely efficient especially with large workloads as the greatest cost constructing
the multi-version data structure is amortized as the workload grows. Again some nuance has been
left out but the key takeaway from BOHM's design is that multi-version data structures allow for
increased concurrency at the cost of memory.
In addition to the four protocols discussed above, a 2017 paper also did some work on STM for blockchains. In the paper, the authors propose the classic STM design with on-the-fly lock
acquisition to attempt to prevent concurrent memory access and post-execution validation to
identify conflicts. The design is uncomplicated. Transactions are
optimistically attempted, attempting to grab locks as they require them, and validated after.
If a conflict is discovered, it is resolved by rolling back the transaction Andre executing it.
The design allowed the leader a great deal of freedom in deciding transaction order but the
results were non-deterministic so other nodes would likely not arrive at the same execution results unless the leader shared the exact
execution path. But the design never saw any adoption, largely because the performance results
presented in the paper showed that the protocol was only slightly better than sequential execution
and sometimes worse. BlockSTM applies the insights from Calvin and Bohm to succeed where the unnamed 2017 protocol
failed. With the appropriate background set, we can move on to Block STM.
Block STM, traditional distributed databases view the insights of Calvin and Bohm as constraints
since enforcing a priority ordering requires some form of consensus between nodes and
committing each transaction individually,
as opposed to block level, is the norm in transactional databases. But both of these
properties are inherent in blockchains. Blockchain nodes must agree on the ordering of transactions,
even if the leader is free to propose asset wishes and commits usually occur at the block level,
or at least in batches a la Solana. In essence, what traditional DBs
consider constraints are built into BlockSTM's spec and it leverages them to improve its performance.
Below an overview of how BlockSTM works, similar to Calvin, transactions are ordered by priority.
After ordering, the transactions are scheduled for execution. Like in STM, transactions are
attempted with all available resources with no regard for conflicts.. Like in STM, transactions are attempted with all available
resources with no regard for conflicts. And like in BOEM, the execution threads don't read from
write to memory directly. Instead, they read from or write to a multi-version data structure
that we'll refer to as data structure going forward. Continuing in BOHM's footsteps, when a
transaction, say, reads from the data structure,
it reads the values written by the most recent version of the lowest priority transaction that
is higher than itself. In this example, we'll read from some transaction that has written to
the memory location. For example, if wants to read from a memory location that has been written
Toby and reads the value written by. Keep this definition of read in mind
as it is central to the working principle of block STM. During the execution of a transaction,
its read set is tracked. The read set contains the memory locations and associated values that
a transaction read during execution. After a transaction completes execution, it is verified
by comparing its read set with the current values
at the memory locations it read from, keeping in mind the definition of read established earlier.
If there are any discrepancies between the read set and the current values at the memory locations,
it implies that during the transaction's S execution, one or more higher than J precedence
transactions, say, modified one or multiple memory locations that read. Based off
the preset serialization order, should have read the values written by so all the values written
be are considered dirty. But instead of deleting those values from the data structure, they are
marked as estimate and is scheduled for re-execution. The values are not deleted because it's likely
that re-executing will write the same locations and any lower than j priority transactions that read values marked as estimate error delayed until
is re-executed and revalidated. Because of this heuristic, block STM can avoid a cascade of aborts
and re-executions that would occur if the data structure were wiped clean of dirty values.
If there are no discrepancies, i.e., no higher priority transaction than the one currently being
validated, write to a memory location in the read set of, then is marked valid but not safe to
commit. The transaction is not safe to commit yet because there's a chance that a transaction of
higher priority, say will fail validation. In such an eventuality, all validated lower than
priority transactions then need to be revalidated to ensure
that they haven't read from allocation written to buy. Because of this, transactions are not safe
to commit until all transactions that come before them in the preset serialization order have been
executed and validated. When all the transactions in the have been executed and validated,
execution is complete. That's a basic, and likely confusing, overview of how BlockSTM
works. We'll go over the process in detail next. Technical details of BlockSTM
Before we look at the technical details of BlockSTM, a few details need to be concretized.
First, the input to BlockSTM is an ordered set called a that contains n transactions in the
preset serialization order.
The goal of block STM is to take this of transactions and execute them with the most concurrency possible, without breaking the serialization. As I mentioned earlier, each
transaction is first executed and then validated. If an executed transaction fails validation,
the transaction is scheduled for execution. To track the number of times a transaction has been
executed, each transaction is associated with an incarnation number in addition to the index number.
You can think of a transaction as being of the form whereas the index number and is the
incarnation number. So the block is initially equivalent to the combination of a transaction's
index and its incarnation make up the version of the transaction. For example,
the version of is. Lastly, to support concurrent reads and writes, BlockSTM maintains an in-memory
multi-version data structure similar to the one discussed in BOM that stores the latest
writes per transaction and the transaction version for every memory location. Here's a
snippet of the early implementation. The MV HashMap maps each memory location to an internal B-tree map that maps the indexes of transactions that have ridden to that memory location to the corresponding values.
The dash map is responsible for concurrency.
It allows thread-safe modification of the B-tree map.
Full details on the data structure implementation can be found here.
Next, we'll look at the actual thread logic. Thread logic,
the activities of the worker threads, execution and validation threads, are coordinated by a
collaborative scheduling thread that tracks and modifies toured sets that we'll call E and V.
E contains all transactions that are yet to be executed and V tracks transactions that are yet
to be validated. The implementation of the collaborative scheduler tracks these tasks with atomic counters. More details can be found in Appendix A3. Each of the worker threads cycles
through the three-stage loop outlined below, and if V and E are empty and no other threads are
performing a task, then the execution of the is complete. Else, if there are tasks in E and V,
an available worker thread will select the task with the
smallest index between V and E. i.e., if V contains and E contains, the worker thread will create and
perform a validation task for. The atomic counter mentioned above ensures that both sets will not
contain the same transaction. As to why the transaction with the smallest index is chosen,
it's because validating or executing the higher priority tasks as soon as possible helps to identify conflicts early.
This is one of the ways a preset order improves performance.
If the next task is an execution task then execute the next incarnation of if during the execution of
an estimate is read then abort execution. Mark all the data written by as estimate and add to E
with an increased incarnation number I E E E S E. If no estimates are read, check if writes to any
locations that the previous incarnation did not write to. If there is a write to a new location,
add validation tasks for all transactions lower in priority than that have been executed or
validated. This step is necessary because if the execution of writes to a new memory location that
previous incarnations of did not, there's a chance that transactions of lower precedence, i.e.
transactions with indexes greater than n, already read from these memory locations so they need to
be checked for validity. Transactions of higher precedence don't need to be validated because they should have read that location before wrote it.
Create a validation task for. If does not write to any new locations,
then create a validation task for alone. Loop back to. A validation task, then.
Validate if validation succeeds, the thread returns back to check done.
Else, if validation fails, and mark all the values
written by as estimate. Create validation tasks for all transactions lower in priority than that
are not currently in E and add the validation tasks to V. Transactions lower in priority than
would never read the value written by because of the definition of a block STM read. Create an
execution task for with an incremented incarnation number and add the task to E. Loop back to backslash dot dot.
This loop continues until there are no more tasks in V and E, at which point check done returns
done in the most recent version of the data structure is safe to commit to persistent storage.
After commitment, a garbage collector frees up the data structure and the transaction processing
unit awaits the next. Below is an example of how this process would play out in practice.
The illustration above is modeled on a four-worker thread machine.
The sets E and V are the execution and validation sets. Nodes, circles, represent transactions and
the color of a node helps to identify other transactions that it conflicts with. The upper level of the main table shows execution threads and the lower levels show
the validation threads. The sets E and V in every column represent the execution and validation sets
after the completion of the iteration. As seen in the illustration, E initially contains all the
transactions and V is empty. During the first iteration, the first
four transactions are executed optimistically and E and V are updated accordingly. During the next
iteration, all four transactions that were executed are validated. Transaction 4 fails validation due
to a conflict with transaction 1 and must be re-executed. During the third iteration, transactions 4, 5, 6, and 7 are executed.
Transaction 5 conflicts with 4 so it will read values marked estimate and its execution is
paused until transaction 4 completes execution. During the fourth iteration, transactions 4,
6, and 7 are validated while the execution of 5 is completed. During the fifth iteration,
transactions 5, 6, and 7 are validated.
6 and 7 are revalidated because of the re-execution of 5 for reasons explained above.
Transaction 8 is also executed. During the next iteration, transaction 8 is validated while 9 and
10 are executed. And during the final iteration, transactions 9 and 10 are validated and the block is marked safe to
commit. While I've described a simplistic workflow the same process can be applied to any batch of
transactions with any dependency structure. At its core block STM is simple and that simplicity
in conjunction with specific properties of blockchains, fixed ordering of transactions,
block level commits and the safety of blockchain
VMs, allows BlockSTM to achieve relatively high throughput without enforcing the declaration of
read-write dependencies upfront. The BlockSTM paper contains formal proofs of liveness and safety.
And an even more detailed breakdown of BlockSTM, the full algorithm, with explanatory comments, can be found in Appendix A3. Lastly,
an implementation of the TPU in Rust can be found here. Next, we'll look at how BlockSTM
performs in practice. BlockSTM performance. To evaluate BlockSTM's performance, we forked
Aptos Core and slightly modified the already existing benchmarks to evaluate its performance. The machine we use for testing is Latitudes M4. Metal, large, which has 384 GBs
of RAM and an AMD 9254 CPU with 24 physical cores. The tests evaluate the execution process from the
being fetched to the completion of execution and validation of all transactions.
The transactions used during this evaluation are simple. P2P, transfers. A simple send between two accounts. The performance metric evaluated was throughput and the independent variable used for
presenting the data is thread count. The parameters used for evaluating performance are block size and
number of accounts, a proxy for contention. Highly contentious workloads, like two accounts, proxies the
type of traffic you'd expect during an NFT mint or when a particular market is hot.
Two is the most extreme possibility, as it means all the transactions in the block are between two
accounts. Non-contentious workloads like 10,000 accounts proxy the type of traffic you'd expect if the
blockchain was used for P2P transactions i.e. simple sends between many individuals.
The other loads are proxies for everything in between. Figures 5 to 8 show the performance
relative to the number of threads of BlockSTM with different levels of contention and block sizes.
From the data, it is evident that, as expected, block STM performs
better in low contention scenarios. However, throughput is only slightly better than sequential
execution in high contention scenarios, less than 10 accounts. And in completely sequential
situations, 2 accounts, block STM's performance is significantly worse than sequential execution, which is also to be
expected. Figure 6 provides more context for performance at 1k block size and the data
suggests that throughput peaks at 12 threads with non-contentious workloads. In high contention
scenarios, performance trends slightly downward as the number of cores increases, plateauing around
8 physical cores. In very high contention scenarios,
performance peaks at four cores, the lower limits of our tests, suggesting that performance would be better with even fewer cores. This behavior is understandable since more threads in a highly
contentious environment increase the chance of the account being concurrently written to an
THE chance of a validation failure. Figure 7 and 8 provides more insight on how performance scales
relative to block sizes and the data is quite similar to the 1k block size data. Although with
a larger block size, performance grows more linearly. For the sake of brevity, we left out
additional data but the data collected implies that block STM performs better with larger blocks.
The improvement in performance with larger blocks is likely due to
the amortization of the initial failures. As the runs get longer, more and more transactions
earlier than their predecessors amortizing the initial costs. We also found that there are no
significant improvements beyond a 20k block size and that reduction in contention beyond a certain
point does not improve performance. A good indicator of this is that there are infinitesimal performance differences between 1,000 accounts and 10,000
accounts. In essence, if thread count greater than greater than contention, then performance peaks.
It's important to note that the performance for thread counts 24 and 32 are not reliable
measures of block SDM's performance at those thread counts. As mentioned earlier, the machine
used for the tests has only 24 physical cores. To expose 48 logical cores, the CPU utilizes
hyper-threading which could be responsible for its performance. The BlockSTM paper contains the
results of more tests, including one where BlockSTM is compared to and slightly outperforms
in implementation of BOM. This is
surprising considering BOM has complete read and write sets of all transactions beforehand but the
outperformance is likely due to the overhead of building the multi-version data structure in BOM.
Overall, the data gives a reliable indication of BlockSTM's limits and it's not a stretch to
say that in low contention situations, BlockSTM massively outperforms any
sequentially executed runtime and also scales almost linearly, making it suitable for large
scale applications like rollups. That concludes the analysis of BlockSTM, next we'll consider
the leading PCC execution runtime, C-Level, a marketing term for Solana's parallel execution TPU.
C-level.
C-level is completely antagonistic in design to block STM.
C-level uses SLEC-based synchronization for concurrency control, but most notably,
it requires that all transactions declare up front what portion of state they'll be reading from and or writing to during execution.
The Solana execution pipeline is further optimized by the use of single
instruction multiple data, SIMD, instructions. SIMD instructions are a parallel, not concurrent,
processing technique that performs a single operation over multiple data points.
Discussions about runtime optimizations are beyond the scope of this paper but the idea
is that transaction instructions can be sorted based on what programs they call and all instructions that call the same instruction within a program
can be batched and executed in parallel with SIMD instructions, again, not concurrently.
A quote from Solana founder Anatoly Yakovenko might provide some more insight greater than,
so, if the incoming transactions that are loaded by C-level all call the greater than same program
instructions, such as, Solana can execute all the greater than transactions concurrently in parallel
over all the available CUDA cores. In short, C-Level is designed for maximum speed. The
following section will cover how C-Level works and its performance. But first, a few primers,
starting with a brief description of the lifecycle and structure of a Solana transaction. The structure and lifecycle of a Solana TRAN SACTION TRAN SACTION STRUCTUREAS
mentioned earlier, Solana transactions must declare up front what portions of state they
will access. They do this by listing the accounts that they will use during execution and specifying if
they will read from or write to these accounts. A basic Solana transaction is of the form the
accounts list within the instructions list of the transaction is the main focus here.
This list contains all the accounts and instruction will use during its execution.
The list is populated based on the struct. The struct has a list of accounts with three fields. The
public key, an ed25519 key that identifies accounts of the program being invoked. A bool that determines
if the account will sign the message. A bool that marks if the account will be written during
execution. Backslash dot. Here is a sample struct named keys. The key takeaway here is that from the
struct, the TPU can identify what accounts will be written by a transaction and use that information to
schedule an in conflicting transactions. Lifecycle of a transaction The lifecycle
of a Solana transaction is largely the same as that of most other blockchains,
with some differences, primary of which is that Solana does nothave a mempool.
Let's briefly go over how transactions go from
creation to commit. After a transaction is created, the transaction is serialized,
signed and forwarded to an RPC node. When an RPC node receives the packets,
they go through SIGVERIFY, which checks that the signature matches the message.
After SIGVERIFY, the packets enter the banking stage, which is where they are either processed
or forwarded. On Solana, nodes are aware of the leader schedule, a pre-generated roster that
dictates what validators will be leaders for what slots. A leader is assigned a set of four
consecutive slots. Based on this information, nodes that are not leader forward transactions
directly to the current leader and the next x2 to in the agave
implementation leaders. Note that since this is not a consensus breaking change, some nodes forward
transactions to more or less leaders. When the leader receives the packets from other nodes,
it runs them through sig verify and sends the packets to the banking stage.
In the banking stage, transactions are deserialized, scheduled,
and executed. The banking stage also communicates with the proof-of-history component to timestamp
batches of transactions. More on this later. By design, Solana blocks are continuously built
and forwarded in small portions called shreds. When other nodes receive the shreds, they replay
them, and when all the shreds that make up a block have been received, validators compare the execution results to the one proposed
by the leader.
Then validators sign a message that contains the hash of the block they're voting for
and send out their votes.
Votes can be disseminated as transactions or through gossip but Solana favors treating
votes as transactions in a bid to speed up consensus.
When the quorum of validators is
reached on a fork, the block at the tip of that fork is confirmed. When a block has 31 plus
confirmed blocks built on it, the block is rooted and practically impossible to reorg.
The major difference between the lifecycle of a Solana transaction and that of an Aptos
transaction is that in Aptos, transactions are not executed before nodes receive the ordering
of the block. Solana and Monad are moving towards an advanced form of this type of execution called
asynchronous execution, with the final destination being stateless validator nodes, but that
discussion is outside the scope of this paper. Before we discuss the meat of transaction execution,
it's helpful to do our fresher on proof-of-history concepts since Poe is heavily intertwined with the banking stage process. Proof-of-history and
the banking stage. Solana's proof-of-history is a decentralized clock. Poe's working principle
is based on recursive SHA-256 hashes. It uses the cycle to proxy the passage of time, the basic idea is that a SHA-256 hash 1 is pre-image
resistant, i.e., the only way to compute a hash, h, of a message, m, is to apply the hash function
h to m, and 2 takes exactly the same amount of time on any high-performance computer.
Because of these two properties, nodes recursively performing SHA-256 hashes can agree
that an amount of time has passed based on the number of hashes computed. Additionally,
verification is highly parallelizable because each hash can be checked against the next one
without relying on the results of the previous hashes. Because of these properties, PoE allows
Solana nodes to agree on the passage of time without the need for synchronization or communication. The PoHash calculation snippet is shown below as an
arbitrary piece of information, in this context, transaction hashes. That is appended to the
previous hash to assert that the event represented by occurred before the hash was computed,
essentially timestamping the event. The other post-specific concepts relevant to understanding
the banking stage are 1. Ticks. A tick is a measure of time defined by x, currently 12,500,
hashes. The tick hash is the 12,500th hash in the chain. 2. Entry. An entry is a timestamped
batch of transactions. Entries are called entries because they're how Solana transactions are committed to the ledger. An entry is composed of three components.
1. The number of hashes performed since the previous entry.
2. Is the result of hashing the hash of the previous entry times.
Backslash dot. There are two types of entries, tick entries and transaction entries.
A tick entry contains no transactions
and is made at every tick. Transaction entries contain a batch of non-conflicting transactions.
Entry constraint TS. There are quite a number of rules that determine if a block is valid,
even if all the transactions it contains are valid. You can find some of those rules here,
but the rule relevant to this report is the entry constraint. This rule dictates that
all the transactions within an entry must be non-conflicting. And if an entry contains
conflicting transactions, the entire block is invalid and rejected by validators. By enforcing
that all the transactions in an entry be non-conflicting, validators can replay all the
transactions within an entry in parallel without the overhead of scheduling.
However, SIMD 0083 proposes the removal of this constraint, as it constrains block production and prevents asynchronous execution on Solana. Andrew Fitzgerald discusses this constraint and
a few others in this post on what he thinks are the next steps Solana needs to take on its journey
to asynchronous execution. To be clear, this constraint does not
completely dictate how transactions are scheduled because executed transactions don't have to be
included within the next entry, but it is an important consideration for current scheduler
designs. With all that out of the way, we can discuss the meat of transaction execution,
Solana's banking stage. The banking stage The banking stage module houses Solana's Banking Stage. The Banking Stage The Banking Stage module houses Solana's TPU and a
lot of other logic for processing transactions. The focus of this report is on the scheduler but
a brief overview of the Banking Stage will be discussed to provide some necessary context.
Overview of the Banking Stage The Banking Stage sits between the SIG Verify module and the
Broadcast Stage, with the PO module running in parallel to it.
As discussed earlier, SIG Verify is where transaction signatures are verified.
The broadcast stage is where processed transaction data is disseminated, via turbine,
to other nodes on the network and THETPU underscore forwarding module is responsible for disseminating sanitized transaction packets to leader nodes. In the banking stage,
transaction packets from SIG Verify are buffered and received by the appropriate channels. Voting transactions
are received at the end of two channels, the TPU underscore vote underscore receiver and gossip
underscore vote underscore receiver while non-votes are received by the non underscore vote underscore
receiver. After buffering the packets are forwarded or consumed,
depending on the leader schedule. If the node is not the leader or due to be leader shortly,
the sanitized packets are forwarded to the appropriate nodes. When the node is leader,
it consumes the transaction packets, i.e. the packets are deserialized, scheduled, and executed.
Scheduling is the main focus of this paper and it will be
expanded on later. The execution stage is relatively straightforward. After a batch of
transactions is scheduled, the TPU will run checks. The worker thread checks that the transaction
hasn't expired by checking that the block hash it references is not too old, hasn't been included in
a previous block. Grab locks. The worker thread
attempts to acquire the appropriate read and write locks for all the transactions in the batch.
If a lock grab fails, retry the transaction later. Load accounts and verify that signers can pay fees.
The thread checks that the programs being loaded are valid, loads the accounts necessary for
execution and checks that the signer can pay the fees specified in the transaction execute the worker threads create vm instances and executes the
transactions record the execution results and the transaction is sent to the po thread which
generates an entry entries are also sent to the broadcast stage during this step commit if
recording succeeds the results are committed, updating the state.
Unlock. Remove locks from accounts. Backslash dot. That's a complete overview of the banking stage.
Next, we'll dive deep into how scheduling works in the Solana TPU. This portion has had many
significant changes, but for the sake of brevity, we'll discuss only the two most recent implementations, the Thread Local Multi-Iterator and the Central Scheduler.
Thread Local Multi-Iterator and the Thread Local Multi-Iterator implementation
consume transactions packetset in the channel mentioned earlier.
Each of the non-vote threads pulls transactions from the shared channel,
sorts them based on priority and stores the ordered transactions in a local buffer.
The local buffers in the TLMI are double-ended priority queues to allow for adding new high
priority transactions while removing the low priority ones with minimal time complexity.
Each thread then marches a multi-iterator through its buffer to create a batch of
128 non-conflicting transactions. A multi-iterator is a programming pattern that runs multiple
iterator objects through an array. The iterators select items from the array based on a decision
function. The multi-iterator concept can be abstract so instead of explaining it, here is
an example that shows how Solana's TLMI works. Imagine a set of 10 transactions with the dependency graph below. Assuming the TLMI created batches of 4, with this set of transactions, the TLMI would
select TX1, then, then, and finally for the first batch,
and would be skipped because they conflict with already selected transactions.
The next iterator would select AND N. The third iterator would select N.
Each iterator would follow this process
to create a batch of non-conflicting transactions and because of the use of locks during execution,
transactions are guaranteed to execute safely regardless of what other threads are doing.
However, this design led to a lot of problems. The problems with the TLMI implementation
The major problem with the TLMI approach is that each thread is isolated from the others. Because each of the worker threads independently pulled transactions from the shared
channel, the distribution of transactions between threads was roughly the same. So even if the
threads create a batch of transactions with no intra-thread conflicts, there could still,
and likely would, be intra-thread conflicts, with the problem becoming worse as contention increases.
In addition, because of the TLMI's design, there is a high tendency for priority inversion,
since two conflicting transactions cannot be in a batch, high-priority transactions that
happen to conflict with a higher-priority transaction will not be scheduled until the
next batch at the very least and lower-pri lower priority transactions would. These problems could be approached by reducing the batch size from 128, but that would create
bottlenecks elsewhere, like increased context switching by threads. Instead, the scheduling
algorithm was redesigned completely, leading to the central scheduler implementation.
The central scheduler, the Agave 1.18 client addresses the central problem of Solana's
banking stage by introducing a central scheduler, very similar in spirit to BlockSDM's collaborative
scheduler. The central scheduler uses a new thread, in addition to the previous six,
that performs the task of global scheduling. The way it works is relatively straightforward.
Just like before, voting transactions are handled by voting threads but now all one-vote transactions go to a dynamically
updated buffer. Since this buffer is fed directly from SIG Verify, it has a global view of priority
as opposed to the local buffers in the TLMI design. Checks that the transaction can pay
the transaction fees and deducts them from the fee-paying account. Transactions that pass the checks are inserted into the preo graph.
The preo graph is created by checking for dependencies between a transaction and the next
highest priority transaction per account i.e. the graph builder checks for conflicts on every
single account that the transaction touches. This i is implemented by tracking in a hash
map what transactions last touched what
accounts. This allows the graph builder to quickly identify dependencies. Once the pre-o graph has
been completely built, transactions go through opera underscore lock underscore filter. The
pre-lock filter is currently unimplemented so it currently does nothing. But the ideal logic flow
is to allow transactions that pass the pre- lock underscore filter to be scheduled one by one on the appropriate threads. A small optimization
is that if a transaction does not conflict with any thread or any other transactions that could
necessitate executing a chain in sequence, the transaction is assigned to the thread with the
least amount of work in terms of cus. Transactions are scheduled until the preo graph is empty or
every thread either reaches the max compute units assigned to it, currently 12 million cus. Transactions are scheduled until the preo graph is empty or every thread either
reaches the max compute units assigned to it, currently 12 million cus, or has a batch of 64
transactions assigned to it. If any of these conditions is met, the worker threads begins
executing the batch. When it finishes, it communicates with the central scheduler thread
and work is scheduled for it. This process is repeated until the node is no longer leader, at which point the banking stage
is said to be over. I have left out the details of committing and forwarding as they're not relevant
to this discourse but the core idea has already been discussed. That is a complete overview of
how the central scheduler approach e-sparallelization. It is, in some aspects, a significant
improvement over previous C-level
iterations as it prevents intra- and inter-thread conflicts. In the CS, ceterisparibus, there should
never be a failed lock grab, which was the main cause of nonlinear performance degradation with
the other iterations. In addition, since there are no more inter-thread conflicts, the central
scheduler allows the use of as many non-vote threads as are available for execution. The biggest trade-off of the central scheduler is the overhead incurred
during scheduling. Compared to block STM or even the TLMI, it spends a decent amount of time
scheduling transactions and has less to spend on execution. Another trade-off is that unlike the
TLMI that had all the available threads desserializing and parsing transactions, only one thread, the CS, is responsible for these tasks now.
That wraps up the discussion of the design of the CS and, by extension, Solana SPIS.
Next, we'll examine their performance.
C-level performance.
There are quite a number of Solana benchmarks in the wild, but for the purpose of this report,
we modified the Agave client codebase and ran the banking bench tests. There are quite a number of Solana benchmarks in the wild, but for the purpose of this report,
we modified the Agave client codebase and ran the banking bench tests. Links to the repo as used and instructions to reproduce the results can be found in the appendix. The tests were run
on the exact same machine used to evaluate BlockSTM. Latitudes M4 Metal Large, which has 384GB RAM and an AMD E-PYC 9254 CPU with 24 physical and 48 logical
cores. The tests are a the exact same as those used to evaluate BlockSTM as well.
The results are shown below. TLMI The data for the TLMI runs are shown in figures 18 to 21 below. The TLMI has a maximum throughput of 137
kTPS observed on 8 threads at a block size of 10k and account contention of 10k. Essentially
embarrassingly parallel. The TLMI also performs relatively well in very contentious scenarios,
processing over 15 kTPSFOR completely sequential workloads.
NNTHE performance trend of the TLMI is relatively less steep,
it reaches about half of its maximum throughput with an account contention of 10, i.e.
The performance in contentious and non-contentious situations is similar.
This implies that the TLMI will not experience significant performance degradation in contentious
situations. The data contains a few surprising facts, one of which is that with a block size
of 1k, figures 19 and 20, performance peaks with 4 threads regardless of contention.
With a block size of 10k, figures 21 and 22, performance consistently peaked at 8 threads
and even slightly degraded as core count increased.
This behavior is understandable for thread counts 24 and 32, as the machine used during testing only
has 24 physical cores and there are some other processes running in the background, e.g. Poe,
but the behavior is unexpected for thread counts 12 and 16 and seems uncorrelated to contention.
There's not enough data to assert
a trend but it would seem the TLMI's peak throughput is correlated to block size.
This suggests that there's still room for growth in regards to optimizing the performance of the
TLMI. But overall, the TLMI is a highly performant TPU that performs well in both contentious and
non-contentious environments. Central scheduler
the decision was made to exclude the performance of the central scheduler because it was found to
be inconsistent. Our tests saw it peak around 107 KTP SWIFT 10K account contention and as high as
70 KTPS on a completely sequential workload. In another implementation of the tests, we get 30
KTPS on completely sequential workloads and
approximately 90 kTPS with an embarrassingly parallel workload. These results are consistent.
Within, the tests and rerunning the tests will produce the same results.
But the inconsistency across tests suggests that there are still bugs in the implementation that
need attention. Because of that, we decided to leave the evaluation and presentation of the
results to a future study. A few conclusions that we can draw regardless are that relative
to the TLMI, the performance will be better in highly contentious scenarios and, slightly,
we're saying low-contention scenarios. With C-level thoroughly discussed, we can move on
to the highlight of the report. Block STM vs C-Level, Fee Markets.
Appropriately pricing block space is a difficult challenge to solve. To put it more eloquently,
greater than, one of the most challenging issues in blockchain protocol design is how to greater
than limit and price the submission of transactions that get included into the greater than chain.
Vitalik Buterin a proper discussion on fee markets would fill up its own paper.
But on the subject of comparing BlockSTM and the C-Level it is relevant to mention that due to the
designs of both TPUs, the fee market structures are completely different. Specifically, C-Level
has pseudo-local fee markets, while BlockSTM's fee markets are global. In local fee markets,
contention for a portion of state does not
affect other portions of state, while in global fee markets, the cost of block space depends on
general demand for block space, irrespective of what portion of state a transaction is accessing.
Local fee markets are clearly more ideal from every point of view. For users, the UX is superior
because users are not forced to compete with transactions accessing portions of state they are not accessing. For validators in the network, ceteris paribus, local fee markets
make for more efficient use of block space as there will be fewer failed transactions.
In short, local fee markets are superior but they are very hard to implement.
I'll explain next the fee market structure of each TPU.
Block STM fee markets as mentioned above, Block STM's fee markets are global.
Ironically, this is a consequence of the design choice that is responsible for Block STM's
performance boost, the predefined ordering of transactions. As discussed when analyzing
Block STM's design, when Block STM wants to execute a block of transactions, it pulls an
already ordered set of
transactions from its mempool or quorum store in the new design and executes the block. Because
the block had already been constructed prior to execution and inclusion in the block was based on
gas fees, securing inclusion in the block is solely dependent on gas fees. For example, in the event of
a highly contested NFT mint where the highest priority
transactions are competing for the same account, most, if not all, the transactions packed into
the block will compete for the same state. This will significantly degrade the performance of
block STM, since it performs poorly in high contention situations. Prevent the inclusion
of non-conflicting transactions and unnecessarily drive up the price of block space, as seen on networks with single-threaded TPUs like Ethereum and other
EVM-compatible chains. We're more focused on the latter point in this section and what it implies,
that all users on block STM networks will be forced to compete with each other for block space.
Considering historical data on networks like Ethereum, BlockSpace could become unnecessarily expensive during times of high activity.
And during sustained activity, fees will be higher for all users.
This is a problem, as it suggests that BlockSpace on BSTM networks must be cheaper and more abundant
than in a C-level network for users to pay similar amounts in fees.
C-level fee markets
All iterations of C-level have always had
some form of fee market locality but the fee markets were barely functional for the reasons
discussed above, local view on priority. Post-central scheduler, C-level's fee markets
are functional and pseudo-local. Transactions bidding for inclusion and priority only have
Tobit against other transactions contesting the accounts they want to access. For example, if transactions through are all bidding for the same portion
of state, doesn't have to bid against all the previous transactions. In the current scheduler
design, it will be scheduled second, while transactions through are all queued behind.
This is a very simple example but it holds true for any batch of transactions with any type of
transaction dependency. It's important to understand that there is and will always be some global pricing
of blockbase because of block size limits, currently 48 million compute units, cus.
Transactions that don't conflict with other transactions in the block may never make it
into the block simply because their absolute priority is not high enough. But once a transaction can pay
some minimum inclusion fee, fee markets are local. Solana further improves its fee market locality
by constraining the maximum number of CUS that can be taken up by a single account in a block.
The current implementation is 12 million CUS. This ensures that transactions competing for
highly contested accounts cannot prevent the inclusion of other transactions, even if the former set of transactions, for some reason,
executes much faster. The data collected during this study suggests that there might be benefits
to lowering the limits relative to block space but that's a discussion for another day.
To summarize, C-level implicitly enables local fee markets and locality is reinforced by the account limits.
Next we'll move on to actual throughput from both TPUs.
Block STM vs C-level.
Performance.
With all the necessary background discussed,
the time has come to answer the question of which TPU performs better.
Figures 24 and 25 show the performance of both TPUs for different at block sizes 1K and 10K. The results show that C-Level is significantly more performant across the board. For the sake of reading
convenience, only the data for 2, 10, 50, and 10,000 accounts were shown, as they suffice to
represent most of the landscape. Figure 24 shows performance at a block size of 1K and C-Level
completely outperforms block STM.
With a completely sequential workload, there's a roughly 7x difference in performance and with a
highly parallelizable workload, 10 CACs, there's a 2.4x difference in performance. With a block
size of 10k, C-level has a 45%, 53, and 118% performance boost over BlockSTM, with account contention 100, 1K,
and 10K respectively. These results are a better than the 1K tests and are attributable to the
reasons discussed earlier. The results are surprising, as years of research on concurrency
control have concluded that OCC is highly suited for low contention workloads and PCC for high contention
workloads, too. In isolation, BlockSTM conforms to this standard, with performance improving
significantly as contention is reduced. However, regardless of contention, when compared to sea
level, BlockSTM falls behind significantly. We investigated this and spoke to members of
Aptos Labs and found some potential
explanations we provide some context and our findings below during testing we had benchmark
block stm with a much earlier iteration of the move vm and the observed throughput is more than
double the results presented we got throughput as high as 193 ktps with block size. 10k. Account contention. 10k. And overall, BlockSTM was much more competitive
in contentious scenario sand outperformed sea level in non-contentious situations.
While the performance of BlockSTM from those tests is legitimate, those numbers cannot be
used for a fair comparison as the execution VM and parts of the underlying program used to obtain
them were not production ready. After discussing this with the Aptos Labs team, we understood that this
degradation occurred because of the additional checks added to the move VM. This suggests that
BlockSTM's true potential might be handicapped by the complexity of the move VM. We say MITA's
benchmarking BlockSTM with DM move produces similar results as the ones presented,
suggesting that the presented results might be the true limit of BlockSTM's ability with a
production-ready VM. Another factor that can explain the performance gap is that C-level
executes transactions in batches as opposed to BlockSTM's ad hoc execution. The increased
context switching from ad hoc execution likely adds some overhead.
As to the question of what TPU will perform better in real-world situations,
it is true that it's impossible to describe real-world contention with a single number.
But BlockSTM is too far behind, regardless of contention.
However, just to be thorough, we'll briefly attempt to estimate real-world contention by looking at historical blockchain data and how contentious it is. Visualization of blockchain contention
Contention can be visualized using the same preo graphs that the central scheduler uses
to identify transaction dependencies. The preo graphs shown when describing the
central scheduler were fairly simple. Real priority graphs, shown in figures 26 to 33 are much, much more complex. The following
preo graphs are preo graphs for both Ethereum and Solana blocks, as these are the only two
networks with enough data to accurately model contention. Figure 26 shows the preo graph for
Solana slot 229,666,043. The graph has 89 components, distinct
subgraphs of transactions. Most of these components have simple structures, with many having zero
dependent transactions or simple sequential dependencies. However, a large number of
transactions are part of very complex trees. Zoom in to see dependencies, keeping in mind that the
priority of a transaction is indicated by the color of the node. Also remember that the further out a transaction is from the center of a component,
the fewer dependencies it has. Nodes in the center of a component depend on all the transactions
outward from them, in all directions. The data above show that blockchain transactions are very
contentious. A fairly large set of accounts have very low contention but the
vast majority of transactions are in straight, long branches of very large trees. The straight,
long branches suggest that the transactions contend over the same accounts and the size
of the trees relative to the smaller clusters suggests that most transactions are contentious.
But most interestingly, the number of branches suggests that many accounts are
contentious, not just a small set. More pre-o-graphs tell the same story. In summary,
state access on blockchains is very contentious. The more nuanced conclusions that can be drawn
from looking at the data are 1. Most high-priority transactions are for highly contested portions of
state. 2. State access is contentious but also highly
parallelizable. There are many hotspots of varying temperature as opposed to one large hostpot.
3. A reasonably large set of transactions do not contend with any others. This set is small
relative to the set of transactions that do as small as 10% but not infinitesimal.
Backslash dot. So what does this mean for
BlockSTM and C-Level? Well, it strengthens the argument that C-Level will perform better in
real-world scenarios. The extent of better is a question that can only be answered by more
involved simulations but the current data strongly indicates that C-Level will perform
significantly better in real-world use. All said, it's important to
note that because of BlockSTM's primitive dependency identification and other optimizations
2, 3, BlockSTM will perform better than traditional OCC TPUs. But its performance will likely never
match that of C-level. The case for OCC and BlockSTM. As we approach the end of this paper, it's easy to walk away with the impression
that OCC, and, by extension, BlockSTM, is pointless, years of research and test data
suggest that it's practically guaranteed to be slower than PCC in an environment like blockchains,
where state access is fairly contentious. In addition, BlockSTM seems like a step backwards
in the context of transaction fee markets,
TFMs, as it lacks one of the major benefits of concurrent execution, local fee markets.
However, BlockSTM has made many improvements to the traditional OCC design and will perform
much better than previous OCC TPUs. In addition, there are three more benefits of BlockSTM's
design that are worth noting.
Wider range of supported applications The primary advantage of an OCC TPU is that
it allows for arbitrary transaction logic. Since PCC TPUs require that transactions declare up
front the portions of memory they will access, it is impossible to write transactional logic
where a transaction decides to read or write certain memory locations based on information discovered during execution. A good example is on chain order book design.
Order books on PCC TPUs usually cannot offer atomic settlement or permissionless market making
and limit orders since transactions must specify up front what accounts will be accessed during
the transaction. To work around this, most order books on Solana required the aid of an additional entity, the cranker, to finalize limit orders.
Phoenix on Solana has managed to overcome this by holding all the balances in one account but
this approach faces its own struggles. Portability The final argument for BlockSTM is that it can
integrate with any blockchain without breaking compatibility with existing transactions. There have been attempts to have transactions optionally specify up front
what portion of state they'll be accessing on Ethereum in a bid to increase the efficiency
of the EVM but they never saw the light of day. Because of BlockSTM's nature, it can be,
permissionlessly, integrated into any blockchain without breaking consensus.
Even Solana clients can implement
BlockSTM with modifications to ensure that the blocks meet all the constraints without a formal
proposal. It is specifically for this reason that Monad and all other parallel EVMs use BlockSTM
to achieve functional parallelization of the EVM. For these reasons, arbitrary transaction logic,
improved developer experience,
and portability, OCC execution engines are at least worth exploring.
Conclusion Frequency scaling, work stealing,
terminology, Solana, what do, state, and, state change, mean in blockchain?
Threads and locks, PDF Deferred execution, monad, paper references,
Maurice Herlihy and J. Elliot B. Moss.
Transactional Memory. Architectural Support for Lock-Free Data Structures.
Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA, 93.
Volume 21, Issue 2, May 1993. Daniel A. Manaske, Tatu Nakanishi.
Optimistic vs. Pessimistic Concurrency Control Mechanisms in Database Management Systems, Information Systems, Volume 7, Issue 1, 1982,
pages 13-27. Glossary CPU Architecture This section will provide concrete definitions for
computer hardware terms that showed up during the report. A CPU core is
essentially a CPU in and of itself. It has all the necessary components to qualify as a full CPU and
is capable of executing tasks by itself. Most modern CPU chips have multiple cores that share
memory and I.O. hardware. A thread is an abstraction that roughly refers to a series of instructions executed by a CPU core.
Multithreading is executing a program, not a transaction, across multiple threads.
Hyperthreading allows CPU cores to execute two threads pseudo-simultaneously.
The core shares its resources between two threads in a way that improves parallelism
but it's not the 2x performance that's often implied.
Parallel vs. Concurrent a way that improves parallelism but it's not the 2x performance that's often implied parallel versus concurrent a parallel program uses multiple cpu cores each core performing a task independently
on the other hand concurrency enables a program to deal with multiple tasks even on a single cpu core
the core switches between tasks i threads, without necessarily completing each one. A program can
have both, neither of or a combination of parallelism and concurrency characteristics.
Concurrency focuses on managing multiple tasks efficiently with one resource. Parallelism
utilizes multiple resources to execute tasks simultaneously, making processes faster.
Concurrency is when two or more tasks can start,
run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both
be running at the same instant. For example, multitasking on a single-core machine.
Parallelism is when tasks literally run at the same time, e.g. on a multi-core processor.
It should be clear now why transaction processing in blockchains is
more analogous to concurrent processing than parallel processing. Other terms instruction.
An instruction is the most granular operation that a processor can perform. They can vary in
complexity based on the computer's instruction set architecture, but the idea is that an instruction
is guaranteed to be atomic by the hardware design.
Bytecode. Bytecode is an assembly code like intermediate representation. In this context,
it refers to a VM's instruction set. Assembly language. Assembly language is a low-level language with constructs that map one-to-one with the underlying CPU's instruction set.
Locking granularity. Refers to how precise locks are. Imagine a transaction needs to
alter an element in a table. A coarse lock could lock the entire table. A more granular lock could
lock the row, column of interest. And a very fine lock could lock only the cell of interest.
The more granular a lock is, the more concurrent memory access can be. But highly granular locks
are more difficult to manage because they require a larger key-value database to manage portions of memory.
Priority inversion is when a lower-priority process
red transaction prevents the execution of a higher-priority process.
Convoying. A lock convoy occurs when threads of equal priority compete for access to a shared
resource. Each time a thread
relinquishes access to the resource and pauses or stops the process, there is some overhead incurred.
Deadlocks. A phenomenon where two or more processes cannot advance because each one
requires resources held by the other. Slots. A slot is Solana's term for block time,
currently around 400 milliseconds.
Nodes are allotted four consecutive slots every time they are leader.
Epic. An epic in Solana is a distinct block of 432,000 consecutive slots.
Not any 432,000 slots, but rather distinct blocks like days of the week.
Serialization. Is the process of converting data, data structures and objects to a string of bits
that can be stored, transmitted, and reconstructed according to the format.
Deserialization is the process of reconstructing serialized data from the string of bits to the
original data structures or objects. Buffer. A buffer is a temporary storage element.
It's usually used when data cannot or should not be processed at
the same rate at which it is being fed to the next block of a process. Packets. A packet is a fixed
size unit of data carried by a packet switch network. Packets are relevant to Solana because
each transaction is constrained to the maximum size of an IPv6 packet. Double-ended queue,
also called DQ, is an abstract data structure that supports the
addition of data to one end and removal from the other. Channel. High-performance I.O. gadget.
B-tree. Self-balancing data structure. Hashmap. Is an efficient associative array data structure.
Race conditions. Occur when uncontrollable processes can lead to incorrect or inconsistent results.
In the context of this paper, race conditions occur in OCC systems.
Appendix A1. Distributed databases Distributed databases are a lot like blockchains, in fact, it's not a stretch to say that they're
fundamentally the same thing. A distributed database is a collection of multiple,
interconnected databases spread across different physical locations connected via a network. These databases appear as a single
database to the user but function on multiple servers. The topic of distributed databases is
expansive so I'll, briefly, cover only the areas relevant to this report, data distribution methods
and database transactions. Data distribution methods horizontal partitioning. Each site stores a different subset of the rows of a table.
For example, a customer table might be divided so that the first x entries are stored on one
database and the others are stored elsewhere. Vertical partitioning. Each site stores different
columns of a table. For example, one site might store the customer names and
addresses, while another site stores their purchase histories. Replication. Copies of the entire
database, or subsets, are stored at multiple sites. This improves data availability and reliability
but may introduce consistency challenges. Transactions. Transactions in DBMSs are defined
as a single unit of operations that perform a
logical action, usually accessing and or modifying the database. For most databases, especially those
keeping track of financial records, transactions are required to have ACID properties i.e.
The transactions are atomic, all or nothing, if one operation fails, the entire transaction reverts.
Consistent Transactions modify the database in predictable and repeatable ways.
Isolated. When multiple transactions are executed concurrently,
they execute without affecting each other, as if they were executed sequentially, and
durable. Changes made by completed transactions are permanent even in the event of system failure.
If it's not already obvious, blockchains are just replicated databases with adversarial operators.
So it's not surprising that distributed database research forms the foundation for most blockchain designs today. A2. Narwhal and QuorumStore TLDR
The value proposition of Narwhal, and by extension QuorumStore, is that,
leader-based consensus, I.E. A consensus system where the leader for a slot is responsible for
transmitting most of the information during that slot, a la Solana, is bottlenecked by the
performance of the leader as opposed to the entire validator set. Narwhal, and QuorumStore,
resolve this bottleneck by decoupling data dissemination
and consensus down to the hardware level. The implementation of this idea spreads the work
equally across all validators, as opposed to having the leader alone bear the brunt.
There's perhaps no better proof of Narwhal's value than Solana itself. The Solana documentation
currently recommends that validators use dedicated 1-gigabit S-lines
as the minimum bandwidth setup and ideally a 10-gigabits line. Average-sized validators
report around 1.5 gigabits per second peak traffic. During leader slots, Aptos nodrunners
report using 25 megabits per second by comparison. Of course, both chains don't process nearly the
same amount of traffic and
there's the question of the stake weight of each validator but the big idea is that QuorumStore
is more efficient from a leader's point of view than leader-based consensus. Let's run through
how it works. In QuorumStore, validators continuously steam batches of ordered transactions,
similar to the blocks discussed above, to one another. The batches contain raw
transaction data and metadata batch identifiers. Other validators receive these batches, sign the
batches, and transmit the signatures to other validators just like they would a normal executed
block in leader-based consensus. When a batch reaches the quorum of stake-weighted signatures,
it essentially receives approof of availability as all the validators that sign a batch promise to keep and provide the batch on
request until it expires. Because of QuorumStore, the leader doesn't have to propose transaction
batches anymore. Instead, the leader simply selects a batch for which it has a proof of
availability, sends out the batch metadata, executes the block, and sends out the execution results.
Other validators can then map the metadata to the batch and replay the batch for verification.
If they don't have the batch, they can request it from other nodes and be sure they'll receive it
since two F plus one nodes promise to store and provide the data. This reduces the messaging
overhead of the leader. And because Quorum store is run on a separate machine,
it's horizontally scalable by adding more boxes. There is a lot of nuance that my TLDR has left out. You can find more details in the Aptos blog post and Narwhal paper.
A3. Block STM Algorithm. This section contains the block STM algorithm from the paper alongside
explanatory comments to aid understanding.
Asterisk 1 Executing transactions in batches reduces messaging and context switching but comes at the cost of requiring extra scheduling. Asterisk 2 The topic of OCC versus PCC is
considered moot in academia. The general sentiment is that OCC is suited for low
contention applications and PCC for high-contention applications.
As such there is very little work ongoing in this regard. One of the most recent papers to
discuss the subject was written in 1982. The following graph is adapted from the paper and
it helps to reaffirm the established sentiment. The 2016 study, Revisiting Optimistic and
Pessimistic Concurrency Control by Goetzhe's grief of Hewlett-Packard labs
comes to the same conclusion as Manassi and Nakanishi. To quote the author,
greater than, we have concluded that optimistic concurrency control permits more greater than
concurrency than pessimistic concurrency control only if it fails to detect greater than some
actual conflicts or if a particular implementation of locking detects greater than false conflicts.
There are numerous studies that come to the same conclusion but we've left them out for brevity.
Asterisk 3 A perfectly fair evaluation of block STM and C-level would require isolation, i.e.
removing all other processes like Solana's PoE and Ledger commits, using the same VM for both
TPUs and a host of laborious engineering tasks that are simply not
worth the effort, especially when preliminary testing suggests that the TPUs follow the
established trends. Asterisk 4 and optimization in the implementation of BlockSTM allows aborted
transactions to be restarted from the point of conflict. Instead of restarting execution from
scratch, the Move VM supports validating the red set of
the transaction's previous incarnation and if valid, continuing execution from the point of
conflict. Asterisk 5 A second optimization is when the dependency is resolved before the execution
task is created i.e. when line 14 of the algorithm returns false. In the implementation, the VM
continues execution from where it paused rather than
restarting the execution. Thank you for listening to this HackerNoon story,
read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.