NetSPI Distributed Ledger Team
WP_Query Object ( [query] => Array ( [post_type] => Array ( [0] => post [1] => webinars ) [posts_per_page] => -1 [post_status] => publish [meta_query] => Array ( [relation] => OR [0] => Array ( [key] => new_authors [value] => "149" [compare] => LIKE ) [1] => Array ( [key] => new_presenters [value] => "149" [compare] => LIKE ) ) ) [query_vars] => Array ( [post_type] => Array ( [0] => post [1] => webinars ) [posts_per_page] => -1 [post_status] => publish [meta_query] => Array ( [relation] => OR [0] => Array ( [key] => new_authors [value] => "149" [compare] => LIKE ) [1] => Array ( [key] => new_presenters [value] => "149" [compare] => LIKE ) ) [error] => [m] => [p] => 0 [post_parent] => [subpost] => [subpost_id] => [attachment] => [attachment_id] => 0 [name] => [pagename] => [page_id] => 0 [second] => [minute] => [hour] => [day] => 0 [monthnum] => 0 [year] => 0 [w] => 0 [category_name] => [tag] => [cat] => [tag_id] => [author] => [author_name] => [feed] => [tb] => [paged] => 0 [meta_key] => [meta_value] => [preview] => [s] => [sentence] => [title] => [fields] => [menu_order] => [embed] => [category__in] => Array ( ) [category__not_in] => Array ( ) [category__and] => Array ( ) [post__in] => Array ( ) [post__not_in] => Array ( ) [post_name__in] => Array ( ) [tag__in] => Array ( ) [tag__not_in] => Array ( ) [tag__and] => Array ( ) [tag_slug__in] => Array ( ) [tag_slug__and] => Array ( ) [post_parent__in] => Array ( ) [post_parent__not_in] => Array ( ) [author__in] => Array ( ) [author__not_in] => Array ( ) [search_columns] => Array ( ) [ignore_sticky_posts] => [suppress_filters] => [cache_results] => 1 [update_post_term_cache] => 1 [update_menu_item_cache] => [lazy_load_term_meta] => 1 [update_post_meta_cache] => 1 [nopaging] => 1 [comments_per_page] => 50 [no_found_rows] => [order] => DESC ) [tax_query] => WP_Tax_Query Object ( [queries] => Array ( ) [relation] => AND [table_aliases:protected] => Array ( ) [queried_terms] => Array ( ) [primary_table] => wp_posts [primary_id_column] => ID ) [meta_query] => WP_Meta_Query Object ( [queries] => Array ( [0] => Array ( [key] => new_authors [value] => "149" [compare] => LIKE ) [1] => Array ( [key] => new_presenters [value] => "149" [compare] => LIKE ) [relation] => OR ) [relation] => OR [meta_table] => wp_postmeta [meta_id_column] => post_id [primary_table] => wp_posts [primary_id_column] => ID [table_aliases:protected] => Array ( [0] => wp_postmeta ) [clauses:protected] => Array ( [wp_postmeta] => Array ( [key] => new_authors [value] => "149" [compare] => LIKE [compare_key] => = [alias] => wp_postmeta [cast] => CHAR ) [wp_postmeta-1] => Array ( [key] => new_presenters [value] => "149" [compare] => LIKE [compare_key] => = [alias] => wp_postmeta [cast] => CHAR ) ) [has_or_relation:protected] => 1 ) [date_query] => [request] => SELECT wp_posts.ID FROM wp_posts INNER JOIN wp_postmeta ON ( wp_posts.ID = wp_postmeta.post_id ) WHERE 1=1 AND ( ( wp_postmeta.meta_key = 'new_authors' AND wp_postmeta.meta_value LIKE '{82e5f045d0157de3da29b2b5d7faad67737b5d714ec30d1048848eccdd174d8f}\"149\"{82e5f045d0157de3da29b2b5d7faad67737b5d714ec30d1048848eccdd174d8f}' ) OR ( wp_postmeta.meta_key = 'new_presenters' AND wp_postmeta.meta_value LIKE '{82e5f045d0157de3da29b2b5d7faad67737b5d714ec30d1048848eccdd174d8f}\"149\"{82e5f045d0157de3da29b2b5d7faad67737b5d714ec30d1048848eccdd174d8f}' ) ) AND wp_posts.post_type IN ('post', 'webinars') AND ((wp_posts.post_status = 'publish')) GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC [posts] => Array ( [0] => WP_Post Object ( [ID] => 30210 [post_author] => 149 [post_date] => 2023-05-25 14:09:32 [post_date_gmt] => 2023-05-25 19:09:32 [post_content] =>This is the second part of our series on Ethereum Virtual Machine (EVM) internals. Part 1, which can be found here, introduced the EVM call context and its architecture, followed by a deep dive into the non-persistent Memory section, function selection and visibility, and how contract control flow can be bypassed at the bytecode level.
This installment will focus on the EVM’s persistent storage region and explore how various Solidity types and data structures are represented and referenced in contract Storage.
EVM Storage and Contract States
In part 1, we dove into the memory section of the call context, used to hold temporary, mutable data necessary for inter-function logic. As a decentralized state machine however, the EVM needs a dedicated mechanism to keep track of persistent data that represents a given contract’s overall state between function calls. This is the job of the EVM’s Storage region.
EVM implementations treat the storage region as a dictionary of 2256 32-byte keys and values. These keys are considered to be logically initialized to zero by default. However, like all data structures that constitute the EVM, it is up to specific implementations to define exactly how the storage region is implemented. Most implementations, including the widely used Go-Ethereum (
geth
) client stays true to the overall idea of the Storage region by implementing storage as hash maps or key-value pairs1, as does the C++-based EMVC implementation of the EVM, which uses the C++ standard librarystd::map
object2.The relevant opcodes for operating on storage are the
SSTORE
andSLOAD
opcodes, which allow writing and reading 32 bytes to and from storage respectively. It should be noted that EVM operations that work on the Storage region incur some of the highest gas fees relative to other opcodes3, therefore a sufficient understanding of this region is especially important for writing gas-efficient contracts.Storage Slots
EVM implementations use the concept of storage slots, which are discrete sections within the storage space in which state variables are stored.
Within the wider Ethereum network, storage slots are also sometimes referenced using the
keccak256
hash of the hexadecimal slot number, left padded to 32 bytes.keccak256(0x0000…0000) = 0x290decd9548b62a8d60345a988386fc84ba6bc95484008f6362f93160ef3e563 keccak256(0x0000…0001) = 0xb10e2d527612073b26eecdfd717e6a320cf44b4afac2b0732d9fcbe2b7fa0cf6 keccak256(0x0000…0002) = 0x405787fa12a823e0f2b7631cc41b3ba8828b3321ca811111fa75cd3aa3bb5ace
Below is an example of a simple storage layout as shown in Remix of three 32-byte
uint256
values, each in their own storage slot:It is important to note that within the EVM execution context, slot keys are usually referenced via their literal hexadecimal values, most evident when storage slots are referenced by the commonly used
SSTORE4
andSSLOAD5
EVM opcodes. However, hashed slot indexes shown by Remix and similar tools play an important role in the wider Ethereum ecosystem.Merkel Patricia Tries
The reason why slots are sometimes also displayed via a hash of the slot numbers themselves stems from a fundamental data structure used by the wider Ethereum network for storing account data and contract storage, called Merkle Patricia Tries (MPTs)6. In an MPT, each key-value pair is hashed using the
keccak256
hash function, and the resulting hash is used as the key for the next level of the trie.This creates a hierarchical structure where each level of the trie corresponds to a different prefix of the key, and each leaf node in the trie corresponds to a complete key-value pair. MPTs are used in various contexts throughout the Ethereum ecosystem, including for storing account data, contract storage, and transaction receipts.
At a high level, an MPT is a modified version of the standard trie data structure. A trie7 is a generic tree-based data structure used for storing and retrieving key-value pairs. It allows for efficient prefix-based searches by organizing keys based on their common prefixes.
MPTs incorporate elements of Merkle and Patricia tries, which are derivative trie structures. Merkle tries, also known as hash trees, use hashed representations of keys and values to save on search and storage overhead8. Patricia tries are optimized for memory usage by compressing and merging multiple nodes within a single child node, further reducing storage requirements9.
The Bitcoin network uses Merkle tries to archive and verify the integrity of transactions that have been confirmed with enough blocks10, whereas the Ethereum network requires MPTs to facilitate smart contract and account state storage. Outside of blockchain systems, Patricia tries and the more general Radix tries are commonly used in applications that require fast prefix-based searches, such as IP routing tables and dictionary implementations.
Storage slots are referenced by their
keccak256
hashes in the MPT when contract balances and other state information is read and written, largely because MPTs need a standardized method of referencing variable-length key values. However, the EVM references storage slots via their explicit storage key values. These are not always hexadecimal values; for more complicated data structures, additionalkeccak256
hashing must be done to derive slot keys, as will be demonstrated later.MPTs and the data structures used to maintain the wider Ethereum network are out of scope for this article. The rest of this article will focus on EVM storage. The EVM has a peculiar way of storing state variables; necessitated in part by the EVM’s implementation as a register-less stack machine, and as a means of minimizing the gas costs of storing smart contract states and data.
For the remainder of this article however, we will omit slot indexes when displaying key-value sets in storage for clarity.
Static Variable Storage
Simple Variable Storage & Packing
Consider the following Solidity contract which demonstrates the concepts of simple variable storage and static variable packing. The
modify()
function assigns values to the state variables, which have been implemented as unsigned integers of varying sizes. The integers are represented in hex format for clarity. You may follow along with this contract via Remix here.pragma solidity >=0.7.0 <0.9.0; contract VarPacking { uint256 slot_0; uint128 slot_1; uint64 still_slot_1; uint64 slot_1_again; uint128 slot_2; function modify() public { slot_1 = 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA; // slot 1 slot_0 = 0xBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB; // slot 0 still_slot_1 = 0xCCCCCCCCCCCCCCCC; // still in slot 1 slot_1_again = 0xDDDDDDDDDDDDDDDD; // still in slot 1 slot_2 = 0xEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE; // slot 2 } }The order in which state variables are assigned values does not affect which slot they will occupy. This is because the storage slots are determined depending on these factors:
- The order in which state variables are declared.
- The type of the state variable.
- The size of the state variable.
- Whether the variable is fixed-size (simple unsigned integers, fixed-size arrays), or dynamically sized (mappings and unbounded arrays).
Until variables are assigned values, no data will be written to the storage slot, and no storage fees are incurred until storage is interacted with. Pay attention to the state variable names, which have been named according to how they will occupy storage slots.
We will now go through the execution of the modify()
function and show how each storage slot is populated.
Note that Solidity compiler optimizations have been turned off for all examples shown in this post, as compiler optimizations can affect the order in which slots are populated. These optimizations do not affect the values stored in the slots, however. As of writing, the main factor that affects which slot a value is stored in is still the order in which the variable was declared.
Integers
unit
and uint256
types are 32 bytes large. When state variables of these types are assigned values, the next available storage slot is used in its entirety. This can be confirmed by deploying the VarPacking
contract, placing a breakpoint on line 12, and debugging the call to modify()
. After the first SSTORE operation, storage will resemble the following:
"key": "0x0000 . . . 0001",
"value": "0x00000000000000000000000000000000aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
The 128-bit value 0xAA..AA
has been written to the latter half of slot 1. Continuing past the next SSTORE
operation will result in the 256-bit value 0xBB…BB
being written to slot 0, as the slot_1
state variable was assigned a value before slot_0
.
"key": "0x0000 . . . 0001", // slot 0x00…01
"value": "0x00000000000000000000000000000000aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
Stepping through until the still_slot_1
value is assigned to storage (just before line 14) results in slightly different storage behavior however:
"key": "0x0000 . . . 0001", // slot 0x00…01
"value": "0x0000000000000000ccccccccccccccccaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
Variable packing:
Unlike slot_1
, the value for still_slot_1
was stored after the 8th byte of storage slot 1. This is because the consecutively declared slot_1
and still_slot_1
are of types uint128
and uint64
, and are 16 and 8 bytes large respectively. If we continue execution until the end of the modify()
function, we will notice that the values for slot_1
, still_slot_1
, slot_1_again
, and slot_2
are also not all stored in their own storage slots.
To save on storage fees, the EVM has attempted to “pack” these values together such that more than one value can occupy a single storage slot. This is called “variable packing”, and is subject to the following rules11:
- The first item in a storage slot is stored lower-order aligned.
- Value types use only as many bytes as are necessary to store them.
- If a value type does not fit the remaining part of a storage slot, it is stored in the next storage slot.
- Structs and array data always start a new slot and their items are packed tightly according to these rules.
- Items following struct or array data always start a new storage slot.
The storage layout will present the following values after the call has completed:
"key": "0x0000 . . . 0001", // slot 0x00…01
"value": "0xddddddddddddddddccccccccccccccccaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
"key": "0x0000 . . . 0002", // slot 0x00…02
"value": "0x00000000000000000000000000000000eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"
Take a moment to reflect on how the previously mentioned variable packing rules are at play. Stepping through the bytecode executed between lines 14 and 15 will show these rules being applied, shifting and rearranging the values as necessary until the SSTORE
opcode is used to preserve the values in the correct slot. Other static types are stored differently; however the same overall goal of maximizing slot usage is maintained.
Compiler Optimizations:
Because opcodes like SSTORE
and SLOAD
are gas intensive, with Solc compiler optimizations turned on, optimized bytecode will be generated which attempts to limit the number of storage operations. This can be observed using the previous VarPacking contract as an example, using this Remix link. Here, the “Enable optimization” flag has been selected under Advanced Configurations in the Solidity Compiler tab. Optimizations are enabled by default in Remix and most other development frameworks.
With optimizations, even though the slot_0
variable was assigned after slot_1
, setting a breakpoint on line 14 shows that storage slot 0 was written to first. This is then followed by a single SSTORE
operation to store the values of slot_1
, still_slot_1
and slot_1_again
to slot 1 within a single operation.
This is because the contract’s bytecode was optimized to pre-arrange the values to be stored in a particular slot before any storage operations are performed. Thus, the previously described variable packing rules are still in effect, however they are instead performed in the Memory region and stack instead of successive storage read/writes.
For simplicity, compiler optimization will be disabled for the remaining contract links in this article.
Fixed-Size Arrays:
Elements in fixed-size arrays are stored as they otherwise would be if the elements were declared sequentially in singular storage variables. After executing modify()
and debugging the transaction for the following contract, each element of the array will occupy its own storage slots. This is only because each element is 32 bytes large. As with singular state variables, array state variable elements can share the same slot, subject to the same packing rules discussed before.
pragma solidity >=0.7.0 <0.9.0; contract FixedArray { uint256[3] arr; function modify() public { // slots 0 and 1 assigned values of 1, 2. // slot 2 will still contain zeros. arr = [1,2]; // slot 2 assigned value of 3 arr[2] = 3; } }
When assigning values for up to n-1 elements in an array, the Solidity compiler will explicitly store a value of 0 for the unassigned values. This can be seen by placing a breakpoint at the first assignment in modify()
and stepping through execution until a value of 0 is pushed to the stack, to eventually be stored at slot 2.
After executing the SSTORE
instruction, slot 2 contains a value of 0.
This is later overwritten by the value of 3 after the final element has been assigned. These variable packing rules suffice for simple unsigned integer types. However, the EVM must perform additional storage packing logic to store more complex data types.
String and byte types:
Strings and byte variations are stored according to their own packing heuristics. Each character in a Solidity string takes up 1 byte. The following contract demonstrates string storage:
pragma solidity >=0.7.0 <0.9.0; contract StringStorage { //same applies to byte types string short_string; string long_string; function modify() public { short_string = "ABCD"; long_string = "ABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCD"; //84 characters long } }
Strings and byte values that are less than or equal to 31 bytes in length are indexed in storage in the following way, where n
represents the next available storage slot for 0 ≤ n
≤ 2256-1 slots:
String/Byte Storage ( of size 1 ≤ 31 bytes) | |
---|---|
Slot Key | Slot Value |
n | String data up to 31 bytes in higher-order bytes, and length of string * 2 in lowest-order byte |
Placing a breakpoint at line 11, after the assignment of short_string
, and debugging the call to modify()
will show the string ABCD (0x41424344)
stored in slot 0, aligned to the higher order of the storage slot. Double the length of the string, 0x08
, is stored at the lower order of the slot (shown in red):
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0x414243440000 . . . 0008",
. . .
A different method is used to index and store strings and byte values over 32 bytes in length, however it still retains the concept of storing string/byte data first, followed by the length of the data. Instead of referencing slot keys via their literal values, the keccak256
hash of the previous slot key number is used as the current slot key. This hash is then incremented for every remaining 32 byte chunk of string data. This is summarized in the table below, for n
storage slots, storing i
32 byte chunks of string data, for 0 ≤ n
, i
≤ 2256-1:
String/Byte Storage ( of size 1 ≤ 32 bytes) | |
---|---|
Slot Key | Slot Value |
. . . | . . . |
keccak256(n)+(i-1 ) | i-1th 32 byte chunk of string data. |
keccak256(n)+i | ith 32 byte chunk of string data, left-aligned with 0 if i ≤ 32 bytes (e.g., 0x41424344 . . . 000000) |
n+1 | (Length of string/bytes * 2) + 1 in lower-order bytes |
After long_string
has been assigned its value, we see several more storage keys being used:
. . .
"key": "0xb10e . . . 0cf6", // slot keccak256(0x00…00)
"value": "x4142434441424344414243444142434441424344414243444142434441424344",
"key": "0xb10e . . . 0cf7", // slot keccak256(0x00…00) + 0x00…01
"value": "0x4142434441424344414243444142434441424344414243444142434441424344",
"key": "0xb10e . . . 0cf8", // slot keccak256(0x00…00) + 0x00…02
"value": "0x4142434441424344414243444142434441424344000000000000000000000000",
"key": "0x0000. . . 0001", // slot 0x00…01
"value": "0x0000 . . . 00a9"
In this case, the keccak256
hash of the storage identifier for slot 1 (0xb10e…0cf6
) is used as the first slot key for the first 32-byte chunk of the string. This slot key is incremented by one for all remaining 32-byte chunks, with the last chunk being padded with zeros if it is less than 32 bytes in length.
Finally, the length of the string is stored at the storage slot corresponding to when the long string was declared. We can see that this mirrors the way shorter string storage variables are stored, with the data taking up higher order bytes, and the length of the string in the lower order bytes.
Enums
The storage of the enum
type is not currently covered in the Solidity documentation12. Testing shows that enum
elements are stored as lower-aligned single-byte values within a single slot for enums
up to 32 elements in size.
This is evident with the following contract, with enum E
containing 35 elements. Note that all elements of enum E
and their assignments in the modify()
function have been omitted for brevity in the example below:
pragma solidity >=0.7.0 <0.9.0; contract EnumStorage { enum E {t1, . . . ,t35} E e1; . . . E e35; function modify() public { e1 = E.t1; . . . e35 = E.t35; } }
For enums
of more than 32 elements, the next available slot n
is used to store the ith
group of up to 32 elements, for 0 ≤ n
,i
≤ 2256-1:
Enums | |
---|---|
Slot Key | Slot Value |
. . . | . . . |
n+(i-1) | i-1th set of 32 elements, each represented as a lower-ordered single byte. |
n+i | ith set of up to 32 elements, each represented as a lower-ordered single byte, padded with zeros if ≤ 31 elements. |
Storage will resemble the following after executing modify()
in this contract.
"key": "0x00 . . . 00", // slot 0x00…00
"value": "0x1f1e1d1c1b1a191817161514131211100f0e0d0c0b0a09080706050403020100",
"key": "0x00 . . . 01", // slot 0x00…01
"value": "0x0000000000000000000000000000000000000000000000000000000000222120"
Structs
In Solidity, structs
can include elements of different data types. Like fixed-size arrays, the elements of struct state variables are allocated slots sequentially based on declaration order and type. Contract developers can save considerably on gas fees by declaring the elements of state variable structs to maximize the amount of data stored per slot13. This is demonstrated in the following contract:
pragma solidity >=0.7.0 <0.9.0; contract StructStorage { struct S1 { uint128 a; uint256 b; uint128 c; } struct S2 { uint128 d; uint128 e; uint256 f; } S1 expensive_struct; S2 cheaper_struct; function modify() public { expensive_struct.a = 1; expensive_struct.b = 1; expensive_struct.c = 1; } function modify2() public { cheaper_struct.d = 1; cheaper_struct.e = 1; cheaper_struct.f = 1; } }
EIP-2929 and Access Sets
According to Remix, the cost of executing modify()
was reported as 66588 gas, whereas modify2()
incurred only 44760 gas. This is because assigning values to all expensive_struct
’s members requires 3 storage slots, while cheaper_struct
only requires two.
Struct | Execution Cost (gas) | Storage Changes After Execution |
expensive_struct | 66588 |
"key": "0x00…00", |
cheaper_struct | 44760 |
"key": "0x00…03", |
This is because the first two elements of struct S2
, which are of type uint128
and thus 16 bytes each, can be packed into the same slot, whereas the second element of struct S1
is a full 32 bytes, and therefore requires that each element of this type is stored in its own slot.
Writing non-zero values to “untouched” storage slots is more expensive than writing to a slot that has already been written to. This can be confirmed by deploying the StructStorage
contract and executing the modify2()
function. Debug the transaction, and step through until the first call to SSTORE
. Observe that 20000 gas is required to write to slot 0 in the Step Details window.
Stepping through to the next SSTORE
operation, which operates on the same storage slot, will instead require only 100 gas.
This is because the Berlin hard fork of the Ethereum network in April 2021 brought various changes to the network. One such change, EIP-292914, increased the gas cost of accessing “cold” (not already accessed) storage slots and account addresses relative to “warm” (previously accessed) ones. To differentiate between cold slots/addresses and ones that have been “warmed up” by means of prior access, the concept of Access Sets15 were introduced.
The motivation behind this was that storage-accessing opcodes were considered underpriced in terms of gas execution fees. The 2016 Shanghai DDOS attack against the Ethereum network exploited this, as the low gas fee of the EXTCODESIZE
opcode was abused16 to force nodes and miners to read state information from disk, thus congesting the network. The specific calculations for warm or cold reads and writes are best covered in the most recent version of the Ethereum Yellow Paper17.
Appendix G lays out the gas costs for common operations, including accessing warm and cold slots/addresses via Gwarmaccess and Gsset respectively. Note that writing zero values to any storage is still treated as if accessing a warm storage slot, as all storage slots are considered to be logically zero-initialized by default18.
Dynamic Variable Storage
The state variables demonstrated so far were all fixed-size types. With fixed-size state variables, the EVM can know ahead of time where and how to best store them according to packing rules. This is not possible for dynamic sized types such as mappings and dynamic-sized arrays. Storage of dynamic-sized array state variables are demonstrated in the following contract.
Dynamic arrays
pragma solidity >=0.7.0 <0.9.0; contract DynamicArray { uint256[] ints; uint256[][] int_ints; function modify() public { ints.push(0xaa . . . aa); ints.push(0xbb . . . bb); int_ints.push(ints); int_ints.push(ints); int_ints.push(ints); } }
Dynamic arrays are stored using a similar hash-of-hashes trick as with string and byte types. They are primarily referenced by a “base” slot identifier that is continually updated with the number of elements currently assigned to a dynamic array. The slot corresponding to the keccak256
hash of this base slot identifier then holds the first element of the array. For all subsequent elements, the inner hash is incremented by one, and a hash of that value is taken.
This process is summarized below, storing the number of elements in a one-dimensional dynamic array at the n
th storage slot, with the i
th element in the array being stored at the keccak256(n)
th storage slot, for 0 ≤ n
,i
≤ 2256-1:
One-Dimensional Dynamic Array Storage | |
---|---|
Slot Key | Slot Value |
n | The number of elements in the array so far |
. . . | . . . |
keccak256(n)+(i-1) | The i-1th element in the array |
keccak256(n)+i | The ith element in the array |
In this simple example, storage will resemble the following once modify()
has been executed:
"key": "0x0000. . . 0000", // slot 0x00…00
"value": "0x0000. . . 0002",
"key": "0x290d . . . e563", // slot keccak256(0x00…00)
"value": "0xaaaa . . . aaaa",
"key": "0x290d . . . e564", // slot keccak256(0x00…00) + 0x00…01
"value": "0xbbbb . . . bbbb",
. . .
Multidimensional dynamic arrays are stored via the same rule applied recursively. For a two-dimensional dynamic array, the nth storage slot contains the number of sub-arrays in the two-dimensional array stored so far. The index of the i
th sub-array is stored at the keccak256(n)+i
th storage slot, and the j
th element of the i
th sub-array is stored at the keccak256(keccak256(n)+i)+j
th storage slot, for 0 ≤ n
,i
,j
≤ 2256-1:
Two-Dimensional Dynamic Array Storage | |
---|---|
Slot Key | Slot Value |
n | Number of elements in the outermost array so far |
. . . | . . . |
keccak256(n)+(i-1) | i-1th sub-array |
. . . | . . . |
keccak256(keccak256(n)+(i-1))+(j-1) | j-1th element in the i-1th sub-array |
keccak256(keccak256(n)+(i-1))+j | jth element in the i-1th sub-array |
keccak256(n)+i |
ith sub-array |
. . . | . . . |
keccak256(keccak256(n)+i)+(j-1) | j-1th element in the ith sub-array |
keccak256(keccak256(n)+i)+j |
jth element in the ith sub-array |
This is repeated for the second and third arrays within int_ints
, demonstrated in the storage representation below:
. . .
"key": "0x00 . . . 01", // slot 0x00…01
"value": "0x00 . . . 03"
"key": "0xb10e . . . 0cf6", // slot keccak256(0x00…01)
"value": "0x00 . . . 02"
"key": "0xb5d9 . . . 7d22", // slot keccak256(keccak256(0x00…01))
"value": "0xaa . . . aa"
"key": "0xb5d9 . . . 7d23", // slot keccak256(keccak256(0x00…01)) + 0x00…01
"value": "0xbb . . . bb"
"key": "0xb1e2 . . . 0cf7", // slot keccak256(0x00…01) + 0x00…01
"value": "0x00 . . . 02",
"key": "0xea78 . . . bd31", // slot keccak256(keccak256(0x00…01) + 0x00…01)
"value": "0xaa . . . aa",
"key": "0xea78 . . . bd32", // slot keccak256(keccak256(0x00…01) + 0x00…01) + 0x00…01
"value": "0xbb . . . bb",
"key": "0xb1e2 . . . 0cf8", // slot keccak256(0x00…01) + 0x00…02
"value": "0x00 . . . 02",
"key": "0xb327 . . . e02b", // slot keccak256(keccak256(0x00…01) + 0x00…02)
"value": "0xaa . . . aa",
"key": "0xb327 . . . e02c", // slot keccak256(keccak256(0x00…01) + 0x00…02) + 0x00…01
"value": "0xbb . . . bb"
Aside - Deterministic Pitfalls – Storing Private Data
The Storage region provides a deterministic way of storing state data in the EVM. For instance, the exact storage slot of an m-dimensional array can be predicted before a contract implementing the array is even deployed. For an m-dimensional array, storage logic can be generalized by applying the storage heuristic of a two-dimensional array recursively. This is shown via the following expression for the j
th element in the i
th index of the m-dimensional array, declared starting at slot n
, where 0 ≤ n
,m
,i
,j
≤ 2256-1:
keccak256(keccak256(...(keccak256(keccak256(n)+i[1])+i[2])...+i[m-1])+i[m])+j
For a sufficiently-populated, 4-dimensional dynamic array arr[][][][]
, the base of which is stored at slot 3, the slot key for the element at arr[1][0][8][1]
can therefore be calculated as follows, and confirmed by viewing the fourth-to-last slot key written after executing the modify()
function of this contract:
keccak256(keccak256(keccak256((keccak256(0x00…03)+ 0x00…01))+ 0x00…00)+ 0x00…08)+ 0x00…01 = 0xb8928d09db2f3fc6a2c8bd4dafbdf7cd5aa6c337f2c2fad8d85a5e908c8ddf49
Any data in a contract’s Storage region is publicly available using block explorers or web3 clients. For example, the contents of storage slot 0 of the live Wrapped Ethereum (WETH) contract19 can be quickly queried using the cast
20 web3 CLI client from the Foundry development suite, given a free-tier Infura21 API access or another RPC endpoint:
$ cast storage 0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2 0 --rpc-url 'https://mainnet.infura.io/v3/<API-KEY>'
0x577261707065642045746865720000000000000000000000000000000000001a
While this data is simply the string Wrapped Ether
, subject to the previously described string storage logic, this is why storing sensitive data within a contract’s Storage region constitutes a security risk as contract storage is publicly available to all.
Mappings
Mappings in Solidity are similar to hash tables or dictionaries in other languages, the storage of which is demonstrated by this contract:
pragma solidity >=0.7.0 <0.9.0; contract Mappings { struct S { uint256 a; uint256 b; } mapping(uint256 => uint256 ) simple_map; mapping(uint256 => S) struct_map; mapping(uint256 => mapping(uint256 => S)) nested_map; function modify() public { simple_map[0] = 0x11 . . . 11; simple_map[1] = 0x22 . . . 22; struct_map[0] = S(0xaa . . . aa, 0xbb . . . bb); struct_map[1] = S(0xcc . . . cc, 0xdd . . . dd); nested_map[0][1] = S(0xee . . . ee, 0xff . . . ff); } }
Mapping values are also initialized to zero by default, meaning that mappings are assigned values during declaration. They may only be stored in storage, and the keccak256
hash of keys are stored in place of the literal keys. This is why it is not possible to directly query the length of a mapping in Solidity 22.
Mapping slot keys are derived by concatenating the storage slot key of the mapping itself (n)
and the index of the key where the element is to be stored (i)
, for all 0 ≤ n
,i
≤ 2256-1:
Mapping Storage | |
---|---|
Slot Key | Slot Value |
keccak256(i-1 . n) | ith element of mapping at storage slot n |
This can be confirmed by debugging the call to modify()
and observing the storage after simple_map[0]
is assigned a value of 0x11..11
. This value will be stored at storage slot keccak256(0x00..00 . 00..00) = 0xad32 . . . 5fb5
.
"key": "0xad32 . . . 5fb5", // slot keccak256(0x00…00 . 00…00)
"value": "0x1111 . . . 1111", // value of simple_map[0]
"key": "0xada5 . . . 9e7d", // slot keccak256(0x00…01 . 00…00)
"value": "0x2222 . . . 2222", // value of simple_map[1]
. . .
Mapping values like structs or fixed-length arrays combines the storage heuristic for structs and mappings. State variables of this particular type are stored by incrementing this slot key for each element in the value, and storing element values at these incremented slot keys. This can be observed by stepping through line 17, where the struct S(0xa…,0xb…)
is assigned as the first value in struct_map
:
. . .
"key": "0xa6ee . . . cb49", // slot keccak256(0x00…00 . 00…01)
"value": "0xaaaa . . . aaaa", // value of struct_map[0].a
"key": "0xa6ee . . . cb4a", // slot keccak256(0x00…00 . 00…01) + 0x00…01
"value": "0xbbbb . . . bbbb", // value of struct_map[0].b
. . .
Like dynamic arrays, nested mappings are stored recursively. However, since mapping keys are never stored, a somewhat simpler mapping heuristic is used. This involves taking the hash of the outermost mapping, and then taking a hash of this value concatenated with the index of the innermost mapping.
This is summarized below, where:
i
is the index of within the outermost mapping.j
is the index in the innermost mapping.n
is the next available storage slot.- Since
nested_map
storesstructs
, the value ofm
is added to the hash as an index for the element within the struct.
Nested Struct Mapping Storage | |
---|---|
Contents of Slot Key | Value |
keccak256(j . keccak256(i . n)) + k | Value at struct element k at innermost mapping i, at outermost mapping j, stored at slot n |
nested_map
is takes up slot 2, and stores an instance of struct S(0xe…,0xf…)
at index 1 of its innermost mapping. Therefore, the first element in the struct will be stored at key keccak256(0x…01 . keccak256(0x0…0 .0x0…2)) = 0x79c0…d89d
, and the second will be stored at this value incremented by one.
. . .
"key": "0x79c0 . . . e89d", // slot keccak256(0x0…01 . keccak256(0x00…00 .0x00…02))
"value": "0xeeee . . . eeee", // value of nested_map[0][1].a
"key": "0x79c0 . . . e89e", // slot keccak256(0x0…01 . keccak256(0x00…00 .0x00…02)) + 0x00…01
"value": "0xffff . . . ffff" // value of nested_map[0][1].b
Data Location Keywords and Reference Types
Solidity supports three data location keywords: storage
, memory
, and calldata
, which can be used to explicitly provide the data location within the EVM execution context, where certain variables are stored within functions. In all examples thus far, variables have been declared outside of functions and are thus state variables stored in the Storage region.
Whether or not variables require being prefixed with data location keywords depends on whether the variable’s type is considered a reference or value type23. Value types are types that do not store any data other than their own, while reference types store the data of other variables.
Structs, arrays, and mappings are therefore reference types, and almost all others are considered value types. Strings are also considered reference types in most situations. Declaring state variables is the only place where explicitly specifying data location can be omitted. Reference types declared inside functions must be prefixed with a data location keyword. It is currently not possible to declare data locations for non-state variable that are of value types within functions.
The storage
keyword essentially produces a pointer to an already-declared state variable, referred to as a storage pointer. It is not possible to declare new instances of state variables within a function; the storage keyword cannot be used unless a state variable has already been declared. When updating the storage pointer, the original state variable value will also be updated. Therefore, the storage keyword state variable allows for variables to be passed by reference if used in function arguments.
The memory
keyword can be used to specify that reference types declared within functions are to be stored in memory. When used in a function argument, it results in an in-memory copy of the passed argument being created, which will be discarded at the end of the function call if not returned or persisted elsewhere.
One use case for the memory keyword is to save on gas by limiting the amount of storage read/write operations. For example, by declaring a reference type as a state variable, initializing it, and then creating an in-memory copy of this state reference type. The in-memory copy can be manipulated with little to no SSTORE
/SSLOAD
operations. After the variable has been processed, it can be assigned back to the original state variable. This is particularly economical with multiple iterations of the same processing. Like “warm” storage, it is cheaper to write to already-allocated memory.
The calldata
keyword specifies that function arguments are to be stored in a temporary yet immutable region. It can only be specified for function arguments. It is only available for external, read-only function call parameters, however. This is primarily used for data that is sent with the call to the contract and doesn't need to be changed during the execution of the function. Using calldata
is cheaper than both memory
and storage
. Check the official Solidity documentation for more information regarding data locations and how best to use them to write gas-efficient contracts.
This concludes part 2 of this series. In the third and final part of this series, we will take a detailed look at proxy contracts, inter-contract message calls, and their effects on storage. We will also look at how storage mismanagement and insecure message calls have resulted in major DeFi breaches, and how to limit the risk of such issues.
References:
- https://docs.soliditylang.org/en/
- https://ethereum.org/en/developers/
- https://ethereum.github.io/yellowpaper/paper.pdf
- https://www.evm.codes/?fork=shanghai
Citations
- https://github.com/ethereum/go-ethereum/blob/41f89ca94423fad523a4427c9af2c327fe344fe2/core/state/state_object.go#L40
- https://github.com/ethereum/evmc/blob/58e16bf6a1be9a5f445b2dd443f018e55218e777/examples/example_host.cpp#L26https://github.com/ethereum/evmc/blob/58e16bf6a1be9a5f445b2dd443f018e55218e777/examples/example_host.cpp#L26
- https://www.evm.codes/#54
- https://www.evm.codes/#55
- https://www.evm.codes/#54
- https://ethereum.org/en/developers/docs/data-structures-and-encoding/patricia-merkle-trie/
- https://bioinformatics.cvr.ac.uk/trie-data-structure/
- https://www.investopedia.com/terms/m/merkle-tree.asp
- https://www.allisons.org/ll/AlgDS/Tree/PATRICIA/
- https://www.coinbase.com/bitcoin.pdf
- https://docs.soliditylang.org/en/latest/internals/layout_in_storage.html#layout-of-state-variables-in-storage
- https://docs.soliditylang.org/en/v0.8.20/internals/layout_in_storage.html#layout-of-state-variables-in-storage
- https://docs.soliditylang.org/en/v0.8.20/internals/layout_in_storage.html#layout-of-state-variables-in-storage
- https://eips.ethereum.org/EIPS/eip-2929
- https://www.evm.codes/about#accesssets
- https://blog.ethereum.org/2016/09/22/ethereum-network-currently-undergoing-dos-attack
- https://ethereum.github.io/yellowpaper/paper.pdf
- https://ethereum.org/en/developers/tutorials/yellow-paper-evm/#91-basics
- https://etherscan.io/address/0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2#code
- https://book.getfoundry.sh/cast/
- https://www.infura.io/pricing
- https://docs.soliditylang.org/en/latest/types.html#mapping-types
- https://docs.soliditylang.org/en/v0.8.20/types.html#reference-types
The Ethereum Virtual Machine (EVM) functions as the sandboxed compatibility layer used by the thousands of nodes that constitute the Ethereum distributed state machine, ensuring that smart contracts are executed deterministically in a platform-independent environment.
The EVM operates on the execution layer of the Ethereum protocol, and all operations performed within the virtual machine are recorded on the blockchain and can be verified by any node in the network. This allows for full transparency and immutability of the data processed by the EVM, ensuring the integrity of the network. In this post, we will look at the inner workings of the EVM in detail.
In this three-part series, we will first provide an overview of volatile memory management, function selection, and potential vulnerabilities that may arise from insecure bytecode execution within the EVM. In Part 2, we will dive deeper into persistent EVM storage, exploring how data is stored and retrieved by smart contracts. Finally, Part 3 will focus on message calls in Solidity and the EVM, and vulnerabilities that stem from insecure message calls and storage mismanagement.
Ethereum Network & Smart Contract Recap
Upon receiving incoming transactions, Ethereum network validators first validate and verify them to ensure they are valid and properly signed. Validators then execute the smart contract code contained in transactions to verify the correctness and consistency of the results. If the output is valid, validators propagate the transaction output to additional validators to reach consensus.
Pending transactions are stored in lists called mempools, where they sit until they are added to a block. Once consensus is reached, valid transactions from the mempools are added to the next block to be added to the blockchain. Once the block is added to the blockchain, the transactions it contains are considered final and other nodes in the network can rely on the state of the blockchain to perform further transactions or execute smart contracts1.
Smart contract execution occurs in the EVM implementations running on Ethereum network validators. The EVM is not an actual virtual machine that you would launch via VMware or similar. It is more akin to the Java Virtual Machine in that it primarily functions to translate high-level smart contract code into bytecode for the purposes of portable execution. Common EVM implementations include the Golang-based geth2 client and the Python-based py-evm client3. EVM smart contracts are usually written in high-level domain-specific languages such as Solidity and Vyper4.
Below is a Solidity smart contract representing a short CTF challenge that we will follow through its execution in the EVM. It has at least two significant vulnerabilities, so please do not use this contract for anything other than educational purposes. The goal of this challenge is to take ownership of the contract via exploiting the two issues:
DodgyProxy.sol
pragma solidity >=0.7.0 <0.9.0; contract DodgyProxy { address public owner; constructor() { owner = msg.sender; } modifier onlyOwner { require(msg.sender == owner, "not owner!"); _; } function deleg() private onlyOwner { address(msg.sender).delegatecall(""); } struct Pointer { function () internal fwd; } function hitMe(uint offset) public { Pointer memory p; p.fwd = deleg; assembly { mstore(p, add(mload(p), offset)) } p.fwd(); } function inc(uint _num) public pure returns (uint) { return _num++; } }
These issues have been introduced because they are a great way of getting to grips with EVM memory quirks. The rest of this article will center around this contract as it is executed through the EVM. While one of the issues is well-documented, the original inspiration for the second issue can be traced back to a CTF challenge deployed to the Ropsten testnet in June 20185 by Reddit user u/wadeAlexC6.
Bytecode
At its core, the EVM is a stack-based virtual machine that operates on bytecode derived from higher level languages used to write EVM-compatible smart contracts. Below is the bytecode for the example contract, compiled with version 0.8.17 of the Solc Solidity compiler with default optimizations7:
608060405234801561001057600080fd5b50600080546001600160a01b031916331790556101e4806100326000396000f3fe608060405234801561001057600080fd5b50600436106100415760003560e01c806373c768d714610046578063812600df1461005b5780638da5cb5b14610081575b600080fd5b61005961005436600461016e565b6100ac565b005b61006e61006936600461016e565b6100cf565b6040519081526020015b60405180910390f35b600054610094906001600160a01b031681565b6040516001600160a01b039091168152602001610078565b60408051602081019091526100e282018082526100cb9063ffffffff16565b5050565b6000816100db81610187565b5092915050565b6000546001600160a01b0316331461012d5760405162461bcd60e51b815260206004820152600a6024820152696e6f74206f776e65722160b01b604482015260640160405180910390fd5b60405133906000818181855af49150503d8060008114610169576040519150601f19603f3d011682016040523d82523d6000602084013e505050565b505050565b60006020828403121561018057600080fd5b5035919050565b6000600182016101a757634e487b7160e01b600052601160045260246000fd5b506001019056fea26469706673582212202e788aebe90e09dbd8b71fced8e6c375f7d2d2989dc683afe753f805e0aa823864736f6c63430008110033
Let’s see how this bytecode ends up looking as it starts moving through EVM’s architecture, by first looking at some key EVM data structures.
EVM Architecture
Smart contract execution occurs inside the EVM instances of Ethereum network validators. When a smart contract starts execution, the EVM creates an execution context that includes various data structures and state variables that are described below. After execution has finished, the execution context is discarded, ready for the next contract. Here is a high-level overview of an EVM execution context:
Stack
The EVM’s stack serves as Last-In-First-Out (LIFO) data structure for storing temporary values during the execution of a smart contract. As of writing, the stack operates with a maximum of 1024 32-byte elements8. These elements may include control flow information, storage addresses, and the results and parameters for smart contract instructions.
The main instructions that operate on the stack are the PUSH
opcode variants, the POP
opcode, as well as the DUP
and SWAP
opcode variants. These allow elements to be added, removed, duplicated, and swapped on the stack respectively.
Code
The code region stores a contract’s bytecode in its entirety. This region is read-only.
Storage
The Storage region is a persistent key-value store. Keys and values are both 32-byte slots where permanent but mutable data that forms part of the contract's state is stored. Data in Storage is persistent in that it is retained between calls. This includes state variables, structs and local variables of structs. Possible uses for the Storage area include storing and providing access to public data such as token balances, and giving libraries access to storage variables. However, contracts cannot arbitrarily access each other’s Storage locations. The relevant opcodes for operating on Storage are the SSTORE
and SLOAD
opcodes, which allow writing and reading 32 bytes to and from Storage respectively.
Memory
The Memory region is a volatile, dynamically sized byte array used for storing intermediate data during the execution of a contract’s functions. It is akin to allocated virtual memory in classic execution contexts. More specifically, the Memory section holds temporary yet mutable data necessary for the execution of logic within a Solidity function. At a low-level, the MLOAD
and MSTORE
opcode variants are responsible for reading and writing to Memory respectively. Like Storage, data in the Memory section can be stored and read in 32-byte chunks. However, the MSTORE8
opcode can be used to write the least significant byte of a 32-byte word9.
Calldata
The Calldata region is like the Memory region in that it is also a volatile data store, however it instead stores immutable data. It is intended to hold data sent as part of a smart contract transaction10. Therefore, data stored here can include function arguments and the constructor code when creating new contracts within existing contracts. The CALLDATALOAD
, CALLDATASIZE
and CALLDATACOPY
opcodes can be used to read Calldata at various offsets. Calldata is formatted in a specific way so that specific function arguments can be isolated from the calldata. We will go into this format in more detail later.
As data stored here is immutable, function arguments that are of simple data types such as unsigned integers are automatically copied over to Memory within a function, so that they can be modified. This does not apply to strings, arrays and maps, which need to explicitly be marked with memory
or storage
in function arguments, depending on whether they are to be modified during a function’s execution11.
Program Counter
The Program Counter (PC) is similar to the RIP register in x86 assembly, in that it points to the next instruction to be executed by the EVM. The PC will usually increase by one byte after an instruction has been executed. Exceptions to this include the JUMP opcode variants, which relocate the PC to positions specified by data at the top of the stack.
Global Variables
The EVM also keeps track of special variables in the global namespace. These are used to provide information about the blockchain and current contract context. There are quite a few global variables12, but you might recognize some of the following global variables from our smart contract example:
msg.sender
– the address of the sender of the current call.msg.value
– the value in Wei sent with the current call.msg.data
– the current calldata.tx.origin
– the original external account that started a transaction chain.
Return Data
The Return Data section stores the return value of a smart contract call. It is read and written to by the RETURNDATASIZE
/RETURNDADACOPY
and RETURN
/REVERT
opcodes respectively.
Gas & Beating the Solidity Compiler
Every opcode in the EVM has an opportunity cost associated with its execution. This is measured in “gas”. It is important to note that gas is not the same as Ether, in that it cannot be directly bought and sold as a native cryptocurrency. However, it is paid in Gwei (1 Gwei = 10-9 Ether). It is simply a unit of measurement for the work the EVM must do to execute a particular instruction.
Gas also exists to incentivize efficient smart contract code. To save on gas, Solidity developers sometimes write EVM assembly themselves mid-contract, by dropping into an intermediate language called Yul13. Yul is similar to literal EVM assembly; however, it allows for additional control flow (loops, conditional statements, etc.) so developers do not have to PUSH and POP their way up and down the stack. This is quite common, because Solidity is not yet a very well-established language and therefore Solidity compiler implementations are not as efficient as they could be.
The EVM execution context takes into account gas limits set for the current transaction and the gas costs for executing each opcode. Gas and gas management are complex topics in EVM development, as smart contract end users are the ones who bear the brunt of gas costs during execution. Cheaper gas costs tend to incentivize users to choose one contract over another.
For further information on gas, Section 5 of the Ethereum Yellowpaper14 is highly recommended.
In any case, while it is often necessary to drop into assembly/Yul for the sake of gas efficiency, it comes as no surprise that doing so the wrong way can have some interesting security implications. Specifically, writing contract logic in Yul/assembly can bypass some significant access control mechanisms only implemented in the higher-level Solidity, which we will demonstrate toward the end of this article.
Application Binary Interfaces
To map high-level Solidity to bytecode, the Solidity compiler generates an intermediate data structure known as an Application Binary Interface (ABI) from the contract. ABIs serve a similar role as APIs do for exposing methods and structures necessary to interact with back-end application services. Below is the ABI for our example contract:
[ { "inputs": [ { "internalType": "uint256", "name": "offset", "type": "uint256" } ], "name": "hitMe", "outputs": [], "stateMutability": "nonpayable", "type": "function" }, { "inputs": [], "stateMutability": "nonpayable", "type": "constructor" }, { "inputs": [ { "internalType": "uint256", "name": "_num", "type": "uint256" } ], "name": "inc", "outputs": [ { "internalType": "uint256", "name": "", "type": "uint256" } ], "stateMutability": "pure", "type": "function" }, { "inputs": [], "name": "owner", "outputs": [ { "internalType": "address", "name": "", "type": "address" } ], "stateMutability": "view", "type": "function" } ]
ABIs are comprised of the following elements:
name
: defines function names.type
: defines the type of function. This is necessary to differentiate between regular functions, constructors, and specialized function types such asreceive
andfallback
.inputs
: an array of objects that themselves define argument names and types.- Note that both types and
internalTypes
are defined. This is because there are subtle differences in the way that certain data types are referenced in Solidity versus ABIs. For example, an input of type struct in Solidity would have aninternalType
as tuple15.
- Note that both types and
outputs
: similar to inputs, the Output array includes objects that denote function return value names and types.stateMutability
: denotes any function mutability attributes such aspure
andview
. These are needed to ascertain whether the function is intended to modify on-chain data, or is simply a getter function that returns existing values.payable
andnonpayable
modifiers are also denoted here.
Data derived from an ABI is necessary to encode function calls in a low-level way that can be parsed by the EVM, which is referenced in the Calldata region in a contract’s execution context.
EVM Function Selection & Calldata
Let’s say we have compiled and deployed our bytecode to a testnet, and now we want to call a function exposed by the contract’s ABI. To do this, the EVM formats function arguments into calldata. Calldata is a standard way of representing function calls, and it is referenced in a transaction via the msg.data
global variable.
Calldata is comprised of the following:
- A function selector
- Optional function arguments
Contracts expose and identify public functions by means of function selectors. Function selectors are (mostly) unique 4-byte identifiers that allow the EVM to locate and call function logic as it is represented in bytecode.
By “public functions,” we mean functions that are denoted with the public function visibility keyword in Solidity. To recap, below are the available visibility keywords for functions and state variables:
public
: function is visible to contract itself, derived contracts, external contracts, and external addresses.private
: function is only visible to contract itself.internal
: function is visible to contract itself and derived contracts.external
: function is visible to external contracts and external addresses. Not available for state variables.
Function arguments are similarly encoded along with the function selector so complete call data can be encoded. Most data types are encoded in discrete 32-byte chunks. Note that 32 bytes is the minimum; simple types like uint
and bool
arguments will result in 32 bytes each, whereas string
, byte
and array types are encoded according to their length and whether they are fixed or dynamically sized.
Consider the following function definition, called like foo(16,7)
:
function foo(uint64 a, uint32 b) public view returns (bool) {}
To derive calldata, the EVM does the following:
- Take the first 4 bytes of the Keccak256 hash of ASCII representation of a function, ignoring the argument variable names. This representation of a function is called a function signature.
- e.g.
keccak256(“foo(uint64,uint32)”)
→0xb9821716
- For the call data, function arguments are added to the hash of the function signature, after being padded by 32 bytes. E.g.
uint64 16
→0x00…..10
uint32 7
→0x00…..07
- Altogether, the foo function selector and calldata would be:
0xb9821716000000000000000000000000000000000000
0000000000000000000000000010000000000000000000
0000000000000000000000000000000000000000000007
We can confirm this with the following short contract, which uses the abi.encodeWithSignature
method to produce the function selector given the foo()
function signature and its arguments. It will then emit the result (enc
) in an event called Encoded
. Events are a means to include additional details in transaction logs, and are useful for granularly committing specific occurrences on-chain:
pragma solidity >=0.7.0 <0.9.0; contract AbiEncodeTest { event Encoded(bytes); function GetCallData() public { bytes memory enc = abi.encodeWithSignature("foo(uint64, uint32)", 16, 7); emit Encoded(enc); } }
We will compile and deploy this contract, and then call the GetCallData
method with the Brownie development environment16:
Events In This Transaction -------------------------- └── AbiEncodeTest (0x9E4c14403d7d9A8A782044E86a93CAE09D7B2ac9) └── Encoded └── : 0xb982171600000000000000000000000000000000000000 00000000000000000000000010000000000000000000000000000000000000 0000000000000000000000000007
It should also be noted that public state variables are also given their own selectors and are treated as getters by the compiler17. For instance, a public state variable named owner will have a selector exposed as keccak256(“owner()”) = 0x8da5cb5b
. It can be accessed from other contracts that import DodgyProxy as DodgyProxy.owner()
.
If the bytecode of the AbiEncodeTest and DodgyProxy contracts were compared, a common section of bytecode will be present, shown below in green. The 4-byte function selectors for the GetCallData()
and hitMe()
functions immediately follow this common section of bytecode:
AbiEncodeTest: …60003560e01c806301cc20f114602d57…
DodgyProxy: …60003560e01c806373c768d71461004657…
This bytecode represents the function selection logic that the EVM uses to identify public functions. The following opcodes are responsible for function selection in DodgyProxy:
026→ 60→ PUSH1 0x00
028→ 35→ CALLDATALOAD
029→ 60→ PUSH1 0xe0
031→ 1C→ SHR
032→ 80→ DUP1
033→ 63→ PUSH4 0x73c768d7 // hitMe(uint256) public function selector
038→ 14→ EQ
039→ 61→ PUSH2 0x0046
042→ 57→ *JUMPI
043→ 80→ DUP1
044→ 63→ PUSH4 0x812600df // inc(uint256) public function selector
049→ 14→ EQ
050→ 61→ PUSH2 0x005b
053→ 57→ *JUMPI
054→ 80→ DUP1
055→ 63→ PUSH4 0x8da5cb5b // owner getter function selector
060→ 14→ EQ
061→ 61→ PUSH2 0x0081
064→ 57→ *JUMPI
Let’s follow execution of this bytecode snippet, given calldata for the following function call. Below is the call we will make to inc()
with an argument of uint256
of 1:
inc(1);
Running this function call through the GetCallData
function in the AbiEncodeTest contract will result in the following calldata. This can also be derived manually following the same method we went through earlier:
812600df0000000000000000000000000000000000000000000000000000000000000001
If you want to follow along, smlXL18 has a brilliant EVM playground, where bytecode execution can be modeled for learning and debugging purposes. You may follow along with the function selection logic bytecode here, given calldata for a call to inc(uint256)
with an argument of 1.
First, the PUSH1
opcode pushes a value of 0x00
to the stack. This value functions as an offset for the next opcode, CALLDATALOAD
:
CALLDATALOAD
loads the first 32 bytes of the msg.data
global variable to the stack. Here we can see that the function selector for inc(uint256)
is included in these first 32 bytes:
However, the EVM now needs to parse out the 4-byte function selector from the rest of the calldata chunk. The EVM currently carries this out by bitshifting the calldata chunk to the right by some amount to be left with only the 4-byte function selector. Since the calldata chunk is 256 bits (32 bytes), it needs to be shifted right by 224 bits (0xE0
in hex).
This is done by first pushing 0xE0
to the stack with PUSH1
, and shifting right by SHR
. After the shift, the offset will be popped, leaving the clean function selector as the sole element in the stack:
Selecting a Function
To understand the significance of the next few bytecodes, it is helpful to know that the EVM compares function selectors as a kind of switch statement, where function selectors are piled on top of each other on the stack before being compared to the function selector below it. In doing so, the EVM is attempting to discover if incoming calldata is targeting a valid function by parsing them sequentially looking for a match.
This is why the DUP1
opcode is used to duplicate the function selector on the stack before the next function selector is pushed to the stack with the PUSH4 opcode. Note that this function selector is not derived from any calldata; it is the first function selector to be placed on the stack by the EVM itself. Here, the function selector for the hitMe(uint256)
(0x73c768d7
) function will eventually be pushed:
The EQ opcode is then used to compare the last two items on the stack. If a match is found, the last two items are popped off the stack and a value of 1 replaces them as a positive comparison result. If not, then a value of 0 replaces them instead.
Here, a value of 0 will be pushed to the stack as 0x812600df
and 0x73c768d7
do not match. Note that the original function selector derived from our calldata still remains below the comparison result, ready to eventually be compared to the next function selector:
Now that a result has been found, a decision needs to be made as to whether to execute the function represented by the function selector. In this case, function execution will not occur yet because the calldata’s function selector does not match the first one encountered by the EVM.
However, the EVM presently still needs to make another comparison to decide whether to jump to the function selector’s corresponding function logic before it can start comparing the next function selector. This comparison is done by first pushing an offset to the start of function logic to the stack with the PUSH2
opcode.
The JUMPI
opcode is then used. Note that JUMPI
represents a conditional jump, whereas the JUMP
instruction only requires an offset to jump to and is used when an unconditional jump is necessary. The JUMPI
instruction first pops the offset and result of the EQ
comparison off the stack. If the comparison result is 1, then the program counter will be changed to the offset, which denotes the start of the function logic to be executed. Here, the comparison result is 0, so execution will continue without changing the program counter.
The previous bytecode is then more or less repeated from the DUP1
opcode: duplicating the calldata-derived function selector, pushing the next function selector to be compared on the stack, comparing the two function selectors, placing an offset to the next section of function logic, and then deciding whether to jump to said function logic depending on the outcome of the comparison.
This is shown below. This time, the inc(uint256)
function selector is compared, so execution will continue at offset 0x5B
(91) in the bytecode.
Jumpdests
If we follow execution past the function selection logic, the first opcode we land on will be the JUMPDEST
opcode, found at position 91 in the bytecode.
091
→
5B→
JUMPDEST
092→
61→
PUSH2 0x006e
095→
61→
PUSH2 0x0069
098→
36→
CALLDATASIZE
099→
60→
PUSH1 0x04
101→
61→
PUSH2 0x0178
104→
56→
JUMP
105→
5B→
JUMPDEST
Think of JUMP
and JUMPDEST
opcodes are opposite ends of a wormhole from one area in the bytecode to another. Every JUMP
opcode must have a corresponding “landing” JUMPDEST
for the jump to be valid. JUMPDESTs
remove the need to dynamically assess starting points for function logic after a jump has been taken.
It should also be noted that JUMPDESTs
do not only denote the start of function logic. In fact, there seems to be little bearing over the placement of function logic in low-level bytecode versus high level Solidity.
Aside - Function Signature Clashing
Note: OpenZeppelin has a great post on this rare but interesting EVM quirk.
Earlier, we referred to function selectors as “mostly” unique because it is not too uncommon to have two or more distinct functions with the same first four bytes of the keccak256 hash of the function name. The Ethereum Signature Database is a great example of this. The selector for the owner
state variable getter function is actually the same as the function selector for ideal_warn_timed(uint256,uint128)
, 0x8da5cb5b
19.
The Solidity compiler is sophisticated enough to notice function signature clashes, so long as the relevant functions are in a single contract. In theory however, function signature clashes are possible between distinct contracts, such as between an implementation contract and a well-known pattern known as a proxy contract20. Potential security risks associated with proxy contract usage will be covered in a subsequent post.
Private/Internal Functions
Only public
and external
functions will have function signatures created for them. Private and internal functions do not receive function selectors. For example, a function selector for deleg()
is not present in the bytecode. In fact, attempting to execute a private
or internal
function via Web3.py or similar will not result in any sort of access error. Rather explicitly, an exception will be raised as the function selector does not exist21:
import web3 from solcx import compile_source # make sure you have a network provider, Ganache is good for this w3 = web3.Web3(web3.HTTPProvider('https://127.0.0.1:8545')) compiled_sol = compile_source( ''' pragma solidity >=0.7.0 <0.9.0; contract DodgyProxy { address public owner; constructor() { owner = msg.sender; } modifier onlyOwner { require(msg.sender == owner, "not owner!"); _; } function deleg() private onlyOwner { address(msg.sender).delegatecall(""); } struct Pointer { function () internal fwd; } function hitMe(uint offset) public { Pointer memory p; p.fwd = deleg; assembly { mstore(p, add(mload(p), offset)) } p.fwd(); } function inc(uint _num) public pure returns (uint) { return _num++; } }''', output_values=['abi', 'bin']) contract_id, contract_interface = compiled_sol.popitem() w3.eth.default_account = w3.eth.accounts[0] Proxy = w3.eth.contract(abi=contract_interface['abi'], bytecode=contract_interface['bin']) # will succeed, as hitMe is public pub = Proxy.get_function_by_name("hitMe") print(pub) try: # will fail, as cannot find deleg priv = Proxy.get_function_by_name("deleg") except ValueError as v: print("%s: deleg" %v)
Running the above script will result in:
❯ python3 web3_check.py
<Function hitMe(uint256)>
Could not find any function with matching name: deleg
However, a lack of function selectors for private
/internal
functions does not mean that private function logic is inaccessible internally, as private function logic still needs to be executable by the contract.
At present, the following contract is valid Solidity, if somewhat redundant. A public function can make calls to private functions within the same contract and handle any results without issue:
pragma solidity >=0.7.0 <0.9.0; contract Pubpriv { function priv() private returns (uint) { return 1; } function pub() public returns (uint) { return priv(); } }
Going back to our DodgyProxy, a similar pattern might be clear now, starting from hitMe()
:
function deleg() private onlyOwner { address(msg.sender).delegatecall(""); } struct Pointer { function () internal fwd; } function hitMe(uint offset) public { Pointer memory p; p.fwd = deleg; assembly { mstore(p, add(mload(p), offset)) } p.fwd(); }
When executing the hitMe()
function as any account other than the contract owner, it is not possible to directly call the private deleg()
function from hitMe()
due to the custom onlyOwner
modifier, which prevents the private function logic from being executed if the owner state variable is not set to the caller’s address (msg.sender
).
However, a not-so subtle memory corruption vulnerability of sorts has been introduced in the assembly
block. If the hitMe()
function is given the right input, it is possible to end up in the middle of the deleg()
function and resume execution, despite the deleg()
function being inaccessible due to the onlyOwner
modifier, and restricted from being called externally due to the private
function visibility.
This is because function visibility, and indeed modifiers in general, only function as definitive access control in high-level Solidity. Meaning that if execution is somehow redirected to the middle of a private function, there is no kind of DEP/NX-like mechanism for preventing execution by unauthorized contexts.
The conditions necessary for this kind of flaw to be exploitable are unlikely to present themselves in the most cases. Such occurrences are not outside the realm of possibility however, especially considering how often Yul is used to write low-level EVM opcodes directly into contracts to encode operations in such a way as to save on gas fees versus the compiler’s own bytecode.
With that in mind, the general idea behind this flaw is that the Pointer
struct essentially functions as a “base” of sorts in memory, from which an offset can be added to. By specifying specific offsets, the struct’s fwd()
function can be used as a jump pad to various JUMPDESTs
, some of which would belong to function logic that would otherwise be prevented from being executed via conventional control flow in high-level Solidity. To understand more about how memory is referenced during bytecode execution, the Memory section of the EVM call context needs to be revisited in more detail.
More About Memory
As a dynamically sized byte array, the Memory area of the EVM call context can be read and written to in discrete 32-byte chunks. There is also the concept of “touched” and “untouched” memory, where newly written chunks of memory accrue increasing gas and storage costs22. Before new memory can be utilized, the EVM needs to retain a section of utility memory that is used to keep track of subsequent memory writes.
This utility memory is comprised of four main sections, starting from the beginning of the Memory array23:
- bytes
0x00-0x3f
: scratch space mainly used to store intermediate outputs from hashing operations such askeccak256()
. Note that this section is allocated as two 32-byte slots. - bytes
0x40-0x5f
: space reserved for the 32-byte “free memory pointer.” This is used as a pointer to unallocated memory for use during contract execution. - bytes
0x60-0x7f
: the 32-byte “zero slot,” which is used as a volatile storage space for dynamic memory arrays which should not be written to.
Initially, the free memory pointer points to position 0x80
in memory, after which point additional memory can be assigned without overwriting memory that has already been allocated to other data. As such, the free memory pointer also functions as the currently allocated memory size.
Exploiting the first vulnerability in the smart contract is contingent on manipulating the value of the free memory pointer to redirect control flow and eventually end up in the middle of the protected deleg()
function. It is useful to follow along with the execution of the DodgyProxy contract in the Remix IDE to understand the next sections.
First, compile the contract with an optimization value of 200 and a compiler version of 0.8.17. Changing optimization settings and compiler versions may increase or decrease the offsets slightly, but the general bytecode logic should remain consistent regardless of optimization. Optimization was chosen as the standalone Solc compiler enabled optimization by default24, whereas Remix does not enable optimization by default.
Deploy the contract using any account. Then, attempt to call the hitMe()
function with an argument of 0, using the same account that was used to deploy the contract (0x5B38Da6a701c568545dCfcB03FcB875f56beddC4
). This will result in a successful function call:
Then attempt to do the same as any account other than the contract owner. Here, account 0xAb8483F64d9C6d1EcF9b849Ae677dD3315835cb2
was used, representing the address of a malicious actor. The transaction will revert:
From the attacker’s account, make call to the hitMe()
function with an argument of 1 uint
and debug the failed transaction:
Execution starts with the values 0x80
and 0x40
being pushed to the stack:
The free memory pointer is then initialized by the MSTORE
operation, which stores the value 0x80
at memory location 0x40
.
Then, place a breakpoint at the initialization of the Pointer
struct on line 22 and continue execution:
Execution until this breakpoint will involve many operations, including the function selection logic described earlier. Once the breakpoint is reached, the stack is seen to contain some familiar values, namely the function selector for hitMe(uint256)
, and an argument of 1:
Place a second breakpoint at the start of the assembly
block on line 24 and continue execution. A value of 0xE2
is eventually pushed to the stack:
The third value on the stack, 0x01
, is then duplicated to the top of the stack by the DUP3
opcode.
The ADD
opcode then adds 0x01
to 0xE2
and stores 0xE3
on the stack. Note that the bytecode generated from the compiler will differ slightly from the exact assembly due to compiler optimizations and translations from Yul:
The last two arguments on the stack are then duplicated by the DUP1
and DUP3
opcodes, in preparation for the MSTORE
opcode to store value 0xE3
at memory location 0x80
:
This follows the logic of the assembly
block on line 24. Therefore, the value that was just stored at the free memory pointer is most likely the location in memory of the Pointer
struct first instantiated on line 26.
Repeating these debugging steps by sending calldata of hitMe(2)
will confirm this, as the free memory pointer will now store a value of 0xE4
, one more than our previous value owing to the incremented offset argument to hitMe()
:
Continuing execution with arbitrary arguments to the hitMe()
function will result in a failed transaction, as a JUMP will eventually be made to an invalid location in the bytecode. However, with granular control over the value at the free memory pointer, an offset can be calculated to bypass access control modifiers protecting the deleg()
function.
To do this, the contract’s bytecode can be examined for opcodes that correspond to sections of high-level Solidity that we would want to end up in. This is simple in this case, as the correct section of bytecode will have the only DELEGATECALL
opcode, since it is only called once in this contract:
301
5B→
JUMPDEST →
302
60→
PUSH1 0x40 →
304
51→
MLOAD →
305
33→
CALLER →
306
90→
SWAP1 →
307
60→
PUSH1 0x00 →
309
81→
DUP2 →
310
81→
DUP2 →
311
81→
DUP2 →
312
85→
DUP6 →
313
5A→
GAS →
314
F4→
DELEGATECALL →
However, recall from earlier that for control flow to be altered by means of a jump, a corresponding JUMPDEST
opcode must be present at the target location, just as with normal function selection. This is why the challenge was set up as a struct with an internal function.
Without a means to set a function call, it would be much more difficult — if not practically impossible — to jump to another section of bytecode to break normal control flow. For this reason, this technique is also only likely to work if there are no opcodes between the JUMPDEST
we land on and the target opcodes that interfere with execution.
In this case, there does not appear to be any opcodes that would prevent us from eventually executing the DELEGATECALL
operation. From here, the offset to take can easily be calculated by subtracting the location in memory of the Pointer struct from the offset in the contract bytecode of the nearest preceding JUMPDEST
from the DELEGATECALL
opcode.
In our bytecode, the target JUMPDEST
is found at offset 0x012D
(301) in our bytecode. The location of the Pointer
struct is at 0xE2
(226). This gives a difference of 0x4B
(75). This difference is added to the Pointer
struct base in subsequent bytecode, before a jump is made to the JUMPDEST
at 0x12D
:
We can confirm that the delegatecall
is then reached by stepping through by placing a breakpoint at the line with the delegatecall
and calling the hitMe()
function with an argument of 75. Execution will proceed to the delegatecall
:
At this point, we have successfully bypassed the onlyOwner
modifier. To complete the challenge however, we need to take ownership of the contract by abusing the delegatecall
to overwrite the owner
state variable. If you have a bit of experience with Solidity, this should be relatively simple.
That’s all for this article. In Part 2 of this installment, we will go into more detail on storage layout in the EVM. Part 3 will contain a detailed look at low-level contract calls, proxy contracts, and a few more examples of how DELEGATECALL
can be abused to subvert control flow and take ownership of contracts.
References
- https://ethereum.org/en/developers/docs/evm/
- https://noxx.substack.com/p/evm-deep-dives-the-path-to-shadowy
- https://medium.com/authio/solidity-ctf-part-2-safe-execution-ad6ded20e042
- https://www.evm.codes/
- https://cypherpunks-core.github.io/ethereumbook/13evm.html
Citations
- https://ethereum.org/en/developers/docs/transactions/
- https://github.com/ethereum/go-ethereum
- https://github.com/ethereum/py-evm
- https://docs.vyperlang.org/en/stable/
- https://web.archive.org/web/20220802005328/https://ropsten.etherscan.io/address/
0x727c1c8d4b190d208f3701f106f7301cb1a32f27#contracts - https://us.reddit.com/r/ethdev/comments/8td9xn/challenge_empty_the_contract_of_funds/
- https://docs.soliditylang.org/en/v0.8.17/using-the-compiler.html#optimizer-options
- https://docs.soliditylang.org/en/v0.8.17/introduction-to-smart-contracts.html#storage-memory-and-the-stack
- https://www.evm.codes/#53
- https://docs.soliditylang.org/en/v0.8.17/internals/layout_in_calldata.html#layout-of-call-data
- https://docs.soliditylang.org/en/v0.8.17/types.html#value-types
- https://docs.soliditylang.org/en/v0.8.17/cheatsheet.html#global-variables
- https://docs.soliditylang.org/en/latest/yul.html
- https://ethereum.github.io/yellowpaper/paper.pdf
- https://docs.soliditylang.org/en/latest/abi-spec.html#mapping-solidity-to-abi-types
- https://github.com/eth-brownie/brownie
- https://docs.soliditylang.org/en/latest/contracts.html#state-variable-visibility
- https://blog.smlxl.io/
- https://www.4byte.directory/signatures/?bytes4_signature=0x8da5cb5b
- https://eips.ethereum.org/EIPS/eip-1167
- You will need a local Ethereum test net provider, along with the
web3.py
andsolcx
packages. - https://eips.ethereum.org/EIPS/eip-3336#changes-to-memory-expansion-gas-cost
- https://docs.soliditylang.org/en/v0.8.17/internals/layout_in_memory.html
- https://docs.soliditylang.org/en/latest/internals/optimizer.html
This is the second part of our series on Ethereum Virtual Machine (EVM) internals. Part 1, which can be found here, introduced the EVM call context and its architecture, followed by a deep dive into the non-persistent Memory section, function selection and visibility, and how contract control flow can be bypassed at the bytecode level.
This installment will focus on the EVM’s persistent storage region and explore how various Solidity types and data structures are represented and referenced in contract Storage.
EVM Storage and Contract States
In part 1, we dove into the memory section of the call context, used to hold temporary, mutable data necessary for inter-function logic. As a decentralized state machine however, the EVM needs a dedicated mechanism to keep track of persistent data that represents a given contract’s overall state between function calls. This is the job of the EVM’s Storage region.
EVM implementations treat the storage region as a dictionary of 2256 32-byte keys and values. These keys are considered to be logically initialized to zero by default. However, like all data structures that constitute the EVM, it is up to specific implementations to define exactly how the storage region is implemented. Most implementations, including the widely used Go-Ethereum (geth
) client stays true to the overall idea of the Storage region by implementing storage as hash maps or key-value pairs1, as does the C++-based EMVC implementation of the EVM, which uses the C++ standard library std::map
object2.
The relevant opcodes for operating on storage are the SSTORE
and SLOAD
opcodes, which allow writing and reading 32 bytes to and from storage respectively. It should be noted that EVM operations that work on the Storage region incur some of the highest gas fees relative to other opcodes3, therefore a sufficient understanding of this region is especially important for writing gas-efficient contracts.
Storage Slots
EVM implementations use the concept of storage slots, which are discrete sections within the storage space in which state variables are stored.
Within the wider Ethereum network, storage slots are also sometimes referenced using the keccak256
hash of the hexadecimal slot number, left padded to 32 bytes.
keccak256(0x0000…0000) = 0x290decd9548b62a8d60345a988386fc84ba6bc95484008f6362f93160ef3e563
keccak256(0x0000…0001) = 0xb10e2d527612073b26eecdfd717e6a320cf44b4afac2b0732d9fcbe2b7fa0cf6
keccak256(0x0000…0002) = 0x405787fa12a823e0f2b7631cc41b3ba8828b3321ca811111fa75cd3aa3bb5ace
Below is an example of a simple storage layout as shown in Remix of three 32-byte uint256
values, each in their own storage slot:
It is important to note that within the EVM execution context, slot keys are usually referenced via their literal hexadecimal values, most evident when storage slots are referenced by the commonly used SSTORE4
and SSLOAD5
EVM opcodes. However, hashed slot indexes shown by Remix and similar tools play an important role in the wider Ethereum ecosystem.
Merkel Patricia Tries
The reason why slots are sometimes also displayed via a hash of the slot numbers themselves stems from a fundamental data structure used by the wider Ethereum network for storing account data and contract storage, called Merkle Patricia Tries (MPTs)6. In an MPT, each key-value pair is hashed using the keccak256
hash function, and the resulting hash is used as the key for the next level of the trie.
This creates a hierarchical structure where each level of the trie corresponds to a different prefix of the key, and each leaf node in the trie corresponds to a complete key-value pair. MPTs are used in various contexts throughout the Ethereum ecosystem, including for storing account data, contract storage, and transaction receipts.
At a high level, an MPT is a modified version of the standard trie data structure. A trie7 is a generic tree-based data structure used for storing and retrieving key-value pairs. It allows for efficient prefix-based searches by organizing keys based on their common prefixes.
MPTs incorporate elements of Merkle and Patricia tries, which are derivative trie structures. Merkle tries, also known as hash trees, use hashed representations of keys and values to save on search and storage overhead8. Patricia tries are optimized for memory usage by compressing and merging multiple nodes within a single child node, further reducing storage requirements9.
The Bitcoin network uses Merkle tries to archive and verify the integrity of transactions that have been confirmed with enough blocks10, whereas the Ethereum network requires MPTs to facilitate smart contract and account state storage. Outside of blockchain systems, Patricia tries and the more general Radix tries are commonly used in applications that require fast prefix-based searches, such as IP routing tables and dictionary implementations.
Storage slots are referenced by their keccak256
hashes in the MPT when contract balances and other state information is read and written, largely because MPTs need a standardized method of referencing variable-length key values. However, the EVM references storage slots via their explicit storage key values. These are not always hexadecimal values; for more complicated data structures, additional keccak256
hashing must be done to derive slot keys, as will be demonstrated later.
MPTs and the data structures used to maintain the wider Ethereum network are out of scope for this article. The rest of this article will focus on EVM storage. The EVM has a peculiar way of storing state variables; necessitated in part by the EVM’s implementation as a register-less stack machine, and as a means of minimizing the gas costs of storing smart contract states and data.
For the remainder of this article however, we will omit slot indexes when displaying key-value sets in storage for clarity.
Static Variable Storage
Simple Variable Storage & Packing
Consider the following Solidity contract which demonstrates the concepts of simple variable storage and static variable packing. The modify()
function assigns values to the state variables, which have been implemented as unsigned integers of varying sizes. The integers are represented in hex format for clarity. You may follow along with this contract via Remix here.
pragma solidity >=0.7.0 <0.9.0; contract VarPacking { uint256 slot_0; uint128 slot_1; uint64 still_slot_1; uint64 slot_1_again; uint128 slot_2; function modify() public { slot_1 = 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA; // slot 1 slot_0 = 0xBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB; // slot 0 still_slot_1 = 0xCCCCCCCCCCCCCCCC; // still in slot 1 slot_1_again = 0xDDDDDDDDDDDDDDDD; // still in slot 1 slot_2 = 0xEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE; // slot 2 } }
The order in which state variables are assigned values does not affect which slot they will occupy. This is because the storage slots are determined depending on these factors:
- The order in which state variables are declared.
- The type of the state variable.
- The size of the state variable.
- Whether the variable is fixed-size (simple unsigned integers, fixed-size arrays), or dynamically sized (mappings and unbounded arrays).
Until variables are assigned values, no data will be written to the storage slot, and no storage fees are incurred until storage is interacted with. Pay attention to the state variable names, which have been named according to how they will occupy storage slots.
We will now go through the execution of the modify()
function and show how each storage slot is populated.
Note that Solidity compiler optimizations have been turned off for all examples shown in this post, as compiler optimizations can affect the order in which slots are populated. These optimizations do not affect the values stored in the slots, however. As of writing, the main factor that affects which slot a value is stored in is still the order in which the variable was declared.
Integers
unit
and uint256
types are 32 bytes large. When state variables of these types are assigned values, the next available storage slot is used in its entirety. This can be confirmed by deploying the VarPacking
contract, placing a breakpoint on line 12, and debugging the call to modify()
. After the first SSTORE operation, storage will resemble the following:
"key": "0x0000 . . . 0001",
"value": "0x00000000000000000000000000000000aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
The 128-bit value 0xAA..AA
has been written to the latter half of slot 1. Continuing past the next SSTORE
operation will result in the 256-bit value 0xBB…BB
being written to slot 0, as the slot_1
state variable was assigned a value before slot_0
.
"key": "0x0000 . . . 0001", // slot 0x00…01
"value": "0x00000000000000000000000000000000aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
Stepping through until the still_slot_1
value is assigned to storage (just before line 14) results in slightly different storage behavior however:
"key": "0x0000 . . . 0001", // slot 0x00…01
"value": "0x0000000000000000ccccccccccccccccaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
Variable packing:
Unlike slot_1
, the value for still_slot_1
was stored after the 8th byte of storage slot 1. This is because the consecutively declared slot_1
and still_slot_1
are of types uint128
and uint64
, and are 16 and 8 bytes large respectively. If we continue execution until the end of the modify()
function, we will notice that the values for slot_1
, still_slot_1
, slot_1_again
, and slot_2
are also not all stored in their own storage slots.
To save on storage fees, the EVM has attempted to “pack” these values together such that more than one value can occupy a single storage slot. This is called “variable packing”, and is subject to the following rules11:
- The first item in a storage slot is stored lower-order aligned.
- Value types use only as many bytes as are necessary to store them.
- If a value type does not fit the remaining part of a storage slot, it is stored in the next storage slot.
- Structs and array data always start a new slot and their items are packed tightly according to these rules.
- Items following struct or array data always start a new storage slot.
The storage layout will present the following values after the call has completed:
"key": "0x0000 . . . 0001", // slot 0x00…01
"value": "0xddddddddddddddddccccccccccccccccaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
"key": "0x0000 . . . 0002", // slot 0x00…02
"value": "0x00000000000000000000000000000000eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"
Take a moment to reflect on how the previously mentioned variable packing rules are at play. Stepping through the bytecode executed between lines 14 and 15 will show these rules being applied, shifting and rearranging the values as necessary until the SSTORE
opcode is used to preserve the values in the correct slot. Other static types are stored differently; however the same overall goal of maximizing slot usage is maintained.
Compiler Optimizations:
Because opcodes like SSTORE
and SLOAD
are gas intensive, with Solc compiler optimizations turned on, optimized bytecode will be generated which attempts to limit the number of storage operations. This can be observed using the previous VarPacking contract as an example, using this Remix link. Here, the “Enable optimization” flag has been selected under Advanced Configurations in the Solidity Compiler tab. Optimizations are enabled by default in Remix and most other development frameworks.
With optimizations, even though the slot_0
variable was assigned after slot_1
, setting a breakpoint on line 14 shows that storage slot 0 was written to first. This is then followed by a single SSTORE
operation to store the values of slot_1
, still_slot_1
and slot_1_again
to slot 1 within a single operation.
This is because the contract’s bytecode was optimized to pre-arrange the values to be stored in a particular slot before any storage operations are performed. Thus, the previously described variable packing rules are still in effect, however they are instead performed in the Memory region and stack instead of successive storage read/writes.
For simplicity, compiler optimization will be disabled for the remaining contract links in this article.
Fixed-Size Arrays:
Elements in fixed-size arrays are stored as they otherwise would be if the elements were declared sequentially in singular storage variables. After executing modify()
and debugging the transaction for the following contract, each element of the array will occupy its own storage slots. This is only because each element is 32 bytes large. As with singular state variables, array state variable elements can share the same slot, subject to the same packing rules discussed before.
pragma solidity >=0.7.0 <0.9.0; contract FixedArray { uint256[3] arr; function modify() public { // slots 0 and 1 assigned values of 1, 2. // slot 2 will still contain zeros. arr = [1,2]; // slot 2 assigned value of 3 arr[2] = 3; } }
When assigning values for up to n-1 elements in an array, the Solidity compiler will explicitly store a value of 0 for the unassigned values. This can be seen by placing a breakpoint at the first assignment in modify()
and stepping through execution until a value of 0 is pushed to the stack, to eventually be stored at slot 2.
After executing the SSTORE
instruction, slot 2 contains a value of 0.
This is later overwritten by the value of 3 after the final element has been assigned. These variable packing rules suffice for simple unsigned integer types. However, the EVM must perform additional storage packing logic to store more complex data types.
String and byte types:
Strings and byte variations are stored according to their own packing heuristics. Each character in a Solidity string takes up 1 byte. The following contract demonstrates string storage:
pragma solidity >=0.7.0 <0.9.0; contract StringStorage { //same applies to byte types string short_string; string long_string; function modify() public { short_string = "ABCD"; long_string = "ABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCD"; //84 characters long } }
Strings and byte values that are less than or equal to 31 bytes in length are indexed in storage in the following way, where n
represents the next available storage slot for 0 ≤ n
≤ 2256-1 slots:
String/Byte Storage ( of size 1 ≤ 31 bytes) | |
---|---|
Slot Key | Slot Value |
n | String data up to 31 bytes in higher-order bytes, and length of string * 2 in lowest-order byte |
Placing a breakpoint at line 11, after the assignment of short_string
, and debugging the call to modify()
will show the string ABCD (0x41424344)
stored in slot 0, aligned to the higher order of the storage slot. Double the length of the string, 0x08
, is stored at the lower order of the slot (shown in red):
"key": "0x0000 . . . 0000", // slot 0x00…00
"value": "0x414243440000 . . . 0008",
. . .
A different method is used to index and store strings and byte values over 32 bytes in length, however it still retains the concept of storing string/byte data first, followed by the length of the data. Instead of referencing slot keys via their literal values, the keccak256
hash of the previous slot key number is used as the current slot key. This hash is then incremented for every remaining 32 byte chunk of string data. This is summarized in the table below, for n
storage slots, storing i
32 byte chunks of string data, for 0 ≤ n
, i
≤ 2256-1:
String/Byte Storage ( of size 1 ≤ 32 bytes) | |
---|---|
Slot Key | Slot Value |
. . . | . . . |
keccak256(n)+(i-1 ) | i-1th 32 byte chunk of string data. |
keccak256(n)+i | ith 32 byte chunk of string data, left-aligned with 0 if i ≤ 32 bytes (e.g., 0x41424344 . . . 000000) |
n+1 | (Length of string/bytes * 2) + 1 in lower-order bytes |
After long_string
has been assigned its value, we see several more storage keys being used:
. . .
"key": "0xb10e . . . 0cf6", // slot keccak256(0x00…00)
"value": "x4142434441424344414243444142434441424344414243444142434441424344",
"key": "0xb10e . . . 0cf7", // slot keccak256(0x00…00) + 0x00…01
"value": "0x4142434441424344414243444142434441424344414243444142434441424344",
"key": "0xb10e . . . 0cf8", // slot keccak256(0x00…00) + 0x00…02
"value": "0x4142434441424344414243444142434441424344000000000000000000000000",
"key": "0x0000. . . 0001", // slot 0x00…01
"value": "0x0000 . . . 00a9"
In this case, the keccak256
hash of the storage identifier for slot 1 (0xb10e…0cf6
) is used as the first slot key for the first 32-byte chunk of the string. This slot key is incremented by one for all remaining 32-byte chunks, with the last chunk being padded with zeros if it is less than 32 bytes in length.
Finally, the length of the string is stored at the storage slot corresponding to when the long string was declared. We can see that this mirrors the way shorter string storage variables are stored, with the data taking up higher order bytes, and the length of the string in the lower order bytes.
Enums
The storage of the enum
type is not currently covered in the Solidity documentation12. Testing shows that enum
elements are stored as lower-aligned single-byte values within a single slot for enums
up to 32 elements in size.
This is evident with the following contract, with enum E
containing 35 elements. Note that all elements of enum E
and their assignments in the modify()
function have been omitted for brevity in the example below:
pragma solidity >=0.7.0 <0.9.0; contract EnumStorage { enum E {t1, . . . ,t35} E e1; . . . E e35; function modify() public { e1 = E.t1; . . . e35 = E.t35; } }
For enums
of more than 32 elements, the next available slot n
is used to store the ith
group of up to 32 elements, for 0 ≤ n
,i
≤ 2256-1:
Enums | |
---|---|
Slot Key | Slot Value |
. . . | . . . |
n+(i-1) | i-1th set of 32 elements, each represented as a lower-ordered single byte. |
n+i | ith set of up to 32 elements, each represented as a lower-ordered single byte, padded with zeros if ≤ 31 elements. |
Storage will resemble the following after executing modify()
in this contract.
"key": "0x00 . . . 00", // slot 0x00…00
"value": "0x1f1e1d1c1b1a191817161514131211100f0e0d0c0b0a09080706050403020100",
"key": "0x00 . . . 01", // slot 0x00…01
"value": "0x0000000000000000000000000000000000000000000000000000000000222120"
Structs
In Solidity, structs
can include elements of different data types. Like fixed-size arrays, the elements of struct state variables are allocated slots sequentially based on declaration order and type. Contract developers can save considerably on gas fees by declaring the elements of state variable structs to maximize the amount of data stored per slot13. This is demonstrated in the following contract:
pragma solidity >=0.7.0 <0.9.0; contract StructStorage { struct S1 { uint128 a; uint256 b; uint128 c; } struct S2 { uint128 d; uint128 e; uint256 f; } S1 expensive_struct; S2 cheaper_struct; function modify() public { expensive_struct.a = 1; expensive_struct.b = 1; expensive_struct.c = 1; } function modify2() public { cheaper_struct.d = 1; cheaper_struct.e = 1; cheaper_struct.f = 1; } }
EIP-2929 and Access Sets
According to Remix, the cost of executing modify()
was reported as 66588 gas, whereas modify2()
incurred only 44760 gas. This is because assigning values to all expensive_struct
’s members requires 3 storage slots, while cheaper_struct
only requires two.
Struct | Execution Cost (gas) | Storage Changes After Execution |
expensive_struct | 66588 |
"key": "0x00…00", |
cheaper_struct | 44760 |
"key": "0x00…03", |
This is because the first two elements of struct S2
, which are of type uint128
and thus 16 bytes each, can be packed into the same slot, whereas the second element of struct S1
is a full 32 bytes, and therefore requires that each element of this type is stored in its own slot.
Writing non-zero values to “untouched” storage slots is more expensive than writing to a slot that has already been written to. This can be confirmed by deploying the StructStorage
contract and executing the modify2()
function. Debug the transaction, and step through until the first call to SSTORE
. Observe that 20000 gas is required to write to slot 0 in the Step Details window.
Stepping through to the next SSTORE
operation, which operates on the same storage slot, will instead require only 100 gas.
This is because the Berlin hard fork of the Ethereum network in April 2021 brought various changes to the network. One such change, EIP-292914, increased the gas cost of accessing “cold” (not already accessed) storage slots and account addresses relative to “warm” (previously accessed) ones. To differentiate between cold slots/addresses and ones that have been “warmed up” by means of prior access, the concept of Access Sets15 were introduced.
The motivation behind this was that storage-accessing opcodes were considered underpriced in terms of gas execution fees. The 2016 Shanghai DDOS attack against the Ethereum network exploited this, as the low gas fee of the EXTCODESIZE
opcode was abused16 to force nodes and miners to read state information from disk, thus congesting the network. The specific calculations for warm or cold reads and writes are best covered in the most recent version of the Ethereum Yellow Paper17.
Appendix G lays out the gas costs for common operations, including accessing warm and cold slots/addresses via Gwarmaccess and Gsset respectively. Note that writing zero values to any storage is still treated as if accessing a warm storage slot, as all storage slots are considered to be logically zero-initialized by default18.
Dynamic Variable Storage
The state variables demonstrated so far were all fixed-size types. With fixed-size state variables, the EVM can know ahead of time where and how to best store them according to packing rules. This is not possible for dynamic sized types such as mappings and dynamic-sized arrays. Storage of dynamic-sized array state variables are demonstrated in the following contract.
Dynamic arrays
pragma solidity >=0.7.0 <0.9.0; contract DynamicArray { uint256[] ints; uint256[][] int_ints; function modify() public { ints.push(0xaa . . . aa); ints.push(0xbb . . . bb); int_ints.push(ints); int_ints.push(ints); int_ints.push(ints); } }
Dynamic arrays are stored using a similar hash-of-hashes trick as with string and byte types. They are primarily referenced by a “base” slot identifier that is continually updated with the number of elements currently assigned to a dynamic array. The slot corresponding to the keccak256
hash of this base slot identifier then holds the first element of the array. For all subsequent elements, the inner hash is incremented by one, and a hash of that value is taken.
This process is summarized below, storing the number of elements in a one-dimensional dynamic array at the n
th storage slot, with the i
th element in the array being stored at the keccak256(n)
th storage slot, for 0 ≤ n
,i
≤ 2256-1:
One-Dimensional Dynamic Array Storage | |
---|---|
Slot Key | Slot Value |
n | The number of elements in the array so far |
. . . | . . . |
keccak256(n)+(i-1) | The i-1th element in the array |
keccak256(n)+i | The ith element in the array |
In this simple example, storage will resemble the following once modify()
has been executed:
"key": "0x0000. . . 0000", // slot 0x00…00
"value": "0x0000. . . 0002",
"key": "0x290d . . . e563", // slot keccak256(0x00…00)
"value": "0xaaaa . . . aaaa",
"key": "0x290d . . . e564", // slot keccak256(0x00…00) + 0x00…01
"value": "0xbbbb . . . bbbb",
. . .
Multidimensional dynamic arrays are stored via the same rule applied recursively. For a two-dimensional dynamic array, the nth storage slot contains the number of sub-arrays in the two-dimensional array stored so far. The index of the i
th sub-array is stored at the keccak256(n)+i
th storage slot, and the j
th element of the i
th sub-array is stored at the keccak256(keccak256(n)+i)+j
th storage slot, for 0 ≤ n
,i
,j
≤ 2256-1:
Two-Dimensional Dynamic Array Storage | |
---|---|
Slot Key | Slot Value |
n | Number of elements in the outermost array so far |
. . . | . . . |
keccak256(n)+(i-1) | i-1th sub-array |
. . . | . . . |
keccak256(keccak256(n)+(i-1))+(j-1) | j-1th element in the i-1th sub-array |
keccak256(keccak256(n)+(i-1))+j | jth element in the i-1th sub-array |
keccak256(n)+i |
ith sub-array |
. . . | . . . |
keccak256(keccak256(n)+i)+(j-1) | j-1th element in the ith sub-array |
keccak256(keccak256(n)+i)+j |
jth element in the ith sub-array |
This is repeated for the second and third arrays within int_ints
, demonstrated in the storage representation below:
. . .
"key": "0x00 . . . 01", // slot 0x00…01
"value": "0x00 . . . 03"
"key": "0xb10e . . . 0cf6", // slot keccak256(0x00…01)
"value": "0x00 . . . 02"
"key": "0xb5d9 . . . 7d22", // slot keccak256(keccak256(0x00…01))
"value": "0xaa . . . aa"
"key": "0xb5d9 . . . 7d23", // slot keccak256(keccak256(0x00…01)) + 0x00…01
"value": "0xbb . . . bb"
"key": "0xb1e2 . . . 0cf7", // slot keccak256(0x00…01) + 0x00…01
"value": "0x00 . . . 02",
"key": "0xea78 . . . bd31", // slot keccak256(keccak256(0x00…01) + 0x00…01)
"value": "0xaa . . . aa",
"key": "0xea78 . . . bd32", // slot keccak256(keccak256(0x00…01) + 0x00…01) + 0x00…01
"value": "0xbb . . . bb",
"key": "0xb1e2 . . . 0cf8", // slot keccak256(0x00…01) + 0x00…02
"value": "0x00 . . . 02",
"key": "0xb327 . . . e02b", // slot keccak256(keccak256(0x00…01) + 0x00…02)
"value": "0xaa . . . aa",
"key": "0xb327 . . . e02c", // slot keccak256(keccak256(0x00…01) + 0x00…02) + 0x00…01
"value": "0xbb . . . bb"
Aside - Deterministic Pitfalls – Storing Private Data
The Storage region provides a deterministic way of storing state data in the EVM. For instance, the exact storage slot of an m-dimensional array can be predicted before a contract implementing the array is even deployed. For an m-dimensional array, storage logic can be generalized by applying the storage heuristic of a two-dimensional array recursively. This is shown via the following expression for the j
th element in the i
th index of the m-dimensional array, declared starting at slot n
, where 0 ≤ n
,m
,i
,j
≤ 2256-1:
keccak256(keccak256(...(keccak256(keccak256(n)+i[1])+i[2])...+i[m-1])+i[m])+j
For a sufficiently-populated, 4-dimensional dynamic array arr[][][][]
, the base of which is stored at slot 3, the slot key for the element at arr[1][0][8][1]
can therefore be calculated as follows, and confirmed by viewing the fourth-to-last slot key written after executing the modify()
function of this contract:
keccak256(keccak256(keccak256((keccak256(0x00…03)+ 0x00…01))+ 0x00…00)+ 0x00…08)+ 0x00…01 = 0xb8928d09db2f3fc6a2c8bd4dafbdf7cd5aa6c337f2c2fad8d85a5e908c8ddf49
Any data in a contract’s Storage region is publicly available using block explorers or web3 clients. For example, the contents of storage slot 0 of the live Wrapped Ethereum (WETH) contract19 can be quickly queried using the cast
20 web3 CLI client from the Foundry development suite, given a free-tier Infura21 API access or another RPC endpoint:
$ cast storage 0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2 0 --rpc-url 'https://mainnet.infura.io/v3/<API-KEY>'
0x577261707065642045746865720000000000000000000000000000000000001a
While this data is simply the string Wrapped Ether
, subject to the previously described string storage logic, this is why storing sensitive data within a contract’s Storage region constitutes a security risk as contract storage is publicly available to all.
Mappings
Mappings in Solidity are similar to hash tables or dictionaries in other languages, the storage of which is demonstrated by this contract:
pragma solidity >=0.7.0 <0.9.0; contract Mappings { struct S { uint256 a; uint256 b; } mapping(uint256 => uint256 ) simple_map; mapping(uint256 => S) struct_map; mapping(uint256 => mapping(uint256 => S)) nested_map; function modify() public { simple_map[0] = 0x11 . . . 11; simple_map[1] = 0x22 . . . 22; struct_map[0] = S(0xaa . . . aa, 0xbb . . . bb); struct_map[1] = S(0xcc . . . cc, 0xdd . . . dd); nested_map[0][1] = S(0xee . . . ee, 0xff . . . ff); } }
Mapping values are also initialized to zero by default, meaning that mappings are assigned values during declaration. They may only be stored in storage, and the keccak256
hash of keys are stored in place of the literal keys. This is why it is not possible to directly query the length of a mapping in Solidity 22.
Mapping slot keys are derived by concatenating the storage slot key of the mapping itself (n)
and the index of the key where the element is to be stored (i)
, for all 0 ≤ n
,i
≤ 2256-1:
Mapping Storage | |
---|---|
Slot Key | Slot Value |
keccak256(i-1 . n) | ith element of mapping at storage slot n |
This can be confirmed by debugging the call to modify()
and observing the storage after simple_map[0]
is assigned a value of 0x11..11
. This value will be stored at storage slot keccak256(0x00..00 . 00..00) = 0xad32 . . . 5fb5
.
"key": "0xad32 . . . 5fb5", // slot keccak256(0x00…00 . 00…00)
"value": "0x1111 . . . 1111", // value of simple_map[0]
"key": "0xada5 . . . 9e7d", // slot keccak256(0x00…01 . 00…00)
"value": "0x2222 . . . 2222", // value of simple_map[1]
. . .
Mapping values like structs or fixed-length arrays combines the storage heuristic for structs and mappings. State variables of this particular type are stored by incrementing this slot key for each element in the value, and storing element values at these incremented slot keys. This can be observed by stepping through line 17, where the struct S(0xa…,0xb…)
is assigned as the first value in struct_map
:
. . .
"key": "0xa6ee . . . cb49", // slot keccak256(0x00…00 . 00…01)
"value": "0xaaaa . . . aaaa", // value of struct_map[0].a
"key": "0xa6ee . . . cb4a", // slot keccak256(0x00…00 . 00…01) + 0x00…01
"value": "0xbbbb . . . bbbb", // value of struct_map[0].b
. . .
Like dynamic arrays, nested mappings are stored recursively. However, since mapping keys are never stored, a somewhat simpler mapping heuristic is used. This involves taking the hash of the outermost mapping, and then taking a hash of this value concatenated with the index of the innermost mapping.
This is summarized below, where:
i
is the index of within the outermost mapping.j
is the index in the innermost mapping.n
is the next available storage slot.- Since
nested_map
storesstructs
, the value ofm
is added to the hash as an index for the element within the struct.
Nested Struct Mapping Storage | |
---|---|
Contents of Slot Key | Value |
keccak256(j . keccak256(i . n)) + k | Value at struct element k at innermost mapping i, at outermost mapping j, stored at slot n |
nested_map
is takes up slot 2, and stores an instance of struct S(0xe…,0xf…)
at index 1 of its innermost mapping. Therefore, the first element in the struct will be stored at key keccak256(0x…01 . keccak256(0x0…0 .0x0…2)) = 0x79c0…d89d
, and the second will be stored at this value incremented by one.
. . .
"key": "0x79c0 . . . e89d", // slot keccak256(0x0…01 . keccak256(0x00…00 .0x00…02))
"value": "0xeeee . . . eeee", // value of nested_map[0][1].a
"key": "0x79c0 . . . e89e", // slot keccak256(0x0…01 . keccak256(0x00…00 .0x00…02)) + 0x00…01
"value": "0xffff . . . ffff" // value of nested_map[0][1].b
Data Location Keywords and Reference Types
Solidity supports three data location keywords: storage
, memory
, and calldata
, which can be used to explicitly provide the data location within the EVM execution context, where certain variables are stored within functions. In all examples thus far, variables have been declared outside of functions and are thus state variables stored in the Storage region.
Whether or not variables require being prefixed with data location keywords depends on whether the variable’s type is considered a reference or value type23. Value types are types that do not store any data other than their own, while reference types store the data of other variables.
Structs, arrays, and mappings are therefore reference types, and almost all others are considered value types. Strings are also considered reference types in most situations. Declaring state variables is the only place where explicitly specifying data location can be omitted. Reference types declared inside functions must be prefixed with a data location keyword. It is currently not possible to declare data locations for non-state variable that are of value types within functions.
The storage
keyword essentially produces a pointer to an already-declared state variable, referred to as a storage pointer. It is not possible to declare new instances of state variables within a function; the storage keyword cannot be used unless a state variable has already been declared. When updating the storage pointer, the original state variable value will also be updated. Therefore, the storage keyword state variable allows for variables to be passed by reference if used in function arguments.
The memory
keyword can be used to specify that reference types declared within functions are to be stored in memory. When used in a function argument, it results in an in-memory copy of the passed argument being created, which will be discarded at the end of the function call if not returned or persisted elsewhere.
One use case for the memory keyword is to save on gas by limiting the amount of storage read/write operations. For example, by declaring a reference type as a state variable, initializing it, and then creating an in-memory copy of this state reference type. The in-memory copy can be manipulated with little to no SSTORE
/SSLOAD
operations. After the variable has been processed, it can be assigned back to the original state variable. This is particularly economical with multiple iterations of the same processing. Like “warm” storage, it is cheaper to write to already-allocated memory.
The calldata
keyword specifies that function arguments are to be stored in a temporary yet immutable region. It can only be specified for function arguments. It is only available for external, read-only function call parameters, however. This is primarily used for data that is sent with the call to the contract and doesn't need to be changed during the execution of the function. Using calldata
is cheaper than both memory
and storage
. Check the official Solidity documentation for more information regarding data locations and how best to use them to write gas-efficient contracts.
This concludes part 2 of this series. In the third and final part of this series, we will take a detailed look at proxy contracts, inter-contract message calls, and their effects on storage. We will also look at how storage mismanagement and insecure message calls have resulted in major DeFi breaches, and how to limit the risk of such issues.
References:
- https://docs.soliditylang.org/en/
- https://ethereum.org/en/developers/
- https://ethereum.github.io/yellowpaper/paper.pdf
- https://www.evm.codes/?fork=shanghai
Citations
- https://github.com/ethereum/go-ethereum/blob/41f89ca94423fad523a4427c9af2c327fe344fe2/core/state/state_object.go#L40
- https://github.com/ethereum/evmc/blob/58e16bf6a1be9a5f445b2dd443f018e55218e777/examples/example_host.cpp#L26https://github.com/ethereum/evmc/blob/58e16bf6a1be9a5f445b2dd443f018e55218e777/examples/example_host.cpp#L26
- https://www.evm.codes/#54
- https://www.evm.codes/#55
- https://www.evm.codes/#54
- https://ethereum.org/en/developers/docs/data-structures-and-encoding/patricia-merkle-trie/
- https://bioinformatics.cvr.ac.uk/trie-data-structure/
- https://www.investopedia.com/terms/m/merkle-tree.asp
- https://www.allisons.org/ll/AlgDS/Tree/PATRICIA/
- https://www.coinbase.com/bitcoin.pdf
- https://docs.soliditylang.org/en/latest/internals/layout_in_storage.html#layout-of-state-variables-in-storage
- https://docs.soliditylang.org/en/v0.8.20/internals/layout_in_storage.html#layout-of-state-variables-in-storage
- https://docs.soliditylang.org/en/v0.8.20/internals/layout_in_storage.html#layout-of-state-variables-in-storage
- https://eips.ethereum.org/EIPS/eip-2929
- https://www.evm.codes/about#accesssets
- https://blog.ethereum.org/2016/09/22/ethereum-network-currently-undergoing-dos-attack
- https://ethereum.github.io/yellowpaper/paper.pdf
- https://ethereum.org/en/developers/tutorials/yellow-paper-evm/#91-basics
- https://etherscan.io/address/0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2#code
- https://book.getfoundry.sh/cast/
- https://www.infura.io/pricing
- https://docs.soliditylang.org/en/latest/types.html#mapping-types
- https://docs.soliditylang.org/en/v0.8.20/types.html#reference-types