2.5.2. Memory Footprints of Data Types¶

Low level languages such as C or Rust use types to describe the memory requirements of a symbol. In a performance oriented mindset this is a killer feature and an essential tool to writing non-pessimized, and data-oriented code.

Fortunately all is not lost. We can do similar reasoning in Haskell. This chapter describes how to statically reason about the memory your data types will require. After reading the chapter one should be able to read a data type declaration and perform back of the napkin math to reason about the worst case (no optimizations or sharing) memory footprint.

2.5.2.1. What is the Point¶

Modern CPU’s are fast and memory is abundant so what is the point. Certainly these things are true, but for performance-oriented Haskell we care about the caching behavior of our code. Every piece of code that one writes will go through a CPU’s L1 caches and the behavior of those caches will have an immediate and drastic impact on the performance of your code. So much so that not missing the data cache is one of the golden rules of performance-oriented Haskell. The best way to have a cache hit rather than a miss is to architect the program to require smaller data that can more easily fit neatly into a cache line. Therefore, the first step is to writing cache friendly code is understanding the memory footprint of your data types so that you understand what you are asking of the CPU.

2.5.2.2. Atomic Types¶

Atomic types are defined and implemented in GHC as builtins. Most of these types are a single machine word or two:

Data Type	Size (Words)	Notes
()	0	will be shared
Bool	0	will be shared
Char	2	will be shared
Int	2	Could be shared, see the on sharing section
Int8	2	Due to alignment
Int16	2	Due to alignment
Int32	2
Int64	2
Int64 (32-bit)	3
Word	2
Word8	2	Due to alignment
Word16	2	Due to alignment
Word32	2
Word64	2
Word64 (32-bit)	3
Double	2
Double (32-bit)	3
Integer	2
Integer (32-bit)	3

2.5.2.3. Boxed Data Types¶

Boxed data types are ubiquitous in Haskell programs. The strategy for any boxed data type is to count the machine words required to represent each data constructor. To count the machine words we count the number of fields, add one for the constructor header (which is the constructor’s info table), and then sum that amount to the amount of memory required for the data types each field points to. This works because boxed types represent fields with pointers, which are a single machine word. For example, consider this monomorphic list-like data type:

data MyIntList = MyCons Int MyIntList
               | Nil

This type has two constructors: Nil which has no fields, and MyCons which has two fields, an Int and the rest of the list. Therefore a single MyCons will need:

one word for the MyCons constructor header.
one word for a pointer to an Int plus the machine words needed to represent an Int.
one word for a pointer to a MyIntList plus the machine words required for the rest of the MyIntList.

The only unknown is the memory footprint of an Int. Fortunately, Int’s are boxed, so we can use the same strategy. Here is its definition:

-- in GHC.Types in the ghc-prim library
-- ...
-- | A fixed-precision integer type with at least the range @[-2^29 .. 2^29-1]@.
-- The exact range for a given implementation can be determined by using
-- 'Prelude.minBound' and 'Prelude.maxBound' from the 'Prelude.Bounded' class.
data Int = I# Int#

The Int# is the payload, it is an atomic, unboxed type. Thus an Int needs two words: one for the constructor header of I#, and one for the payload Int#. This means a MyCons will require, in the worst case, six machine words: 1 for MyCons, 2 for the pointers, 2 for the Int and 1 for Nil because Nil is only a constructor.

As an example, consider a singleton list MyCons 7 Nil. This what the singleton will look like in memory:

Figure made with TikZ

Each box is a machine word and each arrow is a pointer to some location in the heap.

Earlier I was careful to say in the worst case because our analysis does not consider sharing or strictness. In general, we cannot assess how much sharing will happen at runtime without the code in question. However, we can still make some safe assumptions. For example, in GHC there is only one empty list which is repeatedly shared. If we assume that Nil has the same behavior then MyCons 7 Nil will only require five words instead of six. For the memory diagrams in this book, we’ll represent sharing as a dashed outline when needed:

Figure made with TikZ

2.5.2.3.1. A More Complicated Example¶

Our last example was simple, what about a type such as a Data.HashMap:

-- from Data.HashMap.Internal
data HashMap k v
    = Empty
    | BitmapIndexed !Bitmap !(A.Array (HashMap k v))
    | Leaf !Hash !(Leaf k v)
    | Full !(A.Array (HashMap k v))
    | Collision !Hash !(A.Array (Leaf k v))

This type has type variables k and a, bang patterns and uses other types such as: Bitmap, Hash, Leaf and Array. For types like this the strategy remains the same. We assess the memory footprint by counting machine words for each constructor and each type in use. Type variables are represented and counted as pointers. The only difference is that the memory footprint can change depending an what the type variable reifies to. For example, a value of HashMap Bool v will have a smaller footprint than a value of HashMap MyIntList v because Bool will have a smaller footprint than MyIntList. Furthermore, True and False are statically allocated data constructors that are always shared, so the only memory costs incurred are pointers.

Note that our last example showed this without type variables via Nil. Nil was the value that the MyIntList pointer in MyCons pointed to. Imagine if the example had been MyCons 7 (MyCons 5 Nil), then this value would have a larger footprint (seven words assuming a shared Nil ) because the list tail pointer would have pointed to a heavier value than Nil.

First let’s assess the memory footprint for the type that are used in each constructor:

-- all of these are from Data.HashMap.Internal
type Bitmap = Word
...
type Hash   = Word
...
data Leaf k v = L !k v

We see that Bitmap and Hash are a single machine word and Leaf is a partially strict pair with two fields. Leaf will need a total of three words plus the size of k and v. For now, we will ignore strictness and focus only on the worst case, we’ll return to strictness later in the chapter.

This only leaves Array. Here is its definition:

-- Data.HashMap.Internal.Array
data Array a = Array {unArray :: !(SmallArray# a)}

An Array is a wrapper around an unboxed SmallArray#. SmallArray# is one of GHC’s primitive types that are exposed through the GHC.Exts module in base, but are defined in the ghc-prim boot library for GHC:

// in rts/storage/Closures.h
// A small array of head objects, ie SmallArray# and MutableSmallArray#
//
// Closure types: SMALL_MUT_ARR_PTRS_CLEAN, SMALL_MUT_ARR_PTRS_DIRTY,
// SMALL_MUT_ARR_PTRS_FROZEN_DIRTY, SMALL_MUT_ARR_PTRS_FROZEN_CLEAN,
typedef struct {
    StgHeader   header;
    StgWord     ptrs;
    StgClosure *payload[] MUT_FIELD;
} StgSmallMutArrPtrs;

Note

You’ll notice that I do not discuss MUT_FIELD, this macro is used by GHC’s RTS to mark fields which can be modified by the mutator during garbage collection, but it expands to nothing. For more see this note.

A smallArray# is a C struct with a StgHeader, a StgWord and an array of pointers that point to the closures which are the contents of the array. An StgHeader is defined as:

typedef struct {
 // If TABLES_NEXT_TO_CODE is defined, then `info` is offset by
 // `\texttt{sizeof}(StgInfoTable)` and so points to the `code` field of the
 // StgInfoTable! You may want to use `get_itbl` to get the pointer to the
 // start of the info table. See
 // https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/rts/storage/heap-objects#tables_next_to_code.
 const StgInfoTable* info;

#if defined(PROFILING)
    StgProfHeader         prof;
#endif
} StgHeader;

We’ll assume we are not profiling and are compiling with tables-next-to-code. With these assumptions the StgHeader is just a pointer to a StgInfoTable and thus just a single machine word. Similarly, an StgWord is a machine word:

// in ghc/rts/includes/stg/Types.h
/*
 * Stg{Int,Word} are defined such that they have the exact same size as a
 * void pointer.
 */
#if SIZEOF_VOID_P == 8
typedef int64_t            StgInt;
typedef uint64_t           StgWord;
...
#elif SIZEOF_VOID_P == 4
typedef int32_t            StgInt;
typedef uint32_t           StgWord;
...
#endif

This means that a singleton array will be one word for the header, one word for the ptrs field, n pointers for n elements plus the size of those elements. We can summarize this all into a equation that calculates the size of n element array: \(2 + n + n(\texttt{sizeof}(element))\).

Let’s walk through how we derived that equation. We know that there will always be a constant overhead of two words due to the header and ptrs field, and that each element will require a pointer. This gives us 2 + n; then all that is left is the size of the elements themselves and which we know we have n of. Let’s use a datatype as an example to check our assumptions. Assume we have a singleton array consisting of an Int. This singleton array will need one word for the header, one for the ptrs field; this is the constant 2. Next it will need one pointer to the Int heap object because \(n = 1\), and finally two words for the Int itself. Thus, yielding a total of five words. Here is the calculation:

\begin{align*} \texttt{sizeof}(\text{SmallArray}, 1, \text{Int}) =&\; 2 + 1 + 1(\texttt{sizeof}(\text{Int})) \\ =&\; 3 + 1(\texttt{sizeof}(\text{Int})) \\ =&\; 3 + 1(2) \\ =&\; 3 + 2 \\ =&\; 5 \\ \end{align*}

Now consider an array with two Int’s. This array will require three more words: one more pointer, and two more for the Int payload, for a total of eight:

\begin{align*} \texttt{sizeof}(\text{SmallArray}, 2, \text{Int}) =&\; 2 + 2 + 2(\texttt{sizeof}(\text{Int})) \\ =&\; 4 + 2(\texttt{sizeof}(\text{Int})) \\ =&\; 4 + 2(2) \\ =&\; 4 + 4 \\ =&\; 8 \\ \end{align*}

Let’s take stock of what we know:

Bitmap \(\rightarrow\) 1 word
Hash \(\rightarrow\) 1 word
Leaf \(\rightarrow\) \(3 + \texttt{sizeof}(k) + \texttt{sizeof}(v)\) words
Array \(\rightarrow\) \(2 + n + n(\texttt{sizeof}(v))\) words (v becomes the array elements by definition)

Now we can assess HashMap. HashMap has five constructors. We’ll proceed in order beginning with Empty. Empty will take a single word just like Nil. BitmapIndexed is next. It is defined as:

BitmapIndexed !(Bitmap) !(A.Array (HashMap k v))

So we have one word for the constructor, one for the Bitmap and then the array. The best we can do is express the size in terms of k and v since we do not know what these types will reify to. This gives us:

\begin{align*} \texttt{sizeof}(\text{BitmapIndexed}, n, k, v) =&\; 1 + 1 +\texttt{sizeof}(\text{Array}(\text{HashMap}\;k\;v)) \\ =&\; 2 +\texttt{sizeof}(\text{Array}(\text{HashMap}\;k\;v)) \\ =&\; 2 +(2 + n + n(\texttt{sizeof}(\text{HashMap}\;k\;v)) \\ =&\; 4 + n + n(\texttt{sizeof}(\text{HashMap}\;k\;v) \end{align*}

Following BitmapIndexed is Leaf. Leaf is defined as:

Leaf !Hash !(Leaf k v)

We have one word for the Leaf constructor, two words for Hash: one for a pointer and one for the payload we deduced above. Hash is not unpacked and thus is represented as a pointer to the Hash payload. This leaves only the Leaf type which we analyzed earlier. For definitions like this, it is helpful to inline the constituent data types, like so:

Leaf !Hash !(L !k v)

This flattens the data structure and keeps everything in one place. The only caveat is that one must remember there is a pointer that points to L from Leaf. I’ve indicated this with the open parenthesis: ( . Recall that L defines a strict pair which requires three words, one for L and two pointers to each element. Now we can calculate the memory footprint in terms of k and v just as before:

\begin{align*} \texttt{sizeof}(\text{Leaf}, k, v) =&\; 1 + 2 +(1 + \texttt{sizeof}(\text{Leaf k v})) \\ =&\; 4 + \texttt{sizeof}(\text{Leaf k v})) \\ =&\; 4 + (3 + \texttt{sizeof}(k) + \texttt{sizeof}(v)) \\ =&\; 7 + \texttt{sizeof}(k) + \texttt{sizeof}(v)) \end{align*}

All we have left is Full and Collision. Full is a special case. The unordered-containers HashMap is a Hashed Mapped Array Trie [6] where a full array contains a maximum of 32 elements. Thus a Full will need one word for the Full header, and then \(\texttt{sizeof}(\text{Array}(\text{HashMap}\;k\;v))\). Fortunately we already calculated that for BitmapmIndexed. Thus a Full will be:

\begin{align*} \texttt{sizeof}(\text{Full}, 32, k, v) =&\; 1 +\texttt{sizeof}(\text{Array}(\text{HashMap}\;k\;v)) \\ =&\; 1 + (2 + n + n(\texttt{sizeof}(\text{HashMap}\;k\;v)) \\ =&\; 1 + (2 + 32 + 32(\texttt{sizeof}(\text{HashMap}\;k\;v)) \\ =&\; 35 + 32(\texttt{sizeof}(\text{HashMap}\;k\;v) \end{align*}

All we have left is Collision. Fortunately we’ve already calculated the footprint of Collision because Collision is defined as:

Collision !Hash !(A.Array (Leaf k v))

which is isomorphic to BitmapIndexed because Hash and Bitmap have the same footprint and we know how to calculate A.Array (Leaf k v). Thus:

\begin{align*} \texttt{sizeof}(\text{Collision}, n, k, v) =&\; 1 + 1 +\texttt{sizeof}(\text{Array}(\text{HashMap}\;k\;v)) \\ =&\; 4 + n + n(\texttt{sizeof}(\text{HashMap}\;k\;v) \end{align*}

As an aside, this result reveals potential performance issues on reads for a full hash map. 35 words is 280 bytes, a cache line is typically 64 bytes (or 8 words) so just with the constants, a Full will consume 4 cache lines _unevenly_; \(64 * 4 = 256\) always leaving 24 bytes leftover on the fifth cache line. Furthermore let’s consider what the memory footprint of a HashMap Int Int that is a single Full will be. This will require:

\[\begin{split}\begin{align*} \texttt{sizeof}(\texttt{Full}, 32, \texttt{Int}, \texttt{Int}) =&\; 35 + 32(\texttt{sizeof}(\text{HashMap}\;\texttt{Int}\;\texttt{Int}) \\ =&\; 35 + 32(\texttt{sizeof}(\text{Leaf}\;\texttt{Int}\;\texttt{Int}) \\ =&\; 35 + 32(7 + \texttt{sizeof}(\texttt{Int}) + \texttt{sizeof}(\texttt{Int})) \\ =&\; 35 + 32(7 + 2 + 2) \\ =&\; 35 + 32(11) \\ =&\; 35 + 352 \\ =&\; 387 \end{align*}\end{split}\]

A total of 387 words which is 3, 096 bytes or roughly 3KiB, just for 32 elements! Unfortunately, this also will not be a cache friendly data structure. To store 387 words we need \(\frac{387}{8} = 48\) cache lines. But 387 is not a multiple of 2, so there will be \(387 \bmod{} 8 = 3\) words of leftover space on the final cache line [1]. A more cache friendly data structure would ensure that a Full always fit evenly into a set of cache lines and would thereby avoid fragmenting the cache. One caveat is that this wasted space will change depending on the sizes of the key and value.

2.5.2.4. Other Ways¶

We’ve learned how to statically analyze a data type. But often times there are easier ways than reasoning. For example, you can always weigh your data types. Although this will consider sharing.

2.5.2.4.1. Read the Assembly Output¶

Besides weigh, we can practice don’t think look by inspecting the .data sections of GHC’s generated assembly. Here is an example for some primitive types. The idea is that if we create a top level binding that will be statically allocated at compile time. This static allocation will then appear in the .data section of the assembly code. Here is a quick refresher on assembly sizes:

Directive	Size (Bytes)
.byte	1
.word	2
.long	4
.quad	8

And here is our example program.

{-# OPTIONS_GHC -O2 -ddump-asm #-}

{-# NOINLINE a_unit #-}
a_unit :: ()
a_unit =  ()

{-# NOINLINE x #-}
a_bool :: Bool
a_bool = True

{-# NOINLINE a_int8 #-}
a_int8 :: Int8
a_int8 = 1

{-# NOINLINE a_int16 #-}
a_int16 :: Int16
a_int16 = 1823

{-# NOINLINE a_int #-}
a_int :: Int
a_int = 123

{-# NOINLINE a_int64 #-}
a_int64 :: Int64
a_int64 = 1234

{-# NOINLINE a_word8 #-}
a_word8 :: Word8
a_word8 = 1

{-# NOINLINE a_word16 #-}
a_word16 :: Word16
a_word16 = 12

{-# NOINLINE a_word #-}
a_word :: Word
a_word = 123

{-# NOINLINE a_word64 #-}
a_word64 :: Word64
a_word64 = 1234

We’ll handle these in chunks beginning with a unit. Recall that in the Weigh Chapter a () was measured to be 0 allocations because only one () exists in GHC and is shared for all references. But here we can see the truth:

.section .data
.align 8
.align 1
.globl Main.a_unit_closure
.type Main.a_unit_closure, @object
Main.a_unit_closure:
   .quad     ()_con_info

Our unit value is a reference to the shared info-table for the one true (), which requires a .quad. .quad is a quad word meaning it is 8 bytes (quad because its four .word’s), or 64-bits (this was run on a 64-bit machine). What about Bool:

.section .data
.align 8
.align 1
.globl Main.a_bool_closure
.type Main.a_bool_closure, @object
Main.a_bool_closure:
     .quad   GHC.Types.True_con_info

The same is true for a_bool; the True is a pointer to the one True value called GHC.Types.True_con_info, and thus takes a single word. Let’s check the Int-like fixnums:

.section .data
.align 8
.align 1
.globl Main.a_int8_closure
.type Main.a_int8_closure, @object
Main.a_int8_closure:
     .quad   GHC.Int.I8#_con_info
     .byte   8
     .long   0
     .word   0
     .byte   0

...

.section .data
.align 8
.align 1
.globl Main.a_int16_closure
.type Main.a_int16_closure, @object
Main.a_int16_closure:
     .quad   GHC.Int.I16#_con_info
     .word   16
     .long   0
     .word   0

...

.section .data
.align 8
.align 1
.globl Main.a_int_closure
.type Main.a_int_closure, @object
Main.a_int_closure:
     .quad   GHC.Types.I#_con_info
     .quad   32

...

.section .data
.align 8
.align 1
.globl Main.a_int64_closure
.type Main.a_int64_closure, @object
Main.a_int64_closure:
     .quad   GHC.Int.I64#_con_info
     .quad   64

We see that our a_int8 :: Int8 requires one word for the data constructor header: .quad GHC.Int.I8#_con_info, and then requires one byte for the payload: .byte 8, but then requires another 7 bytes: .long 0, .word 0, .byte 0 all initialized to 0. GHC is smartly padding this value to fit evenly into the data cache. Without this padding, a_int would require 65 bytes (the info table pointer + 1 byte for the payload) which is guaranteed to never fit evenly into the data cache. This padding is good, we should be happy that GHC does it for us. However, the padding is still empty space which is wasteful. If this were a real program a good optimization would be to increase memory efficiency by utilizing this extra space. See the Data-Oriented Design Case Study for an example.

We see similar padding for a_int16, while a_int and a_int64 take two words as we expect. The sized Word types are identically sized to the Int types:

.section .data
.align 8
.align 1
.globl Main.a_word8_closure
.type Main.a_word8_closure, @object
Main.a_word8_closure:
     .quad   GHC.Word.W8#_con_info
     .byte   8
     .long   0
     .word   0
     .byte   0

...

.section .data
.align 8
.align 1
.globl Main.a_word16_closure
.type Main.a_word16_closure, @object
Main.a_word16_closure:
     .quad   GHC.Word.W16#_con_info
     .word   16
     .long   0
     .word   0

...

.section .data
.align 8
.align 1
.globl Main.a_word_closure
.type Main.a_word_closure, @object
Main.a_word_closure:
     .quad   GHC.Types.W#_con_info
     .quad   32

...

.section .data
.align 8
.align 1
.globl Main.a_word64_closure
.type Main.a_word64_closure, @object
Main.a_word64_closure:
     .quad   GHC.Word.W64#_con_info
     .quad   64

Unfortunately, this will only work for boxed data types as we cannot (yet have a top level unlifted data type (which an unboxed type is). Another curiosity are the representation of top level compound types. For example, with this:

{-# NOINLINE a_pair#-}
a_pair :: (Int,Int)
a_pair =  (1,2)

{-# NOINLINE a_list#-}
a_list :: [Int16]
a_list =  [1,2,3,4]

GHC generates:

.section .data
.align 8
.align 1
.globl Main.a_pair_closure
.type Main.a_pair_closure, @object
Main.a_pair_closure:
        .quad        (,)_con_info
        .quad        stg_INTLIKE_closure+273
        .quad        stg_INTLIKE_closure+289
        .quad        3

Help Wanted

These + disturb the asm lexer in sphinx. I’ve tried to escape them to no avail so this block lacks syntax highlighting. If you know how to resolve this please contribute!

Which is just what we expected: one word (.quad (,)_con_info) for the data constructor header, one for fst (.quad stg_INTLIKE_closure+273 ), and one snd (.quad stg_INTLIKE_closure+289). However, GHC has added another word that is mysteriously set to 3: .quad 3. This extra word is an optimization that GHC applies which tags the symbol Main.a_pair_closure as a static constructor that contains no CAF references. This tag (the 3) instructs the garbage collector to ignore this symbol during garbage collection. If Main.a_pair_closure was found to possibly have a CAF then the tag would have been 0 but the extra word would still exist [2] . So does this mean that our analysis is incorrect? No, this data is only checked and loaded during a garbage collection event, it is a by product of our abuse of NOINLINE to create a static top-level closure.

2.5.2.4.2. About Strictness¶

Now we can finally discuss strictness. Consider this program:

module Example where

import Data.Word

data Foo = Foo Word16 Word16 Word16 Word16

{-# NOINLINE a_foo #-}
a_foo :: Foo
a_foo = Foo 123 234 345 456

main :: IO ()
main = return ()

Foo has one constructor and four fields; each of which is a Word16. So we should expect that a Foo will be nine words: one for the header, and then each Word16 is a pointer to the Word16# payload which will include padding to align to 64 bits. Let’s check the assembly:

Example_a_foo_closure:
    .quad   Example_Foo_con_info
    .quad   Example_a_foo4_closure+1  ;; these closures are the "boxes"
    .quad   Example_a_foo3_closure+1  ;; in boxed data types!
    .quad   Example_a_foo2_closure+1  ;; this +1 is a pointer tag
    .quad   Example_a_foo1_closure+1  ;; meaning the value is evaluated.
    .quad   3

Ignoring the .quad 3 tag, we find a word for the constructor’s info table: .quad Example_Foo_con_info, and a word for each field which is the aforementioned pointers. Here are what those closures look like:

Example_a_foo4_closure:
     .quad   base_GHCziWord_W16zh_con_info
     .word   123
     .long   0
     .word   0
Example_a_foo3_closure:
        .quad   base_GHCziWord_W16zh_con_info
        .word   234
        .long   0
        .word   0
Example_a_foo2_closure:
        .quad   base_GHCziWord_W16zh_con_info
        .word   345
        .long   0
        .word   0
Example_a_foo1_closure:
        .quad   base_GHCziWord_W16zh_con_info
        .word   456
        .long   0
        .word   0

Each closure is two words: one for the Word16 info table, .quad base_GHCziWord_W16zh_con_info; then two bytes for the payloads, .word 123, .word 234 etc.; and then six bytes of padding, .long 0 and .word 0. This matches our expectations. Now let’s add some strictness to the first two fields:

module Example where

import Data.Word

-- now the first two fields are strict
data Foo = Foo !Word16 !Word16 Word16 Word16

{-# NOINLINE a_foo #-}
a_foo :: Foo
a_foo = Foo 123 234 345 456

main :: IO ()
main = return ()

and here is the assembly:

Example_a_foo_closure:
        .quad   Example_Foo_con_info
        .quad   Example_a_foo2_closure+1
        .quad   Example_a_foo1_closure+1
        .word   123
        .word   234
        .long   0
        .quad   3

Our a_foo object has reduced in size! The payloads for first two fields are now inlined into the closure. Now a_foo requires one word for the header, four for the third and fourth fields, and then just one extra word for padding and the first and second fields yielding a total footprint of 6 words. Let’s make Foo strict in every field, taking this process to its logical conclusion:

{-# LANGUAGE StrictData #-}
module Example where

import Data.Word

-- Now we use Strict data instead of bang patters
data Foo = Foo Word16 Word16 Word16 Word16

-- also notice that NOINLINE is not necessary
a_foo :: Foo
a_foo = Foo 123 234 345 456

main :: IO ()
main = return ()

which generates the following assembly:

Example_a_foo_closure:
     .quad   Example_Foo_con_info
     .word   123
     .word   234
     .word   345
     .word   456

Exactly what we expected. a_foo is now only two words. Good job GHC! What is happening here is that GHC’s Demand Analysis has concluded that it is safe to unbox the Word16 payloads for a_foo because they are marked as strict. The key point is that strictness can matter when assessing memory footprints. For the worst case, you should assume demand analysis cannot help you and analyze your data structures as we did above. Assuming GHC’s demand analysis works in your favor gives you the best case. Be sure to don’t think look by checking the Cmm or the assembly. Here is what the Cmm in this case will looks like:

[section ""data" . a_foo_closure" {
     a_foo_closure:
         const Foo_con_info;
         const 123 :: W16;
         const 234 :: W16;
         const 345 :: W16;
         const 456 :: W16;
 }]

Which also shows the more compact foo closure.

2.5.2.4.4. Use GHC-vis¶

Another method is to use ghc-vis . GHC-vis is a plugin for GHCi that outputs graphiz graphs which show the memory representation of a data type. Its documentation is very readable and it is still actively maintained so we encourage interested parties to contribute or give it a try. One caveat though: the documentation is representative of GHC’s internal types pre-GHC-9.0.

2.5.3. Summary¶

We’ve come a long way. We have added two tools to our optimization toolbox. First, a method to observe the memory footprint of a data type by inspecting the GHC generated assembly. Second, a method to analyze the worst case size of a data type by reading its definition and assessing its footprint. Along the way we have also observed the effect of strictness, what sharing looks like at a low level and observed GHC character and small Int cache pool.

Assessing the memory footprint of your data types should be one of the first techniques you employ. It is non-invasive, doesn’t require a full rebuild, and will help you understand what the CPU must do in order to compute your program. More importantly it is requisite to understanding the cache behavior of your program. This can be especially powerful if you already know where the hot loop in your implementation or architecture is. By reducing the memory footprint of that data you’ll be all but guaranteed to speed up your system.

Note

If this chapter has helped you implement a speedup in your system or if you want to add a footprint derivation just as I did for HashMap above then please contribute!

2.5.2. Memory Footprints of Data Types¶

2.5.2.1. What is the Point¶

2.5.2.2. Atomic Types¶

2.5.2.3. Boxed Data Types¶

2.5.2.3.1. A More Complicated Example¶

2.5.2.4. Other Ways¶

2.5.2.4.1. Read the Assembly Output¶

2.5.2.4.2. About Strictness¶

2.5.2.4.3. About Sharing¶

2.5.2.4.4. Use GHC-vis¶

2.5.3. Summary¶

2.5.4. References¶