4.3. Klister: A First Pass Performance Engineering¶
This chapter is a case study on a first pass of performance engineering for the klister programming language interpreter. This case study should be exemplary of any system which is shortlived, has distinct phases of input and output, and maintains state. To diagnose the performance issues this case study uses Heap Profiling with Eventlog, Info Table Profiling, TickyTicky profiling. The optimizations are trivial: choosing better data structures and closing memory leaks; any intermediate level Haskeller should be able to follow along. After reading this chapter, a reader should be able to spot inefficient data structure use by reading code, and GHC’s profiling tools to diagnose memory performance issues and identify hot code. To read through the code, this commit is the result of the case study, and this commit was the state of the klister code base that the tutorial starts with.
The rest of the chapter is structured as follows. We introduce the klister interpreter and perform a performance minded code review. We then use the klister testsuite to identify a poor performing program. Using this program we begin performance engineering by: identifying a problem, implementing a change, testing that the change was effective, and then repeating the process having either made progress or by changing another part of the interpreter.
4.3.1. Introduction to the System¶
The system we are analyzing is the interpreter for klister. Klister is an interpreted scheme-like programming language. The exact kind of language is not particularly relevant, but one can expect it to have typical scheme features such as s-expression syntax, hygenic macros and lexical scoping. All we need to know about that the klister interpreter inputs a program, parses the program, maintains mutable stores of variables and outputs a result.
To begin performance engineering we’ll review the subsystems that compose the interpreter. In klister, there are 4 major subsystem:
Parser: The system that lexes and tokenizes. Klister employs MegaParsec for its parser.
Expander: klister is a scheme-like language with hygenic macros. This subsystem performs the macro expansion and elaboration typical to these kinds of programming languages.
Type Checker: klister is a pure functional language with Hindley-Milner type checking and higher-ordered types, including type safe macros. This subsystem type checks klister code.
Evaluator: The evaluator inputs an abstract syntax tree which represents a klister program, executes program in the Haskell runtime system and outputs the result.
4.3.2. Characterizing the Problem¶
Our goal is to speed up the klister interpreter, but this goal is too vague and is not actionable. We first need to be able to make statements that are precise enough to be actionable. That is, we need to have a reproducible test to observe where time is spent, and, where and how much memory is allocated and consumed. Then we can correlate these costs to particular subsystems in the interpreter. For example, we wish to be able to say “The parser finishes a 100 line program in 100ms and runs in 100Kb of constant space, but the expander allocates 2Gb and finishes its computation in 2 minutes!”. Once we can make precise statements, we can begin forming hypotheses to speed up the system.
We’ll use klister’s testsuite to find a reproducible test. The testsuite provides good sample of programs and will allow us to find a degenerate program. With a degenerate program, we’ll record and correlate runtime costs to each subsystem and find the offending piece of source code.
Note
A quick note on these degenerate programs and a mental model that you might
find useful. In the abstract, we can think about the system space of the
system. The system space is the space of all possible semantically equivalent
systems, for some input and for some available resources. For example, for
klister, given the input program (+ 2 2)
we have an infinite space of
possible klister interpreters, some with many machine resources and some with
few. Similarly, given the same machine to run the klister interpreter, and
the same input program, we still have an infinite space of klister
interpreters, some of which will be very fast (in wall time) and some will be
very slow, depending on their implementation.
Now we can imagine that there are boundaries [1] in this system space that delineate acceptable performance from unacceptable performance, however one defines performance and what is acceptable. These degenerate programs are inputs that point to such a boundary, and thus they are useful allies in our performance work. As we improve the interpreter these boundaries will change, the space of acceptability will grow and we’ll have to find even more degenerate programs to observe the boundaries again.
4.3.3. Performance Minded Code Review¶
Before running the testsuite, we’ll begin with a code review to better understand the interpreter and to read through the code with a performance engineering mindset. Here is a list of things that are noticeable from reading through the source code.
Lazy
Data.Map
in the interpreter state.Lazy
modifyIORef
in the interpreter state.Use of lazy tuples throughout the code base.
Lists used as general purpose containers.
Lazy State Monad especially in the Expander.
Lazy WriterT monad transformer in the Expander.
Use of Monad Transformers and ExceptT in the interpreter state.
Lack of Specialization, especially in
IORef
utilities module and in lens driven code.Some modules do not use explicit exports.
No use of
-funbox-strict-fields
for bang patterns orUNPACK
pragmas.Using
foldr
on a list (seeExpander.getExports
) even though the code does not exploit laziness.Using
Integer
instead ofInt
.
We won’t being going through each in detail, but instead just highlight a few
consequential items. Note that not all of these need to be fixed or are
categorically a mistake. It depends on what the system is trying to do and how
the implementation is trying to realize the system. For example, there might be
a perfectly good reason to use an Integer
over an Int
, such as one
expects to handle data past the Int
range.
4.3.3.1. Data.Map With Expensive Keys¶
The ExpanderState
is a key type in the klister interpreter. It is stored in
an IORef
and maintains state for the expander. From a performance
perspective the ExpanderState
is suspect because it employs data structures
that are lazy and yet the interpreter will eventually use everything in the
state. Here is the definition of the ExpanderState
:
data ExpanderState = ExpanderState
{ _expanderWorld :: !(World Value)
, _expanderNextScopeNum :: !Int
, _expanderGlobalBindingTable :: !BindingTable
, _expanderExpansionEnv :: !ExpansionEnv
, _expanderTasks :: [(TaskID, ExpanderLocal, ExpanderTask)]
, _expanderOriginLocations :: !(Map.Map SplitCorePtr SrcLoc)
, _expanderCompletedCore :: !(Map.Map SplitCorePtr (CoreF TypePatternPtr PatternPtr SplitCorePtr))
, _expanderCompletedPatterns :: !(Map.Map PatternPtr (ConstructorPatternF PatternPtr))
, _expanderCompletedTypePatterns :: !(Map.Map TypePatternPtr TypePattern)
, _expanderPatternBinders :: !(Map.Map PatternPtr (Either [PatternPtr] (Scope, Ident, Var, SchemePtr)))
, _expanderTypePatternBinders :: !(Map.Map TypePatternPtr [(Scope, Ident, Var, SchemePtr)])
, _expanderCompletedTypes :: !(Map.Map SplitTypePtr (TyF SplitTypePtr))
, _expanderCompletedDeclTrees :: !(Map.Map DeclTreePtr (DeclTreeF DeclPtr DeclTreePtr))
, _expanderCompletedDecls :: !(Map.Map DeclPtr (Decl SplitTypePtr SchemePtr DeclTreePtr SplitCorePtr))
, _expanderModuleTop :: !(Maybe DeclTreePtr)
, _expanderModuleImports :: !Imports
, _expanderModuleExports :: !Exports
, _expanderPhaseRoots :: !(Map Phase Scope)
, _expanderModuleRoots :: !(Map ModuleName Scope)
, _expanderKernelBindings :: !BindingTable
, _expanderKernelExports :: !Exports
, _expanderKernelDatatypes :: !(Map Datatype DatatypeInfo)
, _expanderKernelConstructors :: !(Map Constructor (ConstructorInfo Ty))
, _expanderKernelValues :: !(Env Var (SchemePtr, Value))
, _expanderDeclOutputScopes :: !(Map DeclOutputScopesPtr ScopeSet)
, _expanderCurrentEnvs :: !(Map Phase (Env Var Value))
, _expanderCurrentTransformerEnvs :: !(Map Phase (Env MacroVar Value))
, _expanderCurrentDatatypes :: !(Map Phase (Map Datatype DatatypeInfo))
, _expanderCurrentConstructors :: !(Map Phase (Map Constructor (ConstructorInfo Ty)))
, _expanderCurrentBindingTable :: !BindingTable
, _expanderExpressionTypes :: !(Map SplitCorePtr Ty)
, _expanderCompletedSchemes :: !(Map SchemePtr (Scheme Ty))
, _expanderTypeStore :: !(TypeStore Ty)
, _expanderKindStore :: !KindStore
, _expanderDefTypes :: !(TypeContext Var SchemePtr) -- ^ Module-level definitions
}
These Maps
are suspicious because every type suffixed with Ptr
is a
newtype over an Integer
. For example, SplitCorePtr
is an newtype over a
Unique
which is itself a newtype over Integer
:
newtype SplitCorePtr = SplitCorePtr Unique
deriving (Eq, Ord)
newtype Unique = Unique Integer
deriving (Data, Eq, Ord)
This means that klister
has a lot of Maps
effectively indexed on
Integer
. Clearly an IntMap
would be a better choice of data structure.
An IntMap
has several advantages over Data.Map
. Data.Map
is an
implementation of size balanced binary trees (see the hackage
docs). These trees require re-balancing in order to maintain the balanced
invariant. This balancing slows down writes (that is, adds work to
Data.Map.insert
) and consequently merges, because a merge of any two trees
may invoke a re-balancing. In contrast, IntMap
is a big-endian PATRICIA Trie
(see Okasaki [1]) which never require balancing. Compared to Data.Map
,
an IntMap
provides faster writes at the cost of slightly slower reads.
Additionally, IntMap
uses less total memory than Data.Map
[2].
The klister interpreter also uses Map
’s indexed over String
like types.
For example, _expanderKernelDatatypes
is a Map Datatype DatatypeInfo
,
where DataType
is a record:
data Datatype
= Datatype
{ _datatypeModule :: !ModuleName -- ^ The module that defines the datatype
, _datatypeName :: !DatatypeName -- ^ The unique name for the datatype at this module and phase
}
deriving (Data, Eq, Ord, Show)
where ModuleName
and DatatypeName
are a String
and Text
respectively:
data ModuleName = ModuleName FilePath | KernelName KernelName
deriving (Data, Eq, Ord, Show)
newtype DatatypeName = DatatypeName { _datatypeNameText :: Text }
deriving newtype (Eq, IsString, Ord, Show, Hashable)
deriving stock Data
Indexing over Data.Map
over String
like types is a performance
anti-pattern. In the worse case, the Ord
and Eq
instance on String
will need to check the entire String
. Even in the average case, the extra
comparisons will be required in order to rebalance the tree, which will slow
down writes and merges.
A better datastructure for maps indexed over String
like types is
Data.HashMap
from the unordered-containers
library. These maps are
Hashed Array Mapped Trie’s, or HAMTs for short (see Bagwell [6]).
HAMTs index over a hash
that uniquely represents the key. These data
structures are efficient for any key type where equality could be expensive;
such as String
, Text
or inductive data types that derive Eq
.
4.3.3.2. Lazy Tuples¶
Consider this field of the ExpanderState
:
...
, _expanderTypePatternBinders :: !(Map.Map TypePatternPtr [(Scope, Ident, Var, SchemePtr)])
...
The elements of this map is a list of 4-tuples. This list will have an excessive amount of indirection to its values. Not only will the interpreter be chasing pointers in the elements of the list but it’ll have to chase pointers for each element of the tuple. This will be slow if these lists ever become large (over ~30 elements) and if this list will be the subject of folds. For this specific datatype, there is one fold in the klister interpreter:
else do
varInfo <- view (expanderTypePatternBinders . at patPtr) <$> getState
case varInfo of
Nothing -> throwError $ InternalError "Type pattern info not added"
Just vars -> do
p <- currentPhase
let rhs' = foldr (addScope p) stx
[ sc'
| (sc', _, _, _) <- vars
]
withLocalVarTypes
[ (var, varStx, t)
| (_sc, varStx, var, t) <- vars
] $
expandOneExpression ty dest rhs'
The code projects expanderTypePatternBinders
and looks up the list that
patPtr
points to. It then iterates over that same list twice: First, to
project the sc'
from the first position and pass it to addScope
. Second,
to project the second, third and fourth positions into a list of 3-tuples and
pass that to withLocalVarTypes
. This code can be improved with Loop
Fusion to iterate over the list once, using foldl'
instead of foldr
,
and by defining a datatype which unpacks every field instead of using (,,,)
to avoid the extra allocation for these tuples. Note that defining a such a
datatype will require that \(\bot\) is never an element of the 4-tuple. In this
case that will be true, but it is not true in general; again, in order to
optimize you must understand the system.
Generally types such as (,,,)
are a path of least resistance when writing
new code. They are easy to reach for, easy to write and don’t require more
domain modeling. However, tuples, and especially tuples with more than two
fields, are a consistent source of memory leaks. So one is almost always better
off defining a datatype instead of using a tuple for performance.
Note
Of course, you may not want to add yet another datatype to the implementation. One may want the datatypes in the implementation to map cleanly to domain objects. This a classic tradeoff between performance, readability and maintainability.
4.3.3.3. Running the testsuite¶
Klister does not have a benchmark suite, but does have a testsuite (with 124 tests) written in tasty which outputs the wall time of each test. So let’s run the testsuite to find a degenerate test:
$ cabal test --test-show-details=streaming --ghc-options='-O2'
Test suite klister-tests: RUNNING...
All tests
Expander tests
...
Module tests
Expected to succeed
...
examples/lang.kl: OK (0.04s)
examples/import.kl: OK (0.02s)
examples/macro-body-shift.kl: OK (0.03s)
examples/test-quasiquote.kl: OK (0.05s)
examples/quasiquote-syntax-test.kl: OK (0.04s)
examples/hygiene.kl: OK (0.84s)
examples/defun-test.kl: OK (0.01s)
examples/fun-exports-test.kl: OK (0.04s)
Golden tests
test-quasiquote: OK (0.03s)
io: OK (0.03s)
defun-test: OK (0.04s)
contract: OK (0.11s)
int-ops: OK (0.03s)
implicit-conversion: OK (7.02s)
...
implicit-conversion-test: OK (9.89s)
higher-kinded-patterns: OK (1.80s)
custom-literals-test: OK (0.46s)
double-define: OK (0.34s)
custom-module-test: OK (0.55s)
which-problem: OK (0.82s)
incorrect-context: OK (0.03s)
bound-vs-free: OK (0.31s)
meta-macro: OK (0.11s)
integer-syntax: OK (0.04s)
import: OK (0.04s)
All 124 tests passed (38.15s)
Test suite klister-tests: PASS
The entire testsuite finishes in 38 seconds. Notice that both
implicit-conversion
and implicit-conversion-test
are extreme outliers,
passing in 7 and 9 seconds, whereas each other test passes in well under a
second (except higher-kinded-patterns
). We have found our degenerate
program.
4.3.4. Restate the Problem¶
For the rest of the case study we’ll focus on speeding up
implicit-conversion-test
under the assumption that our changes will also
speed up the other tests and consequently the entire interpreter. It is
certainly possible that optimizations in one subsystem of your code base will
not translate to better overall performance. We make this assumption because
this is the first pass of performance engineering for this code. So the
probability that our changes do impact overall performance is high.
We have identified a list of likely issues, but before we begin to change things
we need to make sure that the problems we’ve identified are indeed problems. Or
in other words, we should only make changes once we have evidence that the
code we’re changing is problematic. If the maps are a problematic factor then we
should expect a lot of allocations to come from Data.Map.insert
, Eq
and
Ord
instance methods, and the functions Data.Map.Internal.balanceR
and
Data.Map.Internal.balanceL
. This is a good opportunity to not think
and look with a ticky report.
First we’ll generate a ticky report for the entire testsuite:
$ cabal test --test-show-details=streaming --test-options='+RTS -rticky -RTS' --ghc-options='-rtsopts -ticky -ticky-allocd -ticky-dyn-thunk'
Build profile: -w ghc-9.2.4 -O1
...
and check the results sorted by allocations. As a reminder, there are 5 columns in a ticky report. The first column is entries. The second column is the number of bytes allocated by the code for the closure. The third column is the number of bytes addressed to each closure. The fourth column classifies the closure by type with a shorthand notation, and the last column displays the names of the Stg closure the ticky profile is describing:
$ cat ticky | tail -n +20 | sort -k2 -nr | less
53739709 4299176720 0 3 +.> ScopeSet.$wallScopeSets'{v rNAX} (fun)
60292448 3858716672 2149588360 3 +.. sat_sOYl{v} (ScopeSet) (fun) in rNAX
81547057 1368797696 0 4 SISM ScopeSet.$w$sgo4{v rNAW} (fun)
57730804 1305110352 0 4 SISM ScopeSet.$w$sgo1{v rNAV} (fun)
61143424 841913088 0 2 SM ScopeSet.isSubsetOf_go15{v rOUK} (fun)
7819243 815587232 0 6 >pii.M Binding.$w$sgo3{v r1syq} (fun)
17961626 421056776 0 3 >MM Binding.$fMonoidBindingTable_$sunionWith{v r1syc} (fun)
867831 366262720 0 10 piiSiSL>>> Parser.Common.$wg{v rk} (fun)
886661 333384536 0 6 SS>>>> located{v r1b6H} (Parser) (fun)
4552387 298031744 0 3 ISM Expander.$w$sgo4{v r5BKT} (fun)
4843152 270145008 612288 1 M go15{v s1szA} (Binding) (fun) in r1syd
2699373 259139808 0 4 >SSM Syntax.$w$cmapScopes{v rTEZ} (fun)
18445979 240603872 0 4 piiM Binding.$w$sgo1{v r1syi} (fun)
1351616 237884416 612288 1 T f{v s1szf} (Binding) (fun) in r1syd
1862523 211065056 0 3 S>M ScopeSet.$satKeyIdentity_$sgo15{v rOUv} (fun)
3383994 186416288 43447360 2 LM go15{v sP96} (ScopeSet) (fun) in rOUk
101588 145802400 0 4 MSSM $wexpandOneForm{v r5IwM} (Expander) (fun)
2607448 125157504 0 2 >S Syntax.$fHasScopesSyntax_$cmapScopes{v rTEY} (fun)
...
There are several interesting aspects to this ticky profile snippet. First, the
most allocating code is ScopeSet.allScopeSets'
, it is allocating a
dictionary (+
) of some type (.
) and function (>
). Observing a
function call to a dictionary +
in a ticky report can often mean that a
function did not specialize. So from this ticky we suspect that
allScopeSets'
has not specialized. The second most allocating closure is a
SAT’d function sat_sOYl
(see The Static Argument Transformation). From its description: {v} (ScopeSet) (fun) in rNAX
, we can see
that it is a non-exported name ({v}
) just like allScopeSets'
, in the
(ScopeSet)
module. It is a function (fun)
and is a local function in the
rNAX
closure. The rNAX
closure is the Stg name of the closure for
allScopeSets'
as shown in description for allScopeSets'
. So the two most
allocating function calls in the interpreter, when running the testsuite, are
due to allScopeSets
. Clearly, allScopeSets
is a good target for
performance engineering.
We also see that the 5th and 6th most allocating functions called are
ScopeSet.isSubsetOf
and Binding.$fMonoidBindingTable_$unionWith
. That
suggests peculiar usage pattern; isSubsetOf
should only return a Bool
which should not be an allocating function call. unionWith
should be
allocating, but that this occurs in the Monoid Binding
instance means that
the Binding Monoid
instance is heavily allocating. Let’s check these
functions in the source code:
data ScopeSet = ScopeSet
{ _universalScopes :: Set Scope
, _phaseScopes :: Map Phase (Set Scope)
}
data Scope = Scope { scopeNum :: Int, scopePurpose :: Text }
deriving (Data, Eq, Ord, Show)
newtype Phase = Phase { phaseNum :: Natural }
deriving (Data, Eq, Ord, Show)
isSubsetOf :: Phase -> ScopeSet -> ScopeSet -> Bool
isSubsetOf p scs1 scs2 =
Set.isSubsetOf (scopes p scs1) (scopes p scs2)
scopes :: Phase -> ScopeSet -> Set Scope
scopes p scs = view universalScopes scs `Set.union`
view (phaseScopes . at p . non Set.empty) scs
ScopeSet
is a record of Data.Set Scope
and Data.Map
indexed by
Phase
that holds Set Scope
. Furthermore, both Scope
and Phase
are Integer-like. So we have an implementation that could use IntMap
and
IntSet
instead of Data.Map
and Data.Set
.
We know that isSubsetOf
does a lot of allocation. Now we can see where this
allocation is happening. isSubsetOf
checks that scs1
is a subset of
scs2
by calling Set.isSubsetOf
on the result of the scopes
function.
scopes
is allocating a new Set Scope
from the ScopeSet
via
Set.union
, then using the results of a lookup on the phaseScopes
Map,
then merging two Set
’s just to check the subset.
There are several ways to improve the memory performance of this function.
First, we can employ better data structures. We know that this code is
performing a lot of merges, so we should expect an improvement in both time and
memory performance by using an IntMap
and IntSet
because these data
structures provide more efficient merges than Data.Set
and Data.Map
.
Second, we can use a better algorithm. From the ticky, isSubSetOf
was called
61143424 times. As written, this code will perform its lookups and unions each
time, even if we have a duplicate call. So this seems to be a good candidate for
memoization or caching the calls to isSubsetOf
. We could also avoid building
the intermediate Set
in the scopes
function by checking for a subset
without merging universalScopes
and phaseScopes
.
The second interesting function was unionWith
in the Monoid Binding
instance. Here is the source code:
newtype BindingTable = BindingTable { _bindings :: Map Text [(ScopeSet, Binding, BindingInfo SrcLoc)] }
deriving (Data, Show)
instance Semigroup BindingTable where
b1 <> b2 = BindingTable $ Map.unionWith (<>) (view bindings b1) (view bindings b2)
instance Monoid BindingTable where
mempty = BindingTable Map.empty
A BindingTable
is a Map
keyed on Text
that holds a list of triples.
The Semigroup
instance is the origin of the unionWith
in the ticky
profile because Monoid.mappend
defaults to the Semigroup definition. This
type is likely too lazy. Data.Map
keyed on Text
relies on the Ord
and Eq
instances of Text
for most of its operations. In the worst case
this means the runtime system has to compare the entire Text
key, which
could be slow when the Text
is large. Another problem is the use of a list.
A list is only an appropriate data structure if it is used like a stack or if it
is used as a store that is eventually traversed and consumed. Once one finds
themselves performing lookups or merges on a list, it is time to use a different
data structure. The last problem is the 3-tuple which we have already covered
above.
To improve the performance of the BindingTable
we’ll use a HashMap
. This
should yield better merge performance, and faster writes and reads. However,
this may not fix the root cause of the allocations. So we’ll rerun the ticky
report after making the changes to test that we have indeed addressed the
problem.
4.3.5. Optimization 1: Better Data Structures: Maps¶
We’ve removed all uses of Data.Map
and replaced them with either a
HashMap
or an IntMap
. After the changes ExpanderState
now looks
like:
data ExpanderState = ExpanderState
{ _expanderWorld :: !(World Value)
, _expanderNextScopeNum :: !Int
, _expanderGlobalBindingTable :: !BindingTable
, _expanderExpansionEnv :: !ExpansionEnv
, _expanderTasks :: [(TaskID, ExpanderLocal, ExpanderTask)]
, _expanderOriginLocations :: !(Store SplitCorePtr SrcLoc)
, _expanderCompletedCore :: !(Store SplitCorePtr (CoreF TypePatternPtr PatternPtr SplitCorePtr))
, _expanderCompletedPatterns :: !(Store PatternPtr (ConstructorPatternF PatternPtr))
, _expanderCompletedTypePatterns :: !(Store TypePatternPtr TypePattern)
, _expanderPatternBinders :: !(Store PatternPtr (Either [PatternPtr] (Scope, Ident, Var, SchemePtr)))
, _expanderTypePatternBinders :: !(Store TypePatternPtr [(Scope, Ident, Var, SchemePtr)])
, _expanderCompletedTypes :: !(Store SplitTypePtr (TyF SplitTypePtr))
, _expanderCompletedDeclTrees :: !(Store DeclTreePtr (DeclTreeF DeclPtr DeclTreePtr))
, _expanderCompletedDecls :: !(Store DeclPtr (Decl SplitTypePtr SchemePtr DeclTreePtr SplitCorePtr))
, _expanderModuleTop :: !(Maybe DeclTreePtr)
, _expanderModuleImports :: !Imports
, _expanderModuleExports :: !Exports
, _expanderPhaseRoots :: !(Store Phase Scope)
, _expanderModuleRoots :: !(HashMap ModuleName Scope)
, _expanderKernelBindings :: !BindingTable
, _expanderKernelExports :: !Exports
, _expanderKernelDatatypes :: !(HashMap Datatype DatatypeInfo)
, _expanderKernelConstructors :: !(HashMap Constructor (ConstructorInfo Ty))
, _expanderKernelValues :: !(Env Var (SchemePtr, Value))
, _expanderDeclOutputScopes :: !(Store DeclOutputScopesPtr ScopeSet)
, _expanderCurrentEnvs :: !(Store Phase (Env Var Value))
, _expanderCurrentTransformerEnvs :: !(Store Phase (Env MacroVar Value))
, _expanderCurrentDatatypes :: !(Store Phase (HashMap Datatype DatatypeInfo))
, _expanderCurrentConstructors :: !(Store Phase (HashMap Constructor (ConstructorInfo Ty)))
, _expanderCurrentBindingTable :: !BindingTable
, _expanderExpressionTypes :: !(Store SplitCorePtr Ty)
, _expanderCompletedSchemes :: !(Store SchemePtr (Scheme Ty))
, _expanderTypeStore :: !(TypeStore Ty)
, _expanderKindStore :: !KindStore
, _expanderDefTypes :: !(TypeContext Var SchemePtr) -- ^ Module-level definitions
}
where a Store k v
is newtype over an IntMap
with some type level
handling for keys:
-- new module Util.Store
newtype Store p v = Store { unStore :: IntMap v}
deriving newtype (Eq, Ord, Show, Semigroup, Monoid, Functor, Foldable)
deriving stock Data
type role Store representational _
-- | Tiny module to wrap operations for IntMaps
module Util.Key
(HasKey(..)
) where
class HasKey a where
getKey :: a -> Int
fromKey :: Int -> a
The HasKey
class is crucial for preserving type safety so that we do not
accidentally merge a Store k v
and a Store p v
when p
is not equal
to k
. It also allows us to provide functions such as mapKeys
with a
polymorphic type on the Key
, as opposed to the monomorphic type in
Data.IntMap
:
mapKeys :: HasKey p => (p -> p) -> Store p v -> Store p v
mapKeys f s = Store $! IM.mapKeys (getKey . f . fromKey) (unStore s)
Note
I’ve purposefully taken the opportunity to create the abstraction boundary
Store p v
, which is unnecessary from a performance perspective. But by
having an abstraction boundary the system becomes easier to change. For
example, we can now alter the implementation of Store p v
without a lot
of code churn occurring in the interpreter. The key point is that if a
subsystem is easy to change, then it is also easy to experiment and engineer,
because the cost of experimentation is reduced. So while performance and
modularity are often in tension, in this case, modularity enables easier and
cheaper (in terms of engineering time) performance engineering.
now let’s check the ticky:
$ cat ticky | tail -n +20 | sort -k2 -nr | less
53996388 4319711040 0 3 +.> ScopeSet.$wallScopeSets'{v rP2F} (fun)
60490404 3871385856 2159855520 3 +.. sat_sQ5D{v} (ScopeSet) (fun) in rP2F
20257037 1487236040 0 3 iMM Binding.$wgo{v r1ric} (fun)
81547057 1368797696 0 4 SISM ScopeSet.$w$sgo4{v rP2E} (fun)
57730804 1305110352 0 4 SISM ScopeSet.$w$sgo1{v rP2D} (fun)
867831 366262720 0 10 piiSiSL>>> Parser.Common.$wg{v r3zJ} (fun)
886661 333384536 0 6 SS>>>> located{v r1art} (Parser) (fun)
10521949 330656896 0 3 Lii ModuleName.$wgo1{v roEi} (fun)
4552387 298031744 0 3 ISM Expander.$w$sgo4{v r5On7} (fun)
2699373 259139808 0 4 >SSM Syntax.$w$cmapScopes{v rUeh} (fun)
1351616 237884416 612288 1 T f{v s1sRr} (Binding) (fun) in r1rif
3159635 193376496 1071504 1 M go{v s1sS8} (Binding) (fun) in r1rif
2348710 169685264 1156288 1 M go2{v s16Wz} (Env) (fun) in r16zL
4590545 146897440 183644160 0 f2{v s1t5Z} (Binding) (thk) in r1ric
101588 145802400 0 4 MSSM $wexpandOneForm{v r5VBM} (Expander) (fun)
2607448 125157504 0 2 >S Syntax.$fHasScopesSyntax_$cmapScopes{v rUeg} (fun)
1357729 119480152 486976 1 S sat_s5YKN{v} (Expander) (fun) in s5YKB
144974 118076280 0 10 piiiSL>>>> $wm2{v r1arF} (Parser) (fun)
Notice that the entries to unionWith
and isSubsetOf
have disappeared.
isSubsetOf
is still in the ticky report but is now shown as non-allocating:
...
38279681 0 0 2 MM ScopeSet.$sisSubsetOf_$sisSubsetOfX{v rP2u} (fun)
...
Which is strange. We should expect it to be allocating, although allocating a
lot less. We are likely observing some new optimization GHC is applying. In any
case, its a good sign. We’ve demonstrated progress with the ticky report. Now
let’s verify that these changes propagate to the implicit-conversion
test.
Test suite klister-tests: RUNNING...
All tests
Expander tests
...
Module tests
Expected to succeed
...
examples/lang.kl: OK (0.04s)
examples/import.kl: OK (0.03s)
examples/macro-body-shift.kl: OK (0.04s)
examples/test-quasiquote.kl: OK (0.04s)
examples/quasiquote-syntax-test.kl: OK (0.03s)
examples/hygiene.kl: OK (0.66s)
examples/defun-test.kl: OK (0.03s)
examples/fun-exports-test.kl: OK (0.04s)
Golden tests
test-quasiquote: OK (0.04s)
io: OK (0.03s)
defun-test: OK (0.03s)
contract: OK (0.08s)
int-ops: OK (0.05s)
implicit-conversion: OK (10.42s)
...
implicit-conversion-test: OK (13.55s)
higher-kinded-patterns: OK (0.77s)
custom-literals-test: OK (0.38s)
double-define: OK (0.28s)
custom-module-test: OK (0.33s)
which-problem: OK (0.53s)
incorrect-context: OK (0.03s)
bound-vs-free: OK (0.25s)
meta-macro: OK (0.10s)
integer-syntax: OK (0.04s)
import: OK (0.03s)
Performance has degraded even though the ticky report showed an improvement!
That we do not observe any difference in wall time after fixing the 5th and
6th most allocating function calls is contrary to what we should expect; even if
the total allocations of these functions are one order of magnitude less than
allScopeSets
. The Data.Map
performance costs must have been eclipsed by
some other issue. Let’s generate a heap profile to see what’s going on in the
heap.
4.3.6. Optimization 2: A Memory Leak Casts a Long Shadow¶
To continue engineering the interpreter we need to be able to observe the
effects of any change we make. Unfortunately, the memory leak we discovered in
the last section was so large it eclipsed the effect of our changes. Clearly we
need to fix this leak in order to continue to optimize. To tackle this memory
leak, we’ll use eventlog and eventlog2html to observe the heap only on
implicit-conversion-test
. To begin we’ll inspect the types that were
allocated on the heap by passing -hy
:
$ cabal test --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hy -l-agu -p -RTS' --ghc-options='-eventlog -rtsopts -O2'
which produces:
We see that the heap is growing to over 2.8Gb of lists for just one test!
Crucially the shape of this profile is not indicative of a memory leak. A
typical memory leak should look like a pyramid because the program builds up
thunks and then forces them all in relatively short time. What we observe in
this profile is allocations of lists that never decrease. Now that we know the
type to look for, we can correlate this type to a subsystem in the interpreter.
To do so we’ll run another heap profile, but break down the heap by module (by
using -hm
instead of -hy
):
$ cabal test --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hm -l-agu -p -RTS' --ghc-options='-eventlog -rtsopts -O2'
We see that these lists are coming from Expander.Monad
. This is suspicious.
We have data being consistently allocated in essentially the state type of a
subsystem. That certainly sounds like a memory leak. Before we can conclude that
the expander has a memory leak we need to know why this data is retained at all.
This is a good scenario to use Biographical Profiling because we want to know: (1) the state
of these objects on the heap and (2) why they are not being collected, that is,
why is GHC’s runtime system keeping them alive. For (1) we’ll do a biographical
profile and for (2) a retainer profile.
Here’s the biographical profile:
$ cabal test --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hb -l-agu -p -RTS' --ghc-options='-eventlog -rtsopts -O2'
Void! The lists are in a void
state meaning these objects are allocated but
are never used (see the user guide). Now we can restate the problem: There
is a memory leak in the Expander
. When implicit-conversion-test
is run,
the expander allocates a total of 121.8 Gb. (eventlog shows 116171.68
MebiBytes in the detailed tab).
Now to answer why this data is being retained. Here is the retainer profile.
$ cabal test --enable-profiling --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hr -l-agu -p -RTS' --ghc-options='-eventlog -rtsopts -O2'
$ hp2ps -c klister-tests.hp && ps2pdf klister-tests.ps
Note
Eventlog threw an exception for this retainer profile. So I’ve resorted to
use the classic tools: hp2ps
and ps2pdf
to render the profile.
The retainer profile clearly shows that currentEnv
is keeping this data
alive and has the distinguishing profile of a memory leak. Let’s look at that
function:
-- in Expander.Monad
currentEnv :: Expand VEnv
currentEnv = do
phase <- currentPhase
globalEnv <- fromMaybe mempty . view (expanderWorld . worldEnvironments . at phase) <$> getState
localEnv <- fromMaybe mempty . view (expanderCurrentEnvs . at phase) <$> getState
return $ globalEnv <> localEnv
This code is reading from the Expander
state twice to retrieve globalEnv
and localEnv
, and then returning the union of these two environments. Notice
that if the results of currentEnv
are not immediately forced, then the
projections and the return will be lazy. Thus, this code will allocate a thunk
for phase
, globalEnv
, localEnv
and the merge of both globalEnv
and localEnv
. In general, unless the result of a monadic action needs to
be consumed lazily there is little reason to not make it strict in the return.
In this case, there are two call sites to currentEnv
both in the expander:
-- Expander.hs
runTask :: (TaskID, ExpanderLocal, ExpanderTask) -> Expand ()
runTask (tid, localData, task) = withLocal localData $ do
...
AwaitingDefn x n b defn t dest stx ->
Env.lookupVal x <$> currentEnv >>=
\case
evalDecl :: CompleteDecl -> WriterT [EvalResult] Expand ()
evalDecl (CompleteDecl d) =
case d of
...
Example loc sch expr -> do
env <- lift currentEnv
value <- lift $ expandEval (eval expr)
tell $ [ExampleResult loc env expr sch value]
The first call site in runTask
is immediately scrutinized by the \case
and Env.lookupVal
. However, the second is not: ExampleResult
is not
strict in any fields:
data EvalResult
= ExampleResult SrcLoc VEnv Core (Scheme Ty) Value
| IOResult (IO ())
and the whole computation is performed in a lazy WriterT
monad transformer.
Before changing the code, let’s first inspect the type of the Expand
monad.
Here is the definition for Expand
:
newtype Expand a = Expand
{ runExpand :: ReaderT ExpanderContext (ExceptT ExpansionErr IO) a
}
deriving (Functor, Applicative, Monad, MonadError ExpansionErr, MonadIO, MonadReader ExpanderContext)
data ExpanderContext = ExpanderContext
{ _expanderLocal :: !ExpanderLocal
, _expanderState :: IORef ExpanderState
}
Where ExpanderState
was shown above. So we have a classic ReaderT over IO pattern.
Meaning that the laziness of any state updates depend on the strictness of
functions operating on ExpanderContext
. Next let’s check the types of
globalEnv
and localEnv
:
-- in Expander.Monad.hs
type VEnv = Env Var Value
-- in Env.hs
newtype Env v a = Env (IntMap (Ident, a))
deriving newtype (Eq, Monoid, Semigroup, Show)
deriving stock Functor
-- in World.hs
data World a = World
{ _worldEnvironments :: !(Store Phase (Env Var a))
, _worldTypeContexts :: !(TypeContext Var SchemePtr)
, _worldTransformerEnvironments :: !(Store Phase (Env MacroVar a))
, _worldModules :: !(HashMap ModuleName CompleteModule)
, _worldVisited :: !(HashMap ModuleName (Set Phase))
, _worldExports :: !(HashMap ModuleName Exports)
, _worldEvaluated :: !(HashMap ModuleName [EvalResult])
, _worldDatatypes :: !(Store Phase (HashMap Datatype DatatypeInfo))
, _worldConstructors :: !(Store Phase (HashMap Constructor (ConstructorInfo Ty)))
, _worldLocation :: FilePath
}
currentEnv
returns a Expand VEnv
, VEnv
is a Env Var Value
where
an Env
is an IntMap
. Thus globalEnv
and localEnv
are both a
strict IntMap
that stores a tuple of (Ident, Value)
. Here is the type of
Value
:
-- in Value.hs
data Value
= ValueClosure Closure
| ValueSyntax Syntax
| ValueMacroAction MacroAction
| ValueIOAction (IO Value)
| ValueOutputPort Handle
| ValueInteger Integer
| ValueCtor Constructor [Value]
| ValueType Ty
| ValueString Text
Notice that ValueCtor
holds a lazy list of Value
. Should
implicit-tests
create many ValueCtor
then the expander state will leak
memory in Value
. This leak will occur even though the IntMap
is strict.
The strict IntMap
well only evaluate elements to WHNF, which will be
(,)
, thus because Value
is stored in a lazy tuple, the fields of the
tuple will be lazy unless they are carefully forced before insertion.
Let’s test this and make Value
strict and then generate another
biographical profile to observe the change:
-- in Value.hs
data Value
= ValueClosure !Closure
| ValueSyntax !Syntax
| ValueMacroAction !MacroAction
| ValueIOAction !(IO Value)
| ValueOutputPort !Handle
| ValueInteger !Integer
| ValueCtor !Constructor ![Value]
| ValueType !Ty
| ValueString !Text
Unfortunately, the change made no difference. We’ll revert the change and try making the monadic action strict in its return:
-- in Expander.Monad
currentEnv :: Expand VEnv
currentEnv = do
phase <- currentPhase
globalEnv <- fromMaybe mempty . view (expanderWorld . worldEnvironments . at phase) <$> getState
localEnv <- fromMaybe mempty . view (expanderCurrentEnvs . at phase) <$> getState
return $! globalEnv <> localEnv
Notice the strict application: $!
. This change results in this profile:
$ cabal test --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hb -l-agu -p -RTS' --ghc-options='-eventlog -rtsopts -O2'
A significant improvement! Instead of 121.8 Gb the profile shows total
allocation of void
of 4.62 Gb (4404.22 MiB in the detailed tab) which is a
30x reduction.
4.3.7. Optimization 3: Choosing Better Data Structures: Lists¶
We closed a major memory leak, and have made the implementation of the expander
more efficient, but there is yet more low hanging fruit. Consider this
definition of the BindingTable
:
newtype BindingTable = BindingTable { _bindings :: HashMap Text [(ScopeSet, Binding, BindingInfo SrcLoc)] }
deriving (Data, Show)
The BindingTable
is a HashMap
keyed on Text
that holds a list of
3-tuples. Let’s observe how this list is used in the system:
visibleBindings :: Expand BindingTable
visibleBindings = do
globals <- view expanderGlobalBindingTable <$> getState
locals <- view expanderCurrentBindingTable <$> getState
return (globals <> locals)
This visibleBindings
function projects two BindingTable
’s from the
ExpanderState
and then merges them. This function is then called by
resolve
:
resolve :: Ident -> Expand Binding
resolve stx@(Stx scs srcLoc x) = do
p <- currentPhase
bs <- allMatchingBindings x scs
case bs of
[] ->
throwError (Unknown (Stx scs srcLoc x))
candidates ->
let best = maximumOn (ScopeSet.size p . fst) candidates
in checkUnambiguous (fst best) (map fst candidates) stx *>
return (snd best)
In resolve
the BindingTable
is checked to be empty, if its not empty
then the table is traversed for the maximum by scopeset. Then the maximum
element and the rest of the scopesets are passed to checkUnambiguous
:
checkUnambiguous :: ScopeSet -> [ScopeSet] -> Ident -> Expand ()
checkUnambiguous best candidates blame =
do p <- currentPhase
let bestSize = ScopeSet.size p best
let candidateSizes = map (ScopeSet.size p) (nub candidates)
if length (filter (== bestSize) candidateSizes) > 1
then throwError (Ambiguous p blame candidates)
else return ()
In checkUnambiguous
the candidates
list is pruned of any duplicates,
filtered, and its length is calculated to determine an error state.
Finally, in the visit
the BindingTable
(bs
) is projected from the
module being visited and then merged with the interpreter’s binding table:
visit :: ModuleName -> Expand Exports
visit modName = do
...
let bs = getModuleBindings m''
modifyState $ over expanderGlobalBindingTable $ (<> bs)
return (shift i es)
where getModuleBindings (Expanded _ bs) = bs
getModuleBindings (KernelModule _) = mempty
So from this code, the BindingTable
data structure undergoes merges,
traversals such as fmap
and filter
and the length is calculated. We can
check how often these functions are used in the prof
report we generated
with the last heap profile (using the -p
RTS flag). visibleBindings
and
resolve
are entered more than 100,000 times, checkUnambiguous
is entered
less than 20,000 times. Thus merging and sorting are more important than
duplicate element removal and calculating length. Based on this usage, we’ll
replace the list with a Data.Sequence
, a Data.Set
would also work and
would provide uniqueness by default, but Data.Set
has slower merge behavior
compared to a Data.Sequence
for many elements
(\(\mathcal{O}(\log{(\min(n_{1},n_{2}))})\), compared to
\(\mathcal{O}(m \log{\frac{n + 1}{m + 1}})\), where \(m \le n\)).
BindingTable
is not the only suspicious list. Consider this list in the
definition of Module
:
data Module f a = Module
{ _moduleName :: ModuleName
, _moduleImports :: !Imports
, _moduleBody :: f a
, _moduleExports :: !Exports
}
deriving (Data, Functor, Show)
data CompleteModule
= Expanded !(Module [] CompleteDecl) !BindingTable
| KernelModule !Phase
deriving (Data, Show)
This list is passed to Module
as a type argument. But is used in this code
in the expander monad:
getDeclGroup :: DeclTreePtr -> Expand [CompleteDecl]
getDeclGroup ptr =
(view (expanderCompletedDeclTrees . at ptr) <$> getState) >>=
\case
Nothing -> throwError $ InternalError "Incomplete module after expansion"
Just DeclTreeLeaf -> pure []
Just (DeclTreeAtom decl) ->
(:[]) <$> getDecl decl
Just (DeclTreeBranch l r) ->
(++) <$> getDeclGroup l <*> getDeclGroup r
Notice that the DeclTreeBranch l r
case will be expensive because it will
merge every combination of elements in the l
and r
groups with list
concatenation. Just this line is enough to change the data structure to a
sequence, again because a sequence has fast merging and cons behavior [3].
4.3.8. Optimization 4: Fixing allScopeSets¶
We,ve come quite far, but there is still a lot of void
in the heap profile.
This is a good scenario for info-table profiling. Info-table profiling relates
source code to closures so we can see the exact line of source code the void
is originating.
$ cabal test --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hi -i0.05 -l -RTS' --ghc-options='-eventlog -rtsopts -O2 -finfo-table-map -fdistinct-constructor-tables'
and the profile is rendered in eventlog:
Notice that the legend displays the Info Table Address instead of the
closure type, module, or biography. From the profile we find that 0x7c41d0
and 0xc0c330
are responsible for the void
allocation. The detailed tab
maps these addresses directly to source code. In the detailed tab, we see that
0x7c41d0
has the description sat_sN17_info
, the closure type THUNK
,
the type f a
, and is in the module ScopeSet
at line 146. That line is
exactly the local combine
function in allScopeSets
. Recall that we also
observed allScopeSets
doing the most allocation in addition to having a high
degree of entries in the ticky profile above. It therefore seems crucial to
optimize the function. Here is the source code:
allScopeSets :: Data d => Traversal' d ScopeSet
allScopeSets = allScopeSets'
where
allScopeSets' :: forall f d. (Applicative f, Data d)
=> (ScopeSet -> f ScopeSet)
-> d -> f d
allScopeSets' f = gmapA go
where
go :: forall a. Data a => a -> f a
go a = case eqT @a @ScopeSet of
Just Refl -> f a
Nothing -> allScopeSets' f a
-- A variant of Data.Data.gmapM which uses Applicative instead of Monad
gmapA :: forall f d. (Applicative f, Data d)
=> (forall x. Data x => x -> f x)
-> d -> f d
gmapA g = gfoldl combine pure
where
combine :: Data a => f (a -> b) -> a -> f b
combine ff a = (<*>) ff (g a)
This code is exceedingly polymorphic and is effectively asking GHC to generate
traversals over many different data types. From the info-table profiling we know
that combine
is building up thunks of function applications. In the body of
combine
there are two function application sites: g a
and (<*>) ff (g
a)
. To determine which is responsible we’ll force each and rerun the
info-table profiling, beginning with <*> ff (g a)
:
combine :: Data a => f (a -> b) -> a -> f b
combine ff a = let !res = (<*>) ff (g a) in res
Unfortunately, this produces the same profile. Let’s next force g a
:
combine :: Data a => f (a -> b) -> a -> f b
combine ff a = let !ga = (g a) in (<*>) ff ga
and rerun the test:
$ cabal test --test-show-details=streaming --enable-profiling --test-options='--pattern "implicit-conversion-test" +RTS -hi -i0.05 -l -rticky -RTS' --ghc-options='-eventlog -rtsopts -ticky -O2 -finfo-table-map -fdistinct-constructor-tables'
...
Running 1 test suites...
Test suite klister-tests: RUNNING...
All tests
Golden tests
implicit-conversion-test: OK (1919.27s)
All 1 tests passed (1919.29s)
Test suite klister-tests: PASS
The test finished in 1919.27 seconds; clearly forcing this thunk produced a regression. The heap profile similarly shows much more memory allocation:
Now this test allocates around 2.6Gb, unfortunately the info-table profiling
does not display the information for the most allocating info table address
0x1b93550
. We’ll do one last test and make both the entire body and g a
strict:
combine :: Data a => f (a -> b) -> a -> f b
combine ff a = let !ga = (g a)
!res = (<*>) ff ga
in res
Running 1 test suites...
Test suite klister-tests: RUNNING...
All tests
Golden tests
implicit-conversion-test: OK (1906.82s)
All 1 tests passed (1906.84s)
Test suite klister-tests: PASS
Not much better. We saw that combine
was building up thunks with info-table
profiling, but by forcing those thunks the heap grew from 100Mb to 2.6Gb. So
even though info-table profiling pointed to allScopeSets
, our problem lies
elsewhere. If allScopeSets
was the culprit then we would expect an
improvement, which was not observed. But we have learned that allScopeSets
is on the critical path of the memory leak. We’ll leave these changes in place
to determine what type of closures are newly leaking:
Note
This is a common phenomena when plugging memory leaks. By adding strictness to the critical path the leak is on, it is often the case that the laziness moves to another part of system until the root cause is discovered. So by leaving these changes in place we are traversing the critical path that leads to the leak.
$ cabal test --enable-profiling --test-show-details=streaming --test-options='--pattern "implicit-conversion-test" +RTS -hy -l-agu -p -L50 -RTS' --ghc-options='-eventlog -rtsopts -O2'
which yields:
Almost all the memory allocated on the heap is a list []
of some type. Let’s
inspect the code path for allScopeSets
, normally one would do a retainer
analysis to find the right code path, but in this case allScopeSets
is only
called in one function: visit
:
visit :: ModuleName -> Expand Exports
visit modName = do
(m, es) <-
do world <- view expanderWorld <$> getState
case view (worldModules . at modName) world of
Just m -> do
let es = fromMaybe noExports $ view (worldExports . at modName) world
return (m, es)
Nothing ->
inPhase runtime $
loadModuleFile modName
p <- currentPhase
let i = phaseNum p
visitedp <- Set.member p .
view (expanderWorld . worldVisited . at modName . non Set.empty) <$>
getState
unless visitedp $ do
let m' = shift i m -- Shift the syntax literals in the module source code
sc <- freshScope $ T.pack $ "For module-phase " ++ shortShow modName ++ "-" ++ shortShow p
let m'' = over ScopeSet.allScopeSets (ScopeSet.insertUniversally sc) m'
evalResults <- inPhase p $ evalMod m''
modifyState $
set (expanderWorld . worldEvaluated . at modName)
(Just evalResults)
let bs = getModuleBindings m''
modifyState $ over expanderGlobalBindingTable $ (<> bs)
return (shift i es)
where getModuleBindings (Expanded _ bs) = bs
getModuleBindings (KernelModule _) = mempty
We see that allScopeSets
is only used to construct m''
, by traversing
the data in m'
, and applying the function ScopeSet.insertUniversally
with the argument sc
. Here is insertUniversally
:
insertUniversally :: Scope -> ScopeSet -> ScopeSet
insertUniversally sc = set (universalScopes . at sc) (Just ())
and recall that a Scope
is:
data Scope = Scope { scopeNum :: Int, scopePurpose :: Text }
deriving (Data, Eq, Ord, Show)
which is created in freshScope
:
freshScope :: Text -> Expand Scope
freshScope why = do
n <- view expanderNextScopeNum <$> getState
modifyState $ over expanderNextScopeNum $ (+ 1)
return (Scope n why)
The Scope
type is overly lazy and is inefficient. The Int
is incremented
lazily in freshScope
, the Text
is constructed lazily and is only used
for debugging in an error state [4]. Furthermore, notice that the Eq
and
Ord
instances are derived, which means the Text
field is unnecessarily
scrutinized to insert and balance the Set
even though the ordinal
information on the Text
field is not useful to the interpreter. This is
especially problematic given the high entry counts we observed for
isSubsetOf
and allScopeSets
.
To fix the Scope
type we’ll make the Int
field strict, remove the
Text
field entirely and inline freshScope
. The Int
field should be
strict to ensure the increments do not accumulate thunks and to make sure GHC
will unpack the Int
. We could choose to make the Text
field strict, but
this would be a mistake because the Text
field is only used in an error
state, so making it strict would mean allocating memory in a hot loop when it is
unnecessary to do so. Furthermore, removing the Text
field unlocks better
data structures; with its removal we can use a newtype
instead of a data
type, thereby reducing allocations, and use an IntSet
instead of a Set
.
An IntSet
is similar to an IntMap
(it is also implemented as a
big-endian PATRICIA trie) but the implementation in containers
is tuned for
dense sets (where the elements are close in value), which is the kind of
IntSet
the klister interpreter generates.
The Text
field is allocated in this line:
...
sc <- freshScope $ T.pack $ "For module-phase " ++ shortShow modName ++ "-" ++ shortShow p
...
We’ll test that this is the source of the []
(due to the String
) by
forcing the Text
object, and then rerun a closure type heap profile to
observe any changes:
...
sc <- freshScope $! T.pack $ "For module-phase " ++ shortShow modName ++ "-" ++ shortShow p
...
Notice, the strict application: $!
, after freshScope
. Here is the heap
profile:
Much better, the interpreter is still leaking []
but the heap stays under
50Mb. We have found the memory leak. Now we’ll go further and remove the
Text
field from Scope
and change ScopeSet
to use IntSet
. To
preserve the ability to debug we’ll use a CPP pragma that includes the Text
field for a debug build and elides it for a normal build. Here are the final
definitions:
-- Util.Set is an IntSet with more type information
newtype Set key = Set { unSet :: IntSet }
deriving newtype (Eq, Ord, Show, Semigroup, Monoid)
deriving stock Data
type role Set representational
-- In Scope.hs
#ifndef KDEBUG
newtype Scope = Scope { scopeNum :: Int}
deriving newtype (Eq, Ord, Show, HasKey)
deriving stock Data
#else
-- For a debug build Scope keeps a blob of text for debugging the expander
-- output. This will have an impact of the performance of the interpreter so it
-- won't be useful for performance issues
data Scope = Scope { scopeNum :: Int, scopePurpose :: Text }
deriving (Data, Eq, Ord, Show)
#endif
#ifndef KDEBUG
freshScope :: Text -> Expand Scope
{-# INLINE freshScope #-}
freshScope _why = do
n <- view expanderNextScopeNum <$> getState
modifyState $ over expanderNextScopeNum (+ 1)
return (Scope n)
#else
freshScope :: Text -> Expand Scope
{-# INLINE freshScope #-}
freshScope why = do
n <- view expanderNextScopeNum <$> getState
modifyState $ over expanderNextScopeNum (+ 1)
return (Scope n why)
#endif
Notice that the API to freshScope
does not change. In total, there are 11
more call sites to freshScope
throughout the interpreter that are leaking.
So if we had changed the API then the CPP would proliferate even more. To avoid
the extra CPP we inline freshScope
; this will ensure that in a non-debug
build GHC will determine that the Text
argument is dead and remove it,
thereby clearing the memory leaks at each call site.
Now we’ll revert the strictness changes in allScopeSets
and check the
runtime of implicit-conversion-test
to observe the effect of our
optimizations:
$ cabal test --test-show-details=streaming --test-options='--pattern "implicit-conversion-test"' --ghc-options='-O2'
...
Test suite klister-tests: RUNNING...
All tests
Golden tests
implicit-conversion-test: OK (1.10s)
All 1 tests passed (1.11s)
Test suite klister-tests: PASS
Much better, implicit-conversion-test
is about nine times faster. Let’s
rerun the entire test suite to check the progress we’ve made. Recall that the
testsuite originally took 38 seconds to finish.
$ cabal test --test-show-details=streaming --ghc-options='-O2'
...
Test suite klister-tests: RUNNING...
All tests
Module tests
Expected to succeed
...
examples/lang.kl: OK (0.03s)
examples/import.kl: OK (0.03s)
examples/macro-body-shift.kl: OK (0.03s)
examples/test-quasiquote.kl: OK (0.03s)
examples/quasiquote-syntax-test.kl: OK (0.03s)
examples/hygiene.kl: OK (0.27s)
examples/defun-test.kl: OK (0.02s)
examples/fun-exports-test.kl: OK (0.03s)
Golden tests
test-quasiquote: OK (0.03s)
io: OK (0.03s)
defun-test: OK (0.02s)
contract: OK (0.08s)
int-ops: OK (0.03s)
implicit-conversion: OK (0.99s)
implicit-conversion-test: OK (1.11s)
higher-kinded-patterns: OK (0.39s)
custom-literals-test: OK (0.20s)
double-define: OK (0.18s)
custom-module-test: OK (0.22s)
which-problem: OK (0.24s)
incorrect-context: OK (0.02s)
bound-vs-free: OK (0.13s)
meta-macro: OK (0.08s)
integer-syntax: OK (0.03s)
import: OK (0.03s)
All 124 tests passed (9.89s)
Test suite klister-tests: PASS
The entire test suite finished in just under 10 seconds, or about as fast as
implicit-conversion-test
finished originally. implicit-conversion-test
is also not the only test that benefited. Almost every test is faster; for
example, which-problem
finishes in 240 ms compared to 820 ms,
double-define
finishes in 180 ms compared to 340 ms. Our work here is done
for now.
4.3.9. Summing up¶
Our goal was to speed up the klister interpreter. To do so we found a poor performing test and focused on the macro expander subsystem of the interpreter. Then we employed a number of methods to find performance problems. We used:
Ticky Profiling: To determine hot loops and highly allocating functions
Biographical/Retainer Profiling: To determine the source of memory leaks in the expander state. This technique led our analysis to the lazy return in
currentEnv
.Info-Table Profiling: To determine the line of code closures on the heap came from. This technique led our analysis to the
visit
function and ultimately to the leakyText
fieldScope
.
Key to our analysis was a careful debugging process. We:
Found a reproducible test case.
Didn’t guess, instead we looked: We used GHC’s instrumentation to develop hypotheses about the klister interpreter’s performance instead of guessing and performing shotgun debugging.
Only changed one thing at a time: We tried to only change a single function, or a single type at a time. Then, if we did not observe any difference with the instrumentation we reverted the change.
Could we have gone farther? Of course. The interpreter is still leaking []
even though the heap stays under 50Mb. In the end, the optimizations were
trivial; we only chose better data structures and closed memory leaks. We could
still optimize or memoize isSubsetOf
, unroll the Expander
monad
transformer, fix the expander’s use of the WriterT
monad transformer; which
is well known to leak memory [5], do some manual loop fusion, or oneShot the expander monad. But for a first pass our results
are good enough.