diff options
Diffstat (limited to 'Objects/dictnotes.txt')
| -rw-r--r-- | Objects/dictnotes.txt | 237 | 
1 files changed, 58 insertions, 179 deletions
diff --git a/Objects/dictnotes.txt b/Objects/dictnotes.txt index d3aa77401c..f89720c9f6 100644 --- a/Objects/dictnotes.txt +++ b/Objects/dictnotes.txt @@ -1,7 +1,6 @@ -NOTES ON OPTIMIZING DICTIONARIES +NOTES ON DICTIONARIES  ================================ -  Principal Use Cases for Dictionaries  ------------------------------------ @@ -21,7 +20,7 @@ Instance attribute lookup and Global variables  Builtins      Frequent reads.  Almost never written. -    Size 126 interned strings (as of Py2.3b1). +    About 150 interned strings (as of Py3.3).      A few keys are accessed much more frequently than others.  Uniquification @@ -59,55 +58,20 @@ Dynamic Mappings      Characterized by deletions interspersed with adds and replacements.      Performance benefits greatly from the re-use of dummy entries. +Data Layout +----------- -Data Layout (assuming a 32-bit box with 64 bytes per cache line) ----------------------------------------------------------------- - -Smalldicts (8 entries) are attached to the dictobject structure -and the whole group nearly fills two consecutive cache lines. - -Larger dicts use the first half of the dictobject structure (one cache -line) and a separate, continuous block of entries (at 12 bytes each -for a total of 5.333 entries per cache line). +Dictionaries are composed of 3 components: +The dictobject struct itself +A dict-keys object (keys & hashes) +A values array  Tunable Dictionary Parameters  ----------------------------- -* PyDict_MINSIZE.  Currently set to 8. -    Must be a power of two.  New dicts have to zero-out every cell. -    Each additional 8 consumes 1.5 cache lines.  Increasing improves -    the sparseness of small dictionaries but costs time to read in -    the additional cache lines if they are not already in cache. -    That case is common when keyword arguments are passed. - -* Maximum dictionary load in PyDict_SetItem.  Currently set to 2/3. -    Increasing this ratio makes dictionaries more dense resulting -    in more collisions.  Decreasing it improves sparseness at the -    expense of spreading entries over more cache lines and at the -    cost of total memory consumed. - -    The load test occurs in highly time sensitive code.  Efforts -    to make the test more complex (for example, varying the load -    for different sizes) have degraded performance. - -* Growth rate upon hitting maximum load.  Currently set to *2. -    Raising this to *4 results in half the number of resizes, -    less effort to resize, better sparseness for some (but not -    all dict sizes), and potentially doubles memory consumption -    depending on the size of the dictionary.  Setting to *4 -    eliminates every other resize step. - -* Maximum sparseness (minimum dictionary load).  What percentage -    of entries can be unused before the dictionary shrinks to -    free up memory and speed up iteration?  (The current CPython -    code does not represent this parameter directly.) - -* Shrinkage rate upon exceeding maximum sparseness.  The current -    CPython code never even checks sparseness when deleting a -    key.  When a new key is added, it resizes based on the number -    of active keys, so that the addition may trigger shrinkage -    rather than growth. +See comments for PyDict_MINSIZE_SPLIT, PyDict_MINSIZE_COMBINED, +USABLE_FRACTION and GROWTH_RATE in dictobject.c  Tune-ups should be measured across a broad range of applications and  use cases.  A change to any parameter will help in some situations and @@ -126,8 +90,8 @@ __iter__(), iterkeys(), iteritems(), itervalues(), and update().  Also, every dictionary iterates at least twice, once for the memset()  when it is created and once by dealloc(). -Dictionary operations involving only a single key can be O(1) unless  -resizing is possible.  By checking for a resize only when the  +Dictionary operations involving only a single key can be O(1) unless +resizing is possible.  By checking for a resize only when the  dictionary can grow (and may *require* resizing), other operations  remain O(1), and the odds of resize thrashing or memory fragmentation  are reduced. In particular, an algorithm that empties a dictionary @@ -135,136 +99,51 @@ by repeatedly invoking .pop will see no resizing, which might  not be necessary at all because the dictionary is eventually  discarded entirely. +The key differences between this implementation and earlier versions are: +    1. The table can be split into two parts, the keys and the values. + +    2. There is an additional key-value combination: (key, NULL). +       Unlike (<dummy>, NULL) which represents a deleted value, (key, NULL) +       represented a yet to be inserted value. This combination can only occur +       when the table is split. + +    3. No small table embedded in the dict, +       as this would make sharing of key-tables impossible. + + +These changes have the following consequences. +   1. General dictionaries are slightly larger. + +   2. All object dictionaries of a single class can share a single key-table, +      saving about 60% memory for such cases.  Results of Cache Locality Experiments -------------------------------------- - -When an entry is retrieved from memory, 4.333 adjacent entries are also -retrieved into a cache line.  Since accessing items in cache is *much* -cheaper than a cache miss, an enticing idea is to probe the adjacent -entries as a first step in collision resolution.  Unfortunately, the -introduction of any regularity into collision searches results in more -collisions than the current random chaining approach. - -Exploiting cache locality at the expense of additional collisions fails -to payoff when the entries are already loaded in cache (the expense -is paid with no compensating benefit).  This occurs in small dictionaries -where the whole dictionary fits into a pair of cache lines.  It also -occurs frequently in large dictionaries which have a common access pattern -where some keys are accessed much more frequently than others.  The -more popular entries *and* their collision chains tend to remain in cache. - -To exploit cache locality, change the collision resolution section -in lookdict() and lookdict_string().  Set i^=1 at the top of the -loop and move the  i = (i << 2) + i + perturb + 1 to an unrolled -version of the loop. - -This optimization strategy can be leveraged in several ways: - -* If the dictionary is kept sparse (through the tunable parameters), -then the occurrence of additional collisions is lessened. - -* If lookdict() and lookdict_string() are specialized for small dicts -and for largedicts, then the versions for large_dicts can be given -an alternate search strategy without increasing collisions in small dicts -which already have the maximum benefit of cache locality. - -* If the use case for a dictionary is known to have a random key -access pattern (as opposed to a more common pattern with a Zipf's law -distribution), then there will be more benefit for large dictionaries -because any given key is no more likely than another to already be -in cache. - -* In use cases with paired accesses to the same key, the second access -is always in cache and gets no benefit from efforts to further improve -cache locality. - -Optimizing the Search of Small Dictionaries -------------------------------------------- - -If lookdict() and lookdict_string() are specialized for smaller dictionaries, -then a custom search approach can be implemented that exploits the small -search space and cache locality. - -* The simplest example is a linear search of contiguous entries.  This is -  simple to implement, guaranteed to terminate rapidly, never searches -  the same entry twice, and precludes the need to check for dummy entries. - -* A more advanced example is a self-organizing search so that the most -  frequently accessed entries get probed first.  The organization -  adapts if the access pattern changes over time.  Treaps are ideally -  suited for self-organization with the most common entries at the -  top of the heap and a rapid binary search pattern.  Most probes and -  results are all located at the top of the tree allowing them all to -  be located in one or two cache lines. - -* Also, small dictionaries may be made more dense, perhaps filling all -  eight cells to take the maximum advantage of two cache lines. - - -Strategy Pattern ----------------- - -Consider allowing the user to set the tunable parameters or to select a -particular search method.  Since some dictionary use cases have known -sizes and access patterns, the user may be able to provide useful hints. - -1) For example, if membership testing or lookups dominate runtime and memory -   is not at a premium, the user may benefit from setting the maximum load -   ratio at 5% or 10% instead of the usual 66.7%.  This will sharply -   curtail the number of collisions but will increase iteration time. -   The builtin namespace is a prime example of a dictionary that can -   benefit from being highly sparse. - -2) Dictionary creation time can be shortened in cases where the ultimate -   size of the dictionary is known in advance.  The dictionary can be -   pre-sized so that no resize operations are required during creation. -   Not only does this save resizes, but the key insertion will go -   more quickly because the first half of the keys will be inserted into -   a more sparse environment than before.  The preconditions for this -   strategy arise whenever a dictionary is created from a key or item -   sequence and the number of *unique* keys is known. - -3) If the key space is large and the access pattern is known to be random, -   then search strategies exploiting cache locality can be fruitful. -   The preconditions for this strategy arise in simulations and -   numerical analysis. - -4) If the keys are fixed and the access pattern strongly favors some of -   the keys, then the entries can be stored contiguously and accessed -   with a linear search or treap.  This exploits knowledge of the data, -   cache locality, and a simplified search routine.  It also eliminates -   the need to test for dummy entries on each probe.  The preconditions -   for this strategy arise in symbol tables and in the builtin dictionary. - - -Readonly Dictionaries ---------------------- -Some dictionary use cases pass through a build stage and then move to a -more heavily exercised lookup stage with no further changes to the -dictionary. - -An idea that emerged on python-dev is to be able to convert a dictionary -to a read-only state.  This can help prevent programming errors and also -provide knowledge that can be exploited for lookup optimization. - -The dictionary can be immediately rebuilt (eliminating dummy entries), -resized (to an appropriate level of sparseness), and the keys can be -jostled (to minimize collisions).  The lookdict() routine can then -eliminate the test for dummy entries (saving about 1/4 of the time -spent in the collision resolution loop). - -An additional possibility is to insert links into the empty spaces -so that dictionary iteration can proceed in len(d) steps instead of -(mp->mask + 1) steps.  Alternatively, a separate tuple of keys can be -kept just for iteration. - - -Caching Lookups ---------------- -The idea is to exploit key access patterns by anticipating future lookups -based on previous lookups. - -The simplest incarnation is to save the most recently accessed entry. -This gives optimal performance for use cases where every get is followed -by a set or del to the same key. +-------------------------------------- + +Experiments on an earlier design of dictionary, in which all tables were +combined, showed the following: + +  When an entry is retrieved from memory, several adjacent entries are also +  retrieved into a cache line.  Since accessing items in cache is *much* +  cheaper than a cache miss, an enticing idea is to probe the adjacent +  entries as a first step in collision resolution.  Unfortunately, the +  introduction of any regularity into collision searches results in more +  collisions than the current random chaining approach. + +  Exploiting cache locality at the expense of additional collisions fails +  to payoff when the entries are already loaded in cache (the expense +  is paid with no compensating benefit).  This occurs in small dictionaries +  where the whole dictionary fits into a pair of cache lines.  It also +  occurs frequently in large dictionaries which have a common access pattern +  where some keys are accessed much more frequently than others.  The +  more popular entries *and* their collision chains tend to remain in cache. + +  To exploit cache locality, change the collision resolution section +  in lookdict() and lookdict_string().  Set i^=1 at the top of the +  loop and move the  i = (i << 2) + i + perturb + 1 to an unrolled +  version of the loop. + +For split tables, the above will apply to the keys, but the value will +always be in a different cache line from the key. + +  | 
