NEP: add NEP 26, a summary of missing data discussions and NEPs.

Rescued from a non-linked file under numpy.org
author: Ralf Gommers <ralf.gommers@gmail.com> 2018-09-22 06:28:35 -0400
committer: Ralf Gommers <ralf.gommers@gmail.com> 2018-09-22 06:28:35 -0400
commit: 531e97d39c4de28c9ebeaf2e6ad533d4b9e26142 (patch)
tree: f32261ecc92ca46f016cbd9c0c3c0164ad9148e3 /doc
parent: aaaee301ebbe82625dd6f82cf5962d41b468b159 (diff)
download: numpy-531e97d39c4de28c9ebeaf2e6ad533d4b9e26142.tar.gz
1 files changed, 727 insertions, 0 deletions
diff --git a/doc/neps/nep-0026-missing-data-summary.rst b/doc/neps/nep-0026-missing-data-summary.rst
new file mode 100644
index 000000000..6f7fd0fa1
--- /dev/null
+++ b/doc/neps/nep-0026-missing-data-summary.rst
@@ -0,0 +1,727 @@
+====================================================
+NEP 26 — Summary of Missing Data NEPs and discussion
+====================================================
+
+:Author: Mark Wiebe <mwwiebe@gmail.com>, Nathaniel J. Smith <njs@pobox.com>
+:Status: Deferred
+:Type: Standards Track
+:Created: 2012-04-22
+
+*Context: this NEP was written as summary of the large number of discussions
+and proposals (NEP 12, NEP 24, NEP 25), regarding missing data functionality.*
+
+The debate about how NumPy should handle missing data, a subject with
+many preexisting approaches, requirements, and conventions, has been long and
+contentious. There has been more than one proposal for how to implement
+support into NumPy, and there is a testable implementation which is
+merged into NumPy's current master. The vast number of emails and differing
+points of view has made it difficult for interested parties to understand
+the issues and be comfortable with the direction NumPy is going.
+
+Here is our (Mark and Nathaniel's) attempt to summarize the
+problem, proposals, and points of agreement/disagreement in a single
+place, to help the community move towards consensus.
+
+The NumPy developers' problem
+=============================
+
+For this discussion, "missing data" means array elements
+which can be indexed (e.g. A[3] in an array A with shape (5,)),
+but have, in some sense, no value.
+
+It does not refer to compressed or sparse storage techniques where
+the value for A[3] is not actually stored in memory, but still has a
+well-defined value like 0.
+
+This is still vague, and to create an actual implementation,
+it is necessary to answer such questions as:
+
+* What values are computed when doing element-wise ufuncs.
+* What values are computed when doing reductions.
+* Whether the storage for an element gets overwritten when marking
+  that value missing.
+* Whether computations resulting in NaN automatically treat in the
+  same way as a missing value.
+* Whether one interacts with missing values using a placeholder object
+  (e.g. called "NA" or "masked"), or through a separate boolean array.
+* Whether there is such a thing as an array object that cannot hold
+  missing array elements.
+* How the (C and Python) API is expressed, in terms of dtypes,
+  masks, and other constructs.
+* If we decide to answer some of these questions in multiple ways,
+  then that creates the question of whether that requires multiple
+  systems, and if so how they should interact.
+
+There's clearly a very large space of missing-data APIs that *could*
+be implemented. There is likely at least one user, somewhere, who
+would find any possible implementation to be just the thing they
+need to solve some problem. On the other hand, much of NumPy's power
+and clarity comes from having a small number of orthogonal concepts,
+such as strided arrays, flexible indexing, broadcasting, and ufuncs,
+and we'd like to preserve that simplicity.
+
+There has been dissatisfaction among several major groups of NumPy users
+about the existing status quo of missing data support. In particular,
+neither the numpy.ma component nor use of floating-point NaNs as a
+missing data signal fully satisfy the performance requirements and
+ease of use for these users. The example of R, where missing data
+is treated via an NA placeholder and is deeply integrated into all
+computation, is where many of these users point to indicate what
+functionality they would like. Doing a deep integration of missing
+data like in R must be considered carefully, it must be clear it
+is not being done in a way which sacrifices existing performance
+or functionality.
+
+Our problem is, how can we choose some incremental additions to
+NumPy that will make a large class of users happy, be
+reasonably elegant, complement the existing design, and that we're
+comfortable we won't regret being stuck with in the long term.
+
+Prior art
+=========
+
+So a major (maybe *the* major) problem is figuring out how ambitious
+the project to add missing data support to NumPy should be, and which
+kinds of problems are in scope. Let's start with the
+best understood situation where "missing data" comes into play:
+
+"Statistical missing data"
+--------------------------
+
+In statistics, social science, etc., "missing data" is a term of art
+referring to a specific (but extremely common and important)
+situation: we have tried to gather some measurements according to some
+scheme, but some of these measurements are missing. For example, if we
+have a table listing the height, age, and income of a number of
+individuals, but one person did not provide their income, then we need
+some way to represent this::
+
+  Person | Height | Age | Income
+  ------------------------------
+     1   |   63   | 25  | 15000
+     2   |   58   | 32  | <missing>
+     3   |   71   | 45  | 30000
+
+The traditional way is to record that income as, say, "-99", and
+document this in the README along with the data set. Then, you have to
+remember to check for and handle such incomes specially; if you
+forget, you'll get superficially reasonable but completely incorrect
+results, like calculating the average income on this data set as
+14967. If you're in one of these fields, then such missing-ness is
+routine and inescapable, and if you use the "-99" approach then it's a
+pitfall you have to remember to check for explicitly on literally
+*every* calculation you ever do. This is, obviously, an unpleasant way
+to live.
+
+Let's call this situation the "statistical missing data" situation,
+just to have a convenient handle for it. (As mentioned, practitioners
+just call this "missing data", and what to do about it is literally an
+entire sub-field of statistics; if you google "missing data" then
+every reference is on how to handle it.) NumPy isn't going to do
+automatic imputation or anything like that, but it could help a great
+deal by providing some standard way to at least represent data which
+is missing in this sense.
+
+The main prior art for how this could be done comes from the S/S+/R
+family of languages. Their strategy is, for each type they support,
+to define a special value called "NA". (For ints this is INT_MAX,
+for floats it's a special NaN value that's distinguishable from
+other NaNs, ...) Then, they arrange that in computations, this
+value has a special semantics that we will call "NA semantics".
+
+NA Semantics
+------------
+
+The idea of NA semantics is that any computations involving NA
+values should be consistent with what would have happened if we
+had known the correct value.
+
+For example, let's say we want to compute the mean income, how might
+we do this? One way would be to just ignore the missing entry, and
+compute the mean of the remaining entries. This gives us (15000 +
+30000)/2, or 22500.
+
+Is this result consistent with discovering the income of person 2?
+Let's say we find out that person 2's income is 50000. This means
+the correct answer is (15000 + 50000 + 30000)/3, or 31666.67,
+indicating clearly that it is not consistent. Therefore, the mean
+income is NA, i.e. a specific number whose value we are unable
+to compute.
+
+This motivates the following rules, which are how R implements NA:
+
+Assignment:
+  NA values are understood to represent specific
+  unknown values, and thus should have value-like semantics with
+  respect to assignment and other basic data manipulation
+  operations. Code which does not actually look at the values involved
+  should work the same regardless of whether some of them are
+  missing. For example, one might write::
+
+    income[:] = income[np.argsort(height)]
+
+  to perform an in-place sort of the ``income`` array, and know that
+  the shortest person's income would end up being first. It turns out
+  that the shortest person's income is not known, so the array should
+  end up being ``[NA, 15000, 30000]``, but there's nothing
+  special about NAness here.
+
+Propagation:
+  In the example above, we concluded that an operation like ``mean``
+  should produce NA when one of its data values was NA.
+  If you ask me, "what is 3 plus x?", then my only possible answer is
+  "I don't know what x is, so I don't know what 3 + x is either". NA
+  means "I don't know", so 3 + NA is NA.
+
+  This is important for safety when analyzing data: missing data often
+  requires special handling for correctness -- the fact that you are
+  missing information might mean that something you wanted to compute
+  cannot actually be computed, and there are whole books written on
+  how to compensate in various situations. Plus, it's easy to not
+  realize that you have missing data, and write code that assumes you
+  have all the data. Such code should not silently produce the wrong
+  answer.
+
+  There is an important exception to characterizing this as propagation,
+  in the case of boolean values. Consider the calculation::
+
+    v = np.any([False, False, NA, True])
+
+  If we strictly propagate, ``v`` will become NA. However, no
+  matter whether we place True or False into the third array position,
+  ``v`` will then get the value True. The answer to the question
+  "Is the result True consistent with later discovering the value
+  that was missing?" is yes, so it is reasonable to not propagate here,
+  and instead return the value True. This is what R does::
+
+    > any(c(F, F, NA, T))
+    [1] TRUE
+    > any(c(F, F, NA, F))
+    [1] NA
+
+Other:
+  NaN and NA are conceptually distinct. 0.0/0.0 is not a mysterious,
+  unknown value -- it's defined to be NaN by IEEE floating point, Not
+  a Number. NAs are numbers (or strings, or whatever), just unknown
+  ones. Another small but important difference is that in Python, ``if
+  NaN: ...`` treats NaN as True (NaN is "truthy"); but ``if NA: ...``
+  would be an error.
+
+  In R, all reduction operations implement an alternative semantics,
+  activated by passing a special argument (``na.rm=TRUE`` in R).
+  ``sum(a)`` means "give me the sum of all the
+  values" (which is NA if some of the values are NA);
+  ``sum(a, na.rm=True)`` means "give me the sum of all the non-NA
+  values".
+
+Other prior art
+---------------
+
+Once we move beyond the "statistical missing data" case, the correct
+behavior for missing data becomes less clearly defined. There are many
+cases where specific elements are singled out to be treated specially
+or excluded from computations, and these could often be conceptualized
+as involving 'missing data' in some sense.
+
+In image processing, it's common to use a single image together with
+one or more boolean masks to e.g. composite subsets of an image. As
+Joe Harrington pointed out on the list, in the context of processing
+astronomical images, it's also common to generalize to a
+floating-point valued mask, or alpha channel, to indicate degrees of
+"missingness". We think this is out of scope for the present design,
+but it is an important use case, and ideally NumPy should support
+natural ways of manipulating such data.
+
+After R, numpy.ma is probably the most mature source of
+experience on missing-data-related APIs. Its design is quite different
+from R; it uses different semantics -- reductions skip masked values
+by default and NaNs convert to masked -- and it uses a different
+storage strategy via a separate mask. While it seems to be generally
+considered sub-optimal for general use, it's hard to pin down whether
+this is because the API is immature but basically good, or the API
+is fundamentally broken, or the API is great but the code should be
+faster, or what. We looked at some of those users to try and get a
+better idea.
+
+Matplotlib is perhaps the best known package to rely on numpy.ma. It
+seems to use it in two ways. One is as a way for users to indicate
+what data is missing when passing it to be graphed. (Other ways are
+also supported, e.g., passing in NaN values gives the same result.) In
+this regard, matplotlib treats np.ma.masked and NaN values in the same way
+that R's plotting routines handle NA and NaN values. For these purposes,
+matplotlib doesn't really care what semantics or storage strategy is
+used for missing data.
+
+Internally, matplotlib uses numpy.ma arrays to store and pass around
+separately computed boolean masks containing 'validity' information
+for each input array in a cheap and non-destructive fashion. Mark's
+impression from some shallow code review is that mostly it works
+directly with the data and mask attributes of the masked arrays,
+not extensively using the particular computational semantics of
+numpy.ma. So, for this usage they do rely on the non-destructive
+mask-based storage, but this doesn't say much about what semantics
+are needed.
+
+Paul Hobson `posted some code`__ on the list that uses numpy.ma for
+storing arrays of contaminant concentration measurements. Here the
+mask indicates whether the corresponding number represents an actual
+measurement, or just the estimated detection limit for a concentration
+which was too small to detect. Nathaniel's impression from reading
+through this code is that it also mostly uses the .data and .mask
+attributes in preference to performing operations on the MaskedArray
+directly.
+
+__ https://mail.scipy.org/pipermail/numpy-discussion/2012-April/061743.html
+
+So, these examples make it clear that there is demand for a convenient
+way to keep a data array and a mask array (or even a floating point
+array) bundled up together and "aligned". But they don't tell us much
+about what semantics the resulting object should have with respect to
+ufuncs and friends.
+
+Semantics, storage, API, oh my!
+===============================
+
+We think it's useful to draw a clear line between use cases,
+semantics, and storage. Use cases are situations that users encounter,
+regardless of what NumPy does; they're the focus of the previous
+section. When we say *semantics*, we mean the result of different
+operations as viewed from the Python level without regard to the
+underlying implementation.
+
+*NA semantics* are the ones described above and used by R::
+
+  1 + NA = NA
+  sum([1, 2, NA]) = NA
+  NA | False = NA
+  NA | True = True
+
+With ``na.rm=TRUE`` or ``skipNA=True``, this switches to::
+
+  1 + NA = illegal # in R, only reductions take na.rm argument
+  sum([1, 2, NA], skipNA=True) = 3
+
+There's also been discussion of what we'll call *ignore
+semantics*. These are somewhat underdefined::
+
+  sum([1, 2, IGNORED]) = 3
+  # Several options here:
+  1 + IGNORED = 1
+  #  or
+  1 + IGNORED = <leaves output array untouched>
+  #  or
+  1 + IGNORED = IGNORED
+
+The numpy.ma semantics are::
+
+  sum([1, 2, masked]) = 3
+  1 + masked = masked
+
+If either NA or ignore semantics are implemented with masks, then there
+is a choice of what should be done to the value in the storage
+for an array element which gets assigned a missing value. Three
+possibilities are:
+
+* Leave that memory untouched (the choice made in the NEP).
+* Do the calculation with the values independently of the mask
+  (perhaps the most useful option for Paul Hobson's use-case above).
+* Copy whatever value is stored behind the input missing value into
+  the output (this is what numpy.ma does. Even that is ambiguous in
+  the case of ``masked + masked`` -- in this case numpy.ma copies the
+  value stored behind the leftmost masked value).
+
+When we talk about *storage*, we mean the debate about whether missing
+values should be represented by designating a particular value of the
+underlying data-type (the *bitpattern dtype* option, as used in R), or
+by using a separate *mask* stored alongside the data itself.
+
+For mask-based storage, there is also an important question about what
+the API looks like for accessing the mask, modifying the mask, and
+"peeking behind" the mask.
+
+Designs that have been proposed
+===============================
+
+One option is to just copy R, by implementing a mechanism whereby
+dtypes can arrange for certain bitpatterns to be given NA semantics.
+
+One option is to copy numpy.ma closely, but with a more optimized
+implementation. (Or to simply optimize the existing implementation.)
+
+One option is that described in the NEP_, for which an implementation
+of mask-based missing data exists. This system is roughly:
+
+.. _NEP: https://github.com/numpy/numpy/blob/master/doc/neps/nep-0012-missing-data.rst
+
+* There is both bitpattern and mask-based missing data, and both
+  have identical interoperable NA semantics.
+* Masks are modified by assigning np.NA or values to array elements.
+  The way to peek behind the mask or to unmask values is to keep a
+  view of the array that shares the data pointer but not the mask pointer.
+* Mark would like to add a way to access and manipulate the mask more
+  directly, to be used in addition to this view-based API.
+* If an array has both a bitpattern dtype and a mask, then assigning
+  np.NA writes to the mask, rather than to the array itself. Writing
+  a bitpattern NA to an array which supports both requires accessing
+  the data by "peeking under the mask".
+
+Another option is that described in the alterNEP_, which is to implement
+bitpattern dtypes with NA semantics for the "statistical missing data"
+use case, and to also implement a totally independent API for masked
+arrays with ignore semantics and all mask manipulation done explicitly
+through a .mask attribute.
+
+.. _alterNEP: https://gist.github.com/njsmith/1056379
+
+Another option would be to define a minimalist aligned array container
+that holds multiple arrays and that can be used to pass them around
+together. It would support indexing (to help with the common problem
+of wanting to subset several arrays together without their becoming
+unaligned), but all arithmetic etc. would be done by accessing the
+underlying arrays directly via attributes. The "prior art" discussion
+above suggests that something like this holding a .data and a .mask
+array might actually be solve a number of people's problems without
+requiring any major architectural changes to NumPy. This is similar to
+a structured array, but with each field in a separately stored array
+instead of packed together.
+
+Several people have suggested that there should be a single system
+that has multiple missing values that each have different semantics,
+e.g., a MISSING value that has NA semantics, and a separate IGNORED
+value that has ignored semantics.
+
+None of these options are necessarily exclusive.
+
+The debate
+==========
+
+We both are dubious of using ignored semantics as a default missing
+data behavior. **Nathaniel** likes NA semantics because he is most
+interested in the "statistical missing data" use case, and NA semantics
+are exactly right for that. **Mark** isn't as interested in that use
+case in particular, but he likes the NA computational abstraction
+because it is unambiguous and well-defined in all cases, and has a
+lot of existing experience to draw from.
+
+What **Nathaniel** thinks, overall:
+
+* The "statistical missing data" use case is clear and compelling; the
+  other use cases certainly deserve our attention, but it's hard to say what
+  they *are* exactly yet, or even if the best way to support them is
+  by extending the ndarray object.
+* The "statistical missing data" use case is best served by an R-style
+  system that uses bitpattern storage to implement NA semantics. The
+  main advantage of bitpattern storage for this use case is that it
+  avoids the extra memory and speed overhead of storing and checking a
+  mask (especially for the common case of floating point data, where
+  some tricks with NaNs allow us to effectively hardware-accelerate
+  most NA operations). These concerns alone appears to make a
+  mask-based implementation unacceptable to many NA users,
+  particularly in areas like neuroscience (where memory is tight) or
+  financial modeling (where milliseconds are critical). In addition,
+  the bit-pattern approach is less confusing conceptually (e.g.,
+  assignment really is just assignment, no magic going on behind the
+  curtain), and it's possible to have in-memory compatibility with R
+  for inter-language calls via rpy2.  The main disadvantage of the
+  bitpattern approach is the need to give up a value to represent NA,
+  but this is not an issue for the most important data types (float,
+  bool, strings, enums, objects); really, only integers are
+  affected. And even for integers, giving up a value doesn't really
+  matter for statistical problems. (Occupy Wall Street
+  notwithstanding, no-one's income is 2**63 - 1. And if it were, we'd
+  be switching to floats anyway to avoid overflow.)
+* Adding new dtypes requires some cooperation with the ufunc and
+  casting machinery, but doesn't require any architectural changes or
+  violations of NumPy's current orthogonality.
+* His impression from the mailing list discussion, esp. the `"what can
+  we agree on?" thread`__, is that many numpy.ma users specifically
+  like the combination of masked storage, the mask being easily
+  accessible through the API, and ignored semantics. He could be
+  wrong, of course. But he cannot remember seeing anybody besides Mark
+  advocate for the specific combination of masked storage and NA
+  semantics, which makes him nervous.
+
+  __ http://thread.gmane.org/gmane.comp.python.numeric.general/46704
+* Also, he personally is not very happy with the idea of having two
+  storage implementations that are almost-but-not-quite identical at
+  the Python level. While there likely are people who would like to
+  temporarily pretend that certain data is "statistically missing
+  data" without making a copy of their array, it's not at all clear
+  that they outnumber the people who would like to use bitpatterns and
+  masks simultaneously for distinct purposes. And honestly he'd like
+  to be able to just ignore masks if he wants and stick to
+  bitpatterns, which isn't possible if they're coupled together
+  tightly in the API.  So he would say the jury is still very much out
+  on whether this aspect of the NEP design is an advantage or a
+  disadvantage. (Certainly he's never heard of any R users complaining
+  that they really wish they had an option of making a different
+  trade-off here.)
+* R's NA support is a `headline feature`__ and its target audience
+  consider it a compelling advantage over other platforms like Matlab
+  or Python. Working with statistical missing data is very painful
+  without platform support.
+
+  __ http://www.sr.bham.ac.uk/~ajrs/R/why_R.html
+* By comparison, we clearly have much more uncertainty about the use
+  cases that require a mask-based implementation, and it doesn't seem
+  like people will suffer too badly if they are forced for now to
+  settle for using NumPy's excellent mask-based indexing, the new
+  where= support, and even numpy.ma.
+* Therefore, bitpatterns with NA semantics seem to meet the criteria
+  of making a large class of users happy, in an elegant way, that fits
+  into the original design, and where we can have reasonable certainty
+  that we understand the problem and use cases well enough that we'll
+  be happy with them in the long run. But no mask-based storage
+  proposal does, yet.
+
+What **Mark** thinks, overall:
+
+* The idea of using NA semantics by default for missing data, inspired
+  by the "statistical missing data" problem, is better than all the
+  other default behaviors which were considered. This applies equally
+  to the bitpattern and the masked approach.
+
+* For NA-style functionality to get proper support by all NumPy
+  features and eventually all third-party libraries, it needs to be
+  in the core. How to correctly and efficiently handle missing data
+  differs by algorithm, and if thinking about it is required to fully
+  support NumPy, NA support will be broader and higher quality.
+
+* At the same time, providing two different missing data interfaces,
+  one for masks and one for bitpatterns, requires NumPy developers
+  and third-party NumPy plugin developers to separately consider the
+  question of what to do in either case, and do two additional
+  implementations of their code. This complicates their job,
+  and could lead to inconsistent support for missing data.
+
+* Providing the ability to work with both masks and bitpatterns through
+  the same C and Python programming interface makes missing data support
+  cleanly orthogonal with all other NumPy features.
+
+* There are many trade-offs of memory usage, performance, correctness, and
+  flexibility between masks and bitpatterns. Providing support for both
+  approaches allows users of NumPy to choose the approach which is
+  most compatible with their way of thinking, or has characteristics
+  which best match their use-case. Providing them through the same
+  interface further allows them to try both with minimal effort, and
+  choose the one which performs better or uses the least memory for
+  their programs.
+
+* Memory Usage
+
+  * With bitpatterns, less memory is used for storing a single array
+    containing some NAs.
+
+  * With masks, less memory is used for storing multiple arrays that
+    are identical except for the location of their NAs. (In this case a
+    single data array can be re-used with multiple mask arrays;
+    bitpattern NAs would need to copy the whole data array.)
+
+* Performance
+
+  * With bitpatterns, the floating point type can use native hardware
+    operations, with nearly correct behavior. For fully correct floating
+    point behavior and with other types, code must be written which
+    specially tests for equality with the missing-data bitpattern.
+
+  * With masks, there is always the overhead of accessing mask memory
+    and testing its truth value. The implementation that currently exists
+    has no performance tuning, so it is only good to judge a minimum
+    performance level. Optimal mask-based code is in general going to
+    be slower than optimal bitpattern-based code.
+
+* Correctness
+
+  * Bitpattern integer types must sacrifice a valid value to represent NA.
+    For larger integer types, there are arguments that this is ok, but for
+    8-bit types there is no reasonable choice. In the floating point case,
+    if the performance of native floating point operations is chosen,
+    there is a small inconsistency that NaN+NA and NA+NaN are different.
+  * With masks, it works correctly in all cases.
+
+* Generality
+
+  * The bitpattern approach can work in a fully general way only when
+    there is a specific value which can be given up from the
+    data type. For IEEE floating point, a NaN is an obvious choice,
+    and for booleans represented as a byte, there are plenty of choices.
+    For integers, a valid value must be sacrificed to use this approach.
+    Third-party dtypes which plug into NumPy will also have to
+    make a bitpattern choice to support this system, something which
+    may not always be possible.
+
+  * The mask approach works universally with all data types.
+
+Recommendations for Moving Forward
+==================================
+
+**Nathaniel** thinks we should:
+
+* Go ahead and implement bitpattern NAs.
+* *Don't* implement masked arrays in the core -- or at least, not
+  yet. Instead, we should focus on figuring out how to implement them
+  out-of-core, so that people can try out different approaches without
+  us committing to any one approach. And so new prototypes can be
+  released more quickly than the NumPy release cycle. And anyway,
+  we're going to have to figure out how to experiment with such
+  changes out-of-core if NumPy is to continue to evolve without
+  forking -- might as well do it now. The existing code can live in
+  master, disabled, or it can live in a branch -- it'll still be there
+  once we know what we're doing.
+
+**Mark** thinks we should:
+
+* The existing code should remain as is, with a global run-time experimental
+  flag added which disables NA support by default.
+
+A more detailed rationale for this recommendation is:
+
+* A solid preliminary NA-mask implementation is currently in NumPy
+  master. This implementation has been extensively tested
+  against scipy and other third-party packages, and has been in master
+  in a stable state for a significant amount of time.
+* This implementation integrates deeply with the core, providing an
+  interface which is usable in the same way R's NA support is. It
+  provides a compelling, user-friendly answer to R's NA support.
+* The missing data NEP provides a plan for adding bitpattern-based
+  dtype support of NAs, which will operate through the same interface
+  but allow for the same performance/correctness tradeoffs that R has made.
+* Making it very easy for users to try out this implementation, which
+  has reasonable feature coverage and performance characteristics, is
+  the best way to get more concrete feedback about how NumPy's missing
+  data support should look.
+
+Because of its preliminary state, the existing implementation is marked
+as experimental in the NumPy documentation. It would be good for this
+to remain marked as experimental until it is more fleshed out, for
+example supporting struct and array dtypes and with a fuller set of
+NumPy operations.
+
+I think the code should stay as it is, except to add a run-time global
+NumPy flag, perhaps numpy.experimental.maskna, which defaults to
+False and can be toggled to True. In its default state, any NA feature
+usage would raise an "ExperimentalError" exception, a measure which
+would prevent it from being accidentally used and communicate its
+experimental status very clearly.
+
+The `ABI issues`__ seem very tricky to deal with effectively in the 1.x
+series of releases, but I believe that with proper implementation-hiding
+in a 2.0 release, evolving the software to support various other
+ABI ideas that have been discussed is feasible. This is the approach
+I like best.
+
+__ http://thread.gmane.org/gmane.comp.python.numeric.general/49485>
+
+**Nathaniel** notes in response that he doesn't really have any
+objection to shipping experimental APIs in the main numpy distribution
+*if* we're careful to make sure that they don't "leak out" in a way
+that leaves us stuck with them. And in principle some sort of "this
+violates your warranty" global flag could be a way to do that. (In
+fact, this might also be a useful strategy for the kinds of changes
+that he favors, of adding minimal hooks to enable us to build
+prototypes more easily -- we could have some "rapid prototyping only"
+hooks that let prototype hacks get deeper access to NumPy's internals
+than we were otherwise ready to support.)
+
+But, he wants to point out two things. First, it seems like we still
+have fundamental questions to answer about the NEP design, like
+whether masks should have NA semantics or ignore semantics, and there
+are already plans to majorly change how NEP masks are exposed and
+accessed. So he isn't sure what we'll learn by asking for feedback on
+the NEP code in its current state.
+
+And second, given the concerns about their causing (minor) ABI issues,
+it's not clear that we could really prevent them from leaking out. (He
+looks forward to 2.0 too, but we're not there yet.) So maybe it would
+be better if they weren't present in the C API at all, and the hoops
+required for testers were instead something like, 'we have included a
+hacky pure-Python prototype accessible by typing "import
+numpy.experimental.donttrythisathome.NEP" and would welcome feedback'?
+
+If so, then he should mention that he did implement a horribly klugy,
+pure Python implementation of the NEP API that works with NumPy
+1.6.1. This was mostly as an experiment to see how possible such
+prototyping was and to test out a possible ufunc override mechanism,
+but if there's interest, the module is available here:
+https://github.com/njsmith/numpyNEP
+
+It passes the maskna test-suite, with some minor issues described
+in a big comment at the top.
+
+**Mark** responds:
+
+I agree that it's important to be careful when adding new
+features to NumPy, but I also believe it is essential that the project
+have forward development momentum. A project like NumPy requires
+developers to write code for advancement to occur, and obstacles
+that impede the writing of code discourage existing developers
+from contributing more, and potentially scare away developers
+who are thinking about joining in.
+
+All software projects, both open source and closed source, must
+balance between short-term practicality and long-term planning.
+In the case of the missing data development, there was a short-term
+resource commitment to tackle this problem, which is quite immense
+in scope. If there isn't a high likelihood of getting a contribution
+into NumPy that concretely advances towards a solution, I expect
+that individuals and companies interested in doing such work will
+have a much harder time justifying a commitment of their resources.
+For a project which is core to so many other libraries, only
+relying on the good will of selfless volunteers would mean that
+NumPy could more easily be overtaken by another project.
+
+In the case of the existing NA contribution at issue, how we resolve
+this disagreement represents a decision about how NumPy's
+developers, contributers, and users should interact. If we create
+a document describing a dispute resolution process, how do we
+design it so that it doesn't introduce a large burden and excessive
+uncertainty on developers that could prevent them from productively
+contributing code?
+
+If we go this route of writing up a decision process which includes
+such a dispute resolution mechanism, I think the meat of it should
+be a roadmap that potential contributers and developers can follow
+to gain influence over NumPy. NumPy development needs broad support
+beyond code contributions, and tying influence in the project to
+contributions seems to me like it would be a good way to encourage
+people to take on tasks like bug triaging/management, continuous
+integration/build server administration, and the myriad other
+tasks that help satisfy the project's needs. No specific meritocratic,
+democratic, consensus-striving system will satisfy everyone, but the
+vigour of the discussions around governance and process indicate that
+something at least a little bit more formal than the current status
+quo is necessary.
+
+In conclusion, I would like the NumPy project to prioritize movement
+towards a more flexible and modular ABI/API, balanced with strong
+backwards-compatibility constraints and feature additions that
+individuals, universities, and companies want to contribute.
+I do not believe keeping the NA code in 1.7 as it is, with the small
+additional measure of requiring it to be enabled by an experimental
+flag, poses a risk of long-term ABI troubles. The greater risk I see
+is a continuing lack of developers contributing to the project,
+and I believe backing out this code because these worries would create a
+risk of reducing developer contribution.
+
+
+References and Footnotes
+------------------------
+
+NEP 12 describes Mark's NA-semantics/mask implementation/view based mask
+handling API.
+
+NEP 24 ("the alterNEP") was Nathaniel's initial attempt at separating MISSING
+and IGNORED handling into bit-patterns versus masks, though there's a bunch
+he would change about the proposal at this point.
+
+NEP 25 ("miniNEP 2") was a later attempt by Nathaniel to sketch out an
+implementation strategy for NA dtypes.
+
+A further discussion overview page can be found at:
+https://github.com/njsmith/numpy/wiki/NA-discussion-status
+
+
+Copyright
+---------
+
+This document has been placed in the public domain.
+\ No newline at end of file
author	Ralf Gommers <ralf.gommers@gmail.com>	2018-09-22 06:28:35 -0400
committer	Ralf Gommers <ralf.gommers@gmail.com>	2018-09-22 06:28:35 -0400
commit	531e97d39c4de28c9ebeaf2e6ad533d4b9e26142 (patch)
tree	f32261ecc92ca46f016cbd9c0c3c0164ad9148e3 /doc
parent	aaaee301ebbe82625dd6f82cf5962d41b468b159 (diff)
download	numpy-531e97d39c4de28c9ebeaf2e6ad533d4b9e26142.tar.gz