diff options
-rw-r--r-- | doc/source/user/basics.io.genfromtxt.rst | 391 |
1 files changed, 239 insertions, 152 deletions
diff --git a/doc/source/user/basics.io.genfromtxt.rst b/doc/source/user/basics.io.genfromtxt.rst index 82c7a661a..edf48bc15 100644 --- a/doc/source/user/basics.io.genfromtxt.rst +++ b/doc/source/user/basics.io.genfromtxt.rst @@ -7,32 +7,37 @@ Importing data with :func:`~numpy.genfromtxt` Numpy provides several functions to create arrays from tabular data. We focus here on the :func:`~numpy.genfromtxt` function. -In a nutshell, :func:`~numpy.genfromtxt` runs two main loops. -The first loop converts each line of the file in a sequence of strings. -The second loop converts each string to the appropriate data type. -This mechanism is slower than a single loop, but gives more flexibility. -In particular, :func:`~numpy.genfromtxt` is able to take missing data into account, when other faster and simpler functions like :func:`~numpy.loadtxt` cannot - +In a nutshell, :func:`~numpy.genfromtxt` runs two main loops. The first +loop converts each line of the file in a sequence of strings. The second +loop converts each string to the appropriate data type. This mechanism is +slower than a single loop, but gives more flexibility. In particular, +:func:`~numpy.genfromtxt` is able to take missing data into account, when +other faster and simpler functions like :func:`~numpy.loadtxt` cannot. .. note:: - When giving examples, we will use the following conventions - - >>> import numpy as np - >>> from StringIO import StringIO + + When giving examples, we will use the following conventions:: + + >>> import numpy as np + >>> from StringIO import StringIO Defining the input ================== -The only mandatory argument of :func:`~numpy.genfromtxt` is the source of the data. -It can be a string corresponding to the name of a local or remote file, or a file-like object with a :meth:`read` method (such as an actual file or a :class:`StringIO.StringIO` object). -If the argument is the URL of a remote file, this latter is automatically downloaded in the current directory. +The only mandatory argument of :func:`~numpy.genfromtxt` is the source of +the data. It can be a string corresponding to the name of a local or +remote file, or a file-like object with a :meth:`read` method (such as an +actual file or a :class:`StringIO.StringIO` object). If the argument is +the URL of a remote file, this latter is automatically downloaded in the +current directory. -The input file can be a text file or an archive. -Currently, the function recognizes :class:`gzip` and :class:`bz2` (`bzip2`) archives. -The type of the archive is determined by examining the extension of the file: -if the filename ends with ``'.gz'``, a :class:`gzip` archive is expected; if it ends with ``'bz2'``, a :class:`bzip2` archive is assumed. +The input file can be a text file or an archive. Currently, the function +recognizes :class:`gzip` and :class:`bz2` (`bzip2`) archives. The type of +the archive is determined by examining the extension of the file: if the +filename ends with ``'.gz'``, a :class:`gzip` archive is expected; if it +ends with ``'bz2'``, a :class:`bzip2` archive is assumed. @@ -42,24 +47,30 @@ Splitting the lines into columns The :keyword:`delimiter` argument --------------------------------- -Once the file is defined and open for reading, :func:`~numpy.genfromtxt` splits each non-empty line into a sequence of strings. -Empty or commented lines are just skipped. -The :keyword:`delimiter` keyword is used to define how the splitting should take place. +Once the file is defined and open for reading, :func:`~numpy.genfromtxt` +splits each non-empty line into a sequence of strings. Empty or commented +lines are just skipped. The :keyword:`delimiter` keyword is used to define +how the splitting should take place. -Quite often, a single character marks the separation between columns. -For example, comma-separated files (CSV) use a comma (``,``) or a semicolon (``;``) as delimiter. +Quite often, a single character marks the separation between columns. For +example, comma-separated files (CSV) use a comma (``,``) or a semicolon +(``;``) as delimiter:: >>> data = "1, 2, 3\n4, 5, 6" >>> np.genfromtxt(StringIO(data), delimiter=",") array([[ 1., 2., 3.], [ 4., 5., 6.]]) -Another common separator is ``"\t"``, the tabulation character. -However, we are not limited to a single character, any string will do. -By default, :func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space. +Another common separator is ``"\t"``, the tabulation character. However, +we are not limited to a single character, any string will do. By default, +:func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line +is split along white spaces (including tabs) and that consecutive white +spaces are considered as a single white space. -Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. -In that case, we need to set :keyword:`delimiter` to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes). +Alternatively, we may be dealing with a fixed-width file, where columns are +defined as a given number of characters. In that case, we need to set +:keyword:`delimiter` to a single integer (if all the columns have the same +size) or to a sequence of integers (if columns can have different sizes):: >>> data = " 1 2 3\n 4 5 67\n890123 4" >>> np.genfromtxt(StringIO(data), delimiter=3) @@ -76,29 +87,32 @@ In that case, we need to set :keyword:`delimiter` to a single integer (if all th The :keyword:`autostrip` argument --------------------------------- -By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces. -This behavior can be overwritten by setting the optional argument :keyword:`autostrip` to a value of ``True``. +By default, when a line is decomposed into a series of strings, the +individual entries are not stripped of leading nor trailing white spaces. +This behavior can be overwritten by setting the optional argument +:keyword:`autostrip` to a value of ``True``:: >>> data = "1, abc , 2\n 3, xxx, 4" >>> # Without autostrip >>> np.genfromtxt(StringIO(data), dtype="|S5") array([['1', ' abc ', ' 2'], - ['3', ' xxx', ' 4']], + ['3', ' xxx', ' 4']], dtype='|S5') >>> # With autostrip >>> np.genfromtxt(StringIO(data), dtype="|S5", autostrip=True) array([['1', 'abc', '2'], - ['3', 'xxx', '4']], + ['3', 'xxx', '4']], dtype='|S5') - + The :keyword:`comments` argument -------------------------------- -The optional argument :keyword:`comments` is used to define a character string that marks the beginning of a comment. -By default, :func:`~numpy.genfromtxt` assumes ``comments='#'``. -The comment marker may occur anywhere on the line. -Any character present after the comment marker(s) is simply ignored. +The optional argument :keyword:`comments` is used to define a character +string that marks the beginning of a comment. By default, +:func:`~numpy.genfromtxt` assumes ``comments='#'``. The comment marker may +occur anywhere on the line. Any character present after the comment +marker(s) is simply ignored:: >>> data = """# ... # Skip me ! @@ -118,7 +132,9 @@ Any character present after the comment marker(s) is simply ignored. [ 9. 0.]] .. note:: - There is one notable exception to this behavior: if the optional argument ``names=True``, the first commented line will be examined for names. + + There is one notable exception to this behavior: if the optional argument + ``names=True``, the first commented line will be examined for names. @@ -128,45 +144,54 @@ Skipping lines and choosing columns The :keyword:`skip_header` and :keyword:`skip_footer` arguments --------------------------------------------------------------- -The presence of a header in the file can hinder data processing. -In that case, we need to use the :keyword:`skip_header` optional argument. -The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. -Similarly, we can skip the last ``n`` lines of the file by using the :keyword:`skip_footer` attribute and giving it a value of ``n``. +The presence of a header in the file can hinder data processing. In that +case, we need to use the :keyword:`skip_header` optional argument. The +values of this argument must be an integer which corresponds to the number +of lines to skip at the beginning of the file, before any other action is +performed. Similarly, we can skip the last ``n`` lines of the file by +using the :keyword:`skip_footer` attribute and giving it a value of ``n``:: >>> data = "\n".join(str(i) for i in range(10)) >>> np.genfromtxt(StringIO(data),) array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) - >>> np.genfromtxt(StringIO(data), + >>> np.genfromtxt(StringIO(data), ... skip_header=3, skip_footer=5) array([ 3., 4.]) -By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines are skipped. +By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines +are skipped. The :keyword:`usecols` argument ------------------------------- -In some cases, we are not interested in all the columns of the data but only a few of them. -We can select which columns to import with the :keyword:`usecols` argument. -This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import. -Remember that by convention, the first column has an index of 0. -Negative integers behave the same as regular Python negative indexes. +In some cases, we are not interested in all the columns of the data but +only a few of them. We can select which columns to import with the +:keyword:`usecols` argument. This argument accepts a single integer or a +sequence of integers corresponding to the indices of the columns to import. +Remember that by convention, the first column has an index of 0. Negative +integers behave the same as regular Python negative indexes. + +For example, if we want to import only the first and the last columns, we +can use ``usecols=(0, -1)``:: -For example, if we want to import only the first and the last columns, we can use ``usecols=(0, -1)``: >>> data = "1 2 3\n4 5 6" >>> np.genfromtxt(StringIO(data), usecols=(0, -1)) array([[ 1., 3.], [ 4., 6.]]) -If the columns have names, we can also select which columns to import by giving their name to the :keyword:`usecols` argument, either as a sequence of strings or a comma-separated string. +If the columns have names, we can also select which columns to import by +giving their name to the :keyword:`usecols` argument, either as a sequence +of strings or a comma-separated string:: + >>> data = "1 2 3\n4 5 6" >>> np.genfromtxt(StringIO(data), ... names="a, b, c", usecols=("a", "c")) - array([(1.0, 3.0), (4.0, 6.0)], + array([(1.0, 3.0), (4.0, 6.0)], dtype=[('a', '<f8'), ('c', '<f8')]) >>> np.genfromtxt(StringIO(data), ... names="a, b, c", usecols=("a, c")) - array([(1.0, 3.0), (4.0, 6.0)], + array([(1.0, 3.0), (4.0, 6.0)], dtype=[('a', '<f8'), ('c', '<f8')]) @@ -175,32 +200,40 @@ If the columns have names, we can also select which columns to import by giving Choosing the data type ====================== -The main way to control how the sequences of strings we have read from the file are converted to other types is to set the :keyword:`dtype` argument. +The main way to control how the sequences of strings we have read from the +file are converted to other types is to set the :keyword:`dtype` argument. Acceptable values for this argument are: * a single type, such as ``dtype=float``. - The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the :keyword:`names` argument (see below). - Note that ``dtype=float`` is the default for :func:`~numpy.genfromtxt`. + The output will be 2D with the given dtype, unless a name has been + associated with each column with the use of the :keyword:`names` argument + (see below). Note that ``dtype=float`` is the default for + :func:`~numpy.genfromtxt`. * a sequence of types, such as ``dtype=(int, float, float)``. * a comma-separated string, such as ``dtype="i4,f8,|S3"``. * a dictionary with two keys ``'names'`` and ``'formats'``. -* a sequence of tuples ``(name, type)``, such as ``dtype=[('A', int), ('B', float)]``. +* a sequence of tuples ``(name, type)``, such as + ``dtype=[('A', int), ('B', float)]``. * an existing :class:`numpy.dtype` object. * the special value ``None``. - In that case, the type of the columns will be determined from the data itself (see below). - -In all the cases but the first one, the output will be a 1D array with a structured dtype. -This dtype has as many fields as items in the sequence. + In that case, the type of the columns will be determined from the data + itself (see below). + +In all the cases but the first one, the output will be a 1D array with a +structured dtype. This dtype has as many fields as items in the sequence. The field names are defined with the :keyword:`names` keyword. -When ``dtype=None``, the type of each column is determined iteratively from its data. -We start by checking whether a string can be converted to a boolean (that is, if the string matches ``true`` or ``false`` in lower cases); -then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. -This behavior may be changed by modifying the default mapper of the :class:`~numpy.lib._iotools.StringConverter` class. +When ``dtype=None``, the type of each column is determined iteratively from +its data. We start by checking whether a string can be converted to a +boolean (that is, if the string matches ``true`` or ``false`` in lower +cases); then whether it can be converted to an integer, then to a float, +then to a complex and eventually to a string. This behavior may be changed +by modifying the default mapper of the +:class:`~numpy.lib._iotools.StringConverter` class. -The option ``dtype=None`` is provided for convenience. -However, it is significantly slower than setting the dtype explicitly. +The option ``dtype=None`` is provided for convenience. However, it is +significantly slower than setting the dtype explicitly. @@ -210,83 +243,108 @@ Setting the names The :keyword:`names` argument ----------------------------- -A natural approach when dealing with tabular data is to allocate a name to each column. -A first possibility is to use an explicit structured dtype, as mentioned previously. +A natural approach when dealing with tabular data is to allocate a name to +each column. A first possibility is to use an explicit structured dtype, +as mentioned previously:: >>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"]) - array([(1, 2, 3), (4, 5, 6)], + array([(1, 2, 3), (4, 5, 6)], dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) - -Another simpler possibility is to use the :keyword:`names` keyword with a sequence of strings or a comma-separated string. + +Another simpler possibility is to use the :keyword:`names` keyword with a +sequence of strings or a comma-separated string:: + >>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, names="A, B, C") - array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], + array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) In the example above, we used the fact that by default, ``dtype=float``. -By giving a sequence of names, we are forcing the output to a structured dtype. +By giving a sequence of names, we are forcing the output to a structured +dtype. -We may sometimes need to define the column names from the data itself. -In that case, we must use the :keyword:`names` keyword with a value of ``True``. -The names will then be read from the first line (after the ``skip_header`` ones), even if the line is commented out. +We may sometimes need to define the column names from the data itself. In +that case, we must use the :keyword:`names` keyword with a value of +``True``. The names will then be read from the first line (after the +``skip_header`` ones), even if the line is commented out:: >>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6") >>> np.genfromtxt(data, skip_header=1, names=True) - array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], + array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')]) -The default value of :keyword:`names` is ``None``. -If we give any other value to the keyword, the new names will overwrite the field names we may have defined with the dtype. +The default value of :keyword:`names` is ``None``. If we give any other +value to the keyword, the new names will overwrite the field names we may +have defined with the dtype:: >>> data = StringIO("1 2 3\n 4 5 6") >>> ndtype=[('a',int), ('b', float), ('c', int)] >>> names = ["A", "B", "C"] >>> np.genfromtxt(data, names=names, dtype=ndtype) - array([(1, 2.0, 3), (4, 5.0, 6)], + array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')]) The :keyword:`defaultfmt` argument ---------------------------------- -If ``names=None`` but a structured dtype is expected, names are defined with the standard NumPy default of ``"f%i"``, yielding names like ``f0``, ``f1`` and so forth. +If ``names=None`` but a structured dtype is expected, names are defined +with the standard NumPy default of ``"f%i"``, yielding names like ``f0``, +``f1`` and so forth:: + >>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=(int, float, int)) - array([(1, 2.0, 3), (4, 5.0, 6)], + array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')]) -In the same way, if we don't give enough names to match the length of the dtype, the missing names will be defined with this default template. +In the same way, if we don't give enough names to match the length of the +dtype, the missing names will be defined with this default template:: + >>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=(int, float, int), names="a") - array([(1, 2.0, 3), (4, 5.0, 6)], + array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')]) -We can overwrite this default with the :keyword:`defaultfmt` argument, that takes any format string: +We can overwrite this default with the :keyword:`defaultfmt` argument, that +takes any format string:: + >>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i") - array([(1, 2.0, 3), (4, 5.0, 6)], + array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')]) .. note:: - We need to keep in mind that ``defaultfmt`` is used only if some names are expected but not defined. + + We need to keep in mind that ``defaultfmt`` is used only if some names + are expected but not defined. Validating names ---------------- -Numpy arrays with a structured dtype can also be viewed as :class:`~numpy.recarray`, where a field can be accessed as if it were an attribute. -For that reason, we may need to make sure that the field name doesn't contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like ``size`` or ``shape``), which would confuse the interpreter. -:func:`~numpy.genfromtxt` accepts three optional arguments that provide a finer control on the names: +Numpy arrays with a structured dtype can also be viewed as +:class:`~numpy.recarray`, where a field can be accessed as if it were an +attribute. For that reason, we may need to make sure that the field name +doesn't contain any space or invalid character, or that it does not +correspond to the name of a standard attribute (like ``size`` or +``shape``), which would confuse the interpreter. :func:`~numpy.genfromtxt` +accepts three optional arguments that provide a finer control on the names: :keyword:`deletechars` - Gives a string combining all the characters that must be deleted from the name. By default, invalid characters are ``~!@#$%^&*()-=+~\|]}[{';: /?.>,<``. + Gives a string combining all the characters that must be deleted from + the name. By default, invalid characters are + ``~!@#$%^&*()-=+~\|]}[{';: + /?.>,<``. :keyword:`excludelist` - Gives a list of the names to exclude, such as ``return``, ``file``, ``print``... - If one of the input name is part of this list, an underscore character (``'_'``) will be appended to it. + Gives a list of the names to exclude, such as ``return``, ``file``, + ``print``... If one of the input name is part of this list, an + underscore character (``'_'``) will be appended to it. :keyword:`case_sensitive` Whether the names should be case-sensitive (``case_sensitive=True``), - converted to upper case (``case_sensitive=False`` or ``case_sensitive='upper'``) or to lower case (``case_sensitive='lower'``). + converted to upper case (``case_sensitive=False`` or + ``case_sensitive='upper'``) or to lower case + (``case_sensitive='lower'``). @@ -296,46 +354,57 @@ Tweaking the conversion The :keyword:`converters` argument ---------------------------------- -Usually, defining a dtype is sufficient to define how the sequence of strings must be converted. -However, some additional control may sometimes be required. -For example, we may want to make sure that a date in a format ``YYYY/MM/DD`` is converted to a :class:`datetime` object, or that a string like ``xx%`` is properly converted to a float between 0 and 1. -In such cases, we should define conversion functions with the :keyword:`converters` arguments. +Usually, defining a dtype is sufficient to define how the sequence of +strings must be converted. However, some additional control may sometimes +be required. For example, we may want to make sure that a date in a format +``YYYY/MM/DD`` is converted to a :class:`datetime` object, or that a string +like ``xx%`` is properly converted to a float between 0 and 1. In such +cases, we should define conversion functions with the :keyword:`converters` +arguments. -The value of this argument is typically a dictionary with column indices or column names as keys and a conversion functions as values. -These conversion functions can either be actual functions or lambda functions. In any case, they should accept only a string as input and output only a single element of the wanted type. +The value of this argument is typically a dictionary with column indices or +column names as keys and a conversion functions as values. These +conversion functions can either be actual functions or lambda functions. In +any case, they should accept only a string as input and output only a +single element of the wanted type. + +In the following example, the second column is converted from as string +representing a percentage to a float between 0 and 1:: -In the following example, the second column is converted from as string representing a percentage to a float between 0 and 1 >>> convertfunc = lambda x: float(x.strip("%"))/100. >>> data = "1, 2.3%, 45.\n6, 78.9%, 0" >>> names = ("i", "p", "n") >>> # General case ..... - >>> np.genfromtxt(StringIO(data), delimiter=",", names=names) - array([(1.0, nan, 45.0), (6.0, nan, 0.0)], + >>> np.genfromtxt(StringIO(data), delimiter=",", names=names) + array([(1.0, nan, 45.0), (6.0, nan, 0.0)], dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) -We need to keep in mind that by default, ``dtype=float``. -A float is therefore expected for the second column. -However, the strings ``' 2.3%'`` and ``' 78.9%'`` cannot be converted to float and we end up having ``np.nan`` instead. -Let's now use a converter. +We need to keep in mind that by default, ``dtype=float``. A float is +therefore expected for the second column. However, the strings ``' 2.3%'`` +and ``' 78.9%'`` cannot be converted to float and we end up having +``np.nan`` instead. Let's now use a converter:: >>> # Converted case ... - >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, + >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, ... converters={1: convertfunc}) - array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], + array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) -The same results can be obtained by using the name of the second column (``"p"``) as key instead of its index (1). +The same results can be obtained by using the name of the second column +(``"p"``) as key instead of its index (1):: >>> # Using a name for the converter ... - >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, + >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, ... converters={"p": convertfunc}) - array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], + array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) -Converters can also be used to provide a default for missing entries. -In the following example, the converter ``convert`` transforms a stripped string into the corresponding float or into -999 if the string is empty. -We need to explicitly strip the string from white spaces as it is not done by default. +Converters can also be used to provide a default for missing entries. In +the following example, the converter ``convert`` transforms a stripped +string into the corresponding float or into -999 if the string is empty. +We need to explicitly strip the string from white spaces as it is not done +by default:: >>> data = "1, , 3\n 4, 5, 6" >>> convert = lambda x: float(x.strip() or -999) @@ -350,33 +419,42 @@ We need to explicitly strip the string from white spaces as it is not done by de Using missing and filling values -------------------------------- -Some entries may be missing in the dataset we are trying to import. -In a previous example, we used a converter to transform an empty string into a float. -However, user-defined converters may rapidly become cumbersome to manage. +Some entries may be missing in the dataset we are trying to import. In a +previous example, we used a converter to transform an empty string into a +float. However, user-defined converters may rapidly become cumbersome to +manage. -The :func:`~nummpy.genfromtxt` function provides two other complementary mechanisms: the :keyword:`missing_values` argument is used to recognize missing data and a second argument, :keyword:`filling_values`, is used to process these missing data. +The :func:`~nummpy.genfromtxt` function provides two other complementary +mechanisms: the :keyword:`missing_values` argument is used to recognize +missing data and a second argument, :keyword:`filling_values`, is used to +process these missing data. :keyword:`missing_values` ------------------------- -By default, any empty string is marked as missing. -We can also consider more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing or invalid data. -The :keyword:`missing_values` argument accepts three kind of values: +By default, any empty string is marked as missing. We can also consider +more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing +or invalid data. The :keyword:`missing_values` argument accepts three kind +of values: a string or a comma-separated string - This string will be used as the marker for missing data for all the columns + This string will be used as the marker for missing data for all the + columns a sequence of strings In that case, each item is associated to a column, in order. a dictionary - Values of the dictionary are strings or sequence of strings. - The corresponding keys can be column indices (integers) or column names (strings). In addition, the special key ``None`` can be used to define a default applicable to all columns. + Values of the dictionary are strings or sequence of strings. The + corresponding keys can be column indices (integers) or column names + (strings). In addition, the special key ``None`` can be used to + define a default applicable to all columns. :keyword:`filling_values` ------------------------- -We know how to recognize missing data, but we still need to provide a value for these missing entries. -By default, this value is determined from the expected dtype according to this table: +We know how to recognize missing data, but we still need to provide a value +for these missing entries. By default, this value is determined from the +expected dtype according to this table: ============= ============== Expected type Default @@ -388,37 +466,43 @@ Expected type Default ``string`` ``'???'`` ============= ============== -We can get a finer control on the conversion of missing values with the :keyword:`filling_values` optional argument. -Like :keyword:`missing_values`, this argument accepts different kind of values: +We can get a finer control on the conversion of missing values with the +:keyword:`filling_values` optional argument. Like +:keyword:`missing_values`, this argument accepts different kind of values: a single value This will be the default for all columns a sequence of values Each entry will be the default for the corresponding column a dictionary - Each key can be a column index or a column name, and the corresponding value should be a single object. - We can use the special key ``None`` to define a default for all columns. - -In the following example, we suppose that the missing values are flagged with ``"N/A"`` in the first column and by ``"???"`` in the third column. -We wish to transform these missing values to 0 if they occur in the first and second column, and to -999 if they occur in the last column. - ->>> data = "N/A, 2, 3\n4, ,???" ->>> kwargs = dict(delimiter=",", -... dtype=int, -... names="a,b,c", -... missing_values={0:"N/A", 'b':" ", 2:"???"}, -... filling_values={0:0, 'b':0, 2:-999}) ->>> np.genfromtxt(StringIO.StringIO(data), **kwargs) -array([(0, 2, 3), (4, 0, -999)], - dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) + Each key can be a column index or a column name, and the + corresponding value should be a single object. We can use the + special key ``None`` to define a default for all columns. + +In the following example, we suppose that the missing values are flagged +with ``"N/A"`` in the first column and by ``"???"`` in the third column. +We wish to transform these missing values to 0 if they occur in the first +and second column, and to -999 if they occur in the last column:: + + >>> data = "N/A, 2, 3\n4, ,???" + >>> kwargs = dict(delimiter=",", + ... dtype=int, + ... names="a,b,c", + ... missing_values={0:"N/A", 'b':" ", 2:"???"}, + ... filling_values={0:0, 'b':0, 2:-999}) + >>> np.genfromtxt(StringIO.StringIO(data), **kwargs) + array([(0, 2, 3), (4, 0, -999)], + dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) :keyword:`usemask` ------------------ -We may also want to keep track of the occurrence of missing data by constructing a boolean mask, with ``True`` entries where data was missing and ``False`` otherwise. -To do that, we just have to set the optional argument :keyword:`usemask` to ``True`` (the default is ``False``). -The output array will then be a :class:`~numpy.ma.MaskedArray`. +We may also want to keep track of the occurrence of missing data by +constructing a boolean mask, with ``True`` entries where data was missing +and ``False`` otherwise. To do that, we just have to set the optional +argument :keyword:`usemask` to ``True`` (the default is ``False``). The +output array will then be a :class:`~numpy.ma.MaskedArray`. .. unpack=None, loose=True, invalid_raise=True) @@ -427,8 +511,10 @@ The output array will then be a :class:`~numpy.ma.MaskedArray`. Shortcut functions ================== -In addition to :func:`~numpy.genfromtxt`, the :mod:`numpy.lib.io` module provides several convenience functions derived from :func:`~numpy.genfromtxt`. -These functions work the same way as the original, but they have different default values. +In addition to :func:`~numpy.genfromtxt`, the :mod:`numpy.lib.io` module +provides several convenience functions derived from +:func:`~numpy.genfromtxt`. These functions work the same way as the +original, but they have different default values. :func:`~numpy.ndfromtxt` Always set ``usemask=False``. @@ -437,8 +523,9 @@ These functions work the same way as the original, but they have different defau Always set ``usemask=True``. The output is always a :class:`~numpy.ma.MaskedArray` :func:`~numpy.recfromtxt` - Returns a standard :class:`numpy.recarray` (if ``usemask=False``) or a :class:`~numpy.ma.MaskedRecords` array (if ``usemaske=True``). - The default dtype is ``dtype=None``, meaning that the types of each column will be automatically determined. + Returns a standard :class:`numpy.recarray` (if ``usemask=False``) or a + :class:`~numpy.ma.MaskedRecords` array (if ``usemaske=True``). The + default dtype is ``dtype=None``, meaning that the types of each column + will be automatically determined. :func:`~numpy.recfromcsv` Like :func:`~numpy.recfromtxt`, but with a default ``delimiter=","``. - |