summaryrefslogtreecommitdiff
path: root/doc/source/user/basics.io.genfromtxt.rst
blob: 814ba520a280f3466dcc5fc47c863792031521a2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
.. sectionauthor:: Pierre Gerard-Marchant <pierregmcode@gmail.com>

*********************************************
Importing data with :func:`~numpy.genfromtxt`
*********************************************

Numpy provides several functions to create arrays from tabular data.
We focus here on the :func:`~numpy.genfromtxt` function.

In a nutshell, :func:`~numpy.genfromtxt` runs two main loops.
The first loop converts each line of the file in a sequence of strings.
The second loop converts each string to the appropriate data type.
This mechanism is slower than a single loop, but gives more flexibility.
In particular, :func:`~numpy.genfromtxt` is able to take missing data into account, when other faster and simpler functions like :func:`~numpy.loadtxt` cannot 


.. note::
   When giving examples, we will use the following conventions
   
   >>> import numpy as np
   >>> from StringIO import StringIO



Defining the input
==================

The only mandatory argument of :func:`~numpy.genfromtxt` is the source of the data.
It can be a string corresponding to the name of a local or remote file, or a file-like object with a :meth:`read` method (such as an actual file or a :class:`StringIO.StringIO` object).
If the argument is the URL of a remote file, this latter is automatically downloaded in the current directory.

The input file can be a text file or an archive.
Currently, the function recognizes :class:`gzip` and :class:`bz2` (`bzip2`) archives.
The type of the archive is determined by examining the extension of the file:
if the filename ends with ``'.gz'``, a :class:`gzip` archive is expected; if it ends with ``'bz2'``, a :class:`bzip2` archive is assumed.



Splitting the lines into columns
================================

The :keyword:`delimiter` argument
---------------------------------

Once the file is defined and open for reading, :func:`~numpy.genfromtxt` splits each non-empty line into a sequence of strings.
Empty or commented lines are just skipped.
The :keyword:`delimiter` keyword is used to define how the splitting should take place.

Quite often, a single character marks the separation between columns.
For example, comma-separated files (CSV) use a comma (``,``) or a semicolon (``;``) as delimiter.

   >>> data = "1, 2, 3\n4, 5, 6"
   >>> np.genfromtxt(StringIO(data), delimiter=",")
   array([[ 1.,  2.,  3.],
          [ 4.,  5.,  6.]])

Another common separator is ``"\t"``, the tabulation character.
However, we are not limited to a single character, any string will do.
By default, :func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.

Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters.
In that case, we need to set :keyword:`delimiter` to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes).

   >>> data = "  1  2  3\n  4  5 67\n890123  4"
   >>> np.genfromtxt(StringIO(data), delimiter=3)
   array([[   1.,    2.,    3.],
          [   4.,    5.,   67.],
          [ 890.,  123.,    4.]])
   >>> data = "123456789\n   4  7 9\n   4567 9"
   >>> np.genfromtxt(StringIO(data), delimiter=(4, 3, 2))
   array([[ 1234.,   567.,    89.],
          [    4.,     7.,     9.],
          [    4.,   567.,     9.]])


The :keyword:`autostrip` argument
---------------------------------

By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces.
This behavior can be overwritten by setting the optional argument :keyword:`autostrip` to a value of ``True``.

   >>> data = "1, abc , 2\n 3, xxx, 4"
   >>> # Without autostrip
   >>> np.genfromtxt(StringIO(data), dtype="|S5")
   array([['1', ' abc ', ' 2'],
          ['3', ' xxx', ' 4']], 
         dtype='|S5')
   >>> # With autostrip
   >>> np.genfromtxt(StringIO(data), dtype="|S5", autostrip=True)
   array([['1', 'abc', '2'],
          ['3', 'xxx', '4']], 
         dtype='|S5')
   

The :keyword:`comments` argument
--------------------------------

The optional argument :keyword:`comments` is used to define a character string that marks the beginning of a comment.
By default, :func:`~numpy.genfromtxt` assumes ``comments='#'``.
The comment marker may occur anywhere on the line.
Any character present after the comment marker(s) is simply ignored.

   >>> data = """#
   ... # Skip me !
   ... # Skip me too !
   ... 1, 2
   ... 3, 4
   ... 5, 6 #This is the third line of the data
   ... 7, 8
   ... # And here comes the last line
   ... 9, 0
   ... """
   >>> np.genfromtxt(StringIO(data), comments="#", delimiter=",")
   [[ 1.  2.]
    [ 3.  4.]
    [ 5.  6.]
    [ 7.  8.]
    [ 9.  0.]]

.. note::
   There is one notable exception to this behavior: if the optional argument ``names=True``, the first commented line will be examined for names.



Skipping lines and choosing columns
===================================

The :keyword:`skip_header` and :keyword:`skip_footer` arguments
---------------------------------------------------------------

The presence of a header in the file can hinder data processing.
In that case, we need to use the :keyword:`skip_header` optional argument.
The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed.
Similarly, we can skip the last ``n`` lines of the file by using the :keyword:`skip_footer` attribute and giving it a value of ``n``.

   >>> data = "\n".join(str(i) for i in range(10))
   >>> np.genfromtxt(StringIO(data),)
   array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])
   >>> np.genfromtxt(StringIO(data), 
   ...               skip_header=3, skip_footer=5)
   array([ 3.,  4.])

By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines are skipped.


The :keyword:`usecols` argument
-------------------------------

In some cases, we are not interested in all the columns of the data but only a few of them.
We can select which columns to import with the :keyword:`usecols` argument.
This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import.
Remember that by convention, the first column has an index of 0.
Negative integers correspond to 

For example, if we want to import only the first and the last columns, we can use ``usecols=(0, -1)``:
   >>> data = "1 2 3\n4 5 6"
   >>> np.genfromtxt(StringIO(data), usecols=(0, -1))
   array([[ 1.,  3.],
          [ 4.,  6.]])

If the columns have names, we can also select which columns to import by giving their name to the :keyword:`usecols` argument, either as a sequence of strings or a comma-separated string.
   >>> data = "1 2 3\n4 5 6"
   >>> np.genfromtxt(StringIO(data),
   ...               names="a, b, c", usecols=("a", "c"))
   array([(1.0, 3.0), (4.0, 6.0)], 
         dtype=[('a', '<f8'), ('c', '<f8')])
   >>> np.genfromtxt(StringIO(data),
   ...               names="a, b, c", usecols=("a, c"))
       array([(1.0, 3.0), (4.0, 6.0)], 
             dtype=[('a', '<f8'), ('c', '<f8')])




Choosing the data type
======================

The main way to control how the sequences of strings we have read from the file are converted to other types is to set the :keyword:`dtype` argument.
Acceptable values for this argument are:

* a single type, such as ``dtype=float``.
  The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the :keyword:`names` argument (see below).
  Note that ``dtype=float`` is the default for :func:`~numpy.genfromtxt`. 
* a sequence of types, such as ``dtype=(int, float, float)``.
* a comma-separated string, such as ``dtype="i4,f8,|S3"``.
* a dictionary with two keys ``'names'`` and ``'formats'``.
* a sequence of tuples ``(name, type)``, such as ``dtype=[('A', int), ('B', float)]``.
* an existing :class:`numpy.dtype` object.
* the special value ``None``.
  In that case, the type of the columns will be determined from the data itself (see below).
  
In all the cases but the first one, the output will be a 1D array with a structured dtype.
This dtype has as many fields as items in the sequence.
The field names are defined with the :keyword:`names` keyword.


When ``dtype=None``, the type of each column is determined iteratively from its data.
We start by checking whether a string can be converted to a boolean (that is, if the string matches ``true`` or ``false`` in lower cases);
then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string.
This behavior may be changed by modifying the default mapper of the :class:`~numpy.lib._iotools.StringConverter` class.

The option ``dtype=None`` is provided for convenience.
However, it is significantly slower than setting the dtype explicitly.



Setting the names
=================

The :keyword:`names` argument
-----------------------------

A natural approach when dealing with tabular data is to allocate a name to each column.
A first possibility is to use an explicit structured dtype, as mentioned previously.

   >>> data = StringIO("1 2 3\n 4 5 6")
   >>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])
   array([(1, 2, 3), (4, 5, 6)], 
         dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
   
Another simpler possibility is to use the :keyword:`names` keyword with a sequence of strings or a comma-separated string.
   >>> data = StringIO("1 2 3\n 4 5 6")
   >>> np.genfromtxt(data, names="A, B, C")
   array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], 
         dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

In the example above, we used the fact that by default, ``dtype=float``.
By giving a sequence of names, we are forcing the output to a structured dtype.

We may sometimes need to define the column names from the data itself.
In that case, we must use the :keyword:`names` keyword with a value of ``True``.
The names will then be read from the first line (after the ``skip_header`` ones), even if the line is commented out.

   >>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")
   >>> np.genfromtxt(data, skip_header=1, names=True)
   array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], 
         dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

The default value of :keyword:`names` is ``None``.
If we give any other value to the keyword, the new names will overwrite the field names we may have defined with the dtype.

   >>> data = StringIO("1 2 3\n 4 5 6")
   >>> ndtype=[('a',int), ('b', float), ('c', int)]
   >>> names = ["A", "B", "C"]
   >>> np.genfromtxt(data, names=names, dtype=ndtype)
   array([(1, 2.0, 3), (4, 5.0, 6)], 
         dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')])


The :keyword:`defaultfmt` argument
----------------------------------

If ``names=None`` but a structured dtype is expected, names are defined with the standard NumPy default of ``"f%i"``, yielding names like ``f0``, ``f1`` and so forth.
   >>> data = StringIO("1 2 3\n 4 5 6")
   >>> np.genfromtxt(data, dtype=(int, float, int))
   array([(1, 2.0, 3), (4, 5.0, 6)], 
         dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])

In the same way, if we don't give enough names to match the length of the dtype, the missing names will be defined with this default template.
   >>> data = StringIO("1 2 3\n 4 5 6")
   >>> np.genfromtxt(data, dtype=(int, float, int), names="a")
   array([(1, 2.0, 3), (4, 5.0, 6)], 
         dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')])

We can overwrite this default with the :keyword:`defaultfmt` argument, that takes any format string:
   >>> data = StringIO("1 2 3\n 4 5 6")
   >>> np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i")
   array([(1, 2.0, 3), (4, 5.0, 6)], 
         dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')])

.. note::
   We need to keep in mind that ``defaultfmt`` is used only if some names are expected but not defined.


Validating names
----------------

Numpy arrays with a structured dtype can also be viewed as :class:`~numpy.recarray`, where a field can be accessed as if it were an attribute.
For that reason, we may need to make sure that the field name doesn't contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like ``size`` or ``shape``), which would confuse the interpreter.
:func:`~numpy.genfromtxt` accepts three optional arguments that provide a finer control on the names:

   :keyword:`deletechars`
      Gives a string combining all the characters that must be deleted from the name. By default, invalid characters are ``~!@#$%^&*()-=+~\|]}[{';: /?.>,<``.
   :keyword:`excludelist`
      Gives a list of the names to exclude, such as ``return``, ``file``, ``print``... 
      If one of the input name is part of this list, an underscore character (``'_'``) will be appended to it.
   :keyword:`case_sensitive`
      Whether the names should be case-sensitive (``case_sensitive=True``),
      converted to upper case (``case_sensitive=False`` or ``case_sensitive='upper'``) or to lower case (``case_sensitive='lower'``).



Tweaking the conversion
=======================

The :keyword:`converters` argument
----------------------------------

Usually, defining a dtype is sufficient to define how the sequence of strings must be converted.
However, some additional control may sometimes be required.
For example, we may want to make sure that a date in a format ``YYYY/MM/DD`` is converted to a :class:`datetime` object, or that a string like ``xx%`` is properly converted to a float between 0 and 1.
In such cases, we should define conversion functions with the :keyword:`converters` arguments.

The value of this argument is typically a dictionary with column indices or column names as keys and a conversion function as values.
These conversion functions can either be actual functions or lambda functions. In any case, they should accept only a string as input and output only a single element of the wanted type.

In the following example, the second column is converted from as string representing a percentage to a float between 0 and 1
   >>> convertfunc = lambda x: float(x.strip("%"))/100.
   >>> data = "1, 2.3%, 45.\n6, 78.9%, 0"
   >>> names = ("i", "p", "n")
   >>> # General case .....
   >>> np.genfromtxt(StringIO(data), delimiter=",", names=names) 
   array([(1.0, nan, 45.0), (6.0, nan, 0.0)], 
         dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

We need to keep in mind that by default, ``dtype=float``.
A float is therefore expected for the second column.
However, the strings ``' 2.3%'`` and ``' 78.9%'`` cannot be converted to float and we end up having ``np.nan`` instead.
Let's now use a converter.

   >>> # Converted case ...
   >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, 
   ...               converters={1: convertfunc})
   array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], 
         dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

The same results can be obtained by using the name of the second column (``"p"``) as key instead of its index (1).

   >>> # Using a name for the converter ...
   >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, 
   ...               converters={"p": convertfunc})
   array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], 
         dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])


Converters can also be used to provide a default for missing entries.
In the following example, the converter ``convert`` transforms a stripped string into the corresponding float or into -999 if the string is empty.
We need to explicitly strip the string from white spaces as it is not done by default.

   >>> data = "1, , 3\n 4, 5, 6"
   >>> convert = lambda x: float(x.strip() or -999)
   >>> np.genfromtxt(StringIO(data), delimiter=",",
   ...               converter={1: convert})
   array([[   1., -999.,    3.],
          [   4.,    5.,    6.]])




Using missing and filling values
--------------------------------

Some entries may be missing in the dataset we are trying to import.
In a previous example, we used a converter to transform an empty string into a float.
However, user-defined converters may rapidly become cumbersome to manage.

The :func:`~nummpy.genfromtxt` function provides two other complementary mechanisms: the :keyword:`missing_values` argument is used to recognize missing data and a second argument, :keyword:`filling_values`, is used to process these missing data.

:keyword:`missing_values`
-------------------------

By default, any empty string is marked as missing.
We can also consider more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing or invalid data.
The :keyword:`missing_values` argument accepts three kind of values:

   a string or a comma-separated string
      This string will be used as the marker for missing data for all the columns
   a sequence of strings
      In that case, each item is associated to a column, in order.
   a dictionary
      Values of the dictionary are strings or sequence of strings.
      The corresponding keys can be column indices (integers) or column names (strings). In addition, the special key ``None`` can be used to define a default applicable to all columns.


:keyword:`filling_values`
-------------------------

We know how to recognize missing data, but we still need to provide a value for these missing entries.
By default, this value is determined from the expected dtype according to this table:

=============  ==============
Expected type  Default
=============  ==============
``bool``       ``False``
``int``        ``-1``
``float``      ``np.nan``
``complex``    ``np.nan+0j``
``string``     ``'???'``
=============  ==============

We can get a finer control on the conversion of missing values with the :keyword:`filling_values` optional argument.
Like :keyword:`missing_values`, this argument accepts different kind of values:

   a single value
      This will be the default for all columns
   a sequence of values
      Each entry will be the default for the corresponding column
   a dictionary
      Each key can be a column index or a column name, and the corresponding value should be a single object.
      We can use the special key ``None`` to define a default for all columns.

In the following example, we suppose that the missing values are flagged with ``"N/A"`` in the first column and by ``"???"`` in the third column.
We wish to transform these missing values to 0 if they occur in the first and second column, and to -999 if they occur in the last column.

>>> data = "N/A, 2, 3\n4, ,???"
>>> kwargs = dict(delimiter=",",
...               dtype=int,
...               names="a,b,c",
...               missing_values={0:"N/A", 'b':" ", 2:"???"},
...               filling_values={0:0, 'b':0, 2:-999})
>>> np.genfromtxt(StringIO.StringIO(data), **kwargs)
array([(0, 2, 3), (4, 0, -999)], 
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])


:keyword:`usemask`
------------------

We may also want to keep track of the occurrence of missing data by constructing a boolean mask, with ``True`` entries where data was missing and ``False`` otherwise.
To do that, we just have to set the optional argument :keyword:`usemask` to ``True`` (the default is ``False``).
The output array will then be a :class:`~numpy.ma.MaskedArray`.


.. unpack=None, loose=True, invalid_raise=True)


Shortcut functions
==================

In addition to :func:`~numpy.genfromtxt`, the :mod:`numpy.lib.io` module provides several convenience functions derived from :func:`~numpy.genfromtxt`.
These functions work the same way as the original, but they have different default values.

:func:`~numpy.ndfromtxt`
   Always set ``usemask=False``.
   The output is always a standard :class:`numpy.ndarray`.
:func:`~numpy.mafromtxt`
   Always set ``usemask=True``.
   The output is always a :class:`~numpy.ma.MaskedArray`
:func:`~numpy.recfromtxt`
   Returns a standard :class:`numpy.recarray` (if ``usemask=False``) or a :class:`~numpy.ma.MaskedRecords` array (if ``usemaske=True``).
   The default dtype is ``dtype=None``, meaning that the types of each column will be automatically determined.
:func:`~numpy.recfromcsv`
   Like :func:`~numpy.recfromtxt`, but with a default ``delimiter=","``.