doc/source/user/how-to-io.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343

.. _how-to-io:

.. Setting up files temporarily to be used in the examples below. Clear-up
   has to be done at the end of the document.

.. testsetup::

   >>> from numpy.testing import temppath
   >>> with open("csv.txt", "wt") as f:
   ...    _ = f.write("1, 2, 3\n4,, 6\n7, 8, 9")
   >>> with open("fixedwidth.txt", "wt") as f:
   ...    _ = f.write("1   2      3\n44      6\n7   88889")
   >>> with open("nan.txt", "wt") as f:
   ...    _ = f.write("1 2 3\n44 x 6\n7 8888 9")
   >>> with open("skip.txt", "wt") as f:
   ...    _ = f.write("1 2   3\n44   6\n7 888 9")
   >>> with open("tabs.txt", "wt") as f:
   ...    _ = f.write("1\t2\t3\n44\t \t6\n7\t888\t9")


===========================
 Reading and writing files
===========================

This page tackles common applications; for the full collection of I/O
routines, see :ref:`routines.io`.


Reading text and CSV_ files
===========================

.. _CSV: https://en.wikipedia.org/wiki/Comma-separated_values

With no missing values
----------------------

Use :func:`numpy.loadtxt`.

With missing values
-------------------

Use :func:`numpy.genfromtxt`.

:func:`numpy.genfromtxt` will either

  - return a :ref:`masked array<maskedarray.generic>`
    **masking out missing values** (if ``usemask=True``), or

  - **fill in the missing value** with the value specified in
    ``filling_values`` (default is ``np.nan`` for float, -1 for int).

With non-whitespace delimiters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


    >>> with open("csv.txt", "r") as f:
    ...     print(f.read())
    1, 2, 3
    4,, 6
    7, 8, 9


Masked-array output
+++++++++++++++++++

    >>> np.genfromtxt("csv.txt", delimiter=",", usemask=True)
    masked_array(
      data=[[1.0, 2.0, 3.0],
            [4.0, --, 6.0],
            [7.0, 8.0, 9.0]],
      mask=[[False, False, False],
            [False,  True, False],
            [False, False, False]],
      fill_value=1e+20)

Array output
++++++++++++

    >>> np.genfromtxt("csv.txt", delimiter=",")
    array([[ 1.,  2.,  3.],
           [ 4., nan,  6.],
           [ 7.,  8.,  9.]])

Array output, specified fill-in value
+++++++++++++++++++++++++++++++++++++


    >>> np.genfromtxt("csv.txt", delimiter=",", dtype=np.int8, filling_values=99)
    array([[ 1,  2,  3],
           [ 4, 99,  6],
           [ 7,  8,  9]], dtype=int8)

Whitespace-delimited
~~~~~~~~~~~~~~~~~~~~

:func:`numpy.genfromtxt` can also parse whitespace-delimited data files
that have missing values if

* **Each field has a fixed width**: Use the width as the `delimiter` argument.

    # File with width=4. The data does not have to be justified (for example,
    # the 2 in row 1), the last column can be less than width (for example, the 6
    # in row 2), and no delimiting character is required (for instance 8888 and 9
    # in row 3)

    >>> with open("fixedwidth.txt", "r") as f:
    ...    data = (f.read())
    >>> print(data)
    1   2      3
    44      6
    7   88889

    # Showing spaces as ^
    >>> print(data.replace(" ","^"))
    1^^^2^^^^^^3
    44^^^^^^6
    7^^^88889

    >>> np.genfromtxt("fixedwidth.txt", delimiter=4)
    array([[1.000e+00, 2.000e+00, 3.000e+00],
           [4.400e+01,       nan, 6.000e+00],
           [7.000e+00, 8.888e+03, 9.000e+00]])

* **A special value (e.g. "x") indicates a missing field**: Use it as the
  `missing_values` argument.

    >>> with open("nan.txt", "r") as f:
    ...     print(f.read())
    1 2 3
    44 x 6
    7  8888 9

    >>> np.genfromtxt("nan.txt", missing_values="x")
    array([[1.000e+00, 2.000e+00, 3.000e+00],
           [4.400e+01,       nan, 6.000e+00],
           [7.000e+00, 8.888e+03, 9.000e+00]])

* **You want to skip the rows with missing values**: Set
  `invalid_raise=False`.

    >>> with open("skip.txt", "r") as f:
    ...     print(f.read())
    1 2   3
    44    6
    7 888 9

    >>> np.genfromtxt("skip.txt", invalid_raise=False)  # doctest: +SKIP
    __main__:1: ConversionWarning: Some errors were detected !
        Line #2 (got 2 columns instead of 3)
    array([[  1.,   2.,   3.],
           [  7., 888.,   9.]])


* **The delimiter whitespace character is different from the whitespace that
  indicates missing data**. For instance, if columns are delimited by ``\t``,
  then missing data will be recognized if it consists of one
  or more spaces.

    >>> with open("tabs.txt", "r") as f:
    ...    data = (f.read())
    >>> print(data)
    1       2       3
    44              6
    7       888     9

    # Tabs vs. spaces
    >>> print(data.replace("\t","^"))
    1^2^3
    44^ ^6
    7^888^9

    >>> np.genfromtxt("tabs.txt", delimiter="\t", missing_values=" +")
    array([[  1.,   2.,   3.],
           [ 44.,  nan,   6.],
           [  7., 888.,   9.]])

Read a file in .npy or .npz format
==================================

Choices:

  - Use :func:`numpy.load`. It can read files generated by any of
    :func:`numpy.save`, :func:`numpy.savez`, or :func:`numpy.savez_compressed`.

  - Use memory mapping. See `numpy.lib.format.open_memmap`.

Write to a file to be read back by NumPy
========================================

Binary
------

Use
:func:`numpy.save`, or to store multiple arrays :func:`numpy.savez`
or :func:`numpy.savez_compressed`.

For :ref:`security and portability <how-to-io-pickle-file>`, set
``allow_pickle=False`` unless the dtype contains Python objects, which
requires pickling.

Masked arrays :any:`can't currently be saved <MaskedArray.tofile>`,
nor can other arbitrary array subclasses.

Human-readable
--------------

:func:`numpy.save` and :func:`numpy.savez` create binary files. To **write a
human-readable file**, use :func:`numpy.savetxt`. The array can only be 1- or
2-dimensional, and there's no ` savetxtz` for multiple files.

Large arrays
------------

See :ref:`how-to-io-large-arrays`.

Read an arbitrarily formatted binary file ("binary blob")
=========================================================

Use a :doc:`structured array <basics.rec>`.

**Example:**

The ``.wav`` file header is a 44-byte block preceding ``data_size`` bytes of the
actual sound data::

    chunk_id         "RIFF"
    chunk_size       4-byte unsigned little-endian integer
    format           "WAVE"
    fmt_id           "fmt "
    fmt_size         4-byte unsigned little-endian integer
    audio_fmt        2-byte unsigned little-endian integer
    num_channels     2-byte unsigned little-endian integer
    sample_rate      4-byte unsigned little-endian integer
    byte_rate        4-byte unsigned little-endian integer
    block_align      2-byte unsigned little-endian integer
    bits_per_sample  2-byte unsigned little-endian integer
    data_id          "data"
    data_size        4-byte unsigned little-endian integer

The ``.wav`` file header as a NumPy structured dtype::

    wav_header_dtype = np.dtype([
        ("chunk_id", (bytes, 4)), # flexible-sized scalar type, item size 4
        ("chunk_size", "<u4"),    # little-endian unsigned 32-bit integer
        ("format", "S4"),         # 4-byte string, alternate spelling of (bytes, 4)
        ("fmt_id", "S4"),
        ("fmt_size", "<u4"),
        ("audio_fmt", "<u2"),     #
        ("num_channels", "<u2"),  # .. more of the same ...
        ("sample_rate", "<u4"),   #
        ("byte_rate", "<u4"),
        ("block_align", "<u2"),
        ("bits_per_sample", "<u2"),
        ("data_id", "S4"),
        ("data_size", "<u4"),
        #
        # the sound data itself cannot be represented here:
        # it does not have a fixed size
    ])

    header = np.fromfile(f, dtype=wave_header_dtype, count=1)[0]

This ``.wav`` example is for illustration; to read a ``.wav`` file in real
life, use Python's built-in module :mod:`wave`.

(Adapted from Pauli Virtanen, :ref:`advanced_numpy`, licensed
under `CC BY 4.0 <https://creativecommons.org/licenses/by/4.0/>`_.)

.. _how-to-io-large-arrays:

Write or read large arrays
==========================

**Arrays too large to fit in memory** can be treated like ordinary in-memory
arrays using memory mapping.

- Raw array data written with :func:`numpy.ndarray.tofile` or
  :func:`numpy.ndarray.tobytes` can be read with :func:`numpy.memmap`::

      array = numpy.memmap("mydata/myarray.arr", mode="r", dtype=np.int16, shape=(1024, 1024))

- Files output by :func:`numpy.save` (that is, using the numpy format) can be read
  using :func:`numpy.load` with the ``mmap_mode`` keyword argument::

      large_array[some_slice] = np.load("path/to/small_array", mmap_mode="r")

Memory mapping lacks features like data chunking and compression; more
full-featured formats and libraries usable with NumPy include:

* **HDF5**: `h5py <https://www.h5py.org/>`_ or `PyTables <https://www.pytables.org/>`_.
* **Zarr**: `here <https://zarr.readthedocs.io/en/stable/tutorial.html#reading-and-writing-data>`_.
* **NetCDF**: :class:`scipy.io.netcdf_file`.

For tradeoffs among memmap, Zarr, and HDF5, see
`pythonspeed.com <https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/>`_.

Write files for reading by other (non-NumPy) tools
==================================================

Formats for **exchanging data** with other tools include HDF5, Zarr, and
NetCDF (see :ref:`how-to-io-large-arrays`).

Write or read a JSON file
=========================

NumPy arrays are **not** directly
`JSON serializable <https://github.com/numpy/numpy/issues/12481>`_.


.. _how-to-io-pickle-file:

Save/restore using a pickle file
================================

Avoid when possible; :doc:`pickles <python:library/pickle>` are not secure
against erroneous or maliciously constructed data.

Use :func:`numpy.save` and :func:`numpy.load`.  Set ``allow_pickle=False``,
unless the array dtype includes Python objects, in which case pickling is
required.

Convert from a pandas DataFrame to a NumPy array
================================================

See :meth:`pandas.DataFrame.to_numpy`.

Save/restore using `~numpy.ndarray.tofile` and `~numpy.fromfile`
================================================================

In general, prefer :func:`numpy.save` and :func:`numpy.load`.

:func:`numpy.ndarray.tofile` and :func:`numpy.fromfile` lose information on
endianness and precision and so are unsuitable for anything but scratch
storage.


.. testcleanup::

   >>> import os
   >>> # list all files created in testsetup. If needed there are
   >>> # convenienes in e.g. astroquery to do this more automatically
   >>> for filename in ['csv.txt', 'fixedwidth.txt', 'nan.txt', 'skip.txt', 'tabs.txt']:
   ...     os.remove(filename)