diff options
Diffstat (limited to 'doc/src/sgml/wal.sgml')
| -rw-r--r-- | doc/src/sgml/wal.sgml | 60 |
1 files changed, 30 insertions, 30 deletions
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index 76f1fdcf3b..2c5ce01112 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.60 2009/11/28 16:21:31 momjian Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.61 2010/02/03 17:25:06 momjian Exp $ --> <chapter id="wal"> <title>Reliability and the Write-Ahead Log</title> @@ -42,9 +42,9 @@ <para> Next, there might be a cache in the disk drive controller; this is particularly common on <acronym>RAID</> controller cards. Some of - these caches are <firstterm>write-through</>, meaning writes are passed - along to the drive as soon as they arrive. Others are - <firstterm>write-back</>, meaning data is passed on to the drive at + these caches are <firstterm>write-through</>, meaning writes are sent + to the drive as soon as they arrive. Others are + <firstterm>write-back</>, meaning data is sent to the drive at some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have @@ -61,7 +61,7 @@ particularly likely to have write-back caches that will not survive a power failure. To check write caching on <productname>Linux</> use <command>hdparm -I</>; it is enabled if there is a <literal>*</> next - to <literal>Write cache</>. <command>hdparm -W</> to turn off + to <literal>Write cache</>; <command>hdparm -W</> to turn off write caching. On <productname>FreeBSD</> use <application>atacontrol</>. (For SCSI disks use <ulink url="http://sg.torque.net/sg/sdparm.html"><application>sdparm</></ulink> @@ -79,10 +79,10 @@ </para> <para> - When the operating system sends a write request to the disk hardware, + When the operating system sends a write request to the storage hardware, there is little it can do to make sure the data has arrived at a truly non-volatile storage area. Rather, it is the - administrator's responsibility to be sure that all storage components + administrator's responsibility to make certain that all storage components ensure data integrity. Avoid disk controllers that have non-battery-backed write caches. At the drive level, disable write-back caching if the drive cannot guarantee the data will be written before shutdown. @@ -100,11 +100,11 @@ to power loss at any time, meaning some of the 512-byte sectors were written, and others were not. To guard against such failures, <productname>PostgreSQL</> periodically writes full page images to - permanent storage <emphasis>before</> modifying the actual page on + permanent WAL storage <emphasis>before</> modifying the actual page on disk. By doing this, during crash recovery <productname>PostgreSQL</> can restore partially-written pages. If you have a battery-backed disk controller or file-system software that prevents partial page writes - (e.g., ReiserFS 4), you can turn off this page imaging by using the + (e.g., ZFS), you can turn off this page imaging by turning off the <xref linkend="guc-full-page-writes"> parameter. </para> </sect1> @@ -140,12 +140,12 @@ <tip> <para> Because <acronym>WAL</acronym> restores database file - contents after a crash, journaled filesystems are not necessary for + contents after a crash, journaled file systems are not necessary for reliable storage of the data files or WAL files. In fact, journaling overhead can reduce performance, especially if journaling causes file system <emphasis>data</emphasis> to be flushed to disk. Fortunately, data flushing during journaling can - often be disabled with a filesystem mount option, e.g. + often be disabled with a file system mount option, e.g. <literal>data=writeback</> on a Linux ext3 file system. Journaled file systems do improve boot speed after a crash. </para> @@ -308,7 +308,7 @@ committing at about the same time. Setting <varname>commit_delay</varname> can only help when there are many concurrently committing transactions, and it is difficult to tune it to a value that actually helps rather - than hurting throughput. + than hurt throughput. </para> </sect1> @@ -326,7 +326,7 @@ <para> <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> are points in the sequence of transactions at which it is guaranteed - that the data files have been updated with all information written before + that the heap and index data files have been updated with all information written before the checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. (The changes were previously flushed to the <acronym>WAL</acronym> files.) @@ -349,18 +349,18 @@ </para> <para> - The server's background writer process will automatically perform + The server's background writer process automatically performs a checkpoint every so often. A checkpoint is created every <xref linkend="guc-checkpoint-segments"> log segments, or every <xref linkend="guc-checkpoint-timeout"> seconds, whichever comes first. - The default settings are 3 segments and 300 seconds respectively. + The default settings are 3 segments and 300 seconds (5 minutes), respectively. It is also possible to force a checkpoint by using the SQL command <command>CHECKPOINT</command>. </para> <para> Reducing <varname>checkpoint_segments</varname> and/or - <varname>checkpoint_timeout</varname> causes checkpoints to be done + <varname>checkpoint_timeout</varname> causes checkpoints to occur more often. This allows faster after-crash recovery (since less work will need to be redone). However, one must balance this against the increased cost of flushing dirty data pages more often. If @@ -469,7 +469,7 @@ server processes to add their commit records to the log so as to have all of them flushed with a single log sync. No sleep will occur if <xref linkend="guc-fsync"> - is not enabled, nor if fewer than <xref linkend="guc-commit-siblings"> + is not enabled, or if fewer than <xref linkend="guc-commit-siblings"> other sessions are currently in active transactions; this avoids sleeping when it's unlikely that any other session will commit soon. Note that on most platforms, the resolution of a sleep request is @@ -483,7 +483,7 @@ The <xref linkend="guc-wal-sync-method"> parameter determines how <productname>PostgreSQL</productname> will ask the kernel to force <acronym>WAL</acronym> updates out to disk. - All the options should be the same as far as reliability goes, + All the options should be the same in terms of reliability, but it's quite platform-specific which one will be the fastest. Note that this parameter is irrelevant if <varname>fsync</varname> has been turned off. @@ -521,26 +521,26 @@ <filename>access/xlog.h</filename>; the record content is dependent on the type of event that is being logged. Segment files are given ever-increasing numbers as names, starting at - <filename>000000010000000000000000</filename>. The numbers do not wrap, at - present, but it should take a very very long time to exhaust the + <filename>000000010000000000000000</filename>. The numbers do not wrap, + but it will take a very, very long time to exhaust the available stock of numbers. </para> <para> - It is of advantage if the log is located on another disk than the - main database files. This can be achieved by moving the directory - <filename>pg_xlog</filename> to another location (while the server + It is advantageous if the log is located on a different disk from the + main database files. This can be achieved by moving the + <filename>pg_xlog</filename> directory to another location (while the server is shut down, of course) and creating a symbolic link from the original location in the main data directory to the new location. </para> <para> - The aim of <acronym>WAL</acronym>, to ensure that the log is - written before database records are altered, can be subverted by + The aim of <acronym>WAL</acronym> is to ensure that the log is + written before database records are altered, but this can be subverted by disk drives<indexterm><primary>disk drive</></> that falsely report a successful write to the kernel, when in fact they have only cached the data and not yet stored it - on the disk. A power failure in such a situation might still lead to + on the disk. A power failure in such a situation might lead to irrecoverable data corruption. Administrators should try to ensure that disks holding <productname>PostgreSQL</productname>'s <acronym>WAL</acronym> log files do not make such false reports. @@ -549,8 +549,8 @@ <para> After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file - <filename>pg_control</filename>. Therefore, when recovery is to be - done, the server first reads <filename>pg_control</filename> and + <filename>pg_control</filename>. Therefore, at the start of recovery, + the server first reads <filename>pg_control</filename> and then the checkpoint record; then it performs the REDO operation by scanning forward from the log position indicated in the checkpoint record. Because the entire content of data pages is saved in the @@ -562,12 +562,12 @@ <para> To deal with the case where <filename>pg_control</filename> is - corrupted, we should support the possibility of scanning existing log + corrupt, we should support the possibility of scanning existing log segments in reverse order — newest to oldest — in order to find the latest checkpoint. This has not been implemented yet. <filename>pg_control</filename> is small enough (less than one disk page) that it is not subject to partial-write problems, and as of this writing - there have been no reports of database failures due solely to inability + there have been no reports of database failures due solely to the inability to read <filename>pg_control</filename> itself. So while it is theoretically a weak spot, <filename>pg_control</filename> does not seem to be a problem in practice. |
