summaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAgeFilesLines
* Rename recently-added pg_stat_activity column from txn_start to xact_start,Tom Lane2007-09-111-4/+4
| | | | for consistency with other column names such as in pg_stat_database.
* Replace the former method of determining snapshot xmax --- to wit, callingTom Lane2007-09-086-160/+131
| | | | | | | | | | | | | | | ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid" variable that is updated during transaction commit or abort. Since latestCompletedXid is written only in places that had to lock ProcArrayLock exclusively anyway, and is read only in places that had to lock ProcArrayLock shared anyway, it adds no new locking requirements to the system despite being cluster-wide. Moreover, removing ReadNewTransactionId from snapshot acquisition eliminates the need to take both XidGenLock and ProcArrayLock at the same time. Since XidGenLock is sometimes held across I/O this can be a significant win. Some preliminary benchmarking suggested that this patch has no effect on average throughput but can significantly improve the worst-case transaction times seen in pgbench. Concept by Florian Pflug, implementation by Tom Lane.
* Don't take ProcArrayLock while exiting a transaction that has no XID; there isTom Lane2007-09-073-70/+194
| | | | | | | | | | no need for serialization against snapshot-taking because the xact doesn't affect anyone else's snapshot anyway. Per discussion. Also, move various info about the interlocking of transactions and snapshots out of code comments and into a hopefully-more-cohesive discussion in access/transam/README. Also, remove a couple of now-obsolete comments about having to force some WAL to be written to persuade RecordTransactionCommit to do its thing.
* Improve page split in rtree emulation. Now if splitted result hasTeodor Sigaev2007-09-071-45/+34
| | | | | | | | | big misalignement, then it tries to split page basing on distribution of boxe's centers. Per report from Dolafi, Tom <dolafit@janelia.hhmi.org> Backpatch is needed, change doesn't affect on-disk storage.
* Quick hack to make the VXID of a prepared transaction be -1/XID,Tom Lane2007-09-051-2/+3
| | | | | so that different prepared xacts can be told apart in the pg_locks view. Per suggestion from Florian.
* Implement lazy XID allocation: transactions that do not modify any databaseTom Lane2007-09-057-502/+411
| | | | | | | | | | | | | | | rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.
* Implement function-local GUC parameter settings, as per recent discussion.Tom Lane2007-09-031-8/+16
| | | | | | | There are still some loose ends: I didn't do anything about the SET FROM CURRENT idea yet, and it's not real clear whether we are happy with the interaction of SET LOCAL with function-local settings. The documentation is a bit spartan, too.
* Add a debug logging message when a resource manager rejects an attemptedTom Lane2007-08-281-1/+7
| | | | restart point. Per suggestion from Simon Riggs.
* Tsearch2 functionality migrates to core. The bulk of this work is byTom Lane2007-08-211-2/+11
| | | | | | | | Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing, so anything that's broken is probably my fault. Documentation is nonexistent as yet, but let's land the patch so we can get some portability testing done.
* Fix oversight in async-commit patch: there were some places in heapam.cTom Lane2007-08-141-31/+35
| | | | | | that still thought they could set HEAP_XMAX_COMMITTED immediately after seeing the other transaction commit. Make them use the same logic as tqual.c does to determine if the hint bit can be set yet.
* Fix two bugs induced in VACUUM FULL by async-commit patch.Tom Lane2007-08-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | First, we cannot assume that XLogAsyncCommitFlush guarantees hint bits will be settable, because clog.c's inexact LSN bookkeeping results in windows where a previously flushed transaction is considered unhintable because it shares an LSN slot with a later unflushed transaction. But repair_frag requires XMIN_COMMITTED to be correct so that it can distinguish tuples moved by the current vacuum. Since not being able to set the bit is an uncommon corner case, the most practical way of dealing with it seems to be to abandon shrinking (ie, don't invoke repair_frag) when we find a non-dead tuple whose XMIN_COMMITTED bit couldn't be set. Second, it is possible for the same reason that a RECENTLY_DEAD tuple does not get its XMAX_COMMITTED bit set during scan_heap. But by the time repair_frag examines the tuple it might be possible to set the bit. We therefore must take buffer content lock when calling HeapTupleSatisfiesVacuum a second time, else we can get an Assert failure in SetBufferCommitInfoNeedsSave. This latter bug is latent in existing releases, but I think it cannot actually occur without async commit, since the first HeapTupleSatisfiesVacuum call should always have set the bit. So I'm not going to back-patch it. In passing, reduce the existing "cannot shrink relation" messages from NOTICE to LOG level. The new message must be no higher than LOG if we don't want unpredictable regression test failures, and consistency seems like a good idea. Also arrange that only one such message is reported per VACUUM FULL; in typical scenarios you could get spammed with many such messages, which seems a bit useless.
* Switch over to using the src/timezone functions for formatting timestampsTom Lane2007-08-041-28/+18
| | | | | | | | | | | | | | displayed in the postmaster log. This avoids Windows-specific problems with localized time zone names that are in the wrong encoding, and generally seems like a good idea to forestall other potential platform-dependent issues. To preserve the existing behavior that all backends will log in the same time zone, create a new GUC variable log_timezone that can only be changed on a system-wide basis, and reference log-related calculations to that zone instead of the TimeZone variable. This fixes the issue reported by Hiroshi Saito that timestamps printed by xlog.c startup could be improperly localized on Windows. We still need a simpler patch for that problem in the back branches, however.
* Support an optional asynchronous commit mode, in which we don't flush WALTom Lane2007-08-019-90/+513
| | | | | | before reporting a transaction committed. Data consistency is still guaranteed (unlike setting fsync = off), but a crash may lose the effects of the last few transactions. Patch by Simon, some editorialization by Tom.
* Create a new dedicated Postgres process, "wal writer", which exists to writeTom Lane2007-07-241-39/+80
| | | | | | | | | | | | | | and fsync WAL at convenient intervals. For the moment it just tries to offload this work from backends, but soon it will be responsible for guaranteeing a maximum delay before asynchronously-committed transactions will be flushed to disk. This is a portion of Simon Riggs' async-commit patch, committed to CVS separately because a background WAL writer seems like it might be a good idea independently of the async-commit feature. I rebased walwriter.c on bgwriter.c because it seemed like a more appropriate way of handling signals; while the startup/shutdown logic in postmaster.c is more like autovac because we want walwriter to quit before we start the shutdown checkpoint.
* Improve logging of checkpoints. Patch by Greg Smith, worked overTom Lane2007-06-301-45/+108
| | | | by Heikki and a little bit by me.
* Implement "distributed" checkpoints in which the checkpoint I/O is spreadTom Lane2007-06-281-36/+76
| | | | | | | | | | | | | over a fairly long period of time, rather than being spat out in a burst. This happens only for background checkpoints carried out by the bgwriter; other cases, such as a shutdown checkpoint, are still done at full speed. Remove the "all buffers" scan in the bgwriter, and associated stats infrastructure, since this seems no longer very useful when the checkpoint itself is properly throttled. Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas, and some minor API editorialization by me.
* Teach heapam code to know the difference between a real seqscan and theTom Lane2007-06-091-6/+34
| | | | | | | | | pseudo HeapScanDesc created for a bitmap heap scan. This avoids some useless overhead during a bitmap scan startup, in particular invoking the syncscan code. (We might someday want to do that, but right now it's merely useless contention for shared memory, to say nothing of possibly pushing useful entries out of syncscan's small LRU list.) This also allows elimination of ugly pgstat_discount_heap_scan() kluge.
* Arrange for large sequential scans to synchronize with each other, so thatTom Lane2007-06-083-16/+432
| | | | | | | when multiple backends are scanning the same relation concurrently, each page is (ideally) read only once. Jeff Davis, with review by Heikki and Tom.
* Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state,Tom Lane2007-06-071-28/+18
| | | | | | | | | which is the only state in which it's safe to initiate database queries. It turns out that all but two of the callers thought that's what it meant; and the other two were using it as a proxy for "will GetTopTransactionId() return a nonzero XID"? Since it was in fact an unreliable guide to that, make those two just invoke GetTopTransactionId() always, then deal with a zero result if they get one.
* Move call of MarkBufferDirty() before XLogInsert() as required.Teodor Sigaev2007-06-053-18/+25
| | | | | Many thanks to Heikki Linnakangas <heikki@enterprisedb.com> for his sharp eyes.
* Fix bundle bugs of GIN:Teodor Sigaev2007-06-045-54/+146
| | | | | | | | | | | | | | | | | - Fix possible deadlock between UPDATE and VACUUM queries. Bug never was observed in 8.2, but it still exist there. HEAD is more sensitive to bug after recent "ring" of buffer improvements. - Fix WAL creation: if parent page is stored as is after split then incomplete split isn't removed during replay. This happens rather rare, only on large tables with a lot of updates/inserts. - Fix WAL replay: there was wrong test of XLR_BKP_BLOCK_* for left page after deletion of page. That causes wrong rightlink field: it pointed to deleted page. - add checking of match of clearing incomplete split - cleanup incomplete split list after proceeding All of this chages doesn't change on-disk storage, so backpatch... But second point may be an issue for replaying logs from previous version.
* Clarify some error messages about duplicate things.Peter Eisentraut2007-06-032-4/+4
|
* Fix several hash functions that were taking chintzy shortcuts instead ofTom Lane2007-06-011-10/+42
| | | | | | | | | | | | | delivering a well-randomized hash value. I got religion on this after observing that performance of multi-batch hash join degrades terribly if the higher-order bits of hash values aren't random, as indeed was true for say hashes of small integer values. It's now expected and documented that hash functions should use hash_any or some comparable method to ensure that all bits of their output are about equally random. initdb forced because this change invalidates existing hash indexes. For the same reason, this isn't back-patchable; the hash join performance problem will get a band-aid fix in the back branches.
* Make some messages more consistentPeter Eisentraut2007-05-311-2/+2
|
* Replace ReadBuffer to ReadBufferWithStrategy in all vacuum-involved placesTeodor Sigaev2007-05-312-16/+21
| | | | to implement limited-size "ring" of buffers for VACUUM for GIN & GIST
* Downgrade some low-level startup messages to DEBUG1.Peter Eisentraut2007-05-311-6/+6
|
* Fix overly-strict sanity check in BeginInternalSubTransaction that made itTom Lane2007-05-301-7/+8
| | | | | | fail when used in a deferred trigger. Bug goes back to 8.0; no doubt the reason it hadn't been noticed is that we've been discouraging use of user-defined constraint triggers. Per report from Frank van Vugt.
* Make large sequential scans and VACUUMs work in a limited-size "ring" ofTom Lane2007-05-306-29/+154
| | | | | | | | | | | | | | | | | | | | | | | buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.
* Fix up pgstats counting of live and dead tuples to recognize that committedTom Lane2007-05-2710-48/+56
| | | | | | | | | | | and aborted transactions have different effects; also teach it not to assume that prepared transactions are always committed. Along the way, simplify the pgstats API by tying counting directly to Relations; I cannot detect any redeeming social value in having stats pointers in HeapScanDesc and IndexScanDesc structures. And fix a few corner cases in which counts might be missed because the relation's pgstat_info pointer hadn't been set.
* To support external compression of archived WAL data, add a flag bit toTom Lane2007-05-203-11/+38
| | | | | | | | | | | | | | | | | | WAL records that shows whether it is safe to remove full-page images (ie, whether or not an on-line backup was in progress when the WAL entry was made). Also make provision for an XLOG_NOOP record type that can be used to fill in the extra space when decompressing the data for restore. This is the portion of Koichi Suzuki's "full page writes" patch that has to go into the core database. The remainder of that work is two external compression and decompression programs, which for the time being will undergo separate development on pgfoundry. Per discussion. Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be possible to compress them (the previous coding caused essential info to be omitted). The other commonly-used record types seem OK already, with the possible exception of GIN and GIST WAL records, which I don't understand well enough to opine on.
* Move the tuple freezing point in CLUSTER to a point further back in the past,Alvaro Herrera2007-05-171-5/+20
| | | | | | | | | | to avoid losing useful Xid information in not-so-old tuples. This makes CLUSTER behave the same as VACUUM as far a tuple-freezing behavior goes (though CLUSTER does not yet advance the table's relfrozenxid). While at it, move the actual freezing operation in rewriteheap.c to a more appropriate place, and document it thoroughly. This part of the patch from Tom Lane.
* Have the rewriteheap code freeze old tuples. This is safe because it is onlyAlvaro Herrera2007-05-161-1/+3
| | | | | | | | | | | applied to live tuples older than a recent Xmin, not to tuples that may be part of an update chain. Those still keep their original markings. This patch makes it possible for CLUSTER to advance relfrozenxid, thus avoiding the need of vacuuming the table for Xid wraparound purposes. That will be patched separately. Patch from Heikki Linnakangas.
* Tweak hash index AM to use the new ReadOrZeroBuffer bufmgr API when fetchingTom Lane2007-05-036-80/+103
| | | | | | | pages it intends to zero immediately. Just to show there is some use for that function besides WAL recovery :-). Along the way, fold _hash_checkpage and _hash_pageinit calls into _hash_getbuf and friends, instead of expecting callers to do that separately.
* During WAL recovery, when reading a page that we intend to overwrite completelyTom Lane2007-05-021-3/+8
| | | | | | | | | | | | from the WAL data, don't bother to physically read it; just have bufmgr.c return a zeroed-out buffer instead. This speeds recovery significantly, and also avoids unnecessary failures when a page-to-be-overwritten has corrupt page headers on disk. This replaces a former kluge that accomplished the latter by pretending zero_damaged_pages was always ON during WAL recovery; which was OK when the kluge was put in, but is unsafe when restoring a WAL log that was written with full_page_writes off. Heikki Linnakangas
* Change the timestamps recorded in transaction commit/abort xlog recordsTom Lane2007-04-303-35/+31
| | | | | | | | | from time_t to TimestampTz representation. This provides full gettimeofday() resolution of the timestamps, which might be useful when attempting to do point-in-time recovery --- previously it was not possible to specify the stop point with sub-second resolution. But mostly this is to get rid of TimestampTz-to-time_t conversion overhead during commit. Per my proposal of a day or two back.
* Implement rate-limiting logic on how often backends will attempt to sendTom Lane2007-04-301-4/+33
| | | | | | | | | | | | | | messages to the stats collector. This avoids the problem that enabling stats_row_level for autovacuum has a significant overhead for short read-only transactions, as noted by Arjen van der Meijden. We can avoid an extra gettimeofday call by piggybacking on the one done for WAL-logging xact commit or abort (although that doesn't help read-only transactions, since they don't WAL-log anything). In my proposal for this, I noted that we could change the WAL log entries for commit/abort to record full TimestampTz precision, instead of only time_t as at present. That's not done in this patch, but will be committed separately.
* Fix dynahash.c to suppress hash bucket splits while a hash_seq_search() scanTom Lane2007-04-261-1/+6
| | | | | | | | | | | | | | | | | | | | | | | is in progress on the same hashtable. This seems the least invasive way to fix the recently-recognized problem that a split could cause the scan to visit entries twice or (with much lower probability) miss them entirely. The only field-reported problem caused by this is the "failed to re-find shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky, which was caused by multiply visited entries. However, it seems certain that mdsync() is vulnerable to missing required fsync's due to missed entries, and I am fearful that RelationCacheInitializePhase2() might be at risk as well. Because of that and the generalized hazard presented by this bug, back-patch all the supported branches. Along the way, fix pg_prepared_statement() and pg_cursor() to not assume that the hashtables they are examining will stay static between calls. This is risky regardless of the newly noted dynahash problem, because hash_seq_search() has never promised to cope with deletion of table entries other than the just-returned one. There may be no bug here because the only supported way to call these functions is via ExecMakeTableFunctionResult() which will cycle them to completion before doing anything very interesting, but it seems best to get rid of the assumption. This affects 8.2 and HEAD only, since those functions weren't there earlier.
* Repair PANIC condition in hash indexes when a previous index extension attemptTom Lane2007-04-193-87/+132
| | | | | | | | | | | failed (due to lock conflicts or out-of-space). We might have already extended the index's filesystem EOF before failing, causing the EOF to be beyond what the metapage says is the last used page. Hence the invariant maintained by the code needs to be "EOF is at or beyond last used page", not "EOF is exactly the last used page". Problem was created by my patch of 2006-11-19 that attempted to repair bug #2737. Since that was back-patched to 7.4, this needs to be as well. Per report and test case from Vlastimil Krejcir.
* Fix condition for whether end_heap_rewrite must fsync, per Heikki.Tom Lane2007-04-171-4/+11
|
* Don't assume rd_smgr stays open across all of a rewriteheap operation;Tom Lane2007-04-171-2/+3
| | | | | doing so can result in crash if an sinval reset occurs meanwhile. I believe this explains intermittent buildfarm failures in cluster test.
* Code review for btree page split WAL reduction patch. Make it actually workTom Lane2007-04-112-106/+141
| | | | | | | (original code *always* created a full-page image for the left page, thus leaving the intended savings unrealized), avoid risk of not having enough room on the page during xlog restore, squeeze out another couple bytes in the xlog record, clean up neglected comments.
* Minor tweaking of index special-space definitions so that the variousTom Lane2007-04-094-16/+19
| | | | | | | | | | index types can be reliably distinguished by examining the special space on an index page. Per my earlier proposal, plus the realization that there's no need for btree's vacuum cycle ID to cycle through every possible 16-bit value. Restricting its range a little costs nearly nothing and eliminates the possibility of collisions. Memo to self: remember to make bitmap indexes play along with this scheme, assuming that patch ever gets accepted.
* Make CLUSTER MVCC-safe. Heikki LinnakangasTom Lane2007-04-084-30/+682
|
* Make 'col IS NULL' clauses be indexable conditions.Tom Lane2007-04-064-47/+121
| | | | Teodor Sigaev, with some kibitzing from Tom Lane.
* Support varlena fields with single-byte headers and unaligned storage.Tom Lane2007-04-063-333/+752
| | | | | | | | | This commit breaks any code that assumes that the mere act of forming a tuple (without writing it to disk) does not "toast" any fields. While all available regression tests pass, I'm not totally sure that we've fixed every nook and cranny, especially in contrib. Greg Stark with some help from Tom Lane
* Remove the CheckpointStartLock in favor of having backends show whether theyTom Lane2007-04-033-47/+78
| | | | | | | | | | are in their commit critical sections via flags in the ProcArray. Checkpoint can watch the ProcArray to determine when it's safe to proceed. This is a considerably better solution to the original problem of race conditions between checkpoint and transaction commit: it speeds up commit, since there's one less lock to fool with, and it prevents the problem of checkpoint being delayed indefinitely when there's a constant flow of commits. Heikki, with some kibitzing from Tom.
* Decouple the values of TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE.Tom Lane2007-04-033-16/+58
| | | | | | | | | | | Add the latter to the values checked in pg_control, since it can't be changed without invalidating toast table content. This commit in itself shouldn't change any behavior, but it lays some necessary groundwork for experimentation with these toast-control numbers. Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some thought still needs to be given to needs_toast_table() in toasting.c before unleashing random changes.
* Support enum data types. Along the way, use macros for the values ofTom Lane2007-04-021-1/+7
| | | | | pg_type.typtype whereever practical. Tom Dunstan, with some kibitzing from Tom Lane.
* Fix oversight in coding of _bt_start_vacuum: we can't assume that the LWLockTom Lane2007-03-301-1/+13
| | | | | | | will be released by transaction abort before _bt_end_vacuum gets called. If either of these "can't happen" errors actually happened, we'd freeze up trying to acquire an already-held lock. Latest word is that this does not explain Martin Pitt's trouble report, but it still looks like a bug.
* Teach CLUSTER to skip writing WAL if not needed (ie, not using archiving)Tom Lane2007-03-292-38/+59
| | | | | --- Simon. Also, code review and cleanup for the previous COPY-no-WAL patches --- Tom.