1 files changed, 2900 insertions, 0 deletions
diff --git a/doc/TODO.detail/pglog b/doc/TODO.detail/pglog
new file mode 100644
index 0000000000..1810a8911f
--- /dev/null
+++ b/doc/TODO.detail/pglog
@@ -0,0 +1,2900 @@
+From aoki@postgres.Berkeley.EDU Sun Jun 22 19:31:06 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id TAA19488
+	for <maillist@candle.pha.pa.us>; Sun, 22 Jun 1997 19:31:03 -0400 (EDT)
+Received: from faerie.CS.Berkeley.EDU (faerie.CS.Berkeley.EDU [128.32.37.53]) by renoir.op.net ($ Revision: 1.12 $) with SMTP id TAA18795 for <maillist@candle.pha.pa.us>; Sun, 22 Jun 1997 19:18:06 -0400 (EDT)
+Received: from localhost.Berkeley.EDU (localhost.Berkeley.EDU [127.0.0.1]) by faerie.CS.Berkeley.EDU (8.6.10/8.6.3) with SMTP id QAA07816 for maillist@candle.pha.pa.us; Sun, 22 Jun 1997 16:16:44 -0700
+Message-Id: <199706222316.QAA07816@faerie.CS.Berkeley.EDU>
+X-Authentication-Warning: faerie.CS.Berkeley.EDU: Host localhost.Berkeley.EDU didn't use HELO protocol
+From: aoki@CS.Berkeley.EDU (Paul M. Aoki)
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+Subject: Re: PostgreSQL psort() function performance 
+Reply-To: aoki@CS.Berkeley.EDU (Paul M. Aoki)
+In-reply-to: Your message of Sun, 22 Jun 1997 09:45:31 -0400 (EDT) 
+	     <199706221345.JAA11476@candle.pha.pa.us> 
+Date: Sun, 22 Jun 97 16:16:43 -0700
+Sender: aoki@postgres.Berkeley.EDU
+X-Mts: smtp
+Status: OR
+
+the mariposa distribution (http://mariposa.cs.berkeley.edu/) contains
+some hacks to nodeSort.c and psort.c that 
+	- make psort read directly from the executor node below it
+	(instead of an input relation)
+	- makes the Sort node read directly from the last set of psort runs
+	(instead of an output relation)
+speeds things up quite a bit.  kind of ruins psort for other purposes,
+though (which is why nbtsort.c exists).
+
+i'd merge these in first and see how far that gets you.
+--
+  Paul M. Aoki         | University of California at Berkeley
+  aoki@CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
+                       | Berkeley, CA 94720-1776
+
+From owner-pgsql-hackers@hub.org Mon Nov  3 09:31:04 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01676
+	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 09:31:02 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id JAA07345 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 09:13:20 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA13315; Mon, 3 Nov 1997 08:50:26 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:48:07 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA11722 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:48:02 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA11539 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 08:47:34 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id UAA19066; Mon, 3 Nov 1997 20:48:04 +0700 (KRS)
+Message-ID: <345DD614.345BF651@sable.krasnoyarsk.su>
+Date: Mon, 03 Nov 1997 20:48:04 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Marc Howard Zuckman <marc@fallon.classyad.com>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <Pine.LNX.3.95.971103090709.21917A-100000@fallon.classyad.com>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Marc Howard Zuckman wrote:
+> 
+> On Mon, 3 Nov 1997, Bruce Momjian wrote:
+> 
+> > With fsync off, I just did an insert of 1000 integers into a table
+> > containing a single int4 column and no indexes, and it completed in 2.3
+> > seconds.  This is on the new source tree..  That is 434 inserts/second.
+> > Pretty major performance, or 2.3 ms/insert.  This is on a idle PP200
+> > with UltraSCSI drives.
+> >
+> > With fsync on, the time goes to 51 seconds.  Wow, big difference.
+> 
+> If better alternative error recovery methods were available, perhaps
+> a facility to replay an interval transactions log from a prior dump,
+> it would be reasonable to run the backend without fsync and
+> take advantage of the performance gains.
+
+???
+
+> 
+> I don't know the answer, but I suspect that the commercial databases
+> don't "fsync" the way pgsql does.
+
+Could someone try 1000 int4 inserts using postgres and 
+some commercial database (on the same machine) ?
+
+Vadim
+
+
+From owner-pgsql-hackers@hub.org Mon Nov  3 09:01:02 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01183
+	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 09:01:00 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id IAA06632 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 08:51:58 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA05964; Mon, 3 Nov 1997 08:39:39 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:37:32 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA04729 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:37:26 -0500 (EST)
+Received: from fallon.classyad.com (root@classyad.com [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA04614 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 08:37:16 -0500 (EST)
+Received: from fallon.classyad.com (marc@fallon.classyad.com [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id JAA22108; Mon, 3 Nov 1997 09:11:09 -0500
+Date: Mon, 3 Nov 1997 09:11:09 -0500 (EST)
+From: Marc Howard Zuckman <marc@fallon.classyad.com>
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+cc: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+In-Reply-To: <199711030513.AAA23474@candle.pha.pa.us>
+Message-ID: <Pine.LNX.3.95.971103090709.21917A-100000@fallon.classyad.com>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+On Mon, 3 Nov 1997, Bruce Momjian wrote:
+
+> > 
+> > Removed...
+> > 
+> > Also, ItemPointerData t_chain (6 bytes) removed from HeapTupleHeader.
+> > CommandId is uint32 now (up to the 2^32 - 1 commands per transaction).
+> > DOUBLEALIGN(Sizeof(HeapTupleHeader)) is 40 bytes now.
+> > 
+> > 1000 inserts (into table with single int4 column, 1 insert per transaction) 
+> > takes 70 - 80 sec now (12.5 - 14 transactions/sec).
+> > This is hardware/OS limitation: 
+> > 
+> >     fd = open ("t", O_RDWR);
+> >     for (i = 1; i <= 1000; i++)
+> >     {
+> >         lseek(fd, 0, SEEK_END);
+> >         write(fd, buf, 56);
+> >         fsync(fd);
+> >     }
+> >     close (fd);
+> > 
+> > takes 33 - 39 sec and so it's not possible to be faster
+> > having 2 fsync-s per transaction.
+> > 
+> > The same test on 6.2.1: 92 - 107 sec
+> 
+> With fsync off, I just did an insert of 1000 integers into a table
+> containing a single int4 column and no indexes, and it completed in 2.3
+> seconds.  This is on the new source tree..  That is 434 inserts/second. 
+> Pretty major performance, or 2.3 ms/insert.  This is on a idle PP200
+> with UltraSCSI drives.
+> 
+> With fsync on, the time goes to 51 seconds.  Wow, big difference.
+
+If better alternative error recovery methods were available, perhaps
+a facility to replay an interval transactions log from a prior dump,
+it would be reasonable to run the backend without fsync and
+take advantage of the performance gains.
+
+I don't know the answer, but I suspect that the commercial databases
+don't "fsync" the way pgsql does.
+
+Marc Zuckman
+marc@fallon.classyad.com
+
+_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
+_     Visit The Home and Condo MarketPlace		      _
+_          http://www.ClassyAd.com			      _
+_							      _
+_  FREE basic property listings/advertisements and searches.  _
+_							      _
+_  Try our premium, yet inexpensive services for a real	      _
+_   selling or buying edge!				      _
+_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
+
+
+
+From owner-pgsql-hackers@hub.org Mon Nov  3 11:31:03 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA04080
+	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 11:31:00 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA13680 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 11:21:30 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA07566; Mon, 3 Nov 1997 11:04:52 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:02:59 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA07372 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:02:52 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA07196 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 11:02:22 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id KAA02525;
+	Mon, 3 Nov 1997 10:42:03 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711031542.KAA02525@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Mon, 3 Nov 1997 10:42:03 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <345DD614.345BF651@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 3, 97 08:48:04 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> > I don't know the answer, but I suspect that the commercial databases
+> > don't "fsync" the way pgsql does.
+> 
+> Could someone try 1000 int4 inserts using postgres and 
+> some commercial database (on the same machine) ?
+
+I have been thinking about this since seeing the performance change
+with/without fsync.
+
+Commerical databases usually do a log write every 5 or 15 minutes, and
+guarantee the logs will contain everything up to this time interval.
+
+Couldn't we have some such mechanism?  Usually they have raw space, so
+they can control when the data is hitting the disk.   Using a file
+system, some of it may be getting to the disk without our knowing it.
+
+What exactly is a scenario where lack of doing explicit fsync's will
+cause data corruption, rather than just lost data from the past few
+minutes?
+
+I think Vadim has gotten fsync's down to fsync'ing the modified data
+page, and pg_log.
+
+Let's suppose we did not fsync.  There could be cases where pg_log was
+fsync'ed by the OS, and some of the modified data pages are fyncs'ed by
+the OS, but not others.  This would leave us with a partial transaction.
+
+However, let's suppose we prevent pg_log from being fsync'ed somehow. 
+Then, because we have a no-overwrite database, we could keep control of
+this, and write of some data pages, but not others would not cause us
+problems because the pg_log would show all such transactions, which had
+not had all their modified data pages fsync'ed, as non-committed.
+
+Perhaps we can even set a flag in pg_log every five minutes to indicate
+whether all buffers for the page have been flushed?  That way we could
+not have to worry about preventing flushing of pg_log.
+
+Comments?
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From owner-pgsql-hackers@hub.org Mon Nov  3 12:00:42 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA04456
+	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 12:00:40 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA26054; Mon, 3 Nov 1997 11:46:49 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:46:33 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA25932 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:46:30 -0500 (EST)
+Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id LAA25750 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 11:45:53 -0500 (EST)
+Received: by orion.SAPserv.Hamburg.dsh.de 
+	(Linux Smail3.1.29.1 #1)}
+	id m0xSPfE-000BGZC; Mon, 3 Nov 97 17:47 MET
+Message-Id: <m0xSPfE-000BGZC@orion.SAPserv.Hamburg.dsh.de>
+From: wieck@sapserv.debis.de
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: maillist@candle.pha.pa.us (Bruce Momjian)
+Date: Mon, 3 Nov 1997 17:47:43 +0100 (MET)
+Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com,
+        hackers@postgreSQL.org
+Reply-To: wieck@sapserv.debis.de (Jan Wieck)
+In-Reply-To: <199711031542.KAA02525@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:42:03 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=iso-8859-1
+Content-Transfer-Encoding: 8bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+>
+> > > I don't know the answer, but I suspect that the commercial databases
+> > > don't "fsync" the way pgsql does.
+> >
+> > Could someone try 1000 int4 inserts using postgres and
+> > some commercial database (on the same machine) ?
+>
+> I have been thinking about this since seeing the performance change
+> with/without fsync.
+>
+> Commerical databases usually do a log write every 5 or 15 minutes, and
+> guarantee the logs will contain everything up to this time interval.
+>
+
+    Without  fsync  PostgreSQL  would  only  loose data if the OS
+    crashes between the last write operation of a backend and the
+    next regular update sync. This is seldom but if it happens it
+    really hurts.
+
+    A database can omit fsync on data files (e.g. tablespaces) if
+    it  writes  a  redo  log. With that redo log, a backup can be
+    restored and than all transactions since the backup redone.
+
+    PostgreSQL doesn't write such a redo  log.  So  an  OS  crash
+    after  the fsync of pg_log could corrupt the database without
+    a chance to recover.
+
+    Isn't it time to get an (optional) redo log. I don't  exactly
+    know all the places where our datafiles can get modified, but
+    I hope this is only done  in  the  heap  access  methods  and
+    vacuum.  So these are the places from where the redo log data
+    comes from (plus transaction commit/rollback).
+
+
+Until later, Jan
+
+--
+#define OPINIONS "they are all mine - not those of debis or daimler-benz"
+
+#======================================================================#
+# It's easier to get forgiveness for being wrong than for being right. #
+# Let's break this rule - forgive me.                                  #
+#================================== wieck@sapserv.debis.de (Jan Wieck) #
+
+
+
+
+From owner-pgsql-hackers@hub.org Mon Nov  3 14:01:06 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA06775
+	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 14:01:04 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA22235 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 13:43:15 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA11482; Mon, 3 Nov 1997 13:32:40 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 13:32:02 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA11204 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 13:31:58 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA11119 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 13:31:44 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id MAA05464;
+	Mon, 3 Nov 1997 12:59:01 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711031759.MAA05464@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: wieck@sapserv.debis.de
+Date: Mon, 3 Nov 1997 12:59:01 -0500 (EST)
+Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com,
+        hackers@postgreSQL.org
+In-Reply-To: <m0xSPfE-000BGZC@orion.SAPserv.Hamburg.dsh.de> from "wieck@sapserv.debis.de" at Nov 3, 97 05:47:43 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> 
+> >
+> > > > I don't know the answer, but I suspect that the commercial databases
+> > > > don't "fsync" the way pgsql does.
+> > >
+> > > Could someone try 1000 int4 inserts using postgres and
+> > > some commercial database (on the same machine) ?
+> >
+> > I have been thinking about this since seeing the performance change
+> > with/without fsync.
+> >
+> > Commerical databases usually do a log write every 5 or 15 minutes, and
+> > guarantee the logs will contain everything up to this time interval.
+> >
+> 
+>     Without  fsync  PostgreSQL  would  only  loose data if the OS
+>     crashes between the last write operation of a backend and the
+>     next regular update sync. This is seldom but if it happens it
+>     really hurts.
+> 
+>     A database can omit fsync on data files (e.g. tablespaces) if
+>     it  writes  a  redo  log. With that redo log, a backup can be
+>     restored and than all transactions since the backup redone.
+> 
+>     PostgreSQL doesn't write such a redo  log.  So  an  OS  crash
+>     after  the fsync of pg_log could corrupt the database without
+>     a chance to recover.
+> 
+>     Isn't it time to get an (optional) redo log. I don't  exactly
+>     know all the places where our datafiles can get modified, but
+>     I hope this is only done  in  the  heap  access  methods  and
+>     vacuum.  So these are the places from where the redo log data
+>     comes from (plus transaction commit/rollback).
+> 
+
+Yes, but because we are a non-over-write database, I don't see why we
+can't just do this without a redo log.
+
+Every five minutes, we fsync() all dirty pages, mark all completed
+transactions as fsync'ed in pg_log, and fsync() pg_log.
+
+On postmaster startup, any transaction marked as completed, but not
+marked as fsync'ed gets marked as aborted.
+
+Of course, all vacuum operations would have to be fsync'ed.
+
+Comments?
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From owner-pgsql-hackers@hub.org Mon Nov  3 16:46:01 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id QAA10292
+	for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 16:45:59 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id QAA02040 for <maillist@candle.pha.pa.us>; Mon, 3 Nov 1997 16:42:40 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id QAA17422; Mon, 3 Nov 1997 16:34:28 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 16:34:10 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id QAA17210 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 16:34:06 -0500 (EST)
+Received: from fallon.classyad.com (root@classyad.com [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id QAA16690 for <hackers@postgreSQL.org>; Mon, 3 Nov 1997 16:33:27 -0500 (EST)
+Received: from fallon.classyad.com (marc@fallon.classyad.com [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id RAA32498; Mon, 3 Nov 1997 17:33:42 -0500
+Date: Mon, 3 Nov 1997 17:33:42 -0500 (EST)
+From: Marc Howard Zuckman <marc@fallon.classyad.com>
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+cc: wieck@sapserv.debis.de, vadim@sable.krasnoyarsk.su, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+In-Reply-To: <199711031759.MAA05464@candle.pha.pa.us>
+Message-ID: <Pine.LNX.3.95.971103173129.32055B-100000@fallon.classyad.com>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+On Mon, 3 Nov 1997, Bruce Momjian wrote:
+
+> > 
+> > >
+> > > > > I don't know the answer, but I suspect that the commercial databases
+> > > > > don't "fsync" the way pgsql does.
+> > > >
+> > > > Could someone try 1000 int4 inserts using postgres and
+> > > > some commercial database (on the same machine) ?
+> > >
+> > > I have been thinking about this since seeing the performance change
+> > > with/without fsync.
+> > >
+> > > Commerical databases usually do a log write every 5 or 15 minutes, and
+> > > guarantee the logs will contain everything up to this time interval.
+> > >
+> > 
+> >     Without  fsync  PostgreSQL  would  only  loose data if the OS
+> >     crashes between the last write operation of a backend and the
+> >     next regular update sync. This is seldom but if it happens it
+> >     really hurts.
+> > 
+> >     A database can omit fsync on data files (e.g. tablespaces) if
+> >     it  writes  a  redo  log. With that redo log, a backup can be
+> >     restored and than all transactions since the backup redone.
+> > 
+> >     PostgreSQL doesn't write such a redo  log.  So  an  OS  crash
+> >     after  the fsync of pg_log could corrupt the database without
+> >     a chance to recover.
+> > 
+> >     Isn't it time to get an (optional) redo log. I don't  exactly
+> >     know all the places where our datafiles can get modified, but
+> >     I hope this is only done  in  the  heap  access  methods  and
+> >     vacuum.  So these are the places from where the redo log data
+> >     comes from (plus transaction commit/rollback).
+> > 
+> 
+> Yes, but because we are a non-over-write database, I don't see why we
+> can't just do this without a redo log.
+
+Because if the hard drive is the reason for the failure (instead of 
+power out, OS bites dust, etc), the database won't be of much help.
+
+The redo log should be on a device different than the database.
+
+Marc Zuckman
+marc@fallon.classyad.com
+
+_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
+_     Visit The Home and Condo MarketPlace		      _
+_          http://www.ClassyAd.com			      _
+_							      _
+_  FREE basic property listings/advertisements and searches.  _
+_							      _
+_  Try our premium, yet inexpensive services for a real	      _
+_   selling or buying edge!				      _
+_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
+
+
+
+From maillist Mon Nov  3 22:59:31 1997
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id WAA16264;
+	Mon, 3 Nov 1997 22:59:31 -0500 (EST)
+From: Bruce Momjian <maillist>
+Message-Id: <199711040359.WAA16264@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: maillist@candle.pha.pa.us (Bruce Momjian)
+Date: Mon, 3 Nov 1997 22:59:30 -0500 (EST)
+Cc: vadim@sable.krasnoyarsk.su, marc@fallon.classyad.com,
+        hackers@postgreSQL.org
+In-Reply-To: <199711031542.KAA02525@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:42:03 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+> 
+> > > I don't know the answer, but I suspect that the commercial databases
+> > > don't "fsync" the way pgsql does.
+> > 
+> > Could someone try 1000 int4 inserts using postgres and 
+> > some commercial database (on the same machine) ?
+> 
+> I have been thinking about this since seeing the performance change
+> with/without fsync.
+> 
+> Commercial databases usually do a log write every 5 or 15 minutes, and
+> guarantee the logs will contain everything up to this time interval.
+> 
+> Couldn't we have some such mechanism?  Usually they have raw space, so
+> they can control when the data is hitting the disk.   Using a file
+> system, some of it may be getting to the disk without our knowing it.
+> 
+> What exactly is a scenario where lack of doing explicit fsync's will
+> cause data corruption, rather than just lost data from the past few
+> minutes?
+> 
+> I think Vadim has gotten fsync's down to fsync'ing the modified data
+> page, and pg_log.
+> 
+> Let's suppose we did not fsync.  There could be cases where pg_log was
+> fsync'ed by the OS, and some of the modified data pages are fyncs'ed by
+> the OS, but not others.  This would leave us with a partial transaction.
+> 
+> However, let's suppose we prevent pg_log from being fsync'ed somehow. 
+> Then, because we have a no-overwrite database, we could keep control of
+> this, and write of some data pages, but not others would not cause us
+> problems because the pg_log would show all such transactions, which had
+> not had all their modified data pages fsync'ed, as non-committed.
+> 
+> Perhaps we can even set a flag in pg_log every five minutes to indicate
+> whether all buffers for the page have been flushed?  That way we could
+> not have to worry about preventing flushing of pg_log.
+> 
+> Comments?
+
+OK, here is a more formal description of what I am suggesting.  It will
+give us commercial dbms reliability with no-fsync performance. 
+Commercial dbms's usually only give restore up to 5 minutes before the
+crash, and this is what I am suggesting.  If we can do this, we can
+remove the no-fsync option.
+
+First, lets suppose there exists a shared queue that is visible to all
+backends and the postmaster that allows transaction id's to be added to
+the queue.  We also add a bit to the pg_log record called 'been_synced'
+that is initially false.
+
+OK, once a backend starts a transaction, it puts a transaction id in
+pg_log.  Once the transaction is finished, it is marked as committed. 
+At the same time, we now put the transaction id on the shared queue.
+
+Every five minutes, or as defined by the administrator, the postmaster
+does a sync() call.  On my OS, anyone use can call sync, and I think
+this is typical.  update/pagecleaner does this every 30 seconds anyway,
+so it is no big deal for the postmaster to call it every 5 minutes.  The
+nice thing about this is that the OS does the syncing of all the dirty
+pages for us.  (An alarm() call can set up this 5 minute timing.)
+
+The postmaster then locks the shared transaction id queue, makes a copy
+of the entries in the queue, clears the queue, and unlocks the queue. 
+It does this so no one else modifies the queue while it is being
+cleared.
+
+The postmaster then goes through pg_log, and marks each transaction as
+'been_synced'.
+
+The postmaster also performs this on shutdown.
+
+On postmaster startup, all transactions are checked and any transaction
+that is marked as committed but not 'been_synced' is marked as not
+committed.  In this way, we prevent non-synced or partially synced
+transactions from being used.
+
+Of course, vacuum would have to do normal fsyncs because it is removing
+the transaction log.
+
+We need the shared transaction id queue because there is no way to find
+the newly committed transactions since the last sync.  A transaction
+can last for hours.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+From owner-pgsql-hackers@hub.org Tue Nov  4 02:13:08 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id CAA17544
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 02:13:06 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id CAA14126; Tue, 4 Nov 1997 02:07:55 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 02:04:59 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id CAA12859 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 02:04:51 -0500 (EST)
+Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id CAA12625 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 02:04:12 -0500 (EST)
+Received: by orion.SAPserv.Hamburg.dsh.de 
+	(Linux Smail3.1.29.1 #1)}
+	id m0xSd44-000BFQC; Tue, 4 Nov 97 08:06 MET
+Message-Id: <m0xSd44-000BFQC@orion.SAPserv.Hamburg.dsh.de>
+From: wieck@sapserv.debis.de
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: maillist@candle.pha.pa.us (Bruce Momjian)
+Date: Tue, 4 Nov 1997 08:06:16 +0100 (MET)
+Cc: maillist@candle.pha.pa.us, vadim@sable.krasnoyarsk.su,
+        marc@fallon.classyad.com, hackers@postgreSQL.org
+Reply-To: wieck@sapserv.debis.de (Jan Wieck)
+In-Reply-To: <199711040359.WAA16264@candle.pha.pa.us> from "Bruce Momjian" at Nov 3, 97 10:59:30 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=iso-8859-1
+Content-Transfer-Encoding: 8bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> OK, here is a more formal description of what I am suggesting.  It will
+> give us commercial dbms reliability with no-fsync performance.
+> Commercial dbms's usually only give restore up to 5 minutes before the
+> crash, and this is what I am suggesting.  If we can do this, we can
+> remove the no-fsync option.
+
+    I'm not 100% sure but as far as I know Oracle, it can recover
+    up to the last committed transaction using  the  online  redo
+    logs.   And even if commercial dbms's aren't able to do that,
+    it should be our target.
+
+> [description about transaction queue]
+
+    This all  depends  on  the  fact  that  PostgreSQL  is  a  no
+    overwrite  dbms.  Otherwise the space of deleted tuples might
+    get overwritten by later transactions and the information  is
+    finally lost.
+
+    Another  issue:  All we up to now though of are crashes where
+    the database files are still usable after restart.  But  take
+    the  simple  case  of a write error. A new bad block or track
+    will get remapped (in some way) but the data in it  is  lost.
+    So  we  end  up  with  one or more totally corrupted database
+    files. And I don't trust mirrored  disks  farer  than  I  can
+    throw  them.  A  bug  in the OS or a memory failure (many new
+    PeeCee boards don't support parity and even with parity a two
+    bit  failure  is still the wrong data but with a valid parity
+    bit) can also currupt the data.
+
+    I still prefer redo logs. They should reside on  a  different
+    disk  and the possibility of loosing the database files along
+    with the redo log is very small.
+
+
+Until later, Jan
+
+--
+#define OPINIONS "they are all mine - not those of debis or daimler-benz"
+
+#======================================================================#
+# It's easier to get forgiveness for being wrong than for being right. #
+# Let's break this rule - forgive me.                                  #
+#================================== wieck@sapserv.debis.de (Jan Wieck) #
+
+
+
+
+From vadim@sable.krasnoyarsk.su Tue Nov  4 04:12:50 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18487
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 04:12:48 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id EAA03152 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 04:12:06 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <345EE75D.398A68D@sable.krasnoyarsk.su>
+Date: Tue, 04 Nov 1997 16:14:05 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711040359.WAA16264@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> OK, here is a more formal description of what I am suggesting.  It will
+> give us commercial dbms reliability with no-fsync performance.
+> Commercial dbms's usually only give restore up to 5 minutes before the
+                                      ^^^^^^^^^^^^^^^^^^^^^^^
+I'm sure that this is not true!
+If on-line redo_file is damaged then you have
+single ability: restore your last backup.
+In all other cases database will be recovered up to the last
+committed transaction automatically!
+
+DBMS-s using WAL have to fsync only redo file on commit
+(and they do it!), non-overwriting systems have to
+fsync data files and transaction log. 
+
+We could optimize fsync-s for multi-user environment: do not
+fsync when we're ensured that our changes flushed to disk by
+another backend.
+
+> crash, and this is what I am suggesting.  If we can do this, we can
+> remove the no-fsync option.
+> 
+...
+> 
+> On postmaster startup, all transactions are checked and any transaction
+> that is marked as committed but not 'been_synced' is marked as not
+> committed.  In this way, we prevent non-synced or partially synced
+> transactions from being used.
+
+And what should users (ensured that their transaction are
+committed) do in this case ?
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Tue Nov  4 04:21:04 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18536
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 04:21:01 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id EAA15551; Tue, 4 Nov 1997 04:15:15 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 04:14:23 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id EAA14464 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 04:14:18 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id EAA13437 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 04:13:33 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS)
+Message-ID: <345EE75D.398A68D@sable.krasnoyarsk.su>
+Date: Tue, 04 Nov 1997 16:14:05 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711040359.WAA16264@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> OK, here is a more formal description of what I am suggesting.  It will
+> give us commercial dbms reliability with no-fsync performance.
+> Commercial dbms's usually only give restore up to 5 minutes before the
+                                      ^^^^^^^^^^^^^^^^^^^^^^^
+I'm sure that this is not true!
+If on-line redo_file is damaged then you have
+single ability: restore your last backup.
+In all other cases database will be recovered up to the last
+committed transaction automatically!
+
+DBMS-s using WAL have to fsync only redo file on commit
+(and they do it!), non-overwriting systems have to
+fsync data files and transaction log. 
+
+We could optimize fsync-s for multi-user environment: do not
+fsync when we're ensured that our changes flushed to disk by
+another backend.
+
+> crash, and this is what I am suggesting.  If we can do this, we can
+> remove the no-fsync option.
+> 
+...
+> 
+> On postmaster startup, all transactions are checked and any transaction
+> that is marked as committed but not 'been_synced' is marked as not
+> committed.  In this way, we prevent non-synced or partially synced
+> transactions from being used.
+
+And what should users (ensured that their transaction are
+committed) do in this case ?
+
+Vadim
+
+
+From owner-pgsql-hackers@hub.org Tue Nov  4 06:43:00 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id GAA19743
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 06:42:57 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id GAA10352; Tue, 4 Nov 1997 06:36:08 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 06:35:42 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id GAA10158 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 06:35:37 -0500 (EST)
+Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id GAA10096 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 06:35:27 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id GAA19665;
+	Tue, 4 Nov 1997 06:35:10 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711041135.GAA19665@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: wieck@sapserv.debis.de
+Date: Tue, 4 Nov 1997 06:35:10 -0500 (EST)
+Cc: hackers@postgreSQL.org (PostgreSQL-development)
+In-Reply-To: <m0xSd44-000BFQC@orion.SAPserv.Hamburg.dsh.de> from "wieck@sapserv.debis.de" at Nov 4, 97 08:06:16 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> 
+> > OK, here is a more formal description of what I am suggesting.  It will
+> > give us commercial dbms reliability with no-fsync performance.
+> > Commercial dbms's usually only give restore up to 5 minutes before the
+> > crash, and this is what I am suggesting.  If we can do this, we can
+> > remove the no-fsync option.
+> 
+>     I'm not 100% sure but as far as I know Oracle, it can recover
+>     up to the last committed transaction using  the  online  redo
+>     logs.   And even if commercial dbms's aren't able to do that,
+>     it should be our target.
+> 
+> > [description about transaction queue]
+> 
+>     This all  depends  on  the  fact  that  PostgreSQL  is  a  no
+>     overwrite  dbms.  Otherwise the space of deleted tuples might
+>     get overwritten by later transactions and the information  is
+>     finally lost.
+> 
+>     Another  issue:  All we up to now though of are crashes where
+>     the database files are still usable after restart.  But  take
+>     the  simple  case  of a write error. A new bad block or track
+>     will get remapped (in some way) but the data in it  is  lost.
+>     So  we  end  up  with  one or more totally corrupted database
+>     files. And I don't trust mirrored  disks  farer  than  I  can
+>     throw  them.  A  bug  in the OS or a memory failure (many new
+>     PeeCee boards don't support parity and even with parity a two
+>     bit  failure  is still the wrong data but with a valid parity
+>     bit) can also currupt the data.
+> 
+>     I still prefer redo logs. They should reside on  a  different
+>     disk  and the possibility of loosing the database files along
+>     with the redo log is very small.
+
+I have been thinking about re-do logs, and I think it is a good idea. 
+It would not be hard to have the queries spit out to a separate file
+configurable by the user.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From owner-pgsql-hackers@hub.org Tue Nov  4 07:31:01 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22051
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:30:59 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07444 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:25:14 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08818; Tue, 4 Nov 1997 07:03:30 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:02:44 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08418 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:29 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08331 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 07:02:07 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id GAA21484;
+	Tue, 4 Nov 1997 06:50:24 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711041150.GAA21484@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Tue, 4 Nov 1997 06:50:24 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <345EE75D.398A68D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 04:14:05 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> 
+> Bruce Momjian wrote:
+> > 
+> > OK, here is a more formal description of what I am suggesting.  It will
+> > give us commercial dbms reliability with no-fsync performance.
+> > Commercial dbms's usually only give restore up to 5 minutes before the
+>                                       ^^^^^^^^^^^^^^^^^^^^^^^
+> I'm sure that this is not true!
+
+You may be right.  This five minute figure is when you restore from your
+previous backup, then restore from the log file.
+
+Can't we do something like sync every 5 seconds, rather than after every
+transaction?  It just seems like such overkill.
+
+Actually, I found a problem with my description.  Because pg_log is not
+fsync'ed, after a crash, pages with new transactions could have been
+flushed to disk, but not the pg_log table that contains the transaction
+ids.  The problem is that the new backend could assign a transaction id
+that is already in use.
+
+We could set a flag upon successful shutdown, and if it is not set on
+reboot, either do a vacuum to find the max transaction id, and
+invalidate all them not in pg_log as synced, or increase the next
+transaction id to some huge number and invalidate all them in between.
+
+
+> If on-line redo_file is damaged then you have
+> single ability: restore your last backup.
+> In all other cases database will be recovered up to the last
+> committed transaction automatically!
+> 
+> DBMS-s using WAL have to fsync only redo file on commit
+> (and they do it!), non-overwriting systems have to
+> fsync data files and transaction log. 
+> 
+> We could optimize fsync-s for multi-user environment: do not
+> fsync when we're ensured that our changes flushed to disk by
+> another backend.
+> 
+> > crash, and this is what I am suggesting.  If we can do this, we can
+> > remove the no-fsync option.
+> > 
+> ...
+> > 
+> > On postmaster startup, all transactions are checked and any transaction
+> > that is marked as committed but not 'been_synced' is marked as not
+> > committed.  In this way, we prevent non-synced or partially synced
+> > transactions from being used.
+> 
+> And what should users (ensured that their transaction are
+> committed) do in this case ?
+> 
+> Vadim
+> 
+> 
+
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From wieck@sapserv.debis.de Tue Nov  4 07:01:00 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA21697
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:00:58 -0500 (EST)
+From: wieck@sapserv.debis.de
+Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by renoir.op.net (o1/$ Revision: 1.14 $) with SMTP id GAA06401 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 06:48:25 -0500 (EST)
+Received: by orion.SAPserv.Hamburg.dsh.de 
+	(Linux Smail3.1.29.1 #1)}
+	id m0xShVQ-000BGZC; Tue, 4 Nov 97 12:50 MET
+Message-Id: <m0xShVQ-000BGZC@orion.SAPserv.Hamburg.dsh.de>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: maillist@candle.pha.pa.us (Bruce Momjian)
+Date: Tue, 4 Nov 1997 12:50:45 +0100 (MET)
+Cc: wieck@sapserv.debis.de, hackers@postgreSQL.org
+Reply-To: wieck@sapserv.debis.de (Jan Wieck)
+In-Reply-To: <199711041135.GAA19665@candle.pha.pa.us> from "Bruce Momjian" at Nov 4, 97 06:35:10 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=iso-8859-1
+Content-Transfer-Encoding: 8bit
+Status: OR
+
+
+Bruce Momjian wrote:
+> I have been thinking about re-do logs, and I think it is a good idea.
+> It would not be hard to have the queries spit out to a separate file
+> configurable by the user.
+
+    This  way the recovery process will be very complicated. When
+    multiple  backends  run  concurrently,  there  are   multiple
+    transactions  active  at  the  same time. And what tuples are
+    affected by an update e.g. depends much on the timing.
+
+    I had something different in mind. The redo log contains  the
+    information  from  the  executor (e.g. the transactionId, the
+    tupleId and the new tuple values when calling  ExecReplace())
+    and  the information which transactions commit and which not.
+    When recovering,  those  operations  where  the  transactions
+    committed are again passed to the executors functions that do
+    the real updates with the values from the logfile.
+
+
+Until later, Jan
+
+--
+#define OPINIONS "they are all mine - not those of debis or daimler-benz"
+
+#======================================================================#
+# It's easier to get forgiveness for being wrong than for being right. #
+# Let's break this rule - forgive me.                                  #
+#================================== wieck@sapserv.debis.de (Jan Wieck) #
+
+
+
+From owner-pgsql-hackers@hub.org Tue Nov  4 07:30:59 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22048
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:30:57 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07189 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:18:02 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08856; Tue, 4 Nov 1997 07:03:37 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:03:03 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08487 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:46 -0500 (EST)
+Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08192 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 07:02:02 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id HAA21653;
+	Tue, 4 Nov 1997 07:00:20 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711041200.HAA21653@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!u
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Tue, 4 Nov 1997 07:00:19 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <345EE75D.398A68D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 04:14:05 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> 
+> Bruce Momjian wrote:
+> > 
+> > OK, here is a more formal description of what I am suggesting.  It will
+> > give us commercial dbms reliability with no-fsync performance.
+> > Commercial dbms's usually only give restore up to 5 minutes before the
+>                                       ^^^^^^^^^^^^^^^^^^^^^^^
+> I'm sure that this is not true!
+> If on-line redo_file is damaged then you have
+> single ability: restore your last backup.
+> In all other cases database will be recovered up to the last
+> committed transaction automatically!
+
+I doubt commercial dbms's sync to disk after every transaction.  They
+pick a time, maybe five seconds, and see all dirty pages get flushed by
+then.
+
+What they do do is to make certain that you are restored to a consistent
+state, perhaps 15 seconds ago.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From vadim@sable.krasnoyarsk.su Tue Nov  4 07:32:45 1997
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22066
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:32:35 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20889; Tue, 4 Nov 1997 19:35:12 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <345F1680.60E33853@sable.krasnoyarsk.su>
+Date: Tue, 04 Nov 1997 19:35:12 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Jan Wieck <wieck@sapserv.debis.de>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, marc@fallon.classyad.com,
+        hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <m0xSd44-000BFQC@orion.SAPserv.Hamburg.dsh.de>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+wieck@sapserv.debis.de wrote:
+> 
+>     I still prefer redo logs. They should reside on  a  different
+>     disk  and the possibility of loosing the database files along
+>     with the redo log is very small.
+
+Agreed. This way we could don't fsync data files and 
+fsync both redo and pg_log. This is much faster.
+
+Vadim
+
+From vadim@sable.krasnoyarsk.su Tue Nov  4 08:00:58 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id IAA22371
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 08:00:56 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA08540 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 07:57:25 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20935; Tue, 4 Nov 1997 19:59:46 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <345F1C42.1F1A7590@sable.krasnoyarsk.su>
+Date: Tue, 04 Nov 1997 19:59:46 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Jan Wieck <wieck@sapserv.debis.de>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <m0xShVQ-000BGZC@orion.SAPserv.Hamburg.dsh.de>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+wieck@sapserv.debis.de wrote:
+> 
+> Bruce Momjian wrote:
+> > I have been thinking about re-do logs, and I think it is a good idea.
+> > It would not be hard to have the queries spit out to a separate file
+> > configurable by the user.
+> 
+>     This  way the recovery process will be very complicated. When
+>     multiple  backends  run  concurrently,  there  are   multiple
+>     transactions  active  at  the  same time. And what tuples are
+>     affected by an update e.g. depends much on the timing.
+> 
+>     I had something different in mind. The redo log contains  the
+>     information  from  the  executor (e.g. the transactionId, the
+>     tupleId and the new tuple values when calling  ExecReplace())
+>     and  the information which transactions commit and which not.
+>     When recovering,  those  operations  where  the  transactions
+>     committed are again passed to the executors functions that do
+>     the real updates with the values from the logfile.
+
+It seems that this is what Oracle does, but Sybase writes queries 
+(with transaction ids, of 'course, and before execution) and 
+begin, commit/abort events <-- this is better for non-overwriting
+system (shorter redo file), but, agreed, recovering is more complicated.
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Tue Nov  4 22:35:45 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA05060
+	for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 22:35:43 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA26725 for <maillist@candle.pha.pa.us>; Tue, 4 Nov 1997 22:35:10 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id WAA27875; Tue, 4 Nov 1997 22:23:14 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 22:20:55 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id WAA24162 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 22:20:50 -0500 (EST)
+Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id WAA22727 for <hackers@postgreSQL.org>; Tue, 4 Nov 1997 22:20:18 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id WAA04674;
+	Tue, 4 Nov 1997 22:17:52 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711050317.WAA04674@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Tue, 4 Nov 1997 22:17:52 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <345F14E7.28CC1042@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 4, 97 07:28:23 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> 
+> Bruce Momjian wrote:
+> > 
+> > >
+> > > Bruce Momjian wrote:
+> > > >
+> > > > OK, here is a more formal description of what I am suggesting.  It will
+> > > > give us commercial dbms reliability with no-fsync performance.
+> > > > Commercial dbms's usually only give restore up to 5 minutes before the
+> > >                                       ^^^^^^^^^^^^^^^^^^^^^^^
+> > > I'm sure that this is not true!
+> > 
+> > You may be right.  This five minute figure is when you restore from your
+> > previous backup, then restore from the log file.
+> > 
+> > Can't we do something like sync every 5 seconds, rather than after every
+> > transaction?  It just seems like such overkill.
+> 
+> Isn't -F and sync in crontab the same ?
+
+OK, let me again try to marshall some (any?) support for my suggestion.
+
+Informix version 5/7 has three levels of logging:  unbuffered
+logging(our normal fsync mode), buffered logging, and no logging(our no
+fsync mode).
+
+We don't have buffered logging.  Buffered logging guarantees you get put
+back to a consistent state after an os/server crash, usually to within
+30/90 seconds.  You do not have any partial transactions lying around,
+but you do have some transactions that you thought were done, but are
+not.
+
+This is faster then non-buffered logging, but not as fast as no logging.
+Guess what mode everyone uses?  The one we don't have, buffered logging!
+
+Unbuffered logging performance is terrible.  Non-buffered logging is
+used to load huge chunks of data during off-hours.
+
+The problem we have is that we fsync every transaction, which causes a
+9-times slowdown in performance on single-integer inserts.
+
+That is a pretty heavy cost.  But the alternative we give people is
+no-fsync mode, where we don't sync anything, and in a crash, you could
+come back with partially committed data in your database, if pg_log was
+sync'ed by the database, and only some of the data pages were sync'ed,
+so if any data was changing within 30 seconds of the crash, you have to
+restore your previous backup.
+
+We really need a middle solution, that gives better data integrity, for
+a smaller price.
+
+> 
+> > 
+> > Actually, I found a problem with my description.  Because pg_log is not
+> > fsync'ed, after a crash, pages with new transactions could have been
+> > flushed to disk, but not the pg_log table that contains the transaction
+> > ids.  The problem is that the new backend could assign a transaction id
+> > that is already in use.
+> 
+> Impossible. Backend flushes pg_variable after fetching nex 32 xids.
+
+My suggestion is that we don't need to flush pg_variable or pg_log that
+much.  My suggestion would speed up the test you do with 100 inserts
+inside a single transaction vs. 100 separate inserts.
+
+> > 
+> > We could set a flag upon successful shutdown, and if it is not set on
+> > reboot, either do a vacuum to find the max transaction id, and
+> > invalidate all them not in pg_log as synced, or increase the next
+> > transaction id to some huge number and invalidate all them in between.
+> >
+
+I have a fix for the problem stated above, and it doesn't require a
+vacuum.
+
+We decide to fsync pg_variable and pg_log every 10,000 transactions or
+oids.  Then if the database is brought up, and it was not brought down
+cleanly, you increment oid and transaction_id by 10,000, because you
+know you couldn't have gotten more than that.  All intermediate
+transactions that are not marked committed/synced are marked aborted.
+
+---------------------------------------------------------------------------
+
+The problem we have with the current system is that we sync by action,
+not by time interval.  If you are doing tons of inserts or updates, it
+is syncing after every one.  What people really want is something that
+will sync not after every action, but after every minute or five
+minutes, so when the system is busy, the syncing every minutes is just a
+small amount, and when the system is idle, no one cares if is syncs, and
+no one has to wait for the sync to complete.
+
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From matti@algonet.se Wed Nov  5 11:02:33 1997
+Received: from smtp.algonet.se (tomei.algonet.se [194.213.74.114])
+	by candle.pha.pa.us (8.8.5/8.8.5) with SMTP id LAA02099
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 11:02:28 -0500 (EST)
+Received: (qmail 6685 invoked from network); 5 Nov 1997 17:01:06 +0100
+Received: from du228-6.ppp.algonet.se (HELO gamma) (root@195.100.6.228)
+  by tomei.algonet.se with SMTP; 5 Nov 1997 17:01:06 +0100
+Sender: root
+Message-ID: <34609871.27EED9D@algonet.se>
+Date: Wed, 05 Nov 1997 17:02:16 +0100
+From: Mattias Kregert <matti@algonet.se>
+Organization: Algonet ISP
+X-Mailer: Mozilla 3.0Gold (X11; I; Linux 2.0.29 i586)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: pgsql-hackers@postgresql.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711050317.WAA04674@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+>
+> We don't have buffered logging.  Buffered logging guarantees you get put
+> back to a consistent state after an os/server crash, usually to within
+> 30/90 seconds.  You do not have any partial transactions lying around,
+> but you do have some transactions that you thought were done, but are
+                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+> not.
+  ^^^^
+> 
+> This is faster then non-buffered logging, but not as fast as no logging.
+> Guess what mode everyone uses?  The one we don't have, buffered logging!
+
+Ouch! I would *not* like to use "buffered logging".
+What's the point in having the wrong data in the database and not
+knowing what updates, inserts or deletes to do to get the correct data?
+
+That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
+
+
+> We really need a middle solution, that gives better data integrity, for
+> a smaller price.
+
+What I would like to have is this:
+
+If a backend tells the frontend that a transaction has completed,
+then that transaction should absolutely not get lost in case of a crash.
+
+What is needed is a log of changes since the last backup. This
+log would preferrably reside on a remote machine or at least
+another disk. Then, if the power goes in the middle of a disk write,
+the disk explodes and the computer goes up in flames, you can
+install Postgresql on a new machine, restore the last backup and
+re-run the change log.
+
+
+> The problem we have with the current system is that we sync by action,
+> not by time interval.  If you are doing tons of inserts or updates, it
+> is syncing after every one.  What people really want is something that
+> will sync not after every action, but after every minute or five
+> minutes, so when the system is busy, the syncing every minutes is just a
+> small amount, and when the system is idle, no one cares if is syncs, and
+> no one has to wait for the sync to complete.
+
+Yes, but this would only be the first step on the way to better
+crash-recovery.
+
+/* m */
+
+From vadim@sable.krasnoyarsk.su Wed Nov  5 12:20:23 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05156
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:20:13 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24123 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 11:44:49 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id XAA23062; Wed, 5 Nov 1997 23:48:52 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <3460A374.41C67EA6@sable.krasnoyarsk.su>
+Date: Wed, 05 Nov 1997 23:48:52 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711050317.WAA04674@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> OK, let me again try to marshall some (any?) support for my suggestion.
+> 
+> Informix version 5/7 has three levels of logging:  unbuffered
+> logging(our normal fsync mode), buffered logging, and no logging(our no
+> fsync mode).
+> 
+> We don't have buffered logging.  Buffered logging guarantees you get put
+> back to a consistent state after an os/server crash, usually to within
+> 30/90 seconds.  You do not have any partial transactions lying around,
+> but you do have some transactions that you thought were done, but are
+> not.
+> 
+> This is faster then non-buffered logging, but not as fast as no logging.
+> Guess what mode everyone uses?  The one we don't have, buffered logging!
+> 
+> Unbuffered logging performance is terrible.  Non-buffered logging is
+> used to load huge chunks of data during off-hours.
+> 
+> The problem we have is that we fsync every transaction, which causes a
+> 9-times slowdown in performance on single-integer inserts.
+> 
+> That is a pretty heavy cost.  But the alternative we give people is
+> no-fsync mode, where we don't sync anything, and in a crash, you could
+> come back with partially committed data in your database, if pg_log was
+> sync'ed by the database, and only some of the data pages were sync'ed,
+> so if any data was changing within 30 seconds of the crash, you have to
+> restore your previous backup.
+> 
+> We really need a middle solution, that gives better data integrity, for
+> a smaller price.
+
+There is no fsync synchronization currently.
+How could we be ensured that all modified data pages are flushed
+when we decided to flush pg_log ?
+If backend doesn't fsync data pages & pg_log at the commit time
+then when he must flush them (data first) ?
+
+This is what Oracle does:
+
+it uses dedicated DBWR process for writing/flushing modified 
+data pages and LGWR process for writing/flushing redo log
+(redo log is transaction log also). LGWR always flushes log pages 
+when committing, but durty data pages can be flushed _after_ transaction 
+commit when DBWR decides that it's time to do it (ala checkpoints interval).
+
+Using redo log we could implement buffered logging quite easy.
+We can even don't use dedicated processes (but flush redo before pg_log),
+though having LGWR could simplify things.
+
+Without redo log or without some fsync synchronization we can't implement
+buffered logging. BTW, shared system cache could help with
+fsync synchonization, but, imho, redo is better (and faster for
+un-buffered logging too).
+
+> > > Actually, I found a problem with my description.  Because pg_log is not
+> > > fsync'ed, after a crash, pages with new transactions could have been
+> > > flushed to disk, but not the pg_log table that contains the transaction
+> > > ids.  The problem is that the new backend could assign a transaction id
+> > > that is already in use.
+> >
+> > Impossible. Backend flushes pg_variable after fetching nex 32 xids.
+> 
+> My suggestion is that we don't need to flush pg_variable or pg_log that
+> much.  My suggestion would speed up the test you do with 100 inserts
+> inside a single transaction vs. 100 separate inserts.
+> 
+> > >
+> > > We could set a flag upon successful shutdown, and if it is not set on
+> > > reboot, either do a vacuum to find the max transaction id, and
+> > > invalidate all them not in pg_log as synced, or increase the next
+> > > transaction id to some huge number and invalidate all them in between.
+> > >
+> 
+> I have a fix for the problem stated above, and it doesn't require a
+> vacuum.
+> 
+> We decide to fsync pg_variable and pg_log every 10,000 transactions or
+> oids.  Then if the database is brought up, and it was not brought down
+> cleanly, you increment oid and transaction_id by 10,000, because you
+> know you couldn't have gotten more than that.  All intermediate
+> transactions that are not marked committed/synced are marked aborted.
+
+This is what I suppose to do by placing next available oid/xid
+in shmem: this allows pre-fetch much more than 32 ids at once
+without losing them when session closed.
+
+> The problem we have with the current system is that we sync by action,
+> not by time interval.  If you are doing tons of inserts or updates, it
+> is syncing after every one.  What people really want is something that
+> will sync not after every action, but after every minute or five
+> minutes, so when the system is busy, the syncing every minutes is just a
+> small amount, and when the system is idle, no one cares if is syncs, and
+> no one has to wait for the sync to complete.
+
+When I'm really doing tons of inserts/updates/deletes I use
+BEGIN/END. But it doesn't work for multi-user environment, of 'course.
+As for about what people really want, I remember that recently someone
+said in user list that if one want to have 10-20 inserts/sec then he
+should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise
+when using one session, 32 inserts/sec with two sessions inserting
+in two different tables and only 20 inserts/sec with two sessions
+inserting in the same table. Imho, this difference between 20 and 32
+is more important thing to fix, and these results are not so bad
+in comparison with others.
+
+(BTW, we shouldn't forget about using raw devices to speed up things).
+
+Vadim
+
+From vadim@sable.krasnoyarsk.su Wed Nov  5 12:20:08 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05150
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:20:07 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24889 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 11:59:27 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <3460A6D7.167EB0E7@sable.krasnoyarsk.su>
+Date: Thu, 06 Nov 1997 00:03:19 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Mattias Kregert <matti@algonet.se>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, pgsql-hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711050317.WAA04674@candle.pha.pa.us> <34609871.27EED9D@algonet.se>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Mattias Kregert wrote:
+> 
+> Bruce Momjian wrote:
+> >
+> > We don't have buffered logging.  Buffered logging guarantees you get put
+> > back to a consistent state after an os/server crash, usually to within
+> > 30/90 seconds.  You do not have any partial transactions lying around,
+> > but you do have some transactions that you thought were done, but are
+>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+> > not.
+>   ^^^^
+> >
+> > This is faster then non-buffered logging, but not as fast as no logging.
+> > Guess what mode everyone uses?  The one we don't have, buffered logging!
+> 
+> Ouch! I would *not* like to use "buffered logging".
+
+And I.
+
+> What's the point in having the wrong data in the database and not
+> knowing what updates, inserts or deletes to do to get the correct data?
+> 
+> That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
+> 
+> > We really need a middle solution, that gives better data integrity, for
+> > a smaller price.
+> 
+> What I would like to have is this:
+> 
+> If a backend tells the frontend that a transaction has completed,
+> then that transaction should absolutely not get lost in case of a crash.
+
+Agreed.
+
+> 
+> What is needed is a log of changes since the last backup. This
+> log would preferrably reside on a remote machine or at least
+> another disk. Then, if the power goes in the middle of a disk write,
+> the disk explodes and the computer goes up in flames, you can
+> install Postgresql on a new machine, restore the last backup and
+> re-run the change log.
+
+Yes. And as I already said - this will speed up things because
+redo flushing is faster than flushing NNN tables which can be
+unflushed for some interval.
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Wed Nov  5 12:20:39 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05168
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:20:38 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id MAA25888 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 12:14:14 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id MAA02259; Wed, 5 Nov 1997 12:02:33 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 12:00:21 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id MAA00750 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 12:00:10 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA00598 for <pgsql-hackers@postgreSQL.org>; Wed, 5 Nov 1997 11:59:45 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS)
+Message-ID: <3460A6D7.167EB0E7@sable.krasnoyarsk.su>
+Date: Thu, 06 Nov 1997 00:03:19 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Mattias Kregert <matti@algonet.se>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, pgsql-hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711050317.WAA04674@candle.pha.pa.us> <34609871.27EED9D@algonet.se>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Mattias Kregert wrote:
+> 
+> Bruce Momjian wrote:
+> >
+> > We don't have buffered logging.  Buffered logging guarantees you get put
+> > back to a consistent state after an os/server crash, usually to within
+> > 30/90 seconds.  You do not have any partial transactions lying around,
+> > but you do have some transactions that you thought were done, but are
+>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+> > not.
+>   ^^^^
+> >
+> > This is faster then non-buffered logging, but not as fast as no logging.
+> > Guess what mode everyone uses?  The one we don't have, buffered logging!
+> 
+> Ouch! I would *not* like to use "buffered logging".
+
+And I.
+
+> What's the point in having the wrong data in the database and not
+> knowing what updates, inserts or deletes to do to get the correct data?
+> 
+> That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
+> 
+> > We really need a middle solution, that gives better data integrity, for
+> > a smaller price.
+> 
+> What I would like to have is this:
+> 
+> If a backend tells the frontend that a transaction has completed,
+> then that transaction should absolutely not get lost in case of a crash.
+
+Agreed.
+
+> 
+> What is needed is a log of changes since the last backup. This
+> log would preferrably reside on a remote machine or at least
+> another disk. Then, if the power goes in the middle of a disk write,
+> the disk explodes and the computer goes up in flames, you can
+> install Postgresql on a new machine, restore the last backup and
+> re-run the change log.
+
+Yes. And as I already said - this will speed up things because
+redo flushing is faster than flushing NNN tables which can be
+unflushed for some interval.
+
+Vadim
+
+
+From owner-pgsql-hackers@hub.org Wed Nov  5 14:01:02 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07017
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:00:59 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA01759 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 13:52:36 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA03611; Wed, 5 Nov 1997 13:29:43 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 13:27:48 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA03291 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 13:27:41 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA02823 for <hackers@postgreSQL.org>; Wed, 5 Nov 1997 13:26:20 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id NAA05863;
+	Wed, 5 Nov 1997 13:16:09 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711051816.NAA05863@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Wed, 5 Nov 1997 13:16:09 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <3460A374.41C67EA6@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 5, 97 11:48:52 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> There is no fsync synchronization currently.
+> How could we be ensured that all modified data pages are flushed
+> when we decided to flush pg_log ?
+> If backend doesn't fsync data pages & pg_log at the commit time
+> then when he must flush them (data first) ?
+
+My idea was to have the backend do a 'sync' that causes the OS to sync
+all dirty pages, then mark all committed transactions on pg_log as
+'synced'.  Then sync pg_log.  That way, there is a clear system where we
+know everything is flushed to disk, and we mark the transactions as
+synced.
+
+The only time that synced flag is used, is when the database starts up,
+and it sees that the previous shutdown was not clean.
+
+What am I missing here?
+
+> 
+> This is what Oracle does:
+> 
+> it uses dedicated DBWR process for writing/flushing modified 
+> data pages and LGWR process for writing/flushing redo log
+> (redo log is transaction log also). LGWR always flushes log pages 
+> when committing, but durty data pages can be flushed _after_ transaction 
+> commit when DBWR decides that it's time to do it (ala checkpoints interval).
+> 
+> Using redo log we could implement buffered logging quite easy.
+> We can even don't use dedicated processes (but flush redo before pg_log),
+> though having LGWR could simplify things.
+> 
+> Without redo log or without some fsync synchronization we can't implement
+> buffered logging. BTW, shared system cache could help with
+> fsync synchonization, but, imho, redo is better (and faster for
+> un-buffered logging too).
+> 
+
+I suggested my solution because it is clean, does flushing in one
+central location(postmaster), and does quick restores.
+
+> > > > Actually, I found a problem with my description.  Because pg_log is not
+> > > > fsync'ed, after a crash, pages with new transactions could have been
+> > > > flushed to disk, but not the pg_log table that contains the transaction
+> > > > ids.  The problem is that the new backend could assign a transaction id
+> > > > that is already in use.
+> > >
+> > > Impossible. Backend flushes pg_variable after fetching nex 32 xids.
+> > 
+> > My suggestion is that we don't need to flush pg_variable or pg_log that
+> > much.  My suggestion would speed up the test you do with 100 inserts
+> > inside a single transaction vs. 100 separate inserts.
+> > 
+> > > >
+> > > > We could set a flag upon successful shutdown, and if it is not set on
+> > > > reboot, either do a vacuum to find the max transaction id, and
+> > > > invalidate all them not in pg_log as synced, or increase the next
+> > > > transaction id to some huge number and invalidate all them in between.
+> > > >
+> > 
+> > I have a fix for the problem stated above, and it doesn't require a
+> > vacuum.
+> > 
+> > We decide to fsync pg_variable and pg_log every 10,000 transactions or
+> > oids.  Then if the database is brought up, and it was not brought down
+> > cleanly, you increment oid and transaction_id by 10,000, because you
+> > know you couldn't have gotten more than that.  All intermediate
+> > transactions that are not marked committed/synced are marked aborted.
+> 
+> This is what I suppose to do by placing next available oid/xid
+> in shmem: this allows pre-fetch much more than 32 ids at once
+> without losing them when session closed.
+> 
+> > The problem we have with the current system is that we sync by action,
+> > not by time interval.  If you are doing tons of inserts or updates, it
+> > is syncing after every one.  What people really want is something that
+> > will sync not after every action, but after every minute or five
+> > minutes, so when the system is busy, the syncing every minutes is just a
+> > small amount, and when the system is idle, no one cares if is syncs, and
+> > no one has to wait for the sync to complete.
+> 
+> When I'm really doing tons of inserts/updates/deletes I use
+> BEGIN/END. But it doesn't work for multi-user environment, of 'course.
+> As for about what people really want, I remember that recently someone
+> said in user list that if one want to have 10-20 inserts/sec then he
+> should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise
+> when using one session, 32 inserts/sec with two sessions inserting
+> in two different tables and only 20 inserts/sec with two sessions
+> inserting in the same table. Imho, this difference between 20 and 32
+> is more important thing to fix, and these results are not so bad
+> in comparison with others.
+> 
+> (BTW, we shouldn't forget about using raw devices to speed up things).
+> 
+> Vadim
+> 
+
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From james@blarg.net Wed Nov  5 13:26:46 1997
+Received: from animal.blarg.net (mail@animal.blarg.net [206.114.144.1])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id NAA06130
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 13:26:26 -0500 (EST)
+Received: from animal.blarg.net (james@animal.blarg.net [206.114.144.1])
+          by animal.blarg.net (8.8.5/8.8.4) with SMTP
+          id KAA09775; Wed, 5 Nov 1997 10:26:10 -0800
+Date: Wed, 5 Nov 1997 10:26:10 -0800 (PST)
+From: "James A. Hillyerd" <james@blarg.net>
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+cc: Mattias Kregert <matti@algonet.se>, pgsql-hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+In-Reply-To: <199711051615.LAA02260@candle.pha.pa.us>
+Message-ID: <Pine.LNX.3.95.971105102332.6252E-100000@animal.blarg.net>
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Wed, 5 Nov 1997, Bruce Momjian wrote:
+> 
+> The strange thing I am hearing is that the people who use PostgreSQL are
+> more worried about data recovery from a crash than million-dollar
+> companies that use commercial databases.
+> 
+
+If I may throw in my 2 cents, I'd prefer to see that database in a
+consistent state, with the data being up to date as of 1 minute or
+less before the crash.  I'd rather have higher performance than up to the
+second data.
+
+-james
+
+[  James A. Hillyerd (JH2162) - james@blarg.net - Web Developer  ]
+[  http://www.blarg.net/~james/   http://www.hyperglyphics.com/  ]
+[ 1024/B11C3751 CA 1C B3 A9 07 2F 57 C9  91 F4 73 F2 19 A4 C5 88 ]
+
+
+From vadim@sable.krasnoyarsk.su Wed Nov  5 14:24:03 1997
+Received: from renoir.op.net (root@renoir.op.net [206.84.208.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07830
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:24:02 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id OAA02778 for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:13:45 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id CAA23376; Thu, 6 Nov 1997 02:17:51 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <3460C65E.446B9B3D@sable.krasnoyarsk.su>
+Date: Thu, 06 Nov 1997 02:17:50 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711051816.NAA05863@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> > There is no fsync synchronization currently.
+> > How could we be ensured that all modified data pages are flushed
+> > when we decided to flush pg_log ?
+> > If backend doesn't fsync data pages & pg_log at the commit time
+> > then when he must flush them (data first) ?
+> 
+> My idea was to have the backend do a 'sync' that causes the OS to sync
+> all dirty pages, then mark all committed transactions on pg_log as
+> 'synced'.  Then sync pg_log.  That way, there is a clear system where we
+> know everything is flushed to disk, and we mark the transactions as
+> synced.
+> 
+> The only time that synced flag is used, is when the database starts up,
+> and it sees that the previous shutdown was not clean.
+> 
+> What am I missing here?
+
+Ok, I see. But we can avoid 'synced' flag: we can make (just before
+sync-ing data pages) in-memory copies of "on-line" durty pg_log pages 
+to being written/fsynced and perform write/fsync from these copies
+without stopping new commits in "on-line" page(s) (nothing must go
+to disk from "on-line" log pages).
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Wed Nov  5 14:32:25 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA08101
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 14:32:21 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id OAA22970; Wed, 5 Nov 1997 14:26:47 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:24:59 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA22344 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:24:56 -0500 (EST)
+Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA22319 for <hackers@postgreSQL.org>; Wed, 5 Nov 1997 14:24:38 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id OAA07661;
+	Wed, 5 Nov 1997 14:22:46 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711051922.OAA07661@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Wed, 5 Nov 1997 14:22:45 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <3460A374.41C67EA6@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 5, 97 11:48:52 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Just a clarification.  When I say the postmaster issues a sync, I mean
+sync(2), not fsync(2).
+
+The sync flushes all dirty pages on all file systems.  Ordinary users
+can issue this, and update usually does this every 30 seconds anyway.
+
+By using this, we let the kernel figure out which buffers are dirty.  We
+don't have to figure this out in the postmaster.
+
+Then we update the pg_log table to mark those transactions as synced. 
+On recovery from a crash, we mark the committed transactions as
+uncommitted if they do not have the synced flag.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From owner-pgsql-hackers@hub.org Wed Nov  5 15:11:07 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA08751
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 15:10:59 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA01986; Wed, 5 Nov 1997 15:01:24 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:59:32 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA01414 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:59:28 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA01403 for <hackers@postgreSQL.org>; Wed, 5 Nov 1997 14:59:14 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id OAA08283;
+	Wed, 5 Nov 1997 14:53:55 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711051953.OAA08283@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Wed, 5 Nov 1997 14:53:54 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <3460C65E.446B9B3D@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 6, 97 02:17:50 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> > The only time that synced flag is used, is when the database starts up,
+> > and it sees that the previous shutdown was not clean.
+> > 
+> > What am I missing here?
+> 
+> Ok, I see. But we can avoid 'synced' flag: we can make (just before
+> sync-ing data pages) in-memory copies of "on-line" durty pg_log pages 
+> to being written/fsynced and perform write/fsync from these copies
+> without stopping new commits in "on-line" page(s) (nothing must go
+> to disk from "on-line" log pages).
+
+[Working late tonight?]
+
+OK, now I am lost.  We need the sync'ed flag so when we start the
+postmaster, and we see the database we not shut down properly, we use
+the flag to clear the commit flag from comitted transactions that were
+not sync'ed by the postmaster.
+
+In my opinion, we don't need any extra copies of pg_log, we can set
+those sync'ed flags while others are making changes, because before we
+did our sync, we gathered a list of committed transaction ids from the
+shared transaction id queue that I mentioned a while ago.
+
+We need this queue so we can find the newly-committed transactions that
+do not have a sync flag.  Another way we could do this would be to scan
+pg_log before we sync, getting all the committed transaction ids without
+sync flags.  No lock is needed on the table.  If we miss some new ones,
+we will get them next time we scan.  The problem I saw is that there is
+no way to see when to stop scanning the pg_log table for such
+transactions, so I thought each backend would have to put its newly
+committed transactions in a separate place.  Maybe I am wrong.
+
+This syncing method just seems so natural since we have pg_log.  That is
+why I keep bringing it up until people tell me I am stupid.
+
+This transaction commit/sync stuff is complicated, and takes a while to
+hash out in a group.
+
+---------------------------------------------------------------------------
+
+I just re-read your description, and I see what you are saying.  My idea
+has pg_log commit flag be real commit flags while the system is running,
+but on reboot after failure, we remove the commit flags on non-synced
+stuff before we start up.
+
+Your idea is to make pg_log commit flags only appear in in-memory copies
+of pg_log, and write the commit flags to disk only after the sync is
+done.
+
+Either way will work.  The question is, "Which is easier?"  The OS is
+going to sync pg_log on its own.  We would almost need a second copy of
+pg_log, one copy to be used on postmaster startup, and a second to be
+used by running backends, and the postmaster would make a copy of the
+running backend pg_log, sync the disks, and copy it to the boot copy.
+
+I don't see how the backend is going to figure out which pg_log pages
+were modified and need to be sent to the boot copy of pg_log.
+
+Now that I am thinking, here is a good idea.  Instead of a fancy
+transaction queue, what if we just have the backend record the lowest
+numbered transaction they commit in a shared memory area.  If the
+current transaction id they commit is greater than the minimum, then
+change nothing.  That way, the backend could copy all pg_log pages
+containing that minimum pg_log transaction id up to the most recent
+pg_log page, do the sync, and copy just those to the boot copy of
+pg_log.
+
+This eliminates the transaction id queue.
+
+The nice thing about the sync-flag in pg_log is that there is no copying
+by the backend.  But we would have to spin through the file to set those
+sync bits.  Your method just copies whole pages to the boot copy.
+
+---------------------------------------------------------------------------
+
+I don't want to force this idea on anyone, or annoy anyone.  I just
+think it needs to be considered.  The concepts are unusual, so once
+people get the full idea, if they don't like it, we can trash it.  I
+still think it holds promise.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From hotz@jpl.nasa.gov Wed Nov  5 15:30:18 1997
+Received: from hotzsun.jpl.nasa.gov (hotzsun.jpl.nasa.gov [137.79.51.138])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA09500
+	for <maillist@candle.pha.pa.us>; Wed, 5 Nov 1997 15:30:16 -0500 (EST)
+Received: from [137.79.51.141] (hotzmac [137.79.51.141]) by hotzsun.jpl.nasa.gov (8.7.6/8.7.3) with SMTP id MAA10100; Wed, 5 Nov 1997 12:29:58 -0800 (PST)
+X-Sender: hotzmail@hotzsun.jpl.nasa.gov
+Message-Id: <v02140b02b0868294bbc1@[137.79.51.141]>
+Mime-Version: 1.0
+Content-Type: text/plain; charset="us-ascii"
+Date: Wed, 5 Nov 1997 12:29:58 -0800
+To: Bruce Momjian <maillist@candle.pha.pa.us>,
+        matti@algonet.se (Mattias Kregert)
+From: hotz@jpl.nasa.gov (Henry B. Hotz)
+Subject: Re: [HACKERS] My $.02, was: PERFORMANCE and Good Bye, Time Travel!
+Cc: pgsql-hackers@postgreSQL.org
+Status: OR
+
+At 11:15 AM 11/5/97, Bruce Momjian wrote:
+>The strange thing I am hearing is that the people who use PostgreSQL are
+>more worried about data recovery from a crash than million-dollar
+>companies that use commercial databases.
+>
+>I don't get it.
+
+I would run PG to make sure that committed transactions were really written
+to disk because that seems "correct" and I don't have the kind of
+performance requirements that would push me to do otherwise.
+
+That said, I can see a need for varying performance/crash-immunity
+tradeoffs, and at least *one* option in between "correct" and "unprotected"
+operation would seem desirable.
+
+Signature failed Preliminary Design Review.
+Feasibility of a new signature is currently being evaluated.
+h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu
+
+
+
+From owner-pgsql-hackers@hub.org Thu Nov  6 15:51:23 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA04634
+	for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 15:51:08 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA24783; Thu, 6 Nov 1997 15:36:47 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 15:36:07 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA24514 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 15:36:02 -0500 (EST)
+Received: from guevara.bildbasen.kiruna.se (guevara.bildbasen.kiruna.se [193.45.225.110]) by hub.org (8.8.5/8.7.5) with SMTP id PAA24319 for <pgsql-hackers@postgreSQL.org>; Thu, 6 Nov 1997 15:35:32 -0500 (EST)
+Received: (qmail 9764 invoked by uid 129); 6 Nov 1997 20:34:35 -0000
+Date: 6 Nov 1997 20:34:35 -0000
+Message-ID: <19971106203435.9763.qmail@guevara.bildbasen.kiruna.se>
+From: Goran Thyni <goran@bildbasen.se>
+To: pgsql-hackers@postgreSQL.org
+In-reply-to: <34619E9E.622F563@algonet.se> (message from Mattias Kregert on
+	Thu, 06 Nov 1997 11:40:30 +0100)
+Subject: [HACKERS] Re: Performance vs. Crash Recovery
+Mime-Version: 1.0
+Content-Type: text/plain; charset=ISO-8859-1
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+
+I am getting quiet bored by this discussion,
+if someone has a strong opinion about how this
+should be done go ahead and make a test implementation
+then we have something to discuss.
+
+In the mean time, if you want best possible data protection
+mount you database disk sync:ed. This is safer than any scheme
+we could come up with.
+D*mned slow too, so everybody should be happy. :-)
+
+And I see no point implement a periodic sync in postmaster.
+All unices has cron, why not just use that.
+Or even a stupid 1-liner (ba)sh-script like:
+
+while true; do sleep 20; sync; done
+
+	best regards,
+-- 
+---------------------------------------------
+G�ran Thyni, sysadm, JMS Bildbasen, Kiruna
+
+
+
+From vadim@sable.krasnoyarsk.su Thu Nov  6 23:31:41 1997
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA04723
+	for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 23:31:21 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25438; Fri, 7 Nov 1997 11:36:25 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <34629AC9.15FB7483@sable.krasnoyarsk.su>
+Date: Fri, 07 Nov 1997 11:36:25 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711051953.OAA08283@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> > > The only time that synced flag is used, is when the database starts up,
+> > > and it sees that the previous shutdown was not clean.
+> > >
+> > > What am I missing here?
+> >
+> > Ok, I see. But we can avoid 'synced' flag: we can make (just before
+> > sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
+> > to being written/fsynced and perform write/fsync from these copies
+> > without stopping new commits in "on-line" page(s) (nothing must go
+> > to disk from "on-line" log pages).
+> 
+> [Working late tonight?]
+
+[Yes]
+
+> I just re-read your description, and I see what you are saying.  My idea
+> has pg_log commit flag be real commit flags while the system is running,
+> but on reboot after failure, we remove the commit flags on non-synced
+> stuff before we start up.
+> 
+> Your idea is to make pg_log commit flags only appear in in-memory copies
+> of pg_log, and write the commit flags to disk only after the sync is
+> done.
+> 
+> Either way will work.  The question is, "Which is easier?"  The OS is
+> going to sync pg_log on its own.  We would almost need a second copy of
+> pg_log, one copy to be used on postmaster startup, and a second to be
+> used by running backends, and the postmaster would make a copy of the
+> running backend pg_log, sync the disks, and copy it to the boot copy.
+> 
+> I don't see how the backend is going to figure out which pg_log pages
+> were modified and need to be sent to the boot copy of pg_log.
+> 
+> Now that I am thinking, here is a good idea.  Instead of a fancy
+> transaction queue, what if we just have the backend record the lowest
+> numbered transaction they commit in a shared memory area.  If the
+> current transaction id they commit is greater than the minimum, then
+> change nothing.  That way, the backend could copy all pg_log pages
+> containing that minimum pg_log transaction id up to the most recent
+> pg_log page, do the sync, and copy just those to the boot copy of
+> pg_log.
+> 
+> This eliminates the transaction id queue.
+> 
+> The nice thing about the sync-flag in pg_log is that there is no copying
+> by the backend.  But we would have to spin through the file to set those
+> sync bits.  Your method just copies whole pages to the boot copy.
+
+   In my plans to re-design transaction system I supposed to keep in shmem
+two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer 
+to access them is not good idea. Also, we could use spinlock instead of
+lock manager to synchronize access to these pages (as I see in spin.c
+spinlock-s could be shared, but only exclusive ones are used) - spinlocks
+are faster. 
+   These two last pg_log pages are "online" ones. Race condition: when one or
+both of online pages becomes non-online ones, i.e. pg_log has to be expanded 
+when writing commit/abort of "big" xid. This is how we could handle this
+in "buffered" logging (delayed fsync) mode:
+
+   When backend want to write commit/abort status he acquires exclusive 
+OnLineLogLock. If xid belongs to online pages then backend writes status
+and releases spin. If xid is less than least xid on 1st online page then
+backend releases spin and does exactly the same what he does in normal mode:
+flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer, 
+update xid status, WriteBuffer, release write lock, flush pg_log.
+If xid is greater than max xid on 2nd online page then the simplest way is 
+just do sync(); sync() (two times), flush 1st or both online pages, 
+read new page(s) into online pages space, update xid status,
+release OnLineLogLock spin. We could try other ways but pg_log expanding
+is rare case (32K xids in one pg_log page)...
+   All what postmaster will have to do is:
+1. Get shared OnLineLogLock.
+2. Copy 2 x 8K data to private place.
+3. Release spinlock.
+4. sync(); sync(); (two times!)
+5. Flush online pages.
+
+We could use -F DELAY_TIME to turn fsync delayed mode ON.
+
+And, btw, having two bits for xact status we have only one unused
+status value (0x11) currently - I would like to use this for
+nested xactions and savepoints...
+
+> I don't want to force this idea on anyone, or annoy anyone.  I just
+> think it needs to be considered.  The concepts are unusual, so once
+> people get the full idea, if they don't like it, we can trash it.  I
+> still think it holds promise.
+
+Agreed.
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Fri Nov  7 01:32:49 1997
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07651
+	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:32:47 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23328 for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 23:46:08 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA19565; Thu, 6 Nov 1997 23:38:55 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 23:36:53 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA18911 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 23:36:44 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA18779 for <pgsql-hackers@postgreSQL.org>; Thu, 6 Nov 1997 23:36:02 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25448; Fri, 7 Nov 1997 11:40:29 +0700 (KRS)
+Message-ID: <34629BBD.59E2B600@sable.krasnoyarsk.su>
+Date: Fri, 07 Nov 1997 11:40:29 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: Mattias Kregert <matti@algonet.se>, pgsql-hackers@postgreSQL.org
+Subject: Re: Sync:ing data and log (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!)
+References: <199711061810.NAA02118@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> >
+> > Never use sync(). Use fsync(). Other processes should take care of their
+> > own syncing. If you use sync(), and you have a lot of disks, the sync
+> > can
+> > take half a minute if you are unlucky.
+> 
+> We could use fsync() but then the postmaster has to know what tables
+> have dirty buffers, and I don't think there is an easy way to do this.
+
+There is one way - shared system cache...
+
+Vadim
+
+
+From vadim@sable.krasnoyarsk.su Fri Nov  7 01:31:24 1997
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07639
+	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:31:22 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23094 for <maillist@candle.pha.pa.us>; Thu, 6 Nov 1997 23:39:00 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25457; Fri, 7 Nov 1997 11:43:52 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <34629C87.3F54BC7E@sable.krasnoyarsk.su>
+Date: Fri, 07 Nov 1997 11:43:51 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Mattias Kregert <matti@algonet.se>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, pgsql-hackers@postgreSQL.org
+Subject: Re: Performance vs. Crash Recovery (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!)
+References: <199711051615.LAA02260@candle.pha.pa.us> <34619E9E.622F563@algonet.se>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Mattias Kregert wrote:
+> 
+> > The strange thing I am hearing is that the people who use PostgreSQL are
+> > more worried about data recovery from a crash than million-dollar
+> > companies that use commercial databases.
+> >
+> > I don't get it.
+> 
+> Perhaps the million-dollar companies have more sophisticated hardware,
+> like big expensive disk arrays, big UPS:es and parallell backup
+> servers?
+> If so, the risk of harware failure is much smaller for them.
+
+More of that - Informix is more stable than postgres: elog(FATAL)
+occures sometime and in fsync delayed mode this will cause
+of losing xaction too, not onle hard/OS failure.
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Fri Nov  7 01:31:26 1997
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07642
+	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:31:24 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id AAA24358 for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 00:09:47 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA00167; Fri, 7 Nov 1997 00:03:17 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:01:26 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA29427 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:01:19 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA29364 for <hackers@postgreSQL.org>; Fri, 7 Nov 1997 00:01:02 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id XAA05565;
+	Thu, 6 Nov 1997 23:54:33 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711070454.XAA05565@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Thu, 6 Nov 1997 23:54:33 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <34629AC9.15FB7483@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 7, 97 11:36:25 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+I was worried when you didn't respond to my last list of ideas.  I
+thought perhaps the idea was getting on your nerves.
+
+I haven't dropped the idea because:
+
+	1) it offers 2-9 times speedup in database modifications
+	2) this is how the big commercial system handle it, and I think
+	   we need to give users this option.
+	3) in the way I had it designed, it wouldn't take much work to
+	   do it.
+
+Anything that promises that much speedup, if it can be done easy, I say
+lets consider it, even if you loose 60 seconds of changes.
+
+
+>    In my plans to re-design transaction system I supposed to keep in shmem
+> two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer 
+> to access them is not good idea. Also, we could use spinlock instead of
+> lock manager to synchronize access to these pages (as I see in spin.c
+> spinlock-s could be shared, but only exclusive ones are used) - spinlocks
+> are faster. 
+
+Ah, so you already had the idea of having on-line pages in shared memory
+as part of a transaction system overhaul?  Right now, does each backend
+lock/read/write/unlock to get at pg_log?  Wow, that is bad.
+
+Perhaps mmap() would be a good idea.  My system has msync() to flush
+mmap()'ed pages to the underlying file.  You would still run fsync()
+after that.  This may give us the best of both worlds: a shared-memory
+area of variable size, and control of when it get flushed to disk.  Do
+other OS's have this?  I have a feeling OS's with unified buffer caches
+don't have this ability to determine when the underlying mmap'ed file
+gets sent to the underlying file and disk.
+
+
+>    These two last pg_log pages are "online" ones. Race condition: when one or
+> both of online pages becomes non-online ones, i.e. pg_log has to be expanded 
+> when writing commit/abort of "big" xid. This is how we could handle this
+> in "buffered" logging (delayed fsync) mode:
+> 
+>    When backend want to write commit/abort status he acquires exclusive 
+> OnLineLogLock. If xid belongs to online pages then backend writes status
+> and releases spin. If xid is less than least xid on 1st online page then
+> backend releases spin and does exactly the same what he does in normal mode:
+> flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer, 
+> update xid status, WriteBuffer, release write lock, flush pg_log.
+> If xid is greater than max xid on 2nd online page then the simplest way is 
+> just do sync(); sync() (two times), flush 1st or both online pages, 
+> read new page(s) into online pages space, update xid status,
+> release OnLineLogLock spin. We could try other ways but pg_log expanding
+> is rare case (32K xids in one pg_log page)...
+>    All what postmaster will have to do is:
+> 1. Get shared OnLineLogLock.
+> 2. Copy 2 x 8K data to private place.
+> 3. Release spinlock.
+> 4. sync(); sync(); (two times!)
+> 5. Flush online pages.
+> 
+> We could use -F DELAY_TIME to turn fsync delayed mode ON.
+> 
+> And, btw, having two bits for xact status we have only one unused
+> status value (0x11) currently - I would like to use this for
+> nested xactions and savepoints...
+
+I saw that.  By keeping two copies of pg_log, one in memory to be used
+by all backend, and another that hits the disk, it certainly will work.
+
+> 
+> > I don't want to force this idea on anyone, or annoy anyone.  I just
+> > think it needs to be considered.  The concepts are unusual, so once
+> > people get the full idea, if they don't like it, we can trash it.  I
+> > still think it holds promise.
+> 
+> Agreed.
+> 
+> Vadim
+> 
+
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From owner-pgsql-hackers@hub.org Fri Nov  7 01:03:09 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07314
+	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:03:05 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA07879; Fri, 7 Nov 1997 00:57:42 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:55:52 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA03918 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:55:46 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA02961 for <hackers@postgreSQL.org>; Fri, 7 Nov 1997 00:55:18 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id MAA25567; Fri, 7 Nov 1997 12:59:29 +0700 (KRS)
+Message-ID: <3462AE40.FF6D5DF@sable.krasnoyarsk.su>
+Date: Fri, 07 Nov 1997 12:59:28 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711070454.XAA05565@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> I was worried when you didn't respond to my last list of ideas.  I
+> thought perhaps the idea was getting on your nerves.
+
+No, I was (and, unfortunately, I still) busy...
+
+> 
+> I haven't dropped the idea because:
+> 
+>         1) it offers 2-9 times speedup in database modifications
+>         2) this is how the big commercial system handle it, and I think
+>            we need to give users this option.
+>         3) in the way I had it designed, it wouldn't take much work to
+>            do it.
+> 
+> Anything that promises that much speedup, if it can be done easy, I say
+> lets consider it, even if you loose 60 seconds of changes.
+
+I agreed with your un-buffered logging idea. This would be excellent
+feature for un-critical dbase usings (WWW, etc).
+
+> 
+> >    In my plans to re-design transaction system I supposed to keep in shmem
+> > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
+> > to access them is not good idea. Also, we could use spinlock instead of
+> > lock manager to synchronize access to these pages (as I see in spin.c
+> > spinlock-s could be shared, but only exclusive ones are used) - spinlocks
+> > are faster.
+> 
+> Ah, so you already had the idea of having on-line pages in shared memory
+> as part of a transaction system overhaul?  Right now, does each backend
+
+Yes. I hope to implement this in the next 1-2 weeks.
+
+> lock/read/write/unlock to get at pg_log?  Wow, that is bad.
+
+Yes, he does.
+
+> 
+> Perhaps mmap() would be a good idea.  My system has msync() to flush
+> mmap()'ed pages to the underlying file.  You would still run fsync()
+> after that.  This may give us the best of both worlds: a shared-memory
+                                                           ^^^^^^^^^^^^^
+> area of variable size, and control of when it get flushed to disk.  Do
+  ^^^^^^^^^^^^^^^^^^^^^
+I like it. FreeBSD supports 
+
+MAP_ANON    Map anonymous memory not associated with any specific file. 
+
+It would be nice to use mmap to get more "shared" memory, but I don't see 
+reasons to mmap any particular file to memory. Having two last pg_log pages 
+in memory + xact commit/abort writeback optimization (updation of commit/abort 
+xmin/xmax status in tuples by any scan - we already have this) reduce access 
+to "old" pg_log pages to zero.
+
+> other OS's have this?  I have a feeling OS's with unified buffer caches
+> don't have this ability to determine when the underlying mmap'ed file
+> gets sent to the underlying file and disk.
+> 
+> >    These two last pg_log pages are "online" ones. Race condition: when one or
+> > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
+> > when writing commit/abort of "big" xid. This is how we could handle this
+> > in "buffered" logging (delayed fsync) mode:
+> >
+> >    When backend want to write commit/abort status he acquires exclusive
+> > OnLineLogLock. If xid belongs to online pages then backend writes status
+> > and releases spin. If xid is less than least xid on 1st online page then
+> > backend releases spin and does exactly the same what he does in normal mode:
+> > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
+> > update xid status, WriteBuffer, release write lock, flush pg_log.
+> > If xid is greater than max xid on 2nd online page then the simplest way is
+> > just do sync(); sync() (two times), flush 1st or both online pages,
+> > read new page(s) into online pages space, update xid status,
+> > release OnLineLogLock spin. We could try other ways but pg_log expanding
+> > is rare case (32K xids in one pg_log page)...
+> >    All what postmaster will have to do is:
+> > 1. Get shared OnLineLogLock.
+> > 2. Copy 2 x 8K data to private place.
+> > 3. Release spinlock.
+> > 4. sync(); sync(); (two times!)
+> > 5. Flush online pages.
+> >
+> > We could use -F DELAY_TIME to turn fsync delayed mode ON.
+> >
+> > And, btw, having two bits for xact status we have only one unused
+> > status value (0x11) currently - I would like to use this for
+> > nested xactions and savepoints...
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+More about this: 0x11 could mean "this _child_ transaction is committed -
+you have to lookup in pg_xact_child to get parent xid and use pg_log again
+to get parent xact status". If parent committed then child xact status
+will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this
+we could get xact nesting and savepoints by starting new child xaction
+inside running one...
+
+> 
+> I saw that.  By keeping two copies of pg_log, one in memory to be used
+                                        ^^^^^^
+                             Just two pg_log pages...
+
+> by all backend, and another that hits the disk, it certainly will work.
+
+Vadim
+
+
+From vadim@sable.krasnoyarsk.su Fri Nov  7 01:30:59 1997
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07599
+	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:30:58 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id BAA26793 for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 01:12:33 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id NAA25592; Fri, 7 Nov 1997 13:16:39 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <3462B247.ABD322C@sable.krasnoyarsk.su>
+Date: Fri, 07 Nov 1997 13:16:39 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Jan Wieck <wieck@sapserv.debis.de>
+CC: Bruce Momjian <maillist@candle.pha.pa.us>, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <m0xT9nq-000BFQC@orion.SAPserv.Hamburg.dsh.de>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+wieck@sapserv.debis.de wrote:
+> 
+> Bruce wrote:
+> >
+> > > > It seems that this is what Oracle does, but Sybase writes queries
+> > > > (with transaction ids, of 'course, and before execution) and
+> > > > begin, commit/abort events <-- this is better for non-overwriting
+> > > > system (shorter redo file), but, agreed, recovering is more complicated.
+> > > >
+> > > > Vadim
+> > > >
+> > >
+> > >     Writing  only  the queries (and only those that really modify
+> > >     data - no selects) would be much smarter and the  redo  files
+> > >     will  be  shorter. But it wouldn't fit for PostgreSQL as long
+> > >     as someone can submit a query like
+> > >
+> > >         DELETE FROM xxx WHERE oid = 59337;
+> >
+> > Interesting point.  Currently, an insert shows the OID as output in
+> > psql.  Perhaps we could do a little oid-manipulating to set the oid of
+> > the insert.
+> 
+>     Only for simple inserts, not on
+> 
+>         INSERT INTO xxx SELECT any_type_of_merge_join;
+
+I don't know how but Sybase handle this and IDENTITY (case of OIDs) too.
+But I don't object you, Jan, just because I havn't time to do
+"log queries" redo implementation and so I would like to have "log changes"
+redo at least. (Actually, "log changes" is good for my production dbase
+with 1 - 2 thousand updations per day).
+(BTW, "incrementing" backup could be implemented without redo - I have
+some thoughts about this, - but having additional recovering is good
+in any case).
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Fri Nov  7 15:42:58 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA22341
+	for <maillist@candle.pha.pa.us>; Fri, 7 Nov 1997 15:42:55 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA02769; Fri, 7 Nov 1997 15:28:54 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 15:24:00 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA01318 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 15:23:52 -0500 (EST)
+Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id PAA00705 for <hackers@postgreSQL.org>; Fri, 7 Nov 1997 15:21:56 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id PAA20010;
+	Fri, 7 Nov 1997 15:20:10 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711072020.PAA20010@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Fri, 7 Nov 1997 15:20:10 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <3462AE40.FF6D5DF@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 7, 97 12:59:28 pm
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> > Anything that promises that much speedup, if it can be done easy, I say
+> > lets consider it, even if you loose 60 seconds of changes.
+> 
+> I agreed with your un-buffered logging idea. This would be excellent
+> feature for un-critical dbase usings (WWW, etc).
+
+Actually, it is buffered logging.  We currently have unbuffered logging,
+I think.
+
+> > >    In my plans to re-design transaction system I supposed to keep in shmem
+> > > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
+> > > to access them is not good idea. Also, we could use spinlock instead of
+> > > lock manager to synchronize access to these pages (as I see in spin.c
+> > > spinlock-s could be shared, but only exclusive ones are used) - spinlocks
+> > > are faster.
+> > 
+> > Ah, so you already had the idea of having on-line pages in shared memory
+> > as part of a transaction system overhaul?  Right now, does each backend
+> 
+> Yes. I hope to implement this in the next 1-2 weeks.
+> 
+> > lock/read/write/unlock to get at pg_log?  Wow, that is bad.
+> 
+> Yes, he does.
+> 
+> > 
+> > Perhaps mmap() would be a good idea.  My system has msync() to flush
+> > mmap()'ed pages to the underlying file.  You would still run fsync()
+> > after that.  This may give us the best of both worlds: a shared-memory
+>                                                            ^^^^^^^^^^^^^
+> > area of variable size, and control of when it get flushed to disk.  Do
+>   ^^^^^^^^^^^^^^^^^^^^^
+> I like it. FreeBSD supports 
+> 
+> MAP_ANON    Map anonymous memory not associated with any specific file. 
+> 
+> It would be nice to use mmap to get more "shared" memory, but I don't see 
+> reasons to mmap any particular file to memory. Having two last pg_log pages 
+> in memory + xact commit/abort writeback optimization (updation of commit/abort 
+> xmin/xmax status in tuples by any scan - we already have this) reduce access 
+> to "old" pg_log pages to zero.
+
+I totally agree.  There is no advantage to mmap() vs. shared memory for
+us.  I thought if we could control when the mmap() gets flushed to disk,
+we could let the OS handle the syncing, but I doubt this is going to be
+portable.
+
+Though, we could mmap() pg_log, and that way backends would not have to
+read/write the blocks, and they could all see the same data.  But with
+the new scheme, they have most transaction ids in shared memory.
+
+Interesting you mention the scan updating the transaction status.  We
+would have a problem here.  It is possible a backend will update the
+commit status of a data page, and that data page will make it to disk,
+but if there is a crash before the update pg_log gets sync'ed, there
+would be a partial transaction in the system.
+
+I don't know any way that a backend would know the transaction has hit
+disk, and the data commit flag could be set.  You don't want to update
+the commit flag of the data page until entire transaction has been
+sync'ed.  The only way to do that would be to have a 'commit and synced'
+flag, but you want to save that for nested transactions.
+
+Another case this could come in handy is to allow reuse of superceeded
+data rows.  If the transaction is committed and synced, the row space
+could be reused by another transaction.
+
+> > other OS's have this?  I have a feeling OS's with unified buffer caches
+> > don't have this ability to determine when the underlying mmap'ed file
+> > gets sent to the underlying file and disk.
+> > 
+> > >    These two last pg_log pages are "online" ones. Race condition: when one or
+> > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
+> > > when writing commit/abort of "big" xid. This is how we could handle this
+> > > in "buffered" logging (delayed fsync) mode:
+> > >
+> > >    When backend want to write commit/abort status he acquires exclusive
+> > > OnLineLogLock. If xid belongs to online pages then backend writes status
+
+This confuses me.  Why does a backend need to lock pg_log to update a
+transaction status?
+
+> > > and releases spin. If xid is less than least xid on 1st online page then
+> > > backend releases spin and does exactly the same what he does in normal mode:
+> > > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
+> > > update xid status, WriteBuffer, release write lock, flush pg_log.
+> > > If xid is greater than max xid on 2nd online page then the simplest way is
+> > > just do sync(); sync() (two times), flush 1st or both online pages,
+> > > read new page(s) into online pages space, update xid status,
+> > > release OnLineLogLock spin. We could try other ways but pg_log expanding
+> > > is rare case (32K xids in one pg_log page)...
+> > >    All what postmaster will have to do is:
+> > > 1. Get shared OnLineLogLock.
+> > > 2. Copy 2 x 8K data to private place.
+> > > 3. Release spinlock.
+> > > 4. sync(); sync(); (two times!)
+> > > 5. Flush online pages.
+
+Great.
+
+> > >
+> > > We could use -F DELAY_TIME to turn fsync delayed mode ON.
+> > >
+> > > And, btw, having two bits for xact status we have only one unused
+> > > status value (0x11) currently - I would like to use this for
+> > > nested xactions and savepoints...
+>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+> More about this: 0x11 could mean "this _child_ transaction is committed -
+> you have to lookup in pg_xact_child to get parent xid and use pg_log again
+> to get parent xact status". If parent committed then child xact status
+> will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this
+> we could get xact nesting and savepoints by starting new child xaction
+> inside running one...
+
+OK.
+
+> 
+> > 
+> > I saw that.  By keeping two copies of pg_log, one in memory to be used
+>                                         ^^^^^^
+>                              Just two pg_log pages...
+
+Got it.
+
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From owner-pgsql-hackers@hub.org Sun Nov  9 22:07:36 1997
+Received: from hub.org (hub.org [209.47.148.200])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA04655
+	for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 22:07:30 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id VAA07023; Sun, 9 Nov 1997 21:55:54 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 21:52:20 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id VAA06174 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 21:52:13 -0500 (EST)
+Received: from candle.pha.pa.us (maillist@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id VAA06092 for <hackers@postgreSQL.org>; Sun, 9 Nov 1997 21:51:58 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id VAA04150;
+	Sun, 9 Nov 1997 21:50:29 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711100250.VAA04150@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! (fwd)
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Sun, 9 Nov 1997 21:50:29 -0500 (EST)
+Cc: hackers@postgreSQL.org (PostgreSQL-development)
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+Forwarded message:
+> > > Perhaps mmap() would be a good idea.  My system has msync() to flush
+> > > mmap()'ed pages to the underlying file.  You would still run fsync()
+> > > after that.  This may give us the best of both worlds: a shared-memory
+> >                                                            ^^^^^^^^^^^^^
+> > > area of variable size, and control of when it get flushed to disk.  Do
+> >   ^^^^^^^^^^^^^^^^^^^^^
+> > I like it. FreeBSD supports 
+> > 
+> > MAP_ANON    Map anonymous memory not associated with any specific file. 
+> > 
+> > It would be nice to use mmap to get more "shared" memory, but I don't see 
+> > reasons to mmap any particular file to memory. Having two last pg_log pages 
+> > in memory + xact commit/abort writeback optimization (updation of commit/abort 
+> > xmin/xmax status in tuples by any scan - we already have this) reduce access 
+> > to "old" pg_log pages to zero.
+> 
+> I totally agree.  There is no advantage to mmap() vs. shared memory for
+> us.  I thought if we could control when the mmap() gets flushed to disk,
+> we could let the OS handle the syncing, but I doubt this is going to be
+> portable.
+> 
+> Though, we could mmap() pg_log, and that way backends would not have to
+> read/write the blocks, and they could all see the same data.  But with
+> the new scheme, they have most transaction ids in shared memory.
+> 
+> Interesting you mention the scan updating the transaction status.  We
+> would have a problem here.  It is possible a backend will update the
+> commit status of a data page, and that data page will make it to disk,
+> but if there is a crash before the update pg_log gets sync'ed, there
+> would be a partial transaction in the system.
+> 
+> I don't know any way that a backend would know the transaction has hit
+> disk, and the data commit flag could be set.  You don't want to update
+> the commit flag of the data page until entire transaction has been
+> sync'ed.  The only way to do that would be to have a 'commit and synced'
+> flag, but you want to save that for nested transactions.
+> 
+> Another case this could come in handy is to allow reuse of superceeded
+> data rows.  If the transaction is committed and synced, the row space
+> could be reused by another transaction.
+> 
+
+I have been thinking about the mmap() issue, and it seems a natural for
+pg_log.  You can have every backend mmap() pg_log.  It becomes a dynamic
+shared memory area that is auto-initialized to the contents of pg_log,
+and all changes can be made by all backends.  No locking needed.  We can
+also flush the changes to the underlying file.  Under bsdi, you can also
+have the mmap area follow you across exec() calls, so each backend
+doesn't have to do anything.  I want to replace exec with fork also, so
+the stuff would be auto-loaded in the address space of each backend.
+
+This way, you don't have to have two on-line pages and move them around
+as pg_log grows.
+
+The only problem remains how to mark certain transactions as synced or
+force only synced transactions to hit the pg_log file itself, and data
+row commit status only should be updated for synced transactions.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+
+From vadim@sable.krasnoyarsk.su Sun Nov  9 23:00:58 1997
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA05394
+	for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 23:00:55 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA25139 for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 22:42:33 -0500 (EST)
+Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id KAA01845; Mon, 10 Nov 1997 10:49:25 +0700 (KRS)
+Sender: root@www.krasnet.ru
+Message-ID: <34668444.237C228A@sable.krasnoyarsk.su>
+Date: Mon, 10 Nov 1997 10:49:24 +0700
+From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
+Organization: ITTS (Krasnoyarsk)
+X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+CC: marc@fallon.classyad.com, hackers@postgreSQL.org
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+References: <199711072020.PAA20010@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+> 
+> > > Anything that promises that much speedup, if it can be done easy, I say
+> > > lets consider it, even if you loose 60 seconds of changes.
+> >
+> > I agreed with your un-buffered logging idea. This would be excellent
+> > feature for un-critical dbase usings (WWW, etc).
+> 
+> Actually, it is buffered logging.  We currently have unbuffered logging,
+> I think.
+
+Sorry - mistyping.
+
+> 
+> Interesting you mention the scan updating the transaction status.  We
+> would have a problem here.  It is possible a backend will update the
+> commit status of a data page, and that data page will make it to disk,
+> but if there is a crash before the update pg_log gets sync'ed, there
+> would be a partial transaction in the system.
+
+You're right! Currently, only system relations can be affected by this:
+backend releases locks on user tables after syncing data and pg_log.
+I'll keep this in mind...
+
+> > > >    These two last pg_log pages are "online" ones. Race condition: when one or
+> > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
+> > > > when writing commit/abort of "big" xid. This is how we could handle this
+> > > > in "buffered" logging (delayed fsync) mode:
+> > > >
+> > > >    When backend want to write commit/abort status he acquires exclusive
+> > > > OnLineLogLock. If xid belongs to online pages then backend writes status
+> 
+> This confuses me.  Why does a backend need to lock pg_log to update a
+> transaction status?
+
+What if two backends try to change xact statuses in the same byte ?
+
+Vadim
+
+From owner-pgsql-hackers@hub.org Sun Nov  9 23:59:50 1997
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA06523
+	for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 23:59:48 -0500 (EST)
+Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA27105 for <maillist@candle.pha.pa.us>; Sun, 9 Nov 1997 23:41:39 -0500 (EST)
+Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA08860; Sun, 9 Nov 1997 23:35:42 -0500 (EST)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 23:31:50 -0500 (EST)
+Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA07962 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 23:31:43 -0500 (EST)
+Received: from candle.pha.pa.us (root@s3-03.ppp.op.net [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA07875 for <hackers@postgreSQL.org>; Sun, 9 Nov 1997 23:31:28 -0500 (EST)
+Received: (from maillist@localhost)
+	by candle.pha.pa.us (8.8.5/8.8.5) id XAA05566;
+	Sun, 9 Nov 1997 23:17:41 -0500 (EST)
+From: Bruce Momjian <maillist@candle.pha.pa.us>
+Message-Id: <199711100417.XAA05566@candle.pha.pa.us>
+Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
+To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
+Date: Sun, 9 Nov 1997 23:17:41 -0500 (EST)
+Cc: marc@fallon.classyad.com, hackers@postgreSQL.org
+In-Reply-To: <34668444.237C228A@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 10, 97 10:49:24 am
+X-Mailer: ELM [version 2.4 PL25]
+MIME-Version: 1.0
+Content-Type: text/plain; charset=US-ASCII
+Content-Transfer-Encoding: 7bit
+Sender: owner-hackers@hub.org
+Precedence: bulk
+Status: OR
+
+> > > > >    These two last pg_log pages are "online" ones. Race condition: when one or
+> > > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
+> > > > > when writing commit/abort of "big" xid. This is how we could handle this
+> > > > > in "buffered" logging (delayed fsync) mode:
+> > > > >
+> > > > >    When backend want to write commit/abort status he acquires exclusive
+> > > > > OnLineLogLock. If xid belongs to online pages then backend writes status
+> > 
+> > This confuses me.  Why does a backend need to lock pg_log to update a
+> > transaction status?
+> 
+> What if two backends try to change xact statuses in the same byte ?
+
+Ooo, you got me.  I so hoped to prevent locking.  It would be nice if:
+
+	*x |= 3;
+
+would be atomic, but I don't think it is.  Most RISC machines don't even
+have an OR against a memory address, I think.
+
+-- 
+Bruce Momjian
+maillist@candle.pha.pa.us
+
+