From aa779adc487929fb6732437b41904681d7479eba Mon Sep 17 00:00:00 2001
From: "Stephen D. Huston" <shuston@apache.org>
Date: Fri, 5 Nov 2010 22:02:57 +0000
Subject: Add design doc for new Windows hybrid SQL-CLFS store.

git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk/qpid@1031841 13f79535-47bb-0310-9956-ffa450edef68
---
 cpp/design_docs/windows_clfs_store_design.txt | 239 ++++++++++++++++++++++++++
 1 file changed, 239 insertions(+)
 create mode 100644 cpp/design_docs/windows_clfs_store_design.txt

(limited to 'cpp')

diff --git a/cpp/design_docs/windows_clfs_store_design.txt b/cpp/design_docs/windows_clfs_store_design.txt
new file mode 100644
index 0000000000..76ae419b40
--- /dev/null
+++ b/cpp/design_docs/windows_clfs_store_design.txt
@@ -0,0 +1,239 @@
+Design for Hybrid SQL/CLFS-Based Store in Qpid
+==============================================
+
+CLFS (Common Log File System) is a new facility in recent Windows versions.
+CLFS is an ARIES-compliant log intended to support high performance and
+transactional applications. CLFS is available in Windows Server 2003R2 and
+higher, as well as Windows Vista and Windows 7.
+
+There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store
+moves the message, messages-mapping to queues, and transaction aspects
+of the SQL store into CLFS logs. Records of queues, exchanges, bindings,
+and configurations will remain in SQL. The main goal of this change is
+to yield higher performance on the time-critical messaging operations.
+CLFS and, therefore, the new hybrid store, is not available on Windows XP
+and Windows Server prior to 2003R2; these platforms will need to run the
+all-SQL store.
+
+Note for future consideration: it is possible to maintain all durable
+objects in CLFS, which would remove the need for SQL completely. It would
+require added log handling as well as the logic to ensure referential
+integrity between exchanges and queues via bindings as SQL does today.
+Also, the CLFS store counts on the SQL-stored queue records being correct
+when recovering messages; if a message operation in the log refers to a queue
+ID that's unknown, the CLFS store assumes the queue was deleted in the
+previous broker session and the log wasn't updated. That sort of assumption
+would need to be revisited if all content moves to a log.
+
+CLFS Capabilities
+-----------------
+
+This section explains some of the key CLFS concepts that are important
+in order to understand the designed use of CLFS for the store. It is
+not a complete explanation and is not feature-complete. Please see the
+CLFS documentation at MSDN for complete details
+(http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx).
+
+CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed
+log has multiple streams of independent log records; a dedicated log has
+only one stream. Each log uses containers to hold the actual data; a log
+requires a minimum of two containers, each of which must be at least 512KB.
+Thus, the smallest log possible is 1MB. They can, of course, be larger, but
+with 1 MB as minimum size for a log, they shouldn't be used willy-nilly.
+The maximum number of streams per log is approximately 100.
+
+As records are written to the log CLFS assigns Log Sequence Numbers (LSNs).
+The first valid LSN in a log stream is called the Base, or Tail. CLFS
+can automatically reclaim and reuse container space for the log as the
+base LSN is moved when records are no longer needed. When a log is multiplexed,
+a stream which doesn't move its tail can prevent CLFS from reclaiming space
+and cause the log to grow indefinitely. Thus, mixing streams which don't
+update (and, thus, move their tails) with streams that are very dynamic in
+a single log will probably cause the log to continue to expand even though
+much of the space will be unused.
+
+CLFS provides three LSN types that are used to chain records together:
+
+- Next: This is a forward sequence maintained by CLFS itself by the order
+  records are put into the stream.
+- Undo-next, Undo-prev: These are backward-looking chains that are used
+  to link a new record to some previous record(s) in the same stream.
+
+Also note that although log files are simply located in the file system,
+easily locatable, streams within a log are not easily known or listable
+outside of some application-specific recording of the stream names somewhere.
+
+Log Usage
+---------
+
+There are two logs in use.
+
+- Message: Each message will be represented by a chain of log records. All
+  messages will be intermixed in the same dedicated stream. Each portion of
+  a message content (sometimes they are written in multiple chunks) as well
+  as each operation involving a message (enqueue, dequeue, etc.) will be
+  in a log record chained to the others related to the same message.
+
+- Transaction: Each transaction, local and distributed, will be represented
+  by a chain of log records. The record content will denote the transaction
+  as local or distributed.
+
+Both transaction and message logs use the LSN of the first record for a
+given object (message or transaction) as the persistence ID for that object.
+The LSN is a CLFS-maintained, always-increasing value that is 64 bits long,
+the same as a persistence ID.
+
+Log records that relate to a transaction or message previously logged use the
+log record undo-prev LSN to indicate which transaction/message the record
+relates to.
+
+Message Log Records
+-------------------
+
+Message log records will be one of the following types:
+
+- Message-Start: the first (and possibly only) section of message content
+- Message-Chunk: second and succeeding message content chunks
+- Message-Delete: marks the end of the message's lifetime
+- Message-Enqueue: records the message's placement on a queue
+- Message-Dequeue: records the message's removal from a queue
+
+The LSN of the Message-Start record is the persistence ID for the message.
+The log record undo-prev LSN is used to link each subsequent record for that
+message to the Message-Start record.
+
+A message's sequence of log records is extended for each operation on that
+message, until the message is deleted whereupon a Message-Delete record is
+written. When the Message-Delete is written, the log's base LSN can be moved
+up to the next earliest message if the deleted one opens up a set of
+records at the tail of the log that are no longer needed. To help maintain
+the order and know when the base can be moved, the store keeps message
+information in a STL map whose key is the message ID (Message-Start LSN).
+Thus, the first entry in the map is the earliest ID/LSN in use.
+During recovery, messages still residing in the log can be ignored when the
+record sequence for the message ends with Message-Delete. Similarly, there
+may be log records for messages that are deleted; in this case the previous
+LSN won't be one that's still within the log and, therefore, there won't have
+been a Message Start record recovered and the record can be ignored.
+
+Transaction Log Records
+-----------------------
+
+Transaction log records will be one of the following types:
+
+- Dtx-Start: Start of a distributed transaction
+- Tx-Start: Start of a local transaction
+- End: End of the transaction
+- Rollback: Marks that the transaction is rolled back
+- Prepare: Marks the dtx as prepared
+- Commit: Marks the transaction as committed
+- Delete: Notes that the transaction is no longer valid
+
+Transactions are also identified by the LSN of the start (Dtx-Start or
+Tx-Start) record. Successive records associated with the same transaction
+are linked backwards using the undo-prev LSN.
+
+The association between messages and transactions is maintained in the
+message log; if the message enqueue/dequeue operation is part of a transaction,
+the operation includes a transaction ID. The transaction log maintains the
+state of the transaction itself. Thus, each operation (enqueue, dequeue,
+prepare, rollback, commit) is a single log record.
+
+A few notes:
+- The transactions need to be recovered and sorted out prior to recovering
+  the messages. The message recovery needs to know if a enqueue/dequeue
+  associated with a transaction can be discarded or should be acted on.
+
+- Transaction IDs need to remain valid as long as any messages exist that
+  refer to them. This prevents the problem of trying to recover a message
+  with a transaction ID that doesn't exist - was it finalized? was it aborted?
+  Reference to a missing transaction ID can be ignored with assurance that
+  the message was deleted further along or the transaction would still be there.
+
+- Transaction IDs needing to be valid requires that a refcount be kept on each
+  transaction at run time. As messages are deleted, the transaction set can
+  be notified that the message is gone. To enforce this, Message objects have
+  a boost::shared_ptr to each Transaction they're associated with. When the
+  Message is destroyed, refs to Transactions go down too. When Transaction is
+  destroyed, it's done so write its delete to the log.
+
+In-Memory Objects
+-----------------
+
+The store holds the message and transaction relationships in memory. CLFS is
+a backing store for that information so it can be reliably reconstructed in
+the event of a failure. This is a change from the SQL-only store where all
+of the information is maintained in SQL and none is kept in memory. The
+CLFS-using store is designed for high-throughput operation where it is assumed
+that messages will transit the broker (and, therefore, the store) quickly.
+
+- Message list: this is a map of persistence ID (message LSN) to a list of
+  queues where the message is located and an indication that there is
+  (or isn't) a transaction involved and in which direction (enqueue/dequeue)
+  so a dequeued message doesn't get deleted while a transacted enqueue is
+  pending.
+
+- Transaction list: also probably a map of id/LSN to a transaction object.
+  The transaction object needs to keep a list of messages/queues that are
+  impacted as well as the transaction state and Xid (for dtx).
+
+- Right now log records are written as need with no preallocation or
+  reservation. It may be better to pre-reserve records in some cases, such
+  as a transaction prepare where the space for commit or rollback may be
+  reserved at the same time. This may be the only case where losing a
+  record may be an issue - needs some more thought.
+
+Recovery
+--------
+
+During recovery, need to verify recovered messages' queues exist; if there's a
+failure after a queue's deletion is final but before the messages are recorded
+as dequeued (and possibly deleted) the remainder of those dequeues (and
+possibly deleting the message) needs to be handled during recovery by not
+restoring them for the broker, and also logging their deletion. Could also
+skip the logging of deletion and let the normal tail-maintenance eventually
+move up over the old message entries. Since the invalid messages won't be
+kept in the message map, their IDs won't be taken into account when maintaining
+the tail - the tail will move up over them as soon as enough messages come
+and go.
+
+Plugin Options
+--------------
+
+The command-line options added by the CLFS plugin are;
+
+  --connect             The SQL connect string for the SQL parts; same as the
+                        SQL plugin.
+  --catalog             The SQL database (catalog) name; same as the SQL plugin.
+  --store-dir           The directory to store the logs in. Defaults to the
+                        broker --data-dir value. If --no-data-dir specified,
+                        --store-dir must be.
+  --container-size      The size of each container in the log, in bytes. The
+                        minimum size is 512K (smaller sizes will be rounded up).
+                        Additionally, the size will be rounded up to a multiple
+                        of the sector size on the disk holding the log. Once
+                        the log is created, each newly added container will
+                        be the same size as the initial container(s). Default
+                        is 1MB.
+  --initial-containers  The number of containers to populate a new log with
+                        if a new log is created. Ignored if the log exists.
+                        Default is 2.
+  --max-write-buffers   The maximum number of write buffers that the plugin can
+                        use before CLFS automatically flushes the log to disk.
+                        Lower values flush more often; higher values have
+                        higher performance. Default is 10.
+
+  Maybe need an option to hold messages of a certain size in memory? I think
+  maybe the broker proper holds the message content, so the store need not.
+
+Testing
+-------
+
+More tests will need to be written to stress the log container extension
+capability and ensure that moving the base LSN works properly and the store
+doesn't continually grow the log without bounds.
+
+Note that running "qpid-perftest --durable yes" stresses the log extension
+and tail maintenance. It doesn't get run as a normal regression test but should
+be run when playing with the container/tail maintenance logic to ensure it's
+not broken.
-- 
cgit v1.2.1