diff options
author | Tommi Virtanen <tommi.virtanen@dreamhost.com> | 2011-09-01 13:24:06 -0700 |
---|---|---|
committer | Tommi Virtanen <tommi.virtanen@dreamhost.com> | 2011-09-01 13:28:12 -0700 |
commit | e09d4a96025f58f370f30b0aa61d88b2173074e4 (patch) | |
tree | 145e3d891e6f3bd7391a20a595c54ef67b0ba774 | |
parent | 0a14c75b1a9ce3534425832eaddad496ef1839b2 (diff) | |
download | ceph-e09d4a96025f58f370f30b0aa61d88b2173074e4.tar.gz |
doc: Architecture, placeholder in install, and first appendix.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
-rw-r--r-- | doc/appendix/differences-from-posix.rst | 17 | ||||
-rw-r--r-- | doc/appendix/index.rst | 10 | ||||
-rw-r--r-- | doc/architecture.rst | 179 | ||||
-rw-r--r-- | doc/index.rst | 1 | ||||
-rw-r--r-- | doc/ops/install.rst | 19 |
5 files changed, 210 insertions, 16 deletions
diff --git a/doc/appendix/differences-from-posix.rst b/doc/appendix/differences-from-posix.rst new file mode 100644 index 00000000000..f327e786dc9 --- /dev/null +++ b/doc/appendix/differences-from-posix.rst @@ -0,0 +1,17 @@ +======================== + Differences from POSIX +======================== + +.. todo:: delete http://ceph.newdream.net/wiki/Differences_from_POSIX + +Ceph does have a few places where it diverges from strict POSIX semantics for various reasons: + +- Sparse files propagate incorrectly to tools like df. They will only + use up the required space, but in df will increase the "used" space + by the full file size. We do this because actually keeping track of + the space a large, sparse file uses is very expensive. +- In shared simultaneous writer situations, a write that crosses + object boundaries is not necessarily atomic. This means that you + could have writer A write "aa|aa" and writer B write "bb|bb" + simultaneously (where | is the object boundary), and end up with + "aa|bb" rather than the proper "aa|aa" or "bb|bb". diff --git a/doc/appendix/index.rst b/doc/appendix/index.rst new file mode 100644 index 00000000000..a98bf899a2a --- /dev/null +++ b/doc/appendix/index.rst @@ -0,0 +1,10 @@ +============ + Appendices +============ + +.. toctree:: + :glob: + :numbered: + :titlesonly: + + * diff --git a/doc/architecture.rst b/doc/architecture.rst index cf67f6be505..3afbe6bcc8b 100644 --- a/doc/architecture.rst +++ b/doc/architecture.rst @@ -2,26 +2,173 @@ Architecture of Ceph ====================== -- Introduction to Ceph Project +Ceph is a distributed network storage and file system with distributed +metadata management and POSIX semantics. - - High-level overview of project benefits for users (few paragraphs, mention each subproject) - - Introduction to sub-projects (few paragraphs to a page each) +RADOS is a reliable object store, used by Ceph, but also directly +accessible. - - RADOS - - RGW - - RBD - - Ceph +``radosgw`` is an S3-compatible RESTful HTTP service for object +storage, using RADOS storage. - - Example scenarios Ceph projects are/not suitable for - - (Very) High-Level overview of Ceph +RBD is a Linux kernel feature that exposes RADOS storage as a block +device. Qemu/KVM also has a direct RBD client, that avoids the kernel +overhead. - This would include an introduction to basic project terminology, - the concept of OSDs, MDSes, and Monitors, and things like - that. What they do, some of why they're awesome, but not how they - work. -- Discussion of MDS terminology, daemon types (active, standby, - standby-replay) +Monitor cluster +=============== +``cmon`` is a lightweight daemon that provides a consensus for +distributed decisionmaking in a Ceph/RADOS cluster. -.. todo:: write me +It also is the initial point of contact for new clients, and will hand +out information about the topology of the cluster, such as the +``osdmap``. + +You normally run 3 ``cmon`` daemons, on 3 separate physical machines, +isolated from each other; for example, in different racks or rows. + +You could run just 1 instance, but that means giving up on high +availability. + +You may use the same hosts for ``cmon`` and other purposes. + +``cmon`` processes talk to each other using a Paxos_\-style +protocol. They discover each other via the ``[mon.X] mon addr`` fields +in ``ceph.conf``. + +.. todo:: What about ``monmap``? Fact check. + +Any decision requires the majority of the ``cmon`` processes to be +healthy and communicating with each other. For this reason, you never +want an even number of ``cmon``\s; there is no unambiguous majority +subgroup for an even number. + +.. _Paxos: http://en.wikipedia.org/wiki/Paxos_algorithm + +.. todo:: explain monmap + + +RADOS +===== + +``cosd`` is the storage daemon that provides the RADOS service. It +uses ``cmon`` for cluster membership, services object read/write/etc +request from clients, and peers with other ``cosd``\s for data +replication. + +The data model is fairly simple on this level. There are multiple +named pools, and within each pool there are named objects, in a flat +namespace (no directories). Each object has both data and metadata. + +The data for an object is a single, potentially big, series of +bytes. Additionally, the series may be sparse, it may have holes that +contain binary zeros, and take up no actual storage. + +The metadata is an unordered set of key-value pairs. It's semantics +are completely up to the client; for example, the Ceph filesystem uses +metadata to store file owner etc. + +.. todo:: Verify that metadata is unordered. + +Underneath, ``cosd`` stores the data on a local filesystem. We +recommend using Btrfs_, but any POSIX filesystem that has extended +attributes should work (see :ref:`xattr`). + +.. _Btrfs: http://en.wikipedia.org/wiki/Btrfs + +.. todo:: write about access control + +.. todo:: explain osdmap + +.. todo:: explain plugins ("classes") + + +Ceph filesystem +=============== + +The Ceph filesystem service is provided by a daemon called +``cmds``. It uses RADOS to store all the filesystem metadata +(directories, file ownership, access modes, etc), and directs clients +to access RADOS directly for the file contents. + +The Ceph filesystem aims for POSIX compatibility, except for a few +chosen differences. See :doc:`/appendix/differences-from-posix`. + +``cmds`` can run as a single process, or it can be distributed out to +multiple physical machines, either for high availability or for +scalability. + +For high availability, the extra ``cmds`` instances can be `standby`, +ready to take over the duties of any failed ``cmds`` that was +`active`. This is easy because all the data, including the journal, is +stored on RADOS. The transition is triggered automatically by +``cmon``. + +For scalability, multiple ``cmds`` instances can be `active`, and they +will split the directory tree into subtrees (and shards of a single +busy directory), effectively balancing the load amongst all `active` +servers. + +Combinations of `standby` and `active` etc are possible, for example +running 3 `active` ``cmds`` instances for scaling, and one `standby`. + +To control the number of `active` ``cmds``\es, see :doc:`/ops/grow/mds`. + +.. topic:: Status as of 2011-09: + + Multiple `active` ``cmds`` operation is stable under normal + circumstances, but some failure scenarios may still cause + operational issues. + +.. todo:: document `standby-replay` + +.. todo:: mds.0 vs mds.alpha etc details + + + +``radosgw`` +=========== + +``radosgw`` is a FastCGI service that provides a RESTful_ HTTP API to +store objects and metadata. It layers on top of RADOS with its own +data formats, and maintains it's own user database, authentication, +access control, and so on. + +.. _RESTful: http://en.wikipedia.org/wiki/RESTful + + +Rados Block Device (RBD) +======================== + +In virtual machine scenarios, RBD is typically used via the ``rbd`` +network storage driver in Qemu/KVM, where the host machine uses +``librbd`` to provide a block device service to the guest. + +Alternatively, as no direct ``librbd`` support is available in Xen, +the Linux kernel can act as the RBD client and provide a real block +device on the host machine, that can then be accessed by the +virtualization. This is done with the command-line tool ``rbd`` (see +:doc:`/ops/rbd`). + +The latter is also useful in non-virtualized scenarios. + +Internally, RBD stripes the device image over multiple RADOS objects, +each typically located on a separate ``cosd``, allowing it to perform +better than a single server could. + + +Client +====== + +.. todo:: cephfs, cfuse, librados, libceph, librbd + + +.. todo:: Summarize how much Ceph trusts the client, for what parts (security vs reliability). + + +TODO +==== + +.. todo:: Example scenarios Ceph projects are/not suitable for diff --git a/doc/index.rst b/doc/index.rst index 195c74993bb..d3ad0c669ec 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -94,6 +94,7 @@ Table of Contents man/index papers glossary + appendix/index Indices and tables diff --git a/doc/ops/install.rst b/doc/ops/install.rst index 692ac926bec..1ffca6f4417 100644 --- a/doc/ops/install.rst +++ b/doc/ops/install.rst @@ -12,3 +12,22 @@ mentioning all the design tradeoffs and options like journaling locations or filesystems At this point, either use 1 or 3 mons, point to :doc:`grow/mon` + +OSD installation +================ + +btrfs +----- + +what does btrfs give you (the journaling thing) + + +ext4/ext3 +--------- + +.. _xattr: + +Enabling extended attributes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +how to enable xattr on ext4/3 |