doc: Added a lot of info to OSD troubleshooting.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
author: John Wilkins <john.wilkins@inktank.com> 2013-02-22 15:38:20 -0800
committer: John Wilkins <john.wilkins@inktank.com> 2013-02-22 15:38:20 -0800
commit: 10f50d353e3072c621f9924f11bac9f8dbf1a119 (patch)
tree: 10a91caf474d53e1b41a54e4c9be2acb55272174
parent: e68f2c85d35f0a19e2f5b7cd2b7f1ff1e5c98187 (diff)
download: ceph-10f50d353e3072c621f9924f11bac9f8dbf1a119.tar.gz
1 files changed, 457 insertions, 92 deletions
diff --git a/doc/rados/operations/troubleshooting-osd.rst b/doc/rados/operations/troubleshooting-osd.rst
index 1dffa02bb42..6bc77db2e40 100644
--- a/doc/rados/operations/troubleshooting-osd.rst
+++ b/doc/rados/operations/troubleshooting-osd.rst
@@ -1,10 +1,187 @@
 ==============================
- Recovering from OSD Failures
+ Troubleshooting OSDs and PGs
 ==============================
 
-Single OSD Failure
+Before troubleshooting your OSDs, check your monitors and network first. If
+you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
+a health status, the return of a status means that the monitors have a quorum.
+If you don't have a monitor quorum or if there are errors with the monitor
+status, address the monitor issues first. Check your networks to ensure they
+are running properly, because networks may have a significant impact on OSD
+operation and performance.
+
+
+The Ceph Community
 ==================
 
+The Ceph community is an excellent source of information and help. For
+operational issues with Ceph releases we recommend you `subscribe to the
+ceph-users email list`_. When  you no longer want to receive emails, you can
+`unsubscribe from the ceph-users email list`_. 
+
+If you have read through this guide and you have contacted ``ceph-users``,
+but you haven't resolved your issue, you may contact `Inktank`_ for support.
+
+You may also `subscribe to the ceph-devel email list`_. You should do so if
+your issue is: 
+
+- Likely related to a bug
+- Related to a development release package
+- Related to a development testing package
+- Related to your own builds
+
+If you no longer want to receive emails from the ``ceph-devel`` email list, you
+may `unsubscribe from the ceph-devel email list`_.
+
+.. tip:: The Ceph community is growing rapidly, and community members can help
+   you if you provide them with detailed information about your problem. See 
+   `Obtaining Data About OSDs`_ before you post questions to ensure that 
+   community members have sufficient data to help you.
+
+
+Obtaining Data About OSDs
+=========================
+
+A good first step in troubleshooting your OSDs is to obtain information in
+addition to the information you collected while `monitoring your OSDs`_
+(e.g., ``ceph osd tree``).
+
+
+Ceph Logs
+---------
+
+If you haven't changed the default path, you can find Ceph log files at
+``/var/log/ceph``:: 
+
+	ls /var/log/ceph
+
+If you don't get enough log detail, you can change your logging level.  See
+`Ceph Logging and Debugging`_ and `Logging and Debugging Config Reference`_ in
+the Ceph Configuration documentation for details. Also, see `Debugging and
+Logging`_ in the Ceph Operations documentation to ensure that Ceph performs
+adequately under high logging volume.
+
+
+Admin Socket
+------------
+
+Use the admin socket tool to retrieve runtime information. For details, list
+the sockets for your Ceph processes:: 
+
+	ls /var/run/ceph
+
+Then, execute the following, replacing ``{socket-name}`` with an actual 
+socket name to show the list of available options:: 
+
+	ceph --admin-daemon /var/run/ceph/{socket-name} help
+
+The admin socket, among other things, allows you to:
+
+- List your configuration at runtime
+- Dump historic operations
+- Dump the operation priority queue state
+- Dump operations in flight
+- Dump perfcounters
+
+
+Display Freespace
+-----------------
+
+Filesystem issues may arise. To display your filesystem's free space, execute
+``df``. ::
+
+	df -h
+
+Execute ``df --help`` for additional usage. 
+
+
+I/O Statistics
+--------------
+
+Use `iostat`_ to identify I/O-related issues. :: 
+
+	iostat -x
+
+
+Diagnostic Messages
+-------------------
+
+To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
+or ``tail``.  For example:: 
+
+	dmesg | grep scsi
+
+
+Stopping w/out Rebalancing
+==========================
+
+If the problem with your cluster requires you to bring down a failure domain
+(e.g., a rack) for maintenance and you do not want CRUSH to automatically
+rebalance the cluster as you stop OSDs for maintenance, set the cluster to
+``nodown`` first::
+
+	ceph osd set nodown
+
+Once the cluster is set to ``nodown``, you can begin stopping the OSDs within the
+failure domain that requires maintenance work. ::
+
+	ceph osd stop osd.{num}
+
+.. note:: Placement groups within the OSDs you stop will become ``degraded`` 
+   while you are addressing issues with within the failure domain. 
+
+Once you have completed your maintenance, restart the OSDs. ::
+
+	ceph osd start osd.{num}
+
+Finally, you must unset the cluster from ``nodown``. :: 
+
+	ceph osd unset nodown
+
+
+
+.. _osd-not-running:
+
+OSD Not Running
+===============
+
+Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
+allow it to rejoin the cluster and recover.
+
+An OSD Won't Start
+------------------
+
+If you start your cluster and an OSD won't start, check the following:
+
+- **Configuration File:** If you were not able to get OSDs running from 
+  a new installation, check your configuration file to ensure it conforms
+  (e.g., ``host`` not ``hostname``, etc.). 
+
+- **Check Paths:** Check the paths in your configuration, and the actual
+  paths themselves for data and journals. If you separate the OSD data from 
+  the journal data and there are errors in your configuration file or in the 
+  actual mounts, you may have trouble starting OSDs. If you want to store the 
+  journal on a block device, you should partition your journal disk and assign 
+  one partition per OSD.
+
+- **Kernel Version:** Identify the kernel version and distribution you 
+  are using. Ceph uses some third party tools by default, which may be 
+  buggy or may conflict with certain distributions and/or kernel 
+  versions (e.g., Google perftools). Check the `OS recommendations`_ 
+  to ensure you have addressed any issues related to your kernel.
+
+- **Segment Fault:** If there is a segment fault, turn your logging up 
+  (if it isn't already), and try again. If it segment faults again, 
+  contact the ceph-devel email list and provide your Ceph configuration
+  file, your monitor output and the contents of your log file(s).
+
+If you cannot resolve the issue and the email list isn't helpful, you may
+contact `Inktank`_ for support.
+
+
+An OSD Failed
+-------------
+
 When a ``ceph-osd`` process dies, the monitor will learn about the failure
 from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
 command::
@@ -20,8 +197,7 @@ processes that are marked ``in`` and ``down``.  You can identify which
 	HEALTH_WARN 1/3 in osds are down
 	osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
 
-Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
-allow it to rejoin the cluster and recover.  If there is a disk
+If there is a disk
 failure or other fault preventing ``ceph-osd`` from functioning or
 restarting, an error message should be present in its log file in
 ``/var/log/ceph``.  
@@ -31,17 +207,27 @@ kernel file system may be unresponsive. Check ``dmesg`` output for disk
 or other kernel errors.
 
 If the problem is a software error (failed assertion or other
-unexpected error), it should be reported to the :ref:`mailing list
-<mailing-list>`.
+unexpected error), it should be reported to the `ceph-devel`_ email list.
+
 
+No Free Drive Space
+-------------------
 
-The Cluster Has No Free Disk Space
-==================================
+Ceph prevents you from writing to a full OSD so that you don't lose data. 
+In an operational cluster, you should receive a warning when your cluster 
+is getting near its full ratio. The ``mon osd full ratio`` defaults to 
+``0.95``, or 95% of capacity before it stops clients from writing data. 
+The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity 
+when it generates a health warning.
 
-If the cluster fills up, the monitor will prevent new data from being
-written.  The system puts ``ceph-osds`` in two categories: ``nearfull``
-and ``full``, with configurable threshholds for each (80% and 90% by
-default).  In both cases, full ``ceph-osds`` will be reported by ``ceph health``::
+Full cluster issues usually arise when testing how Ceph handles an OSD 
+failure on a small cluster. When one node has a high percentage of the 
+cluster's data, the cluster can easily eclipse its nearfull and full ratio
+immediately. If you are testing how Ceph reacts to OSD failures on a small
+cluster, you should leave ample free disk space and consider temporarily
+lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``.
+
+Full ``ceph-osds`` will be reported by ``ceph health``::
 
 	ceph health
 	HEALTH_WARN 1 nearfull osds
@@ -54,43 +240,236 @@ Or::
 	osd.2 is near full at 85%
 	osd.3 is full at 97%
 
-The best way to deal with a full cluster is to add new ``ceph-osds``,
-allowing the cluster to redistribute data to the newly available
-storage.
+The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
+the cluster to redistribute data to the newly available storage.
 
+If you cannot start an OSD because it is full, you may delete some data by deleting
+some placement group directories in the full OSD.
 
-Homeless Placement Groups
-=========================
+.. important:: If you choose to delete a placement group directory on a full OSD, 
+   **DO NOT** delete the same placement group directory on another full OSD, or
+   **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
+   at least one OSD.
 
-It is possible for all OSDs that had copies of a given placement groups to fail.
-If that's the case, that subset of the object store is unavailable, and the
-monitor will receive no status updates for those placement groups.  To detect
-this situation, the monitor marks any placement group whose primary OSD has
-failed as ``stale``.  For example::
 
-	ceph health
-	HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+OSDs are Slow/Unresponsive
+==========================
 
-You can identify which placement groups are ``stale``, and what the last OSDs to
-store them were, with::
+A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
+have eliminated other troubleshooting possibilities before delving into OSD
+performance issues. For example, ensure that your network(s) is working properly
+and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
 
-	ceph health detail
-	HEALTH_WARN 24 pgs stale; 3/300 in osds are down
-	...
-	pg 2.5 is stuck stale+active+remapped, last acting [2,0]
-	...
-	osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
-	osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
-	osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
+.. tip:: Newer versions of Ceph provide better recovery handling by preventing
+   recovering OSDs from using up system resources so that ``up`` and ``in`` 
+   OSDs aren't available or are otherwise slow.
 
-If we want to get placement group 2.5 back online, for example, this tells us that
-it was last managed by ``osd.0`` and ``osd.2``.  Restarting those ``ceph-osd``
-daemons will allow the cluster to recover that placement group (and, presumably,
-many others).
+
+Networking Issues
+-----------------
+
+Ceph is a distributed storage system, so it  depends upon networks to peer with
+OSDs, replicate objects, recover from faults and check heartbeats. Networking
+issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
+details.
+
+Ensure that Ceph processes and Ceph-dependent processes are connected and/or 
+listening. :: 
+
+	netstat -a | grep ceph
+	netstat -l | grep ceph
+	sudo netstat -p | grep ceph
+
+Check network statistics. :: 
+
+	netstat -s
+
+
+Drive Configuration
+-------------------
+
+A storage drive should only support one OSD. Sequential read and sequential
+write throughput can bottleneck if other processes share the drive, including
+journals, operating systems, monitors, other OSDs and non-Ceph processes. 
+
+Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
+option to accelerate the response time--particularly when using the ``ext4`` or
+XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal
+simultaneously.
+
+.. note:: Partitioning a drive does not change its total throughput or
+   sequential read/write limits. Running a journal in a separate partition
+   may help, but you should prefer a separate physical drive.
+
+
+Bad Sectors / Fragmented Disk
+-----------------------------
+
+Check your disks for bad sectors and fragmentation. This can cause total throughput
+to drop substantially. 
+
+
+Co-resident Monitors/OSDs
+-------------------------
+
+Monitors are generally light-weight processes, but they do lots of ``fsync()``,
+which can interfere with other workloads, particularly if monitors run on the
+same drive as your OSDs. Additionally, if you run monitors on the same host as
+the OSDs, you may incur performance issues related to:
+
+- Running an older kernel (pre-3.0)
+- Running Argonaut with an old ``glibc``
+- Running a kernel with no syncfs(2) syscall.
+
+In these cases, multiple OSDs running on the same host can drag each other down
+by doing lots of commits. That often leads to the bursty writes.
+
+
+Co-resident Processes
+---------------------
+
+Spinning up co-resident processes such as a cloud-based solution, virtual
+machines and other applications that write data to Ceph while operating on the
+same hardware as OSDs can introduce significant OSD latency. Generally, we
+recommend optimizing a host for use with Ceph and using other hosts for other
+processes. The practice of separating Ceph operations from other applications
+may help improve performance and may streamline troubleshooting and maintenance.
+
+
+Logging Levels
+--------------
+
+If you turned logging levels up to track an issue and then forgot to turn
+logging levels back down, the OSD may be putting a lot of logs onto the disk. If
+you intend to keep logging levels high, you may consider mounting a drive to the
+default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
+
+
+Recovery Throttling
+-------------------
+
+Depending upon your configuration, Ceph may reduce recovery rates to maintain 
+performance or it may increase recovery rates to the point that recovery 
+impacts OSD performance. Check to see if the OSD is recovering. 
+
+
+Kernel Version
+--------------
+
+Check the kernel version you are running. Older kernels may not receive
+new backports that Ceph depends upon for better performance.
+
+
+Kernel Issues with SyncFS
+-------------------------
+
+Try running one OSD per host to see if performance improves. Old kernels
+might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
+
+
+Filesystem Issues
+-----------------
+
+Currently, we recommend deploying clusters with XFS or ext4. The btrfs 
+filesystem has many attractive features, but bugs in the filesystem may
+lead to performance issues. 
+
+
+Insufficient RAM
+----------------
+
+We recommend 1GB of RAM per OSD daemon. You may notice that during normal
+operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
+Unused RAM makes it tempting to use the excess RAM for co-resident applications,
+VMs and so forth. However, when OSDs go into recovery mode, their memory
+utilization spikes. If there is no RAM available, the OSD performance will slow
+considerably. 
+
+
+Old Requests or Slow Requests
+-----------------------------
+
+If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
+complaining about requests that are taking too long.  The warning threshold
+defaults to 30 seconds, and is configurable via the ``osd op complaint time``
+option.  When this happens, the cluster log will receive messages. 
+
+Legacy versions of Ceph complain about 'old requests`::
+
+	osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
+
+New versions of Ceph complain about 'slow requests`:: 
+
+	{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
+	{date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
+
+
+Possible causes include:
+
+- A bad drive (check ``dmesg`` output)
+- A bug in the kernel file system bug (check ``dmesg`` output)
+- An overloaded cluster (check system load, iostat, etc.)
+- A bug in the ``ceph-osd`` daemon.
+
+Possible solutions
+
+- Remove VMs Cloud Solutions from Ceph Hosts
+- Upgrade Kernel
+- Upgrade Ceph
+- Restart OSDs
+
+
+
+Flapping OSDs
+=============
+
+We recommend using both a public (front-end) network and a cluster (back-end)
+network so that you can better meet the capacity requirements of object replication. Another
+advantage is that you can run a cluster network such that it isn't connected to
+the internet, thereby preventing some denial of service attacks. When OSDs peer
+and check heartbeats, they use the cluster (back-end) network when it's available.
+See `Monitor/OSD Interaction`_ for details.
+
+However, if the cluster (back-end) network fails or develops significant latency
+while the public (front-end) network operates optimally, OSDs currently do not
+handle this situation well. What happens is that OSDs mark each other ``down``
+on the monitor, while marking themselves ``up``. We call this scenario 'flapping`.
+
+If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and then
+``up`` again), you can force the monitors to stop the flapping with::
+
+	ceph osd set noup      # prevent osds from getting marked up
+	ceph osd set nodown    # prevent osds from getting marked down
+
+These flags are recorded in the osdmap structure::
+
+	ceph osd dump | grep flags
+	flags no-up,no-down
+
+You can clear the flags with::
+
+	ceph osd unset noup
+	ceph osd unset nodown
+
+Two other flags are supported, ``noin`` and ``noout``, which prevent
+booting OSDs from being marked ``in`` (allocated data) or down
+ceph-osds from eventually being marked ``out`` (regardless of what the
+current value for ``mon osd down out interval`` is).
+
+.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
+   sense that once the flags are cleared, the action they were blocking
+   should occur shortly after.  The ``noin`` flag, on the other hand,
+   prevents OSDs from being marked ``in`` on boot, and any daemons that
+   started while the flag was set will remain that way.
+
+
+
+Troubleshooting PG Errors
+=========================
 
 
 Stuck Placement Groups
-======================
+----------------------
 
 It is normal for placement groups to enter states like "degraded" or "peering"
 following a failure.  Normally these states indicate the normal progression
@@ -122,10 +501,11 @@ recovery from completing, like unfound objects (see
 :ref:`failures-osd-unfound`);
 
 
+
 .. _failures-osd-peering:
 
 Placement Group Down - Peering Failure
-======================================
+--------------------------------------
 
 In certain cases, the ``ceph-osd`` `Peering` process can run into
 problems, preventing a PG from becoming active and usable.  For
@@ -190,7 +570,7 @@ Recovery will proceed.
 .. _failures-osd-unfound:
 
 Unfound Objects
-===============
+---------------
 
 Under certain combinations of failures Ceph may complain about
 ``unfound`` objects::
@@ -288,62 +668,47 @@ was a new object) forget about it entirely.  Use this with caution, as
 it may confuse applications that expected the object to exist.
 
 
+Homeless Placement Groups
+-------------------------
 
-Slow or Unresponsive OSD
-========================
-
-If, for some reason, a ``ceph-osd`` is slow to respond to a request, it will
-generate log messages complaining about requests that are taking too
-long.  The warning threshold defaults to 30 seconds, and is configurable
-via the ``osd op complaint time`` option.  When this happens, the cluster
-log will receive messages like::
-
-    slow request 30.383883 seconds old, received at 2013-02-12 16:27:15.508374: osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.000000000120 [write 921600~4096] 2.981cf6bc) v4 currently no flag points reached
-
-Possible causes include:
-
- * bad disk (check ``dmesg`` output)
- * kernel file system bug (check ``dmesg`` output)
- * overloaded cluster (check system load, iostat, etc.)
- * ceph-osd bug
-
-Pay particular attention to the ``currently`` part, as that will give
-some clue as to what the request is waiting for.  You can further look
-at exactly what requests the slow OSD is working on are, and what
-state(s) they are in with::
-
- ceph --admin-daemon /var/run/ceph/ceph-osd.{ID}.asok dump_ops_in_flight
-
-These are sorted oldest to newest, and the dump includes an ``age``
-indicating how long the request has been in the queue.
-
-
-Flapping OSDs
-=============
-
-If something is causing OSDs to "flap" (repeatedly getting marked ``down`` and then
-``up`` again), you can force the monitors to stop with::
+It is possible for all OSDs that had copies of a given placement groups to fail.
+If that's the case, that subset of the object store is unavailable, and the
+monitor will receive no status updates for those placement groups.  To detect
+this situation, the monitor marks any placement group whose primary OSD has
+failed as ``stale``.  For example::
 
-	ceph osd set noup      # prevent osds from getting marked up
-	ceph osd set nodown    # prevent osds from getting marked down
+	ceph health
+	HEALTH_WARN 24 pgs stale; 3/300 in osds are down
 
-These flags are recorded in the osdmap structure::
+You can identify which placement groups are ``stale``, and what the last OSDs to
+store them were, with::
 
-	ceph osd dump | grep flags
-	flags no-up,no-down
+	ceph health detail
+	HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+	...
+	pg 2.5 is stuck stale+active+remapped, last acting [2,0]
+	...
+	osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
+	osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
+	osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
 
-You can clear the flags with::
+If we want to get placement group 2.5 back online, for example, this tells us that
+it was last managed by ``osd.0`` and ``osd.2``.  Restarting those ``ceph-osd``
+daemons will allow the cluster to recover that placement group (and, presumably,
+many others).
 
-	ceph osd unset noup
-	ceph osd unset nodown
 
-Two other flags are supported, ``noin`` and ``noout``, which prevent
-booting OSDs from being marked ``in`` (allocated data) or down
-ceph-osds from eventually being marked ``out`` (regardless of what the
-current value for ``mon osd down out interval`` is).
 
-Note that ``noup``, ``noout``, and ``noout`` are temporary in the
-sense that once the flags are cleared, the action they were blocking
-should occur shortly after.  The ``noin`` flag, on the other hand,
-prevents ceph-osds from being marked in on boot, and any daemons that
-started while the flag was set will remain that way.
+.. _iostat: http://en.wikipedia.org/wiki/Iostat
+.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
+.. _Logging and Debugging Config Reference: ../../configuration/log-and-debug-ref
+.. _Debugging and Logging: ../debug
+.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
+.. _monitoring your OSDs: ../monitoring-osd-pg
+.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
+.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
+.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users
+.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users
+.. _Inktank: http://inktank.com
+.. _OS recommendations: ../../../install/os-recommendations
+.. _ceph-devel: ceph-devel@vger.kernel.org
+\ No newline at end of file
author	John Wilkins <john.wilkins@inktank.com>	2013-02-22 15:38:20 -0800
committer	John Wilkins <john.wilkins@inktank.com>	2013-02-22 15:38:20 -0800
commit	10f50d353e3072c621f9924f11bac9f8dbf1a119 (patch)
tree	10a91caf474d53e1b41a54e4c9be2acb55272174
parent	e68f2c85d35f0a19e2f5b7cd2b7f1ff1e5c98187 (diff)
download	ceph-10f50d353e3072c621f9924f11bac9f8dbf1a119.tar.gz