diff options
author | John Wilkins <john.wilkins@inktank.com> | 2013-02-22 15:38:20 -0800 |
---|---|---|
committer | John Wilkins <john.wilkins@inktank.com> | 2013-02-22 15:38:20 -0800 |
commit | 10f50d353e3072c621f9924f11bac9f8dbf1a119 (patch) | |
tree | 10a91caf474d53e1b41a54e4c9be2acb55272174 | |
parent | e68f2c85d35f0a19e2f5b7cd2b7f1ff1e5c98187 (diff) | |
download | ceph-10f50d353e3072c621f9924f11bac9f8dbf1a119.tar.gz |
doc: Added a lot of info to OSD troubleshooting.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
-rw-r--r-- | doc/rados/operations/troubleshooting-osd.rst | 549 |
1 files changed, 457 insertions, 92 deletions
diff --git a/doc/rados/operations/troubleshooting-osd.rst b/doc/rados/operations/troubleshooting-osd.rst index 1dffa02bb42..6bc77db2e40 100644 --- a/doc/rados/operations/troubleshooting-osd.rst +++ b/doc/rados/operations/troubleshooting-osd.rst @@ -1,10 +1,187 @@ ============================== - Recovering from OSD Failures + Troubleshooting OSDs and PGs ============================== -Single OSD Failure +Before troubleshooting your OSDs, check your monitors and network first. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns +a health status, the return of a status means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, address the monitor issues first. Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. + + +The Ceph Community ================== +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +If you have read through this guide and you have contacted ``ceph-users``, +but you haven't resolved your issue, you may contact `Inktank`_ for support. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. See + `Obtaining Data About OSDs`_ before you post questions to ensure that + community members have sufficient data to help you. + + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't get enough log detail, you can change your logging level. See +`Ceph Logging and Debugging`_ and `Logging and Debugging Config Reference`_ in +the Ceph Configuration documentation for details. Also, see `Debugging and +Logging`_ in the Ceph Operations documentation to ensure that Ceph performs +adequately under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph processes:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{socket-name}`` with an actual +socket name to show the list of available options:: + + ceph --admin-daemon /var/run/ceph/{socket-name} help + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + + +Display Freespace +----------------- + +Filesystem issues may arise. To display your filesystem's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + + +Stopping w/out Rebalancing +========================== + +If the problem with your cluster requires you to bring down a failure domain +(e.g., a rack) for maintenance and you do not want CRUSH to automatically +rebalance the cluster as you stop OSDs for maintenance, set the cluster to +``nodown`` first:: + + ceph osd set nodown + +Once the cluster is set to ``nodown``, you can begin stopping the OSDs within the +failure domain that requires maintenance work. :: + + ceph osd stop osd.{num} + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs. :: + + ceph osd start osd.{num} + +Finally, you must unset the cluster from ``nodown``. :: + + ceph osd unset nodown + + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and journals. If you separate the OSD data from + the journal data and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + journal on a block device, you should partition your journal disk and assign + one partition per OSD. + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google perftools). Check the `OS recommendations`_ + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, turn your logging up + (if it isn't already), and try again. If it segment faults again, + contact the ceph-devel email list and provide your Ceph configuration + file, your monitor output and the contents of your log file(s). + +If you cannot resolve the issue and the email list isn't helpful, you may +contact `Inktank`_ for support. + + +An OSD Failed +------------- + When a ``ceph-osd`` process dies, the monitor will learn about the failure from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` command:: @@ -20,8 +197,7 @@ processes that are marked ``in`` and ``down``. You can identify which HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 -Under normal circumstances, simply restarting the ``ceph-osd`` daemon will -allow it to rejoin the cluster and recover. If there is a disk +If there is a disk failure or other fault preventing ``ceph-osd`` from functioning or restarting, an error message should be present in its log file in ``/var/log/ceph``. @@ -31,17 +207,27 @@ kernel file system may be unresponsive. Check ``dmesg`` output for disk or other kernel errors. If the problem is a software error (failed assertion or other -unexpected error), it should be reported to the :ref:`mailing list -<mailing-list>`. +unexpected error), it should be reported to the `ceph-devel`_ email list. + +No Free Drive Space +------------------- -The Cluster Has No Free Disk Space -================================== +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster +is getting near its full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity +when it generates a health warning. -If the cluster fills up, the monitor will prevent new data from being -written. The system puts ``ceph-osds`` in two categories: ``nearfull`` -and ``full``, with configurable threshholds for each (80% and 90% by -default). In both cases, full ``ceph-osds`` will be reported by ``ceph health``:: +Full cluster issues usually arise when testing how Ceph handles an OSD +failure on a small cluster. When one node has a high percentage of the +cluster's data, the cluster can easily eclipse its nearfull and full ratio +immediately. If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``. + +Full ``ceph-osds`` will be reported by ``ceph health``:: ceph health HEALTH_WARN 1 nearfull osds @@ -54,43 +240,236 @@ Or:: osd.2 is near full at 85% osd.3 is full at 97% -The best way to deal with a full cluster is to add new ``ceph-osds``, -allowing the cluster to redistribute data to the newly available -storage. +The best way to deal with a full cluster is to add new ``ceph-osds``, allowing +the cluster to redistribute data to the newly available storage. +If you cannot start an OSD because it is full, you may delete some data by deleting +some placement group directories in the full OSD. -Homeless Placement Groups -========================= +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. -It is possible for all OSDs that had copies of a given placement groups to fail. -If that's the case, that subset of the object store is unavailable, and the -monitor will receive no status updates for those placement groups. To detect -this situation, the monitor marks any placement group whose primary OSD has -failed as ``stale``. For example:: - ceph health - HEALTH_WARN 24 pgs stale; 3/300 in osds are down +OSDs are Slow/Unresponsive +========================== -You can identify which placement groups are ``stale``, and what the last OSDs to -store them were, with:: +A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. - ceph health detail - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - ... - pg 2.5 is stuck stale+active+remapped, last acting [2,0] - ... - osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 - osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 - osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs aren't available or are otherwise slow. -If we want to get placement group 2.5 back online, for example, this tells us that -it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` -daemons will allow the cluster to recover that placement group (and, presumably, -many others). + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it depends upon networks to peer with +OSDs, replicate objects, recover from faults and check heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + + +Drive Configuration +------------------- + +A storage drive should only support one OSD. Sequential read and sequential +write throughput can bottleneck if other processes share the drive, including +journals, operating systems, monitors, other OSDs and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive +option to accelerate the response time--particularly when using the ``ext4`` or +XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal +simultaneously. + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your disks for bad sectors and fragmentation. This can cause total throughput +to drop substantially. + + +Co-resident Monitors/OSDs +------------------------- + +Monitors are generally light-weight processes, but they do lots of ``fsync()``, +which can interfere with other workloads, particularly if monitors run on the +same drive as your OSDs. Additionally, if you run monitors on the same host as +the OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running Argonaut with an old ``glibc`` +- Running a kernel with no syncfs(2) syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + + +Co-resident Processes +--------------------- + +Spinning up co-resident processes such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing a host for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with XFS or ext4. The btrfs +filesystem has many attractive features, but bugs in the filesystem may +lead to performance issues. + + +Insufficient RAM +---------------- + +We recommend 1GB of RAM per OSD daemon. You may notice that during normal +operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). +Unused RAM makes it tempting to use the excess RAM for co-resident applications, +VMs and so forth. However, when OSDs go into recovery mode, their memory +utilization spikes. If there is no RAM available, the OSD performance will slow +considerably. + + +Old Requests or Slow Requests +----------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages +complaining about requests that are taking too long. The warning threshold +defaults to 30 seconds, and is configurable via the ``osd op complaint time`` +option. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about 'old requests`:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about 'slow requests`:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + + +Possible causes include: + +- A bad drive (check ``dmesg`` output) +- A bug in the kernel file system bug (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions + +- Remove VMs Cloud Solutions from Ceph Hosts +- Upgrade Kernel +- Upgrade Ceph +- Restart OSDs + + + +Flapping OSDs +============= + +We recommend using both a public (front-end) network and a cluster (back-end) +network so that you can better meet the capacity requirements of object replication. Another +advantage is that you can run a cluster network such that it isn't connected to +the internet, thereby preventing some denial of service attacks. When OSDs peer +and check heartbeats, they use the cluster (back-end) network when it's available. +See `Monitor/OSD Interaction`_ for details. + +However, if the cluster (back-end) network fails or develops significant latency +while the public (front-end) network operates optimally, OSDs currently do not +handle this situation well. What happens is that OSDs mark each other ``down`` +on the monitor, while marking themselves ``up``. We call this scenario 'flapping`. + +If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and then +``up`` again), you can force the monitors to stop the flapping with:: + + ceph osd set noup # prevent osds from getting marked up + ceph osd set nodown # prevent osds from getting marked down + +These flags are recorded in the osdmap structure:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or down +ceph-osds from eventually being marked ``out`` (regardless of what the +current value for ``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + + + +Troubleshooting PG Errors +========================= Stuck Placement Groups -====================== +---------------------- It is normal for placement groups to enter states like "degraded" or "peering" following a failure. Normally these states indicate the normal progression @@ -122,10 +501,11 @@ recovery from completing, like unfound objects (see :ref:`failures-osd-unfound`); + .. _failures-osd-peering: Placement Group Down - Peering Failure -====================================== +-------------------------------------- In certain cases, the ``ceph-osd`` `Peering` process can run into problems, preventing a PG from becoming active and usable. For @@ -190,7 +570,7 @@ Recovery will proceed. .. _failures-osd-unfound: Unfound Objects -=============== +--------------- Under certain combinations of failures Ceph may complain about ``unfound`` objects:: @@ -288,62 +668,47 @@ was a new object) forget about it entirely. Use this with caution, as it may confuse applications that expected the object to exist. +Homeless Placement Groups +------------------------- -Slow or Unresponsive OSD -======================== - -If, for some reason, a ``ceph-osd`` is slow to respond to a request, it will -generate log messages complaining about requests that are taking too -long. The warning threshold defaults to 30 seconds, and is configurable -via the ``osd op complaint time`` option. When this happens, the cluster -log will receive messages like:: - - slow request 30.383883 seconds old, received at 2013-02-12 16:27:15.508374: osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.000000000120 [write 921600~4096] 2.981cf6bc) v4 currently no flag points reached - -Possible causes include: - - * bad disk (check ``dmesg`` output) - * kernel file system bug (check ``dmesg`` output) - * overloaded cluster (check system load, iostat, etc.) - * ceph-osd bug - -Pay particular attention to the ``currently`` part, as that will give -some clue as to what the request is waiting for. You can further look -at exactly what requests the slow OSD is working on are, and what -state(s) they are in with:: - - ceph --admin-daemon /var/run/ceph/ceph-osd.{ID}.asok dump_ops_in_flight - -These are sorted oldest to newest, and the dump includes an ``age`` -indicating how long the request has been in the queue. - - -Flapping OSDs -============= - -If something is causing OSDs to "flap" (repeatedly getting marked ``down`` and then -``up`` again), you can force the monitors to stop with:: +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: - ceph osd set noup # prevent osds from getting marked up - ceph osd set nodown # prevent osds from getting marked down + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down -These flags are recorded in the osdmap structure:: +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: - ceph osd dump | grep flags - flags no-up,no-down + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 -You can clear the flags with:: +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). - ceph osd unset noup - ceph osd unset nodown -Two other flags are supported, ``noin`` and ``noout``, which prevent -booting OSDs from being marked ``in`` (allocated data) or down -ceph-osds from eventually being marked ``out`` (regardless of what the -current value for ``mon osd down out interval`` is). -Note that ``noup``, ``noout``, and ``noout`` are temporary in the -sense that once the flags are cleared, the action they were blocking -should occur shortly after. The ``noin`` flag, on the other hand, -prevents ceph-osds from being marked in on boot, and any daemons that -started while the flag was set will remain that way. +.. _iostat: http://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging Config Reference: ../../configuration/log-and-debug-ref +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _monitoring your OSDs: ../monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users +.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users +.. _Inktank: http://inktank.com +.. _OS recommendations: ../../../install/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file |