summaryrefslogtreecommitdiff
path: root/src
Commit message (Collapse)AuthorAgeFilesLines
* PG: don't write out pg map epoch every handle_activate_mapSamuel Just2013-05-313-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't actually need to write out the pg map epoch on every activate_map as long as: a) the osd does not trim past the oldest pg map persisted b) the pg does update the persisted map epoch from time to time. To that end, we now keep a reference to the last map persisted. The OSD already does not trim past the oldest live OSDMapRef. Second, handle_activate_map will trim if the difference between the current map and the last_persisted_map is large enough. Fixes: #4731 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> (cherry picked from commit 2c5a9f0e178843e7ed514708bab137def840ab89) Conflicts: src/common/config_opts.h src/osd/PG.cc - last_persisted_osdmap_ref gets set in the non-static PG::write_info Conflicts: src/osd/PG.cc
* move log, ondisklog, missing from PG to PGLogLoic Dachary2013-05-3012-1225/+1472
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PG::log, PG::ondisklog, PG::missing are moved from PG to a new PGLog class and are made protected data members. It is a preliminary step before writing unit tests to cover the methods that have side effects on these data members and define a clean PGLog API. It improves encapsulation and does not change any of the logic already in place. Possible issues : * an additional reference (PG->PGLog->IndexedLog instead of PG->IndexedLog for instance) is introduced : is it optimized ? * rewriting log.log into pg_log.get_log().log affects the readability but should be optimized and have no impact on performances The guidelines followed for this patch are: * const access to the data members are preserved, no attempt is made to define accessors * all non const methods are in PGLog, no access to non const methods of PGLog::log, PGLog::logondisk and PGLog::missing are provided * when methods are moved from PG to PGLog the change to their implementation is restricted to the minimum. * the PG::OndiskLog and PG::IndexedLog sub classes are moved to PGLog sub classes unmodified and remain public A const version of the pg_log_t::find_entry method was added. A const accessor is provided for PGLog::get_log, PGLog::get_missing, PGLog::get_ondisklog but no non-const accessor. Arguments are added to most of the methods moved from PG to PGLog so that they can get access to PG data members such as info or log_oid. The PGLog method are sorted according to the data member they modify. //////////////////// missing //////////////////// * The pg_missing_t::{got,have,need,add,rm} methods are wrapped as PGLog::missing_{got,have,need,add,rm} //////////////////// log //////////////////// * PGLog::get_tail, PGLog::get_head getters are created * PGLog::set_tail, PGLog::set_head, PGLog::set_last_requested setters are created * PGLog::index, PGLog::unindex, PGLog::add wrappers, PGLog::reset_recovery_pointers are created * PGLog::clear_info_log replaces PG::clear_info_log * PGLog::trim replaces PG::trim //////////////////// log & missing //////////////////// * PGLog::claim_log is created with code extracted from PG::RecoveryState::Stray::react. * PGLog::split_into is created with code extracted from PG::split_into. * PGLog::recover_got is created with code extracted from ReplicatedPG::recover_got. * PGLog::activate_not_complete is created with code extracted from PG::active * PGLog:proc_replica_log is created with code extracted from PG::proc_replica_log * PGLog:write_log is created with code extracted from PG::write_log * PGLog::merge_old_entry replaces PG::merge_old_entry The remove_snap argument is used to collect hobject_t * PGLog::rewind_divergent_log replaces PG::rewind_divergent_log The remove_snap argument is used to collect hobject_t A new PG::rewind_divergent_log method is added to call remove_snap_mapped_object on each of the remove_snap elements * PGLog::merge_log replaces PG::merge_log The remove_snap argument is used to collect hobject_t A new PG::merge_log method is added to call remove_snap_mapped_object on each of the remove_snap elements * PGLog:write_log is created with code extracted from PG::write_log. A non-static version is created for convenience but is a simple wrapper. * PGLog:read_log replaces PG::read_log. A non-static version is created for convenience but is a simple wrapper. * PGLog:read_log_old replaces PG::read_log_old. http://tracker.ceph.com/issues/5046 refs #5046 Signed-off-by: Loic Dachary <loic@dachary.org>
* os/WBThrottle: remove asserts in clear()Samuel Just2013-05-301-2/+0
| | | | | | | | cur_ios, etc may not be zero due to an in progress flush. Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
* Added -r option to usageChristophe Courtaut2013-05-301-0/+1
| | | | | | | Added the -r option, which starts the radosgw and apache2 to access it to the usage message. Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
* Merge pull request #331 from ceph/wip-osd-interfacecheckSage Weil2013-05-294-73/+230
|\ | | | | Reviewed-by: Samuel Just <sam.just@inktank.com>
| * osd: wait for healthy pings from peers in waiting-for-healthy stateSage Weil2013-05-292-23/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | If we are (wrongly) marked down, we need to go into the waiting-for-healthy state and verify that our network interfaces are working before trying to rejoin the cluster. - make _is_healthy() check require positive proof of pings working - do heartbeat checks and updates in this state - reset the random peers every heartbeat_interval, in case we keep picking bad ones Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: distinguish between definitely healthy and definitely not unhealthySage Weil2013-05-292-8/+12
| | | | | | | | | | | | | | | | | | | | | | is_unhealthy() will assume they are healthy for some period after we send our first ping attempt. is_healthy() is now a strict check that we know they are healthy. Switch the failure report check to use is_unhealthy(); use is_healthy() everywhere else, including the waiting-for-healthy pre-boot checks. Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: remove down hb peersSage Weil2013-05-291-4/+10
| | | | | | | | | | | | If a (say, random) peer goes down, filter it out. Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: only add pg peers if activeSage Weil2013-05-291-17/+19
| | | | | | | | | | | | | | We will soon be in this method for the waiting-for-healthy state. As a consequence, we need to remove any down peers. Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: factor out _remove_heartbeat_peerSage Weil2013-05-292-12/+19
| | | | | | | | Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: augment osd heartbeat peers with neighbors and randoms, to up some minSage Weil2013-05-293-16/+84
| | | | | | | | | | | | | | | | - always include our neighbors to ensure we have a fully-connected graph - include some random neighbors to get at least some min number of peers. Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: move health checks into a single helperSage Weil2013-05-292-3/+14
| | | | | | | | | | | | For now we still only look at the internal heartbeats. Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: avoid duplicate mon requests for a new osdmapSage Weil2013-05-291-2/+2
| | | | | | | | | | | | sub_want() returns true if this is a new sub; only renew then. Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: tell peers that ping us if they are deadSage Weil2013-05-291-0/+7
| | | | | | | | Signed-off-by: Sage Weil <sage@inktank.com>
| * osd: simplify is_healthy() check during bootSage Weil2013-05-291-7/+2
| | | | | | | | | | | | | | | | This has a slight behavior change in that we ask the mon for the latest osdmap if our internal heartbeat is failing. That isn't useful yet, but will be shortly. Signed-off-by: Sage Weil <sage@inktank.com>
* | Merge branch 'next'Sage Weil2013-05-292-1/+9
|\ \
| * | osd: initialize new_state field when we use itSage Weil2013-05-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we use operator[] on a new int field its value is undefined; avoid reading it or using |= et al until we initialize it. Fixes: #4967 Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
| * | mds: stay in SCAN state in file_evalSage Weil2013-05-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | If we are in the SCAN state, stay there until the recovery finishes. Do not jump to another state from file_eval(). Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit 0071b8e75bd3f5a09cc46e2225a018f6d1ef0680)
| * | osd: do not assume head obc object exists when getting snapdirSage Weil2013-05-291-0/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | For a list-snaps operation on the snapdir, do not assume that the obc for the head means the object exists. This fixes a race between a head deletion and a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress test when thrashing OSDs. Fixes: #5183 Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
* | Merge branch 'wip_osd_throttle'Samuel Just2013-05-2910-280/+671
|\ \ | | | | | | | | | | | | Fixes: #4782 Reviewed-by: Sage Weil
| * | WBThrottle: add some comments and some assertsSamuel Just2013-05-292-0/+6
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | WBThrottle: rename replica nocacheSamuel Just2013-05-292-9/+9
| | | | | | | | | | | | | | | | | | | | | We may want to influence the caching behavior for other reasons. Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | WBThrottle: add perfcountersSamuel Just2013-05-282-0/+40
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | ReplicatedPG::submit_push_complete don't remove the head objectSamuel Just2013-05-221-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The object would have had to have been removed already. With fd caching, this extra remove might check the wrong replay_guard since the fd caching mechanism assumes that between any operation on an hobject_t oid and a remove operation, all operations on that hobject_t must refer to the same inode. Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | FileStore: integrate WBThrottleSamuel Just2013-05-213-158/+9
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | os/: Add WBThrottleSamuel Just2013-05-215-1/+384
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | FileStore: add fd cacheSamuel Just2013-05-215-127/+228
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | common/shared_cache.hpp: fix set_size()Samuel Just2013-05-211-1/+1
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | common/shared_cache.hpp: add clear()Samuel Just2013-05-211-0/+11
| | | | | | | | | | | | | | | | | | Clear clears a key/value from the cache. Signed-off-by: Samuel Just <sam.just@inktank.com>
* | | mds: stay in SCAN state in file_evalSage Weil2013-05-291-0/+4
| | | | | | | | | | | | | | | | | | | | | If we are in the SCAN state, stay there until the recovery finishes. Do not jump to another state from file_eval(). Signed-off-by: Sage Weil <sage@inktank.com>
* | | Makefile: include new message header filesSage Weil2013-05-291-0/+2
| | | | | | | | | | | | Signed-off-by: Sage Weil <sage@inktank.com>
* | | Merge remote-tracking branch 'yan/wip-mds'Sage Weil2013-05-2932-703/+1534
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | Reviewed-by: Sage Weil <sage@inktank.com> Conflicts: src/mds/MDCache.cc
| * | mds: use "open-by-ino" function to open remote linkYan, Zheng2013-05-283-20/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | Also add a new config option "mds_open_remote_link_mode". The anchor approach is used by default. If mode is non-zero, use the open-by-ino function. In case open-by-ino function fails, if mode is 1, retry using the anchor approach, otherwise trigger assertion. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: open missing cap inodesYan, Zheng2013-05-286-87/+185
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a recovering MDS enters reconnect stage, client sends reconnect messages to it. The message lists open files, their path, and issued caps. If an inode is not in the cache, the recovering MDS uses the path client provides to determine if it's the inode's authority. If not, the recovering MDS exports the inode's caps to other MDS. The issue here is that the path client provides isn't always accuracy. The fix is use recently added "open inode by ino" function to open any missing cap inodes when the recovering MDS enters rejoin stage. Send cache rejoin messages to other MDS after all caps' authorities are determined. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: bump the protocol versionYan, Zheng2013-05-281-1/+1
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: open inode by inoYan, Zheng2013-05-289-7/+612
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds "open-by-ino" helper. It utilizes backtrace to find inode's path and open the inode. The algorithm looks like: 1. Check MDS peers. If any MDS has the inode in its cache, goto step 6. 2. Fetch backtrace. If backtrace was previously fetched and get the same backtrace again, return -EIO. 3. Traverse the path in backtrace. If the inode is found, goto step 6; if non-auth dirfrag is encountered, goto next step. If fail to find the inode in its parent dir, goto step 1. 4. Request MDS peers to traverse the path in backtrace. If the inode is found, goto step 6. If MDS peer encounters non-auth dirfrag, it stops traversing. If any MDS peer fails to find the inode in its parent dir, goto step 1. 5. Use the same algorithm to open the inode's parent. Goto step 3 if succeeds; goto step 1 if fails. 6. return the inode's auth MDS ID. The algorithm has two main assumptions: 1. If an inode is in its auth MDS's cache, its on-disk backtrace can be out of date. 2. If an inode is not in any MDS's cache, its on-disk backtrace must be up to date. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: move fetch_backtrace() to class MDCacheYan, Zheng2013-05-284-68/+45
| | | | | | | | | | | | | | | | | | | | | | | | We may want to fetch backtrace while corresponding inode isn't instantiated. MDCache::fetch_backtrace() will be used by later patch. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: remove old backtrace handlingYan, Zheng2013-05-285-215/+14
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: update backtraces when unlinking inodesYan, Zheng2013-05-281-4/+7
| | | | | | | | | | | | | | | | | | unlink moves inodes to stray dir, it's a special form of rename. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: bring back old style backtrace handlingYan, Zheng2013-05-289-11/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To queue a backtrace update, current code allocates a BacktraceInfo structure and adds it to log segment's update_backtraces list. The main issue of this approach is that BacktraceInfo is independent from inode. It's very inconvenient to find pending backtrace updates for given inodes. When exporting inodes from one MDS to another MDS, we need find and cancel all pending backtrace updates on the source MDS. This patch brings back old backtrace handling code and adapts it for the current backtrace format. The basic idea behind of the old code is: when an inode's backtrace becomes dirty, add the inode to log segment's dirty_parent_inodes list. Compare to the current backtrace handling, another difference is that backtrace update is journalled in EMetaBlob::full_bit Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: rename last_renamed_version to backtrace_versionYan, Zheng2013-05-285-17/+21
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: journal backtrace update in EMetaBlob::fullbitYan, Zheng2013-05-282-22/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current way to journal backtrace update is set EMetaBlob::update_bt to true. The problem is that an EMetaBlob can include several inodes. If an EMetaBlob's update_bt is true, journal replay code has to queue backtrace updates for all inodes in the EMetaBlob. This patch adds two new flags to class EMetaBlob::fullbit, make it be able to journal backtrace update. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: reorder EMetaBlob::add_primary_dentry's parametersYan, Zheng2013-05-286-33/+33
| | | | | | | | | | | | | | | | | | prepare for adding new state parameter such as 'dirty_parent' Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: warn on unconnected snap realmsYan, Zheng2013-05-281-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When there are more than one active MDS, restarting MDS triggers assertion "reconnected_snaprealms.empty()" quite often. If there is no snapshot in the FS, the items left in reconnected_snaprealms should be other MDS' mdsdir. I think it's harmless. If there are snapshots in the FS, the assertion probably can catch real bugs. But at present, snapshot feature is broken, fixing it is non-trivial. So replace the assertion with a warning. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: slient MDCache::trim_non_auth()Yan, Zheng2013-05-281-20/+6
| | | | | | | | | | | | | | | | | | No need to output the function's debug message to console. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: fix check for base inode discoveryYan, Zheng2013-05-281-1/+2
| | | | | | | | | | | | | | | | | | | | | If a MDiscover message is for discovering base inode, want_base_dir should be false, path should be empty. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: Fix replica's allowed caps for filelock in SYNC_LOCK stateYan, Zheng2013-05-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | For replica, filelock in LOCK_LOCK state doesn't allow Fc cap. So filelock in LOCK_SYNC_LOCK/LOCK_EXCL_LOCK state shouldn't allow Fc cap either. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: defer releasing cap if necessaryYan, Zheng2013-05-282-32/+58
| | | | | | | | | | | | | | | | | | When inode is freezing or frozen, we defer processing MClientCaps messages and cap release embedded in requests. The same deferral logical should also cover MClientCapRelease messages.
| * | mds: fix Locker::request_inode_file_caps()Yan, Zheng2013-05-281-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After sending cache rejoin message, replica need notify auth MDS when cap_wanted changes. But it can send MInodeFileCaps message only after receiving auth MDS' rejoin ack. Locker::request_inode_file_caps() has correct wait logical, but it skips sending MInodeFileCaps message if the auth MDS is still in rejoin state. The fix is defer sending MInodeFileCaps message until the auth MDS is active. It makes the function's wait logical less tricky. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | mds: notify auth MDS when cap_wanted changesYan, Zheng2013-05-281-2/+4
| | | | | | | | | | | | | | | | | | So the auth MDS can choose locks' states base on our cap_wanted. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>