summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* asdfwip-beforeSage Weil2013-07-211-0/+1
|
* mds: stay in SCAN state in file_evalSage Weil2013-05-291-0/+4
| | | | | | | If we are in the SCAN state, stay there until the recovery finishes. Do not jump to another state from file_eval(). Signed-off-by: Sage Weil <sage@inktank.com>
* Makefile: include new message header filesSage Weil2013-05-291-0/+2
| | | | Signed-off-by: Sage Weil <sage@inktank.com>
* Merge remote-tracking branch 'yan/wip-mds'Sage Weil2013-05-2932-703/+1534
|\ | | | | | | | | | | | | Reviewed-by: Sage Weil <sage@inktank.com> Conflicts: src/mds/MDCache.cc
| * mds: use "open-by-ino" function to open remote linkYan, Zheng2013-05-283-20/+39
| | | | | | | | | | | | | | | | | | Also add a new config option "mds_open_remote_link_mode". The anchor approach is used by default. If mode is non-zero, use the open-by-ino function. In case open-by-ino function fails, if mode is 1, retry using the anchor approach, otherwise trigger assertion. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: open missing cap inodesYan, Zheng2013-05-286-87/+185
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a recovering MDS enters reconnect stage, client sends reconnect messages to it. The message lists open files, their path, and issued caps. If an inode is not in the cache, the recovering MDS uses the path client provides to determine if it's the inode's authority. If not, the recovering MDS exports the inode's caps to other MDS. The issue here is that the path client provides isn't always accuracy. The fix is use recently added "open inode by ino" function to open any missing cap inodes when the recovering MDS enters rejoin stage. Send cache rejoin messages to other MDS after all caps' authorities are determined. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: bump the protocol versionYan, Zheng2013-05-281-1/+1
| | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: open inode by inoYan, Zheng2013-05-289-7/+612
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds "open-by-ino" helper. It utilizes backtrace to find inode's path and open the inode. The algorithm looks like: 1. Check MDS peers. If any MDS has the inode in its cache, goto step 6. 2. Fetch backtrace. If backtrace was previously fetched and get the same backtrace again, return -EIO. 3. Traverse the path in backtrace. If the inode is found, goto step 6; if non-auth dirfrag is encountered, goto next step. If fail to find the inode in its parent dir, goto step 1. 4. Request MDS peers to traverse the path in backtrace. If the inode is found, goto step 6. If MDS peer encounters non-auth dirfrag, it stops traversing. If any MDS peer fails to find the inode in its parent dir, goto step 1. 5. Use the same algorithm to open the inode's parent. Goto step 3 if succeeds; goto step 1 if fails. 6. return the inode's auth MDS ID. The algorithm has two main assumptions: 1. If an inode is in its auth MDS's cache, its on-disk backtrace can be out of date. 2. If an inode is not in any MDS's cache, its on-disk backtrace must be up to date. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: move fetch_backtrace() to class MDCacheYan, Zheng2013-05-284-68/+45
| | | | | | | | | | | | | | | | We may want to fetch backtrace while corresponding inode isn't instantiated. MDCache::fetch_backtrace() will be used by later patch. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: remove old backtrace handlingYan, Zheng2013-05-285-215/+14
| | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: update backtraces when unlinking inodesYan, Zheng2013-05-281-4/+7
| | | | | | | | | | | | unlink moves inodes to stray dir, it's a special form of rename. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: bring back old style backtrace handlingYan, Zheng2013-05-289-11/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To queue a backtrace update, current code allocates a BacktraceInfo structure and adds it to log segment's update_backtraces list. The main issue of this approach is that BacktraceInfo is independent from inode. It's very inconvenient to find pending backtrace updates for given inodes. When exporting inodes from one MDS to another MDS, we need find and cancel all pending backtrace updates on the source MDS. This patch brings back old backtrace handling code and adapts it for the current backtrace format. The basic idea behind of the old code is: when an inode's backtrace becomes dirty, add the inode to log segment's dirty_parent_inodes list. Compare to the current backtrace handling, another difference is that backtrace update is journalled in EMetaBlob::full_bit Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: rename last_renamed_version to backtrace_versionYan, Zheng2013-05-285-17/+21
| | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: journal backtrace update in EMetaBlob::fullbitYan, Zheng2013-05-282-22/+53
| | | | | | | | | | | | | | | | | | | | | | | | Current way to journal backtrace update is set EMetaBlob::update_bt to true. The problem is that an EMetaBlob can include several inodes. If an EMetaBlob's update_bt is true, journal replay code has to queue backtrace updates for all inodes in the EMetaBlob. This patch adds two new flags to class EMetaBlob::fullbit, make it be able to journal backtrace update. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: reorder EMetaBlob::add_primary_dentry's parametersYan, Zheng2013-05-286-33/+33
| | | | | | | | | | | | prepare for adding new state parameter such as 'dirty_parent' Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: warn on unconnected snap realmsYan, Zheng2013-05-281-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | When there are more than one active MDS, restarting MDS triggers assertion "reconnected_snaprealms.empty()" quite often. If there is no snapshot in the FS, the items left in reconnected_snaprealms should be other MDS' mdsdir. I think it's harmless. If there are snapshots in the FS, the assertion probably can catch real bugs. But at present, snapshot feature is broken, fixing it is non-trivial. So replace the assertion with a warning. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: slient MDCache::trim_non_auth()Yan, Zheng2013-05-281-20/+6
| | | | | | | | | | | | No need to output the function's debug message to console. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix check for base inode discoveryYan, Zheng2013-05-281-1/+2
| | | | | | | | | | | | | | If a MDiscover message is for discovering base inode, want_base_dir should be false, path should be empty. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: Fix replica's allowed caps for filelock in SYNC_LOCK stateYan, Zheng2013-05-281-2/+2
| | | | | | | | | | | | | | | | For replica, filelock in LOCK_LOCK state doesn't allow Fc cap. So filelock in LOCK_SYNC_LOCK/LOCK_EXCL_LOCK state shouldn't allow Fc cap either. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: defer releasing cap if necessaryYan, Zheng2013-05-282-32/+58
| | | | | | | | | | | | When inode is freezing or frozen, we defer processing MClientCaps messages and cap release embedded in requests. The same deferral logical should also cover MClientCapRelease messages.
| * mds: fix Locker::request_inode_file_caps()Yan, Zheng2013-05-281-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | After sending cache rejoin message, replica need notify auth MDS when cap_wanted changes. But it can send MInodeFileCaps message only after receiving auth MDS' rejoin ack. Locker::request_inode_file_caps() has correct wait logical, but it skips sending MInodeFileCaps message if the auth MDS is still in rejoin state. The fix is defer sending MInodeFileCaps message until the auth MDS is active. It makes the function's wait logical less tricky. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: notify auth MDS when cap_wanted changesYan, Zheng2013-05-281-2/+4
| | | | | | | | | | | | So the auth MDS can choose locks' states base on our cap_wanted. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: export CInode:mds_caps_wantedYan, Zheng2013-05-282-5/+6
| | | | | | | | | | | | | | CInode:mds_caps_wanted is used to keep track of caps wanted by non-auth MDS. The auth MDS checks it when choosing locks' states. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: export CInode::STATE_NEEDSRECOVERYan, Zheng2013-05-284-14/+21
| | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: send slave request after target MDS is activeYan, Zheng2013-05-283-13/+53
| | | | | | | | | | | | | | | | | | | | | | when failure of peer is detected, MDCache::handle_mds_failure() checks if there are requests waiting for slave replies from the failed peer, and adds them to the "wait for active peer" list. The "retry request" logical only covers slave requests sent before MDCache::handle_mds_failure() is called. If a slave request was sent while peer isn't up, we wait for its reply forever. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: unfreeze inode after rename rollback finishesYan, Zheng2013-05-282-27/+40
| | | | | | | | | | | | | | we should not wake up the unfreeze waiter while the inode is still linked to a non-auth dirfrag. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: remove buggy cache rejoin codeYan, Zheng2013-05-281-20/+11
| | | | | | | | | | | | | | | | | | | | I previously added code to handle a corner case of cache rejoin: entire subtree, together with the inode subtree root belongs to, were trimmed between sending cache rejoin and receiving rejoin ack. In this case, we should send cache expire message to the subtree's auth MDS. But the code is complete broken, remove it temporarily. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix typo in Server::do_rename_rollbackYan, Zheng2013-05-281-2/+2
| | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix import cancel raceYan, Zheng2013-05-282-32/+34
| | | | | | | | | | | | | | | | | | Current code uses import state to detect obsolete import discover/prep message. it does not work for the case: cancel a subtree import, import the same subtree again, the discover/prep message for the first import get dispatched. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix straydn raceYan, Zheng2013-05-284-19/+39
| | | | | | | | | | | | | | | | For unlink/rename request, the target dentry's linkage may change before all locks are acquired. So we need check if the existing stray dentry is valid. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix slave commit trackingYan, Zheng2013-05-283-23/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | MDS may crash after journalling a slave commit, but before sending commit ack to the master. Later when the MDS restarts, it will not send commit ack to the master. So the master waits for the commit ack forever. The fix is remove failed MDS from requests' uncommitted slave list. When failed MDS recovers, its resolve message will tell the master which slave requests are not committed. The master will re-add the recovering MDS to requests' uncommitted slave list if necessary. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix uncommitted master waitYan, Zheng2013-05-282-15/+10
| | | | | | | | | | | | | | We may add new waiter while the master is committing. so we should take the waiters and wake up them when the master is committed. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: adjust subtree auth if import aborts in PREPPED stateYan, Zheng2013-05-281-1/+7
| | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: don't stop at export bounds when journaling dir contextYan, Zheng2013-05-281-1/+1
| | | | | | | | | | | | | | We only journal the finish of exporting subtree, so we shouldn't consider export bounds as subtree root. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: fix underwater dentry cleanupYan, Zheng2013-05-281-1/+1
| | | | | | | | | | | | | | If the underwater dentry is a remove link, we shouldn't mark the inode clean Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * mds: journal new subtrees created by renameYan, Zheng2013-05-282-1/+12
| | | | | | | | | | | | this avoids creating bare dirfrags during journal replay. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* | Merge pull request #329 from javacruft/wip-fuse-depsSage Weil2013-05-291-2/+2
|\ \ | | | | | | Use new fuse package instead of fuse-utils
| * | Use new fuse package instead of fuse-utilsJames Page2013-05-291-2/+2
|/ / | | | | | | | | | | | | | | | | The fuse-utils package was deprecated a while ago. Switch the primary dependency for fuse tools to use the preferred 'fuse' package. Signed-off-by: James Page <james.page@ubuntu.com>
* | mon: disable tdump by defaultSage Weil2013-05-281-1/+1
| | | | | | | | | | | | Grr. Signed-off-by: Sage Weil <sage@inktank.com>
* | Merge remote-tracking branch 'gh/last'Sage Weil2013-05-2812-70/+111
|\ \
| * | v0.63v0.63Gary Lowell2013-05-282-1/+7
| | |
| * | HashIndex: sync top directory during start_split,merge,col_splitSamuel Just2013-05-281-3/+12
| |/ | | | | | | | | | | | | | | | | | | | | Otherwise, the links might be ordered after the in progress operation tag write. We need the in progress operation tag to correctly recover from an interrupted merge, split, or col_split. Fixes: #5180 Backport: cuttlefish, bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * Merge branch 'wip_scrub_tphandle' into nextSamuel Just2013-05-234-33/+66
| |\ | | | | | | | | | | | | Fixes: #5159 Reviewed-by: Sage Weil <sage@inktank.com>
| | * PG: ping tphandle during omap loop as wellSamuel Just2013-05-232-0/+8
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| | * PG: reset timeout in _scan_list for each object, read chunkSamuel Just2013-05-231-0/+2
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| | * OSD,PG: pass tphandle down to _scan_listSamuel Just2013-05-233-33/+56
| | | | | | | | | | | | Signed-off-by: Samuel Just <sam.just@inktank.com>
| * | rgw: iterate usage entries from correct entryYehuda Sadeh2013-05-231-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes: #5152 When iterating through usage entries, and when user id was provided, we started at the user's first entry and not from the entry indexed by the request start time. This commit fixes the issue. Backport: bobtail Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
| * | mon: drop unnecessary conditionalsSage Weil2013-05-231-5/+3
| | | | | | | | | | | | Signed-off-by: Sage Weil <sage@inktank.com>
| * | Merge pull request #311 from ceph/wip-5102Sage Weil2013-05-234-30/+12
| |\ \ | | | | | | | | Reviewed-by: Sage Weil <sage@inktank.com>
| | * | mon: Paxos: get rid of the 'prepare_bootstrap()' mechanismJoao Eduardo Luis2013-05-224-26/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need it after all. If we are in the middle of some proposal, then we guarantee that said proposal is likely to be retried. If we haven't yet proposed, then it's forever more likely that a client will eventually retry the message that triggered this proposal. Basically, this mechanism attempted at fixing a non-problem, and was in fact triggering some unforeseen issues that would have required increasing the code complexity for no good reason. Fixes: #5102 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>