| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't actually need to write out the pg map epoch on every
activate_map as long as:
a) the osd does not trim past the oldest pg map persisted
b) the pg does update the persisted map epoch from time
to time.
To that end, we now keep a reference to the last map persisted.
The OSD already does not trim past the oldest live OSDMapRef.
Second, handle_activate_map will trim if the difference between
the current map and the last_persisted_map is large enough.
Fixes: #4731
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 2c5a9f0e178843e7ed514708bab137def840ab89)
Conflicts:
src/common/config_opts.h
src/osd/PG.cc
- last_persisted_osdmap_ref gets set in the non-static
PG::write_info
Conflicts:
src/osd/PG.cc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PG::log, PG::ondisklog, PG::missing are moved from PG to a new PGLog
class and are made protected data members. It is a preliminary step
before writing unit tests to cover the methods that have side effects
on these data members and define a clean PGLog API. It improves
encapsulation and does not change any of the logic already in
place.
Possible issues :
* an additional reference (PG->PGLog->IndexedLog instead of
PG->IndexedLog for instance) is introduced : is it optimized ?
* rewriting log.log into pg_log.get_log().log affects the readability
but should be optimized and have no impact on performances
The guidelines followed for this patch are:
* const access to the data members are preserved, no attempt is made
to define accessors
* all non const methods are in PGLog, no access to non const methods of
PGLog::log, PGLog::logondisk and PGLog::missing are provided
* when methods are moved from PG to PGLog the change to their
implementation is restricted to the minimum.
* the PG::OndiskLog and PG::IndexedLog sub classes are moved
to PGLog sub classes unmodified and remain public
A const version of the pg_log_t::find_entry method was added.
A const accessor is provided for PGLog::get_log, PGLog::get_missing,
PGLog::get_ondisklog but no non-const accessor.
Arguments are added to most of the methods moved from PG to PGLog so
that they can get access to PG data members such as info or log_oid.
The PGLog method are sorted according to the data member they modify.
//////////////////// missing ////////////////////
* The pg_missing_t::{got,have,need,add,rm} methods are wrapped as
PGLog::missing_{got,have,need,add,rm}
//////////////////// log ////////////////////
* PGLog::get_tail, PGLog::get_head getters are created
* PGLog::set_tail, PGLog::set_head, PGLog::set_last_requested setters
are created
* PGLog::index, PGLog::unindex, PGLog::add wrappers,
PGLog::reset_recovery_pointers are created
* PGLog::clear_info_log replaces PG::clear_info_log
* PGLog::trim replaces PG::trim
//////////////////// log & missing ////////////////////
* PGLog::claim_log is created with code extracted from
PG::RecoveryState::Stray::react.
* PGLog::split_into is created with code extracted from
PG::split_into.
* PGLog::recover_got is created with code extracted from
ReplicatedPG::recover_got.
* PGLog::activate_not_complete is created with code extracted
from PG::active
* PGLog:proc_replica_log is created with code extracted from
PG::proc_replica_log
* PGLog:write_log is created with code extracted from
PG::write_log
* PGLog::merge_old_entry replaces PG::merge_old_entry
The remove_snap argument is used to collect hobject_t
* PGLog::rewind_divergent_log replaces PG::rewind_divergent_log
The remove_snap argument is used to collect hobject_t
A new PG::rewind_divergent_log method is added to call
remove_snap_mapped_object on each of the remove_snap
elements
* PGLog::merge_log replaces PG::merge_log
The remove_snap argument is used to collect hobject_t
A new PG::merge_log method is added to call
remove_snap_mapped_object on each of the remove_snap
elements
* PGLog:write_log is created with code extracted from PG::write_log. A
non-static version is created for convenience but is a simple
wrapper.
* PGLog:read_log replaces PG::read_log. A non-static version is
created for convenience but is a simple wrapper.
* PGLog:read_log_old replaces PG::read_log_old.
http://tracker.ceph.com/issues/5046 refs #5046
Signed-off-by: Loic Dachary <loic@dachary.org>
|
|
|
|
|
|
|
|
| |
cur_ios, etc may not be zero due to an in progress
flush.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
|
|
|
|
|
|
|
| |
Added the -r option, which starts the radosgw and apache2 to access it
to the usage message.
Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
|
|\
| |
| | |
Reviewed-by: Samuel Just <sam.just@inktank.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If we are (wrongly) marked down, we need to go into the waiting-for-healthy
state and verify that our network interfaces are working before trying to
rejoin the cluster.
- make _is_healthy() check require positive proof of pings working
- do heartbeat checks and updates in this state
- reset the random peers every heartbeat_interval, in case we keep picking
bad ones
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
is_unhealthy() will assume they are healthy for some period after we
send our first ping attempt. is_healthy() is now a strict check that we
know they are healthy.
Switch the failure report check to use is_unhealthy(); use is_healthy()
everywhere else, including the waiting-for-healthy pre-boot checks.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| | |
If a (say, random) peer goes down, filter it out.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| |
| | |
We will soon be in this method for the waiting-for-healthy state. As
a consequence, we need to remove any down peers.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| | |
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
- always include our neighbors to ensure we have a fully-connected
graph
- include some random neighbors to get at least some min number of peers.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| | |
For now we still only look at the internal heartbeats.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| | |
sub_want() returns true if this is a new sub; only renew then.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| | |
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
This has a slight behavior change in that we ask the mon for the latest
osdmap if our internal heartbeat is failing. That isn't useful yet, but
will be shortly.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|\ \ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we use operator[] on a new int field its value is undefined; avoid
reading it or using |= et al until we initialize it.
Fixes: #4967
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we are in the SCAN state, stay there until the recovery finishes. Do
not jump to another state from file_eval().
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0071b8e75bd3f5a09cc46e2225a018f6d1ef0680)
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists. This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.
Fixes: #5183
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
|
|\ \
| | |
| | |
| | |
| | | |
Fixes: #4782
Reviewed-by: Sage Weil
|
| | |
| | |
| | |
| | | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We may want to influence the caching behavior for other
reasons.
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The object would have had to have been removed already. With
fd caching, this extra remove might check the wrong replay_guard
since the fd caching mechanism assumes that between any operation
on an hobject_t oid and a remove operation, all operations on that
hobject_t must refer to the same inode.
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
Clear clears a key/value from the cache.
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we are in the SCAN state, stay there until the recovery finishes. Do
not jump to another state from file_eval().
Signed-off-by: Sage Weil <sage@inktank.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Sage Weil <sage@inktank.com>
|
|\ \ \
| |_|/
|/| |
| | |
| | |
| | |
| | | |
Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
src/mds/MDCache.cc
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Also add a new config option "mds_open_remote_link_mode". The anchor
approach is used by default. If mode is non-zero, use the open-by-ino
function. In case open-by-ino function fails, if mode is 1, retry
using the anchor approach, otherwise trigger assertion.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a recovering MDS enters reconnect stage, client sends reconnect
messages to it. The message lists open files, their path, and issued
caps. If an inode is not in the cache, the recovering MDS uses the
path client provides to determine if it's the inode's authority. If
not, the recovering MDS exports the inode's caps to other MDS. The
issue here is that the path client provides isn't always accuracy.
The fix is use recently added "open inode by ino" function to open
any missing cap inodes when the recovering MDS enters rejoin stage.
Send cache rejoin messages to other MDS after all caps' authorities
are determined.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch adds "open-by-ino" helper. It utilizes backtrace to find
inode's path and open the inode. The algorithm looks like:
1. Check MDS peers. If any MDS has the inode in its cache, goto step 6.
2. Fetch backtrace. If backtrace was previously fetched and get the
same backtrace again, return -EIO.
3. Traverse the path in backtrace. If the inode is found, goto step 6;
if non-auth dirfrag is encountered, goto next step. If fail to find
the inode in its parent dir, goto step 1.
4. Request MDS peers to traverse the path in backtrace. If the inode
is found, goto step 6. If MDS peer encounters non-auth dirfrag, it
stops traversing. If any MDS peer fails to find the inode in its
parent dir, goto step 1.
5. Use the same algorithm to open the inode's parent. Goto step 3 if
succeeds; goto step 1 if fails.
6. return the inode's auth MDS ID.
The algorithm has two main assumptions:
1. If an inode is in its auth MDS's cache, its on-disk backtrace
can be out of date.
2. If an inode is not in any MDS's cache, its on-disk backtrace
must be up to date.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We may want to fetch backtrace while corresponding inode isn't
instantiated. MDCache::fetch_backtrace() will be used by later
patch.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
unlink moves inodes to stray dir, it's a special form of rename.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
To queue a backtrace update, current code allocates a BacktraceInfo
structure and adds it to log segment's update_backtraces list. The
main issue of this approach is that BacktraceInfo is independent
from inode. It's very inconvenient to find pending backtrace updates
for given inodes. When exporting inodes from one MDS to another
MDS, we need find and cancel all pending backtrace updates on the
source MDS.
This patch brings back old backtrace handling code and adapts it
for the current backtrace format. The basic idea behind of the old
code is: when an inode's backtrace becomes dirty, add the inode to
log segment's dirty_parent_inodes list.
Compare to the current backtrace handling, another difference is
that backtrace update is journalled in EMetaBlob::full_bit
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Current way to journal backtrace update is set EMetaBlob::update_bt
to true. The problem is that an EMetaBlob can include several inodes.
If an EMetaBlob's update_bt is true, journal replay code has to queue
backtrace updates for all inodes in the EMetaBlob.
This patch adds two new flags to class EMetaBlob::fullbit, make it be
able to journal backtrace update.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
prepare for adding new state parameter such as 'dirty_parent'
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When there are more than one active MDS, restarting MDS triggers
assertion "reconnected_snaprealms.empty()" quite often. If there
is no snapshot in the FS, the items left in reconnected_snaprealms
should be other MDS' mdsdir. I think it's harmless.
If there are snapshots in the FS, the assertion probably can catch
real bugs. But at present, snapshot feature is broken, fixing it is
non-trivial. So replace the assertion with a warning.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
No need to output the function's debug message to console.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If a MDiscover message is for discovering base inode, want_base_dir
should be false, path should be empty.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For replica, filelock in LOCK_LOCK state doesn't allow Fc cap. So
filelock in LOCK_SYNC_LOCK/LOCK_EXCL_LOCK state shouldn't allow Fc
cap either.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
When inode is freezing or frozen, we defer processing MClientCaps
messages and cap release embedded in requests. The same deferral
logical should also cover MClientCapRelease messages.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
After sending cache rejoin message, replica need notify auth MDS when
cap_wanted changes. But it can send MInodeFileCaps message only after
receiving auth MDS' rejoin ack. Locker::request_inode_file_caps() has
correct wait logical, but it skips sending MInodeFileCaps message if
the auth MDS is still in rejoin state.
The fix is defer sending MInodeFileCaps message until the auth MDS
is active. It makes the function's wait logical less tricky.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
So the auth MDS can choose locks' states base on our cap_wanted.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|