summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* osd: kludge to efficiently rebuild past_intervals in parallel on startuphistoric/past-intervals-hackSage Weil2012-04-293-1/+137
| | | | | | | | | | | | Particularly tortured clusters might be buried under thousands of osdmap epochs of thrashing with thousands of pgs. Rebuilding the past_intervals becomes O(n^2) in that case, and can take days and days. Instead, do the rebuild for all PGs in parallel during a single pass over the osdmap history. This is an ugly (mostly) one-time use hack that can removed soon. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: keep pgs locked during handle_osd_map danceSage Weil2012-04-293-21/+61
| | | | | | | | | | Currently we drop and retake locks during handle_osd_map calls to advance_map and activate_map. Instead, take them all once, and hold them. This avoids leaving dirty in-core state in the PG without the lock held. This will clearly go away as soon as the map threading stuff is redone. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* mon: drop obsolete osd/PG.h #includesSage Weil2012-04-293-3/+0
| | | | Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: set dirty flags on rewind_divergent_logSage Weil2012-04-291-0/+4
| | | | | | | | | Make sure we record any rewind_divergent_log. In the activate case, this will happen anyway, but mark it dirty here for correctness/completeness. The merge_log case might be a bug. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: use dirty flags in activate(), merge_log()Sage Weil2012-04-291-4/+4
| | | | | | | These are all called from within the state machine, so we can simply set the dirty flags. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: fix nested transaction in all_activated_and_committed()Sage Weil2012-04-291-4/+9
| | | | | | | | | | | all_activated_and_committed() is called from _activate_committed(), called from a objectstore completion, and also from the state machine, which is part of a larger transaction. Instead, set dirty_info, and build/apply a transaction in the caller (the completion) as needed. Fixes part of #2360. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: use PG::write_if_dirty() helperSage Weil2012-04-293-10/+17
| | | | Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: do not merge history on querySage Weil2012-04-281-0/+2
| | | | | | | We shouldn't modify the local notion of the history without recording it to disk. And we (probably) also don't need to do that at all on query. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: dirty_info if history.merge updated anythingSage Weil2012-04-282-12/+30
| | | | | | | | | | | In proc_replica_info and proc_primary_info, we may or may not update the pg_info_t. If we do, set dirty_info, so that it will be recorded. Same goes for when the primary pushes out updated stats to us. Also, do not write a purged_snaps() update directory; rely on the caller to write out dirty info. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: write dirty info on handle info, notify, logSage Weil2012-04-281-0/+6
| | | | Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: skip scrub scheduling if we aren't upSage Weil2012-04-281-0/+4
| | | | Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: fix dirty_info check for advance/activate pathsSage Weil2012-04-281-9/+14
| | | | | | | | | | | | Previously we would check and write dirty_info *without the pg lock* after doing the advance and activate map calls. This was unlikely to race with anything because the queues were drained, but definitely not right. Instead, do the write in activate_map, or explicitly if activate_map is not called (so that we record our progress after handling maps when we are not up). Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: always share past_intervalsSage Weil2012-04-287-47/+96
| | | | | | | | | | | | | | Share past intervals when starting up new replicas. This can happen via an MOSDPGInfo or an MOSDPGLog message. Fix up get_or_create_pg() so the past_intervals arg is required (and a ref, like the other args). Fix doxygen comment. Now the only time generate_past_intervals() should do any work is when upgrading old clusters, during pg creation, and (possibly) during pg split (when that is fully implemented). Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: set dirty_info in generate_past_intervalsSage Weil2012-04-281-0/+3
| | | | | | This ensures that we save our work. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: fill in past intervals during advance_mapSage Weil2012-04-281-0/+5
| | | | | | | | If ceph-osd is way behind, we will advance through past maps before we mark ourselves up. This avoids the slow recalculation once we are up, and the ensuing badness. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: drop useless PG::fulfill_info()Sage Weil2012-04-282-17/+5
| | | | | | | | There is a nice symmetry there with fulfill_log(), but it is a short function with a single caller that mostly just forces us to copy a bunch of data structures around unnecessarily. Drop it. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: share past intervals with notifiesSage Weil2012-04-285-40/+77
| | | | | | | | | | | | | | | | | | | | | Send past_intervals along with pg_info_t on every notify. The reasoning here is as follows: - we already have the state in memory - if we don't send it, and the primary doesn't have it, it will recalculate it by reading/decoding many previous maps from disk - for a highly-tortured cluster, i see past_intervals on the order of ~6 KB, times 600 pgs means ~2.5 MB sent for every activate_map(). for comparison, the same cluster would need to read and decode ~1 GB of maps to recalculate the same info. - for healthy clusters, the data is small, and costs little. - for unhealthy clusters, the data is large, but most useful. In theory we could set a threshold so that we don't send it if it is large, but allow the primary to query it explicitly. I doubt it's worth the complexity. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* osd: only generate missing intervals in generate_past_intervalsSage Weil2012-04-281-2/+2
| | | | | | | | | | | | We can (currently) get into a situation where we don't have the full history back to last_epoch_clean because non-primaries record past intervals but don't initially have the full history, resulting in a partial recent history. If this happens, only fill in what's missing; no need to rebuild the recent parts too. Signed-off-by: Sage Weil <sage@newdream.net>
* osd: include past_intervals in pg debug printoutSage Weil2012-04-281-0/+5
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* osd: fix check for whether to recalculate past_intervalsSage Weil2012-04-281-11/+12
| | | | | | | | | | | We may not recalculate all the way back to last_interval_clean due to the oldest_map floor. Figure out what we want and could calculate before deciding whether what we have is insufficient. Also, print something if we discard and recalculate so it is clear what is happening and why. Signed-off-by: Sage Weil <sage@newdream.net>
* osd: PG::Interval -> pg_interval_tSage Weil2012-04-285-78/+102
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* Merge branch 'next' into tSage Weil2012-04-2814-178/+288
|\
| * Stop rebuild of libcommon.la on "make dist"Dan Mick2012-04-281-2/+2
| | | | | | | | | | Fixes: 2356 Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
| * mon: limit size of MOSDMap message sent as replySage Weil2012-04-281-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We may send an MOSDMap as a reply to various requests, including - a failure report - a boot message - a pg_temp message - an up_thru message In these cases, send a single MOSDMap message, but limit how big it gets. All recipients here are osds, which are smart enough to request more maps based on the MOSDMap::newest_map field. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
| * ceph-object-corpus: revert rewindSage Weil2012-04-281-0/+0
| | | | | | | | | | | | From 92becb696bde7f0aa9687b2fe7505ed1ac9f493b Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
| * FileJournal: simply flush by waiting for completions to emptySamuel Just2012-04-273-46/+6
| | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
| * PG: in GetInfo Notify handler, fix peer_info_requested filterSamuel Just2012-04-271-1/+1
| | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
| * librados: test get/set of debug levelsSage Weil2012-04-261-0/+35
| | | | | | | | | | | | Also do some sanity checks on the subsystem log level settings. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
| * config: allow {get,set}_val on subsystem debug levelsSage Weil2012-04-262-3/+37
| | | | | | | | | | | | | | | | | | This mimics the allows you to get and set subsystem debug levels via the normal config access methods. Among other things, this allows librados users to set debug levels. Fixes: #2350 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
| * librbd: the length argument of aio_discard should be uint64_tJosh Durgin2012-04-263-5/+5
| | | | | | | | | | | | size_t was accidentally copy-pasted. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
| * filestore: interprect any fiemap error as EOPNOTSUPPSage Weil2012-04-261-1/+1
| | | | | | | | | | | | On 2.6.32-5-amd64 (debian) and XFS I'm getting EINVAL. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
| * filestore: fix a journal replay issue with collection_add()Joao Eduardo Luis2012-04-261-0/+8
| | | | | | | | Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
| * osd: filter osds removed from probe set from peer_info_requestedSage Weil2012-04-261-51/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Peef_info_requested should be a strict subset of the probe set. Filter osds that are dropped from probe from peer_info_requested. We could also restart peering from scratch here, but this is less expensive, because we don't have to re-probe everyone. Once we adjust the probe and peer_info_requested sets, (re)check if we're done: we may have been blocedk on a previous peer_info_requested entry. The situation I saw was: "recovery_state": [ { "name": "Started\/Primary\/Peering\/GetInfo", "enter_time": "2012-04-25 14:39:56.905748", "requested_info_from": [ { "osd": 193}]}, { "name": "Started\/Primary\/Peering", "enter_time": "2012-04-25 14:39:56.905748", "probing_osds": [ 79, 191, 195], "down_osds_we_would_probe": [], "peering_blocked_by": []}, { "name": "Started", "enter_time": "2012-04-25 14:39:56.905742"}]} Once in this state, cycling osd.193 doesn't help, because the prior_set is not affected. Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
| * PG: get_infos() should not post GotInfoSamuel Just2012-04-261-4/+3
| | | | | | | | | | | | | | The MNotifyRec handler also posts GotInfo under the same conditions after calling get_infos(). Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
| * Revert "PG: whitelist MNotifyRec in started"Samuel Just2012-04-261-2/+0
| | | | | | | | This reverts commit 9579365720818125a4b15741ae65e58948b9c69f.
| * PG: whitelist MNotifyRec in startedSamuel Just2012-04-261-0/+2
| | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
| * mon: decode old PGMap Incrementals differently from new onesGreg Farnum2012-04-241-3/+9
| | | | | | | | | | | | | | | | | | | | | | We need to distinguish between the old 0 (meaning undefined) and the new 0 (meaning switch to 0 and disable the flags). So rev the encoding version on PGMap::Incremental, and if you decode an old version with [near]full_ratio == 0, set the ratio to -1 instead. Then when applying the Incremental interpret -1 as no change. Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
| * Merge remote branch 'origin/wip-rbd-snapid' into nextJosh Durgin2012-04-242-62/+113
| |\ | | | | | | | | | Reviewed-by: Sage Weil <sage.weil@dreamhost.com>
| | * test_rbd: add tests for snap_set and more complicated resizingJosh Durgin2012-04-241-2/+56
| | | | | | | | | | | | | | | | | | | | | | | | * snap_set to a deleted (and recreated) snapshot * resizing down (truncating) and back up * resizing to non-object-aligned sizes Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
| | * librbd: reset needs_refresh flag before re-reading headerJosh Durgin2012-04-241-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we can't miss an update if we get a notify during ictx_refresh. Specifically, a race like this: Thread 1 Thread 2 Process 2 ictx_refresh() read_header() snap_create() notify() need_refresh = true process header... need_refresh = false If this happened, we would not re-read the header with the new snapshot, so the snapshot would not happen at the intended point in time, but only after we re-read the header again. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
| | * librbd: clean up snapshot handling a bitJosh Durgin2012-04-241-51/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * snapid should determine whether our mapped snapshot is gone, not snapname * snap_set(<nonexistent_snap>) shouldn't reset us to CEPH_NOSNAP * snapname should be set before using the it in the perfcounter name * snapname and image name don't need to be passed as arguments since an ImageCtx already contains that info * ictx_check() doesn't need to check for non-existent snaps - only I/Os care, so check in check_io() instead Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
| | * librbd: clarify handle_sparse_read conditionJosh Durgin2012-04-241-5/+3
| | | | | | | | | | | | | | | | | | | | | The earlier condition is >. != means < at this point, and the nesting is unnecessary. Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
* | | osdmap: fix addr dedup checkSage Weil2012-04-271-17/+13
| | | | | | | | | | | | | | | | | | | | | | | | Compare *every* address for a match, or else note that it is (or might be) different. Previously, we falsely took diff==0 to mean that all addrs were definitely equal, which was not necessarily the case. Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
* | | OSD: use map bl cache pinning during handle_osd_mapSamuel Just2012-04-272-3/+27
| | | | | | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
* | | simple_cache.hpp: add pinningSamuel Just2012-04-271-3/+30
| | | | | | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
* | | OSD.cc: track osdmap refs using an LRUSamuel Just2012-04-263-122/+78
| | | | | | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
* | | common/: added templated simple lru implementationsSamuel Just2012-04-263-0/+221
| | | | | | | | | | | | Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
* | | osdmap: dedup pg_tempSage Weil2012-04-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | We only deal with the case where the entire map is identical, since the individual items are too small to make the pointer overhead worthwhile. Too bad. A in-memory btree-like structure would work better for this. Signed-off-by: Sage Weil <sage@newdream.net>
* | | osdmap: use shared_ptr<> for pg_tempSage Weil2012-04-263-24/+25
| | | | | | | | | | | | | | | | | | This will let us dedup later. Signed-off-by: Sage Weil <sage@newdream.net>
* | | osd: make map dedup optionalSage Weil2012-04-262-6/+9
| | | | | | | | | | | | | | | | | | | | | On by default. This trades CPU for memory. Some might have unlimited RAM and not care. Signed-off-by: Sage Weil <sage@newdream.net>