17.2.0 正式发布

Ceph.io — v17.2.0 Quincy released * telementry report优化, 是否可以满足性能问题定位?

the "kvs" Ceph object class is not packaged anymore.
- 这个是什么能力?

crimson相关

根据History for src/crimson -quincy ceph/ceph来看, 好像是官方不准备在这个版本上这个功能了.

相比History for src/crimson -master ceph/ceph已经差了大概2个月的更新了.

Ceph Leadership Team meeting

2022-05-25

These are the topics discussed in today's meeting:

Change in the release process Patrick suggesting version bump PRs vs current commit push approach Commits are not signed Avoids freezing the branch during hotfixes Both for hotfixes and regular dot releases Needs further discussion https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow <-- more closely matches current model + proposed changes re: PRs, with the addition of a development branch https://www.atlassian.com/continuous-delivery/continuous-integration/trunk-based-development ceph-Jenkins account needs admin privs to ceph.git in order to push directly to branches doesn't apply to PR version bump publishing (Windows) binaries signed by cloudbase? Issues with Linux/Ceph Foundation RH might provide the signed binaries Probably don't publish binaries signed by Cloudbase b/c we wouldn't get telemetry data back from crashes etc. master-main rename rename completed there are still issues with some Jenkins jobs quincy blogs 17.2.1 readiness 3 PRs left release candidate by Jun 1st week 16.2.8 issue retrospective: https://docs.google.com/presentation/d/1hbwo_GW48O4nnM78US2ghVXglxZw5TRwmWUOy_oxam8/ scale testing stricter backport policy lack of reviews in the backport mgr component is orphaned (RADOS, lack of experience in other teams) conflict solving was not properly documented mgr has become too critical (due to cephadm it is now key) don't merge with 1 approval reviewers on sensitive PRs/files should acknowledge they understand the code changes different standards across teams try out requiring reviews from CODEOWNERS on the pacific branch

2022-04-06

CDS Reef Crimson会议

稳定性和部署

目前crimson支持rook和cephadm
投入在teuthology上的测试套件
在初始的teuthology test上修复了许多稳定性的bug
- refinements to op ordering logic
- 改进watch/notify apis
优化/修复 crimson的bluestore support.

seastore metrics

seastore去年变化很多, 现在有了一批监控seastore的metrics能力, 有助于未来的优化

新的有助于理解seastore的性能模式的指标
- cache: 事务冲突率, 缓存利用率信息
- lba tree: lba 分配
- 事务管理器
- segment_manager/block: 读写计数器

Seastore Internals

joyhound添加了支持placing extents in an non-journal segments
内部重构支持多设备
冲突检测改良
增加了已经在实际zns设备上测试过的zns segment manager
rados_api_test coverage fix
性能方面
- lba hinting on onode tree, omap tree
- journal coalescing

quincy的crimson, 应该支持在没有快照的情况下在crimson测试rbd负载, 用单个reactor配置下基于cyanstore或者seastore后端, 用rook或者cephadm部署.

crimson reef计划

后续规划里, crimson比较巨大的事情是

多核(op processing)
多核(messengers)
快照
scrub

快照是Baton负责开发的一个相当核心的功能, 已经填好了rbd测试用例, 后续多核会比较严峻.

seastore reef计划

后续在多设备和分层的gc机制上上会有进一步改进
在nvme设备的基于随机块管理器和多核的gc机制上.
多核支持
- 后续一开始会对nvme划多个分区运行seastore来得到性能基线, 后续用于评估元数据结构.

可用性/测试

日志
- 保持debug_*配置选项
- pg log messages格式保持和传统一致
- vstart, teuthology 集群日志到和传统一样的文件里
- 尽量使用debug common选项, 快速收集到一些问题原因. 尽量确保所有人在调试的同一个页面上. 包括vstart和teuthology的日志, 这块与传统osd不太一样.
- 后续计划增加一个teuthology套件测试所有的pr.
- 如果有人愿意做审计, 会非常好
- 基础日志框架上, 有一个差异是, 20等级在传统上已经是debug, 而crimson不止.
  - 目前只有debug_ms这块比较好, 1就会打印所有的消息.
  - 最好不仅仅是debug_ms
  - 这里有一个约定, 10-20是调试, 30是trace.
  - 但是我们创建了5这类边界的日志日志
用teuthology tests控制crimson PRs
- 选择一批测试集

crimson讨论

qlc, zns讨论 qlc比tlc更密集

我的策略是, 要么足够快, 可以被视为随机(block)块管理器, 或者足够慢, 被视为(segment)段管理器, 被落入到与zns相同的存储桶汇总.

RedHat 分享性能基准, 如果期望早点测试rbd

今年不会太稳定, 但是几年会有所改善,

行业正在转向accuracy,而不是tlc. TODO: 这里有疑问, 这里的accuracy指的是什么新介质还是什么?

看到了该领域的兴趣, 看看crimson如何帮助从tlc过度

seastore目前进度落后于crimson, 所以成熟会有点困难不过应该可以工作了.

经典osd的测试

最新版本通过用于人们的测试

但是Crimson, 不期望人们在quincy测试

也许每月一次或者每周一次, 某种自动构建的快照.

我们不做任何测试, 也不提供后端端口.

只是提供下载和运行的时间点快照, 给人们提供一些更频繁地测试crimson的方法比每年的发布.

如果有一些图像和说明, 会更好.

从bluestore转到crimson, 这是新贡献者可以做的另一个地方.

这个也可以作为一个强制功能来改进发布pipeline.

crimson可以在quincy一起使用, 但是没有快照, 所以还需要继续改进rbd, 以及得到rgw, cephfs的支持.

rbd一直是我们的主要目标, 因为从a cpu per io的角度, rbd大多数情况下最苛刻, 所以一直是我们的重点.

gc策略 yangson

SeaStore Generational Segment Cleaning - Google Docs 接下来讨论的链接是这个

这块gc没细看, 跳过...详见CDS Reef: Crimson - YouTube > seastar把Extent存储到磁盘. 每个extent具有标记热的character, 意味着root extent或者logical lba extent, 或者未知的, 元数据. > > 可以是想对象数据块或者omap leave的代码 > > 每个extent有一个Age, 具有lfs的设计. 将相同字符和相似age的分组. > > 根据论文, 有一个成本效益的垃圾收集策略. 如果实现将相似的extent合并到同一个segment中进行回收会更有效.

device tier

这个方法, 划分多层. 每一层设备比另一层设备更强, 感觉这个思路不错.

Reef 性能会议

这个应该是下一个ceph版本的规划讨论?

rocksdb 列族 pg log 优化?

rocksdb增加merge pg log.

recycling pidgey lobe

a mapping from the real pg info to the pg log key 导致性能下降.

more or less working implementation now poc at the moment and the idea is

yeah to replace roxdb write ahead log with 用外部日志替换 roxdb 预写日志，

external one residing to store 以便

implementation for write the hedge log 实现附近写入对冲日志

bluestore 性能提升.

you prototyped moving the pg log out of rocks db completely and you stop because

it increased our iops by 20

yeah the first one is about you write a headlock and the second one is about making well removing uh pg lock implementation at all along with all this replication stuff

iops增长20%

igor

早期

just write out um pg log updates to 64k allocations in bluefest and um

她没有看到任何好处，所以她很快就放弃

it sounds like igor you're you're having much better success with it with your i

with roxdb iterator boundaries

corey ?

rgw实现对接daos

rgw: add DAOS SAL implementation by zalsader · Pull Request #45888 · ceph/ceph

Ceph User + Dev Monthly Meeting

2022-06-01 ceph developer monthly

block deduction feature

2022-02-02

trace

Ceph Crimson/SeaStore

2022-06-07

multi-core计划继续推进
修复容量问题
发现fio zipf模式更适合作为测试gc的工作负载
- 使用统一分布的zipf是Gc的最差场景
  - zipf分布?
- 考虑在Seastore上增加一些采样, 用于观察调整zipf用的参数 ## 2022-05-11
seastore concurrent 讨论

performance weekly

2022-05-05

closed: https://github.com/ceph/ceph/pull/46095 (kv/RocksDBStore: Remove ability to bound WholeSpaceIterator, aclamk) <-- merged to master by yuriw https://github.com/ceph/ceph/pull/45993 (crimson/osd: fix argument parsing after seastar changes, markhpc) <-- merged to master by markhpc

updated: https://github.com/ceph/ceph/pull/46062 (crimson: Enable tcmalloc when using seastar, markhpc) <-- Discussion, updates https://github.com/ceph/ceph/pull/45771 (os/bluestore: Switch to time-based adaptive near-fit alogrithm, markhpc) <-- disccusion https://github.com/ceph/ceph/pull/45888 (rgw: add DAOS SAL implementation, zalsader) <-- discussion, review, needs rebase

igor的bluestore实现, 下周分享.

2022-04-13

视频未上传,

- CURRENT STATUS OF PULL REQUESTS (since 2021-04-07):

   new:
       https://github.com/ceph/ceph/pull/45904 (os/bluestore: set upper and lower bounds on rocksdb omap iterators, cfsnyder) <-- cbodley reviews
       https://github.com/ceph/ceph/pull/45888 (rgw: add DAOS SAL implementation, zalsader) <-- new PR


   closed:
       https://github.com/ceph/ceph/pull/45884 (os/bluestore: Always update the cursor position in AVL near-fit search, markhpc) <-- Merged to master by Yuri
       https://github.com/ceph/ceph/pull/45755 (common/options: Disable AVL allocator first-fit optimizations, markhpc) <-- superceded by #45884 and #45771


   updated:
       https://github.com/ceph/ceph/pull/45771 (os/bluestore: Switch to time-based adaptive near-fit alogrithm, markhpc) <-- disccusion
       https://github.com/ceph/ceph/pull/44684 (tracer: set tracing compiled in by default, zenomri) <-- reviews, updates, discussion, more testing, ideepika reviews, updates
       https://github.com/ceph/ceph/pull/31694 (♪ I've got the world on a string, sittin' on a rainbow ♪, adamemerson) <-- cbodley reviews, mbenjamin reviews, needs-rebase, anything left to merge?

2022-04-07

Ceph Performance Meeting 2022-04-07 - YouTube

- CURRENT STATUS OF PULL REQUESTS (since 2021-03-31):

   new:
       https://github.com/ceph/ceph/pull/45771 (os/bluestore: Switch to time-based adaptive near-fit alogrithm, markhpc) <-- disccusion
       https://github.com/ceph/ceph/pull/45755 (common/options: Disable AVL allocator first-fit optimizations, markhpc) <-- discussion

   closed:


   updated:
       https://github.com/ceph/ceph/pull/44684 (tracer: set tracing compiled in by default, zenomri) <-- reviews, updates, discussion, more testing, review req for ideepika and markhpc

master分支一般会搞砸性能

gabby's talking 关于fast nvme test nodes, 在基线上得到了较好的结果? 70K-80K的小对象随机iops

但是现在master分支只有20-30K.

mako notes and still saw high

performance

avl分配器

本周2个新pr与avl分配器有关

determine when to go into

best fit mode in the avl allocator替换去年夏天的fit mode?

在三星硬盘上, 大块顺序写性能大量的slow down. 看起来是动态调整分配模式 ,而不是线性分配.

写64K这种io模型似乎不太适配三星硬盘. 后续尝试增加4x或者8x的参数来让他适配? 这个有帮助, 不过并不能解决问题.

可能主要体现在搜索空间的耗时上. 所以写了这个基于cycles和字节来选择最优拟合. 超过1ms, 就切换到快速模式? 在pacific 16.2.7版本似乎更好. 将64K划分64个块. 可能是这些硬盘的模式针对我们的修改, 并不喜欢?导致很容易进入性能下降. 不管是其他nvme, 包括基于intel p3700的硬盘的性能?差异都不大, 但在三星上, 差异就大.

os/bluestore: Switch to time-based adaptive near-fit alogrithm by markhpc · Pull Request #45771 · ceph/ceph

common/options: Disable AVL allocator first-fit optimizations by markhpc · Pull Request #45755 · ceph/ceph

这张图里好像有性能测试的图? 从而决定要关闭或采取上面的优化方案的?

david galloway ?

workload test相关

dp compaction . avl分配器在4K情况下, 看到了70-700ms的情况.

stupid和hybrid分配器?

主要还是64K的申请上? 当分配器花2ms找2个连续的块的时候, 会锁住其他在相同块上的操作的处理. 导致没买哦只有500的分配能力?当我申请500K的chunk时, 实际分配单元是64K, 但是空间本身可能会用16K的块碎片.

目前修改后的stupid分配器没有原版的(没合入这个patch)的性能更好

...

后面都是针对这个分配逻辑的讨论了, 未了解, 后续了解后再翻译.

来源

https://pad.ceph.com/p/performance_weekly
https://www.youtube.com/c/Cephstorage/videos
TODO:https://www.openeuler.org/zh/interaction/blog-list/
- 这里openEuler整理的比较好?

ceph社区进展跟进2022Q1+Q2

2022-04-18
专业