Skip to main content

ceph rbd better distribute small write operations

· 2 min read
Joachim Kraftmayer

When creating an RBD image, you can pass the stripe unit and the stripe count.
A smaller stripe unit means that smaller write operations are better distributed across the Ceph cluster with its OSDs.

rbd -p benchpool create image-su-64kb --size 102400 --stripe-unit 65536 --stripe-count 16

RBD images are striped over many objects, which are then stored by the Ceph distributed object store (RADOS). As a result, read and write requests for the image are distributed across many nodes in the cluster, generally preventing any single node from becoming a bottleneck when individual images get large or busy.

The striping is controlled by three parameters:

order The size of objects we stripe over is a power of two, specifically 2^[order] bytes. The default is 22, or 4 MB.

stripe_unit Each [stripe_unit] contiguous bytes are stored adjacently in the same object, before we move on to the next object.

stripe_count After we write [stripe_unit] bytes to [stripe_count] objects, we loop back to the initial object and write another stripe, until the object reaches its maximum size (as specified by [order]. At that point, we move on to the next [stripe_count] objects. By default, [stripe_unit] is the same as the object size and [stripe_count] is 1. Specifying a different [stripe_unit] requires that the STRIPINGV2 feature be supported (added in Ceph v0.53) and format 2 images be used.

docs.ceph.com/docs/giant/man/8/rbd/#striping