Storage Backend Migration

As the TeamBeam installed base grows, so do the requirements towards the storage backend evolve and change. In order to deal with these changing requirements, the storage backend is becoming modular, supporting multiple types of repositories for payload storage.

Currently available repository types:

  • Proxy: a delegating meta-repository. Bundles multiple repositories and directs requests to the correct one.
  • Riak: high-available, distributed, eventually consistent object storage. Supported in a multi-node installation with multiple storagebackend and multiple Riak nodes. Default repository type for all installed systems. Deprecated, since the 3rd party software is no longer maintained.
  • Filesystem: simple repository without high-availability. Stores payload in a directory in the filesystem. Only supported in a single-node installation. Very fast, since no coordination with other nodes is required.

Future repository types:

  • Cassandra: high-available, distributed, eventually consistent column store database. Supported in a multi-node installation with multiple storagebackend and multiple Cassandra nodes. Future use, not yet implemented.

All repositories have in common that they separate metadata from the payload.

Repository Proxy

The storage backend can support multiple backends in parallel. To achieve this, the Repository Proxy must be enabled in the configuration:

proxy.enabled=true
proxy.preferredBackend=sfs

If enabled, the repository proxy will take precendence over all other repositories. It requires at least one other repository to be enabled.

Requests will be treated depending on their nature:

  • Read requests: The proxy will search for the object's manifest in all enabled backends. A request is then delegated to the backend that reports ownership of the manifest. Note: A locked manifest is treated as if the manifest does not exist! If it cannot find the manifest in any backend, it is reported as not found.
  • Write requests: Similar to read requests: the write request is delegated to the backend that reports ownership. If the manifest is not found anywhere, the request is delegated to the preferredBackend.
  • Delete requests: The request is delegated to all enabled repositories, regardless of the fact they may not hold the manifest, or if it is locked.

The repository proxy is a soft approach to data migration: old data is read from wherever it exists, new data is written to the preferred backend, and delete operations from expired transfers are executed by all backends.

Due to its nature, the repository proxy adds latency to the execution of requests. It should not be used unless necessary.

Riak Repository

The metadata is stored in a dedicated bucket, with the objectId as the key. On top of the required filename, filesize and content type, the manifest stores a map of chunk-addresses. This facilitates the lookup of chunks when streaming only a specific range of the data.

The payload is sliced into chunks of typically ~1MB and stored in a bucket. Entries are keyed "{objectId}-{index}". For every chunk, a seperate chunk-digest is stored in a bucket, holding the result of a https://en.wikipedia.org/wiki/SipHash calculation. The siphash is used to detect and rule out invalid chunks resulting from Riak's eventually consistent architecture.

Simple Filesystem Repository

The SFS creates a directory structure in the filesystem, starting at a configured root directory. A directory is created for each object, holding the objectId as name. This directory is nested in a parent directory, identified by the last three characters of the objectId. This is to not overload the directory index and make traversal easier in the case of several thousand objects.

Since the objectId is an encoded timestamp, the last three characters are used. ensuring a much better distribution.

The metadata is stored as a JSON file. Chunks are stored as regular files, named "{chunkId divided by 100}/{objectId-chunkId}". Chunks typically have a 4MB filesize. This can be tuned according to the expected use case (when transferring predominantly large files, use larger chunksize; when seeking is used often, prefer smaller chunksize).

The structure looks like this:

  • /srv/sfs/data/
    • ysw/
      • g34fidcysw/
        • manifest.json
        • chunks/
          • 0/
            • g34fidcysw-0
            • g34fidcysw-1
            • ...
            • g34fidcysw-98
            • g34fidcysw-99
          • 1/
            • g34fidcysw-100
            • g34fidcysw-101
            • ...
            • g34fidcysw-199
          • 2/
            • g34fidcysw-200
            • g34fidcysw-201
            • ...

The SFS manages the files and directories within its root directory by itself. It is normal for empty directories to exist.

Migration

Data can be moved from old repositories to the preferred repository. Migration is supported during live traffic by using the repository proxy. If this is not necessary, the migration can be executed when the skp-storage service is shutdown.

Note:

  • The migration process is executed through the skp-storage shell. It runs in the JVM of the shell. It does not require a running skp-storage service.
  • The migration process will fetch a list of objects hosted in a repository and iterate over it, migrating every single object. It will not determine if the object is required by other services (e.g. TeamBeam).
  • The migration process can run unattended, or the operator can be required to confirm each step.
  • During migration, the shell captures CTRL-C and will only allow orderly interruption of the migration after the ongoing migration of an object has been completed.
  • Only a single migration process is allowed to operate within a cluster at any time.
  • The migration processor reports its progress by emitting InProgress events. These events are logged as well as broadcasted on the message broker.

The migration processor requires the same connectivity as the skp-storage service. Typically it is executed on a storage backend host.

Image

  • Copy stage
    • Check if the object has already been migrated to the preferred backend. If it exists and is complete (has the same filesize as expected), consider it copied. If not, remove whatever exist already.
    • Create manifest in preferred backend, set to locked.
    • Read data from old backend, write stream to preferred backend.
    • When finished, unlock manifest in preferred backend. Lock manifest in old backend.
    • NOTE: From this moment on, the repository proxy would direct new read-requests to the preferred backend!
  • Wait stage
    • The migration processor repeatedly checks a countdown latch for the specific objectId. If the latch is >0 it indicates that a process is reading from the object. The migration processor waits until the latch is 0.
  • Delete stage
    • The migration processor issues a delete-object-request to the old backend. Note: this request is idempotent and can be repeated, even if the backend does not hold the object.
  • Report stage
    • Inform that the object has been migrated; report on progress.