Tips for managing OpenEBS Jiva Volumes

Utkarsh Mani Tripathi
5 min readSep 27, 2019

OpenEBS Jiva Volumes are a great way to provide highly available persistent volumes to Kubernetes stateful workloads by making use of the local storage available on each node. Installing and setting up OpenEBS is very easy, almost like running just another application in Kubernetes Cluster. Of course the best part is that OpenEBS Jiva volumes are completely Kubernetes Native and can be managed directly via kubectl.

For setting up OpenEBS and using Jiva volumes, please refer to this documentation.

In this blog, I will focus on some of the common questions that come up in the OpenEBS Users Slack channel, related to running Jiva volumes in production.

Overview of Jiva Volumes

Before we get into the specifics, OpenEBS Jiva volumes are implemented with microservices / containers and comprise of two main components as depicted in the following diagram.

  • Jiva Controller is responsible for serving the data to the application by contacting the Jiva Replicas from different nodes. Jiva controller does synchronous replication to all the replicas. To extend this further a copy of the data is sent to all the nodes.
  • Jiva Replica is responsible for saving the data on the local storage on the nodes and maintain data consistency. Rebuild the data from other replicas after recovering from a failure. Note that Replica’s once scheduled on to a node, remain on that node. In fact they are pinned to the node using the hostname label.

The health of the Jiva Volume can be checked by using mayactl which provides details status of controller and the replicas. A sample output looks as follows:

mayactl describe volume

More information on how to run mayactl is available in the OpenEBS Docs.

Understanding the Replica Status

A Replica is considered healthy if it is in RW State.

OpenEBS Jiva Controller checks that a minimum of 51% of its replicas are healthy (aka RW) for serving the data to the application. So if there are 3 replicas, then for a volume to be available, at least two out of three replicas must be in RW state.

Due to cluster conditions like a node shutdown or loss of network connectivity, it is possible for a replica to go into the following states:

  • NA — This indicates that the controller is no longer able to contact the replica and vice versa.
  • WO (Write Only) — This indicates that the replica has just restarted either due to network loss or node shutdown, and is in the process of rebuilding the data from other replicas.

WO is a valid state and will remain in this state, till all the data is rebuilt. In the current version, if the Replica is lagging behind from the healthy replica, then a full rebuild is initiated. The time taken to rebuild depends on the amount of the data, the speed of the network and the storage.

The replicas that are stuck in the NA state need some manual intervention. The NA state can be due to the following scenarios:

  • The replica pod is hung or is unable to make any outgoing connections to the controller. In this case, restarting the replica pod can help.
  • The node on which the replica pod has been taken out from the cluster. In this case, the replica deployment node affinity needs to be updated so that the replica pod can be scheduled to a new node in the cluster, that replaced the older node. Steps to recover from this scenario is mentioned below.
  • In case all the replicas are in NA state, this indicates an issue with the controller pod unable to respond to the replica requests or controller pod is down or all the nodes are down where replicas are scheduled. If controller remains down for very long time scheduling it on other node will help.

Recover from NA if replica pod is in pending state

Note: We use Node Affinity to pin the replica on the desired worker nodes to avoid multiple copies of data across more than desired number of replication and also to optimise rebuilding process.

Prerequisites:

  • Min n/2+1 replicas should be up and healthy (RW) to ensure volume is up, where n is replication factor. For exp if Replication factor = 3, min 2 replicas should be up and healthy.

Here are the steps to schedule replica on the other node:

  1. kubectl get deployment -n <ns_of_pvc> (get the deployment of the replica that requires scheduling {say replica_deployment})
  2. kubectl get pods -n <ns_of_pvc> -o wide (get the name of the node from where you want to detach the replica {say N3})
  3. kubectl edit deployment <replica_deployment> -n <ns_of_pvc>
  4. Now replace the worker node(N3) from the following field with the new worker node N4
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
— N1
— N2
— N3 (replace with N4)

5. Now save and exit and you will notice a new replica pod will be scheduled to new worker node.

Another multi-failure scenario to consider is that if multiple replicas remain in non-healthy state for a longer duration the application node can mark the volume as read-only. After using the above resolution steps to bring back the volume to healthy state, a manual step needs to be performed on the application node to make the volume as read-write again. The steps for recovering from read-only state are available in the OpenEBS Troubleshooting section.

Upcoming Enhancements

Optimizing the rebuild to reduce the time taken for a replica to go from WO to RW state. Also, right now I am actively working on a Jiva Operator that can help automatically recovering the replicas that are stuck in the NA state — due to a completely node failure.

Feel free to reach out to me to learn about the internals of Jiva or providing feedback and suggestions to improve your experience with OpenEBS Jiva volumes.

Thank you kiran mova for your valuable suggestions and edits.

~ उत्कर्ष_उवाच

--

--