Unable to reach leader in group 1 - dir structures help

@MichelDiz
We’re running a RAID 5 (striped with 1 parity) and a hot spare with SSD SAS Mix Use 12Gbps drives. The RAM has the ECC data integrity checks. These are on 6x Dell R940s with an RPS. They’re supplied power through 2x UPSs.

We have 4 Groups. Each Group has 3 Alphas. There are 3 Zeros.

We’re running Dgraph in Docker Swarm.

The issue
We experienced an unscheduled power outage where 3 of the 6 servers came online after the outage and 3 had to be manually restarted a few hours later.

With 6 servers up, the DGraph state became:
1 of the 3 Zero’s were alive, no leader.
Group #1 had 1 Alpha alive, but was not the leader.
Group #2 had 2x Alphas alive, with a leader.
Group #3 had 2x Alphas alive, with a leader.
Group #4 had 2x Alphas alive, with a leader.

Repairs
The Zero #2 came back online just fine.
The Zero #1 didn’t have the --peer statement in it’s startup command and created it’s own cluster. This resulted in a split brain with some of the Alpha’s now attached to it. We eventually fixed this by adding the --peer statement.

Group #2, #3, and #4’s dead Alpha’s couldn’t move past a DirectedEdge: illegal tag error. To correct this, we called removeNode for each one and created a new Alpha. Replication occurred.

Group #1 - The alive Alpha had all the data needed, it just needed to replicate it to the 2x dead ones.
The 2x dead alpha’s also couldn’t move past the DirectedEdge: illegal tag error. We called removeNode on both of these and created 2x new alpha’s expecting a leader to be elected and replication to occur. They just continued to throw the error: Error while calling hasPeer: Unable to reach leader in group 1. Retrying... (Described here)

Group #1’s Alpha that was up appeared to believe that the original Alpha’s in it’s group were never removed. We confirmed this running dgraph debug -o command on the w directory. The Zero’s also thought the original group #1 alpha’s were still there. They correctly appeared in the /state “removed nodes” list, but when running the dgraph debug -o command in the w directory, the Snapshot Metadata: {ConfState:{Nodes:[]... showed the removed nodes still there. That appeared to line-up with the error messages we were seeing from the Alpha that was up the whole time… Unable to send message to peer: 0x1. Error: Do not have address of peer 0x1

Once we decided we couldn’t fix the Group’s ability to elect a leader, we put back the original node’s data (everything was backed up) into the new created Alphas. Back to the DirectedEdge: illegal tag error.

We attempted to run/repair the data with badger info --read-only=false --truncate=true , badger flatten, and badger stream, however they all failed because we have our maxlevels configured to 8 instead of the default 7 and there’s not currently an option to override that. We have our maxlevels set to 8 because badger panics once it’s reaches 1.1TB.

Once we realized we couldn’t repair the data because of the badger errors, we realized we needed to take more drastic action to get replication working. We removed all nodes from DGraph. We started up 3 completely clean Zeros (brand new zw folders) with no knowledge of the previous cluster state. We started up Group #1 Alpha 1 (the one with the working copy of the data). We then started up an empty 2nd and 3rd node. Dgraph elected a leader from the 2nd and 3rd empty nodes. This replicated the empty database to Alpha 1.

We tore it all down again… repeated the steps to get the zero’s up clean, copied Alpha 1 data to Alpha 2’s p directory and started up an empty Alpha 3. They elected a leader - this time the one with the data. Ratel showed data for the Group, all sizes 0Bytes though. We brought online Groups 2, 3, and 4. They came up with data. After waiting about 10 minutes for each Group to come up, the tablet’s changed from 0 Bytes changed to the appropriate Gigabytes for them, except roughly half the tablets didn’t load from each group. They’re now erroring with: While retrieving snapshot, error: cannot retrieve snapshot from peer: rpc error: code = Unknown desc = operation opSnapshot is already running. Retrying...

At this point we’ve invested 3 days of 4 people’s efforts into trying to recovery the data and there isn’t a clear known path to recovery. I do think that when we called removeNodes on the erroring nodes in Group #1 we went down a path that we were unable to revert from.

We’re shifting gears to rebuilding the database from scratch.

From this experience, we do have a couple recommendations…

  1. Update badger to work with maxlevels of 8 (vs. default of 7) so info, stream, and flatten work.
  2. removeNodes needs to be easily able to be reverted in some cases, maybe an addNodes
  3. All of this could have been avoided if we could have just declared a lead in Group #1, or manually manipulated the state back to the “correct” version vs letting Dgraph try to figure it out.

Thanks for all the support on this.