Now I have not tried downgrading the cluster yet. Also previous discussions have talked about it being related to a query to schema, I’m not sure how I would figure out what part of the schema is causing it.
Update: Restarting zero on the dgraph-0 node brings the cluster back up.
But it seems one of my applications is triggering the issue, so as soon as I run that app dgraph goes down again. I’ll try to find the root issue.
Update2: Turns out even if I simply use ratel to load the schema it crashes.
So all I did was update the docker image from 1.0.3 to 1.0.5. I’m thinking I’m going to have to wipe all the data and rebuild dgraph in order to keep on working.
No export wasn’t possible because it was in a crash loop.
Since then as we’re getting closer to production-use, I’ve implemented an automatic export and backup function.
Since I’ve completely wiped and reinstalled the cluster things have been working fine. I have not changed any of the code or the schema.
So I’m going to assume that something went really wrong when switching the docker image from 1.0.3 straight to 1.0.5. Perhaps I should have gone to 1.0.4 first.
I think for upgrade cases you could plan a better way out. Like, you have a stack that needs to be “abandoned.” Okay, you’ll start a new Dgraph Stack with the latest version and connect it with that Cluster. Wait for the (new stack) most current version to finish syncing and then kill the old version.
This avoids a number of things. I’m talking about this for myself, I have not yet checked if the Dgraph has it recommended. But it seems a good way out. And plausible, since the Dgraph can work like that and was created for it.
Currently I’m using a 3 node kubernetes cluster. If I update the docker image version, it first updates node 3, once it’s rebooted it waits a couple minutes and then does node 2, and so on. This way there is no down time and doesn’t require additional resources to migrate to a new cluster.
But I see what you’re saying, perhaps it could be done automatically using a helm chart. Would require a bit of testing.