I'm working in a tech startup of around 60 technical people plus a few dozen more in business side.
Our Scrum sprint is 2-week long and a major release happens once every 3 sprints. That's 1 major release/6 weeks. A major releases comprise of more than 12 services.
We wanted zero downtime deployment. And we thought Kubernetes was what we needed. I was brought into the company to "Kubernize" the system with the aim of zero downtime deployment.
After working on it for a year and a half, I realize zero downtime deployment does not require K8s. K8s is nothing but a tool, what we need is mindset and then principles.
I realize we are not mature enough in the way we delivery software to be able to ship services in zero downtime fashion. Let me explain:
- We have 4 small scrum teams, each of which has BAs, QAs, Devs and Operations.
- At the beginning of a sprint, BAs discuss stories with QAs and Devs and Operations and estimate necessary efforts.
- QA then work with Devs to produce a set of MANUAL acceptance test suite
- Devs begin implementation using git branching model. They make no efforts to make the code "zero downtime deployable" and are not aware of what techniques there are to deliver software without downtime from Devs' point of view.
- Devs (should) make sure their code passes the acceptance test suite provided by QA.
- QAs test the same acceptance test suite on test env
- QAs perform MANUAL regression test on staging env
- Operations plan for release
- Operations handle release
The problem I have found so far with this working model is that:
- Releases are very stressful for operation team.
- No amount of planning is enough for a major release (12 services being deployed at the same time)
- Operation team alone can not achieve zero downtime deployment: what about DB? what about long running session, what about… …
I am now proposing an organizational change in how we deliver software. I want my company to shift from seldom releases of big changes to frequent releases of small changes.
And those small changes will be released in zero downtime fashion.
My idea is that: 1. A typical feature involves changes in DB, backend-1, backend-2, and GUI. 2. Instead of shipping these changes to production in one go, let's ship them gradually: say today change DB schema (zero downtime), next week backend-1 (also zero downtime)… Of course, end users won't be able to use the feature until all the components are available.
In simple terms, I'm suggesting we break a feature into many small releasable parts and ship them to production gradually. The full feature is only available to end users after all the releasable parts are shipped.
After interviewing Devs, QAs and BAs, the one question I'm struggling to answer is:
Normally, we build the whole feature, do manual acceptance test on test environment, manual regression on staging environment and then ship it to production. If a it is broken into small releasable parts and are shipped gradually, how do Devs and QAs test guarantee the quality of each of those small parts since the full flow has not been implemented yet?
How do I persuade them to onboard with me? Both technical and non technical answers are welcome.