Auto-failover Groups – GracePeriodWithDataLossHours

March 31, 2021

The auto-failover groups feature for the Azure SQL database can be configured with an automatic failover policy. Azure triggers failover after the failure is detected and the grace period has expired. Grace period is determined by a setting called ‘GracePeriodWithDataLossHours’ that cannot be set under one hour. Why is it not allowed to set a time which is less than an hour? Can your business tolerate the application be down for that period? Should your turn off Auto Fail-over and set it to manual?

I noticed a lot of confusion around this setting, including my own. Some of the confusion is due to a lack of clarity in the documentation. I checked with the Microsoft Azure SQL team, and they are actively working on clarifying some of the questions I raised.

I want to thank Dimitri Furman and Roberto Bustos from the Azure SQL Team for clarifying some of my confusion that I will share here.

Why is it not allowed to set a time which is less than an hour?

“Because verification of the scale of the outage and how quickly it can be mitigated involves human actions by the operations team, the grace period cannot be set below one hour. This limitation applies to all databases in the failover group regardless of their data synchronization state.”
https://docs.microsoft.com/en-us/azure/azure-sql/database/auto-failover-group-overview?tabs=azure-powershell

Once Microsoft detects an issue with Primary, it will try to resolve it by the grace period time; if unable will initiate auto-failover with possible data loss– Exactly at what point Auto-Failover will initiate?

If Microsoft is not certain that there is no data loss for a database, it will wait for at least GracePeriodWithDataLossHours before failing over. Note that Microsoft does not guarantee that the failover will happen exactly when the grace period expires in all cases. Some parts of this process are done manually as part of major incident mitigation, so you can only assume that failover with possible data loss will happen no sooner than GracePeriodWithDataLossHours.

During the grace period of time, at any point, if it is evident that the system cannot recover within the grace period, will the auto-failover initiate earlier, or will it still wait for the full period?

Microsoft will still wait for the grace period. The only exception is if a determination is possible that there will be no data loss. In that case, it may fail over sooner.

Can customers always initiate a force (assuming primary not available) failover and not wait for the grace period?

Correct. In fact, for customers who have relatively mature operations teams, handling geo-failover manually is usually a better option because they can assess the risk and tradeoffs in their specific context and minimize the impact that way. The automatic failover policy is for those customers who must (or prefer) rely on the platform to make this decision.

Is there a way for the customers to know when exactly the grace period started?

If customers know when the grace period started, they can decide whether to let Microsoft keep trying to resolve the outage or initiate a manual failover? The countdown is initiated when the incident is detected and declared. This is typically very close when you start observing database unavailability that is not mitigated by built-in HA capabilities. Again, some manual actions are required given the large scale of incidents we are talking about, so we cannot provide an exact point in time.

Other than ‘I cannot connect,’ what other monitoring can I put in place to know that the grace period clock started to roll?

You will see a service outage notification in the Azure portal under Service Health.

Why does the database still keep syncing even I remove it from the failover group?

Failover groups are built on top of geo-replication. You can create a geo-replica of a database and then add that database in a failover group, which will inherit that geo-replication link. Since a geo-replication link could have existed before adding the database to the failover group, it is not automatically dropped when you remove the database from a failover group. On a similar topic, note that if you drop the secondary replica for a database in a failover group, connectivity to the ‘primary failover group endpoint’ will no longer work because the primary database is automatically removed from the failover group. Connectivity to the primary (previous) database using its server/database names will continue to work.

What happens when the previous primary is back online again?

There are no automatic fail-back capabilities. When service in the old primary region is restored after you had failed over to the secondary region (manually or automatically), you will need to initiate a manual failover to move the primary to the old primary region.

Summary

Manual failover is a better option for customers who need certainty on failover timing. The automatic failover policy is only “automatic” in the sense that a customer doesn’t have to do anything. The implementation on Microsoft’s side includes manual steps to arrive at the optimal compromise between maximizing availability and minimizing data loss across multiple customer databases. With manual steps, there is always some uncertainty.

At the same time, someone can make an argument to keep the setting automatic failover. Depending on the situation and impact on business, customers can initiate manual failover (before Microsoft triggers automatic failover).