When a Front End Server (Skype for Business) Goes Down
In an enterprise pool of Skype for Business, each user's data is kept on as many as three front end servers (replicas). This is the reason why Microsoft recommends to have at least three front end servers in a pool. With the distributed model for Front End pools, a certain numbers of a pool’s servers must be running for the pool to function.
Following table helps us understand the minimum number of servers required running in a pool.
|Front Ends in a Pool||Must be running|
To begin with, it is not recommended to have a pool with less than 3 front end servers. You may end up getting into “service not starting” issue every now and then. Hence, I would avoid it at all costs.
What happens when you have three or more than three front end serves in the pool? Do we still expect “service not starting” issue in the environment with following error?
- Pool Manager failed to connect to Fabric Pool Manager.
- Cause: This could happen because insufficient number of Front-Ends are currently active in the Pool.
- Resolution:Ensure that 85% of the Front-Ends configured for this Pool are up and running. For 2 or 3 machine Pools, initial cold-start of the Pool requires all machines to be started. If multiple Front-Ends have been recently decommissioned, run Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery to enable the Pool to recover from Quorum Loss and make progress.
The answer is, it depends! If you restart the front end service on one of the three front end servers (replicas), then you may encounter above error. Personally, it’s been frustrating to resolve this error in order to either start or re-start the front end service.
You got to try following means to resolve this error;
Launch Skype for Business PowerShell to run following commands;
Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery –PoolFQDN Pool_Name
Reset-CsPoolRegistrarState -ResetType FullReset –PoolFQDN Pool_Name
Reset-CsPoolRegistrarState -ResetType ServiceReset –PoolFQDN Pool_Name
In most of the cases running above commands resolved the issue. However, there have been occasions when the issue got resolved by creating following registry key on the front end server.
Key Name: ClientAuthTrustMode
Typically, user registrations are distributed across the pool front end servers. Hence, there are only certain percentage of users, registered at the moment on failed server, are impacted. Personally, I have seen these clients taking anything between 2-10 mins before reconnecting to another server in the pool. These clients get following warning message until they re-connect with another front end server.