Day: August 22, 2016

Skype Front End not starting on dual-homed VM

TL;DR: When dual-homing, make sure both your NICs have the same link speed (1Gb, 10Gb). VMware’s E1000E is 1Gb, and VMXNET3 is 10Gb. Automatic metrics will prefer the 10Gb and that may cause the issue below….

After a power loss over the weekend, two Skype for Business Front-Ends were restarted and the RTCSRV service failed to start. A bit about these machines that’s relevant to the issue:

  • Running Widnows Server 2012 R2 as VMware ESXi 5.5 guests.
  • Collocated Mediation service.
  • Dual-homed, with Data network as default gateway, and Voice network to talk to a Sonus SBC, with service usage limited to the specified addresses for Primary and PSTN in the topology.

Certificate stores were in good order, so KB2795828 did not apply.

Event ID’s seen in the log were LS User Services 32178:

Failed to sync data for Routing group {0FCDD1FD-39AF-502A-AECA-E702A5E8FC55} from backup store.
Cause: This may indicate a problem with connectivity to backup database or some unknown product issue.
Resolution:
Ensure that connectivity to backup database is proper. If the error persists, please contact product support with server traces.

LS User Services 30988:

Sending HTTP request failed. Server functionality will be affected if messages are failing consistently.

Sending the message to https://FE1.domain.org:444/LiveServer/Replication failed. IP Address is IPOFVOICENIC. Error code is 0x2EFD. Content-Type is application/replication+xml. Http Error Code is 0x0.
Cause: Network connectivity issues or an incorrectly configured certificate on the destination server. Check the eventlog description for more information.
Resolution:
Check the destination server to see that it is listening on the same URI and it has certificate configured for MTLS. Other reasons might be network connectivity issues between the two servers.

and User Services 32174:

Server startup is being delayed because fabric pool manager has not finished initial placement of users.

Currently waiting for routing group: {EF5151C7-B5E1-53B8-9F61-0CC90C82B9F6}.
Number of groups potentially not yet placed: 9.
Total number of groups: 9.

[…]

The issue ended up being different Adapter Type in VMware for both NICs. The primary NIC was set to E1000E, so 1Gb/s Max, and the Voice NIC, which was added after the server was deployed, was set to VMXNET 3, which runs at 10Gb/s regardless of uplink bandwidth from the host.

Turns out the Windows Automatic Metric was messing up interface preference here because it was setting the 10Gb/s NIC with an automatic lower metric.

Manually setting a lower metric for the Primary NIC and rebooting the server resolved the issue.