Hooking up Twilio SIP to Skype for Business

If you’ve never heard of Twilio before, you’d be surprised to learn that they are the largest backend for services around automated calling services, text messaging (and verification), and are pioneering Software Defined Telephony by use of APIs to route and handle texts/calls/faxes. Is your Uber driver calling you now? That’s Twilio… Got a text from Netflix for a password reset? Yup, that’s Twilio… PagerDuty sending you a SMS alert? You guessed it!

There are MANY things you can build with Twilio, but you can also use its simple services to set up PSTN origination/termination with your Skype for Business infrastructure. Why?

  • Fast and easy provisioning of trunks and numbers. No contracts, and you pay for what you use. Buy a number and it’s ready to use in less than a minute!
  • Crazy scalable. If large companies rely on Twilio for their backend integrations, why wouldn’t you?
  • Support for SIPS and SRTP, which means encrypted, secure calls over that internet trunk
  • Record calls and pull them from the Twilio portal or over API. Need recording for certain Response Groups? Done.
  • Failover mechanisms that can use preference/weight to balance call targets, or set up a script that can at least give callers a notice that your phones are down, or route them somewhere else, automatically
  • Use add-ons to do fancy things like transcribe calls with speaker recognition, translate calls then play them using text-to-speech, or even cleanse call recordings of sensitive PCI data. Yes. I know. It’s crazy.
  • Entire platform is built on top of AWS and globally scaled, so you know it’s good…

In this post I’ll guide you on setting up a trunk over the internet between a SfB infrastructure and Twilio’s Elastic SIP Trunking service, and you can start using it as a failover, aggregate, or maybe a conferencing bridge number so you’re not limited by your PRIs.

First, Sign up for Twilio

Obvious step here, you have to go here and sign up for an account.

Load it with some of your cash, which you can source from a Credit Card or over PayPal. You could use Twilio’s free test features, but if you want to call real numbers you need to have some money loaded there.

Create and configure a Twilio Elastic SIP Trunk

Now that you have an account with some moolah, it’s time to make that SIP trunk. Go on Elastic SIP Trunking (if you don’t see it, hit the “…” button) then Create new SIP Trunk

Give it a friendly name.

Then go on Termination and enter a Termination SIP URI. You’ll use this when creating the PSTN Gateway in Skype or Lync (or your favorite SBC). Don’t worry about call recording or encryption yet, you can play with that stuff later.

Then under Authentication, create an ACL and add the IP addresses of your mediation boxes. If you’re NATted outbound then use that, but in order to receive calls you’ll need to have inbound NAT or a public IP assigned.

Now save your Trunk by hitting Save at the bottom.

Next we will configure origination, so we can receive PSTN calls over the SIP trunk.

Configure an Origination URI, and set it in the format shown. ;transport=tcp will force Twilio’s edge to use TCP instead of the default UDP transport, still over 5060. If you want a different port, just use sip:IPADDR:PORT;transport=tcp. This is very similar if not exactly to how Flowroute’s inbound routes work. If you’ve got multiple servers, you can play with priorities and weights… but that’s out-of-scope for now.

Next, you can assign numbers for your DID’s.  If you don’t, you can still make outbound calls and mask caller ID as anything you want, but for receiving calls you need a PSTN number…

You can probably figure out the rest, but the basics are done. Let’s move to SfB now.

Configure Lync / Skype for Business Trunk

Remember that Termination SIP URI? We need it now. So we start by creating a new PSTN Gateway in the topology and using it as the FQDN.

Then we use 5060 as ports and TCP as the protocol. If we were using TLS, we can change that to 5061.

Publish the topology and then start using your new root trunk in your voice routes for outbound calls. That’s pretty much it! (assuming you’ve got the networking done right, either NAT or Public with dual-home).

To secure your edge you can use Twilio’s public IP list to make sure you’re not getting unauthorized SIP requests. You can get that here: https://www.twilio.com/console/sip-trunking/your-network

Test and smile

Since we’re not encrypting SIP traffic, and it’s flowing over 5060, we can fire up Wireshark and start looking at dialogs. Even with no calls flowing, we should be seeing OPTIONS requests roughly every 60 seconds sourcing from the Skype servers that have the trunk attached.

Making an inbound call we see the INVITE sourcing from Twilio.

And making an outbound call we see the call outgoing:

Also, Twilio has built-in pcaps that you can use to troubleshoot the remote-end of the trunk. Think of this as having Wireshark running on Twilio’s edge. VERY COOL!

Important note for NAT and non-SIP-aware edge

If you’re using a Mediation server, either dedicated or collocated, and use RFC1918 private IPs on your inside network, you have to do NAT to translate a public address to the inside IP and get calls flowing.

The issue this introduces is it’s not a supported configuration with Skype because the Contact header (and many others) will have the server’s internal IP, when it really should have the external, public IP. That’s why when doing Direct SIP with certified providers, you need to use the Edge server with a Public IP.

Some providers like IntelePeer will happily mangle SIP headers to make sure they have your external IP in there, and everything is well. In my case, using a NAT address kills some functionality, specifically:

  • Some calls tend to hang up after 30 seconds
  • Calls can’t be put on hold for longer than 30 seconds
  • When hanging up the call on the far end, the out-of-dialog BYE message coming from Twilio goes to the contact IP, so you never get it… and the call hangs up after 30 seconds anyway

Using a Session Border Controller to trunk to Twilio is one answer. Using a SIP-aware firewall or edge device is another, but most can’t do SIP over TCP, and definitely not SIP over TLS… so what then?

There’s a bit of a hack, and it involves setting the EnableSessionTimer to $True, RTCPActiveCalls to $False and RTCPCallsOnHold to $False, like so:

Not ideal, but gets the job done. The SessionTimer will check every 30 seconds for an active RTP session, regardless of whether RTCP “control” packets were received or not. This is why calls hang up after 30 seconds, because of no RTCP from Twilio since it goes to the Contact IP.

This hack is probably best if done as a MUST, and no other solutions are viable. My recommendation would be to use use Public IP with proper edge security (limiting to Twilio’s service addresses) or using an SBC or B2BUA.

Hope you enjoyed this post!!! Please leave a comment!!!

LS Data MCU events 41025 and 41026 starting in May-June?

If running Lync Server 2010, Lync Server 2013 or Skype for Business Server 2015, and started noticing these in your Lync Server event log sometime between May and June of 2017, while at the same time people can’t share PowerPoints or do Whiteboards / Q&A…

Then you’ve fallen victim of a known issue with the May 2017 .NET Framework Security and Quality Rollup affecting the Web Conferencing Service.

Luckily, Microsoft detailed two workarounds, one involves getting a new Edge Internal cert, and another involved a registry change. Here’s the details:

https://support.microsoft.com/en-us/help/4023993/ls-data-mcu-events-41025-and-41026-are-constantly-generated-after-you-

Skype Front End not starting on dual-homed VM

TL;DR: When dual-homing, make sure both your NICs have the same link speed (1Gb, 10Gb). VMware’s E1000E is 1Gb, and VMXNET3 is 10Gb. Automatic metrics will prefer the 10Gb and that may cause the issue below….

After a power loss over the weekend, two Skype for Business Front-Ends were restarted and the RTCSRV service failed to start. A bit about these machines that’s relevant to the issue:

  • Running Widnows Server 2012 R2 as VMware ESXi 5.5 guests.
  • Collocated Mediation service.
  • Dual-homed, with Data network as default gateway, and Voice network to talk to a Sonus SBC, with service usage limited to the specified addresses for Primary and PSTN in the topology.

Certificate stores were in good order, so KB2795828 did not apply.

Event ID’s seen in the log were LS User Services 32178:

Failed to sync data for Routing group {0FCDD1FD-39AF-502A-AECA-E702A5E8FC55} from backup store.
Cause: This may indicate a problem with connectivity to backup database or some unknown product issue.
Resolution:
Ensure that connectivity to backup database is proper. If the error persists, please contact product support with server traces.

LS User Services 30988:

Sending HTTP request failed. Server functionality will be affected if messages are failing consistently.

Sending the message to https://FE1.domain.org:444/LiveServer/Replication failed. IP Address is IPOFVOICENIC. Error code is 0x2EFD. Content-Type is application/replication+xml. Http Error Code is 0x0.
Cause: Network connectivity issues or an incorrectly configured certificate on the destination server. Check the eventlog description for more information.
Resolution:
Check the destination server to see that it is listening on the same URI and it has certificate configured for MTLS. Other reasons might be network connectivity issues between the two servers.

and User Services 32174:

Server startup is being delayed because fabric pool manager has not finished initial placement of users.

Currently waiting for routing group: {EF5151C7-B5E1-53B8-9F61-0CC90C82B9F6}.
Number of groups potentially not yet placed: 9.
Total number of groups: 9.

[…]

The issue ended up being different Adapter Type in VMware for both NICs. The primary NIC was set to E1000E, so 1Gb/s Max, and the Voice NIC, which was added after the server was deployed, was set to VMXNET 3, which runs at 10Gb/s regardless of uplink bandwidth from the host.

Turns out the Windows Automatic Metric was messing up interface preference here because it was setting the 10Gb/s NIC with an automatic lower metric.

Manually setting a lower metric for the Primary NIC and rebooting the server resolved the issue.

Is your Lync/SfB starved for memory?

Let’s say it was totally underprovisioned at some point. Just bumping up the RAM won’t give you the performance you expect. Here’s why:

You deploy a Server 2012 R2 template with 2GB RAM using your favorite hypervisor and just roll with the Skype for Business or Lync deployment without even thinking about it. Or… let’s say you ask for a VM to be provisioned so you can roll out SfB, and it’s underprovisioned from the start, but changing the resources would take too long so you go ahead with the deployment anyway and just wait for resources to be added later on. No time wasted. How many times has that happened? Plenty to me…

Down the road, whether it’s a reactive need for more memory, or you just realized the VM’s were completely underprovisioned and not up to those 32GB RAM Microsoft really asks for… What then? Just bump up the RAM, right?

Not quite…

Do that, and your SQL instances RTCLOCAL and LYNCLOCAL will just daydream about those sweet 32GB you allocated… Let’s take a look at a VM with only 4GB on it:

2016-06-28 13_28_33 2016-06-28 13_28_06

Pretty sad, right? at no point in time can both SQL instances consume more than 941MB. What if you add RAM you say? The Minimum and Maximum Server Memory stay the exact same!!!

If you want to go ahead and change these values, you’re open to do so, but it’s not technically supported. Don’t care? then pick values for 6%-8% of your total RAM for LYNCLOCAL, and 12%-15% for RTCLOCAL. Care? Then:

  1. Open the Deployment Wizard, Install or Update, and run Step 1 again. Go grab a coffee while RTCLOCAL gets pimped out with more RAM.
  2. Then run Step 2 again. If done with coffee, get a new one while LYNCLOCAL gets a memory makeover.

You can verify the added memory now. Big difference. But, because these instances actually run on SQL Express, they won’t be able to address more than 1GB each (or 1400MB depending on who you ask). The difference between a max 327MB and 1GB is quite substantial, so this change will still make a difference.

2016-06-28 13_56_112016-06-28 13_55_56

7/28/2016 Edit: Looks like Tom Pacyk wrote a better post over two years ago, and also points out the SQL Express limit of 1GB per instance. http://www.confusedamused.com/notebook/lync-2013-sql-express-instance-memory

Skype for Business Server June 2016 CU

New features!

  • Video Based Screen Sharing (VBSS) in meetings, enables much more efficient screen sharing with fluid motion (not the 2fps we’re used to in meetings.
  • Multiple Emergency Numbers in a location policy, useful for universities that may have their own local emergency number in addition to 911
  • Busy Options like Busy on Busy and Voicemail on Busy.

Get it here:

https://support.microsoft.com/en-us/kb/3061064

T.38 Fax over IP call on Wireshark

Ever wondered what a proper T.38 Fax over IP (FoIP) transmission looks like running through Wireshark? Maybe you’re troubleshooting a call flow, or never seen a T.38 capture. Below I’ll try to explain the call flow and steps to look out for when troubleshooting T.38 calls. Here’s an Outbound FAX call originating from a FXS port in a Cisco CUBE, and going towards Flowroute.

  • Initial SIP INVITE and early media receipt (ringback). Note this is all RTP.
    2016-03-22 16_08_08
  • SDP from the INVITE shows media offered is all voice (RTP)
    2016-03-22 16_10_50
  • 183 Session in Progress, and we start sending media too (again, RTP). Later on comes the 200 OK, meaning the call was answered on the remote end.
    2016-03-22 16_09_50
  • Things changing now… in-dialog (RE)INVITE from Cisco CUBE to SIP trunk… RTP and T.38 packets mixed because the remote end has not accepted our INVITE yet, but we start sending media either way.
    2016-03-22 16_14_17
  • And the SDP of the new INVITE now shows all T.38 media now.
    2016-03-22 16_14_59
  • Once we get the 200 OK from Flowroute, it’s all T.38 media both ways.
    2016-03-22 16_15_44
  • Now the flow gets interesting, more Fax-ey. Wireshark will decode the HDLC data and show interesting bits here
    • TSI, is our Fax station number programmed in the machine.
      2016-03-22 16_17_35
    • DCS, our Fax machine communicates the capabilities, and starts training.
      2016-03-22 16_18_34
    • If we look inside the packet’s data, our DCS has a lot more information about our Fax machine’s settings and resolution
      2016-03-22 16_44_22
    • Then we get an FTT, means the remote end “Failed to Train”. Not usually a sign something is wrong, but more a capability mismatch. The remote fax may accept only lower baud rates, and will fail to train any higher. This is normal unless it’s the only response we get back from the remote end.
      2016-03-22 16_18_49
    • We see the same process of TSI, DCS and FTT until we hit the right baud rate… in our case it’s 9600… Once we get that, we receive a CFR.
      2016-03-22 16_20_19
    • Followed by a short training to sync-up and data (because we did long training before the CFR)
      2016-03-22 16_20_35
    • And our actual FAX data which will vary
      2016-03-22 16_21_21
    • At the end of the data, Wireshark reassembles the packets and tells us whether there was a loss or not. In our case, we’re good!
      2016-03-22 16_21_50
    • We send an EOP to signal the end of the transmission
      2016-03-22 16_22_21
    • The remote end does an MCF to acknowledge receipt (this is how your Fax machine knows the fax is “good” on the other end)
      2016-03-22 16_22_40
    • And then we send a DCN to logically hang up the HDLC stream, but we wait for the remote end…
      2016-03-22 16_24_42
    • Remote end hangs up the call… and we’re done…
      2016-03-22 16_25_02

And that was it. Many exchanges and training but in the end our page was sent over a SIP trunk, negotiating T.38, training with the remote fax machine at 9600 baud, and transmitting one page in about a minute.

Remote Wireshark capture for Sophos UTM over SSH

Sophos UTM v9 comes with the tcpdump utility, which lets you run packet captures from the shell. This is great and all, but in order to look at those pcaps with Wireshark, you need to pipe to a file, copy the file, then run Wireshark against it. Annoying. All of it.

What if we could remotely capture packets over an SSH tunnel? YES… turns out it’s a bit tricky if you’re on Windows, and the authentication piece to get root access without having to do the loginuser first. How? Keep reading…

First, the necessary ingredients:

  • Sophos UTM
  • Wireshark (or your favorite pcap application)
  • Putty suite (specifically Plink and PuttyGen)

To start, we’ll need to enable Shell Access, with public key authentication, and with Root access but only with SSH key.

2016-03-16 15_10_50

We need to use PuttyGen to generate the key pair we’ll use for root authentication, so open it, Generate the key, then copy the Public Key into the Authorized Keys for root in the UTM, apply and save… and also Save private key to somewhere you’ll remember. We’ll need this for Plink.

2016-03-16 15_10_08

There’s our new key…

2016-03-16 15_13_30

Then run the actual magic using Plink. Take the following command as an example:

plink -ssh root@firewall.domain.com -i C:\ssh-priv.ppk “tcpdump -s 0 -U -n -w – not port 22 and not host 192.168.0.1” | “C:\Program Files\Wireshark\Wireshark.exe” -k -i –

Replace the SSH connection string for your actual firewall FQDN, the filename of ssh-priv.ppk for the location of your saved Private Key generated with PuttyGen, and the not host 192.168.0.1 with the IP address of the firewall from the interface you’re reaching it.

Wireshark will open and start showing packets. You can smile and jump now.

You can modify the tcpdump parameters to better match the capture, for example, using -i eth1 to capture a specific interface, or filter specific traffic… once you’re done, just close Wireshark and CTRL+C the command.

Note, if you’re doing this capture remotely over WAN or Internet, it will tunnel ALL packets over SSH, so it will take up a lot of bandwidth…

Have fun!!!

VMware Power Policy and CPU Ready latency

VMware’s Performance Best Practices mentions you should set power management in the BIOS to “OS Controlled Mode” or equivalent. This is so you can control power saving from the hypervisor itself. It’s very useful when you want to change these settings on the fly without having to reboot into the BIOS, similar to how Windows power profiles work.

But the “gotcha” here, which is also mentioned in the best practices documentation, is that the default Power Policy setting is set to Balanced, when you most likely want to set this to High Performance, as you’ll see later…

2016-03-01 10_39_33

This makes it pretty awful for latency-sensitive workloads. In Balanced, your CPU sometimes has to scale up or down in the power states before it can process an instruction, and this adds latency. The difference can be clearly seen by looking at the CPU Ready (RDY%) metric. Here’s the difference changing to High Performance made in a single vCPU VM:

2016-03-01 10_32_13

And here is a 4 vCPU VM running Exchange 2013

2016-03-01 10_32_34

My VM’s felt “snappier” after the change. It’s hard to avoid speaking subjectively here, but click-to-action felt quicker. Maybe it’s in my head, but I feel those charts tell a different story.

The effect the “Balanced” power savings has on CPU Ready times is clear as day, though it’s mentioned that Balanced has minimal to no impact on performance. I have yet to do benchmarks to show how CPU Ready% affects real workloads, but at the very least, CPU instruction latency from a guest VM is dramatically decreased, which benefits those real-time workloads like Lync, Skype for Business or VoIP.

VMware NSX Lab in a night = awesome

So… VMware’s NSX is super awesome! I’m one of those weird guys that find playing with networking and virtualization on a Monday night more fun and exciting than a weekend in Vegas. Ok, maybe not so much, but still somehow I managed to stay up past midnight deploying an NSX “Lab” just by messing with it. I say screw the guide, I learn better by just pressing buttons and breaking things… I’m not doing this for a client so what gives? Let’s poke…

After some fun I’ve gone from just knowing concepts of SDN to a fully usable network running on top of VMware NSX. It’s complete with:

  • Single 6.2 controller
  • VXLAN transport on a Force10 S60 with PIM and IGMP snooping enabled
    • Since I already had Distributed vSwitches, it was very easy to provision the transport
  • Multicast Transport Zone and segment ID
  • Single NSX Edge running OSPF connecting to the S60 core and redistributing connected networks
  • Single logical switch (for now)
  • Two VM’s on two different hosts to test connectivity
  • Smiles

Captured live flows while downloading a CentOS ISO from a mirror site just to test speeds.

Screen Shot 2016-02-23 at 12.35.01 AM

So far i’m very impressed with what NSX can do, and i’ve only scratched the surface. Think stretched networks over L3, per-VM firewall policies both at Layer 3 and Layer 2 levels, Logical routers between virtual switches, each with its own ACLs, HA edges, so many cool things!. Only 59 days left…

It’s almost 1am and I should really go to sleep now. Good night.

New Technical Diagrams for Skype for Business Server 2015

Released last week, new technical diagrams in Visio and PDF for Skype for Business workloads, Call Quality Methodology (CQM) and different hybrid scenarios.

https://technet.microsoft.com/en-us/library/dn594589.aspx

SfB Protocol Workloads poster

Thumbnail for the CQM poster

Plan Voice Solution poster Thumbnail