Bad network route?

HCHTech

Well-Known Member
Reaction score
4,451
Location
Pittsburgh, PA - USA
I have a client that is experiencing poor performance with their VOIP phones. After troubleshooting, the phone vendor tells me this:

"This is not DNS - There is a bad network route in Verizon's core network, that is adding latency and packet loss. That is what we believe is causing the issues"

and asked me to open a ticket with Verizon.

I've never run into this as an issue, so I'm wondering if this is a real thing or not. I'm not looking forward to trying to explain this to Verizon level 1 support, let alone expecting them to do anything.
 
Can you take a look yourself with mtr or pathping? That would tell you exactly where the problem is, if there's a problem somewhere upstream.
 
Well, pings to the problem servers this morning for 15 minutes didn't show any problems. A 15-minute run with MTR didn't even have the problem servers in the route, so maybe Verizon took those servers out of service all by themselves - haha. Here's hoping the problem is fixed without further action!
 
Maybe but usually network issues are spurious like

Reply
Reply
Reply
Request Times Out
Reply
Reply


If it is a network problem at all, it is not a routine issue… that’s just a crappy vendor blaming the network to say “routing issue”

A network issue would be all the phones on a subnet don’t work.


****

I agree may as well trace route and look at latency for each hop, that said, I would also look at STP to see if you have changes, check all the interfaces from point A to Z look at error counters too.

I suspect it is something strange like a 1 Gbps layer 2 loop on a 10 Gbps+ network or maybe QoS dropping the wrong packets. I would debug there.

Maybe try moving a test phone to a different VoIP VLAN
 
Reply
Reply
Reply
Request Times Out
Reply
Reply


If it is a network problem at all, it is not a routine issue… that’s just a crappy vendor blaming the network to say “routing issue”

My first reaction was that this was textbook vendor finger-pointing, PLUS, aren't these things self-policing? If a hop on the route to anything is having trouble, isn't it automatically taken offline or traffic rerouted? That was my understanding, although admitedly I don't really know.

When I do a tracert or an MTR run to their servers from the client location, the problem hop isn't even on the list! Of course, they didn't even want to hear that, they just sent me another screenshot of their test results. So now, I'm opening a ticket with Verizon with the inability to reproduce the "evidence" AND the inability to test whether any fix they might do was successful - it's maddening.

When I suggested THEY talk to Verizon, their response was "It's our policy NOT to communicate with a client's ISP." Right. Just lay the blame somewhere else and back away. Close ticket.
I would also look at STP to see if you have changes, check all the interfaces from point A to Z look at error counters too.

Their router is at the edge now, so I have no visability into that. Our equipment looks all good. We've got a managed Unifi switch there and there are no problems at all that I can see. I haven't dug through the firewall logs specifically, but we do have alerting enabled and are receiving no alerts.
 
My first reaction was that this was textbook vendor finger-pointing, PLUS, aren't these things self-policing? If a hop on the route to anything is having trouble, isn't it automatically taken offline or traffic rerouted? That was my understanding, although admittedly I don't really know.

Yes, my initial reaction was that this feels like typical finger-pointing . If this were truly a Verizon backbone issue, I would expect broader impact with other Verizon customers.

Most impairments such as latency, jitter, or high utilization will not cause a BGP routing protocol session within the provider network to drop or a route to be withdrawn! BGP reacts to reachability failures not performance degradation. The type of failover behavior causing a failover is more consistent with a hard outage (i.e. fiber cut or complete path failure).

In short, the routing protocol has no awareness of application performance metrics such as VoIP call quality. As long as the next hop remains reachable, the route will stay installed. It would take some detailed engineering such as BFD tied to route withdrawal or automation modifying policy (i.e. prefix-lists) to remove a path based on performance rather than reachability.

When I do a tracert or an MTR run to their servers from the client location, the problem hop isn't even on the list! Of course, they didn't even want to hear that, they just sent me another screenshot of their test results. So now, I'm opening a ticket with Verizon with the inability to reproduce the "evidence" AND the inability to test whether any fix they might do was successful - it's maddening.

If the problem hop does not appear in your MTR, the path being tested may not be the same due to Policy-Based-Routing. Traceroute/MTR use ICMP which may not follow the exact same forwarding path as application traffic like VoIP due to ECMP or upstream routing differences, but Verizon would have to tell you to be sure.

If there were true packet loss at a specific hop, you would normally see that loss continue in subsequent hops. If subsequent hops are clean, it is very likely Verizon may simply be rate-limiting ICMP rather than dropping transit traffic.

Also, between two visible Layer-3 hops there can be significant Layer-2 infrastructure issues that segment that always present at all in a Traceroute. For example if I have several switches with a trunked VLAN carrying Hop C to Hop E in your A-Z path... In short something like an unmitigated loop broadcast storm could drown out traffic on a VLAN making it hard for Hops to reliably get their L2 frames containing the packets they need to route.

The larger issue is that without being able to reproduce the problem from the client network, it is difficult to pinpoint the problem. You need to bring Verizon into this probably.

When I suggested THEY talk to Verizon, their response was "It's our policy NOT to communicate with a client's ISP." Right. Just lay the blame somewhere else and back away. Close ticket.

I understand you may not communicate directly with a client’s ISP; however, since the reported issue is within the transit path if not in the customer's, coordinated validation is necessary. If you are seeing consistent loss or latency, please provide source/destination IPs, timestamps, protocol used, and evidence of downstream impact to Verizon with actionable data. Beyond that this is ultimately what SLAs are for ... your measurable, verifiable performance issues.

You only get SLA credits when it is confirmed on the Carrier's side though.

Their router is at the edge now, so I have no visability into that. Our equipment looks all good. We've got a managed Unifi switch there and there are no problems at all that I can see. I haven't dug through the firewall logs specifically, but we do have alerting enabled and are receiving no alerts.

If their PE router is at the demarc, you really do NOT have visibility into anything upstream of that. From what you describe, your side looks clean no switch errors, no alerts, nothing obvious pointing to local impairment. That said, UniFi lacks any great deep diagnostics or historical interface counters, so it is terribly hard to prove your network clean.

If I were isolating this, I would temporarily bypass the UniFi entirely. Plug a single phone (or small test device) directly into a known-good switch even better, a basic Cisco like a 9300X or something with clear interface counters and make that the whole network for testing!

Either way... One phone.... One switch.... Direct to the VoIP carrier handoff.

If the issue still happens in that stripped-down setup, it is almost certainly upstream. If it disappears, then you know the UniFi setup deserves a closer look.


Good Luck
 
It absolutely can be a thing, and is a PITA. As NETWiz mentioned, a tracert might not actually show the path the specific traffic is going down.

In my first and most memorable incident, it was with RDP traffic. I only clued in it wasn't my local network because when connecting to a VPN, RDP performance to the destination was fine. I had to run quite a few tracerts before I had the IP that was a problem, but I pretty much only knew which IP was the problem because I was throwing things into PingPlotter (VPN did rule some out though). IIRC, there was jitter on the problem hop.

It was a Zayo hop inbetween my ISP and the server provider I was RDPing into. I had no relationship with Zayo, but I sent an email, they claimed they couldn't see a problem. But on his email signature of notice of maintenance was added. And then after that maintenance it was fine.
 
Back
Top