Mastering Azure: Key Benefits and Best Practices for 2024

text
$ az network vnet peering show –name PeerToProd –resource-group rg-core-connectivity –vnet-name vnet-hub –output json
Deployment failed. Correlation ID: 4e52-b91a-8823471bc. {
“error”: {
“code”: “InternalServerError”,
“message”: “An error occurred while processing your request. Please try again in a few minutes.”,
“details”: []
}
}

$ az network vnet list -g rg-core-connectivity –query “[].{Name:name, State:provisioningState}”
[
{
“Name”: “vnet-hub”,
“State”: “Updating”
}
]

The terminal is mocking me. It’s 03:14 AM. My eyes feel like someone rubbed them with a handful of fiberglass insulation. I’ve been staring at this specific `Updating` state for forty-five minutes. In the world of azure, "Updating" is the polite way of saying "I’ve fallen into a deep coma and I might never wake up." If you try to touch it, you get a 409 Conflict. If you leave it alone, your entire regional routing table stays in a state of quantum uncertainty.

The pager went off at midnight. "High Latency - US East 2." A classic. A masterpiece of understatement. It wasn't just latency; it was a total packet graveyard. The "architects"—the guys who spend their days drawing pretty boxes in slide decks and using words like "synergy"—decided last week that we needed to move to a "hub-and-spoke" model with a centralized NVA (Network Virtual Appliance). They promised it would simplify things. They lied.

## The First Alert at 02:14: The Ghost in the VNet

By 2:00 AM, I realized the hub VNet was black-holing everything. I tried to check the effective routes on the primary gateway. The azure portal, that bloated, JavaScript-heavy labyrinth of broken dreams, just gave me a spinning blue circle. It’s a special kind of hell when the UI designed to help you troubleshoot is the first thing to fail under load. 

I dropped back to the CLI. I needed to see what the hell happened to the User Defined Routes (UDRs).

```json
{
  "addressPrefix": "0.0.0.0/0",
  "nextHopType": "VirtualAppliance",
  "nextHopIpAddress": "10.0.0.4",
  "provisioningState": "Succeeded",
  "name": "DefaultRouteToFirewall"
}

The JSON looked fine. On paper. But 10.0.0.4 wasn’t responding to ARP. The NVA was healthy according to the load balancer probes, but the traffic was just… vanishing. I started digging into the VNet peering. We have three spokes peering into the hub. One of them—the most critical one, obviously—was stuck in “Initiated” state.

Why? Because someone (probably the same guy who thinks “serverless” means there are no servers) tried to enable “Use Remote Gateways” on a peering that already had a Gateway Load Balancer associated with the interface. You can’t do that. The azure API won’t stop you from trying, though. It’ll just let the resource provider chew on that impossible request until it chokes and dies, leaving your VNet in the “Updating” state I’m currently staring at.

I tried to force a manual update.

az network vnet update --ids /subscriptions/sub-id-here/resourceGroups/rg-core-connectivity/providers/Microsoft.Network/virtualNetworks/vnet-hub

Result? Another 500 Internal Server Error. The underlying Resource Manager is having a panic attack. I’m on my fourth cup of sludge that used to be coffee. My keyboard is covered in crumbs from a protein bar that expired in 2022.

Table of Contents

The DNS Black Hole and the Private Link Lie

While the hub VNet was busy having its existential crisis, the application logs started screaming about database connection timeouts. We use Private Link for everything because “security.” Private Link is great until it isn’t. It relies on a delicate, fragile web of Private DNS Zones and CNAME redirects that would make a 1990s sysadmin weep.

I checked the KQL logs for the Private Endpoint.

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Category == "NetworkSecurityGroupEvent"
| where OperationName == "NetworkSecurityGroupRuleCounter"
| where ResultSeverity == "Deny"
| where properties_s contains "10.0.2.55"
| project TimeGenerated, srcIP_s, destIP_s, inner_action_s

Nothing. The NSGs weren’t dropping the traffic. The traffic wasn’t even getting to the NSG. I ran a nameresolver from a jumpbox inside the spoke.

nslookup prod-db-01.database.windows.net
Address: 52.157.24.12

There it is. The smoking gun. That’s a public IP. It should be resolving to a 10.x.x.x address via the Private Endpoint. The azure DNS forwarder (the magical 168.63.129.16 address that we’re all supposed to just trust) decided to stop honoring the Private DNS Zone link.

I checked the DNS Zone link configuration.

{
  "id": "/subscriptions/.../providers/Microsoft.Network/privateDnsZones/privatelink.database.windows.net/virtualNetworkLinks/link-to-spoke-01",
  "properties": {
    "registrationEnabled": false,
    "virtualNetwork": {
      "id": "/subscriptions/.../resourceGroups/rg-network/providers/Microsoft.Network/virtualNetworks/vnet-spoke-01"
    },
    "provisioningState": "Succeeded"
  },
  "location": "global",
  "type": "Microsoft.Network/privateDnsZones/virtualNetworkLinks"
}

It says “Succeeded.” It’s lying. I’ve learned that in the azure ecosystem, “Succeeded” just means “I sent the command to the worker node and I haven’t heard back that it exploded yet.” I had to delete the link and recreate it. While waiting for the “global” replication—which is about as fast as a tectonic plate—I had to explain to a Project Manager why “just restarting the database” wouldn’t fix a DNS resolution failure in the SDN layer. He didn’t understand. He never does.

Why RBAC is My Mortal Enemy

At 04:30, I discovered why the NVA was failing. Someone had “cleaned up” the Managed Identities. Our NVA uses a system-assigned identity to update UDRs dynamically when it detects a failure in the primary node. Standard stuff, right? Wrong.

When the identity was nuked, the NVA lost the Network Contributor role on the route table. So when the primary node hit a transient memory leak (thanks, vendor-provided “optimized” Linux kernel), the secondary node tried to take over the 0.0.0.0/0 route and got a 403 Forbidden.

# Checking the role assignments... or lack thereof
az role assignment list --assignee <nva-identity-id> --scope /subscriptions/.../resourceGroups/rg-core-connectivity/providers/Microsoft.Network/routeTables/rt-spoke-to-hub
# Output: []

Empty. Just like my soul. I had to re-assign the role, but because the azure IAM system is eventually consistent, I had to wait another ten minutes for the token to propagate. Ten minutes of watching the “Request Timed Out” count climb in the dashboard.

I tried to automate the fix with a script, but the az CLI decided this was the perfect time to tell me my login session had expired. I had to do the device-code dance on my phone while squinting at a screen that was starting to blur.

The documentation says that role assignments take “up to 10 minutes.” In reality, when you’re in the middle of a regional outage, it feels like a decade. I spent that time reading the release notes for API version 2022-03-01. Apparently, they changed how Microsoft.Network/networkInterfaces handles secondary IP configurations in relation to Gateway Load Balancers. This is the kind of “minor change” that breaks everything but is buried on page 45 of a PDF no one reads.

The Load Balancer Latency Trap

By 05:15, the NVA was back, the DNS was resolving, and the VNet was finally out of the “Updating” state. But the application was still crawling. We’re talking 500ms of jitter on a simple internal API call.

I started looking at the Load Balancer. We’re using a Standard Load Balancer as the front end for the NVA cluster. The “Architect” (I really hate that guy) insisted on using a Gateway Load Balancer (GWLB) for “advanced packet inspection.”

Here’s the thing about the azure Gateway Load Balancer: it uses VXLAN encapsulation to tunnel traffic to the appliance. That adds overhead. It adds headers. It fragments packets if your MTU isn’t perfectly tuned. And if your NVA isn’t handling the VXLAN decapsulation in hardware, your CPU usage spikes, and your latency goes to the moon.

I ran a quick test.

# Checking the LB metrics via CLI because the portal is still garbage
az monitor metrics list --resource <gwlb-id> --metric "DipAvailability" --interval PT1M

The health probes were flapping. Why? Because the VXLAN overhead was pushing the packets over the 1500-byte MTU limit, and the packets were being dropped by the virtual switch. I had to go into every single VMSS (Virtual Machine Scale Set) instance and drop the MTU to 1450.

Do you know how fun it is to run a pssh command across 200 instances when the network is already flaky? It’s like trying to perform surgery with a pair of rusty garden shears during an earthquake.

# The "Fix"
sudo ip link set dev eth0 mtu 1450

I watched the latency graph. It dropped from 500ms to 15ms. I felt a brief moment of triumph, which was immediately extinguished when I realized I still had to figure out why the “Auto-Scale” rule hadn’t triggered during the spike.

The ARM Template from Hell

06:45 AM. The sun is coming up. I hate the sun. It’s too bright, and it reminds me that the rest of the world is waking up and expecting the “azure” environment to be stable.

I pulled the ARM template for the scale set to see why the scaling failed.

{
  "type": "Microsoft.Compute/virtualMachineScaleSets",
  "apiVersion": "2022-03-01",
  "name": "app-scale-set",
  "properties": {
    "overprovision": true,
    "upgradePolicy": {
      "mode": "Automatic"
    },
    "virtualMachineProfile": {
      "networkProfile": {
        "networkInterfaceConfigurations": [
          {
            "name": "nic01",
            "properties": {
              "primary": true,
              "enableAcceleratedNetworking": true,
              "networkSecurityGroup": {
                "id": "..."
              }
            }
          }
        ]
      }
    }
  }
}

There it was. enableAcceleratedNetworking: true. Sounds good, right? Marketing loves that word. “Accelerated.” But on the specific instance size we were using (Standard_D2s_v3), Accelerated Networking is only supported on certain images and requires a specific driver version that wasn’t in our “golden” image.

So, when the scale set tried to spin up new nodes to handle the load, they failed to initialize the NIC. They just sat there in a “Failed” state, drawing billing cycles but doing zero work. I had to manually patch the scale set model to disable the “acceleration” just to get the capacity back.

The irony is thick enough to choke on. The feature meant to make things faster made everything stop.

I spent the next hour manually deleting the “Failed” instances. The azure CLI kept timing out on the bulk delete command, so I had to write a bash loop to do it one by one.

for i in {0..15}; do
  az vmss delete-instances --resource-group rg-app --name app-scale-set --instance-ids $i --no-wait
  echo "Nuking instance $i..."
done

Each command took 30 seconds to return. The “no-wait” flag is a suggestion, apparently.

The Final Realization: It Was Always a Typo

08:00 AM. The “Morning Stand-up” is starting. I’m still in my pajamas, I smell like a burnt-out fuse, and I have to explain what happened.

I did one last audit of the route tables. I found it. In the rg-core-connectivity group, there was a second route table I didn’t see earlier. It was named rt-spoke-to-hub-backup. It had a higher priority (shorter prefix match) for a specific subnet.

Inside that route table, someone had typed 10.0.1.0/24 instead of 10.0.11.0/24.

That one extra ‘1’. That single character. It was routing all the traffic for the authentication service into a non-existent subnet in a different VNet. Because azure doesn’t validate if the destination of a UDR actually exists (it just trusts you), the packets were being sent to a black hole.

I deleted the rogue route.

az network route-table route delete --name RogueRoute --resource-group rg-core-connectivity --route-table-name rt-spoke-to-hub-backup

The “Request Success Rate” on the dashboard immediately shot up to 99.9%. The alerts cleared. My phone stopped vibrating.

I looked at the “azure” portal one last time. It looked so peaceful. So “comprehensive.” (Wait, I can’t use that word. It looked so… complete. No, that’s too nice. It looked like a liar.) It sat there with its green checkmarks, pretending that the last eight hours of absolute chaos never happened.

The “Architect” messaged me on Slack. “Hey, I saw some blips overnight. Glad the hub-and-spoke model is providing the resilience we talked about. Let’s discuss how we can further optimize the tapestry of our cloud infrastructure in the next sprint.”

I didn’t reply. I closed my laptop, put it in the freezer (not really, but I thought about it), and went to bed.

The worst part? I have to do it all again tomorrow. Because someone wants to “simplify” the ExpressRoute configuration.

God help us all.

Post-Mortem Summary for the “Management” (The stuff I won’t actually send):
– Root Cause: A combination of API versioning mismatches (2022-03-01), a typo in a UDR, and the inherent latency of the Gateway Load Balancer’s VXLAN encapsulation.
– Resolution: Manually corrected the MTU on 200 instances, recreated the Private DNS Zone links, and deleted a rogue route that was fat-fingered by a “Senior Cloud Engineer.”
– Lessons Learned: Don’t trust the portal. Don’t trust the “Succeeded” state. And for the love of everything holy, keep the MTU at 1450 if you’re going to use fancy networking abstractions.

Now, if you’ll excuse me, I’m going to sleep for three days. Or until the next “azure” regional outage. So, probably about four hours.

Explore more insights and best practices: