Thursday, April 24, 2008

Broadcom has one of the Biggest Networking Bugs Ever

This is a big one boys and girls; and I guarantee you it can effect your IT infrastructure if you use clustering, nic teaming, and a half a dozen other configurations, with Broadcom gigabit ethernet interfaces.

This is what I've been working on all day the last two days, and will be working on all day every day for the next few weeks, maybe months. If this hasn't hit your shop yet, it will, and you'll be working on it too.

Here's a sanitized version of my incident report on the issue:

Background:

Broadcom is a leading vendor of networking chipsets, in the server, appliance, embedded, and network device markets. They are the most common gigabit ethernet chipset used by all classes of device within the organization.

Problem:

Recently, a critical bug has been discovered in all known implementations of Broadcoms family of gigabit ethernet chipsets; which under certain circumstances causes interfaces which have been configured to proxy ARP (which is used with clustering, teaming and bonding, load balancing and other high availability configurations, and some security related configurations), to respond promiscuously to ARP requests.

This behavior causes network interfaces on other systems within the same broadcast domain (effectively all other devices on the same VLAN) to see degradations in quality of service, packet loss, and can cause complete loss of network communications (systems can get knocked off the network). This problem can also adversely effect the operation of the switches carrying this traffic.

Impact:

These problems are intermittent, but reproducible, and have been seen in multiple environments within the organization over the past several months.

Because of the nature of ARP, these problems are difficult if not impossible to detect, before services are significantly impacted; and without specific knowledge of this issue, are difficult to diagnose in a timely manner. This issue has appeared intermittently in several environments within the organization, and in each case took several weeks to detect, diagnose, and understand.

Scope of affected infrastructure:

This issue may appear in existing devices when their drivers, or patch level are updated; or in new devices which have not been patched to specifically address this problem.

All subnets which have Broadcom gigabit ethernet chipsets attached, in a proxy ARP configuration, may be effected by this bug (this would constitute the majority of the subnets within the organization). Most major server vendors have issued patches for older revisions of their operating environments which address this problem already, or these patches are in development; however until such time as patches exist and are applied for all systems, the subnets to which these systems are attached may be at risk.

Notably, the most recent patch update of Solaris 10 does not include a fix for this issue (a fix is in development), the most recent paches for Microsoft Windows server 2000, 2003, and 2008 do not include a fix; and IBM has decided that this issue does not present a problem for iSeries systems, even though they may use the effected chipsets in a proxy ARP for some adapaters (though they have, or will, for the X series and p series).

Mitigation and/or Resolution:

At this time no definitive resolution exists for this issue; however, the following means of mitigation are available to us:

  • Do not add any new systems or devices using Broadcom gigabit ethernet chipsets and configured for proxy ARP, until those systems have been properly patched
  • Do not update operating systems or patch levels on existing systems or devices using those chipsets and configured for proxy ARP, (even if the devices in question have not yet demonstrated the problem), until those systems have been properly patched for this specific issue.
  • Identify all systems and devices which have these chipsets, and target them for specific patching to resolve this issue before it can occur
  • Segregate critical infrastructure onto controlled broadcast domains by creating private VLANs (this presents extremely significant difficulties; to the point of effective impossibility for many environments); to prevent systems exhibiting the issue from impacting those environments
  • Remove or segregate from the public network segments, any systems or devices which have these chipsets, and are configured for proxy ARP; and which have not yet been patched, or for which a patch does not yet exist (even if the devices in question have not yet exhibited the issue
Other methods of mitigation may arise, or be developed; as this issue is worked further.