Overview of Error Counters

The idea is that an "End Bad Packet" can be used instead of EGP (End Good Packet) whenever you know there is something wrong with the packet. So, if a packet is passing through the fabric and some port notices a problem (i.e. bad CRC), it will end it with EBP instead of EGP. If the packet progress requires store-and-forward, an option would be to just drop it and not waste bandwidth sending EBP packets. The CA that reports this error is NOT where the corruption occurred. It occurred elsewhere in the fabric.

SymbolErrors

The interpretation of symbols within the packet is done on the HCA/CA. If the translation or interpretation fails, it creates a minor event called a symbol error. 99% of all !SymbolErrors are hardware related. If the counts are small (small being a relative term that is up for interpretation) they can be ignored. If the numbers are large and/or the same CA is reporting this error regularly it should be looked into. On a node, the HCA and/or cable should be reseated. If the reseat is unsuccessful it should be replaced. On a switch, reseat the cable or replace the cable.

VL15Drop

VL15 is the default virtual lane for management packets. They are the first to be dropped when there are resource limitations on the port. This is usually related to not enough space in the buffers. In many instances these errors can be ignored. There have been instances, however, when these messages were very closely correlated to user problems in time and fabric space. Obviously, if they are being dropped the buffers are being kept very busy with other data and therefore could indicate congestion.

XmtDiscards

This counter tracks packets that were discarded instead of transmitted. This usually indicates congestion in the fabric. The CA this packet was supposed to be sent to cannot accept it. After so many retries and/or too many incoming packets, the packet to be transmitted gets dropped. If the fabric is being routed well, without deadlocks or credit loops, these should be transient.

Overview of Error Counters

Contents

IB Error Counter Definitions and Examples

LinkDowned

Linkspeed not at maximum

Linkwidth not at maximum

PortRcvErrors

PortRcvRemotePhysicalErrors

PortRcvSwitchRelayErrors

Port[Rcv|Xmit]ConstraintErrors

PortXmitWait

RcvRemotePhys(ical)Errors

SymbolErrors

VL15Drop

XmtDiscards

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools