High-Speed Standard Serial Interfaces

# High-Speed, Standard Serial Protocols With all the context that the sections on signal integrity and interconnect types have provided, it is time now to turn our attention to how we can reliably transfer high-speed digital data across imperfect channels using industry-standard interfaces. By adopting standard interfaces and protocols, we can design our digital systems to interoperate with different blocks and components or even with other digital systems. By exploring these interfaces, we will see how digital communications engineers keep on pushing the boundaries in terms of data rate in the presence of constructive limitations found. ### Network Planes: Control Plane, Data Plane, Utility Plane, Timing Plane Just as if it was not enough to think of networks in terms of layers, there is also an alternative architectural viewpoint when analyzing networks based on the concept of "planes". Network "planes" are a reasonably new conceptual framework used in networking to separate and organize different functions within a network system. Originally, networks were relatively simple, with devices like routers and switches performing both control functions (deciding where to send data) and data functions (actually sending the data) in a combined manner. As networks grew in size and complexity, there was a need to scale and manage them more easily and introduce new services without overhauling the entire network architecture. This led to the conceptual separation of network functionalities into different _planes_. The earliest distinction was between the control and data planes. ==The control plane is responsible for making decisions about how data should be routed through the network. It uses protocols and algorithms to determine the best path for data packets. The data plane, on the other hand, is responsible for the actual forwarding of packets based on the decisions made by the control plane.== As network technology evolved, particularly with the advent of software-defined networking (SDN), this separation became more pronounced. SDN, for instance, centralizes the control plane in a controller, which can be a software application, separating it from the data plane devices that are responsible for forwarding traffic. This allows for greater flexibility and programmability in network management and operations. In addition to the control and data planes, the concept of a management or utility plane emerged. This plane is concerned with the overall management of the network, including configuration, monitoring, and maintaining network health. The management plane provides the tools and interfaces necessary for network administrators to manage network resources, enforce policies, and ensure the network operates efficiently. The timing plane is not always present, but when present, it is in charge of carrying signals that are important for synchronization purposes. Typically, the timing plane carries a Pulse Per Second (PPS) signal to ensure different elements of the system are synchronized. This separation into different planes allows for specialized technologies and approaches to be developed for each aspect of network functionality, leading to more scalable networks. It also aids in troubleshooting and security, as issues can be isolated to a specific plane. Note that the plane conceptual distinction has spilled outside of the telco industry into other industrial domains. We will see in the following sections that point-to-point, high-speed standard interfaces also use the plane distinction. ### PCS & PMA When specifying physical layers of high-speed data interfaces, it is very common to find a sub-layering of the physical layer, and it is been quite popular to subdivide the physical layer into two: PCS and PMA. These stand for PCS (Physical Coding Sublayer) and PMA (Physical Medium Attachment) respectively. PCS converts the digital data from the MAC (Media Access Control) layer into a form that can be handled by the physical medium. This involves processes like 8b/10b encoding, 64b/66b encoding, or other self-synchronous encoding schemes depending on the standard. In self-clocked interconnects, because the absence of data transitions for long periods can lead to loss of synchronization, the PCS includes mechanisms such as scrambling to prevent synchronization issues from these inconvenient patterns in the data by reordering or encoding the data so that it appears random but can be unscrambled (remember data-dependent [[Physical Layer#Jitter|jitter]]). PCS may also insert idle patterns or generate pseudorandom binary sequences (PRBS) for channel characterization that will be used by upper layers for link training. PCS includes mechanisms for error detection using forward error correction (FEC) schemes that help the detection of bit errors, and it acts as a bridge between the MAC layer and the PMA layer, ensuring that the data is in the correct format for transmission over the physical medium. The PMA is, in a nutshell, responsible for signal quality. The PMA can control the way symbols are driven and received to/from the physical medium. PMA includes signal conditioning circuitry for equalization, reflection cancellation, pulse shaping, etc. The PMA provides mechanisms to adjust the signal to the characteristics of the physical medium and thus ensure the highest data rate achievable. Note that not always the PHY is split in PCS and PMA, and in some cases, some standards may even create more subdivisions. ## PCI Express PCI Express, often abbreviated as PCIe, is a high-speed serial communication standard conceived to replace the older PCI, PCI-X, and AGP bus standards. To understand PCIe's significance, we must first explore its direct ancestor, PCI. PCI, or Peripheral Component Interconnect, introduced in the early 1990s, was a groundbreaking development in computer architecture. It replaced the older ISA (Industry Standard Architecture) and EISA (Extended Industry Standard Architecture) as the standard bus for connecting peripheral devices in computing systems, from personal computers' motherboards to industrial data acquisition systems. PCI offered a significant performance improvement over its predecessors, thanks to its higher clock speeds and bus width. It allowed devices like network cards, sound cards, and later, graphics cards, to communicate efficiently with the CPU and memory. As technology advanced, the limitations of PCI began to surface. Its parallel bus architecture was not scalable with the increasing speed demands of newer processors and peripherals. This limitation led to the development of PCI-X, an enhanced version of PCI, which offered higher speeds but still relied on the parallel bus concept. Despite these enhancements, the parallel architecture was reaching its physical limits in terms of speed and efficiency. In the early 2000s, the industry sought a solution that could overcome these limitations, leading to the development of PCI Express. PCIe was a significant departure from the parallel architecture of PCI. Instead, it adopted a serial point-to-point architecture. This design change allowed for higher data transfer rates, improved scalability, and better performance. With the move from a parallel into a serial, point-to-point architecture, the standard ceased being signal-based and transitioned into a packet-based serial protocol. It would be insufficient to call PCIe a packet-based serial protocol; it is a full networking protocol stack. PCI Express is an evolving standard. Therefore, it is difficult to write about PCI Express because there are several flavors—revisions, now called generations—of it out in the field. One risk of unpacking a particular generation—at the time of writing these lines, the latest is Gen6—is that this revision will be superseded by another one by the time these lines are published (if that ever happens). Therefore, the approach in this section is to describe PCIe from a conceptual, architectural perspective, highlighting its main aspects, and accepting the fact that is a continuously moving standard and that new features and specs are surely coming down the road. #### PCI Express Evolution PCI Express (PCIe) has significantly evolved since its introduction, continually adapting to the escalating demands of high-speed data transfer in modern computing. Each revision of PCIe has brought substantial improvements, primarily in terms of higher data transfer rates, greater power efficiency, and enhanced functionality. Here's an exploration of these revisions and their key differences: PCIe 1.0 (Gen1): Launched in 2003, PCIe 1.0 was a significant step forward from the older PCI standards. It introduced a serial point-to-point connection, replacing the parallel bus architecture of PCI. With a data transfer rate of 2.5 GT/s (Giga Transfers per second) and an effective throughput of 250 MB/s per lane in each direction, it offered a significant performance boost. PCIe 1.0 set the foundation for the future of high-speed data transfer in computing systems, providing scalability and reliability much needed for the evolving digital landscape. > [!info] > Note that the PCIe specification uses the term GT/s (GigaTransfers/sec) to mean the same as what Gbaud means in some other standards. As per the PCIe terminology, a _symbol_ is decoded into an 8-bit byte. PCIe 2.0 (Gen2): Released in 2007, PCIe 2.0 doubled the data rate to 5 GT/s (Giga Transfers per second), effectively providing 500 MB/s per lane, addressing the increasing bandwidth requirements of applications like high-end graphics, networking, and storage. Besides the enhanced speed, PCIe 2.0 also improved the encoding scheme, transitioning from the 8b/10b encoding of PCIe 1.0 to a more efficient one, thus achieving better performance with the same bandwidth. PCIe 3.0 (Gen3): Introduced in 2010, PCIe 3.0 marked another leap in data transfer efficiency. It further doubled the rate to 8 GT/s, equating to approximately 1 GB/s per lane. This version employed a more advanced 128b/130b encoding scheme, reducing overhead, and thus allowing for higher net data transfer rates. PCIe 3.0's enhancements were pivotal for data-intensive tasks and began to see widespread adoption in enterprise systems and consumer electronics. PCIe 4.0 (Gen4): PCIe 4.0, launched in 2017, continued the trend, doubling the data rate to 16 GT/s per lane, resulting in a throughput of nearly 2 GB/s per lane. This advancement catered to emerging needs in high-performance computing, artificial intelligence, and server applications. With increased bandwidth, PCIe 4.0 also introduced improvements in areas such as power efficiency, and it started to address the challenges posed by emerging technologies like NVMe storage. PCIe 5.0 (Gen5): Released in 2019, PCIe 5.0 once again doubled the data rate to 32 GT/s per lane, achieving a throughput of around 4 GB/s per lane. This revision was particularly significant for sectors where ultra-high-speed data transfer is critical, such as data centers and advanced computing applications. PCIe 5.0 also brought enhancements in areas like latency reduction and reliability, further solidifying PCIe 6.0 (Gen6): Released in 2023, PCIe 6.0 appears as the latest evolution of the standard, greatly shaped by the increasing demand for advanced applications in high-performance computing (HPC), data centers, artificial intelligence/machine learning (AI/ML), automotive, aerospace and military, and more. The bandwidth demand curve is up and to the right. This revision includes a data transfer rate of 64 GT/s per pin, increased power efficiency via a new low-power state, and data encryption. Perhaps the most salient change in this revision is the shift in the electrical signaling scheme, moving from the traditional non-return to zero (NRZ) signaling to pulse amplitude modulation in four voltage levels (PAM-4) signaling, increasing the data rate without doubling the signaling rate. With PCIe 6.0, there is a new packet header format with a streamlined organization. PCIe 6.0 uses flow control units (FLITs) to transfer data, eliminating the need for encoding schemes. In past revisions, 8 bits of data would end up being 10 bits on the wire due to the encoding. In newer revisions, 128 bits B of data would be 130 bits B on the wire. FLITs, on the other hand, are not encoded at all. This means that for every 1 bit of data, 1 bit ends up on the wire. The features and functions previously performed by the encoding in PCIe 5.0 are now covered by both the scrambling polynomial and the change to the FLIT headers in PCIe 6.0. ![PCIe evolution](site/Resources/media/image130.png) > [!Figure] > _PCIe evolution_ > [!attention] > In the context of PCI Express (PCIe), a FLIT, which stands for **Flow Control Unit**, is a fundamental data unit used in the physical layer of PCIe's layered architecture, particularly starting from the PCIe 6.0 specification. It is a fixed-size packet that encapsulates data or control information for transmission across a PCIe link. > Prior to PCIe 6.0, data transmission in PCIe was primarily based on variable-sized packets, which could create inefficiencies in high-speed communication. PCIe 6.0 introduced FLITs as part of its shift to Pulse Amplitude Modulation 4-level (PAM4) signaling and the use of forward error correction (FEC) to improve reliability and efficiency at extremely high data rates (64 GT/s). > Each FLIT is 256 bytes in size. It can carry either user data (such as memory read or write transactions) or protocol control information (used to manage link operations). FLITs are structured in a way that allows error correction codes to be embedded directly into the FLIT, improving the robustness of data transmission. > The adoption of FLITs standardizes the size of transmitted data units, simplifying link management, enhancing error recovery, and enabling more efficient use of bandwidth. This change is particularly critical as PCIe continues to scale its data rates to meet the growing demands of high-performance computing, data centers, and AI workloads. An advantage of PCIe is its backward compatibility. Each new version of the standard maintained compatibility with cards and slots of previous versions, ensuring a smooth transition for both users and manufacturers. This feature has been instrumental in PCIe's widespread adoption across various computing platforms, from consumer desktops to enterprise servers. PCIe's influence extends beyond traditional PCs. It is also used in other forms of technology, such as in high-performance computing, data centers, and industrial computing architectures like [[Backplanes and Standard Form Factors#CompactPCI Serial|CompactPCI Serial]], and even in spacecraft avionics form factors like [[Backplanes and Standard Form Factors#CompactPCI Serial Space|CompactPCI Serial Space]] where PCIe links and lanes travel through backplanes. Its versatility and performance have made it a cornerstone of modern computer architecture. Far from being a mere mechanism to send serial data over an interconnect, PCI Express is a complex protocol stack that includes flow control, link training, and power management capabilities. We will cover in the following sections the architectural aspects of PCI Express which should remain, to a certain extent, reasonably fixed across revisions. #### PCI Express Link A Link represents a communications channel between two components. The fundamental PCI Express Link consists of two, low-voltage, differentially driven signal pairs: a Transmit pair and a Receive pair (see figure below). A PCI Express Link includes a PCIe PHY. ![A PCI Express Link composed of one Lane](site/Resources/media/image131.png) > [!Figure] > _A PCI Express Link composed of one Lane_ The primary Link attributes for a PCI Express Link are: - A link consists of dual unidirectional [[Physical Layer#Differential Interconnects|differential]] Links, implemented as a Transmit pair and a Receive pair. A data clock is embedded using an encoding scheme to achieve very high data rates. - Signaling rate: Once initialized, each Link must only operate at one of the supported signaling levels. - Lanes: Every PCIe Link must support at least one Lane. Each Lane represents a set of differential signal pairs (one pair for transmission, one pair for reception). To scale bandwidth, a Link may aggregate multiple Lanes denoted by xN where N may be any of the supported Link widths (see [[Physical Layer#Link/Lane aggregation and Channel Bonding|link aggregation]]). A x8 Link operating at the 2.5 GT/s data rate represents an aggregate bandwidth of 20 Gigabits/second of raw bandwidth in each direction. PCIe specifications describe operations for x1, x2, x4, x8, x12, x16, and x32 Lane widths. ![A PCIe link can contain 1 or more lanes (source: "Introduction to PCI Express: A Hardware and Software Developer's Guide" by Adam H. Wilen, Justin P. Schade, Ron Thornburg, Intel Press)](site/Resources/media/image132.png) > [!Figure] > _A PCIe link can contain 1 or more lanes (source: #ref/Wilen )_ - Initialization: During hardware initialization, each PCI Express Link is set up following a negotiation of Lane widths and frequency of operation by the two agents at each end of the Link. No firmware or operating system software is involved. - Symmetry: Each Link must support a symmetric number of Lanes in each direction, i.e., an x16 Link indicates there are 16 differential signal pairs in each direction. #### PCI Express Fabric Topology A fabric is composed of point-to-point Links that interconnect a set of components. An example topology is shown in the figure below. This figure illustrates a single fabric instance referred to as a hierarchy -- composed of a Root Complex (RC), multiple Endpoints (I/O devices), a Switch, and a PCI Express to PCI/PCI-X Bridge, all interconnected via PCI Express Links. ![PCI Express Typical Topology](site/Resources/media/image133.png) > [!Figure] > _PCI Express Typical Topology_ #### Root Complex A Root Complex (RC) denotes the root of an I/O hierarchy that connects the CPU/memory subsystem to the I/O. As illustrated in the figure above, a Root Complex may support one or more PCI Express Ports. Each interface defines a separate hierarchy domain. Each hierarchy domain may be composed of a single Endpoint or a sub-hierarchy containing one or more Switch components and Endpoints. The capability to route peer-to-peer transactions between hierarchy domains through a Root Complex is optional and implementation-dependent. For example, an implementation may incorporate a real or virtual Switch internally within the Root Complex to enable full peer-to-peer support in a software-transparent way. Unlike Switches (described below), a Root Complex is generally permitted to split a packet into smaller packets when routing transactions between hierarchy domains, e.g., split a single packet with a 256-byte payload into two packets of 128-byte payload each. The resulting packets are subject to the normal packet formation rules contained in the specification. It should be noted that packet-splitting may have negative performance penalties. #### Endpoints Endpoint refers to a type of Function that can be the Requester or Completer of a PCI Express transaction either on its own behalf or on behalf of a distinct non-PCI Express device (other than a PCI device or Host CPU), e.g., a PCI Express attached graphics controller or a PCI Express-USB host controller. Endpoints are classified as either legacy, PCI Express, or Root Complex Integrated Endpoints. #### Switch A Switch is defined as a logical assembly of multiple virtual PCI-to-PCI Bridge devices as illustrated in the figure below. ![Logical Block Diagram of a Switch](site/Resources/media/image134.png) > [!Figure] > _Logical Block Diagram of a Switch_ A PCIe Switch forwards transactions using PCI Bridge mechanisms; e.g., address-based routing. A Switch must forward all types of Transaction Layer Packets ([[High-Speed Standard Serial Interfaces#Transaction Layer Specification|TLP]]) between any set of Ports. A PCIe Switch is not allowed to split a packet into smaller packets. Arbitration between inbound Links of a Switch may be implemented using round robin or weighted round robin when contention occurs on the same Virtual Channel. This is described in more detail in the specification #footnote_needed. #### PCI Protocol Layers PCIe is designed following three discrete logical layers: the Transaction Layer, the Data Link Layer, and the Physical Layer (see next figure). Each of these layers is divided into two sections: one that processes outbound (to be transmitted) information and one that processes inbound (received) information. PCI Express uses packets to communicate information between components. Packets are formed in the Transaction and Data Link Layers to carry the information from the transmitting component to the receiving component. As the transmitted packets flow through the other layers, they are extended with additional information necessary to handle packets at those layers. On the receiving side, the reverse process occurs, and packets get transformed from their Physical Layer representation to the Data Link Layer representation and finally (for Transaction Layer Packets) to the form that can be processed by the Transaction Layer of the receiving device. ![PCI Express Layers](site/Resources/media/image135.png) > [!Figure] > _[PCI Express Layers_ The figure below shows the conceptual boundaries or jurisdictions for each layer. ![ Layers' jurisdictions for a PCI Express Packet](site/Resources/media/image136.png) > [!Figure] > _ Layers' jurisdictions for a PCI Express Packet_ ##### Transaction Layer The Transaction Layer's primary responsibility is the assembly and disassembly of Transaction Layer Packets (TLPs). TLPs are used to communicate transactions, such as read and write, as well as certain types of events. The Transaction Layer is also responsible for managing flow control for TLPs. Every request packet requiring a response packet is implemented as a split transaction. Each packet has a unique identifier that enables response packets to be directed to the correct originator. The packet format supports different forms of addressing depending on the type of the transaction (Memory, I/O, Configuration, and Message). The Packets may also have attributes such as No Snoop, Relaxed Ordering, and ID-Based Ordering (IDO). The transaction Layer supports four address spaces: it includes the three PCI address spaces (memory, I/O, and configuration) and adds Message Space. This specification uses Message Space to support all prior sideband signals, such as interrupts, power-management requests, and so on, as in-band Message transactions. You could think of PCI Express Message transactions as "virtual wires" since their effect is to eliminate the wide array of sideband signals currently used in a platform implementation. ###### Transaction Layer Services The Transaction Layer, in the process of generating and receiving TLPs, exchanges Flow Control information with its complementary Transaction Layer on the other side of the Link. It is also responsible for supporting both software and hardware-initiated power management. Initialization and configuration functions require the Transaction Layer to: - Store Link configuration information generated by the processor or management device - Store Link capabilities generated by Physical Layer hardware negotiation of width and operational frequency A Transaction Layer's Packet generation and processing services require it to: - Generate TLPs from device core Requests - Convert received Request TLPs into Requests for the device core - Convert received Completion Packets into a payload, or status information, deliverable to the core - Detect unsupported TLPs and invoke appropriate mechanisms for handling them - If end-to-end data integrity is supported, generate the end-to-end data integrity CRC and update the TLP header accordingly. Flow control services: - The Transaction Layer tracks flow control credits for TLPs across the Link. - Transaction credit status is periodically transmitted to the remote Transaction Layer using transport services of the Data Link Layer. - Remote Flow Control information is used to throttle TLP transmission. Ordering rules: - PCI/PCI-X compliant producer-consumer ordering model - Extensions to support Relaxed Ordering - Extensions to support ID-based ordering Power management services: - ACPI/PCI power management, as dictated by system software. - Hardware-controlled autonomous power management minimizes power during full-on power states. Virtual Channels and Traffic Class: Virtual channels in PCI Express are a feature that enables the segregation of traffic based on their bandwidth and latency requirements. Essentially, they create multiple logical data flows within a single physical connection. Each virtual channel can be assigned different priorities and resources. This is particularly useful in environments where multiple types of data (such as control, management, and payload data) are transmitted simultaneously. - The combination of Virtual Channel mechanism and Traffic Class identification is provided to support differentiated services and QoS support for certain classes of applications. - Virtual Channels: Virtual Channels provide a means to support multiple independent logical data flows over given common physical resources of the Link. Conceptually this involves multiplexing different data flows onto a single physical Link. - Traffic Class: The Traffic Class is a Transaction Layer Packet label that is transmitted unmodified end-to-end through the fabric. At every service point (e.g., Switch) within the fabric, Traffic Class labels are used to apply appropriate servicing policies. Each Traffic Class label defines a unique ordering domain - no ordering guarantees are provided for packets that contain different Traffic Class labels. ###### Transaction Layer Specification The Transaction Layer comprehends the following: - TLP (Transaction Level Packets) construction and processing - Association of transaction-level mechanisms with device resources including: - Flow Control - Virtual Channel management - Rules for ordering and management of TLPs - PCI/PCI-X compatible ordering - Including Traffic Class differentiation Transactions form the basis for information transfer between a Requester and a Completer. Four address spaces are defined, and different Transaction types are defined, each with its own unique intended usage: | **ADDRESS SPACE** | **TRANSACTION TYPES** | **BASIC USAGE** | | ----------------- | --------------------- | ---------------------------------------------- | | **Memory** | Read/Write | Transfer data to/from a memory-mapped location | | **I/O** | Read/Write | Transfer data to/from I/O-mapped location | | **configuration** | Read/Write | Device function configuration and setup | ###### Packet Format Overview Transactions consist of Requests and Completions, which are communicated using packets. PCI Express conceptually transfers information as a serialized stream of bytes as shown below. Note that at the byte level, information is transmitted/received over the interconnect with the leftmost byte of the TLP being transmitted/received first (byte 0 if one or more optional TLP Prefixes are present else byte H). The header layout is optimized for performance on a serialized interconnect, driven by the requirement that the most time-critical information be transferred first. For example, within the TLP header, the most significant byte of the address field is transferred first so that it may be used for early address decode. ![Generic TLP format](site/Resources/media/image137.png) ![Serial view of a TLP](site/Resources/media/image138.png) > [!Figure] > _Generic TLP format_ PCI Express uses a packet-based protocol to exchange information between the Transaction Layers of the two components communicating with each other over the Link. PCI Express supports the following basic transaction types: Memory, I/O, Configuration, and Messages. Two addressing formats for Memory Requests are supported: 32-bit and 64-bit. Transactions are carried out using Requests and Completions. Completions are used only where required, for example, to return read data, or to acknowledge Completion of I/O and Configuration Write Transactions. Completions are associated with their corresponding Requests by the value in the Transaction ID field of the Packet header. All TLP fields marked Reserved (sometimes abbreviated as R) must be filled with all 0's when a TLP is formed. Values in such fields must be ignored by Receivers and forwarded unmodified by Switches. Note that for certain fields there are both specified and Reserved values; the handling of Reserved values in these cases is specified separately for each case. ![Fields Present in All TLPs](site/Resources/media/image139.png) > [!Figure] > _Fields Present in All TLPs_ ##### Data Link Layer The middle layer in the simple PCIe stack, the Data Link Layer, serves as an intermediate stage between the Transaction Layer and the Physical Layer. The primary responsibilities of the Data Link Layer include Link management and data integrity, including error detection and error correction. The transmission side of the Data Link Layer accepts TLPs assembled by the Transaction Layer, calculates and applies a data protection code and TLP sequence number, and submits them to the Physical Layer for transmission across the Link. The receiving Data Link Layer is responsible for checking the integrity of received TLPs and for submitting them to the Transaction Layer for further processing. On detection of TLP error(s), this Layer is responsible for requesting retransmission of TLPs until information is correctly received, or the Link is determined to have failed. The Data Link Layer also generates and consumes packets that are used for Link management functions. To differentiate these packets from those used by the Transaction Layer (TLP), the term Data Link Layer Packet (DLLP) will be used when referring to packets that are generated and consumed at the Data Link Layer. ###### Data Link Services The Data Link Layer is responsible for reliably exchanging information with its counterpart on the opposite side of the Link. It includes: Initialization and power management services: - Accept power state Requests from the Transaction Layer and convey to the Physical Layer - Convey active/reset/disconnected/power managed state to the Transaction Layer Data protection, error checking, and retry services: - CRC generation - Transmitted TLP storage for Data Link level retry - Error checking - TLP acknowledgment and retry Messages - Error indication for error reporting and logging The Data Link Layer is responsible for reliably conveying Transaction Layer Packets (TLPs) supplied by the Transaction Layer across a PCI Express Link to the other component's Transaction Layer. Services provided by the Data Link Layer include: Data Exchange: - Accept TLPs for transmission from the Transmit Transaction Layer and convey them to the Transmit Physical Layer - Accept TLPs received over the Link from the Physical Layer and convey them to the Receive Transaction Layer Error Detection and Retry: - TLP Sequence Number and LCRC generation - Transmitted TLP storage for Data Link Layer Retry - Data integrity checking for TLPs and Data Link Layer Packets (DLLPs) - Positive and negative acknowledgement DLLPs - Error indications for error reporting and logging mechanisms - Link Acknowledgement Timeout replay mechanism Initialization and power management: - Track Link state and convey active/reset/disconnected state to Transaction Layer Data Link Layer Packets (DLLPs) are used for Link Management functions including TLP acknowledgment, power management, and exchange of Flow Control information. Data Link Layer Packets are transferred between Data Link Layers of the two directly connected components on a Link. ##### Physical Layer As we have been focusing more on the physical layer in general, we will also delve into the physical layer of PCI Express with special depth. Fundamentally, the Physical Layer isolates the Transaction and Data Link Layers from the signaling technology used for Link data interchange. The Physical Layer is divided into the logical and electrical subblocks, and it includes all circuitry for interface operation, including driver and input buffers, parallel-to-serial and serial-to-parallel conversion, PLLs, and impedance matching circuitry. It includes also logical functions related to interface initialization and maintenance. The Physical Layer exchanges information with the Data Link Layer in an implementation-specific format. This Layer is responsible for converting information received from the Data Link Layer into an appropriate serialized format and transmitting it across the PCI Express Link at a frequency and width compatible with the device connected to the other side of the Link. There are two Physical Layer implementations supported by this specification, the PCI Express Link consists of a PCIe Logical Physical and Electrical Physical Layer. The PCI Express architecture has "hooks" to support future performance enhancements via speed upgrades and advanced encoding techniques. The future speeds, encoding techniques, or media may only impact the Physical Layer definition. It is important to note that the logical sublayer is sometimes further divided into a MAC sublayer and a PCS, although this division is not formally part of the PCIe specification although it is reasonably easy to find content online referring to PCIe PCS. The reason for the confusion is a separate specification published[^37] by Intel, the PHY Interface for PCI Express (called PIPE)—intended to enable the development of functionally equivalent PCI Express, SATA, and USB PHYs—which defines the MAC/PCS functional partitioning and the interface between these two sub-layers. The PIPE specification also identifies the physical media attachment (PMA) layer, which includes the serializer/deserializer (SerDes) and other analog circuitry; however, since SerDes implementations vary greatly among ASIC vendors, PIPE does not specify an interface between the PCS and PMA. | | Lanes | Pairs | Gbaud | Encoding Bits/Baud | Gbps/ Direction | Gbytes/s per Direction | Directions | Total Gbytes/s | | --------------------------------------------- | ----- | ----- | ----- | ------------------ | --------------- | ---------------------- | ---------- | -------------- | | **PCIe (2, 12, and 32 lanes also supported)** | | | | | | | | | | x1 gen1 | 1 | 2 | 2.5 | 8/10 | 2.000 | 0.250 | 2 | 0.500 | | x4 gen1 | 4 | 8 | 2.5 | 8/10 | 8.000 | 1.000 | 2 | 2.000 | | x8 gen1 | 8 | 16 | 2.5 | 8/10 | 16.000 | 2.000 | 2 | 4.000 | | x16 gen1 | 16 | 32 | 2.5 | 8/10 | 32.000 | 4.000 | 2 | 8.000 | | x1 gen2 | 1 | 2 | 5.0 | 8/10 | 4.000 | 0.500 | 2 | 1.000 | | x4 gen2 | 4 | 8 | 5.0 | 8/10 | 16.000 | 2.000 | 2 | 4.000 | | x8 gen2 | 8 | 16 | 5.0 | 8/10 | 32.000 | 4.000 | 2 | 8.000 | | x16 gen2 | 16 | 32 | 5.0 | 8/10 | 64.000 | 8.000 | 2 | 16.000 | | x1 gen3 | 1 | 2 | 8.0 | 128/130 | 7.877 | 0.985 | 2 | 1.969 | | x4 gen3 | 4 | 8 | 8.0 | 128/130 | 31.508 | 3.938 | 2 | 7.877 | | x8 gen3 | 8 | 16 | 8.0 | 128/130 | 63.015 | 7.877 | 2 | 15.754 | | x16 gen3 | 16 | 32 | 8.0 | 128/130 | 126.031 | 15.754 | 2 | 31.508 | ###### Physical Layer Services The PCI Express physical layer is responsible for interface initialization, maintenance control, and status tracking: - Reset/Hot-Plug control/status - Interconnect power management - Width and Lane mapping negotiation - Lane polarity inversion Symbol and special Ordered Set generation: - 8b/10b encoding/decoding - Embedded clock tuning and alignment Symbol transmission and alignment: - Transmission circuits - Reception circuits - Elastic buffer at the receiving side - Multi-Lane de-skew (for widths \> x1) at receiving side ###### Logical Sub Block The logical sub-block has two main sections: a Transmit section or unit that prepares outgoing information passed from the Data Link Layer for transmission by the electrical sub-block, and a Receiver section or unit that identifies and prepares received information before passing it to the Data Link Layer. The logical sub-block and electrical sub-block coordinate the state of each Transceiver through a status and control register interface or functional equivalent. The logical sub-block directs control and management functions of the Physical Layer. PCI Express uses 8b/10b encoding when the data rate is 2.5 GT/s or 5.0 GT/s. For data rates greater than or equal to 8.0 GT/s, PCI Express uses a per-lane code along with physical layer encapsulation. ![Logical Sub-Block Primary Stages](site/Resources/media/image140.png) > [!Figure] > _Logical Sub-Block Primary Stages_ The primary purpose of 8-bit/10-bit encoding is to [[Physical Layer#Self-synchronous|embed a clock signal into the data stream]]. By embedding a clock into the data, this encoding scheme renders external clock signals unnecessary. As we saw on the signal integrity page, this does not come without a dose of challenges. For 2.5 GT/s and 5.0 GT/s data rates, PCI Express uses the aforementioned 8b/10b transmission code. The definition of this transmission code is identical to that specified in ANSI X3.230-1994, clause 11 (and also IEEE 802.3z, 36.2.4). Using this scheme, 8-bit data characters are treated as 3 bits and 5 bits mapped onto a 4-bit code group and a 6-bit code group, respectively. The control bit in conjunction with the data character is used to identify when to encode one of the 12 Special Symbols included in the 8b/10b transmission code. These code groups are concatenated to form a 10-bit Symbol. ![Character to Symbol Mapping](site/Resources/media/image141.png) > [!Figure] > _Character to Symbol Mapping_ The bits of a Symbol are placed on a Lane starting with bit "a" and ending with bit "j". ![Bit Transmission Order on Physical Lanes - x1 Example](site/Resources/media/image142.png) > [!Figure] > _Bit Transmission Order on Physical Lanes - x1 Example_ ![Bit Transmission Order on Physical Lanes - x4 Example](site/Resources/media/image143.png) > [!Figure] > _Bit Transmission Order on Physical Lanes - x4 Example_ ###### Electrical Sub Block The electrical sub-block works as the delivery mechanism for the physical architecture. The electrical sub-block contains transmit and receive buffers that transform the data into electrical signals that can be transmitted across the link. The electrical sub-block may also contain the PLL circuitry, which provides internal clocks for the device. The following paragraphs describe exactly how the signaling of PCI Express works and why, and what a PLL (Phase Locked Loop) actually does. The concepts of AC coupling and de-emphasis are also discussed briefly. Serial/Parallel Conversion: The transmit buffer in the electrical sub-block takes the encoded/packetized data from the logical sub-block and converts it into serial format. Once the data has been serialized it is then routed to an associated lane for transmission across the link. On the receive side, the receivers deserialize the data and feed it back to the logical sub-block for further processing. Clock Extraction: In addition to the parallel-to-serial conversion described above, the receive buffer in the electrical sub-block is responsible for recovering the link clock that has been embedded in the data. With every incoming bit transition, the receive side PLL circuits are resynchronized to maintain bit and symbol (10 bits) lock. Lane-to-Lane De-Skew: PCI Express uses [[Physical Layer#Differential Interconnects|differential signaling]]. The receive buffer in the electrical sub-block [[Physical Layer#Line-to-Line Skew|de-skews]] data from the various lanes of the link before assembling the serial data into a parallel data packet. This is necessary to compensate for the allowable 20 nanoseconds of lane-to-lane skew. Depending on the flight time characteristics of a given transmission medium this could correlate to nearly 7 inches of variance from lane to lane. The actual amount of skew the receive buffer must compensate for is discovered during the [[Physical Layer#Adaptive Equalization Backchannel|training process for the link]]. Phase Locked Loop (PLL): A clock derived from a PLL circuit may provide the internal clocking to the PCI Express device. Each PCI Express device is given a 100 megahertz differential pair clock. This clock can be fed into a PLL circuit, which multiplies it by 25 to achieve the 2.5 gigahertz PCI Express frequency. The PLL clocks may be used to drive the state machines that configure the link and transmit and receive data. AC Coupling: PCI Express uses AC coupling on the transmit side of the differential pair to eliminate the Common Signal element. By removing the DC Common Signal element, the buffer design process for PCI Express becomes much simpler. Each PCI Express device can have a unique DC Common Mode voltage element, which is used during the detection process. The link AC coupling removes the common element from view of the receiving device. The range of AC capacitance that is permissible by the PCI Express specification is 75 to 200nF. Equalization and Link Training When two PCIe devices communicate, they must agree on the conditions through which this communication will take place. This includes negotiating not only data rates but also signal integrity aspects using data patterns and measurements. Tx Voltage Parameters Tx voltage parameters include equalization coefficients, equalization presets, and min/max voltage swings. Tx voltage swing and equalization presets at 8.0 and 16GT/s are measured using a low-frequency pattern within the compliance pattern. Consisting of a sequence of 64 zeroes followed by 64 ones, this pattern is used to perform an accurate measurement of voltage since, ISI effects will have decayed, and the signal will approach a steady state. Tx Equalization A PCIe transmitter implements a coefficient-based equalization mode to support fine-grained control over Tx equalization resolution. Additionally, transmitters support several presets that give a coarser control over Tx equalization resolution. Both coefficient space and preset space are controllable via messaging from the receiver via an equalization procedure. The equalization procedure operates on the same physical path as normal signaling and is implemented via extensions to the existing protocol link layer. All transmitters in PCIe implement the equalization procedure, whereas receivers may optionally implement it. PCIe Tx equalization is in a FIR filter architecture (see figure below). ![Tx Equalization FIR Representation](site/Resources/media/image145.png) > [!Figure] > _Tx Equalization FIR Representation_ Link Training: The Link equalization procedure enables components to adjust the Transmitter and the Receiver setup of each Lane to improve the signal quality and meet the requirements specified by the standard when operating at 8.0 GT/s and higher data rates. All the Lanes that are currently operational or may be operational in the future participate in the Equalization procedure. The procedure must be executed during the first data rate change to 8.0 GT/s as well as the first change to all data rates greater than 8.0 GT/s. Components must not require that the equalization procedure be repeated at any data rate for reliable operation, although there is provision to repeat the procedure. Components must store the Transmitter setups that were agreed to during the equalization procedures and use it them for future operation at 8.0 GT/s and higher data rates. Components are permitted to fine-tune their Receiver setup even after the equalization procedure is complete as long as doing so does not cause the Link to be unreliable. The equalization procedure can be initiated either autonomously or by software. The PCIe specification strongly recommends that components use the autonomous mechanism. However, a component that chooses not to participate in the autonomous mechanism must have its associated software to ensure that the software-based mechanism is applied. Rx Equalization and Testing All PCIe receiver speeds: 2.5, 5.0, 8.0, and 16 GT/s must be tested using a stressed eye applied over a calibration channel that approximates the near worst-case loss characteristics encountered in an actual channel. The recovered eye is defined at the input to the receiver's latch. For 2.5G and 5.0G, this point is equivalent to the Rx die pad; for 8G and 16G, it is equivalent to the signal at the Rx die pad after behavioral Rx equalization has been applied. ![Rx Test Topology](site/Resources/media/image146.png) > [!Figure] > _Rx Test Topology_ ###### PHY blocks In general, the logical and electrical subblocks in PCI Express are managed by specialized devices called PHYs. PHYs can be standalone chips, or they might be included as IP cores or included in [[Semiconductors#SerDes and High-Speed Transceivers|SerDes and transceivers]] in FPGAs and System-on-Chips. A PHY device, either standalone or as a core, will interface between the electrical medium and a PCIe data link layer (or MAC in the PIPE terminology). Standalone PHYs: Examples of single-lane, standalone PCIe PHYs are for instance the XIO1100 (Gen1) from Texas Instruments and PX1011B (Gen1) from NXP. The XIO1100 PHY complies with PCI Express PHY as defined by the Physical Layer Specifications of the PCI−SIG document PCI Express Base Specification. The XIO1100 conforms to the functional behavior described in the PHY Interface for the PCI Express Architecture by Intel Corporation. The XI01100 comes in a 100-pin GGB BGA package. The PX1011B from NXP is a high-performance, low-power, single-lane PCI Express electrical Physical layer (PHY) that handles the low-level PCI Express protocol and signaling, in an LFBGA81 package. The PX1011B PCI Express PHY is compliant with the PCI Express Base Specification, Rev. 1.0a, and Rev. 1.1. The PX1011B includes features such as Clock and Data Recovery (CDR), data serialization and de-serialization, 8b/10b encoding, analog buffers, elastic buffer, and receiver detection. The PX1011B is a 2.5 Gbit/s PCI Express PHY with an 8-bit data PXPIPE interface. Its PXPIPE interface is a superset of the PHY Interface for the PCI Express (PIPE) specification, enhanced and adapted for off-chip applications with the introduction of a source synchronous clock for transmit and receive data. ![XIO1100 Schematic Diagram with external components (source: TI)](site/Resources/media/image147.png) > [!Figure] > _XIO1100 Schematic Diagram with external components (source: TI)_ ![XIO1100 (Source: TI)](site/Resources/media/image148.png) > [!Figure] > _XIO1100 (Source: TI)_ ![PX1011B (Source: NXP)](site/Resources/media/image149.png) > [!Figure] > _PX1011B (Source: NXP)_ When implementing PHYs in IP cores, the transceiver technology needs to be compatible with the intended PHY. #### PCI Express Transactions The PCI Express architecture defines four transaction types: memory, I/O, configuration, and message. This is similar to the traditional PCI transactions, with the notable difference being the addition of a message transaction type. Memory Transactions Transactions: targeting the memory space transfer data to or from a memory-mapped location. There are several types of memory transactions: Memory Read Request, Memory Read Completion, and Memory Write Request. Memory transactions use one of two different address formats, either 32-bit addressing (short address) or 64-bit addressing (long address). I/O Transactions Transactions: transactions targeting the I/O space transfer data to or from an I/O mapped location. PCI Express supports this address space for compatibility with existing devices that utilize this space. There are several types of I/O transactions: I/O Read Request, I/O Read Completion, I/O Write Request, and I/O Write Completion. I/O transactions use only 32-bit addressing (short address format). Configuration Transactions: These are transactions targeting the configuration space are used for device configuration and setup. These transactions access the configuration registers of PCI Express devices. Compared to traditional PCI, PCI Express allows for many more configuration registers. For each function of each device, PCI Express defines a configuration register block four times the size of PCI. There are several types of configuration transactions: Configuration Read Request, Configuration Read Completion, Configuration Write Request, and Configuration Write Completion. Message Transactions: PCI Express adds a new transaction type to communicate a variety of miscellaneous messages between PCI Express devices. Referred to simply as messages, these transactions are used for things like interrupt signaling, error signaling or power management. This address space is a new addition for PCI Express and is necessary since these functions are no longer available via sideband signals such as PME\#, IERR\#, and so on. #### Flow Control The PCI Express specification defines several ordering rules to govern which types of transactions are allowed to pass or be passed. Passing occurs when a newer transaction bypasses a previously issued transaction and the device executes the newer transaction first. The ordering rules apply uniformly to all transaction types—memory, I/O, configuration, and messages—but only within a given traffic class. There are no ordering rules between transactions with different traffic classes. PCI Express provides a virtual channel mechanism that, along with traffic class designations, serves to facilitate traffic flow throughout the system. The basic idea behind virtual channels is to provide independent resources (queues, buffers, and so on) that allow fully independent flow control between different virtual channels. Conceptually, traffic that flows through multiple virtual channels is multiplexed onto a single physical link and then de-multiplexed on the receiver side. Virtual channels are logical representations of channels between two devices. The finite physical link bandwidth is divided up amongst the supported virtual channels as appropriate. Each virtual channel has its own set of queues and buffers, control logic, and a credit-based mechanism to track how full or empty those buffers are on each side of the link. If the receive queues and buffers for a virtual channel on one side of the link are full, then no further transactions can travel across that virtual channel until space is made available. ![Flow Control through Virtual Channels and Traffic Classes (source: "Introduction to PCI Express: A Hardware and Software Developer's Guide" by Adam H. Wilen, Justin P. Schade, Ron Thornburg, Intel Press)](site/Resources/media/image150.png) > [!Figure] > _Flow Control through Virtual Channels and Traffic Classes (source: #ref/Wilen )_ #### Using PCIe from Application Software Let's now dive into how we would use a PCI Express device from application software. If we assume a Linux environment, and a block diagram as in Figure 2‑119, the application software will execute in the CPU and interact with a PCIe device (which is connected in the root complex) using a driver. But before the application software can use the device, and even before the driver can kick in, some important things need to happen in the device. When a PCIe device receives power, it initially remains inactive, only responding to configuration transactions. At this stage, the device does not have any memory or I/O ports assigned within the computer's address space, and all specific features of the device, such as its ability to report interrupts, are also disabled. However, all PCIe boards come equipped with firmware that allows interaction with the device"s configuration address space through the reading and writing of registers in the PCIe controller. During system startup, this firmware, or the Linux kernel if it"s set up that way, carries out configuration transactions with the PCIe device. This process allocates a designated space for the address regions the device requires. Consequently, by the time a device driver starts interacting with the PCI device, its memory and I/O regions have already been allocated and set up. At its core, the PCIe driver"s primary responsibility is to initialize and control the PCIe device, ensuring that it functions correctly within the system. The driver begins by registering itself with the kernel"s PCI subsystem, specifying which devices it is capable of managing, typically identified by vendor and device IDs. This registration process makes the driver aware of any relevant PCIe devices connected to the system. Once the operating system detects a compatible PCIe device, it invokes the driver"s probe function. This function is necessary for initializing the device, which involves setting up device registers, configuring device-specific settings, and mapping any necessary memory regions into the kernel"s address space for direct access. In addition to initialization, the PCIe driver must handle data transfers between the device and the system. This often involves implementing read-and-write operations that facilitate data movement to and from the device"s memory or registers. For high-performance devices, this could also include setting up Direct Memory Access (DMA) operations, enabling the device to transfer data directly to and from system memory, bypassing the CPU to increase throughput and efficiency. Another important aspect of the PCIe driver is managing interrupts. PCIe devices often use interrupts to notify the system about various events, such as the completion of data transfers or the occurrence of errors. The driver must be able to handle these interrupts appropriately, which typically involves clearing the interrupt on the device and performing any necessary follow-up actions. Additionally, the driver will provide an interface to user-space application software, enabling it to interact with the PCIe device. This is usually achieved through creating device files in the /dev directory, which user-space programs can access using standard system calls like ```open```, ```read```, ```write```, and ```ioctl```. Through these system calls, applications can perform operations like sending data to the device, reading data from it, or controlling specific device functions. Finally, a PCIe driver must also include clean-up and exit routines. These routines are executed when the device is removed or when the driver is unloaded from the system. They ensure that all allocated resources are freed, and the device is left in a stable state. Let's see a schematic example below. ```C #include <linux/module.h> #include <linux/pci.h> #include <linux/cdev.h> #include <linux/fs.h> #include <linux/device.h> static int my_pci_major; static struct cdev my_cdev; static struct class *my_class = NULL; static struct pci_device_id my_pci_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_YOUR_VENDOR, PCI_DEVICE_ID_YOUR_DEVICE), }, { 0, } }; MODULE_DEVICE_TABLE(pci, my_pci_ids); static int my_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { printk(KERN_INFO "PCI device found and probed\n"); // Initialization code goes here return 0; } static void my_pci_remove(struct pci_dev *pdev) { printk(KERN_INFO "PCI device removed\n"); // Cleanup code goes here } static struct pci_driver my_pci_driver = { .name = "my_pci_driver", .id_table = my_pci_ids, .probe = my_pci_probe, .remove = my_pci_remove, }; static int my_pci_open(struct inode *inode, struct file *file) { // Open code here return 0; } static int my_pci_close(struct inode *inode, struct file *file) { // Close code here return 0; } static ssize_t my_pci_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { // Read code here return 0; } static ssize_t my_pci_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { // Write code here return count; } static struct file_operations my_pci_fops = { .owner = THIS_MODULE, .open = my_pci_open, .release = my_pci_close, .read = my_pci_read, .write = my_pci_write, }; static int __init my_pci_init(void) { dev_t dev_id; int result; result = alloc_chrdev_region(&dev_id, 0, 1, "my_pci_device"); if (result < 0) { printk(KERN_WARNING "my_pci: unable to get major number\n"); return result; } my_pci_major = MAJOR(dev_id); cdev_init(&my_cdev, &my_pci_fops); my_cdev.owner = THIS_MODULE; result = cdev_add(&my_cdev, dev_id, 1); if (result < 0) { printk(KERN_WARNING "Error %d adding my_pci", result); unregister_chrdev_region(dev_id, 1); return result; } my_class = class_create(THIS_MODULE, "my_pci_class"); if (IS_ERR(my_class)) { unregister_chrdev_region(dev_id, 1); cdev_del(&my_cdev); return PTR_ERR(my_class); } device_create(my_class, NULL, dev_id, NULL, "my_pci_device"); return pci_register_driver(&my_pci_driver); } static void __exit my_pci_exit(void) { dev_t dev_id = MKDEV(my_pci_major, 0); device_destroy(my_class, dev_id); class_destroy(my_class); cdev_del(&my_cdev); unregister_chrdev_region(dev_id, 1); pci_unregister_driver(&my_pci_driver); } module_init(my_pci_init); module_exit(my_pci_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Your Name"); MODULE_DESCRIPTION("Simple PCIe Driver with /dev node creation"); ``` Once compiled, this would become a loadable module to be dynamically linked to the kernel utilizing the ```insmod``` command. The user-space application code that uses the loadable module would look like this: ```C #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/ioctl.h> int main() { int fd; char buffer[128]; // Open the device file fd = open("/dev/my_pci_device", O_RDWR); if (fd < 0) { perror("Failed to open device"); return -1; } // Read data from the device if (read(fd, buffer, sizeof(buffer)) < 0) { perror("Failed to read from device"); } // Write data to the device if (write(fd, buffer, sizeof(buffer)) < 0) { perror("Failed to write to device"); } // Perform a device-specific operation if (ioctl(fd, MY_PCI_IOCTL_COMMAND, ...) < 0) { perror("Failed to perform ioctl"); } // Close the device file close(fd); return 0; } ``` ## Universal Chiplet Interconnect Express (UCIe) Although we should put this interface standard next to [[Semiconductors#System-on-Chips|System-on-Chip]] matters or, even more accurately, in the [[Semiconductors#System-on-Chips#On-Chip Interconnects|on-chip interconnects]] section, we will put it here because, technically speaking, UCIe is an inter-chip protocol and not an on-chip one. Also, due to its close relationship with it, it's quite convenient to have it close to its sibling, the [[High-Speed Standard Serial Interfaces#PCI Express|PCIe]] protocol. Universal Chiplet Interconnect Express (UCIe) is an open industry standard designed to enable efficient communication between the [[Semiconductors#Chiplets|chiplets]]. As the demand for more powerful and specialized processors grows, traditional monolithic chip designs face challenges related to manufacturing costs, yields, and the inability to combine different technologies optimally. UCIe addresses these issues by standardizing the interconnect between chiplets, allowing them to communicate seamlessly regardless of their origin or design. This standardization fosters a modular approach where manufacturers can mix and match chiplets from different vendors, memory, or accelerators, tailoring systems to specific performance, power, or cost requirements. Its stack is broken down into three major layers, namely: - Protocol Layer - Die-to-Die Adapter Layer and - Physical Layer Each of the above-mentioned layers has a wide array of capabilities, such as multi-protocol support (CXL, PCIe, and Streaming), varying interface widths and data rates, protocol multiplexing, and packaging options, to name a few. This gives options to design simple as well as the most sophisticated chiplets based on the end application. As an example, one version of Physical Layer Chiplet design could have Standard Packaging support and another version could support Advanced Packaging with increased link width. ![[Pasted image 20250128133731.png]] > [!Figure] > UCIe layers (source: Synopsis ) A key advantage of UCIe is its ability to support heterogeneous integration. By enabling diverse chiplets to work together in a unified package, it unlocks new possibilities for system design. For example, a single package could integrate a high-performance CPU, a machine learning accelerator, and high-bandwidth memory, all optimized on different semiconductor processes. This flexibility is particularly valuable for applications like artificial intelligence, data centers, and edge computing, where balancing performance, power, and space is critical. Additionally, UCIe reduces development costs by promoting reuse of validated chiplets and minimizing the need for custom interfaces. The UCIe consortium, which includes major industry players, ensures the standard remains vendor-neutral and widely adopted. By fostering collaboration, UCIe aims to accelerate innovation, reduce time-to-market for advanced systems, and create a competitive marketplace for chiplets. As semiconductor technology continues to evolve, UCIe is poised to play a central role in enabling scalable, efficient, and versatile chip designs that meet the growing demands of modern computing. ## Ethernet ALOHAnet, developed at the University of Hawaii in the 1970s, laid the foundation for modern wireless communication. Originally designed to provide data communication between computer systems on the Hawaiian Islands using radio frequency, it allowed multiple computer terminals to connect to a central mainframe computer and communicate with each other. One of its key innovations was the use of a random access protocol which enabled multiple users to share a common communication channel without centralized control. ALOHAnet became operational in 1971, demonstrating the feasibility of wireless networking and providing valuable insights into the design and management of shared communication channels. Standing on the shoulders of ALOHAnet, Robert Metcalfe, along with his team at Xerox"s Palo Alto Research Center (PARC), developed Ethernet as a means to connect computers and share resources in a local area network (LAN) using coaxial cables and the concept of packet switching, that is, breaking data into small packets for transmission. The original Ethernet operated at a speed of 2.94 megabits per second, and it utilized a bus topology, where multiple devices shared the same physical cable. Ethernet proved to be a rather simple way of sending chunks of data from node to node in a small area network and within the same network context. That simplicity made it a widely adopted protocol. Several decades away from Metcalfe's pioneering work, Ethernet continues to be widely adopted, although the protocol has morphed and extended so much that it is fair to say that it has become something else. In essence, we cannot consider Ethernet "a protocol" but an intricate family of protocols and architectures to work on a breadth of physical media technologies. The IEEE 802.3 standard captures the Ethernet family of specifications. ==Ethernet has become a challenging standard to understand. The Ethernet standard has many different specifications even for the same data rate. For example, 10GBASE-ER and 10GBASE-KR are 10 Gbps Ethernet specifications but they describe different interconnect medium interfaces. There are at least twenty different types of one Gigabit Ethernet or Gigabit Ethernet and close to 30 different 10 Gigabit Ethernet specifications that have been defined by the IEEE 802.3 standard. In a nutshell, Ethernet is in fact a family of many protocols that share the same frame structure.== When we discussed [[Physical Layer#Parallel versus Serial Interconnects|serial data interfaces]], we said that serial data communication consists of data bits transmitted one at a time over an interconnect medium. The data rate is the number of bits transmitted per second (bits/s or bps), so if the bit time is 1 nanosecond, then the data rate is 1000 million bits per second (1000 Mbps or 1 Gbps). However, there is some overhead in serial interfaces which are a result of the encoding needed to transfer data at high speeds. Then, to achieve the target data rate, line rates or physical layer gross data rate must be increased. In Ethernet, to achieve the effective 1Gbps throughput, the actual line rate is 1.25 Gbps, and in a 10 Gigabit Ethernet throughput, the line rate is 10.3125 Gbps. Ethernet speeds are the actual data throughput rate without the overhead, with this overhead being control bits, source address, destination address, and other non-data bits. The actual data throughput rate is also the operating rate of the Ethernet controller, which is also known as the media access control (MAC) or Ethernet MAC. Ethernet can work with different types of interconnects. An Ethernet physical medium may consist of only differential pairs such as PCB traces connecting two PHYs at each end, or it may include additional devices such as connectors, cable (optical or copper), and backplanes. All options are illustrated below. ![Ethernet on a PCB (note that the trace connecting PHYs represents one or more lanes)](site/Resources/media/image151.png) > [!Figure] > _Ethernet on a PCB (note that the trace connecting PHYs may represent one or more lanes)_ ![Ethernet on a mezzanine (note that the trace connecting PHYs represents one or more lanes)](site/Resources/media/image152.png) > [!Figure] > _Ethernet on a mezzanine (again, note that the trace connecting PHYs may represent one or more lanes)_ ![Ethernet on a backplane, Ethernet on a copper cable, Ethernet over Fiber (note that the trace connecting PHYs may represent one or more lanes)](site/Resources/media/image153.png) > [!Figure] > _top: Ethernet on a backplane, middle: Ethernet on a copper cable, bottom: Ethernet over Fiber_ All nodes are of equal importance in an Ethernet network; there is no master network controller of any sort. The medium between two Ethernet PHYs can either be mechanical and electrical or optical. The PHYs driving each medium may have the same throughput data rate, but have different specifications, depending on the medium interface. The typical electrical medium for Ethernet is copper-based cable (twisted pair or twin axial) or backplanes with multiple connectors. ==Due to differences in cable, connector, and backplane characteristics, the PHY specification for each of the interfaces may be required to be medium-dependent with different specifications. Therefore, two Ethernet MACs might be unable to interoperate if their PHYs work on different clauses of the specification.== For [[Physical Layer#Optical Channels|optical channels]], Ethernet utilizes optical transceivers for electrical signal conversion to and from light signals and electrical signals at the two ends of a fiber optic cable. The two major optical transceiver types are single-mode fiber (SMF) and multi-mode fiber (MMF), each supporting multiple different wavelengths, fiber types, and cable reaches. ### Architectural Aspects Ethernet is organized along architectural lines, emphasizing the separation of the architecture into two main parts: the Media Access Control (MAC) sublayer of the Data Link Layer, and the Physical Layer (PHY); in the apparent simplicity of this two-layer split, a lot of complexity is hidden. These layers are intended to correspond closely to the lowest layers of the OSI Reference Model layers. The Logical Link Control (LLC) sublayer and MAC sublayer together encompass the functions intended for the Data Link Layer as defined in the OSI model. ![IEEE 802.3 standard relationship to the ISO/IEC Open Systems Interconnection (OSI) reference model](site/Resources/media/image154.png) > [!Figure] > _IEEE 802.3 standard relationship to the ISO/IEC Open Systems Interconnection (OSI) reference model_ Other relevant architectural aspects of Ethernet are the ways different sublayers connect. For example: - Medium Dependent Interfaces (MDI): To communicate in a compatible manner, all stations shall adhere rigidly to the exact specification of physical media signals defined in the appropriate clauses in this standard, and to the procedures that define the correct behavior of a station. The medium-independent aspects of the LLC sublayer and the MAC sublayer should not be taken as detracting from this point; communication in an Ethernet Local Area Network requires complete compatibility at the Physical Medium interface (that is, the physical cable interface). - Media Independent Interface (MII): It is anticipated that some devices will be connected to a remote PHY, and/or to different medium-dependent PHYs. The MII is defined as a third compatibility interface. While conformance with implementation of this interface is not strictly necessary to ensure communication, it is recommended, since it allows maximum flexibility in intermixing PHYs and DTEs. - Gigabit Media Independent Interface (GMII). The GMII is designed to connect a 1 Gb/s capable MAC or repeater unit to a 1 Gb/s PHY. While conformance with the implementation of this interface is not strictly necessary to ensure communication, it is recommended, since it allows maximum flexibility in intermixing PHYs and DTEs at 1 Gb/s speeds. The GMII is intended for use as a chip-to-chip interface. No mechanical connector is specified for use with the GMII. The GMII is optional. - 10 Gigabit Media Independent Interface (XGMII). The XGMII is designed to connect a 10 Gb/s capable MAC to a 10 Gb/s PHY. While conformance with the implementation of this interface is not necessary to ensure communication, it allows maximum flexibility in intermixing PHYs and DTEs at 10 Gb/s speeds. The XGMII is intended for use as a chip-to-chip interface. No mechanical connector is specified for use with the XGMII. The XGMII is optional. Note that lower data rate, legacy versions of Ethernet used to include an Attachment Unit Interface (AUI) connecting sublayers of the PHY. AUIs have made a comeback in multi-gigabit clauses of Ethernet, which will be explained in the following sections. ### MAC The MAC in Ethernet is a sublayer that provides services to a client. And does so by providing a set of primitives depicted in the figure below. > [!Note] What are primitives? In network protocols, a primitive is a function that a user or an entity can use to access a service provided by a layer of a protocol. A primitive asks a service to carry out an action or to give an account of a measure taken by the entity with which the communication is set. Primitives in general do not reveal any implementation details. The MAC defines a medium-independent facility, built on the medium-dependent physical facility provided by the Physical Layer, and under the access-layer-independent LLC sublayer (or some other MAC client). The MAC is in charge of: - Data encapsulation (transmit and receive): - Framing (frame boundary delimitation, frame synchronization) - Addressing (handling of source and destination addresses) - Error detection (detection of physical medium transmission errors) - Media Access Management: - Medium allocation (collision avoidance) - Contention resolution (collision handling) ![Generic MAC primitives](site/Resources/media/image155.png) > [!Figure] > _Generic MAC primitives_ MAC realizes its services utilizing data units called _frames_. The mapping between MAC service interface primitives and Ethernet frames, including the syntax and semantics of the various fields of MAC frames and the fields used to form those MAC frames into packets, is described in the next two figures. ![Ethernet Frame format](site/Resources/media/image156.png) > [!Figure] > _Ethernet Frame format_ ![MAC request and indication services mappings with frame format](site/Resources/media/image157.png) > [!Figure] > _MAC request and indication services mappings with frame format_ The Ethernet frame fields are: - Preamble: The Preamble field is a 7-octet field that is used to allow the signaling circuitry to reach its steady-state synchronization with the received packet's timing. - Start Frame Delimiter (SFD) field: The SFD field is the sequence 10101011. It immediately follows the preamble pattern. A MAC frame starts immediately after the SFD. - Address fields: Each MAC frame shall contain two address fields: the Destination Address field and the Source Address field, in that order. The Destination Address field shall specify the destination addressee(s) for which the MAC frame is intended. The Source Address field shall identify the station from which the MAC frame was initiated. The representation of each address field shall be as follows: - Each address field shall be 48 bits in length. - The first bit (LSB) shall be used in the Destination Address field as an address type designation bit to identify the Destination Address either as an individual or as a group address. If this bit is 0, it shall indicate that the address field contains an individual address. If this bit is 1, it shall indicate that the address field contains a group address that identifies none, one or more, or all of the stations connected to the LAN. In the Source Address field, the first bit is reserved and set to 0. - The second bit shall be used to distinguish between locally or globally administered addresses. For globally administered (or U, universal) addresses, the bit is set to 0. If an address is to be assigned locally, this bit shall be set to 1. Note that for the broadcast address, this bit is also a 1. - Each octet of each address field shall be transmitted least significant bit first. - Destination Address field: The Destination Address field specifies the station(s) for which the MAC frame is intended. It may be an individual or multicast (including broadcast) address. - Source Address field: The Source Address field specifies the station sending the MAC frame. The Source Address field is not interpreted by the MAC sublayer. - Length/Type field: This two-byte field takes one of two meanings, depending on its numeric value. For numerical evaluation, the first octet is the most significant octet of this field. - If the value of this field is less than or equal to 1500 decimal (05DC hexadecimal), then the Length/ Type field indicates the number of MAC client data octets contained in the subsequent MAC Client Data field of the basic frame (Length interpretation). - If the value of this field is greater than or equal to 1536 decimal (0600 hexadecimal), then the Length/Type field indicates the Ethertype of the MAC client protocol (Type interpretation). The Length and Type interpretations of this field are mutually exclusive. When used as a Type field, it is the responsibility of the MAC client to ensure that the MAC client operates properly when the MAC sublayer pads the supplied MAC Client data. Regardless of the interpretation of the Length/Type field, if the length of the MAC Client Data field is less than the minimum required for proper operation of the protocol, a Pad field (a sequence of octets) will be added after the MAC Client Data field but before the FCS field, specified below. The Length/Type field is transmitted and received with the high-order octet first. - MAC Client Data field: The MAC Client Data field contains a sequence of octets. Any arbitrary sequence of bytes may appear in the MAC Client Data field up to a maximum field length determined by the particular implementation. - Pad field: A minimum MAC frame size is required for correct protocol operation. - Frame Check Sequence (FCS): field A cyclic redundancy check (CRC) is used by the transmit and receive algorithms to generate a CRC value for the FCS field. The FCS field contains a 4-octet (32-bit) CRC value. This value is computed as a function of the contents of the protected fields of the MAC frame: the Destination Address, Source Address, Length/ Type field, MAC Client Data, and Pad (that is, all fields except FCS). For information about the polynomial used, see the specification. - Extension field: The Extension field follows the FCS field and is made up of a sequence of extension bits, which are readily distinguished from data bits. The contents of the Extension field are not included in the FCS computation. Each byte of the MAC frame, except the FCS, is transmitted least-significant bit first. An invalid MAC frame shall be defined as one that meets at least one of the following conditions: - The frame length is inconsistent with a length value specified in the length/type field. - It is not an integral number of bytes in length. - The bits of the incoming frame (exclusive of the FCS field itself) do not generate a CRC value identical to the one received. The contents of invalid MAC frames shall not be passed to the LLC or MAC Control sublayers. The occurrence of invalid MAC frames may require the frame to be discarded, or it may be communicated to network management for further action. #### Services The Ethernet MAC must be observed as a generic block providing a set of services to a client sitting in an upper layer. ##### MA_Data.request This primitive defines the transfer of data from a MAC client entity to a single peer entity or multiple peer entities in the case of group addresses. The semantics of the primitive are as follows: ```pseudocode MA_DATA.request (destination_address, source_address, mac_service_data_unit, frame_check_sequence) ``` The destination\_address parameter may specify either an individual or a group MAC entity address. It must contain sufficient information to create the destination address field that is prepended to the frame by the local MAC sublayer entity and any physical information. The ```source_address``` parameter, if present, must specify an individual MAC address. If the ```source_address``` parameter is omitted, the local MAC sublayer entity will insert a value associated with that entity. The ```mac_service_data_unit``` parameter specifies the MAC service data unit to be transmitted by the MAC sublayer entity. There is sufficient information associated with the ```mac_service_data_unit``` for the MAC sublayer entity to determine the length of the data unit. The ```frame_check_sequence``` parameter, if present, must specify the frame check sequence field for the frame. If ```the frame_check_sequence``` parameter is omitted, the local MAC sublayer entity will compute this field and append it to the end of the frame. This primitive is generated by the MAC client entity whenever data shall be transferred to a peer entity or entities. This can be in response to a request from higher protocol layers or from data generated internally to the MAC client. The triggering of this primitive will cause the MAC entity to insert all MAC-specific fields, including destination address (DA), source address (SA), and any fields that are unique to the particular media access method, and pass the properly formed frame to the lower protocol layers for transfer to the peer MAC sublayer entity or entities. ##### MA\_DATA.indication This primitive defines the transfer of data from the MAC sublayer to the MAC client entity or entities in the case of group addresses. The semantics of the primitive are as follows: ```pseudocode MA_DATA.indication (destination_address, source_address, mac_service_data_unit, frame_check_sequence, reception_status) ``` The ```destination_address``` parameter may be either an individual or a group address as specified by the DA field of the incoming frame. The ```source_address``` parameter is an individual address as specified by the Source Address field of the incoming frame. The ```mac_service_data_unit``` parameter specifies the MAC service data unit as received by the local MAC entity. The ```frame_check_sequence``` parameter is the cyclic redundancy check value as specified by the FCS field of the incoming frame. This parameter may be either omitted or (optionally) passed by the MAC sublayer entity to the MAC client. The ```reception_status``` parameter is used to pass status information to the MAC client entity. The ```MA_DATA.indication``` is passed from the MAC entity to the MAC client entity or entities to indicate the arrival of a frame to the local MAC sublayer entity that is destined for the MAC client. Such frames are reported only if they are validly formed and received without error, and their destination address designates the local MAC entity. ### Ethernet Physical Layer The MAC couples with the transmission medium using a PHY block. An Ethernet PHY is a two-sided box that, on one end, connects to the medium through the Media Dependent Interface (MDI), and on the other end, it connects to the MAC through the media-independent interface (MII), as shown in the figure below. ![Gigabit and 10 Gigabit Ethernet physical and data link layers](site/Resources/media/image158.png) > [!Figure] > _Ethernet physical and data link layers (most applicable for Gigabit and 10 Gigabit Ethernet_ Speed-specific MII is an interface that provides a means to connect a MAC to a PHY, especially when the MAC is connected to an off-chip PHY. The MII interface is a chip-to-chip interface without a mechanical connector and is typically realized by PCB traces. For higher speeds, a Gigabit MAC connects to a Gigabit PHY through the Gigabit Medium Independent Interface (GMII). A 10 Gigabit MAC can connect to a 10 Gigabit PHY through the optional 10 Gigabit MII (XGMII). These Media Independent Interfaces are basically signal-based parallel interconnects. The portion of the Physical Layer between the Medium Dependent Interface (MDI) and the Media Independent Interface (MII) consists of a "sandwich" typically composed of the Physical Coding Sublayer (PCS), the Physical Medium Attachment (PMA), and the Physical Medium Dependent (PMD) sublayers: - Physical Coding Sublayer (PCS): The PCS contains the functions to encode data bits for transmission via the PMA and to decode the received conditioned signal from the PMA. - Physical Medium Attachment (PMA) sublayer: this portion of the Physical Layer contains the functions for transmission, reception, and (depending on the PHY) collision detection, clock recovery, and skew alignment. PMA performs signal conditioning. - Physical Medium Dependent (PMD) sublayer: the PMD is responsible for interfacing with the transmission medium. The PMD is located just above the Medium Dependent Interface (MDI). Succinctly, the MII is a logical interface that connects the MAC to a PHY and the AUI is a physical interface that extends the connection between the PCS and the PMA. The naming of these interfaces follows the convention found in 10 Gigabit Ethernet specification—IEEE Std 802.3ae—where the 'X' in XAUI and XGMII represent the Roman numeral 10. Since the Roman numerals for 40 are 'XL' and the Roman numeral for 100 is 'C', the same convention yields XLAUI and XGMII for 40Gbps and CAUI and CGMII for 100 gigabits per second. The final interface is the parallel physical interface (PPI), which is the physical interface for the connection between the PMA and the PMD for certain clauses such as 40GBASE‐SR4 and 100GBASE‐SR10 PMDs. In the context of the complex IEEE 802.3, the PCS and PMA details strongly depend on specific speeds and media, therefore, is difficult to describe these sublayers in a generic capacity. As an example, we will discuss PCS and PMA for multi-gigabit Ethernet in the following section. ### Backplane-Based, Multi-Gigabit Ethernet Backplane Ethernet supports the full-duplex MAC operating at 1000 Mbps, 2.5 Gbps, 5 Gbps, 10 Gbps, 25 Gbps, 40 Gbps, 50 Gbps, or 100 Gbps providing a bit error ratio (BER) better than or equal to $10^{-12}$ at the MAC/PLS service interface or 200 Gbps providing a BER better than or equal to $10^{-13}$ at the MAC/PLS service interface. The following Physical Layers are supported: - 1000BASE-KX for 1 Gbps operation over a single-lane - 2.5GBASE-KX for 2.5 Gbps operation over a single lane - 5GBASE-KR for 5 Gbps operation over a single lane - 10GBASE-KX4 for 10 Gbps operation over four lanes - 10GBASE-KR for 10 Gbps operation over a single lane - 25GBASE-KR and 25GBASE-KR-S for 25 Gbps operation over a single lane - 40GBASE-KR4 for 40 Gbps operation over four lanes - 50GBASE-KR for 50 Gbps operation over a single-lane - 100GBASE-KR4 and 100GBASE-KP4 for 100 Gbps operation over four lanes - 100GBASE-KR2 for 100 Gbps operation over two lanes - 200GBASE-KR4 for 200 Gbps operation over four lanes Backplane Ethernet couples the MAC to a family of Physical Layers defined for operation over electrical backplanes. The relationships among Backplane Ethernet, the MAC, and the OSI layers are shown in the figure below. Different flavors may implement different PCS, PMA, and PMD. But also, the same flavor may implement different physical sublayer functions as well, such as RS FEC or signaling schemes. The 50GBASE-KR flavor, for instance, employs 4-level PAM over one differential path in each direction. The 100GBASE-KR2 employs 4-level PAM over two differential paths in each direction. The 200GBASE-KR4 flavor employs 4-level PAM over four differential paths in each direction. ![MAC, PHY, and transmission media for backplane-based Ethernet](site/Resources/media/image159.png) > [!Figure] > _MAC, PHY, and transmission media for backplane-based Ethernet_ > [!warning] > This section is under #development ### Ethernet Type Nomenclature Ethernet nomenclature is based on the interconnect data rate (R), modulation type (mTYPE), medium lengths (L), and a reference to the PHY's PCS coding (C) scheme. When multiple lanes are aggregated, there is additional information on the number of aggregated lanes (n). In the absence of a reference to the number of lane(s), a single-lane interface is assumed. The R mTYPE - L C n parameters used in Ethernet nomenclature are defined as: 1. Data rate (R): - 1000: 1000 Mbps or 1 Gbps; Megabit unit is eliminated in the data rate reference - 10G: 10 Gbps - 100G: 100 Gbps - 200G: 200 Gbps - 400G: 400 Gbps - 10/1G: 10 Gbps downstream, 1 Gbps upstream 2. Modulation type (mTYPE): BASE → Baseband 3. Medium types/wavelength/reach (L): - B: Bidirectional optics, with downstream (D) or upstream (U) asymmetric qualifiers - C: Twin axial copper - D: Parallel single mode (500 m) - E: Extra-long optical wavelength λ (1510/1550 nm) / reach (40 km) - F: Fiber (2 km) - K: Backplane - L: Long optical wavelength λ (\~1310 nm) / reach (10 km) - P: Passive optics, with single or multiple downstream (D) or upstream (U) asymmetric qualifiers, as well as eXternal sourced coding (X) of 4B/5B or 8B/10B - RH: Red LED plastic optical fiber with PAM16 coding and different transmit power optics - S: Short optical wavelength λ (850 nm) / reach (100 m) - T: Twisted pair 4. PCS coding (C): - R → scRambled coding (64B/66B) - X → eXternal sourced coding (4B/5B, 8B/10B) 5. Number of Lanes (n): - Blank space without lane number: defaults as 1-lane - 4:4-lanes For example, 10GBASE-KR is a 10 Gbps (10G) data rate baseband (BASE) specification, with a backplane (K) medium, using a 64B/66B (R) coding scheme, in a single lane configuration. This is purely an electrical specification that fully defines the features and characteristics of a compliant Ethernet PHY. 10GBASE-KX4 is also a 10 Gbps baseband specification for a backplane; however, it uses 8B/10B (X) coding, in an aggregated 4-lane configuration. Even though both 10GBASE-KX4 and 10GBASE-KR are 10 Gbps electrical interfaces, they describe different PHYs. A 10GBASE-KX4 PHY operates at 1/4 rate of the 10GBASE-KR across 4 lanes to achieve the same throughput. Similarly, although 10GBASE-ER is a 10 Gbps baseband specification, it is not an electrical description like 10GBASE-KR or 10GBASE-KX4. 10GBASE-ER is an extra-long reach (E) single-mode optical transceiver specification that utilizes 64B/66B (R) coding, capable of 40 km fiber optic cable reach. 10GBASE-ER mainly describes the requirements of an optical transceiver and does not provide the electrical requirements of a PHY that could drive the transceiver. Therefore, it is important to distinguish the differences between the optical transceiver specification and electrical specifications defined in the IEEE 802.3 standard. ### Summary of Ethernet Types ![Ethernet physical layer summary](site/Resources/media/image160.png) > [!Figure] > _Ethernet physical layer summary_ ### Avionics Full DupleX Switched Ethernet (AFDX) Avionics Full-Duplex Switched Ethernet (AFDX), is a data network, based on Ethernet technology, developed for safety-critical applications in the aerospace and defense industries. It is a specialized and more robust version of Ethernet, specifically designed to meet the stringent requirements of avionics systems in aircraft. AFDX was developed as an upgrade to older avionics data buses like ARINC 429, which were limited in bandwidth and unidirectional. AFDX, standardized as ARINC 664 Part 7, is part of the Integrated Modular Avionics (IMA) architecture (see ). It was first implemented on the Airbus A380, but it is now widely used in many modern aircraft due to its efficiency and reliability. AFDX operates in a full-duplex mode, meaning data can be transmitted and received simultaneously, which increases the efficiency and speed of data exchange. Unlike traditional Ethernet, AFDX is deterministic, ensuring that messages are delivered within a guaranteed time frame. This is critical for avionics, where timely and predictable data delivery is essential for safety. AFDX uses a unique method known as Bandwidth Allocation Gap (BAG) to regulate data traffic. BAG controls the maximum rate at which a particular data source can send packets to avoid network congestion. Data on an AFDX network is transmitted through Virtual Links (VLs). Each VL is a unidirectional logical path with its own specified bandwidth and quality of service, which guarantees that critical messages have a reserved path with no interference from other network traffic. In terms of reliability, AFDX networks are inherently redundant, with two separate networks carrying duplicate data streams. AFDX is predominantly used in avionics systems for both civil and military aircraft. Its applications range from flight control systems to monitoring and navigation systems. The technology is also being explored for use in other critical systems in trains and ships. While AFDX offers many advantages, it also presents challenges such as higher costs compared to traditional Ethernet and the need for specialized equipment and training. However, with the ongoing advancements in aerospace technology, AFDX is expected to evolve further, offering even greater efficiency and reliability in future avionics systems. ### Industrial Ethernet (Single Pair Ethernet) Ethernet has become a mainstream communications protocol for process and building automation. The Institute of Electrical and Electronics Engineers (IEEE) has defined a new Ethernet standard, IEEE 802.3.cg for 10 Mb/s operation and associated power delivery over a single balanced pair of conductors. Because a single-pair cable can now support both data and power, adoption of the standard can lead to significant cost savings and easier installation in building automation applications. There are numerous efforts to take Ethernet to the edge devices. Multiple communication network technologies currently coexist in building automation. For example, heating, ventilation, and air-conditioning (HVAC) applications use Modbus, access control uses BACnet, lighting uses LonWorks, and fire safety uses Ethernet. This fragmentation of networks requires the use of gateways to perform protocol conversion to unite merge networks at the top of the building automation control pyramid. End users must in turn manage complex systems. Reasons for the existence of various communication networks include the need for longer distances, multidrop connectivity, powering schemes, and support for unique protocols. Single-pair Ethernet (SPE) can address many of the above-said reasons. Having Ethernet to the edge devices offers benefits such as direct accessibility for the control system, status updates, predictive maintenance, standardized hardware, and interoperability across various systems. As we saw in previous sections, standard Ethernet uses simplex communication with independent cables for transmitting and receiving data at multiple data rates, depending on the physical layer. Single Pair Ethernet is broadly classified into three categories: - IEEE 802.3.cg (10 Mbps) - IEEE 802.3.bw (100 Mbps) - IEEE 802.3.bp (1000 Mbps) IEEE 802.3cg has two more classifications: - Long and short cable lengths reach over a single balanced pair of either shielded or unshielded wire. 10BASE-T1L: IEEE 802.3 Physical Layer specification for a 10 Mb/s Ethernet local area network over a single balanced pair of conductors up to at least 1000 m reach (long reach using 18 AWG wire for point-to-point connection). - 10BASE-T1S: IEEE 802.3 Physical Layer specification for a 10 Mb/s Ethernet local area network over a single balanced pair of conductors up to at least 15 m reach (short reach using 24-26 AWG wire with multidrop connection). ![SPE Interface for 10 Mbps Using the DP83TD510E PHY chip from TI (source: https://www.ti.com/lit/wp/snla360/snla360.pdf)](site/Resources/media/image161.png) > [!Figure] > _SPE Interface for 10 Mbps Using the DP83TD510E PHY chip from TI (source: https://www.ti.com/lit/wp/snla360/snla360.pdf)_ SPE also enables sending power over data lines (PoDL) along the same single-pair cable through a low-pass filter like the one shown in the example below. ![PoDL example (source: https://www.ti.com/lit/wp/snla360/snla360.pdf)](site/Resources/media/image162.png) > [!Figure] > _PoDL example (source: https://www.ti.com/lit/wp/snla360/snla360.pdf)_ ### Automotive Ethernet Standing on the shoulders of Single Pair Ethernet, automotive Ethernet is used in cars to connect different electronic components and systems. Traditionally, cars used different types of networks for this purpose, but these older networks were limited in terms of data rate. As cars have become more complex, with features like self-driving capabilities and advanced entertainment systems, they need to process higher volumes of data at a faster pace. The main advantage of Automotive Ethernet is its ability to transfer data at high speeds, up to 1 gigabit per second or more. This is a significant improvement over previous in-vehicle networking technologies, which could only handle a few megabits per second. This higher-speed data transfer capability is relevant for modern vehicles that need to process large amounts of information from cameras, radars, lidars, sensors, and other devices. Automotive Ethernet also helps simplify the car's wiring. Older networks required multiple wires, which made the wiring system in cars complex and heavy. Automotive Ethernet, however, can carry different types of signals over a single differential pair, reducing the amount of wiring needed. This is particularly useful in electric cars, where reducing the weight of the car can help increase its driving range and efficiency. ![A typical ADAS system showing the in-vehicle network architecture](site/Resources/media/image164.png) > [!Figure] > _A typical ADAS system showing the in-vehicle network architecture_ One of the key advantages of Automotive Ethernet is its support for a standardized IP-based network, similar to those used in ground-based environments. This compatibility simplifies the integration of car networks with external networks and devices, facilitating updates, diagnostics, and connectivity. Given that all Automotive Ethernet technologies utilize a single twisted-pair cable (T1) and a point-to-point network topology, they differ primarily in their data rates and signaling methods. For the most part, the data rate determines the applications for which a particular Automotive Ethernet standard can be used. If we take the case of an automatic driver assistance system (ADAS), we can see where each Automotive Ethernet variant might best be used to replace existing automotive protocols. ##### 10Base-T1S 10Base-T1S is a 10Mbps Automotive Ethernet described in the IEEE 802.cg standard. It is intended for 10-to-25-meter, short-reach applications. Using differential Manchester encoding (DME), all physical layers employing 10Base-T1S must support the point-to-point topology up to a 15-meter reach, with full duplex operation. How does 10Base-T1S reach full duplex capability over a single differential pair? It does so through an approach called "Echo Cancellation" and "Physical Layer Collision Avoidance" (PLCA). Echo cancellation is a technique where the transmitted signal is used as a reference to remove (or cancel out) the part of the signal that is echoed back on the same line. By subtracting the known transmitted signal from the combined signal on the line, the system can isolate and accurately interpret the incoming data. Physical Layer Collision Avoidance is another critical feature of 100Base-T1S. It helps to manage situations where two devices try to transmit data simultaneously, which could lead to signal collision and data corruption. PLCA works by detecting the presence of another transmitting device on the line and coordinating to avoid collisions. This is different from the collision detection used in traditional Ethernet, which simply detects collisions and then retries transmission. Multidrop is an optional topology for 10Base-T1S, where eight or more nodes can be supported on a maximum bus length of 25 meters. 10Base-T1S is best used for lower-speed data communications between vehicle subsystems like power train ECUs, sensors, and car body ECUs (see figure below). Here, it would replace traditional in-vehicle networks like CAN, removing the need to manage multiple technologies and simplifying designs. ![10Base-T1S is best used for lower data rate communications from sensors and ECUs.](site/Resources/media/image165.png) > [!Figure] > _10Base-T1S is best used for lower data rate communications from sensors and ECUs._ ##### 100Base-T1 100Base-T1 is a 100 Mbps Ethernet standard described in IEEE 802.3bw. It was originally specified by Broadcom as BroadR-Reach. 100Base-T1 uses PAM3 encoding, full-duplex communication, and a point-to-point topology. 100Base-T1 is used in applications where moderately high data rates are required, such as infotainment systems, where it replaces existing technologies such as LVDS that use heavier and more expensive cabling. It could also be used in Passive ADAS operations, such as lane departure warnings and backup cameras. ![100Base-T1 is used for applications requiring moderate data rates, such as infotainment and passive ADAS systems.](site/Resources/media/image166.png) > [!Figure] > _100Base-T1 is used for applications requiring moderate data rates, such as infotainment and passive ADAS systems._ ##### 1000Base-T1 1000Base-T1 is 1 Gbps Automotive Ethernet described in IEEE 802.3bp. Like 100Base-T1, it uses PAM3 encoding, full-duplex communication, and a point-to-point topology. The higher data rate of 1000Base-T1 is suited for active ADAS, systems such as lane departure assistance or automated emergency braking. In these applications, the systems are connected to actuators and can take control of the vehicle. It can also be used as the communications backbone for in-vehicle data transfer. ![1000Base-T1 is used in applications where the high data rate is essential, like active ADAS or as the communications backbone.](site/Resources/media/image167.png) > [!Figure] > _1000Base-T1 is used in applications where the high data rate is essential, like active ADAS or as the communications backbone._ ##### MultiGBase-T1 MultiGBase-T1 provides the highest bandwidth of Automotive Ethernet variants (so far) to support high-speed applications. It is defined in IEEE 802.3ch for three data rates: 2.5GBase-T1 at 2.5 Gbps, 5GBase-T1 at 5 Gbps, and 10GBase-T1 at 10 Gbps. MultiGBase-T1 utilizes PAM4 encoding, full duplex communication, and a point-to-point topology. MultiGBase-T1 enables the high data rates needed for systems associated with autonomous driving. These applications require lossless video transmission to provide high-resolution video for image recognition. Autonomous driving systems are connected to actuators that can take control of the vehicle, and there can be no dropouts in the video stream or other communications. ![MultiGBase-T1 is used for very high data rate applications, such as autonomous driving systems.](site/Resources/media/image168.png) > [!Figure] > _MultiGBase-T1 is used for very high data rate applications, such as autonomous driving systems._ ## Serial ATA Parallel ATA, often abbreviated as PATA and originally known as AT Attachment (ATA), represents a standard interface for connecting storage devices in personal computers. Developed in the mid-1980s, PATA was the evolution of the earlier Integrated Drive Electronics (IDE) interface. It became the primary method of connecting storage devices in PCs for over two decades. The standard was maintained and enhanced by the ANSI committee, which released several revisions under the name "ATA" with increasing data transfer speeds and capabilities. One of the key features of PATA is the 40-pin connector, which was later expanded to 80 conductors in the higher-speed versions of the standard (like ATA-5 and ATA-6). These additional conductors were used to reduce electromagnetic interference and allow for higher data transfer rates. The cable length was limited to 18 inches (about 46 cm), a restriction that was partly responsible for the eventual shift towards Serial ATA (SATA) in newer systems, as SATA supports longer cable lengths and higher transfer rates. PATA supports two devices on a single channel, typically configured as a master and a slave device. Each device is connected via a ribbon cable that "daisy chains" the devices (see figure below). ![Parallel ATA device connectivity](site/Resources/media/image169.png) > [!Figure] > _Parallel ATA device connectivity_ Serial ATA is a high-speed serial link replacement for the parallel ATA attachment of mass storage devices. The serial link employed in SATA is a high-speed differential layer that utilizes Gigabit technology and 8b/10b encoding. ![Serial ATA connectivity](site/Resources/media/image170.png) > [!Figure] > _Serial ATA connectivity_ The figure above shows an example of how the same two devices are connected using a Serial ATA Host Bus Adapter (HBA). In this figure, the dark grey portion is functionally identical to the dark grey portion of parallel ATA. Host software that is only parallel ATA aware accesses the Serial ATA subsystem in the same manner. In this case, however, the software views the two devices as if they were "masters" on two separate ports. The right-hand portion of the HBA is of a new design that converts the normal operations of the software into a serial data/control stream. The Serial ATA structure connects each of the two devices with their own respective cables in a point-to-point fashion. ### Architecture There are four layers in the Serial ATA architecture: Application, Transport, Link, and Physical. The Application layer is responsible for overall ATA command execution. The Transport layer is responsible for placing control information and data to be transferred between the host and device in a packet/frame, known as a Frame Information Structure (FIS). The Link layer is responsible for taking data from the constructed frames, encoding or decoding each byte using 8b/10b, and inserting control characters such that the 10-bit stream of data may be decoded correctly. The Physical layer is responsible for the signaling and transmitting and receiving the encoded information as a serial data stream on the interconnect. ![SATA Cable/Connector Connection Between Host and Device](site/Resources/media/image171.png) > [!Figure] > _SATA Cable/Connector Connection Between Host and Device_ ### Physical Layer The SATA physical layer provides a set of services which are listed below: - Transmit a 1.5 Gbps, 3.0 Gbps, or 6.0 Gbps differential NRZ serial stream at specified voltage levels as per section 7.2 in the SATA specification[^38]. - Provide a 100-ohm matched termination (differential) at the transmitter. - Serialize a 10, 20, 40, or other width parallel input from the Link for transmission. - Receive a 1.5 Gbps, 3.0 Gbps, or 6.0 Gbps differential NRZ serial stream with data rates of ±350 ppm with +0/-5000 ppm tolerance from the nominal data rate. - Provide a 100 Ohms matched termination (differential) at the receiver. - Extract data (and, optionally, clock) from the serial stream. - De-serialize the serial stream. - Detect the K28.5 comma character and provide a bit and word aligned 10, 20, 40, or other width parallel output. - Provide specified out-of-band (OOB) signaling detection and transmission. - Use OOB signaling protocol for initializing the Serial ATA interface and use this OOB sequence to execute a pre-defined speed negotiation function. - Perform proper power-on sequencing and speed negotiation. - Provide device status to Link layer: - Device present. - Device absent. - Device was present but failed to negotiate communications. - Optionally support power management modes. - Optionally perform transmitter and receiver impedance calibration. - Handle the input data rate frequency variation due to a spread spectrum transmitter clock. OOB in the context of the Serial ATA specification refers to "Out-of-Band" signaling. This is a method used for communication between devices that are separate from the main data channel. In SATA, OOB signaling plays a role during the initialization and wake-up sequences. When a SATA host (for instance a computer's motherboard) and a device (a hard drive or SSD) are first connected, they need to establish communication. However, since they haven't yet synchronized their data transfer parameters, they can't use the regular data channel for this initial communication. OOB signaling provides a way to do this. SATA devices can enter a low-power state when not in use. To wake them up, the host sends a special signal. This signal is sent using the OOB signaling method, ensuring that the device wakes up correctly and starts listening on the main data channel. ### Link Layer The Link layer transmits and receives frames, transmits primitives based on control signals from the Transport layer, and receives primitives from the Physical layer which are converted to control signals to the Transport layer. The Link layer need not be aware of the content of frames. Host and device Link layer state machines are similar; however, the device is given precedence when both the host and device request ownership for transmission. #### Frame Transmission When requested by the Transport layer to transmit a frame, the Link layer provides the following services: - Negotiates with its peer Link layer to transmit a frame, resolves arbitration conflicts if both host and device request transmission - Inserts frame envelope around Transport layer data (i.e., SOFP, CRC, EOFP, etc.). - Receives data in the form of Dwords from the Transport layer. - Calculates CRC on Transport layer data. - Transmits frame. - Provides frame flow control in response to requests from the FIFO or the peer Link layer. - Receives frame receipt acknowledge from peer Link layer. - Reports good transmission or Link/Phy layer errors to Transport layer. - Performs 8b/10b encoding. - Scrambles data Dwords in such a way as to distribute the potential EMI emissions over a broader range. #### Frame Reception When data is received from the Physical layer, the Link layer provides the following services: - Acknowledges to the peer Link layer readiness to receive a frame. - Receives data in the form of encoded characters from the Phy layer. - Decodes the encoded 8b/10b character stream into aligned Dwords of data. - Removes the envelope around frames (i.e., SOFP, CRC, EOFP). - Calculates CRC on the received Dwords. - Provides frame flow control in response to requests from the FIFO or the peer Link layer. - Compares the calculated CRC to the received CRC. - Reports good reception or Link/Phy layer errors to the Transport layer and the peer Link layer. - Descrambles data Dwords received from a peer Link layer. #### Encoding Information to be transmitted over Serial ATA shall be encoded a byte (eight bits) at a time along with a data or control character indicator into a 10-bit encoded character and then sent serially bit by bit. Information received over Serial ATA shall be collected ten bits at a time, assembled into an encoded character, and decoded into the correct data characters and control characters. The 8b/10b code allows for the encoding of all 256 combinations of eight-bit data. A subset of the control character set is utilized by Serial ATA. ## Double Data Rate (DDR) The Double Data Rate (DDR) interface is a widely used data transfer scheme for memory devices in modern computing systems. It's designed to enhance memory performance by transferring data on both the rising and falling edges of the clock signal, effectively doubling the data transfer rate compared to single data rate (SDR) SDRAM. Some relevant aspects of the DDR interface include: 1. **Clocking Mechanism**: DDR SDRAM synchronizes data transfers with a system clock. The rising and falling edges of this clock signal are utilized to transfer data, effectively doubling the data transfer rate compared to SDR SDRAM, where data is transferred only on one edge of the clock. As the name implies, DDR enables two data transactions to occur within a single clock cycle without doubling the applied clock or without doubling the size of the data bus. This increased data bus performance is due to [[Physical Layer#Source-synchronous|source-synchronous]] data strobes that permit data to be captured on both the falling and rising edges of the strobe. 2. **Data Bus**: DDR SDRAM employs a bidirectional data bus, allowing data to be transferred to and from the memory device. This bidirectional nature enhances memory throughput and efficiency. 3. **Memory Banks and Ranks**: DDR SDRAM typically organizes memory into multiple banks and ranks. Memory banks allow for concurrent access to different memory locations, improving overall memory access speed. Ranks enable systems to address multiple memory modules in parallel, further enhancing memory capacity and bandwidth. 4. **Data Rate**: The data rate of DDR SDRAM is expressed in terms of the number of transfers per second (MT/s). DDR memory standards, such as DDR, DDR2, DDR3, DDR4, and DDR5, define specific data rates and timings for memory operation. 5. **Prefetching**: DDR SDRAM utilizes prefetch buffers to fetch multiple data words in a single operation. This prefetching mechanism reduces the latency of memory accesses and increases memory bandwidth by fetching additional data in anticipation of future requests. 6. **Timings and Latencies**: DDR SDRAM specifications include various timings and latencies, such as CAS latency (CL), tRCD, tRP, and tRAS, which define the timing relationships between different memory operations. Optimizing these timings is relevant for achieving maximum memory performance. 7. **Termination**: Proper termination of DDR signal lines is essential to minimize signal reflections and ensure [[Physical Layer#Signal Integrity|signal integrity]]. On-Die Termination (ODT) and Series Termination Resistors (RTT) are commonly used techniques to achieve impedance matching and minimize signal distortion. DDR has ruled the roost as the main system memory in PCs for a long time. Of late, it's seeing more usage in embedded systems as well. A DDR interface entails each DRAM chip transferring data to/from the memory controller utilizing several digital data lines. These data streams are accompanied by a strobe signal. Because data can flow both from the controller to the DRAM (write operation) and from the DRAM to the controller (read operation, these digital lines are bi-directional in nature. Global clock, command, and address lines serve all DRAM chips present. Because these lines control the interface's operation, they are unidirectional between the controller and the memory chips. The figure below illustrates the "fly-by" topology in use beginning with the DDR3 standard. ![](DDR1.jpg) > [!Figure] > _Common clock, command, and address lines link DRAM chips and controller (source: https://www.signalintegrityjournal.com/blogs/8-for-good-measure/post/473-ddr-memory-interface-basics)_ ![](DDR5.png) >[!Figure] >_DDR5 layout and structure (credit: Altium)_ In a DDR interface, a bit is transmitted on the rising edge of the clock, and another on the falling edge of the clock. The clock runs at half of the DDR data rate and is distributed to all memory chips. The DDR command bus consists of several signals that control the operation of the DDR interface. Command signals are clocked only on the rising edge of the clock. Possible command states vary by DDR speed grade but can include: deselect, no operation, read, write, bank activate, precharge, refresh, and mode register set. The address bus selects which cells of the DRAM are being written to or read from. Like the command bus, the address bus is single-clocked. The bit values on the bus determine the bank, row, and column being written or read. Due to the interface's bi-directional nature, data is transferred between the memory and controller in bursts. To that end, the strobe (DQS) signal is a differential "bursted clock" that only functions during read and write operations. In most DDR generations since its inception, the timing relationship between the strobe and data signals is different for reads and writes (see figure below). ![](DDR2.jpg) > [!Figure] > _The timing relationship between the DDR strobe and data signals is different for reads and writes (source: https://www.signalintegrityjournal.com/blogs/8-for-good-measure/post/473-ddr-memory-interface-basics)_ Each DRAM chip has multiple parallel data lines (DQ0, DQ1, and so on) that carry data from the controller to the DRAM for write operations and vice versa for read operations. The data signals are true double-data-rate signals that transition at the same rate as the clock/strobe (two transfers per clock cycle). Although DDR can bring improved performance to an embedded design, care must be observed in the schematic and layout phases to ensure that the desired performance is realized. Smaller setup and hold times, cleaner reference voltages, tighter trace match, and the need for proper termination can present the board designer with a new set of challenges that were not present for single data rate designs. Two topology types are supported for DDR4 SDRAM: fly-by, and clamshell. The fly-by topology consists of all memory devices on one layer, usually in-line (see below). This type of topology is generally easier to route and can offer the best signal integrity, but can take up precious board real estate. ![](flyby.jpeg) > [!Figure] > _Fly-by topology_ Clamshell topology requires more intricate routing but is optimal for designs where board space is at a premium. ![](clamshell.jpeg) > [!Figure] > _Clamshell topology_ The table below lists the common signals found in DDR4 memory: | Signal Name | Size | Type | Explanation | | ------------------------ | ------ | ------------ | -------------------------------------------------------------------------------------------------------------- | | DQ (Data) | Varies | Differential | Data input/output pins used for transferring data between the memory controller and the DDR4 memory devices. | | DQS (Data Strobe) | Varies | Differential | Data strobe signal used to indicate the valid data is present on the DQ pins during read and write operations. | | CK (Clock) | 1 | Single-ended | Clock signal used to synchronize data transfer between the memory controller and DDR4 memory devices. | | CA (Command/Address) | Varies | Single-ended | Command and Address signals used to control and specify memory operations, such as read, write, and refresh. | | CKE (Clock Enable) | 1 | Single-ended | Clock enable signal used to enable/disable clocking to the DDR4 memory devices. | | CS (Chip Select) | 1 | Single-ended | Chip select signal used to enable/disable a specific DDR4 memory device on the memory bus. | | ODT (On-Die Termination) | 1 | Single-ended | On-die termination signal used to match the impedance of the memory bus for better signal integrity. | | RESET_n | 1 | Single-ended | Reset signal used to initialize or reset the DDR4 memory devices. | | VDD/VDDQ | N/A | N/A | Power supply voltage pins for the DDR4 memory devices (VDD for logic, VDDQ for I/O). | | VSS/GND | N/A | N/A | Ground pins for the DDR4 memory devices. | The size of signals like DQ, DQS, and CA can vary depending on the specific DDR4 memory configuration and the number of memory channels used. #### Routing Design challenges confronting the board designer concerning DDR signals can be summarized as follows: - Routing requirements - Power supply and decoupling, which includes the DDR devices and controller - Proper termination for a given memory topology Manufacturers provide layout considerations within these areas and include recommendations that can serve as an initial baseline for board designers as they begin specific implementation. In a typical application, a DDR memory controller may consist of more than 130 signals to provide a glueless interface for the memory subsystem. These signals can be divided into the following signal groups to analyze their routing: - Clocks - Data - Address/Command - Control - Feedback signals To help ensure that the DDR interface is properly implemented, manufacturers may recommend a given sequence for routing the signals, for instance: 1. Power (VTT island with termination resistors, VREF) 2. Pin swapping within resistor networks 3. Route data 4. Route address/command 5. Route control 6. Route clocks 7. Route feedback The data group is typically listed before the command, address, and control group because it operates at twice the clock speed and its signal integrity is of higher concern. In addition, the data group constitutes the largest portion of the memory bus and consists of the majority of the trace matching requirements, those of the data lanes. The address/command, control, and the data groups all have a relationship to the routed clock. Therefore, the effective clock lengths used in the system must satisfy multiple relationships. The designer should construct system timing budgets to ensure that all these relationships are properly satisfied. > [!Warning] > This section is under #development ## InfiniBand > [!Warning] > This section is under #development ## Proprietary Interfaces ### NVLink NVLink is a high-speed, high-bandwidth interconnect technology developed by NVIDIA primarily for connecting multiple [[Semiconductors#Graphics Processing Units (GPUs)|GPUs]] or other accelerators within a single system. It's designed to address the demand for faster data transfer rates and improved communication between GPUs in large-scale parallel computing applications such as deep learning, scientific simulations, and data analytics. NVLink is a proprietary interface, therefore details are not completely known. It claims to provide significantly higher bandwidth and lower latency compared to PCIe, enabling faster data exchange between GPUs. An important feature of NVLink is its support for coherent memory addressing across multiple GPUs. This means that each GPU can access the memory of other GPUs in the system as if it were its own. Coherent memory addressing is especially valuable for applications that require large shared memory spaces or distributed data structures. NVLink technology has been integrated into various NVIDIA GPU architectures, starting with the Pascal architecture and continuing with subsequent generations such as Volta, Turing, and Ampere. NVIDIA has also partnered with server manufacturers to incorporate NVLink support into high-performance computing (HPC) systems, supercomputers, and [[Data Centers and "The Cloud"|data center]] platforms, further expanding its adoption in scientific and enterprise computing environments. NVLink's design and implementation details are controlled by NVIDIA, and access to the technology is limited to NVIDIA's own products and partner systems that incorporate NVLink support under license from NVIDIA. # Deterministic Networking Best-effort delivery describes a network service in which the network does not provide any guarantee that data is delivered or that delivery meets any quality of service. In a best-effort network, all users obtain a best-effort service. Under best-effort policies, network performance characteristics such as delay and packet loss depend on the current network traffic load and the network hardware capacity. When network load increases, this can lead to packet loss, retransmission, packet delay variation, and further network delay, or even timeout and session disconnect. To use a familiar analogy: the postal service physically delivers letters using a best-effort approach. The delivery of a certain letter or package is not scheduled in advance; no resources are preallocated in the post offices. The service will do the best it can to try to deliver, but the delivery may be delayed if too many letters or packages suddenly arrive at a postal office or triage center. You can never be sure when a package will arrive at your home, only that it will most likely arrive when you're not there. Best-effort networks are not suitable for mission-critical applications. For instance, imagine connecting a patient's pacemaker to a nondeterministic network, where every next heartbeat would depend on a packet in that network being sent to a server for validation before the pacemaker is authorized to send an electrical pulse to the heart. Similar to other mission-critical systems. Imagine an autopilot on an airliner sending aircraft information through an indeterministic network where a remote algorithm would decide what torque to exert on the control surfaces to keep it leveled. Or the accelerometer signal of a car's airbag sent through the Internet before deciding if it makes sense to protect a passenger from a crash about to happen at 100 kilometers per hour? Many, if not most, of the networks around us, offer best-effort service for delivering packets. Packets may be lost, arbitrarily delayed, corrupted, or duplicated. The Internet has been designed to try "the best it can". But is that enough? It was for a long while, but with countless machines now connected to the network exchanging information related to industrial processes and critical infrastructure, network determinism is an increasing need. The alternative to this is deterministic networking. Deterministic networks provide guaranteed latency on a per-deterministic-flow basis. The data traffic of each flow is delivered within a guaranteed bounded latency and low delay variation constraint. Deterministic networks aim to deliver zero data loss due to congestion for all allocated deterministic flows. Deterministic networks may reject or degrade flows to maintain the characteristics of the admitted deterministic flows. Deterministic networks support a wide range of applications, each possibly having different Quality of Service (QoS) requirements. Engineering deterministic services require a different paradigm when compared to engineering traditional packet-switched services. The latter have loss/latency/jitter curves which have a wide probability distribution. In traditional networks, achieving lower latency means discarding more packets (or requires heavy over-provisioning). A core objective of DetNet is to enable the convergence of sensitive non-IP networks onto a common network infrastructure. This requires the accurate emulation of currently deployed mission-specific networks, which, for example, rely on point-to-point analog (e.g., 4-20mA) and serial-digital cables or buses for highly reliable, synchronized, and jitter-free communications. While the latency of analog transmissions is low, legacy serial links are usually slow (in the order of Kbps) compared to, say, Gigabit Ethernet and some latency is usually acceptable. What is not acceptable is the introduction of excessive jitter, which may, for instance, affect the stability of control systems. Several emerging standards are defining the technology building blocks needed for the delivery of reliable and predictable network services over Deterministic Networks. IEEE 802.1 is working to support deterministic Ethernet services in its Time-Sensitive Networking (TSN) Task Group. 3GPP is working to deliver deterministic 5G in support of Ultra-Reliable and Low Latency Communication (URLLC) usage scenarios. The IETF is working to deliver deterministic services over IP routers and wireless networks in the respective Deterministic Networking (DetNet) and Reliable and Available Wireless (RAW) Working Groups. The technology standards being developed in the IEEE, 3GPP, and IETF aim to deliver solutions that are complementary and can be combined by network operators to deliver end-to-end converged networks supporting both traditional and deterministic services. The ability to ensure the delivery of traffic for each flow's different Quality of Service requirements is critical for all deterministic networking technologies. [^34]: PCI Express is maintained by the Peripheral Component Interconnect Special Interest Group (PCI-SIG). More information: https://pcisig.com/ [^35]: https://pcisig.com/specifications [^36]: TLPs can have different "prefixes" that modify or augment the information in the packet. These prefixes provide additional control or status information that affects how the packet is processed. [^37]: https://www.intel.com/content/www/us/en/io/pci-express/phy-interface-pci-express-sata-usb30-architectures-3-1.html [^38]: https://sata-io.org/system/files/specifications/SerialATA\_Revision\_3\_1\_Gold.pdf