Semiconductors

# Semiconductors We will start discussing digital systems—quite literally—from the ground up. So, we will kick off by talking about sand. The sand we find on the beaches is mostly composed of silica, which is another name for silicon dioxide, or SiO2. Silica is one of the most complex and abundant families of materials, existing as a compound of several minerals. Silica is a crystalline material, which means that its atoms are linked in an orderly spatial lattice of silicon-oxygen tetrahedra, with each oxygen being shared between two adjacent tetrahedra. Sand is abundant in silica and many other things, including macro particles such as plastic and other stuff, so SiO2 must be cleaned to be industrially used. Once all the macro impurities are removed, silica is melted in a furnace at high temperature and is reacted with carbon to produce silicon of a relative purity[^39]. Somewhere in 1915 a Polish scientist called Jan Czochralski woke up one morning on the wrong side of the bed and made a mistake: instead of dipping his pen into his inkwell, he dipped it in molten tin and drew a tin filament, which later proved to be a single crystal. He had invented by accident a method[^40] which remains in use in most semiconductor industries around the world to grow silicon monocrystalline structures, manufactured as ingots[^41] that are then sliced into ultra-thin wafers that companies use to etch their integrated circuits layouts on[^42]. The process provides an almost pure, monocrystalline silicon chip makers can work with. Crystals and their orderly structure have fascinated scientists for ages, perhaps due to the fact they provide an illusion of order and for that reason offer a relatively easier grasp of the underlying physics: condensed matter is a complex matter—heh—but when it's arranged in a more or less symmetrical way in three dimensions, it may give the impression to be a tad simpler to comprehend. In a silicon crystal, each silicon atom forms four covalent bonds with four oxygen atoms, that is, each silicon atom shares electrons with four oxygen atoms (see figure below). ![](site/Resources/media/image172.png) >[!Figure] >SiO2 structure As we know, temperature is the quantitative measure of the kinetic energy of all particles that form a substance or material. In crystals, atoms do not really go anywhere but they vibrate in their fixed positions. The temperature in crystalline structures indicates how violently atoms shake at their spots. Valence electrons, in thermal equilibrium with the crystal they belong to, share the kinetic energy with the rest of the material. But temperature tends to describe the average energy across the lattice. Momentary differences in local temperature may cause an electron to muster the guts to break its covalent bond and go free[^44]. A bond without its precious electron is a broken bond, and as such will try to recover from this absence, so the affinity with neighboring electrons intensifies. If the broken bond manages to capture an electron from a neighboring bond, the problem is only passed to the neighbor, which will also soon pass it to the next one, and so on. The "hole" left behind by the initial emancipated electron spreads across the lattice. What happens with the initial fugitive electron? It travels across the structure, emotionally disengaged from the problem it caused. Worth noting is that a broken bond creates two phenomena: wandering holes and wandering free electrons. Another way of calling such free electrons is conduction electrons. > [!info] > By the mid-1920s, scientists had developed most of the modern picture of the atom. Every atom has a fixed number of protons, which specifies its atomic number, and this number uniquely determines which element the atom represents. The protons reside in the nucleus of the atom, while electrons orbit far away, on the scale of the atom. The swarm of electrons is arranged in shells of increasing energy levels. Shells are further subdivided into orbitals of slightly differing energies. Only two electrons may occupy each orbital, but the total number of orbitals differs for each shell. The innermost shell, of lowest energy, has only one orbital and can contain only two electrons, whereas outer shells can hold more electrons, always in multiples of two. The chemical properties of an atom are determined by the number and arrangement of its electrons. Atoms that are electrically neutral rarely posses a set of fully occupied shells. Atoms engage in chemical reactions in an attempt to fill shells that have available orbitals. Only the inert gases (sometimes called the noble gases) of helium, neon, argon, krypton, xenon, and radon have filled orbitals; consequently, they participate in almost no chemical reactions. The strict regularity in the filling of electron shells accounts for the patterns in the Periodic Table; atoms with similar numbers of unpaired electrons have similar chemical properties. An atom that literally gains or loses an electron, thereby acquiring a net electrical charge, is called an ion. Two ions of opposite charge that approach closely can be electrically attracted and thus can sometimes adhere to form a chemical compound; such a bond is said to be ionic. The most familiar example of a compound held together by an ionic bond is sodium chloride, ordinary table salt. Most atoms cling fairly tightly to their electrons, however, and the most common type of chemical bond is the covalent bond, in which the atoms share electrons. Many elements have an unfilled outer shell; under the right circumstances they are likely to lose an electron, becoming a positively charged ion. One means by which this can happen is related to temperature; a sufficiently large heating can provide enough energy to an outer electron to liberate it completely from the nucleus. Undisputed kings of negative charge, electrons leave less negative, thus more positively charged zones behind them. Therefore, in the vicinity of holes, the charge is more positive, and such positivity travels as the hole travels. Therefore, we can say holes have a positive charge. A wafer of pure mono-crystalline silicon or germanium does not do much in and of itself. It is just an 'intrinsic' material with electrons and holes moving around because of bonds constantly being broken due to thermal agitation. Intrinsic materials create electron-hole pairs in exact numbers because one exists because of the other (along with some other particles existing inside intrinsic silicon as well, like photons). Intrinsic materials would be of little practical use if we couldn't break the balance between electrons and holes. How to break that harmony? By opportunistically sprinkling our crystals with more electrons (or more holes) by means of adding impurities. Didn't we say impurities were bad? Yes, but these are more sophisticated, controlled impurities, unlike the microplastic that washes ashore on beaches as a product of our pointless mass consumption urges. However, here's the catch: we cannot just add loose electrons like we add pepper to salad; the Coulomb forces would be insane due to the sudden electric charge imbalance. All we can do is add atoms that can contribute with electrons, called donors. Examples of donors are phosphorus or arsenic. Typical proportions of impurity atoms are one of these guys for every million silicon atoms. When a donor atom is implanted in the lattice, it mimics the Si atom quite well; it completes the four covalent bonds the same way as Si atoms do. But arsenic happens to have 5 valence electrons, so one electron does not belong to any bond, and because it's not trapped in any potential barrier, it has higher energy than its other 4 cousins, and thus it has higher chances of leaving the atom behind, leaving it positively charged as a gift. An ion is born, fixed in the crystalline structure. The material remains electrically neutral at the macro level, but it's now populated with positively charged spots, all balanced by the free electrons wandering around. Conversely, acceptor impurities do the opposite. Aluminum, Indium, and Gallium, for instance, are good examples of acceptor elements. Adding acceptors is a way of adding holes to a lattice, without breaking the macroelectric neutrality. An Indium atom fits comfortably in the lattice, impersonating a Silicon atom, but it has only 3 valence electrons. You get the score. A hole is now there because one covalent bond is missing. This vacant bond is open for business, and eventually, it will get filled by an electron, breaking the impurity atom's neutrality and thus creating a negative ion. > [!Note] > Valence electrons are the outermost electrons in an atom, involved in chemical bonding. They determine an element's reactivity and bonding abilities. Valence electrons occupy the highest energy levels and are important in forming chemical bonds with other atoms. The number of valence electrons corresponds to an element's position on the periodic table, with elements in the same group sharing similar properties due to identical valence electron configurations. Chemical reactions involve the exchange, sharing, or transfer of these electrons, influencing an atom's stability and the formation of molecules and compounds. Understanding valence electrons is fundamental in predicting and explaining chemical behavior. > [!info] > In a conductor—say, a copper wire—current is the collective drift of charges (usually electrons) under the influence of an electric field. But the microscopic mechanism is subtler than just “electrons move”. > At the atomic level, metals have a lattice of positively charged ions, with valence electrons that are not tightly bound to any single atom. These electrons form what’s often called an **electron gas** or a **Fermi sea**: a huge population of mobile charges that can move freely through the lattice. In equilibrium (no applied voltage), electrons fly around randomly at very high speeds (on the order of 10^6 m/s, the Fermi velocity), but with no net flow of charge. > When you apply a voltage across the conductor, you establish an electric field inside it. That field slightly “tilts” the energy landscape, so electrons acquire a tiny additional average velocity superimposed on their random thermal motion. This average motion is the **drift velocity**. Despite the large number of electrons involved, the drift velocity is extremely small—typically millimeters per second for common currents. That slow drift, multiplied by the huge density of electrons, produces the macroscopic current you measure. The lattice doesn’t just let electrons glide freely forever. Electrons scatter off imperfections, phonons (vibrations of the lattice), and other electrons. Each scattering event randomizes their direction. The electric field constantly re-accelerates them between collisions, which gives rise to a steady drift. While the electrons themselves move slowly, the **signal**—the electromagnetic disturbance that sets them in motion—propagates close to the speed of light in the medium. > So in short: > >- Conduction electrons exist in a “sea” not bound to atoms. >- An electric field gives them a small drift velocity. >- They move in a stop-and-go fashion, constantly scattering. >- The aggregate drift constitutes current, while the signal propagates nearly at light speed. In summary, impurities, whether donors or acceptors, will end up all being ionized. Donors will quickly lose an electron, and acceptors will quickly lose a hole (or gain an electron) because the energy to allow such ionization is quite low. Thermal agitation will make sure that practically all impurities will be ionized; therefore, we can consider that all donors will lose their extra electron. This simplifies the math: we can estimate that the density of conduction electrons will be more or less equal to the density of donor atoms. The same goes for conduction holes. This is important: a piece of silicon crystal with more donor impurities than acceptor impurities will be called **type n**. Similarly, if more acceptors than donors are added to the silicon, the material will be called **type p**. Conduction electrons and holes will not have it easy while traveling inside the lattice. Multiple things will alter their trajectories: repulsion forces coming from fellow moving carriers, un-ionized impurity atoms, ionized impurity atoms, and whatnot. The life of a charge carrier is not simple. > [!info] > The electromagnetic force is the force that exists between charged particles; it is ultimately responsible for many of the everyday forces we experience. It directly holds ions together in ionic bonds, by the attraction of positive and negative charges. It also causes molecules to stick to one another, because molecules almost always have some distribution of charge even if they are neutral overall. It is the adherence of molecules, through the relatively weak electromagnetic forces between them, which holds together almost all everyday objects, including our bodies. Glues work by causing various molecules to link together. The floor does not collapse under a weight because its molecules are electrostatically bound to one another. Friction is simply the very weak attraction of the surface molecules on one object to the surface molecules on the other object. The electromagnetic force is also responsible for the generation and transmission of electromagnetic [[Antennas#Radiation Mechanism|radiation]], that is, light. # The Junction The magic starts to unfold when we sandwich type n and type p materials together. This is called a *junction*, and its properties are worth mentioning because it sets the foundations of all solid-state devices out there. Junctions are not perfect; it is impossible to define an ideally abrupt boundary between a material partially doped with donors and another part partially doped with acceptors. Junctions must be gradual, and this does not affect the physics behind them. It is very important to note that junctions are not made by welding one type n crystal with a type p crystal. A junction must still be made of a single crystal; there are no practical means of attaching two bars of silicon with different impurities dosages and expect that it will work. The crystal lattice perfection is a key factor when it comes to the junction's performance. In equilibrium (that is, with the piece of silicon that hosts the junction at some nonzero temperature, with no electric field applied), the concentration of acceptors will be maximum on the p side, then decrease to zero as we approach the junction and the same for donors on the n side. With carriers moving due to thermal agitation, they cross the boundary thrust by the gradient of impurities concentrations at the far ends. Holes come across the chasm and reach the n side, where they recombine easily because of the high density of electrons there. Equivalently, electrons cross the boundary to the p side and recombine. Then, a zone starts to appear around the border, a zone without carriers. A no man's land of sorts, where all ions are complete. Because acceptor and donor ions are fixed to the lattice, the area around the boundary will be charged slightly negative on the p side (because electrons have found their spots in acceptors) and slightly positive on the n side, because electrons have fled the scene. These non-zero charge levels stemming from the fixed ions create an electric field, which causes the diffusion process to settle when such an electric field is intense enough to create displacement currents that cancel further currents from the doping concentration gradient. ![[Pasted image 20250204142124.png]] >[!Figure] >From Top to Bottom; Top: hole and electron concentrations through the junction; Second: charge densities; Third: electric field; Bottom: electric potential In all our analyses thus far, we have only considered the piece of material to be only interacting with its surroundings by thermal energy. But that is only one part of the story. There are several other ways equilibrium in a silicon bar can be disrupted: electric fields, magnetic fields, and light. In a n-type material, holes are the minority carriers. Equivalent, in a p-type, electrons are minority carriers. Minority carriers are many, many orders of magnitude less numerous than majority carriers. Now if we put the silicon bar under uniform light, the photons of the light beam will break bonds all across the lattice, creating pairs of electron-holes. Light photons have created carriers of both signs in equal amounts, but the minority carriers are the ones noticed here. Imagine that an extra number of electrons on the n-side will not move the needle; at the end of the day there were a myriad of other electrons there, so they are nothing special. However, an increasing number of holes on the n-side will be indeed comparatively noticed. The injection of minority carriers is an important effect that will also play a part in the discovery of the bipolar transistor. You start to see the tendency of semiconductors to easily become a mess just by being beamed with some harmless light. > [!attention] > To know more about how light interacts with semiconductors, and actually how light can even disrupt the work of integrated circuits, see this: https://www.raspberrypi.com/news/xenon-death-flash-a-free-physics-lesson/ Now, to break the equilibrium in the junction, we must apply a voltage to the junction. In forward bias, the p-type is connected with the positive terminal, and the n-type is connected with the negative terminal of a voltage source. Only the majority carriers (electrons in n-type material or holes in p-type) can flow through a semiconductor for a macroscopic length. The forward bias causes a force on the electrons pushing them from the n side toward the p side. With forward bias, the depletion region is narrow enough that electrons can cross the junction and inject into the p-type material. However, they do not continue to flow through the p-type material indefinitely, because it is favorable for them to recombine with holes. The average length an electron travels through the p-type material before recombining is called the diffusion length, and it is typically on the order of micrometers. Although the electrons penetrate only a short distance into the p-type material, the electric current continues uninterrupted, because holes (the majority carriers on that side) begin to flow in the opposite direction. The total current (the sum of the electron and hole currents) is constant, in spatial terms. The flow of holes from the p-type region into the n-type region is exactly analogous to the flow of electrons from n to p. Therefore, the macroscopic picture of the current flow through this device involves electrons flowing through the n-type region toward the junction, holes flowing through the p-type region in the opposite direction toward the junction, and the two species of carriers constantly recombining in the vicinity of the junction. The electrons and holes travel in opposite directions, but they also have opposite charges, so the current is in the same direction on both sides of the material, as required. Now we do the opposite. Connecting the p-type region to the negative terminal of the voltage source and the n-type region to the positive terminal corresponds to reverse bias. Because the p-type material is now connected to the negative terminal of the power supply, the holes in the p-type material are pulled away from the junction, leaving behind charged ions. Likewise, because the n-type region is connected to the positive terminal, the electrons are pulled away from the junction, with a similar effect. This increases the voltage barrier causing a high resistance to the flow of charge carriers, thus allowing minimal electric current to cross the boundary. But some current—a leakage current—does flow. Leakage current is caused by the movement of minority carriers (electrons in p-type and holes in n-type) across the depletion region of the junction. As the depletion region widens, the potential barrier at the junction increases. However, even though the potential barrier is high, a small number of minority carriers can still cross the junction by thermionic emission[^45] or tunneling. The amount of leakage current depends on several factors, including the doping concentration of the semiconductor material, the temperature, and the voltage applied across the diode. Higher doping concentrations and higher temperatures can increase the number of minority carriers and therefore increase the leakage current. The increase in resistance of the p-n junction results in the junction behaving as an insulator. The strength of the depletion zone electric field increases as the reverse-bias voltage increases. But everything has a limit. Once the electric field intensity increases beyond a critical level, the p-n junction depletion zone may break down and current shall begin to flow even when reverse-biased, usually by what is called the avalanche breakdown[^46] processes. When the electric field is strong enough, the mobile electrons or holes may be accelerated to high enough speeds to knock other bound electrons free, creating more free charge carriers, increasing the current, and leading to further "knocking out" processes and creating an avalanche. In this way, large portions of a normally insulating crystal can begin to conduct. This breakdown process is non-destructive and is reversible, as long as the amount of current flowing does not reach levels that cause the semiconductor material to overheat and cause thermal damage. # Noise It is important to say that the hectic scene inside a semiconductor described in this section can be noticed from the outside. All these electrons and holes knocking about the junction create a good deal of noise which can affect external circuits. For instance, shot noise, also known as Schottky noise, is a type of electrical noise that arises from the random nature of the flow of electric charge carriers in the material. In semiconductors, shot noise occurs when the electrons and holes cross the junction, and is caused by the discrete nature of charge carriers and their motion. Because of the discrete nature of charge carriers, current in a junction does not flow smoothly but rather in bursts or "shots" of current. These bursts occur when electrons or holes overcome the potential barrier and move from one side to the other. The size and frequency of these bursts depend on several factors, including the voltage applied, the temperature of the material, and the concentration of charge carriers. At the beginning of this section, we commented that thermal agitation caused electrons to break loose from their atoms in the lattice and go wild, creating electron-hole pairs. This process causes a noise called Johnson-Nyquist noise, also known as thermal noise, and is a type of electrical noise that arises from the random thermal motion of charge carriers in the presence of thermal energy, which means that it increases with temperature. Thermal noise is present in all electric circuits, and in radio receivers, it may affect weak signals. There is also flicker noise, which although not fully understood, it is believed to be related to the trapping and release of charge carriers by defects or impurities in the semiconductor material. All these noises can affect the performance of the external circuits—and more importantly, low-noise circuits—using the semiconductors, and the relevance of these noises may change depending on the application, and the current levels and frequencies involved. All in all, what we have described in these paragraphs is nothing but the inner workings of a diode. A diode is a solid-state device which conducts current primarily in one direction. As we will see, being able to control the direction of the flow of electrons and holes would prove to be of importance. Why stop with only one junction? # The Transistor Drama A drama you didn't expect: the transistor drama. After Bardeen and Brattain's December 1947 invention of the point-contact transistor[^47], William Shockley dissociated himself from many of his colleagues at Bell Labs, and eventually became disenchanted with the institution itself. Some hint that this was the result of jealousy at not being fully involved in the final experiments with the point-contact transistor and from frustration at not progressing rapidly up the laboratory management ladder. Mr. Shockley had, in the words of his employees, an unusual management style[^48]. Shockley recognized that the point-contact transistor's delicate mechanical configuration would be difficult to manufacture in high volume with sufficient reliability. He also disagreed with Bardeen's explanation of how their transistor worked. Shockley claimed that positively charged holes could also penetrate through the bulk germanium material, not only trickle along a surface layer. And he was right. On February 16, 1948, physicist John Shive achieved transistor action in a sliver of germanium with point contacts on opposite sides, not next to each other, demonstrating that holes were indeed flowing through the thickest part of the crystal. All we have said before about the p-n junction applies to transistors. But transistors have three distinctive areas, with two boundaries or junctions: n-p-n, or p-n-p, typically called emitter, base, and collector. Emitters are heavily doped with impurities, and for that, it is usually called n++ or p++. The base is weakly doped, and for the collector, this is not so important, and its doping depends on the manufacturing process. The most important constructive factor is the based width or W. The junction separating the emitter from the base is called, unsurprisingly, emitter junction, whereas the junction separating the base from collector is called collector junction. Naming at least is not complicated. ![](image173.jpg) > [! Figure] > A bipolar transistor with one junction in forward bias and another one in reverse bias (source: #ref/Tremosa ) To understand the inner workings of a transistor of this kind, let's assume a p-n-p arrangement where we forward bias the emitter junction, that is, the positive terminal of the voltage source connected to the emitter, and the negative terminal to the base (see figure above). Conversely, we reverse-bias the collector junction: negative terminal of a power source to the collector, positive terminal to the base. This way, the emitter to base current is large because the junction is forward-biased—with the current value being governed by the diode equation[^49]. Given that this junction is highly asymmetric (the doping of the emitter p-region is orders of magnitude higher than the doping of the base n-region), the emitter current will be largely composed of holes going from the p-side to the n-side (current 1 in the figure above). If the base width (W) is narrow enough, and because the base area is electrically neutral, the holes traversing through the emitter junction will find their way to the collector junction where the electric field will capture and inject them into the collector area (currents 3 and 4 in the figure). Some holes will recombine in the base (current 6), creating a base current that is very small due to the low doping of the base section and the small width of the base. With all this, the emitter current is passing almost unaltered to the collector. The collector current is almost independent of the collector-base voltage, as long as this voltage remains negative. Otherwise, the collector would also inject holes into the base, altering the functioning of the device. This is an important mode (saturation mode) we will talk about. The electric field at the collector junction injects the holes into the collector area, and the magnitude of this electric field does not affect the number of holes arriving at that place. It is the base and the diffusion that happens there that defines the number of holes that will make it to the collector. Even zero volts between the collector and base would keep that current flowing. Thus far, we have been analyzing the behavior of the transistor mostly from its direct-current (DC) biasing perspective. The analysis to follow should be about observing how the transistor behaves while in the active region and when fed with small—and not-so-small—AC signals superimposed to base voltages, causing the device's biasing to fluctuate around certain points, and how the input and output signals should match each other, minimizing alterations (i.e., distortion). Although understanding this is of great importance and a topic in itself that finds applications in a myriad of fields such as analog circuits, radiofrequency, communications, hi-fi audio, and whatnot, for this discussion we shall focus on the device in switching mode, which is, moving between defined, discrete conduction states: from cut-off to saturation, and swinging between them as fast as possible. In this mode, the transistor acts as a switch, evolving from one extreme state (cut-off, or open switch) to the other (saturation, closed switch) as fast as possible. A transistor operating in the cut-off region has its two junctions working in reverse bias mode. In this situation, only leakage current flows from collector to emitter. Conversely, in saturation, the device has both junctions in forward-bias mode, allowing a small depletion layer and allowing the maximum current to flow through it. By controlling the biasing of the emitter-base junction, we can make the transistor transition between these two modes: full current conduction or practically zero current conduction. # Solid-State Switches and Digital Logic The transistor in switching mode sets the foundation of the underlying behavior of practically all digital systems out there. The solid-state switch would go down history to spark a revolution[^50]. The junctions we described above, in the form of diodes and transistors, would become the basic building blocks of our modern digital systems. Combining transistors in switching mode can form logic gates. For instance, a simple bipolar junction transistor (BJT) can form a NOT gate, which basically takes an input and inverts it[^51]. ![NOT gate with BJT transistor.](image174.jpg) > [!Figure] > _NOT gate with BJT transistor._ Whose truth table is: | **A** | **Output** | | ----- | ---------- | | 0 | 1 | | 1 | 0 | Similarly, a BJT can form a NAND gate: ![NAND gate with BJT transistor](image175.jpg) > [!Figure] > _NAND gate with BJT transistor._ Its truth table is: | **A** | **B** | **Output** | | ----- | ----- | ---------- | | 0 | 0 | 1 | | 0 | 1 | 1 | | 1 | 0 | 1 | | 1 | 1 | 0 | ![NAND logic ANSI symbol](site/Resources/media/image176.png) > [!Figure] > NAND logic ANSI symbol The NAND gate has the property of functional completeness. That is, any other logic function can be implemented using only NAND gates. In fact, an entire microprocessor could be created using NAND gates alone. We will not revisit every logic gate due to the fact it's information that can be easily found elsewhere. The important point here is to show the microscopic foundation of digital systems, and how a rather simple crystalline structure can create something as revolutionary as the transistor. Our ingenuity made it possible to use the transistor to devise logic gates and then, logic gates would form flip-flops. Flip-flops would form registers, decoders, multiplexers, demultiplexers, but also adders, subtractors, and multipliers, which in turn would become full-fledged arithmetic units (ALUs). As integration technology and processes mature, designers would start packing many logic blocks such as memories, ALUs, and buses inside smaller and smaller die areas. Let's discuss next what all these building blocks are about. But first, MOSFETs. # Field-Effect Transistors (FETs) All members of the FET family are also called unipolar transistors, due to the fact they all based their main function to charge carriers of single polarity: electrons for $n$ channel, or holes for $p$ channel. A fundamental feature of FETs is the existence of a channel whose conductivity is a function of a voltage applied between certain terminals of the device. Thus, FETs are voltage-controlled devices with high input impedance, making them efficient for amplification and switching applications. ![[Pasted image 20250203132629.png]] > [!Figure] > The JFET device cross section and symbol In a FET, the channel is sandwiched between two heavily doped gate regions of the opposite type. The three terminals of the JFET are the source, from which charge carriers enter, the drain, where carriers exit, and the gate, which modulates the flow of carriers. When a voltage is applied between the source and drain, a current flows through the channel due to the movement of majority carriers. The conductivity of this channel is influenced by the voltage applied to the gate. By applying a reverse bias voltage to the gate with respect to the source, a depletion region forms at the p-n junction between the gate and the channel. This depletion region expands inward into the channel, reducing its effective width and thereby increasing the channel’s resistance. Since the depletion region is created by the movement of majority carriers away from the junction, the higher the reverse bias voltage at the gate, the more the depletion region grows. At a sufficiently high negative gate-to-source voltage, the depletion regions from both sides of the channel meet in the middle, effectively pinching off the channel and reducing the current flow to a very small value. This is known as the pinch-off effect, which occurs at a specific voltage called the pinch-off voltage. Beyond this point, any further increase in the drain-to-source voltage does not significantly increase the current because the channel remains constricted. Instead, the JFET enters saturation, where the current stabilizes at a nearly constant value determined by the gate voltage. ## Leakage Leakage current in FETs refers to the unintended flow of electric current that occurs even when the transistor is nominally in its off state, representing a deviation from ideal behavior. This phenomenon is critical in modern devices, particularly as device dimensions shrink, leading to increased static power consumption and challenges in circuit design. The leakage current arises from several physical mechanisms, each contributing to the total off-state current in distinct ways. One primary source of leakage is the reverse-biased p-n junctions formed between the drain/source regions and the body of the transistor. In a FET, the drain and source regions are heavily doped, creating p-n junctions with the lightly doped body. When these junctions are reverse-biased—such as when the transistor is in the off state—a small leakage current flows, as we discussed before, due to minority carrier diffusion and generation-recombination processes in the depletion region. This current is temperature-dependent and increases with higher reverse voltages. Subthreshold conduction constitutes another significant leakage mechanism. Even when the gate-to-source voltage ($V_{GS}$) is below the threshold voltage ($V_{TH}$), a residual current flows between the drain and source. This occurs because the potential barrier in the channel is not infinitely high, allowing some carriers to diffuse through the weak inversion layer. The subthreshold current exhibits an exponential dependence on $V_{GS}$, governed by the subthreshold swing, a parameter quantifying how sharply the transistor transitions between on and off states. As process technologies scale to smaller nodes, reduced $V_{TH}$ and shorter channel lengths exacerbate subthreshold leakage, making it a dominant contributor to static power dissipation. Gate oxide tunneling represents another leakage pathway, particularly in advanced nodes with ultra-thin gate dielectrics. When the oxide layer becomes sufficiently thin (typically below 2 nm), quantum mechanical tunneling allows electrons to traverse the barrier between the gate electrode and the channel or between the gate and source/drain overlaps. This direct tunneling current depends on the oxide thickness, electric field across the dielectric, and material properties. The industry’s shift to high-$k$ dielectrics (e.g., hafnium-based oxides) instead of silicon dioxide was partly driven by the need to mitigate this tunneling current while maintaining equivalent capacitance. Gate-induced drain leakage (GIDL) arises in the overlap region between the gate and drain under specific biasing conditions. When a high drain-to-gate voltage creates a strong electric field, it can induce band-to-band tunneling in the drain junction. In this process, electrons tunnel from the valence band to the conduction band in the drain’s depletion region, generating electron-hole pairs. The resulting current is particularly pronounced in transistors with negative gate biases or during rapid switching transitions, contributing to both leakage and potential reliability concerns. Punch-through leakage occurs in short-channel devices when the depletion regions surrounding the drain and source merge under high drain-to-source voltages ($V_{DS}$). This effectively creates a conductive path beneath the channel, bypassing gate control. Punch-through is mitigated through channel doping engineering and the use of halo implants, but remains a challenge in aggressively scaled technologies. Collectively, these leakage mechanisms impose trade-offs between performance, power, and reliability in FET design. Subthreshold and gate leakage currents dominate in low-power applications, while GIDL and punch-through are more relevant in high-voltage scenarios. Understanding these effects is essential for optimizing transistor architectures, material choices, and circuit techniques to balance speed, energy efficiency, and thermal management in modern integrated circuits. These leakage currents in FET transistors directly contribute to thermal dissipation in integrated circuits. This occurs because any current flowing through a semiconductor device, even when it is nominally "off," results in power dissipation proportional to the product of the leakage current and the voltage across the device ($P = I_{leakage} × V$). In modern chips, where billions of transistors are densely packed, the cumulative effect of these leakage currents becomes significant, leading to static power consumption that manifests as heat. This thermal dissipation poses critical challenges for chip design, performance, and reliability. The relationship between leakage and thermal dissipation is multifaceted. For instance, subthreshold leakage depends exponentially on gate voltage and temperature, even small increases in leakage current per transistor, when aggregated across millions or billions of transistors, can lead to substantial power loss. For example, a modern microprocessor might exhibit static power consumption in the range of tens of watts due to leakage alone, contributing significantly to the chip’s thermal profile. GIDL further exacerbate this issue. In advanced nodes with ultra-thin gate dielectrics, tunneling currents allow carriers to traverse the oxide, creating additional leakage paths. Similarly, GIDL generates electron-hole pairs in the drain depletion region under high electric fields. Both mechanisms add to the total static power, converting electrical energy into heat even when the transistor is not actively switching. This heat raises the chip’s temperature, which in turn increases leakage currents due to the temperature dependence of carrier mobility and generation-recombination rates. This creates a positive feedback loop: higher temperatures increase leakage, which generates more heat, further degrading performance and reliability. Thermal dissipation from leakage currents is particularly problematic in high-performance computing and mobile devices. In servers or the [[Semiconductors#Graphics Processing Units (GPUs)|GPUs]], excessive heat necessitates aggressive cooling solutions to prevent thermal throttling or device failure. In battery-powered systems, static power from leakage reduces energy efficiency, shortening operational lifetimes. Moreover, localized heating from leakage can create temperature gradients across the chip, inducing mechanical stress and electromigration, which degrade interconnects and transistor [[Reliability Assessment Methods#Physics of Failure#Semiconductor-Level Failure Mechanisms|reliability]] over time. To mitigate these effects, semiconductor technologies employ several strategies. High-k gate dielectrics reduce gate tunneling currents by increasing the physical thickness of the dielectric while maintaining capacitance, and FinFET or gate-all-around (GAA) architectures improve gate control, suppressing subthreshold leakage. Power gating techniques disconnect unused circuit blocks from the supply voltage to eliminate leakage paths, and dynamic voltage/frequency scaling (DVFS) adjusts operating conditions to minimize leakage during low-activity states. Additionally, advanced cooling solutions, such as liquid cooling or phase-change materials, are integrated into systems to manage the thermal load. ## Energy Efficiency FETs mostly operate as voltage-controlled switches. Their energy efficiency is governed by how effectively input electrical energy is converted into useful work (e.g., logic state transitions) versus dissipated as heat. This efficiency depends on both dynamic and static power consumption, which are influenced by device physics, process technology, and operational parameters. Dynamic energy arises from charging and discharging parasitic capacitances inherent to the transistor and interconnects during switching events. The energy per switching cycle is proportional to $( C \cdot V_{DD}^2$ ), where ($C$) is the effective capacitance and ($V_{DD}$) is the supply voltage. In modern nodes (e.g., 3 nm FinFET or gate-all-around architectures), aggressive voltage scaling (down to ~0.7 V) and reduced capacitance (via geometric scaling and low-k dielectrics) minimize dynamic losses. However, even in advanced nodes, only a fraction of this energy contributes to computational work. A significant portion is lost as heat due to the resistive dissipation in interconnects, non-ideal switching behavior (e.g., short-circuit current during transitions), and parasitic capacitances unrelated to logic functions. For a typical FET switching at high frequencies, approximately 40–60% of dynamic energy is effectively utilized for signal swinging, while the remainder is wasted. Static energy stems from a leakage currents that flow even when the transistor is nominally "off", as we discussed above. In modern nodes, static power can account for 20–40% of total energy consumption under active operation, rising to nearly 100% in idle states. It's worth noting that leakage is highly temperature-dependent and worsens with process variations, limiting the minimum achievable power. At the device level, energy efficiency is quantified by metrics like the energy-delay product (EDP) or power efficiency (operations per joule). For a modern FET in a 3 nm process, when actively switching, roughly 50–70% of total input energy is expended as heat (dynamic and static losses), with the remainder enabling computational work. However, this varies with workload: high-frequency operation increases dynamic losses, while low-duty-cycle applications emphasize static losses. Advanced techniques like near-threshold computing (operating FETs just above the threshold voltage) or fully depleted silicon-on-insulator (FD-SOI) designs can improve efficiency by reducing ($V_{DD}$) and leakage, but these trade off performance for power savings. FET energy efficiency is a balance between process innovations (e.g., high-k/metal gates, strain engineering), circuit design (e.g., clock gating, power domains), and system-level optimizations. While no FET achieves perfect efficiency, continuous scaling and novel architectures (e.g., tunneling FETs, carbon nanotube transistors) aim to push these limits further. # MOSFETs The MOSFET, also known as the MOS transistor, was invented by Mohamed M. Atalla and Dawon Kahng in 1959 at Bell Labs. The concept of the field-effect transistor (FET) itself was not new by the time the MOSFET was invented. The theoretical principles underlying FET operation were first proposed by Julius Edgar Lilienfeld in the 1920s and further elaborated by Shockley in the late 1940s. However, these early attempts to create FETs faced significant challenges, primarily due to the lack of high-quality semiconductor materials and the absence of reliable fabrication techniques. >[!info] >The bipolar junction transistor amplifies a small change in input current to provide a large change in output current. The gain of a bipolar transistor is thus defined as the ratio of output to input current (beta). A field- effect transistor (FET), transforms a change in input voltage into a change in output current. The gain of an FET is measured by its transconductance, defined as the ratio of change in output current to change in input voltage. >[!info] >The field-effect transistor is so named because its input terminal (called gate) influences the flow of current through the transistor by projecting an electric field across an insulating layer. The breakthrough came with the development of the silicon planar process by Jean Hoerni, another Bell Labs scientist, in 1959. This process enabled the fabrication of electronic devices by doping very thin layers on a silicon wafer's surface, which was then protected by a layer of silicon oxide. The planar process was critical for the successful implementation of the MOSFET. Atalla and Kahng utilized the planar process to fabricate the first MOSFET. Their device used a metal gate (aluminum) deposited on a layer of silicon oxide, which insulated the gate electrically from the silicon substrate (sometimes called backgate) beneath it. This structure allowed the effective control of the conductivity of the underlying silicon layer by applying a voltage to the gate, thus modulating the current between the source and drain terminals. The MOSFET offered several advantages over existing transistors, including lower power consumption and greater density potential, making it ideally suited for integrated circuits. However, its widespread adoption was initially slow, primarily due to reliability issues like the susceptibility of the silicon oxide layer to contamination and defects, which could trap charge and affect the device's performance. Significant improvements in manufacturing processes and materials throughout the 1960s and 1970s addressed many of the early reliability issues. The MOSFET's ability to be scaled down in size with retained functionality has been critical to its success. ## MOSFET structure The MOS transistor can be better understood by first considering a simpler device called a MOS capacitor. This device consists of two electrodes, one of metal and one of extrinsic silicon, separated by a thin layer of silicon dioxide (see figure below). The metal electrode forms the gate, while the semiconductor slab forms the substrate (or body). The insulating oxide layer between the two is called the gate dielectric. The illustrated device has a substrate consisting of lightly doped P-type silicon. The electrical behavior of this MOS capacitor can be demonstrated by grounding the substrate and biasing the gate to various voltages. The MOS capacitor of the figure below (part A) has a gate potential of 0V. The difference in work functions between the metal gate and the semiconductor substrate causes a small electric field to appear across the dielectric. In the illustrated device, this field biases the metal plate slightly positive with respect to the P-type silicon. This electric field attracts electrons from deep within the silicon up toward the surface, while it repels holes away from the surface. The field is weak, so the change in carrier concentrations is small, and the overall effect upon the device characteristics is minimal. ![[Pasted image 20250816150119.png]] >[!Figure] >*MOS capacitor: (A) unbiased (VBG = 0V), (B) inversion (VBG = 3V), (C) accumulation (VBG = -3V) (source: #ref/Hastings )* Part B of the figure above shows what occurs when the gate of the MOS capacitor is biased positively with respect to the substrate. The electric field across the gate dielectric strengthens, and more electrons are drawn up from the bulk. Simultaneously, holes are repelled away from the surface. As the gate voltage rises, a point is reached where more electrons than holes are present at the surface. Due to the excess electrons, the surface layers of the silicon behave as if they were N-type. The apparent reversal of doping polarity is called *inversion*, and the layer of silicon that inverts is called a *channel*. As the gate voltage increases still further, more electrons accumulate at the surface and the channel becomes more strongly inverted. The voltage at which the channel just begins to form is called the threshold voltage $V_{T}$. When the voltage difference between the gate and substrate is less than the threshold voltage, no channel forms. When the voltage difference exceeds the threshold voltage, a channel appears. Finally, part C of the figure above shows what happens if the gate of the MOS capacitor is biased negatively with respect to the substrate. The electric field now reverses, drawing holes toward the surface and repelling electrons away from it. The surface layers of silicon appear to be more heavily doped, and the device is said to be in accumulation. The behavior of the MOS capacitor can be utilized to form a true MOS transistor. Figure below, part A, shows the cross-section of the resulting device. The gate, dielectric, and backgate or substrate remain as before. Two additional regions are formed by selectively doping the silicon on either side of the gate. One of these regions is called the source, and the other is called the drain. Imagine that the source and backgate are both grounded and that a positive voltage is applied to the drain. As long as the gate-to-substrate voltage remains less than the threshold voltage, no channel forms. The pn junction formed between drain and backgate is reverse-biased, so very little current flows from drain to backgate. If the gate voltage exceeds the threshold voltage, a channel forms beneath the gate dielectric. Eventually, and exploiting the principles from the basic MOS capacitor, a conductive connection between source and gate can be conveniently formed with a voltage applied to the gate terminal. >[!info] >The threshold voltage of a MOS transistor equals the gate-to-source bias required to just form a channel with the substrate of the transistor connected to the source. If the gate-to-source bias is less than the threshold voltage, then no channel forms. The threshold voltage exhibited by a given transistor depends on a number of factors, including substrate doping. dielectric thickness, gate material, and excess charge in the dielectric. Each of these effects will be briefly examined. **Substrate doping has a major effect on the threshold voltage**. If the substrate is doped more heavily, then it becomes more difficult to invert. A stronger electric field is required to achieve inversion, and the threshold voltage increases. The substrate doping of an MOS transistor can be adjusted by performing a shallow implant beneath the surface of the gate dielectric to dope the channel region. ![[Pasted image 20250816150541.png]] >[!figure] >*Cross-section of MOS transistor (source: #ref/Hastings )* >[!important] >The most commonly used substrate materials in integrated circuits are silicon (Si) and gallium arsenide (GaAs), where Si is traditionally used for RF and lower frequencies and GaAs is used for microwave and millimeter-wave frequencies. However, with the advancement of silicon germanium (SiGe) transistors, the frequency response of circuits on Si substrates is pushing firmly into the microwave-frequency region. GaAs, on the other hand, is finding more applications further into the millimeter-wave-frequency region using transistors with gate lengths of one-tenth of a micron. More exotic substrate materials, such as indium phosphide (InP), are tending to take over from GaAs as frequencies extend beyond 100 GHz. Other substrate materials that are increasingly being used include silicon carbide (SiC) and gallium nitride (GaN). These are both wide band-gap semiconductors, which means they have much higher breakdown voltages and can operate at higher junction temperatures and higher output powers than the other semiconductor materials. The characteristics of these commonly used semiconductor materials are shown in the table below (source: #ref/Marsh ): > >![[Pasted image 20250816153442.png]] Thus, the structure of a modern MOSFET includes four primary regions: the source, the drain, the gate, and the substrate (also body or backgate). These are integrated into a semiconductor material, typically silicon, with the gate separated from the body by a thin layer of insulating material (usually silicon dioxide or, more recently, materials with higher dielectric constants). Just like we saw in the previous section, the operation of a MOSFET is also based on the ability to control the conductivity between the drain and source terminals by applying a voltage to the gate terminal. Just like in any other FET, this voltage alters the distribution of charges in the semiconductor material, enabling or hindering the flow of current between the drain and source. The "source" terminal is so named because it is the source of the charge carriers that flow through the channel. In an N-channel MOSFET, these carriers are electrons, while in a P-channel MOSFET, they are holes. The source serves as the origin point for these carriers as they are injected into the channel and move toward the drain under the influence of an electric field. The "drain" terminal is called this because it is where the charge carriers leave the channel. The term reflects its role in "draining" the carriers that have traversed the channel from the source. The voltage applied to the drain relative to the source influences the current flow through the MOSFET, making the drain critical in determining the device's operation. The "gate" terminal controls the conductivity of the channel between the source and drain. It acts as the gatekeeper for the current flow. By applying a voltage to the gate, an electric field is created across the insulating layer (usually made of silicon dioxide or a high-k dielectric material), which modulates the channel's conductivity. This field-effect operation is fundamental to the MOSFET's functionality, allowing it to function as a switch or amplifier without requiring direct contact between the gate and the semiconductor material of the channel. The gate effectively controls the opening and closing of the conductive pathway between the source and drain, hence the name. ![N-channel MOSFET cross-section](site/Resources/media/image177.png) > [!Figure] > _N-channel MOSFET cross-section_ From the figure above, the gate material was originally aluminum (Al) in early MOS devices. As geometries were reduced in size, however, fabrication processes changed, and the higher temperatures required for these processes prohibited the use of low-melting-point metals such as Al. Replacing the metal gate material with highly doped (and therefore highly conducting) polycrystalline silicon provided good long-term reliability as well as tolerance to high-temperature processing. The term polysilicon (or *poly*) is traditionally used when discussing the polycrystalline silicon gate material. The insulating layer can be silicon dioxide or some other insulating material with good material matching properties, but the term *oxide* is typically used when discussing this gate insulating material. Some CMOS fabrication technologies provide the capability for a second poly layer over the first (with a thin insulating oxide layer in between). The thin oxide between the two poly layers provides the chip designer with a high capacitance per unit area and hence the possibility of high values of on-chip capacitance. There may be up to seven layers of metal above the wafer surface in advanced CMOS processes (aluminum and copper) with thicker lines on the higher layers. Each metal layer is separated by a thick layer of insulating oxide; therefore, each metal layer has no DC connection to the other. These metal layers can be used for signal interconnections and power supply connections. Vias are used to connect adjoining metal layers, with contacts used to connect semiconductor layers or metal layers to other semiconductor layers. Manufacturers publish or otherwise make available to designers process parameters that describe electrical parameters such as threshold voltage, resistance, and capacitance, or physical parameters such as oxide and other layer thicknesses. Knowledge of these process parameters is crucial to successful designs. More detailed process parameters can often be obtained from the IC manufacturer, but in many cases, nondisclosure agreements (NDAs) are involved because of proprietary issues and protection of trade secrets. ## Behavior and Characteristic Curves There are primarily two types of characteristic curves for a MOSFET: transfer characteristics and output characteristics. ### Transfer Characteristics The transfer characteristic curve of a MOSFET shows the relationship between the gate-source voltage $V_{\text{GS}}$ and the drain current $I_{D}$ when the drain-source voltage $V_{\text{DS}}$ is kept constant. Key regions of transfer characteristics are: - Cut-off Region: When $V_{\text{GS}}$ is below a certain threshold voltage $V_{T}$, the MOSFET is off, and $I_{D}$ is essentially zero. - Linear (Ohmic) Region: As $V_{\text{GS}}$ exceeds $V_{T}$, the MOSFET turns on, and $I_{D}$ increases linearly with $V_{\text{GS}}$ initially. This region indicates that the MOSFET behaves like a variable resistor. - Saturation Region: Beyond a certain point, even if $V_{\text{GS}}$ increases, $I_{D}$ reaches a maximum and remains relatively constant. This occurs because the channel becomes pinched off at the drain end, limiting current flow. ![MOSFET output characteristics](image179.jpg) > [!Figure] > _MOSFET output characteristics_ ![A MOSFET in perspective](site/Resources/media/image180.png) > [!Figure] > _A MOSFET in perspective_ ### Output Characteristics The output characteristic curve depicts the relationship between the drain-source voltage $V_{\text{DS}}$ and the drain current $I_{D}$ for different fixed gate-source voltages $V_{\text{GS}}$. Important regions of output characteristics are: - Ohmic Region: At low $V_{\text{DS}}$ values, the MOSFET operates in the linear or ohmic region, where $I_{D}$ increases with $V_{\text{DS}}$. - Saturation (Active) Region: As $V_{\text{DS}}$ increases, $I_{D}$ enters a saturation state where it becomes relatively constant despite further increases in $V_{\text{DS}}$. This is due to the formation of a depletion region that extends and eventually leads to channel pinch-off, limiting current flow. - Breakdown Region: If $V_{\text{DS}}$ is increased beyond a certain point, the MOSFET may enter the breakdown region, where $I_{D}$ suddenly increases, potentially damaging the device. ![MOSFET characteristic curves](site/Resources/media/image181.png) > [!Figure] > _MOSFET characteristic curves_ Some important characteristic values of MOSFETs are: - Threshold Voltage ($V_{T}$): The minimum gate-to-source voltage required to create a conductive channel between the source and drain. - On Resistance ($R_{\text{DS}}\left( \text{on} \right)$): The resistance between the drain and source when the MOSFET is in the "on" state. Lower values are generally better for power efficiency. - Gate Capacitance ($C_{g}$): The capacitance between the gate and the substrate, affects how quickly the device can turn on and off. > [!important] > In a MOSFET, the gate does not inject energy into the channel directly. Instead, it modulates the density of charge carriers, and this modulation of channel conductivity controls the power flowing from drain to source (supplied by an external source). That's what gives MOSFETs the ability to amplify signals. ## MOS inverter Following the example of the logic inverter we explored when we discussed BJTs, we can now do the same with an NMOS inverter, illustrated in the figure below. ![NMOS inverter](site/Resources/media/image183.png) > [!Figure] > _NMOS inverter_ When no voltage is applied to the gate, then the gate capacitor is not charged, so there is no channel, and $R_{D}$ pulls the output voltage to $V_{\text{DD}}$. When a positive voltage is applied to the gate, it attracts electrons towards the gate, creating a conductive channel between the source and drain. A low resistance path between source and drain means that the output voltage will be virtually connected to ground, therefore the output voltage will be zero volts. The inverter describes thus the same table truth table as in the BJT inverter we discussed before. One of the issues with the NMOS inverters is that, when the transistor is in ON state, there will always be a small power dissipation in $R_{D}$. For only one transistor, this would not represent a terrible problem. But when there are millions or billions of transistors integrated into a chip, a tiny dissipation can add up to a big problem. One of the key innovations to solve this problem was to employ complementary topologies. In complementary circuits, instead of pull-up resistors a PMOS transistor is used (see below). ![Complementary Metal-Oxide (CMOS) topology](site/Resources/media/image184.png) > [!Figure] > _An Inverter using Complementary Metal-Oxide (CMOS) topology_ ### Simulating a logic inverter We will now put our hands on a VLSI tool and design our own integrated logic inverter. As we saw, a logic inverter is in charge of simply negating/inverting the signal at its input; if we input a logical 0, the inverter shall output a logical 1, and vice versa. Note that the voltage levels represent a 0 or a 1 depending on which logic family we are talking about. A basic logic inverter will need a transistor, so we must draw one in the VLSI tool. For this, we will use Magic^[http://opencircuitdesign.com/magic/]. The Magic VLSI tool is a venerable piece of software in the field of Very Large Scale Integration (VLSI) design, primarily used for creating and editing the layout of integrated circuits (ICs). Developed in the early 1980s at the University of California, Berkeley, it has been a fundamental tool in the education and practice of semiconductor design. Magic stands out due to its use of technology-specific design rules to automatically check for errors in IC layouts, ensuring that the designs are manufacturable and will function as intended. > [!attention] > Before continuing, take a look at this introduction to stick diagrams in the video below: > > ![](https://youtu.be/9G-R_jy6wEU) Magic operates on a principle called "virtual grid", allowing designers to work at a higher level of abstraction without worrying about the underlying fabrication process details. It supports various technologies through technology files that define the rules for layer thicknesses, spacings, and electrical characteristics, making it adaptable to different manufacturing processes. Magic provides a real-time, graphical view of the IC layout, where designers can directly manipulate the geometric shapes representing different layers of the IC. This is complemented by Magic's capability to perform automatic design rule checking (DRC), which continuously verifies that the layout conforms to the specific set of rules for the target fabrication process. This immediate feedback loop significantly speeds up the design process and helps in identifying and resolving errors early. Furthermore, Magic integrates well with other tools in the VLSI design process, such as simulation tools (like SPICE) for testing the electrical behavior of the IC designs and extraction tools for generating the parasitic resistance and capacitance estimates. Despite its age, Magic remains relevant in the academic world and among VLSI enthusiasts or smaller design firms due to its open-source nature, extensive documentation, and the community that continues to maintain and update it for modern fabrication processes. Its simplicity and efficiency make it an excellent tool for teaching the fundamentals of VLSI design, offering students and newcomers a hands-on experience with the intricacies of IC layout and the challenges of ensuring manufacturability and functionality. We start our quest simulating a MOS inverter by opening Magic UI. The tool must be configured to load with a specific technology (in our case Sky130) for the Design Rule Checks (DRC) to be consistent. This will greatly impact the manufacturability of the design. Once the tool starts with a given technology, the necessary layers will show on the right hand (figure below). ![Magic VLSI tool User Interface](site/Resources/media/image185.png) We will now draw a polysilicon rectangle which will form the gate of our poor man's MOSFET (see figure below). Note that the DRC legend at the top must always stay at 0 and with a green check mark, indicating our design matches what the technology (in our case Sky130) specifies[^52]. ![A polysilicon polygon that will form the gate of a MOSFET](site/Resources/media/image186.png) Now we draw the diffusion area that will form the source and the drain of the MOSFET. Note that we do not need to explicitly add the substrate because Magic considers there is a substrate by default. ![Adding diffusion area to form source and drain of a MOSFET](site/Resources/media/image187.png) This is, in fact, already an N-MOSFET. The tool has also detected the combination of the n-diffusion with the polysilicon (observe the shared area where the poly and the diffusion overlap) and this can be already extracted for simulation in SPICE. Before extraction, we must add labels to the different elements of the transistor. Then, to extract the model for simulation purposes, we will execute some commands in Magic using its command console: ```shell extract # extract the netlist in Magic's format ext2spice lvs # Turn on a bunch of default settings ext2spice cthresh 0 # turn on capacitor parasitic extraction ext2spice # extract to a spice file. ``` Which yields the following SPICE file: ```Spice * NGSPICE file created from mosfet.ext - technology: sky130A .subckt mosfet X0 a_30_n100# a_n30_n210# a_n148_n100# VSUBS sky130_fd_pr__nfet_01v8 ad=0.6 pd=3.2 as=0.59 ps=3.18 w=1 l=0.3 C0 a_n30_n210# VSUBS 0.216f .ends ``` To simulate the inverter we drew in Magic, we cannot just use the MOSFET SPICE model extracted above in a circuit. We need biasing and we also need to add a pull-up resistor in the drain. This is done by a parent SPICE model that includes the MOSFET model, and adds the extra elements, along with the conditions of the simulation: ```Spice MOSFET Simulation .lib "./sky130_fd_pr/models/sky130.lib.spice" tt * Instantiate the inverter Xmosfet DRAIN GATE VGND VGND X0 .subckt X0 drain gate source VSUBS * NGSPICE file created from mosfet.ext - technology: sky130A X0 a_30_n100# a_n30_n210# a_n148_n100# VSUBS sky130_fd_pr__nfet_01v8 ad=0.6 pd=3.2 as=0.59 ps=3.18 w=1 l=0.3 C0 a_n30_n210# VSUBS 0.216f .ends * set gnd and power Vgnd VGND 0 0 Vdd VPWR VGND 1.8 * create a resistor between the MOSFET drain and VPWR R VPWR DRAIN 10k * create pulse Vin GATE VGND pulse(0 1.8 1p 10p 10p 1n 2n) .tran 10p 2n 0 .control run set color0 = white set color1 = black plot GATE DRAIN .endc .end ``` Which yields: ![Simulation in SPICE of the N-MOS inverter we drew in the VLSI tool (red: gate voltage; blue: voltage measured at the drain)](site/Resources/media/image188.png) But this design is not really usable. For the design to be of practical use and perhaps even added to a [[Semiconductors#Standard Cells|standard cell]] library, we must add some extra things like a local interconnect and connect that to the polysilicon and N diffusion. Then we must connect the local interconnect to the metal1 layer. All standard cells connect at the metal1 layer so the labels should really be on metal1 instead of directly on the gate, drain, and source of the MOSFET. We will use local interconnect (palette name *locali*) to connect to the gate, drain and source. We will need two types of vias: -*Ndcontact* (n diffusion contact): connects between local interconnect and n diffusion. -*Pcontact* (poly contact): connects between local interconnect and polysilicon. As we draw, this will create lots of DRC errors. You will see the white hatching that shows an error, and see the number at the top of the window increase and turn red. To find out what went wrong, you can click DRC->Find next error in the GUI window. The reports can be pretty confusing to understand, and unfortunately, there isn't a shortcut as the DRC rules are so numerous and complex. It's not strictly necessary to end up with a DRC clean design to extract (the extraction will normally work anyway), but it's a good thing to have a bit of practice in case you need to edit a completed design at a later date. ![A more complete MOSFET with local interconnects and metal1 terminals](site/Resources/media/image189.png) We said before that the MOS inverter with a pull-up resistor suffers from several drawbacks. To overcome these limitations, we use complementary MOS, which adds a P MOS instead of the pull-up resistor. ![Complementary Metal-Oxide (CMOS) topology](site/Resources/media/image184.png) And the CMOS inverter design in Magic looks like the figure below. ![A CMOS logic inverter (both the NMOS at the bottom and PMOS at the top share the gate in polysilicon layer)](site/Resources/media/image190.png) > [!Figure] > _A CMOS logic inverter (both the NMOS at the bottom and PMOS at the top share the gate in polysilicon layer) (credit: ZeroToAsic Course, Matt Venn)_ > [!info] > Cumulated resistance and the back gate effect tend to make MOSFET stacks slow. CMOS subcircuits such as logic gates and bistables are thus not normally designed with more than three MOSFETs connected in series. Complex combinational operations that would ask for more series transistors are broken down into smaller subfunctions and implemented as cascades of simpler gates #ref/Kaeslin . ## Fabrication The production of a MOSFET is a multi-step process that involves various techniques of semiconductor fabrication. The process includes doping, oxidation, photolithography, etching, and deposition, among others. Basically, the transfer of layout patterns to the various layers of material on a semiconductor die is obtained from photolithographic methods, followed by selective removal of unwanted material. A brief, somewhat oversimplified overview of a complementary MOS process—which includes both NMOS and PMOS transistors—is described below. Silicon Substrate Preparation: - The process starts with a pure silicon wafer, which serves as the base or substrate for the MOSFETs. For a CMOS process, this is typically a p-type substrate. Creation of N-Wells: - Oxidation: The silicon wafer is oxidized to grow a thin layer of silicon dioxide (SiO2) on its surface. This oxide layer serves as a mask for selective doping and as a part of the gate oxide in the MOSFET structure. - Photolithography: Photolithography is used to transfer the pattern of the n-well regions onto the wafer. The wafer is coated with a light-sensitive material called photoresist. A mask with the desired pattern is then placed over the wafer, and it is exposed to ultraviolet (UV) light. The exposed areas of the photoresist become soluble and are washed away to reveal the underlying oxide. ==This section does not do real justice to the photolithography topic so I plan to cover it in proper depth when time allows.== - Doping: To create an n-well, the exposed areas of the silicon wafer are doped with n-type impurities, such as phosphorus or arsenic. This can be done through various doping techniques, such as ion implantation or diffusion. The dopants increase the concentration of electrons in these regions, creating n-type semiconductor areas. Ion implantation deserves a bit more detail. This process introduces dopants into a material and thereby changes its physical, chemical, or electrical properties. During ion implantation, ions of an element are accelerated into a solid target like Si or SiC at relatively low temperatures (below 300°C). Ion implantation equipment typically consists of 1) an ion source, where ions of the desired element are produced, 2) an accelerator, where the ions are electrostatically accelerated to high energy, and 3) a target chamber, where the ions impinge on a target, which is the material to be implanted. Therefore, ion implantation is considered a special case of particle radiation. Each ion is typically a single atom or molecule. The total amount of implanted material in a target is the integral over time of the ion current, also known as the dose, measured commonly in cm2. Dopant ions are generally created from a gas source, for purity reasons, and are afterwards accelerated towards the wafer to penetrate the crystal lattice. The crystal structure of a target can be damaged or even destroyed by the energetic collision cascades with ions of high energies. Moreover, the desired carrier concentration is typically not achieved after the implantation, due to defects and clusters which occur during the implantation process. Therefore, post-implantation annealing steps are necessary to repair lattice damage and increase the electrical activation of the implanted species. - Oxide Removal: The remaining photoresist and exposed oxide layers are removed, leaving behind the doped n-well regions in the p-type substrate. Gate Oxide Formation: - As we said before, a thin layer of silicon dioxide is grown or deposited over the entire wafer. This layer will serve as the gate oxide for the MOSFETs. Gate Formation: - Polysilicon (polycrystalline silicon) is deposited over the gate oxide. This polysilicon layer is then patterned using photolithography and etching to form the gate electrodes of the MOSFETs. Source and Drain Formation: - Additional photolithography steps define the areas for the source and drain regions adjacent to the gate. For NMOS transistors in the n-well, p-type dopants are introduced, and for PMOS transistors in the p-type substrate, n-type dopants are introduced, using ion implantation or diffusion. -This doping process creates heavily doped regions that serve as the source and drain of the MOSFETs. Insulation and Contacts: - An insulating layer of silicon dioxide or silicon nitride is deposited over the entire wafer. - Holes are etched in the insulating layer to expose parts of the source, drain, and gate where electrical contacts will be made. - Metal or highly conductive material is deposited to fill these holes, forming electrical contacts. These contacts are then interconnected according to the circuit design. Final Steps - The wafer is subjected to further processing, including metallization to form interconnects, and passivation to protect the surface. - Finally, the wafer is tested, and individual chips are separated from the wafer, packaged, and tested again. A graphical depiction of the process (for a SiC substrate, and a discrete device) is shown below. Note the figure is weirdly ordered so it should be read in columns, not rows. ![MOSFET production process for a SiC substrate (for a discrete device)](image191.jpeg) > [!Figure] > _MOSFET production process for a SiC substrate (for a discrete device)_ This overview simplifies a complex and precise process that requires advanced equipment and cleanroom conditions. Each step involves careful control of conditions and materials to ensure the functionality and reliability of the resulting MOSFETs and integrated circuits. ## Floating Gate MOSFET A floating gate MOSFET resembles a standard MOSFET except that the transistor has two gates instead of one. The transistor can be functionally seen as an electrical switch in which current flows between two terminals (source and drain) and is controlled by a floating gate (FG) and a control gate (CG). The CG is similar to the gate in other MOS transistors, but below this, there is the FG insulated all around by an oxide layer. The FG is interposed between the CG and the MOSFET channel. Because the FG is electrically isolated by its insulating layer, electrons placed on it are trapped. When the FG is charged with electrons, this charge screens^[https://en.wikipedia.org/wiki/Electric-field_screening] from the CG, thus, increasing the threshold voltage ($V_{T}$) of the transistor. This means that the transistor's $V_{T}$ can be changed between the _uncharged FG threshold voltage_ ($V_{T1}$) and the higher _charged FG threshold voltage_ ($V_{T2}$) by changing the FG charge. ![](fg_mosfet.webp) > [!Figure] > Floating-gate MOSFET, cross-section ## Planar Technology, FinFETs and GAAFETs All FET transistors discussed so far are planar transistors. In planar MOSFETs, the current flows through a flat channel between the source and drain, with the gate on top controlling the flow (see figure right above). As feature sizes shrank below 22nm, planar transistors faced severe leakage issues due to short-channel effects, meaning the gate had less control over the channel. To overcome the limitations of planar designs, manufacturers introduced FinFETs (Fin Field-Effect Transistors) around the 22nm node (Intel 22nm in 2011, TSMC and Samsung 16/14nm in 2015). Instead of a flat channel, the FinFET transistor’s channel is raised into a thin vertical "fin". The gate wraps around three sides of the fin, improving electrostatic control. This reduces leakage current, allowing for lower power consumption and better switching performance. Multiple fins can be stacked to increase current flow, improving transistor drive strength. FinFETs are used from 22nm down to 4nm in most modern chips, including mobile processors and GPUs. ![[Pasted image 20250221131144.png]] >[!figure] >FinFET device structure As FinFETs approach their physical limits, Gate-All-Around Field-Effect Transistors (GAAFETs) are the next step in transistor evolution. GAAFETs provide even better control over current flow by fully surrounding the channel with the gate. Instead of a single fin, GAAFETs use stacked nanowires or nanosheets that are completely enclosed by the gate. The gate surrounds all four sides of the channel, leading to superior electrostatic control, achieving less leakage and higher [[Semiconductors#Energy Efficiency|efficiency]] than FinFETs. GAAFETs allow for more design flexibility: manufacturers can vary the width of nanosheets to optimize power and performance. Samsung introduced GAAFETs at the 3nm node, while TSMC and Intel plan to adopt them around 2nm. Intel refers to its GAAFET design as RibbonFET, and TSMC is calling it Nanosheet Transistors. ![[Pasted image 20250221132029.png]] >[!Figure] > GAAFET (a) structure and (b) cross-sectional view with Si-Nanowire (NW) channel. MBC-FET (c) structure and (d) cross-sectional view with Si-Nanosheet (NS) channel (source: https://www.researchgate.net/publication/344063509_First_Principle_and_NEGF_Based_Study_of_Silicon_Nano-wire_and_Nano-sheet_for_Next_Generation_FETs_Performance_Interface_Effects_and_Lifetime) ![[Pasted image 20250221131544.png]] >[!figure] >Planar vs FinFET vs GAAFET (Source: Lam Research) # Integrated Circuit Design A modern system-on-chip can have 50 to 100 billion MOSFETs. Drawing individual transistors by hand is not an applicable approach for very complex chips. For those, the approach is to encapsulate more and more functionality in packaged blocks that can be reused for different designs. Still, in some selected cases, it is necessary to design custom chips, for instance, in fault-tolerant applications where internal measures must be added to the circuits to make them more robust against radiation or other environmental factors. But for more mainstream designs, a hierarchical collection of building blocks is used. This is done by combining transistors for implementing different logic functions, which in turn form other building blocks. We will discuss two important aspects of this approach: standard cells and [[Semiconductors#Integrated Circuit Design#Process Design Kits (PDKs)|Process Design Kits]] (PDK). > [!info] > In the context of semiconductors, III-V refers to a class of compound semiconductors made from elements in groups III and V of the [periodic table](https://www.acs.org/content/dam/acsorg/education/whatischemistry/periodic-table-of-elements/acs-periodic-table-poster_download.pdf). These materials combine a group III element, such as gallium (Ga), aluminum (Al), or indium (In), with a group V element, such as arsenic (As), phosphorus (P), or nitrogen (N). Examples of III-V semiconductors include gallium arsenide (GaAs), indium phosphide (InP), and gallium nitride (GaN). ## Nodes and Nanometers When semiconductor nodes refer to "nm" (nanometers), it historically represented the **gate length** of a transistor, which is the distance between the source and drain under the transistor gate. However, in modern semiconductor manufacturing, the "nm" value no longer directly corresponds to a specific physical dimension in the chip. Instead, it serves as a marketing term representing a new generation of technology with improved **transistor density, power efficiency, and performance**. ### Historical Meaning: Gate Length In early semiconductor manufacturing, the process node (e.g., 90nm, 65nm) referred to the **minimum feature size** of the transistor gate length. This distance was important because it determined how fast transistors could switch and how much power they consumed. Smaller gates meant faster and more efficient transistors. ### Modern Meaning: Not a Direct Physical Measurement With advanced nodes (e.g., 7nm, 5nm, 3nm), the "nm" value no longer represents a specific feature like gate length. Instead, it reflects a **combination of improvements** in multiple aspects of semiconductor design, including: 1. **FinFET & GAAFET structures** – Modern transistors use 3D structures that no longer have a single, simple "gate length" anymore. 2. **Increased transistor density** – A lower "nm" number generally means a higher number of transistors per unit area. For example, TSMC’s 5nm process has about 1.8 times the transistor density of 7nm. 3. **Power efficiency and performance** – Each new node improves energy efficiency and switching speed, even if the exact physical dimensions do not scale linearly. For instance, despite the different names, TSMC’s 5nm and Intel’s 10nm have similar transistor densities, showing how "nm" no longer directly corresponds to a single dimension but instead indicates a relative advancement in technology. ## Standard Cells In reality, chip designers do not have full freedom to decide the geometry of the transistors they will use for their designs. Transistor geometries are highly standardized and packed in libraries, making the fabrication process more efficient and repeatable. Standard cells go a bit beyond pure transistor geometry and define a collection of logic building blocks including inverters, gates, flip flops, and multiplexers. Standard cells have a uniform height but vary in width depending on the complexity of the function they perform. This uniformity allows for easier placement and routing in the VLSI design process. Each standard cell's electrical properties, such as timing, power consumption, and area, are pre-characterized. This pre-characterization simplifies the design process, as designers can use these cells knowing their exact specifications. Also, standard cells are organized into libraries. A standard cell library is a collection of these cells, each optimized for a specific manufacturing process technology. Designers select cells from these libraries to build their ICs. Being pre-designed and verified, standard cells can be reused across different designs, significantly reducing the design time and effort required to create new chips. The use of standard cells enables the automation of the IC layout process. Electronic Design Automation (EDA) tools can automatically place and route these cells, optimizing the chip for performance, area, and power consumption. ![A 2-input NAND gate standard cell in an EDA tool](site/Resources/media/image192.png) > [!Figure] > _A 2-input NAND gate standard cell in an EDA tool_ ### Process Design Kits (PDKs) Process Development Kits (PDKs) are collections of foundry-specific data files and design resources that enable integrated circuit (IC) designers to create semiconductor devices compatible with a specific manufacturing process. PDKs serve as the interface between electronic design automation (EDA) tools and semiconductor fabrication technologies, ensuring that the circuit designs can be accurately fabricated, tested, and implemented within the physical constraints of the manufacturing process. PDKs typically include a variety of resources: - Device Models: Accurate representations of the electrical behavior of the transistors and passive components available in the technology. - Design Rules: Specifications detailing the geometric constraints for layout designs, including minimum widths and spacings of circuit elements, to ensure manufacturability and reliability. - Layout Cells (LEFs and GDS2 files): Library elements that include standard cells, I/O pads, and other pre-designed structures, provided in formats like LEF (Library Exchange Format) for abstracted views and GDSII for detailed layout information. - Technology Files: Define layer information, DRC (Design Rule Checking), LVS (Layout Versus Schematic), and extraction setups for EDA tools, aligning them with the specific manufacturing process. - Parameterized Cells (PCells): Advanced library components that can be customized based on certain parameters, allowing for flexible design adjustments while adhering to design rules. #### SKY130 The SkyWater Technology Foundry's SKY130 is a notable example within the landscape of open-source PDKs. It is a 130-nanometer CMOS process that has garnered attention for being the first semiconductor process to be fully open-sourced, marking a significant shift in access to semiconductor manufacturing technologies. This initiative is a collaboration between SkyWater Technology and Google, aiming to democratize access to semiconductor fabrication, enabling a broader community of designers and researchers to innovate in IC design. Features of Sky130 PDK include: - Accessibility: Being open-source, Sky130 PDK allows academic institutions, individual researchers, and small companies to access state-of-the-art semiconductor fabrication processes without the high costs traditionally associated with proprietary PDKs. - Comprehensive Tool Integration: The Sky130 PDK supports a wide range of EDA tools, from open-source options like Magic, KLayout, NGSPICE, and Qflow, to commercial tools, providing flexibility in design approaches. - Diverse Device Offerings: The PDK includes a wide array of device models suitable for various applications, including high-performance digital, analog, mixed-signal, and RF designs. - Community Support: The open nature of the Sky130 PDK fosters a growing community of users and contributors, enhancing the PDK's capabilities, documentation, and support infrastructure. For advanced users, the Sky130 PDK provides a great opportunity to engage with a fully open-source semiconductor process. It enables not only the design and fabrication of custom ICs at relatively low cost but also encourages experimentation, innovation, and learning within the semiconductor design field. The availability of comprehensive documentation and an active community further enhances its value, making it an attractive platform for sophisticated design projects, research, and education in advanced digital, analog, and mixed-signal circuits. ![SKY130 PDK stack up](site/Resources/media/image193.png) > [!Figure] > _SKY130 PDK stack up (source: https://skywater-pdk.readthedocs.io/)_ ![[Pasted image 20250912133546.png]] > [!Figure] > _Generic stack-up showcasing the FEoL/BEoL hierarchy_ #### OSU180nm The OSU180nm (Oregon State University 180 nanometer) Process Design Kit (PDK) is a set of documents, models, and data used by semiconductor designers to create integrated circuits (ICs) using the 180nm CMOS manufacturing process. Developed by Oregon State University, the OSU180nm PDK provides essential information and tools for designing chips at the 180nm technology node. Here's a summary of it: 4. **Technology Node**: The term "180nm" refers to the minimum feature size (also known as the process node) of the semiconductor fabrication process. In the OSU180nm PDK, this means that the smallest feature that can be reliably manufactured is approximately 180 nanometers. This determines the resolution and accuracy of the design rules for creating circuits. 5. **Process Design Kit (PDK)**: A PDK is essentially a collection of files, libraries, and documentation that describe the fabrication process, design rules, device models, and simulation parameters specific to a particular semiconductor manufacturing process. The OSU180nm PDK contains the information necessary for designing and simulating ICs using the 180nm CMOS process. 6. **Components of OSU180nm PDK**: - **Design Rules**: These rules specify the constraints and limitations of the fabrication process. Designers must adhere to these rules to ensure that their designs can be reliably manufactured. - **Device Models**: These models provide mathematical representations of transistors, capacitors, resistors, and other components used in IC design. They describe how these devices behave under different operating conditions and are essential for circuit simulation and analysis. - **Library Cells**: The PDK includes a library of standard cells, such as logic gates, flip-flops, and other basic building blocks commonly used in IC design. These cells are optimized for the 180nm process and serve as the foundation for designing more complex circuits. - **Simulation Models**: These models describe the electrical behavior of components and interconnects within the IC. They enable designers to simulate the performance of their designs under various operating conditions, helping to identify and address potential issues early in the design process. - **Layout Tools**: The PDK may include layout tools or scripts tailored for the 180nm process, helping designers create physical layouts of their circuits that conform to the design rules and manufacturing constraints. - **Documentation**: Detailed documentation is provided to guide designers through the process of using the PDK, including instructions for installing the PDK, designing circuits, running simulations, and preparing designs for fabrication. 7. **Applications**: The OSU180nm PDK is used by semiconductor companies, research institutions, and universities for designing a wide range of ICs, including digital circuits, analog circuits, mixed-signal circuits, and microelectromechanical systems (MEMS). It enables designers to create custom ICs tailored to specific applications, such as consumer electronics, telecommunications, automotive systems, and medical devices. 8. **Advantages and Challenges**: The 180nm CMOS process offers a balance between performance, power consumption, and cost, making it suitable for a variety of applications. However, designers must carefully consider the limitations and trade-offs associated with this process, such as limited scalability compared to more advanced nodes, increased power consumption, and lower performance compared to newer technologies. #### GF180MCU The GF180MCU (GlobalFoundries 180nm Microcontroller Unit) PDK (Process Design Kit) is a suite of tools, libraries, and documentation provided by GlobalFoundries for designing and manufacturing integrated circuits using their 180nm process technology. This PDK is specifically tailored for microcontroller unit (MCU) applications, offering designers the resources to develop microcontroller-based systems. The GF180MCU PDK provides access to transistor-level models, standard cell libraries, and various analog and digital IP blocks optimized for the 180nm process node. Additionally, the PDK encompasses simulation environments and tools that enable designers to verify the functionality and performance of their designs before fabrication. This aids in reducing design iterations and mitigating potential issues during the chip development process. Furthermore, the GF180MCU PDK offers documentation and guidelines to assist designers in effectively utilizing the provided resources and optimizing their designs for power, performance, and area (PPA) metrics. ### Simulating standard cells As we saw for the simple MOSFET transistor we drew above, to simulate a standard cell we can also use SPICE. In particular, the open-source NGSPICE software. But what is NGSPICE anyway? NGSPICE is a powerful, open-source software tool for the simulation of electronic circuits. It is an enhanced version of the original SPICE (Simulation Program with Integrated Circuit Emphasis), which was developed in the early 1970s at the University of California, Berkeley. Over the years, SPICE has become the de facto standard for circuit simulation, and NGSPICE has continued this legacy by integrating improvements and extensions to the original program, ensuring compatibility with modern simulation needs and technology processes. The core functionality of SPICE relies on solving a set of nonlinear differential equations that describe the behavior of an electronic circuit. These equations are derived from Kirchhoff's current and voltage laws, along with the constitutive relations (i.e., device models) that describe how individual circuit components (like resistors, capacitors, inductors, diodes, and transistors) respond to electrical stimuli. To simulate the dynamic behavior of circuits over time, especially for transient analysis, SPICE uses numerical integration methods to solve these differential equations. SPICE commonly employs numerical integration techniques such as the Gear method and the trapezoidal (TRAP) rule. These methods approximate the continuous changes in circuit variables over time by discretizing the time domain into small intervals and estimating the integral of the circuit's equations across these intervals. Through integration, SPICE calculates the voltage and current values at each node and through each component in the circuit at discrete time points. The choice of numerical integration method affects the accuracy and stability of the simulation. SPICE allows users to adjust parameters related to the integration process, such as the time step size, to balance between computational efficiency and the precision of the simulation results. Solving circuit equations often involves iterative methods to deal with nonlinearity and ensure convergence to a stable solution. Typically, standard cell libraries come equipped with SPICE models for their devices. These models are hierarchically composed of subcircuits including other circuits. For instance, a NAND gate standard cell is modeled in SPICE in the following manner: ```Spice .subckt sky130_fd_sc_hd__nand2_1 A B VGND VNB VPB VPWR Y X0 Y A VPWR VPB sky130_fd_pr__pfet_01v8_hvt w=1e+06u l=150000u X1 VPWR B Y VPB sky130_fd_pr__pfet_01v8_hvt w=1e+06u l=150000u X2 VGND B a_113_47# VNB sky130_fd_pr__nfet_01v8 w=650000u l=150000u X3 a_113_47# A Y VNB sky130_fd_pr__nfet_01v8 w=650000u l=150000u .ends ``` It can be seen that the subcircuit employs discrete MOS transistors to implement the NAND gate by using PFET and NFET models with parameterized geometries such as widths and lengths. The rest of the SPICE directives are to indicate how the different terminals of these models connect with each other. It's not the most intuitive nor graphical way ever to connect a circuit but the advantage of this text-based design is that it is highly automatable by EDA tools. We will simulate now the 2-input NAND cell. For this, we will create a high-level test bench that will instantiate the cell model, which in turn instantiates the transistor models. ```Spice Standard cell Simulation * this file edited to remove everything not in tt lib .lib "./sky130_fd_pr/models/sky130.lib.spice" tt * include the standard cells .include "./sky130_fd_sc_hd.spice" * instantiate the cell - adjust this to match your standard cell Xcell A B VGND VGND VPWR VPWR Y sky130_fd_sc_hd__nand2_1 * set gnd and power Vgnd VGND 0 0 Vdd VPWR VGND 1.8 * create pulse for A and B * parameters are: initial value, pulsed value, delay time, rise time, fall time, pulse width, period Va A VGND pulse(0 1.8 1n 10p 10p 1n 2n) Vb B VGND pulse(0 1.8 1.5n 10p 10p 1n 2n) * setup the transient analysis .tran 10p 3n 0 .control run set color0 = white set color1 = black plot A B Y .endc .end ``` The cell is, expectedly, has 3 terminals: A and B for inputs, Y for output. The cell has also terminals that must be connected to ground and to Vdd (called VGND and VPWR respectively). In the SPICE testbench, we must also generate the stimulus for the cell model. We do that with the Va and Vb signals; we specify voltage swings, rise and fall times, and pulse duration. The simulation yields the result seen in the figure below. ![NAND standard cell simulation](site/Resources/media/image196.png) > [!Figure] > _NAND standard cell simulation in SPICE_ ### OpenLane OpenLane is an automated RTL to gate-level (GDSII) flow based on hooking together several building blocks including OpenROAD, Yosys, Magic, Netgen, KLayout, and a number of custom scripts for design exploration and optimization. The Openlane flow performs all ASIC implementation steps from RTL all the way down to GDSII. Currently, it supports both A and B variants of the [[Semiconductors#SKY130|sky130]] PDK, the C variant of the [[Semiconductors#GF180MCU|GF180MCU]] PDK, and instructions to add support for other (including proprietary) PDKs are documented. OpenLane abstracts the underlying open-source utilities and allows users to configure all their behavior with a single configuration file. It should be noted that the journey from RTL to GDSII is complex and long, and includes a sequence of many steps that require careful configuration and customization. #### Flow The OpenLane flow is depicted in the figure below: ![[Pasted image 20250110214628.png]] > [!Figure] > OpenLane architecture and sequence (source: https://github.com/The-OpenROAD-Project/OpenLane/) OpenLane flow consists of several stages. By default, all flow steps are run in sequence. Each stage may consist of multiple sub-stages. 1. **Synthesis** 1. [Yosys](https://github.com/yosyshq/yosys) - Perform RTL synthesis and technology mapping. 2. [OpenSTA](https://github.com/the-openroad-project/opensta) - Performs static timing analysis on the resulting netlist to generate timing reports 2. **Floorplaning** 1. [OpenROAD/Initialize Floorplan](https://github.com/the-openroad-project/openroad/tree/master/src/ifp) - Defines the core area for the macro as well as the rows (used for placement) and the tracks (used for routing) 2. OpenLane IO Placer - Places the macro input and output ports 3. [OpenROAD/PDN Generator](https://github.com/the-openroad-project/openroad/tree/master/src/pdn) - Generates the power distribution network 4. [OpenROAD/Tapcell](https://github.com/the-openroad-project/openroad/tree/master/src/tap) - Inserts welltap and endcap cells in the floorplan 3. **Placement** 1. [OpenROAD/RePlace](https://github.com/the-openroad-project/openroad/tree/master/src/gpl) - Performs global placement 2. [OpenROAD/Resizer](https://github.com/the-openroad-project/openroad/tree/master/src/rsz) - Performs optional optimizations on the design 3. [OpenROAD/OpenDP](https://github.com/the-openroad-project/openroad/tree/master/src/dpl) - Performs detailed placement to legalize the globally placed components 4. **CTS** 1. [OpenROAD/TritonCTS](https://github.com/the-openroad-project/openroad/tree/master/src/cts) - Synthesizes the clock distribution network (the clock tree) 5. **Routing** 1. [OpenROAD/FastRoute](https://github.com/the-openroad-project/openroad/tree/master/src/grt) - Performs global routing to generate a guide file for the detailed router 2. [OpenROAD/TritonRoute](https://github.com/the-openroad-project/openroad/tree/master/src/drt) - Performs detailed routing 3. [OpenROAD/OpenRCX](https://github.com/the-openroad-project/openroad/tree/master/src/rcx) - Performs SPEF extraction 6. **Tapeout** 1. [Magic](https://github.com/rtimothyedwards/magic) - Streams out the final GDSII layout file from the routed def 2. [KLayout](https://github.com/klayout/klayout) - Streams out the final GDSII layout file from the routed def as a back-up 7. **Signoff** 1. [Magic](https://github.com/rtimothyedwards/magic) - Performs DRC Checks & Antenna Checks 2. [Magic](https://github.com/klayout/klayout) - Performs DRC Checks & an XOR sanity-check between the two generated GDS-II files 3. [Netgen](https://github.com/rtimothyedwards/netgen) - Performs LVS Checks While OpenLane itself as a script (and its associated build scripts) are under the Apache License, version 2.0, tools may fall under stricter licenses. Everything in Floorplanning through Routing is done using [OpenROAD](https://github.com/The-OpenROAD-Project/OpenROAD) and its various sub-utilities, hence the name "OpenLane." #### OpenLane Output All output run data is placed by default under ./designs/design_name/runs. Each flow cycle will output a timestamp-marked folder containing the following file structure: ``` <design_name> ├── config.json/config.tcl ├── runs │ ├── <tag> │ │ ├── config.tcl │ │ ├── {logs, reports, tmp} │ │ │ ├── cts │ │ │ ├── signoff │ │ │ ├── floorplan │ │ │ ├── placement │ │ │ ├── routing │ │ │ └── synthesis │ │ ├── results │ │ │ ├── final │ │ │ ├── cts │ │ │ ├── signoff │ │ │ ├── floorplan │ │ │ ├── placement │ │ │ ├── routing │ │ │ └── synthesis ``` > [!warning] > This section is under #development ## Fabrication For a great introduction to the CMOS fabrication flow, see #ref/Kaeslin, chapter 14, section 14.2. ### Foundries Semiconductor foundries are highly specialized industrial facilities dedicated to the manufacturing of integrated circuits (ICs) and other semiconductor devices. ==These foundries operate on a contract basis, producing chips designed by external clients, typically fabless semiconductor companies that lack their own fabrication capabilities.== The core function of a foundry is to translate intricate circuit designs into physical silicon products through a series of sophisticated and precise manufacturing steps. Critical to maintaining precision is the use of cleanrooms, where air filtration and contamination control systems minimize particulate levels, as even microscopic dust can disrupt nanoscale features. Advanced lithography tools, including extreme ultraviolet (EUV) scanners, enable the creation of features as small as a few nanometers by utilizing shorter wavelengths of light, essential for cutting-edge process nodes like 5 nm or 3 nm. These nodes denote the smallest feature size achievable, directly impacting transistor density, power efficiency, and performance. Post-fabrication, wafers must undergo rigorous testing to identify functional dies. Automated probes check electrical characteristics, marking defective units. Functional dies are then separated via dicing and packaged, a process that encases the silicon in protective materials and connects it to external leads or solder bumps for integration into electronic systems. Foundries face significant technical and economic challenges. The capital expenditure for constructing a fabrication plant (fab) can exceed tens of billions of dollars, driven by the cost of advanced machinery and cleanroom infrastructure. Yield management—maximizing the proportion of operational dies per wafer—is critical for profitability, requiring continuous optimization of processes to mitigate defects. Additionally, thermal management, material limitations, and the physical constraints of atomic-scale manufacturing necessitate ongoing research and development. # Digital Logic Building Blocks ## Flip-Flops A flip-flop is a fundamental building block in digital systems, functioning as a logic cell capable of storing one bit of data. It is a bistable device, which means it has two stable states that it can latch onto, representing a binary 0 or 1. Flip-flops maintain their state indefinitely until an input pulse, known as a trigger, prompts them to switch states. This characteristic makes them useful for registers of several kinds (latch registers, shift registers), counters, and data storage within CPUs. The simplest type of flip-flop is the SR (Set-Reset) flip-flop, which has two inputs and two corresponding outputs. One input sets the state to 1, and the other resets it to 0. The interesting part of this, which in a way gives sense to the whole section, is that one can observe a flip-flop from multiple levels of abstraction, which in a way illustrates the layered nature of digital, complex systems. There are four different levels of abstraction we can use while observing a flip-flop: 16. As a [[Semiconductors#Standard Cells|standard cell]] in VLSI[^53] : ![Flip-flop standard cell layout](site/Resources/media/image197.png) 17. A transistor-level view shows the flip-flop synthesized with complementary MOSFET transistors: ![Flip-flop depicted with discrete CMOS transistors](site/Resources/media/image198.png) 18. Gate-level which shows the flip-flop synthesized with NAND gates: ![SR Flip-Flop with NAND gates](site/Resources/media/image199.png) 19. And the highest level of abstraction which depicts the flip-flop as a schematic box with inputs and outputs. ![RS Flip-Flop (block)](site/Resources/media/image200.png) Here lies the core of the hierarchical analysis of digital systems: depending on the zoom level, we can choose to take a look at the underlying, almost physical details, or we can choose to use digital systems as mere black boxes and only use them by manipulating their inputs and outputs. There are multiple types of flip-flops. Common types include the D (Data) flip-flop, which captures the value on its data line when a clock signal is received, and the JK flip-flop, which can toggle its output when both its inputs are high, and a clock signal is received. The truth table of the RS flip-flop is shown below. | **S** | **R** | **Q** | **Q#** | **Condition** | | ----- | ----- | ---------- | ---------- | ------------- | | 0 | 0 | Prev state | Prev state | Latched | | 0 | 1 | 0 | 1 | Reset | | 1 | 0 | 1 | 0 | Set | | 1 | 1 | X | X | Invalid | The precise timing of the flip-flop's state change is controlled by a clock signal, making it a synchronous device; its output changes only at defined times, providing predictability that is fundamental for the timing and control within sequential logic circuits. Flip-flops are integral to the design of sequential circuits where the notion of time and order of operations are critical, serving as the memory elements in a wide array of digital devices. > [!info] > A factor that tends to be underestimated when it comes to discussing flip-flops is **feedback**. Feedback is an essential factor of flip-flops, acting as the mechanism that enables them to retain data and maintain a stable state. In a flip-flop, logic gates are configured in a way that the output of the circuit is fed back to its input (see the gate-level depiction above). This feedback loop is what creates the two stable states a flip-flop can hold. When a flip-flop is set or reset, the output changes state accordingly. This new state is then fed back into the circuit through the feedback loop, effectively telling the circuit to "latch" onto this new state and hold it. The feedback ensures that once the flip-flop is in a given state, it remains in that state until an external force (usually a different input signal) causes it to change. Without feedback, a flip-flop would not be able to maintain its state and would not function as a storage device cell. Feedback loops in flip-flops are carefully designed to avoid unwanted oscillations and to ensure that the flip-flop responds correctly to control signals, such as clock pulses. The control signals synchronize changes to the flip-flop's state, allowing it to be used effectively in sequential logic circuits where the precise timing of data storage and retrieval is paramount. ### D-flip-flop Simulation in SPICE We will now simulate a D flip-flop standard cell from the Sky130 PDK library. For this, we will trace and locate the flip-flop subcircuit in the PDK library. The SPICE code is shown below. It can be seen the flip-flop model utilizes NMOS and PMOS as per the technology process, with widths and lengths matching Sky130 specifications (150um gate length). ```Spice .subckt sky130_fd_sc_hd__dfxtp_1 CLK D VGND VNB VPB VPWR Q X0 a_891_413# a_193_47# a_975_413# VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X1 a_1059_315# a_891_413# VGND VNB sky130_fd_pr__nfet_01v8 w=650000u l=150000u X2 a_466_413# a_27_47# a_561_413# VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X3 a_634_159# a_27_47# a_891_413# VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X4 a_381_47# a_193_47# a_466_413# VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X5 VPWR D a_381_47# VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X6 VPWR a_466_413# a_634_159# VPB sky130_fd_pr__pfet_01v8_hvt w=750000u l=150000u X7 VGND a_466_413# a_634_159# VNB sky130_fd_pr__nfet_01v8 w=640000u l=150000u X8 a_1017_47# a_1059_315# VGND VNB sky130_fd_pr__nfet_01v8 w=420000u l=150000u X9 a_1059_315# a_891_413# VPWR VPB sky130_fd_pr__pfet_01v8_hvt w=1e+06u l=150000u X10 a_561_413# a_634_159# VPWR VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X11 VPWR a_1059_315# Q VPB sky130_fd_pr__pfet_01v8_hvt w=1e+06u l=150000u X12 a_891_413# a_27_47# a_1017_47# VNB sky130_fd_pr__nfet_01v8 w=360000u l=150000u X13 a_634_159# a_193_47# a_891_413# VNB sky130_fd_pr__nfet_01v8 w=360000u l=150000u X14 a_592_47# a_634_159# VGND VNB sky130_fd_pr__nfet_01v8 w=420000u l=150000u X15 a_466_413# a_193_47# a_592_47# VNB sky130_fd_pr__nfet_01v8 w=360000u l=150000u X16 VGND a_27_47# a_193_47# VNB sky130_fd_pr__nfet_01v8 w=420000u l=150000u X17 a_381_47# a_27_47# a_466_413# VNB sky130_fd_pr__nfet_01v8 w=360000u l=150000u X18 a_27_47# CLK VGND VNB sky130_fd_pr__nfet_01v8 w=420000u l=150000u X19 a_27_47# CLK VPWR VPB sky130_fd_pr__pfet_01v8_hvt w=640000u l=150000u X20 VPWR a_27_47# a_193_47# VPB sky130_fd_pr__pfet_01v8_hvt w=640000u l=150000u X21 VGND D a_381_47# VNB sky130_fd_pr__nfet_01v8 w=420000u l=150000u X22 a_975_413# a_1059_315# VPWR VPB sky130_fd_pr__pfet_01v8_hvt w=420000u l=150000u X23 VGND a_1059_315# Q VNB sky130_fd_pr__nfet_01v8 w=650000u l=150000u .ends ``` We can see from the code that the model has 7 pins: CLK, D, VGND, VNB, VPB VPWR, and Q. - CLK is the clock pin - D is the data pin - VGND is the ground pin - VNB is the n-well pin (connected to GND) - VPB is the p-well pin (must be connected to PWR) - VPWR is the power (VDD) pin - Q is the output To simulate a D-type flip-flop using the Sky130 standard cell, we must provide power to it, and stimulus (clock and data). The snippet below shows how to do this: ```Spice Standard cell Simulation * this file edited to remove everything not in tt lib .lib "./sky130_fd_pr/models/sky130.lib.spice" tt * include the standard cells .include "./sky130_fd_sc_hd.spice" * instantiate the cell - adjust this to match your standard cell Xcell CLK D VGND VGND VPWR VPWR Q sky130_fd_sc_hd__dfxtp_1 * set gnd and power Vgnd VGND 0 0 Vdd VPWR VGND 1.8 * create pulse for the flip-flop * parameters are: initial value, pulsed value, delay time, rise time, fall time, pulse width, period Vd CLK VGND PULSE(0 1.8 5n 1n 1n 5n 10n) Vb D VGND 0 PULSE(0 1.8 0 1n 1n 10n 20n) * setup the transient analysis .tran 1n 200n .control run set color0 = white set color1 = black plot D Q CLK .endc .end ``` The result is shown below. It can be seen in the picture that the output Q (in blue) is set whenever D is high and there is a rising edge in the CLK signal. Equivalently, the output Q gets cleared whenever D is low and there is again a rising edge in the clock signal. ![D-type flip-flop, standard cell SPICE simulation](ff_spice.png) > [!Figure] > _D-type flip-flop, standard cell SPICE simulation_ We haven't discussed [[Semiconductors#Registers|registers]] yet, but we will jump the gun here a little bit and chain a few flip-flops together to create a shift register. For creating a shift register in SPICE, we will hook the output of the first flip-flop to the data input of the second one, and so on. A schematic diagram of what we are trying to achieve is shown below. ![4-bit shift register made with D flip-flops](site/Resources/media/image202.png) > [!Figure] > _4-bit shift register made with D flip-flops_ In this circuit, as the figure above depicts, all D flip-flops will share a global clock. The code in SPICE is shown below. ```Spice Standard cell Simulation * this file edited to remove everything not in tt lib .lib "./sky130_fd_pr/models/sky130.lib.spice" tt * include the standard cells .include "./sky130_fd_sc_hd.spice" * instantiate the cell - adjust this to match your standard cell XFF1 CLK D VGND VGND VPWR VPWR Q1 sky130_fd_sc_hd__dfxtp_1 XFF2 CLK Q1 VGND VGND VPWR VPWR Q2 sky130_fd_sc_hd__dfxtp_1 XFF3 CLK Q2 VGND VGND VPWR VPWR Q3 sky130_fd_sc_hd__dfxtp_1 XFF4 CLK Q3 VGND VGND VPWR VPWR Q4 sky130_fd_sc_hd__dfxtp_1 * set gnd and power Vgnd VGND 0 0 Vdd VPWR VGND 1.8 * create pulses * parameters are: initial value, pulsed value, delay time, rise time, fall time, pulse width, period * Clock signal - shared by all flip-flops Vd CLK VGND PULSE(0 1.8 5n 1n 1n 5n 10n) * Data input for the shift register (initial bit to be shifted in) Vb D VGND 0 PULSE(0 1.8 0 1n 1n 10n 20n) * setup the transient analysis .tran 1n 200n .control run set color0 = white set color1 = black plot D CLK plot Q1 plot Q2 plot Q3 plot Q4 .endc .end ``` Whose results are shown below, in separate plots for clarity. Observe signal legends for reference. ![](clk_d.png) ![](q1.png) ![](q2.png) ![](q3.png) ![](q4.png) Creating more complex digital circuits would require writing quite some more code in SPICE. It should be reasonably visible by now that SPICE does not appear to be the most intuitive language ever to create circuits; no one in their sane mind would design a complex digital system in SPICE. This is not a criticism of SPICE if we consider that SPICE was never conceived as a hardware description language but as a language for analyzing particular aspects of circuits like timing, transients, delays, overshoots, and the like. Horses for courses. For designing complex digital systems, we must use Hardware Description Languages. ## Hardware Description Languages Hardware Description Languages (HDLs) are specialized computer languages used to describe the structure, design, and operation of digital circuits and systems. Unlike traditional procedural programming languages, HDLs allow designers to specify the operation of digital circuits at various levels of abstraction, from high-level behavioral descriptions to detailed gate-level or even transistor-level specifications. The most widely used HDLs in the field of electronic design automation (EDA) are VHDL (VHSIC Hardware Description Language) and Verilog, each having unique syntax, semantics, and capabilities tailored to the needs of hardware design and simulation. > [!info] > Verilog and VHDL are not the only way of describing hardware. Chisel (Constructing Hardware in a Scala Embedded Language^[https://www.chisel-lang.org/docs/installation]) and Scala form an intriguing pair in the context of HDLs. Chisel leverages the high-level programming capabilities of Scala, a modern programming language known for its expressive syntax and functional programming features, to enable the design of complex and reusable hardware modules. Unlike traditional VHDL and Verilog, which are more verbose and primarily tailored for describing hardware at a lower abstraction level, Chisel allows designers to write hardware designs that are both compact and powerful, harnessing the full might of Scala's abstraction, type inference, and object-oriented features. The integration of Chisel within the Scala ecosystem means that hardware designers can utilize Scala's extensive library support and sophisticated build tools, enhancing productivity and facilitating the creation of highly parametrizable and type-safe hardware designs. This approach represents a significant paradigm shift in hardware design methodologies, moving away from the rigid and often cumbersome syntax of traditional HDLs towards a more software-oriented design process. This not only speeds up the design cycle but also makes the hardware design more accessible to software engineers who may not be familiar with the intricacies of traditional hardware description languages. Furthermore, Chisel generates Verilog code as its output, ensuring compatibility with existing workflows. This compatibility is convenient for integrating Chisel-based designs into standard hardware development processes, allowing for simulation, synthesis, and testing using well-established tools in the industry. At the core of HDLs is the ability to model concurrent operations, a fundamental aspect of digital hardware where multiple operations can occur simultaneously, unlike the sequential execution seen in software programming. This concurrency is achieved through constructs like processes in VHDL and always blocks in Verilog, which allow the description of parallel operations within a digital system. Furthermore, HDLs support a range of abstraction levels, enabling designers to focus on high-level functional behavior without detailing the underlying implementation or, conversely, to specify precise gate-level or structural details for fabrication. An advanced use of HDLs extends into the domain of synthesis and simulation. Synthesis tools interpret HDL descriptions to generate physical hardware implementations, typically in the form of gate-level netlists suitable for fabrication on silicon through Application-Specific Integrated Circuits (ASICs) or for configuration of Field-Programmable Gate Arrays (FPGAs). This process involves complex algorithms that translate behavioral descriptions into optimized sets of logic gates, considering constraints like timing, area, and power consumption. Simulation, on the other hand, plays a critical role in verifying the correctness and performance of a design before it is physically realized. HDLs allow for the creation of testbenches, which are environments constructed to apply stimuli to the design under test, enabling designers to observe and verify their responses. Through simulation, potential issues can be identified and corrected early in the design cycle, significantly reducing development time and costs. HDLs facilitate a methodology known as parametric or generic design. By using parameters or generics, designers can create flexible and reusable modules where specific characteristics, such as bit widths or operation modes, can be easily adjusted without altering the core logic. This approach enhances design efficiency and promotes the development of modular, scalable hardware architectures. In addition to VHDL and Verilog, there are other HDLs and hardware specification languages that have gained popularity for specific purposes or methodologies. For instance, SystemVerilog extends Verilog with stronger support for system-level design and verification, while Hardware Construction Languages (HCLs) like Chisel and Bluespec offer functional programming paradigms to hardware design, enabling highly abstract and concise descriptions that can be efficiently synthesized into hardware. ## Metastability The phenomenon of metastability is inherent in clocked digital logic. Every bistable device will have two stable states and also a third metastable state. If the device enters its metastable state, it will stay there for an indeterminate and unbounded length of time before eventually transitioning into one of the stable states. Digital latches and flip-flops are bistable devices that can store a one or zero (the two stable states) but may also enter their metastable state under marginal triggering conditions if the specifications of the cell are violated (e.g., input setup time, input hold time, clock slew rate, power supply voltage, clock pulse width). Since the flip-flop will stay metastable for an indeterminate time, the system has by definition failed its timing requirements. Physics dictates that one cannot eliminate metastability. However, its effects can be minimized, and the probability of failure can be reduced by isolating the asynchronous inputs using a synchronizer circuit. The synchronizer conditions the input into a known relationship with the system clock. ![](d_flop.png) > [!Figure] > D-flop (source: #ref/Golson ) For the D flop in the figure above, initially, the data D is low and clock CLK is high. The latch is transparent and thus Q is low and /Q is high. Now assume D goes high just before CLK goes low. What happens? At first, M1 will transition low causing output Q to head towards high, but then M1 will transition back to high because CLK has just gone low. If the overlap between D and CLK is sufficiently small, M1 and M2 can both be high and output Q may not yet have reached a high state. At this point, because M1 and M2 are high they cannot have any further effect on the outputs. The outcome for Q and /Q is determined solely by the cross-coupled gates. This situation is now similar to the cross-coupled inverters in the figure below (which we will revisit when we discuss the SRAM [[Semiconductors#SRAM cell|6T cell]]). ![](coupled_inv.png) > [!Figure] > Cross-coupled inverters (source: #ref/Golson) The steady-state DC transfer curves for each inverter are shown in the figure below. ![](DC_inv.png) > [!Figure] > Transfer curves for cross-coupled inverters (source: #ref/Golson ) What are the possible steady-state behaviors for these cross-coupled inverters? We can determine this graphically from the figure above; it is simply the points at which the two transfer curves intersect. Surprisingly there are not two but three such points. There are two stable states at A=0, B=1 and A=1, B=0. The third point is A=B=$V_m$ where $V_m$ is not a legal logic level. This third intersection represents a metastable state. This is a valid solution to the transfer equations; the voltages are self-consistent and the latch can theoretically remain in this state indefinitely. However, any noise or other disruption will tend to drive the latch toward one of the stable states. ## Using Verilog for Simulating Digital Logic Building Blocks Verilog is a hardware description language (HDL) used to model digital systems at multiple levels of abstraction ranging from the behavioral level to the gate level. It was first developed by Gateway Design Automation in 1984 as a proprietary language for logic simulation. Later, Cadence Design Systems acquired Gateway and made Verilog an open standard in 1990, which broadened its adoption and established it as a de facto language in the electronic design automation (EDA) industry. The language syntax of Verilog is somewhat similar to that of C, which allows engineers with software backgrounds to get acquainted with it relatively quickly. Verilog enables designers to describe the structure and behavior of electronic systems in text form, which is then used to simulate the design before it's ever made into physical hardware. In 1995, Verilog was standardized as IEEE Standard 1364. Over the years, it has evolved through several iterations, adding features to enhance its modeling capabilities, improve simulation performance, and support newer forms of synthesis. Verilog is not just for simulating how a design will behave but also plays a critical role in the synthesis of designs into actual hardware. Synthesis tools convert Verilog code into a netlist that can be mapped onto a physical IC. Abstraction levels kick in again. With Verilog, one can design a logic module at different levels of abstraction. Irrespective of the abstraction level chosen, the module would behave exactly in a similar way to the external environment. Following are the four different levels of abstraction which can be described by four different coding styles of Verilog language: Behavioral ( Algorithmic level), dataflow level, and gate level. ### Behavioral Level Behavioral level modeling refers to the description of a digital system in terms of algorithmic or sequential execution, rather than specific hardware constructs such as gates or flip-flops. This type of modeling focuses on what the design does (its functionality), rather than how it is implemented in hardware. It's usually more abstract and can lead to more concise code that is easier to write and understand. Here's an example of the D Flip-Flop we have been using as an example, but implemented by means of behavioral modeling in Verilog: ```Verilog module dff ( input wire clk, input wire reset, input wire d, output reg q ); always @(posedge clk or posedge reset) begin if (reset) q <= 0; // Asynchronously reset the flip-flop else q <= d; // At every rising edge of the clock, output follows input end endmodule ``` This code snippet above abstractly models the behavior of the flip-flop without delving into its gate-level implementation details, or in how cables are connected. ### Gate Level Gate-level modeling in Verilog is used to describe the circuit in terms of logic gates and connections between them. It is closer to the actual hardware implementation and is often used for detailed simulation and synthesis. The behavior of the gates is defined by the built-in primitives of Verilog (AND, OR, NAND, NOR, etc.), and the interconnections are explicitly described. Here is an example of the D Flip-Flop using gate-level modeling with NAND gates: ```Verilog module dff_nand ( input wire clk, input wire reset, input wire d, output wire q, output wire q_bar // This is the inverted output ); wire d_bar, q_int, q_int_bar, clk_bar, reset_bar; // Create the inverted signals using NAND gates nand(clk_bar, clk, clk); nand(reset_bar, reset, reset); nand(d_bar, d, d); // Master latch wire s, r; nand(s, d, clk_bar); nand(r, d_bar, clk_bar); wire qm, qm_bar; nand(qm, s, qm_bar); nand(qm_bar, r, qm); // Slave latch wire s1, r1; nand(s1, qm, clk); nand(r1, qm_bar, clk); // The actual D Flip-Flop output nand(q, s1, q_int_bar); nand(q_int_bar, r1, q); nand(q_bar, q_int_bar, q_int); // Asynchronous reset wire q_reset, q_bar_reset; nand(q_reset, q, reset_bar); nand(q_bar_reset, q_bar, reset_bar); // Make sure reset dominates nand(q, q_reset, q_reset); nand(q_bar, q_bar_reset, q_bar_reset); endmodule ``` This code uses NAND primitives to create both the master and slave latches of a master-slave (edge-triggered) D Flip-Flop. Gate-level models are much more detailed compared to behavioral models and are useful for a detailed understanding of how logic operates at a fundamental level, as well as for low-level simulations where the timing of individual gates could be critical. The penalty paid is ### Data Flow Level Modeling Data flow modeling in Verilog describes the behavior of a circuit in terms of the flow of data through the circuit. It uses continuous assignments (```assign``` statements) with expressions to define relationships between inputs and outputs. Unlike behavioral modeling which uses procedural blocks (```always``` or ```initial```), data flow modeling is more declarative and expresses the logic directly. Here is an example of our infamous D Flip-Flop using data flow level modeling with Verilog operators to describe the flip-flop functionality: ```Verilog module dff_dataflow ( input wire clk, input wire d, output wire q ); reg q_internal; // The actual flip-flop behavior using data flow modeling always @(posedge clk) begin q_internal <= d; end assign q = q_internal; endmodule ``` In this example: - The ```assign```statement is used for the continuous assignment to output ```q```, which means the output is always driven by the value of ```q_internal```. - The ```always``` block still looks like behavioral modeling, but the only behavior it models is the data flow: on every positive edge of ```clk```, ```q_internal``` takes the value of ```d```. - There are no explicit gate-level or structural details, as the intent is to simply show the data flow (the value of d being stored in q at every clock edge). This modeling level abstracts away the exact timing and logic gate structure, focusing solely on the propagation of data within the circuit. It's a higher level of abstraction compared to gate-level modeling, yet it still can be synthesized into hardware because the semantics of a D flip-flop are clear and consistent with hardware implementation. We will continue using Verilog to explore some simple building blocks of digital systems. ### Setting Up IVerilog and GTKWave for running the examples Verilog is a tool we will use to illustrate some of the examples, but as the examples might get more complex, we will keep them at a simplistic level. > [!info] > The steps below have been tested on Ubuntu 20.04. To compile Icarus Verilog from sources, you will need ```make```, ```autoconf```, ```gcc```, ```g++```, ```flex```, ```bison```. Clone the git repo: ```Shell $ git clone https://github.com/steveicarus/iverilog.git ``` Change to the directory: ```Shell $ cd iverilog ``` We run the configuration script: ``` $ sh autoconf.sh ``` If this gives ```autoconf.sh: line 10: autoconf: command not found```, we do: ``` $ sudo apt-get install Autoconf ``` We run configure: ```Shell $ ./configure # for default settings - installs to /usr/local/bin $ ./configure --prefix=<your directory> #installs to specific directory provided\ ``` Now we can compile the sources by running make: ```Shell $ make ``` This command will require ```gcc```, ```g++```, ```bison``` and ```flex```. And finally, we install it: ```Shell $ sudo make install ``` It is also very recommended to Install a waveform dumpfile viewer like gtkwave: ```Shell $ sudo port -v install gtkwave ``` Note: Windows binaries are available at <http://bleyer.org/icarus/> For more details on installation, visit <https://iverilog.fandom.com/wiki/Installation_Guide> ### Simulating a D-type Flip-Flop in Verilog Let's create some logic. We go again with our now familiar D-type flip-flop stores the value that is on the data line (D) when a clock pulse is detected. It has a single data input along with a clock input and typically a reset. Here's an example of how you might describe a D-type flip-flop in Verilog (```D_FF.v```): ```Verilog module d_flip_flop( input wire clk, // Clock input input wire reset, // Asynchronous reset input input wire d, // Data input output reg q // Q output // Output wire q_bar could be added if the complement is needed ); // Set or reset the flip-flop based on reset and clock always @(posedge clk or posedge reset) begin if (reset) begin // Asynchronously reset the Q output to 0 when reset is high q <= 1'b0; end else begin // On a rising edge of the clock, set Q to the value of D q <= d; end end endmodule ``` In this Verilog code: - ```clk``` is the clock input that the flip-flop uses to time its operations. - ```reset``` is an asynchronous reset that sets the output ```q``` to 0 regardless of the clock. - ```d``` is the data input that the flip-flop is supposed to store on the rising edge of the clock. - ```q``` is the output that holds the value of d that was present on the last rising edge of ```clk```. - The ```always``` block triggers either on the positive edge of the clock or the positive edge of the reset signal. If ```reset``` is high, ```q``` is set to 0, otherwise, ```q``` follows d on the rising edge of the clock. The D flip-flop described here is positive-edge-triggered because it responds to the rising edge (```posedge```) of the clock signal. The asynchronous reset (``` posedge reset```) has priority over the clock, which means the output will be reset as soon as the reset goes high, regardless of the clock state. A testbench for this code would look like this (```testbench.v```): ```Verilog `timescale 1ns / 1ps module d_flip_flop_tb; // Inputs reg clk_tb; reg reset_tb; reg d_tb; // Outputs wire q_tb; // Instantiate the Unit Under Test (UUT) d_flip_flop uut ( .clk(clk_tb), .reset(reset_tb), .d(d_tb), .q(q_tb) ); // Clock generation initial begin clk_tb = 0; forever #5 clk_tb = ~clk_tb; // Generate a clock with a period of 10ns end // Test sequence initial begin // Initialize Inputs reset_tb = 0; d_tb = 0; // Add VCD file generation for GTKWave $dumpfile("d_flip_flop_tb.vcd"); $dumpvars(0, d_flip_flop_tb); // Apply asynchronous reset #10 reset_tb = 1; #10 reset_tb = 0; // Apply stimulus to the D input #10 d_tb = 1; #10 d_tb = 0; #10 d_tb = 1; #10 d_tb = 0; // Finish the simulation $finish; end endmodule ``` And to run the example we shall invoke iverilog: ```Shell $ iverilog -o d_ff_tb.vvp D_FF.v testbench.v ``` And then we create the dump file: ```Shell $ vvp d_ff_tb.vvp ``` And finally, we invoke GTKWave to see the timing diagram of our D-flip-flop: ```Shell $ gtkwave d_flip_flop_tb.vcd ``` Which yields: ![D-Flip-Flop timing diagram](site/Resources/media/image201.png) > [!Figure] > _D-Flip-Flop timing diagram as implemented in Verilog_ > [!attention] > **A moment of reflection.** > > In this last example, we got again, as an output, a waveform or a plot, where we see the expected behavior of the D flip-flop, just like we did some paragraphs ago with SPICE. What's going on here? We seem to be doing extremely different things, only to arrive at the same results. What we are somehow now making very explicit is the workflow to create digital circuits, from semiconductors up. It might seem natural to start in Magic by drawing MOSFETs by hand, but in fact, MOSFETs are the end result of the work. When we design complex digital circuits, we start at high levels of abstraction and then we eventually synthesize the circuit into the physical fabric that will make NMOS and PMOS transistors to be hooked together to achieve the task. ## Registers Registers are, simply put, a convenient arrangement of flip-flops, with each flip-flop capable of storing a single bit of data. When you string together multiple flip-flops, each controlled by the same clock signal, you basically create a register that can store a multi-bit value. The most common flip-flop used in registers is the D (data) flip-flop, which captures the input value at the moment of a clock edge (typically the rising edge) and holds this value until the next clock edge. A 32-bit register, for instance, will have 32 D flip-flops, each holding one of the 32 bits that make up the word. In a register, all flip-flops share the same clock signal, ensuring that they all capture their inputs simultaneously. This synchronization allows the register to store a coherent multi-bit value from a data bus. Additionally, registers often have control signals for reset (to set all bits to 0), set (to set all bits to 1), or enable (to decide when the register should capture data from the input). Registers play a key role in processor architecture; they hold the operands for arithmetic logic units ([[Semiconductors#Arithmetic Logical Units (ALUs)|ALUs]]), store addresses for the memory unit, hold the results of operations, and keep the current instruction during processing. Without registers, a CPU core would not have a way to perform sequential operations or remember any interim values, making the execution of programs infeasible. ### Verilog Code of a 32-bit register You may note that, as the complexity of the examples slightly increases, the hierarchical nature of digital systems kicks in again in full visibility as a design decision. When modeling a register using flip flops, we wouldn't naturally use gate level; it would simply become too complex to model each particular gate involved in an 8-bit register, with the wires and all. And this would only get worse with wider bit widths. As complexity increases, behavioral models make more and more sense as they stay more tractable for simulation and analysis purposes. This way, a 32-bit D flip-flop is modeled practically the same as the 1-bit D-type flip-flop we saw before. Let's see how the Verilog code for a 32-bit register would look like: ```Verilog module latch_register_32bit ( input wire clk, // Clock signal input wire enable, // Strobe/Enable signal input wire [31:0] d, // 32-bit data input output reg [31:0] q // 32-bit data output (registered) ); // On every rising edge of the clock, if enable is high, the input is latched always @(posedge clk) begin if (enable) begin q <= d; end // else, retain the current value (i.e., latch the value) end endmodule ``` And we write now a testbench for it: ```Verilog `timescale 1ns / 1ps module latch_register_32bit_tb; // Inputs reg clk; reg enable; reg [31:0] d; // Outputs wire [31:0] q; // Instantiate the Unit Under Test (UUT) latch_register_32bit uut ( .clk(clk), .enable(enable), .d(d), .q(q) ); // Clock generation initial begin clk = 0; forever #5 clk = ~clk; // Toggle clock every 5ns end // Stimulus initial begin // Initialize Inputs enable = 0; d = 0; $dumpfile("latch_register_32bit_tb.vcd"); $dumpvars(0, latch_register_32bit_tb); // Wait for global reset #100; // Apply test vectors // Test 1: No latch when enable is 0 d = 32'hA5A5A5A5; #10; enable = 1'b0; #10; d = 32'hFFFFFFFF; // Change data while enable is low #10; // Test 2: Latch when enable is 1 enable = 1'b1; // Enable should latch the data on next clock #10; d = 32'h12345678; // Change data after latching #10; enable = 1'b0; #10; // Test 3: Keep the latched data when enable is 0 d = 32'h87654321; // Change data but it should not be latched #20; // Finish simulation $finish; end // Monitor changes initial begin $monitor("At time %t, enable = %b, d = %h, q = %h", $time, enable, d, q); end endmodule ``` The waveforms look like this: ![32-bit register](site/Resources/media/image204.png) As expected, the output of the register stays undefined (we didn't assign any value by default to its output) until the enable signal is asserted and until the rising edge of the clock appears. Then, we re-write the data input of the register to a different value, and while the enable signal stays true, the register outputs the new value at the next rising edge of the clock signal. ## Decoders > [!Warning] > This section is under #development ## Multiplexers and Demultiplexers A multiplexer is a combinational[^54] logic circuit designed to switch between several input signals and transmit the chosen input to a single output line. Think of it as a railroad switch, governed by the select lines, that channels one of several trains (input signals) onto a single track (output line). The selection of the input is controlled by a set of binary selection lines, also known as select lines. The number of select lines determines the number of inputs the multiplexer can handle: with $\text{\ n\ }$ select lines, you can control $2^{n}$ inputs. For example, a 2-to-1 multiplexer has one select line used to choose between two inputs, while an 8-to-1 multiplexer would require three select lines. Note that each input could be 1 or more bits wide. Multiplexers are foundational components in digital circuits, widely used in applications where data routing and switching are needed. Multiplexers are a popular building block in digital systems where channeling multiple data streams through a single pipeline is needed, working based on digital selection commands. ![A 4-to-1 Multiplexer](site/Resources/media/image205.png) > [!Figure] > _A 4-to-1 Multiplexer_ Equivalently, a demultiplexer is a device that takes a single input and directs it to one of the multiple output lines. The decision of which output line is selected is made based on additional inputs called select lines. Essentially, it functions as the opposite of a multiplexer, which takes multiple inputs and selects one of them to pass through to a single output. In a demultiplexer, the number of possible outputs is a power of two, being determined by the number of select lines. At any one time, only one of the output lines is active, transmitting the input signal to the chosen output while all other outputs remain inactive. By pairing multiplexers with demultiplexers, digital systems can channel complex data paths across physical architectures, minimizing interconnects and making more efficient use of routing space. ![An n-bit width mux paired with a demux](site/Resources/media/image206.png) > [!Figure] > _An n-bit width mux paired with a demux_ ### Verilog code for a 2:1, 8-bit multiplexer An 8-bit multiplexer (MUX) selects one of the multiple input signals based on the select lines and forwards the selected input to the output. Below is an example of a 2-to-1 8-bit multiplexer written in Verilog. This multiplexer will select between two 8-bit inputs in0 and in1 based on the select line sel: ```Verilog module mux2to1_8bit( input wire [7:0] in0, // Input 0 input wire [7:0] in1, // Input 1 input wire sel, // Select line output wire [7:0] out // Output ); // Generate the MUX logic for each bit of the output genvar i; generate for (i = 0; i < 8; i = i + 1) begin : mux_loop assign out[i] = sel ? in1[i] : in0[i]; end endgenerate endmodule ``` In this code: - ```in0``` and ```in1``` are the 8-bit input vectors to the multiplexer. - ```sel``` is a single-bit input that determines which input vector is selected. - ```out``` is the 8-bit output vector of the multiplexer. - The ```assign``` statement within the generate block applies the multiplexer function to each bit of the output. The ternary conditional operator (? :) is used to select between ```in0``` and ```in1```. If ```sel``` is 1, ```in1``` is selected; if ```sel``` is 0, ```in0``` is selected. - The ```genvar``` and ```generate``` construct is used to create repetitive hardware structures. In this case, it generates a set of multiplexers for each bit of the 8-bit input signals. A testbench for this overly basic multiplexer would look like this: ```Verilog `timescale 1ns / 1ps module testbench; // Inputs to the multiplexer reg [7:0] in0; reg [7:0] in1; reg sel; // Output from the multiplexer wire [7:0] out; // Instantiate the multiplexer (DUT - Device Under Test) mux2to1_8bit mux( .in0(in0), .in1(in1), .sel(sel), .out(out) ); // File for VCD output initial begin // Dump all signals for the module and submodules $dumpfile("mux2to1_8bit.vcd"); $dumpvars(0, testbench); // Initialize Inputs in0 = 8'h55; // Binary pattern: 01010101 in1 = 8'hAA; // Binary pattern: 10101010 sel = 0; // Wait 100 ns for the global reset #100; // Apply different test cases sel = 1; // Select input 1 #100; // Wait for 100 ns sel = 0; // Select input 0 #100; // Wait for 100 ns sel = 1; // Select input 1 again #100; // Wait for 100 ns // Finish the simulation $finish; end endmodule ``` The resulting waveforms are shown below: expectedly, when the selector line is in a low state, the output shows the content of input 0. When the selector line goes high, the output shows the value of input 1, as expected for a multiplexer. ![MUX waveforms](site/Resources/media/image207.png) Note that you could use two 4:1 multiplexers and a 2:1 mux to build an 8:1 mux. ### Verilog code for an 8-bit, 2:1 MUX/DEMUX pairing As for the multiplexer, let's reuse the one from the previous section. Now let's define an 8-bit 2:1 demultiplexer. A demultiplexer takes one 8-bit input and a select line, and it directs the input to one of the two 8-bit outputs based on the select line. ```Verilog module demux2to1_8bit( input [7:0] in, input sel, output [7:0] out_a, output [7:0] out_b ); assign out_a = sel ? 8'b0 : in; // Output A gets the input if sel is 0 assign out_b = sel ? in : 8'b0; // Output B gets the input if sel is 1 endmodule ``` To pair these two modules, one would simply connect the output of the multiplexer to the input of the demultiplexer and use the same select line for both. In a practical scenario, these two would likely be part of a larger design, since using them back-to-back in this manner wouldn't be very useful (it effectively just passes the signal through when the select lines are in agreement). Let's create a testbench to simulate this configuration, doing something like the following: ```Verilog module testbench; reg [7:0] input_a, input_b; reg sel; wire [7:0] mux_out, demux_out_a, demux_out_b; // Instantiate the multiplexer mux2to1_8bit mux( .a(input_a), .b(input_b), .sel(sel), .out(mux_out) ); // Instantiate the demultiplexer demux2to1_8bit demux( .in(mux_out), .sel(sel), .out_a(demux_out_a), .out_b(demux_out_b) ); initial begin // Initialize inputs input_a = 8'hAA; // 10101010 in binary input_b = 8'h55; // 01010101 in binary sel = 0; #10; // Wait for 10 time units sel = 1; // Change select line #10; // Wait for 10 more time units sel = 0; // Switch back select line #10; $finish; // End simulation end initial begin $monitor("Time = %d : sel = %b : input_a = %h : input_b = %h : mux_out = %h : demux_out_a = %h : demux_out_b = %h", $time, sel, input_a, input_b, mux_out, demux_out_a, demux_out_b); end endmodule ``` And GTKWave gives: ![Waveforms of the 8-bit 2:1 MUXDEMUX pairing](site/Resources/media/image208.png) ## Look-Up Tables (LUTs) When we discussed registers, we commented that gate-level modeling started to be cumbersome due to the fact complexity was growing beyond manageable levels and that a behavioral model was a better fit. So, there is an undeniable relationship between the complexity of logic operations and the associated complexity of the circuitry needed to implement it. In short: the more complex the logic operation, the higher the number of gates needed to synthesize it. Look-up tables (LUTs) come to break a bit that relationship. A logic look-up table (LUT) is a fundamental building block used in digital systems, to implement combinational logic. It operates on a simple principle: for given input signals, the output is determined by a stored value in a table, much like a dictionary lookup. Each combination of input values is associated with a corresponding output value, stored in the LUT's memory. The LUT is a form of memory that contains the output for every possible input configuration. When the LUT receives an input, it uses this input as an address to access a memory location and immediately retrieves the precomputed output. This ability to provide instant outputs for inputs makes LUTs extremely efficient for implementing complex logic functions, as the time to compute the output does not increase with the complexity of the function. But of course, there's a caveat in this, another side of the story: simple operations will take as long as complex operations (a logic inverter would take the same time as a 4-input NAND). In some configurable devices that we will discuss soon, you may have arrays of LUTs that can be interconnected and configured to perform a wide array of logical operations, from simple AND, OR gates to more complex functions like decoders or mathematical operations. Because the LUT is essentially a programmable piece of memory, it offers great flexibility in digital circuit design. Here's a Verilog example that demonstrates a 4-to-1 LUT: ```Verilog module lut4to1 ( input wire [3:0] data, // 4-bit input data input wire [1:0] sel, // 2-bit select input output reg out // Output of the LUT ); always @(*) begin // Decide output based on select lines case(sel) 2'b00: out = data[0]; // If sel is 00, output is data bit 0 2'b01: out = data[1]; // If sel is 01, output is data bit 1 2'b10: out = data[2]; // If sel is 10, output is data bit 2 2'b11: out = data[3]; // If sel is 11, output is data bit 3 default: out = 1'b0; // Default case endcase end endmodule ``` In this module, the data input represents the 4-bit input from which the LUT will output one bit based on the 2-bit ```sel``` select input. The ```always @(*)``` block ensures combinational logic, where the output changes immediately with any change in the inputs. The case statement selects which bit of the data input to output based on the ```sel``` input. A testbench for this LUT goes like this: ```Verilog `timescale 1ns / 1ps module tb_lut4to1; // Testbench signals reg [3:0] data_tb; reg [1:0] sel_tb; wire out_tb; // Instantiate the LUT module lut4to1 uut ( .data(data_tb), .sel(sel_tb), .out(out_tb) ); // File for the waveform initial begin $dumpfile("tb_lut4to1.vcd"); $dumpvars(0, tb_lut4to1); end // Test cases initial begin // Initialize testbench signals data_tb = 4'b1011; // Initialize data sel_tb = 2'b00; // Select line 0 #10; // Wait for 10 time units sel_tb = 2'b01; // Select line 1 #10; // Wait for 10 time units sel_tb = 2'b10; // Select line 2 #10; // Wait for 10 time units sel_tb = 2'b11; // Select line 3 #10; // Wait for 10 time units $finish; // End the simulation end endmodule ``` And GTKWave output shows the modeled behavior: ![4-to-1 LUT behavior](site/Resources/media/image209.png) > [!Figure] 4-to-1 LUT behavior The 4-to-1 LUT of the example above can implement any 4-input, 1-output function. It's a matter of filling the LUT with the right configuration to achieve the logical functions we need. Note that LUTs and Multiplexers get along quite well. A 4-to-1 Lookup Table (LUT) can be implemented with multiplexers. A 4-to-1 multiplexer selects one of the four inputs to pass to the output, based on two selection lines. The LUT can be represented as a multiplexer where the inputs are the possible values that the LUT can output, and the select lines are the inputs to the LUT. To use multiplexers and considering that a 4-to-1 multiplexer can be constructed using 2-to-1 multiplexers, this is how you can represent this using Verilog with instantiated 2-to-1 multiplexers. ```Verilog module mux2to1( input a, input b, input sel, output out ); assign out = sel ? b : a; endmodule module lut4to1_using_mux( input [1:0] select, output out ); wire lower_mux_out; wire upper_mux_out; // Instantiating the first level of 2-to-1 MUXes mux2to1 lower_mux( .a(1'b0), // Value for LUT input 00 .b(1'b1), // Value for LUT input 01 .sel(select[0]), .out(lower_mux_out) ); mux2to1 upper_mux( .a(1'b0), // Value for LUT input 10 .b(1'b1), // Value for LUT input 11 .sel(select[0]), .out(upper_mux_out) ); // Instantiating the second level of 2-to-1 MUX to select between the outputs of the first level mux2to1 final_mux( .a(lower_mux_out), .b(upper_mux_out), .sel(select[1]), .out(out) ); endmodule ``` In the code above, ```mux2to1``` is a simple 2-to-1 multiplexer module that selects between two inputs, ```a``` and ```b```, based on a selection line ```sel```. The ```lut4to1_using_mux``` module then instantiates two of these for the first stage to select between 00 and 01, and 10 and 11, respectively, and one more to select between the results of the first stage based on the higher bit of the **select** input. In real applications, we would probably not hardcode the LUT values in the code but instead use parameters or an external configuration method to set them. To see how also multiplexers can be implemented with LUTs, for example, to create a 4-to-1 mux, one would need 6 inputs: - 4 data inputs - 2 control inputs to select between the 4 data inputs. Thus, a 6-input LUT can be used to build a 4:1 mux. ## Arithmetic Logical Units (ALUs) An Arithmetic Logical Unit (ALU) is a fundamental component of a CPU core that is responsible for carrying out arithmetic and logical operations. It performs basic arithmetic operations such as addition, subtraction, multiplication, and division, as well as logical operations including AND, OR, XOR, and NOT between operands. The ALU receives data from the processor's registers, processes it according to the operation specified by the instruction set, and then returns the result back to the registers or memory. This processing is critical for the execution of software applications and the overall operation of the computer. The efficiency and capabilities of an ALU directly impact the performance of a CPU, influencing both the speed and accuracy of computations. Implementing a full-fledged floating-point Arithmetic Logic Unit (ALU) is a rather complex thing, as it involves handling various aspects of floating-point arithmetic like normalization, rounding, and exception handling. The IEEE-754 standard is commonly used for floating-point representation and operations. Here we will write a naïve floating point ALU for the sake of seeing a behavioral depiction of this building block that is present in CPU cores. ![](ALU.png) > [!Figure] > Arithmetic Logic Unit symbol, with typical inputs and outputs Below is the simplified version of the floating-point ALU in Verilog that can perform basic operations like addition, subtraction, multiplication, and division on single-precision floating-point numbers. ```Verilog module floating_point_alu( input [31:0] a, b, // Input operands input [1:0] op, // Operation selector: 00-add, 01-subtract, 10-multiply, 11-divide output reg [31:0] result, // Result output reg overflow, underflow, invalid // Flags ); // Extracting fields from IEEE 754 representation wire signed [7:0] a_exponent, b_exponent; wire [22:0] a_mantissa, b_mantissa; wire a_sign, b_sign; assign a_sign = a[31]; assign a_exponent = a[30:23] - 127; // Exponent biased by 127 assign a_mantissa = {1'b1, a[22:0]}; // Implicit leading 1 assign b_sign = b[31]; assign b_exponent = b[30:23] - 127; // Exponent biased by 127 assign b_mantissa = {1'b1, b[22:0]}; // Implicit leading 1 always @(*) begin overflow = 0; underflow = 0; invalid = 0; case(op) 2'b00: begin // Addition // Simplified addition logic end 2'b01: begin // Subtraction // Simplified subtraction logic end 2'b10: begin // Multiplication // Simplified multiplication logic end 2'b11: begin // Division // Simplified division logic end default: begin invalid = 1; end endcase end // Additional functions and procedures for normalization, rounding, etc. go here endmodule ``` This code provides the structure for a floating-point ALU but does not include the full implementation of each operation, which can be quite extensive. The actual implementation of each arithmetic operation should handle cases like normalization (shifting the mantissa to the correct position while adjusting the exponent), rounding, and dealing with special cases like NaN (Not a Number), infinities, zeros, and denormalized numbers. Implementing a fully compliant IEEE 754 floating-point ALU requires a detailed understanding of the standard, including handling various rounding modes, exceptional conditions, and precision requirements. Creating a testbench for the simplified floating-point ALU outlined previously involves stimulating the ALU with various floating-point operations and monitoring the results. To generate waveforms for viewing in GTKWave, you should dump the simulation data to a VCD (Value Change Dump) file. Below is a basic example of how to write a testbench in Verilog for the floating-point ALU module. Remember, the ALU implementation was simplified and did not cover full IEEE 754 compliance. The testbench reflects this simplicity. ```Verilog `timescale 1ns / 1ps module floating_point_alu_tb; // Inputs reg [31:0] a; reg [31:0] b; reg [1:0] op; // Outputs wire [31:0] result; wire overflow, underflow, invalid; // Instantiate the Unit Under Test (UUT) floating_point_alu uut ( .a(a), .b(b), .op(op), .result(result), .overflow(overflow), .underflow(underflow), .invalid(invalid) ); initial begin // Initialize Inputs a = 0; b = 0; op = 0; // Initialize VCD dump $dumpfile("floating_point_alu.vcd"); $dumpvars(0, floating_point_alu_tb); // Stimulus // Example: Add two numbers a = 32'h3F800000; // 1.0 in IEEE 754 b = 32'h40000000; // 2.0 in IEEE 754 op = 2'b00; // Addition operation #10; // Example: Subtract two numbers // Provide different stimulus for subtraction, multiplication, and division // ... // Complete the simulation #100; $finish; end endmodule ``` This testbench initializes the inputs, sets up the VCD file for waveform dump, provides stimulus to the ALU, and runs the simulation for a certain duration. You need to expand the stimulus section with different test vectors to cover various scenarios like different operations, edge cases (like overflow or underflow), and any other specific conditions. ## Memory ### Volatile There is a thick body of knowledge on memories, and the idea of this section is by no means to give a thorough lecture on them but to only describe their most salient features. Needless to say, memory devices are fundamental components in the architecture of modern digital systems, serving as repositories for data that can be rapidly accessed and manipulated. Among the various types of volatile memory, DRAM (Dynamic RAM) and SRAM (Static RAM) are particularly relevant for their distinctive characteristics and applications. SDRAM, a type of DRAM, operates in sync with a system clock. This synchronization allows for the precise timing of data flow, making SDRAM much faster than its asynchronous predecessors. SDRAM stores bits of data in cells made up of a capacitor and a transistor, leveraging the dynamic nature of its memory storage; the word "dynamic" refers to the periodic refresh required to maintain the data within the cells (more about this in the next section). These refresh cycles are necessary because the capacitors gradually leak charge, and without a refresh, the information would be lost. In computing systems, SDRAM is commonly used for main system memory due to its speed and cost-effectiveness. It allows for high bandwidth by transferring data on both the rising and falling edges of the clock signal, which makes it ideal for complex tasks that require quick access to large volumes of data. A great source on SDRAM basics can be found [here](https://www.systemverilog.io/design/ddr4-basics/). ### DRAM In principle, DRAM is simple. It comprises an array of memory cells laid out in a grid, each storing one bit of information. All modern DRAM uses a 1T1C cell (see figure below), denoting 1 transistor and 1 capacitor. The transistor controls access to the cell, and the capacitor stores the information in the form of a small electrical charge. Wordlines (WL) connect all cells in a single row; they control the access transistor for each cell. Bitlines (BL) connect all cells in a single column; they connect to the source of the access transistor. When a word line is energized, the access transistors for all cells in the row open and allow current flow from the bit line into the cell (when writing to the cell) or from the cell to the BL (when reading from the cell). Only 1 word line and 1 bit line will be active at once, meaning only the 1 cell where the active word- and bitlines intersect will be written or read. ![](SDRAM.png) > [!Figure] > _DRAM cell_ DRAM is a volatile memory technology: the storage capacitors leak charge, and thus require frequent refreshes (as often as every ~32 milliseconds) to maintain stored data. Each refresh reads the contents of a cell, boosts the voltage on the bit line to an ideal level, and lets that refreshed value flow back into the capacitor. Refreshes happen entirely inside the DRAM chip, with no data flowing in or out of the chip. This minimizes wasted power, but refreshes can still come to 10%+ of the total DRAM power draw. Capacitors, much like transistors, have been shrunk to nanoscopic width but also with extreme aspect ratios ~1,000nm high but only 10s of nm in diameter – aspect ratios are approaching 100:1, with capacitance on the order of 6-7 fF (femto-Farad). Each capacitor stores an extremely small charge, about 40,000 electrons when freshly written. The cell must get electrons in and out via the bitline, but voltage put onto the bitline is diluted by all the other cells attached to the same bitline. Total bitline capacitance may total more than 30fF – a 5x dilution. The bitline is also very thin which slows the electrons. Finally, the cell may have drained significantly if it has not been refreshed recently, so has only a fraction of charge to deliver. All these factors mean that discharging a cell to read its value can result in a very weak signal which must be amplified. To this end, sense amplifiers (SA) are attached at the end of each bitline to detect the extremely small charges read from the memory cells and amplify the signal to a useful strength. These stronger signals can then be read elsewhere in the system as a binary 1 or 0. The sense amplifier has a clever circuit design: it compares the active bitline to a matching neighbor which is not in use, starting with both lines brought to a similar voltage. The voltage on the active bitline will be compared to the inactive neighbor, shifting the sense amp off balance and causing it to amplify the difference back into that active bitline, both amplifying the signal and driving a fresh full value, high or low, back into the cell which remains open to the bitline. It’s a 2 birds, 1 stone situation: the cell is read and refreshed at the same time. After reading/refreshing the active cell, the value can either be copied out of the chip or overwritten by a write operation. A write ignores the refreshed value and uses a stronger signal to force the bitline to match the new value. When the read or write is finished the wordlines are disabled, shutting off the access transistors and thus trapping any resident charges in the storage capacitors. ![](https://youtu.be/7J7X7aZvMXQ?t=730) Modern DRAM is made possible by two separate and complementary inventions: the 1T1C memory cell, and the sense amplifier. The 1T1C cell was invented in 1967 at IBM by Dr. Robert Dennard, also well known for his eponymous MOS transistor scaling law. Both DRAM and the scaling are based on MOS transistors (metal oxide silicon, the layers in the transistor gate). Despite the invention of the 1T1C memory cell structure, early DRAM shipped by Intel in 1973 used 3 transistors per cell with the gate on the middle transistor acting as a storage capacitor. This was a “gain cell” where the middle and final transistor provided gain to amplify the very small charge on the middle gate, enabling the cell to be read easily and without disturbing the stored value. A 1T1C cell is better in theory: fewer devices, simpler to wire together, and smaller. Why was it not immediately adopted? It was not yet practical to read the cell. At the time of invention, the small capacitance of the 1T1C cell made it infeasible to operate. A second key invention was needed: the sense amplifier. The first modern sense amplifier was developed in 1971 by Karl Stein at Siemens, presented at a conference in California, and completely overlooked. The 1T1C architecture was not widely adopted at that point and Siemens had no idea what to do with this invention. Stein was moved to another assignment where he had a successful career unrelated to DRAM. This design was well matched to the spacing of the bit lines and has been able to scale smaller to keep pace with cell size. The sense amp is completely powered down when not in use which allows there to be millions of them on a chip without draining power. They have been a small miracle. It took more than 5 years for the sense amplifier’s time to come. Robert Proebsting at Mostek independently (re)discovered the concept and by 1977 their 16kb DRAM with 1T1C + SA architecture became the market leader. This winning formula stuck; DRAM architecture is fundamentally the same nearly 5 decades later. #### DDR SDRAM SDRAM DDR memories represent a significant advancement in memory technology. DDR stands for Double Data Rate, and it's an innovative kind of SDRAM that essentially doubles the rate at which it can transfer data compared to its predecessor. Unlike conventional SDRAM, which only transfers data on the rising edge of the clock signal, DDR memory transfers data on both the rising and falling edges of the clock signal, effectively doubling the data rate without increasing the clock frequency. The first generation of DDR SDRAM, known as DDR1, substantially increased the memory bandwidth without a significant cost increase. As the technology evolved, subsequent generations like DDR2, DDR3, and DDR4 introduced further improvements, including lower power consumption, larger storage capacities, and even faster transfer rates. ![](SDRAM_org.png) > [!Figure] > _SDRAM chip internal structure (source: Micron)_ Each generation of DDR memory has managed to double the transfer rate of the previous generation while operating at a lower voltage. This reduction in voltage contributes to less power consumption and less heat generation, which is particularly important as systems become more compact and powerful. With the progression from DDR1 through DDR4, and now moving towards DDR5, the preeminent focus has been on achieving higher bandwidth and energy efficiency. There is, though, no free lunch. And with higher and higher data rates come the obligatory [[Physical Layer#Signal Integrity|signal integrity]] issues we described before. On the other hand, SRAM, or Static Random-Access Memory, is called "static" because, unlike SDRAM, it does not need to be periodically refreshed to maintain its data as long as power is supplied. SRAM cells are made up of several transistors and typically do not contain capacitors, which results in faster access times compared to SDRAM. However, this design also means that SRAM consumes more power and is more expensive to manufacture, leading to its use in smaller cache memories where speed is paramount. SRAM retains data as long as it is powered on, making it an attractive choice for cache memory of [[Semiconductors#CPU Cores|CPU cores]], where quick retrieval of frequently accessed data is important for performance. #### SRAM cell The following schematic depicts the most used implementation of a static RAM cell. ![](SRAM_MOS.png) > [!Figure] > _SRAM memory cell implemented with MOS transistors_ The working principle of the SRAM memory cell can be easier to understand if the transistors M1 through M4 in the schematic above are drawn as logic gates. That way it is clear that, at its heart, the cell storage is built by using two cross-coupled inverters. ![](SRAM_2.png) > [!Figure] > _SRAM memory cell with cross-coupled inverters_ This simple loop creates a bi-stable circuit. A logic 1 at the input of the first inverter turns into a 0 at its output, and it is fed into the second inverter which transforms that logic 0 back to a logic 1 feeding back the same value to the input of the first inverter. That creates a stable state that does not change over time. Similarly, the other stable state of the circuit is to have a logic 0 at the input of the first inverter. After being inverted twice it will also feedback the same value. > [!attention] > It should be no surprise that the SRAM cell's capability of data retention (as long as power is applied) originates from the use of feedback. In this case, it's the mutual feedback between the two inverters depicted in the figure above. To read the contents of the memory cell stored in the loop, the transistors $M_{5}$ and $M_6$ must be turned on. When they receive voltage to their gates from the word line ($WL$), they become conductive and so the $Q$ and $\overline Q$ values get transmitted to the bit line $BL$ and $\overline{BL}$. Finally, these values get amplified at the end of the bit lines (amplification circuitry not shown). The writing process is similar, the difference is that now the new value that will be stored in the memory cell is driven into the bit line ($BL$) and the inverted one into its complement ($\overline{BL}$). Next, transistors $M_5$ and $M_6$ are open by driving a logic 1 (voltage high) into the word line ($WL$). This effectively connects the bit lines to the by-stable inverter loop. There are two possible cases: 20. If the value of the loop is the same as the new desired value, there is no change; 21. if the value of the loop is different from the new value there are two conflicting values. For the voltage in the bit lines to overwrite the output of the inverters, the size of the $M_5$ and $M_6$ transistors must be larger than that of the $M_1$-$M_4$ transistors. This allows more current to flow through the first ones and therefore tips the voltage in the direction of the new value, at some point the loop will then amplify this intermediate value to full rail. #### Simulation and Layout of a 6T SRAM cell memory (on OSU180nm PDK) ##### Block Diagram and Control Signals The cell design to implement is depicted in the figure below. [![](https://github.com/Anushar123/vsdsram/raw/master/Circuit-Inv/BlockSram.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Circuit-Inv/BlockSram.PNG) > [!Figure] > _SRAM cell, block diagram_ ##### Read/Write Analysis of SRAM A schematic diagram of the 6T cell is depicted below. A 6T SRAM pairs up with two access transistors for read, write state, and cross cross-coupled inverters to hold/regenerate the state. During write operation, that is, to write Q=0 while initial Q=1, when the voltage at node Q reaches a threshold voltage wherein PMOS $M_{5}$ gets ON and the voltage at node Qbar starts to rise and the regenerative action of the cross-coupled inverter will force the write Q = 0. To write a logic 0, the Bitlline ($BL$) should be logic at 0 and the BLbar ($\overline{BL}$) is complemented to 1. The word line $WL$ is high. During a read operation, that is, to read Q=vdd or 1, when the voltage at node Qbar reaches a threshold voltage wherein NMOS M3 goes ON and the voltage at node Q starts to fall and cross-coupled inverter state will force to flip the bit in a cell. The word line $WL$ is high. [![](https://github.com/Anushar123/vsdsram/raw/master/Circuit-Inv/Sram.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Circuit-Inv/Sram.PNG) > [!Figure] > SRAM 6T cell schematic diagram ##### Static Noise Margin (SNM) Analysis Static noise margin (SNM) is a functional metric to analyze the stability of SRAM cells. The graphical representation of SNM is through the largest square diagonal between the voltage transfer characteristics of inverters. It determines the capability to retain the information that is stored. SNM measures the SRAM stability and it defines the maximum noise margin that a cell can store the data without losing the information. [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/SNM-Butterfly.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/SNM-Butterfly.PNG) ##### Cell Physical Layout in Magic [![](https://github.com/Anushar123/vsdsram/raw/master/Layout/Sram.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Layout/Sram.PNG) The dimensions of the cell layout from Magic are: [![](https://github.com/Anushar123/vsdsram/raw/master/Layout/Sram-Width,height.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Layout/Sram-Width,height.PNG) SRAM 6T cells require precharge circuitry to ensure reliable operation. The precharge circuitry ensures that the bitlines (BL and BLbar) are pre-charged to a defined voltage level (usually Vdd/2) before a read or write operation. This precharging process ensures that the bitlines are stable and ready to sense or drive the data during read and write operations. Precharging the bitlines also helps in reducing noise susceptibility. Noise can distort the signals on the bitlines, leading to errors in read/write operations. By precharging the bitlines, the noise margin is improved, making the SRAM cell more robust against external disturbances. Additionally, precharging the bitlines helps in achieving faster access times. During a read or write operation, if the bitlines are not precharged, the access time may increase because the bitlines need to overcome a voltage difference before sensing or driving the data. Precharging reduces this delay, enhancing the overall speed of the SRAM operation. While precharge circuits consume some power, they help reduce overall power consumption by ensuring efficient operation of the cell. By precharging the bitlines, the sensing and driving operations can be performed with lower power consumption compared to scenarios where bitlines are not precharged. The schematic below illustrates a precharging circuit. The sense amplifier in a 6T SRAM cell serves the purpose of amplifying the small voltage difference between the true and complement bitlines (BL and BLbar) during a read operation. This voltage difference arises when one of the bitlines is precharged high and the other is precharged low. During a read operation, the selected SRAM cell's state (either '0' or '1') causes one of the bitlines to discharge more than the other due to the conductivity of the SRAM cell transistors. The sense amplifier detects this voltage difference and amplifies it to drive the bitlines to logic levels, allowing the SRAM cell's data to be correctly sensed and read out. [![](https://github.com/Anushar123/vsdsram/raw/master/Circuit-Inv/Precharge.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Circuit-Inv/Precharge.PNG) > [!Figure] > _Precharge circuit_ The layout of the precharge is shown below. [![](https://github.com/Anushar123/vsdsram/raw/master/Layout/Precharge.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Layout/Precharge.PNG) The dimensions of the precharge circuitry are shown below. [![](https://github.com/Anushar123/vsdsram/raw/master/Layout/Precharge-width,height.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Layout/Precharge-width,height.PNG) The sense amplifier circuitry is shown below. [![](https://github.com/Anushar123/vsdsram/raw/master/Circuit-Inv/SenseAmplifier.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Circuit-Inv/SenseAmplifier.PNG) The sense amplifier layout and dimensions are shown in the figure below. [![](https://github.com/Anushar123/vsdsram/raw/master/Layout/senseamplifier.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Layout/senseamplifier.PNG) [![](https://github.com/Anushar123/vsdsram/raw/master/Layout/senseamplifier-width,height.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Layout/senseamplifier-width,height.PNG) ##### SPICE Extraction ```Spice .include NMOS-180nm.lib .include PMOS-180nm.lib m1 q qbar vdd vdd CMOSP W=0.9u L=0.18u M=1 m3 gnd qbar q gnd CMOSN W=0.36u L=0.18u M=1 m5 vdd q qbar vdd CMOSP W=0.9u L=0.18u M=1 m4 qbar q gnd gnd CMOSN W=0.36u L=0.18u M=1 m6 blbar wl qbar gnd CMOSN W=0.36u L=0.18u M=1 m2 q wl bl gnd CMOSN W=0.36u L=0.18u M=1 * u1 blbar qbar q port v3 bl gnd pulse(0v 1.8v 10ns 50ns 20ns 40us 80us) m8 blbar bl vdd vdd CMOSP W=0.9u L=0.18u M=1 m7 blbar bl gnd gnd CMOSN W=0.36u L=0.18u M=1 v1 wl gnd pulse(1.8v 2.0v 5ns 50ns 3ns 10ns 80us) c1 q gnd capacitor v2 vdd gnd dc 1.8v .tran 10e-09 100e-09 3e-09 .SUBCKT precharge m1 vcc bl gnd vcc CMOSP W=0.9u L=0.18u M=1 m2 vcc bl gnd vcc CMOSP W=0.9u L=0.18u M=1 * u1 bl port .ends .SUBCKT senseamp m2 net-_m1-pad1_ net-_m1-pad1_ vcc vcc CMOSP W=0.9u L=0.18u M=1 m4 vcc net-_m1-pad1_ net-_m4-pad3_ vcc CMOSP W=0.9u L=0.18u M=1 m7 vcc net-_m4-pad3_ net-_m6-pad1_ vcc CMOSP W=0.9u L=0.18u M=1 m9 vcc net-_m6-pad1_ net-_m8-pad1_ vcc CMOSP W=0.9u L=0.18u M=1 m8 net-_m8-pad1_ net-_m6-pad1_ gnd gnd CMOSN W=0.36u L=0.18u M=1 m6 net-_m6-pad1_ net-_m4-pad3_ gnd gnd CMOSN W=0.36u L=0.18u M=1 m5 net-_m1-pad3_ blbar net-_m4-pad3_ gnd CMOSN W=0.36u L=0.18u M=1 m1 net-_m1-pad1_ bl net-_m1-pad3_ gnd CMOSN W=0.36u L=0.18u M=1 m3 net-_m1-pad3_ vcc gnd gnd CMOSN W=0.36u L=0.18u M=1 * u1 bl blbar net-_m8-pad1_ port .IC V(1)=0 .ends * Control Statements .control run print allv > plot_data_v.txt print alli > plot_data_i.txt .endc .end ``` Which yields: [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/Q.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/Q.PNG) ngspice 1 -> plot qbar [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/Qbar.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/Qbar.PNG) ngspice 1 -> plot q qbar [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/Q-Qbar.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/Q-Qbar.PNG) ngspice 1 -> plot bl [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/BL.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/BL.PNG) ngspice 1 -> plot blbar [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/BLbar.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/BLbar.PNG) ngspice 1 -> plot bl blbar [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/BL-Blbar.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/BL-Blbar.PNG) ngspice 1 -> plot bl q [![](https://github.com/Anushar123/vsdsram/raw/master/Waveforms/Ngspice/Spice/Butterfly.PNG)](https://github.com/Anushar123/vsdsram/blob/master/Waveforms/Ngspice/Spice/Butterfly.PNG) #### OpenRAM You don't need to lay out SRAM cells by hand. Tools exist to do this programmatically when you are designing an ASIC that needs SRAM memory, such as OpenRAM^[https://openram.org/]. OpenRAM is an open-source Python framework to create the layout, netlists, timing and power models, placement and routing models, and other views necessary to use SRAMs in ASIC design. OpenRAM supports integration in both commercial and open-source flows with both predictive and fabricable technologies. ##### Supported Technologies - NCSU FreePDK 45nm - Non-fabricable but contains DSM rules - Calibre or klayout for DRC/LVS - MOSIS 0.35um (SCN4M_SUBM) - Fabricable technology - Magic/Netgen or Calibre for DRC/LVS - [[Semiconductors#SKY130|Skywater 130nm]] (sky130) - Fabricable technology - Magic/Netgen or klayout ##### Implementation - Front-end mode - Generates SPICE, layout views, timing models - Netlist-only mode can skip the physical design too - Doesn't perform DRC/LVS - Estimates power/delay analytically - Back-end mode - Generates SPICE, layout views, timing models - Performs DRC/LVS - Can perform at each level of hierarchy or at the end - Simulates power/delay - Can be back-annotated or not ##### Technology and Tool Portability - OpenRAM is technology-independent by using a technology directory that includes: - Technology's specific information - Technology's rules such as DRC rules and the GDS layer map - Custom-designed library cells (6T, sense amp, DFF) to improve the SRAM density. - For technologies with specific design requirements, such as specialized well contacts, the user can include helper functions in the technology directory. - Verification wrapper scripts - Uses a wrapper interface with DRC and LVS tools that allow flexibility - DRC and LVS can be performed at all levels of the design hierarchy to enhance bug tracking. - DRC and LVS can be disabled completely for improved run-time or if licenses are not available. ##### Architecture ![](openram_architecture.png) ### Non-Volatile > [!warning] > This section is under #development ## Design of a Supervisor ASIC Here I will summarize the steps to design a basic ASIC using the Sky130 PDK and the Openlane flow. > [!warning] > This section is under #development # Instruction Set Architectures (ISAs) Instruction Set Architectures, or ISAs, are specifications that define how a microprocessor behaves and therefore, how it communicates with the software running on it. Essentially, an ISA serves as an abstract definition of the behavior of a processor, and it does it by defining a set of instructions that the hardware must execute. These instructions encompass everything from basic arithmetic operations to complex memory and data management tasks using different addressing modes. When software, like an operating system, needs to perform a function, it does so by issuing a series of such instructions, which are then interpreted and executed by the microprocessor's core or cores. The ISA specifies how these instructions are formatted, what operations they can carry out, and how they interact with the processor's various components, such as registers, caches, and execution units. Each instruction within an ISA is a simple operation that the processor can understand directly without needing any translation or interpretation. Think of it as a common language spoken between software and hardware, where both sides need to know this language for any effective communication to happen. Different types of ISAs exist, with various balances between complexity and efficiency. Some ISAs include a wide array of instructions capable of doing very intricate tasks in one go, while others focus on simplicity, with each instruction doing a smaller, more atomic operation, potentially leading to a greater number of instructions for the same task but executed with greater efficiency. The design of an ISA affects not just how a processor will execute instructions but also influences the entire design of the processor itself, including its architecture and the microarchitecture of its cores. It's one of the key factors that dictate the processor's compatibility with software and its performance. As technology advances and computing needs evolve, ISAs continue to adapt, offering more capabilities and enabling more efficient processing to meet the demands of modern applications. The ISA acts as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the instructions, execution model, processor registers, address, and data formats, among other things. In contrast, the [[Semiconductors#Microarchitecture|microarchitecture]] is in charge of implementing the constituent parts of the processor that will interconnect and interoperate to implement the ISA. The ISA is the dialect that the microarchitecture understands. > [!note] > Companies may license both the ISA (Instruction Set Architecture) and the hardware implementation designs for their processors. When it comes to ISA licensing, a company can design its own processor that implements a particular instruction set. This allows the licensee to create custom microarchitectures as long as they are compatible with the original ISA specification, which will ensure software compatibility. For example, Apple's A-series chips used in iPhones and iPads are based on the popular ARM ISA, but the chip designs are Apple's own. ## ISA Licensing ISAs can be open or commercial. You may think: aren't ISAs just documents defining instructions bit fields, and opcodes? Aren't they all _open_? You just can get the document, can you? ARM, for instance, does provide comprehensive documentation for its instruction set architecture (ISA). These documents are available to developers and the public and detail how ARM processors work, including the ARM instruction sets and other architectural details. However, there are important distinctions between making the ISA documentation available and the ISA itself being open for use. ARM documentation is publicly available for educational and development purposes. This allows software developers to write code for ARM-based processors and understand the ARM architecture. However, the right to use ARM's ISA in designing and implementing a processor is not free and requires a license agreement with ARM's parent company, ARM Holdings. This is where the distinction lies: ARM provides the knowledge of how to use their ISA, but implementing the ISA in silicon requires a commercial arrangement. ARM's business model is based on licensing its IP. Companies pay to license ARM's processor designs (cores) or the right to create their custom cores using the ARM ISA. RISC-V does not impose such licensing for the use of its ISA. It operates under open-source licenses, allowing for free implementation and modification, provided the license terms are met. The ARM ISA is copyrighted and using it in custom processor designs without a license could result in legal action for IP infringement. With RISC-V, such legal restrictions do not exist, as it is designed to be freely implemented, extended, and distributed. Companies may also license their own hardware implementation in the form of microprocessor core designs. Customer companies can license these designs and use them as-is, or with some customization in their products. This means they are licensing the actual silicon-ready design and not just the ISA. ## Popular ISAs Several Instruction Set Architectures (ISAs) are popular in the world of computing, each with its specific applications and strengths. Some of the most notable ISAs include: - x86 and x86-64: Developed by Intel and used in most desktops, laptops, and servers. The x86-64 ISA (also known as AMD64) is the 64-bit version of the x86 ISA, which supports larger amounts of memory and is the standard for modern personal computers. ==Before the smartphone era, x86 was the dominant instruction set within general-purpose CPUs. Almost every PC and server was guaranteed to have an x86-based CPU as the software was written to be compatible with the x86 instruction set.== ==For a long time, most of x86 CPUs were Intel CPUs. While AMD also had the IP rights to design x86-based CPUs, AMD was for a long time tied to its own fabs with inferior process technology to Intel’s, making it uncompetitive.== - ARM: ARM's ISA is widely used in portable devices such as smartphones, and tablets, but also increasingly in servers and desktops. The ARM architecture is known for its power efficiency, which is why it's the preferred choice for battery-powered devices. - MIPS: Originally developed by MIPS Technologies, the MIPS ISA was historically popular in embedded systems and is known for its simple, clean design. - PowerPC: Originally developed by the AIM alliance (Apple, IBM, Motorola), PowerPC was used in older Apple Macintosh computers and games, and is still extensively used in embedded systems. - [[Semiconductors#RISC-V|RISC-V]]: An open-source ISA that is gaining popularity for its modularity and extensibility. RISC-V allows any company or individual to design, manufacture, and sell RISC-V chips and software without the need for paying royalties. - SPARC: Developed by Sun Microsystems, SPARC is a RISC architecture used in servers and workstations, particularly in high-performance computing environments. But also in space systems, like the LEON family of processors, the ERC32 and NGMP[^55]. - Itanium (IA-64): Developed by Intel, Itanium's ISA is distinct from x86 and was designed for use in high-end server and supercomputing applications. - AVR: Developed by Atmel (now part of Microchip Technology), AVR is a family of ISAs used in microcontrollers found in embedded systems. - PIC: The ISA for a family of microcontrollers made by Microchip Technology, PIC is widely used in both education and embedded system applications. Each of these ISAs serves different segments of the market, from high-end servers and workstations to embedded devices, and they have various architectural features that make them suitable for their target applications. Adopting a popular ISA has an undeniable advantage for core developers that tools, compilers, and operating systems will most likely work on a given implementation. Starting a new ISA would require all these toolchains to be grown from scratch. That's why an increasingly adopted open standard like RISC-V is so appealing. ## A Naïve example of an ISA Let's now create a simple 32-bit ISA, for illustration purposes. To define it, we will start by defining a set of 32-bit registers and expand on each instruction's behavior. Let's start by defining how many registers there will be in our abstract architecture, in a way defining the programmer's model of our simple ISA. CPU Register File: - The processor has a register file that contains 16 general-purpose registers (R0 to R15). - Each register is 32 bits wide. - There is a special-purpose Program Counter (PC) register that holds the address of the next instruction. - There is a special-purpose Stack Pointer (SP) register (R14) for stack operations. - There is a Flags or Status register to indicate zero, carry, overflow, and negative conditions after arithmetic operations. Instructions: Each instruction is 32 bits wide, with the first 6 bits typically representing the opcode, which defines the operation to be performed. The remaining bits are divided into fields that specify the registers and immediate values or addresses involved. - `LOAD` `Rn`, \[address\] - Opcode: 000001 - Description: Load a word from the specified memory address into register `Rn`. - Format: \[opcode\]\[Rn\]\[unused\]\[address\] - The address is a direct address in memory. - unused bits are set to 0. - `STORE` `Rn`, \[address\] - Opcode: 000010 - Description: Store the word in register Rn to the specified memory address. - Format: \[opcode\]\[Rn\]\[unused\]\[address\] - The address is a direct address in memory. - `ADD` `Rn`, `Rm`, `Ro` - Opcode: 000011 - Description: Add the contents of `Rm` to `Ro` and store the result in `Rn`. - Format: \[opcode\]\[Rn\]\[Rm\]\[Ro\] - Sets the Flags register accordingly. - `SUB` `Rn`, `Rm`, `Ro` - Opcode: 000100 - Description: Subtract `Ro` from `Rm` and store the result in `Rn`. - Format: \[opcode\]\[Rn\]\[Rm\]\[Ro\] - Sets the Flags register accordingly. Of course, more instructions are needed for a minimally decent ISA. For brevity, we will just list the rest of opcodes without diving into many details: `MUL` `Rn`, `Rm`, `Ro` - Opcode: 000101 `DIV` `Rn`, `Rm`, `Ro` - Opcode: 000110 `AND` Rn, Rm, Ro - Opcode: 000111 `OR` `Rn, Rm, Ro - Opcode: 001000 `XOR` `Rn, Rm, Ro - Opcode: 001001 `NOT` `Rn, Rm - Opcode: 001010 `JMP` \[address\] - Opcode: 001011 `JEQ` `Rn`, `Rm`, \[address\] - Opcode: 001100 `JNE` `Rn`, `Rm`, \[address\] - Opcode: 001101 `JGT` `Rn`, `Rm`, \[address\] - Opcode: 001110 `JLT` `Rn`, `Rm`, \[address\] - Opcode: 001111 `INC` `Rn` - Opcode: 010000 `DEC` `Rn` - Opcode: 010001 `PUSH` `Rn` - Opcode: 010010 `POP` `Rn` - Opcode: 010011 `NOP` - Opcode: 010100 Operand Encoding: - Registers `Rn`, `Rm`, and `Ro` are each represented with 4 bits (since there are 16 registers). - Immediate values or addresses can vary in size depending on the instruction, but for direct memory access (like in `LOAD` or `STORE`), we might allocate 18 bits for the address field. - For register-based operations (like `ADD` or `SUB`), there would be no immediate values and unused bits would be set to 0. - For branch instructions with addresses, the opcode would be followed by an address field. Flags/Status Register: - The flags register would contain at least 4 bits to represent the Zero, Carry, Overflow, and Negative conditions. - After an arithmetic operation like `ADD` or `SUB`, the flags would be set based on the result. For example, if the result is 0, the Zero flag would be set. Control Flow: - The `JMP` and conditional jump instructions modify the program counter (`PC`) and, therefore modify the flow of execution. For conditional jumps, if the condition is met based on the flags and the register values, the `PC` would be set to the address specified; otherwise, it would move to the next sequential instruction. Note that, despite the obvious naivety of this ISA we just defined, if this architecture would, for one reason or another, become widely adopted, then anyone defining a core microarchitecture would be quite interested in making the architecture compatible with this ISA. This way, software libraries, compilers, and toolchains would remain compatible, increasing the probability of the microarchitecture being adopted. ## Microcode In processor design, microcode serves as an intermediary layer situated between the central processing unit (CPU) hardware and the programmer-visible instruction set architecture of a computer, also known as its machine code. Microcode consists of a set of hardware-level instructions that implement the higher-level machine code instructions or control internal finite-state machine sequencing in many digital processing components. Initially, CPU instruction sets were hardwired. Each step needed to fetch, decode, and execute the machine instructions (including any operand address calculations, reads, and writes) was controlled directly by combinational logic and rather minimal sequential state machine circuitry. While such hard-wired processors were very efficient, the need for powerful instruction sets with multi-step addressing and complex operations (see below) made them difficult to design and debug; highly encoded and varied-length instructions can contribute to this as well, especially when very irregular encodings are used. Housed in special high-speed memory, microcode translates machine instructions, state machine data, or other input into sequences of detailed circuit-level operations and signals. It separates the machine instructions from the underlying electronics, thereby enabling greater flexibility in designing and altering instructions. Moreover, microcode facilitates the construction of complex multi-step instructions, while reducing the complexity of computer circuits. The act of writing microcode is often referred to as microprogramming, and the microcode in a specific processor implementation is sometimes termed a microprogram. ## RISC-V RISC-V (pronounced “risk-five”) is a relatively new instruction set architecture (ISA) that was originally designed to support computer architecture research and education but has slowly become a standard free and open architecture for industry implementations. As per the RISC-V Instruction Set Manual^[https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf] The goals in defining RISC-V include: - A completely open ISA that is freely available to academia and industry. - A real ISA that is suitable for direct native hardware implementation, not just simulation or binary translation. - An ISA that avoids “over-architecting” for a particular [[Semiconductors#Core Microarchitecture|microarchitecture]] style or implementation technology, but allows efficient implementation in any of these. - An ISA that is separated into a small base integer ISA, usable by itself as a base for customized accelerators or educational purposes, and optional standard extensions, to support general-purpose software development. - Support for the revised 2008 IEEE-754 floating-point standard. - An ISA supporting extensive user-level ISA extensions and specialized variants. - Both 32-bit and 64-bit address space variants for applications, operating system kernels, and hardware implementations. - An ISA with support for highly parallel multicore or manycore implementations, including heterogeneous multiprocessors. - Optional variable-length instructions to both expand available instruction encoding space and support an optional dense instruction encoding for improved performance, static code size, and energy efficiency. - A virtualizable ISA to ease [[Semiconductors#Hypervisors/Virtual Machine Monitors (VMM)|hypervisor]] development. - An ISA that simplifies experiments with new supervisor-level and hypervisor-level ISA designs. > [!note] > The name RISC-V was chosen to represent the fifth major RISC ISA design from UC Berkeley. RISC-V pursued a highly flexible and extensible base ISA around which to build research efforts. A question that the steering group has been repeatedly asked is “Why develop a new ISA?” The biggest obvious benefit of using an existing commercial ISA is the large and widely supported software ecosystem, both development tools and ported applications, which can be leveraged in research and teaching. Other benefits include the existence of large amounts of documentation and tutorial examples. However, the experience of using commercial instruction sets for research and education is that these benefits are smaller in practice, and do not outweigh the disadvantages: - Commercial ISAs are proprietary. Except for SPARC V8, which is an open IEEE standard, most owners of commercial ISAs carefully guard their intellectual property and do not welcome freely available competitive implementations. This is much less of an issue for academic research and teaching using only software simulators but has been a major concern for groups wishing to share actual RTL implementations. It is also a major concern for entities who do not want to trust the few sources of commercial ISA implementations, but who are prohibited from creating their own clean room implementations. We cannot guarantee that all RISC-V implementations will be free of third-party patent infringements, but we can guarantee we will not attempt to sue a RISC-V implementor. - Commercial ISAs are only popular in certain market domains. ==The most obvious examples at the time of writing are that the ARM architecture is not well supported in the server space, and the Intel x86 architecture (or for that matter, almost every other architecture) is not well supported in the mobile space, though both Intel and ARM are attempting to enter each other’s market segments==. Another example is ARC and Tensilica, which provide extensible cores but are focused on the embedded space. This market segmentation dilutes the benefit of supporting a particular commercial ISA as in practice the software ecosystem only exists for certain domains, and has to be built for others. - Commercial ISAs come and go. Previous research infrastructures have been built around commercial ISAs that are no longer popular (SPARC, MIPS) or even no longer in production. These lose the benefit of an active software ecosystem, and the lingering intellectual property issues around the ISA and supporting tools interfere with the ability of interested third parties to continue supporting the ISA. An open ISA might also lose popularity, but any interested party can continue using and developing the ecosystem. - Popular commercial ISAs are complex. The dominant commercial ISAs (x86 and ARM) are both very complex to implement in hardware to the level of supporting common software stacks and operating systems. Worse, nearly all the complexity is due to bad, or at least outdated, ISA design decisions rather than features that truly improve efficiency. - Commercial ISAs alone are not enough to bring up applications. Even if we expend the effort to implement a commercial ISA, this is not enough to run existing applications for that ISA. Most applications need a complete ABI (application binary interface, see callout below) to run, not just the user-level ISA. Most ABIs rely on libraries, which in turn rely on operating system support. To run an existing operating system requires implementing the supervisor-level ISA and device interfaces expected by the OS. These are usually much less well-specified and considerably more complex to implement than the user-level ISA. - Popular commercial ISAs were not designed for extensibility. The dominant commercial ISAs were not particularly designed for extensibility, and as a consequence have added considerable instruction encoding complexity as their instruction sets have grown. - A modified commercial ISA is a new ISA. One of RISC-V's main goals is to support architecture research, including major ISA extensions. Even small extensions diminish the benefit of using a standard ISA, as compilers have to be modified and applications rebuilt from source code to use the extension. Larger extensions that introduce new architectural state also require modifications to the operating system. Ultimately, the modified commercial ISA becomes a new ISA but carries along all the legacy baggage of the base ISA. > [!info] > An Application Binary Interface (ABI) is a set of conventions that dictate how different programs or components interact at the binary level. This encompasses the calling conventions, which determine how functions' arguments are passed and return values are received, the binary format for object files, and how static and dynamic libraries are used. Essentially, an ABI allows software and operating systems to communicate without the need for source or intermediate code, ensuring compatibility between different systems and software versions. It serves as a critical bridge between high-level source code and the low-level machine code that processors actually execute, enabling software compiled by different compilers to work together as long as they adhere to the same ABI. This ensures that applications function correctly on any given operating system and hardware configuration without needing to recompile the software specifically for each platform. ### RISC-V Overview The RISC-V ISA is defined as a base integer ISA, which must be present in any implementation, plus optional extensions to the base ISA. The base integer ISA is very similar to that of the early RISC processors except with no branch delay slots and with support for optional variable-length instruction encodings. The base ISA is carefully restricted to a minimal set of instructions sufficient to provide a reasonable target for compilers, assemblers, linkers, and operating systems (with additional supervisor-level operations), and so provides a convenient ISA and software toolchain “skeleton” around which more customized processor ISAs can be built. Each base integer instruction set is characterized by the width of the integer registers and the corresponding size of the user address space. There are two primary base integer variants, RV32I and RV64I, which provide 32-bit or 64-bit user-level address spaces respectively. Hardware implementations and operating systems might provide only one or both of RV32I and RV64I for user programs. The base integer instruction sets use a two’s-complement representation for signed integer values. RISC-V has been designed to support extensive customization and specialization. The base integer ISA can be extended with one or more optional instruction-set extensions, but the base integer instructions cannot be redefined. RISC-V instruction-set extensions are divided into standard and non-standard extensions. Standard extensions should be generally useful and should not conflict with other standard extensions. Non-standard extensions may be highly specialized or may conflict with other standard or non-standard extensions. Instruction-set extensions may provide slightly different functionality depending on the width of the base integer instruction set. To support more general software development, a set of standard extensions are defined to provide integer multiply/divide, atomic operations, and single and double-precision floating-point arithmetic. The base integer ISA is named “I” (prefixed by RV32 or RV64 depending on integer register width), and contains integer computational instructions, integer loads, integer stores, and control-flow instructions, and is mandatory for all RISC-V implementations. The standard integer multiplication and division extension is named “M”, and adds instructions to multiply and divide values held in the integer registers. The standard atomic instruction extension, denoted by “A”, adds instructions that atomically read, modify, and write memory for inter-processor synchronization. The standard single-precision floating-point extension, denoted by “F”, adds floating-point registers, single-precision computational instructions, and single-precision loads and stores. The standard double-precision floating-point extension, denoted by “D”, expands the floating-point registers, and adds double-precision computational instructions, loads, and stores. An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains. Later chapters describe these and other planned standard RISC-V extensions. Beyond the base integer ISA and the standard extensions, a new instruction will rarely provide a significant benefit for all applications, although it may be very beneficial for a certain domain. As energy efficiency concerns are forcing greater specialization, the RISC-V believes it is important to simplify the required portion of an ISA specification. Whereas other architectures usually treat their ISA as a single entity, which changes to a new version as instructions are added over time, RISC-V will endeavor to keep the base and each standard extension constant over time, and instead layer new instructions as further optional extensions. For example, the base integer ISAs will continue as fully supported standalone ISAs, regardless of any subsequent extensions. ### Programmer's Model #### Register File The RISC-V architecture defines a set of registers that are used for various purposes, including general computations, controlling the execution flow, and managing system status. The core of the register file in a typical RISC-V implementation includes: - General-Purpose Registers: RISC-V specifies 32 general-purpose registers, labeled x0 through x31, each 32 or 64 bits wide depending on the architecture variant (RV32I for 32 bits, RV64I for 64 bits). Register x0 is hardwired to zero, always reading 0, and ignores writes. The other registers (x1 through x31) are used for general computation and function arguments. - Program Counter (PC): A special register that holds the address of the current instruction being executed. It is automatically updated after each instruction but can also be modified by jump and branch instructions. The Application Binary Interface (ABI) defines how software applications interact with the operating system, including calling conventions, which specify how functions receive arguments and return results, and how system calls are invoked. In RISC-V, the ABI details include: - Calling Convention: Specifies which registers are used to pass function arguments and return values. In RISC-V, the first few arguments of a function are usually passed in the registers a0-a7 (x10-x17), and return values are passed back in a0 and a1. - Saved and Temporary Registers: The ABI defines certain registers as callee-saved (must be preserved across calls) and others as caller-saved (can be freely used by the callee, with the caller responsible for saving and restoring if needed). - Stack Management: The ABI specifies how the stack should be used for function calls, including pushing and popping frame records and managing local variables. A summary is listed below. | Register | ABI | Use by convention | Preserved? | | :------- | :--------- | :------------------------------------ | ---------- | | x0 | zero | hardwired to 0, ignores writes | _n/a_ | | x1 | ra | return address for jumps | no | | x2 | sp | stack pointer | yes | | x3 | gp | global pointer | _n/a_ | | x4 | tp | thread pointer | _n/a_ | | x5 | t0 | temporary register 0 | no | | x6 | t1 | temporary register 1 | no | | x7 | t2 | temporary register 2 | no | | x8 | s0 _or_ fp | saved register 0 _or_ frame pointer | yes | | x9 | s1 | saved register 1 | yes | | x10 | a0 | return value _or_ function argument 0 | no | | x11 | a1 | return value _or_ function argument 1 | no | | x12 | a2 | function argument 2 | no | | x13 | a3 | function argument 3 | no | | x14 | a4 | function argument 4 | no | | x15 | a5 | function argument 5 | no | | x16 | a6 | function argument 6 | no | | x17 | a7 | function argument 7 | no | | x18 | s2 | saved register 2 | yes | | x19 | s3 | saved register 3 | yes | | x20 | s4 | saved register 4 | yes | | x21 | s5 | saved register 5 | yes | | x22 | s6 | saved register 6 | yes | | x23 | s7 | saved register 7 | yes | | x24 | s8 | saved register 8 | yes | | x25 | s9 | saved register 9 | yes | | x26 | s10 | saved register 10 | yes | | x27 | s11 | saved register 11 | yes | | x28 | t3 | temporary register 3 | no | | x29 | t4 | temporary register 4 | no | | x30 | t5 | temporary register 5 | no | | x31 | t6 | temporary register 6 | no | | pc | _(none)_ | program counter | _n/a_ | As a general rule, the saved registers `s0` to `s11` are preserved across function calls, while the argument registers `a0` to `a7` and the temporary registers `t0` to `t6` are not. #### Control and Status Registers Control and status register (CSR) is a register that stores various information in the CPU. RISC-V defines a separate address space of 4096 CSRs. RISC-V only allocates a part of address space so one can add custom CSRs in unused addresses. Also, not all CSRs are required in all implementations. #### Memory Model and Hardware Threads (Harts) The base RISC-V ISA supports multiple concurrent threads of execution within a single user address space. Each RISC-V hardware thread, or _hart_, has its own user register state and program counter and executes an independent sequential instruction stream. The execution environment will define how RISC-V harts are created and managed. RISC-V harts can communicate and synchronize with other harts either via calls to the execution environment, which are documented separately in the specification for each execution environment, or directly via the shared memory system. RISC-V harts can also interact with I/O devices, and indirectly with each other, via loads and stores to portions of the address space assigned to I/O. > [!Note] > RISC-V spec uses the term _hart_ to unambiguously and concisely describe a hardware thread as opposed to software-managed thread contexts. In the base RISC-V ISA, each RISC-V hart observes its own memory operations as if they executed sequentially in program order. RISC-V has a relaxed memory model between harts, requiring an explicit FENCE instruction to guarantee ordering between memory operations from different RISCV harts. > [!Note] > To illustrate the importance—even at the geopolitical level—of ISAs and chips, at the time of writing this material, US lawmakers are pressing Biden's administration to restrict American companies from working on RISC-V, as they see Chinese use of the technology as a potential national security threat due to the fact RISC-V, being open source, is not captured by the export controls the U.S. has imposed on sending microprocessor technology to China[^56]. > [!Info] > RISC-V is gaining popularity rapidly, and there are many implementations of the ISA in IP cores out there. There are still not so many commercial implementations of the ISA in standalone microprocessors or System-on-chips, though. Some examples are the Efabless' Caravel Harness Chip^[https://caravel-harness.readthedocs.io/en/latest/index.html], SiFive's FE310-G0000^[https://static.dev.sifive.com/FE310-G000-DS.pdf] and the somewhat mysterious XiangShan Open-source 64-bit RISC-V Processor^[https://github.com/OpenXiangShan/XiangShan] # Compilers A compiler translates a program in a source language to a program in a target language. The most well-known form of a compiler is one that translates a high-level language like C into the native assembly language of a machine (as per its ISA) so that it can be executed. And of course, there are compilers for other languages like C++, Java, C#, Rust, and many others. The same techniques used in a traditional compiler are also used in any kind of program that processes a language. For example, a typesetting program like $\TeX$ translates a manuscript into a Postscript document. A web browser translates an HTML document into an interactive graphical display. To write programs like these, you need to understand and use the same techniques as in traditional compilers. Compilers exist not only to translate programs but also to improve them. A compiler assists a programmer by finding errors in a program at compile time so that the user does not have to encounter them at runtime. Usually, a more strict language results in more compile-time errors. This makes the programmer’s job harder but makes it more likely that the program is correct. For example, the Ada language is infamous among programmers as challenging to write without compile-time errors, but once working, is trusted to run safety-critical systems such as the Boeing 777 aircraft. A compiler is distinct from an interpreter, which reads in a program and then executes it directly, without producing a translation. Languages like Python and Ruby are typically executed by an interpreter that reads the source code directly. Compilers and interpreters are closely related, and it is sometimes possible to exchange one for the other. For example, Java compilers translate Java source code into Java bytecode, which is an abstract form of assembly language. Some implementations of the Java Virtual Machine work as interpreters that execute one instruction at a time. Others work by translating the bytecode into local machine code and then running the machine code directly. This is known as just-in-time compiling or JIT. ## Compiler Toolchain A compiler is one component in a toolchain of programs used to create executables from source code. Typically, when you invoke a single command to compile a program, a whole sequence of programs is invoked in the background. The figure below shows the programs typically used in a Unix system for compiling C source code to assembly code. ![](compiler1.png) > [!Figure] > *A Typical Compiler Toolchain (source: #ref/Thain)* - The preprocessor prepares the source code for the compiler. In the C and C++ languages, this means consuming all directives that start with the # symbol. For example, an `#include` directive causes the preprocessor to open the named file and insert its contents into the source code. A `#define` directive causes the preprocessor to substitute a value wherever a macro name is encountered. Not all languages rely on a preprocessor. - The compiler consumes the clean output of the preprocessor. It scans and parses the source code, performs type-checking and other semantic routines, optimizes the code, and then produces assembly language as the output. - The assembler consumes the assembly code and produces object code. Object code is “almost executable” in that it contains raw machine language instructions in the form needed by the CPU's ISA. However, object code does not know the final memory addresses in which it will be loaded, and so it contains gaps that must be filled in by the linker. - The linker consumes one or more object files and library files and combines them into a complete, executable program. It selects the final memory locations where each piece of code and data will be loaded, and then “links” them together by writing in the missing address information. For example, an object file that calls the `printf` function does not initially know the address of the function. An empty (zero) address will be left where the address must be used. Once the linker selects the memory location of `printf`, it must go back and write in the address at every place where printf is called. In Unix-like operating systems, the preprocessor, compiler, assembler, and linker are historically named `cpp`, `cc1`, `as`, and `ld` respectively. The user-visible program `cc` simply invokes each element of the toolchain to produce the final executable. ## Compiler Stages The compiler itself can be divided into several stages: - The scanner: it consumes the plain text of a program and groups together individual characters to form complete tokens. This is much like grouping characters into words in a natural language. - The parser: it consumes tokens and groups them together into complete statements and expressions, much like words are grouped into sentences in a natural language. The parser is guided by a grammar that states the formal rules of composition in a given language. The output of the parser is an abstract syntax tree (AST) that captures the grammatical structures of the program. The AST also remembers where in the source file each construct appeared, so it can generate targeted error messages if needed. - The semantic routines traverse the AST and derive additional meaning (semantics) about the program from the rules of the language and the relationship between elements of the program. For example, we might determine that x + 10 is a float expression by observing the type of x from an earlier declaration, then applying the language rule that addition between int and float values yields a float. After the semantic routines, the AST is often converted into an intermediate representation (IR), which is a simplified form of assembly code suitable for detailed analysis. - One or more optimizers can be applied to the intermediate representation, to make the program smaller, faster, or more efficient. Typically, each optimizer reads the program in IR format and then emits the same IR format, so that each optimizer can be applied independently, in arbitrary order. - Finally, a code generator consumes the optimized IR and transforms it into a concrete assembly language program. Typically, a code generator must perform register allocation to effectively manage the limited number of hardware registers, and instruction selection and sequencing to order assembly instructions in the most efficient form. ![](compiler2.png) > [!Figure] > *Stages of a compiler (source: #ref/Thain)* ## Example Compilation Suppose we wish to compile this fragment of code into assembly: ```C height = (width+56) * factor(foo); ``` The first stage of the compiler (the scanner) will read in the text of the source code character by character, identify the boundaries between symbols, and emit a series of tokens. Each token is a small data structure that describes the nature and contents of each symbol: ![](compiler3.png) At this stage, the purpose of each token is not yet clear. For example, `factor` and `foo` are simply known to be identifiers, even though one is the name of a function, and the other is the name of a variable. Likewise, we do not yet know the type of `width`, so the + could potentially represent integer addition, floating point addition, string concatenation, or something else entirely. The next step is to determine whether this sequence of tokens forms a valid program. The parser does this by looking for patterns that match the grammar of a language. Suppose that our compiler understands a language with the following grammar: ![](compiler4.png) Each line of the grammar is called a rule and explains how various parts of the language are constructed. Rules 1-3 indicate that an expression can be formed by joining two expressions with operators. Rule 4 describes a function call. Rule 5 describes the use of parentheses. Finally, rules 6 and 7 indicate that identifiers and integers are atomic expressions. The parser looks for sequences of tokens that can be replaced by the left side of a rule in our grammar. Each time a rule is applied, the parser creates a node in a tree and connects the sub-expressions into the **abstract syntax tree** (AST). The AST shows the structural relationships between each symbol: addition is performed on width and 56, while a function call is applied to `factor` and `foo`. With this data structure in place, we are now prepared to analyze the meaning of the program. The semantic routines traverse the AST and derive additional meaning by relating parts of the program to each other and to the definition of the programming language. An important component of this process is typechecking, in which the type of each expression is determined, and checked for consistency with the rest of the program. To keep things simple here, we will assume that all of our variables are plain integers. To generate linear intermediate code, we perform a post-order traversal of the AST and generate an IR instruction for each node in the tree. A typical IR looks like an abstract assembly language, with load/store instructions, arithmetic operations, and an infinite number of registers. For example, this is a possible IR representation of our example program: ![](compiler5.png) ```Assembler LOAD $56 -> r1 LOAD width -> r2 IADD r1, r2 -> r3 ARG foo CALL factor -> r4 IMUL r3, r4 -> r5 STOR r5 -> height ``` The intermediate representation is where most forms of optimization occur. Dead code is removed, common operations are combined, and code is generally simplified to consume fewer resources and run more quickly. Finally, the intermediate code must be converted to the desired assembly code, as per an actual [[Semiconductors#Instruction Set Architectures (ISAs)|ISA]]. The code below shows x86 assembly code which is one possible translation of the IR given above. ```Assembler MOVQ width, %rax # load width into rax ADDQ $56, %rax # add 56 to rax MOVQ %rax, -8(%rbp) # save sum in temporary MOVQ foo, %edi # load foo into arg 0 register CALL factor # invoke factor, result in rax MOVQ -8(%rbp), %rbx # load sum into rbx IMULQ %rbx # multiply rbx by rax MOVQ %rax, height # store result into height ``` Note that the assembly instructions do not necessarily correspond one-to-one with IR instructions. A well-engineered compiler is highly modular so that common code elements can be shared and combined as needed. To support multiple languages, a compiler can provide distinct scanners and parsers, each emitting the same intermediate representation. Different optimization techniques can be implemented as independent modules (each reading and writing the same IR) so that they can be enabled and disabled independently. A retargetable compiler contains multiple code generators so that the same IR can be created for a variety of ISAs. # CPU Cores A CPU core is an implementation in hardware of an Instruction Set Architecture (ISA). Note that a core strictly circumscribes the hardware elements necessary to execute a full ISA, but it does not necessarily include peripherals or input/output capabilities. In general, CPU cores are meant to be instantiated either in a [[Semiconductors#System-on-Chips|System-on-Chip]] or in a standalone [[Semiconductors#Microprocessor Devices|microprocessor device]]. ## A naïve implementation of a naïve ISA Implementing an entire Instruction Set Architecture (ISA) in Verilog is a complex task that involves creating a CPU core with fetch, decode, and execute stages at the very least. This would be a lengthy process and would result in quite a bit of code. For the sake of brevity and clarity, let's see a simplistic implementation of a subset of the ISA we discussed before, focusing on the structure of a simple CPU and the implementation of a few instructions. This example will give you an idea of how you might begin to implement an ISA in an [[Semiconductors#Hardware Description Languages|HDL]], but it is far from a complete processor. It will include the definition of the register file, a simple [[Semiconductors#Arithmetic Logical Units (ALUs)|ALU]], and a control unit that can decode and execute a few instructions. ```Verilog module SimpleCPU ( input clk, input reset, input [31:0] instruction, // Instruction fetched from memory output reg [31:0] pc // Program Counter ); // Register file reg [31:0] registers[15:0]; // ALU inputs and output reg [31:0] alu_a; reg [31:0] alu_b; reg [31:0] alu_out; reg alu_carry_out; // Instruction decode wire [5:0] opcode = instruction[31:26]; wire [3:0] reg_dst = instruction[25:22]; wire [3:0] reg_src1 = instruction[21:18]; wire [3:0] reg_src2 = instruction[17:14]; wire [17:0] immediate = instruction[17:0]; // Flags reg zero_flag, carry_flag, overflow_flag, negative_flag; // ALU Operations and LOAD instruction parameter ADD = 6'b000011; parameter SUB = 6'b000100; parameter NOP = 6'b010100; parameter LOAD = 6'b010101; // Control unit always @(posedge clk) begin if (reset) begin pc <= 32'b0; zero_flag <= 1'b0; carry_flag <= 1'b0; overflow_flag <= 1'b0; negative_flag <= 1'b0; end else begin case (opcode) ADD: begin alu_a <= registers[reg_src1]; alu_b <= registers[reg_src2]; {alu_carry_out, alu_out} <= alu_a + alu_b; registers[reg_dst] <= alu_out; end SUB: begin alu_a <= registers[reg_src1]; alu_b <= registers[reg_src2]; {alu_carry_out, alu_out} <= alu_a - alu_b; registers[reg_dst] <= alu_out; end LOAD: begin // Directly load the immediate value into the destination register registers[reg_dst] <= {14'b0, immediate}; end NOP: begin // Do nothing end // Additional cases for other instructions would go here... endcase // Update flags zero_flag <= (alu_out == 0); carry_flag <= alu_carry_out; overflow_flag <= (alu_a[31] == alu_b[31]) && (alu_out[31] != alu_a[31]); negative_flag <= alu_out[31]; // Update program counter pc <= pc + 4; end end endmodule ``` The Verilog module `SimpleCPU` represents an implementation of a naive CPU core following our ISA. This simple core can perform `ADD`, `SUB`, `LOAD`, and `NOP` instructions. Each instruction is expected to come in as a 32-bit value on the instruction input (which is _automagically_ loaded with instructions), and the core updates the program counter PC every clock cycle. The registers array models the register file, the alu_* signals are used for ALU operations, and the opcode, `reg_dst`, `reg_src1`, `reg_src2`, and immediate signals are used to decode the instruction. A testbench for our rudimentary core would look like this: ```Verilog module SimpleCPU_tb; reg clk; reg reset; reg [31:0] instruction; wire [31:0] pc; // Instantiate the Unit Under Test (UUT) SimpleCPU uut ( .clk(clk), .reset(reset), .instruction(instruction), .pc(pc) ); // Procedure to display the contents of the register file task display_registers; integer i; begin $display("Register File Contents at Time: %t", $time); for (i = 0; i < 16; i = i + 1) begin $display("R%0d: %h", i, uut.registers[i]); end $display(""); // Blank line for readability end endtask // Clock generation always begin #5 clk = ~clk; // Toggle clock every 5 time units end initial begin $dumpfile("SimpleCPU_tb.vcd"); // Specify the VCD file name $dumpvars(0, SimpleCPU_tb); // Dump all signals in the testbench and the UUT // Initialize Inputs clk = 0; reset = 1; // Apply reset instruction = 0; // Wait for global reset #50; reset = 0; // Release reset // Wait for a clock cycle after reset @(posedge clk); // Example instruction to test LOAD operation // Let's assume LOAD uses the opcode '6'b010101' and immediate value is the data instruction = {6'b010101, 4'd0, 22'd5}; // LOAD R0 with the value 5 (Immediate load) @(posedge clk); display_registers(); instruction = {6'b010101, 4'd1, 22'd10}; // LOAD R1 with the value 10 display_registers(); @(posedge clk); // Example instruction to test ADD operation // ADD R2, R0, R1 (R2 = 5 + 10) instruction = {6'b000011, 4'd2, 4'd0, 4'd1, 18'b0}; display_registers(); @(posedge clk); // Example instruction to test SUB operation // SUB R3, R1, R0 (R3 = 10 - 5) instruction = {6'b000100, 4'd3, 4'd1, 4'd0, 18'b0}; display_registers(); @(posedge clk); // More test instructions can be added here... // Finish the simulation #50; $finish; end // Optional: Monitor changes in the program counter and output register initial begin $monitor("Time: %t, Program Counter: %h, Instruction: %b", $time, pc, instruction); end endmodule ``` ## ISA implementation using High-Level Synthesis There's also another way of defining behavior in logic building blocks, and that is High-Level Synthesis. High-Level Synthesis (HLS) refers to the process of converting an algorithmic description of the desired behavior (often written in C, C++, or SystemC) into digital hardware. HLS tools such as AMD Vitis HLS, Intel HLS Compiler, and Cadence Stratus HLS take high-level code and automatically generate the corresponding RTL code in Verilog or VHDL, which can then be synthesized to an FPGA or ASIC. Given the ISA described previously, we can create a high-level representation of a simple CPU core. Again, note that the complete implementation of a CPU core with an ISA is quite complex and cannot be fully covered in this format. Instead, I'll provide a high-level description of a simple processor that can handle a few instructions to give you an idea of how it might be approached with HLS. Consider a simple C++ snippet that can be used with an HLS tool to demonstrate the concept: ```C++ #include <cstdint> #include <iostream> // Define the memory size #define MEMORY_SIZE 256 // Define the opcode enum Opcode { NOP = 0x00, ADD = 0x01, SUB = 0x02, // Add more opcodes here }; // Define a structure for the instruction struct Instruction { Opcode opcode; uint8_t reg_dst; uint8_t reg_src1; uint8_t reg_src2; uint16_t immediate; // Only used for some instructions }; // Define the CPU with a register file and program memory class SimpleCPU { public: uint32_t registers[16]; // Register file Instruction memory[MEMORY_SIZE]; // Program memory // Method to execute instructions void execute() { uint32_t pc = 0; // Program counter while (pc < MEMORY_SIZE) { Instruction inst = memory[pc]; // Fetch instruction switch (inst.opcode) { case ADD: registers[inst.reg_dst] = registers[inst.reg_src1] + registers[inst.reg_src2]; break; case SUB: registers[inst.reg_dst] = registers[inst.reg_src1] - registers[inst.reg_src2]; break; case NOP: // Do nothing break; // Add cases for more opcodes } pc++; // Increment program counter } } }; int main() { // Instantiate the CPU SimpleCPU cpu; // Initialize the CPU and memory // ... // Execute the program cpu.execute(); return 0; } ``` This C++ code can be fed to an HLS tool to generate the corresponding RTL implementation. The tool will infer the required registers, the finite state machine for the instruction cycle, and the control logic for the `switch` statement that handles each opcode. In practice, you would need to provide more details to the HLS tool, like directives on how to optimize the design, timing constraints, and interface definitions. HLS tools also allow the designer to make trade-offs between the area, speed, and power consumption of the generated hardware. It's also essential to thoroughly validate the high-level design with extensive testing and verification before using HLS to generate the RTL. > [!tip] > A great reference on HLS is "High-Level Synthesis Blue Book" by Michael Fingeroff. ## Making the core more efficient with pipelining A production-ready core design would be significantly more complex and would include many more considerations such as pipelining, memory hierarchy, exception handling, and so forth. Our little core was magically loaded with instructions and those were executed right away. A full implementation would require creating a pipelined CPU which would involve breaking down the instruction execution into separate stages, where each stage performs a part of the instruction processing. In a basic pipeline, these stages typically include: - Fetch (IF): Retrieve an instruction from memory. - Decode (ID): Determine the operation to perform and which registers or values are involved. - Execute (EX): Perform the arithmetic or logic operation. - Memory (MEM): Access data memory if needed (for load/store instructions). - Write-back (WB): Write the result back to the register file. Here's a Verilog pseudocode for a pipelined version of our SimpleCPU: ```Verilog module PipelinedCPU ( input clk, input reset, // Other I/O ports as necessary ); // Define pipeline registers between each stage reg [31:0] IF_ID_instr, IF_ID_pc; reg [31:0] ID_EX_instr, ID_EX_pc, ID_EX_regA, ID_EX_regB, ID_EX_imm; reg [31:0] EX_MEM_instr, EX_MEM_pc, EX_MEM_aluResult, EX_MEM_regB; reg [31:0] MEM_WB_instr, MEM_WB_pc, MEM_WB_aluResult, MEM_WB_readData; // IF Stage - Instruction Fetch always @(posedge clk or posedge reset) begin if (reset) begin // Reset IF/ID pipeline registers end else begin // Fetch instruction and update PC // Pass instruction and PC to the next pipeline stage end end // ID Stage - Instruction Decode and Register file read always @(posedge clk or posedge reset) begin if (reset) begin // Reset ID/EX pipeline registers end else begin // Decode instruction and read from register file // Pass necessary values to the next pipeline stage end end // EX Stage - Execution or address calculation always @(posedge clk or posedge reset) begin if (reset) begin // Reset EX/MEM pipeline registers end else begin // Perform ALU operations or calculate addresses // Pass results to the next pipeline stage end end // MEM Stage - Data memory access always @(posedge clk or posedge reset) begin if (reset) begin // Reset MEM/WB pipeline registers end else begin // Access memory if needed // Pass results to the next pipeline stage end end // WB Stage - Write back to the register file always @(posedge clk or posedge reset) begin if (reset) begin // Reset write-back logic end else begin // Write results back to the register file end end // Additional logic for handling hazards, forwarding, and branch predictions // would also be necessary for a complete pipelined CPU. endmodule ``` Below is a high-level C++ example of a simple 5-stage pipelined core: ```C++ #include <cstdint> #include <iostream> // Define the memory size and width #define MEMORY_SIZE 256 #define REG_COUNT 16 // Define the opcode enum Opcode { NOP = 0x00, ADD = 0x01, SUB = 0x02, // ... Add more opcodes here }; // Define a structure for the instruction struct Instruction { Opcode opcode; uint8_t reg_dst; uint8_t reg_src1; uint8_t reg_src2; uint16_t immediate; // Only used for some instructions // Constructor for default NOP instruction Instruction() : opcode(NOP), reg_dst(0), reg_src1(0), reg_src2(0), immediate(0) {} }; // Define a pipeline register to hold intermediate values between stages struct PipelineRegister { Instruction inst; uint32_t alu_out; uint32_t reg_val1; uint32_t reg_val2; uint32_t pc; }; // Define the CPU with a register file and program memory class PipelinedCPU { public: uint32_t registers[REG_COUNT]; // Register file Instruction memory[MEMORY_SIZE]; // Program memory // Pipeline stages PipelineRegister IF_ID, ID_EX, EX_MEM, MEM_WB; // Reset pipeline registers void reset_pipeline() { IF_ID = PipelineRegister(); ID_EX = PipelineRegister(); EX_MEM = PipelineRegister(); MEM_WB = PipelineRegister(); } // Constructor to initialize the CPU and pipeline PipelinedCPU() { reset_pipeline(); // Initialize registers and memory with zeroes or predefined values for (int i = 0; i < REG_COUNT; ++i) { registers[i] = 0; } // Similarly, initialize program memory with NOPs or actual instructions } // Method to simulate the pipeline execution void run() { uint32_t pc = 0; // Program counter bool running = true; while (running) { // Write-back stage if (MEM_WB.inst.opcode != NOP) { // Only ADD and SUB write back results for this example registers[MEM_WB.inst.reg_dst] = MEM_WB.alu_out; } // Memory stage (stubbed for this example) EX_MEM = ID_EX; // In a real scenario, handle memory access here // Execute stage ID_EX = IF_ID; switch (ID_EX.inst.opcode) { case ADD: ID_EX.alu_out = ID_EX.reg_val1 + ID_EX.reg_val2; break; case SUB: ID_EX.alu_out = ID_EX.reg_val1 - ID_EX.reg_val2; break; case NOP: // Do nothing break; // ... Add cases for more opcodes } // Decode stage IF_ID.pc = pc; IF_ID.inst = memory[pc]; // Fetch instruction IF_ID.reg_val1 = registers[IF_ID.inst.reg_src1]; // Read registers IF_ID.reg_val2 = registers[IF_ID.inst.reg_src2]; // Fetch stage // Normally you'd fetch the next instruction here, but we'll assume a NOP // for the sake of simplicity. In a real pipeline, you'd increment the PC // and fetch from memory. // Update the program counter (stubbed for this example) pc = (pc + 1) % MEMORY_SIZE; // Move to the next instruction for simplicity // Check for termination condition (e.g., a special HALT instruction) // ... If detected, set running to false // Pipeline stage movement MEM_WB = EX_MEM; // Move the Execute results to Memory stage } } }; int main() { // Instantiate the CPU PipelinedCPU cpu; // Initialize the CPU with a program // ... // Run the CPU cpu.run(); return 0; } ``` In the example above: - Each pipeline stage is represented by a `PipelineRegister` structure. - The `run` function contains the logic for each stage. For a real processor, this would also include mechanisms to handle hazards, such as data hazards (with forwarding or stalls) and control hazards (with prediction or stalls). - The example provided assumes a perfect memory system with no latency, which is not realistic. In practice, memory access would take several cycles, and instructions might take different amounts of time to complete each stage. Actual HLS code would also need to specify the required interfaces and timing constraints to guide the synthesis tool in generating an efficient hardware implementation. Debugging and optimization would likely be complex, requiring careful consideration of the hardware architecture and the behavior of the high-level language constructs under synthesis. ## Microarchitecture From the code snippets above, one can see that certain blocks or components seem to be exchanging information every time a new instruction is executed. The microarchitecture for a simple ISA as described previously would define the physical building blocks, control signals, data paths, and the logic necessary to implement the instruction set of the processor. A brief overview of what the main blocks of the microarchitecture—otherwise called computer organization—might consist of for a hypothetical 32-bit ISA with 20 instructions is shown below: Core Components: - Program Counter (PC): A register that holds the address of the next instruction to be fetched from the instruction memory. - Instruction Memory: A memory where the program code that the CPU will execute resides. - Register File: A set of 32-bit registers, including general-purpose registers and special-purpose registers like the stack pointer or link register. - Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations. - Control Unit (CU): Decodes instructions and generates control signals to orchestrate the operation of the CPU core as it goes. - Floating Point Unit (FPU): a core component designed to handle complex arithmetic operations on decimal numbers, which are represented as floating-point numbers. This component is relevant for tasks requiring high numerical precision and range, such as scientific computations, graphics processing, and any application involving real numbers. Pipeline Stages: - Instruction Fetch (IF): Fetches instructions from memory using the PC, increments the PC, and passes the instruction to the next stage. - Instruction Decode (ID): Decodes the fetched instruction and reads the needed registers from the register file. - Execution (EX): Executes the instruction using the ALU or performs a branch/jump calculation. - Memory Access (MEM): For load and store instructions, interacts with the data memory. Not all instructions will use this stage. - Write Back (WB): Writes the results back to the register file or updates the PC if it was a branch/jump instruction. Data Paths: In general, data paths are parallel buses connecting different building blocks in the microarchitecture. - Instruction Data Path: From instruction memory to the instruction register in the IF stage, and then to the Control Unit and register file in the ID stage. - Operand Data Path: From the register file to the ALU. - Result Data Path: From the ALU to the register file (or to data memory and back for load/store instructions). Control Signals: - ALU Control: Determines the operation of the ALU based on the opcode. - Register File Control: Signals to read or write from/to the register file. - Memory Control: Signals to perform read or write operations on data memory. - Branch Control: Determines whether to take a branch and computes the new PC value. Handling Hazards: - Data Hazard Handling: The microarchitecture would need forwarding paths to resolve data hazards, or it would need to stall the pipeline until the data is ready. - Control Hazard Handling: Implements branch prediction, delay slots, or stalls to handle situations where the next instruction to execute depends on a branch outcome. Specialized Units (if needed): - Load/Store Unit: Specifically deals with accessing data memory. - Branch Unit: Deals with branch prediction and address calculation. Support for Interrupts and Exceptions - Exception Handling Unit: Manages interrupts and exceptions, altering the flow of control as necessary. An example of a microarchitecture for the 8-bit 8085 processor is shown below: ![8085 microarchitecture](site/Resources/media/image210.png) > [!Figure] > _8085 microarchitecture_ It is quite didactic to observe the microarchitecture of the 8085 above, as the core only supports direct or indirect addressing, it does not support [[Semiconductors#Virtual Memory|virtual memory]], it does not support privileged operating modes, and has a rather simplified interrupt scheme. ### Optimization The design of the core microarchitecture typically focuses on achieving some optimizations, which might aim to lower power consumption, faster execution, or a combination of both. #### Faster, more parallelized execution The design of the microarchitecture allows for cores to be capable of executing a significant number of operations in parallel or with very few CPU clocks. Let's see as an example the microarchitecture of the ADSP-210XX family of CPU cores from Analog Devices, known as the SHARC (Super Harvard Architecture Computer) family. The SHARC is a 32-bit processor designed for fast numerical computation for digital signal processing (DSP). This processor is characterized by its high performance, which is attributed to its unique architecture optimized for the parallel execution of instructions and efficient handling of multiple operations typical in signal processing tasks. Here's a brief overview of its key microarchitectural features: Memory Architecture: - Super Harvard Architecture: The core implements a modified Harvard architecture that provides separate pathways for instruction and data, allowing simultaneous access to both the program and data memory, hence improving throughput. - On-chip Memory: It includes a large on-chip RAM, which is partitioned into instruction and data memory blocks for simultaneous access. Processing Elements - Pipelined Floating-Point Unit (FPU): It can execute a floating-point multiply and an arithmetic or logical operation simultaneously in each clock cycle. - Multiple Functional Units: Allows for simultaneous operations, such as an ALU operation, a multiplier operation, and a barrel shifter operation, in one cycle. - Dedicated Hardware for Circular Buffering: Optimizes the implementation of digital filters by providing hardware support for circular buffering, which is critical for DSP algorithms. Parallelism and Performance: - Instruction Cache: Improves performance by prefetching instructions to reduce access time. - [[Semiconductors#CPU Cores#Microarchitecture#Direct Memory Access|DMA]] Controllers: Direct Memory Access (DMA) controllers facilitate parallel data transfers without CPU intervention, greatly enhancing data I/O throughput. - Zero Overhead Looping: The processor can execute loops with zero overhead, which is particularly useful in DSP applications where many operations are iterative and performed on arrays of data. I/O and Peripheral Support: - Link Ports: These are high-speed serial ports for direct processor-to-processor communication without using external hardware. - I/O Processor: A dedicated I/O processor (separate from the main CPU core) manages peripheral operations, offloading routine tasks from the main processor. Instruction Set: - Rich Instruction Set: The SHARC's instruction set is optimized for DSP operations, with support for complex arithmetic operations, bit-reverse, and modulo addressing modes. - SIMD Support: Single Instruction, Multiple Data (SIMD) capabilities allow it to perform the same operation on multiple data points simultaneously. The ADSP-210XX, like other SHARC processors, is designed to be highly efficient for DSP operations, offering high throughput for applications such as audio processing, telecommunication, and control applications. Its combination of on-chip memory, parallel functional units, and specialized DSP hardware features allow for the efficient execution of complex signal processing algorithms. ![ADSP-210XX family of DSPs (Credit: Analog Devices)](site/Resources/media/image211.png) > [!Figure] > _ADSP-210XX family of DSPs (Credit: Analog Devices)_ In all the microarchitectures shown above, the memory access is rather literal. This means that provided the right instructions are used, there is unrestricted freedom to read and write any memory location in the memory map. This is, per se, not a terrible problem. But as software complexity increased and operating systems made an appearance, the idea of protecting memory from unrestricted access picked up. We will talk about that at some point. #### Work Smarter, Not Harder: Direct Memory Access (DMA) A way to optimize a microarchitecture is not always about making it faster, that is, to execute more instructions per second. Another way of optimizing a microarchitecture is to offload the CPU from some tasks, so we can use it more efficiently. One way of doing this is to relieve the core from having to move data from a device to or from memory. Moving data tends to be a rather repetitive task where bytes from an initial address to an end address are hauled in a rather mechanistic way. This is called Direct Memory Access (DMA), and microarchitectures incorporate special blocks that can perform this task, called Direct Memory Access controllers. By using DMA, data that needs to go from a peripheral device into memory (or vice versa), will not require the microprocessor to take part in it, although the data bus will of course be occupied anyway. In the early days, DMA controllers were discrete devices (like the historic 8257 from Intel^[https://www.eecs.northwestern.edu/~ypa448/Microp/8257.pdf], see figure below). Nowadays, DMA controllers are incorporated on-chip in most modern processors. We can still use the 8257 as an illustrative example of how DMA controllers work. ![8257 discrete DMA controller from Intel](8257.png) > [!Figure] > _8257 discrete DMA controller from Intel_ The 8257 has four independent DMA channels, each capable of handling its own DMA request. These channels can be used for both memory-to-memory and I/O-to-memory transfers, with varying priorities assigned to each channel (Channel 0 has the highest priority and Channel 3 the lowest). The CPU accesses the internal registers of the 8257 DMA controller using the system's I/O address space. This is done by driving the 8257's chip select (CS#), read (MEMRD#), and write (MEMWR#) control lines, along with address lines to select the specific register within the 8257. The registers in the 8257 are: - **Mode Set Register (MSR):** Used to set the mode of operation for each channel. - **Command Register:** Controls the overall operation of the 8257. - **Status Register:** Indicates the status of each channel and the DMA controller. - **Mask Register:** Used to enable or disable the individual channels. - **Address Register and Counter Register for each channel:** These registers store the base address and the word count of the data block to be transferred for each channel. And the 8257 pins are described below: - **HOLD (Hold Request):** This signal is used by the 8257 to request control of the system bus. - **HLDA (Hold Acknowledge):** This signal is sent by the CPU to acknowledge that the 8257 has been granted control of the system bus. - **DREQx (DMA Request):** A channel-specific signal used by peripherals to request a DMA operation. - **DACKx (DMA Acknowledge):** A channel-specific signal used by the 8257 to acknowledge the DREQ from the peripheral. - **EOP# (End of Process):** Indicates that the DMA transfer for the current block is complete. - **IOR# (I/O Read) and IOW# (I/O Write):** Used for reading from and writing to an I/O device. - **MEMRD# (Memory Read) and MEMWR# (Memory Write):** Used for reading from and writing to memory. The 8257 can operate in various modes, set by the Mode Set Register: - **Single Mode:** In this mode, each channel needs to be enabled individually after each transfer. - **Block Mode:** Allows a continuous transfer of a block of data without intervention from the CPU. - **Demand Mode:** The transfer continues as long as the demand (DREQ) is present. - **Cascade Mode:** Allows the 8257 channels to be connected in a cascaded fashion, expanding the number of available channels. The typical behavior of the 8257 controller is as follows: 22. **Initialization:** The CPU initializes the 8257 by loading the base address and word count into the respective registers for each channel, along with the mode of operation. 23. **DMA Request:** When a peripheral device requires a data transfer, it sends a DREQ signal to the corresponding channel of the 8257. 24. **DMA Acknowledge:** Upon receiving DREQ, if the channel is not masked and is the highest priority channel requesting service, the 8257 asserts the HRQ signal to the CPU, requesting control of the system bus. Once the CPU acknowledges with HLDA, the 8257 takes over the bus and sends a DACK signal to the peripheral device. 25. **Data Transfer:** The 8257 reads from or writes to the memory or I/O device by asserting the appropriate control signals (IOR/IOW for I/O operations and MEMR/MEMW for memory operations) until the word count reaches zero, indicating the end of the transfer. 26. **Transfer Completion:** Once the transfer is complete, the 8257 updates the status register, indicating the completion of the operation, and sends an EOP signal if programmed to do so. The CPU is then notified that the DMA transfer is complete, and the system bus control is returned to the CPU. On-chip DMA controllers follow a similar philosophy, although the registers and internal state machines may greatly vary from vendor to vendor. #### Power efficiency Microarchitectural design for power efficiency involves various strategies aimed at reducing energy consumption while maintaining performance. Here are some approaches: - Clock Gating: By disabling the clock signal to portions of the core when they're not in use, this technique prevents power from being wasted on unnecessary calculations. - Power Gating: Completely turns off power to inactive building blocks within the core to save energy. - Dynamic Voltage and Frequency Scaling (DVFS): Adjusts the voltage and frequency according to the workload. When full performance is not needed, the core can run at a lower frequency, which reduces power consumption. - Multi-Vdd Design: Uses different supply voltage rails for different parts of the chip, optimizing the power for each unit's performance requirements. - Multi-threshold CMOS (MTCMOS): Combines high threshold voltage transistors for low leakage in idle states with low threshold voltage transistors where high speed is required. - Pipeline Balancing: Designs the pipeline stages to have a balanced workload, thus avoiding power-hungry pipeline stalls and bubbles. - Speculative Execution: Although it can increase power consumption, carefully controlled speculative execution can reduce wasted cycles, making execution more efficient. - Branch Prediction Improvements: By improving branch prediction accuracy, the number of speculative execution paths that do not result in committed instructions (thus wasted energy) can be reduced. - Fine-grained Execution Units: Incorporating more, but smaller and specialized execution units can allow for running only the necessary units for particular tasks at a lower power. - On-Chip Memory: Using levels of [[Semiconductors#Caches|cache]] and optimizing its size and access algorithms can reduce the need to access off-chip memory, which is more power-intensive. - Asynchronous Design: Eliminates global clocks, which can save power by allowing different parts of the processor to run only when needed. - Efficient Interconnects: Reducing the power required for data to travel across the chip with optimized buses and networks-on-chip. - Hardware Acceleration: Using dedicated hardware for specific tasks (like encryption, graphics, or signal processing) can be more power-efficient. #### Area A core microarchitecture might be required to take as little area as possible. When designing a core microarchitecture to minimize the area used, there are some sacrifices to be made, where designers may need to compromise on certain performance-enhancing features or reduce the complexity of the core, which can affect overall performance. ## Memory Map When the microarchitecture is running and instructions are being fetched and executed at the clock rate, the core sends out addresses over its address bus, along with the control signals to read the content in memory-like devices placed outside of the core, and a data bus to read or write into the positions pointed by the address bus. We as designers can hook anything we want out of the address and data buses, as long as we can feed the CPU with numbers. We can also use address lines to apply logic functions and create complex memory maps. A memory map is, in essence, the complete list of devices from which the core can read and in some cases also write data to. This may include RAM, EEPROM, cache, or other I/O devices like converters or serial peripherals like UARTs (see figure below). ![16550 UART chip with address and data bus, ready to be memory-mapped with a CPU](site/Resources/media/image212.png) > [!Figure] > _16550 UART chip with address and data bus, ready to be memory-mapped with a CPU_ Memory-mapped locations are then accessible from the core in runtime using memory move—or load/store—instructions. This approach is especially powerful when I/O devices are memory-mapped, coupling these devices very closely to the application software. The layout of the memory map is a critical part of any digital system architecture for it acts as a blueprint of any digital system. ### Port-Mapped Input/Output Port-mapped I/O uses a separate address space for I/O devices, which is accessed using a different set of instructions specifically designed for I/O operations. This method does not encroach on the memory address space, allowing the full range of addresses to be used for memory. However, it can be more cumbersome to deal with because it requires different instructions for memory and I/O operations. ## Virtual Memory With computer architecture increasing in complexity, and with the consequent software complexity increase, operating systems proliferated, and operating systems introduced the concept of tasks as isolated units of execution running next to each other, cleverly orchestrated by a scheduler to give the impression of "multitasking" although CPU cores were still executing one instruction at a time. With more tasks running together and performing complex operations, operating systems required improved protection and isolation between tasks. Additionally, it became clear that operating systems did not need to know the subtleties of memory maps and chips; tasks needed memory, and whether that memory was physically contiguous or not was the task's concern. All this paved the way for the idea of decoupling physical memory from virtual memory. The idea of virtual memory is rather straightforward: to provide an application with a perception that it has a contiguous and potentially very large address space, regardless of the actual amount of physical memory that is present. The idea of virtual memory was initially developed in the late 1950s for the Atlas Computer at the University of Manchester. It was designed to provide programmers with a large, uniform address space without the need to manage memory physically. Throughout the 1960s, the concepts of paging and segmentation evolved. Paging involves dividing memory into fixed-size blocks, while segmentation divides memory into variable-length blocks. Both methods are used to map virtual addresses to physical memory locations. With the advent of more sophisticated CPU designs in the 1970s, virtual memory became a standard feature in commercial computer architectures, including the influential VAX computers from Digital Equipment Corporation. The introduction of the Intel 386 processor brought virtual memory capabilities to personal computers. It provided a hardware mechanism for translating virtual addresses to physical addresses, a process managed through a multilevel paging system. As processors moved from 32-bit to 64-bit architectures, the virtual address space was expanded significantly, allowing for even larger amounts of virtual memory to be utilized, far exceeding the limits of physical RAM. Key Points in Virtual Memory: - Abstraction: Applications are given the illusion that they are using continuous space of memory. - Protection: Each process operates in its own space, preventing one process from corrupting the memory used by another. - Swapping: Data can be swapped in and out of physical memory to allow more applications to run than there is RAM available. - Mapping: The operating system maintains tables to map virtual addresses to physical locations. ### Memory Management in the i386 Microarchitecture In the i386 microarchitecture, the segmentation unit and the paging unit (see figure below) work in tandem to convert logical addresses into physical addresses. ==When a program runs, it does not deal directly with physical memory addresses; instead, it operates with logical addresses that need to be translated to actual locations in physical memory.== The segmentation unit first comes into play, which handles logical addresses made up of a segment selector and an offset. The selector refers to an entry in a table of segment descriptors, each containing information about the segment's size, location, and access permissions. These descriptors enable the system to enforce access controls and keep different parts of a program isolated from one another. Once the segmentation unit has identified the base address of the segment via the segment descriptor, it adds this base to the offset, resulting in a linear address. This address is what the paging unit will work with if paging is enabled on the system. The paging unit now takes over, using the linear address to find the actual data in physical memory. It does this through a multi-level table system that includes a page directory and page tables. The linear address is divided into parts that index into these tables and ultimately point to a page frame in physical memory. A final addition of the offset within the page leads to the precise physical address where the desired data resides. This linear-to-physical translation is critical for virtual memory systems. The processor might be tricked into thinking it's working with a contiguous block of memory, but behind the scenes, the paging unit maps these logical segments to non-contiguous physical memory, including areas on the disk. This dual-layered approach allowed for flexibility in programming and operating system design. The segmentation provided a way to isolate different parts of a program for protection and control while paging provided a mechanism for implementing virtual memory, allowing the system to use disk storage as an extension of RAM, managing memory in more granular page-sized chunks. More modern x86 processors have largely deprecated segmentation in favor of flat memory models but retain it for compatibility reasons. Paging, however, remains a fundamental part of modern processors, playing a critical role in memory management and protection. The virtual memory management feature in the i386 is part of a larger design feature called protected mode. The 386 protected mode supports using multiple privilege levels, as defined by the ring architecture. This allows for better control over the access rights of various parts of the system. This separation ensures that user applications cannot directly access critical system resources, safeguarding the system from potential threats. ![i386 microarchitecture (note the segmentation unit and the paging unit at the top)](site/Resources/media/image213.png) > [!Figure] > _i386 microarchitecture (note the segmentation unit and the paging unit at the top)_ ## Caches In essence, CPU cores can be as fast as the slowest device they need to interact with. All microarchitectural measures to improve core efficiency and speed are useless when slow devices are connected to the core. Memories are slower than cores, and caches are an essential feature to mediate between the speed gap between the rapid processing capabilities of the CPU core and the slower pace of access to main memory. This disparity, often referred to as the *memory wall*[^57], has motivated the creation of a hierarchy of caches in modern processors. At the heart of their function, caches are small, fast memory stores that hold copies of data and instructions that the CPU core is likely to need imminently. By anticipating the processor's requirements, caches reduce the frequency and duration of its interactions with the more sluggish main memory, thus enhancing system performance. ==Caches are layered in a stratified structure, usually denoted as L1, L2, and L3, with L1 being the smallest and fastest, situated closest to the CPU core. This cache is split into two parts: one for data and one for instructions, allowing simultaneous access to both by the CPU. The L2 cache, typically larger and slightly slower, might be shared between cores or dedicated to a single core, depending on the specific CPU design. The L3 cache, often much larger, serves as a reservoir of data for the entire CPU, smoothing out the demands placed on main memory.== The efficiency of caches hinges on the principle of locality, which comes in two forms: temporal and spatial. Temporal locality posits that if a data location is accessed, it is likely to be accessed again soon, thus it makes sense to keep this data close at hand in the cache. Spatial locality suggests that data locations near recently accessed data are likely to be accessed in the near future, prompting the cache to fetch blocks of contiguous memory rather than isolated addresses. The operation of caches is largely invisible to software, managed instead by hardware algorithms that determine which data to store and when to store it. One such algorithm is the least recently used (LRU) policy, which predicts that data that has not been accessed for the longest period is the least likely to be needed imminently. Therefore, when new data needs to be fetched into a full cache, the LRU data is evicted to make room. To expedite the process of checking whether the required data is in the cache, a mechanism called a cache line is employed. The cache is divided into these lines, typically of a size like 64 bytes, and a part of the CPU called the memory controller checks whether the data from a particular memory address is fully or partially contained within a cache line. Conflicts and misses do occur, however. A cache miss happens when the data sought by the CPU is not found in the cache, necessitating a fetch from main memory—a relatively time-consuming process. A cache conflict, on the other hand, arises when multiple data pieces compete for the same cache line, leading to the eviction of some data, which may soon need to be reloaded. Sophisticated cache designs use associative mappings to minimize this problem, allowing data to be stored in several possible lines rather than a predetermined one. In the grand landscape of CPU core operations, the cache is a critical mediator, handling the delicate balance between the need for speed in data access and the physical limitations of memory technologies. Through the strategic use of caches, CPUs manage to maintain a near-constant feed of data to the hungry processor core, ensuring that the system operates with the efficiency and speed demanded by users and applications alike. ## Multi-Core Architectures Before we move on to other devices, let's discuss multi-core architectures. The advent of multi-core architectures was driven by the physical limitations of scaling single-core CPUs. As the demand for higher performance increased, simply ramping up the clock speed of CPUs led to prohibitive power consumption and heat dissipation challenges. Multi-core architectures emerged as a solution, providing a way to continue performance improvements without the drawbacks associated with higher clock speeds. In a multi-core processor, each core can independently execute tasks, and depending on the application's nature, it can act in concert with other cores to tackle complex, multithreaded workloads more efficiently than a single-core processor. This allows for a division of labor, where different cores can handle separate tasks or threads within a program, leading to better system responsiveness and multitasking capabilities. To fully exploit the capabilities of multi-core architectures, operating systems and applications must be designed to operate in a multithreaded manner, where tasks are broken down into discrete parts that can be allocated to different cores and run in parallel. This approach can significantly reduce the time required to execute application software, especially those designed for parallel processing like numerical simulations, graphics rendering, and data analysis. ### Coherence Coherence between cores in a multi-core architecture is maintained by ensuring that all cores have a consistent view of the shared memory space. This is important because each core may have its own cache, and without a proper coherence policy, one core might be unaware of the changes in data made by another core, leading to inconsistent or stale data being used in computations. The typical mechanism for maintaining coherence is known as a cache coherence protocol. One of the most common types of protocols is the snooping protocol where the cache is put on a bus that "snoops" to observe all the transactions. A coherency controller watching the bus can detect when a memory location it has a copy of is being read or written by another core and take appropriate action. If a core wants to write to a location, it must ensure that no other caches have a valid copy of that location, usually by broadcasting an invalidation message on the bus. An alternative to the snooping protocols are directory-based protocols where coherence is managed by a central directory that keeps track of which caches have a copy of each block of memory. When a core wants to write to a block, the directory is consulted to see which caches have copies, and invalidation messages are sent to those caches. This method is more scalable than snooping because it doesn't require all caches to monitor all data traffic, which becomes impractical in systems with many cores. Both coherency methods must handle scenarios where multiple cores want to read or write to the same memory location nearly simultaneously, necessitating a consistent strategy for dealing with such conflicts. In addition to hardware mechanisms, software also plays a role in maintaining coherence. Memory barriers or fence instructions are used by compilers and programmers to prevent the cores from performing certain types of reordering of memory operations, which is critical for maintaining coherence in multithreading contexts. The combination of hardware protocols and software mechanisms ensures that all cores in a multi-core architecture have a coherent view of memory, allowing for efficient and correct execution of concurrent processes. ![P2020, Dual-core processor. Note the coherency module in blue, located in the middle of the image. Credit: NXP](site/Resources/media/image214.png) > [!Figure] > _P2020, Dual-core processor. Note the coherency module in blue, located in the middle of the image. Credit: NXP_ We will dive again into coherence in multi-core architectures when we discuss ACE (AXI Coherency Protocol Extension) in the context of [[Semiconductors#System-on-Chips#Advanced Microcontroller Bus Architecture (AMBA)|Advanced Microprocessor Bus Architecture]] release 4 (AMBA 4) as well as the evolution of ACE in AMBA 5 with the introduction of [[Semiconductors#System-on-Chips#CHI|CHI]] (Coherent Hub Interconnect) protocol as a re-design of the signal-based AXI/ACE protocol with a new packet-based CHI layered protocol. ### Manycore Unlike multicore architectures, which might feature a handful of cores, manycore systems can have hundreds or even thousands of cores. This leap in the number of cores is not just quantitative but also brings qualitative changes in how computing tasks are handled, optimized, and scaled. At the heart of manycore architectures is the principle of parallelism. In traditional computing, tasks are often executed sequentially, which can create bottlenecks as computational demands increase. Manycore systems address this issue by enabling a higher degree of parallel processing. This means that more tasks can be executed simultaneously, significantly speeding up computation for applications that can be parallelized. The design of manycore processors is different from that of traditional CPUs. Manycore processors often feature simpler, more energy-efficient cores. This design choice reflects a trade-off between the complexity and power consumption of individual cores versus the overall computational capacity of the chip. By opting for simpler cores, manycore architectures can pack more processing units into the same space, boosting parallel computational abilities while keeping power consumption in check. Interconnect technology is another critical aspect of manycore architectures. As the number of cores increases, the challenge of efficiently connecting them grows. Manycore systems employ advanced interconnect technologies to ensure that data can be rapidly and efficiently shared among cores. Programming for manycore systems also presents unique challenges and opportunities. Traditional sequential programming models do not fully exploit the parallelism potential of manycore architectures. As a result, developers have turned to parallel programming models and languages specifically designed to manage and leverage the computational resources of manycore systems. These models allow for the division of tasks into smaller, concurrent operations that can be executed across multiple cores, maximizing the architectural benefits. Manycore architectures are particularly well-suited to applications that require significant computational power and can be effectively parallelized. This includes fields such as scientific computing, data analysis, machine learning, and graphics processing. In these areas, manycore systems can dramatically reduce computation times, enabling more complex simulations, analyses, and real-time processing capabilities. ![](manycore.png) ![](manycore2.png) How to develop software for a manycore? A manycore microprocessor can use a common compiler flow in the development process as a single microprocessor system, although task partitioning and communication require alternative tools. For multiple data storages, the coherence must be guaranteed. The process communication and coherency can be categorized into two groups. 27. Shared memory: implicit messaging Implicit messaging is a shared memory approach in which a single memory address space is shared with multiple processors. Although the development of a program with process communication is not required, exclusive control (a mechanism used to synchronize among the different processes) of the shared variables and coherence is needed. Recent processors have transactional memory that speculatively applies exclusive control (called a lock). 28. Distributed memory: Message passing is a distributed memory approach in which each processor has its own memory address space and does not share variables among the processors. It uses a message delivered between the processor cores, and message passing to communicate among them. Message passing uses the API of the message passing interface (MPI) [272] in a software program, and explicitly orders the communication. It communicates using a message and thus does not share the variables among processes, and does not need to take exclusive control or maintain coherency. Studies on shared memory and message-passing types have been conducted. For manycore processors, a software program and its dataset are subdivided to fit into the core, which is called a working set, and it is necessary to take care of the cache memory architecture to avoid a cache miss when accessing the external world as a penalty and creating unnecessary traffic on the chip. ### Pros and Cons of Multi-Core Systems How exactly are more CPU cores beneficial? Are there scalability issues as we add more cores to a chip? At first glance, it might seem obvious that given a certain algorithm or routine, the running time of the algorithm while using 1 core will be greater (as in, will take longer) than the one we could expect with more CPUs. However, this is not necessarily true. More things need to be considered; the effort to create a parallel algorithm is usually split between a part required for creating parallelism (a set of threads or processes), computations required for running the concurrent threads, and everything necessary for communication and synchronization. Given more cores, the main option for reducing the running time of a parallel algorithm would consist of increasing the number of independent computations. However, this will probably require more communication, synchronization, and a more important overhead for the creation of parallelism. Hence, it cannot be guaranteed that the gain induced by the increase in parallelism will be balanced by these additional operations. Amdahl's Law and Gustafsson's Law are two fundamental principles in the field of parallel computing, each offering insights into the potential benefits and limitations of parallelization in improving computational performance. Amdahl's Law, named after Gene Amdahl, focuses on the relationship between the overall performance improvement of a computational task and the proportion of the task that can be actually parallelized. > [!note] > Not all algorithms can be fully parallelized, as they may contain parts that must be inescapably executed sequentially. According to this law, the maximum speed-up of a program using multiple processors in parallel computing is limited by the time needed for the sequential part of the task. The law says that even with an infinite number of processors, the speed-up of a program is limited if a portion of the program must be executed sequentially. The formula for Amdahl's Law expresses the theoretical speed-up in processing time as a function of the number of processors and the fraction of the program that can be parallelized. ==There's some math associated with this law, but the key takeaway is that increasing the number of processors will have diminishing returns on performance improvement, especially if a significant portion of the program cannot be parallelized.== Gustafsson's Law, proposed by John L. Gustafsson, presents a different perspective on the scalability of parallel systems. It addresses a limitation of Amdahl's Law by considering the case where the total workload can be increased. Gustafsson's Law suggests that the speed-up of a program is more closely related to the increase in workload size rather than the fixed proportion of parallelizable code. As the workload grows, it becomes possible to achieve near-linear speed-up because the overhead of the sequential portion of the code becomes negligible compared to the overall computation time. This law implies that in real-world scenarios where larger problems can be divided into parallel tasks, the potential for performance improvement through parallelization is significantly higher than what Amdahl's Law might suggest. Gustafson's law proposes that ==programmers tend to increase the size of problems to fully exploit the computing power that becomes available as the resources improve==. Besides all the challenges associated with analyzing which parts of our algorithms are naturally sequential and which parts can be parallelized, multi-core systems also face other challenges for scalability: the ever-present memory wall, and energy consumption. The memory wall is due to an imbalance between the memory bandwidth, latency, and the processor speed. ==In short: processors spend most of their time waiting, and that waiting time is mostly dedicated to memories.== Last but not least, is energy consumption. The power consumption of a processor grows with processor utilization. This consumed energy is transformed into heat that must be dissipated. Several studies showed that cooling can account for up to 40% of the energy consumed in a data center. To reduce this cost, the Power Usage Efficiency metric (PUE) was introduced to estimate the efficiency of data centers. Roughly speaking, the PUE is the ratio between the total energy consumed by a data center and the one devoted to computations. The closer the PUE to 1, the better the data center. In such a context, it is important to keep the parallel efficiency of an algorithm under a threshold where it does not consume too much energy in the perspective of PUE minimization. ## Software and Supervisors (Aka Operating Systems) By combining all the building blocks that we explored in previous paragraphs, engineers would eventually devise a digital machine of sorts whose behavior could be slightly modified, this means, it would perform different arithmetic operations and data movements between parts of its microarchitecture using commands called instructions stored in a memory, giving way to machine code and CPU architectures. Corrado Böhm in his Ph.D. thesis[^58] would conceive the foundations for the first compiler—which still lacked the name as he called it "automatic programming"—with Böhm being one of the first computer science doctorates awarded anywhere in the world, an invention that would appear as a way of coping with the natural lack of human readability of machine code. The word "compiler" would eventually be coined by Grace Hopper, who would go and implement the first compiler ever. Compilers would become more available and accelerate the process of development run time behavior in CPUs, what we now call software. Not without creating some crisis in the process[^59]. With CPU cores proliferating, and with software rapidly increasing in complexity, it started to be somewhat problematic to port code from one architecture to another architecture. Moreover, software programs were initially single-purpose monoliths; there was no concept of "application software". Processors were running big software blobs that included everything; from drivers to the parts that interacted with the user. Running any other application would still need the programmer to replicate the whole stack; there was little reuse. > [!Info] > *In the In the context of compilers, **self-hosting** refers to a compiler that is written in the same programming language it is designed to compile. Essentially, the language is used to implement itself. This concept is a significant milestone for programming languages because it demonstrates the language's robustness and maturity, as it is capable of expressing all the constructs needed to build its own compiler.* > *For example, a C compiler written in C or a Python interpreter written in Python would be considered self-hosting. Achieving self-hosting often involves several stages: initially, the compiler for a language might be written in another, already existing language. Once the language is stable and functional enough to support its own features, the compiler can be rewritten in the same language, transitioning to self-hosting.* *Self-hosting is valuable because it validates the language's expressive power and often simplifies the maintenance and development of the compiler itself. However, achieving it can be complex, as it requires the language to already have sufficient features to support its own development.* Trends started to change when the software industry began packing layers of common functionality in libraries and created a set of routines and services that would aim to ease deploying different applications. Through organized scheduling, two or more applications could run side-by-side by taking turns using the underlying hardware. Moreover, common libraries and services started to be portable across architectures. These system "stacks" were initially called supervisors; we still use them, only that we call them operating systems. It's hard to think of life without operating systems. An OS performs a variety of critical tasks, including managing memory, scheduling execution, storage management, and handling input/output operations. It ensures that different programs and users running simultaneously on the machine do not interfere with each other. The operating system is essential for managing the interaction with users, offering a way to interact with the computer, typically through a graphical user interface (GUI) or command-line interface (CLI). It also oversees security, ensuring that unauthorized users cannot access the system, and managing file permissions to protect data integrity. There are different kinds of operating systems, all using a variety of different architectures, each with pros and cons. Operating systems, being software at the end of the day, cannot escape the typical [[Software#Software Architecture|software architectural styles]]. Operating systems may incorporate kernels (portions of the architecture designed for handling critical tasks and orchestrating the services for user applications) or may not. If the latter, the operating system may run in a single memory space, or may implement a "microkernel". The basic idea behind the microkernel design is to achieve high reliability by splitting the operating system up into small, well-defined modules, only one of which—said microkernel—runs in kernel mode and the rest run as relatively powerless ordinary user processes. In particular, by running each device driver and file system as a separate user process, a bug in one of these can crash that component but cannot crash the entire system. Thus, a bug in a driver may cause it to stop but will not crash the computer. > [!tip] > An obligatory read on all-things-operating-systems is *Tanenbaum, A. S., & Bos, H. (2015). Modern Operating Systems (4th ed.). Pearson.* It shall be emphasized that Operating Systems remain an "optional" feature. Almost no one in their sane mind would go without one, but in some other cases, one may choose to go "bare metal", that is, writing software the good old way, starting from a clean sheet. In bare metal scenarios, the programmer must handle all operations related to data processing but also interruption handling and configuration of the underlying hardware. For very simple processors, this is not a complicated task. For modern processors, bare metal appears as an almost impossible scenario given that the complexity of the underlying hardware begs for the services operating systems bring. ### Protection With operating systems growing in complexity, the idea of creating a structured way of assigning different levels of privileges to various parts of the system started to gain relevance. This would help in protecting sensitive system resources from user errors and malicious actions. The history of protection can be traced back to the early mainframe computers. One of the first implementations was in the Multics system, developed in the 1960s. Multics introduced a hierarchical structure for access control and was influential in shaping the design of subsequent operating systems. One of the implementations of protection mechanisms is the ring model. The ring model was formalized and gained widespread recognition with the development of x86 architecture by Intel in the 1970s. Here, it was implemented more explicitly with multiple privilege levels. The architecture of x86 processors incorporated a scheme of four rings of protection, numbered from 0 to 3, with each ring representing a different privilege level. Ring 0, the most privileged, was reserved for critical system code like the kernel, while Ring 3 was for user-space applications with the least privileges. This design was intended to safeguard the system's stability and security by controlling the access levels of various parts of the operating system and the applications running on it. ![Privilege rings for the x86 available in protected model](site/Resources/media/image215.png) > [!Figure] > _Privilege rings for the x86 available in protected model_ - Ring 0, often referred to as "kernel mode" is where the operating system kernel resides. This level has full access to all hardware and system resources, allowing it to execute any CPU instruction and to reference any memory address. - Ring 1 and Ring 2 are typically used less frequently. They are often reserved for specific purposes, like device drivers or other system-level code that requires certain privileges but not full kernel access. These intermediate rings allow for a nuanced approach to system security, providing an environment that is more privileged than user applications but less privileged than the kernel. - Ring 3, commonly known as "user mode," is where user applications run. Applications in this ring have limited access to system resources and hardware. They cannot directly execute certain CPU instructions or access protected memory areas. The transition between these rings is tightly controlled and typically occurs through mechanisms like system calls. When a user application (in Ring 3) needs to perform an operation that requires higher privileges (like accessing hardware), it makes a system call, which securely transfers control to the operating system (in Ring 0) to operate on its behalf. This ring architecture helps ensure system stability and security by enforcing strict control over resource access and execution privileges. It prevents less trusted code from performing operations that could compromise the system, creating a more secure and stable operating environment. ### Single Memory Space/No protection In CPUs without the protection mechanisms described in the previous section and without sophisticated memory management, the separation between user space and kernel space has little merit. In this scenario, the line between application level and driver level blurs because there is one single space of execution where low-level and high-level routines coexist and run next to each other. This, in some way, is similar to the early "monolith" scenarios we discussed at the beginning of this section where software was a single entity containing everything. In designs using CPUs without protection and running Real-Time Operating Systems (RTOS), this is the architecture. Many RTOS are not complete operating systems in the strict sense of the term but mostly kernels providing basic services such as scheduling, inter-process communication (IPC), synchronization, resource locking, and timer services. RTOSes often do not define a driver model nor provide any I/O, networking, or filesystem services, therefore the kernel space concept has little merit. RTOSes do provide mechanisms to assign a small amount of memory per task, but there is no real active control in cases where a task oversteps its assigned section of memory into some other task. > [!warning] > This section is under #development ## Human-Computer Interface (HCI) Picture yourself sitting in front of a computer system. And imagine you want to interact with it. You may want to know its status or command it to do something you need. What's next? Well, we'll have to assume said system offers a User Interface (UI) that includes a mechanism for inputting what you want to do (a keyboard, a touchscreen, etc.), and a way of displaying outputs (a screen, some LED lights, some audible buzzer). This user interface, in the generic sense of the word, is typically called a shell. The concept of a shell in operating systems is integral to how users interact with the system. A shell is essentially a special interface for accessing the services of an operating system. It can come in two main forms: a command-line interface (CLI) or a graphical user interface (GUI). In the context of a shell, people often refer to the command-line type, but it's important to remember that GUIs are also a form of shell. The primary function of a shell is to interpret commands entered by the user. In a command-line shell, users type commands in text form and receive text responses. These commands are used to launch processes, manage files, and control other aspects of the operating system. The shell reads the command, interprets it, and then calls the operating system to execute it. This process involves looking up the location of the program to be executed, setting up any required inputs and outputs, and managing the program's execution. Shells also provide a scripting language that can be used to automate tasks. These scripts are a series of commands that the shell can execute as a single new command. This capability is particularly powerful for system administrators and power users who need to perform complex or repetitive tasks efficiently. In addition to these basic functions, shells often provide various features to enhance user experience and efficiency. These include command history, tab completion, and command aliasing, which allow users to interact with the system more effectively. On the other hand, graphical shells, like those found in the most popular operating systems, allow users to interact with the system through graphical icons and windows. These shells are often more user-friendly, especially for those who are not familiar with command-line syntax. Operating a digital system containing operating systems takes another meaning when we are not able to sit in front of the shell and the human-computer interface protracts a long physical distance and includes unreliable links in between. Imagine a digital system that is remotely located. You are in a specific longitude and latitude, and the system is diametrically on the opposite side of the globe. Then, you have no keyboard, no screen, no possibility of observing any LED lights whatsoever, let alone hearing a beep; a wide air gap separates the user and the system. This is, in a nutshell, the challenge of administering remote systems that contain computers. It could be a server, a satellite orbiting a distant planet, a power plant in a distant location, an ultra-low-power IoT sensor gauging the humidity of a crop in a distant field, or a drone patrolling a border. How to interact and manage distant systems when all we have is a noisy, unreliable channel that sends information as a stream of faint electromagnetic energy? The type of tasks we would like to perform on this remote system are still pretty much the same as if the system were in front of us in an office: - Analyzing system logs for identifying potential issues. - Applying updates, patches, and configuration changes. - Adding, removing, or updating passwords, encryption keys, etc. - Tuning system performance. - Configuring, adding, and managing memory contents and file systems. - Performing tests, commissioning procedures, etc. When digital systems adopt operating systems, part of the problem is (kind of) solved. Operating systems come with a rich set of pre-loaded commands and responses (the `ls`, `nc`, `pwd`, `dd`, `chmod`, or similar), so system developers and operators do not have to reinvent the wheel and can focus on application-level behavior. Operating systems have undergone a good deal of standardization with notable efforts such as the Portable Operating System Interface (POSIX). But—there's always a but—most operating systems expect commands and responses to be sent and received using user interfaces. Then, when operating a remote digital system that contains an OS, we must replicate the "keyboard + screen" scenario somehow, because that's how shells work. If you happen to be on a good network, this is just a remote shell. Nothing too fancy, and it works well. Challenges appear when the remote system and the user are not on a good network but on networks with high bit errors and data rate asymmetries. When the channel is bad, shells are a suboptimal way of administering remote operating systems. One solution for this is to define more efficiently packed, specific commands with carefully defined responses. In short, human-readability is abandoned in favor of reducing overhead and optimizing the use of channel bandwidth. But, because the standard commands in operating systems are designed for humans, an operator still needs to parse the commands' outputs to understand the status of their execution. There is a reality: making things more "human-eye friendly" tends to bring a good dose of informational redundancy. For example, informing about the quantity zero over a serial line for a human to interpret it takes 8-bits (ASCII code 0x30), whereas a computer can manage with just 1-bit[^60]. Operating systems and their interfaces to the outside world remain highly human-oriented. Needless to say, an operating system cannot predict the behavior that underlies the system it's installed on. A fuel pump, an electric scooter, a drone, a radar, and a rover exploring Mars; all have very different architectures and purposes and therefore they interact very differently with the environment. Even if all those systems run the same operating system distribution and version, the system designers must create their own application software and system-specific commands to alter and gauge the execution of the state machine that models the underlying behavior. How to define all these application-specific details? There is no real recipe for this; it's up to the designers and is largely system-dependent. This also means there is very little standardization when it comes to defining commands and responses to handle a remote system's time evolution. To summarize, although digital systems share a lot in common—after all, they all carry CPUs, memories, operating systems, data interfaces, and application software—the fact their underlying behavior is largely architecture and application-dependent requires creating unique sets of commands and responses to manage unique behavior. What's more, when operating remote systems over unreliable, noisy channels, designers' choices of commands and response structure become critical to ensure a reliable operation. ## Emulation and Virtualization Having discussed ISAs, CPU cores, and operating systems, it's a good time to make an observation: software is ultimately a collection of opcodes that encode behavior that an underlying architecture must execute: loading data, moving data, jumping. That being said, nowhere does it say that said behavior must be provided strictly by hardware. Such behavior could be provided by software, or by mechanical pulleys, for that matter. CPU instructions will execute wherever they will be given the underlying structure that will unpack the opcodes and provide the behavior the instructions expect. Picture a basic software routine running on the simplistic [[Semiconductors#Instruction Set Architectures (ISAs)|ISA we defined before]] where we count from 0 to 9. If we wrote a small program in C that could read the opcodes/instructions from a file and "act" like it is a CPU core implementing our basic ISA and behaving accordingly, then said routine would still produce the same result and have no way of telling it was executing on a pure software impersonator. This is the basic principle behind emulation, and it proved to be a powerful tool. Emulators do precisely that: impersonate a CPU. Emulators ensure that they will be able to replicate, for each instruction in a given instruction set, all the behavior every instruction expects, therefore creating a software layer that will mimic the targeted CPU microarchitecture to the last beat, including the register file, the ALU, the MMU, the data paths, caches, the pipeline, etc. In an emulator, instruction execution involves a dispatch loop that calls appropriate procedures to update these data structures for each instruction. Examples of this are all the old video game console emulators out there giving us the chance to play video games from our childhood. Emulators are good choices when needing to replicate old, obsolete processors. There is, though, a big disadvantage with emulators: they are horrendously slow. There is a lot of work to perform for every emulated instruction at each emulated execution cycle. When emulating old, simple hardware, this is not necessarily a problem: A modern CPU in a laptop can do a good job mimicking a 1 MHz 6502 CPU doing 2-3 clock cycles per instruction. However, the method is less convincing when emulating more powerful hardware; the performance declines abruptly. The advantage of true emulators is that they are beautifully self-contained and thus the emulated software can run anywhere, as it requires no support from the underlying operating system more than the usual system calls used in any user space application. ![Emulator](image216.png) > [!Figure] > _Emulator block diagram. An emulator creates a completely artificial computing environment which is isolated from the host system_ > [!Storytime] > Around 12 years ago, I was part of a project that sought to create a flight simulator for training satellite operators. The idea was that the simulated computers would run the exact same binaries as in flight. One challenge popped up when one of the computers on board the satellite used a somewhat dated processor whose architecture was not ARM or x86, therefore all the known emulators and simulators were not applicable. >I embarked on the task of creating a cycle-accurate emulator for this eccentric processor. I chose C++, and I had to painstakingly read, page by page, the chip's Technical Reference Manual, implementing instructions and their runtime behavior, the instruction pipeline, the microarchitecture. The processor was a DSP so it executed a lot of things in one single instruction. >As the emulator was coming into life, an unsettling feeling assaulted me: one single bit wrong in my implementation and the program flow could be totally different. I realized it would require a deep verification, instruction per instruction. Cycle-accurate emulators are no joke and formal verification of CPU simulators is a huge topic in itself. ### Dynamic Translation A trick to make emulators faster is to use dynamic translation. Dynamic translation involves translating the guest's machine instructions into a form that the host architecture can execute natively. Dynamic translators do this at runtime, translating instructions on the fly. The emulator achieves this by monitoring the guest code during execution and translating the instructions into host instructions dynamically. To speed things up, the emulator will cleverly cache frequently used instruction sequences and translate them once before storing them in a cache. If these sequences are encountered again, the emulator will directly use the translated code from the cache, avoiding re-translation. This kind of emulator may also use just-in-time (JIT) compilation, which is a form of dynamic translation where the guest code is compiled into host machine code just before it is executed, improving performance. Needless to say, ==implementing an efficient dynamic translator can be complex, especially when dealing with intricate architectural differences between the guest and host.== QEMU[^61] was initially an emulator which later on incorporated dynamic translation to efficiently emulate different architectures. For instance, QEMU can emulate ARM architecture on an x86 host, translating ARM instructions to x86 instructions dynamically. QEMU translates guest code into intermediate code using a mechanism called Tiny Code Generators (TCG) which are like tiny Instruction Set Architectures in their own right[^62], and then the TCG code is translated into the host's code. It is worth taking a little time to describe TCG in more detail. Originally QEMU worked as a "template" translator where each individual instruction in the guest architecture had a snippet of C code associated with it. The translation was a case of stitching these templates together into larger blocks of code. This meant porting QEMU to a new system was relatively easy. However, eventually, the limits of this approach required moving to a new code generator and TCG was born. TCG has its roots as a generic back end for a C compiler. The main difference is instead of converting an abstract syntax tree from a high-level language into micro-operations, its input is the decomposed operations of an individual instruction. A simplified version might look like this: ```C static void disas_add_imm(DisasContext *s, uint32_t insn) { /* Decode Instruction */ int rd = extract32(insn, 0, 5); int rn = extract32(insn, 5, 5); uint64_t imm = extract32(insn, 10, 12); /* Allocate Temporaries */ TCGv_i64 tcg_rn = cpu_reg_sp(s, rn); TCGv_i64 tcg_rd = cpu_reg_sp(s, rd); TCGv_i64 tcg_result = tcg_temp_new_i64(); /* Do operation */ tcg_gen_addi_i64(tcg_result, tcg_rn, imm); tcg_gen_mov_i64(tcg_rd, tcg_result); /* Clean-up */ tcg_temp_free_i64(tcg_result); } ``` The decode step involved dissecting the various fields of the instruction to work out what registers and immediate values are needed. The operation is synthesized from TCG operations which are the basic units of the code generator. After a simple optimization pass, these ops are then converted into host instructions and executed. The original implementation of QEMU was single-threaded and although the user-mode emulation followed the threading model of the programs it translated, it was suboptimal in its behavior. The process of converting QEMU to a fully multi-threaded software was a multi-year effort involving contributions from many different sections of the QEMU community. Now Multi-Threaded TCG (MTTCG) is the default architecture. Nowadays QEMU has considerably grown and remains a very active open source project and it offers virtualization through different accelerators. In QEMU, virtualization capabilities are tied to the host OS and the hardware architecture. ![QEMU dynamic translation and Tiny Code Generators (TCG)](site/Resources/media/image217.png) > [!Figure] > _QEMU dynamic translation and Tiny Code Generators (TCG)_ ### Hypervisors/Virtual Machine Monitors (VMM) A different approach is to use what are called hypervisors or Virtual Machine Monitors (VMM), which lay the foundations of virtualization technology by allowing multiple operating systems to run on a single physical machine. A hypervisor, also known as a Virtual Machine Monitor (VMM), is a layer of software, firmware, or hardware that creates and manages virtual machines (VMs). In essence, a hypervisor allows multiple operating systems to share a single hardware host. There are two types of hypervisors: - Type 1 (Bare-Metal Hypervisors): These run directly on top of the host's hardware to control the hardware and manage guest operating systems. Examples include VMware ESXi, Microsoft Hyper-V, and Xen. - Type 2 (Hosted Hypervisors): These run on a conventional operating system just like other computer programs. Examples include VMware Workstation and Oracle VirtualBox. The seminal paper titled "Formal Requirements for Virtualizable Third Generation Architectures" by Gerald J. Popek and Robert P. Goldberg, published in 1974, laid down the theoretical foundation for virtualization. As per the definition of Popek and Goldberg, a virtual machine is taken to be an efficient, isolated duplicate of the real machine. The paper explains this notion through the idea of a virtual machine monitor (VMM). As a piece of software, a VMM has three essential characteristics: - Fidelity: The VMM provides an environment for programs that is essentially identical to the original machine; - Performance: Programs run in this environment show at worst only minor decreases in speed. - Isolation and Safety: The VMM is in complete control of system resources. By an "essentially identical" environment, the first characteristic is meant the following. Any program run under the VMM should exhibit an effect identical to that demonstrated if the program had been run on the original machine directly, with the possible exception of differences caused by the availability of system resources and differences caused by timing dependencies. The latter qualification is required because of the intervening level of software and because of the effect of any other virtual machines concurrently existing on the same hardware. The former qualification arises, for example, from the desire to include in the definition the ability to have varying amounts of memory made available by the virtual machine monitor. The identical environment requirement excludes the behavior of the usual time-sharing operating system from being classed as a virtual machine monitor. The second characteristic of a virtual machine monitor is efficiency. It demands that a statistically dominant subset of the virtual processor's instructions be executed directly by the real processor, with no software intervention by the VMM. This statement rules out the traditional emulators we discussed before and complete software interpreters (simulators) from the virtual machine umbrella. The third characteristic, resource control, labels as resources the usual items such as memory, peripherals, and the like, although not necessarily processor activity. The VMM is said to have complete control of these resources if: - It is not possible for a program running under it in the created environment to access any resource not explicitly allocated to it - It is possible under certain circumstances for the VMM to regain control of resources already allocated. A virtual machine is the environment created by the virtual machine monitor. This definition is intended not only to reflect generally accepted notions of virtual machines but also to provide a reasonable environment for proof. A VMM as defined is not necessarily a time-sharing system, although it may be. However, the identical effect requirement applies regardless of any other activity on the real computer, so that isolation, in the sense of protection of the virtual machine environment, is meant to be implied. This requirement also distinguishes the virtual machine concept from virtual memory. Virtual memory is just one possible ingredient in a virtual machine, and techniques such as segmentation and paging are often used to provide virtual memory. The virtual machine effectively has a virtual processor, too, and possibly other virtual devices. ![Where a hypervisor can live](site/Resources/media/image218.png) > [!Figure] > _Where a hypervisor can live_ A more modern terminology is to call VMMs _hypervisors_. They are synonyms. A hypervisor is a layer of software between the hardware and the virtual machine. Without a hypervisor, an operating system communicates directly with the hardware beneath it: storage operations go directly to the disk subsystem, and memory calls are handled directly by the Memory Management Unit. Without a hypervisor, more than one operating system from multiple virtual machines would request simultaneous control of the hardware, which would result in an unreliable operation. The hypervisor manages the interactions between each virtual machine and the hardware that the guests share. #### Type 1 Hypervisor (Bare Metal) Type 1 hypervisors, also known as bare-metal hypervisors, run directly on top of the host's hardware and manage guest virtual machines running guest operating systems. Type 1 hypervisors have direct access to physical hardware, and one of the key selling points of this type of hypervisors is their ability to provide strong isolation between virtual machines. This means that if one VM crashes, it doesn't affect the others, offering a high degree of reliability and security. They also allow for better resource allocation and management, as the hypervisor can control and orchestrate the access to the underlying hardware resources such as CPU, memory, and storage among the various VMs based on their needs. ![Type 1 Hypervisor](site/Resources/media/image219.png) > [!Figure] > _Type 1 Hypervisor_ #### Type 2 Hypervisor Type 2 hypervisors run as a software layer on top of an existing operating system, rather than directly on the hardware. This setup allows to create and run virtual machines (VMs) on a host system that already has its own operating system. An important factor of Type 2 hypervisors is that they're installed on a host operating system like any other software application, so they must share computing resources with those other applications. Examples of Type 2 hypervisors include VMware Workstation and Oracle VirtualBox. Because they operate within a host OS, Type 2 hypervisors are generally easier to install and use, making them popular for personal use, testing, and development environments. However, the presence of an underlying operating system introduces an additional layer between the virtual machine and the physical hardware, which can lead to decreased performance compared to Type 1 hypervisors, because Type 2 hypervisors must go through the host OS to access hardware resources, which can create overhead and latency. Type 2 hypervisors are particularly useful for scenarios where ease of use, flexibility, and convenience are more important than achieving the highest levels of performance and efficiency. They allow users to run different operating systems on their existing hardware without needing to reboot or partition the hard drive, offering a practical solution for running multiple environments on a single machine. Despite their performance limitations compared to Type 1 hypervisors, Type 2 hypervisors remain a popular tool for a wide range of virtualization needs. ![Type 2 Hypervisor](site/Resources/media/image220.png) > [!Figure] > _Type 2 Hypervisor_ #### How Hypervisors Work In a nutshell, a hypervisor decouples an operating system from the hardware. The hypervisor becomes the transporter and broker of resources to and from the virtual guests it supports. It achieves this capability by fooling the guest operating system into believing that the hypervisor is actually the hardware. To understand how a virtual machine works, one needs to look more closely at how virtualization works. Without going too far, let's examine how a native operating system manages hardware resources. When a program needs some data from a file on a disk, it makes a request through a program language command, such as an `fgets()` in C, which gets passed through to the operating system. The operating system has filesystem information available to it and passes the request on to the correct device driver in kernel space, which then works with the physical disk I/O controller and storage device to retrieve the proper data. The data comes back through the I/O controller and device driver where the operating system returns the data to the requesting program. Requesting data from disk requires many other operations to happen, like memory block transfers, scheduling, probably PCI transfers, etc. As we discussed when we talked about [[Semiconductors#Software and Supervisors (Aka Operating Systems)|operating systems]], CPU cores tend to add layers of safety measures to prevent wrongful execution of instructions from corrupting the applications or the operating system. Applications themselves cannot execute processor instructions directly. Those requests are passed through the levels via system calls, where they are executed on behalf of the application, or they throw an error because the request would violate a constraint. If a program with the right rank wants to affect some hardware state, it does so by executing privileged instructions in ring 0. A shutdown request would be one example of this. A hypervisor runs in ring 0, and the operating systems in the guests believe that they run in ring 0. When a guest OS does not have the required privileges to run a particular instruction, a trap will happen. Remember that operating systems running on top of a hypervisor do not know that they are running in user mode; i.e., they can only execute a limited set of instructions, since they do not have full control over the hardware. Then, when the guest operating system executes some privileged instructions (instructions that can only be run in kernel mode or some privileged mode), those instructions create a trap that goes into the hypervisor which then emulates the expected functionality of the OS. If a guest wants to issue a shutdown, the hypervisor intercepts that request and responds to the guest, indicating that the shutdown is proceeding so the operating system can continue through its steps to complete the software shutdown. If the hypervisor did not trap this command, any guest would be able to directly affect the resources and environment of all of the guests on a host, which would violate the isolation rule of Popek and Goldberg's definition. Like the native operating system that manages concurrent program requests for resources, hypervisors abstract one layer further and manage multiple operating systems' requests for resources. Note that more guest operating system complexity is a factor that impacts the hypervisor complexity. #### Paravirtualization We said right above that hypervisors are in charge of "fooling" operating systems to make them believe they have full access to the hardware. There are, though, certain challenges in this approach. Because the guest operating systems think they fully control the hardware, they will issue syscalls and I/O operations like there is no tomorrow, which the hypervisor will have to manage. But, if operating systems could be aware there is a busy hypervisor below and cooperate with it, things could be more efficient if the work of organizing all interactions with the hardware could be less hectic. This is called paravirtualization. ==In paravirtualization, the guest OS is aware of the hypervisor and collaborates and interacts with it.== This awareness allows the guest OS to make calls to the hypervisor for certain operations, like memory management and I/O processing, which in full virtualization would be handled by emulating hardware. This interaction is facilitated through a special API. The key advantage of paravirtualization is, as we said, efficiency. Since the guest OS is cooperating with the hypervisor, it reduces the overhead typically associated with hardware emulation. As a result, paravirtualized environments tend to offer better performance than fully virtualized ones, particularly in terms of I/O operations and system calls. However, the need for the guest OS to be aware of the hypervisor also brings a notable limitation. Operating systems must be modified or specifically designed to run in a paravirtualized environment. This means that not all operating systems can be used as-is in such setups, limiting flexibility in some cases. #### Virtualization in Mission-Critical Digital Systems Mission-critical applications—anything where [[Dependability, Reliability, and Availability|failure]] could be catastrophic—tend to be designed in a conservative manner, and for valid reasons. Surprisingly, among these applications, the aerospace industry appears less conservative than one would think. When it comes to exploring new architectural approaches, and despite the stringent safety requirements that percolate the industry, aerospace leads the way. The main reason for this flexibility is the never-ending quest for achieving better fuel efficiency, which in the end means weight reduction. In the last fifteen or twenty years the aerospace avionics architectures have shifted paradigms considerably. Federated architectures (one computer assigned to one function) that have been popular up to the end of the century are being replaced by a different approach, called Integrated Modular Avionics (IMA). Some point out that the origins of the IMA concept originated in the United States with the F-22 and F-35 fighters to migrate to the commercial airliner sector. Others say the modular avionics concept has been used in business jets and regional airliners since the late 1980s or early 90s. Either way, the modular approach is also seen in the military in tankers and transport aircraft, such as the KC135, and C-130 as well as in the Airbus A400M. In a federated architecture, a system's main function is decomposed into smaller blocks realized by units that provide more specific functions. Each box—often called a Line Replaceable Unit (LRU) or more recently [[Units, Chassis and Racks#ARINC-600|MCU]]—contains the hardware and software required for it to provide its function. In the federated concept, each new function added to an avionics system requires the addition of new LRUs/MCUs. This means that there is a linear correlation between functionality and mass, volume, and power; i.e. every new functionality proportionally increases all these factors. What is more, for every new function added to the system there is a consequent increase in multidisciplinary configuration control efforts, updates, iterations, etc. This approach quickly met a limit. The aerospace industry understood that the classical concept of "one function maps to one computer" could no longer be maintained. To tackle this issue, the IMA (Integrated Modular Avionics) concept emerged. Exploiting the fact that software does not weigh anything in and of itself, IMA allowed for retaining some advantages of the federated architecture, like fault containment, while decreasing the overhead of separating each function physically from others. The main architectural principle behind IMA is the introduction of a shared computing environment that hosts functions from several LRUs. This means the function does not directly map 1:1 to the physical architecture, but one physical computing unit (CPU) can share its computing resources to execute more than one function. A contradiction surrounds the IMA concept. It could be argued that IMA proposes an architecture that technology has already rendered obsolete: centralized architectures. With embedded processors, memories, and other devices becoming more reliable and less expensive, surely this trend should favor *less* rather than more centralization. Thus, following this argument, a modern avionics architecture should be more, not less, federated, with existing functions "deconstructed" into smaller components, each having its own processor. There is some plausibility to this argument, but the distinction between the "more federated" architecture and centralized IMA proves to be debatable on closer inspection. A federated architecture is one whose components are very loosely coupled, meaning that they can operate largely independently. However, the different elements of a function usually are rather tightly coupled so that the deconstructed function would not be a federated system so much as a *distributed* one, meaning a system whose components may be physically separated, but which must closely coordinate to achieve a collective purpose. Consequently, a conceptually centralized architecture will be, internally, a distributed system, and the basic services that it provides will not differ in a significant way from those required for the more federated architecture. Squeezing more functionality per unit of mass was too attractive for aerospace to look away. Then, IMA quickly gained traction, and its success caught the attention also of the space industry. ![Federated architecture (form follows function)](site/Resources/media/image221.png) > [!Figure] > _Federated architecture (form follows function)_ Combining multiple functions in one processing/computing environment requires specific considerations. The physical separation that existed between LRUs in the federated style must be still provided for applications running on the same computing environment. ![A more integrated architecture, where form stops following function](site/Resources/media/image222.png) > [!Figure] > _A more integrated architecture, where form stops following function_ In the IMA concept, input/output functionalities are provided by I/O modules. These I/O Modules (or IOMs) interface with sensors/actuators, acting as a bridge between them and core processing modules (CPMs). A Core processing module that also contains I/O capabilities is named a Core Processing Input/Output Module (CPIOM). If the architecture does not make use of CPIOMs, the IO layer remains as *thin* as can be. Removing all software from the IOMs removes complexity and therefore verification, and configuration control costs considerably. The physical separation that was inherent in the LRUs for the federated architecture from a software point of view must be virtually enforced in an IMA platform. The performance of each application shall be unaffected by the presence of others. This separation is provided by partitioning the common resources and assigning those partitioned resources to an application. The partitioning of processing power is enforced by strictly limiting the time each application can use the processor and by restraining its memory access. Memory access is controlled by hardware thus preventing partitions from interfering with each other. Each software application is therefore partitioned in space and time. Avionic software applications have different levels of [[The Quality of Quality#Criticality Levels|criticality]] based on the effects that a failure on a given application would cause on the system. Those criticality levels are specified in standards such as the RTCA/DO178C[^63] which defines five different criticality levels (from A: catastrophic, to E: no safety effect). The software development efforts and costs grow exponentially with the criticality level required for certification since the verification process becomes more complex. In an integral, non-partitioned architecture, all the software in a functional block has to be validated under the same criticality level. The IMA approach enables different software partitions with different criticality levels to be integrated under the same platform and certified separately, which eases the certification process. Since each application is isolated from others, it is guaranteed that faults will not propagate, provided the separation "agent" can do so. As a result, it is possible to create an integrated system that has the same inherent fault containment as a federated system. For achieving such containment, J. Rushby in "Partitioning in Avionics Architectures: Requirements, Mechanisms, and Assurance", specifies a set of guidelines: - Gold Standard for Partitioning: A partitioned system should provide fault containment equivalent to an idealized system in which each partition is allocated an independent processor and associated peripherals, and all inter-partition communications are carried on dedicated lines. - Alternative Gold Standard for Partitioning: The behavior and performance of software in one partition must be unaffected by the software in other partitions. - Spatial Partitioning: Spatial partitioning must ensure that software in one partition cannot change the software or private data of another partition (either in memory or in transit) nor command the private devices or actuators of other partitions. - Temporal Partitioning: Temporal partitioning must ensure that the service received from shared resources by the software in one partition cannot be affected by the software in another partition. This includes the performance of the resource concerned, as well as the rate, latency, jitter, and duration of scheduled access to it. The mechanisms of partitioning must block the spatial and temporal pathways for fault propagation by interposing themselves between software functions and the shared resources that they use. ##### ARINC-653 From the previous section, it is still not clear what is the entity in charge of guaranteeing the isolation for different applications running on top of a common computing environment. ARINC 653 defines a standard interface between the software applications and the underlying operating system. This middle layer is known as the application executive (APEX) interface. The philosophy of ARINC 653 is centered upon a time and space-partitioned operating system, which allows independent execution of different partitions. In ARINC 653, a partition is defined as portions of software specific to avionics applications that are subject to robust space and time partitioning. They occupy a similar role to processes in regular operating systems, having their own data, context, attributes, etc. The underlying architecture of a partition is similar to that of a multitasking application within a general-purpose computer. Each partition consists of one or more concurrently executing processes (threads), sharing access to processor resources based on the requirements of the application. An application partition is limited to using the services provided by the APEX defined in ARINC 653 while system partitions can use interfaces that are specific to the underlying hardware or platform. ![Building blocks of a partitioned system](site/Resources/media/image223.png) > [!Figure] > _Building blocks of a partitioned system_ Each partition is scheduled to the processor on a fixed, predetermined, cyclic basis, guaranteeing temporal segregation. The static schedule is defined by specifying the period and the duration of each partition's execution. The period of a partition is defined as the interval at which computing resources are assigned to it while the duration is defined as the amount of execution time required by the partition within one period. The periods and durations of all partitions are used to compose a major time frame. The major time frame is the scheduler's basic unit that is cyclically repeated. Each partition has predetermined areas of memory allocated to it. These unique memory areas are identified based on the requirements of the individual partitions and vary in size and access rights. The expected benefits of implementing IMA solutions are: - Optimizations and saving of mass, volume, and power consumption; - Simpler Assembly, Integration, and Verification (AIV) activities due to smaller number of physical units and simpler harness. - Focused development efforts: the developer can focus wholly on their software, instead of focusing on the complete development of an LRU; - Retaining federated system properties like fault containment; - Incremental validation and certification of applications; IMA has already been applied in the development of several aircraft, most noticeably in the development of the A380 and Boeing 787. The Airbus and Boeing programs reported savings in terms of mass, power, and volume of 25%, 50%, and 60% respectively[^64]. IMA has eliminated 100 LRUs from the Boeing 787[^65]. We said at the beginning of this section that Space is also looking at IMA now. There are, though, some fundamental differences between the space and aeronautical domains. The IMA approach, as used in the aeronautics domain, cannot be used directly in the space domain. The reasons are: - The processing platforms used in space are usually less powerful than the platforms in aeronautics, i.e. there is a technology gap. This gap is narrowing though, with more modern processors and system-on-chips being increasingly adopted in space. - There are strong requirements that hardware and software modules already developed for the current platform shall remain compatible with any new architecture to keep the costs of the architecture transition low. Space systems are constrained in terms of available power, mass, and volume. When compared with commercial aviation, the space industry lacks standardization. Each major player in the industry designs and operates its digital systems using their own internal principles. The European Space Agency has invested a great amount of effort in the last decades to standardize space engineering across Europe, as well as the European Cooperation for Space Standardization[^66] (ECSS, which includes ESA as a member). ESA has defined[^67] several ground rules that guide the IMA for space specification, adapting IMA to space without totally breaking with the current space avionics approach. To enable the usage of an operating system with the requirement to implement time and space partitioning, ESA defined a two-level software executive. The system executive level is composed by a software hypervisor that segregates computing resources between partitions. This hypervisor is responsible for the robust isolation of the applications and for implementing the static CPU allocation schedule. A second level, the application level, is composed of the user's applications running in an isolated environment (partition). Each application can implement a system function by running several tasks/processes. The multitasking environment is provided by a paravirtualized operating system that runs in each partition. These partition operating systems (POS) are modified to operate along with the underlying hypervisor. The hypervisor supplies a software interface layer to which these operating systems attach to. In the context of IMA for Space, ESA selected the use of RTEMS as the main partition operating system. Three hypervisors are currently being evaluated by ESA as part of the IMA-SP platform: XtratuM[^68], AIR[^69], and PikeOS[^70]. XtratuM is an open-source hypervisor available for the x86 and LEON architectures that are developed by the Universidad Politécnica de Valencia. Despite providing similar services to the ARINC 653 standard, XtratuM does not aim at being ARINC 653 compatible. PikeOS is a commercial microkernel that supports many APIs and virtualized operating systems. Finally, AIR is an open-source operating system developed by GMV and based on RTEMS. The IMA for Space specification also leaves an option to have partitions without RTOS. These "bare metal" partitions can be used for very critical single threading code which doesn't require a full-featured real-time operating system. ##### I/O Challenges in Integrated Architectures Robust partitioning demands that applications strictly use the time resources that have been reserved for them during the system design phase. I/O activities shall, hence, be scheduled for periods when the applications that use these specific capabilities are being executed. Furthermore, safety requirements may forbid some partitions are interrupted by hardware during their guaranteed execution time slices. In consequence, it must be ensured that I/O devices have enough buffering capabilities at their disposal to store data during the time non-interruptible applications are running. Segregating a data bus is harder than segregating memory and processor resources. I/O handling software must be able to route data from an incoming bus to the application to which that data belongs to. General-purpose operating systems use network stacks to route data to different applications. In a virtualized and partitioned architecture, the incoming data must be shared not only with applications in the same partition but among partitions themselves. Each partition operating system could manage its own devices, but this is only feasible if devices are not shared among partitions. If a device is used by more than one partition then there is the latent risk of one of the partitions leaving the shared device in an unknown state, therefore influencing the second partition. In aeronautical IMA this problem is approached by using partitioning-aware data buses like [[High-Speed Standard Serial Interfaces#Avionics Full DupleX Switched Ethernet (AFDX)|AFDX]]. AFDX devices are smart in the sense that they can determine to which partition of an end system the data belongs to. This point is critical since the I/O in the platform must be managed in such a way that behavior by one partition cannot affect the I/O services received by another. ---- Where does the technical value of virtualization reside, in reality? Besides the fact that it might be impressive to be able to boot several operating systems on a single hardware, or the fact that virtualization also streamlines costs, the most important technical impact of virtualization is that it breaks an important relationship in the way we design digital systems: one computer does not mean one function anymore. Although it does not sound too impressive, its implications are. Virtualizing permits "squeezing" more functionalities for a constant amount of hardware, and this means that you can add complex functions without increasing the hardware complexity. The implications of a more constant hardware complexity are many: reliability, lower weight, and lower power consumption. ### Containers Containers are yet another technology to isolate applications along with their entire runtime environment. Technically, containers are an abstraction at the application layer, which packages code and dependencies together. Multiple containers can run on the same machine and share the OS kernel with other containers, each running as isolated processes in user space. It is important to differentiate containers from traditional virtual machines (VMs). While VMs virtualize the hardware, containers virtualize the operating system. In a VM setup, each VM includes not only the application and necessary binaries and libraries but also an entire guest operating system. This can be resource-intensive. Containers, on the other hand, share the host system's kernel and, where applicable, the binaries and libraries, which are read-only. This sharing makes containers much more lightweight and faster to start compared to VMs. In essence, a container is more heavyweight than a process but less heavyweight than a VM. Like a process and unlike a VM, the container shares the OS with other containers running on the same host. But unlike a process, and akin to a VM, each container has its own private view of the network, including hostname and domain name, processes, users, filesystems, and inter-process communication. ![Containers vs Virtual Machines](site/Resources/media/image224.png) > [!Figure] > _Containers vs Virtual Machines_ The most popular containerization platform is Docker^[https://www.docker.com/], which provides tools to pack, distribute, and manage applications within containers. Docker containers are built from Docker images, which are the templates used to create containers. These images are created with the Dockerfile, a script that contains instructions for building the image. The benefits of using containers include: - Portability: Since containers include everything needed to run an application, they can move between different environments (development, testing, production) with consistency. - Efficiency: Containers are lightweight and require fewer resources than VMs, as they share the host system's kernel, rather than requiring their own operating system. - Isolation: Each container is isolated from others and from the host system, offering a level of security. Although they share the kernel, they are separated from each other and can have different user-space libraries and binaries as required. Still, containers can easily communicate with the host system utilizing shared folders and access to the same IPC (inter-process communication) mechanisms used by the host operating systems such as shared memories, semaphores, etc. This is a significant advantage compared to virtual machines. - Scalability and Management: Containers can be easily scaled up and down. Containers have given way to the appearance of tools for automated deployment and scaling of containers. - Rapid Deployment: Containers can be started, stopped, and replicated quickly and easily. On the disadvantages of using containers, one can mention a tendency of "over-containerize" architectures that could perfectly be implemented with good old processes and share the same operating system, either real or virtual. The tendency of over-containerizing has given way to intricate orchestration schemes and very complex deployment environments. # Microprocessor Devices A CPU core by itself, is unusable. For a core to be able to perform on a board-level design, it requires interfacing with memories and a set of internal peripherals that will allow the core(s) to interact with the outside world utilizing standard data interfaces, general purpose input/output (GPIO) but also timers, counters, watchdogs, interrupt controllers but also multimedia peripherals, multi-Gigabit transceivers, sophisticated DMA schemes and encryption/security cores. Microprocessors are typically used for automatic control applications, radar, and telecommunications, and tend to use complex packages and power management. ![i.MX 91XX microprocessor block diagram. Credit: NXP](site/Resources/media/image225.png) > [!Figure] > _i.MX 91XX microprocessor block diagram. Credit: NXP_ > [!Note] > How are microcontrollers programmed in mass? > > For high-volume production (~200k units/year) companies typically set automated jigs that access the boards through programming pins where scripts invoke the device's program utility. For even higher volumes (1mil units/year) companies may use reel-to-reel programmers which would unreel the devices, program them, and put them on a second reel so the [[Printed Circuit Boards#SMT Line Layout|SMT line]] could place them preprogrammed. > [!warning] > This section is under #development ## Microcontrollers In microprocessors, historically the core buses are exposed to the outside for the designer to be able to hook devices like memories or other external devices and manage them in a memory-mapped capacity. That means (simplifying things quite a bit here), that any device hooked to the CPU bus can be accessed by the application software from a position in memory as per the [[Semiconductors#Memory Map|memory map]], either by using memory data movement instructions, or specific I/O instructions. In a system of such kind, the selection of the devices to be accessed by the MPU depends on some opportunistic logic done from the address lines that are exposed by the microprocessor. In general, this approach requires more real estate from boards, since all the peripherals the processor will access are hosted in their own packages. Microprocessors will most likely also include [[High-Speed Standard Serial Interfaces|high-data rate interfaces]] using gigabit transceivers and complex power management which makes the act of integrating microprocessors into a design a rather complex adventure. Microcontrollers appear as a good alternative to reduce some of the complexity by integrating many resources on-chip. Typically, this flexibility is paid with a lower performance capability compared to an MPU-based system. We will discuss this in more detail when we discuss [[Semiconductors#System-on-Chips|System-on-chips]], but we can already say that microcontrollers are the ancestors of SoCs, for they employ a very similar philosophy: integrating more building blocks on the chip. Microcontrollers typically incorporate not only a core but also on-chip RAM, Flash, ADC and DAC converters, timers, counters, and a breadth of standard interfaces like CAN, USB, I2C, SPI, and in some cases even an Ethernet MAC (the Stellaris family of MCUs from TI also integrates a PHY). ![Stellaris family of MCUs incorporates an on-chip Ethernet PHY (credit: Texas Instruments)](site/Resources/media/image227.png) > [!Figure] > _Stellaris family of MCUs incorporates an on-chip Ethernet PHY (credit: Texas Instruments)_ ![LPC122x Microcontroller (credit: NXP)](site/Resources/media/image228.png) > [!Figure] > _LPC122x Microcontroller (credit: NXP)_ Microcontrollers find extensive use in industrial applications where timing and determinism are key, for instance in motor control, access control, HVAC, home appliances, and automation. But also, in portable, low-power applications like wireless sensors, IoT, and the like. > [!warning] > This section is under #development ## Digital Signal Processing It is hard to classify Digital Signal Processors (DSPs), mainly because they are microprocessors in their own right, so all we said about CPU cores and microprocessors also applies to DSPs. But, fair is to say, DSPs are a special kind of microprocessors. A Digital Signal Processor (DSP) is a microprocessor purposely designed for efficiently performing mathematical operations on discrete signals. A discrete signal is a collection or array of "snapshots" of a continuous-time signal. This original signal could be a voltage, an audio signal, some signal from a sensor, or a carrier in a communications system. To see how DSPs are conceived for managing discrete-time signals, we must explore first what digital signal processing is about. Continuous-time signals are defined along a continuum of times and thus are represented by a continuous independent variable. Continuous time signals are often referred to as analog signals. Discrete-time signals are defined at discrete times, and thus, the independent variable has discrete values; i.e., discrete-time signals are represented as sequences of numbers. Signals such as speech or images may have either a continuous- or a discrete-variable representation, and if certain conditions hold, these representations are entirely equivalent. Besides the independent variables being either continuous or discrete, the signal amplitude may be either continuous or discrete. Digital signals are those for which both time and amplitude are discrete. Discrete-time signals are represented mathematically as sequences of numbers. A sequence of numbers in which the nth number in the sequence is denoted $x\lbrack n\rbrack$, is formally written as: $\large x\ = \ \{ x\lbrack n\rbrack\},\ - \infty < n < \infty$ Where $n$ is an integer. In a practical setting, such sequences can often arise from periodic sampling of an analog signal. In this case, the numeric value of the nth number in the sequence is equal to the value of the analog signal, $x_{a}(t)$, at time $\text{nT}$, $\large x\lbrack n\rbrack = x_{c}(nT), \infty < n < \infty$ The quantity $T$ is called the sampling period, and its reciprocal is the sampling frequency, therefore $f_{s} = 1/T$, and given in samples per second. Although sequences do not always arise from sampling analog waveforms (they can also be devised analytically), it is convenient to refer to $x\lbrack n\rbrack$ as the "nth sample" of the sequence. Also, although, strictly speaking, $x\lbrack n\rbrack$ denotes the nth number in the sequence, the notation is often unnecessarily complicated, and it is convenient and unambiguous to refer to "the sequence $x\lbrack n\rbrackquot; when we mean the entire sequence, just as we referred to the "analog, continuous-time signal $x_{c}(t)quot;. ![Block diagram representation of an ideal continuous-to-discrete-time (C/D) converter](site/Resources/media/image229.png) > [!Figure] > _Block diagram representation of an ideal continuous-to-discrete-time (C/D) converter_ It should be quite visible by now how much the discrete-time sequence $x\left\lbrack n \right\rbrack$ looks like an array in a programming context. A discrete-time sequence can be perfectly thought of as a buffer of a certain size allocated in memory. We will soon see that the Digital Signal Processing theory considers sequences of infinite size and there lies one of the main practical matters when it comes to digitally processing signals when we must truncate infinite sequences into finite sequences due to the obvious fact memory in processors is limited. In a practical setting, the operation of sampling is implemented by an analog-to digital (A/D or ADC) converter. Such systems can be viewed as approximations to the ideal C/D converter from the figure above. Important considerations in the implementation or choice of an ADC include quantization of the output samples (how many bits are used to represent a given voltage), linearity of quantization steps, the need for sample-and-hold circuits, and limitations on the sampling rate. Other practical issues of analog-to-digital conversion are thermal stability and noise. The sampling operation is generally not invertible. This means, given the output $x\lbrack n\rbrack$, it is not possible in general to reconstruct $x_{c}\left( t \right)$, the input to the sampler, since many continuous-time signals can produce the same output sequence of samples. The inherent ambiguity in sampling is a fundamental issue in signal processing. Fortunately, it is possible to remove the ambiguity by restricting the input signals that go into the sampler. It is convenient to further unpack the C/D converted into two stages as depicted in the figure below. The stages consist of an impulse train modulator followed by the conversion of the impulse train to a sequence. ![Internal stages of the continuous-to-digital converter](site/Resources/media/image230.png) > [!Figure] > _Internal stages of the continuous-to-digital converter_ The figures below show a continuous time signal $x_{c}\left( t \right)$ and the results of impulse train sampling for two different sampling rates. ![Sampling signal for two different sampling rates](site/Resources/media/image231.png) > [!Figure] > _Sampling signal for two different sampling rates_ The figures below depict the corresponding output sequences. The essential difference between $x_{S}\left( t \right)$ and $x\lbrack n\rbrack$ is that $x_{S}\left( t \right)$ is, in a sense, a continuous-time signal (specifically, an impulse train) that is zero except at integer multiples of $T$. The sequence $x\lbrack n\rbrack$, on the other hand, is indexed on the integer variable $n$, which in effect introduces a time normalization; i.e., the sequence of numbers $x\lbrack n\rbrack$ contains no explicit information about the sampling rate. Furthermore, the samples of $x_{c}\left( t \right)$ are represented by finite numbers in $x\lbrack n\rbrack$ rather than as the areas of impulses, as with $x_{S}\left( t \right)$. ![Output sequences for two different sampling rates](site/Resources/media/image232.png) > [!Figure] > _Output sequences for two different sampling rates_ ==Discrete-time sequences are only defined for integer values of $n$. It is not correct to think of $x\lbrack n\rbrack$ as being zero for n is not an integer; $x\lbrack n\rbrack$ is simply undefined for non-integer values of $n$.== In discussing the theory of discrete-time signals and systems, several basic sequences are of particular importance. In particular, the unit sample sequence is of special importance, and it is defined as: $\LARGE \delta\lbrack n\rbrack = \ \left\{ \begin{matrix} 0,\ n \neq 0 \\ 1,\ n = 0 \\ \end{matrix} \right.\ $ ![Impulse sequence](site/Resources/media/image233.png) > [!Figure] > _Impulse sequence_ One of the important aspects of the impulse sequence is that any arbitrary sequence can be represented as a sum of scaled, delayed impulses. For example, the sequence $p\lbrack n\rbrack$ in the figure below can be expressed in terms of delayed impulses: $\large p\lbrack n\rbrack = \ a_{- 3}\delta\lbrack n + 3\rbrack + a_{1}\delta\lbrack n–1\rbrack + a_{2}\delta\lbrack n–2\rbrack + a_{7}\delta\lbrack n–7\rbrack\ $ ![An arbitrary, discrete sequence](site/Resources/media/image234.png) > [!Figure] > _An arbitrary, discrete sequence_ More generally, any sequence can be expressed as: $\large x\lbrack n\rbrack = \sum_{k = - \infty}^{\infty}{x\lbrack k\rbrack\delta\lbrack n–k\rbrack}$ One may start to see two essential functionalities any Digital Signal Processors must have: the ability to quickly sum and multiply (what is called MAC, for Multiply-Accumulate), but also the capability of fast data movement of long sequences of numbers in memory, including reading, writing and managing buffers in different capacities, including circular buffers, ring buffers, etc. We will see soon more evidence as to why DSPs must feel very comfortable doing all that, including handling complex numbers. ### Discrete-time Systems A discrete-time system is defined mathematically as a transformation or operator that maps an input sequence with values $x\lbrack n\rbrack$ into an output sequence with values $y\lbrack n\rbrack$. This can be denoted as: $\large y\lbrack n\rbrack = \ T\{ x\lbrack n\rbrack\}$ ![Representation of a discrete-time system](site/Resources/media/image235.png) > [!Figure] > _Representation of a discrete-time system_ We will describe now a few useful discrete-time systems. ### The Ideal Delay System The ideal delay system is defined by the equation: $\large y\lbrack n\rbrack\ = \ x\lbrack n–n_{d}\rbrack,\ - \infty < n < \infty$ where $n_{d}$ is a fixed positive integer called the delay of the system. In other words, the ideal delay system simply shifts the input sequence to the right by $n_{d}$samples to form the output. If, the equation above, $n_{d}$is a fixed negative integer, then the system would shift the input to the left by $\left| n_{d} \right|$ samples, corresponding to a time advance. ### Memoryless and Causal Systems A system is referred to as memoryless if the output $y\lbrack n\rbrack$ at every value of $n$ depends only on the input $x\lbrack n\rbrack$ at the same value of $n$. A system is causal if, for every choice of $n_{0}$, the output sequence value at the index ${n = n}_{0}$ depends only on the input sequence values for ${n \leq n}_{0}$. This implies that if $x_{1}\lbrack n\rbrack = x_{2}\lbrack n\rbrack$ for ${n \leq n}_{0}$, then $y_{1}\lbrack n\rbrack = y_{2}\lbrack n\rbrack$ for ${n \leq n}_{0}$. That is, the system is non-anticipative in the sense that it only depends on current and past values. ### Linear Systems Linear systems are defined by the principle of superposition. If $y_{1}\lbrack n\rbrack$ and $y_{2}\lbrack n\rbrack$ are the responses of a system when $x_{1}\lbrack n\rbrack$ and $x_{2}\lbrack n\rbrack$ are the respective inputs, then the system is linear if and only if: $\large T\{ x_{1}\lbrack n\rbrack + x_{2}\lbrack n\rbrack\} = \ T\{ x_{1}\lbrack n\rbrack\} + T\{ x_{2}\lbrack n\rbrack\}\ = \ y_{1}\lbrack n\rbrack + y_{2}\lbrack n\rbrack\ $ And, $\large T\{ ax\lbrack n\rbrack\} = \ aT\{ x\lbrack n\rbrack\} = \ ay\lbrack n\rbrack$ where $\large a$ is an arbitrary constant. The first property is called the *additivity property*, and the second is called the *homogeneity* or *scaling property*. These two properties can be combined into the principle of superposition, stated as: $\large T\{ ax_{1}\lbrack n\rbrack + bx_{2}\lbrack n\rbrack\} = \ aT\{ x_{1}\lbrack n\rbrack\} + bT\{ x_{2}\lbrack n\rbrack\}\ $ for any arbitrary constants $a$ and $\text{b.}$ This equation can be generalized to the superposition of many inputs. Specifically, if: $\large x\lbrack n\rbrack = \ \sum_{k}^{}a_{k}x_{k}\lbrack n\rbrack\ $ then the output of a linear system will be: $\large y\left\lbrack n \right\rbrack = \sum_{k}^{}a_{k}y_{k}\left\lbrack n \right\rbrack$ where $y_{k}\left\lbrack n \right\rbrack$ is the system response to the input $x_{k}\lbrack n\rbrack$. ### Time-Invariant Systems A time-invariant system (often referred to equivalently as a shift-invariant system) is a system for which a time shift or delay of the input sequence causes a corresponding shift in the output sequence. Specifically, suppose that a system transforms the input sequence with values $x\lbrack n\rbrack$ into the output sequence with values $y\lbrack n\rbrack$. Then the system is said to be time-invariant if, for all $n_{0}$, the input sequence with values $x_{1}\lbrack n\rbrack = x\lbrack n - n_{0}\rbrack$ produces the output sequence with values $y_{1}\lbrack n\rbrack = y\lbrack n - n_{0}\rbrack$. As in the case of linearity, proving that a system is time-invariant requires general proof without making any assumptions about the input signals. ### Stability A system is said to be stable if and only if every bounded input sequence produces a bounded output sequence. The input $x\lbrack n\rbrack$ is bounded if there exists a fixed positive finite value $B_{x}$ such that $\large \left| x\lbrack n\rbrack \right| \leq B_{x} < \infty$(for all $n$) Stability requires that, for every bounded input, there exists a fixed positive finite value $B_{y}$ such that $\large \left| y\lbrack n\rbrack \right| \leq B_{x} < \infty$For all $n$ ### Linear Time-Invariant Systems and the Convolution Sum A particularly important class of systems consists of those that are both linear and time-invariant. These two properties in combination lead to especially convenient representations for such systems. Most important, this class of systems has significant signal-processing applications. Linear systems are defined by the principle of superposition we described in the sections above. If the linearity property is combined with the representation of a general sequence as a linear combination of delayed impulses we discussed before, it follows that a linear system can be completely characterized by its impulse response. Specifically, let $h_{k}\lbrack n\rbrack$ be the response of the system to $\delta\lbrack n - k\rbrack$, which represents an impulse occurring at $n\ = \ k$. Then, $\large y\left\lbrack n \right\rbrack = T\left\{ \sum_{k = - \infty}^{\infty}x\left\lbrack k \right\rbrack\delta\lbrack n - k\rbrack \right\}$ From the principle of superposition that we described before, $\large y\left\lbrack n \right\rbrack = \sum_{k = - \infty}^{\infty}x\left\lbrack k \right\rbrack T\{\delta\lbrack n - k\rbrack\} = \sum_{k = - \infty}^{\infty}x\left\lbrack k \right\rbrack h_{k}\lbrack n\rbrack\ $ According to this equation, the system response to any input can be expressed in terms of the responses of the system to the sequences $\delta\lbrack n - k\rbrack$. If only linearity is imposed, $h_{k}\lbrack n\rbrack$ will depend on both $n$ and $k$, in which case the usefulness of this equation is limited. We get a more useful result if we also impose the additional constraint of time invariance. The property of time invariance implies that if $h\lbrack n\rbrack$ is the response to $\delta\lbrack n\rbrack$, then the response to $\delta\lbrack n–k\rbrack\ is\ h\lbrack n–k\rbrack$. With this additional constraint, the equation becomes: $\LARGE y\left\lbrack n \right\rbrack = \sum_{k = - \infty}^{\infty}x\left\lbrack k \right\rbrack h\lbrack n - k\rbrack$ ==As a consequence of this equation, a linear time-invariant system is completely characterized by its impulse response $h\lbrack n\rbrack$ in the sense that, given $h\lbrack n\rbrack$, it is possible to use the equation above to compute the output $y\lbrack n\rbrack$ due to any input $x\lbrack n\rbrack$.== This equation is commonly called the convolution sum. If $y\lbrack n\rbrack$ is a sequence whose values are related to the values of two sequences $h\lbrack n\rbrack$ and $x\lbrack n\rbrack$, we say that $y\lbrack n\rbrack$ is the convolution of $x\lbrack n\rbrack$ with $h\lbrack n\rbrack$ and represent this by the notation: $\large y\lbrack n\rbrack\ = \ x\lbrack n\rbrack\ *\ h\lbrack n\rbrack$ To implement discrete-time convolution, the two sequences $x\lbrack k\rbrack$ and $h\lbrack n - k\rbrack$ are multiplied together for $- \infty < k < \ \infty$, and the products are summed to compute the output sample $y\lbrack n\rbrack$. To obtain another output sample, the origin of the sequence $h\lbrack - k\rbrack$ is shifted to the new sample position, and the process is repeated. This computational procedure applies whether the computations are carried out numerically on sampled data or analytically with sequences for which the sample values have simple formulas. Here's a simple implementation of the convolution sum of two sequences $x$ and $h$: ```C #include <stdio.h> #include <stdlib.h> void convolve(const double *x, int lenx, const double *h, int lenh, double *y) { int n, k; // Initialize output array with zeros for (n = 0; n < lenx + lenh - 1; n++) { y[n] = 0; } // Perform the convolution operation for (n = 0; n < lenx + lenh - 1; n++) { for (k = 0; k < lenh; k++) { if (n - k >= 0 && n - k < lenx) { y[n] += x[n - k] * h[k]; } } } } int main() { double x[] = {1, 2, 3}; // Example input sequence x int lenx = sizeof(x) / sizeof(x[0]); double h[] = {4, 5, 6}; // Example impulse response h int lenh = sizeof(h) / sizeof(h[0]); double *y = (double *)malloc((lenx + lenh - 1) * sizeof(double)); if (y == NULL) { fprintf(stderr, "Memory allocation failed.\n"); return 1; } convolve(x, lenx, h, lenh, y); printf("Convolution result:\n"); for (int i = 0; i < lenx + lenh - 1; i++) { printf("%f ", y[i]); } printf("\n"); free(y); return 0; } ``` Whose result is ```[4.0, 13.0, 28.0, 27.0, 18.0]``` ==An important property of the convolution sum is that convolution in the time domain is equivalent to multiplication in the frequency domain.== This is rooted in the properties of the Fourier Transform. For a continuous time-domain signal $x(t)$, the Fourier Transform is defined as: $\large X(f) = \int_{–\infty}^{\infty}{x(t)e^{- j2\pi ft}}\text{dt}$ How does the Fourier Transform manage to "detect" frequency components in signals? What is this little, magic "spectrum analyzer" power it seems to have? The equation above is nothing but a mathematical "sweep" function. The Fourier Transform's job is to decompose this time-varying signal $x(t)$ into frequency components. It does this using a basis of complex exponentials, which are essentially sinusoidal functions (sines and cosines) expressed in a complex number form. These complex exponentials are defined for every possible frequency. The core operation in the Fourier Transform is to multiply the signal by a complex exponential of a specific frequency and then integrate (or sum in the case of discrete Fourier Transform) this product over the entire duration of the signal. This operation is akin to asking, "How much of this specific frequency is present in my signal?" Now, the reason the Fourier Transform gives non-zero values when it finds a frequency component in the signal is due to the properties of complex exponentials. When the frequency of the complex exponential matches a frequency component in the signal, the product of the signal and the complex exponential resonates, or aligns, over the integration period. This alignment leads to a constructive summing (or integration) where the values reinforce each other, resulting in a significant, non-zero value. Conversely, if the frequency of the complex exponential does not match any frequency component in the signal, the product of the signal and the complex exponential do not align consistently. Instead, it oscillates in such a way that positive and negative parts cancel each other out over the integration period, leading to a near-zero or insignificant value. This behavior is a direct consequence of the complex exponentials being **orthogonal** to each other. When a signal contains a frequency component that matches the frequency of a complex exponential, the orthogonality is broken, and the integral yields a significant value. Orthogonality is a concept from mathematics and signal processing that relates to the independence of two functions or signals. Two functions $f(x)$ and $g(x)$ are said to be orthogonal over an interval if their inner product over that interval is zero. For continuous functions, the inner product is defined as the integral of the product of the two functions over the interval. For sine waves, let's consider two sine functions: $\large f(x)\ = \ \sin(mx)$ $\large g(x)\ = \ \sin(nx)$ where $m$ and $n$ are different integers, and $x$ is a variable. The orthogonality of these two sine waves is evaluated over an interval, typically $\left\lbrack 0,2\pi \right\rbrack$ or $\lbrack - \pi,\ \pi\rbrack$. The inner product (integral of the product) of $f(x)$ and $g(x)$ over this interval is: $\large \int_{0}^{2\pi}{\sin\left({mx} \right)\sin\left({nx} \right)\text{ dx}}$ To show that $f(x)$ and $g(x)$ are orthogonal, we need to prove that this integral evaluates to zero when $\text{m\ } \neq n$. The integral simplifies using trigonometric identities, such as the product-to-sum identities: $\large \sin\left({mx} \right)\sin\left({nx} \right) = \frac{1}{2}\left\lbrack \cos\left( \left( m - n \right)x \right) - \cos\left( \left( m + n \right)x \right) \right\rbrack$ Thus, the integral becomes: $\large \int_{0}^{2\pi}{\frac{1}{2}\left\lbrack \cos\left( \left( m - n \right)x \right) - \cos\left( \left( m + n \right)x \right) \right\rbrack}\text{ dx}$ Integrating this over a full period $\left\lbrack 0,2\pi \right\rbrack$, we find that both terms integrate to zero for any integers $m$ and $n$ such that $\text{m\ } \neq n$. This is because the integral of the cosine function over its full period is zero. Thus, when $\text{m\ } \neq n$, $\sin\left( \text{mx} \right)$ and $\sin\left( \text{nx} \right)$ are orthogonal over the interval $\left\lbrack 0,2\pi \right\rbrack$. To use a numerical example, for $m = 1$ and $n = 2$, ${\int_{0}^{2\pi}{sin(x)sin(2x)}\text{ dx} = \frac{1}{2}\int_{0}^{2\pi}{cos( - x) - cos(3x)\ dx} }{\text{\ \ \ } }{= \frac{1}{2}\left( –sin(x)\left| \begin{matrix} 2\pi \\ 0 \\ \end{matrix} \right.\ –\frac{1}{3}sin(3x)\left| \begin{matrix} 2\pi \\ 0 \\ \end{matrix} \right.\ \right) } {= \ 0}$ More graphically, the basic idea is that if the frequencies of the two sines are different, then between $0$ and $2\pi$, the two sine curves are of opposite sign as much as they are of the same sign: ![](image236.gif) Thus, their product will be positive as much as it is negative. In the integral, those positive contributions will exactly cancel the negative contributions, leading to an average of zero: ![](image237.gif) ==Convolution sum and Fourier Transform combined bring a useful property: the Convolution Theorem, which states that the Fourier Transform of a convolution of two signals (in the time domain) is equal to the product of their individual Fourier Transforms in the frequency domain.== Mathematically, if $y(t)\ = \ x(t)*h(t)$, then the Fourier Transform of $y(t)$, $Y(f)$, is $X(f)H(f)$, where $X(f)$ and $H(f)$ are the Fourier Transforms of $x(t)$ and $h(t)$ respectively. But why Is this true? This equivalence arises from the properties of the Fourier Transform. The Fourier Transform is a linear operation, meaning that it preserves the linear combinations of signals. When you convolve two signals, you are essentially creating a linear combination of shifted and scaled versions of one signal. The Fourier Transform translates this combination into multiplication in the frequency domain. We will see in the next section that this property is particularly useful in understanding how filters modify the spectral components of a signal. ### Frequency-domain Representation of Discrete-time Signals and Systems In the previous sections, we have introduced some of the fundamental concepts of the theory of discrete-time signals and systems. For linear time-invariant systems, we saw that a representation of the input sequence as a weighted sum of delayed impulses leads to a representation of the output as a weighted sum of delayed impulse responses. As with continuous-time signals, discrete-time signals may be represented in several different ways. For example, sinusoidal and complex exponential sequences play a particularly important role in representing discrete-time signals. This is because complex exponential sequences are eigenfunctions of linear time-invariant systems and the response to a sinusoidal input is sinusoidal with the same frequency as the input and with amplitude and phase determined by the system. This fundamental property of linear time-invariant systems makes representations of signals in terms of sinusoids or complex exponentials (i.e., Fourier representations) are very useful in linear system theory. To demonstrate the eigenfunction property of complex exponentials for discrete-time systems, consider an input sequence: $x\lbrack n\rbrack\ = \ e^{\text{jωn}}$ for $- \infty < n < \infty$. This is, a complex exponential of frequency $\omega$. From the convolution sum, the corresponding output of a linear, time-invariant system with impulse response $h\lbrack n\rbrack$ is: $\large {y\left\lbrack n \right\rbrack = \sum_{k = - \infty}^{\infty}x\left\lbrack k \right\rbrack h\lbrack n - k\rbrack }{= e^{jωn}\left( \sum_{k = - \infty}^{\infty}h\left\lbrack k \right\rbrack e^{–j\omega k} \right)}$ If we define, $\large H(e^{jω}) = \ \sum_{k = - \infty}^{\infty}h\left\lbrack k \right\rbrack e^{–j\omega k}$ Then we obtain, $\large y\left\lbrack n \right\rbrack = H(e^{{jω}})e^{{jωn}}$ Therefore, $e^{\text{jωn}}$ is an eigenfunction of the system, and the eigenvalue is $H(e^{{jω}})$. We can see that $H(e^{{jω}})$ describes the change in complex amplitude of a complex exponential input signal as a function of frequency $\omega$. The eigenvalue $H(e^{{jω}})$ is called the frequency response of the system. In general, $H(e^{{jω}})$ is complex and can be also expressed in rectangular form: $H(e^{{jω}})\ = \ H_{R}(e^{{jω}})\ + \ j\ H_{{Im}}(e^{{jω}})\ $ The concept of the frequency response of linear time-invariant systems is essentially the same for continuous-time and discrete-time systems. However, an important distinction arises because the frequency response of discrete-time linear time-invariant systems is always a periodic function of the frequency variable $\omega$ with period $2\pi$. To show this, we can use $\omega + 2\pi$ in the equations above: $\large H(e^{j(\omega + 2\pi)}) = \ \sum_{n = - \infty}^{\infty}h\left\lbrack n \right\rbrack e^{–j(\omega + 2\pi)n}$ Because $e^{–j(\omega + 2\pi)n} = 1$ for any integer value of $n$, we have: $\large e^{–j(\omega + 2\pi)n} = e^{–j\omega n}e^{–j2\pi n} = e^{–j\omega n}$ Then, $H(e^{j(\omega + 2\pi)})\ = \ H(e^{{jω}})$ Therefore, $H(e^{{jω}})$ is periodic with period $2\pi$. The reason for the periodicity is directly related to the fact that the sequence $\left\{ e^{{jωn}} \right\}\ - \infty < n < \infty$ is indistinguishable from the sequence $\left\{ e^{j(\omega + 2\pi)n} \right\}\ - \infty < n < \infty$. Because these two sequences have identical values for all $n$, the system must respond identically to both input sequences. ### Frequency-domain Representation of Sampling To derive the frequency-domain relation between the input and output of an ideal ADC, we can first consider the conversion of $x_{c}\left( t \right)$ to $x_{S}\left( t \right)$ through modulation of the periodic impulse train $\large s(t) = \sum_{n = –\infty}^{\infty}{\delta(t–nT)}$ Where $\delta(t)$ is the unit impulse function, or Dirac delta function. We modulate $s(t)$ with $x_{c}\left( t \right)$, obtaining $\large {x_{S}\left( t \right) = x_{c}\left( t \right)s\left( t \right) }{= x_{c}\left( t \right)\sum_{n = –\infty}^{\infty}{\delta\left( t–nT \right)}}$ The delta function has a useful property called _sifting_ property which is expressed as follows: $\int_{- \infty}^{\infty}{f\left( t \right)\delta\left( t - a \right)}\, dt = f\left( a \right)$ where $f(t)$ is a continuous function at $t\ = \ a$. This property essentially \"sifts out\" the value of the function $f(t)$ at the point $t\ = \ a$. The explanation of the sifting property is rather simple: The delta function $\delta\left( t - a \right)$ is zero for all $t$ except at $t\ = \ a$. Therefore, the product $f\left( t \right)\delta\left( t - a \right)$ is zero everywhere except at $t\ = \ a$. Then, when one integrates $f(t)\ \delta(t\ - \ a)$ over all $t$, the only value that contributes to the integral is at $t\ = \ a$, where $\delta\left( t - a \right)$ is non-zero (infinitely large in a theoretical sense). The integral then \"picks out\" or \"sifts\" the value of $f(t)$ at $t\ = \ a$, hence the name sifting property. We now go back to our sampled signals $x_{S}\left( t \right)$: $\large {x_{S}\left( t \right) = x_{c}\left( t \right)s\left( t \right) }{= x_{c}\left( t \right)\sum_{n = –\infty}^{\infty}{\delta\left( t–nT \right)}}$ If we apply the sifting property here, we can express $x_{S}\left( t \right)$ as $\large x_{S}\left( t \right) = \sum_{n = –\infty}^{\infty}{x_{c}\left( \text{nT} \right)\delta\left( t–nT \right)}$ We can consider now the Fourier transform of $x_{S}\left( t \right) = x_{c}\left( t \right)s\left( t \right)$, which will be the convolution of the Fourier transforms $X_{c}\left( j\Omega \right)$ and $S(j\Omega)$. The Fourier transform of a periodic impulse train is a periodic impulse train. Why? A periodic impulse train in the time domain can be mathematically represented as a sum of Dirac delta functions spaced periodically at intervals T (the period of the impulse train): $\large x\left( t \right) = \sum_{n = - \infty}^{\infty}{\delta\left( t - nT \right)}$ Any periodic signal can be expressed as a Fourier series where the coefficients of the series are the samples of the Fourier transform over one period of the signal. When one applies the Fourier transform to a periodic signal, one effectively converts the time-domain periodicity into frequency-domain periodicity. This is because the Fourier transform of a single impulse is a constant function (flat spectrum) across all frequencies, and the periodic repetition of impulses in the time domain translates into a periodic repetition of the Fourier transform (the spectrum) of a single impulse in the frequency domain. Mathematically, the Fourier transform of an impulse train results in a frequency spectrum that contains impulses at frequencies that are multiples of the fundamental frequency $f_{0} = \frac{1}{T}$, which is the inverse of the time period T: $\large X\left( f \right) = \sum_{k = - \infty}^{\infty}{\delta\left( f - kf_{0} \right)}$ Here, $kf_{0}$ are the harmonics of the fundamental frequency $f_{0}$, and the delta functions in the frequency domain represent the discrete frequencies at which the periodic time-domain impulses contribute energy. This dual relationship between time and frequency domains is a manifestation of the property of the Fourier transform known as the duality property. It states that if a time-domain signal is a periodic series of impulses, its frequency-domain representation will also be a periodic series of impulses, and vice versa. Following our nomenclature using frequency in radians, we can rewrite the Fourier transform of our impulse train $\large S(j\Omega) = \frac{2\pi}{T}\sum_{k = –\infty}^{\infty}\delta(\Omega –k\Omega_{S})$ Where $\Omega_{S} = 2\pi/T$ is the sampling frequency in rad/s. Because $x_{S}\left( t \right)$ is the result of a multiplication in the time domain, then $X_{S}\left( j\Omega \right)$ is a convolution in the frequency domain, $\large X_{S}\left( j\Omega \right) = \frac{1}{T}\sum_{k = –\infty}^{\infty}{X_{c}\left( j\Omega - k\Omega_{S} \right)}$ This equation provides the relationship between the Fourier transforms of the input and the output of the impulse train modulator we described earlier. We see from the equation above that the Fourier transform of $x_{S}\left( t \right)$—that is, $X_{S}\left( j\Omega \right)$—consists of periodically repeated copies of the Fourier transform of $x_{c}\left( t \right)$, or $X_{c}\left( j\Omega \right)$. The copies of $X_{c}\left( j\Omega \right)$ are shifted by integer multiples of the sampling frequency and then superimposed to produce the periodic Fourier transform of the impulse train of samples. ![](site/Resources/media/image238.png) > [!info] > From the figure above: Effect in the frequency domain of sampling in the time domain. (a) spectrum of the original signal (b) spectrum of the sampling signal (c) spectrum of the sampled signal with $\Omega_{S} > \ 2\Omega_{N}$ (d) Spectrum of the sample signal with $\Omega_{S} < \ 2\Omega_{N}$ showing aliasing ![Exact recovery of a continuous-time domain signal from its samples using an ideal low-pass filter](site/Resources/media/image239.png) > [!Figure] > _Exact recovery of a continuous-time domain signal from its samples using an ideal low-pass filter_ It can be observed in the figures that if the copies of the spectrum of $x_{S}\left( t \right)$ are comfortably separated from each other, there is no issue. This separation is given by the difference between the sampling frequency and the maximum frequency in the spectrum of $x_{S}\left( t \right)$. It should be obvious to note now that improper selection of the sampling frequency may lead to copies of the spectrum of $x_{S}\left( t \right)$ colliding with each other in the frequency domain. What is this collision about and what effects does it create? It creates a distortion and makes the recovery of the original signal not possible. To avoid this effect, the sampling frequency must satisfy the inequality: $\large \Omega_{S} \geq {2\Omega}_{N}$ The frequency $\Omega_{N}$ (that is, the maximum frequency component in a bandlimited signal) is commonly referred to as the _Nyquist frequency_, and the frequency ${2\Omega}_{N}$ that must be exceeded by the sampling frequency to avoid aliasing is called the _Nyquist rate_. ##### Changing the Sampling Rate We saw that a continuous-time signal $x_{c}\left( t \right)$ can be represented by a discrete-time signal consisting of a sequence of samples, $\large x\lbrack n\rbrack = x_{c}(nT)$ It is often necessary to change the sampling rate of a discrete-time signal, i.e., to obtain a new discrete-time representation of the underlying continuous-time signal of the form $\large x'\lbrack n\rbrack = x_{c}(nT')$ Where $\large T^{'} \neq T$ One approach to obtaining samples of $x'\lbrack n \rbrack$ from $x\lbrack n\rbrack$ could be to somehow reconstruct $x_{c}(t)$ from $x\lbrack n\rbrack$ and then resample in continuous time with period $T^{'}$ to obtain $x'\lbrack n\rbrack$. This does not transpire as the best approach, considering the non-ideal nature of the analog equipment needed for the reconstruction of the signal, combined with the limitations in practical DACs and ADCs. A better approach is to consider changing the sample rate purely in discrete time. The sampling rate of a discrete-time sequence can be reduced by "sampling" it by defining a new sequence: $\large x_{d}\left\lbrack n \right\rbrack = x\left\lbrack \text{nM} \right\rbrack = x_{c}\left( \text{nMT} \right)$ The equation above represents a system depicted in the figure below. ![Representation of a compressor, or discrete-time sampler](site/Resources/media/image240.png) It is clear that $x_{d}\left\lbrack n \right\rbrack$ is identical to the sequence that would be obtained from $x_{c}\left( t \right)$ sampling it with period $T^{'} = MT$. The sampling rate can be reduced by a factor of $M$ without aliasing if the original sampling rate was at least $M$ times the Nyquist rate or if the bandwidth of the sequence is first reduced by a factor of $M$ by discrete-time filtering. In general, the operation of reducing the sampling rate (including any prefiltering) is called *downsampling*. It is useful to obtain a frequency-domain relation between the input and output of the compressor (or downsampler). It will be a relationship between discrete-time Fourier transforms (DTFTs). The DTFT of $x\left\lbrack n \right\rbrack = x_{c}\left( \text{nT} \right)$ is: $\large X\left( e^{{jω}} \right) = \frac{1}{T}\sum_{k = –\infty}^{\infty}{X_{c}\left( j\left( \frac{\omega}{T}–\frac{2\pi k}{T} \right) \right)}$ Similarly, the DTFT of the downsampled sequence $x_{d}\lbrack n\rbrack = \ x\lbrack nM\rbrack = x_{c}(nT')$ with $T^{'} = MT$ is: $\large X_{d}\left( e^{{jω}} \right) = \frac{1}{T'}\sum_{r = –\infty}^{\infty}{X_{c}\left( j\left( \frac{\omega}{T'}–\frac{2\pi r}{T'} \right) \right)}$ But because $T^{'} = MT$, we can then write the previous equation as: $\large X_{d}\left( e^{{jω}} \right) = \frac{1}{\text{MT}}\sum_{r = –\infty}^{\infty}{X_{c}\left( j\left( \frac{\omega}{\text{MT}}–\frac{2\pi r}{\text{MT}} \right) \right)}$ To understand how $X_{d}\left( e^{{jω}} \right)$ and $X\left( e^{{jω}} \right)$ relate, the index $r$ in the previous equation can be expressed as: $\large r = i + kM$ Where k and i are integers such that -$\infty < k < \infty$ and $0 \leq i \leq M - 1$ and we can rewrite the DTFT of the downsampled signal $X_{d}\left( e^{{jω}} \right)$ as: $\large X_{d}\left( e^{{jω}} \right) = \frac{1}{M}\sum_{i = 0}^{M - 1}\left\lbrack \frac{1}{T}\sum_{k = –\infty}^{\infty}{X_{c}\left( j\left( \frac{\omega}{\text{MT}}–\frac{2\pi k}{T}–\frac{2\pi i}{\text{MT}} \right) \right)} \right\rbrack$ The term inside the square brackets in the equation above is recognized as the Discrete-Time Fourier Transform (DTFT) of $X\left( e^{{jω}} \right)$ as $\large X\left( e^{j(\omega - 2\pi i)/M} \right) = \frac{1}{T}\sum_{k = –\infty}^{\infty}{X_{c}\left( j\left( \frac{\omega –2\pi i}{\text{MT}}–\frac{2\pi k}{T} \right) \right)}$ Then, we can express the DTFT of the downsampled signal $X_{d}\left( e^{{jω}} \right)$ as $\large X_{d}\left( e^{{jω}} \right) = \frac{1}{M}\sum_{i = 0}^{M - 1}{X\left( e^{j(\omega/M - 2\pi i/M} \right)}$ We should be able to see a strong analogy between how we started with the DTFT of $x\lbrack n\rbrack$ (with period T) in terms of the Fourier transform of the continuous-time signal $x_{c}\left( t \right)$: $\large X\left( e^{{jω}} \right) = \frac{1}{T}\sum_{k = –\infty}^{\infty}{X_{c}\left( j\left( \frac{\omega}{T}–\frac{2\pi k}{T} \right) \right)}$ Whereas the next equation expresses the FT of the discrete-time sampled sequence $x_{d}\left( t \right)$ with sampling period $M$ in terms of the sequence $x\lbrack n\rbrack$: $\large X_{d}\left( e^{{jω}} \right) = \frac{1}{M}\sum_{i = 0}^{M - 1}{X\left( e^{j(\omega/M - 2\pi i/M} \right)}$ With all, the downsampled sequence $X_{d}\left( e^{{jω}} \right)$ can be thought of as being composed of either an infinite number of copies of $X_{c}\left( j\Omega \right)$ scaled (widened) in frequency through $\omega = \Omega T'$ and shifted by integer multiples of $2\pi\text{/}T^{'}$, or $M$ copies of the periodic Fourier transform $X\left( e^{{jω}} \right)$, frequency scaled by M and shifted by integer multiples of $2\pi$ (see figure below). Either interpretation makes clear that $X_{d}\left( e^{{jω}} \right)$ is periodic with period $2\pi$ (as all DTFTs are) and that aliasing can be avoided by ensuring that $X\left( e^{{jω}} \right)$ is bandlimited, in other words, that, $\large X\left( e^{{jω}} \right) = 0$ for $\large \omega_{N} \leq \left| \omega \right| \leq \pi$ And, $\large 2\pi\text{/}M \geq 2\omega_{N}$ ![ Effect in the frequency domain of downsampling (without aliasing)](site/Resources/media/image241.png) > [!Figure] > _Effect in the frequency domain of downsampling (without aliasing)_ ![Effect in the frequency domain of downsampling (a-c with aliasing, c-e with pre-filtering to avoid aliasing)](image242.png) > [!Figure] > _Effect in the frequency domain of downsampling (a-c with aliasing, c-e with pre-filtering to avoid aliasing)_ ### Digital Filtering and Discrete-Time Fourier Transform (DTFT) An important type of linear time-invariant systems includes those systems for which the frequency response is unity over a certain range of frequencies and is zero at the remaining frequencies. These correspond to ideal frequency-selective filters. The frequency response of an ideal lowpass filter is shown in the figure below. Because of the inherent periodicity of the discrete-time frequency response, it has the appearance of a multiband filter since frequencies around $\omega = 2\pi$ are indistinguishable from frequencies around $\omega = 0$. In effect, however, the frequency response passes only low frequencies and rejects high frequencies. Since the frequency response is completely specified by its behavior over the interval $–\pi < \omega \leq \pi$, the ideal lowpass filter frequency response is more typically shown only in the interval $–\pi < \omega \leq \pi$. It is now understood that the frequency response repeats periodically with period $2\pi$ outside the plotted interval. ![Ideal lowpass filter showing: (a) periodicity of the frequency response and (b) one period of the periodic frequency response](image243.png) > [!Figure] > _Ideal lowpass filter showing: (a) periodicity of the frequency response and (b) one period of the periodic frequency response_ The frequency responses for ideal high pass, bandstop, and bandpass filters are shown in the figure below, legends (a), (b), and (c), respectively. ![Ideal frequency-selective filters. (a) Highpass filter. (b) Bandstop filter. (c) Bandpass filter. Only one period is shown.](image244.png) > [!Figure] > _Ideal frequency-selective filters. (a) Highpass filter. (b) Bandstop filter. (c) Bandpass filter. Only one period is shown._ Filtering, in essence, is about manipulating the spectral content of signals. When we are low-pass filtering, we are reducing or nullifying all frequency components higher than a threshold called cutoff frequency. Beyond that marker, we are not interested in any signal present therefore we ideally multiply by zero any component in that region. For high-pass filtering, we do the exact opposite: we nullify or minimize components below a threshold. For bandpass filtering, we keep a band of a certain width, and we discard all the rest. Conceptually speaking, filtering is very straightforward, and its math is practically primary-school level. By applying the convolution theorem, we can then multiply in the frequency domain the signal frequency response with the frequency response of the filter's impulse response. That means we can simply multiply the Fourier representation of the sequence with the Fourier representation of the impulse response. ![Filtering as a multiplication in the frequency domain between the Fourier transform of the signal and the Fourier transform of the impulse response of the filter](image245.png) > [!Figure] > _Filtering as a multiplication in the frequency domain between the Fourier transform of the signal and the Fourier transform of the impulse response of the filter_ To obtain the Fourier representation of a sequence, we have, $\large X(e^{{jω}}) = \sum_{n = –\infty}^{\infty}{x\lbrack n\rbrack e^{- j\omega n}}$ This equation is also called DTFT, or discrete-time Fourier Transform. The DTFT transforms a discrete-time signal, which is usually infinite, into a continuous and periodic frequency spectrum. It must be emphasized the fact that the DTFT yields a ==continuous and periodic frequency spectrum==. This means that the DTFT gives you the frequency content at every possible frequency within its range. All in all, the DTFT is more of a theoretical construct used for analysis and understanding the frequency content of signals than anything that can be implemented in a real processor. The inverse of the DTFT is: $\large x\lbrack n\rbrack = \frac{1}{2\pi}\int_{–\pi}^{\pi}{X(e^{{jω}})}e^{{jωn}}{dω}$ With all this, we can now obtain the discrete-time sequence for the impulse response of a low-pass discrete-time filter from its frequency domain definition. We want the filter to have a frequency response of 1 if the signal frequency is lower than the cutoff frequency $\omega_{c}$, and zero if the frequency exceeds the cutoff frequency (and is smaller than $\large H_{\text{lp}} = \left\{ \begin{matrix} 1,\ \left| \omega \right| < \omega_{c,} \\ 0,\ \omega_{c} < \left| \omega \right| \leq \pi\ \\ \end{matrix} \right.\ $ By applying the inverse DTFT: $\large {h_{\text{lp}}\lbrack n\rbrack = \ \frac{1}{2\pi}\int_{–\pi}^{\pi}e^{{jωn}}{dω} }{= \frac{1}{2\pi jn}e^{{jωn}}\left| \begin{matrix} \omega_{c} \\ –\omega_{c} \\ \end{matrix} = \right.\ \ \frac{1}{2\pi jn}\left( e^{j\omega_{c}n}–e^{–j\omega_{c}n} \right) }{= \frac{\sin\ \omega_{c}n}{{πn}},\ –\infty < n < \infty}$ The discrete-time sequence to obtain a low-pass filter frequency response is a _sinc_ function. The simplest method of FIR filter design is called the window method. This method generally begins with an ideal (desired) frequency response that can be represented as: $\large H_{d}\left( e^{{jω}} \right) = \sum_{n = –\infty}^{\infty}{h_{d}\lbrack n\rbrack e^{- jωn}}$ Where $h_{d}\lbrack n\rbrack$ is the impulse response sequence. Many systems have impulse responses that are non-causal and finitely long. The most straightforward approach to obtaining a causal FIR approximation to such systems is to truncate the ideal response. The equation above can be thought of as a Fourier series representation of the periodic frequency response $H_{d}\left( e^{{jω}} \right)$, with the sequence $h_{d}\lbrack n\rbrack$ playing the role of the Fourier coefficients. The simplest way to obtain a causal FIR filter from $h_{d}\lbrack n\rbrack$ is to define a new system with impulse response $h_{d}\lbrack n\rbrack$ given by: $\large h\left\lbrack n \right\rbrack = \left\{ \begin{matrix} h_{d}\left\lbrack n \right\rbrack,0 \leq n \leq M \\ 0,\ otherwise \\ \end{matrix} \right.\ $ More generally, we can represent $h\left\lbrack n \right\rbrack$ as the product of the desired impulse response and a finite-duration "window" $w\left\lbrack n \right\rbrack$. In other words, $\large h\left\lbrack n \right\rbrack = h_{d}\left\lbrack n \right\rbrack w\left\lbrack n \right\rbrack$ For a simple truncation, the window is a rectangular window: $\large w\left\lbrack n \right\rbrack = \left\{ \begin{matrix} 1,\ 0 \leq n \leq M \\ 0,\ otherwise \\ \end{matrix} \right.\ $ The multiplication with a window sequence in the time domain effectively equates to a convolution in the frequency domain. Therefore, $\large H\left( e^{{jω}} \right) = \frac{1}{2\pi}\int_{–\pi}^{\pi}{H_{d}\left( e^{{jθ}} \right)W(e^{j(\omega - \theta}})d\theta$ Which represents the periodic convolution of the desired ideal frequency response with the Fourier transform of the window. A periodic convolution is a convolution of two periodic functions with the limits of integration extending over only one period. With all, the frequency response $H\left( e^{{jω}} \right)$ will be a "smeared" version of the desired frequency response* $H_{d}\left( e^{{jω}} \right)$. ![(a) Convolution performed by truncation of the ideal impulse response (b) Typical approximation resulting from windowing the ideal square impulse response](image246.png) > [!Figure] > _(a) Convolution performed by truncation of the ideal impulse response (b) Typical approximation resulting from windowing the ideal square impulse response_ If $w\lbrack n\rbrack$ is chosen so that $W\left( e^{{jω}} \right)$ is concentrated in a narrow band of frequencies around $\omega = 0$, then $H\left( e^{{jω}} \right)$ will resemble $H_{d}\left( e^{{jω}} \right)$, except where $H_{d}\left( e^{{jω}} \right)$ changes very abruptly, for instance at the edges. Consequently, the choice of window is governed by the desire to have $w\lbrack n\rbrack$ as short as possible in duration, to minimize computation in the implementation of the filter, while having $W\left( e^{{jω}} \right)$ approximate an impulse in the frequency domain, that is, we want $W\left( e^{{jω}} \right)$ to be highly concentrated in frequency so that the convolution faithfully reproduces the desired frequency response (see figure above). These are conflicting requirements, as can be seen in the case of the rectangular window, where: $\large W(e^{{jω}}) = \sum_{n = 0}^{M}{e^{–j\omega n} = \frac{1 - e^{- j\omega(M + 1)}}{1 - e^{- j\omega}} = e^{- j\omega(M/2)}\frac{sin\lbrack\omega(M + 1)/2\rbrack}{sin(\omega/2)}}$ The magnitude of the sinc function $\frac{sin\lbrack\omega(M + 1)/2\rbrack}{sin(\omega/2)}$ is plotted below for $M = 7$*.* ![Magnitude of the Fourier Transform of a rectangular window](image247.png) > [!Figure] > _Magnitude of the Fourier Transform of a rectangular window_ As $M$ increases, the width of the main lobe decreases. The main lobe is defined as the region between the first zero-crossings on either side of the origin. For the rectangular window, the width of the main lobe is ${\Delta\omega}_{m} = 4\pi/(M + 1)$. However, for the rectangular window, the side lobes are large, and in fact, as $M$ increases, the peak amplitudes of the main lobe and the side lobes grow in a manner such that the area under each lobe is a constant while the width of each lobe decreases with $M$. Consequently, as $W(e^{j(\omega –\theta)})$ "slides by" a discontinuity of $H_{d}\left( e^{{jθ}} \right)$ with increasing $\omega$, the integral of $W(e^{j(\omega –\theta)})H_{d}\left( e^{{jθ}} \right)$ will oscillate as each side lobe of $W(e^{j(\omega –\theta)})$ moves past the discontinuity. Since the area under each lobe remains constant with increasing $M$, the oscillations occur more rapidly but do not decrease in amplitude as $M$ increases (we will calculate this in code in a bit). The equation $\frac{\sin\left\lbrack \omega\left( M + 1 \right)\text{/}2 \right\rbrack}{\sin\left( \omega\text{/}2 \right)}$ is recognized as the Dirichlet kernel. As said before, the main lobe width is inversely proportional to $M$, and the nulls of the function occur at multiples of $\left( \frac{2\pi}{M + 1} \right)$, except for $\omega = 0$. This function describes the phenomenon of Gibbs ringing in the frequency domain due to the sharp cutoffs in the time domain, which is a rectangular window. The next code snippet implements the Dirichlet kernel in Python and plots it for $M = \lbrack 7,11,15,50\rbrack$. ```python import numpy as np import matplotlib.pyplot as plt # Define the Dirichlet Kernel function def dirichlet_kernel(omega, M): numerator = np.sin(omega * (M + 1) / 2) denominator = np.where(omega == 0, 1, np.sin(omega / 2)) return numerator / denominator # Define the range of omega and M values omega = np.linspace(-np.pi, np.pi, 1000) M_values = [7, 11, 15, 50] # Plot setup plt.figure(figsize=(14, 8)) # Plot the absolute value of the Dirichlet kernel for different M values for M in M_values: plt.plot(omega, np.abs(dirichlet_kernel(omega, M)), label=f'M={M}') plt.title('Absolute Value of Dirichlet Kernel for Different Values of M') plt.xlabel('Frequency (rad/sample)') plt.ylabel('Magnitude') plt.grid(True) plt.legend() plt.show() ``` Which produces: ![Absolute value of the Dirichlet kernel for different values of M](image248.png) > [!Figure] > _Absolute value of the Dirichlet kernel for different values of M_ It is a well-known fact that the Gibbs phenomenon can be moderated through the use of a less abrupt truncation of the Fourier series. By tapering the window smoothly to zero at each end, the height of the side lobes can be diminished; however, this is achieved at the expense of a wider main lobe and thus a wider transition at the discontinuity. Some commonly used windows are computed in the next code snippet and plotted afterward. ```python import numpy as np import matplotlib.pyplot as plt # Number of sample points N = 500 # Define window functions with their formulas def rectangular_window(N): return np.ones(N) def hanning_window(N): return 0.5 - 0.5 * np.cos(2 * np.pi * np.arange(N) / (N - 1)) def hamming_window(N): return 0.54 - 0.46 * np.cos(2 * np.pi * np.arange(N) / (N - 1)) def blackman_window(N): return 0.42 - 0.5 * np.cos(2 * np.pi * np.arange(N) / (N - 1)) + 0.08 * np.cos(4 * np.pi * np.arange(N) / (N - 1)) def kaiser_window(N, beta=14): return np.kaiser(N, beta) def bartlett_window(N): return np.bartlett(N) def gaussian_window(N, std=0.4): return np.exp(-0.5 * (np.arange(N) - (N - 1) / 2) ** 2 / (std * N) ** 2) # Note: Removing Chebyshev window due to numpy compatibility issues # Define the Welch window def welch_window(N): n = np.arange(N) return 1 - ((n - (N - 1) / 2) / ((N - 1) / 2)) ** 2 # Window names window_names = ["Rectangular", "Hanning", "Hamming", "Blackman", "Kaiser", "Bartlett", "Gaussian", "Welch"] # Window functions window_functions = [rectangular_window(N), hanning_window(N), hamming_window(N), blackman_window(N), kaiser_window(N), bartlett_window(N), gaussian_window(N), welch_window(N)] # Plot setup plt.figure(figsize=(15, 10)) # Plotting windows for i, window in enumerate(window_functions): plt.subplot(4, 2, i+1) plt.plot(window) plt.title(window_names[i]) plt.grid(True) plt.tight_layout() plt.show() ``` Which outputs: ![Commonly used windows](image249.png) > [!Figure] > _Commonly used windows_ The frequency response in dB for the windows shown above is calculated with the following code (for $M = 50$): ```python import numpy as np import matplotlib.pyplot as plt # First, define each window function and its frequency response def rectangular_window(M): return np.ones(M) def hanning_window(M): return np.hanning(M) def hamming_window(M): return np.hamming(M) def blackman_window(M): return np.blackman(M) def kaiser_window(M, beta=14): return np.kaiser(M, beta) def bartlett_window(M): return np.bartlett(M) def gaussian_window(M, std=0.4): return np.blackman(M) def welch_window(M): n = np.arange(0, M) return 1 - ((n - (M - 1) / 2) / ((M - 1) / 2))**2 # Define the Fourier Transform function for the window def frequency_response(window): A = np.fft.fft(window, 2048) / (len(window)/2.0) freq = np.linspace(-0.5, 0.5, len(A)) response = np.fft.fftshift(A / abs(A).max()) return freq, 20 * np.log10(np.abs(response)) # Define window length M = 50 # Window functions windows = { "Rectangular": rectangular_window(M), "Hanning": hanning_window(M), "Hamming": hamming_window(M), "Blackman": blackman_window(M), "Kaiser": kaiser_window(M), "Bartlett": bartlett_window(M), "Gaussian": gaussian_window(M), "Welch": welch_window(M) } # Plotting the frequency response of each window plt.figure(figsize=(15, 10)) for window_name, window in windows.items(): freq, response = frequency_response(window) plt.plot(freq, response, label=window_name) plt.title("Frequency Response of Various Window Functions (in dB)") plt.xlabel("Normalized Frequency") plt.ylabel("Gain (dB)") plt.grid(True) plt.legend() plt.show() ``` The figure below plots the frequency response of various windows and depicts the tradeoff between main lobe width and side lobe attenuation; the rectangular window shows the narrowest main lobe but a poor main lobe to side lobe attenuation, whereas Kaiser shows the best attenuation of side lobes but the widest main lobe. ![Frequency response of typical windows (M=50)](image250.png) > [!Figure] > _Frequency response of typical windows (M=50)_ ### How Do Processors Compute Trigonometric Functions such as Sines and Cosines? With Digital Signal Processing relying so heavily on sine and cosine functions, a valid question pops up: how do computers compute such trigonometric functions? Computers calculate sine and cosine functions using a variety of methods, depending on the requirements for precision, performance, and the specific hardware or software environment. Some typical methods are: - Polynomial Approximations (Taylor Series or Chebyshev Polynomials): - The Taylor series expansion of sine and cosine functions is a common method. These series provide an approximation that becomes more accurate with more terms: $\large \sin\left( x \right) \approx x - \frac{x^{3}}{3!} + \frac{x^{5}}{5!} - \frac{x^{7}}{7!} + \cdots$ $\large \cos\left( x \right) \approx 1 - \frac{x^{2}}{2!} + \frac{x^{4}}{4!} - \frac{x^{6}}{6!} + \cdots$ - Chebyshev polynomials or other polynomial approximations can also be used, often providing better accuracy for a given number of terms compared to Taylor series. - CORDIC Algorithm (Coordinate Rotation Digital Computer): The CORDIC algorithm is another popular method. It's an iterative algorithm that rotates a vector to a desired angle, computing trigonometric functions using only addition, subtraction, and bit shift operations, making it efficient for hardware implementation. We will see an example in Verilog soon. - Lookup Tables: For applications where speed is more critical than precision, or where computational resources are limited, sine and cosine values can be precomputed and stored in lookup tables. The computer then retrieves the value from the table rather than computing it in real time. This is often used in embedded systems or real-time applications. - Combination of Methods: Often, a combination of the above methods is used. For instance, a lookup table might provide a coarse approximation, and then a polynomial approximation refines this to the required precision. In modern architectures, trigonometric functions are often computed directly by the CPU using special instructions. This hardware implementation takes advantage of the processor\'s architecture to perform the calculation more efficiently than using software-based methods. The choice of method depends on the trade-off between computational efficiency and the level of precision required. In general, modern standard libraries and CPUs handle these calculations and selections automatically, abstracting these details from the user or programmer. Depending on the input angle, the libraries may choose between different methods in runtime. For instance, for small enough double-precision input angles, the best answer for $\sin(x)$ is in fact $x$ itself[^71]. ### Discrete Fourier Transform (DFT) One thing is clear; real microprocessors cannot work with infinite sequences, or infinite spectrums nor can deal with improper integrals. A DSP in fact cannot compute integrals at all, at least not in the theoretical sense of the term. Memory in processors is always limited as well as computing resources. How do we, then, compute discrete-time Fourier transforms? For finite sequences, it is possible to calculate a sort of alternative Fourier representation, referred to as the discrete Fourier transform (DFT). The DFT corresponds to samples, equally spaced in frequency, of the Fourier transform of the signal. > [!info] > In addition to its theoretical importance as a Fourier representation of sequences, the Discrete Fourier Transform (DFT) plays a central role in the implementation of a variety of digital signal-processing algorithms used in many areas such as video processing and telecommunications; for instance, the DFT is extensively used in Orthogonal Frequency-Division Multiplexing (OFDM), which is a method of encoding digital data on multiple carrier frequencies and widely adopted in modern mobile networks like 5G. Although there are multiple ways of deriving and interpreting the DFT representation of a finite-duration sequence, we will use the relationship between periodic sequences and finite-length sequences, so we will begin by considering the Fourier series representation of periodic sequences. We accomplish this by constructing a periodic sequence for which each period is identical to the equivalent finite-length sequence. As we will see, the Fourier series representation of the periodic sequence corresponds to the DFT of the finite-length sequence. Thus, our approach is to define the Fourier series representation for periodic sequences and to study the properties of such representations. Then we repeat essentially the same derivations, assuming that the sequence to be represented is a finite-length sequence. This approach to the DFT emphasizes the fundamental inherent periodicity of the DFT representation and ensures that this periodicity is not overlooked in applications of the DFT. Let's consider a sequence $x\lbrack n\rbrack$ that is periodic with period $N$, so that $x\lbrack n\rbrack\ = \ x\lbrack n + \ rN\rbrack$ for any integer values of $n$ and $r$. This means that, $\text{rN}$ samples past the first sample, the sequence just repeats itself. As with continuous-time periodic signals, such a sequence can be represented by a Fourier series corresponding to a sum of harmonically related complex exponential sequences with frequencies that are integer multiples of the fundamental frequency $2\pi/\ N$ associated with the periodic sequence $x\lbrack n\rbrack$. These periodic complex exponentials are of the form $\large e_{k}\lbrack n\rbrack = e^{j(2\pi/N)kn} = e_{k}\lbrack n + rN\rbrack$ Where $k$ is an integer, and the Fourier representation has the form $\large x\lbrack n\rbrack = \sum_{k}^{}{X\lbrack k\rbrack}e^{j(2\pi/N)kn}$ The Fourier series representation of a continuous-time periodic signal generally requires infinitely many harmonically related complex exponentials, whereas the Fourier series for any discrete-time signal with period N requires only N harmonically related complex exponentials. To see this, note that the harmonically related complex exponentials $e_{k}\lbrack n\rbrack$ are identical for values of $k$ separated by $N$; in other words, $e_{0}\lbrack n\rbrack = e_{N}\lbrack n\rbrack$ and $e_{1}\lbrack n\rbrack = e_{N + 1}\lbrack n\rbrack$, and, in general, $\large X\lbrack k\rbrack = \sum_{n = 0}^{N - 1}{x\lbrack n\rbrack e^{–j(2\pi/N)kn}}$ #### Implementation of DFT in Python Let's maybe leave theory for a moment and move into implementation. How do we compute the DFT in reality? We will see next an example code in Python showing the computation of a 128-point Discrete Fourier Transform (DFT) of a sequence with two harmonics. The sequence was defined as a combination of two cosine functions with different frequencies and amplitudes: - The first harmonic is $\cos\left( 2\pi n\text{/}16 \right)$, with a larger amplitude. - The second harmonic is $0.5\cos\left( 2\pi n\text{/}32 \right)$, with half the amplitude. Note that there are plenty of libraries implementing Fourier Transforms, but this example is purposely done step by step (and using a rectangular form instead of complex exponential) to show the iterative nature of the DFT. The Discrete Fourier Transforms admits many optimizations to make it faster and lighter from a computational point of view, but we will not worry about this for now: ```python import matplotlib.pyplot as plt from math import cos, sin, pi def DFT(x,N_points): """ Compute the Discrete Fourier Transform (DFT) of the one-dimensional sequence x """ N = len(x) X = [complex(0)] * N_points # Create an array of N_points complex numbers for k in range(N_points): # For each frequency bin for n in range(N): # For each input element # Complex exponential multiplied by the input signal, accumulated X[k] += x[n] * complex(cos(2 * pi * k * n / N), -sin(2 * pi * k * n / N)) return X # Example sequence with harmonics of different amplitudes N = 128 DFT_points=128 x = [cos(2 * pi * n / 16) + 0.5 * cos(2 * pi * n / 32) for n in range(N)] # Perform DFT X = DFT(x,DFT_points) # Output the result (magnitude of the DFT) magnitude_X = [abs(X[k]) for k in range(DFT_points)] # Plotting the results plt.figure(figsize=(10, 6)) plt.stem(range(DFT_points), magnitude_X, use_line_collection=True) plt.title('Magnitude of the DFT of the Sequence') plt.xlabel('Frequency Index') plt.ylabel('Magnitude') plt.grid(True) plt.show() ``` Which yields: ![Output of the naive implementation of Discrete Fourier Transform (DFT)](image251.png) > [!Figure] > _Output of the naive implementation of Discrete Fourier Transform (DFT)_ In the DFT plot above, you can observe peaks corresponding to the two harmonic frequencies. The first peak is at the 8th frequency bin, corresponding to the $\cos\left( 2\pi n\text{/}16 \right)$ component, and the second peak, half as large, is at the 4th frequency bin, corresponding to the $\cos\left( 2\pi n\text{/}32 \right)$ component. To translate the frequency bins obtained from the DFT into Hertz, one needs to consider the sampling rate at which the signal was sampled, due to the fact the frequency in Hertz for each bin (or index) in the DFT is determined relative to the sampling rate. Suppose the signal is sampled at a sampling rate $f_{s}$ (in Hertz). The frequency resolution of each DFT bin (the spacing in frequency between each index) is given by $\large \frac{f_{s}}{N}$, where $N$ is the number of points in the DFT. For a given index $k$ in the DFT, the frequency in Hertz is: $\large \text{Frequency\ (Hz)} = \frac{k \cdot f_{s}}{N}$ where: - $k$ is the index in the DFT (ranging from 0 to $N - 1$). - $f_{s}$ is the sampling rate in Hertz. - $N$ is the total number of points in the DFT. For example, if the sampling rate$f_{s}$ is 1000 Hz (1 kHz) and we compute a 128-point DFT: - The frequency resolution is $\frac{1000\,\text{Hz}}{128} \approx 7.81\,\text{Hz}\text{/}\text{bin}\text{)}$. - For the 8th index (considering 0 as the first index), the frequency is $8 \times 7.81\,\text{Hz} \approx 62.5\,\text{Hz}$. - For the 4th index, the frequency is $4 \times 7.81\,\text{Hz} \approx 31.25\,\text{Hz}$). This means that each index in the DFT corresponds to a specific frequency component of the original signal, based on the sampling rate and the number of points in the DFT. It can be easily noted that there seems to be an integer relationship between frequency resolution and the frequencies chosen in this example. This greatly helps to obtain "clear" results in the DFT, with frequency components' energy being fully concentrated around specific frequency bins. This is not always the case. Let's see an example where the harmonics are not integers of the resolution. In this case, one harmonic is now $\cos\left( 2\pi n\text{/}14 \right)$ and the other one is $\cos\left( 2\pi n\text{/}31 \right)$. The figure below shows how the bins are now "leaking" energy into adjacent bins. Note that a wrong interpretation of this effect would be to think the signal has now extra harmonics. This is not the case and is only a numerical artifact of the transform. ![Leakage effect in DFT](image252.png) > [!Figure] > _Leakage effect in DFT_ Leakage in the context of the Discrete Fourier Transform (DFT) is a phenomenon that occurs when the frequency components of a signal are not exactly aligned with the frequency bins of the DFT. This misalignment results in the spreading of energy from a signal's true frequency components into adjacent DFT bins, which can make the frequency spectrum appear smeared or blurred. This is known as spectral leakage. Leakage occurs primarily when the analyzed signal contains frequency components that do not complete an integer number of cycles within the DFT window. In such cases, the DFT treats the signal as if it were abruptly cut off at the window\'s edges, leading to discontinuities at the boundaries. This discontinuity introduces additional frequency components according to Fourier theory. Also, the finite length of the DFT introduces the equivalent of a rectangular window in time, which when multiplied with the signal, can be thought of as convolving the signal's spectrum with the spectrum of the rectangular window. The rectangular window has a sinc-shaped frequency response, leading to the spreading of energy across multiple frequency bins. Then, the peak of the actual frequency component might be lower than expected because some of its energy has 'leaked' into adjacent bins, which means that adjacent bins to a strong frequency component might show higher levels than they should, making it harder to detect low-amplitude harmonics near a strong harmonic. There are some mitigation techniques for leakage. For instance, applying the window functions we discussed to the signal before performing the DFT can reduce leakage. These window functions taper the signal to zero at the boundaries of the DFT window, reducing the abruptness of the cut-off and hence the discontinuities. Another method is zero-padding. Adding zeros to the end of the signal sequence (extending its length before performing DFT) increases the frequency resolution of the DFT. This can help separate closely spaced frequency components, although it doesn't reduce the leakage itself. Using a longer DFT (more points) increases the frequency resolution, allowing for a finer distinction between closely spaced frequencies, at the expense of a higher computational load. If possible, modifying the signal to ensure that it contains an integer number of cycles within the DFT window can eliminate leakage, but this is often not practical or possible. ![Zero-padding effect in DFT leakage](image253.png) > [!Figure] > _Zero-padding effect in DFT leakage_ From the figures shown, the known "mirrored" output of the DFT can be seen. The appearance of mirrored or symmetric harmonics in the DFT output is a fundamental characteristic of the DFT when applied to real-valued signals. This phenomenon is due to the inherent properties of the Fourier Transform for real-valued functions. For any real-valued time-domain sequence (meaning the signal does not have any imaginary component), its DFT is symmetric with respect to the complex conjugate. This means that the positive and negative frequency components of the DFT are complex conjugates of each other. Mathematically, if $X\left\lbrack k \right\rbrack$ is the DFT of a real-valued signal $x\lbrack n\rbrack$, then: $\large X\left\lbrack k \right\rbrack = X^{*}\left\lbrack N - k \right\rbrack$ where $X^{*}\left\lbrack N - k \right\rbrack$ is the complex conjugate of the DFT coefficient at the $N - k$th frequency bin, and $N$ is the total number of points in the DFT. The DFT produces frequency bins ranging from 0 to the Nyquist frequency (half of the sampling frequency) and then back down to 0. In terms of indices, this range is from 0 to $\frac{N}{2}$ and then from $\frac{N}{2} + 1$ to $N - 1$ in a mirrored fashion. For a real-valued signal, the second half of the DFT (from $\frac{N}{2} + 1$ to $N - 1$ does not provide additional information; it's the mirror image (complex conjugate) of the first half. Therefore, when analyzing the frequency spectrum of a real-valued signal using the DFT, one only needs to consider the first half of the DFT results (up to the Nyquist frequency) because this contains all the unique frequency information. The second half is a mirrored version of the first half. ### The Z Transform We have seen that the Fourier transform plays a key role in representing and analyzing discrete-time signals and systems. In this section, we explore the z-transform representation of a sequence and study how the properties of a sequence are related to the properties of its z-transform. The z-transform for discrete-time signals is the equivalent of the Laplace transform for continuous-time signals, and they each have a similar relationship to the corresponding Fourier transform. One motivation for introducing this generalization is that the Fourier transform does not converge for all sequences, and it is useful to have a generalization of the Fourier transform that encompasses a broader class of signals. A second advantage is that in analytical problems the z-transform notation is often more convenient than the Fourier transform notation. The Fourier transform of a sequence $x\left\lbrack n \right\rbrack$ was defined before as $\large X(e^{{jω}}) = \sum_{n = –\infty}^{\infty}{x\lbrack n\rbrack e^{- j\omega n}}$ Given a discrete-time signal $x[n]$ (where $n$ is an integer representing the time index), its Z-transform $X(z)$ is defined as: $\large X\left( z \right) = \sum_{n = - \infty}^{\infty}{x\left\lbrack n \right\rbrack z^{- n}}$ In this equation: - $X\left( z \right)$ is the z-transform of the signal. - $z$ is a complex number, generally represented as $z = re^{j\omega}$, where r is the magnitude and $\ \omega\ $is the phase angle. - The sum is taken over all values of $n$, from negative to positive infinity, encompassing the entire signal. The z-transform converts the discrete-time signal into a function in the complex plane. The inverse z-transform is used to go back from the frequency domain to the time domain. Key properties of the Z-transform include: - Linearity: The Z-transform of a linear combination of signals is the same linear combination of their Z-transforms. - Time Shifting: Shifting a signal in time corresponds to multiplying its Z-transform by $z^{- k}$ (for a shift of $k$ samples): $\large x\left\lbrack n - k \right\rbrack = z^{- k}X\left( z \right)$ Since the z-transform is a function of a complex variable, it is convenient to de scribe and interpret it using the complex z-plane. In the z-plane, the contour corresponding to $\left| z \right| = 1$ is a circle of unit radius, as illustrated in the figure below. This contour is referred to as the unit circle. The z-transform evaluated on the unit circle corresponds to the Fourier transform. Note that $\omega$ is the angle between the vector to a point $z$ on the unit circle and the real axis of the complex z-plane. If we evaluate $X(z)$ at points on the unit circle in the z-plane beginning at $z = 1$ (that is, $\omega = 0$) through $z = j$ (i.e., $\omega = \pi/2$) to $z = –1$ (i.e., $\omega = \pi$ ), we obtain the Fourier transform for $0 < \omega < \pi$. ![The unit circle in the complex Z-plane](image254.png) > [!Figure] > _The unit circle in the complex Z-plane_ Now, a valid question to ask is: why use the z-transform? In real hardware, synchronizing calculations in a structure of registers can be daunting when using traditional time-domain analysis of waveforms to visualize data flow. The z-transform is a viable alternative. From the properties of the z-transform we explored above, we can recognize the notation in, $z^{- 1}$ for a unit-sample delay, which can be easily implemented with a register. ![Signal flow graph of direct-form FIR filter (source: https://www.eetimes.com/analyze-dsp-designs-in-fpgas-with-the-z-transform/)](image255.gif) This refers to an important property of the z-transform: a delay in the time domain corresponds to the z-transform of the signal without delay, multiplied by a power of z in the frequency domain. We can represent discrete-time systems using the signal-flow graphs combined with z-transform. The signal flow graph is a time-tested tool for visualizing DSP algorithms. The figure above is the signal flow graph of an FIR filter. Three elements comprise a signal flow graph: - Branch node (1 in figure below): sends a copy of the input signal to several output paths. - Summing node (2 in figure below): outputs the sum of all signals flowing into it - Delay element: stores a delayed sample of the input signal. ![Signal-flow graph main elements: (1): Branch; (2): Summing node; (3): Delay element (source: https://www.eetimes.com/analyze-dsp-designs-in-fpgas-with-the-z-transform/)](image256.png) Note the filter coefficients $b_{0}\ldots b_{3}$, which multiply the signal flowing through each branch. For clarity, multiplier blocks are not explicitly shown. From the signal flow, we can derive the z-transform of the output: $\large Y\left( z \right) = b_{0}X\left( z \right) + b_{1}z^{- 1}X\left( z \right) + b_{2}z^{- 2}X\left( z \right) + b_{3}z^{- 3}X\left( z \right)$ More generically, $\large Y(z) = \sum_{k = 0}^{N - 1}{b_{k}z^{- k}X\left( z \right)}$ Which is the familiar FIR filter sum-of-products, equivalent to discrete-time convolution in the time domain: $\large y\left\lbrack n \right\rbrack = x\left\lbrack n \right\rbrack*b\left\lbrack n \right\rbrack$ ## Digital Signal Processors Digital signal processing (DSP) has many advantages over analog signal processing. Digital signals are more robust than analog signals with respect to environmental and process variations. The accuracy of digital representations can be controlled better by changing the numerical parameters of the signal, such as word length and precision. Digital signals can be stored and recovered, transmitted and received, processed and manipulated, all with a reasonably (and to some extent, controlled) amount of errors. Digital signal processors allow the realization of complex systems with high precision, high signal-to-noise ratio (SNR), repeatability, and flexibility. Digital Signal Processing algorithms can be realized using programmable processors or custom-designed hardware circuits fabricated using VLSI circuit technology. The key is to translate now the mathematical operations we discussed in the previous section into hardware, and see what architectures may help to implement said algorithms in the most sensible way possible. It must be highlighted that [[Semiconductors#Graphics Processing Units (GPUs)|GPUs (Graphics Processing Units)]] have become increasingly popular for a wide range of computational tasks beyond just graphics processing, including tasks traditionally handled by DSPs (Digital Signal Processors). The combination of parallel processing power, flexibility, performance, efficiency, and strong ecosystem support has contributed to GPUs taking over many tasks that were traditionally handled by DSPs. However, it's worth noting that DSPs still have their niche in certain applications where specialized hardware and power consumption optimizations are required for tasks such as real-time audio processing, software-defined radio, and telecommunications. ### Multiply and Accumulate (MAC) One of the most common operations required in digital signal processing applications is array multiplication. For example, performing the [[Semiconductors#Microprocessor Devices#Linear Time-Invariant Systems and the Convolution Sum|convolution]] sum requires array multiplication. One of the important requirements of these array multipliers is that they have to process the signals in real time; before the next sample of the input sequence arrives at the input to the array, the array multiplication should be completed. This requires multiplication as well as accumulation to be carried out using hardware elements. There are two approaches to solving this problem. A dedicated MAC unit may be implemented in hardware, which integrates a multiplier and accumulator in a single hardware unit. The other approach is to have the multiplier and accumulator separate. The presence of a MAC building block is one of the mandatory requirements of any DSP device. Convolution is performed by using a special instruction called MACD, or multiply-accumulate with data movement. Modern DSPs have MACD instructions that multiply the content of the program memory (pma) with the content of the data memory with address (dma) and store the result in the product register. The content of the product register is added to the accumulator before the new product is stored. Further, the content of dma is copied to the next location whose address is dma + 1. It may be noted that executing such instruction requires four memory accesses per instruction cycle. The four memory accesses/clock periods required for the MACD instructions are as follows: 29. Fetch the MACD instruction from the program memory 30. Fetch one of the operands from the program memory 31. Fetch the second operand from the data memory 32. Write the content of the data memory with address dma into the location with the address dma + 1 ### Architecture And here's where the DSP internal architecture will play a significant part. The relatively static impulse response coefficients are stored in the program memory and the samples of the input data are stored in the data memory. If the MACD instruction is to be executed in a machine with Von Neumann architecture, it requires four clock cycles. This is because in the Von Neumann architecture (shown below) there is a single address bus and a single data bus for accessing the program as well as data memory area. ![](von_neumann.png) > [!Figure] > Von Neumann CPU organization (source: #ref/Venkataramani ) One of the ways by which the number of clock cycles required for memory access can be reduced is to use more than one bus for both address and data. For example in the Harvard architecture shown below, there are two separate buses for program and data memory, hence the content of program memory and data memory can be accessed in parallel. The instruction code can be fed from the program memory to the control unit while the operand is fed to the processing unit from the data memory. The processing unit consists of the register file and processing elements such as MAC units, multiplier, ALU, shifter, etc. ![](harvard_1.png) > [!Figure] > Harvard architecture (source: #ref/Venkataramani ) DSPs more typically follow the modified Harvard architecture shown in the figure below. A modified Harvard architecture is a variation that allows memory that contains instructions to be accessed as data. Most modern computers that are documented as Harvard architecture are in reality modified Harvard architectures. In modified Harvard architectures, one set of buses is used to access a memory that has both program and data and another that has data alone. ![](harvard_2.png) > [!Figure] > Modified Harvard architecture (source: #ref/Venkataramani ) Another architecture used for DSPs is the very long instruction word (VLIW) architecture. These DSPs have several data paths, ALUs, MAC units, shifters, etc. The VLIW is accessed from memory and is used to specify the operands and operations to be performed by each of the data paths. In these architectures, the multiple functional units share a common multi-ported register file for fetching the operands and storing the results. Parallel random access by the functional units to the register file is facilitated by the read/write crossbar. Execution of the operations in the functional units is carried out concurrently with the load/store operation of data between a RAM and the register file. The performance gains achieved with VLIW architecture depend on the degree of parallelism in the algorithm selected for a DSP application and the number of functional units. The throughput will be higher only if the algorithm involves the execution of independent operations. However, it may not always be possible to have an independent stream of data for processing. Further, the number of functional units is also limited by the hardware cost for the multi-ported register file and cross-bar switch. ![](VLIW.png) > [!Figure] > VLIW architecture with 3 independent functional units (FU) ### SHARC Processors We will take some commercial DSPs to observe how they are designed. Analog Devices' SHARC processors are a good example. SHARC stands for "Super Harvard Architecture Single Chip Computer" (don't ask me why the acronym doesn't add up). This "Super" Harvard architecture extends the original concepts of the modified Harvard architecture (separate program and data memory buses) by adding an I/O processor with its associated dedicated buses. In addition, SHARC processors integrate large memory arrays and application-specific peripherals. We will more specifically observe the ADSP-TS201 processor, which is a 128-bit, high-performance processor from the TigerSHARC family. The ADSP-TS201 processor combines multiple computation units for floating-point and fixed-point processing and wide word widths. As shown in the figure below, the processor has the following architectural features: - Dual computation blocks: X and Y – each consisting of a multiplier, ALU, CLU, shifter, and a 32-word register file - Dual integer ALUs: J and K – each containing a 32-bit IALU and 32-word register file - Program sequencer – Controls the program flow and contains an instruction alignment buffer (IAB) and a branch target buffer (BTB) - Three 128-bit buses provide high bandwidth connectivity between internal memory and the rest of the processor core (compute blocks, IALUs, program sequencer, and SOC interface) - A 128-bit bus providing high bandwidth connectivity between internal memory and external I/O peripherals (DMA, external port, and link ports) - External port interface including the host interface, SDRAM controller, static pipelined interface, four DMA channels, four LVDS link ports (each with two DMA channels), and multiprocessing support - 24Mbits of internal memory organized as six 4M bit blocks—each block containing 128K words x 32 bits; each block connects to the crossbar through its own buffers and a 128Kbit, 4-way set associative cache. ![](sharc.gif) > [!Figure] > _ADSP-TS201 processor block diagram_ > [!warning] > This section is under #development # Field Programmable Gate Arrays (FPGA) Having discussed the most salient building blocks of digital circuits (there are, of course, a plethora of other blocks but for the sake of discussing what is coming up next, we probably had enough). It's now time to imagine things a bit. Imagine that someone would give out a magic circuit board, a flat surface of a certain area, and that someone would just populate this board with tens of thousands of the building blocks we discussed in the sections above. This means, thousands of logic gates, flip flops, registers, multiplexers, demultiplexers, memories, and LUTs. And also imagine that this mysterious person would also give you a magic wand with routing powers where you could just choose how to connect all these things together; what NAND input to what other gate output, the output of a register to the input of a multiplexer, the select signals of the multiplexer to some other register, the output of a register into a memory data bus and so on. And also imagine this magic configuration system would also allow you to parameterize different things around, like for instance bit widths: with a wave of the wand, you could choose to have 8-bit muxes, 64-bit, etc. Well, with a bit of idealization—for narrative purposes—what we have just described is an FPGA. The roots of FPGAs trace back to the development of programmable logic devices (PLDs) in the 1970s, with the invention of Programmable Read-Only Memory (PROM) and Programmable Array Logic (PAL). These devices offered the flexibility to implement various logic functions by burning fuses within the chip, leading to customized digital logic but without the possibility of reconfiguration after programming. The invention of the FPGA in the 1980s, credited to Ross Freeman and Bernard Vonderschmitt, founders of Xilinx, brought a new dimension to programmable devices. Unlike their predecessors, FPGAs were reprogrammable, which allowed them to be reused for different tasks or to update their functionality after deployment. They achieved this by using an array of logic blocks and a hierarchy of reconfigurable interconnects that could be programmed to replicate complex logical structures, from simple gates to entire microprocessors. At their core, FPGAs consist of three key components: logic blocks, I/O blocks, and routing channels. The logic blocks are the heart of the FPGA and contain a small amount of logic circuitry that can be configured to perform a specific function. These blocks typically contain look-up tables (LUTs), multiplexers, and flip-flops. I/O blocks are used to connect the internal logic of the FPGA to the outside world, and they can be programmed to match the voltage and protocol requirements of the peripheral devices. Routing channels interconnect the logic blocks and I/O blocks in a flexible manner, often using programmable switches that can connect any input to any output. Over the past few decades, FPGAs have not only increased in logic density but have also expanded their capabilities to include high-speed serial transceivers. These transceivers are sophisticated communication blocks integrated into the FPGA fabric that can transmit and receive data at very high speeds, reaching into the gigabits per second (Gbps) range. High-speed serial transceivers are a testament to the evolution of FPGAs from simple glue logic devices to complex systems capable of handling advanced communication protocols and data transfer methods. These transceivers enable FPGAs to interface directly with high-speed communication protocols such as [[site/Data Interfaces#PCI Express|PCI Express]] and [[High-Speed Standard Serial Interfaces#Ethernet|Gigabit Ethernet]]. Point-to-point, high-speed channels in FPGAs are critical for applications requiring rapid and reliable data movement, such as in network infrastructure, where FPGAs can process and route data packets with low latency and determinism. In high-performance computing and data centers, they facilitate data flow between servers, storage systems, and network interfaces. Incorporating high-speed transceivers into FPGAs also means these devices can be used in applications previously dominated by ASICs and other specialized communication chips. For instance, in the telecommunications industry, the ability to handle multiple communication protocols at high speeds makes FPGAs ideal for base stations and other parts of the network backbone that need to dynamically support different standards. Moreover, the versatility of FPGA serial transceivers extends beyond communication protocols. They also find applications in high-speed data acquisition and digital signal processing, where large amounts of data need to be transferred quickly to and from sensors and other peripherals. This is particularly relevant in scientific instrumentation, radar, medical imaging, space systems, autonomous vehicles, and defense, where the speed of data processing can make a big difference. The high-speed transceiver blocks in FPGAs come with a range of features that support robust data transfer. They typically include programmable pre-emphasis and equalization to mitigate signal degradation over long distances or noisy environments. Clock data recovery (CDR) circuits maintain the integrity of the received data by compensating for variations in the input data stream. As FPGA technology continues to progress, the integration of these transceivers is becoming more sophisticated, supporting higher data rates, lower power consumption, and improved signal integrity. FPGAs have also become popular in industries where customization and speed of deployment are critical. Their reconfigurability makes them a good fit for applications such as signal processing, cryptography, communications, and data center acceleration, where they can perform parallel processing much faster than traditional CPUs. With the integration of additional components such as memory blocks, digital signal processing slices, hard IP cores, and CPU cores, FPGAs represent a hybrid between programmable logic, fixed-function ASICs, and microprocessors. This integration has propelled FPGAs into the forefront of hardware acceleration in computing, particularly in areas that demand high performance with a considerable level of flexibility. The line between microprocessors, FPGAs, and System-on-Chips is rapidly blurring, so let's maybe take some time to discuss what all this means. > [!Opinion Alert] > Perhaps the ONE thing that has kept FPGAs from being really widely adopted is the typically closed-source nature of all parts of the toolchain, which is painful, and the strong vendor lock-in that permeates the industry. ## Internal Structure Unlike CPU cores, where the internal structure is optimized for the task of performing several operations as per the instructions stored in memory, an FPGA does not pursue such specific functionality. FPGAs are designed for the target application to define its functionality. FPGAs accomplish this by having a collection of on-chip generic building blocks (called primitives) that can be connected and combined to perform almost any logic operation. Examples of primitives are Configurable Logic Blocks (CLB), and block RAM. I/O blocks. But also, SerDes, Phase Lock Loops (PLLs), transceivers, and DSP blocks. We will explore each one of these in this section. ### Configurable Logic Block (CLB) A Configurable Logic Block (CLB) in a Field-Programmable Gate Array (FPGA) is a fundamental building block that can be programmed to perform a wide range of logical functions. ![Configurable Logic Block](image257.png) > [!Figure] > _Configurable Logic Block_ A CLB typically contains a set of basic logic elements, such as Look-Up Tables (LUTs), flip-flops, and multiplexers. The number and configuration of these elements can vary between different FPGA models and manufacturers. [[Semiconductors#Look-Up Tables (LUTs)|LUTs]] are an important component of CLBs. Each LUT can be configured to realize any function of its input variables. The number of inputs to a LUT (commonly 4, 5, or 6) determines the complexity of the logic function it can implement. [[Semiconductors#Flip-Flops|Flip-flops]] in the CLB enable the implementation of sequential logic, allowing the FPGA to handle clocked operations and store state information. The multiplexer allows to either use the CLB for combinatorial logic (the LUT) or sequential (flip-flop), The CLB is connected to a network of programmable routing interconnects that allow it to communicate with other CLBs and I/O blocks. This interconnect network is configurable, allowing the CLB to be linked to basically any other block within the FPGA. Then, the behavior of the LUTs, flip-flops, and interconnects can be configured through user-defined bitstreams, which involves loading a configuration into the FPGA that will modify the connections in the fabric. This bitstream dictates how the CLBs and other resources interact to perform the desired functions. In many FPGAs, CLBs are organized hierarchically. A group of CLBs can form a larger logic block, which in turn forms part of a larger region. This hierarchy enables efficient utilization of resources and simplifies the design process. FPGA manufacturers provide specialized software tools that help designers map their digital logic designs onto the FPGA's array of CLBs, abstracting many of the low-level details and simplifying the design process. ![CLBs distributed across an FPGA (credit: AMD)](image258.png) > [!Figure] > _CLBs distributed across an FPGA (credit: AMD)_ #### Creating a CLB in Verilog Creating a CLB in Verilog is a bit misleading as an exercise, because CLBs are fixed in the hardware of the FPGA, and languages like Verilog or VHDL are supposed to create logic using combining CLBs. Still, it is useful to see a CLB in action, so we will model it anyway. Our naive Configurable Logic Block (CLB) will contain: 33. A 4-input LUT, which can be represented by a 16-bit register (since $2^{4} = \ 16$) possible input combinations). 34. A D-type flip-flop for sequential logic. 35. An output multiplexer to select between the LUT output and the flip-flop output. Here's the Verilog implementation: ```Verilog module clb_with_mux( input clk, // Clock input for the flip-flop input [3:0] inputs, // 4-bit inputs for the LUT input [15:0] lut_contents, // 16-bit LUT content input ff_en, // Flip-flop enable input mux_sel, // Mux select input: 0 for LUT output, 1 for Flip-flop output output reg output_q // Output of the CLB ); // LUT Implementation wire lut_output; assign lut_output = lut_contents[inputs]; // Flip-Flop Implementation reg ff_output; always @(posedge clk) begin if (ff_en) ff_output <= lut_output; end // Output Mux Implementation always @(*) begin case(mux_sel) 1'b0: output_q = lut_output; // Select LUT output 1'b1: output_q = ff_output; // Select Flip-Flop output default: output_q = 1'b0; // Default case endcase end endmodule ``` In this module: - `inputs` are the 4-bit inputs to the LUT. - `lut_contents` is a 16-bit vector specifying the output of the LUT for each possible input combination. - `ff_en` is the enable signal for the flip-flop. When enabled, the flip-flop captures the output of the LUT at each clock-rising edge. - `mux_sel` is the select line for the output multiplexer. Depending on its value, the output (`output_q`) will be either the current LUT output (if the selector is low) or the last captured output of the flip-flop (if the selector is high). - The LUT output is directly and somewhat naively assigned using a bit-selection from `lut_contents`. - The flip-flop output `ff_output` is updated on the rising edge of `clk` when `ff_en` is high. To create a testbench for the scenario with two CLBs, we'll simulate various conditions by applying different inputs. This testbench will initialize and modify the LUT contents, flip-flop enable signals, and multiplexer selections for both CLBs, showcasing how their behavior changes based on these inputs. Here's an example testbench for this purpose: ```Verilog `timescale 1ns / 1ps module clb_with_mux_tb; // Inputs reg clk; reg [3:0] clb1_inputs, clb2_inputs; reg [15:0] clb1_lut_contents, clb2_lut_contents; reg clb1_ff_en, clb2_ff_en; reg clb1_mux_sel, clb2_mux_sel; // Outputs wire clb1_output, clb2_output; // Instantiate CLBs clb_with_mux clb1( .clk(clk), .inputs(clb1_inputs), .lut_contents(clb1_lut_contents), .ff_en(clb1_ff_en), .mux_sel(clb1_mux_sel), .output_q(clb1_output) ); clb_with_mux clb2( .clk(clk), .inputs(clb2_inputs), .lut_contents(clb2_lut_contents), .ff_en(clb2_ff_en), .mux_sel(clb2_mux_sel), .output_q(clb2_output) ); // Clock generation always #10 clk = ~clk; initial begin $dumpfile("clb_with_mux_tb.vcd"); // Dump all variables in the testbench $dumpvars(0, clb_with_mux_tb); // Initialize Inputs clk = 0; clb1_inputs = 0; clb2_inputs = 0; clb1_lut_contents = 16'hAAAA; clb2_lut_contents = 16'h5555; clb1_ff_en = 1; clb2_ff_en = 1; clb1_mux_sel = 0; clb2_mux_sel = 0; // Wait for the global reset #100; // Simulate CLB behavior // Apply different inputs, enable/disable flip-flops, and switch mux selection clb1_inputs = 4'b1010; clb2_inputs = 4'b0101; #50 clb1_mux_sel = 1; // Switch to flip-flop output for CLB1 #50 clb1_inputs = 4'b1100; clb2_inputs = 4'b0011; #50 clb2_ff_en = 0; // Disable flip-flop for CLB2 #50 clb2_mux_sel = 1; // Switch to flip-flop output for CLB2 #50; // Finish the simulation $finish; end endmodule ``` In this testbench: - Two instances of `clb_with_mux` are created, representing our two CLBs. - The testbench controls the inputs, LUT contents, flip-flop enable signals, and multiplexer selections for both CLBs. - The behavior of each CLB is simulated under different conditions to demonstrate how their outputs change with the applied bitstreams and control signals. - The clock signal is generated with a period of 20ns. - The simulation goes through a sequence of steps, changing inputs and control signals at different time instances. The GTKWave output is as follows: ![GTKWave output for the 2-CLB testbench](image259.png) #### Full Adder implemented with 2 CLBs To implement a full adder using the two previously defined Configurable Logic Blocks (CLBs), each with a 4-input LUT, a flip-flop, and an output multiplexer, we need to first map the full adder's functionality onto the capabilities of these CLBs. A full adder adds two one-bit numbers (A and B) and a carry-in (C_in) to produce a sum (Sum) and a carry-out (C_out). The logic functions for a full adder are: - Sum = A XOR B XOR C_in - C_out = (A AND B) OR (A AND C_in) OR (B AND C_in) Given the 4-input LUT in each CLB, we can implement these functions as follows: - CLB1: Used to compute part of the Carry-out (C_out) calculation. Let's say it computes (A AND B). - CLB2: Used for the Sum. As the Sum operation is a 3-input function, we can use the remaining input as a constant or as a don't-care. ```Verilog module full_adder_using_clbs( input A, input B, input C_in, output Sum, output C_out ); wire clb1_output, clb2_output; // CLB1: Implementing A AND B clb_with_mux clb1( .clk(1'b0), // Clock not used .inputs({2'b00, A, B}), // Only A and B are relevant .lut_contents(16'h8000), // LUT for AND operation .ff_en(1'b0), // Flip-flop not used .mux_sel(1'b0), // Use LUT output directly .output_q(clb1_output) ); // CLB2: Implementing Sum = A XOR B XOR C_in clb_with_mux clb2( .clk(1'b0), // Clock not used .inputs({1'b0, A, B, C_in}), // A, B, and C_in for XOR .lut_contents(16'h6996), // LUT for XOR operation .ff_en(1'b0), // Flip-flop not used .mux_sel(1'b0), // Use LUT output directly .output_q(clb2_output) ); // Deriving Sum and C_out assign Sum = clb2_output; assign C_out = clb1_output | (A & C_in) | (B & C_in); // Combining CLB1 output with remaining OR operations endmodule ``` In this implementation: - CLB1 is programmed to perform the AND operation for A and B. - CLB2 is programmed for the XOR operation needed for the Sum. The carry-out logic is partially handled by CLB1 and the remaining OR operations are done outside the CLBs, as our defined CLBs can only handle one logic operation at a time. In a more complex design, one could have a CLB capable of handling more complex logic, or you could chain multiple CLBs to achieve the desired functionality. Now we write a testbench for the full adder using two CLBs: ```Verilog `timescale 1ns / 1ps module full_adder_using_clbs_tb; // Inputs reg A; reg B; reg C_in; // Outputs wire Sum; wire C_out; // Instantiate the Unit Under Test (UUT) full_adder_using_clbs uut ( .A(A), .B(B), .C_in(C_in), .Sum(Sum), .C_out(C_out) ); // Test all input combinations initial begin // Initialize Inputs A = 0; B = 0; C_in = 0; // Wait for global reset #100; // Apply different combinations of inputs {A, B, C_in} = 3'b000; #10; {A, B, C_in} = 3'b001; #10; {A, B, C_in} = 3'b010; #10; {A, B, C_in} = 3'b011; #10; {A, B, C_in} = 3'b100; #10; {A, B, C_in} = 3'b101; #10; {A, B, C_in} = 3'b110; #10; {A, B, C_in} = 3'b111; #10; // Finish the simulation $finish; end endmodule ``` And we use GTKWave as usual to inspect the behavior of the circuit: ![GTKWave output for the full adder implement with 2 CLBs](image260.png) FPGAs are designed with a fast carry chain that allows the carry output of one CLB to be directly fed into the carry input of the adjacent CLB. This direct connection is important for high-speed arithmetic operations because it significantly reduces the delay typically associated with carry propagation. In a typical logic gate implementation, the carry would need to propagate through multiple gates, causing a delay at each gate, which adds up and slows down the operation. By using a dedicated carry chain, FPGAs can perform arithmetic operations, like addition, subtraction, and even more complex calculations, much more rapidly than if these operations had to be implemented using general logic gates. This is especially important for applications that require high-speed data processing, like digital signal processing, cryptography, and others. ### Clock Domains From the section above, we saw that CLBs contain flip-flops, and we know that [[Semiconductors#Flip-Flops|flip-flops]] need a clock to latch their inputs with their outputs. If all CLBs in an FPGA would share the same clock, there would be only one clock domain and the design would be rather simple. However, single clock domains for the entire FPGA fabric would be inefficient in terms of power management but also because not everything on-chip must run at the same speed. Distributing a single high-speed clock across a large area can lead to significant clock skew, which is challenging to manage. Why not just use one clock and a set of dividers? In a single clock domain with dividers, synchronization issues might arise due to the phase differences introduced by the dividers. Dividers can introduce phase shifts, complicating the design of synchronous systems. Therefore, FPGAs create zones or domains for independent clocks. Each clock domain is defined by the scope of a particular clock signal, and components within the same clock domain receive the same clock signal. Note that working with multiple domains also brings a set of challenges, for instance, meta-stability arising from jitter between asynchronous clock domains which can cause functional failures if the appropriate clock synchronizers are not present. When signals and data move across domains governed by dissimilar clocks, techniques such as dual flip-flop synchronization, FIFO buffers, or handshaking protocols are used to ensure data integrity. Block RAMs are an important building block when it comes to buffering data across clock domains, so we will discuss this component next. ### Block RAM Block RAMs (BRAMs) are embedded directly within the FPGA fabric, providing a high-speed, efficient way to store and access data. The primary purpose of BRAMs is to offer fast data storage and retrieval, which is essential for many digital applications. Since they are located within the FPGA, they provide much quicker access to data compared to external memory resources, resulting in lower latency. This makes them useful for applications that require rapid and frequent access to large amounts of data, such as image processing, digital signal processing, and video buffering. Another important aspect of BRAMs is their configurability. Unlike fixed memory structures, BRAMs in FPGAs can be configured to suit the specific needs of the application. They can be set up in various widths and depths, allowing designers to optimize memory usage based on the particular requirements of their project. This flexibility is a significant advantage in custom digital designs where specific memory configurations are needed. In addition to serving as simple storage units, BRAMs can be used in more complex applications such as implementing lookup tables, FIFO buffers, and even custom memory architectures like caches. They are essential in scenarios where data needs to be temporarily stored and manipulated within the FPGA, such as in data streaming applications where incoming data is buffered before processing. ### Input/Output Blocks > [!warning] > This section is under #development ### DSP Blocks Performing arithmetic operations in an FPGA can be quite costly in terms of the FPGA resources discussed until now. In Application Specific Integrated Circuits (ASICs), the most expensive and time-consuming task is usually multiplying numbers, while adding is quicker and less demanding. To handle this, FPGA makers have been putting in dedicated arithmetic cores directly into the fabric of the FPGA for many years. This flips things around in an FPGA; now, adding numbers can become a slower task, especially when dealing with wider numbers. This happens because the multiplication process has been turned into a complex and pipelined operation. One of the most common arithmetic operations in digital signal processing (DSP) is called the Multiply and Accumulate (MAC) operation. This is used extensively in DSP; one perfect example would be filtering, FIR filter implementations in FPGA hardware make use of MAC operations, the higher order the filter, the more MAC operations are needed. FPGAs have special parts called DSP blocks or slices. These DSP blocks help speed up common tasks like fast Fourier transforms (FFTs) and finite impulse response filtering (FIR), which are related to processing signals and require a large number of arithmetic operations (Multiply, Divide, Add, Subtract, etc.). DSP slices are not only for multiplying numbers—they can do more. The DSP slices make lots of things work faster and better in many applications, not just digital signal processing. They help with tasks like handling big buses that move data around, creating memory addresses, combining different data paths, and dealing with input and output registers that are linked to memory locations. You can also perform operations such as multiplication using regular logic (LUTs and flip-flops), but it uses up a lot of resources. Using the special DSP blocks for multiplication makes sense because it’s better for performance and using logic efficiently. That’s why even small FPGAs set aside space for DSP blocks. #### DSP48E1 FPGAs excel in digital signal processing (DSP) tasks because they can use special, fully parallel methods that are customized for specific needs. DSP operations often involve a lot of binary multiplication and accumulation, and FPGAs have dedicated parts called DSP slices that are perfect for these tasks. In some FPGAs, there are plenty of these custom-designed, low-power DSP slices that are fast, compact, and still flexible for designing different systems. The figure below illustrates the basic DSP48E1 Slice functionality in 7-series FPGAs from Xilinx: ![[Pasted image 20250217232627.png]] > [!Figure] > Basic DSP48E1 Slice Functionality (source: #ref/Digilent ) The DSP functionality in question offers several notable features. First, it includes a 25 × 18 two’s-complement multiplier with dynamic bypass and a 48-bit accumulator that can function as a synchronous up/down counter. There’s also a power-saving pre-adder designed to optimize symmetrical filter applications and reduce DSP slice requirements. The single-instruction-multiple-data (SIMD) arithmetic unit supports dual 24-bit or quad 12-bit add/subtract/accumulate operations, along with an optional logic unit capable of generating ten different logic functions of the two operands. Additional capabilities involve a pattern detector, convergent or symmetric rounding, and the ability to perform 96-bit-wide logic functions when used alongside the logic unit. Advanced features such as optional pipelining and dedicated buses for cascading further enhance the versatility of this DSP functionality. ### Peripherals Peripherals in FPGAs refer to external devices or components that can be connected to the FPGA to enhance its functionality. These peripherals can include input/output interfaces, communication ports, memory modules, sensors, and other hardware components that extend the capabilities of the FPGA. They enable the FPGA to interact with the external world, process data from various sources, and perform specific tasks based on the application’s requirements. Integrating peripherals allows FPGAs to be customized for a wide range of applications and makes them adaptable to different tasks and environments. Modern FPGA chips are also incorporating peripherals so that implementation of certain functions such as inter-chip communications. These do not have to be implemented using the available logic in the FPGA. ### ADCs Analog-to-Digital Converters (ADCs) convert continuous analog signals into discrete digital representations. In other words, they transform real-world signals, such as those from sensors, audio devices, or other analog sources, into digital data that can be processed by digital systems like microcontrollers, computers, or digital signal processors. Xilinx's 7-series FPGAs include an on-chip user configurable analog interface (XADC), incorporating dual 12-bit 1MSPS analog-to-digital converters with on-chip thermal and supply sensors. If the specifications of on-chip ADCs are insufficient for certain applications, external ADCs are commonly used. > [!warning] > This section is under #development ### SerDes and High-Speed Transceivers A serializer/deserializer is a logic block responsible for converting data between serial and parallel interfaces in either direction. In the serializer role, parallel data flows in, which represents multiple bits of data aligned in parallel channels and converts it into serial data. This serial data is just a single stream of data bits, sent sequentially over a channel. On the receiving end, the deserializer function comes into play. It takes this incoming serial data stream and converts it back into parallel data so that it can be processed or read by the receiving system. ![A SerDes generic block diagram](image261.png) > [!Figure] > _A SerDes generic block diagram_ The blocks of a typical SerDes are: - Serializer: Takes n bits of parallel data changing at rate y and transforms them into a serial stream at a rate of n times y. - Deserializer: Takes serial stream at a rate of n times y and changes it into parallel data of width n changing at rate y. - Rx (Receive) Align: Aligns the incoming data into the proper word boundaries. Several different mechanisms can be used from automatic detection and alignment of a special reserved bit sequence (often called a comma) to user-controlled bit slips. - Clock Manager: Manages various clocking needs including clock multiplication, clock division, and clock recovery. - Transmit FIFO (First In First Out): Allows for storing of incoming data before transmission. - Receive FIFO: Allows for storing of received data before removal; is essential in a system where clock correction is required. - Receive Line Interface: Analog receive circuitry includes a differential receiver and may include active or passive equalization. - Transmit Line Interface: Analog transmission circuit often allows varying drive strengths. It may also allow for pre-emphasis of transitions. - Line Encoder: Encodes the data into a more line-friendly format. This usually involves eliminating long sequences of non-changing bits. May also adjust data for an even balance of ones and zeros. (This is an optional block sometimes not included in a SerDes.) - Line Decoder: Decodes from line encoded data to plain data. (This is an optional block that is sometimes done outside of the SerDes.) - Clock Correction and Channel Bonding: Allows for correction of the difference between the transmit clock and the receive clock. Also allows for skew correction between multiple channels. Channel bonding is optional and not always included in SerDes. Other possible functions to be included in a SerDes are cyclic redundancy check (CRC) generators, CRC checkers, other encoding and decoding schemes such as 4b/5b, 8b/10b, 64b/66b, settable scramblers, various alignment and daisy-chaining options, and clock configurable front and backends. ![Multi-gigabit transceiver core (credit: Xilinx)](image262.png) > [!Figure] > _Multi-gigabit transceiver core (credit: Xilinx)_ #### Encoding Schemes Line encoding schemes modify raw data into a form that the receiver can better accept. Specifically, the line encoding scheme ensures that there are enough transitions for the clock recovery circuit to operate. They provide a means of aligning the data into words with a good direct current (DC) balance on the line. There are two mainline encoding schemes—value lookup schemes and self-modifying streams, or scramblers. ##### 8b/10b Encoding/Decoding In 8b/10b encoding, 8-bit data words are converted into 10-bit code groups. This means that for each byte of data (8 bits), 10 bits are actually transmitted. The encoding ensures a balance between zeros and ones in the transmitted data, and it includes a mechanism to manage the running disparity between the numbers of ones and zeros. Running disparity can be either positive or negative, and the encoding ensures that it does not become too large in either direction. | **8-bit Value** | **10-bit Symbol** | | --------------- | ----------------- | | 00000000 | 1001110100 | | 00000001 | 0111010100 | Table 5 Example of 8-bit Values In addition to representing data bytes, the 8b/10b encoding also provides special symbol codes that can be used for control and signaling purposes, separate from the data being transmitted. The table below shows valid control K-Characters | **Name** | **Hex** | **8 Bits** | **RD-** | **RD +** | | -------- | ------- | ---------- | ---------- | ---------- | | K28.0 | 1C | 00011100 | 0011110100 | 1100001011 | | K28.1 | 3C | 00111100 | 0011111001 | 1100000110 | | K28.2 | 5C | 01011100 | 0011110101 | 1100001010 | | K28.3 | 7C | 01111100 | 0011110011 | 1100001100 | | K28.4 | 9C | 10011100 | 0011110010 | 1100001101 | | K28.5 | BC | 10111100 | 0011111010 | 1100000101 | | K28.6 | DC | 11011100 | 0011110110 | 1100001001 | | K28.7 | FC | 11111100 | 0011111000 | 1100000111 | | K23.7 | F7 | 11110111 | 1110101000 | 0001010111 | | K27.7 | FB | 11111011 | 1101101000 | 0010010111 | | K29.7 | FD | 11111101 | 1011101000 | 0100010111 | | K30.7 | FE | 11111110 | 0111101000 | 1000010111 | In 8b/10b, each 8-bit word is mapped to a corresponding 10-bit code, and it uses two different symbols assigned to each data value. In most cases, one of the symbols has six zeros and four ones, and the other has four zeros and six ones. The total number of ones and zeros is monitored, and the next symbol is chosen based on what is needed to bring the DC balance back in line. The two symbols are normally referred to as + and - symbols. The mapping is designed to ensure that each 10-bit code has no more than five consecutive zeros or ones. The choice of which 10-bit code to use for a given 8-bit word depends on the current running disparity. If the disparity is negative, a code with more ones than zeros will be chosen, and vice versa. One additional benefit of the running disparity is that the receiver can monitor the running disparity and detect that an error has occurred in the incoming stream because the running disparity rules have been violated. Decoding is the reverse process of encoding. The 10-bit code groups are converted back into the original 8-bit data words. The decoder also checks the disparity of the incoming 10-bit code groups to ensure they are valid and have not been corrupted during transmission. Alignment of data is an important function of a deserializer. The figure below represents valid 8b/10b data in a serial stream. ![Serial Stream of Valid 8b/10b Data](image263.png) > [!Figure] > _Serial Stream of Valid 8b/10b Data_ How to know where the symbol boundaries are? Symbols are delineated by a **comma**. Here, a comma is one or two symbols specified to be the comma or alignment sequence. This sequence is usually settable in the transceiver, but in some cases, it may be predefined. Naturally, a SERDES requires a method for aligning the incoming stream into words. This special bit sequence or comma must be sent if the system requires clock correction. The comma could be a natural marker for the beginning or end of a frame. If clock correction is required, the clock correction sequence is usually the ideal character. After adding a couple of ordered sets to indicate the end or start of the packet and an ordered set to indicate a special type of packet, we have a simple, powerful transmission path. The idle symbol, or sequence, is another important packet concept. This symbol is sent whenever there is no information to send. Continuous transmission of data ensures that the link stays aligned The receiver then scans the incoming data stream for the specified bit sequence. If it finds the sequence, the deserializer resets the word boundaries to match the detected comma sequence. This is a continuous scan. Once the alignment has been made, all subsequent commas detected should find the alignment already set. Of course, the comma sequence must be unique within any combination of sequences. For example, if we are using a signal symbol $c$ for the comma, then we must be certain that no ordered set of symbols $\text{xy}$ contains the bit sequence $c$. Using a predefined protocol is not a problem since the comma characters have already been defined. One or more of a special subset of K-characters is often used. The subset consists of K28.1, K28.5, and K28.7, all of which have 1100000 as the first seven bits. This pattern is only found in these characters; no ordered set of data and no other K-characters will ever contain this sequence. Hence, it is ideal for alignment use. In cases where a custom protocol is built, the safest and most common solution is to "borrow" a sequence from a well-known protocol. Gigabit Ethernet uses K28.5 as its comma. Because of this, it is often referred to as the comma symbol even though there are technically other choices. ![Encoders/Decoders Block Diagram](image264.png) > [!Figure] > _Encoders/Decoders Block Diagram_ The naming used—such as D0.3 and K28.5—are derived from the way the encoders and decoders can be built. The 8-input bits are broken into 5- and 3-bit buses; that is how the names were developed. For example, the name Dx.y describes the data symbol for the input byte where the five least significant bits have a decimal value of x and the three most significant bits have a decimal value of y. A K indicates the control character. The three bits turn into four bits and the five bits turn into six. Another naming convention refers to the 8-bit bits as HGF EDCBA and 10-bit bits as abcdei fghj. Overhead is one of the drawbacks to the 8b/10b scheme. To get 2.5 gigabits of bandwidth requires a wire speed of 3.125 Gbps. Scrambling techniques can easily handle the clock transition and DC bias problems without a need for increased bandwidth. #### Scrambling ==Scrambling is a way of reordering or encoding the data so that it appears to be random, but it can still be unscrambled. Serial data transmissions benefit from randomizers that break up long runs of zeros and ones.== Obviously, we want the descrambler to unscramble the bits without requiring any special alignment information. This characteristic is called a self-synchronizing code. A simple scrambler consists of a series of flip-flops arranged to shift the data stream. Most of the flip-flops simply feed the next bit, but occasionally a flip-flop will be exclusively ORed or ANDed with an older bit in the stream. ![Basic scrambling and descrambling circuits](image265.png) Scrambling eliminates long runs and works to eliminate other patterns that may have a negative impact on the receiver's ability to decode the signal. There are, however, other tasks provided by line encoding schemes such as 8b/10b that are not supplied by scrambling: - Word alignment - Clock correction mechanism - Channel bonding mechanism 4b/5b is similar to 8b/10b but simpler. As the name implies, four bits are encoded into five bits with this scheme. 4b/5b offers simpler encoders and decoders than 8b/10b. However there are few control characters, and it does not handle the DC balance or disparity problem. With the same coding overhead and less functionality, 4b/5b is not often used anymore. Its main advantage was implementation size, but gates are so cheap now that it is not much of an advantage. 4b/5b is still used in some. One of the new encoding methods is known as 64b/66b. We might think that it is simply a version of 8b/10b that has less coding overhead, but the details are vastly different. 64b/66b came about as a result of user needs not being met by current technology. The 10 Gigabit Ethernet community had a need for Ethernet-based communication at 10 Gb/s. And while they could use four links at a 2.5 Gb payload and 3.125 Gbps wire speed, SERDES was approaching the ultimate 10 Gb solution in a single link. There were new SERDES that could run at just over 10 Gbps but could not be pushed to the 12.5 Gbps needed to support 8b/10b overhead. The laser driving diode was another issue. The telecommunications standard Synchronous Optical Network (SONET) used lasers capable of just over 10 Gbps. Faster lasers were much more expensive. The Gigabit Ethernet community could either give up or create something with a significantly lower overhead to replace 8b/10b. They chose 64b/66b. The price for the lower overhead is longer alignment times, the possibility of a slight DC bias, and more complicated encoders and decoders. Complications such as turning the scramblers on and off for payload vs. sync and type fields make 64b/66b circuits more complicated than their 8b/10b cousins. There is also a complexity cost for using and setting up the encoder. #### Drivers The physical implementation of multi-gigabit SERDES typically takes the form of differential-based electrical interfaces. There are three common differential signal methods used in SerDes: Low-Voltage Differential Signaling (LVDS), Low-Voltage Pseudo Emitter-Coupled Logic (LVPECL), and Current Mode Logic (CML). CML is preferred for multi-gigabit links, and it has the most common interface type and often provides for either AC or DC termination and selectable output drive. Some SerDes provide built-in line equalization and/or internal termination. Often the termination impedance is selectable as well. ![CML driver](image266.png) > [!Figure] > _CML driver_ ![CML receiver](image267.png) > [!Figure] > _CML receiver_ ### Modern FPGA architecture Contemporary FPGA architectures incorporate the basic elements like the ones we discussed above (CLBs, Block RAMs, I/O) along with additional blocks that increase the computational density and efficiency of the device. These additional elements, which are discussed in the following sections, include: - Phase-locked loops (PLLs) for driving the FPGA fabric at different clock rates - High-speed serial transceivers with PCS and PMA sublayers compliant with several standards like PCIe, GigE, SRIO, and others. - Off-chip memory controllers - Multiply-accumulate blocks (MAC) ![Modern FPGA architecture (credit: AMD)](image268.png) > [!Figure] > _Modern FPGA architecture (credit: AMD)_ In modern FPGAs, for instance in Xilinx (now AMD) FPGAs, each CLB element is connected to a switch matrix for access to the general routing matrix. A CLB element contains a pair of slices (see figure below). ![Arrangement of slices inside a FPGA (credit: Xilinx)](image269.png) > [!Figure] > _Arrangement of slices inside a FPGA (credit: Xilinx)_ A CLB element contains a pair of slices, and each slice is composed of four 6-input LUTs and eight storage elements. - SLICE(0): slice at the bottom of the CLB and in the left column - SLICE(1): slice at the top of the CLB and in the right column These two slices do not have direct connections to each other, and each slice is organized as a column. Each slice in a column has an independent carry chain. The Xilinx tools designate slices with these definitions: - An "X" followed by a number identifies the position of each slice in a pair as well as the column position of the slice. The "X" number counts slices starting from the bottom in sequence 0, 1 (the first CLB column); 2, 3 (the second CLB column); etc. - A "Y" followed by a number identifies a row of slices. The number remains the same within a CLB but counts up in sequence from one CLB row to the next CLB row, starting from the bottom. ![Row and Column relationship between CLBs and slices](image270.png) > [!Figure] > _Row and Column relationship between CLBs and slices_ Visibly more complex than the naïve CLB we implemented before, but still based on the same building blocks, each CLB in commercial FPGA contains: - Four logic-function generators (or look-up tables) - Eight storage elements - Wide-function multiplexers - Carry logic The function generators (LUTs) in some types of slices can be implemented as a synchronous RAM resource called a distributed RAM element. Multiple LUTs can be combined in various ways to store larger amounts of data. RAM elements are configurable within a slice to implement different RAM configurations. > [!warning] > This section will be #expanded #### Transceivers In modern FPGAs, for instance, the Xilinx(AMD) 7 series, transceivers support line rates from 500 Mbps to 12.5 Gbps for GTX type and 13.1 Gbps for GTH type. The GTX/GTH transceiver is highly configurable and tightly integrated with the programmable logic resources of the FPGA. The GTX/GTH transceiver supports these use modes: - PCI Express, Revision 1.1/2.0/3.0 - 10GBASE-R - 10 Gb Attachment Unit Interface (XAUI), Reduced Pin eXtended Attachment Unit Interface (RXAUI), 100 Gb Attachment Unit Interface (CAUI), 40 Gb Attachment Unit Interface (XLAUI) - Serial RapidIO (SRIO) - Serial Advanced Technology Attachment (SATA)/Serial Attached SCSI (SAS) - Serial Digital Interface (SDI) - SFF-8431 (SFP+) The following table describes the features in the PCS sublayer for a modern FPGA | Feature | GTX | GTH | | ---------------------------------------------------------------------------------------------------------- | --- | --- | | 2-byte and 4-byte internal data path to support different line rate requirements | X | X | | 8B/10B encoding and decoding | X | X | | 64B/66B and 64B/67B support | X | X | | Comma detection and byte and word alignment | X | X | | PRBS generator and checker | X | X | | FIFO for clock correction and channel bonding | X | X | | Programmable FPGA logic interface | X | X | | 100 Gb Attachment Unit Interface (CAUI) support | | X | | Native multi-lane support for buffer bypass | | X | | TX Phase Interpolator PPM Controller for external voltage-controlled crystal oscillator (VCXO) replacement | | X | The next table lists transceiver PMA sublayer features (credit: Xilinx) | Feature | GTX | GTH | | ------------------------------------------------------------------------------ | --- | --- | | Shared LC tank phase-locked loop (PLL) per Quad for best jitter performance | X | X | | One ring PLL per channel for best clocking flexibility | X | X | | Power-efficient adaptive linear equalizer mode called the low-power mode (LPM) | X | X | | 5-tap decision feedback equalization (DFE) | X | | | 7-tap DFE | | X | | Reflection cancellation for enhanced backplane support | | X | | TX Pre-emphasis | X | X | | Programmable TX output | X | X | | Beacon signaling for PCI Express | X | X | | Out-of-band (OOB) signaling including COM signal support for Serial ATA (SATA) | X | X | | Line rate up to 12.5 Gbps | X | X | | Line rate up to 13.1 Gbps | | X | > [!info] > We spoke about SerDes before, and now we speak about multi-gigabit transceivers. Are they the same thing? Yes. Multi-Gigabit Transceiver is another name for multi-gigabit Serializer/Deserializer (SERDES). Receives parallel data and allows transportation of high bandwidth data over a n-lane serial link, with "n" equal to or greater than 1. ## IP Cores Unless for simple circuits, no one in their right mind would design in an FPGA taking individual CLBs or block memories and connecting them in a manual capacity. FPGAs are the versatile devices they are thanks to their capability of rapidly creating abstraction layers with them. The fact that designers can pack logic functionality in libraries and macros that they can reuse in different designs is what makes FPGAs so widely adopted. Thanks to this reuse, a derivative market appeared: organizations selling pre-packaged logic for customers to incorporate them in their designs and save time and effort. These logic blocks are called IP cores (IP stands for Intellectual Property, due to the fact these blocks are delivered encrypted or obfuscated which means the designers can retain the intellectual property of the core), and they are, to some extent, like libraries in software. Just as a software developer might use a pre-written, compiled library to handle complex tasks (like network communication or image processing) without needing to understand the intricate details, a hardware designer uses IP cores as high-level building blocks. This approach speeds up the design process and can also improve the quality and reliability of the final product, as IP cores are typically thoroughly verified. IP cores come in various forms, such as soft cores, hard cores, and firm cores. Soft cores are provided as synthesizable code, typically in a hardware description language like VHDL or Verilog. This allows them to be highly flexible, as they can be synthesized for various target technologies. Hard cores, on the other hand, are provided as physical layouts, which means they are optimized for a specific manufacturing technology and offer high performance but less flexibility. Firm cores sit somewhere in between, offering a compromise between flexibility and optimized performance. IP cores communicate with the rest of a customer's design in a digital system, primarily through well-defined interfaces and protocols. The mechanism of communication depends on the nature of the IP core and the overall system architecture. Many IP cores are designed with standard on-chip interfaces like AMBA (AHB, APB, AXI, CHI, PCI Express, Ethernet, or low data rate interfaces like I2C. These standardized interfaces ensure that the IP core can easily communicate with other components that also adhere to these standards. In some other cases, IP cores might be connected via a bus or a network-on-chip (NoC). The IP core will have an interface compatible with this bus, allowing it to send and receive data, control signals, and address information. For instance, an IP core might be connected to a common AMBA AXI bus, which allows it to interact with other cores and memory elements in the system. IP cores typically have a set of registers that can be read or written by the main processor or other controlling entities in the system. These registers are used to control the operation of the IP core (e.g., start, stop, configure) and to read back status information. For high-throughput data transfer, some IP cores might use DMA channels. A DMA-capable IP core can directly read from or write to system memory without involving the central processor, thereby increasing data transfer efficiency. IP cores often use interrupts to signal the main processor about events like the completion of a task, an error condition, or the need for attention. The processor can then take appropriate action based on these signals. In some cases, especially when an IP core is specialized, it might use custom protocols or interfaces. In such scenarios, the customer's design must be adapted to interface correctly with these custom elements. For complex IP cores, especially those that perform more sophisticated functions, there may be software APIs and drivers that abstract the lower-level details and provide a higher-level interface for software applications to interact with the core. ## Synthesis The FPGA design flow must include the key phase of FPGA synthesis. In order to program the FPGA device, it requires translating a high-level hardware description into a gate-level representation. The task of synthesis is to translate the desired design functionality onto the hardware resources of the FPGA, optimizing for various factors such as area utilization, performance, clocks and power consumption. The synthesis process involves several key steps to transform the RTL description into a gate-level representation suitable for implementation on an FPGA. These steps typically include: - **Analysis:** The synthesis tool analyzes the RTL description to understand the circuit’s structure, functionality, and timing requirements. It identifies the various modules, signals, and their dependencies. - **Optimization:** The synthesis tool performs various optimizations to improve the design’s performance, area utilization, and power consumption. These optimizations include constant propagation, logic folding, technology mapping, and more. - **Technology Mapping:** During this step, the synthesis tool maps the RTL constructs to the target FPGA’s available resources, such as lookup tables ([[Semiconductors#Look-Up Tables (LUTs)|LUTs]]), [[Semiconductors#Flip-Flops|flip-flops]], [[Semiconductors#Configurable Logic Block (CLB)|configurable logic blocks]], and other specialized components. The tool tries to optimize the mapping based on the design requirements and constraints. - **Timing Analysis:** The synthesized design is subjected to timing analysis to ensure that all timing constraints are met. The tool performs static timing analysis to estimate the design’s performance, checking if the required clock frequencies can be achieved and avoiding timing violations. - **The synNetlist Generation:** thesis tool generates a gate-level netlist as output, representing the design in terms of gates, flip-flops, and interconnections. This netlist can be used for further implementation steps, such as placement and routing. ## Configuration and Programming Configuration and programming are the steps used in order to use the internal hardware resources of the FPGA to implement custom digital applications effectively and efficiently. In FPGA terminology, configuration refers to the process of initializing the FPGA with a specific set of instructions. These instructions, stored in a configuration memory, dictate the interconnections and functionalities of the FPGA's internal logic elements. Configuration occurs during startup and is crucial for defining the FPGA's operational characteristics. Remember FPGAs lose their configuration on loss of power, so they must be configured upon every power cycle. That is why some type of configuration memory is required. Programming an FPGA involves a series of systematic steps: 36. Define the Objective: Clearly articulate the desired functionality or task the FPGA is intended to perform. This forms the basis for subsequent programming steps. 37. Hardware Description Language (HDL): Utilize HDL, such as Verilog or VHDL, to describe the desired circuitry and behavior. HDL serves as the intermediary language between human-readable code and the low-level hardware description. 38. Compilation Process: The HDL code undergoes synthesis and implementation processes using specialized tools. Synthesis translates the high-level HDL code into a netlist, representing the logical structure. Implementation maps this netlist onto the physical resources of the FPGA, considering factors like timing and resource utilization. 39. Bitstream Generation: The compiled design is converted into a bitstream – a binary file containing configuration data. 40. Configuration Upload: The bitstream is loaded onto the FPGA's configuration memory, effectively programming the device. This step is typically carried out during the power-up sequence. # System-on-Chips Although, in theory, FPGAs may appear as an ideal device of sorts where almost any logic block can be devised, that is not entirely the case. A paradox that highly configurable systems suffer is that, in that flexibility, there is a penalty in the form of a loss of efficiency that must be paid. In FPGAs, developing big, complex logic elements will be less efficient (in terms of area, power, and thermal) compared to developing those elements in an ad-hoc manner, tailored for the underlying technology. A compromise between the flexibility that FPGAs bring with the efficiency ad-hoc cores represent started to become evident and, with the progress of integration technology, System-on-Chips appeared. A System-on-Chip combines several logical building blocks that have historically been standalone, onto a single package. Traditional digital systems had separate components for different functions, like processing, memory, and input/output control, each hosted on its own chip. A SoC puts these into one package, reducing size and power consumption while increasing performance. This integration makes devices smaller, faster, and more efficient. Modern SoCs also include programmable logic fabrics as well. The rise of SoCs is closely linked to the demand for power-efficient, cost-effective, and high-performing digital systems. As PCB areas and form factors shrink in size but grow in capabilities, traditional multi-chip designs become impractical. SoCs offer an attractive alternative by integrating various functionalities into small-footprint devices. This integration not only saves space but also improves the speed of data transfer between components, leading to better device performance. ![35 years of microprocessor trend data](image271.png) > [!Figure] > _35 years of microprocessor trend data_ Designing an SoC is a complex process. Architecting SoCs involves deciding what components the SoC will need, such as processors, memory, and high-speed transceivers. The design process may require selecting IP cores from third-party vendors, or to reuse previous in-house designs that the organization may use without paying license fees. In many cases, new IP cores are written. Needless to say, also these internal cores need to exchange data so an on-chip interconnect must be used. SoC designs employ high-level programming languages like C++ for instance and are converted to RTL designs through high-level synthesis tools. These algorithmic synthesis tools allow designers to model and synthesize the system, circuit, software, and verification levels in a single language. This process includes also using simulation capabilities at different levels of abstraction; from behavioral level and transactional level modeling for architecture exploration to cycle-accurate simulations. Eventually, the SoC design phase then moves into synthesis where a detailed blueprint of how its internal components will be integrated is defined. This includes selecting the target technology for the physical design and layout of the die. This phase gives almost the complete circuit diagram of the SoC in terms of individual logic gates. It is not the entire circuit diagram since components such as CPU cores are typically supplied as hard cores. Modern SoCs may include several dies inside the same package utilizing die-stacking techniques or interposers, in what is called 2.5 or 3D IC design, or [[Semiconductors#Chiplets|chiplets]] (see figures below). ![2.5D-IC assembly that includes two substrates (silicon interposer + organic package). Source: Siemens EDA, from "Dissolving The Barriers In Multi-Substrate 3D-IC Assembly Design" blog on Semiconductor Engineering, Feb. 2022.](image272.png) > [!Figure] > _2.5D IC assembly that includes two substrates (silicon interposer + organic package). Source: Siemens EDA, from "Dissolving The Barriers In Multi-Substrate 3D-IC Assembly Design" blog on Semiconductor Engineering, Feb. 2022._ ![Example of multi-die system. Source: Synopsys](image273.png) > [!Figure] > _Example of multi-die system. Source: Synopsys_ > [!info] > Active interposers play the same role in packaging as a standard interposer (called a passive interposer), as shown in the figure above. Active interposers provide interconnections between chiplets, and connections to the substrate through through-silicon vias (TSVs), but they also include active circuitry built into the structure of the interposer. These active circuits are built into silicon interposers, although there is also research indicating the capability of active interposer fabrication on glass. SoCs are developed using hardware component IP core specifications, known as blocks. Hardware blocks are created using EDA tools and can be classified into the following categories: - Hard IP cores are hard layouts using physical design libraries but are technology-dependent and may lack flexibility. - Soft IP cores include HDL code like Verilog code with functional descriptions of IPs, making them more flexible and reconfigurable but requiring synthesis and verification before implementation. - Firm IP cores strike a balance between the two and are provided as netlists to specific physical libraries after synthesis. SoCs have a variety of blocks that require sending data and instructions and thus require communication subsystems. For a while, traditional data bus architectures were used to connect the different blocks of the SoC. For example, ARM's royalty-free Advanced Microcontroller Bus Architecture (AMBA) was a common standard. However, computer buses have limited scalability and support only tens of cores. Wire delay is not scalable due to smaller miniaturization; this results in system performance not scaling with the number of cores. For this reason, the SoC's operating frequency must decrease to remain sustainable. This fact, coupled with more wires consuming more electrical power, has led to the adoption of network-on-chip (NoC) technology. With NoC technology, advantages include application and destination-specific routing, better power efficiency, and reduced bus content. Newer NoC architectures are constantly being developed. For example, distributed computing network topologies such as torus, hypercube, mesh, and tree networks are increasingly popular. NoC architectures can efficiently meet the power and throughput needs of SoC designs. >[!Note] >In VLSI, another jargon revolves around abstraction levels: front-end and back-end. > >Front-end design, also known as RTL (register-transfer level) design, involves the creation of a functional model of the system using high-level design languages such as Verilog or VHDL. This phase of the design process focuses on defining the logical behavior and functional specifications of the system, including the inputs and outputs, the data flow, and the overall architecture. The front-end design phase ensures that the system will meet its functional requirements and specifications. Once the front-end design is complete and has been verified to meet the functional specifications, the design is handed off to the back-end design phase. > >Back-end design, also known as physical design, involves the translation of the RTL design into a physical layout that can be fabricated onto a chip. This phase of the design process focuses on the placement and routing of the various components on the chip, as well as the optimization of the design for performance, power, and other constraints. The back-end design phase is important for ensuring that the physical implementation of the design meets the performance, power, and other requirements of the system. ## On-Chip Interconnects When creating a System-on-chip, designers must allocate different building blocks like CPU cores, programmable logic, memory controllers, and peripherals. One need rapidly appears: a standardized way of hooking dissimilar blocks of logic together in a more or less seamless way. Due to the different bandwidths and speeds involved across the variety of building blocks, there are low-speed, medium-speed, and high-speed on-chip interconnects. SoC architectures may also include bridges between these. We explore on-chip interconnects next. ![UT699 SoC architectural block diagram](image423.png) > [!Figure] > _A System-on-Chip block diagram. Credit: Frontgrade/Gaisler_ ### Advanced Microcontroller Bus Architecture (AMBA) The Advanced Microcontroller Bus Architecture (AMBA) is a family of protocols developed by ARM, designed to connect different functional modules within a system-on-chip (SoC). These modules could be CPUs, memory controllers, peripherals, or interfaces. AMBA aims to manage how these components communicate with each other, ensuring data transfers are efficient, reliable, and scalable. AMBA was introduced by ARM in the mid-1990s, initially as a single standard bus. As technology evolved and the demands of systems increased, ARM expanded AMBA to include several specific protocols, each tailored to different aspects of system communication. - AXI (Advanced eXtensible Interface): This is the most advanced protocol in the AMBA family, designed for high-performance, high-speed data transfers. It supports burst-based transactions, separate read and write data channels, and out-of-order transaction completion, making it suitable for high-bandwidth and low-latency designs. - AHB (Advanced High-performance Bus): A step up from the original AMBA standard, AHB is used for high-performance modules like CPUs and DMA controllers. It's a single clock-edge protocol, which means it operates more efficiently than a multiple clock-edge design. AHB supports a single master and multiple slaves, ensuring efficient communication between major components. - APB (Advanced Peripheral Bus): This protocol is used for less demanding peripheral devices that don't require the high-speed data transfer capabilities of AXI or AHB. APB is simpler and has lower bandwidth, making it more suitable for controlling simple peripherals like timers, interrupt controllers, and I/O interfaces. - CHI (Coherent Hub Interface): This is a relatively newer addition to the AMBA family, designed for high-end digital systems. CHI provides support for data coherency among high-performance CPUs and other components, which is necessary for multi-core processors to maintain consistency in data across various caches. These protocols coexist in complex systems, allowing designers to optimize the communication paths according to the performance requirements and power constraints of different system components. For example, a SoC might use AXI for high-speed connections between the CPU and memory, AHB for communications between the CPU and intermediate-performance peripherals, and APB for low-speed, low-power peripheral connections. #### AHB AMBA AHB is a bus interface suitable for high-performance synthesizable designs. It defines the interface between components, such as Managers, interconnects, and Subordinates. AMBA AHB implements the features required for high-performance, high-clock frequency systems including: - Burst transfers - Single clock-edge operation - Non-tristate implementation - Configurable data bus widths - Configurable address bus widths The most common AHB Subordinates are internal memory devices, external memory interfaces, and high-bandwidth peripherals. Although low-bandwidth peripherals can be included as AHB Subordinates, for system performance reasons, they typically reside on the AMBA Advanced Peripheral Bus (APB). Bridging between the higher-performance AHB and APB is done using an AHB Subordinate, known as an APB bridge. ![AHB Block Diagram](image274.png) > [!Figure] > _AHB Block Diagram_ The roles in an AHB are: Managers and Subordinates. A Manager provides address and control information to initiate read and write operations. ![Manager Interface](image275.png) > [!Figure] > _Manager Interface_ A Subordinate responds to transfers initiated by Managers in the system. The Subordinate uses the HSELx select signal from the decoder to control when it responds to a bus transfer. The Subordinate signals back to the Manager: - The completion or extension of the bus transfer. - The success or failure of the bus transfer. The AHB interconnect between Managers and Subordinates is composed of decoders and multiplexers. A single Manager only requires the use of a Decoder and Multiplexor, as depicted in the AHB block diagram above. A multi-manager system requires the use of an interconnect that provides arbitration and the routing of signals from different Managers to the appropriate Subordinates. In AHB, a [[Semiconductors#Decoders|decoder]] decodes the address of each transfer and provides a select signal for the Subordinate that is involved in the transfer. It also provides a control signal to the multiplexer. A single centralized decoder is required in all implementations that use two or more Subordinates. A Subordinate-to-Manager multiplexer is required to multiplex the read data bus and response signals from the Subordinates to the Manager. The decoder provides control for the multiplexer. A single centralized multiplexor is required in all implementations that use two or more Subordinates. The Manager starts a transfer by driving the address and control signals. These signals provide information about the address, direction, and width of the transfer, and indicate if the transfer forms part of a burst. Transfers can be: - Single. - Incrementing bursts that do not wrap at address boundaries. - Wrapping bursts that wrap at particular address boundaries. The write data bus moves data from the Manager to a Subordinate, and the read data bus moves data from a Subordinate to the Manager. Every transfer consists of: - Address phase: One address and control cycle. - Data phase: One or more cycles for the data. A Subordinate cannot request that the address phase be extended and therefore all Subordinates must be capable of sampling the address during this time. However, a Subordinate can request that the Manager extends the data phase by using HREADY. This signal, when LOW, causes wait states to be inserted into the transfer and enables the Subordinate to have extra time to provide or sample data. The Subordinate uses HRESP to indicate the success or failure of a transfer. #### APB The APB protocol is a low-cost interface, optimized for minimal power consumption and reduced interface complexity. The APB interface is not pipelined and is a simple, synchronous protocol. Every transfer takes at least two cycles to complete. The APB interface is designed for accessing the programmable control registers of peripheral devices. APB peripherals are typically connected to the main memory system using an APB bridge. For example, a bridge from AXI to APB could be used to connect a number of APB peripherals to an AXI memory system. APB transfers are initiated by an APB bridge. APB bridges can also be referred to as a Requester. A peripheral interface responds to requests. APB peripherals can also be referred to as a Completer. The AMBA APB specification uses Requester and Completer. APB Protocol Signals are described below. - PCLK: This is the clock signal. Everything on the APB is synchronized to the rising edge of this clock. - PRESETn: This is the reset signal. It's active low and is used to reset the peripherals connected to the APB. - PADDR: The address bus. This carries the address of the peripheral register being accessed. - PSELx: Peripheral select. This signal is used by the master to select one of the peripherals connected to the bus. Each peripheral connected to the APB has its own select signal (hence the 'x' in PSELx). - PENABLE: This signal is used to enable the data transfer. During a write operation, it indicates that the write data is available on the data bus. During a read operation, it tells the peripheral that the master is ready to accept data. - PWRITE: This is the write control signal. When high, it indicates a write operation; when low, it indicates a read operation. - PWDATA\[7:0\]: This is the write data bus. It carries the data to be written into the peripheral. - PRDATA\[7:0\]: Read data bus. This carries the data read from the peripheral to the master. - PREADY: This is used by the peripheral to extend an APB transfer. - PSLVERR: Optional signal. Used by the peripheral to indicate an error in the transfer. The operation of APB is managed by a state machine composed of the following states: - Idle State: Initially, the bus is in an idle state where PENABLE is low, and no transfers are happening. - Setup Phase: The master asserts the PSELx corresponding to the target peripheral and sets the PADDR to the desired address. The operation type (read or write) is indicated by the PWRITE signal. - Access Phase: In the next clock cycle, the PENABLE signal is asserted high, indicating that the transfer is in progress. For a write, the master places the data on PWDATA\[7:0\]. For a read, the peripheral places the data on PRDATA\[7:0\]. - Completion: If PREADY is used and goes high, it indicates that the peripheral has completed the data transfer, and the bus can move to the next transfer. - Error Handling: If PSLVERR is implemented and asserted, it indicates an error in the transfer. ![AMBA APB State Machine](image276.png) > [!Figure] > _AMBA APB State Machine_ ![APB Write transfer with no wait states](image277.png) > [!Figure] > _APB Write transfer with no wait states_ ![Read transfer with no waiting states](image278.png) > [!Figure] > _Read transfer with no waiting states_ ![Error during write transfer](image279.png) > [!Figure] > _Error during write transfer_ ![Error during read transfer](image280.png) > [!Figure] > _Error during read transfer_ #### AXI The AMBA AXI protocol supports high-performance, high-frequency system designs. The AXI protocol is suitable for high-bandwidth and low-latency designs, and it provides high-frequency operation without using complex bridges. AXI meets the interface requirements of a wide range of components, and it is suitable for memory controllers with high initial access latency. It also provides flexibility in the implementation of interconnect architectures, while being backward-compatible with existing AHB and APB interfaces. The key features of the AXI protocol are: - Separate address/control and data phases - Support for unaligned data transfers, using byte strobes - Uses burst-based transactions with only the start address issued - Separate read and write data channels, that can provide low-cost Direct Memory Access (DMA) - Support for issuing multiple outstanding addresses - Support for out-of-order transaction completion - Permits easy addition of register stages to provide timing closure. The AXI protocol includes optional extensions that cover signaling for low-power operation. The AXI protocol includes the AXI4-Lite specification, a subset of AXI4 for communication with simpler control register style interfaces within components. The AXI protocol is burst-based and defines the following independent transaction channels: - Read address - Read data - Write address - Write data - Write response An address channel carries control information that describes the nature of the data to be transferred. The data is transferred between master and slave using : - A write data channel to transfer data from the master to the slave. In a write transaction, the slave uses the write response channel to signal the completion of the transfer to the master. - A read data channel to transfer data from the slave to the master. The AXI protocol permits address information to be issued ahead of the actual data transfer and supports multiple outstanding transactions, including support for out-of-order completion of transactions. ![AXI Channel architecture of reads](image281.png) > [!Figure] > _AXI Channel architecture of reads_ ![AXI channel architecture of writes](image282.png) > [!Figure] > _AXI channel architecture of writes_ Read and write transactions each have their own address channel. The appropriate address channel carries all of the required address and control information for a transaction. The read data channel carries both the read data and the read response information from the slave to the master and includes: - the data bus, which can be 8, 16, 32, 64, 128, 256, 512, or 1024 bits wide - a read response signal indicating the completion status of the read transaction. The write data channel carries the write data from the master to the slave and includes: - the data bus, which can be 8, 16, 32, 64, 128, 256, 512, or 1024 bits wide - a byte lane strobe signal for every eight data bits, indicating which bytes of the data are valid. Write data channel information is always treated as buffered so that the master can perform write transactions without slave acknowledgment of previous write transactions. A slave uses the write response channel to respond to write transactions. All write transactions require completion signaling on the write response channel. A typical AXI system consists of a number of master and slave devices connected together through some form of interconnect (see figure below). ![AXI interface and interconnect](image283.png) > [!Figure] > _AXI interface and interconnect_ The AXI protocol provides a single interface definition, for the interfaces between a master and the interconnect, between a slave and the interconnect, and between a master and a slave. This interface definition supports a variety of different interconnect implementations. Most AXI systems use one of three interconnect topologies: - shared address and data buses - shared address buses and multiple data buses - multilayer, with multiple address and data buses. In most systems, the address channel bandwidth requirement is significantly less than the data channel bandwidth requirement. Such systems can achieve a good balance between system performance and interconnect complexity by using a shared address bus with multiple data buses to enable parallel data transfers. An overview of the AMBA AXI signals in each channel, including their widths when they are not a single wire: - AXI Write Address Channel: - AWADDR: Write address, width typically varies - AWLEN: Burst length, 8 wires. - AWSIZE: Burst size, 3 wires. - AWBURST: Burst type, 2 wires. - AWLOCK: Lock type, 2 wires. - AWCACHE: Cache type, 4 wires. - AWPROT: Protection type, 3 wires. - AWID: Write ID, width varies. - AWVALID: Write address valid, 1 wire. - AWREADY: Write address ready, 1 wire. - AXI Write Data Channel - WDATA: Write data, width varies (e.g., 32-bit, 64-bit). - WSTRB: Write strobes, width equals the data bus width. - WLAST: Write last, 1 wire. - WID: Write ID, width varies. - WVALID: Write valid, 1 wire. - WREADY: Write ready, 1 wire. - AXI Write Response Channel: - BRESP: Write response, 2 wires. - BID: Response ID, width varies. - BVALID: Response valid, 1 wire. - BREADY: Response ready, 1 wire. - AXI Read Address Channel: - ARADDR: Read address, width varies - ARLEN: Burst length, 8 wires. - ARSIZE: Burst size, 3 wires. - ARBURST: Burst type, 2 wires. - ARLOCK: Lock type, 2 wires. - ARCACHE: Cache type, 4 wires. - ARPROT: Protection type, 3 wires. - ARID: Read ID, width varies. - ARVALID: Read address valid, 1 wire. - ARREADY: Read address ready, 1 wire. - AXI Read Data Channel - RDATA: Read data, width varies (e.g., 32-bit, 64-bit, 128-bit). - RRESP: Read response, 2 wires. - RLAST: Read last, 1 wire. - RID: Read ID, width varies. - RVALID: Read valid, 1 wire. - RREADY: Read ready, 1 wire. The width of address and data signals (AWADDR, WDATA, ARADDR, RDATA) usually depends on the specific implementation and requirements of the SoC architecture. The ID signals (AWID, WID, BID, ARID, RID) also vary in width based on the number of outstanding transactions the bus supports. #### CHI > [!warning] > This section is under #development ### CoreConnect > [!warning] > This section is under #development ### Wishbone The WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores is a flexible design methodology for use with semiconductor IP cores. Its purpose is to foster design reuse by alleviating System-on-Chip integration problems. This is accomplished by creating a common interface between IP cores. This improves the portability and reliability of the system, and results in faster time-to-market for the end user. Previously, IP cores used non-standard interconnection schemes that made them difficult to integrate. This required the creation of custom glue logic to connect each of the cores together. By adopting a standard interconnection scheme, the cores can be integrated more quickly and easily by the end user. This specification can be used for soft core, firm core or hard core IP. Since firm and hard cores are generally conceived as soft cores, the specification is written from that standpoint. This specification does not require the use of specific development tools or target hardware. Furthermore, it is fully compliant with virtually all logic synthesis tools. The WISHBONE interconnect is intended as a general-purpose interface. As such, it defines the standard data exchange between IP core modules. It does not attempt to regulate the application-specific functions of the IP core. The WISHBONE architects were strongly influenced by three factors. First, there was a need for a good, reliable System-on-Chip integration solution. Second, there was a need for a common interface specification to facilitate structured design methodologies on large project teams. Third, they were impressed by the traditional system integration solutions afforded by microcomputer buses such as PCI bus and [[Backplanes and Standard Form Factors#VME|VMEbus]]. The WISHBONE architecture is analogous to a regular computer bus in that that they both: (a) offer a flexible integration solution that can be easily tailored to a specific application; (b) offer a variety of bus cycles and data path widths to solve various system problems; and (c) allow products to be designed by a variety of suppliers (thereby driving down price while improving performance and quality). However, traditional microcomputer buses are fundamentally handicapped for use as a System-on-Chip interconnection. That's because they are designed to drive long signal traces and connector systems which are highly inductive and capacitive. In this regard, System-on-Chip is much simpler and faster. Furthermore, the System-on-Chip solutions have a rich set of interconnection resources. These do not exist in microcomputer buses because they are limited by IC packaging and mechanical connectors. The WISHBONE architects have attempted to create a specification that is robust enough to ensure complete compatibility between IP cores. #### Wishbone Features The WISHBONE interconnection makes System-on-Chip and design reuse easy by creating a standard data exchange protocol. Features of this technology include: - Simple, compact, logical IP core hardware interfaces that require very few logic gates. - Supports structured design methodologies used by large project teams. - Full set of popular data transfer bus protocols including: - READ/WRITE cycle - BLOCK transfer cycle - RMW cycle - Modular data bus widths and operand sizes. - Supports both BIG ENDIAN and LITTLE ENDIAN data ordering. - Variable core interconnection methods support point-to-point, shared bus, crossbar switch, and switched fabric interconnections. - Handshaking protocol allows each IP core to throttle its data transfer speed. - Supports single-clock data transfers. - Supports normal cycle termination, retry termination, and termination due to error. - Modular address widths. ![Standard Wishbone connection](image284.png) > [!Figure] > _Standard Wishbone connection_ A somewhat simplified summary of the handshake process in Wishbone is as follows: 41. **Request Phase**: - **CYC (Cycle)**: The master asserts the CYC signal to indicate the beginning of a new transaction. - **STB (Strobe)**: Along with CYC, the master asserts the STB signal to indicate that it wants to initiate a transaction. - **ADR (Address)**: The master places the address of the desired slave device register or memory location on the ADR bus. - **WE (Write Enable)**: If it's a write transaction, the master asserts the WE signal to indicate that it intends to write data to the slave. - **SEL (Select)**: The master asserts the SEL signal to select the specific slave device it wants to communicate with. 42. **Slave Acknowledgment**: - **ACK (Acknowledge)**: Upon receiving the request, the selected slave asserts the ACK signal to acknowledge that it has received the request and is ready to proceed with the transaction. 43. **Data Transfer Phase**: - **DAT_I (Data Input)**: If it's a write transaction, the master places the data to be written on the DAT_I bus. - **DAT_O (Data Output)**: If it's a read transaction, the slave places the requested data on the DAT_O bus for the master to read. 44. **Completion Phase**: - **ERR (Error)**: The slave may assert the ERR signal if any error occurs during the transaction. - **RST (Reset)**: At the end of the transaction, either the master or the slave may assert the RST signal to reset or release any resources associated with the transaction. - **ACK (Acknowledge)**: The slave may assert the ACK signal again to indicate the completion of the transaction. 45. **Handshake Termination**: - **CYC (Cycle)**: At the end of the transaction, the master deasserts the CYC signal to indicate the end of the transaction. - **STB (Strobe)**: The master deasserts the STB signal, indicating that it no longer wants to initiate transactions. - **ACK (Acknowledge)**: The slave may deassert the ACK signal, indicating the completion of the transaction. ### Network-on-Chip (NOC) As the complexity and functionality of System-on-chips keeps increasing, traditional signal-oriented, bus-based and point-to-point interconnects have struggled to keep up with the demands for higher data throughput, lower latency, and better scalability. The NoC approach addresses these challenges by borrowing concepts from computer network design and applying them to on-chip communication. In a NoC system, the components of the SoC—such as processors, memory units, and IP blocks—are connected via a network-like structure instead of a traditional bus. This network comprises routers and links that facilitate communication between components like how data packets travel across a computer network. This design allows for parallel data transfers, reduces bottlenecks, and enables more efficient use of the chip's communication infrastructure. One of the key benefits of the NoC approach is its scalability. As SoC designs grow more complex, adding more cores or modules can be accomplished without a significant redesign of the interconnect architecture. This is because the network can be expanded or modified relatively easily compared to traditional bus systems. Furthermore, NoC designs can be tailored to the specific requirements of the SoC, such as optimizing for low power consumption, high performance, or a balance of both. Another advantage of NoCs is their ability to support multiple communication protocols and traffic types, ranging from control signals and low-latency messages to high-bandwidth data transfers. However, the NoC approach also introduces new challenges, such as the design and optimization of the network topology, routing algorithms, and protocols to ensure efficient communication while minimizing power consumption and silicon area. Additionally, the increased complexity of NoC-based SoCs requires more sophisticated design tools and methodologies. A network-on-chip (NoC) consists of endpoints, switching elements, domain-crossing elements, and bus resizers. Above a certain minimum size, NoCs lead to scaling gains over circuit-switched buses. The principal advantage of packetized communication is better utilization of the physical channels while still providing data isolation and QoS guarantees. Virtual Channels are commonly used, leading to even better utilization of the physical nets and the ability to handle disparate traffic requirements without deadlocks. The NoC concept is a miniature version of a wide area network with routers, packets, and links. This paradigm describes a way of communicating between IPs including features such as routing protocols, flow control, switching, arbitration, and buffering. ![](NOC_1.png) > [!Figure] > _Example of an NoC connecting 16 IP cores (source: #ref/Subhoda )_ The figure above shows an example of NoC interconnection architecture consisting of several processing elements connected together via routers and wires (links). A processing element can be any component such as a microprocessor, an ASIC, or an intellectual property block that performs a dedicated task. IPs are connected to the routers via a network interface (NI). We call the combination of an IP, an NI, and a router as a _node_ in the NoC. It can be observed that the words _node_ and _tile_ are used interchangeably in existing literature. NoC architecture uses a packet-based communication approach. A request or response that goes to a cache or to off-chip memory is divided into packets, and subsequently, to _flits_ and injected on the network. A flit is the smallest unit of flow control in a NoC. Packets can consist of one or more flits. For example, assume $S$ is a processor IP whereas node $D$ is connected to an off-chip memory interface (memory controller). When a load instruction is executed at $S$, it first checks the private cache located in the same node, and if it is a cache miss, the required data has to be fetched from the memory. Therefore, a memory fetch request message is created and sent on the appropriate virtual network to the NI. The network interface then converts it into network packets according to the packet format, "flit-cizes" the packets, and sends the flits into the network via the local router. The network is then responsible for routing the flits to the destination, $D$. Flits are routed either along the same path or different paths, depending on the routing protocol. The NI at $D$ creates the packet from the received flits and forwards the request to $D$, which then initiates the memory fetch request. The response message from the memory that contains the data block follows a similar process. Similarly, all IPs integrated into the SoC leverage the resources provided by the NoC to communicate with each other. ![](NOC_topo.png) > [!Figure] > _NoC topologies (source: #ref/Subhoda )_ When NoC was first introduced, the discussion was on electrical (copper) wires connecting NoC components together, referred to as _electrical NoC_. However, recent advancements have demanded the exploration of alternatives. With the advancement of manufacturing technologies, the computational power of IPs has increased significantly. As a result, the communication between SoC components has become the bottleneck. Irrespective of the architectural optimizations, electrical NoC exhibits inherent limitations due to the physical characteristics of electrical wires, namely: - The resistance of wires, and as a result, the resistance of NoC, is increasing significantly under combined effects of enhanced grain boundary scattering, surface scattering, and the presence of a highly resistive diffusion barrier layer. - Electrical NoC can contribute a significant portion of the on-chip capacitance. In some cases, about 70% of the total capacitance. - The above two reasons combined make the Electrical NoC a major source of power dissipation. Therefore, it is becoming increasingly difficult for electrical NoC to keep up with the delay, power, bandwidth, reliability, and delay uncertainty requirements of state-of-the-art SoC architectures. These challenges can only intensify in future giga and tera-scale architectures. In fact, the International Technology Roadmap for Semiconductors (ITRS) has mentioned optical and wireless-based on-chip interconnect innovation to be key to addressing these challenges [^73]. Recent years have seen the introduction of emerging NoC technologies such as wireless NoC and optical NoC. Both electrical and optical NoCs represent similar topologies. Wireless NoCs integrate on-chip antennas and suitable transceivers that enable communication between two IPs without a wired medium. Silicon-integrated antennas communicating using millimeter-wave range are shown to be a viable technology for on-chip communication. However, optical NoC, also known as photonic NoC, uses photo emitters, optical waveguides, and transceivers for communication. The major advantage over metal/electrical NoCs is that it is possible to physically intersect light beams with minimal crosstalk. This enables simplified routing and together with other properties, optical NoC can achieve bandwidths in the range of several Gbps. >[!warning] >This section is under #development ## SoC Design: From High-Level Synthesis to Tapeout In SoC design, the journey from High-Level Synthesis (HLS) to tapeout encompasses several critical stages, each with its specialized tools and methodologies. This flow transforms abstract design concepts into a manufacturable semiconductor device. Here's a walkthrough: High-Level Synthesis (HLS): - Purpose: Convert algorithmic or behavioral descriptions of a design, often written in higher-level languages like C, C++, or SystemC, into synthesizable Register Transfer Level (RTL) code (Verilog or VHDL). - Process: It involves analyzing the high-level code to identify hardware constructs like parallelism, data paths, and control logic. The HLS tool then generates an optimized hardware description that meets specified constraints such as area, power, and performance. RTL Synthesis: - Purpose: Transform the RTL description into a gate-level netlist, which is a detailed representation of the design in terms of logic gates and interconnections based on a specific semiconductor technology. It is at this point that standard cells become of great usefulness. As we saw, [[Semiconductors#Standard Cells|standard cells]] are pre-designed and pre-characterized blocks representing basic logic functions such as AND, OR, NOT, flip-flops, multiplexers, and more. These cells have a uniform height but varying widths depending on their complexity, fan-out, and function. Each standard cell in a library is designed to meet specific performance, power, and area requirements and comes with associated timing, power, and noise models. During RTL Synthesis, the synthesizer tool maps the RTL descriptions to specific standard cells from a library provided by the foundry or a third-party vendor. This library is tailored to the target fabrication process technology, ensuring that the resulting gate-level netlist is optimized for performance, area, and power consumption for that specific process. - Process: This stage involves logic optimization, technology mapping, and timing closure. Tools at this stage ensure that the design will meet the required performance targets and operate reliably under all conditions. Design for Testability (DFT): - Purpose: Integrate test structures into the design to facilitate later testing of the chip for manufacturing defects. - Process: Techniques like scan chain insertion, Built-In Self-Test (BIST), and Automatic Test Pattern Generation (ATPG) are applied. This ensures that the chip can be efficiently tested for faults post-manufacture. Place and Route (P&R) - Purpose: Convert the gate-level netlist into a physical layout, which is a detailed map of where each gate, block, and interconnect will be placed on the silicon die. - Process: - Placement: Determines the optimal location for each component within the chip's floor plan to meet performance, area, and power constraints. - Routing: Connects the components with electrical wires while managing signal integrity and avoiding congestion. Physical Verification: - Purpose: Ensure the chip's layout meets all manufacturing design rules and accurately reflects the intended design. - Process: - Design Rule Checking (DRC): Verifies the layout against a set of rules defined by the semiconductor foundry to ensure manufacturability. - Layout Versus Schematic (LVS): Checks that the layout electrically matches the original schematic or RTL description. - Antenna Effect Checks: Ensures that during fabrication, no gate is damaged due to excess charge accumulation. - Electromigration (EM) and IR-drop Analysis: Verifies that current density and voltage drops within the chip are within safe limits to ensure reliability and functionality. Timing Closure: - Purpose: Ensure that the design meets all timing requirements, including setup and hold times for all paths, under all operating conditions. - Process: Involves iterative analysis and optimization to adjust the layout, tweak the placement, or insert buffering as necessary to meet timing constraints. Power Analysis and Optimization: - Purpose: Ensure that the chip operates within its power budget for both dynamic (switching) and static (leakage) power. - Process: Tools perform simulations to identify power hotspots and suggest optimizations like gate sizing, voltage islands, and power gating to reduce power consumption. Signoff: - Purpose: Final validation of the design against all specifications and conditions it will face in operation to ensure readiness for manufacturing. - Process: Comprehensive checks including final DRC, LVS, timing, power, and reliability analyses are performed. This stage often involves running simulations under worst-case scenarios. Tapeout: - Purpose: The final step where the design is sent to the foundry for fabrication. - Process: The completed design, encapsulated in a set of files (typically in GDSII or OASIS format), is handed off to the semiconductor foundry. This includes all layers of the chip design, from transistors to metal interconnects, ready for photolithography and subsequent manufacturing processes. Each stage of this flow is rather complex and requires close attention to detail to ensure that the final product meets its design specifications and can be manufactured without issues. The use of advanced Electronic Design Automation (EDA) tools and collaboration between design teams, tool vendors, and the manufacturing foundry is critical throughout this process. ## Design IP Portfolios No SoC design starts from scratch but stands on the shoulder of existing IP that is integrated together as per an overarching architecture. Companies like Synopsys^[https://www.synopsys.com/designware-ip/soc-infrastructure-ip.html], Cadence Design Systems^[https://www.cadence.com/en_US/home/tools/ip.html], ARM^[https://www.arm.com/products/ip-explorer] and others, provide libraries of Intellectual Property (IP) for System-on-Chip (SoC) design. These libraries consist of pre-designed and pre-verified components such as processors, memory controllers, SerDes, analog and mixed-signal blocks, logic blocks, and other functional units that are essential for building complex semiconductor devices. These components are designed to be integrated into SoC designs, saving significant time and effort for semiconductor companies and designers. These companies invest heavily in research and development to ensure that their IP libraries offer high performance, low power consumption, and compatibility with industry standards. This involves collaboration with semiconductor manufacturers, as well as continuous refinement and improvement of existing IP blocks to keep pace with evolving technology requirements. In addition to providing individual IP blocks, said companies offer comprehensive design tools and methodologies to help designers streamline the SoC design process. This includes tools for verification, synthesis, and physical implementation, as well as specialized solutions for tasks such as power analysis, thermal management, and design for manufacturing. Overall, design IP portfolios play an important role in the semiconductor ecosystem by providing essential building blocks and tools that enable semiconductor companies to develop complex SoCs cost-effectively. One concern, though, about the growing ecosystem of IP portfolios is [[Security#Securing Semiconductors|security]]. # Chiplets A chiplet is a [[Modularity|modular]], self-contained type of [[Semiconductors#System-on-Chips|System-on-chip]] designed to perform a specific function. Instead of building a single, large monolithic chip, chiplets allow manufacturers to break a complex processor into smaller, specialized components. These chiplets are then interconnected using advanced packaging technologies to work together as a single system. This approach is revolutionizing semiconductor design by improving performance, cost, and flexibility. Traditional monolithic chips integrate all components (CPU cores, GPU cores, memory controllers, I/O, etc.) onto a single die. Chiplets split these functions into separate, optimized blocks. For example, a processor might have one chiplet for CPU cores, another for GPU cores, and a third for I/O or memory control. Typically, chiplets are connected using high-speed interconnects (e.g., TSVs (Through-Silicon Vias), interposers, or EMIB (Embedded Multi-Die Interconnect Bridge)). These technologies enable fast communication between chiplets while minimizing latency and power consumption. Chiplets allow for heterogeneous integration. Different chiplets can be manufactured on separate process nodes. For example: a performance CPU cores on a cutting-edge 3nm node, whereas an I/O or analog components on a cheaper, older 14nm node. This optimizes cost and performance by using the right technology for each function. The key disadvantages of chiplet design and heterogeneous integration packaging are larger packaging size and higher packaging cost. The reasons are very simple: (a) in order to obtain higher semiconductor manufacturing yield, which translates to low cost, the system-on-chip (SoC) is partitioned and/or split into smaller chiplets (thus the size and cost of the package are larger and higher), and (b) in order to let those chiplets to perform lateral or horizontal communication, addition packaging are needed (thus the cost of the package are higher). ### Examples of Chiplet-Based Products - AMD Ryzen/EPYC CPUs: Use "Zen" core chiplets connected via Infinity Fabric. - Intel Ponte Vecchio GPU: Combines 47 chiplets (compute, memory, I/O) in one package. - Apple M1 Ultra: Links two M1 Max dies via UltraFusion interconnect. - NVIDIA Grace Hopper: Pairs a CPU chiplet with a GPU chiplet for AI/HPC workloads. ### Challenges 46. Design Complexity: - Requires standardized communication protocols (e.g., UCIe (Universal Chiplet Interconnect Express)) to ensure compatibility. - Managing heat and power across multiple chiplets. 47. Latency: - Communication between chiplets must be as fast as on-die connections to avoid bottlenecks. 48. Standardization: - The industry is still developing universal standards for chiplet interoperability. > [!attention] > Some disambiguation: > Chip (Integrated Circuit, IC): A chip refers to the complete packaged electronic component (e.g., a CPU, GPU, or memory module) that is ready to be mounted on a circuit board. The term "chip" is often used interchangeably with "package" in casual contexts, but technically, the "chip" includes the silicon die(s) and the packaging that protects and connects it to the outside world. > Die (Singular of "Dice"): A die is a single piece of silicon cut from a wafer, containing a functional circuit (e.g., a CPU core, GPU, or memory block). A monolithic die integrates all functional blocks (e.g., cores, cache, I/O) on a single piece of silicon. For example, traditional CPUs like Intel’s Core i9 or older GPUs are monolithic dies. > Chiplet: A chiplet is a modular die designed to work alongside other chiplets in a single package. Chiplets are specialized for specific functions (e.g., compute, memory, I/O) and connected via high-speed interconnects. Chiplets enable heterogeneous integration (mixing process nodes or functions) and improve yield/cost. For instance, AMD’s Ryzen CPUs split compute cores (on 5nm chiplets) and I/O (on 14nm chiplets). > Tile: A tile is a repeated, modular block of circuitry (e.g., a compute cluster, memory bank, or accelerator) that is replicated across a die or package. Tiles are often used in large designs to simplify scaling. Unlike chiplets, tiles are usually part of a single die. For example, Intel’s Xe GPUs use compute tiles; Apple’s M-series SoCs use GPU/core tiles. > Dielet: The term dielet is less standardized but generally refers to a small, specialized die within a multi-die system. It is sometimes used synonymously with "chiplet", but in some contexts, it implies a passive or supporting role (e.g., interposers, bridges). Dielets often handle non-compute tasks like power delivery or signal routing. For example, Intel’s EMIB (Embedded Multi-Die Interconnect Bridge) is a dielet connecting chiplets. # Graphics Processing Units (GPUs) Graphics Processing Units (GPUs) are specialized hardware designed to handle the complex computations required for rendering images, video, and animations. Unlike CPU cores, which are designed for general-purpose computing, GPUs are optimized for tasks that can be executed in parallel, making them highly efficient for specific types of calculations. Processing graphics, for instance in video games or scientific computing, requires a high number of computations in parallel. Why? When a 3D object is rendered on screen and this object has complex shapes and textures, geometric operations on the object such as rotations or translations require processing new positions of all the constituent parts of the object. A traditional CPU core would compute these positions in a sequential manner, which would require pumping up the CPU clock speed to get faster operations. GPUs solve this by parallelizing these computations. A GPU is composed of hundreds or thousands of small cores, each capable of performing its own calculations. This architecture is particularly well-suited to the demands of graphics rendering, where many parallel computations, like pixel shading or vertex transformations, need to be performed simultaneously. In essence, while a CPU might have a small number of cores optimized for sequential serial processing, a GPU has a large number of smaller, more specialized cores designed for parallel processing. The basic operation of a GPU involves receiving rendering instructions and data from the CPU. This data might include information about the shapes, textures, lighting, and movement within a 3D scene. The GPU then uses its multiple cores to process these instructions in parallel. This process involves various stages, including vertex processing (determining the position of each vertex in a 3D space), geometry shading (manipulating the geometry of individual shapes), and pixel shading (calculating the color of each pixel). One of the key strengths of a GPU is its ability to handle a large number of calculations related to graphics and visuals simultaneously, significantly speeding up rendering times compared to a CPU. This makes GPUs essential for graphic-intensive applications like video games, 3D modeling, and video rendering. Over time, the role of GPUs has expanded beyond just graphics rendering. Their ability to handle parallel tasks efficiently makes them suitable for a range of computational tasks in fields such as scientific research, machine learning, and digital signal processing. In these applications, the GPU's parallel processing capabilities are used to perform complex calculations much faster than a traditional CPU could. ## Latency Tolerance To discuss GPUs, it is important to discuss two factors: throughput and latency. In computing, latency is defined as the time delay between the moment something is initiated, and the moment one of its effects begins or becomes detectable. For example, the time delay between a data read request is issued and the time the data is retrieved. Throughput is defined as the amount of work done in a given amount of time. For example, how many instructions are processed per second. With this being said, we can point out that CPU cores are designed as low latency, low throughput processors, whereas GPUs are high latency high throughput processors. GPUs are designed for tasks that can tolerate latency, for instance, Graphics in a video game. To be efficient, GPUs must have high throughput, i.e. processing millions of pixels in a single frame. Because latency will continue to improve more slowly than bandwidth[^72], designers must implement solutions that can tolerate larger and larger amounts of latency by continuing to do useful work while waiting for data to return from operations that take a long time. ![GPU vs CPU latency scenarios (credit: NVidia)](image285.png) > [!Figure] > _GPU vs CPU latency scenarios (credit: NVidia)_ CPU cores, on the other hand, are designed to minimize latency, for instance, to serve a device interrupt in a mission-critical application. For this, [[Semiconductors#Caches|caches]] are needed to minimize latency in the core. Then, CPUs typically need large caches, which require a considerable amount of die area. GPUs can dedicate more of the transistor area to computation horsepower. ![](Maxwell.png) > [!Figure] > _NVidia's Maxwell’s Multiprocessor, SMM. Credit: NVidia_ Therefore, GPUs can have more ALUs for the same-sized chip and therefore run more threads of computation in parallel (modern GPUs can run 10,000s of threads concurrently). ![CPU core vs GPU chip simplified layout (Credit: Nvidia)](image286.png) > [!Figure] > _CPU core vs GPU chip simplified layout (Credit: Nvidia)_ Managing threads on a GPU is a complex task that raises several critical questions. Firstly, how do we tackle synchronization issues that arise when operating numerous threads simultaneously? The key to preventing these conflicts is to design GPUs to run specific types of threads that are independent of one another, thus avoiding synchronization issues altogether. This can be achieved through the use of SIMD (Single Instruction Multiple Data) threads, which minimize the management required for each thread by having them execute the same instruction on different data. Another challenge is dispatching, scheduling, caching, and context-switching tens of thousands of threads efficiently. This demands a reduction in hardware overhead for these operations. By programming blocks of threads—for instance, dedicating one-pixel shader per draw call, or a group of pixels—it becomes possible to streamline thread management and leverage the GPU's full potential. Through these measures, GPUs can be tailored to handle vast numbers of threads while maintaining efficiency and performance. GPUs work with what is called stream processing. Stream Processing involves handling a typically large dataset, referred to as a "stream", and applying the same series of operations to all data. GPUs enhance this process through various optimizations. These include utilizing on-chip memory and local caches to reduce the need for external memory bandwidth and batching thread groups to minimize incoherent memory access. However, inefficient access patterns can lead to increased latency and thread stalls. To optimize performance, GPUs may eliminate unnecessary operations by terminating or killing threads when appropriate. In the stream programming paradigm, data is defined as ordered collections of uniform data types. These types can range from simple (like streams of integers or floats) to complex (such as streams of points, triangles, or matrices). While streams can vary in length, their processing is most efficient when they are lengthy, typically containing hundreds or more elements. Permissible stream operations encompass copying, creating sub-streams, indexing through separate index streams, and processing with kernels. Kernels process whole streams, accepting one or several streams as input and generating one or more output streams. Their key feature is processing streams in their entirety rather than individual elements. Commonly, a kernel applies a function to each element of an input stream (akin to a "map" operation). For instance, a transformation kernel might adjust each point in a stream to a different coordinate system. Other valuable kernel functions include expansions (producing multiple outputs from a single input), reductions (merging multiple elements into one output), and filters (selecting certain input elements for output). Kernel outputs solely depend on their inputs, and within a kernel, each stream element's processing is independent of others. This approach has two main benefits. Firstly, the data needed for kernel execution is fully determined at the kernel's creation or compilation, leading to high efficiency, especially when input elements and intermediate data are locally stored or involve controlled global references. Secondly, this independence in processing allows seemingly serial kernel calculations to be adapted for data-parallel hardware. GPUs leverage stream processing to achieve high throughput, making them particularly effective for problems that can tolerate high latencies. This high latency tolerance translates into lower cache requirements, allowing for less transistor area to be dedicated to cache and more to computing units. Consequently, this leads to thousands of SIMD threads and significantly high throughput. Additionally, the hardware manages threads, relieving the user from writing and managing code for each thread individually. This makes it easier to scale up parallelism simply by adding more processors. Therefore, the fundamental unit of a modern GPU is a stream processor, embodying a design that maximizes efficiency and performance, especially in highly parallel computing environments. ## CUDA Cores CUDA is an abbreviation for Compute Unified Device Architecture. It is a name given to the parallel processing platform and API which is used to access the Nvidia GPU instruction set directly. Each CUDA core had a floating-point unit and an integer unit. Each CUDA core is capable of executing a single thread independently. However, GPUs typically have hundreds or even thousands of CUDA cores working together simultaneously on different parts of a task. This parallel processing capability allows GPUs to handle highly parallelizable workloads, such as graphics rendering, scientific simulations, deep learning training, and more, much more efficiently than CPUs. CUDA cores are organized into streaming multiprocessors (SMs) within the GPU architecture. Each SM contains multiple CUDA cores along with dedicated memory caches and other resources. The number of CUDA cores and SMs can vary significantly between different GPU models, with higher-end GPUs typically featuring more cores and SMs for increased parallel processing power. ![](CUDA.jpg) ## Tensor Cores > [!warning] > This section is under #development ## Challenges As GPUs evolve, power management has become a relevant aspect of their design, due to the rising energy consumption with each new generation. Looking ahead, we might see a more finely tuned approach to power management at various processing stages, tailored and energy-efficient design strategies for the most power-intensive sections of the GPU, and advanced thermal control for top-tier GPUs. Given the trend of escalating power requirements in forthcoming chip designs, persistent innovation in this field is essential to meet the upcoming challenges. One of the primary reasons for GPUs' power-hungry nature lies in their architectural design. To accommodate thousands of cores and memory units, GPUs feature a dense arrangement of transistors on their silicon dies. These transistors consume power when switching between states, with higher transistor counts leading to increased power consumption. Additionally, the need for high-speed memory and extensive interconnects further contributes to the power requirements of GPUs. Moreover, GPUs operate at significantly higher clock speeds compared to CPUs. However, this increased clock speed also translates to higher power consumption. To sustain these frequencies and prevent overheating, GPUs require robust cooling systems, often in the form of elaborate heat sinks and fans, which consume additional power. The power conundrum of GPUs poses significant challenges, particularly in the context of environmental sustainability and energy efficiency. As the world grapples with climate change and the need to reduce carbon emissions, the energy-intensive nature of GPUs raises concerns about their environmental impact. High power consumption not only contributes to increased electricity bills for consumers but also places strain on power grids and exacerbates the demand for fossil fuels. Addressing the power conundrum of GPUs requires a multifaceted approach. Hardware manufacturers must continue to innovate in areas such as transistor efficiency, thermal management, and power optimization. Techniques like dynamic voltage and frequency scaling (DVFS) can help adjust GPU performance based on workload requirements, thereby reducing power consumption during idle or low-intensity tasks. Additionally, advancements in renewable energy sources and more efficient cooling technologies can mitigate the environmental impact of GPU usage. # Chip Packaging Chip packaging is undergoing a major paradigm shift these days and promises to take up the lag caused by the slowing down of CMOS scaling. These shifts that have been driven by the scaling of key packaging metrics such as bump pitch, trace pitch, inter-die spacing and alignment. The goal of advanced packaging is to enable the same benefits that Moore/Dennard scaling has accomplished for CMOS with respect to density, performance, power, and cost. The vehicles that advanced packaging employs are somewhat different: dielets/chiplets, advanced assembly techniques, simplified inter-chip communication protocols and cost optimization via the use of optimized heterogeneous technologies. Packaging technology plays a key role in improving the performance of the next generation of digital systems.nBut first things first. Let's define what chip packages are. Chip packages are the interconnectable housings for semiconductor devices. The major functions of the packages are to provide electrical interconnections between the IC and the board and to efficiently remove heat generated by the device. Feature sizes are constantly shrinking, resulting in an increased number of transistors being packed into the device. To keep pace with this progress coming from silicon technologies, packages have also evolved to provide improved device functionality and performance. To meet these demands, electronic packages must be flexible to address high pin counts, and reduced pitch and form factor requirements. At the same time, packages must be reliable and cost-effective. ![[Pasted image 20250128160630.png]] > [!Figure] > Groups of advanced packaging: 2D, 2.1D, 2.3D, 2.5D, and 3D IC integration (source: #ref/Lau ) ### Units The JEDEC standards for PLCC, CQFP, and PGA packages define package dimensions in inches. The lead spacing is specified as 25 mils, 50 mils, or 100 mils (0.025 in., 0.050 in. or 0.100 in.). The JEDEC standards for PQFP, HQFP, TQFP, VQFP, CSP, and BGA packages define package dimensions in millimeters. The lead frame packages have lead spacings of 0.5 mm, 0.65 mm, or 0.8 mm. The CSP and BGA packages have ball pitches of 0.5 mm, 0.8 mm, 1.00 mm, or 1.27 mm. ### Cavity-Up, Cavity-Down Chip suppliers usually attach the die to the inside bottom of the package. Called "Cavity-Up," this has been the standard IC assembly method for over 25 years. This method does not provide the best thermal characteristics. Pin Grid Arrays (greater than 130 pins), copper-based BGA packages, and Ceramic Quad Flat Packs are assembled "Cavity-Down," with the die attached to the inside top of the package, for optimal heat transfer to the ambient air. For most packages, this information does not affect how the package is used because the user has no choice in how the package is mounted on a board. For Ceramic Quad Flat Pack (CQFP) packages, however, the leads can be formed to either side. Therefore, for best heat transfer to the surrounding air, CQFP packages are mounted logo up, facing away from the PC board. ### Cavity-Up Plastic BGA BGA is a plastic package technology that utilizes area array solder balls at the bottom of the package to make electrical contact with the system circuit board. The area array format of solder balls reduces package size considerably when compared to leaded products. It also results in improved electrical performance as well as having higher manufacturing yields. The substrate is made of a multilayer BT (bismaleimide triazene) epoxy-based material. Power and ground pins are grouped together, and the signal pins are assigned in the perimeter format for ease of routing onto the board. The package is offered in a die-up format and contains a wire-bonded device that is covered with a mold compound. ![Cavity-Up Ball Grid Array (BGA) Package (credit: AMD)](image287.png) > [!Figure] > _Cavity-Up Ball Grid Array (BGA) Package (credit: AMD)_ As shown in the cross-section in the figure above, the BGA package contains a wire-bonded die on a single-core printed circuit board with an overmold. Beneath the die are the thermal vias which can dissipate the heat through a portion of the solder ball array and ultimately into the power and ground planes of the system circuit board. This thermal management technique provides better thermal dissipation than a standard PQFP package. Metal planes also distribute the heat across the entire package, enabling a 15-20% decrease in thermal resistance to the case. The advantages of Cavity-Up BGA package are: - High board assembly yield since the board attachment process is self-centering (refresh [[Printed Circuit Boards#Reflow Oven|reflow soldering concepts]]) - SMT compatible - Extendable to multichip modules - Low profile and small footprint - Improved electrical performance (short wire length) - Enhanced thermal performance - Excellent board-level reliability Copper-based cavity-down BGAs offer better electrical and thermal characteristics. This technology is especially applicable for high-speed, high-power semiconductors such as FPGAs and System-on-Chips. Key Features/Advantages of Cavity-Down BGAs: - Lowest thermal resistance - Superior electrical performance - Low profile and lightweight construction - Excellent board-level reliability ### Flip-Chip BGA Packages Flip chip is a packaging interconnect technology that replaces peripheral bond pads of traditional wire bond interconnect technology with area array interconnect technology at the die/substrate interface. The bond pads are either redistributed on the surface of the die or in some very limited cases, they are directly dropped from the core of the die to the surface. Because of this inherent distribution of bond pads on the surface of the device, more bond pads and I/Os can be packed into the device. The flip-chip BGA package is usually offered for high-performance FPGA products. Unlike traditional packaging in which the die is attached to the substrate face up and the connection is made by using wire, the solder-bumped die in flip-chip BGA is flipped over and placed face down, with the conductive bumps connecting directly to the matching metal pads on the laminate substrate. Unlike traditional packaging technology in which the interconnection between the die and the substrate is made possible using wire, flip chip utilizes conductive bumps that are placed directly on the area array pads of the die surface. The area array pads contain wettable metallization for solders (either eutectic or high lead) where a controlled amount of solder is deposited either by plating or screen-printing. These parts are then reflowed to yield bumped dies with relatively uniform solder bumps over the surface of the device. The device is then flipped over and reflowed on a ceramic or organic laminate substrate. The solder material at molten stage is self-aligning and produces good joints even if the chips are placed offset to the substrates. After the die is soldered to the substrate, the gap (standoff) formed between the chip and the substrate is filled with an organic compound called underfill. The underfill is a type of epoxy that helps distribute stresses from these solder joints to the surface of the whole die and hence improves the reliability and fatigue performance of these solder joints. This interconnect technology has emerged in applications related to high-performance communications, networking, and computer applications as well as in consumer applications where miniaturization, high I/O count, and good thermal performance are key attributes. ![Flip-Chip BGA Package (credit: AMD)](image288.png) > [!Figure] > _Flip-Chip BGA Package (credit: AMD)_ Flip-chip packages are typically not hermetically sealed, and exposure to cleaning solvents or excessive moisture during board assembly can pose serious package reliability concerns. Small vents are placed by design between the heat spreader (lid) and the organic substrate to allow for outgassing and moisture evaporation. These vent holes are located in the middle of all four sides of FF flip-chip packages. Solvents or other corrosive chemicals can seep through these vents and attack the organic materials and components inside the package and are strongly discouraged during board assembly of flip-chip BGA packages. Key Features/Advantages of Flip-Chip BGA Packages: - Easy access to core power/ground, resulting in better electrical performance - Excellent thermal performance (direct heatsinking to the backside of the die) - Higher I/O density since bond pads are in area array format - Higher frequency switching with better noise control #### Chip Scale Package (CSP) Chip Scale Packages have emerged as a dominant packaging option for meeting the demands of miniaturization while offering improved performance. Applications for Chip Scale Packages are targeted to portable and consumer products where real estate is of utmost importance, miniaturization is key, and power consumption/dissipation must be low. A Chip Scale Package is defined as a package that fits the definition of being between 1 to 1.2 times the area of the die that the package contains while having a pitch of less than 1 mm. By employing CSP packages, system designers can dramatically reduce board real estate and increase the I/O counts. Although there are currently more than 50 different types of CSPs available in the market, the CSP types we will discuss here are: flex-based substrates and rigid BT-based substrates. Although both types meet the reliability requirement at the component and board level, BT-based substrate is typically chosen for the newer devices because of the large vendor base producing/supporting the BT-based substrates. Key Features/Advantages of CSP Packages: - An extremely small form factor that significantly reduces board real estate for such applications as portable and wireless designs, and PC add-in cards - Lower inductance and lower capacitance - The absence of thin, fragile leads found on other packages - A very thin, light-weight package ![Rigid BT-Based Substrate Chip Scale Packages, Left; Flex-Based Tape Substrate (Right) (credit: AMD)](image289.png) > [!Figure] > _Rigid BT-Based Substrate Chip Scale Packages, Left; Flex-Based Tape Substrate (Right) (credit: AMD)_ #### Quad Flat No-Lead (QFN) Packages Quad Flat No-Lead (QFN) or MLF package is a robust and low-profile lead frame-based plastic package that has several advantages over traditional lead frame packages. The exposed die-attach paddle enables efficient thermal dissipation when directly soldered to the PCB. Additionally, this near-chip scale package offers improved electrical performance, a smaller package size, and an absence of external leads. Since the package has no external leads, coplanarity, and bent leads are no longer a concern. Quad Flat No-Lead packages are ideal for portable applications where size, weight, and performance matter. The QFN is a molded leadless package with land pads on the bottom of the package. Electrical contact with the PCB is made by soldering the land pads to the PCB. The backside of the die is attached to the exposed paddle through the die attach electrically conductive material. The exposed pad therefore represents a weak ground and should be left floating or connected to a ground net. ![QFN Cross Section (Left) and Bottom View (Right) (credit: AMD)](image290.png) > [!Figure] > _QFN Cross Section (Left) and Bottom View (Right) (credit: AMD)_ Key Features/Advantages of QFN Packages: - Small size and lightweight - Excellent thermal and electrical performance - Compatible with conventional SMT processes #### Ceramic Column Grid Array (CCGA) Packages Ceramic Column Grid Array (CCGA) packages are surface-mount-compatible packages that use high-temperature solder columns as interconnections to the board. Compared to the solder spheres, the columns have lower stiffness and provide a higher stand-off. These features significantly increase the reliability of the solder joints. When combined with a high-density, multilayer ceramic substrate, this packaging technology offers a high-density, reliable packaging solution. Key Features/Advantages of CCGA Packages: - High planarity and good thermal stability at high temperature - CTE matches well with the silicon die - Low moisture absorption ![CCGA Package (credit: AMD)](image291.png) > [!Figure] > _CCGA Package (credit: AMD)_ ### Chip-on-Wafer-on-Substrate (CoWoS) CoWoS is a 2.5D wafer-level multi-chip packaging technology that incorporates multiple dies side-by-side on a silicon interposer to achieve better interconnect density and performance. Individual chips are bonded through microbumps on a silicon interposer forming a chip-on-wafer (CoW). The CoW is then subsequently thinned such that the through-silicon vias (TSV) perforations are exposed. This is followed by C4 bumps formation and singulation. C4 is different than wire bonding where a chip is sitting face up. In this case, die pads are connected to the pads of an interconnect circuit, like a substrate, BGA, or glass or ceramic, thereby connecting to outside circuitry. C4 is an abbreviation for the controlled collapse of chip connection, and it has long been associated with the ball-grid array (BGA) packaging process. The "collapse" part of C4 is when the BGA balls undergo reflow. A CoWoS package is completed through bonding to a package substrate (see figure below). ![CoWoS package (credit: Wikichips)](image292.png) > [!Figure] > _CoWoS package (credit: Wikichips)_ In CoWoS technology, the process begins with the placement of several individual chips onto a wafer. This "chip-on-wafer" stage is where known good dies (KGDs), which are tested and functional chips, are attached to a larger wafer using a process known as die-to-wafer bonding. The bonding is typically done using microbumps, which are tiny solder bumps that provide both mechanical and electrical connections between the chip and the wafer. Once the chips are bonded to the wafer, the entire assembly is then attached to a substrate in the "wafer-on-substrate" stage. The substrate serves as a base and includes interconnections and other electronic components necessary for the final device's functionality. This substrate can also include additional layers for power delivery and heat dissipation, which are important in high-performance applications. One of the key advantages of CoWoS technology is its ability to increase the density of interconnections between chips. This is achieved through the use of through-silicon vias (TSVs), which are vertical electrical connections passing through the silicon wafers. TSVs allow for shorter electrical paths, leading to faster signal transmission and reduced power consumption. CoWoS is particularly valuable in high-performance computing applications, such as servers and data centers, where space and energy efficiency are critical. By stacking chips vertically and interconnecting them densely, CoWoS enables the creation of compact and powerful devices, and it finds extensive use in GPUs. Additionally, CoWoS technology facilitates the integration of heterogeneous chips, meaning chips made using different manufacturing processes or for different functions, such as memory and logic, can be combined into a single package. These functional blocks are also called [[Semiconductors#Chiplets |chiplets]]. One of these manufacturing processes is High Bandwidth Memory (HBM) technology, which represents a significant advancement in memory design, particularly in the context of high-performance computing and graphics processing. The essence of HBM lies in its architecture, which is fundamentally different from traditional memory formats like DDR3 or DDR4. Traditional memory chips are typically laid out flat on a board, requiring data to travel relatively long distances at high speeds through copper interconnects, which can cause delays and affect [[Physical Layer#Signal Integrity|signal integrity]]. HBM, by contrast, stacks several memory dies on top of each other, creating a "3D" structure. These dies are connected by Through-Silicon Vias (TSVs) and microbumps (see figure above). The stacked design of HBM significantly reduces the footprint on the carrier board. But more importantly, it allows for much broader pathways for data to travel compared to the traditional flat layouts. This design results in higher bandwidth, meaning more data can be transferred to and from the memory at any given second. Additionally, the shorter distances and efficient layout help reduce losses and latency. HBM is particularly advantageous in scenarios that demand high-speed data processing. This is why it's commonly used in graphics cards, where rapid access to large amounts of memory is necessary for rendering complex images and video. It's also increasingly seen in high-performance computing applications, like AI and deep learning, where quick data processing and efficient power usage are attractive. > [!info] > The Through-Silicon Via (TSV) was invented more than 60 years ago by William Shockley (yes, the same Shockley from the [[Semiconductors#The Transistor Drama|transistor drama]]). He filed the patent, "_Semiconductive Wafer and Method of Making the Same_" on October 23, 1958, and was granted the [US patent 3,044,909](https://patents.google.com/patent/US3044909A/en) on July 17, 1962. In a TSV, the via diameter can be as small as 1 μm, but in general it is about 5–10 μm. The via is usually filled with Cu with a SiO2 insulation layer because silicon is an electrically conductive material. [^39]: More about the process here: <https://aip.scitation.org/doi/pdf/10.1063/5.0046150> [^40]: Expectedly, the process is called after him: <https://en.wikipedia.org/wiki/Czochralski_method> [^41]: More about the production of ingots here: <https://www.microchemicals.com/technical_information/czochralski_floatzone_silicon_ingot_production.pdf> [^42]: See more about the etching process here: [https://www.kth.se/social/upload/510f795cf276544e1ddda13f/Lecture 7 Etching.pdf](https://www.kth.se/social/upload/510f795cf276544e1ddda13f/Lecture%207%20Etching.pdf) [^43]: A valence electron is an electron in the outer shell associated with an atom, that can participate in the formation of a chemical bond if the outer shell is not closed. [^44]: This is due to thermionic emission. Thermionic emission (also known as the Edison effect) is the liberation of electrons by virtue of their temperature. This occurs because the thermal energy given to the charge carrier overcomes the work function of the material. [^45]: Thermionic emission is the process by which electrons escape from the surface of a material due to their thermal energy. When a material is heated to a sufficiently high temperature, the thermal energy of the electrons increases, and some of them gain enough energy to overcome the potential barrier that holds them in the material. [^46]: https://en.wikipedia.org/wiki/Avalanche_breakdown [^47]: https://en.wikipedia.org/wiki/Point-contact_transistor [^48]: Sadly, Shockley would eventually become a notorious racist: <https://www.nature.com/articles/442631a> [^49]: https://en.wikipedia.org/wiki/Shockley_diode_equation [^50]: Transistor switches (MOSFETs) are the most produced object in the history of humanity with approximately 1.3E22 units produced. [^51]: Note that "1" and "0" states here are just logic states and do not represent specific voltages. Eventually, the semiconductor industry would standardize the voltage thresholds in logic among other specs, giving way to logic families. [^52]: Sky130A specification can be found here: https://skywater-pdk.readthedocs.io/en/main/rules/assumptions.html\#general [^53]: Fact: The flip flop is one of the largest and most complex standard cells in Sky130 [^54]: Note that in combinatorial logic there is no feedback, therefore as soon as the input changes, the outputs will change. [^55]: https://www.esa.int/Enabling\_Support/Space\_Engineering\_Technology/Onboard\_Computers\_and\_Data\_Handling/Microprocessors [^56]: https://www.reuters.com/technology/us-lawmakers-press-biden-plans-chinese-use-open-chip-technology-2023-11-02/ [^57]: The memory wall is the situation where improvements to processor speed will be masked by the much slower improvement in dynamic random access (DRAM) memory speed. For a great explanation on the topic, see this article: https://semianalysis.com/2024/09/03/the-memory-wall/# [^58]: Translated version of the thesis: <http://www.itu.dk/~sestoft/boehmthesis/boehm.pdf> [^59]: "Software crisis" is a term used in the early days of computing science to describe the difficulty of writing efficient computer programs in the required time. The software crisis was due to the rapid increases in computer power and the complexity of the problems that could not be tackled. With the increase in the complexity of the software, many software problems arose because existing methods were inadequate. [^60]: Actually, with 1 bit, more than a "quantity" you can represent a status (true/false, hi/low, on/off). [^61]: https://www.qemu.org/ [^62]: Here's a description of TCG instructions and operations: https://www.qemu.org/docs/master/devel/tcg-ops.html [^63]: DO-178C, Software Considerations in Airborne Systems and Equipment Certification is the primary document by which the certification authorities such as FAA, EASA, and Transport Canada approve all commercial software-based aerospace systems. The document is published by RTCA, Incorporated. [^64]: Itier, J.-B. (2007). A380 Integrated Modular Avionics The history, objectives, and challenges of the deployment of IMA on A380. Artist - European Network of Excellence in Embedded Systems Design. http://www.artist-embedded.org/docs/Events/2007/IMA/Slides/ARTIST2\_IMA\_Itier.pdf [^65]: Ramsey, J. (2007, February 1). Integrated Modular Avionics: Less is More. Aviation Today. https://www.aviationtoday.com/2007/02/01/integrated-modular-avionics-less-is-more/ [^66]: https://ecss.nl/ [^67]: https://www.esa.int/Enabling\_Support/Space\_Engineering\_Technology/Shaping\_the\_Future/IMA\_Separation\_Kernel\_Qualification\_-\_preparation [^68]: https://fentiss.com/products/hypervisor/ [^69]: https://indico.esa.int/event/225/contributions/4307/attachments/3343/4403/OBDP2019-S02-05-GMV\_Gomes\_AIR\_Hypervisor\_using\_RTEMS\_SMP.pdf [^70]: https://www.sysgo.com/products/pikeos-hypervisor/ [^71]: Link to the source code of sin(): https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/s\_sin.c;hb=HEAD\#l281 [^72]: Richard Brown et al. Report to Congress on Server and Data Center Energy Efficiency: Public Law 109-431. Technical report, Lawrence Berkeley National Laboratory, 2008. [^73]: Semiconductor Industry Association et al. 2011. International Technology Roadmap for Semiconductors (ITRS). 2003 edition. Incheon, Korea..