ex.tex 9.89 KB
 Jens Korinth committed May 09, 2017 1 This chapter discusses the exemplary \gloss{Architecture} called \code{baseline} in \secref{sec:baseline-architecture} and \gloss{Platform} for Zynq-7000 series devices \code{zynq} in \secref{sec:zynq-platform}.  Jens Korinth committed May 09, 2017 2 Finally, some of the examples which are part of the \tapasco{} archive are discussed and a small tutorial shows how ModelSim can be used to perform hardware/software co-simulation using the \tapasco{} libraries.  Jens Korinth committed May 09, 2017 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 \figref{fig:baseline-zynq} illustrates a complete design for the exemplary \gloss{Architecture} and \gloss{Platform} for hypothetical \gloss{Composition} of \gloss{Kernels} "A-E" and is explained in detail in the following sections. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The Zynq Platform}\label{sec:zynq-platform} This \gloss{Platform} targets the Xilinx Zynq-7000 series of system-on-chips (SoC), which combine a \emph{28nm Artix-/Kintex-7 FPGA with dual core ARM Cortex A9 CPU} on a single chip. The Zynq-7000 can be used as a standalone embedded system capable of running a variant of Linux using a standard Linux kernel. The close connection of FPGA and CPU makes accessing the FPGA very convenient and eases prototyping. Furthermore, one of the most interesting features of the Zynq-7000 is the fact that \emph{the CPU and the FPGA share main system memory}, via the so called ACP port it is even possible to use level-2 cache coherent memory accesses between CPU and FPGA. This is a very interesting design, since usually FPGA accelerators are rather distant from the memory controllers, often being connected by peripheral buses with high throughput, but also high latency, such as PCIe. On Zynq-7000 boards, memory accesses from the FPGA are almost on-par with the CPU. This low latency access facilitates a \emph{zero-copy approach}: In other design the common technique is to transfer the data for a \gloss{Job} to the device, compute, then transfer the data back. On Zynq-7000 boards it is possible to directly access the shared main memory instead, removing the need for data transfer (zero copy). This is the approach chosen for the \code{zynq} platform. \paragraph{Host Connection} The Zynq-7000 offers several AXI4 Master interfaces as hard IP, i.e., outside of the reconfigurable fabric. To connect the threadpool register space, \code{zynq} uses the \code{GP0} port (cf. left-hand side of \figref{fig:baseline-zynq}), while it uses the \code{GP1} port to connect slave interfaces in the \gloss{Platform} infrastructure. \paragraph{Memory Connection} To access the shared main memory, the Zynq-7000 also offers several hard IP AXI4 Slave interfaces.  Jens Korinth committed May 09, 2017 24 Since \tapasco{} is currently only providing thread-private memory to the hardware threads, memory coherency is not a primary concern.  Jens Korinth committed May 09, 2017 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 The hardware threads can therefore be connected to the fastest AXI4 slaves, \code{HP0}-\code{HP2}, both of which can support up to 16 parallel masters, giving a total of 48 possible master interfaces. \paragraph{Signaling} The interrupt lines of the hardware threads are being connected to a standard component from the Xilinx IP catalog, the \emph{AXI4 Interrupt Controller IP}. Among many features not even required for this simple \gloss{Platform}, it supports up to 32 edge- or level-sensitive interrupt lines. The \gloss{Platform} automatically instantiates the required number of interrupt controller instances, depending on the number of interrupt wires returned by \code{arch\_get\_irqs}. Each interrupt controller has itself an interrupt output wire, all of which are connected to the Zynq-7000 \code{IRQ2FP} port, which connects to the CPU interrupt controller. The \code{IRQ2FP} port is up to 16bit wide, can support up to 16 AXI4 interrupt controllers, giving a maximum of $16 * 32 = 512$ hardware thread interrupts which could be supported by the design, which leaves a lot of room to scale. \paragraph{Address Mapping} The simple \code{zynq} implementation only requires register space addresses for the slave registers of the interrupt controllers. They start at \code{0x80800000}, using a \code{0x1000} window each, thus the address space from \code{0x80800000-0x80810000} is reserved for the interrupt controllers and cannot be used by the threadpool. \paragraph{Simulation Design} For the simulation design, the \emph{Zynq Processing System BFM Core} is used instead of the regular core. Furthermore, an instance of the \emph{AXI BFM Core} IP is also instantiated, which is a simulation core to easily generate AXI4 transactions in simulation. It is very helpful to debug the memory system and accesses and is directly connected to the last of the high-performance memory ports, \code{HP3}. \paragraph{Platform API: Simulation Implementation} The \gloss{Platform API} is implemented in \code{platform/\allowbreak zynq/\allowbreak include/\allowbreak platform-api.svh}. The implementation is straight-forward and uses the Zynq BFM core to simulate host accesses. There is one slight deviation from the real design: Since the simulator cannot actually access the memory located in the client process, data transfers must be performed in simulation; the AXI4 BFM core mentioned in the previous paragraph is used to perform that task and initializes the memory seen by the simulated hardware threads. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The Baseline Architecture}\label{sec:baseline-architecture}  Jens Korinth committed May 09, 2017 51 The \code{baseline} architecture serves as a proof-of-concept implementation for \tapasco{}, it shall support up to 48 independent hardware threads.  Jens Korinth committed May 09, 2017 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 It is intended to be the simplest organization of a hardware threadpool which can be realized using only standard Xilinx IP components and the AXI4 bus. Furthermore, as the name already suggests, this \gloss{Architecture} provides a reasonable baseline to compare future optimizations against. \paragraph{HLS Directives} The first design choice is to provide a AXI4Lite Slave interface (a simplified AXI4 slave without burst support) for all value arguments of a \gloss{Kernel}. This can be achieved by the following directive template: \lstinputlisting[language={[TemplateTcl]Tcl}, firstline=2]{../arch/baseline/valuearg.directives.template} Reference arguments shall receive a AXI4 Master interface with a configurable base address register, which enables them to perform random access patterns: \lstinputlisting[language={[TemplateTcl]Tcl}, firstline=1]{../arch/baseline/referencearg.directives.template} \paragraph{Bus Structure} Toward the host \code{baseline} therefore uses a two-tiered hierarchy of AXI4 Interconnect IPs (standard bus structure components from the Xilinx IP catalog), see "Host AXI Interconnect 1-3" in \figref{fig:baseline-zynq}. In theory, each AXI4 Interconnect can support up to 16 Master/Slave interfaces each; yielding a theoretical maximum of $16 * 16 = 256$ slave interfaces, which easily suffices to accomodate the 48 hardware threads, even if some have several slave interfaces. \medskip Toward memory, \code{baseline} aims to provide reasonably short paths and uses only a single-tier of AXI4 Interconnects; "Mem AXI Interconnect 1-3" are connected to \code{HP0-2}, each supporting up to 16 independent masters, giving a total of $16 * 3 = 48$ masters this design can support, making at least one master per hardware thread possible. Each of the \code{HPx} ports is mapped to the first GB of physical memory, making the address space \code{0x00000000-\allowbreak 0x3FFFFFFF} directly accessible to the hardware threads. \medskip \emph{Remark: In \figref{fig:baseline-zynq} "Mem AXI Interconnect 2" is actually connected to \code{HP2} instead of \code{HP1}. The reason for this choice is that the pairs \code{HP0/1} and \code{HP2/3} are not completely disjoint in the hard IP and full performance cannot be achieved when saturating both \code{HP0} and \code{HP1}, or \code{HP2} and \code{HP3} at once. Thus a minor performance boost can be achieved when connecting the first 32 hardware threads to \code{HP0} and \code{HP2} instead of \code{HP1}.} \paragraph{Threadpool Organization} The internal organization of the threadpool is as simple as possible: Each \gloss{Kernel} in the \gloss{Composition} is instantiated the requested number of times and assigned ascending hardware thread slot IDs. Each slave and master interface of the hardware threads is then connected to the first available master / slave interface on the host / memory interconnects. This may result in asymmetric connections, and also in unused interfaces, as \figref{fig:baseline-zynq} illustrates: In the center of the diagram the hardware threads are depicted, their left-hand side (slave interfaces) facing toward the host, their right-hand side (master interfaces) facing toward memory. There are several unused slave interfaces on "Mem AXI Interconnect 2"; these will in fact not be instantiated, since the IP is freely configurable, and will thus not waste area in the final design. \paragraph{Signaling} By default, Vivado HLS automatically generates a level-sensitive interrupt line to indicate completion on the IP cores produced by HLS. The array of interrupt lines returned by \code{arch\_get\_irqs} (see \tblref{tbl:architecture.tcl}) contains each of these interrupt lines ordered by the hardware thread slot ID. \paragraph{Address Mappping} In Zynq designs it is common to have the primary AXI peripheral address space begin at \code{0x43C00000}, so \code{baseline} adheres to this convention. Each of the up to 48 hardware threads is mapped into a window of size \code{0x00010000}, thus the threadpool register space ranges from \code{0x43C00000-\allowbreak 0x43F00000}, as indicated on the left of \figref{fig:baseline-zynq}. % \begin{sidewaysfigure}[p] \centering% \includegraphics[width=\textwidth]{tikz/baseline_zynq} \caption{Complete design for example composition using zynq Platform and baseline Architecture.} \label{fig:baseline-zynq} \end{sidewaysfigure} % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Examples}\label{sec:examples} TODO