Files
ddr5-paper/main.tex
2025-03-24 20:59:55 +00:00

2915 lines
134 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
\makeatletter
\p@=1bp
\makeatother\documentclass[10pt,oneside,twocolumn,a4paper]{article}
\usepackage{graphicx}
\usepackage[dvipsnames]{xcolor}
\usepackage{pgfplots}
\usepackage{rotating}
\pgfplotsset{compat=1.12}
%% Font: Times
\usepackage[utf8]{inputenc}
\usepackage{mathptmx}
\usepackage{tabularx}
\usepackage{ragged2e}
\usepackage[singlelinecheck=false]{caption}
\usepackage[T1]{fontenc}
\usepackage[numbers]{natbib}
\usepackage{amsmath}
\usepackage{xcolor}
\usepackage{caption}
\usepackage{subcaption}
\usepackage[english]{babel}
\newcommand\todo[1]{\textcolor{blue}{#1}}
\def\figurename{Figure}%
\def\tablename{Table}%
\RequirePackage[blocks]{authblk}
\usepackage{babel}
\pagestyle{empty}
\hoffset-1in
\voffset-1in
\oddsidemargin20truemm
\usepackage{multirow}
%\usepackage{draftwatermark}
%\SetWatermarkText{Draft}
%\SetWatermarkScale{2}
\usepackage{listings}
\definecolor{darkgreen}{rgb}{0,0.43,0}
\lstdefinelanguage{DRAMml}{
keywords={},
otherkeywords={% Operators
->, ->>, -<>, -o, *\\
},
keywordstyle=\color{blue}\bfseries,
keywords=[2]{DRAM, Timings, Bank, Bankgroup, Places, Transitions, Arcs, TimingConstraints, WARNING},
keywordstyle=[2]\color{darkgreen}\bfseries,
identifierstyle=\color{black},
sensitive=false,
comment=[l]{//},
morecomment=[s]{/*}{*/},
commentstyle=\color{gray},
stringstyle=\color{red},
morestring=[b]',
morestring=[b]"
}
\makeatletter
\let\ps@plain\ps@empty
\def\@xivpt{14bp}
\setcounter{secnumdepth}{4}
\columnsep5mm
\def\@sect#1#2#3#4#5#6[#7]#8{%
\ifnum #2>\c@secnumdepth
\let\@svsec\@empty
\else
\refstepcounter{#1}%
\protected@edef\@svsec{%
\ifnum #2<4
\hb@xt@10mm{\csname the#1\endcsname}\relax
\else
\hb@xt@12mm{\csname the#1\endcsname}\relax
\fi}%
\fi
\@tempskipa #5\relax
\ifdim \@tempskipa>\z@
\begingroup
#6{%
\@hangfrom{\hskip #3\relax\@svsec}%
\interlinepenalty \@M #8\@@par}%
\endgroup
\csname #1mark\endcsname{#7}%
\addcontentsline{toc}{#1}{%
\ifnum #2>\c@secnumdepth \else
\protect\numberline{\csname the#1\endcsname}%
\fi
#7}%
\else
\def\@svsechd{%
#6{\hskip #3\relax
\@svsec #8}%
\csname #1mark\endcsname{#7}%
\addcontentsline{toc}{#1}{%
\ifnum #2>\c@secnumdepth \else
\protect\numberline{\csname the#1\endcsname}%
\fi
#7}}%
\fi
\@xsect{#5}}
%
\renewcommand\LARGE{\@setfontsize\LARGE{16}{20}}
%\def\abstract#1{\def\@abstract{#1}}
\def\abstractEn#1{\def\@abstractEn{#1}}
\def\titleEn#1{\def\@titleEn{#1}}
%% Def. Titelei
\headheight0bp
\headsep0mm
\topskip10bp
\topmargin18mm
\textwidth170mm
\textheight60\baselineskip
\def\@maketitle{%
\newpage
\null
\let \footnote \thanks
%{\LARGE\bfseries\RaggedRight \@title \par}%
{\LARGE\bfseries\RaggedRight \@titleEn \par}%
\vskip 1\baselineskip%
{\normalsize
% \lineskip 1ex%
\@author\par}%
\vskip 2\baselineskip%
%{\section*{Kurzfassung}
% \@abstract}%
%\vskip \baselineskip%
{\section*{Abstract}
\@abstractEn}%
\par
\vskip 3\baselineskip}
\renewcommand\section{\@startsection {section}{1}{\z@}%
{-3.5ex \@plus -1ex \@minus -.2ex}%
{\baselineskip}%
{\normalfont\Large\bfseries\RaggedRight}}
\renewcommand\subsection{\@startsection{subsection}{2}{\z@}%
{\baselineskip}%
{1ex}%
{\normalfont\large\bfseries\RaggedRight}}
\renewcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}%
{1\baselineskip}%
{3bp}%
{\normalfont\normalsize\bfseries\RaggedRight}}
\renewcommand\paragraph{\@startsection{paragraph}{4}{\z@}%
{1\baselineskip\@plus1ex \@minus.2ex}%
{3bp}%
{\normalfont\normalsize\RaggedRight}}
\renewcommand\subparagraph{\@startsection{subparagraph}{5}{\parindent}%
{3.25ex \@plus1ex \@minus .2ex}%
{-1em}%
{\normalfont\normalsize\bfseries\RaggedRight}}
\affilsep0pt
\parindent\p@
\makeatother
\bibsep3bp
\raggedbottom
\DeclareCaptionLabelSeparator{enskip}{\enskip}
\captionsetup{labelsep=enskip,justification=RaggedRight,labelfont=bf,skip=10pt}
\renewcommand\bibsection{\section{Bibliography}}
%\title{Beitragstitel (16 pt fett)}
\titleEn{Exploration of DDR5 with the Open-Source Simulator DRAMSys}
\author{M.Sc. Lukas Steiner$\,^1$,
Dr.-Ing. Matthias Jung$\,^2$,
Prof. Dr.-Ing. Norbert Wehn$\,^1$
}
\affil{%
1: Technische Universität Kaiserslautern, Germany \{lsteiner, wehn\}@eit.uni-kl.de\\
2: Fraunhofer IESE, Kaiserslautern, Germany, matthias.jung@iese.fraunhofer.de
}
\abstractEn{%
Over the last five decades, we have seen a continuous evolution in DRAM technology, always targeting lower cost per bit, higher device capacity, higher bandwidth, and lower power consumption. The most recent DRAM standard released by JEDEC in mid 2020 is DDR5. It exhibits several new features, for example two channels on a single DIMM, same-bank refresh, and data rates up to 8400\,MT/s. As a result, DDR5 greatly enlarges the DRAM device options, while the selection of a suitable device heavily depends on the application. In this paper, we investigate the performance of the new DDR5 standard in depth, compare it to its predecessor DDR4, and derive key observations that help selecting a suitable DRAM configuration. We present a new DDR5 simulation model based on the open-source simulator DRAMSys. This is to the best of our knowledge the first DDR5 simulation model. The model is the base for all of our investigations.
% Extremfälle für Anwendung
% seine Stärken ausspielen
% zwei grundverschiedene Benchmarks
% In recent years, more and more DRAM standards have been specified, which in parts differ significantly from previous ones. The latest standard released by JEDEC in mid 2020 is DDR5.
% It poses new features, for example two channels on a DIMM, same bank refresh and data rates up to 8400 MT/s. For system designers, DDR5 greatly enlarges the total DRAM design space, where certain design choices are not trivial depending on the application that uses the memory system.
% In this paper, we present for the first time a new DDR5 simulation model based on the open-source simulator DRAMSys. This simulation model can help designers to efficiently explore the DRAM design space and to find the best fitting DRAM solution for their application, and moreover, to find suitable optimizations for the memory subsystem. Furthermore, we show a first quantitative comparison between DDR4 and DDR5 for two typical example benchmarks.
%
}
% TODOs: make DRAMSys DRAMSys4.0 consistent
\hyphenation{DRAMSys} % prevent DRAMSys to be hyphened
\begin{document}
\maketitle
\section{Introduction}
Currently, we see a strong shift to memory-dominated applications. Thus, \textit{Dynamic Random Access Memories} (DRAMs) play a large role in compute platforms. Over the last years, the number of DRAM standards specified by the \textit{JEDEC Solid State Technology Association} has been growing rapidly. The most recent DRAM standard is DDR5, which was released in mid 2020. Because of the large amount of new features, system designers are either challenged to adopt the new standard or they can move on with well-established standards like DDR4. If DDR5 is a potential candidate for a specific application, a further challenge is the configuration of the DDR5 subsystem, which features a lot of parameter choices. Fast and accurate simulation models are mandatory to explore the new features and compare different configurations.
For our investigations we use the design space exploration framework DRAMSys, which relies on a \textit{Domain Specific Language} (DSL) to specify the DRAM's architecture, states, and timing behavior in a compressed and comprehensive fashion. This DSL, called \textit{DRAMml}~\cite{junkra_19}, is based on a Petri Net semantic~\cite{pet_62} and allows correct-by-construction source code generation for DRAMSys. To achieve a high simulation speed and high accuracy at the same time, the framework makes use of SystemC \textit{Transaction Level Modeling} (TLM) and only simulates relevant state changes instead of individual signal changes in each clock cycle. In this way, the overall number of events is reduced drastically and performance results can be generated several orders of magnitude faster compared to an RTL simulation while maintaining the same accuracy.
By using DRAMml for the source code generation, we were able to create a full-featured DDR5 simulation model from scratch within two weeks. This shows the applicability of the DRAMml methodology within the DRAMSys framework for the adoption of new DRAM standards.
In this paper, we compare the DDR5 standard to its predecessor DDR4 and investigate advantages and disadvantages for specific applications. Furthermore, we perform an in-depth analysis of different DDR5 configurations.
%
\begin{figure*}[t]
\includegraphics[width=\textwidth]{dram.pdf}
\caption{DRAM Architecture}
\label{fig:dram}
\end{figure*}
%
In summary, the paper presents three new contributions:
\begin{itemize}
\item We present the first DDR5 simulation model, which is integrated in the DRAMSys framework.
\item We provide an in-depth comparison between DDR5 and DDR4.
\item We investigate the performance of different DDR5 configurations.
\end{itemize}
%qual., quant.
The paper is structured as follows: Section 2 gives some background on DRAM technology and the new main features of the DDR5 standard. In Section 3 we describe the DRAM design space exploration framework DRAMSys. The integration of the DDR5 model is described in Section 4. Experiments and key observations are presented in Section 5. In Section 6 the related work is reviewed and the paper is concluded by Section 7.
%
%
\section{DRAM Background and DDR5}
%
In this section we introduce the basic terminology of DRAM devices and their controllers and give an overview on the new features of the DDR5 standard.
%
\newcommand{\rot}[0]{22.4}
\begin{figure}[t]
\resizebox{\columnwidth}{!}{% <------ Don't forget this %
\begin{tikzpicture}
\begin{loglogaxis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
height = 14cm,
grid=both,
major grid style={black!50},
width = \textwidth,
height = 0.9\textheight,
xmin = 600,
xmax = 13000
]
% HBM
\addplot[Orange, line width=3.1mm, line cap=round] coordinates { (2*500, 16.00) (2*1800, 57.6) }
node[below, pos=0.5, rotate=\rot] {1 Channel $\times$128};
\addplot[Orange, line width=3.1mm, line cap=round] coordinates { (2*500, 128.00) (2*1800, 460.8) }
node[above, pos=0.5, rotate=\rot] {8 Channels $\times$1024};
\addplot[Orange, line width=3.1mm, line cap=round] coordinates { (2*500, 512.00) (2*1800, 1843.2) }
node[above, pos=0.5, rotate=\rot] {4 Stacks with 8 Channels $\times$4096};
% DDR3
\addplot[Gray, line width=1.8mm, line cap=round] coordinates { (2*400, 0.800/2) (2*1066, 2.1300/2) }
node[above, pos=0.5, rotate=\rot] {Device $\times$4};
\addplot[Gray, line width=1.8mm, line cap=round] coordinates { (2*400, 0.800) (2*1066, 2.1300) }
node[above, pos=0.5, rotate=\rot] {Device $\times$8};
\addplot[Gray, line width=1.8mm, line cap=round] coordinates { (2*400, 1.600) (2*1066, 4.2600) }
node[above, pos=0.5, rotate=\rot] {Device $\times$16};
\addplot[Gray, line width=1.8mm, line cap=round] coordinates { (2*400, 6.400) (2*1066, 17.060) }
node[above, pos=0.5, rotate=\rot] {DIMM $\times$64};
\addplot[Gray, line width=1.8mm, line cap=round] coordinates { (2*400, 6.400*2) (2*1066, 17.060*2) }
node[above, yshift=+2pt, pos=0.5, rotate=\rot] {2 Channels $\times$128};
\addplot[Gray, line width=1.8mm, line cap=round] coordinates { (2*400, 6.400*4) (2*1066, 17.060*4) }
node[above, pos=0.5, rotate=\rot] {4 Channels $\times$256};
%
%
% DDR5
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 1.600) (2*4200, 4.200) }
node[above, pos=0.5, rotate=\rot] {Device $\times$4};
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 3.200) (2*4200, 8.400) }
node[above, pos=0.5, rotate=\rot] {Device $\times$8};
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 6.400) (2*4200, 16.80) }
node[above, pos=0.5, rotate=\rot] {Device $\times$16};
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 12.80) (2*4200, 33.60) }
node[above, pos=0.5, rotate=\rot] {1 Channel $\times$32};
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 25.60) (2*4200, 67.20) }
node[above, pos=0.7, rotate=\rot] {2 Channels on 1 DIMM $\times$64};
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 51.20) (2*4200, 134.4) }
node[above, pos=0.7, rotate=\rot] {4 Channels on 2 DIMMs $\times$128};
\addplot[MidnightBlue, line width=2mm, line cap=round] coordinates { (2*1600, 102.4) (2*4200, 268.8) }
node[above, pos=0.7, rotate=\rot] {8 Channels on 4 DIMMs $\times$256};
%
% DDR4
\addplot[BrickRed, ultra thick, line cap=round] coordinates { (2*800, 0.800) (2*2100, 2.1000) }
node[above, pos=0.5, rotate=\rot] {Device $\times$4};
\addplot[BrickRed, ultra thick, line cap=round] coordinates { (2*800, 1.600) (2*2100, 4.2000) }
node[above, pos=0.5, rotate=\rot] {Device $\times$8};
\addplot[BrickRed, ultra thick, line cap=round] coordinates { (2*800, 3.200) (2*2100, 8.4000) }
node[above, pos=0.5, rotate=\rot] {Device $\times$16};
\addplot[BrickRed, ultra thick, line cap=round] coordinates { (2*800, 12.80) (2*2100, 33.600) }
node[above, pos=0.5, rotate=\rot] {DIMM $\times$64};
\addplot[BrickRed, ultra thick, line cap=round] coordinates { (2*800, 51.20/2) (2*2100, 134.40/2) }
node[above, yshift=+4pt, pos=0.5, rotate=\rot] {2 Channels $\times$128};
\addplot[BrickRed, ultra thick, line cap=round] coordinates { (2*800, 51.20) (2*2100, 134.40) }
node[above, pos=0.5, rotate=\rot] {4 Channels $\times$256};
% LPDDR5
\addplot[Green, line width=1.2mm, line cap=round] coordinates { (2*2750, 11.00) (2*3200, 12.80) }
node[below, pos=0.5, rotate=\rot] {Device $\times$16};
\addplot[Green, line width=1.2mm, line cap=round] coordinates { (2*2750, 22.00) (2*3200, 25.60) }
node[below, pos=0.5, rotate=\rot] {2 Devices $\times$32};
\addplot[Green, line width=1.2mm, line cap=round] coordinates { (2*2750, 44.00) (2*3200, 51.20) }
node[below, pos=0.5, rotate=\rot] {4 Devices $\times$64};
\addplot[Green, line width=1.2mm, line cap=round] coordinates { (2*2750, 88.00) (2*3200, 102.4) }
node[below, pos=0.5, rotate=\rot] {8 Devices $\times$128};
\legend{HBM2E,,,DDR3,,,,,,DDR5,,,,,,,DDR4,,,,,,LPDDR5,,,,,,}
\end{loglogaxis}
\end{tikzpicture}%
}
\caption{Bandwidth Evolution of DRAM Standards}
\label{fig:evolution}
\end{figure}
\subsection{DRAM Basics}\label{sec:dram:basics}
%\todo{(Physical) RANKs, Channels, DIMMs etc. sollte noch dazu und Bankgruppen wegen Same-Bank Refresh, Prefetch erklärt?}
As shown in Figure~\ref{fig:dram}, DRAM can be organized in a multi-hierarchical fashion of \textit{DIMMs}, \textit{channels}, \textit{physical ranks}, \textit{devices}, \textit{logical ranks}, \textit{bank groups}, \textit{banks}, \textit{memory arrays}, \textit{sub arrays}, \textit{rows}, and \textit{columns}.
%
Several DRAM \textit{channels} can be connected to the \textit{Multi-Processor System on Chip}~(MPSoC). These channels are completely independent of each other and have separate command/address and data buses. A channel can be composed of one or multiple \textit{physical ranks}, which are sharing the data and command/address bus. A \textit{Dual Inline Memory Module}~(DIMM) is a small PCB that accommodates several DRAM devices, which work completely synchronously.
%
One single device is called $\times16$\footnote{pronunciation: \textit{by-sixteen}} if it has an I/O data width of 16 bit. A DIMM is assembled for instance out of four $\times16$ devices in order to have a total I/O data width $n=64~\mbox{bit}$ (called $\times64$). While the I/O data width is usually very limited, inside the DRAM a lot of data can be fetched or stored in parallel. However, the time between consecutive internal data accesses is very long due to the optimization for storage density, while the interface can be operated on much higher frequencies. To align this mismatch DRAM uses a so-called \textit{prefetching} technique. For a read large chunks of data are concurrently fetched to the interface and then transferred in one burst to the requester, for a write the process is reversed. In addition, data is transferred at the doubled interface frequency (\textit{double data rate}, short DDR). Current devices such as DDR4 use an 8n prefetch architecture, where n is the I/O data width, 8 the \textit{Burst Length}~(BL) and 8n the number of bits for an internal data transfer. That means with each DRAM access the total amount of data received or delivered is $BL\cdot n = 8 \cdot 64~\mbox{bit} = 512~\mbox{bit} = 64~\mbox{B}$, which is the usual cache line size in today's computing systems. In combination with interface frequencies up to 1600\,MHz or pin transfer rates up to 3200\,MT/s (megatransfers per second) DDR4 reaches a maximum bandwidth of 25.6\,GB/s per channel.
Each device itself can consist of several 3D-stacked \textit{logical ranks}, which can form several \textit{bank groups} that include several \textit{banks}. The concept of bank groups was introduced with GDDR5 and DDR4 in order to reduce the bank switching times to support a seamless burst behavior at high data rates and therefore a high bandwidth.
%
All banks in a whole channel can be used concurrently (so-called \textit{bank parallelism}). However, there are some constraints due to the shared buses. Each bank usually consists of $2^{12}$ to $2^{18}$ \textit{rows}, and each row can usually store $512\,\mbox{B}$ to $2\,\mbox{KB}$ of data in its \textit{columns}.
A memory controller is composed of a \textit{front end} and a \textit{back end}. The front end performs arbitration and scheduling of incoming read and write requests, whereas the task of the back end is to translate these incoming requests into a sequence of DRAM commands, which have to be orchestrated with respect to the current state of the device. To access data in a row of a certain bank, an \textit{activate} (\texttt{ACT}) command must be issued by the controller before any column access, i.e., \textit{read} (\texttt{RD}) or \textit{write} (\texttt{WR}) commands, can be executed. The \texttt{ACT} command opens an entire row of the memory array, which is transferred into the banks \textit{row buffer}\footnote{The row buffer is a model that abstracts the real physical DRAM architecture. It is basically a combination of \textit{Primary Sense Amplifiers} (PSAs) and \textit{Secondary Sense Amplifiers} (SSAs) of the memory arrays in one bank. This model is useful to describe the functionality of a memory controller and its scheduling algorithms. Unfortunately, it often leads to a misunderstanding of the real DRAM architecture. For further details on the internal DRAM architecture we refer to~\cite{jacng_10}.}.
%
It acts like a small cache that stores the most recently accessed row of the bank. The latency of a memory access to a bank largely varies depending on the state of this row buffer. If a memory access targets the same row as the currently cached row in the buffer (called~\textit{row hit}), it results in a low latency and low energy memory access. Whereas, if a memory access targets a different row as the current row in the buffer (called~\textit{row miss}), it results in a higher latency and energy consumption.
%
If a certain row in a bank is active it must first be \textit{precharged} (\texttt{PRE}) before another row can be activated. In addition to the normal \texttt{RD} and \texttt{WR} commands, there exist read and write commands with an integrated \textit{auto-precharge} (\texttt{RDA}, \texttt{WRA}). If auto-precharge is selected, the row being accessed will be precharged automatically at the end of the read or write access. Because a DRAM cell uses a capacitor with leakage effects for data storage, it usually has to be refreshed every $64\,\mbox{ms}$ to retain the data stored in it. Modern DRAMs are equipped with an \textit{all-bank refresh} (\texttt{REFab}) command to perform this operation automatically on all banks of a rank in parallel. However, a prerequisite is that all banks are in a precharged state. This can be achieved by issuing a special \textit{all-bank precharge} (\texttt{PREab}) command in advance.
In addition to the commands, each DRAM standard defines a set of timing dependencies, which are temporal constraints that must be satisfied between issued commands. For example, between two \texttt{ACT} commands to the same bank the timing dependency $t_{RC}$ (\textit{row cycle time}) must be satisfied. Timing dependencies can also exist on other hierarchies of the DRAM, e.g., between commands to the same bank group, to the same logical/physical rank or to different logical/physical ranks.
%
%Besides the normal active mode operations presented above, a DRAM is capable to enter power-down modes to save energy by setting the clock-enable signal \texttt{cke} to low. There exist three major power-down modes called \textit{Precharge Power-Down} (PDNP), \textit{Active Power-Down} (PDNA) and \textit{Self-Refresh} (SREF).
%%
%
The selection of a DRAM subsystem usually has three main dimensions: bandwidth, latency, and capacity. Bandwidth is the amount of data that can be transferred between DRAM and a computational unit within a given time. As shown in Figure~\ref{fig:evolution}, the maximum theoretical DRAM bandwidth is limited to the number of data pins times the interface pin data rate (number of accesses per time per pin). Latency is the time that it takes to complete an access. In fact, latency helps bandwidth, but not vice versa~\cite{pat_04}. For instance, lower DRAM latency results in more accesses per time, and therefore higher bandwidth, whereas increasing the number of data pins increases the bandwidth without decreasing latency. In realistic scenarios, the full theoretical bandwidth is never reached due to many timing dependencies, interference between different requests, and refresh. The actual achieved bandwidth for a specific application is called \textit{sustainable bandwidth}.
%
\subsection{DDR5 Standard}
\begin{table}[]
\caption{Comparison of DDR4 and DDR5 Key Parameters}
\centering
\resizebox{\columnwidth}{!}{% <------ Don't forget this %
\begin{tabular}{|l|l|l|}
\hline
$~$ & \textbf{DDR4} & \textbf{DDR5} \\ \hline
\textbf{Data Rates [MT/s]} & 1600 - 3200 & 3200 - 8400 \\ \hline
\textbf{Max. Channel BW [GB/s]} & 25.6 & 33.6 \\ \hline
\textbf{Max. DIMM BW [GB/s]} & 25.6 & 67.2 \\ \hline
\textbf{Error Correction} & - & On-Die \\ \hline
\textbf{Device Capacity [Gb]} & 2 - 16 & 8 - 64 \\ \hline
\textbf{Max. Stacked Devices} & 8 & 16 \\ \hline
%\textbf{\todo{DIMM Capacity??}} & ? & ? \\ \hline
\textbf{Channel Width [bit]} & 64 & 32 \\ \hline
\textbf{Channels per DIMM} & 1 & 2 \\ \hline
\textbf{Prefetch} & 8n & 16n \\ \hline
\textbf{Max. Banks} & 16 & 32 \\ \hline
\textbf{Max. Bank Groups} & 4 & 8 \\ \hline
\multirow{2}{*}{\textbf{Refresh Modes}} & \multirow{2}{*}{All-Bank} & All-Bank, \\
& & Same-Bank \\ \hline
\textbf{Burst Length} & 8 & 16, 32 \\ \hline
\textbf{Supply Voltage [V]} & 1.2 & 1.1 \\ \hline
\end{tabular}%
}
\label{tab:comparison}
\end{table}
% tREFI = 3.9us, but tRFC shorter
% higher bandwidth -> higher interface frequency -> increase of internal prefetch, interface is small but fast, internal DRAM is slow but highly parallel
% DDR3 used 8n prefetch, DDR4 introduced concept of bank groups (separate power supplies for each bank group) and kept 8n prefetch because with 16n prefetch each burst would be 128 Bytes, but cache line size only 64 Bytes
% DDR5 increased prefetch to 16n, but data bus is only 32 bit wide -> 2 channels per DIMM
% 8 bank groups instead of 4, up to 32 banks for higher bank parallelism -> hide row misses
% up to 16 logical ranks (stacked devices)
% total device size up to 512 Gb (DDR4 only 128 Gb)
% power supply located on DIMM, new timings: same bank, different bank same bank group, different bank group same logical rank, different logical rank same physical rank, different physical rank same DIMM
% internal ECC
% same bank refresh: refresh only one bank of each group -> other banks can still be accessed -> with increasing device size (more banks and rows) refresh overhead becomes bigger and bigger -> this way refresh can be hidden, no idle phases on data bus
% no per-bank refresh because 32 banks, high command bus utilization
%
\begin{figure*}[t!]
\centering
\includegraphics[width=.8\linewidth]{dramsys_tc.pdf}
\caption{Architecture of DRAMSys}
\label{fig:dramsys}
\end{figure*}
With the development of a new DRAM standard generation there are always several key parameters that should be enhanced, e.g., bandwidth, power consumption, and device capacity. Table~\ref{tab:comparison} shows a comparison between key parameters of the new DDR5 standard and its predecessor DDR4. In the following we will also describe the most important differences in more detail.
For a higher bandwidth DDR5 raises the maximum pin data rate to 8400\,MT/s compared to 3200\,MT/s for DDR4. Because the frequency of internal data accesses stays more or less the same as a result of the capacity- and cost-optimized architecture, the prefetch was incremented from 8n to 16n. When using the same 64-bit-wide data bus for one channel as all previous DDR generations, this would result in 128\,B of transferred data per access. However, since the usual cache line size of modern processors is only 64\,B, the data bus of each DDR5 DIMM is split up into two independent channels of 32\,bit width. That way only 64\,B of data are transferred per access. Theoretical transfer rates then reach a maximum of 33.6\,GB/s per channel and 67.2\,GB/s per DIMM compared to 25.6\,GB/s per channel/DIMM for DDR4, as shown in Figure~\ref{fig:evolution}.
%
At the same time supply voltages are reduced from 1.2\,V to 1.1\,V for an improved power consumption.
The maximum number of banks per device increases from 16 to 32 distributed over 8 instead of 4 bank groups, the total capacity of a single device from 16\,Gb to 64\,Gb. In addition, up to 16 instead of 8 devices can now be stacked in a three-dimensional fashion (logical ranks)\footnote{The initial DDR4 standard did not specify any stacked devices. This feature was first introduced with an addendum in 2017.}. This enables stack capacities of up to 512\,Gb (max. 16 $\times$ 32\,Gb or 8 $\times$ 64\,Gb because of limited address bits). One problem that always arises with higher device capacities is the increased refresh overhead, because each cell still has to be refreshed approximately every 64\,ms. As a consequence, either the controller has to issue refresh commands more frequently or the individual refresh cycles take a longer time. Since all banks of a rank cannot be accessed during an all-bank refresh, it can lead to significant performance drops.
To overcome this problem, DDR5 introduces \textit{same-bank refresh} (REFsb) and associated \textit{same-bank precharge} (PREsb) commands as an alternative to \textit{all-bank refresh} (REFab) and \textit{all-bank precharge} (PREab) commands. When issuing them, only one bank in each bank group of the target rank is refreshed and unaccessible, while all other banks can still process incoming read and write requests. Most modern DRAM controllers use advanced reordering techniques for an improved performance so they can try to hide the same-bank refresh by sending requests to other banks in the meantime.
Finally, DDR5 devices implement an on-die error correction to improve the data integrity.
\section{DRAMSys}
The simulation of DRAMs on system level requires highly accurate models due to their complex timing and latency behavior. However, conventional cycle-accurate DRAM controller models often become the bottleneck for the overall simulation speed~\cite{liver_19}. A promising alternative are DRAM simulation models based on TLM, which can be fast and accurate at the same time.
%
In this paper we use the open-source simulator DRAMSys~\cite{stejun_20}, which is, to the best of our knowledge, the fastest cycle-accurate open-source DRAM simulator and has a large range of functionalities. It features a very high simulation speed while maintaining full temporal accuracy. DRAMSys has a simulator architecture that enables a fast adaptation to new DRAM standards using a DSL. In the following we describe the DRAMSys simulation model, the internal architecture, the DSL-based code generation, and the integration of DDR5.
%
\subsection{The DRAMSys Simulation Model}
%
DRAMSys uses the concept of the SystemC/TLM2.0 IEEE 1666 Standard for a fast and fully cycle-accurate simulation. In accordance with the standard, all components are designed as \textit{SystemC modules} and connected by \textit{TLM sockets}. Each memory request is modelled with a special TLM transaction object, the so-called \textit{generic payload}. It stores all relevant information of the request, e.g., the address, the transfer direction (read or write) or corresponding data, and is passed by reference between the modules. The simulator utilizes the \textit{Approximately Timed} (AT) coding style, which defines a \textit{non-blocking four-phase handshake protocol}. This protocol is required to model the DRAM subsystem's pipelined behavior and out-of-order responses to the initiators. However, since a single memory access can cause the issuance of multiple DRAM commands depending on the device's current state (e.g., \texttt{PRE}, \texttt{ACT}, \texttt{RD/WR} for a row miss), four phases are still not sufficient to model the communication between controller and device with full temporal accuracy. To close this gap, a custom TLM protocol (called DRAM-AT) that defines application-specific phases for all DRAM commands was introduced in~\cite{junwei_13}. These phases allow a projection of the cycle-accurate DRAM protocol to TLM.
The rule of thumb for making cycle-accurate simulations fast is to reduce the number of simulated events and the executed control flow overhead. Therefore, DRAMSys only simulates relevant state changes instead of individual signal changes in each clock cycle. Especially in scenarios where the memory access density is low, this approach can lead to an enormous event reduction and a resulting simulation speedup of several orders of magnitude while still yielding fully cycle-accurate results~\cite{stejun_20}.
%
\begin{figure*}[t!]
\centering
\includegraphics[width=.95\linewidth]{analyzer_DDR5.png}
\caption{Graphical User Interface of the Trace Analyzer}
\label{fig:traceanalyzer}
\end{figure*}
%
\subsection{Architecture}\label{sec:dramsys:arch}
As shown in Figure~\ref{fig:dramsys}, DRAMSys consists of three main components: a shared \textit{arbitration \& mapping unit} (short arbiter), a \textit{channel controller} for each memory channel and a \textit{DRAM device} for each memory channel. The arbiter cross-couples multiple initiators and DRAM channels and translates the address of each request into a corresponding channel, logical/physical rank, bank group, bank, row, and column. This translation is done on the basis of a predefined \textit{address mapping}. Different address mappings can have a large impact on the overall system performance because of the timing dependencies between commands, which not only depend on the commands themselves, but also on the target location they are issued to (see Section~\ref{sec:dram:basics}). Thus, the address mapping should always be chosen with consideration of the application's memory access pattern and the DRAM configuration.
Following the arbiter each memory channel has a channel controller. It is composed of five components that will be explained shortly. The \textit{scheduler} enqueues incoming requests and reorders them on the basis of a specific \textit{scheduling policy} to improve, e.g., bandwidth and latency by avoiding row misses. For each bank in the whole channel there is a separate \textit{bank machine}. It keeps track of the current bank state and makes sure that commands are issued in a valid sequence, e.g., \texttt{PRE}, \texttt{ACT}, \texttt{RD/WR} for a row miss. The issuance of refresh commands to each DRAM rank at the right time is managed by \textit{refresh managers}. To ensure that all timing dependencies between commands are satisfied and DRAMSys behaves compliant to the corresponding JEDEC standard, both bank machines and refresh managers ask the \textit{timing checker} for the earliest point in time a command can be issued. The timing checker stores the relevant command history and all timing dependencies to calculate the right time. For each JEDEC standard there is a separate timing checker that can be instantiated. Finally, each channel controller includes a \textit{command multiplexer} to resolve conflicts between all bank machines and refresh managers, since only one request can be issued at a time due to the shared command/address bus.
As last component each channel controller is connected to a DRAM device. It manages the data storage and in addition can establish a connection to simulation tools for power estimation, thermal modeling or error modeling.
%
\subsection{Trace Analyzer}
To enable better analysis capabilities for the DRAM subsystem performance than the usual outputs to the console or a text file and to deeply investigate the channel controller's scheduling decisions, DRAMSys provides the Trace Analyzer. During a simulation there is a functionality to record all TLM transactions of the channel controller in an SQLite trace database. Afterwards, this database can be evaluated graphically using the Trace Analyzer. The user interface is shown in Figure~\ref{fig:traceanalyzer}. It illustrates a time window of requests, DRAM commands and the utilization of all banks. This can help system designers to understand the subsystem's internal behavior and to find limiting issues. Exploiting the power of SQL, the data aggregation happens very fast and the tool provides a user-friendly handling that offers a quick navigation through the whole trace database, sometimes with millions of requests and associated DRAM commands.
In addition, an evaluation of the traces can be performed with the powerful Python interface of the Trace Analyzer. Different metrics are described as SQL statements and formulas in Python and can be customized or extended without recompiling the tool. Typical metrics are for instance memory utilization (bandwidth), average response latency or number of accesses per activate (row hit rate).
%
\subsection{DRAMml: A DSL for DRAMs}
%
As stated in the introduction, an increasing number of different DRAM standards has been presented by JEDEC in recent years. Since each new standard comes with challenging changes in the DRAM protocol compared to previous ones, the memory simulation models as well as the RTL models must be modified and validated repeatedly. In order to keep pace with these frequent changes and the large variety of standards, a robust and error-free methodology for a fast adaption must be established.
In~\cite{junkra_19} we presented a comprehensive and formal DSL called DRAMml, which is based on a Petri Net~\cite{pet_62} semantic. DRAMml can describe the entire memory functionality of a DRAM standard including all timing dependencies in just a few lines of code. Using the formal description of a corresponding Petri Net, different simulation and validation models can be generated automatically and correct by construction, as shown in Figure~\ref{fig:methodology}. One of the simulation models is also the channel controller's standard-specific timing checker (see Section~\ref{sec:dramsys:arch}). In this way, the error-prone handwritten source code is replaced by source code generated from a high-level description.
%\todo{Furthermore, DRAMSys offers functionalities for embedding an RTL model of a memory controller into the framework and for recording the executed DRAM commands in an output trace database. Using the formal DSL descriptions, a standard-specific executable C++ validation model can be created, which analyses a recorded DRAM command trace for standard compliance. This approach provides fast feedback to an RTL developer if a change in the RTL description led to a protocol violation.}\todo{Das ist doppelt, weil der Teil auch nochmal bei Integration of DDR5 kommt. \textbf{Macht nix.. kann ruhig nochmal gesagt werden...}}
%
\begin{figure}
\centering
\includegraphics[width=0.7\columnwidth]{methology.pdf}
\caption{DRAMml Methodology}
\label{fig:methodology}
\end{figure}
%
%
\subsection{Integration of DDR5}
%
%\textcolor{red}{Major difference between two DRAM standards: maximum operating frequency, timings, data bus width, burst length, timing dependencies between commands; timing values, bus width and burst length are specified in JSON config files that are loaded during runtime -> new config files for DDR5 have to be created from the standard; timing dependencies are modeled in a standard-specific timing checker, timing checker is created from DRAMml, DRAMml is less error prone, dependencies can only be inserted at specific locations, source code is then generated. Functionality of new refresh manager: specific order must be observed, try to refresh idle banks first, DRAMml: 3D-DDR4: slr and dlr, DDR5 in addition dpr, timing dependencies on six hierarchies, two new commands REFsb and PREsb, maybe explain PREab and PREsb in advance, DRAMml code excerpt}
In order to expand DRAMSys by a DDR5 simulation model three steps had to be performed. All architectural parameters (e.g., number of channels/banks/bank groups, data bus width, burst length) as well as temporal parameters (frequency and timing values) of one specific DRAM subsystem configuration are defined in a JSON file and passed to the simulator as an argument. From the new DDR5 JEDEC standard JSON files for various speed grades and device sizes were put together first.
Second, a DRAMml description was created. Since DDR5 introduces same-bank refresh and specifies timing dependencies between commands to (1) the same bank, (2) different banks (same bank group), (3) different bank groups (same logical rank), (4) different logical ranks (same physical rank), (5) different physical ranks (same DIMM) and (6) different DIMMs\footnote{This increase results from the three-dimensional stacking of devices and the move of the power management integrated circuit (PMIC) from the main PCB onto the DIMM.}, the overall length of the description has increased by around 50\,\% compared to the predecessor DDR4. Nevertheless, it still fits on two pages and is very compact when considering the length of the actual JEDEC standard of almost 500 pages. In addition, the complexity has not increased. Because DRAMml fully relies on a Petri Net semantic, it is modular and could be adapted to all protocol changes of DDR5 without adding any language extensions. Listing~\ref{lst:DRAMml} shows an excerpt of the initial DRAMml descriptions of DDR4 and DDR5 for the timing dependencies between two \texttt{ACT} commands. The increase in hierarchies (additional logical and physical ranks) could be directly applied to the code. From this description the new DDR5 timing checker source code for DRAMSys was generated automatically and correct by construction.
\lstset{
language=DRAMml,
extendedchars=true,
basicstyle=\footnotesize\ttfamily,
showstringspaces=false,
showspaces=false,
tabsize=2,
breaklines=true,
showtabs=false
}
\begin{lstlisting}[
caption={Timing Dependencies between \texttt{ACT} Commands},
label={lst:DRAMml},
frame=single]
# DDR4:
ACT -<> ACT (tRC, tRRD_L, tRRD_S, 0);
# DDR5:
ACT -<> ACT (tRC, tRRD_L_slr, tRRD_S_slr, tRRD_dlr, 0, 0);
\end{lstlisting}
The third and final step was to derive a new same-bank refresh manager from the existing all-bank refresh manager. This was the only source code written by hand. Besides the change from \texttt{PREab} and \texttt{REFab} to \texttt{PREsb} and \texttt{REFsb} commands, the new component required some extra logic because \texttt{REFsb} commands have to be issued in a specific order to refresh all banks evenly. All in all, the whole integration process could be performed within two weeks. This time also included the study of the new JEDEC standard as well as testing and debugging. One major benefit of our DRAMml approach is that the major part of the source code can be generated correct by construction, so there are only very limited possibilities for introducing any errors, which keeps the debugging effort low.
%
%In order to expand DRAMSys by a DDR5 model only three steps had to be performed. First, a new refresh manager for the same-bank refresh was developed. Second, a DRAMml description for DDR5 was written on the basis of the JEDEC standard. Third, the source code for the DDR5 timing checker was automatically generated from the DRAMml and added to DRAMSys. Modeling two channels per DIMM could simply be achieved by instantiating two separate channel controllers connected to one shared arbiter device and thus carried no additional integration effort with it.
%Although the number of timing dependencies and thus the lines of code for the timing checker have largely increased for DDR5 compared to previous standards\footnote{DDR5 specifies timing dependencies between commands to the same bank, same bank group, same logical rank, same physical rank and different physical ranks. This results from the three-dimensional stacking of devices and the move of the power management integrated circuit (PMIC) from the main PCB onto the DIMM.}, the whole integration process could still be performed within two weeks. This time also included the study of the new JEDEC standard and the testing and debugging process. One major benefit of the DRAMml approach is that most of the source code can be generated correct by construction, so there are only very limited possibilities for introducing any errors and the debugging effort is low.
% 2 channels -> no problem, simply instantiate 2 memory controllers connected to one arbiter
% power supply located on DIMM??
% timings same bank, different bank same bank group, different bank group same logical rank, different logical rank same physical rank, different physical rank same DIMM
% "automatic" generation of timing checker using DRAMPetri -> less error prone, lots of timing dependencies!!! (e.g. ACT -> ACT has ... dependencies), still fast because no loops (comparison simulation speed DDR4 and DDR5?)
% same bank refresh -> new refresh manager
\section{Experiments and Observations}
In this section we analyze the bandwidth and latency behavior of different DDR5 device configurations with respect to two common application characteristics and compare it to DDR4. We will describe the experimental setup, the results and conclude with several key observations.
%
\subsection{Experimental Setup}
To evaluate the performance we use two characteristic benchmarks: (1) accesses with linearly-increasing addresses and (2) accesses with a random address distribution over the whole address space. These two benchmarks are the extreme borderline cases of all possible applications. Therefore, they are well suited to show the different limits, e.g., for server applications with many cores (random) and data-flow-oriented tasks (linear).
In both benchmarks the number of read requests is twice as high as the number of write requests. The total number of requests is set to 100,000 to omit short-time effects and to obtain results in a reasonable amount of time. For the linear benchmark a burst of 16 read requests is followed by a burst of 8 write requests, for the random benchmark read-write switches are also done randomly. The frequency in which requests are issued to the DRAM subsystem can be chosen arbitrarily and depends on the test case.
For DDR5 we use DIMMs with 8\,Gb (1\,Gb\,$\times$8) devices containing 16 banks and with 16\,Gb (2\,Gb\,$\times$8) devices containing 32 banks to find out the impact of the bank increase. Each channel is then composed of 4 devices and the whole DIMM of 8 devices. In addition to single-rank DIMMs (SR) we also use dual-rank DIMMs (DR) with 16 devices. Speed grades are varied from the slowest one (DDR5-3200) up to the fastest one (DDR5-6400) finally specified by JEDEC\footnote{Speed grades up to DDR5-8400 are planned but not yet finally specified.}. For comparison we use a DDR4 DIMM with 16\,Gb (2\,Gb\,$\times$8) devices containing 16 banks. The channel/DIMM is then composed of 8 devices. As for DDR5 we also investigate the behavior of a dual-rank DIMM with 16 devices. Speed grades are again varied from the slowest one (DDR4-1600) up to the fastest one specified in the standard (DDR4-3200). All-bank refresh is selected by default for both standards. In the case of DDR5 we always show the bandwidth sum of both channels, however, the number of data pins, which usually limits the amount of channels, is identical for DDR5 and DDR4.
All channel controllers implement separate queues for reads and writes with a queue depth of 32 each. Smaller and larger queue depths (16 and 64) were also investigated for completeness but only caused small shifts in all results, the general observations were the same. Thus, they are not presented in this work. As scheduling policy for incoming requests \textit{First-Ready First Come First Served} (FR-FCFS)~\cite{rixdal_00} is applied. The address mapping is chosen to yield the highest performance for the linear benchmark using techniques like channel and bank group interleaving. This means subsequent requests are mapped to different channels or bank groups and row or bank switching penalties are minimized. For the random benchmark the address mapping does not influence the performance at all because toggling rates for all address bits are uniformly distributed. Finally, for the linear benchmark a row is kept open after a read or write access because row hits are likely to happen (\textit{open-page policy}), while for the random benchmark a row is automatically precharged after each read or write access because row hits practically never happen (\textit{closed-page policy}).
%
\subsection{Experimental Results}
%
In the following we present the experimental results for different DDR5 device configurations, for the comparison of DDR5 and DDR4 and for the different DDR5 refresh modes. All results have been analyzed and verified with the Trace Analyzer.
%
\vspace{-5pt}
\subsubsection{Different DDR5 Configurations}
%
First, we investigate the bandwidth of all DDR5 configurations both for the linear and the random benchmark. The requests are issued as fast as possible such that either the read or the write queue is always filled up completely. The corresponding results are presented in Figure~\ref{fig:bw:linear_DDR5} and \ref{fig:bw:random_DDR5}. What stands out most obviously is that for the linear benchmark all four configurations come much closer to the maximum bandwidth as for the random benchmark. This is because the address mapping is optimized for the linear case to reduce switching penalties. Only the refresh and read-write switches cannot be omitted and constantly decrease the bandwidth. Hence, also the bank increase from 16 (8\,Gb device) to 32 (16\,Gb device) and the move from single rank to dual rank yield rather small improvements (on average 2.7\,\% and 11.4\,\%), the 8\,Gb dual rank configuration even outperforms the 16\,Gb dual rank configuration due to shorter refresh cycles. For increasing data rates the slope slightly drops because refresh cycles take a fixed amount of time and thus more clock cycles. The waviness of some graphs is caused by timings that limit the maximum bandwidth and do not take a fixed amount of time, but a fixed amount of clock cycles. This behavior can also be observed in Figure~\ref{fig:bw:linear_DDR4_DDR5} and \ref{fig:bw:random_DDR4_DDR5} and in Figure~\ref{fig:refresh}.
However, the random benchmark results in Figure~\ref{fig:bw:random_DDR5} look a lot different. While at low data rates all configurations are still in a close range, they largely drift apart when the data rates become higher. Then the gap between 16 and 32 banks and the move from single to dual rank is enormous (up to 41.8\,\% and up to 63.4\,\% bandwidth increase, on average 26.4\,\% and 42.4\,\%). When we consider that an 8\,Gb dual-rank configuration also provides in total 32 banks per channel it clearly shows that for high data rates and random accesses DDR5 requires at least the parallelism of 32 banks to keep the bandwidth up. To find a simple explanation for this behavior we have to recall several things into our minds. For the chosen configuration random requests always translate into a full row cycle consisting of an \texttt{ACT} and a \texttt{RDA/WRA}. These row cycles require the fixed minimal amount of time $t_{RC}$ (minimum time between \texttt{ACT} commands to the same bank, see Section~\ref{sec:dram:basics}). With increasing data rates the absolute time $t_{RC}$ stays more or less constant and therefore the corresponding number of clock cycles increases (around 74 for DDR5-3200 and 149 for DDR5-6400). But at the same time the number of clock cycles each request occupies the data bus stays constant (for DDR5 8 clock cycles at burst length 16). To successfully align this mismatch, hide the delays, and achieve a high data bus utilization and high sustainable bandwidth, a certain amount of bank parallelism is required to do row cycles in parallel on all banks. Simply put, with 149 clock cycles for one row cycle and 16 banks we can issue a new request every $\frac{149}{16}=9.3$ clock cycles, but each request only takes 8 on the bus. This means even for an ideal command placement, a perfectly uniform distribution of requests over all banks and without refresh we will never reach the theoretical bandwidth limit. However, with 32 banks we can issue a new request every $\frac{149}{32}=4.7$ clock cycles, which is fast enough even for the highest data rates of DDR5.
%
\vspace{-5pt}
\subsubsection{DDR4 versus DDR5}
%
Next up, we compare the bandwidth of DDR5 to its predecessor DDR4 for single- and dual-rank configurations with 16\,Gb devices (16 banks per rank for DDR4, 32 banks per rank for DDR5). The corresponding results are provided in Figure~\ref{fig:bw:linear_DDR4_DDR5} and \ref{fig:bw:random_DDR4_DDR5}. Each figure starts on the left side with DDR4-1600, the results for DDR5 start at a data rate of 3200\,MT/s. DDR4 finishes at this data rate but is additionally extrapolated to DDR4-4400. This extrapolation is done because by this time there are lots of overclocked DDR4 devices on the market with data rates up to 4400\,MT/s, so in reality the data rate limits specified in the JEDEC standard are not valid any more. For the linear benchmark the bandwidths of DDR4 and DDR5 come very close to each other at data rates where they overlap (average increase of 7.5\,\% with SR and 8.2\,\% with DR from DDR4 to DDR5). The slightly better performance of DDR5 can be explained when we look on the number of read-write switches for each standard. While the total number of requests issued to each subsystem is identical, the DDR5 DIMM is composed of two channels, both of which receive only half of the requests. Considering that a read-write switch on each channel is always done after a fixed amount of requests (usually when the write buffer is full and the read buffer empty or the other way round), the number of switches on each of them is divided in half. However, the total runtime is not divided in half but almost identical for DDR4 and DDR5 because of the doubled burst length of DDR5. To sum it up, when choosing the memory for a platform where transfer rates lower than 4400\,MT/s are targeted and the application's memory access scheme is rather linear, an upgrade from DDR4 to DDR5 is not worth the effort especially because prices for DDR5 at release are expected to be higher.
For the random benchmark the results in Figure~\ref{fig:bw:random_DDR4_DDR5} tell a completely different story. DDR4 with a single rank cannot keep up the bandwidth at all for increasing data rates, the dual-rank configuration performs better but also flattens for higher data rates. Their gap to DDR5 is large in both cases (approximately up to 130\,\% and 57\,\%, on average 107\,\% and 42\,\%). The problem is the same as for the 8\,Gb DDR5 device mentioned previously. In the case of DDR4 a single data transfer occupies the data bus only for 4 clock cycles and the device only has 16 banks. With two ranks the number of banks increases to 32, however the burst length is still to small. As a result, the data bus turns idle because row cycle delays cannot be hidden completely. With 32 or even 64 banks and a burst length of 16, both DDR5 configurations perform a lot better over the whole range. This shows that DDR5 is especially optimized for random workloads and is clearly ahead of DDR4 even at low data rates. %To sum it up, in the random case DDR5 with its 32 banks is clearly ahead of DDR4 even at the lowest data rates.
To compare both standards also from a different point of view, we fix the data rate to 3200\,MT/s and vary the frequency requests are issued to the DRAM subsystem. Starting at 25\,MHz and increasing in steps of 25\,MHz we then plot the average response latency of all requests (average latency from entry of requests until data has been transferred over the bus) over the bandwidth. These results are presented in Figure~\ref{fig:lat_bw}. For the linear case (Figures~\ref{fig:lat_bw:linear_16Gb_SR} and \ref{fig:lat_bw:linear_16Gb_DR}) DDR4 and DDR5 show a very similar behavior. In the beginning the bandwidth increases directly proportional to the frequency increase while the latency stays almost constant (around 40\,ns). Each request is immediately serviced by the DRAM in this region and the latency is mainly composed of the read or write latency (latency from issuing a \texttt{RD} or \texttt{WR} command until the data transfer starts) and the data transfer over the bus. At higher input frequencies multiple requests are in the DRAM subsystem in parallel and interfere with each other. The controller starts reordering requests to keep the bandwidth up but to the disadvantage of the response latency, which increases exponentially. At an input frequency around 350\,MHz the maximum sustainable bandwidth ($\sim$\,22\,GB/s) is reached and both bandwidth and response latency remain unchanged. These values come close to the maximum theoretical bandwidth of 25.6\,GB/s at an input frequency of 400\,MHz. DDR5 has a worse average response latency at this point mainly due to the doubled burst length ($\sim$\,270\,ns vs. 150\,ns), however, real systems would never be operated at this point anyway. Advanced memory controllers rather take a slight loss in bandwidth with the advantage of a much lower average response latency, which is a lot more beneficial for the overall system performance. The memory controller configuration we used for our experiments is only optimized for bandwidth and therefore also allows these operating points.
When both standards are compared for the random benchmark on the basis of these facts (Figures~\ref{fig:lat_bw:random_16Gb_SR} and \ref{fig:lat_bw:random_16Gb_DR}), the benefits of DDR5 are large again. At the same speed grade DDR5 can handle random request streams with much higher frequencies and the average response latency rises much slower. For the single-rank configuration DDR4 exceeds an average response latency of 100\,ns already at an input frequency of 125\,MHz, DDR5 reaches this latency at input frequencies between 200 and 225\,MHz. As a consequence, there is also a large gap between the maximum sustainable bandwidths of both standards ($\sim$\,11\,GB/s for DDR4 vs. 21\,GB/s for DDR5). In the case of two ranks the gap becomes smaller, but DDR5 still offers a big advantage over DDR4 both with respect to average response latency and bandwidth. These results confirm the optimization of DDR5 for random input patterns once more and show the advantage of the architectural changes compared to DDR4 (higher bank count, higher burst length, two channels per DIMM).
%
\vspace{-5pt}
\subsubsection{DDR5 Refresh Modes}
%
Finally, we investigate the performance of the new same-bank refresh in comparison to all-bank refresh. Again, this is done with 16\,Gb DDR5 devices for both benchmarks, for single- and dual-rank DIMMs and for all available DDR5 speed grades. The corresponding bandwidth results are shown in Figure~\ref{fig:refresh}. For reference, also the maximum theoretical bandwidth and the sustainable bandwidth without any refresh is shown. In the linear case (Figures~\ref{fig:refresh:linear_16Gb_SR} and \ref{fig:refresh:linear_16Gb_DR}) same-bank refresh clearly leads to a higher bandwidth compared to all-bank refresh over the whole range of data rates and for both DIMM configurations (on average 7.9\,\% for single rank and 9.1\,\% for dual rank). Also, the gap to the bandwidth completely without refresh is very little (in some cases less than 1\,\%). In the case of a single rank, all-bank refresh causes a large drop in performance because the data bus turns idle during the whole refresh cycle. With two ranks, refresh commands are staggered, i.e., there is an offset between them and only one rank is refreshed at a time. The other rank can still process requests in the meanwhile, but only as long as requests for it are available in the queues. Since the selected address mapping makes sure that work is equally distributed over both ranks for an increased average performance, at some point the queues are completely filled with requests for the blocked rank and the data bus also turns idle. In addition, refresh commands have to be issued twice as often compared to a single rank. An increased queue depth could solve this issue, but at the same time would require more hardware resources and would make the scheduling more complex. In contrast, during a same-bank refresh cycle only every fourth (SR) or every eighth (DR) bank is unaccessible and each refresh cycle is also shorter. In that case the blocking on individual banks can easily be compensated by the remaining banks and the data bus does not turn idle.
For the random benchmark the results with two ranks (Figure~\ref{fig:refresh:random_16Gb_DR}) are more or less identical (on average 10.0\,\% bandwidth increase), but with only a single rank (Figure~\ref{fig:refresh:random_16Gb_SR}) same-bank refresh does not bring an advantage (on average less than 1\,\% bandwidth increase). The reason for this behavior is already well-known and also occurred in the previous results: not enough bank parallelism. Although a single 16\,Gb DDR5 device already offers 32 banks, the blocking times of refresh cycles still noticeably lower the performance. In this case the refresh mode makes no difference because the overall bank blocking times induced by both modes are nearly identical. With all-bank refresh there are less but longer idle phases on the data bus, with same-bank refresh more but shorter ones. Only with two ranks and 64 banks the bank parallelism is large enough to compensate the refresh penalty and the performance depends on the queue depth as for the linear case. As a conclusion, DDR5's new same-bank refresh strategy performs better or at least just as good as the all-bank refresh in all tested configurations and should therefore always be preferred.
% trotz besserer Verteilung Datenbus nicht voll, Reduzierung des Bank Parallelism zu stark
% Bank Parallelism zwar hoch, aber kann Refresh Penalty nicht vollständig kompensieren
% IMPORTANCE OF TA!!!!!
%%%%%%%%%% ...
% stream and random
% different read:write ratios
% different frequencies
% 1/2 ranks, 16/32 banks for DDR5
% performance gain same bank refresh
% performance gain DDR3 -> DDR4 not so large (cite)
% refresh overhead becoming more and more problematic with increasing memory capacity
%
%
% Refresh is staggered! That way it can be hidden!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Old Figures
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{sidewaysfigure*}[p]
% \centering
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*1200,
% ymin = 15,
% xmax = 2*3400,
% legend style={at={(0.05,0.7)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*1600,204.8/8)
% (2*1800,230.4/8)
% (2*2000,256 /8)
% (2*2200,281.6/8)
% (2*2400,307.2/8)
% (2*2600,332.8/8)
% (2*2800,358.4/8)
% (2*3000,384 /8)
% (2*3200,409.6/8)
% };
% % DDR5 8Gb SR
% \addplot[Purple, thick, line cap=round, smooth] coordinates {
% (2*1600,178.44/8)
% (2*1800,197.16/8)
% (2*2000,219.04/8)
% (2*2200,237.38/8)
% (2*2400,257.80/8)
% (2*2600,276.02/8)
% (2*2800,295.48/8)
% (2*3000,315.32/8)
% (2*3200,331.86/8)
% };
% % DDR5 16Gb SR
% \addplot[Magenta, thick, line cap=round, smooth] coordinates {
% (2*1600,180.3 /8)
% (2*1800,200.5 /8)
% (2*2000,222.76/8)
% (2*2200,242.74/8)
% (2*2400,264.28/8)
% (2*2600,285.1 /8)
% (2*2800,306.88/8)
% (2*3000,327.4 /8)
% (2*3200,346.16/8)
% };
% % DDR5 8Gb DR
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,197.52/8)
% (2*1800,215.02/8)
% (2*2000,243.36/8)
% (2*2200,264.08/8)
% (2*2400,288.86/8)
% (2*2600,307.5 /8)
% (2*2800,331.7 /8)
% (2*3000,353.92/8)
% (2*3200,374.92/8)
% };
% %
% \addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
% (2*1600,186.46/8)
% (2*1800,214.44/8)
% (2*2000,239.36/8)
% (2*2200,245.44/8)
% (2*2400,272.26/8)
% (2*2600,301.7 /8)
% (2*2800,315 /8)
% (2*3000,335.24/8)
% (2*3200,364.18/8)
% };
% \legend{
% Max,%
% DDR5 8\,Gb SR,%
% DDR5 16\,Gb SR,%
% DDR5 8\,Gb DR,%
% DDR5 16\,Gb DR%
% }
% \end{axis}
% \end{tikzpicture}
% \caption{DDR5, Linear}
% \label{fig:bw:linear_DDR5}
% \end{subfigure}
% %%%
% \hfill
% %%%
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*1200,
% ymin = 15,
% xmax = 2*3400,
% legend style={at={(0.05,0.70)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*1600, 204.8/8)
% (2*1800, 230.4/8)
% (2*2000, 256 /8)
% (2*2200, 281.6/8)
% (2*2400, 307.2/8)
% (2*2600, 332.8/8)
% (2*2800, 358.4/8)
% (2*3000, 384 /8)
% (2*3200, 409.6/8)
% };
% % DDR5 8Gb SR
% \addplot[Purple, thick, line cap=round, smooth] coordinates {
% (2*1600,154.2 /8)
% (2*1800,159.76/8)
% (2*2000,169.26/8)
% (2*2200,169.6 /8)
% (2*2400,176.2 /8)
% (2*2600,173.18/8)
% (2*2800,179.3 /8)
% (2*3000,184 /8)
% (2*3200,183.56/8)
% };
% %
% \addplot[Magenta, thick, line cap=round, smooth] coordinates {
% (2*1600,165.18/8)
% (2*1800,178.04/8)
% (2*2000,196.66/8)
% (2*2200,208.58/8)
% (2*2400,224.62/8)
% (2*2600,234.14/8)
% (2*2800,244.94/8)
% (2*3000,256.16/8)
% (2*3200,260.22/8)
% };
% %
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,179.98/8)
% (2*1800,198 /8)
% (2*2000,219 /8)
% (2*2200,233.72/8)
% (2*2400,251.58/8)
% (2*2600,264.92/8)
% (2*2800,279.62/8)
% (2*3000,292.44/8)
% (2*3200,299.98/8)
% };
% %
% \addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
% (2*1600,180.6 /8)
% (2*1800,199.1 /8)
% (2*2000,219.08/8)
% (2*2200,235.34/8)
% (2*2400,256.92/8)
% (2*2600,271.6 /8)
% (2*2800,291.44/8)
% (2*3000,310 /8)
% (2*3200,325.9 /8)
% };
% \legend{
% Max,
% DDR5 8\,Gb SR,
% DDR5 16\,Gb SR,
% DDR5 8\,Gb DR,
% DDR5 16\,Gb DR
% }
% \end{axis}
% \end{tikzpicture}
% \caption{DDR5, Random}
% \label{fig:bw:random_DDR5}
% \end{subfigure}
% %\caption{Sustainable Bandwidth for Different DDR5 Configurations, SR: Single Rank, DR: Dual Rank}
% \par\bigskip
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*400,
% ymin = 5,
% xmax = 2*3400,
% legend style={at={(0.05,0.7)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*800 , 102.4 /8)
% (2*933 , 119.424/8)
% (2*1066.5, 136.512/8)
% (2*1200 , 153.6 /8)
% (2*1333 , 170.624/8)
% (2*1466.5, 187.712/8)
% (2*1600 , 204.8 /8)
% (2*1800 , 230.4 /8)
% (2*2000 , 256 /8)
% (2*2200 , 281.6 /8)
% (2*2400 , 307.2 /8)
% (2*2600 , 332.8 /8)
% (2*2800 , 358.4 /8)
% (2*3000 , 384 /8)
% (2*3200 , 409.6 /8)
% };
% % DDR4 SR
% \addplot[BrickRed, thick, line cap=round, smooth] coordinates {
% (2*800 , 88.26 /8)
% (2*933 , 101.31/8)
% (2*1066.5 , 114.62/8)
% (2*1200 , 128.43/8)
% (2*1333 , 140.64/8)
% (2*1466.5 , 153.37/8)
% (2*1600 , 166.5 /8)
% };
% \addplot[BrickRed, thick, line cap=round, smooth, dashed] coordinates {
% (2*1600 , 166.5/8)
% (2*1800 , 187 /8)
% (2*2000 , 207 /8)
% (2*2200 , 227 /8)
% };
% % DDR4 DR
% \addplot[Orange, thick, line cap=round, smooth] coordinates {
% (2*800 , 91.4 /8)
% (2*933 , 106.43/8)
% (2*1066.5, 116.63/8)
% (2*1200 , 132.34/8)
% (2*1333 , 147.23/8)
% (2*1466.5, 160.98/8)
% (2*1600 , 174.02/8)
% };
% \addplot[Orange, line cap=round, smooth, dashed] coordinates {
% (2*1600 , 174.02/8)
% (2*1800 , 195 /8)
% (2*2000 , 215 /8)
% (2*2200 , 235 /8)
% };
% % DDR5 SR
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,180.3 /8)
% (2*1800,200.5 /8)
% (2*2000,222.76/8)
% (2*2200,242.74/8)
% (2*2400,264.28/8)
% (2*2600,285.1 /8)
% (2*2800,306.88/8)
% (2*3000,327.4 /8)
% (2*3200,346.16/8)
% };
% % DDR5 DR
% \addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
% (2*1600, 186.46/8)
% (2*1800, 214.44/8)
% (2*2000, 239.36/8)
% (2*2200, 245.44/8)
% (2*2400, 272.26/8)
% (2*2600, 301.7 /8)
% (2*2800, 315 /8)
% (2*3000, 335.24/8)
% (2*3200, 364.18/8)
% };
% \legend{
% Max,
% DDR4 16\,Gb SR,,
% DDR4 16\,Gb DR,,
% DDR5 16\,Gb SR,
% DDR5 16\,Gb DR
% }
% \end{axis}
% \end{tikzpicture}
% \caption{DDR4 vs. DDR5, Linear}
% \label{fig:bw:linear_DDR4_DDR5}
% \end{subfigure}
% %%%
% \hfill
% %%%
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*400,
% ymin = 5,
% xmax = 2*3400,
% legend style={at={(0.05,0.70)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*800 , 102.4 /8)
% (2*933 , 119.424/8)
% (2*1066.5, 136.512/8)
% (2*1200 , 153.6 /8)
% (2*1333 , 170.624/8)
% (2*1466.5, 187.712/8)
% (2*1600 , 204.8 /8)
% (2*1800 , 230.4 /8)
% (2*2000 , 256 /8)
% (2*2200 , 281.6 /8)
% (2*2400 , 307.2 /8)
% (2*2600 , 332.8 /8)
% (2*2800 , 358.4 /8)
% (2*3000 , 384 /8)
% (2*3200 , 409.6 /8)
% };
% % DDR4 SR
% \addplot[BrickRed, thick, line cap=round, smooth] coordinates {
% (2*800 , 74.64/8)
% (2*933 , 79.12/8)
% (2*1066.5, 85.81/8)
% (2*1200 , 86.68/8)
% (2*1333 , 89.09/8)
% (2*1466.5, 88.52/8)
% (2*1600 , 88.51/8)
% };
% \addplot[BrickRed, thick, line cap=round, smooth, dashed] coordinates {
% (2*1600 , 88.51/8)
% (2*1800 , 90 /8)
% (2*2000 , 91 /8)
% (2*2200 , 91 /8)
% };
% % DDR4 DR
% \addplot[Orange, thick, line cap=round, smooth] coordinates {
% (2*800 , 87.33 /8)
% (2*933 , 95.61 /8)
% (2*1066.5, 101.87/8)
% (2*1200 , 113.43/8)
% (2*1333 , 124.85/8)
% (2*1466.5, 134.69/8)
% (2*1600 , 141.27/8)
% };
% \addplot[Orange, thick, line cap=round, smooth, dashed] coordinates {
% (2*1600 , 141.27/8)
% (2*1800 , 146 /8)
% (2*2000 , 148 /8)
% (2*2200 , 150 /8)
% % Alt
% %(2*1600 , 141.27/8)
% %(2*1800 , 150 /8)
% %(2*2000 , 160 /8)
% %(2*2200 , 170 /8)
% };
% % DDR5 SR
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,165.18/8)
% (2*1800,178.04/8)
% (2*2000,196.66/8)
% (2*2200,208.58/8)
% (2*2400,224.62/8)
% (2*2600,234.14/8)
% (2*2800,244.94/8)
% (2*3000,256.16/8)
% (2*3200,260.22/8)
% };
% % DDR5 DR
% \addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
% (2*1600,180.6 /8)
% (2*1800,199.1 /8)
% (2*2000,219.08/8)
% (2*2200,235.34/8)
% (2*2400,256.92/8)
% (2*2600,271.6 /8)
% (2*2800,291.44/8)
% (2*3000,310 /8)
% (2*3200,325.9 /8)
% };
% \legend{
% Max,
% DDR4 16\,Gb SR,,
% DDR4 16\,Gb DR,,
% DDR5 16\,Gb SR,
% DDR5 16\,Gb DR
% }
% \end{axis}
% \end{tikzpicture}
% \caption{DDR4 vs. DDR5, Random}
% \label{fig:bw:random_DDR4_DDR5}
% \end{subfigure}
% \caption{Sustainable Bandwidth for Different DDR5 Configurations and DDR4 vs. DDR5, SR: Single Rank, DR: Dual Rank}
% \label{fig:bw}
% \end{sidewaysfigure*}%*}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{sidewaysfigure*}[p]
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Latency [ns]}},
% xlabel={\textbf{Bandwidth [GB/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% %height = 0.9\columnwidth,
% xmin = 0,
% ymin = 0,
% xmax = 30,
% ymax = 300,
% legend style={at={(0.05,0.85)}, anchor=west}
% %xmax = 10000
% ]
% % DDR4 16Gb SR
% \addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
% (12.8 /8, 39.9 )
% (25.6 /8, 41.1 )
% (38.4 /8, 43.1 )
% (51.2 /8, 43.9 )
% (64.0 /8, 44.6 )
% (76.8 /8, 45.1 )
% (89.6 /8, 46.2 )
% (102.34/8, 46.8 )
% (115.21/8, 50.9 )
% (127.99/8, 57.0 )
% (140.81/8, 66.8 )
% (153.61/8, 86.5 )
% (166.18/8, 146.5)
% %(166.52, 151.9)
% %(166.52, 154.3)
% %(166.52, 154.9)
% %(166.52, 154.9)
% %(166.52, 154.9)
% };
% % DDR5 16Gb SR
% \addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
% (12.8 /8, 42.2 )
% (25.6 /8, 39.2 )
% (38.4 /8, 39.2 )
% (51.2 /8, 41.9 )
% (64 /8, 43.6 )
% (76.8 /8, 46.5 )
% (89.6 /8, 49.7 )
% (102.4 /8, 53.9 )
% (115.2 /8, 59.3 )
% (128 /8, 69.1 )
% (140.78/8, 79.3 )
% (153.6 /8, 99.2 )
% (166.38/8, 138.6)
% (178.94/8, 237.9)
% (180.14/8, 279.1)
% %(180.22, 279.9)
% %(180.22, 279.8)
% %(180.26, 280.6)
% };
% \addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
% (25.6, 0)
% (25.6, 300)
% } node[below, pos=0.5, rotate=90] {Max. Bandwidth 25.6\,GB/s};
% \legend{
% DDR4,% 16Gb SR
% DDR5,% 16Gb SR
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16\,Gb Single Rank, Linear}
% \label{fig:lat_bw:linear_16Gb_SR}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Latency [ns]}},
% xlabel={\textbf{Bandwidth [GB/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% %height = 0.9\columnwidth,
% xmin = 0,
% ymin = 0,
% xmax = 30,
% ymax = 300,
% legend style={at={(0.05,0.85)}, anchor=west}
% %xmax = 10000
% ]
% % DDR4 16Gb DR
% \addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
% (12.8 /8 , 38.4 )
% (25.6 /8 , 39.4 )
% (38.4 /8 , 39.2 )
% (51.2 /8 , 41 )
% (63.98 /8 , 42.8 )
% (76.8 /8 , 44.7 )
% (89.6 /8 , 48.1 )
% (102.38/8 , 52.8 )
% (115.21/8 , 57.1 )
% (128 /8 , 64.4 )
% (140.73/8 , 72.3 )
% (153.57/8 , 86.3 )
% (166.39/8 , 111 )
% (173.97/8 , 163.3)
% %(174.34 , 163.5)
% %(174.02 164.9)
% %(174.34 164.8)
% %(174.02 165.1)
% };
% % DDR5 16Gb DR
% \addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
% (12.8 /8, 47.3 )
% (25.6 /8, 42.2 )
% (38.4 /8, 39.5 )
% (51.2 /8, 39.1 )
% (64 /8, 39.4 )
% (76.8 /8, 39.9 )
% (89.56 /8, 41.1 )
% (102.4 /8, 43.4 )
% (115.2 /8, 45.3 )
% (127.92/8, 49.6 )
% (140.8 /8, 55.6 )
% (153.6 /8, 65.3 )
% (166.38/8, 79.1 )
% (179.14/8, 117.6)
% (186.46/8, 262.6)
% %(187.28, 290.6)
% %(184.9 274 )
% %(184.52 252 )
% };
% \addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
% (25.6, 0)
% (25.6, 300)
% } node[below, pos=0.5, rotate=90] {Max. Bandwidth 25.6\,GB/s};
% \legend{
% DDR4,
% DDR5,
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16 Gb Dual Rank, Linear}
% \label{fig:lat_bw:linear_16Gb_DR}
% \end{subfigure}
% %%%%
% \centering
% \par\bigskip
% %%%%
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Latency [ns]}},
% xlabel={\textbf{Bandwidth [GB/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 0,
% ymin = 0,
% xmax = 30,
% ymax = 300,
% legend style={at={(0.05,0.85)}, anchor=west}
% ]
% % DDR4 16Gb SR
% \addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
% (12.8 /8, 52.3)
% (25.6 /8, 60.3)
% (38.4 /8, 72.2)
% (51.2 /8, 84.5)
% (64.0 /8, 102.1)
% (76.79/8, 138.4)
% (88.42/8, 280.1)
% %(88.45, 282.6)
% %(88.5 , 282.6)
% %(88.49, 282.6)
% %(88.48, 282.8)
% %(88.49, 282.7)
% %(88.48, 282.7)
% %(88.47, 282.9)
% %(88.49, 282.9)
% %(88.51, 282.7)
% %(88.51, 282.7)
% %(88.51, 282.8)
% };
% % DDR5 16Gb SR
% \addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
% (12.78 /8, 47.5 )
% (25.58 /8, 51.2 )
% (38.36 /8, 56.4 )
% (51.16 /8, 60.6 )
% (63.94 /8, 67.2 )
% (76.72 /8, 73 )
% (89.52 /8, 82.3 )
% (102.28/8, 92.7 )
% (115.1 /8, 104.3)
% (127.86/8, 120.3)
% (140.56/8, 145.2)
% (153.34/8, 179.4)
% (164.96/8, 265.7)
% %(165.12, 266.7)
% %(165.32, 267.8)
% %(165.38, 267.3)
% %(165.24, 267.2)
% %(165.38, 266.7)
% };
% \addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
% (25.6, 0)
% (25.6, 300)
% } node[below, pos=0.5, rotate=90] {Max. Bandwidth 25.6\,GB/s};
% \legend{
% DDR4,% 16Gb SR
% DDR5,% 16Gb SR
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16\,Gb Single Rank, Random}
% \label{fig:lat_bw:random_16Gb_SR}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Latency [ns]}},
% xlabel={\textbf{Bandwidth [GB/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 0,
% ymin = 0,
% xmax = 30,
% ymax = 300,
% legend style={at={(0.05,0.85)}, anchor=west}
% %xmax = 10000
% ]
% % DDR4 16Gb DR
% \addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
% (12.8 /8, 49.5 )
% (25.6 /8, 52.8 )
% (38.4 /8, 58.4 )
% (51.2 /8, 63.7 )
% (63.97 /8, 69.1 )
% (76.79 /8, 75 )
% (89.6 /8, 83 )
% (102.34/8, 94.8 )
% (115.19/8, 110.3)
% (127.97/8, 132.1)
% (140.6 /8, 192.2)
% %(141.64, 199.1)
% %(141.51, 200.3)
% %(141.52, 200.6)
% %(141.56, 200.5)
% %(141.52, 200.7)
% %(141.51, 200.5)
% %(141.55, 200.4)
% };
% % DDR5 16Gb DR
% \addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
% (12.78 /8 , 46.9 )
% (25.58 /8 , 48.1 )
% (38.36 /8 , 51.6 )
% (51.16 /8 , 52.9 )
% (63.94 /8 , 56.9 )
% (76.72 /8 , 59 )
% (89.48 /8 , 63.7 )
% (102.3 /8 , 67.2 )
% (115.1 /8 , 73.8 )
% (127.78/8 , 81.1 )
% (140.64/8 , 91.2 )
% (153.44/8 , 108.2)
% (166.18/8 , 139.8)
% (178.8 /8 , 232.6)
% (181.66/8 , 268.6)
% %(181.8 , 269.8)
% %(181.7 , 267.3)
% %(181.5 , 267.3)
% };
% \addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
% (25.6, 0)
% (25.6, 300)
% } node[below, pos=0.5, rotate=90] {Max. Bandwidth 25.6\,GB/s};
% \legend{
% DDR4,
% DDR5,
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16 Gb Dual Rank, Random}
% \label{fig:lat_bw:random_16Gb_DR}
% \end{subfigure}
% %%%%
% \caption{Average Response Latency over Bandwidth for DDR4-3200 vs. DDR5-3200}
% \label{fig:lat_bw}
% \end{sidewaysfigure*}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{sidewaysfigure*}[p]
% \centering
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*1200,
% ymin = 15,
% %xmax = 200,
% %ymax = 300,
% legend style={at={(0.05,0.7)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*1600,204.8/8)
% (2*1800,230.4/8)
% (2*2000,256 /8)
% (2*2200,281.6/8)
% (2*2400,307.2/8)
% (2*2600,332.8/8)
% (2*2800,358.4/8)
% (2*3000,384 /8)
% (2*3200,409.6/8)
% };
% % No Refresh
% \addplot[Red, thick, line cap=round, smooth] coordinates {
% (2*1600,197.08/8)
% (2*1800,220.14/8)
% (2*2000,244.2 /8)
% (2*2200,266.68/8)
% (2*2400,290.26/8)
% (2*2600,312.42/8)
% (2*2800,336.1 /8)
% (2*3000,359 /8)
% (2*3200,380.02/8)
% };
% % all bank
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,180.3 /8)
% (2*1800,200.5 /8)
% (2*2000,222.76/8)
% (2*2200,242.74/8)
% (2*2400,264.28/8)
% (2*2600,285.1 /8)
% (2*2800,306.88/8)
% (2*3000,327.4 /8)
% (2*3200,346.16/8)
% };
% % same-bank
% \addplot[Green, thick, line cap=round, smooth] coordinates {
% (2*1600,196.84/8)
% (2*1800,217.64/8)
% (2*2000,240.28/8)
% (2*2200,262 /8)
% (2*2400,284.58/8)
% (2*2600,306.6 /8)
% (2*2800,329.96/8)
% (2*3000,351.84/8)
% (2*3200,370.76/8)
% };
% %
% \legend{
% Max,
% No Refresh,
% All-Bank Refresh,
% Same-Bank Refresh
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16\,Gb Single Rank, Linear}
% \label{fig:refresh:linear_16Gb_SR}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*1200,
% ymin = 15,
% %xmax = 200,
% %ymax = 300,
% legend style={at={(0.05,0.7)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*1600,204.8/8)
% (2*1800,230.4/8)
% (2*2000,256 /8)
% (2*2200,281.6/8)
% (2*2400,307.2/8)
% (2*2600,332.8/8)
% (2*2800,358.4/8)
% (2*3000,384 /8)
% (2*3200,409.6/8)
% };
% % No Refresh
% \addplot[Red, thick, line cap=round, smooth] coordinates {
% (2*1600,200.92/8)
% (2*1800,225.9 /8)
% (2*2000,251.16/8)
% (2*2200,275.66/8)
% (2*2400,299.84/8)
% (2*2600,326.64/8)
% (2*2800,352.34/8)
% (2*3000,377.74/8)
% (2*3200,401.78/8)
% };
% % all bank
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,186.46/8)
% (2*1800,214.44/8)
% (2*2000,239.36/8)
% (2*2200,245.44/8)
% (2*2400,272.26/8)
% (2*2600,301.7 /8)
% (2*2800,315 /8)
% (2*3000,335.24/8)
% (2*3200,364.18/8)
% };
% % same-bank
% \addplot[Green, thick, line cap=round, smooth] coordinates {
% (2*1600,200.48/8)
% (2*1800,225.46/8)
% (2*2000,250.64/8)
% (2*2200,275.42/8)
% (2*2400,300.44/8)
% (2*2600,325.92/8)
% (2*2800,351.54/8)
% (2*3000,376.6 /8)
% (2*3200,400.62/8)
% };
% %
% \legend{
% Max,
% No Refresh,
% All-Bank Refresh,
% Same-Bank Refresh
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16\,Gb Dual Rank, Linear}
% \label{fig:refresh:linear_16Gb_DR}
% \end{subfigure}
% %%%%
% \par\bigskip
% %%%%
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*1200,
% ymin = 15,
% %xmax = 200,
% %ymax = 300,
% legend style={at={(0.05,0.7)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*1600,204.8/8)
% (2*1800,230.4/8)
% (2*2000,256 /8)
% (2*2200,281.6/8)
% (2*2400,307.2/8)
% (2*2600,332.8/8)
% (2*2800,358.4/8)
% (2*3000,384 /8)
% (2*3200,409.6/8)
% };
% % No Refresh
% \addplot[Red, thick, line cap=round, smooth] coordinates {
% (2*1600,183.06/8)
% (2*1800,197.48/8)
% (2*2000,216.9 /8)
% (2*2200,230.64/8)
% (2*2400,248.1 /8)
% (2*2600,258.6 /8)
% (2*2800,270.36/8)
% (2*3000,282.52/8)
% (2*3200,286.2 /8)
% };
% % all bank
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,165.18/8)
% (2*1800,178.04/8)
% (2*2000,196.66/8)
% (2*2200,208.58/8)
% (2*2400,224.62/8)
% (2*2600,234.14/8)
% (2*2800,244.94/8)
% (2*3000,256.16/8)
% (2*3200,260.22/8)
% };
% % same-bank
% \addplot[Green, thick, line cap=round, smooth] coordinates {
% (2*1600,173.7 /8)
% (2*1800,185.14/8)
% (2*2000,202.66/8)
% (2*2200,211.08/8)
% (2*2400,225.28/8)
% (2*2600,231.24/8)
% (2*2800,240.22/8)
% (2*3000,250.4 /8)
% (2*3200,253.24/8)
% };
% %
% \legend{
% Max,
% No Refresh,
% All-Bank Refresh,
% Same-Bank Refresh
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16\,Gb Single Rank, Random}
% \label{fig:refresh:random_16Gb_SR}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.49\textheight}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Bandwidth [GB/s]}},
% xlabel={\textbf{Pin Data Rate [MT/s]}},
% grid=minor,
% width = 0.49\textheight,
% height = 7cm,
% xmin = 2*1200,
% ymin = 15,
% legend style={at={(0.05,0.7)}, anchor=west}
% ]
% % Max
% \addplot[Black, thick, line cap=round, smooth] coordinates {
% (2*1600,204.8/8)
% (2*1800,230.4/8)
% (2*2000,256 /8)
% (2*2200,281.6/8)
% (2*2400,307.2/8)
% (2*2600,332.8/8)
% (2*2800,358.4/8)
% (2*3000,384 /8)
% (2*3200,409.6/8)
% };
% % No Refresh
% \addplot[Red, thick, line cap=round, smooth] coordinates {
% (2*1600,194.2 /8)
% (2*1800,216.34/8)
% (2*2000,240.78/8)
% (2*2200,262.66/8)
% (2*2400,286.56/8)
% (2*2600,308.88/8)
% (2*2800,332.98/8)
% (2*3000,355.5 /8)
% (2*3200,376.22/8)
% };
% % all bank
% \addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
% (2*1600,180.60/8)
% (2*1800,199.10/8)
% (2*2000,219.08/8)
% (2*2200,235.34/8)
% (2*2400,256.92/8)
% (2*2600,271.60/8)
% (2*2800,291.44/8)
% (2*3000,310.00/8)
% (2*3200,325.90/8)
% };
% % same-bank
% \addplot[Green, thick, line cap=round, smooth] coordinates {
% (2*1600,192.78/8)
% (2*1800,214.74/8)
% (2*2000,238.96/8)
% (2*2200,260.06/8)
% (2*2400,283.28/8)
% (2*2600,303.78/8)
% (2*2800,325.60/8)
% (2*3000,345.60/8)
% (2*3200,360.54/8)
% };
% %
% \legend{
% Max,
% No Refresh,
% All-Bank Refresh,
% Same-Bank Refresh
% }
% \end{axis}
% \end{tikzpicture}
% \caption{16\,Gb Dual Rank, Random}
% \label{fig:refresh:random_16Gb_DR}
% \end{subfigure}
% %
% \caption{DDR5 All-Bank Refresh vs. Same-Bank Refresh}
% \label{fig:refresh}
% \end{sidewaysfigure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% New Figures
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}%[h!]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*1200,
ymin = 15,
xmax = 2*3400,
legend style={legend pos=north west, font=\scriptsize}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*1600,204.8/8)
(2*1800,230.4/8)
(2*2000,256 /8)
(2*2200,281.6/8)
(2*2400,307.2/8)
(2*2600,332.8/8)
(2*2800,358.4/8)
(2*3000,384 /8)
(2*3200,409.6/8)
};
% DDR5 8Gb SR
\addplot[Purple, thick, line cap=round, smooth] coordinates {
(2*1600,178.44/8)
(2*1800,197.16/8)
(2*2000,219.04/8)
(2*2200,237.38/8)
(2*2400,257.80/8)
(2*2600,276.02/8)
(2*2800,295.48/8)
(2*3000,315.32/8)
(2*3200,331.86/8)
};
% DDR5 16Gb SR
\addplot[Magenta, thick, line cap=round, smooth] coordinates {
(2*1600,180.3 /8)
(2*1800,200.5 /8)
(2*2000,222.76/8)
(2*2200,242.74/8)
(2*2400,264.28/8)
(2*2600,285.1 /8)
(2*2800,306.88/8)
(2*3000,327.4 /8)
(2*3200,346.16/8)
};
% DDR5 8Gb DR
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,197.52/8)
(2*1800,215.02/8)
(2*2000,243.36/8)
(2*2200,264.08/8)
(2*2400,288.86/8)
(2*2600,307.5 /8)
(2*2800,331.7 /8)
(2*3000,353.92/8)
(2*3200,374.92/8)
};
%
\addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
(2*1600,186.46/8)
(2*1800,214.44/8)
(2*2000,239.36/8)
(2*2200,245.44/8)
(2*2400,272.26/8)
(2*2600,301.7 /8)
(2*2800,315 /8)
(2*3000,335.24/8)
(2*3200,364.18/8)
};
\legend{
Maximum,%
DDR5 8\,Gb SR,%
DDR5 16\,Gb SR,%
DDR5 8\,Gb DR,%
DDR5 16\,Gb DR%
}
\end{axis}
\end{tikzpicture}
\caption{DDR5, Linear}
\label{fig:bw:linear_DDR5}
\end{subfigure}
%%%
\hfill
%%%
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*1200,
ymin = 15,
xmax = 2*3400,
legend style={legend pos=north west, font=\scriptsize}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*1600, 204.8/8)
(2*1800, 230.4/8)
(2*2000, 256 /8)
(2*2200, 281.6/8)
(2*2400, 307.2/8)
(2*2600, 332.8/8)
(2*2800, 358.4/8)
(2*3000, 384 /8)
(2*3200, 409.6/8)
};
% DDR5 8Gb SR
\addplot[Purple, thick, line cap=round, smooth] coordinates {
(2*1600,154.2 /8)
(2*1800,159.76/8)
(2*2000,169.26/8)
(2*2200,169.6 /8)
(2*2400,176.2 /8)
(2*2600,173.18/8)
(2*2800,179.3 /8)
(2*3000,184 /8)
(2*3200,183.56/8)
};
%
\addplot[Magenta, thick, line cap=round, smooth] coordinates {
(2*1600,165.18/8)
(2*1800,178.04/8)
(2*2000,196.66/8)
(2*2200,208.58/8)
(2*2400,224.62/8)
(2*2600,234.14/8)
(2*2800,244.94/8)
(2*3000,256.16/8)
(2*3200,260.22/8)
};
%
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,179.98/8)
(2*1800,198 /8)
(2*2000,219 /8)
(2*2200,233.72/8)
(2*2400,251.58/8)
(2*2600,264.92/8)
(2*2800,279.62/8)
(2*3000,292.44/8)
(2*3200,299.98/8)
};
%
\addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
(2*1600,180.6 /8)
(2*1800,199.1 /8)
(2*2000,219.08/8)
(2*2200,235.34/8)
(2*2400,256.92/8)
(2*2600,271.6 /8)
(2*2800,291.44/8)
(2*3000,310 /8)
(2*3200,325.9 /8)
};
\legend{
Maximum,
DDR5 8\,Gb SR,
DDR5 16\,Gb SR,
DDR5 8\,Gb DR,
DDR5 16\,Gb DR
}
\end{axis}
\end{tikzpicture}
\caption{DDR5, Random}
\label{fig:bw:random_DDR5}
\end{subfigure}
%\caption{Sustainable Bandwidth for Different DDR5 Configurations, SR: Single Rank, DR: Dual Rank}
\vskip\baselineskip
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*400,
ymin = 5,
xmax = 2*3400,
legend style={legend pos=north west, font=\scriptsize}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*800 , 102.4 /8)
(2*933 , 119.424/8)
(2*1066.5, 136.512/8)
(2*1200 , 153.6 /8)
(2*1333 , 170.624/8)
(2*1466.5, 187.712/8)
(2*1600 , 204.8 /8)
(2*1800 , 230.4 /8)
(2*2000 , 256 /8)
(2*2200 , 281.6 /8)
(2*2400 , 307.2 /8)
(2*2600 , 332.8 /8)
(2*2800 , 358.4 /8)
(2*3000 , 384 /8)
(2*3200 , 409.6 /8)
};
% DDR4 SR
\addplot[BrickRed, thick, line cap=round, smooth] coordinates {
(2*800 , 88.26 /8)
(2*933 , 101.31/8)
(2*1066.5 , 114.62/8)
(2*1200 , 128.43/8)
(2*1333 , 140.64/8)
(2*1466.5 , 153.37/8)
(2*1600 , 166.5 /8)
};
\addplot[BrickRed, thick, line cap=round, smooth, dashed] coordinates {
(2*1600 , 166.5/8)
(2*1800 , 187 /8)
(2*2000 , 207 /8)
(2*2200 , 227 /8)
};
% DDR4 DR
\addplot[Orange, thick, line cap=round, smooth] coordinates {
(2*800 , 91.4 /8)
(2*933 , 106.43/8)
(2*1066.5, 116.63/8)
(2*1200 , 132.34/8)
(2*1333 , 147.23/8)
(2*1466.5, 160.98/8)
(2*1600 , 174.02/8)
};
\addplot[Orange, line cap=round, smooth, dashed] coordinates {
(2*1600 , 174.02/8)
(2*1800 , 195 /8)
(2*2000 , 215 /8)
(2*2200 , 235 /8)
};
% DDR5 SR
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,180.3 /8)
(2*1800,200.5 /8)
(2*2000,222.76/8)
(2*2200,242.74/8)
(2*2400,264.28/8)
(2*2600,285.1 /8)
(2*2800,306.88/8)
(2*3000,327.4 /8)
(2*3200,346.16/8)
};
% DDR5 DR
\addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
(2*1600, 186.46/8)
(2*1800, 214.44/8)
(2*2000, 239.36/8)
(2*2200, 245.44/8)
(2*2400, 272.26/8)
(2*2600, 301.7 /8)
(2*2800, 315 /8)
(2*3000, 335.24/8)
(2*3200, 364.18/8)
};
\legend{
Maximum,
DDR4 16\,Gb SR,,
DDR4 16\,Gb DR,,
DDR5 16\,Gb SR,
DDR5 16\,Gb DR
}
\end{axis}
\end{tikzpicture}
\caption{DDR4 vs. DDR5, Linear}
\label{fig:bw:linear_DDR4_DDR5}
\end{subfigure}
%%%
\hfill
%%%
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*400,
ymin = 5,
xmax = 2*3400,
legend style={legend pos=north west, font=\scriptsize}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*800 , 102.4 /8)
(2*933 , 119.424/8)
(2*1066.5, 136.512/8)
(2*1200 , 153.6 /8)
(2*1333 , 170.624/8)
(2*1466.5, 187.712/8)
(2*1600 , 204.8 /8)
(2*1800 , 230.4 /8)
(2*2000 , 256 /8)
(2*2200 , 281.6 /8)
(2*2400 , 307.2 /8)
(2*2600 , 332.8 /8)
(2*2800 , 358.4 /8)
(2*3000 , 384 /8)
(2*3200 , 409.6 /8)
};
% DDR4 SR
\addplot[BrickRed, thick, line cap=round, smooth] coordinates {
(2*800 , 74.64/8)
(2*933 , 79.12/8)
(2*1066.5, 85.81/8)
(2*1200 , 86.68/8)
(2*1333 , 89.09/8)
(2*1466.5, 88.52/8)
(2*1600 , 88.51/8)
};
\addplot[BrickRed, thick, line cap=round, smooth, dashed] coordinates {
(2*1600 , 88.51/8)
(2*1800 , 90 /8)
(2*2000 , 91 /8)
(2*2200 , 91 /8)
};
% DDR4 DR
\addplot[Orange, thick, line cap=round, smooth] coordinates {
(2*800 , 87.33 /8)
(2*933 , 95.61 /8)
(2*1066.5, 101.87/8)
(2*1200 , 113.43/8)
(2*1333 , 124.85/8)
(2*1466.5, 134.69/8)
(2*1600 , 141.27/8)
};
\addplot[Orange, thick, line cap=round, smooth, dashed] coordinates {
(2*1600 , 141.27/8)
(2*1800 , 146 /8)
(2*2000 , 148 /8)
(2*2200 , 150 /8)
% Alt
%(2*1600 , 141.27/8)
%(2*1800 , 150 /8)
%(2*2000 , 160 /8)
%(2*2200 , 170 /8)
};
% DDR5 SR
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,165.18/8)
(2*1800,178.04/8)
(2*2000,196.66/8)
(2*2200,208.58/8)
(2*2400,224.62/8)
(2*2600,234.14/8)
(2*2800,244.94/8)
(2*3000,256.16/8)
(2*3200,260.22/8)
};
% DDR5 DR
\addplot[BlueGreen, thick, line cap=round, smooth] coordinates {
(2*1600,180.6 /8)
(2*1800,199.1 /8)
(2*2000,219.08/8)
(2*2200,235.34/8)
(2*2400,256.92/8)
(2*2600,271.6 /8)
(2*2800,291.44/8)
(2*3000,310 /8)
(2*3200,325.9 /8)
};
\legend{
Maximum,
DDR4 16\,Gb SR,,
DDR4 16\,Gb DR,,
DDR5 16\,Gb SR,
DDR5 16\,Gb DR
}
\end{axis}
\end{tikzpicture}
\caption{DDR4 vs. DDR5, Random}
\label{fig:bw:random_DDR4_DDR5}
\end{subfigure}
\caption{Sustainable Bandwidth for Different DDR5 Configurations and DDR4 vs. DDR5, SR: Single Rank, DR: Dual Rank}
\label{fig:bw}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}%[h!]
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Latency [ns]}},
xlabel={\textbf{Bandwidth [GB/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
%height = 0.9\columnwidth,
xmin = 0,
ymin = 0,
xmax = 30,
ymax = 300,
legend style={legend pos=north west, font=\small}
%xmax = 10000
]
% DDR4 16Gb SR
\addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
(12.8 /8, 39.9 )
(25.6 /8, 41.1 )
(38.4 /8, 43.1 )
(51.2 /8, 43.9 )
(64.0 /8, 44.6 )
(76.8 /8, 45.1 )
(89.6 /8, 46.2 )
(102.34/8, 46.8 )
(115.21/8, 50.9 )
(127.99/8, 57.0 )
(140.81/8, 66.8 )
(153.61/8, 86.5 )
(166.18/8, 146.5)
%(166.52, 151.9)
%(166.52, 154.3)
%(166.52, 154.9)
%(166.52, 154.9)
%(166.52, 154.9)
};
% DDR5 16Gb SR
\addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
(12.8 /8, 42.2 )
(25.6 /8, 39.2 )
(38.4 /8, 39.2 )
(51.2 /8, 41.9 )
(64 /8, 43.6 )
(76.8 /8, 46.5 )
(89.6 /8, 49.7 )
(102.4 /8, 53.9 )
(115.2 /8, 59.3 )
(128 /8, 69.1 )
(140.78/8, 79.3 )
(153.6 /8, 99.2 )
(166.38/8, 138.6)
(178.94/8, 237.9)
(180.14/8, 279.1)
%(180.22, 279.9)
%(180.22, 279.8)
%(180.26, 280.6)
};
\addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
(25.6, 0)
(25.6, 300)
} node[below, pos=0.5, rotate=90, font=\small] {Maximum (25.6\,GB/s)};
\legend{
DDR4,% 16Gb SR
DDR5,% 16Gb SR
}
\end{axis}
\end{tikzpicture}
\caption{16\,Gb Single Rank, Linear}
\label{fig:lat_bw:linear_16Gb_SR}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Latency [ns]}},
xlabel={\textbf{Bandwidth [GB/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
%height = 0.9\columnwidth,
xmin = 0,
ymin = 0,
xmax = 30,
ymax = 300,
legend style={legend pos=north west, font=\small}
%xmax = 10000
]
% DDR4 16Gb DR
\addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
(12.8 /8 , 38.4 )
(25.6 /8 , 39.4 )
(38.4 /8 , 39.2 )
(51.2 /8 , 41 )
(63.98 /8 , 42.8 )
(76.8 /8 , 44.7 )
(89.6 /8 , 48.1 )
(102.38/8 , 52.8 )
(115.21/8 , 57.1 )
(128 /8 , 64.4 )
(140.73/8 , 72.3 )
(153.57/8 , 86.3 )
(166.39/8 , 111 )
(173.97/8 , 163.3)
%(174.34 , 163.5)
%(174.02 164.9)
%(174.34 164.8)
%(174.02 165.1)
};
% DDR5 16Gb DR
\addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
(12.8 /8, 47.3 )
(25.6 /8, 42.2 )
(38.4 /8, 39.5 )
(51.2 /8, 39.1 )
(64 /8, 39.4 )
(76.8 /8, 39.9 )
(89.56 /8, 41.1 )
(102.4 /8, 43.4 )
(115.2 /8, 45.3 )
(127.92/8, 49.6 )
(140.8 /8, 55.6 )
(153.6 /8, 65.3 )
(166.38/8, 79.1 )
(179.14/8, 117.6)
(186.46/8, 262.6)
%(187.28, 290.6)
%(184.9 274 )
%(184.52 252 )
};
\addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
(25.6, 0)
(25.6, 300)
} node[below, pos=0.5, rotate=90, font=\small] {Maximum (25.6\,GB/s)};
\legend{
DDR4,
DDR5,
}
\end{axis}
\end{tikzpicture}
\caption{16 Gb Dual Rank, Linear}
\label{fig:lat_bw:linear_16Gb_DR}
\end{subfigure}
%%%%
\vskip\baselineskip
%%%%
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Latency [ns]}},
xlabel={\textbf{Bandwidth [GB/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 0,
ymin = 0,
xmax = 30,
ymax = 300,
legend style={legend pos=north west, font=\small}
]
% DDR4 16Gb SR
\addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
(12.8 /8, 52.3)
(25.6 /8, 60.3)
(38.4 /8, 72.2)
(51.2 /8, 84.5)
(64.0 /8, 102.1)
(76.79/8, 138.4)
(88.42/8, 280.1)
%(88.45, 282.6)
%(88.5 , 282.6)
%(88.49, 282.6)
%(88.48, 282.8)
%(88.49, 282.7)
%(88.48, 282.7)
%(88.47, 282.9)
%(88.49, 282.9)
%(88.51, 282.7)
%(88.51, 282.7)
%(88.51, 282.8)
};
% DDR5 16Gb SR
\addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
(12.78 /8, 47.5 )
(25.58 /8, 51.2 )
(38.36 /8, 56.4 )
(51.16 /8, 60.6 )
(63.94 /8, 67.2 )
(76.72 /8, 73 )
(89.52 /8, 82.3 )
(102.28/8, 92.7 )
(115.1 /8, 104.3)
(127.86/8, 120.3)
(140.56/8, 145.2)
(153.34/8, 179.4)
(164.96/8, 265.7)
%(165.12, 266.7)
%(165.32, 267.8)
%(165.38, 267.3)
%(165.24, 267.2)
%(165.38, 266.7)
};
\addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
(25.6, 0)
(25.6, 300)
} node[below, pos=0.5, rotate=90, font=\small] {Maximum (25.6\,GB/s)};
\legend{
DDR4,% 16Gb SR
DDR5,% 16Gb SR
}
\end{axis}
\end{tikzpicture}
\caption{16\,Gb Single Rank, Random}
\label{fig:lat_bw:random_16Gb_SR}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Latency [ns]}},
xlabel={\textbf{Bandwidth [GB/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 0,
ymin = 0,
xmax = 30,
ymax = 300,
legend style={legend pos=north west, font=\small}
%xmax = 10000
]
% DDR4 16Gb DR
\addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
(12.8 /8, 49.5 )
(25.6 /8, 52.8 )
(38.4 /8, 58.4 )
(51.2 /8, 63.7 )
(63.97 /8, 69.1 )
(76.79 /8, 75 )
(89.6 /8, 83 )
(102.34/8, 94.8 )
(115.19/8, 110.3)
(127.97/8, 132.1)
(140.6 /8, 192.2)
%(141.64, 199.1)
%(141.51, 200.3)
%(141.52, 200.6)
%(141.56, 200.5)
%(141.52, 200.7)
%(141.51, 200.5)
%(141.55, 200.4)
};
% DDR5 16Gb DR
\addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
(12.78 /8 , 46.9 )
(25.58 /8 , 48.1 )
(38.36 /8 , 51.6 )
(51.16 /8 , 52.9 )
(63.94 /8 , 56.9 )
(76.72 /8 , 59 )
(89.48 /8 , 63.7 )
(102.3 /8 , 67.2 )
(115.1 /8 , 73.8 )
(127.78/8 , 81.1 )
(140.64/8 , 91.2 )
(153.44/8 , 108.2)
(166.18/8 , 139.8)
(178.8 /8 , 232.6)
(181.66/8 , 268.6)
%(181.8 , 269.8)
%(181.7 , 267.3)
%(181.5 , 267.3)
};
\addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
(25.6, 0)
(25.6, 300)
} node[below, pos=0.5, rotate=90, font=\small] {Maximum (25.6\,GB/s)};
\legend{
DDR4,
DDR5,
}
\end{axis}
\end{tikzpicture}
\caption{16 Gb Dual Rank, Random}
\label{fig:lat_bw:random_16Gb_DR}
\end{subfigure}
%%%%
\caption{Average Response Latency over Bandwidth for DDR4-3200 vs. DDR5-3200}
\label{fig:lat_bw}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}%[h!]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*1200,
ymin = 15,
%xmax = 200,
%ymax = 300,
legend style={legend pos=north west, font=\small}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*1600,204.8/8)
(2*1800,230.4/8)
(2*2000,256 /8)
(2*2200,281.6/8)
(2*2400,307.2/8)
(2*2600,332.8/8)
(2*2800,358.4/8)
(2*3000,384 /8)
(2*3200,409.6/8)
};
% No Refresh
\addplot[Red, thick, line cap=round, smooth] coordinates {
(2*1600,197.08/8)
(2*1800,220.14/8)
(2*2000,244.2 /8)
(2*2200,266.68/8)
(2*2400,290.26/8)
(2*2600,312.42/8)
(2*2800,336.1 /8)
(2*3000,359 /8)
(2*3200,380.02/8)
};
% all bank
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,180.3 /8)
(2*1800,200.5 /8)
(2*2000,222.76/8)
(2*2200,242.74/8)
(2*2400,264.28/8)
(2*2600,285.1 /8)
(2*2800,306.88/8)
(2*3000,327.4 /8)
(2*3200,346.16/8)
};
% same-bank
\addplot[Green, thick, line cap=round, smooth] coordinates {
(2*1600,196.84/8)
(2*1800,217.64/8)
(2*2000,240.28/8)
(2*2200,262 /8)
(2*2400,284.58/8)
(2*2600,306.6 /8)
(2*2800,329.96/8)
(2*3000,351.84/8)
(2*3200,370.76/8)
};
%
\legend{
Maximum,
No Refresh,
All-Bank,
Same-Bank
}
\end{axis}
\end{tikzpicture}
\caption{16\,Gb Single Rank, Linear}
\label{fig:refresh:linear_16Gb_SR}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*1200,
ymin = 15,
%xmax = 200,
%ymax = 300,
legend style={legend pos=north west, font=\small}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*1600,204.8/8)
(2*1800,230.4/8)
(2*2000,256 /8)
(2*2200,281.6/8)
(2*2400,307.2/8)
(2*2600,332.8/8)
(2*2800,358.4/8)
(2*3000,384 /8)
(2*3200,409.6/8)
};
% No Refresh
\addplot[Red, thick, line cap=round, smooth] coordinates {
(2*1600,200.92/8)
(2*1800,225.9 /8)
(2*2000,251.16/8)
(2*2200,275.66/8)
(2*2400,299.84/8)
(2*2600,326.64/8)
(2*2800,352.34/8)
(2*3000,377.74/8)
(2*3200,401.78/8)
};
% all bank
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,186.46/8)
(2*1800,214.44/8)
(2*2000,239.36/8)
(2*2200,245.44/8)
(2*2400,272.26/8)
(2*2600,301.7 /8)
(2*2800,315 /8)
(2*3000,335.24/8)
(2*3200,364.18/8)
};
% same-bank
\addplot[Green, thick, line cap=round, smooth] coordinates {
(2*1600,200.48/8)
(2*1800,225.46/8)
(2*2000,250.64/8)
(2*2200,275.42/8)
(2*2400,300.44/8)
(2*2600,325.92/8)
(2*2800,351.54/8)
(2*3000,376.6 /8)
(2*3200,400.62/8)
};
%
\legend{
Maximum,
No Refresh,
All-Bank,
Same-Bank
}
\end{axis}
\end{tikzpicture}
\caption{16\,Gb Dual Rank, Linear}
\label{fig:refresh:linear_16Gb_DR}
\end{subfigure}
%%%%
\vskip\baselineskip
%%%%
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*1200,
ymin = 15,
%xmax = 200,
%ymax = 300,
legend style={legend pos=north west, font=\small}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*1600,204.8/8)
(2*1800,230.4/8)
(2*2000,256 /8)
(2*2200,281.6/8)
(2*2400,307.2/8)
(2*2600,332.8/8)
(2*2800,358.4/8)
(2*3000,384 /8)
(2*3200,409.6/8)
};
% No Refresh
\addplot[Red, thick, line cap=round, smooth] coordinates {
(2*1600,183.06/8)
(2*1800,197.48/8)
(2*2000,216.9 /8)
(2*2200,230.64/8)
(2*2400,248.1 /8)
(2*2600,258.6 /8)
(2*2800,270.36/8)
(2*3000,282.52/8)
(2*3200,286.2 /8)
};
% all bank
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,165.18/8)
(2*1800,178.04/8)
(2*2000,196.66/8)
(2*2200,208.58/8)
(2*2400,224.62/8)
(2*2600,234.14/8)
(2*2800,244.94/8)
(2*3000,256.16/8)
(2*3200,260.22/8)
};
% same-bank
\addplot[Green, thick, line cap=round, smooth] coordinates {
(2*1600,173.7 /8)
(2*1800,185.14/8)
(2*2000,202.66/8)
(2*2200,211.08/8)
(2*2400,225.28/8)
(2*2600,231.24/8)
(2*2800,240.22/8)
(2*3000,250.4 /8)
(2*3200,253.24/8)
};
%
\legend{
Maximum,
No Refresh,
All-Bank,
Same-Bank
}
\end{axis}
\end{tikzpicture}
\caption{16\,Gb Single Rank, Random}
\label{fig:refresh:random_16Gb_SR}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth}
\centering
\begin{tikzpicture}
\begin{axis}[
ylabel={\textbf{Bandwidth [GB/s]}},
xlabel={\textbf{Pin Data Rate [MT/s]}},
grid=minor,
width = \textwidth,
height = 5.25cm,
xmin = 2*1200,
ymin = 15,
legend style={legend pos=north west, font=\small}
]
% Max
\addplot[Black, thick, line cap=round, smooth] coordinates {
(2*1600,204.8/8)
(2*1800,230.4/8)
(2*2000,256 /8)
(2*2200,281.6/8)
(2*2400,307.2/8)
(2*2600,332.8/8)
(2*2800,358.4/8)
(2*3000,384 /8)
(2*3200,409.6/8)
};
% No Refresh
\addplot[Red, thick, line cap=round, smooth] coordinates {
(2*1600,194.2 /8)
(2*1800,216.34/8)
(2*2000,240.78/8)
(2*2200,262.66/8)
(2*2400,286.56/8)
(2*2600,308.88/8)
(2*2800,332.98/8)
(2*3000,355.5 /8)
(2*3200,376.22/8)
};
% all bank
\addplot[MidnightBlue, thick, line cap=round, smooth] coordinates {
(2*1600,180.60/8)
(2*1800,199.10/8)
(2*2000,219.08/8)
(2*2200,235.34/8)
(2*2400,256.92/8)
(2*2600,271.60/8)
(2*2800,291.44/8)
(2*3000,310.00/8)
(2*3200,325.90/8)
};
% same-bank
\addplot[Green, thick, line cap=round, smooth] coordinates {
(2*1600,192.78/8)
(2*1800,214.74/8)
(2*2000,238.96/8)
(2*2200,260.06/8)
(2*2400,283.28/8)
(2*2600,303.78/8)
(2*2800,325.60/8)
(2*3000,345.60/8)
(2*3200,360.54/8)
};
%
\legend{
Maximum,
No Refresh,
All-Bank,
Same-Bank
}
\end{axis}
\end{tikzpicture}
\caption{16\,Gb Dual Rank, Random}
\label{fig:refresh:random_16Gb_DR}
\end{subfigure}
%
\caption{DDR5 All-Bank Refresh vs. Same-Bank Refresh}
\label{fig:refresh}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure*}
% \centering
% \begin{subfigure}[b]{0.33\textwidth}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Occurrence}},
% xlabel={\textbf{Latency [ns]}},
% grid=minor,
% width = \columnwidth,
% %height = 0.9\columnwidth,
% xmin = 0,
% ymin = 0,
% xmax = 1500,
% ymax = 350,
% legend style={at={(0.05,0.85)}, anchor=west}
% ]
%
% \addplot [BrickRed, fill=BrickRed, ybar interval, mark=no] table [x=latency, y=count, col sep=comma] {latencies_no_refresh_DDR5_4000_16Gb_SR_counts.csv};
%
% \end{axis}
% \end{tikzpicture}
% \caption{No Refresh}
% \label{fig:lat:a}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.33\textwidth}
% \centering
% \begin{tikzpicture}
% \draw[Black] (2.2,2) node[]{All-Bank Refresh};
% \draw[Black] (2.2,1.8) -- (1.6,0.4);
% \begin{axis}[
% ylabel={\textbf{Occurrence}},
% xlabel={\textbf{Latency [ns]}},
% grid=minor,
% width = \columnwidth,
% %height = 0.9\columnwidth,
% xmin = 0,
% ymin = 0,
% xmax = 1500,
% ymax = 350,
% legend style={at={(0.05,0.85)}, anchor=west}
% ]
%
% \addplot [BrickRed, fill=BrickRed, ybar interval, mark=no] table [x=latency, y=count, col sep=comma] {latencies_all_bank_DDR5_4000_16Gb_SR_counts.csv};
%
% \end{axis}
% \end{tikzpicture}
% \caption{All-Bank Refresh}
% \label{fig:lat:b}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.33\textwidth}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Occurrence}},
% xlabel={\textbf{Latency [ns]}},
% grid=minor,
% width = \columnwidth,
% xmin = 0,
% ymin = 0,
% xmax = 1500,
% ymax = 350,
% legend style={at={(0.05,0.85)}, anchor=west}
% ]
%
% \addplot [BrickRed, fill=BrickRed, ybar interval, mark=no] table [x=latency, y=count, col sep=comma] {latencies_same_bank_DDR5_4000_16Gb_SR_counts.csv};
%
%
% \end{axis}
% \end{tikzpicture}
% \caption{Same-Bank Refresh}
% \label{fig:lat:b}
% \end{subfigure}
%
% \caption{Latency Histrograms w.r.t. Refresh for 16Gb DDR5-4000, Random and High Sustainable Bandwidth}
% \label{fig:lat}
%\end{figure*}
%
%\begin{figure*}
% \centering
% \begin{subfigure}[b]{1.00\textwidth}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% ylabel={\textbf{Occurrence}},
% xlabel={\textbf{Latency [ns]}},
% grid=minor,
% width = \columnwidth,
% %height = 0.9\columnwidth,
% xmin = 0,
% ymin = 0,
% %xmax = 1500,
% %ymax = 350,
% legend style={at={(0.05,0.85)}, anchor=west}
% ]
%
% \addplot [Blue, fill=Blue, ybar interval, mark=no] table [x=latency, y=count, col sep=comma] {latencies_no_refresh_DDR5_4000_16Gb_SR_counts.csv};
% \addplot [Red, fill=Red, ybar interval, mark=no] table [x=latency, y=count, col sep=comma] {latencies_all_bank_DDR5_4000_16Gb_SR_counts.csv};
% \addplot [Green, fill=Green, ybar interval, mark=no] table [x=latency, y=count, col sep=comma] {latencies_same_bank_DDR5_4000_16Gb_SR_counts.csv};
%
% \end{axis}
% \end{tikzpicture}
% \caption{No Refresh}
% \label{fig:lat:a}
% \end{subfigure}
% \hfill
% \caption{Latency Histrograms w.r.t. Refresh for 16Gb DDR5-4000, Random and High Sustainable Bandwidth}
% \label{fig:lat}
%\end{figure*}
\subsection{Key Observations}
In this section we summarize the key observations for the performance of different DDR5 configurations and for the comparison to DDR4:
\begin{itemize}
\setlength{\itemsep}{0pt}
\item \textit{Applications with random memory access patterns:} Each DDR5 channel should be composed of at least 32 banks to achieve a high sustainable bandwidth.
\item \textit{Applications with linear memory access patterns:} DDR5 does not outperform DDR4 by much with respect to sustainable bandwidth and average response latency up to the highest DDR4 data rates.
\item \textit{Applications with random memory access patterns:} DDR5 should be preferred over DDR4 at all data rates with respect to sustainable bandwidth and average response latency.
\item \textit{Independent of the memory access patterns:} DDR5's same-bank refresh should always be preferred over all-bank refresh with respect to sustainable bandwidth.
\end{itemize}
\vspace{-15pt}
% 1. For random benchmark each DDR5 channel needs at least 32 banks.
%
% stream: only small relative performance drop for higher speed grades -> small gains with DDR5, DDR4 also already available with data rates up to 4600 MT/s
% random: large relative performance drop with increasing frequencies, DDR4 3200 only achieves ...% of the maximum bandwidth while DDR5-3200 achieves ...%
% random performance drop for high frequencies and single rank:
% for DDR4 data transfer takes 4tCK regardless of frequency, but all timing dependencies increase in clock cycles, with DDR5 data transfer takes 8tCK -> more time to do precharges, activates and RD/WR latencies
% 32 banks especially beneficial with single rank random, only small differences with dual rank
% lower response latencies for same bandwidth
%
%same-bank refresh beneficial for stream (refresh penalty can be hidden almost completely), for random it varies???
%
%
% DDR4: What if 4 * tCCD_S < tCCD_L -> no seamless bank group hopping possible, at which frequency does this happen?
% DDR5 has at least 8 bank groups
%
% same bank refresh can be hidden, performance gain especially when only 1 rank is available
%
% To avoid large tail latencies, clever techniques in the MC are required to balance latency vs. BW requirements!
%
%\begin{figure*}[p]
% \begin{tikzpicture}
% \begin{axis}[
% width = 1\textwidth,
% height = 8cm,
% ymode=log,
% axis y line*=left,
% axis x line=bottom,
% major x tick style = transparent,
% ybar=5*\pgflinewidth,
% bar width=14pt,
% ymajorgrids = true,
% ylabel = {\textbf{Bandwidth [GiB/s]}},
% symbolic x coords={
% DDR3,
% DDR4,
% DDR5 SC,
% DDR5 DC,
% DDR5 16H SC,
% DDR5 16H DC},
% xtick = data,
% scaled y ticks = false,
% enlarge x limits=0.1,
% axis line style={-},
% legend columns=2,
% legend cell align=left,
% legend style={
% at={(0.5,-0.2)},
% anchor=north,
% column sep=1ex
% }
% ]
% % 16,7 GiB/s 37,5 GiB/s 32,8 GiB/s 32,8 GiB/s
% \addplot[style={black,fill=MidnightBlue,mark=none}]
% coordinates {
% (DDR3, 16.7)
% (DDR4, 37.5)
% (DDR5 SC, 32.8)
% (DDR5 DC, 65.6)
% (DDR5 16H SC, 32.8)
% (DDR5 16H DC, 65.6)
% };
% %136,4 GiB/s 409,6 GiB/s 716,8 GiB/s 11468,8 GiB/s
% \addplot[style={black,fill=BrickRed,mark=none}]
% coordinates {
% (DDR3, 136.7)
% (DDR4, 409.6)
% (DDR5 SC, 716.8)
% (DDR5 DC, 1433.6)
% (DDR5 16H SC, 11468.8)
% (DDR5 16H DC, 22937.6)
% };
% \legend{Max. Ext. Bandwidth, Max Int. Bandwidth}
% \end{axis}
% \begin{axis}[
% width = 1\textwidth,
% height = 8cm,
% axis y line*=right,
% %axis x line=none,
% axis on top,
% major x tick style = transparent,
% enlarge x limits=0.1,
% axis line style={-},
% ylabel = {\textbf{Ratio Between Int. and Ext.}},
% xmin=0, xmax=5,
% scaled y ticks = false,
% xticklabel=\empty
% ]
% \addplot[YellowOrange,sharp plot,ultra thick, update limits=true]
% coordinates {
% (0,8.2)
% (1,10.9)
% (2,21.8)
% (3,43.7)
% (4,349.5)
% (5,699)
% };
% \end{axis}
% \end{tikzpicture}
% \caption{Bandwidth Evolvement of DDR DRAM Standards}
%\end{figure*}
%\begin{table}[h!]
% \begin{tabularx}{\hsize}{|X|X|X|X|X|}\hline
% & A& B& C& D \\\hline
% X& & & & \\\hline
% y& & & & \\\hline
% \end{tabularx}
% \caption{Tabellenunterschrift zweizeilig (10/12 pt) Tabellenunterschrift zweizeilig (10/12 pt)}\label{tab2}
%\end{table}
%
\section{Related Work}
With respect to functional and cycle-accurate DRAM simulation there exist several open-source solutions. \hbox{DrSim~\cite{jeodrsim}} was already developed several years ago but never updated over time, leading to an exclusive support of older JEDEC standards and making it unsuitable for most current system developments. DRAMSys~\cite{stejun_20}, DRAMsim3~\cite{liyan_20}, Ramulator~\cite{kimyan_15} and the gem5 DRAM model~\cite{hanaga_14} are all updated from time to time, however, only DRAMSys currently supports the DDR5 standard. Due to the Petri-Net-based source code generation~\cite{junkra_19} and the special TLM concept, DRAMSys can generate results a lot faster than the other cycle-accurate simulators. Furthermore, there are other approaches that only approximate the DRAM behavior. In~\cite{yuaaam_09} the authors propose an analytical DRAM performance model that uses traces to predict the efficiency of the DRAM subsystem. Todorov et\,al.~\cite{todmue_12} presented a statistical approach for the construction of a cycle-approximate TLM model of a DRAM controller based on a decision tree. However, these approaches suffer from a significant loss in accuracy. More promising approaches based on machine learning techniques have been presented recently. The paper~\cite{lijac_19} presents the modeling of DRAM behavior using decision trees. In \cite{junfel_20} the authors present a performance-optimized DRAM model that is based on a neural network. However, to the best of our knowledge, there exists no analytical nor approximated DDR5 simulation model so far.
In terms of the simulation-based analysis and comparison of different DRAMs there is a comprehensive study by Ghose et\,al.~\cite{gholi_19}. It features a large variety of standards and workloads and also makes some key observations. However, it was already published in 2019 and thus does not include the new DDR5 standard.
%
\section{Conclusion}
%
With the release of DDR5 the total DRAM design space is greatly enlarged, making the selection of a suitable DRAM device even more complex. In this paper, we deeply investigated the performance of the new DDR5 standard for different configurations and compared it to its predecessor DDR4. In order to generate those results, we developed a DDR5 simulation model, which is based on the open-source DRAM simulator DRAMSys. We presented several insights and key observations for two characteristic benchmarks, which will guide system designers to make the right decisions. For the future we plan to run full-system simulations with our new DDR5 model to investigate the impact on the whole system performance more accurately.
%
%\vspace{-7pt}
\section*{Acknowledgements}
%\vspace{-7pt}
%
This work was supported within the Fraunhofer and DFG cooperation programme (Grant no. 248750294) and supported by the Fraunhofer High Performance Center for Simulation- and Software-based Innovation. Furthermore, we thank Rambus for their support.
%
%\renewcommand{\baselinestretch}{1.0}\normalsize
%
\bibliographystyle{IEEEtran}
\bibliography{IEEEabrv,references_JR}
\end{document}